Actual run · 2,000,000-row sample of live NYC TLC Yellow-Taxi 2024-01 → local Delta · 2026-06-21 UTC. Every statement below ran through conn.sql(...)
against live NYC TLC data, landing real Delta — the SQL and its actual output, top to bottom.
| Operation | Rows | Before | After | Expected | Count ✓ | Values ✓ | Time |
|---|---|---|---|---|---|---|---|
| Load zones (CSV→Delta) | +265 | 0 | 265 | 265 | ✅ | ✅ | 0.6s |
| Land trips sample | +2,000,000 | 0 | 2,000,000 | 2,000,000 | ✅ | ✅ | 9.2s |
| Write revenue mart | +251 | 0 | 251 | 251 | ✅ | ✅ | 0.3s |
| INSERT append (+100) | +100 | 2,000,000 | 2,000,100 | 2,000,100 | ✅ | ✅ | 0.2s |
| SQL upsert (1 upd, 1 ins) | +1 | 252 | 253 | 253 | ✅ | ✅ | 0.5s |
| DELETE Cash trips | -294,508 | 2,000,100 | 1,705,592 | 1,705,592 | ✅ | ✅ | 2.1s |
| Concurrent MERGE clash | +0 | 253 | 253 | 253 | ✅ | ✅ | 0.5s |
CREATE OR REPLACE TABLE zones AS
SELECT "LocationID"::INT AS zone_id, "Borough" AS borough,
"Zone" AS zone, "service_zone" AS service_zone
FROM read_csv_auto('https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv')
zones (first 6 of a real 265-zone dimension)
| zone_id | borough | zone |
|---|---|---|
| 1 | EWR | Newark Airport |
| 2 | Queens | Jamaica Bay |
| 3 | Bronx | Allerton/Pelham Gardens |
| 4 | Manhattan | Alphabet City |
| 5 | Staten Island | Arden Heights |
| 6 | Staten Island | Arrochar/Fort Wadsworth |
265 zones loaded
CREATE TEMP TABLE trips_raw AS
SELECT
t.tpep_pickup_datetime AS pickup_ts,
t."PULocationID"::INT AS zone_id,
z.borough AS borough,
z.zone AS zone,
CASE t.payment_type WHEN 1 THEN 'Credit' WHEN 2 THEN 'Cash'
WHEN 3 THEN 'NoCharge' WHEN 4 THEN 'Dispute' ELSE 'Other' END AS payment,
t.fare_amount::DOUBLE AS fare,
t.tip_amount::DOUBLE AS tip,
t.trip_distance::DOUBLE AS distance,
t.total_amount::DOUBLE AS total,
hour(t.tpep_pickup_datetime) AS pickup_hour
FROM read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet') t
JOIN zones z ON z.zone_id = t."PULocationID"
WHERE t.fare_amount > 0 AND t.trip_distance > 0
AND t.tpep_pickup_datetime >= TIMESTAMP '2024-01-01'
AND t.tpep_pickup_datetime < TIMESTAMP '2024-02-01'
2,869,697 clean trips for 2024-01 — all compute in-engine; nothing written to Delta yet
CREATE OR REPLACE TABLE trips AS SELECT pickup_ts, zone_id, borough, zone, payment, fare, tip, distance, total, pickup_hour FROM trips_raw USING SAMPLE 2000000 ROWS
2,000,000 rows landed as Delta version 0; fare>0 verified
CREATE OR REPLACE TABLE mart_zone_revenue AS SELECT borough, zone, count(*) AS trips, round(sum(total), 2) AS revenue FROM trips GROUP BY borough, zone
top zones by revenue (mart_zone_revenue)
| borough | zone | trips | revenue $ |
|---|---|---|---|
| Queens | JFK Airport | 96,352 | 7,784,696 |
| Queens | LaGuardia Airport | 61,214 | 4,060,324 |
| Manhattan | Midtown Center | 97,569 | 2,339,430 |
| Manhattan | Upper East Side South | 97,567 | 1,933,218 |
| Manhattan | Times Sq/Theatre District | 71,670 | 1,931,992 |
| Manhattan | Upper East Side North | 93,368 | 1,891,903 |
| Manhattan | Penn Station/Madison Sq West | 71,177 | 1,712,107 |
| Manhattan | Midtown East | 72,676 | 1,693,600 |
251 (borough, zone) rows written to Delta
tables in 'dbo'
| table | location (real Delta dir) |
|---|---|
| mart_zone_revenue | …/taxidemo_9yoqx797/dbo/mart_zone_revenue |
| trips | …/taxidemo_9yoqx797/dbo/trips |
| zones | …/taxidemo_9yoqx797/dbo/zones |
3 Delta tables — each a real directory with a _delta_log at the location shown
INSERT INTO trips SELECT pickup_ts, zone_id, borough, zone, payment, fare, tip, distance, total, pickup_hour FROM trips_raw USING SAMPLE 100 ROWS
2,000,000 → 2,000,100 rows
CREATE OR REPLACE TABLE zone_stats AS SELECT zone_id, any_value(zone) AS zone, count(*) AS trips, round(avg(fare), 2) AS avg_fare FROM trips GROUP BY zone_id
DELETE FROM zone_stats WHERE zone_id = 161 OR zone_id = 999
INSERT INTO zone_stats VALUES (161, 'Midtown Center', 97573, 20.52), (999, 'NEW Demo Zone', 1, 7.77)
zone 161 avg 15.52→20.52 (updated), zone 999 inserted; 252 → 253 rows
DELETE FROM trips WHERE payment = 'Cash'
SELECT count(*) FROM delta_scan('C:/Users/MDJOUA~1/AppData/Local/Temp/taxidemo_9yoqx797/dbo/trips', version => 0)
time travel: current vs landed snapshot
| before | Cash deleted | after (now) | at landed ver |
|---|---|---|---|
| 2,000,100 | 294,508 | 1,705,592 | 2,000,000 |
deleted 294,508 Cash trips (2,000,100 → 1,705,592); delta_scan(version => 0) still sees 2,000,000
# DataFrame builder API (conn.sql rejects MERGE) — both writers pin the SAME version:
writer_A = DeltaTable.forName(conn, "zone_stats").merge(srcA, "target.zone_id = source.zone_id") \
.whenMatchedUpdateAll().whenNotMatchedInsertAll() # srcA sets avg → 100.0
writer_B = DeltaTable.forName(conn, "zone_stats").merge(srcB, "target.zone_id = source.zone_id") \
.whenMatchedUpdateAll().whenNotMatchedInsertAll() # srcB sets avg → 200.0
writer_A.execute() # wins, advances the table version
writer_B.execute() # stale snapshot → CommitFailedError
both writers built against zone 161 (avg 20.52) at the same snapshot
two concurrent MERGEs against one snapshot
| writer | intended avg | outcome |
|---|---|---|
| A — fresh snapshot | 100.0 | committed ✅ |
| B — stale snapshot | 200.0 | refused ✅ |
snapshot isolation held: zone 161 avg = 100.0 (writer A won); writer B refused — a concurrent transactions added new data.