← duckrun docs

pure SQL — the connection APInot a dbt project

Actual run · 2,000,000-row sample of live NYC TLC Yellow-Taxi 2024-01 → local Delta · 2026-06-21 UTC. Every statement below ran through conn.sql(...) against live NYC TLC data, landing real Delta — the SQL and its actual output, top to bottom.

Run scorecard — ✅ all operations correct

OperationRowsBeforeAfterExpectedCount ✓Values ✓Time
Load zones (CSV→Delta)+26502652650.6s
Land trips sample+2,000,00002,000,0002,000,0009.2s
Write revenue mart+25102512510.3s
INSERT append (+100)+1002,000,0002,000,1002,000,1000.2s
SQL upsert (1 upd, 1 ins)+12522532530.5s
DELETE Cash trips-294,5082,000,1001,705,5921,705,5922.1s
Concurrent MERGE clash+02532532530.5s

1 real dimension off the web: taxi-zone lookup CSV (https) → Delta 'zones'

CREATE OR REPLACE TABLE zones AS
SELECT "LocationID"::INT AS zone_id, "Borough" AS borough,
       "Zone" AS zone, "service_zone" AS service_zone
FROM read_csv_auto('https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv')

zones (first 6 of a real 265-zone dimension)

zone_idboroughzone
1EWRNewark Airport
2QueensJamaica Bay
3BronxAllerton/Pelham Gardens
4ManhattanAlphabet City
5Staten IslandArden Heights
6Staten IslandArrochar/Fort Wadsworth

265 zones loaded

2 register the live month: read_parquet yellow_tripdata_2024-01 (https, ~50MB) → TEMP TABLE 'trips_raw'

CREATE TEMP TABLE trips_raw AS
SELECT
    t.tpep_pickup_datetime  AS pickup_ts,
    t."PULocationID"::INT    AS zone_id,
    z.borough                AS borough,
    z.zone                   AS zone,
    CASE t.payment_type WHEN 1 THEN 'Credit' WHEN 2 THEN 'Cash'
         WHEN 3 THEN 'NoCharge' WHEN 4 THEN 'Dispute' ELSE 'Other' END AS payment,
    t.fare_amount::DOUBLE    AS fare,
    t.tip_amount::DOUBLE     AS tip,
    t.trip_distance::DOUBLE  AS distance,
    t.total_amount::DOUBLE   AS total,
    hour(t.tpep_pickup_datetime) AS pickup_hour
FROM read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet') t
JOIN zones z ON z.zone_id = t."PULocationID"
WHERE t.fare_amount > 0 AND t.trip_distance > 0
  AND t.tpep_pickup_datetime >= TIMESTAMP '2024-01-01'
  AND t.tpep_pickup_datetime <  TIMESTAMP '2024-02-01'

2,869,697 clean trips for 2024-01 — all compute in-engine; nothing written to Delta yet

3 land a bounded sample → Delta 'trips' (CREATE TABLE AS … USING SAMPLE 2,000,000)

CREATE OR REPLACE TABLE trips AS
SELECT pickup_ts, zone_id, borough, zone, payment, fare, tip, distance, total, pickup_hour
FROM trips_raw USING SAMPLE 2000000 ROWS

2,000,000 rows landed as Delta version 0; fare>0 verified

4 transform → Delta: revenue by borough & zone → table 'mart_zone_revenue'

CREATE OR REPLACE TABLE mart_zone_revenue AS
SELECT borough, zone, count(*) AS trips, round(sum(total), 2) AS revenue
FROM trips GROUP BY borough, zone

top zones by revenue (mart_zone_revenue)

boroughzonetripsrevenue $
QueensJFK Airport96,3527,784,696
QueensLaGuardia Airport61,2144,060,324
ManhattanMidtown Center97,5692,339,430
ManhattanUpper East Side South97,5671,933,218
ManhattanTimes Sq/Theatre District71,6701,931,992
ManhattanUpper East Side North93,3681,891,903
ManhattanPenn Station/Madison Sq West71,1771,712,107
ManhattanMidtown East72,6761,693,600

251 (borough, zone) rows written to Delta

5 Catalog: the Delta tables this demo landed — with their real storage locations

tables in 'dbo'

tablelocation (real Delta dir)
mart_zone_revenue…/taxidemo_9yoqx797/dbo/mart_zone_revenue
trips…/taxidemo_9yoqx797/dbo/trips
zones…/taxidemo_9yoqx797/dbo/zones

3 Delta tables — each a real directory with a _delta_log at the location shown

6 raw-DML INSERT: append a 100-row late-arriving batch to 'trips'

INSERT INTO trips
SELECT pickup_ts, zone_id, borough, zone, payment, fare, tip, distance, total, pickup_hour
FROM trips_raw USING SAMPLE 100 ROWS

2,000,000 → 2,000,100 rows

7 SQL-only upsert on 'zone_stats': DELETE literal keys + INSERT (1 update, 1 insert)

CREATE OR REPLACE TABLE zone_stats AS
SELECT zone_id, any_value(zone) AS zone, count(*) AS trips, round(avg(fare), 2) AS avg_fare
FROM trips GROUP BY zone_id
DELETE FROM zone_stats WHERE zone_id = 161 OR zone_id = 999
INSERT INTO zone_stats VALUES (161, 'Midtown Center', 97573, 20.52), (999, 'NEW Demo Zone', 1, 7.77)

zone 161 avg 15.52→20.52 (updated), zone 999 inserted; 252 → 253 rows

8 raw-DML DELETE + time travel: drop 'Cash' trips, then read the landed snapshot

DELETE FROM trips WHERE payment = 'Cash'
SELECT count(*) FROM delta_scan('C:/Users/MDJOUA~1/AppData/Local/Temp/taxidemo_9yoqx797/dbo/trips', version => 0)

time travel: current vs landed snapshot

beforeCash deletedafter (now)at landed ver
2,000,100294,5081,705,5922,000,000

deleted 294,508 Cash trips (2,000,100 → 1,705,592); delta_scan(version => 0) still sees 2,000,000

9 concurrent MERGE clash (DataFrame DeltaTable.merge): two writers, one snapshot — the stale one is refused

# DataFrame builder API (conn.sql rejects MERGE) — both writers pin the SAME version:
writer_A = DeltaTable.forName(conn, "zone_stats").merge(srcA, "target.zone_id = source.zone_id") \
    .whenMatchedUpdateAll().whenNotMatchedInsertAll()   # srcA sets avg → 100.0
writer_B = DeltaTable.forName(conn, "zone_stats").merge(srcB, "target.zone_id = source.zone_id") \
    .whenMatchedUpdateAll().whenNotMatchedInsertAll()   # srcB sets avg → 200.0
writer_A.execute()   # wins, advances the table version
writer_B.execute()   # stale snapshot → CommitFailedError

both writers built against zone 161 (avg 20.52) at the same snapshot

two concurrent MERGEs against one snapshot

writerintended avgoutcome
A — fresh snapshot100.0committed ✅
B — stale snapshot200.0refused ✅

snapshot isolation held: zone 161 avg = 100.0 (writer A won); writer B refused — a concurrent transactions added new data.