Coverage vs the Spark / Delta API¶

duckrun's connection API (duckrun.connect()) deliberately targets parity with the surface most data engineers already know — PySpark's SparkSession / DataFrame family and Delta Lake's DeltaTable (the "Delta-on-Spark") API — so notebook code reads the same. Under the hood it is DuckDB + delta-rs, not Apache Spark: there is no Spark runtime, no JVM, no cluster.

This page tracks how far the surface matches. Two things it is not: a Spark runtime, and a fluent DataFrame transform builder. Both are design decisions, not gaps — duckrun is SQL-first (transforms are written as SQL through conn.sql(...)), and there is no cluster to configure. Those rows are marked 🚫 below precisely so they read as "we chose not to," not "still missing."

Everywhere else in the docs this is called the DataFrame API (it is not full Spark). This one page names Spark on purpose, because it is a side-by-side comparison.

The companion to this page is the method scorecard in connection-api.md (auto-generated from tests/connection_api/test_connection_api.py): that says which mapped methods actually pass their tests; this page says what maps to what.

Legend

✅ implemented, parity-faithful
🟡 implemented, duckrun-flavored or a convenience shortcut
🚫 by design — deliberately not offered (SQL-first, or no Spark runtime). Not a gap.
➖ not wired yet — could be added if asked for

Bespoke — duckrun-only, no Spark/Delta equivalent¶

The entire invented surface is the four entries below — a path-based entry point, a multi-catalog attach verb, and two runtime primitives. Everything else on this page either maps to a real Spark/delta-rs method, is a deliberate 🚫 omission, or is a ➖ TODO. There is nothing else masquerading as Spark.

duckrun	what it is	why there's no Spark name
`duckrun.connect(path, storage_options=…, schema=…, read_only=True, name=…)`	open a session bound to one lakehouse root, addressed by a storage path; read-only by default (`read_only=False` to write). The primary catalog's `name` is derived from the URL (else `data`) unless given.	Spark's entry point is `SparkSession.builder…getOrCreate()` against a cluster — there's no cluster and no builder; one connection binds to one storage root, and defaults to read-only to protect a shared lakehouse
`conn.attach(path, name=…, storage_options=…, schema=…, read_only=…)`	attach a second+ lakehouse root as a named catalog, so `catalog.schema.table` resolves across lakehouses (powers `catalog.listCatalogs` / `setCurrentCatalog`)	Spark configures catalogs (metastore/Unity) at session build; there's no runtime "attach another storage root as a catalog" verb. `name` is derived from a friendly path, mandatory for a GUID-only OneLake path; one URL ↔ one name (re-attaching either raises). `read_only` is per-catalog — a read-only reference store (e.g. a Fabric Warehouse) can sit next to a writable lakehouse.
`conn.refresh()`	re-discover the Delta tables under the store and re-register their views	duckrun finds tables by globbing storage for `_delta_log`; Spark's metastore is authoritative, so it never needs a "rescan the store" call
`write.mode("append_if_unchanged")` (alias `safeappend`)	fail-loud compare-and-swap append (commits only if the table version hasn't moved)	no Spark `SaveMode` for it — it's the duckrun/dbt concurrency primitive
`write.mode("overwrite_if_unchanged")`	fail-loud compare-and-swap full overwrite (the overwrite sibling)	no Spark `SaveMode` for it

Private plumbing (_resolve, _table_path, _connection) is bespoke too but intentionally not part of the public surface, so it isn't tracked below. One naming note: DeltaTable.delete(predicate) uses delta-rs's parameter name (predicate) rather than delta-spark's (condition) — real API, not invented.

`SparkSession` ↔ `duckrun.connect()` / `DuckSession`¶

Spark	duckrun		Notes
`spark.sql(str)`	`conn.sql(str)`	✅	Plus raw DML routing to delta-rs (`create table as` / `insert` / `update` / `delete` / `alter add column` / `drop`).
`spark.table(name)`	`conn.table(name)`	✅
`spark.read`	`conn.read`	✅	→ `DataFrameReader`.
`spark.catalog`	`conn.catalog`	✅	→ `Catalog` (see below).
`spark.createDataFrame(rows)`	`conn.createDataFrame(data, schema=None)`	✅	Accepts a list of tuples/scalars, a pandas `DataFrame`, or a pyarrow `Table`/`RecordBatchReader`; `schema` is a list of names or a DDL string. Materialized on duckrun's own DuckDB connection — persist with `df.write.saveAsTable(...)`.
`spark.stop()`	`conn.stop()`	✅	Closes the underlying DuckDB connection.

`DataFrame`¶

duckrun's DataFrame is a thin handle over a DuckDB relation. Row/column transforms are done in SQL (conn.sql(...)), not via chained DataFrame methods — a deliberate SQL-first design choice (see the DuckDB Spark-API spike decision), not missing coverage. The methods below are the action/output verbs, plus a passthrough to the underlying relation.

Spark	duckrun		Notes
`df.write`	`df.write`	✅	→ `DataFrameWriter`.
`df.collect()`	`df.collect()`	✅
`df.toPandas()`	`df.toPandas()`	✅
`df.toArrow()`	`df.toArrow()`	🟡	Spark collects a whole `pyarrow.Table`; duckrun returns a streaming `pyarrow.RecordBatchReader` (`to_arrow_reader()`) so big results don't materialize.
`df.count()`	`df.count()`	✅
`df.show()`	`df.show()`	✅
`df.first()`	`df.first()`	✅	First row as a tuple, or `None` if empty.
`df.head(n=1)` / `df.take(n)`	`df.head([n])` / `df.take(n)`	✅	`head()` → first row (or `None`); `head(n)` / `take(n)` → list of the first `n` rows.
`df.isEmpty()`	`df.isEmpty()`	✅
`df.createOrReplaceTempView(name)`	`df.createOrReplaceTempView(name)`	✅	Native, ephemeral DuckDB view — not Delta, not in `conn.catalog`.
`df.columns`	(passthrough)	🟡	Not reimplemented — `__getattr__` forwards to the DuckDB relation; a list of names, like Spark.
`df.dtypes`	(passthrough)	🟡	Passthrough to the DuckDB relation — returns DuckDB types, not Spark `(name, type)` tuples.
`df.schema`	`df.schema`	🟡	Returns a `StructType` of `StructField(name, dataType, nullable)` — same surface as Spark, but `dataType` is the DuckDB type string (like `df.dtypes`), not a Spark type object, and `nullable` is always `True` (the relation doesn't track it).
`df.printSchema()`	`df.printSchema()`	✅	Spark's `root / \|-- col: type (nullable = …)` tree, with DuckDB type names.

`DataFrameReader` (`conn.read`)¶

Spark	duckrun		Notes
`read.format(fmt)`	`read.format(fmt)`	✅	`delta` (default), `parquet`, `csv`.
`read.option(k, v)`	`read.option(k, v)`	✅
`read.load(path)`	`read.load(path)`	✅	Honors the chosen `format`.
`read.format("delta").load(path)`	`read.format("delta").load(path)`	✅	Identical — `delta` is the default format.
`read.parquet(path)`	`read.parquet(path)`	✅
`read.csv(path)`	`read.csv(path)`	✅
`read.table(name)`	`read.table(name)`	✅
`read.option("versionAsOf", N).load(path)`	`read.option("versionAsOf", N).load(path)`	✅	Time travel via duckdb-delta `version =>`.
`read.option("timestampAsOf", ts)`	—	🚫	duckdb-delta time-travels by version only; rejected (use `versionAsOf`).
`read.schema(…)`	`read.schema(ddl_or_StructType)`	🟡	Applies to `csv` / `json` via DuckDB `columns={…}` — names + types the columns, turns off sniffing, skips the header (Spark override). A DDL string or a `StructType` (e.g. another frame's `df.schema`). Rejected for `delta` / `parquet`, which carry their own schema.
`read.json`	`read.json(path)` / `read.format("json").load(path)`	✅	DuckDB `read_json_auto`.
`read.orc`	—	🚫	DuckDB has no native ORC reader; no engine to back it.
`read.text`	—	🚫	Spark yields one row per line (`value` column); DuckDB's `read_text` returns the whole file as one value — different shape, so rejected rather than faked.

`DataFrameWriter` (`df.write`)¶

Spark	duckrun		Notes
`write.format(fmt)`	`write.format(fmt)`	✅	`delta`.
`write.mode(m)`	`write.mode(m)`	✅	`overwrite` / `append` / `ignore` / `error` (default).
`write.option(k, v)`	`write.option(k, v)`	✅	`overwriteSchema`, `mergeSchema`.
`write.partitionBy(*cols)`	`write.partitionBy(*cols)`	✅
`write.save(path)`	`write.save(path)`	✅	Write Delta by path.
`write.saveAsTable(name)`	`write.saveAsTable(name)`	✅	Write Delta by catalog name.
`write.insertInto(name)`	`write.insertInto(name)`	✅	Appends to an existing table (errors if missing); `overwrite=True` replaces all rows.
`write.bucketBy`	—	🚫	Delta doesn't bucket; partitioning is `partitionBy`.
`write.sortBy`	—	🚫	Delta doesn't bucket; partitioning is `partitionBy`.

`Catalog` (`conn.catalog`)¶

Spark	duckrun		Notes
`catalog.listDatabases()`	`catalog.listDatabases()`	✅	(= schemas in the lakehouse root).
`catalog.listTables(dbName)`	`catalog.listTables(dbName)`	✅
`catalog.listColumns(table, dbName)`	`catalog.listColumns(table, dbName)`	✅
`catalog.currentDatabase()`	`catalog.currentDatabase()`	✅
`catalog.setCurrentDatabase(db)`	`catalog.setCurrentDatabase(db)`	✅
`catalog.tableExists(t, db)`	`catalog.tableExists(t, db)`	✅
`catalog.databaseExists(db)`	`catalog.databaseExists(db)`	✅
`catalog.getTable(t, db)`	`catalog.getTable(t, db=None)`	✅	Returns a `Table` namedtuple (`name`, `catalog`, `database`, `description`, `tableType`, `isTemporary`); raises if absent. duckrun tables are always `MANAGED`, never temporary.
`catalog.getDatabase(db)`	`catalog.getDatabase(db)`	✅	Returns a `Database` namedtuple (`name`, `catalog`, `description`, `locationUri`); raises if absent.
`catalog.dropTempView(name)`	`catalog.dropTempView(name)`	✅	Inverse of `df.createOrReplaceTempView`; returns `True` if the view existed.
`catalog.createTable(t, schema)`	`catalog.createTable(t, schema)`	🟡	Creates an empty managed Delta table from a DDL string or a `StructType`, returns it as a DataFrame. No `path`/`source` arg — duckrun tables are always managed (read foreign data by path with `conn.read…load()`).
`catalog.refreshTable(t)`	`catalog.refreshTable(t)`	✅	Rebuilds one table's cached view from the current on-store snapshot — the per-table peer of `conn.refresh()` (bespoke), which rediscovers the whole store.
`catalog.refreshByPath`	—	🚫	Path reads aren't cached — nothing to refresh.
`catalog.currentCatalog` / `setCurrentCatalog` / `listCatalogs`	`catalog.currentCatalog()` / `setCurrentCatalog(name)` / `listCatalogs()`	✅	Each attached lakehouse root is a catalog (`catalog.schema.table`). The primary comes from `connect`; add more with `conn.attach(path, name=…)` (bespoke — see top).
`catalog.functionExists` / `listFunctions` / `registerFunction`	—	🚫	DuckDB owns the function namespace; not a duckrun catalog concept.
`catalog.dropGlobalTempView`	—	🚫	No global-temp namespace in duckrun.

`DeltaTable` (Delta-on-Spark) ↔ `DeltaTable.forName(conn, name)`¶

The write/mutate side. merge is snapshot-pinned by default (single-snapshot MERGE): the target version is captured at build time and the commit validates against it, so a concurrent writer fails loudly (CommitFailedError) rather than silently interleaving.

Spark / Delta	duckrun		Notes
`DeltaTable.forName(spark, name)`	`DeltaTable.forName(conn, name)`	✅
`DeltaTable.forPath(spark, path)`	`DeltaTable.forPath(conn, path)`	✅
`.merge(source, condition)`	`.merge(source, condition)`	✅	Returns a builder; snapshot-pinned automatically.
`.whenMatchedUpdate(set=…)`	`.whenMatchedUpdate(set=…)`	✅
`.whenMatchedUpdateAll()`	`.whenMatchedUpdateAll()`	✅
`.whenNotMatchedInsertAll()`	`.whenNotMatchedInsertAll()`	✅
`.whenNotMatchedBySourceDelete()`	`.whenNotMatchedBySourceDelete()`	✅
`.whenMatchedDelete()`	`.whenMatchedDelete(condition=None)`	✅	`WHEN MATCHED [AND …] THEN DELETE`.
`.whenNotMatchedInsert(values=…)`	`.whenNotMatchedInsert(values=…)`	✅	`values` maps each target column to a source expression.
`.whenNotMatchedBySourceUpdate(set=…)`	`.whenNotMatchedBySourceUpdate(set=…)`	✅	Update target rows the source doesn't carry.
`.delete(predicate)`	`.delete(predicate)`	✅	delta-rs param name (`predicate`); takes literals, not `IN (SELECT …)`.
`.update(condition, set)`	`.update(condition=…, set=…)`	✅	delta-spark signature.
`df.write.option("replaceWhere", …)` / `INSERT OVERWRITE`	`df.write.option("replaceWhere", pred).mode("overwrite").save()` / `.saveAsTable()`	✅	Single atomic commit; snapshot-fenced.
`.history()`	`.history(limit=None)`	✅	delta-rs commit history (newest-first list of dicts: `version`, `timestamp`, `operation`, …).
`.vacuum()`	`.vacuum(retention_hours=None, dry_run=False, …)`	✅	delta-rs `vacuum`; deletes by default (Spark-like), `dry_run=True` only lists. Returns the removed paths.
`.optimize()`	`.optimize(zorder_by=None, target_size=None)`	✅	delta-rs `optimize.compact` (or `optimize.z_order` with `zorder_by`); returns the metrics dict.
`.restoreToVersion()`	`.restoreToVersion(version)`	✅	delta-rs `restore`; commits a new version, so the restore is itself revertible.
`.generate()`	—	➖	TODO (delta-rs gap — no symlink-format-manifest generation).
`.clone()`	—	➖	TODO (delta-rs gap).
`DeltaTable.convertToDelta(spark, ident, partitionSchema)`	`DeltaTable.convertToDelta(conn, ident, partitionSchema=None)`	✅	Zero-copy — writes a `_delta_log` over existing parquet, no data rewrite. `ident` is `"parquet."` (a bare path is also accepted); `partitionSchema` is a pyarrow `Schema` for a Hive-partitioned dir.

Coverage vs the Spark / Delta API¶

Bespoke — duckrun-only, no Spark/Delta equivalent¶

SparkSession ↔ duckrun.connect() / DuckSession¶

DataFrame¶

DataFrameReader (conn.read)¶

DataFrameWriter (df.write)¶

Catalog (conn.catalog)¶

DeltaTable (Delta-on-Spark) ↔ DeltaTable.forName(conn, name)¶