DuckDB: The Universal Query Engine Data Engineers Actually Need

Something unusual is happening in the database world. A tiny, embeddable, in-process analytical database — one you can pip install in three seconds — is quietly replacing Spark clusters, unifying data silos, and running inside web browsers. DuckDB has gone from a research project at a Dutch university to one of the fastest-growing databases on the planet, and it deserves your attention.

dadadance/duckdb-universal-query-engine

Working demo: Cross-source joins (SQLite + Parquet + CSV), Python integration, dbt-duckdb models, and larger-than-memory processing

After using DuckDB extensively in production pipelines and local development, I believe it represents a genuine paradigm shift: the idea that you don't need infrastructure to do serious analytical work.

DuckDB as a universal query engine — one SQL interface for everything

What Makes DuckDB Different

DuckDB was created in 2018 by Hannes Muhleisen and Mark Raasveldt at Centrum Wiskunde & Informatica (CWI) — the same Dutch research institute that produced MonetDB and Vectorwise. It was designed with a radical premise: what if you could have the analytical power of a data warehouse, but as a library you embed in your Python script?

No server. No daemon. No configuration. Just import duckdb and start querying. This "SQLite for analytics" philosophy, combined with a columnar vectorized execution engine, is what makes DuckDB so powerful:

In-process execution — runs inside your application, not as a separate server. Zero network overhead, zero deployment complexity.
Columnar storage with vectorized execution — processes data in batches using SIMD instructions, the same approach that makes modern data warehouses fast.
Larger-than-memory processing — automatically spills to disk when data exceeds RAM. DuckDB 1.4 successfully ran TPC-H at scale factor 100,000 (100 TB of CSV data), spilling 7 TB to disk on a single machine.
Zero dependencies — a single binary or Python package. No JVM, no Docker, no cluster manager.

The Universal Query Engine

Here is where DuckDB gets genuinely exciting for data engineers. It is not just a database — it is a universal query layer that can reach into virtually any data source and let you work with them all through a single SQL interface.

Query Anything, Anywhere

Through its extension system, DuckDB can attach and query:

Databases: PostgreSQL, MySQL, SQLite — attach them as named schemas and query tables directly
File formats: Parquet, CSV, JSON, Excel — with automatic schema detection
Open table formats: Apache Iceberg (full read/write as of January 2026), Delta Lake (read)
Cloud storage: S3, GCS, Azure Blob, Cloudflare R2 — all via the httpfs extension

The killer capability is cross-source JOINs. You can write a single SQL query that joins a Postgres table with a Parquet file on S3 and an Iceberg table — with no ETL step, no data movement, and no infrastructure:

-- Attach a live Postgres database
ATTACH 'postgres:dbname=production host=db.internal' AS prod (TYPE postgres);

-- Query across Postgres, S3 Parquet, and a local CSV in one statement
SELECT
    p.customer_id,
    p.name,
    o.total_amount,
    r.region_name
FROM prod.customers p
JOIN read_parquet('s3://data-lake/orders/2026/*.parquet') o
    ON p.customer_id = o.customer_id
JOIN read_csv('regions.csv') r
    ON p.region_code = r.code
WHERE o.order_date >= '2026-01-01';

This is the equivalent of what you would need Presto, Trino, or Athena for — but running on your laptop with zero infrastructure.

Before: Traditional Approach

After: DuckDB

pg_duckdb: DuckDB Inside Postgres

Perhaps the most audacious integration is pg_duckdb, which reached version 1.0 in 2025. It embeds DuckDB's vectorized columnar engine inside PostgreSQL as a native extension. Set duckdb.force_execution=true and your existing analytical queries — the slow ones with full table scans, GROUP BYs, and window functions — automatically route through DuckDB's engine. Same SQL, same Postgres client, dramatically faster analytical performance.

The Numbers Tell the Story

DuckDB's growth is not incremental — it is exponential:

25 million monthly PyPI downloads — up from 6 million at v1.1
30,000+ GitHub stars — doubling roughly every 12 months
Stack Overflow 2025 — usage jumped from 1.4% to 3.3%, now the #4 database, just 0.2% behind SQLite
DB-Engines ranking — climbed from #81 to #51 in a single year, with a 50.7% growth rate

DuckDB adoption trajectory

DuckDB 1.4 LTS: Enterprise-Ready

The September 2025 release of DuckDB 1.4.0 "Andium" marks a major milestone — it is the first Long-Term Support release, backed by one year of community support. The headline features signal that DuckDB is no longer just a developer tool — it is production infrastructure:

AES-256 encryption (GCM mode) — covers the database file, write-ahead log, and even temporary files. Supports hardware-accelerated OpenSSL for production-grade performance. This was a hard requirement for many enterprises.
MERGE INTO statement — the bread and butter of OLAP data loading. Unlike INSERT ... ON CONFLICT, MERGE works without primary keys using arbitrary conditions — critical for data warehousing patterns.
Iceberg write support — DuckDB can now read and write Iceberg tables, with full DML (INSERT, UPDATE, DELETE) as of January 2026. This turns DuckDB into a bridge between local development and your production lakehouse.

-- Encrypt a DuckDB database with AES-256
PRAGMA add_key('my_secret_key');

-- MERGE: upsert without primary keys
MERGE INTO dim_customers t
USING staging_customers s ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET name = s.name, updated_at = now()
WHEN NOT MATCHED THEN INSERT VALUES (s.customer_id, s.name, now());

-- Write directly to Iceberg tables
COPY FROM DATABASE duckdb_db TO iceberg_catalog;

DuckLake: Rethinking the Lakehouse

Perhaps the boldest move from DuckDB Labs in 2025 was DuckLake — a new open lakehouse format that directly challenges Iceberg and Delta Lake. The insight is deceptively simple: instead of storing catalog metadata in a complex hierarchy of JSON and Avro manifest files (as Iceberg does), store it in a standard SQL database — DuckDB, SQLite, Postgres, or MySQL.

Data still lives in Parquet files on S3 or local storage. But the metadata — schemas, partition info, snapshots — lives in a familiar, queryable SQL database. You get full ACID transactions, time travel, and multi-table transactions, with none of the operational complexity of managing Iceberg metadata files.

DuckLake 0.3 is current, with v1.0 targeted for early 2026. It already supports Iceberg interoperability — you can copy between DuckLake and Iceberg bidirectionally.

DuckDB + dbt: The Lean Data Stack

One of the most powerful applications of DuckDB is its integration with dbt through the dbt-duckdb adapter. This combination has given rise to what people are calling the "Modern Data Stack in a Box" — a complete transformation pipeline that runs on your laptop or a single server:

# profiles.yml — that's it. No warehouse credentials needed.
my_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: warehouse.duckdb
      extensions:
        - httpfs
        - parquet
      settings:
        s3_region: us-east-1

With this setup, dbt runs all your SQL models against DuckDB. You can read source data from S3 Parquet files, transform it through your dbt DAG, and write the results back to DuckDB or export to Parquet/Iceberg. The practical applications are significant:

Local development: Develop and test dbt models locally against real data without needing Snowflake/BigQuery credentials or incurring cloud costs
CI/CD testing: Run your full dbt build in GitHub Actions against DuckDB — validate models, test data quality, and catch regressions before they hit production
Cost-effective production: For smaller datasets (up to hundreds of GBs), teams are running the full pipeline — dlt for ingestion, DuckDB for storage, dbt for transformations, Evidence.dev for dashboards — on a single server, replacing $5,000/month cloud warehouse bills

The lean data stack — a complete pipeline with zero cloud infrastructure

The WASM Frontier: Analytics in the Browser

DuckDB compiles to WebAssembly and runs entirely in the browser. As of early 2026, this has matured from a novelty into something genuinely useful:

Browser-based Iceberg queries — DuckDB-WASM can now query Iceberg REST Catalogs directly from a browser tab. Zero infrastructure. Your laptop, a browser, and an S3 bucket — that is your entire analytics stack.
Offline-capable dashboards — using DuckDB-WASM with Service Workers, you can build BI dashboards that cache data locally and continue working without network access.
Client-side ETL — validate and transform data before upload for privacy-first workflows. The data never leaves the user's browser.
Portable analytics — entire analytical notebooks that can be shared as a single HTML file.

When DuckDB Replaces Spark (and When It Does Not)

The "DuckDB vs Spark" conversation is one of the hottest in data engineering right now. The benchmarks are consistent and striking:

At 10 GB scale, DuckDB is roughly 2x faster than Spark across most analytical queries
At 40 GB (real-world workload), DuckDB was 2.5x faster than Spark
For small-to-medium datasets, DuckDB can be up to 100x faster because it avoids JVM startup, PySpark serialization, and cluster coordination overhead
DuckDB on a 12-core laptop ran TPC-H at SF3000 (3 TB) in 47 minutes

The rule of thumb I use: default to DuckDB for everything, and reach for Spark only when your data genuinely exceeds what fits on a single machine (roughly 1+ TB consistently) or when you need Spark's fault tolerance for multi-hour distributed jobs. For the vast majority of data engineering work — exploration, development, testing, and moderate-scale production — DuckDB is faster, simpler, and cheaper.

Getting Started: Five Minutes to Productive

The best thing about DuckDB is how quickly you can go from zero to productive. Here are some recipes I reach for constantly:

import duckdb

# Query a Parquet file on S3 — no download needed
result = duckdb.sql("""
    SELECT region, SUM(revenue) as total
    FROM read_parquet('s3://my-bucket/sales/2026/*.parquet')
    GROUP BY region
    ORDER BY total DESC
""")

# Query a Pandas DataFrame with SQL
import pandas as pd
df = pd.read_csv('local_data.csv')
duckdb.sql("SELECT category, AVG(price) FROM df GROUP BY category")

# Attach Postgres and join with local files
duckdb.sql("ATTACH 'postgres:dbname=prod' AS pg (TYPE postgres)")
duckdb.sql("""
    SELECT pg.users.name, local_orders.*
    FROM pg.users
    JOIN read_parquet('orders.parquet') local_orders
        ON pg.users.id = local_orders.user_id
""")

# Export query results to Parquet
duckdb.sql("""
    COPY (SELECT * FROM my_table WHERE year = 2026)
    TO 'output.parquet' (FORMAT parquet)
""")

Conclusion

DuckDB represents a fundamental rethinking of what a database can be. It is not trying to replace your production warehouse — it is filling the vast gap between "I need to write a SQL query" and "let me spin up a cluster." For data engineers, it eliminates the infrastructure tax on exploratory analysis, local development, data quality testing, and medium-scale production workloads.

With the 1.4 LTS release bringing encryption, MERGE, and Iceberg writes, the DuckLake project challenging lakehouse complexity, and the WASM runtime enabling zero-infrastructure browser analytics, DuckDB is not slowing down. The duck has landed — and it is here to stay.