Use Cases & Case Studies — DuckDB Course

Should You Use DuckDB?

Answer a few questions to find out whether DuckDB is a good fit for your workload.

Interactive Decision Tree

Question 1 of 3

Is your workload analytical (scans, aggregations, batch transforms) or transactional (point reads, single-row writes, real-time updates)?

Question 2 of 3

Is your dataset size under ~100GB (fits on a single machine with room for processing)?

Question 3 of 3

Do you need concurrent multi-user access (multiple services or users reading and writing simultaneously)?

⚠️

Consider DuckDB with caveats

Your workload is analytical but the data may exceed single-machine limits. DuckDB works well up to ~500GB-1TB with disk spilling. For truly massive datasets, consider MotherDuck (cloud DuckDB) or distributed systems like ClickHouse, Spark, or BigQuery.

⚠️

Possible, but with limitations

DuckDB supports multiple concurrent readers but only one writer at a time. If your concurrency needs are read-heavy, DuckDB works in READ_ONLY mode from multiple processes. For full read-write concurrency, a client-server database like PostgreSQL is more appropriate.

❌

DuckDB is not the right fit

DuckDB is designed for OLAP, not OLTP. For transactional workloads with point reads and writes, use PostgreSQL, MySQL, or SQLite. DuckDB supports one writer at a time and is optimized for scanning, not key-value lookups.

When to Use It

DuckDB excels in five categories of workloads, all centered on analytical processing without the overhead of a server.

🔍

Ad-Hoc Analytical Queries

You have Parquet, CSV, or JSON files and need answers fast. DuckDB queries these directly without a loading step -- SELECT AVG(price) FROM 'sales.parquet' WHERE region = 'EMEA' just works. This is DuckDB's sweet spot: a data scientist with a laptop and a question.

🔄

Data Pipeline Transformations

Replace pandas or Spark for small-to-medium ETL transformations (up to ~100GB). DuckDB's SQL engine is faster and more memory-efficient than pandas for aggregations and joins. FinQore cut their financial ETL pipeline from 8 hours to 8 minutes by replacing PostgreSQL with DuckDB.

📦

Embedded Analytics

Ship DuckDB inside your product to provide analytical query capabilities without requiring users to set up a database server. Evidence uses DuckDB as a universal SQL engine for BI, and Rill uses it as their analytics backbone (3x--30x faster than SQLite for analytical queries).

📓

Notebook Data Exploration

In Jupyter, R, or Observable notebooks, DuckDB provides instant SQL querying of DataFrames, Arrow tables, and files. Hex reports 5--10x speedups in notebook execution after switching to DuckDB.

🌐

Browser-Based Analytics

DuckDB compiles to WebAssembly, enabling analytical queries directly in web browsers. South Australia's government uses duckdb-wasm for their climate change data dashboard. Mosaic used it to explore 18M data points from the Gaia star catalog entirely in-browser.

When NOT to Use It

DuckDB is deliberately focused on analytical workloads. These are the scenarios where it is the wrong tool.

⛔

High-Concurrency OLTP Workloads

DuckDB is not a replacement for PostgreSQL, MySQL, or SQLite for transaction-heavy applications with hundreds of concurrent users doing point reads and writes. It supports one writer at a time and is optimized for scanning, not key-value lookups.

⛔

Multi-Terabyte Distributed Datasets

DuckDB runs on a single machine. If your data exceeds what one machine can handle (typically beyond ~500GB--1TB), use distributed systems like Apache Spark, ClickHouse, or BigQuery. MotherDuck extends DuckDB to the cloud with hybrid execution, but for truly massive datasets, native distributed systems are more appropriate.

⛔

Real-Time Streaming Ingestion

DuckDB is designed for batch-oriented analytical queries, not continuous streaming. For real-time event processing, use Kafka + Flink or similar streaming architectures. DuckDB works well for querying the results of streaming pipelines after they land in Parquet files.

⛔

Multi-User Shared Database

DuckDB is an embedded database -- it's designed for single-user or single-application access. If multiple services need to share a database with concurrent read-write access, use a client-server database like PostgreSQL.

Real-World Examples

Production deployments across industries show DuckDB handling everything from carbon analytics to AI dataset exploration.

Watershed

Carbon Analytics

10x faster than PostgreSQL

Watershed processes carbon footprint data for enterprises. They store customer datasets as Parquet files on Google Cloud Storage (largest: ~750MB, 17M rows) and use DuckDB to translate natural-language analytics requests into SQL. DuckDB handles 75,000 daily queries with 10x performance gains over their previous PostgreSQL setup.

Okta

Enterprise Security

7.5 Trillion Records Processed

Okta, a Fortune 500 identity provider, uses DuckDB to process security telemetry at massive scale -- 7.5 trillion records. DuckDB's ability to efficiently scan and aggregate columnar data makes it viable for security analytics workloads that would traditionally require dedicated distributed infrastructure.

Hugging Face

AI Dataset Access

150,000+ Datasets Queryable via SQL

Hugging Face integrated DuckDB to provide SQL querying across their 150,000+ AI/ML datasets. Users query datasets directly using hf:// protocol URLs, making it trivial to explore and filter training data without downloading entire datasets.

NSW Department of Education

Data Portal

Modern Data Stack with Local-First Dev

Australia's NSW Department of Education uses DuckDB as part of a modern data stack (Dagster + dbt + dlt + Evidence) for their education data portal. DuckDB enables local-first development and testing of analytical pipelines before deploying to production.

Ibis Project

Large-Scale Analytics

1.1 Billion Rows in 38 Seconds

The Ibis team processed 1.1 billion PyPI package download rows in 38 seconds on a laptop using only 1GB of RAM, demonstrating DuckDB's efficiency for large-scale analytical processing on commodity hardware.