Architecture — DuckDB Course

High-Level Design

DuckDB follows a classic database architecture organized as a multi-stage query processing pipeline, but with critical design choices optimized for analytical workloads. The system is a single-process, multi-threaded C++ library with zero external dependencies. Every component — from SQL parsing to storage management — is compiled into a single library that runs inside the host application's process space.

The architecture can be summarized as a pipeline:

SQL String → Parser → Binder → Logical Planner → Optimizer → Physical Planner → Executor → Results

Data flows through this pipeline for every query, with each stage transforming the representation from human-readable SQL to machine-executable operators working on columnar vectors.

DuckDB Query Pipeline Architecture

Click any component to see implementation details

📝

Parser

SQL to AST

🔗

Binder

Name & type resolution

📑

Catalog

Metadata registry

📐

Logical Planner

Abstract query plan

⚡

Optimizer

30+ optimization passes

🏗️

Physical Planner

Concrete execution plan

🚀

Executor

Push-based pipelines

📅

Task Scheduler

Thread pool + events

💾

Storage Engine

Single-file, columnar

Processing

Storage

Scheduling

Key Components

Each stage of the pipeline is a distinct component with clear responsibilities. The separation between stages enables clean testing, extensibility, and the ability to inspect the query plan at any point in the transformation.

📝 Parser compute

Converts SQL text into an AST. DuckDB uses a fork of PostgreSQL's battle-tested parser, which provides broad SQL compatibility. The parser is completely catalog-unaware — it produces SQLStatement, QueryNode, TableRef, and ParsedExpression nodes without resolving any names or types. This separation exists because parsing is a syntactic concern; mixing in catalog lookups would create circular dependencies.

🔗 Binder compute

Resolves names and types against the catalog. Transforms parsed nodes into bound equivalents (SQLStatement → BoundStatement, ParsedExpression → Expression) by looking up table names, resolving column references, checking types, and expanding wildcards (SELECT *). Supports CTEs, subqueries, and correlated references through a hierarchical binder chain where child binders inherit parent context.

📐 Logical Planner compute

Produces a tree of LogicalOperator nodes representing the query's intent without specifying how to execute it. Operators include LogicalGet (scan), LogicalFilter, LogicalProjection, LogicalAggregate, LogicalJoin, and more. This abstract representation is what the optimizer works on.

⚡ Optimizer compute

The heart of query performance. DuckDB runs 30+ sequential optimization passes on the logical plan, including:

Filter Pushdown — moves predicates close to the data source to reduce I/O
Join Order Optimization — uses the DPccp algorithm for dynamic-programming-based join enumeration
Column Pruning — removes columns never referenced downstream
Common Subexpression Elimination — deduplicates repeated computation
Late Materialization — defers column reads until actually needed
Statistics Propagation — uses cardinality estimates to guide operator selection
TopN Optimization — converts ORDER BY ... LIMIT N into a single operator

The optimizer is extensible — extensions can inject custom pre- and post-optimization passes.

🏗️ Physical Planner compute

Converts the optimized logical plan into a tree of PhysicalOperator nodes that specify concrete execution strategies. For example, a LogicalJoin might become a PhysicalHashJoin or a PhysicalMergeJoin depending on the join condition, data sizes, and available indexes. The physical planner also resolves column bindings from logical column references to physical vector indices.

🚀 Executor compute

Arranges physical operators into pipelines and schedules them for parallel execution. The executor uses an event-driven model where pipeline completion triggers downstream pipelines. Each pipeline has a source, zero or more intermediate operators, and a sink. Worker threads from the task scheduler each grab a morsel of work and push it through the pipeline independently.

📑 Catalog storage

Manages all database metadata: schemas, tables, views, functions, types, sequences, and extensions. DuckDB supports attaching multiple databases simultaneously (including read-only access to external databases like PostgreSQL, MySQL, and SQLite via extensions), and the catalog provides a unified namespace across all attached databases.

💾 Storage Engine storage

Manages persistent and in-memory data using a single-file format. Data is organized into row groups (~122,880 rows each), with each column in a row group compressed independently. The storage engine includes a buffer manager for out-of-core processing (datasets larger than RAM), a WAL for crash recovery, and checkpoint logic for compacting the WAL into the main database file.

📅 Task Scheduler scheduler

Manages a thread pool that executes pipeline tasks. The scheduler supports both fully parallel pipelines (multiple threads on different data morsels) and sequential pipelines (when operator semantics require ordered execution). It also handles inter-pipeline dependencies through an event system.

Data Flow

To understand how the components interact, let's trace a concrete query through the entire pipeline:

Query Pipeline Walkthrough

SELECT region, SUM(amount) FROM sales WHERE year = 2025 GROUP BY region

Parser

Tokenizes the SQL and produces an AST with a SelectStatement containing a TableRef("sales"), a ComparisonExpression(year = 2025), and a FunctionExpression(SUM(amount)).

↓

Binder

Resolves sales to a physical table in the catalog, verifies that region, amount, and year columns exist with compatible types, and produces bound expressions with resolved type information.

↓

Logical Planner

Creates: LogicalGet(sales) → LogicalFilter(year = 2025) → LogicalAggregate(GROUP BY region, SUM(amount)) → LogicalProjection(region, sum).

↓

Optimizer

Pushes the year = 2025 filter down into the scan (eliminating a separate filter operator), prunes unused columns (only region, amount, year are read), and propagates statistics to estimate result cardinalities.

↓

Physical Planner

Converts to: PhysicalTableScan(sales, filter: year=2025, columns: [region, amount, year]) → PhysicalHashAggregate(GROUP BY region, SUM(amount)).

↓

Executor

Pipeline 1: Scan → Hash Aggregate (build phase). Multiple threads each scan different row groups, checking zone maps to skip groups where year max < 2025 or min > 2025, and push DataChunks into the hash table.
Pipeline 2: Hash Aggregate (probe phase) → Result. A single thread reads the aggregated hash table and produces the final output.

Design Decisions

DuckDB's architecture reflects a series of deliberate trade-offs, each driven by the goal of making analytical queries on local data as fast and frictionless as possible.

💡

In-Process over Client-Server. DuckDB runs inside the application process, eliminating network round-trips and serialization overhead. Analytical queries on local data shouldn't require running a server. The trade-off is that DuckDB isn't designed for concurrent multi-user access (though it supports multiple read-only connections).

💡

PostgreSQL Parser Fork. Rather than building a SQL parser from scratch, DuckDB forked PostgreSQL's parser. This provides mature SQL dialect support (CTEs, window functions, lateral joins) with decades of battle-testing. DuckDB is developing an experimental PEG-based parser (introduced in v1.5.0) for better error messages and extensibility, but the PostgreSQL parser remains the production default.

💡

Push-Based over Pull-Based Execution. Most databases use pull-based (Volcano) execution where the root operator calls next() on children. DuckDB uses push-based execution where sources push data downstream. This eliminates virtual function call overhead per-tuple and enables more natural morsel-driven parallelism — a source can distribute morsels across threads without complex synchronization at every operator boundary.

💡

Single-File Storage. The entire database (data, metadata, indexes) lives in a single file, similar to SQLite. This simplifies deployment, backup, and data sharing. The WAL is stored as a separate file during operation but is compacted into the main file during checkpoints.

💡

Zero External Dependencies. DuckDB compiles with just a C++17 compiler. All dependencies (compression algorithms, parsers, etc.) are vendored. This was a deliberate decision to make embedding trivial — no package manager conflicts, no shared library version issues, no system library requirements.