How It Works — DuckDB Course

Vectorized Query Processing

DuckDB's execution engine processes data in vectors of STANDARD_VECTOR_SIZE (2048) values. Every operator in the pipeline receives a DataChunk — a collection of Vector objects, one per column — and produces a transformed DataChunk as output.

Each Vector stores its data in a flat array optimized for CPU cache access. When the engine executes a filter like WHERE price > 100, it doesn't check one row at a time. Instead, it runs a tight loop comparing 2048 price values against 100 in a single batch. Modern compilers auto-vectorize these loops into SIMD instructions (SSE, AVX), processing 4–8 comparisons per CPU cycle.

🔭

Why 2048 values? 2048 values of 8 bytes each = 16KB, which fits comfortably in L1 cache (typically 32–64KB per core). This means the CPU processes the entire vector without cache misses, achieving near-theoretical throughput.

Vector Representations

DuckDB uses multiple vector representations to avoid unnecessary data copying:

Flat Vectors Contiguous arrays — the most common and fastest format for computation

Constant Vectors A single value repeated for all rows (e.g., the literal 100 in price > 100)

Dictionary Vectors Indices into a dictionary of unique values, used for compressed string data

Sequence Vectors Arithmetically generated sequences (e.g., rowid) without materializing values

Push-Based Pipeline Execution

DuckDB's executor arranges physical operators into pipelines — chains of operators that process data without materializing intermediate results. Each pipeline has:

A source that produces DataChunks (e.g., a table scan)
Zero or more intermediate operators that transform DataChunks in-place (e.g., filter, projection)
A sink that consumes DataChunks and builds state (e.g., hash table for aggregation, sort buffer for ORDER BY)

Pipeline breaks occur at operators that must see all input before producing output — hash joins (build side), aggregations, sorts, and window functions. These operators become the sink of one pipeline and the source of the next.

Example: SELECT region, SUM(amount) FROM sales WHERE year = 2025 GROUP BY region ORDER BY SUM(amount)

Pipeline 1 — Parallel Scan & Aggregate

TableScan → Filter (year=2025) → HashAggregate (build)

Each thread scans different row groups in parallel

Pipeline 2 — Sort Aggregated Results

HashAggregate (scan) → Sort (build)

Reads aggregated results and sorts them by SUM(amount)

Pipeline 3 — Output

Sort (scan) → ResultCollector

Produces final ordered output to the client

Morsel-Driven Parallelism

DuckDB parallelizes query execution using the morsel-driven model (SIGMOD 2014). Instead of partitioning data upfront or using exchange operators (like Spark), DuckDB uses a dynamic work-stealing approach:

The source operator divides its input into morsels (chunks of row groups)
Worker threads from the task scheduler each grab a morsel
Each thread pushes its morsel through the entire pipeline independently
Threads share the same sink state (e.g., a concurrent hash table) using thread-local partitions that are merged at the end

💡

Why this scales linearly: No synchronization between threads during pipeline execution (each thread has its own pipeline executor state). Work is dynamically distributed — fast threads grab more morsels, naturally load-balancing. The task scheduler limits concurrency to available hardware threads, avoiding over-subscription.

The Pipeline::ScheduleParallel() method checks that all operators in the chain support parallelism before enabling multi-threaded execution. If any operator requires ordered processing, the pipeline falls back to sequential execution.

Storage Engine

DuckDB uses a custom single-file storage format designed for analytical access patterns.

Block-Based I/O

The fundamental I/O unit is a 256KB block. This size balances sequential read performance (large enough for SSDs/HDDs to read efficiently) with memory management granularity (small enough to manage in a buffer pool).

Row Groups

Tables are horizontally partitioned into row groups of ~122,880 rows (60 × STANDARD_VECTOR_SIZE). Each row group stores columns independently with per-column compression, per-column zone maps (min/max statistics) for predicate pushdown, and per-column null bitmasks.

Buffer Manager

DuckDB includes a unified buffer manager that enables out-of-core processing. When available memory is exhausted, the buffer manager transparently spills data to disk using temporary files. This means DuckDB can process datasets significantly larger than RAM — the system handles the spilling automatically with no user configuration.

Write-Ahead Log

For persistent databases, DuckDB uses a WAL for crash recovery. Writes go to the WAL first, and periodic checkpoints compact the WAL into the main database file. Since v1.5.0, checkpointing is non-blocking — reads and writes can proceed concurrently during checkpoint operations using MVCC.

📊

Non-blocking checkpoints yield ~17% throughput improvement on mixed read-write workloads compared to the previous blocking checkpoint model. This was introduced in DuckDB v1.5.0.

Compression Codecs

DuckDB automatically selects the best compression codec per column segment based on data characteristics:

📌

Constant

All identical values in the segment. Stores a single value regardless of row count — maximum compression.

🔁

RLE

Run-Length Encoding for consecutive repeated values. Stores (value, count) pairs instead of individual entries.

📖

Dictionary

Low-cardinality columns. Stores a dictionary of unique values plus compact indices into that dictionary.

🧩

Bitpacking

Integers with limited range. Uses the minimal number of bits per value (e.g., 3 bits for values 0–7).

🔤

FSST

Fast Static Symbol Table for string compression. Replaces frequent byte sequences with short codes, achieving 2–5× compression.

📈

Chimp/Patas

Floating-point compression for time-series data. Exploits the fact that consecutive float values often share many bits.

🎯

ALP

Adaptive Lossless floating-Point compression for doubles. Encodes values as integer linear combinations for superior compression.

Query Optimizer Internals

The optimizer runs 30+ passes sequentially on the logical plan. Key passes include:

Filter Pushdown

Pushes predicates as close to the data source as possible. For a query joining two tables with a filter on one, the optimizer moves the filter below the join so it's applied during the scan — reducing the number of rows that enter the join.

Join Order Optimization

Uses the DPccp (Dynamic Programming connected complement pair) algorithm, which efficiently enumerates all valid join orderings for up to ~20 tables. For larger joins, it falls back to heuristics. The optimizer also converts cross products into joins when possible and determines the build/probe sides for hash joins based on estimated cardinalities.

Late Materialization

Defers reading column data until it's actually needed. When a scan has a selective filter, DuckDB first scans only the filtered column, determines which rows pass, and then reads the remaining columns for only those rows. This can dramatically reduce I/O for queries with selective predicates on wide tables.

Statistics Propagation

Maintains cardinality estimates as the plan transforms. Each optimizer pass updates these estimates, which downstream passes use to make better decisions (e.g., which side of a join to use as the build side for a hash join).

🔭

Zone maps at work: Each column segment stores min/max statistics (zone maps). During a scan with WHERE year = 2025, DuckDB checks each row group's zone map for the year column and skips entire row groups where max < 2025 or min > 2025 — without reading any data.

Performance Characteristics

Where DuckDB Excels

Analytical Scans & Aggregation Excellent

Columnar storage + vectorized execution + automatic parallelism. Competes with ClickHouse and Hyper on TPC-H benchmarks.

Parquet/CSV File Queries Excellent

Queries Parquet directly with column pruning and predicate pushdown. A 2GB CSV with 50M rows aggregates in under 3 seconds.

Joins (Moderate Data) Excellent

Highly optimized hash join with partitioned build/probe and vectorized probing.

Where DuckDB Is Slower

Point Lookups by Primary Key Poor

No B-tree indexes for OLTP-style lookups. SQLite is ~4x faster for single-record retrieval.

Concurrent OLTP Workloads Poor

Uses MVCC for transactions but not optimized for hundreds of concurrent read-write transactions.

Very Large Datasets (Multi-TB) Limited

Handles out-of-core processing, but truly massive datasets benefit from distributed systems like Spark or ClickHouse.

Vectorized Query Processing

Vector Representations

Push-Based Pipeline Execution

Morsel-Driven Parallelism

Storage Engine

Block-Based I/O

Row Groups

Buffer Manager

Write-Ahead Log

Compression Codecs

Constant

RLE

Dictionary

Bitpacking

FSST

Chimp/Patas

ALP

Query Optimizer Internals

Filter Pushdown

Join Order Optimization

Late Materialization

Statistics Propagation

Performance Characteristics

Where DuckDB Excels

Where DuckDB Is Slower

📚 References & Resources