Vectorized Query Processing
DuckDB's execution engine processes data in vectors of STANDARD_VECTOR_SIZE (2048) values. Every operator in the pipeline receives a DataChunk — a collection of Vector objects, one per column — and produces a transformed DataChunk as output.
Each Vector stores its data in a flat array optimized for CPU cache access. When the engine executes a filter like WHERE price > 100, it doesn't check one row at a time. Instead, it runs a tight loop comparing 2048 price values against 100 in a single batch. Modern compilers auto-vectorize these loops into SIMD instructions (SSE, AVX), processing 4–8 comparisons per CPU cycle.
Vector Representations
DuckDB uses multiple vector representations to avoid unnecessary data copying:
100 in price > 100)
rowid) without materializing values
Push-Based Pipeline Execution
DuckDB's executor arranges physical operators into pipelines — chains of operators that process data without materializing intermediate results. Each pipeline has:
- A source that produces DataChunks (e.g., a table scan)
- Zero or more intermediate operators that transform DataChunks in-place (e.g., filter, projection)
- A sink that consumes DataChunks and builds state (e.g., hash table for aggregation, sort buffer for ORDER BY)
Pipeline breaks occur at operators that must see all input before producing output — hash joins (build side), aggregations, sorts, and window functions. These operators become the sink of one pipeline and the source of the next.
SELECT region, SUM(amount) FROM sales WHERE year = 2025 GROUP BY region ORDER BY SUM(amount)Morsel-Driven Parallelism
DuckDB parallelizes query execution using the morsel-driven model (SIGMOD 2014). Instead of partitioning data upfront or using exchange operators (like Spark), DuckDB uses a dynamic work-stealing approach:
- The source operator divides its input into morsels (chunks of row groups)
- Worker threads from the task scheduler each grab a morsel
- Each thread pushes its morsel through the entire pipeline independently
- Threads share the same sink state (e.g., a concurrent hash table) using thread-local partitions that are merged at the end
The Pipeline::ScheduleParallel() method checks that all operators in the chain support parallelism before enabling multi-threaded execution. If any operator requires ordered processing, the pipeline falls back to sequential execution.
Storage Engine
DuckDB uses a custom single-file storage format designed for analytical access patterns.
Block-Based I/O
The fundamental I/O unit is a 256KB block. This size balances sequential read performance (large enough for SSDs/HDDs to read efficiently) with memory management granularity (small enough to manage in a buffer pool).
Row Groups
Tables are horizontally partitioned into row groups of ~122,880 rows (60 × STANDARD_VECTOR_SIZE). Each row group stores columns independently with per-column compression, per-column zone maps (min/max statistics) for predicate pushdown, and per-column null bitmasks.
Buffer Manager
DuckDB includes a unified buffer manager that enables out-of-core processing. When available memory is exhausted, the buffer manager transparently spills data to disk using temporary files. This means DuckDB can process datasets significantly larger than RAM — the system handles the spilling automatically with no user configuration.
Write-Ahead Log
For persistent databases, DuckDB uses a WAL for crash recovery. Writes go to the WAL first, and periodic checkpoints compact the WAL into the main database file. Since v1.5.0, checkpointing is non-blocking — reads and writes can proceed concurrently during checkpoint operations using MVCC.
Compression Codecs
DuckDB automatically selects the best compression codec per column segment based on data characteristics:
Constant
All identical values in the segment. Stores a single value regardless of row count — maximum compression.
RLE
Run-Length Encoding for consecutive repeated values. Stores (value, count) pairs instead of individual entries.
Dictionary
Low-cardinality columns. Stores a dictionary of unique values plus compact indices into that dictionary.
Bitpacking
Integers with limited range. Uses the minimal number of bits per value (e.g., 3 bits for values 0–7).
FSST
Fast Static Symbol Table for string compression. Replaces frequent byte sequences with short codes, achieving 2–5× compression.
Chimp/Patas
Floating-point compression for time-series data. Exploits the fact that consecutive float values often share many bits.
ALP
Adaptive Lossless floating-Point compression for doubles. Encodes values as integer linear combinations for superior compression.
Query Optimizer Internals
The optimizer runs 30+ passes sequentially on the logical plan. Key passes include:
Filter Pushdown
Pushes predicates as close to the data source as possible. For a query joining two tables with a filter on one, the optimizer moves the filter below the join so it's applied during the scan — reducing the number of rows that enter the join.
Join Order Optimization
Uses the DPccp (Dynamic Programming connected complement pair) algorithm, which efficiently enumerates all valid join orderings for up to ~20 tables. For larger joins, it falls back to heuristics. The optimizer also converts cross products into joins when possible and determines the build/probe sides for hash joins based on estimated cardinalities.
Late Materialization
Defers reading column data until it's actually needed. When a scan has a selective filter, DuckDB first scans only the filtered column, determines which rows pass, and then reads the remaining columns for only those rows. This can dramatically reduce I/O for queries with selective predicates on wide tables.
Statistics Propagation
Maintains cardinality estimates as the plan transforms. Each optimizer pass updates these estimates, which downstream passes use to make better decisions (e.g., which side of a join to use as the build side for a hash join).
WHERE year = 2025, DuckDB checks each row group's zone map for the year column and skips entire row groups where max < 2025 or min > 2025 — without reading any data.