Vectors in RAM: N * D * 4 bytes for float32. 1M vectors at 1536d = ~5.7 GB.
HNSW graph: ~N * m * 2 * 8 bytes. 1M vectors, m=16 = ~256 MB.
With scalar quantization (int8): Vector RAM drops to N * D * 1 byte -- 4x reduction. 1M at 1536d = ~1.4 GB.
With mmap: Vectors on disk, RAM depends on OS page cache. Full dataset can exceed RAM.
Rule of thumb: Budget ~2 bytes/dimension/point for quantized + indexed storage. 10M points at 768d = ~15 GB.
Cosine: Measures angle between vectors, ignoring magnitude. Best for most text embedding models (OpenAI, sentence-transformers). Qdrant pre-normalizes at insert time, making it as fast as Dot Product.
Dot Product: Measures both direction and magnitude. Use when magnitude carries meaning (popularity-weighted embeddings). Identical to Cosine on L2-normalized vectors.
Euclidean: Straight-line distance. Use for geometric applications or when the model documentation specifically recommends L2.
When in doubt, use Cosine. It is the most universally compatible choice.
Scalar (int8): 4x memory reduction, <1% recall loss. Works with all models. Safest default.
Binary: 32x reduction with Hamming distance. Only for compatible models (OpenAI text-embedding-3-large, Cohere embed-v3). Test recall first.
Product: 8-32x reduction for high-dim vectors (1536+). Larger accuracy trade-off.
No quantization: When accuracy is paramount and RAM is abundant, or for small datasets.
Payload filtering (recommended): Add tenant_id to every point, create keyword index, include in every query filter. Qdrant's filter-aware HNSW handles this efficiently.
Shard-per-tenant (v1.16+): Large tenants get dedicated shards, small tenants share a fallback shard. Physical isolation with cost efficiency.
Collection-per-tenant: Strongest isolation but scales poorly beyond ~100 tenants due to per-collection overhead.
1. Enable scalar quantization with always_ram=true. 2. Tune hnsw_ef (start at 128, lower for latency, raise for recall). 3. Build payload indexes on all filtered fields. 4. Use indexed_only: true to skip unindexed segments. 5. Right-size HNSW m parameter (16-32). 6. Monitor optimizer -- rising segment count means it is falling behind.
Performance: Qdrant is typically 5-20x faster, especially with filtering. SIMD-optimized, filterable HNSW vs. pgvector's general-purpose extension.
Scalability: Qdrant supports horizontal sharding and replication. pgvector is limited to PostgreSQL's options.
Features: Qdrant offers sparse vectors, named multi-vectors, multiple quantization methods, hybrid search with RRF.
pgvector wins when: small dataset (<1M vectors), simple filtering, you want a single database, operational simplicity matters most.
100M+ vectors with appropriate configuration. 100M at 768d with scalar quantization needs ~76 GB for quantized vectors + ~12 GB for HNSW graph. Fits on 128 GB RAM.
Beyond ~200M vectors, distributed deployment with multiple shards is recommended. Benchmarks show 3ms p99 latency for 1M OpenAI embeddings.
Upserts: New points go to the appendable segment. Old versions are soft-deleted via ID tracker. During search, only the latest version is returned.
Deletes: Soft-deleted via ID tracker. Points remain in HNSW graph but are excluded from results. VacuumOptimizer periodically rebuilds segments without deleted points.
Why not modify HNSW in-place? Removing graph nodes requires complex rewiring that degrades quality. Soft-delete + periodic rebuild is simpler, faster per operation, and produces better graph quality.