Vector Indexing Under the Hood: HNSW vs IVF-PQ vs Flat Vector Indexes

Understanding proximity graphs, quantization, clustering, and search speed/recall tradeoffs.

Written by Shyank
Shyank
Banner

SHARE

In modern artificial intelligence infrastructure, vector search is the critical engine that powers Retrieval-Augmented Generation (RAG), semantic recommendation systems, and agentic workflows. As applications scale from a few thousand prototype documents to billions of multi-modal embeddings, the indexing strategies deployed behind the scenes dictate whether a system remains responsive and cost-effective or degrades under memory bloat and latency spikes.

Traditional relational database indexing relies on structured, sorted trees (such as B-Trees) to navigate 1D ordered data in logarithmic time. Lexical search engines use inverted indices to map discrete words to document occurrences. Neither of these models applies to dense vector embeddings. Embeddings are high-dimensional, continuous representations (often 768 or 1536 dimensions) where relevance is defined by mathematical distance (e.g., Euclidean distance or Cosine similarity) in a high-dimensional space.

To query these spaces efficiently, database engineers must choose between Exact Nearest Neighbor (kNN) and Approximate Nearest Neighbor (ANN) search. This article provides an under-the-hood, algorithmic, and architectural deep dive into the three foundational vector indexing strategies: Flat (exact) indices, Hierarchical Navigable Small World (HNSW) proximity graphs, and Inverted File with Product Quantization (IVF-PQ). We will analyze their mathematical formulations, memory layouts, production failure modes, and hardware optimization strategies.


What Is It?

To evaluate vector indexing options, we must first define what each index structure is and what physical representation it constructs in memory or disk.

1. Flat Index (The Baseline)

A Flat index is the simplest possible vector index: it performs no transformation, clustering, or graph construction on the raw vectors. It stores the multidimensional array of floating-point numbers sequentially in memory or on disk.

  • Search Method: When a query vector is received, the engine performs a brute-force linear scan, calculating the exact distance between the query vector and every single vector in the index.
  • Recall: Because it calculates the exact similarity score for every candidate, a Flat index guarantees a recall of 1.0 (100% accuracy). It is the ground truth against which all approximate indexing algorithms are measured.

2. HNSW (Hierarchical Navigable Small World)

HNSW is a graph-based Approximate Nearest Neighbor (ANN) search algorithm. It structures the multi-dimensional vector space as a multi-layer graph, inspired by the concept of skip lists in 1D data structures and the "small-world" property of social networks.

  • Structure: The index builds a hierarchy of graph layers. The top layer (Layer L) is extremely sparse with long-range links, while the bottom layer (Layer 0) is highly dense containing all vectors and their immediate local neighbors.
  • Search Method: The query starts at the top layer, routing through long-range connections to find the closest local node, then drops to the next layer down to refine the search. This multi-layered navigation scales logarithmically, offering high search speeds and outstanding recall.

3. IVF-PQ (Inverted File with Product Quantization)

IVF-PQ is a hybrid indexing strategy designed to achieve extreme memory compression and high query throughput on massive datasets. It combines space partitioning (Inverted File - IVF) with vector compression (Product Quantization - PQ).

  • Clustering (IVF): The high-dimensional space is divided into a discrete number of Voronoi cells using K-Means clustering. Each cell is represented by a centroid. During insertion, vectors are assigned to their nearest centroid's bucket (the inverted list).
  • Quantization (PQ): To prevent storing raw high-dimensional floats, PQ splits each vector into smaller sub-vectors, quantizes each sub-vector subspace independently using a localized codebook, and replaces the sub-vectors with short, 8-bit integer indices. The raw vector is compressed into a compact array of byte codes.

Why It Matters

Selecting the correct vector index is a critical system design decision that directly impacts application performance, infrastructure costs, and quality of service. In production environments, search speed and memory footprint represent a stark trade-off.

If you rely solely on Flat indexing, search latency scales linearly with the dataset size. For a dataset of 100,000 vectors of 1536 dimensions, a sequential scan might take 10 milliseconds. However, at 10 million vectors, that same query will exceed 1,000 milliseconds, rendering the search engine useless for real-time applications.

To scale, systems must transition to ANN indices. However, approximate indices sacrifice 100% recall guarantee for lower latency, introducing a delicate Pareto frontier:

Search Recall (Accuracy) vs. Query Latency (QPS)
   High |        [Flat] (1.0 Recall / Low QPS)
        |          \
        |           \
        |            * [HNSW] (0.98 Recall / High QPS)
        |             \
        |              * [IVF-PQ] (0.90 Recall / Maximum QPS)
   Low  +--------------------------------------
        Slow                               Fast
                      Search Latency

Beyond speed, memory is often the primary bottleneck. Storing 100 million raw vectors of 1536 dimensions requires approximately 614 GB of memory just for the raw coordinates. If using HNSW, the graph structures add substantial memory overhead, often expanding the required memory footprint to over 1.2 TB of RAM. For many organizations, the infrastructure cost of maintaining such high-RAM instances is prohibitive. IVF-PQ can compress this same dataset down to less than 20 GB of memory, allowing it to run on a single, inexpensive cloud instance.

Evaluating these indices in production requires rigorous tracking of search quality. Developers should integrate robust metrics to evaluate retrieval performance under realistic query distributions, as detailed in our guide on RAG Evaluation Metrics: RAGAS and TruLens.


How It Works

Understanding these indexing strategies requires a deep dive into the underlying mathematical algorithms that drive index creation and query execution.

1. Flat Indexing Mechanics

A Flat index relies on direct distance computations. Given a query vector Q and a database vector V in D-dimensional space:

Euclidean (L2) Distance:
d_L2(Q, V) = sqrt( sum_{i=1}^{D} (Q_i - V_i)^2 )

Inner Product (IP) Distance:
d_IP(Q, V) = sum_{i=1}^{D} Q_i * V_i

Cosine Similarity:
d_Cosine(Q, V) = (Q . V) / (||Q|| * ||V||)

For Cosine Similarity, if the vectors are unit-normalized (length of 1.0) during ingestion, the calculation simplifies to a dot product, which can be heavily optimized using hardware SIMD (Single Instruction, Multiple Data) instructions such as AVX-512 or ARM NEON.

Distance Metric Comparison Table

The choice of distance metric dictates how similarity is calculated between vectors. The table below compares the three primary distance metrics used in production vector databases:

Distance MetricMathematical RepresentationOptimization StrategyBest Suited ForTypical Models & Use Cases
L2 (Euclidean) Distanced = sqrt( sum( (Q_i - V_i)^2 ) )SIMD-based subtraction and squaring (AVX-512 / ARM NEON)Datasets where coordinate magnitude is physically meaningfulImage retrieval, audio classification, classic clustering
Inner Product (IP)d = sum( Q_i * V_i )Fused Multiply-Add (FMA) hardware instructions, GPU tensor coresNon-normalized embeddings, dot-product query scalingDocument retrieval, recommendation scoring, Cohere embeddings
Cosine Similarityd = (Q . V) / ( ||Q|| * ||V|| )Unit-normalize vectors during ingestion, then perform fast IP searchHigh-dimensional text search where length normalization is criticalOpenAI text-embedding-3, Ada-002, Sentence-Transformers

2. HNSW Algorithmic Mechanics

HNSW structures data as a hierarchical network. Each vector inserted into the database is assigned an intrinsic maximum layer l using an exponential decay probability distribution:

l = floor( -ln(uniform_random(0, 1)) * m_L )

Here, m_L is a normalization factor that determines the rate of decay. Most nodes will be assigned to Layer 0. Fewer nodes will reach Layer 1, even fewer Layer 2, and very few will make it to the top layer.

HNSW Layer Hierarchy:

Layer 2 (Sparse)      [Node A] ---------------------------- [Node B]
                         |                                     |
Layer 1 (Medium)      [Node A] -------- [Node C] ---------- [Node B]
                         |                 |                   |
Layer 0 (Dense)       [Node A] - [Node D] - [Node C] - [Node E] - [Node B]

Index Construction Phase:

When a new vector V is inserted:

  1. The algorithm starts at the top layer with an entry point node.
  2. It performs a greedy search on the current layer: it calculates the distance from V to the entry point and all its neighbors. It moves to the neighbor closest to V.
  3. When a local minimum is reached on the current layer, the algorithm uses that local minimum as the entry point for the next layer down.
  4. Once it reaches the assigned insertion layer l of the new node V, it adds V to the graph. It finds the M nearest neighbors on that layer and establishes bidirectional links.
  5. To optimize the graph quality and ensure strong global connectivity, the algorithm uses a dynamic candidate list of size ef_construction during neighbor selection.

Index Query Phase:

At query time:

  1. A greedy search starts from the entry point at the top layer.
  2. The search travels down the layers, updating the nearest neighbor.
  3. Upon reaching Layer 0, the search switches from a simple greedy search to maintaining a priority queue of size ef_search containing the closest elements found so far.
  4. The search continues exploring neighbors of the elements in the queue until the closest unexplored neighbor is further than the furthest element in the queue.

3. IVF-PQ Algorithmic Mechanics

IVF-PQ uses a two-stage coarse-to-fine quantization process to partition and compress the vector space.

Coarse Quantization (IVF Partitioning):

  1. During index training, a representative subset of the dataset is clustered using K-Means to identify n_list centroids: [C_1, C_2, ..., C_{n_list}].
  2. Each centroid represents a Voronoi cell partition.
  3. Every vector in the dataset is assigned to its nearest coarse centroid C_i.
  4. An inverted list index is built where centroid C_i maps to a list of vectors belonging to its cell.

Fine Quantization (Product Quantization Compression):

  1. To compress the vectors within each cluster, the original D-dimensional space is split into m orthogonal sub-spaces, each of dimension d_sub = D / m.
  2. For a 1536-dimensional vector with m = 96 sub-spaces, each sub-vector has a dimension of 16.
  3. For each sub-space, a separate codebook is trained using K-Means to find a set of centroids (typically 2^8 = 256 centroids, so each centroid index can be stored in a single byte).
  4. Each sub-vector of a database vector is replaced by the 8-bit index of the closest centroid in its sub-space codebook.
  5. A 1536-dimensional vector (originally requiring 1536 * 4 = 6144 bytes as float32) is compressed into 96 bytes. This yields a compression ratio of 64x.

To learn more about the underlying mathematical concepts of vector compression and quantization techniques used in deep learning, refer to our detailed post on Quantization Mathematics: GPTQ, AWQ, and GGUF.

Distance Computation (Asymmetric Distance Computation - ADC):

To calculate the distance between a query Q and a compressed vector V represented by code [i_1, i_2, ..., i_m]:

  1. The query Q is split into m sub-vectors: [Q^1, Q^2, ..., Q^m].
  2. For each sub-vector Q^j, we compute the exact float32 distance to all 256 centroids in the sub-space codebook.
  3. These distances are stored in a lookup table of size 256 x m (the distance table).
  4. The distance from the query to any compressed database vector is computed by looking up and summing the pre-calculated distances from the lookup table:
Distance_ADC(Q, V) = sum_{j=1}^{m} DistanceTable[ j, V_code[j] ]

Because this lookup-and-sum operation avoids calculating high-dimensional Euclidean distances during search, it is exceptionally fast and can be vector-parallelized on CPUs and GPUs.


Architecture

The physical architecture of a vector index determines how it organizes memory, stores coordinates, and routes queries.

HNSW Architecture

HNSW maintains graph connectivity structures along with the raw vector data. In memory, each node is structured as follows:

HNSW In-Memory Node Structure:
+-------------------------------------------------------------+
| Vector ID (64-bit Int)                                      |
+-------------------------------------------------------------+
| Raw Coordinates (e.g., 1536 * 4 bytes for Float32)          |
+-------------------------------------------------------------+
| Layer 0 Connections: [ID_1, ID_2, ..., ID_{2M}] (M-ints)    |
+-------------------------------------------------------------+
| Layer 1 Connections: [ID_1, ID_2, ..., ID_M]                 |
+-------------------------------------------------------------+
| Layer L Connections: [ID_1, ID_2, ..., ID_M]                 |
+-------------------------------------------------------------+

Because HNSW requires random memory access to traverse graph links across layers, the entire graph structure and the raw vectors must reside in RAM to maintain low search latency. Attempts to store HNSW graphs on traditional disks lead to severe I/O bottlenecks.

To structure document hierarchies before indexing them into such graph databases, advanced chunking strategies are often applied, as discussed in Advanced RAG: Hierarchical Node Parsing, Parent-Child Retrievers, and Metadata Pre-Filtering.

IVF-PQ Architecture

The IVF-PQ index separates the coarse centroid layout from the quantized data payload:

IVF-PQ Layout:

 Coarse Centroids (In RAM):
 Centroid 1 ----> Inverted List: [ Vector ID 12 | PQ Code (m-bytes) ]
                                 [ Vector ID 45 | PQ Code (m-bytes) ]
 Centroid 2 ----> Inverted List: [ Vector ID 99 | PQ Code (m-bytes) ]
                                 [ Vector ID 11 | PQ Code (m-bytes) ]

 Sub-space Codebooks (In RAM):
 Sub-space 1 Centroids: [ Centroid_0 (d_sub floats) ] ... [ Centroid_255 ]
 ...
 Sub-space m Centroids: [ Centroid_0 (d_sub floats) ] ... [ Centroid_255 ]

Because the payload in the inverted lists consists of contiguous arrays of bytes (IDs and PQ codes), IVF-PQ is highly amenable to disk-resident architectures. The coarse centroids and the codebooks (which are very small) are loaded into RAM. The inverted lists themselves can be stored on high-speed NVMe SSDs and memory-mapped (mmap). When a query is executed, the engine identifies the closest centroids, reads only the matching inverted list buckets from disk, and evaluates the ADC distances.

If your vector retrieval pipeline needs to scale even further, combining dense semantic search with sparse lexical indices provides a resilient failover. For an architectural blueprint on setting up hybrid retrieval, read our post on Hybrid Search Architectures: Reciprocal Rank Fusion (RRF) and Cross-Encoder Re-ranking.


Production Deployment Considerations

Deploying vector indices at scale requires careful calculations of memory usage, build time, and throughput requirements.

Index Comparison Table

The following table summarizes the operational trade-offs of the three indexing strategies:

Vector IndexSearch SpeedRecall AccuracyRAM ConsumptionBuild / Training TimeUpdate Overhead
FlatO(N) (Slow)1.0 (Exact)Base Vector SizeO(1) (None)O(1) (Instant)
HNSWO(log N) (Fast)0.95 - 0.991.5x to 3.0x BaseO(N log N) (High)O(log N) (Moderate)
IVF-PQO(N/k) (Very Fast)0.80 - 0.920.01x to 0.15x BaseO(N) (Medium-High)Requires Reconstruction

Memory Sizing Formulas

To plan your bare-metal or cloud infrastructure, you can estimate memory usage using the following formulas:

1. Flat Index Memory (Bytes):
M_Flat = N * D * 4

2. HNSW Index Memory (Bytes):
M_HNSW = (N * D * 4) + (N * M * 2 * 8) + Graph_Metadata_Overhead
(Where Graph_Metadata_Overhead typically adds 15-20% to the total size)

3. IVF-PQ Index Memory (Bytes):
M_IVFPQ = (N * m * (nbits / 8)) + (2^nbits * D * 4) + (n_list * D * 4)

Let us apply these formulas to a concrete example of 10,000,000 vectors with 1536 dimensions (e.g., OpenAI embeddings) using typical production parameters:

  • N = 10,000,000
  • D = 1536
  • M = 16 (for HNSW)
  • m = 96 (for IVF-PQ)
  • nbits = 8 (for IVF-PQ)
  • n_list = 4096 (for IVF-PQ)
Concrete Memory Size Calculations (10M Vectors, 1536-D):

Flat Memory:
M_Flat = 10,000,000 * 1536 * 4 bytes
       = 61,440,000,000 bytes
       ≈ 61.44 GB

HNSW Memory:
M_HNSW = 61.44 GB + (10,000,000 * 16 * 2 * 8 bytes)
       = 61.44 GB + 2,560,000,000 bytes
       = 61.44 GB + 2.56 GB (minimum link memory)
       With graph overhead, actual memory allocations range from 90 GB to 130 GB.

IVF-PQ Memory:
M_IVFPQ = (10,000,000 * 96 * (8 / 8) bytes) + (256 * 1536 * 4 bytes) + (4096 * 1536 * 4 bytes)
        = 960,000,000 bytes + 1,572,864 bytes + 25,165,824 bytes
        = 960 MB (codes) + 1.57 MB (codebooks) + 25.16 MB (centroids)
        ≈ 986.73 MB

In this scenario, IVF-PQ achieves a 62x memory compression compared to Flat, and up to a 100x compression compared to the fully realized in-memory HNSW graph. This makes IVF-PQ highly cost-effective for multi-billion vector databases.

Hyperparameter Tuning Checklist

When configuring indices, you must balance search recall and latency by tuning these primary hyperparameters:

ParameterIndex TypeAction / EffectTuning Guideline
MHNSWDefines maximum links per node. Higher M increases recall and connectivity but increases index size.Use 16 for standard text RAG. Use 32 to 64 for high-dimensional or noisy vectors.
ef_constructionHNSWSet size of candidate pool during build. Higher values improve graph quality but increase build time.Default to 64 or 128. Increase to 256 or 512 for final production builds.
ef_searchHNSWSet size of candidate pool during query. Higher values increase search recall and increase latency.Dynamic runtime setting. Start with 32 or 64. Set to 128 or 256 if recall is dropping.
n_listIVF-PQSet number of Voronoi partitions. Higher values speed up queries but increase index build time.Set to sqrt(N) as a baseline. For 10M vectors, use 4096 to 16384.
n_probesIVF-PQSet number of Voronoi cells to search. Higher values increase recall but increase latency.Dynamic runtime setting. Set n_probes to 5% to 10% of n_list for high-recall tasks.

Common Mistakes

Through dozens of post-mortems of vector search deployments, we have identified several recurring architectural errors.

1. The Small Dataset Overkill

Engineers often implement complex, approximate indices (such as HNSW or IVF-PQ) prematurely on small datasets containing under 50,000 vectors. Managing approximate graphs or quantization codebooks introduces runtime complexity, clustering training delays, and indexing overhead. On datasets of this scale, a simple flat vector scan leveraging SIMD vectorization (using numpy, faiss.IndexFlatL2, or raw PostgreSQL sequential scans) is faster, uses less memory, and guarantees a 100% search recall.

2. Failing to Tuned ef_search and n_probes Dynamically

Many developers build their HNSW or IVF-PQ indices, measure the initial recall on test queries, and deploy the system with default runtime parameters. Over time, as the database content changes or the query patterns shift, search quality degrades. Because ef_search (in HNSW) and n_probes (in IVF-PQ) are dynamic runtime variables, they should be configured on a per-query basis. For example, a search engine can use a low ef_search for fast autocomplete suggestions, and a high ef_search for critical semantic context fetching.

3. Misunderstanding Distance Metrics and Vector Normalization

Creating an index using Cosine similarity while inserting vectors that are not unit-normalized can lead to incorrect search results. When using Cosine similarity, ensure that your client code normalizes the vectors to unit length, or choose Inner Product (IP) if your embedding provider already generates normalized outputs.


Lessons From Production Deployments

Operating large-scale vector search systems reveals several silent failure modes that standard benchmarking tools fail to capture.

1. Centroid Collapse and Voronoi Cell Skew

In IVF-PQ databases, coarse partitioning relies on K-Means clustering to build Voronoi cells. K-Means assumes a relatively uniform distribution of data. In production, however, semantic embedding spaces are rarely uniform. Documents often cluster tightly around a few popular topics (semantic hotspots).

This causes Centroid Collapse, where a few Voronoi cells contain millions of vectors, while other cells remain nearly empty. When queries target these dense clusters, the engine is forced to scan massive buckets, leading to latency spikes and hot shards. To mitigate this skew, implement custom clustering routines, increase the coarse list size n_list, or use hierarchical IVF systems.

2. Graph Disconnection under Real-Time Updates

When using HNSW in systems with high update and delete throughput, the index can suffer from Graph Disconnection. In HNSW, deleting a node requires finding its neighbors and re-routing their connections. If deletions are frequent, the graph can fracture into disconnected islands. A query starting at the top layer may become trapped on a disconnected graph island, causing search recall to drop from 99% to less than 60% without throwing any system errors. Databases must periodically run background optimization and graph consolidation tasks to prevent this drift.

3. Silent Quality Degradation

Unlike traditional databases that return empty sets or throw errors when a matching query is not found, vector databases always return the top-K nearest neighbors. If the database does not contain relevant information, the engine will still return the closest vectors, which represent irrelevant semantic noise. This is a silent failure mode that leads directly to hallucination in RAG pipelines. Production systems must implement distance threshold calibration, rejecting matches that fall beyond a verified similarity limit.


What Most Articles Miss

While most technical articles cover the basic theory of HNSW graphs and Product Quantization, they overlook several critical engineering trade-offs.

1. Quantizer Drift

When training an IVF-PQ index, the coarse centroids and PQ sub-space codebooks are computed using a training set. This training set must represent the distribution of your data. If your application starts ingesting new types of documents (e.g., adding technical manuals to a general news database), the mathematical distribution of your vector embeddings will shift.

Quantizer Drift Visualized:
+------------------------------------------------------+
|  (Centroid A)         * (New Vector)                 |
|    * * *             /                               |
|   * * *             / [High Quantization Error]      |
|                    /                                 |
|                   v                                  |
|         [Original Codebook Center]                   |
+------------------------------------------------------+

This shift is known as Quantizer Drift. Because the codebooks were trained on the old distribution, they cannot represent the new vectors accurately. This causes high quantization reconstruction errors, leading to search recall degradation. To resolve this, you must set up monitoring to track average quantization error and trigger automatic index re-training when the drift exceeds a specific threshold.

2. Memory-Recall Pareto Frontier Non-Linearity

The trade-offs in approximate nearest neighbor search are non-linear. Increasing your HNSW parameter M from 8 to 16 yields a massive jump in search recall with minimal memory overhead. However, increasing M from 32 to 64 increases index memory usage and build times significantly while providing only marginal improvements in recall. Teams should construct empirical Pareto curves using their own datasets to locate the optimal configuration.


Best Practices

To build a robust, high-performance vector search architecture, follow these implementation rules:

  1. Establish a Ground Truth Baseline: Always maintain a subset of your dataset indexed using a Flat configuration. Run a daily evaluation suite comparing your production HNSW/IVF-PQ search results against the Flat index to detect recall degradation early.
  2. Match the Index to Your Data Scale:
    • N < 50,000: Use a Flat index.
    • 50,000 < N < 10,000,000: Default to HNSW for low latency and high recall, provided you have the budget for RAM.
    • N > 10,000,000: Use IVF-PQ or DiskANN to prevent high infrastructure costs.
  3. Optimize Build Times: Building HNSW graphs can take hours for large datasets. Utilize parallel indexing features in modern databases (such as pgvector parallel worker processes) and consider offloading index builds to GPU-accelerated environments like NVIDIA cuVS.
  4. Calibrate Distance Thresholds: Never pass vector search results directly to downstream LLMs without filtering. Track similarity scores and apply strict distance thresholds to prune irrelevant noise.
  5. Monitor Quantization Error: For IVF-PQ indexes, monitor the average distance between raw vectors and their reconstructed quantized versions. Re-train codebooks if this quantization error drifts upward.

FAQ

Q1: Why is my HNSW index build taking hours?

HNSW graph construction is computationally intensive due to the requirement of performing nearest-neighbor searches for every inserted node to establish optimal links. To speed up index construction, increase CPU concurrency (e.g., set max_parallel_workers in pgvector), reduce the ef_construction parameter, or perform batch index construction on GPU-accelerated database instances.

Q2: Can I add new vectors to an IVF-PQ index without retraining the codebook?

Yes. You can insert new vectors into an existing IVF-PQ index by assigning them to their closest coarse centroids and quantizing their sub-vectors using the existing codebooks. However, if the new vectors belong to a different semantic domain, this will cause quantizer drift and degrade recall over time, eventually requiring a complete index rebuild and retraining.

Q3: How do I choose between L2 (Euclidean) and Cosine similarity?

The choice of distance metric must match the training objective of your embedding model. Models like OpenAI's text-embedding-3 or Cohere's embeddings are trained to work with Cosine similarity. If using Cosine similarity, you can unit-normalize your vectors to 1.0 length and utilize the faster Inner Product (IP) index, which behaves identically to Cosine similarity on normalized data.

Q4: Why does HNSW use so much RAM compared to the raw vector files?

HNSW stores both the raw vector coordinates and the graph connections for multiple layers. Each node maintains pointers to its neighbors across layers. For a graph with M = 16, each node stores up to 32 connections in Layer 0 and 16 connections in higher layers. Storing these bidirectional pointers (64-bit integers) alongside metadata can double or triple the memory footprint of your raw vector float data.

Q5: What is DiskANN, and how does it compare to HNSW and IVF-PQ?

DiskANN is a disk-resident vector indexing algorithm designed to search billion-scale datasets. It structures a compressed graph index that is stored on high-speed NVMe SSDs, maintaining only a small cache of layout data in RAM. DiskANN achieves search latency comparable to HNSW while keeping memory costs close to IVF-PQ.

Q6: Can I tune search recall and latency dynamically at runtime?

Yes. In HNSW, you can adjust ef_search on a per-query basis. A higher ef_search explores more paths, increasing recall at the cost of higher latency. In IVF-PQ, you can adjust n_probes dynamically. Increasing n_probes forces the engine to search more Voronoi cells, improving recall but increasing search latency.

Q7: Does pgvector support both HNSW and IVF indexing?

Yes. pgvector supports both HNSW and IVFFlat indices. HNSW is recommended for most dynamic production applications due to its higher recall and support for real-time updates. IVFFlat is useful when memory is limited and the dataset is static.

Q8: What causes "silent recall drops" in production vector search?

Silent recall drops are typically caused by graph fragmentation in HNSW due to high delete/update rates, or quantizer drift in IVF-PQ when the semantic distribution of ingested documents deviates from the initial training set. Because the database continues returning the nearest matches without errors, these drops can go unnoticed without active recall monitoring.

Q9: What is the optimal number of dimensions for vector search?

There is no single optimal number of dimensions. The dimensionality is fixed by the embedding model you select (e.g., 384 for all-MiniLM-L6-v2, 1536 for OpenAI text-embedding-3-small, or 3072 for text-embedding-3-large). Higher dimensions capture richer semantic details but require more memory and computation.

Q10: How do metadata filters interact with vector indices?

Metadata filtering can be implemented in three ways: pre-filtering, post-filtering, or single-stage filtered indexing. In pre-filtering, the database filters the document set using metadata first and then performs a vector scan on the subset (which can degrade if the subset is large). In post-filtering, the database performs a vector search first and then filters the top-K results (which can lead to returning fewer than K results). Modern databases use single-stage filtered indexing, traversing the vector index graph while evaluating metadata constraints at each node.


Key Takeaways

  • Flat indexing provides a 100% recall guarantee and is the fastest option for datasets under 50,000 vectors.
  • HNSW is the standard index for real-time RAG applications under 10 million vectors, offering excellent recall and low search latency at the cost of high RAM overhead.
  • IVF-PQ is ideal for large-scale and resource-constrained environments, compressing high-dimensional vectors by up to 100x to run massive indices on cost-effective hardware.
  • Active runtime tuning of parameters like ef_search (HNSW) and n_probes (IVF-PQ) is required to balance search recall and query-per-second (QPS) targets.
  • Production monitoring must track quantization errors, data drift, and physical memory footprint to prevent silent search quality degradation.

About & Technical Stack

Shyank Akshar

Shyank Akshar

Hi! I'm Shyank, a full-stack Software Developer and a Call of Duty enthusiast. I help businesses scale by engineering robust technology solutions that automate complex tasks, save hundreds of hours, and delight users. Over the years, I've partnered with leading global startups and government organizations to deliver high-performance, secure applications at scale.

Technical Stack

Languages, platforms, and architectures I build on.

iOS
Swift
GCP
AWS
Java
backend
Golang
Javascript
Typescript
Mongo DB
MySQL
Redis
Kotlin
Kafka
Kubernetes
Docker
Microservices
System Design
Distributed Systems
More Blogs
Recent Blogs