almessadi.
Back to Index

RAG in Production: Why Retrieval Gets Slow_

The bottlenecks that actually hurt production RAG systems, from chunking and cross-region traffic to reranking and stale indexes.

PublishedMarch 10, 2024
Reading Time8 min read

Most RAG demos are fast because they avoid the parts that make production retrieval expensive.

The dataset is small. The chunks are clean. The embedding call is nearby. The vector index is warm. There are no permission filters, no reranking stage, and no requirement to explain why a result was selected.

Real systems are slower because retrieval is not one step. It is a pipeline.

The Latency Budget Is Not Just Vector Search

When a user asks a question, a production retrieval flow often includes:

  1. query normalization
  2. embedding generation
  3. metadata filtering
  4. vector lookup
  5. lexical lookup
  6. reranking
  7. prompt assembly
  8. the final model call

Teams often blame the vector database because it is the most visible component. In practice, the slowest part is often somewhere else.

Cross-region traffic is a common example. If your application runs in one region, your embedding provider is in another, and your vector store is in a third, the system can feel slow even when each individual component is "fast" in isolation.

Chunking Quality Decides Retrieval Quality

Bad chunks produce bad recall.

A naive fixed-width chunker will split through headings, code blocks, tables, and paragraphs without understanding any of them. The resulting embeddings are technically valid and semantically weak.

If the content is structured, chunk with the structure:

  • respect Markdown headings
  • keep code blocks intact
  • preserve section titles with their content
  • keep metadata such as document id, section path, language, and access scope

You do not need a perfect chunker. You do need one that avoids obviously breaking the meaning of the source material.

Hybrid Retrieval Is Usually the Better Default

Dense retrieval is good at semantic similarity. Lexical retrieval is good at exact terms, ids, filenames, error codes, and product names.

Production systems usually need both.

A practical shape looks like this:

type RetrievalResult = {
  id: string;
  score: number;
  source: "dense" | "lexical";
};

async function retrieve(query: string) {
  const [dense, lexical] = await Promise.all([
    vectorIndex.search(query),
    bm25Index.search(query),
  ]);

  return mergeAndDedupe([...dense, ...lexical]);
}

That does not make the system simpler, but it usually makes it more dependable.

Reranking Is Expensive and Often Worth It

Approximate nearest-neighbor indexes optimize for speed, not perfect ranking. That is fine for the candidate generation stage. It is often not enough for the final answer stage.

Reranking improves precision by taking a smaller candidate set and scoring each result against the actual query with a stronger model.

The trade-off is obvious:

  • better relevance
  • more latency
  • more cost

That is why reranking belongs after retrieval, not before it. Let the fast system find candidates. Let the slower system refine them.

Freshness Is a Systems Problem

A lot of RAG frustration is actually an indexing problem.

The source of truth changes, but the retrieval index lags behind. Then the user gets an answer from stale chunks and the team blames the language model.

Treat indexing as its own pipeline:

  • capture document changes
  • enqueue re-embedding work
  • upsert the new chunks
  • remove or tombstone obsolete chunks
  • track index version or document revision

If you skip this, you do not have a retrieval system. You have a snapshot.

Filters Change the Shape of the Problem

Enterprise search rarely means "search everything". It usually means "search only what this user is allowed to see".

That means metadata design matters almost as much as embedding quality.

A chunk should usually carry enough metadata to support:

  • tenant isolation
  • document type filters
  • source system filters
  • freshness windows
  • permission-aware retrieval

This is also where some vector database benchmarks become less useful. A benchmark without real filters and access constraints does not look like a production workload.

A Better Production Posture

If your RAG feature feels slow, check these before blaming embeddings:

  • Are services in the same region?
  • Are you over-fetching candidates?
  • Are you reranking too many chunks?
  • Are chunks structurally sane?
  • Is the index fresh?
  • Are permissions being enforced in the retrieval layer?

RAG quality is not just model quality. It is retrieval engineering.

Further Reading