RAG in Production: Why Retrieval Gets Slow

Most RAG demos are fast because they avoid the parts that make production retrieval expensive.

The dataset is small. The chunks are clean. The embedding call is nearby. The vector index is warm. There are no permission filters, no reranking stage, and no requirement to explain why a result was selected.

Real systems are slower because retrieval is not one step. It is a pipeline.

The Latency Budget Is Not Just Vector Search

When a user asks a question, a production retrieval flow often includes:

query normalization
embedding generation
metadata filtering
vector lookup
lexical lookup
reranking
prompt assembly
the final model call

Teams often blame the vector database because it is the most visible component. In practice, the slowest part is often somewhere else.

Cross-region traffic is a common example. If your application runs in one region, your embedding provider is in another, and your vector store is in a third, the system can feel slow even when each individual component is "fast" in isolation.

Chunking Quality Decides Retrieval Quality

Bad chunks produce bad recall.

A naive fixed-width chunker will split through headings, code blocks, tables, and paragraphs without understanding any of them. The resulting embeddings are technically valid and semantically weak.

If the content is structured, chunk with the structure:

respect Markdown headings
keep code blocks intact
preserve section titles with their content
keep metadata such as document id, section path, language, and access scope

You do not need a perfect chunker. You do need one that avoids obviously breaking the meaning of the source material.

Hybrid Retrieval Is Usually the Better Default

Dense retrieval is good at semantic similarity. Lexical retrieval is good at exact terms, ids, filenames, error codes, and product names.

Production systems usually need both.

A practical shape looks like this:

type RetrievalResult = {
  id: string;
  score: number;
  source: "dense" | "lexical";
};
 
async function retrieve(query: string) {
  const [dense, lexical] = await Promise.all([
    vectorIndex.search(query),
    bm25Index.search(query),
  ]);
 
  return mergeAndDedupe([...dense, ...lexical]);
}

That does not make the system simpler, but it usually makes it more dependable.

Reranking Is Expensive and Often Worth It

Approximate nearest-neighbor indexes optimize for speed, not perfect ranking. That is fine for the candidate generation stage. It is often not enough for the final answer stage.

Reranking improves precision by taking a smaller candidate set and scoring each result against the actual query with a stronger model.

The trade-off is obvious:

better relevance
more latency
more cost

That is why reranking belongs after retrieval, not before it. Let the fast system find candidates. Let the slower system refine them.

Freshness Is a Systems Problem

A lot of RAG frustration is actually an indexing problem.

The source of truth changes, but the retrieval index lags behind. Then the user gets an answer from stale chunks and the team blames the language model.

Treat indexing as its own pipeline:

capture document changes
enqueue re-embedding work
upsert the new chunks
remove or tombstone obsolete chunks
track index version or document revision

If you skip this, you do not have a retrieval system. You have a snapshot.

Filters Change the Shape of the Problem

Enterprise search rarely means "search everything". It usually means "search only what this user is allowed to see".

That means metadata design matters almost as much as embedding quality.

A chunk should usually carry enough metadata to support:

tenant isolation
document type filters
source system filters
freshness windows
permission-aware retrieval

This is also where some vector database benchmarks become less useful. A benchmark without real filters and access constraints does not look like a production workload.

A Better Production Posture

If your RAG feature feels slow, check these before blaming embeddings:

Are services in the same region?
Are you over-fetching candidates?
Are you reranking too many chunks?
Are chunks structurally sane?
Is the index fresh?
Are permissions being enforced in the retrieval layer?

RAG quality is not just model quality. It is retrieval engineering.

RAG in Production: Why Retrieval Gets Slow_

The Latency Budget Is Not Just Vector Search

Chunking Quality Decides Retrieval Quality

Hybrid Retrieval Is Usually the Better Default

Reranking Is Expensive and Often Worth It

Freshness Is a Systems Problem

Filters Change the Shape of the Problem

A Better Production Posture

Further Reading

Related Writing.

AI Code Generation Changes the Work More Than It Removes the Engineer

Generative UI Gets Better When Models Select Components, Not Just Text