RAG in Production: Why Retrieval Gets Slow_
The bottlenecks that actually hurt production RAG systems, from chunking and cross-region traffic to reranking and stale indexes.
Most RAG demos are fast because they avoid the parts that make production retrieval expensive.
The dataset is small. The chunks are clean. The embedding call is nearby. The vector index is warm. There are no permission filters, no reranking stage, and no requirement to explain why a result was selected.
Real systems are slower because retrieval is not one step. It is a pipeline.
The Latency Budget Is Not Just Vector Search
When a user asks a question, a production retrieval flow often includes:
- query normalization
- embedding generation
- metadata filtering
- vector lookup
- lexical lookup
- reranking
- prompt assembly
- the final model call
Teams often blame the vector database because it is the most visible component. In practice, the slowest part is often somewhere else.
Cross-region traffic is a common example. If your application runs in one region, your embedding provider is in another, and your vector store is in a third, the system can feel slow even when each individual component is "fast" in isolation.
Chunking Quality Decides Retrieval Quality
Bad chunks produce bad recall.
A naive fixed-width chunker will split through headings, code blocks, tables, and paragraphs without understanding any of them. The resulting embeddings are technically valid and semantically weak.
If the content is structured, chunk with the structure:
- respect Markdown headings
- keep code blocks intact
- preserve section titles with their content
- keep metadata such as document id, section path, language, and access scope
You do not need a perfect chunker. You do need one that avoids obviously breaking the meaning of the source material.
Hybrid Retrieval Is Usually the Better Default
Dense retrieval is good at semantic similarity. Lexical retrieval is good at exact terms, ids, filenames, error codes, and product names.
Production systems usually need both.
A practical shape looks like this:
type RetrievalResult = {
id: string;
score: number;
source: "dense" | "lexical";
};
async function retrieve(query: string) {
const [dense, lexical] = await Promise.all([
vectorIndex.search(query),
bm25Index.search(query),
]);
return mergeAndDedupe([...dense, ...lexical]);
}
That does not make the system simpler, but it usually makes it more dependable.
Reranking Is Expensive and Often Worth It
Approximate nearest-neighbor indexes optimize for speed, not perfect ranking. That is fine for the candidate generation stage. It is often not enough for the final answer stage.
Reranking improves precision by taking a smaller candidate set and scoring each result against the actual query with a stronger model.
The trade-off is obvious:
- better relevance
- more latency
- more cost
That is why reranking belongs after retrieval, not before it. Let the fast system find candidates. Let the slower system refine them.
Freshness Is a Systems Problem
A lot of RAG frustration is actually an indexing problem.
The source of truth changes, but the retrieval index lags behind. Then the user gets an answer from stale chunks and the team blames the language model.
Treat indexing as its own pipeline:
- capture document changes
- enqueue re-embedding work
- upsert the new chunks
- remove or tombstone obsolete chunks
- track index version or document revision
If you skip this, you do not have a retrieval system. You have a snapshot.
Filters Change the Shape of the Problem
Enterprise search rarely means "search everything". It usually means "search only what this user is allowed to see".
That means metadata design matters almost as much as embedding quality.
A chunk should usually carry enough metadata to support:
- tenant isolation
- document type filters
- source system filters
- freshness windows
- permission-aware retrieval
This is also where some vector database benchmarks become less useful. A benchmark without real filters and access constraints does not look like a production workload.
A Better Production Posture
If your RAG feature feels slow, check these before blaming embeddings:
- Are services in the same region?
- Are you over-fetching candidates?
- Are you reranking too many chunks?
- Are chunks structurally sane?
- Is the index fresh?
- Are permissions being enforced in the retrieval layer?
RAG quality is not just model quality. It is retrieval engineering.
Further Reading
Related Writing.
Continue with closely related articles on software engineering, architecture, and implementation trade-offs.
AI Code Generation Changes the Work More Than It Removes the Engineer
LLMs are strongest at acceleration and scaffolding. They remain weaker at judgment, architecture, and responsibility for production consequences.
Generative UI Gets Better When Models Select Components, Not Just Text
LLM-driven interfaces become more useful when they help choose or configure trusted UI components instead of always answering with paragraphs.