Retrieval-augmented generation went from research technique to startup category to enterprise buzzword in about eighteen months. That pace created a lot of implementations that use the name without capturing the substance. Most "RAG systems" in production are retrieval systems bolted onto generation systems, with the hard parts — retrieval quality, grounding verification, citation integrity — left as exercises for the user to figure out.
The original RAG paper from Facebook AI Research described a specific architecture: a dense passage retriever trained end-to-end with a seq2seq generator, where both components were optimized jointly so the retriever learned to surface passages that the generator could use effectively. Almost no production implementations actually do this. They use separate off-the-shelf components that weren't trained together, and they accept the quality loss that comes from that mismatch.
The Basic Architecture and Its Gaps
The standard production RAG pipeline looks like this:
- Index a corpus by embedding documents into a vector space
- At query time, embed the query and retrieve the top-k most similar chunks
- Prepend the retrieved chunks to the query as context
- Pass the augmented prompt to a generation model
- Return the generated output, optionally with citations
This works well when the corpus is reasonably clean, the queries are well-formed, the relevant content is contained within a single chunk, and the generator handles the retrieved context appropriately. That's a fairly restrictive set of conditions.
The gaps start showing up immediately in real deployments. The corpus is often noisy — documents in multiple formats, varying quality, stale content mixed with current content. Queries from real users are not always precise — they're colloquial, ambiguous, and sometimes express a need that doesn't map cleanly to any existing document. Relevant content often spans multiple chunks or requires connecting information from several different sources. And generators hallucinate — they produce confident-sounding text that isn't supported by the retrieved context, especially when the retrieved context is incomplete or contradictory.
Where Serious Implementations Differ
The production RAG systems at high-stakes applications — legal research platforms, medical information systems, financial analysis tools — have solved these gaps through techniques that don't make it into tutorials:
Hybrid retrieval. Combining dense vector retrieval with sparse keyword matching (BM25 or similar) and fusing the results. Dense retrieval handles semantic similarity; sparse retrieval handles exact terminology and proper nouns. Many domain-specific queries benefit from exact-match retrieval that vector similarity doesn't provide reliably.
Reranking. The initial retrieval returns candidates. A cross-encoder reranker scores each candidate against the full query — a much more expensive operation but one that produces significantly better precision in the top results. The candidates that make it into context are the reranker's top-k, not the retriever's.
Query expansion and decomposition. Multi-part questions often retrieve better when decomposed into sub-questions. Ambiguous queries benefit from expansion — generating alternative formulations and retrieving for each. The additional retrieval cost is usually worth it for precision on complex queries.
Citation grounding and hallucination detection. For applications where accuracy matters, every claim in the output should be verifiable against a retrieved passage. Systems that include citation metadata and a post-generation verification step — checking that claims are supported by retrieved sources — produce outputs that can be trusted more reliably than unchecked generation.
Corpus management as a first-class problem. The retrieval quality of any RAG system is bounded by corpus quality. Managing the index — handling updates, removals, permission changes, duplicate detection, freshness scoring — is operational work that has direct quality implications. Teams that treat corpus management as ongoing infrastructure, not a one-time indexing job, have significantly better retrieval quality over time.
The Grounding Problem
The problem that takes the most engineering effort to solve well is grounding: ensuring that the generated output is actually supported by the retrieved content, rather than using the retrieved content as a launching pad for the generator's own knowledge.
Language models are trained to produce fluent, coherent text. That training doesn't constrain them to stay within what the context provides. They'll extrapolate, generalize, and fill in gaps using their parametric knowledge, sometimes in ways that contradict or go beyond the retrieved sources. For a knowledge base Q&A system, that's a hallucination risk. For a legal research tool, it can produce citations to cases that don't exist.
The techniques for improving grounding — constrained generation, attribution scoring, factuality classifiers trained on domain data — are active research areas that are starting to produce deployable tools. The gap between "RAG that mostly works" and "RAG that can be trusted for high-stakes tasks" is substantially about grounding, and closing that gap is where the real engineering challenge sits.
Why This Matters for Investment
The companies building serious retrieval infrastructure — not just wrapping existing models with a vector database, but solving the hard problems of retrieval quality, grounding, and corpus management — are building genuinely difficult products with real defensibility. The basics are easy. Production-grade is not.
We're particularly interested in teams that have built RAG infrastructure for specific high-stakes domains, where the grounding problem is most acute and the tolerance for errors is lowest. That's where the hardest problems get solved, and where the products that emerge have the highest value to the organizations that need them.
Building retrieval infrastructure or grounding tooling for high-stakes AI applications? Let's talk.