On this page
What is RAG (Retrieval-Augmented Generation)?
Ground LLM answers in your own documents — retrieve at query time, augment the prompt, generate the response. Plus what goes wrong and how to govern it.
10 min readWhat Is RAG
Retrieval-Augmented Generation (RAG) is a pattern for grounding large language model answers in a specific body of knowledge. Instead of relying only on what the model learned during training, the system retrieves relevant passages from your own documents at query time, pastes them into the prompt, and asks the model to answer based on those passages.
The technique was introduced in a 2020 paper from Facebook AI Research, but it went mainstream in 2023 once embedding models and vector databases got cheap enough to run at production scale. Today, almost every enterprise LLM use case — from customer support chatbots to internal knowledge assistants to code search — runs some variant of RAG under the hood.
The reason RAG matters is that it closes the two biggest gaps of out-of-the-box LLMs: they don't know your private data, and they can't update themselves when your data changes. RAG lets an LLM answer questions about a contract signed yesterday using a model whose training cutoff was a year ago — without retraining the model.
How RAG Works
A production RAG system runs two pipelines. An offline indexing pipeline processes your documents into a searchable vector store; an online query pipeline retrieves from that store and generates an answer. The same five concepts — chunk, embed, index, retrieve, generate — show up in every implementation.
1. Chunk & embed
Split documents, turn each chunk into a vector
2. Index
Store vectors in a vector database
3. Retrieve
Embed the query, find top-k nearest chunks
4. Augment
Stuff retrieved chunks into the prompt
5. Generate
LLM answers with grounded context
Indexing (1–2) happens offline; retrieval + generation (3–5) happens per query.
Chunking splits long documents into passages of a few hundred to a few thousand tokens. The chunk size is a tradeoff: smaller chunks are more precise but lose surrounding context; larger chunks preserve context but dilute the signal.
Embedding turns each chunk into a vector — typically 768 to 3,072 dimensions — using a model trained so that semantically similar text produces similar vectors. OpenAI's text-embedding-3, Cohere Embed, Voyage, and open-source models like BGE and E5 are common choices.
Indexing stores those vectors in a vector database (Pinecone, Weaviate, Qdrant, pgvector, Elasticsearch) alongside the original text and metadata like source URL, author, and access-control tags.
Retrieval embeds the user's query with the same model and finds the top-k nearest chunks. Modern systems combine dense vector search with keyword (BM25) search — a pattern called hybrid retrieval — because neither alone catches every query shape.
Generation passes the query plus the retrieved chunks to the LLM with a prompt like "Answer the question using only the following context. If the context does not contain the answer, say so."
RAG vs Other Grounding Strategies
RAG is not the only way to get an LLM to use private data. The choice between RAG, fine-tuning, prompt stuffing, and long-context models comes down to the shape of your data and how often it changes.
Fine-tuning is the right choice when you want to change the model's style, format, or behavior — not when you want to inject facts. Facts embedded in weights are difficult to update and hard to audit.
Long-context models (Claude's 200K window, Gemini's 2M) are tempting because they seem to remove the need for retrieval. In practice they work well for single large documents but get expensive fast on large corpora, and they suffer from the "lost in the middle" problem: the model pays less attention to content buried in the center of a long context.
Prompt stuffing — just pasting everything into the prompt — is fine for tiny, static corpora. Past a few megabytes of source material, you're paying to send the same tokens on every call and you lose the ability to cite which chunks were used.
RAG wins when the corpus is large, changes frequently, and you need per-answer citation. Most enterprise knowledge-assistant use cases hit all three.
Where RAG Goes Wrong
RAG systems fail quietly. A hallucinating LLM without retrieval is obvious — it says something clearly wrong. A RAG system that retrieves the wrong chunks and answers confidently looks authoritative and still says something wrong. This makes RAG failures more dangerous, not less.
Retrieval miss
The right chunk exists in the index but ranks outside top-k — answer is wrong with no warning
Stale index
Source doc was updated last week; embedding is from last month — answer cites outdated policy
Chunking boundary loss
Critical context sits across a chunk boundary; neither half is retrieved
Lost in the middle
Retrieval returned the right chunks, but the LLM ignored what was buried in the middle of the prompt
Confident hallucination
Retrieval returned nothing relevant; LLM invented an answer anyway with no disclaimer
Permission leak
Index mixes documents across access boundaries; user sees chunks they should never access
The retrieval problem is harder than the LLM problem
Two failures deserve special mention. The first is stale indexes: a document was updated, but the embedding pipeline hasn't rerun. The LLM answers from yesterday's version and no one notices for weeks. The fix is event-driven reindexing — not nightly batch jobs.
The second is permission leaks: your index mixes documents that should be isolated by access control. Without retrieval-time access enforcement, a user's query can surface a chunk they were never entitled to read. Access control has to live on the retrieval path, not just on the source system.
Enterprise RAG
Turning a RAG demo into a production system is mostly about governance plumbing. The retrieval pipeline, the embedding model choice, and the LLM itself are the easy parts — almost every vector database and embedding provider can do the basics. What separates a toy RAG from an enterprise RAG is how you handle provenance, access, freshness, evaluation, and logging.
Source provenance
Every chunk traces to its origin doc, version, and owner
Access control on retrieval
Retrieval respects the user's doc permissions — not just the index
Citation in the answer
The LLM response cites which chunks it used, so a human can verify
Eval harness
A labeled set of queries with known-good answers, run on every index rebuild
Retrieval logging
Every query, top-k scores, and chunks shown to the LLM are captured
PII / sensitive data scrubbing
Pipeline redacts or flags sensitive content before it gets indexed
The enterprise RAG stack
- Vector databases (Pinecone, Weaviate, Qdrant, pgvector) handle storage and search. None of them handle access-controlled retrieval end-to-end.
- Embedding providers (OpenAI, Cohere, Voyage, Nomic) handle the embedding model. Model choice matters for quality; none of them handle governance.
- RAG frameworks (LangChain, LlamaIndex, Haystack) wire the pipeline. Production teams typically replace framework abstractions with their own code as they hit scale.
- Governance layer — provenance, access control, eval harness, logging — is the piece that determines whether RAG can be deployed to regulated users.
How Axiom differs
Most RAG tooling focuses on the retrieval pipeline itself. The governance layers — provenance, access-controlled retrieval, citation enforcement, eval harnesses, query logs — usually end up built in-house. Axiom's LLM Gateway and MCP Gateway give you the observability layer for free: every retrieval call, every LLM call, every chunk shown to the model is captured with who-asked-what for audit.
RAG and AI Agents
The frontier of RAG is moving from a single retrieval step to retrieval as a tool that agents decide when to use. Instead of always retrieving before answering, the agent reasons about the query, decides whether external context is needed, chooses which index to search, and may retrieve multiple times as it refines its answer. This pattern is sometimes called agentic RAG.
In this world, retrieval becomes one tool among many — typically exposed to agents through MCP. An agent might search internal docs, then search Jira, then search code, then synthesize. Each retrieval call is a tool use that an MCP gateway can log, allowlist, and audit.
This is where RAG governance and agent governance converge. A coding agent that retrieves from your codebase, a support agent that retrieves from your knowledge base, and a compliance agent that retrieves from your policies all need the same primitives: tracked queries, access-controlled retrieval, cited answers, and full logs.
Govern every retrieval, not just the LLM call
Axiom's gateways sit in front of both the LLM and the tools your agents call — including retrieval. Every query, every chunk returned, every answer generated is captured with the user identity that asked, so audit becomes "who asked what, what did we surface, and what did the model say?" — not "good luck reconstructing that."
Getting Started
A realistic first RAG deployment is a scoped internal use case — one corpus, one user audience, one evaluation target. "Answer questions about our HR policies" is a good starting shape; "be a general-purpose company brain" is not. Tight scope lets you measure whether the system works.
Step 1 — Build an eval set before you build the system
Write 50–200 real questions with known-good answers and which source document each answer comes from. This is the ground truth against which every pipeline change is measured. Teams that skip this step end up making retrieval "feel better" without knowing whether it actually improved.
Step 2 — Start with hybrid retrieval
Dense-only retrieval (just embeddings) misses exact-term matches. BM25-only misses semantic matches. Hybrid retrieval plus a reranker is the pragmatic baseline. Start there; tune later.
Step 3 — Force citations from day one
Every answer should include which chunks it used. This is the governance primitive that makes everything else possible: users can verify, reviewers can audit, evals can check whether citations match the stated source.
Step 4 — Wire access control into retrieval, not just ingestion
If a user can't read the source doc, they shouldn't retrieve its chunks — even if the index contains them. The safest pattern is to tag chunks with their source's access groups and filter at query time.
Step 5 — Log every query and retrieval
You will need to debug retrieval quality, trace incidents, and prove compliance. Log the query, top-k scores, chunks passed to the LLM, and the final answer. An LLM Gateway makes this automatic.
Make RAG auditable without rebuilding your stack
VibeFlow + Axiom's gateways give you the observability, access control, and audit trail layer on top of your existing retrieval pipeline. Bring your own vector DB, embedding model, and LLM — inherit the governance you need to ship RAG to regulated users.
Make RAG auditable from day one
Axiom's LLM Gateway and MCP Gateway capture every query, every chunk, and every generated answer — so retrieval becomes governed, not invisible.
Contact Us