Governed Vibecoding vs Unmanaged AI CodingRead Now →
Skip to main content

What is RAG (Retrieval-Augmented Generation)?

Ground LLM answers in your own documents — retrieve at query time, augment the prompt, generate the response. Plus what goes wrong and how to govern it.

10 min read

What Is RAG

Retrieval-Augmented Generation (RAG) is a pattern for grounding large language model answers in a specific body of knowledge. Instead of relying only on what the model learned during training, the system retrieves relevant passages from your own documents at query time, pastes them into the prompt, and asks the model to answer based on those passages.

The technique was introduced in a 2020 paper from Facebook AI Research, but it went mainstream in 2023 once embedding models and vector databases got cheap enough to run at production scale. Today, almost every enterprise LLM use case — from customer support chatbots to internal knowledge assistants to code search — runs some variant of RAG under the hood.

The reason RAG matters is that it closes the two biggest gaps of out-of-the-box LLMs: they don't know your private data, and they can't update themselves when your data changes. RAG lets an LLM answer questions about a contract signed yesterday using a model whose training cutoff was a year ago — without retraining the model.

How RAG Works

A production RAG system runs two pipelines. An offline indexing pipeline processes your documents into a searchable vector store; an online query pipeline retrieves from that store and generates an answer. The same five concepts — chunk, embed, index, retrieve, generate — show up in every implementation.

1. Chunk & embed

Split documents, turn each chunk into a vector

2. Index

Store vectors in a vector database

3. Retrieve

Embed the query, find top-k nearest chunks

4. Augment

Stuff retrieved chunks into the prompt

5. Generate

LLM answers with grounded context

Indexing (1–2) happens offline; retrieval + generation (3–5) happens per query.

Chunking splits long documents into passages of a few hundred to a few thousand tokens. The chunk size is a tradeoff: smaller chunks are more precise but lose surrounding context; larger chunks preserve context but dilute the signal.

Embedding turns each chunk into a vector — typically 768 to 3,072 dimensions — using a model trained so that semantically similar text produces similar vectors. OpenAI's text-embedding-3, Cohere Embed, Voyage, and open-source models like BGE and E5 are common choices.

Indexing stores those vectors in a vector database (Pinecone, Weaviate, Qdrant, pgvector, Elasticsearch) alongside the original text and metadata like source URL, author, and access-control tags.

Retrieval embeds the user's query with the same model and finds the top-k nearest chunks. Modern systems combine dense vector search with keyword (BM25) search — a pattern called hybrid retrieval — because neither alone catches every query shape.

Generation passes the query plus the retrieved chunks to the LLM with a prompt like "Answer the question using only the following context. If the context does not contain the answer, say so."

RAG vs Other Grounding Strategies

RAG is not the only way to get an LLM to use private data. The choice between RAG, fine-tuning, prompt stuffing, and long-context models comes down to the shape of your data and how often it changes.

Approach
How Data Gets In
Best For
Cost Shape
Prompt stuffing
Paste full docs into prompt
Tiny, static corpus
High token cost every call
Fine-tuning
Bake knowledge into weights
Style + behavior, not facts
Training cost; stale data
RAG
Retrieve at query time
Large, changing corpus
Index cost + retrieval latency
Long-context LLM
1M-token context window
Single large document
Per-call token cost scales

Fine-tuning is the right choice when you want to change the model's style, format, or behavior — not when you want to inject facts. Facts embedded in weights are difficult to update and hard to audit.

Long-context models (Claude's 200K window, Gemini's 2M) are tempting because they seem to remove the need for retrieval. In practice they work well for single large documents but get expensive fast on large corpora, and they suffer from the "lost in the middle" problem: the model pays less attention to content buried in the center of a long context.

Prompt stuffing — just pasting everything into the prompt — is fine for tiny, static corpora. Past a few megabytes of source material, you're paying to send the same tokens on every call and you lose the ability to cite which chunks were used.

RAG wins when the corpus is large, changes frequently, and you need per-answer citation. Most enterprise knowledge-assistant use cases hit all three.

Where RAG Goes Wrong

RAG systems fail quietly. A hallucinating LLM without retrieval is obvious — it says something clearly wrong. A RAG system that retrieves the wrong chunks and answers confidently looks authoritative and still says something wrong. This makes RAG failures more dangerous, not less.

Retrieval miss

The right chunk exists in the index but ranks outside top-k — answer is wrong with no warning

Stale index

Source doc was updated last week; embedding is from last month — answer cites outdated policy

Chunking boundary loss

Critical context sits across a chunk boundary; neither half is retrieved

Lost in the middle

Retrieval returned the right chunks, but the LLM ignored what was buried in the middle of the prompt

Confident hallucination

Retrieval returned nothing relevant; LLM invented an answer anyway with no disclaimer

Permission leak

Index mixes documents across access boundaries; user sees chunks they should never access

The retrieval problem is harder than the LLM problem

Most "RAG isn't working" incidents trace back to retrieval, not generation. Better prompting doesn't fix a retriever that can't find the right chunk. Investing in hybrid search (dense + BM25), reranking, and a real eval harness pays off more than trying another model.

Two failures deserve special mention. The first is stale indexes: a document was updated, but the embedding pipeline hasn't rerun. The LLM answers from yesterday's version and no one notices for weeks. The fix is event-driven reindexing — not nightly batch jobs.

The second is permission leaks: your index mixes documents that should be isolated by access control. Without retrieval-time access enforcement, a user's query can surface a chunk they were never entitled to read. Access control has to live on the retrieval path, not just on the source system.

Enterprise RAG

Turning a RAG demo into a production system is mostly about governance plumbing. The retrieval pipeline, the embedding model choice, and the LLM itself are the easy parts — almost every vector database and embedding provider can do the basics. What separates a toy RAG from an enterprise RAG is how you handle provenance, access, freshness, evaluation, and logging.

Source provenance

Every chunk traces to its origin doc, version, and owner

Access control on retrieval

Retrieval respects the user's doc permissions — not just the index

Citation in the answer

The LLM response cites which chunks it used, so a human can verify

Eval harness

A labeled set of queries with known-good answers, run on every index rebuild

Retrieval logging

Every query, top-k scores, and chunks shown to the LLM are captured

PII / sensitive data scrubbing

Pipeline redacts or flags sensitive content before it gets indexed

The enterprise RAG stack

  • Vector databases (Pinecone, Weaviate, Qdrant, pgvector) handle storage and search. None of them handle access-controlled retrieval end-to-end.
  • Embedding providers (OpenAI, Cohere, Voyage, Nomic) handle the embedding model. Model choice matters for quality; none of them handle governance.
  • RAG frameworks (LangChain, LlamaIndex, Haystack) wire the pipeline. Production teams typically replace framework abstractions with their own code as they hit scale.
  • Governance layer — provenance, access control, eval harness, logging — is the piece that determines whether RAG can be deployed to regulated users.

How Axiom differs

Most RAG tooling focuses on the retrieval pipeline itself. The governance layers — provenance, access-controlled retrieval, citation enforcement, eval harnesses, query logs — usually end up built in-house. Axiom's LLM Gateway and MCP Gateway give you the observability layer for free: every retrieval call, every LLM call, every chunk shown to the model is captured with who-asked-what for audit.

RAG and AI Agents

The frontier of RAG is moving from a single retrieval step to retrieval as a tool that agents decide when to use. Instead of always retrieving before answering, the agent reasons about the query, decides whether external context is needed, chooses which index to search, and may retrieve multiple times as it refines its answer. This pattern is sometimes called agentic RAG.

In this world, retrieval becomes one tool among many — typically exposed to agents through MCP. An agent might search internal docs, then search Jira, then search code, then synthesize. Each retrieval call is a tool use that an MCP gateway can log, allowlist, and audit.

This is where RAG governance and agent governance converge. A coding agent that retrieves from your codebase, a support agent that retrieves from your knowledge base, and a compliance agent that retrieves from your policies all need the same primitives: tracked queries, access-controlled retrieval, cited answers, and full logs.

Govern every retrieval, not just the LLM call

Axiom's gateways sit in front of both the LLM and the tools your agents call — including retrieval. Every query, every chunk returned, every answer generated is captured with the user identity that asked, so audit becomes "who asked what, what did we surface, and what did the model say?" — not "good luck reconstructing that."

See MCP Gateway

Getting Started

A realistic first RAG deployment is a scoped internal use case — one corpus, one user audience, one evaluation target. "Answer questions about our HR policies" is a good starting shape; "be a general-purpose company brain" is not. Tight scope lets you measure whether the system works.

Step 1 — Build an eval set before you build the system

Write 50–200 real questions with known-good answers and which source document each answer comes from. This is the ground truth against which every pipeline change is measured. Teams that skip this step end up making retrieval "feel better" without knowing whether it actually improved.

Step 2 — Start with hybrid retrieval

Dense-only retrieval (just embeddings) misses exact-term matches. BM25-only misses semantic matches. Hybrid retrieval plus a reranker is the pragmatic baseline. Start there; tune later.

Step 3 — Force citations from day one

Every answer should include which chunks it used. This is the governance primitive that makes everything else possible: users can verify, reviewers can audit, evals can check whether citations match the stated source.

Step 4 — Wire access control into retrieval, not just ingestion

If a user can't read the source doc, they shouldn't retrieve its chunks — even if the index contains them. The safest pattern is to tag chunks with their source's access groups and filter at query time.

Step 5 — Log every query and retrieval

You will need to debug retrieval quality, trace incidents, and prove compliance. Log the query, top-k scores, chunks passed to the LLM, and the final answer. An LLM Gateway makes this automatic.

Make RAG auditable without rebuilding your stack

VibeFlow + Axiom's gateways give you the observability, access control, and audit trail layer on top of your existing retrieval pipeline. Bring your own vector DB, embedding model, and LLM — inherit the governance you need to ship RAG to regulated users.

Talk to us

Make RAG auditable from day one

Axiom's LLM Gateway and MCP Gateway capture every query, every chunk, and every generated answer — so retrieval becomes governed, not invisible.

Contact Us