RAG: Retrieval-Augmented Generation Explained
Retrieval-Augmented Generation (RAG) is the most widely deployed technique for building AI systems that answer questions based on your own data — documentation, internal knowledge bases, databases, or any corpus that changes over time. Understanding RAG is essential for anyone building production AI applications.
The Problem RAG Solves
Language models have a training cutoff. They don't know about:
- Your company's internal documentation
- Product updates released after their training data
- Private data (customer records, proprietary research)
- Real-time information
You could attempt to solve this with fine-tuning — training the model on your data. But fine-tuning is expensive, slow to update, and doesn't reliably teach factual recall. Models learn patterns, not specific facts, from fine-tuning.
RAG takes a different approach: retrieve the relevant information at query time, then inject it into the prompt. The model doesn't need to memorize facts — it reads them from the retrieved context, just like a human researcher reads a reference before answering.
The RAG Architecture
User query
↓
[Retrieval]
Embed the query → search vector database → return top-K relevant documents
↓
[Augmentation]
Insert retrieved documents into the prompt as context
↓
[Generation]
Model reads context + generates answer grounded in retrieved documents
↓
Response to user (with citations)
Components
1. Document Store Your corpus — documentation, PDFs, database records, web pages. Ingested and chunked into retrievable pieces.
2. Embedding Model
Converts text chunks into vector representations. Semantically similar texts produce similar vectors. Common choices: text-embedding-3-large (OpenAI), embed-english-v3.0 (Cohere), or open-source models via sentence-transformers.
3. Vector Database Stores and searches vectors efficiently. Options: Pinecone, Weaviate, Qdrant, pgvector (PostgreSQL extension), Chroma (local).
4. Retrieval Logic Given a query vector, find the most similar document vectors (cosine similarity or dot product). Return the top K chunks.
5. Language Model Receives the retrieved context + user question → generates the answer.
Chunking Strategy
How you split documents into chunks is one of the highest-impact RAG decisions:
Fixed-size chunking: Split every N characters or tokens. Simple, but can cut mid-sentence or mid-concept.
Sentence/paragraph chunking: Split on sentence or paragraph boundaries. More semantically coherent.
Recursive chunking: Try to split on paragraphs, then sentences, then words. Preserves structure where possible.
Semantic chunking: Use an embedding model to identify natural breakpoints in meaning. Most accurate, most expensive.
Practical recommendation: Start with paragraph-level chunking with 200-500 token chunks and 50-token overlap between chunks. The overlap ensures concepts that span chunk boundaries are retrievable.
The RAG Prompt Pattern
System:
You are a helpful assistant for [company]. Answer questions based on
the provided context. If the context doesn't contain enough information
to answer, say so clearly — do not answer from general knowledge.
Cite the source document for each claim.
User:
Context:
---
[Document 1: Customer Support Policy v3.2]
Refund requests must be submitted within 30 days of purchase...
---
[Document 2: FAQ - Returns]
Items must be in original packaging to qualify for return...
---
Question: Can I return a product after 45 days?
Key Elements
System instruction to use only context: Critical for preventing hallucination. The model should never blend retrieved facts with training data without flagging it.
Source labels: Tell the model where each chunk came from. This enables proper citation.
Fallback instruction: What to do when the answer isn't in the retrieved context. Always specify this — the alternative is confident hallucination.
Improving Retrieval Quality
Retrieval is where most RAG systems fail. The model is only as good as the context it receives.
Hybrid Search
Combine dense (vector/semantic) retrieval with sparse (keyword/BM25) retrieval. Dense retrieval finds semantically related content; sparse retrieval finds exact keyword matches. Merging both outperforms either alone.
Query Rewriting
The user's question may not be the best retrieval query. Use the model to rewrite it before retrieval:
System: Rewrite the following question into 3 search queries optimized for
retrieval from technical documentation. Output as a JSON array of strings.
User question: "Why does my login keep failing?"
Rewritten queries:
["authentication failure causes", "login error troubleshooting", "session token validation"]
Reranking
After initial retrieval (top 20 chunks), use a reranking model to select the truly most relevant (top 5). Rerankers are trained specifically for relevance judgment and outperform similarity search for precision.
Metadata Filtering
Narrow retrieval before semantic search using metadata filters:
results = collection.query(
query_embeddings=[query_vector],
where={"category": "returns-policy", "version": "current"},
n_results=5
)
RAG vs. Fine-Tuning
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Updates knowledge | Easy — update the document store | Hard — requires retraining |
| Factual accuracy | High (grounded in retrieved text) | Unreliable for specific facts |
| Cost | Retrieval + inference | Training cost + inference |
| Setup complexity | Moderate | High |
| Teaches style/behavior | No | Yes |
| Best for | Dynamic knowledge, Q&A | Style, tone, task behavior |
RAG and fine-tuning are complementary, not competing. RAG provides current knowledge; fine-tuning shapes behavior.
Evaluating a RAG System
Retrieval metrics:
- Recall@K — does the correct document appear in the top K results?
- Precision@K — of the top K results, how many are actually relevant?
Generation metrics:
- Faithfulness — does the answer stay within the retrieved context?
- Answer relevance — does the answer actually address the question?
- Context utilization — did the model use all relevant retrieved context?
Frameworks like RAGAS automate these evaluations.
Key Takeaways
- RAG solves the knowledge cutoff and private data problems without expensive retraining
- The core pattern: embed query → retrieve relevant chunks → inject into prompt → generate grounded response
- Chunking strategy and retrieval quality are the biggest levers on RAG performance
- Always instruct the model to stay within retrieved context and cite sources
- Hybrid search (semantic + keyword) outperforms either approach alone
- Evaluate both retrieval quality and generation faithfulness separately