Retrieval-Augmented Generation (RAG)
RAG combines retrieval systems with language models, allowing LLMs to access external knowledge beyond their training data. It's become essential for building practical AI applications.
The Problem with Pure LLMs
LLMs have limitations:
- Knowledge cutoff: Only know what was in training data
- Hallucinations: Confidently generate false information
- No source attribution: Can't tell you where info came from
- Static knowledge: Can't access real-time information
- Domain gaps: May lack specialized knowledge
The RAG Solution
Combine retrieval with generation:
Query: "What were Q3 2024 earnings?"
↓
[Retriever] → Find relevant documents
↓
[Retrieved docs] + [Query]
↓
[LLM generates answer with context]
↓
Answer with citations
RAG Architecture
1. Indexing Phase (Offline)
Documents → Chunk → Embed → Store in Vector DB
Chunking: Split documents into manageable pieces
- Fixed size (e.g., 512 tokens)
- Semantic boundaries (paragraphs, sections)
- Overlapping windows
Embedding: Convert chunks to vectors
- Models: OpenAI embeddings, sentence-transformers
- Dimension: typically 384-1536
Vector Store: Efficient similarity search
- Pinecone, Weaviate, Chroma, FAISS, Qdrant
2. Retrieval Phase
Query → Embed → Search vector DB → Top-k chunks
Similarity Search: Find most relevant chunks
- Cosine similarity
- Approximate nearest neighbors (ANN)
3. Generation Phase
[System prompt] + [Retrieved context] + [Query] → LLM → Answer
The LLM generates an answer grounded in retrieved context.
Retrieval Strategies
Dense Retrieval
Embed query and documents, find similar vectors.
Pros: Captures semantic meaning Cons: Needs good embeddings, compute-intensive
Sparse Retrieval (BM25)
Keyword-based matching.
Pros: Fast, no training needed Cons: Misses synonyms, semantic relationships
Hybrid Retrieval
Combine dense and sparse:
final_score = α × dense_score + (1-α) × sparse_score
Often the best approach!
Re-ranking
Retrieve many candidates, then re-rank with a more powerful model:
Query → BM25 (top 100) → Re-ranker (top 10) → LLM
Chunking Strategies
Fixed-Size Chunks
Split every N tokens, optional overlap
Simple but may break mid-sentence.
Recursive Splitting
Split by paragraphs, then sentences, then characters
Respects document structure.
Semantic Chunking
Split when embedding similarity drops
Keeps related content together.
Document-Specific
- Code: Split by functions/classes
- HTML: Split by sections/headings
- Tables: Keep rows together
Advanced RAG Techniques
Query Transformation
Rewrite query for better retrieval:
"What's the latest iPhone?" → "iPhone 15 Pro specifications features 2024"
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer, embed that:
Query → LLM generates fake answer → Embed fake answer → Search
Multi-Query RAG
Generate multiple query variations:
Original → [Query 1, Query 2, Query 3] → Retrieve for each → Merge results
Self-RAG
Model decides when to retrieve and validates retrieved info.
Agentic RAG
Multi-step retrieval with reasoning:
Step 1: Retrieve, realize need more info
Step 2: Formulate new query, retrieve again
Step 3: Synthesize all retrieved info
Evaluation
Retrieval Metrics
- Precision@k: Relevant docs in top k
- Recall@k: Found relevant docs / all relevant docs
- MRR: Position of first relevant result
- NDCG: Ranking quality
Generation Metrics
- Faithfulness: Is answer supported by context?
- Relevance: Does answer address the question?
- Completeness: Are all aspects covered?
Tools
- RAGAS (RAG Assessment)
- TruLens
- LangSmith
Common Pitfalls
1. Poor Chunking
Chunks too small lose context, too large dilute relevance.
2. Wrong Embedding Model
General embeddings may fail for domain-specific content.
3. Ignoring Retrieved Context
LLM may rely on parametric knowledge instead of context.
4. No Re-ranking
Top semantic results aren't always most relevant.
5. Missing Metadata
Filters (date, source, type) often crucial for good retrieval.
RAG vs Fine-tuning
| Aspect | RAG | Fine-tuning |
|---|---|---|
| Knowledge updates | Easy (update DB) | Requires retraining |
| Attribution | Yes (sources) | No |
| Hallucination | Reduced | Still possible |
| Cost | Inference overhead | Training cost |
| Best for | Facts, docs | Style, behavior |
Best practice: Use both! Fine-tune for style, RAG for facts.
Key Takeaways
- RAG grounds LLMs in retrieved documents
- Architecture: Index → Retrieve → Generate
- Hybrid retrieval often works best
- Chunking strategy significantly impacts quality
- Advanced techniques: query rewriting, re-ranking, multi-step
- Evaluate both retrieval and generation quality