Retrieval-Augmented Generation (RAG)

RAG combines retrieval systems with language models, allowing LLMs to access external knowledge beyond their training data. It's become essential for building practical AI applications.

The Problem with Pure LLMs

LLMs have limitations:

Knowledge cutoff: Only know what was in training data
Hallucinations: Confidently generate false information
No source attribution: Can't tell you where info came from
Static knowledge: Can't access real-time information
Domain gaps: May lack specialized knowledge

The RAG Solution

Combine retrieval with generation:

Query: "What were Q3 2024 earnings?"
         ↓
    [Retriever] → Find relevant documents
         ↓
    [Retrieved docs] + [Query]
         ↓
    [LLM generates answer with context]
         ↓
    Answer with citations

RAG Architecture

1. Indexing Phase (Offline)

Documents → Chunk → Embed → Store in Vector DB

Chunking: Split documents into manageable pieces

Fixed size (e.g., 512 tokens)
Semantic boundaries (paragraphs, sections)
Overlapping windows

Embedding: Convert chunks to vectors

Models: OpenAI embeddings, sentence-transformers
Dimension: typically 384-1536

Vector Store: Efficient similarity search

Pinecone, Weaviate, Chroma, FAISS, Qdrant

2. Retrieval Phase

Query → Embed → Search vector DB → Top-k chunks

Similarity Search: Find most relevant chunks

Cosine similarity
Approximate nearest neighbors (ANN)

3. Generation Phase

[System prompt] + [Retrieved context] + [Query] → LLM → Answer

The LLM generates an answer grounded in retrieved context.

Retrieval Strategies

Dense Retrieval

Embed query and documents, find similar vectors.

Pros: Captures semantic meaning Cons: Needs good embeddings, compute-intensive

Sparse Retrieval (BM25)

Keyword-based matching.

Pros: Fast, no training needed Cons: Misses synonyms, semantic relationships

Hybrid Retrieval

Combine dense and sparse:

final_score = α × dense_score + (1-α) × sparse_score

Often the best approach!

Re-ranking

Retrieve many candidates, then re-rank with a more powerful model:

Query → BM25 (top 100) → Re-ranker (top 10) → LLM

Chunking Strategies

Fixed-Size Chunks

Split every N tokens, optional overlap

Simple but may break mid-sentence.

Recursive Splitting

Split by paragraphs, then sentences, then characters

Respects document structure.

Semantic Chunking

Split when embedding similarity drops

Keeps related content together.

Document-Specific

Code: Split by functions/classes
HTML: Split by sections/headings
Tables: Keep rows together

Advanced RAG Techniques

Query Transformation

Rewrite query for better retrieval:

"What's the latest iPhone?" → "iPhone 15 Pro specifications features 2024"

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer, embed that:

Query → LLM generates fake answer → Embed fake answer → Search

Multi-Query RAG

Generate multiple query variations:

Original → [Query 1, Query 2, Query 3] → Retrieve for each → Merge results

Self-RAG

Model decides when to retrieve and validates retrieved info.

Agentic RAG

Multi-step retrieval with reasoning:

Step 1: Retrieve, realize need more info
Step 2: Formulate new query, retrieve again
Step 3: Synthesize all retrieved info

Evaluation

Retrieval Metrics

Precision@k: Relevant docs in top k
Recall@k: Found relevant docs / all relevant docs
MRR: Position of first relevant result
NDCG: Ranking quality

Generation Metrics

Faithfulness: Is answer supported by context?
Relevance: Does answer address the question?
Completeness: Are all aspects covered?

Tools

RAGAS (RAG Assessment)
TruLens
LangSmith

Common Pitfalls

1. Poor Chunking

Chunks too small lose context, too large dilute relevance.

2. Wrong Embedding Model

General embeddings may fail for domain-specific content.

3. Ignoring Retrieved Context

LLM may rely on parametric knowledge instead of context.

4. No Re-ranking

Top semantic results aren't always most relevant.

5. Missing Metadata

Filters (date, source, type) often crucial for good retrieval.

RAG vs Fine-tuning

Aspect	RAG	Fine-tuning
Knowledge updates	Easy (update DB)	Requires retraining
Attribution	Yes (sources)	No
Hallucination	Reduced	Still possible
Cost	Inference overhead	Training cost
Best for	Facts, docs	Style, behavior

Best practice: Use both! Fine-tune for style, RAG for facts.

Key Takeaways

RAG grounds LLMs in retrieved documents
Architecture: Index → Retrieve → Generate
Hybrid retrieval often works best
Chunking strategy significantly impacts quality
Advanced techniques: query rewriting, re-ranking, multi-step
Evaluate both retrieval and generation quality