intermediateLLMs & Generative AI

Learn about RAG - the technique that enhances LLMs with external knowledge by retrieving relevant documents at inference time.

ragretrievalllmsknowledge-basesembeddings

Retrieval-Augmented Generation (RAG)

RAG combines retrieval systems with language models, allowing LLMs to access external knowledge beyond their training data. It's become essential for building practical AI applications.

The Problem with Pure LLMs

LLMs have limitations:

  • Knowledge cutoff: Only know what was in training data
  • Hallucinations: Confidently generate false information
  • No source attribution: Can't tell you where info came from
  • Static knowledge: Can't access real-time information
  • Domain gaps: May lack specialized knowledge

The RAG Solution

Combine retrieval with generation:

Query: "What were Q3 2024 earnings?"
         ↓
    [Retriever] → Find relevant documents
         ↓
    [Retrieved docs] + [Query]
         ↓
    [LLM generates answer with context]
         ↓
    Answer with citations

RAG Architecture

1. Indexing Phase (Offline)

Documents → Chunk → Embed → Store in Vector DB

Chunking: Split documents into manageable pieces

  • Fixed size (e.g., 512 tokens)
  • Semantic boundaries (paragraphs, sections)
  • Overlapping windows

Embedding: Convert chunks to vectors

  • Models: OpenAI embeddings, sentence-transformers
  • Dimension: typically 384-1536

Vector Store: Efficient similarity search

  • Pinecone, Weaviate, Chroma, FAISS, Qdrant

2. Retrieval Phase

Query → Embed → Search vector DB → Top-k chunks

Similarity Search: Find most relevant chunks

  • Cosine similarity
  • Approximate nearest neighbors (ANN)

3. Generation Phase

[System prompt] + [Retrieved context] + [Query] → LLM → Answer

The LLM generates an answer grounded in retrieved context.

Retrieval Strategies

Dense Retrieval

Embed query and documents, find similar vectors.

Pros: Captures semantic meaning Cons: Needs good embeddings, compute-intensive

Sparse Retrieval (BM25)

Keyword-based matching.

Pros: Fast, no training needed Cons: Misses synonyms, semantic relationships

Hybrid Retrieval

Combine dense and sparse:

final_score = α × dense_score + (1-α) × sparse_score

Often the best approach!

Re-ranking

Retrieve many candidates, then re-rank with a more powerful model:

Query → BM25 (top 100) → Re-ranker (top 10) → LLM

Chunking Strategies

Fixed-Size Chunks

Split every N tokens, optional overlap

Simple but may break mid-sentence.

Recursive Splitting

Split by paragraphs, then sentences, then characters

Respects document structure.

Semantic Chunking

Split when embedding similarity drops

Keeps related content together.

Document-Specific

  • Code: Split by functions/classes
  • HTML: Split by sections/headings
  • Tables: Keep rows together

Advanced RAG Techniques

Query Transformation

Rewrite query for better retrieval:

"What's the latest iPhone?" → "iPhone 15 Pro specifications features 2024"

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer, embed that:

Query → LLM generates fake answer → Embed fake answer → Search

Multi-Query RAG

Generate multiple query variations:

Original → [Query 1, Query 2, Query 3] → Retrieve for each → Merge results

Self-RAG

Model decides when to retrieve and validates retrieved info.

Agentic RAG

Multi-step retrieval with reasoning:

Step 1: Retrieve, realize need more info
Step 2: Formulate new query, retrieve again
Step 3: Synthesize all retrieved info

Evaluation

Retrieval Metrics

  • Precision@k: Relevant docs in top k
  • Recall@k: Found relevant docs / all relevant docs
  • MRR: Position of first relevant result
  • NDCG: Ranking quality

Generation Metrics

  • Faithfulness: Is answer supported by context?
  • Relevance: Does answer address the question?
  • Completeness: Are all aspects covered?

Tools

  • RAGAS (RAG Assessment)
  • TruLens
  • LangSmith

Common Pitfalls

1. Poor Chunking

Chunks too small lose context, too large dilute relevance.

2. Wrong Embedding Model

General embeddings may fail for domain-specific content.

3. Ignoring Retrieved Context

LLM may rely on parametric knowledge instead of context.

4. No Re-ranking

Top semantic results aren't always most relevant.

5. Missing Metadata

Filters (date, source, type) often crucial for good retrieval.

RAG vs Fine-tuning

AspectRAGFine-tuning
Knowledge updatesEasy (update DB)Requires retraining
AttributionYes (sources)No
HallucinationReducedStill possible
CostInference overheadTraining cost
Best forFacts, docsStyle, behavior

Best practice: Use both! Fine-tune for style, RAG for facts.

Key Takeaways

  1. RAG grounds LLMs in retrieved documents
  2. Architecture: Index → Retrieve → Generate
  3. Hybrid retrieval often works best
  4. Chunking strategy significantly impacts quality
  5. Advanced techniques: query rewriting, re-ranking, multi-step
  6. Evaluate both retrieval and generation quality

Practice Questions

Test your understanding with these related interview questions: