intermediateLLMs & Generative AI

Understand LLM context windows - the maximum amount of text a model can process at once, and strategies for working within these limits.

context-windowllmtokensprompt-engineeringrag

Context Window

The context window (or context length) is the maximum number of tokens an LLM can process in a single forward pass, including both input and output.

What is Context Window?

┌──────────────────────────────────────────────────┐
│              Context Window (e.g., 8K tokens)     │
│                                                   │
│  ┌────────────────────┐  ┌────────────────────┐  │
│  │   Input Tokens     │  │   Output Tokens    │  │
│  │   (your prompt)    │  │   (model response) │  │
│  │                    │  │                    │  │
│  │   System prompt    │  │   Generated text   │  │
│  │   + User message   │  │   until max_tokens │  │
│  │   + Context        │  │   or EOS           │  │
│  └────────────────────┘  └────────────────────┘  │
│                                                   │
│         Total ≤ Context Window Size               │
└──────────────────────────────────────────────────┘

Context Lengths by Model

ModelContext Window
GPT-3.54K / 16K tokens
GPT-48K / 32K / 128K tokens
Claude 3200K tokens
Llama 24K tokens
Llama 38K tokens
Mistral8K / 32K tokens
Gemini 1.51M tokens

Why Context Window Matters

1. Information Retention

Short context: Model "forgets" earlier parts of long documents
Long context: Can process entire books, codebases, conversations

2. Use Cases

Context SizeEnables
4KShort conversations, simple Q&A
32KLong documents, detailed analysis
128KMultiple documents, extended conversations
1M+Entire codebases, book-length content

Token Counting

Rough Estimates

1 token ≈ 4 characters (English)
1 token ≈ 0.75 words
1 page ≈ 500 tokens
1 book ≈ 100,000 tokens

Accurate Counting

import tiktoken

# For OpenAI models
encoder = tiktoken.encoding_for_model("gpt-4")
tokens = encoder.encode("Hello, world!")
print(f"Token count: {len(tokens)}")  # 4 tokens

# For Hugging Face models
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")
tokens = tokenizer.encode("Hello, world!")
print(f"Token count: {len(tokens)}")

Challenges with Long Contexts

1. Lost in the Middle

Models often struggle to use information in the middle of long contexts:

┌─────────────────────────────────────────────┐
│ Beginning    │    Middle    │      End      │
│   (strong)   │    (weak)    │   (strong)    │
│              │              │               │
│ ████████████ │ ░░░░░░░░░░░ │ ████████████  │
│   Attention  │  Attention   │   Attention   │
└─────────────────────────────────────────────┘

Solution: Place important information at the beginning or end.

2. Computational Cost

Transformer attention: O(n²) complexity

  4K context → 16M operations
 32K context → 1B operations
128K context → 16B operations

3. Quality Degradation

Longer contexts can lead to:

  • Less focused responses
  • More hallucinations
  • Slower inference

Strategies for Limited Context

1. Chunking

def chunk_document(text, chunk_size=1000, overlap=100):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap  # Overlap for continuity
    return chunks

# Process each chunk separately
for chunk in chunks:
    response = model.generate(chunk + question)

2. Summarization

def recursive_summarize(long_text, model, target_tokens=2000):
    if count_tokens(long_text) <= target_tokens:
        return long_text
    
    # Split into chunks
    chunks = split_into_chunks(long_text)
    
    # Summarize each chunk
    summaries = [model.summarize(chunk) for chunk in chunks]
    
    # Recursively summarize summaries
    combined = "\n".join(summaries)
    return recursive_summarize(combined, model, target_tokens)

3. RAG (Retrieval Augmented Generation)

from sentence_transformers import SentenceTransformer
import faiss

# Index documents
embedder = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedder.encode(document_chunks)
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

# Retrieve relevant chunks for query
def get_context(query, k=3):
    query_embedding = embedder.encode([query])
    distances, indices = index.search(query_embedding, k)
    return [document_chunks[i] for i in indices[0]]

# Only include relevant context in prompt
relevant_context = get_context(user_question)
prompt = f"Context: {relevant_context}\n\nQuestion: {user_question}"

4. Conversation Pruning

def manage_conversation(messages, max_tokens=4000):
    # Always keep system message and recent messages
    system_msg = messages[0]
    recent = messages[-4:]  # Last 2 exchanges
    
    total_tokens = count_tokens(system_msg) + count_tokens(recent)
    
    # Add older messages if space permits
    older_messages = []
    for msg in reversed(messages[1:-4]):
        msg_tokens = count_tokens(msg)
        if total_tokens + msg_tokens < max_tokens:
            older_messages.insert(0, msg)
            total_tokens += msg_tokens
        else:
            break
    
    return [system_msg] + older_messages + recent

Extended Context Techniques

1. Sliding Window Attention

Instead of full attention, attend to local window:

[tok1 tok2 tok3 tok4 tok5 tok6 tok7 tok8]
        └──────┴──────┘
          Window = 3
          
Each token attends to 3 neighbors only

2. Sparse Attention

Combine local + global attention patterns:

█ ░ ░ █ ░ ░ █ ░  (global every 3rd)
█ █ ░ ░ ░ ░ ░ ░  (local first 2)
░ █ █ ░ ░ ░ ░ ░  (local sliding)

3. Memory Layers

Memory tokens that persist across segments:

[MEMORY][segment 1] → process → update [MEMORY]
[MEMORY][segment 2] → process → update [MEMORY]

Best Practices

1. Monitor Token Usage

def check_context_fit(prompt, max_context, reserved_output=500):
    prompt_tokens = count_tokens(prompt)
    available = max_context - reserved_output
    
    if prompt_tokens > available:
        raise ValueError(
            f"Prompt ({prompt_tokens}) exceeds available context ({available})"
        )
    
    return max_context - prompt_tokens  # Remaining for output

2. Structure Prompts Efficiently

# Inefficient
prompt = f"Here is a very long document: {full_document}. \
          Please answer: {question}"

# Efficient  
prompt = f"""Question: {question}

Relevant excerpts:
{relevant_excerpts}

Answer based only on the excerpts above."""

3. Use Appropriate Models

# Choose model based on task
if token_count < 4000:
    model = "gpt-3.5-turbo"  # Cheaper, faster
elif token_count < 32000:
    model = "gpt-4-32k"
else:
    model = "claude-3-opus"  # 200K context

Key Takeaways

  1. Context window = total tokens (input + output) model can handle
  2. Longer context enables more complex tasks but costs more
  3. Models struggle with information in the middle of long contexts
  4. Use chunking, summarization, or RAG for documents exceeding context
  5. Always reserve tokens for model output
  6. Choose model context size appropriate to your task