Context Window
The context window (or context length) is the maximum number of tokens an LLM can process in a single forward pass, including both input and output.
What is Context Window?
┌──────────────────────────────────────────────────┐
│ Context Window (e.g., 8K tokens) │
│ │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ Input Tokens │ │ Output Tokens │ │
│ │ (your prompt) │ │ (model response) │ │
│ │ │ │ │ │
│ │ System prompt │ │ Generated text │ │
│ │ + User message │ │ until max_tokens │ │
│ │ + Context │ │ or EOS │ │
│ └────────────────────┘ └────────────────────┘ │
│ │
│ Total ≤ Context Window Size │
└──────────────────────────────────────────────────┘
Context Lengths by Model
| Model | Context Window |
|---|---|
| GPT-3.5 | 4K / 16K tokens |
| GPT-4 | 8K / 32K / 128K tokens |
| Claude 3 | 200K tokens |
| Llama 2 | 4K tokens |
| Llama 3 | 8K tokens |
| Mistral | 8K / 32K tokens |
| Gemini 1.5 | 1M tokens |
Why Context Window Matters
1. Information Retention
Short context: Model "forgets" earlier parts of long documents
Long context: Can process entire books, codebases, conversations
2. Use Cases
| Context Size | Enables |
|---|---|
| 4K | Short conversations, simple Q&A |
| 32K | Long documents, detailed analysis |
| 128K | Multiple documents, extended conversations |
| 1M+ | Entire codebases, book-length content |
Token Counting
Rough Estimates
1 token ≈ 4 characters (English)
1 token ≈ 0.75 words
1 page ≈ 500 tokens
1 book ≈ 100,000 tokens
Accurate Counting
import tiktoken
# For OpenAI models
encoder = tiktoken.encoding_for_model("gpt-4")
tokens = encoder.encode("Hello, world!")
print(f"Token count: {len(tokens)}") # 4 tokens
# For Hugging Face models
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")
tokens = tokenizer.encode("Hello, world!")
print(f"Token count: {len(tokens)}")
Challenges with Long Contexts
1. Lost in the Middle
Models often struggle to use information in the middle of long contexts:
┌─────────────────────────────────────────────┐
│ Beginning │ Middle │ End │
│ (strong) │ (weak) │ (strong) │
│ │ │ │
│ ████████████ │ ░░░░░░░░░░░ │ ████████████ │
│ Attention │ Attention │ Attention │
└─────────────────────────────────────────────┘
Solution: Place important information at the beginning or end.
2. Computational Cost
Transformer attention: O(n²) complexity
4K context → 16M operations
32K context → 1B operations
128K context → 16B operations
3. Quality Degradation
Longer contexts can lead to:
- Less focused responses
- More hallucinations
- Slower inference
Strategies for Limited Context
1. Chunking
def chunk_document(text, chunk_size=1000, overlap=100):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap # Overlap for continuity
return chunks
# Process each chunk separately
for chunk in chunks:
response = model.generate(chunk + question)
2. Summarization
def recursive_summarize(long_text, model, target_tokens=2000):
if count_tokens(long_text) <= target_tokens:
return long_text
# Split into chunks
chunks = split_into_chunks(long_text)
# Summarize each chunk
summaries = [model.summarize(chunk) for chunk in chunks]
# Recursively summarize summaries
combined = "\n".join(summaries)
return recursive_summarize(combined, model, target_tokens)
3. RAG (Retrieval Augmented Generation)
from sentence_transformers import SentenceTransformer
import faiss
# Index documents
embedder = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedder.encode(document_chunks)
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)
# Retrieve relevant chunks for query
def get_context(query, k=3):
query_embedding = embedder.encode([query])
distances, indices = index.search(query_embedding, k)
return [document_chunks[i] for i in indices[0]]
# Only include relevant context in prompt
relevant_context = get_context(user_question)
prompt = f"Context: {relevant_context}\n\nQuestion: {user_question}"
4. Conversation Pruning
def manage_conversation(messages, max_tokens=4000):
# Always keep system message and recent messages
system_msg = messages[0]
recent = messages[-4:] # Last 2 exchanges
total_tokens = count_tokens(system_msg) + count_tokens(recent)
# Add older messages if space permits
older_messages = []
for msg in reversed(messages[1:-4]):
msg_tokens = count_tokens(msg)
if total_tokens + msg_tokens < max_tokens:
older_messages.insert(0, msg)
total_tokens += msg_tokens
else:
break
return [system_msg] + older_messages + recent
Extended Context Techniques
1. Sliding Window Attention
Instead of full attention, attend to local window:
[tok1 tok2 tok3 tok4 tok5 tok6 tok7 tok8]
└──────┴──────┘
Window = 3
Each token attends to 3 neighbors only
2. Sparse Attention
Combine local + global attention patterns:
█ ░ ░ █ ░ ░ █ ░ (global every 3rd)
█ █ ░ ░ ░ ░ ░ ░ (local first 2)
░ █ █ ░ ░ ░ ░ ░ (local sliding)
3. Memory Layers
Memory tokens that persist across segments:
[MEMORY][segment 1] → process → update [MEMORY]
[MEMORY][segment 2] → process → update [MEMORY]
Best Practices
1. Monitor Token Usage
def check_context_fit(prompt, max_context, reserved_output=500):
prompt_tokens = count_tokens(prompt)
available = max_context - reserved_output
if prompt_tokens > available:
raise ValueError(
f"Prompt ({prompt_tokens}) exceeds available context ({available})"
)
return max_context - prompt_tokens # Remaining for output
2. Structure Prompts Efficiently
# Inefficient
prompt = f"Here is a very long document: {full_document}. \
Please answer: {question}"
# Efficient
prompt = f"""Question: {question}
Relevant excerpts:
{relevant_excerpts}
Answer based only on the excerpts above."""
3. Use Appropriate Models
# Choose model based on task
if token_count < 4000:
model = "gpt-3.5-turbo" # Cheaper, faster
elif token_count < 32000:
model = "gpt-4-32k"
else:
model = "claude-3-opus" # 200K context
Key Takeaways
- Context window = total tokens (input + output) model can handle
- Longer context enables more complex tasks but costs more
- Models struggle with information in the middle of long contexts
- Use chunking, summarization, or RAG for documents exceeding context
- Always reserve tokens for model output
- Choose model context size appropriate to your task