Temperature and Sampling
Sampling strategies control how language models select the next token. Understanding these parameters is essential for getting the right balance between creativity and reliability.
How LLMs Generate Text
LLMs output a probability distribution over vocabulary:
"The cat sat on the ___"
Probabilities:
mat: 0.35
floor: 0.25
couch: 0.15
roof: 0.10
table: 0.08
...
Sampling decides which token to pick.
Temperature
Temperature scales the logits before softmax:
P(token) = softmax(logits / T)
Effect of Temperature
| T | Effect | Use Case |
|---|---|---|
| 0 | Greedy (always pick highest prob) | Factual, deterministic |
| 0.1-0.5 | Low randomness | Code, math, factual |
| 0.7-0.9 | Balanced | General conversation |
| 1.0 | Original distribution | Creative writing |
| >1.0 | More random | Brainstorming, diversity |
Mathematical View
Low T (0.1): [0.35, 0.25, 0.15] → [0.95, 0.04, 0.01]
(very peaked)
T = 1.0: [0.35, 0.25, 0.15] → [0.35, 0.25, 0.15]
(unchanged)
High T (2.0): [0.35, 0.25, 0.15] → [0.40, 0.32, 0.28]
(flattened)
Top-K Sampling
Only consider the top k most likely tokens:
Vocabulary: 50,000 tokens
Top-k = 40: Only sample from 40 highest probability tokens
Process:
- Sort tokens by probability
- Keep only top k
- Renormalize probabilities
- Sample from reduced set
Why:
- Prevents sampling very unlikely tokens
- Reduces nonsensical outputs
- Faster computation
Downside:
- Fixed k regardless of distribution shape
- May cut off good tokens or include bad ones
Top-P (Nucleus) Sampling
Sample from smallest set of tokens whose cumulative probability exceeds p:
Tokens: mat floor couch roof table ...
Probs: 0.35 0.25 0.15 0.10 0.08 ...
Cumulative: 0.35 0.60 0.75 0.85 0.93 ...
Top-p = 0.9: Keep {mat, floor, couch, roof} (sum = 0.85 < 0.9)
Include table? 0.93 > 0.9 ✓
Final set: {mat, floor, couch, roof, table}
Why:
- Adapts to distribution shape
- Confident predictions → fewer tokens
- Uncertain → more tokens
- Often better than fixed top-k
Combining Strategies
Most APIs let you combine:
response = openai.chat.completions.create(
model="gpt-4",
messages=[...],
temperature=0.7, # Scale distribution
top_p=0.9, # Nucleus sampling
# Note: OpenAI recommends changing one, not both
)
Order of Operations
Logits → Temperature scaling → Top-k filter → Top-p filter → Sample
Other Sampling Methods
Greedy Decoding
Always pick argmax(probability)
- Deterministic
- Fast
- Often repetitive and boring
Beam Search
Maintain top-n sequences, extend each, keep best n overall
- Better than greedy for translation/summarization
- Can still be repetitive
- Computationally expensive
Contrastive Search
Balance probability with distinctiveness from previous tokens
- Reduces repetition
- Newer method
Typical Sampling
Sample from tokens with probability close to expected information content
- Avoids both too-common and too-rare tokens
Repetition Penalties
Separate from sampling, but related:
Frequency Penalty
Reduce probability of tokens based on how often they've appeared:
P(token) = P(token) × (1 - frequency_penalty × count)
Presence Penalty
One-time penalty if token has appeared at all:
P(token) = P(token) × (1 - presence_penalty) if token in context
Practical Guidelines
For Different Tasks
| Task | Temperature | Top-p | Notes |
|---|---|---|---|
| Code generation | 0-0.2 | 0.1 | Deterministic, correct |
| Math/reasoning | 0-0.3 | 0.1-0.5 | Consistent logic |
| Factual Q&A | 0.3-0.5 | 0.5-0.8 | Accurate but natural |
| Conversation | 0.7-0.9 | 0.9 | Natural, varied |
| Creative writing | 0.8-1.2 | 0.95 | Diverse, surprising |
| Brainstorming | 1.0-1.5 | 1.0 | Maximum diversity |
Common Combinations
# Deterministic (testing, code)
temperature=0
# Balanced (general use)
temperature=0.7, top_p=0.9
# Creative
temperature=1.0, top_p=0.95
# Reduce repetition in long outputs
frequency_penalty=0.5, presence_penalty=0.5
Temperature = 0 Gotcha
Even with T=0, outputs may vary:
- Floating point non-determinism
- GPU parallelism effects
- Some APIs add small noise
For true determinism:
seed=42 # If supported
temperature=0
Debugging Poor Outputs
Too Repetitive?
- Increase temperature (0.7 → 0.9)
- Add frequency/presence penalty
- Try typical sampling
Too Random/Nonsensical?
- Decrease temperature
- Lower top-p (0.95 → 0.8)
- Add top-k constraint
Good But Different Each Time?
- Lower temperature
- Use seed if available
Key Takeaways
- Temperature controls distribution peakedness (creativity vs determinism)
- Top-k limits to k most likely tokens
- Top-p (nucleus) adapts to distribution shape
- Lower values = more deterministic, higher = more creative
- Match settings to your task requirements
- Repetition penalties help with long-form generation