Temperature and Sampling

Sampling strategies control how language models select the next token. Understanding these parameters is essential for getting the right balance between creativity and reliability.

How LLMs Generate Text

LLMs output a probability distribution over vocabulary:

"The cat sat on the ___"

Probabilities:
  mat:    0.35
  floor:  0.25
  couch:  0.15
  roof:   0.10
  table:  0.08
  ...

Sampling decides which token to pick.

Temperature

Temperature scales the logits before softmax:

P(token) = softmax(logits / T)

Effect of Temperature

T	Effect	Use Case
0	Greedy (always pick highest prob)	Factual, deterministic
0.1-0.5	Low randomness	Code, math, factual
0.7-0.9	Balanced	General conversation
1.0	Original distribution	Creative writing
>1.0	More random	Brainstorming, diversity

Mathematical View

Low T (0.1):  [0.35, 0.25, 0.15] → [0.95, 0.04, 0.01]
                                    (very peaked)

T = 1.0:      [0.35, 0.25, 0.15] → [0.35, 0.25, 0.15]
                                    (unchanged)

High T (2.0): [0.35, 0.25, 0.15] → [0.40, 0.32, 0.28]
                                    (flattened)

Top-K Sampling

Only consider the top k most likely tokens:

Vocabulary: 50,000 tokens
Top-k = 40: Only sample from 40 highest probability tokens

Process:

Sort tokens by probability
Keep only top k
Renormalize probabilities
Sample from reduced set

Why:

Prevents sampling very unlikely tokens
Reduces nonsensical outputs
Faster computation

Downside:

Fixed k regardless of distribution shape
May cut off good tokens or include bad ones

Top-P (Nucleus) Sampling

Sample from smallest set of tokens whose cumulative probability exceeds p:

Tokens:        mat    floor  couch  roof   table  ...
Probs:         0.35   0.25   0.15   0.10   0.08   ...
Cumulative:    0.35   0.60   0.75   0.85   0.93   ...

Top-p = 0.9:   Keep {mat, floor, couch, roof} (sum = 0.85 < 0.9)
               Include table? 0.93 > 0.9 ✓
               Final set: {mat, floor, couch, roof, table}

Why:

Adapts to distribution shape
Confident predictions → fewer tokens
Uncertain → more tokens
Often better than fixed top-k

Combining Strategies

Most APIs let you combine:

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[...],
    temperature=0.7,  # Scale distribution
    top_p=0.9,        # Nucleus sampling
    # Note: OpenAI recommends changing one, not both
)

Order of Operations

Logits → Temperature scaling → Top-k filter → Top-p filter → Sample

Other Sampling Methods

Greedy Decoding

Always pick argmax(probability)

Deterministic
Fast
Often repetitive and boring

Beam Search

Maintain top-n sequences, extend each, keep best n overall

Better than greedy for translation/summarization
Can still be repetitive
Computationally expensive

Contrastive Search

Balance probability with distinctiveness from previous tokens

Reduces repetition
Newer method

Typical Sampling

Sample from tokens with probability close to expected information content

Avoids both too-common and too-rare tokens

Repetition Penalties

Separate from sampling, but related:

Frequency Penalty

Reduce probability of tokens based on how often they've appeared:

P(token) = P(token) × (1 - frequency_penalty × count)

Presence Penalty

One-time penalty if token has appeared at all:

P(token) = P(token) × (1 - presence_penalty) if token in context

Practical Guidelines

For Different Tasks

Task	Temperature	Top-p	Notes
Code generation	0-0.2	0.1	Deterministic, correct
Math/reasoning	0-0.3	0.1-0.5	Consistent logic
Factual Q&A	0.3-0.5	0.5-0.8	Accurate but natural
Conversation	0.7-0.9	0.9	Natural, varied
Creative writing	0.8-1.2	0.95	Diverse, surprising
Brainstorming	1.0-1.5	1.0	Maximum diversity

Common Combinations

# Deterministic (testing, code)
temperature=0

# Balanced (general use)
temperature=0.7, top_p=0.9

# Creative
temperature=1.0, top_p=0.95

# Reduce repetition in long outputs
frequency_penalty=0.5, presence_penalty=0.5

Temperature = 0 Gotcha

Even with T=0, outputs may vary:

Floating point non-determinism
GPU parallelism effects
Some APIs add small noise

For true determinism:

seed=42  # If supported
temperature=0

Debugging Poor Outputs

Too Repetitive?

Increase temperature (0.7 → 0.9)
Add frequency/presence penalty
Try typical sampling

Too Random/Nonsensical?

Decrease temperature
Lower top-p (0.95 → 0.8)
Add top-k constraint

Good But Different Each Time?

Lower temperature
Use seed if available

Key Takeaways

Temperature controls distribution peakedness (creativity vs determinism)
Top-k limits to k most likely tokens
Top-p (nucleus) adapts to distribution shape
Lower values = more deterministic, higher = more creative
Match settings to your task requirements
Repetition penalties help with long-form generation