beginnerLLMs & Generative AI

Understand temperature, top-k, top-p, and other sampling strategies that control the randomness and creativity of LLM outputs.

samplingtemperaturetop-pgenerationinference

Temperature and Sampling

Sampling strategies control how language models select the next token. Understanding these parameters is essential for getting the right balance between creativity and reliability.

How LLMs Generate Text

LLMs output a probability distribution over vocabulary:

"The cat sat on the ___"

Probabilities:
  mat:    0.35
  floor:  0.25
  couch:  0.15
  roof:   0.10
  table:  0.08
  ...

Sampling decides which token to pick.

Temperature

Temperature scales the logits before softmax:

P(token) = softmax(logits / T)

Effect of Temperature

TEffectUse Case
0Greedy (always pick highest prob)Factual, deterministic
0.1-0.5Low randomnessCode, math, factual
0.7-0.9BalancedGeneral conversation
1.0Original distributionCreative writing
>1.0More randomBrainstorming, diversity

Mathematical View

Low T (0.1):  [0.35, 0.25, 0.15] → [0.95, 0.04, 0.01]
                                    (very peaked)

T = 1.0:      [0.35, 0.25, 0.15] → [0.35, 0.25, 0.15]
                                    (unchanged)

High T (2.0): [0.35, 0.25, 0.15] → [0.40, 0.32, 0.28]
                                    (flattened)

Top-K Sampling

Only consider the top k most likely tokens:

Vocabulary: 50,000 tokens
Top-k = 40: Only sample from 40 highest probability tokens

Process:

  1. Sort tokens by probability
  2. Keep only top k
  3. Renormalize probabilities
  4. Sample from reduced set

Why:

  • Prevents sampling very unlikely tokens
  • Reduces nonsensical outputs
  • Faster computation

Downside:

  • Fixed k regardless of distribution shape
  • May cut off good tokens or include bad ones

Top-P (Nucleus) Sampling

Sample from smallest set of tokens whose cumulative probability exceeds p:

Tokens:        mat    floor  couch  roof   table  ...
Probs:         0.35   0.25   0.15   0.10   0.08   ...
Cumulative:    0.35   0.60   0.75   0.85   0.93   ...

Top-p = 0.9:   Keep {mat, floor, couch, roof} (sum = 0.85 < 0.9)
               Include table? 0.93 > 0.9 ✓
               Final set: {mat, floor, couch, roof, table}

Why:

  • Adapts to distribution shape
  • Confident predictions → fewer tokens
  • Uncertain → more tokens
  • Often better than fixed top-k

Combining Strategies

Most APIs let you combine:

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[...],
    temperature=0.7,  # Scale distribution
    top_p=0.9,        # Nucleus sampling
    # Note: OpenAI recommends changing one, not both
)

Order of Operations

Logits → Temperature scaling → Top-k filter → Top-p filter → Sample

Other Sampling Methods

Greedy Decoding

Always pick argmax(probability)
  • Deterministic
  • Fast
  • Often repetitive and boring

Beam Search

Maintain top-n sequences, extend each, keep best n overall
  • Better than greedy for translation/summarization
  • Can still be repetitive
  • Computationally expensive

Contrastive Search

Balance probability with distinctiveness from previous tokens
  • Reduces repetition
  • Newer method

Typical Sampling

Sample from tokens with probability close to expected information content
  • Avoids both too-common and too-rare tokens

Repetition Penalties

Separate from sampling, but related:

Frequency Penalty

Reduce probability of tokens based on how often they've appeared:

P(token) = P(token) × (1 - frequency_penalty × count)

Presence Penalty

One-time penalty if token has appeared at all:

P(token) = P(token) × (1 - presence_penalty) if token in context

Practical Guidelines

For Different Tasks

TaskTemperatureTop-pNotes
Code generation0-0.20.1Deterministic, correct
Math/reasoning0-0.30.1-0.5Consistent logic
Factual Q&A0.3-0.50.5-0.8Accurate but natural
Conversation0.7-0.90.9Natural, varied
Creative writing0.8-1.20.95Diverse, surprising
Brainstorming1.0-1.51.0Maximum diversity

Common Combinations

# Deterministic (testing, code)
temperature=0

# Balanced (general use)
temperature=0.7, top_p=0.9

# Creative
temperature=1.0, top_p=0.95

# Reduce repetition in long outputs
frequency_penalty=0.5, presence_penalty=0.5

Temperature = 0 Gotcha

Even with T=0, outputs may vary:

  • Floating point non-determinism
  • GPU parallelism effects
  • Some APIs add small noise

For true determinism:

seed=42  # If supported
temperature=0

Debugging Poor Outputs

Too Repetitive?

  • Increase temperature (0.7 → 0.9)
  • Add frequency/presence penalty
  • Try typical sampling

Too Random/Nonsensical?

  • Decrease temperature
  • Lower top-p (0.95 → 0.8)
  • Add top-k constraint

Good But Different Each Time?

  • Lower temperature
  • Use seed if available

Key Takeaways

  1. Temperature controls distribution peakedness (creativity vs determinism)
  2. Top-k limits to k most likely tokens
  3. Top-p (nucleus) adapts to distribution shape
  4. Lower values = more deterministic, higher = more creative
  5. Match settings to your task requirements
  6. Repetition penalties help with long-form generation

Practice Questions

Test your understanding with these related interview questions: