RLHF (Reinforcement Learning from Human Feedback)
RLHF trains language models to follow human preferences rather than just predicting text. It's the key technique behind ChatGPT's helpfulness and safety.
Why RLHF?
The Problem with Pre-training
Pre-trained LLMs learn to predict text, not be helpful:
❌ May continue harmful content
❌ May not follow instructions
❌ May be verbose or unhelpful
❌ May hallucinate confidently
What RLHF Adds
✓ Follows instructions
✓ Refuses harmful requests
✓ Admits uncertainty
✓ Provides helpful, concise answers
The Three Stages
1. Supervised Fine-Tuning (SFT)
Pre-trained model → Fine-tuned on demonstrations
2. Reward Model Training
Learn human preferences from comparisons
3. RL Optimization
Optimize policy against reward model
Stage 1: Supervised Fine-Tuning
Goal
Teach model the desired format and style.
Process
Human demonstrations:
User: "Explain gravity"
Assistant: "Gravity is the force that..."
Fine-tune pre-trained model on these examples.
Result
Model that responds in desired format, but may not be optimal.
Stage 2: Reward Model
Goal
Learn a function that scores responses by human preference.
Collecting Preferences
Prompt: "Write a poem about AI"
Response A: [Generated text A]
Response B: [Generated text B]
Human: A is better than B
Training the Reward Model
Input: (prompt, response)
Output: scalar reward score
Loss: -log(sigmoid(r(A) - r(B)))
Trained to assign higher scores to preferred responses.
class RewardModel(nn.Module):
def __init__(self, base_model):
self.backbone = base_model
self.head = nn.Linear(hidden_size, 1)
def forward(self, input_ids):
hidden = self.backbone(input_ids).last_hidden_state[:, -1]
return self.head(hidden).squeeze()
Stage 3: RL Optimization
Goal
Optimize the policy (language model) to maximize reward.
PPO Algorithm
For each batch:
1. Generate responses from policy
2. Score with reward model
3. Update policy to increase expected reward
4. KL penalty to stay close to SFT model
The Objective
max E[r(x, y)] - β × KL(π || π_SFT)
r(x, y): Reward model score
KL: Divergence from SFT model
β: KL penalty coefficient
Why KL Penalty?
- Prevents reward hacking
- Maintains language quality
- Keeps model in-distribution
Reward Hacking
The Problem
Model finds shortcuts to high reward:
❌ Overly long responses
❌ Repeating user's question
❌ Excessive flattery
❌ Exploiting reward model weaknesses
Solutions
- KL penalty to stay near SFT
- Regularly update reward model
- Diverse training prompts
- Red-teaming
Alternatives to PPO
DPO (Direct Preference Optimization)
Skip reward model, train directly on preferences:
Loss = -log(sigmoid(β × (log π(y_w|x) - log π(y_l|x)
- log π_ref(y_w|x) + log π_ref(y_l|x))))
y_w: Preferred response
y_l: Dispreferred response
Advantages:
- Simpler (no reward model)
- More stable
- Faster
RLAIF
Replacement for human feedback:
AI model ranks responses instead of humans
Constitutional AI
Self-improvement with principles:
1. Generate response
2. Critique based on constitution
3. Revise response
4. Train on revised version
Practical Considerations
Data Collection
- Quality > quantity
- Clear guidelines for labelers
- Diverse prompts
- Include edge cases
Labeler Agreement
Inter-annotator agreement:
- High for clear cases
- Low for subjective preferences
Handle disagreement:
- Majority vote
- Model uncertainty
Hyperparameters
# PPO
learning_rate = 1e-5
batch_size = 64
kl_penalty = 0.1 # β
ppo_epochs = 4
# DPO
beta = 0.1
learning_rate = 1e-6
Evaluation
Automatic
- Reward model score
- Win rate vs baseline
- Toxicity classifiers
- Helpfulness classifiers
Human
- A/B comparisons
- Likert ratings
- Task success rate
Benchmarks
- MT-Bench
- AlpacaEval
- HumanEval (for code)
Current Challenges
Specification Gaming
Model optimizes metric but not true intent.
Scalable Oversight
How to get feedback on superhuman tasks?
Value Alignment
Whose values? How to aggregate preferences?
Robustness
Adversarial prompts can bypass safety training.
The Full Pipeline
[Pre-training]
↓
[SFT on demonstrations]
↓
[Collect preference data]
↓
[Train reward model]
↓
[PPO/DPO optimization]
↓
[Evaluation & red-teaming]
↓
[Deployment]
Key Takeaways
- RLHF aligns LLMs with human preferences
- Three stages: SFT → Reward Model → RL
- Reward model learns from human comparisons
- PPO optimizes policy with KL constraint
- DPO is simpler alternative to PPO
- Key challenges: reward hacking, scalable oversight