RLHF (Reinforcement Learning from Human Feedback)

RLHF trains language models to follow human preferences rather than just predicting text. It's the key technique behind ChatGPT's helpfulness and safety.

Why RLHF?

The Problem with Pre-training

Pre-trained LLMs learn to predict text, not be helpful:

❌ May continue harmful content
❌ May not follow instructions
❌ May be verbose or unhelpful
❌ May hallucinate confidently

What RLHF Adds

✓ Follows instructions
✓ Refuses harmful requests
✓ Admits uncertainty
✓ Provides helpful, concise answers

The Three Stages

1. Supervised Fine-Tuning (SFT)
   Pre-trained model → Fine-tuned on demonstrations

2. Reward Model Training
   Learn human preferences from comparisons

3. RL Optimization
   Optimize policy against reward model

Stage 1: Supervised Fine-Tuning

Goal

Teach model the desired format and style.

Process

Human demonstrations:
  User: "Explain gravity"
  Assistant: "Gravity is the force that..."

Fine-tune pre-trained model on these examples.

Result

Model that responds in desired format, but may not be optimal.

Stage 2: Reward Model

Goal

Learn a function that scores responses by human preference.

Collecting Preferences

Prompt: "Write a poem about AI"

Response A: [Generated text A]
Response B: [Generated text B]

Human: A is better than B

Training the Reward Model

Input: (prompt, response)
Output: scalar reward score

Loss: -log(sigmoid(r(A) - r(B)))

Trained to assign higher scores to preferred responses.

class RewardModel(nn.Module):
    def __init__(self, base_model):
        self.backbone = base_model
        self.head = nn.Linear(hidden_size, 1)
    
    def forward(self, input_ids):
        hidden = self.backbone(input_ids).last_hidden_state[:, -1]
        return self.head(hidden).squeeze()

Stage 3: RL Optimization

Goal

Optimize the policy (language model) to maximize reward.

PPO Algorithm

For each batch:
  1. Generate responses from policy
  2. Score with reward model
  3. Update policy to increase expected reward
  4. KL penalty to stay close to SFT model

The Objective

max E[r(x, y)] - β × KL(π || π_SFT)

r(x, y): Reward model score
KL: Divergence from SFT model
β: KL penalty coefficient

Why KL Penalty?

Prevents reward hacking
Maintains language quality
Keeps model in-distribution

Reward Hacking

The Problem

Model finds shortcuts to high reward:

❌ Overly long responses
❌ Repeating user's question
❌ Excessive flattery
❌ Exploiting reward model weaknesses

Solutions

KL penalty to stay near SFT
Regularly update reward model
Diverse training prompts
Red-teaming

Alternatives to PPO

DPO (Direct Preference Optimization)

Skip reward model, train directly on preferences:

Loss = -log(sigmoid(β × (log π(y_w|x) - log π(y_l|x) 
                         - log π_ref(y_w|x) + log π_ref(y_l|x))))

y_w: Preferred response
y_l: Dispreferred response

Advantages:

Simpler (no reward model)
More stable
Faster

RLAIF

Replacement for human feedback:

AI model ranks responses instead of humans

Constitutional AI

Self-improvement with principles:

1. Generate response
2. Critique based on constitution
3. Revise response
4. Train on revised version

Practical Considerations

Data Collection

Quality > quantity
Clear guidelines for labelers
Diverse prompts
Include edge cases

Labeler Agreement

Inter-annotator agreement:
- High for clear cases
- Low for subjective preferences

Handle disagreement:
- Majority vote
- Model uncertainty

Hyperparameters

# PPO
learning_rate = 1e-5
batch_size = 64
kl_penalty = 0.1  # β
ppo_epochs = 4

# DPO
beta = 0.1
learning_rate = 1e-6

Evaluation

Automatic

Reward model score
Win rate vs baseline
Toxicity classifiers
Helpfulness classifiers

Human

A/B comparisons
Likert ratings
Task success rate

Benchmarks

MT-Bench
AlpacaEval
HumanEval (for code)

Current Challenges

Specification Gaming

Model optimizes metric but not true intent.

Scalable Oversight

How to get feedback on superhuman tasks?

Value Alignment

Whose values? How to aggregate preferences?

Robustness

Adversarial prompts can bypass safety training.

The Full Pipeline

[Pre-training]
      ↓
[SFT on demonstrations]
      ↓
[Collect preference data]
      ↓
[Train reward model]
      ↓
[PPO/DPO optimization]
      ↓
[Evaluation & red-teaming]
      ↓
[Deployment]

Key Takeaways

RLHF aligns LLMs with human preferences
Three stages: SFT → Reward Model → RL
Reward model learns from human comparisons
PPO optimizes policy with KL constraint
DPO is simpler alternative to PPO
Key challenges: reward hacking, scalable oversight

RLHF (Reinforcement Learning from Human Feedback)

Why RLHF?

The Problem with Pre-training

What RLHF Adds

The Three Stages

Stage 1: Supervised Fine-Tuning

Goal

Process

Result

Stage 2: Reward Model

Goal

Collecting Preferences

Training the Reward Model

Stage 3: RL Optimization

Goal

PPO Algorithm

The Objective

Why KL Penalty?

Reward Hacking

The Problem

Solutions

Alternatives to PPO

DPO (Direct Preference Optimization)

RLAIF

Constitutional AI

Practical Considerations

Data Collection

Labeler Agreement

Hyperparameters

Evaluation

Automatic

Human

Benchmarks

Current Challenges

Specification Gaming

Scalable Oversight

Value Alignment

Robustness

The Full Pipeline

Key Takeaways

Related Concepts

Practice Questions