advancedLLMs & Generative AI

Understand RLHF - the technique that aligns language models with human preferences, making them helpful and safe.

rlhfalignmentllmsreinforcement-learningchatgpt

RLHF (Reinforcement Learning from Human Feedback)

RLHF trains language models to follow human preferences rather than just predicting text. It's the key technique behind ChatGPT's helpfulness and safety.

Why RLHF?

The Problem with Pre-training

Pre-trained LLMs learn to predict text, not be helpful:

❌ May continue harmful content
❌ May not follow instructions
❌ May be verbose or unhelpful
❌ May hallucinate confidently

What RLHF Adds

✓ Follows instructions
✓ Refuses harmful requests
✓ Admits uncertainty
✓ Provides helpful, concise answers

The Three Stages

1. Supervised Fine-Tuning (SFT)
   Pre-trained model → Fine-tuned on demonstrations

2. Reward Model Training
   Learn human preferences from comparisons

3. RL Optimization
   Optimize policy against reward model

Stage 1: Supervised Fine-Tuning

Goal

Teach model the desired format and style.

Process

Human demonstrations:
  User: "Explain gravity"
  Assistant: "Gravity is the force that..."

Fine-tune pre-trained model on these examples.

Result

Model that responds in desired format, but may not be optimal.

Stage 2: Reward Model

Goal

Learn a function that scores responses by human preference.

Collecting Preferences

Prompt: "Write a poem about AI"

Response A: [Generated text A]
Response B: [Generated text B]

Human: A is better than B

Training the Reward Model

Input: (prompt, response)
Output: scalar reward score

Loss: -log(sigmoid(r(A) - r(B)))

Trained to assign higher scores to preferred responses.

class RewardModel(nn.Module):
    def __init__(self, base_model):
        self.backbone = base_model
        self.head = nn.Linear(hidden_size, 1)
    
    def forward(self, input_ids):
        hidden = self.backbone(input_ids).last_hidden_state[:, -1]
        return self.head(hidden).squeeze()

Stage 3: RL Optimization

Goal

Optimize the policy (language model) to maximize reward.

PPO Algorithm

For each batch:
  1. Generate responses from policy
  2. Score with reward model
  3. Update policy to increase expected reward
  4. KL penalty to stay close to SFT model

The Objective

max E[r(x, y)] - β × KL(π || π_SFT)

r(x, y): Reward model score
KL: Divergence from SFT model
β: KL penalty coefficient

Why KL Penalty?

  • Prevents reward hacking
  • Maintains language quality
  • Keeps model in-distribution

Reward Hacking

The Problem

Model finds shortcuts to high reward:

❌ Overly long responses
❌ Repeating user's question
❌ Excessive flattery
❌ Exploiting reward model weaknesses

Solutions

  • KL penalty to stay near SFT
  • Regularly update reward model
  • Diverse training prompts
  • Red-teaming

Alternatives to PPO

DPO (Direct Preference Optimization)

Skip reward model, train directly on preferences:

Loss = -log(sigmoid(β × (log π(y_w|x) - log π(y_l|x) 
                         - log π_ref(y_w|x) + log π_ref(y_l|x))))

y_w: Preferred response
y_l: Dispreferred response

Advantages:

  • Simpler (no reward model)
  • More stable
  • Faster

RLAIF

Replacement for human feedback:

AI model ranks responses instead of humans

Constitutional AI

Self-improvement with principles:

1. Generate response
2. Critique based on constitution
3. Revise response
4. Train on revised version

Practical Considerations

Data Collection

  • Quality > quantity
  • Clear guidelines for labelers
  • Diverse prompts
  • Include edge cases

Labeler Agreement

Inter-annotator agreement:
- High for clear cases
- Low for subjective preferences

Handle disagreement:
- Majority vote
- Model uncertainty

Hyperparameters

# PPO
learning_rate = 1e-5
batch_size = 64
kl_penalty = 0.1  # β
ppo_epochs = 4

# DPO
beta = 0.1
learning_rate = 1e-6

Evaluation

Automatic

  • Reward model score
  • Win rate vs baseline
  • Toxicity classifiers
  • Helpfulness classifiers

Human

  • A/B comparisons
  • Likert ratings
  • Task success rate

Benchmarks

  • MT-Bench
  • AlpacaEval
  • HumanEval (for code)

Current Challenges

Specification Gaming

Model optimizes metric but not true intent.

Scalable Oversight

How to get feedback on superhuman tasks?

Value Alignment

Whose values? How to aggregate preferences?

Robustness

Adversarial prompts can bypass safety training.

The Full Pipeline

[Pre-training]
      ↓
[SFT on demonstrations]
      ↓
[Collect preference data]
      ↓
[Train reward model]
      ↓
[PPO/DPO optimization]
      ↓
[Evaluation & red-teaming]
      ↓
[Deployment]

Key Takeaways

  1. RLHF aligns LLMs with human preferences
  2. Three stages: SFT → Reward Model → RL
  3. Reward model learns from human comparisons
  4. PPO optimizes policy with KL constraint
  5. DPO is simpler alternative to PPO
  6. Key challenges: reward hacking, scalable oversight

Practice Questions

Test your understanding with these related interview questions: