Fine-tuning LLMs
Fine-tuning adapts a pretrained language model to specific tasks or domains. It's how you make general models excel at your particular use case.
Why Fine-tune?
Prompting limitations:
- Limited context window
- Inconsistent outputs
- Can't learn new behaviors
- May not match your style/format
Fine-tuning enables:
- Task-specific optimization
- Style and format consistency
- Improved accuracy on domain
- Potentially smaller, faster models
Types of Fine-tuning
Full Fine-tuning
Update all model parameters:
All weights → Training → All weights updated
Pros: Most flexibility, best potential performance Cons: Expensive, catastrophic forgetting risk, needs lots of data
Parameter-Efficient Fine-tuning (PEFT)
Update only a small subset of parameters:
Most weights frozen → Train small adapter → Merge or use alongside
Methods: LoRA, QLoRA, Adapters, Prefix Tuning
LoRA (Low-Rank Adaptation)
The most popular PEFT method.
How It Works
Instead of updating weight matrix W, add low-rank matrices:
Original: y = Wx
LoRA: y = Wx + BAx
Where:
- W (frozen): Original weights [d × d]
- B: Low-rank down-projection [d × r]
- A: Low-rank up-projection [r × d]
- r << d (typical r = 8-64)
Why It Works
Weight updates during fine-tuning are low-rank. LoRA exploits this:
- Train only B and A (few parameters)
- Captures task-specific adaptations
- Can merge back: W' = W + BA
Advantages
- 10-100x fewer trainable parameters
- Multiple LoRAs for different tasks
- Easy to swap/combine adapters
- Memory efficient
QLoRA (Quantized LoRA)
Combine quantization with LoRA:
Base model: 4-bit quantized (frozen)
Adapters: Full precision (trained)
Fine-tune 65B models on a single GPU!
Key Techniques
- 4-bit NormalFloat quantization
- Double quantization (quantize quantization constants)
- Paged optimizers (offload to CPU when needed)
Other PEFT Methods
Adapters
Insert small trainable modules between layers:
Layer → Adapter → Layer → Adapter → ...
Prefix Tuning
Learn "soft prompts" - continuous embeddings prepended to input:
[Learned prefix embeddings] + [Input tokens]
Prompt Tuning
Similar to prefix tuning but simpler, just prepends to embeddings.
Fine-tuning Process
1. Prepare Data
# Instruction format
{
"instruction": "Summarize the following article",
"input": "<article text>",
"output": "<summary>"
}
# Chat format
[
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
2. Choose Method
| Scenario | Recommendation |
|---|---|
| Limited compute | QLoRA |
| Multiple tasks | LoRA (separate adapters) |
| Maximum quality | Full fine-tune (if resources permit) |
| Quick iteration | LoRA with small r |
3. Set Hyperparameters
training_args = {
"learning_rate": 2e-5, # Lower than pretraining
"num_epochs": 3, # Usually 1-5
"batch_size": 4, # As large as fits
"gradient_accumulation": 4, # Effective batch = 16
"warmup_ratio": 0.1,
"weight_decay": 0.01,
}
# LoRA specific
lora_config = {
"r": 16, # Rank
"lora_alpha": 32, # Scaling
"target_modules": ["q_proj", "v_proj"],
"lora_dropout": 0.05,
}
4. Train
from transformers import Trainer
from peft import get_peft_model, LoraConfig
model = get_peft_model(base_model, lora_config)
trainer = Trainer(model=model, args=training_args, ...)
trainer.train()
5. Evaluate
- Task-specific metrics (accuracy, F1, BLEU, etc.)
- Human evaluation for generation quality
- Compare to baseline (original model + prompting)
Common Challenges
Catastrophic Forgetting
Model loses general abilities while learning task.
Solutions:
- Mix general data with task data
- Use lower learning rate
- PEFT methods (less forgetting)
Overfitting
Small datasets → memorization.
Solutions:
- More data
- Data augmentation
- Regularization (dropout, weight decay)
- Early stopping
Data Quality
Garbage in, garbage out.
Solutions:
- Clean, deduplicate data
- Diverse examples
- Include edge cases
- Quality > quantity
Instruction Tuning vs Task Fine-tuning
Instruction Tuning
Teach model to follow instructions generally:
- Diverse tasks
- Natural instructions
- Creates "helpful assistant"
Task-Specific Fine-tuning
Optimize for one specific task:
- Single task
- Consistent format
- Maximum task performance
RLHF (Reinforcement Learning from Human Feedback)
Fine-tuning with human preferences:
1. Supervised fine-tuning (SFT)
2. Train reward model on human preferences
3. Optimize policy with PPO against reward model
How ChatGPT was aligned!
Simpler alternative: DPO (Direct Preference Optimization)
- No separate reward model
- Direct optimization from preferences
When to Fine-tune vs Prompt
| Factor | Use Prompting | Use Fine-tuning |
|---|---|---|
| Data available | Little | Plenty |
| Task complexity | Simple | Complex |
| Budget | Low | Higher |
| Need consistency | Less critical | Critical |
| Iteration speed | Need fast | Can wait |
Key Takeaways
- Fine-tuning adapts models to specific tasks/domains
- LoRA/QLoRA enable efficient fine-tuning on limited hardware
- Quality data is more important than quantity
- Watch for catastrophic forgetting and overfitting
- Often combine with prompting for best results
- Evaluate carefully against baselines