Fine-tuning LLMs

Fine-tuning adapts a pretrained language model to specific tasks or domains. It's how you make general models excel at your particular use case.

Why Fine-tune?

Prompting limitations:

Limited context window
Inconsistent outputs
Can't learn new behaviors
May not match your style/format

Fine-tuning enables:

Task-specific optimization
Style and format consistency
Improved accuracy on domain
Potentially smaller, faster models

Types of Fine-tuning

Full Fine-tuning

Update all model parameters:

All weights → Training → All weights updated

Pros: Most flexibility, best potential performance Cons: Expensive, catastrophic forgetting risk, needs lots of data

Parameter-Efficient Fine-tuning (PEFT)

Update only a small subset of parameters:

Most weights frozen → Train small adapter → Merge or use alongside

Methods: LoRA, QLoRA, Adapters, Prefix Tuning

LoRA (Low-Rank Adaptation)

The most popular PEFT method.

How It Works

Instead of updating weight matrix W, add low-rank matrices:

Original: y = Wx
LoRA:     y = Wx + BAx

Where:
- W (frozen): Original weights [d × d]
- B: Low-rank down-projection [d × r]
- A: Low-rank up-projection [r × d]
- r << d (typical r = 8-64)

Why It Works

Weight updates during fine-tuning are low-rank. LoRA exploits this:

Train only B and A (few parameters)
Captures task-specific adaptations
Can merge back: W' = W + BA

Advantages

10-100x fewer trainable parameters
Multiple LoRAs for different tasks
Easy to swap/combine adapters
Memory efficient

QLoRA (Quantized LoRA)

Combine quantization with LoRA:

Base model: 4-bit quantized (frozen)
Adapters: Full precision (trained)

Fine-tune 65B models on a single GPU!

Key Techniques

4-bit NormalFloat quantization
Double quantization (quantize quantization constants)
Paged optimizers (offload to CPU when needed)

Other PEFT Methods

Adapters

Insert small trainable modules between layers:

Layer → Adapter → Layer → Adapter → ...

Prefix Tuning

Learn "soft prompts" - continuous embeddings prepended to input:

[Learned prefix embeddings] + [Input tokens]

Prompt Tuning

Similar to prefix tuning but simpler, just prepends to embeddings.

Fine-tuning Process

1. Prepare Data

# Instruction format
{
  "instruction": "Summarize the following article",
  "input": "<article text>",
  "output": "<summary>"
}

# Chat format
[
  {"role": "user", "content": "..."},
  {"role": "assistant", "content": "..."}
]

2. Choose Method

Scenario	Recommendation
Limited compute	QLoRA
Multiple tasks	LoRA (separate adapters)
Maximum quality	Full fine-tune (if resources permit)
Quick iteration	LoRA with small r

3. Set Hyperparameters

training_args = {
    "learning_rate": 2e-5,      # Lower than pretraining
    "num_epochs": 3,            # Usually 1-5
    "batch_size": 4,            # As large as fits
    "gradient_accumulation": 4,  # Effective batch = 16
    "warmup_ratio": 0.1,
    "weight_decay": 0.01,
}

# LoRA specific
lora_config = {
    "r": 16,              # Rank
    "lora_alpha": 32,     # Scaling
    "target_modules": ["q_proj", "v_proj"],
    "lora_dropout": 0.05,
}

4. Train

from transformers import Trainer
from peft import get_peft_model, LoraConfig

model = get_peft_model(base_model, lora_config)
trainer = Trainer(model=model, args=training_args, ...)
trainer.train()

5. Evaluate

Task-specific metrics (accuracy, F1, BLEU, etc.)
Human evaluation for generation quality
Compare to baseline (original model + prompting)

Common Challenges

Catastrophic Forgetting

Model loses general abilities while learning task.

Solutions:

Mix general data with task data
Use lower learning rate
PEFT methods (less forgetting)

Overfitting

Small datasets → memorization.

Solutions:

More data
Data augmentation
Regularization (dropout, weight decay)
Early stopping

Data Quality

Garbage in, garbage out.

Solutions:

Clean, deduplicate data
Diverse examples
Include edge cases
Quality > quantity

Instruction Tuning vs Task Fine-tuning

Instruction Tuning

Teach model to follow instructions generally:

Diverse tasks
Natural instructions
Creates "helpful assistant"

Task-Specific Fine-tuning

Optimize for one specific task:

Single task
Consistent format
Maximum task performance

RLHF (Reinforcement Learning from Human Feedback)

Fine-tuning with human preferences:

1. Supervised fine-tuning (SFT)
2. Train reward model on human preferences
3. Optimize policy with PPO against reward model

How ChatGPT was aligned!

Simpler alternative: DPO (Direct Preference Optimization)

No separate reward model
Direct optimization from preferences

When to Fine-tune vs Prompt

Factor	Use Prompting	Use Fine-tuning
Data available	Little	Plenty
Task complexity	Simple	Complex
Budget	Low	Higher
Need consistency	Less critical	Critical
Iteration speed	Need fast	Can wait

Key Takeaways

Fine-tuning adapts models to specific tasks/domains
LoRA/QLoRA enable efficient fine-tuning on limited hardware
Quality data is more important than quantity
Watch for catastrophic forgetting and overfitting
Often combine with prompting for best results
Evaluate carefully against baselines

Fine-tuning LLMs

Why Fine-tune?

Types of Fine-tuning

Full Fine-tuning

Parameter-Efficient Fine-tuning (PEFT)

LoRA (Low-Rank Adaptation)

How It Works

Why It Works

Advantages

QLoRA (Quantized LoRA)

Key Techniques

Other PEFT Methods

Adapters

Prefix Tuning

Prompt Tuning

Fine-tuning Process

1. Prepare Data

2. Choose Method

3. Set Hyperparameters

4. Train

5. Evaluate

Common Challenges

Catastrophic Forgetting

Overfitting

Data Quality

Instruction Tuning vs Task Fine-tuning

Instruction Tuning

Task-Specific Fine-tuning

RLHF (Reinforcement Learning from Human Feedback)

When to Fine-tune vs Prompt

Key Takeaways

Related Concepts

Practice Questions