intermediateLLMs & Generative AI

Learn how to fine-tune large language models for specific tasks, from full fine-tuning to parameter-efficient methods like LoRA.

fine-tuningloraqloratrainingpeft

Fine-tuning LLMs

Fine-tuning adapts a pretrained language model to specific tasks or domains. It's how you make general models excel at your particular use case.

Why Fine-tune?

Prompting limitations:

  • Limited context window
  • Inconsistent outputs
  • Can't learn new behaviors
  • May not match your style/format

Fine-tuning enables:

  • Task-specific optimization
  • Style and format consistency
  • Improved accuracy on domain
  • Potentially smaller, faster models

Types of Fine-tuning

Full Fine-tuning

Update all model parameters:

All weights → Training → All weights updated

Pros: Most flexibility, best potential performance Cons: Expensive, catastrophic forgetting risk, needs lots of data

Parameter-Efficient Fine-tuning (PEFT)

Update only a small subset of parameters:

Most weights frozen → Train small adapter → Merge or use alongside

Methods: LoRA, QLoRA, Adapters, Prefix Tuning

LoRA (Low-Rank Adaptation)

The most popular PEFT method.

How It Works

Instead of updating weight matrix W, add low-rank matrices:

Original: y = Wx
LoRA:     y = Wx + BAx

Where:
- W (frozen): Original weights [d × d]
- B: Low-rank down-projection [d × r]
- A: Low-rank up-projection [r × d]
- r << d (typical r = 8-64)

Why It Works

Weight updates during fine-tuning are low-rank. LoRA exploits this:

  • Train only B and A (few parameters)
  • Captures task-specific adaptations
  • Can merge back: W' = W + BA

Advantages

  • 10-100x fewer trainable parameters
  • Multiple LoRAs for different tasks
  • Easy to swap/combine adapters
  • Memory efficient

QLoRA (Quantized LoRA)

Combine quantization with LoRA:

Base model: 4-bit quantized (frozen)
Adapters: Full precision (trained)

Fine-tune 65B models on a single GPU!

Key Techniques

  • 4-bit NormalFloat quantization
  • Double quantization (quantize quantization constants)
  • Paged optimizers (offload to CPU when needed)

Other PEFT Methods

Adapters

Insert small trainable modules between layers:

Layer → Adapter → Layer → Adapter → ...

Prefix Tuning

Learn "soft prompts" - continuous embeddings prepended to input:

[Learned prefix embeddings] + [Input tokens]

Prompt Tuning

Similar to prefix tuning but simpler, just prepends to embeddings.

Fine-tuning Process

1. Prepare Data

# Instruction format
{
  "instruction": "Summarize the following article",
  "input": "<article text>",
  "output": "<summary>"
}

# Chat format
[
  {"role": "user", "content": "..."},
  {"role": "assistant", "content": "..."}
]

2. Choose Method

ScenarioRecommendation
Limited computeQLoRA
Multiple tasksLoRA (separate adapters)
Maximum qualityFull fine-tune (if resources permit)
Quick iterationLoRA with small r

3. Set Hyperparameters

training_args = {
    "learning_rate": 2e-5,      # Lower than pretraining
    "num_epochs": 3,            # Usually 1-5
    "batch_size": 4,            # As large as fits
    "gradient_accumulation": 4,  # Effective batch = 16
    "warmup_ratio": 0.1,
    "weight_decay": 0.01,
}

# LoRA specific
lora_config = {
    "r": 16,              # Rank
    "lora_alpha": 32,     # Scaling
    "target_modules": ["q_proj", "v_proj"],
    "lora_dropout": 0.05,
}

4. Train

from transformers import Trainer
from peft import get_peft_model, LoraConfig

model = get_peft_model(base_model, lora_config)
trainer = Trainer(model=model, args=training_args, ...)
trainer.train()

5. Evaluate

  • Task-specific metrics (accuracy, F1, BLEU, etc.)
  • Human evaluation for generation quality
  • Compare to baseline (original model + prompting)

Common Challenges

Catastrophic Forgetting

Model loses general abilities while learning task.

Solutions:

  • Mix general data with task data
  • Use lower learning rate
  • PEFT methods (less forgetting)

Overfitting

Small datasets → memorization.

Solutions:

  • More data
  • Data augmentation
  • Regularization (dropout, weight decay)
  • Early stopping

Data Quality

Garbage in, garbage out.

Solutions:

  • Clean, deduplicate data
  • Diverse examples
  • Include edge cases
  • Quality > quantity

Instruction Tuning vs Task Fine-tuning

Instruction Tuning

Teach model to follow instructions generally:

  • Diverse tasks
  • Natural instructions
  • Creates "helpful assistant"

Task-Specific Fine-tuning

Optimize for one specific task:

  • Single task
  • Consistent format
  • Maximum task performance

RLHF (Reinforcement Learning from Human Feedback)

Fine-tuning with human preferences:

1. Supervised fine-tuning (SFT)
2. Train reward model on human preferences
3. Optimize policy with PPO against reward model

How ChatGPT was aligned!

Simpler alternative: DPO (Direct Preference Optimization)

  • No separate reward model
  • Direct optimization from preferences

When to Fine-tune vs Prompt

FactorUse PromptingUse Fine-tuning
Data availableLittlePlenty
Task complexitySimpleComplex
BudgetLowHigher
Need consistencyLess criticalCritical
Iteration speedNeed fastCan wait

Key Takeaways

  1. Fine-tuning adapts models to specific tasks/domains
  2. LoRA/QLoRA enable efficient fine-tuning on limited hardware
  3. Quality data is more important than quantity
  4. Watch for catastrophic forgetting and overfitting
  5. Often combine with prompting for best results
  6. Evaluate carefully against baselines

Practice Questions

Test your understanding with these related interview questions: