Transfer Learning
Transfer learning uses knowledge gained from solving one problem to help solve a different but related problem. It's one of the most powerful techniques in modern deep learning.
Why Transfer Learning?
The Problem
- Deep learning needs lots of data
- Labeling data is expensive
- Training from scratch takes time and compute
- Many tasks have limited data
The Solution
- Start with a model pretrained on large dataset
- Adapt it to your specific task
- Achieve good results with less data and time
How It Works
The Intuition
Models learn hierarchical features:
Images (CNN):
Layer 1: Edges, colors
Layer 2: Textures, patterns
Layer 3: Parts (eyes, wheels)
Layer 4: Objects
Layer 5: Scenes
Text (Transformer):
Lower layers: Syntax, grammar
Middle layers: Semantics
Upper layers: Task-specific
Lower layers learn general features that transfer across tasks!
Transfer Learning Strategies
Strategy 1: Feature Extraction
Use pretrained model as fixed feature extractor:
# Freeze pretrained layers
for param in pretrained_model.parameters():
param.requires_grad = False
# Add new classifier head
model = nn.Sequential(
pretrained_model,
nn.Linear(features, num_classes)
)
# Only train the new head
When to use:
- Very small dataset
- Target task very similar to source
- Limited compute
Strategy 2: Fine-tuning
Unfreeze some/all layers and train with small learning rate:
# Unfreeze last few layers
for param in pretrained_model.layer4.parameters():
param.requires_grad = True
# Train with small learning rate
optimizer = Adam(model.parameters(), lr=1e-5)
When to use:
- Moderate dataset size
- Have compute resources
- Target task somewhat different from source
Strategy 3: Gradual Unfreezing
Unfreeze layers progressively during training:
Epoch 1-2: Train only head
Epoch 3-4: Unfreeze last layer, train
Epoch 5-6: Unfreeze more layers, train
...
Prevents catastrophic forgetting of pretrained features.
Fine-tuning Best Practices
Learning Rate
# Lower learning rate than training from scratch
lr = 1e-5 to 1e-4 # vs 1e-3 for from-scratch
# Discriminative learning rates (different lr per layer)
optimizer = Adam([
{'params': model.base.parameters(), 'lr': 1e-5},
{'params': model.head.parameters(), 'lr': 1e-3}
])
Data Augmentation
# Match augmentation to pretrained model's training
transforms.Normalize(mean=[0.485, 0.456, 0.406], # ImageNet stats
std=[0.229, 0.224, 0.225])
Batch Size
- Can often use larger batch sizes with frozen layers
- Reduce batch size when fine-tuning (more memory for gradients)
Computer Vision Transfer Learning
Popular Pretrained Models
| Model | Pretrained On | Use Case |
|---|---|---|
| ResNet | ImageNet | General, good baseline |
| EfficientNet | ImageNet | Efficient, accurate |
| ViT | ImageNet-21k | Strong with more data |
| CLIP | 400M image-text pairs | Zero-shot, multimodal |
Example: Image Classification
import torchvision.models as models
import torch.nn as nn
# Load pretrained ResNet
model = models.resnet50(pretrained=True)
# Replace final layer for your task
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, num_classes)
# Fine-tune
for param in model.parameters():
param.requires_grad = False
for param in model.fc.parameters():
param.requires_grad = True
NLP Transfer Learning
Evolution
Word2Vec (2013): Pretrained word embeddings
↓
ELMo (2018): Contextualized embeddings
↓
BERT (2019): Pretrained transformers
↓
GPT-3 (2020): Massive pretrained models
↓
ChatGPT (2022): Instruction-tuned LLMs
Modern NLP Pattern
from transformers import AutoModel, AutoTokenizer
# Load pretrained BERT
model = AutoModel.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# Add task-specific head
classifier = nn.Sequential(
model,
nn.Linear(768, num_classes)
)
# Fine-tune on your data
When Transfer Learning Helps
Dataset Size Guidelines
| Your Data | Strategy |
|---|---|
| Very small (<1K) | Feature extraction only |
| Small (1K-10K) | Fine-tune top layers |
| Medium (10K-100K) | Fine-tune most/all layers |
| Large (>100K) | Fine-tune all, or train from scratch |
Domain Similarity
| Similarity | Approach |
|---|---|
| Very similar | Feature extraction works well |
| Somewhat similar | Fine-tune top layers |
| Different | Fine-tune more layers, maybe from scratch |
Negative Transfer
When transfer learning hurts:
- Source and target domains too different
- Pretrained model captures irrelevant features
- Target dataset very large (doesn't need transfer)
Signs:
- Fine-tuned model worse than from-scratch
- Training loss doesn't decrease
Solutions:
- Try different pretrained model
- Fine-tune fewer layers
- Train from scratch
Key Takeaways
- Transfer learning leverages pretrained models for new tasks
- Lower layers learn general features that transfer well
- Feature extraction: freeze model, train new head
- Fine-tuning: unfreeze layers, use small learning rate
- More data and different domains → more fine-tuning
- Essential for modern CV and NLP (BERT, ResNet, etc.)