Data Augmentation
Data augmentation creates new training samples by applying transformations to existing data. It's one of the most effective techniques for improving model generalization, especially with limited data.
Why Augmentation Works
The Intuition
- Models should be invariant to certain transformations
- A rotated cat is still a cat
- Augmentation teaches these invariances
- Effectively increases dataset size
Benefits
- Reduces overfitting: More diverse training data
- Improves generalization: Learns true patterns, not artifacts
- Handles data scarcity: Multiplies effective dataset size
- Encodes invariances: Makes model robust to transformations
Image Augmentation
Geometric Transformations
import albumentations as A
transform = A.Compose([
A.HorizontalFlip(p=0.5),
A.VerticalFlip(p=0.5),
A.RandomRotate90(p=0.5),
A.Rotate(limit=15, p=0.5),
A.ShiftScaleRotate(shift_limit=0.1, scale_limit=0.1, rotate_limit=15),
A.RandomCrop(height=224, width=224),
A.Perspective(p=0.3),
])
Common transforms:
- Flip (horizontal/vertical)
- Rotation
- Scaling
- Translation (shift)
- Cropping
- Perspective/affine transforms
Color Transformations
transform = A.Compose([
A.RandomBrightnessContrast(p=0.5),
A.HueSaturationValue(p=0.5),
A.RGBShift(p=0.3),
A.CLAHE(p=0.3), # Contrast Limited Adaptive Histogram Equalization
A.ToGray(p=0.1),
])
Noise and Blur
transform = A.Compose([
A.GaussianBlur(blur_limit=3, p=0.3),
A.MotionBlur(p=0.2),
A.GaussNoise(p=0.3),
A.ISONoise(p=0.2),
])
Cutout / Random Erasing
transform = A.Compose([
A.CoarseDropout(max_holes=8, max_height=16, max_width=16, p=0.5),
])
Randomly mask out rectangular regions. Forces model to use multiple features.
MixUp
Blend two images and their labels:
lambda_ = np.random.beta(alpha, alpha)
image = lambda_ * image1 + (1 - lambda_) * image2
label = lambda_ * label1 + (1 - lambda_) * label2
Labels become soft (e.g., 0.7 cat, 0.3 dog).
CutMix
Paste patch from one image onto another:
# Cut patch from image2, paste onto image1
image1[y1:y2, x1:x2] = image2[y1:y2, x1:x2]
label = lambda_ * label1 + (1 - lambda_) * label2
AutoAugment / RandAugment
Learned or random sequences of augmentations:
from torchvision.transforms import RandAugment
transform = RandAugment(num_ops=2, magnitude=9)
Text Augmentation
Synonym Replacement
"The quick brown fox" → "The fast brown fox"
Random Insertion
"The quick fox" → "The very quick fox"
Random Swap
"I love cats" → "cats love I"
Random Deletion
"The quick brown fox" → "The brown fox"
Back-Translation
# English → French → English
"Hello world" → "Bonjour le monde" → "Hello everyone"
EDA (Easy Data Augmentation)
from eda import eda
augmented = eda(sentence, alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=4)
LLM-based Augmentation
prompt = f"Paraphrase this sentence: '{sentence}'"
augmented = llm.generate(prompt)
Audio Augmentation
# Time stretching
audio_stretched = librosa.effects.time_stretch(audio, rate=1.2)
# Pitch shifting
audio_shifted = librosa.effects.pitch_shift(audio, sr=sr, n_steps=4)
# Add noise
audio_noisy = audio + 0.005 * np.random.randn(len(audio))
# Time masking (SpecAugment)
spec[t1:t2, :] = 0
# Frequency masking (SpecAugment)
spec[:, f1:f2] = 0
Tabular Data Augmentation
Less common but possible:
SMOTE (for imbalanced data)
from imblearn.over_sampling import SMOTE
X_aug, y_aug = SMOTE().fit_resample(X, y)
Noise Injection
X_aug = X + np.random.normal(0, 0.01, X.shape)
Feature Mixup
# Similar to image MixUp
X_aug = lambda_ * X1 + (1 - lambda_) * X2
Best Practices
Match Domain
# Medical images: Usually no horizontal flip (anatomy matters)
# Satellite images: All rotations valid
# Text: Back-translation better than random swaps
Don't Augment Validation Set
train_transform = A.Compose([...augmentations...])
val_transform = A.Compose([A.Resize(224, 224)]) # No augmentation!
Augmentation Intensity
- Start mild, increase if overfitting
- Too strong can hurt performance
- Task-dependent (classification vs detection)
Online vs Offline
Online (during training):
for batch in dataloader:
augmented = transform(batch) # Different each epoch
Offline (precompute):
for img in dataset:
for i in range(5):
save(transform(img))
Online is usually preferred (more diversity).
Common Pipeline
import albumentations as A
from albumentations.pytorch import ToTensorV2
train_transform = A.Compose([
A.RandomResizedCrop(224, 224, scale=(0.8, 1.0)),
A.HorizontalFlip(p=0.5),
A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
A.GaussNoise(p=0.2),
A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
ToTensorV2(),
])
val_transform = A.Compose([
A.Resize(256, 256),
A.CenterCrop(224, 224),
A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
ToTensorV2(),
])
Key Takeaways
- Augmentation artificially expands training data
- Teaches model invariances to transformations
- Image: flips, rotations, color jitter, cutout, mixup
- Text: synonym replacement, back-translation
- Don't augment validation/test sets
- Match augmentation to domain constraints
- Online augmentation provides more diversity