Handling Imbalanced Data
Imbalanced data occurs when classes have very different numbers of samples. It's extremely common in real-world problems like fraud detection, disease diagnosis, and anomaly detection.
The Problem
Example: Fraud Detection
99.9% legitimate transactions
0.1% fraudulent transactions
A model predicting "not fraud" for everything achieves 99.9% accuracy but catches zero fraud!
Why It Matters
- Standard algorithms optimize for majority class
- Minority class gets ignored
- High accuracy masks poor performance
Measuring Performance
Don't Use Accuracy
Accuracy = 99.9% ← Misleading!
Recall (fraud) = 0% ← Reality
Better Metrics
- Precision/Recall for minority class
- F1 Score (harmonic mean)
- PR-AUC (Precision-Recall curve)
- ROC-AUC (but can be misleading with severe imbalance)
Resampling Strategies
Oversampling: Increase Minority Class
Random Oversampling:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
Simple but can cause overfitting (exact duplicates).
SMOTE (Synthetic Minority Oversampling):
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
Creates synthetic samples by interpolating between neighbors.
Existing points: A ——— B
SMOTE creates: A — × — B
SMOTE Variants:
- Borderline-SMOTE: Only synthesize near decision boundary
- ADASYN: More synthetic samples in harder regions
- SMOTE-NC: Handles categorical features
Undersampling: Reduce Majority Class
Random Undersampling:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)
Simple but loses information.
Tomek Links: Remove majority samples that are nearest neighbors of minority samples.
Edited Nearest Neighbors: Remove majority samples misclassified by KNN.
NearMiss: Select majority samples closest to minority samples.
Combined Approaches
SMOTEENN: SMOTE + Edited Nearest Neighbors
from imblearn.combine import SMOTEENN
smoteenn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smoteenn.fit_resample(X, y)
SMOTETomek: SMOTE + Tomek Links
Algorithm-Level Approaches
Class Weights
Tell the algorithm to penalize minority class errors more:
# Scikit-learn
model = LogisticRegression(class_weight='balanced')
model = RandomForestClassifier(class_weight='balanced')
# Manual weights
weights = {0: 1, 1: 10} # Minority class 10x more important
model = RandomForestClassifier(class_weight=weights)
# Compute from data
from sklearn.utils.class_weight import compute_class_weight
weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
Cost-Sensitive Learning
Explicit misclassification costs:
Cost matrix:
Predicted
Neg Pos
Actual Neg 0 1 (FP cost)
Pos 10 0 (FN cost - 10x higher!)
Threshold Adjustment
Default threshold is 0.5. Adjust for imbalanced data:
# Get probabilities
y_proba = model.predict_proba(X_test)[:, 1]
# Lower threshold to catch more positives
threshold = 0.3
y_pred = (y_proba > threshold).astype(int)
Use PR curve to find optimal threshold.
Ensemble Methods for Imbalanced Data
BalancedRandomForest
from imblearn.ensemble import BalancedRandomForestClassifier
model = BalancedRandomForestClassifier(n_estimators=100)
EasyEnsemble
Train multiple classifiers on different balanced subsets:
from imblearn.ensemble import EasyEnsembleClassifier
model = EasyEnsembleClassifier(n_estimators=10)
RUSBoost
Random undersampling + AdaBoost.
When to Use What
| Situation | Approach |
|---|---|
| Moderate imbalance (1:10) | Class weights |
| Severe imbalance (1:100+) | SMOTE + class weights |
| Very small minority (<100) | Careful oversampling, anomaly detection |
| Large dataset | Undersampling + ensemble |
| Extreme imbalance (1:1000+) | Consider anomaly detection |
Important Considerations
Resampling in Cross-Validation
# WRONG: Resample before split → data leakage!
X_resampled, y_resampled = SMOTE().fit_resample(X, y)
scores = cross_val_score(model, X_resampled, y_resampled) # Leakage!
# RIGHT: Resample inside each fold
from imblearn.pipeline import Pipeline # Not sklearn Pipeline!
pipeline = Pipeline([
('smote', SMOTE()),
('classifier', RandomForestClassifier())
])
scores = cross_val_score(pipeline, X, y)
Validation Set
Keep validation set imbalanced (real-world distribution) to evaluate true performance.
Stratified Splits
Always use stratified sampling:
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5)
Anomaly Detection Approach
For extreme imbalance, treat minority as anomalies:
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
# Train on majority class only
model = IsolationForest(contamination=0.01)
model.fit(X_majority)
Practical Pipeline
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('scaler', StandardScaler()),
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier(class_weight='balanced'))
])
pipeline.fit(X_train, y_train)
Key Takeaways
- Don't use accuracy for imbalanced data - use F1, PR-AUC
- Class weights are simple and often effective
- SMOTE creates synthetic minority samples
- Always resample inside cross-validation, not before
- Keep validation/test sets with original distribution
- Combine approaches (SMOTE + class weights + threshold tuning)
- For extreme imbalance, consider anomaly detection