Class Imbalance
Class imbalance occurs when the distribution of classes in a dataset is highly skewed, with some classes having significantly more samples than others. This is common in fraud detection, medical diagnosis, and anomaly detection.
The Problem
Balanced Dataset: Imbalanced Dataset:
Class A: ████████ 50% Class A (Normal): █████████████████ 99%
Class B: ████████ 50% Class B (Fraud): █ 1%
A naive classifier can achieve 99% accuracy by always
predicting the majority class!
Why It Matters
# With 99% negative, 1% positive (fraud):
# Naive classifier (always predict negative)
y_pred = [0] * len(y_test) # Always "not fraud"
accuracy = 0.99 # 99% accuracy!
precision = 0 # No frauds detected
recall = 0 # Missed all frauds
Accuracy is misleading for imbalanced data.
Solution Categories
┌─────────────────────────────────────────────────────────┐
│ Handling Class Imbalance │
├───────────────┬───────────────┬───────────────┬─────────┤
│ Data Level │ Algorithm │ Cost-Sensitive│ Metric │
│ │ Level │ │ │
├───────────────┼───────────────┼───────────────┼─────────┤
│ Oversampling │ Class weights │ Custom loss │ Use F1, │
│ Undersampling │ Ensemble │ Threshold │ AUC-ROC │
│ SMOTE │ One-class │ tuning │ PR-AUC │
└───────────────┴───────────────┴───────────────┴─────────┘
Data-Level Solutions
Random Oversampling
Duplicate minority class samples:
from sklearn.utils import resample
# Separate classes
X_majority = X[y == 0]
X_minority = X[y == 1]
# Upsample minority
X_minority_upsampled = resample(
X_minority,
replace=True, # With replacement
n_samples=len(X_majority), # Match majority
random_state=42
)
# Combine
X_balanced = np.vstack([X_majority, X_minority_upsampled])
y_balanced = np.hstack([np.zeros(len(X_majority)),
np.ones(len(X_minority_upsampled))])
Random Undersampling
Remove majority class samples:
from sklearn.utils import resample
X_majority_downsampled = resample(
X_majority,
replace=False,
n_samples=len(X_minority),
random_state=42
)
X_balanced = np.vstack([X_majority_downsampled, X_minority])
Warning: Loses information from majority class.
SMOTE (Synthetic Minority Oversampling)
Create synthetic samples by interpolating:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
print(f"Original: {Counter(y)}")
print(f"Resampled: {Counter(y_resampled)}")
How SMOTE works:
1. Pick a minority sample
2. Find its k nearest minority neighbors
3. Create synthetic point between them
○ minority sample
│
│←──×──→○ neighbor
↑
synthetic sample
SMOTE Variants
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, ADASYN
# Borderline-SMOTE: Focus on samples near decision boundary
borderline_smote = BorderlineSMOTE(random_state=42)
# ADASYN: Adaptively generate more samples where harder to learn
adasyn = ADASYN(random_state=42)
Combined Sampling
from imblearn.combine import SMOTETomek, SMOTEENN
# SMOTE + Tomek links (clean up overlapping)
smote_tomek = SMOTETomek(random_state=42)
X_res, y_res = smote_tomek.fit_resample(X, y)
# SMOTE + ENN (more aggressive cleaning)
smote_enn = SMOTEENN(random_state=42)
Algorithm-Level Solutions
Class Weights
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# Automatically balance
model = LogisticRegression(class_weight='balanced')
model = RandomForestClassifier(class_weight='balanced')
# Manual weights
model = LogisticRegression(
class_weight={0: 1, 1: 99} # Weight minority 99x more
)
Weighted Loss in Deep Learning
import torch
import torch.nn as nn
# Calculate weights
class_counts = [9900, 100] # 99% vs 1%
weights = 1 / torch.tensor(class_counts, dtype=torch.float)
weights = weights / weights.sum() # Normalize
# Weighted CrossEntropy
criterion = nn.CrossEntropyLoss(weight=weights)
# Or Focal Loss (reduces easy example contribution)
class FocalLoss(nn.Module):
def __init__(self, alpha=1, gamma=2):
super().__init__()
self.alpha = alpha
self.gamma = gamma
def forward(self, inputs, targets):
ce_loss = F.cross_entropy(inputs, targets, reduction='none')
pt = torch.exp(-ce_loss)
focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
return focal_loss.mean()
Ensemble Methods
from imblearn.ensemble import BalancedRandomForestClassifier, EasyEnsembleClassifier
# Balanced Random Forest (undersamples for each tree)
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)
# Easy Ensemble (AdaBoost on balanced subsets)
easy_ensemble = EasyEnsembleClassifier(n_estimators=10, random_state=42)
easy_ensemble.fit(X_train, y_train)
Threshold Tuning
from sklearn.metrics import precision_recall_curve
# Get probabilities
y_probs = model.predict_proba(X_test)[:, 1]
# Find optimal threshold for F1
precision, recall, thresholds = precision_recall_curve(y_test, y_probs)
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-8)
best_threshold = thresholds[np.argmax(f1_scores[:-1])]
print(f"Default threshold: 0.5")
print(f"Optimal threshold: {best_threshold:.3f}")
# Use optimal threshold
y_pred_optimized = (y_probs >= best_threshold).astype(int)
Evaluation Metrics
Use These Instead of Accuracy
from sklearn.metrics import (
classification_report,
confusion_matrix,
roc_auc_score,
average_precision_score,
f1_score
)
# Confusion Matrix
print(confusion_matrix(y_test, y_pred))
# Predicted
# Neg Pos
# Actual Neg [TN] [FP]
# Pos [FN] [TP]
# Classification Report
print(classification_report(y_test, y_pred))
# AUC-ROC (threshold-independent)
roc_auc = roc_auc_score(y_test, y_probs)
# Precision-Recall AUC (better for imbalanced)
pr_auc = average_precision_score(y_test, y_probs)
# F1 Score
f1 = f1_score(y_test, y_pred)
Metric Comparison for Imbalanced Data
| Metric | Good For Imbalanced? | Why |
|---|---|---|
| Accuracy | No | Dominated by majority |
| Precision | Yes | Focuses on positive predictions |
| Recall | Yes | Focuses on finding positives |
| F1 | Yes | Balances precision/recall |
| AUC-ROC | Somewhat | Can be misleading if very imbalanced |
| PR-AUC | Yes | Best for high imbalance |
Practical Guidelines
Mild Imbalance (80:20 to 90:10)
# Try class weights first
model = RandomForestClassifier(class_weight='balanced')
Moderate Imbalance (95:5 to 99:1)
# SMOTE + class weights
smote = SMOTE()
X_res, y_res = smote.fit_resample(X_train, y_train)
model = RandomForestClassifier(class_weight='balanced')
model.fit(X_res, y_res)
Severe Imbalance (>99:1)
# Ensemble methods + threshold tuning + focal loss
model = EasyEnsembleClassifier(n_estimators=20)
# Or anomaly detection approach
from sklearn.ensemble import IsolationForest
Common Mistakes
1. Resampling Before Split
# WRONG - data leakage!
smote = SMOTE()
X_res, y_res = smote.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res)
# CORRECT - resample only training data
X_train, X_test, y_train, y_test = train_test_split(X, y)
smote = SMOTE()
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
# Test set remains untouched!
2. Ignoring Cross-Validation
# Use stratified CV to maintain class proportions
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X, y, cv=cv, scoring='f1')
Key Takeaways
- Don't use accuracy for imbalanced data - use F1, PR-AUC, or AUC-ROC
- Try class_weight='balanced' as a first approach
- SMOTE creates synthetic samples; use only on training data
- Threshold tuning can significantly improve performance
- For severe imbalance, consider anomaly detection approaches
- Always maintain class distribution in cross-validation