Handling Imbalanced Data

Imbalanced data occurs when classes have very different numbers of samples. It's extremely common in real-world problems like fraud detection, disease diagnosis, and anomaly detection.

The Problem

Example: Fraud Detection

99.9% legitimate transactions
0.1% fraudulent transactions

A model predicting "not fraud" for everything achieves 99.9% accuracy but catches zero fraud!

Why It Matters

Standard algorithms optimize for majority class
Minority class gets ignored
High accuracy masks poor performance

Measuring Performance

Don't Use Accuracy

Accuracy = 99.9%  ← Misleading!
Recall (fraud) = 0%  ← Reality

Better Metrics

Precision/Recall for minority class
F1 Score (harmonic mean)
PR-AUC (Precision-Recall curve)
ROC-AUC (but can be misleading with severe imbalance)

Resampling Strategies

Oversampling: Increase Minority Class

Random Oversampling:

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

Simple but can cause overfitting (exact duplicates).

SMOTE (Synthetic Minority Oversampling):

from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

Creates synthetic samples by interpolating between neighbors.

Existing points:  A ——— B
SMOTE creates:    A — × — B

SMOTE Variants:

Borderline-SMOTE: Only synthesize near decision boundary
ADASYN: More synthetic samples in harder regions
SMOTE-NC: Handles categorical features

Undersampling: Reduce Majority Class

Random Undersampling:

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)

Simple but loses information.

Tomek Links: Remove majority samples that are nearest neighbors of minority samples.

Edited Nearest Neighbors: Remove majority samples misclassified by KNN.

NearMiss: Select majority samples closest to minority samples.

Combined Approaches

SMOTEENN: SMOTE + Edited Nearest Neighbors

from imblearn.combine import SMOTEENN
smoteenn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smoteenn.fit_resample(X, y)

SMOTETomek: SMOTE + Tomek Links

Algorithm-Level Approaches

Class Weights

Tell the algorithm to penalize minority class errors more:

# Scikit-learn
model = LogisticRegression(class_weight='balanced')
model = RandomForestClassifier(class_weight='balanced')

# Manual weights
weights = {0: 1, 1: 10}  # Minority class 10x more important
model = RandomForestClassifier(class_weight=weights)

# Compute from data
from sklearn.utils.class_weight import compute_class_weight
weights = compute_class_weight('balanced', classes=np.unique(y), y=y)

Cost-Sensitive Learning

Explicit misclassification costs:

Cost matrix:
                Predicted
              Neg    Pos
Actual Neg    0      1     (FP cost)
       Pos    10     0     (FN cost - 10x higher!)

Threshold Adjustment

Default threshold is 0.5. Adjust for imbalanced data:

# Get probabilities
y_proba = model.predict_proba(X_test)[:, 1]

# Lower threshold to catch more positives
threshold = 0.3
y_pred = (y_proba > threshold).astype(int)

Use PR curve to find optimal threshold.

Ensemble Methods for Imbalanced Data

BalancedRandomForest

from imblearn.ensemble import BalancedRandomForestClassifier
model = BalancedRandomForestClassifier(n_estimators=100)

EasyEnsemble

Train multiple classifiers on different balanced subsets:

from imblearn.ensemble import EasyEnsembleClassifier
model = EasyEnsembleClassifier(n_estimators=10)

RUSBoost

Random undersampling + AdaBoost.

When to Use What

Situation	Approach
Moderate imbalance (1:10)	Class weights
Severe imbalance (1:100+)	SMOTE + class weights
Very small minority (<100)	Careful oversampling, anomaly detection
Large dataset	Undersampling + ensemble
Extreme imbalance (1:1000+)	Consider anomaly detection

Important Considerations

Resampling in Cross-Validation

# WRONG: Resample before split → data leakage!
X_resampled, y_resampled = SMOTE().fit_resample(X, y)
scores = cross_val_score(model, X_resampled, y_resampled)  # Leakage!

# RIGHT: Resample inside each fold
from imblearn.pipeline import Pipeline  # Not sklearn Pipeline!
pipeline = Pipeline([
    ('smote', SMOTE()),
    ('classifier', RandomForestClassifier())
])
scores = cross_val_score(pipeline, X, y)

Validation Set

Keep validation set imbalanced (real-world distribution) to evaluate true performance.

Stratified Splits

Always use stratified sampling:

from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5)

Anomaly Detection Approach

For extreme imbalance, treat minority as anomalies:

from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM

# Train on majority class only
model = IsolationForest(contamination=0.01)
model.fit(X_majority)

Practical Pipeline

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(class_weight='balanced'))
])

pipeline.fit(X_train, y_train)

Key Takeaways

Don't use accuracy for imbalanced data - use F1, PR-AUC
Class weights are simple and often effective
SMOTE creates synthetic minority samples
Always resample inside cross-validation, not before
Keep validation/test sets with original distribution
Combine approaches (SMOTE + class weights + threshold tuning)
For extreme imbalance, consider anomaly detection