Class Imbalance

Class imbalance occurs when the distribution of classes in a dataset is highly skewed, with some classes having significantly more samples than others. This is common in fraud detection, medical diagnosis, and anomaly detection.

The Problem

Balanced Dataset:           Imbalanced Dataset:

 Class A: ████████ 50%      Class A (Normal):  █████████████████ 99%
 Class B: ████████ 50%      Class B (Fraud):   █ 1%

A naive classifier can achieve 99% accuracy by always
predicting the majority class!

Why It Matters

# With 99% negative, 1% positive (fraud):

# Naive classifier (always predict negative)
y_pred = [0] * len(y_test)  # Always "not fraud"
accuracy = 0.99  # 99% accuracy!
precision = 0    # No frauds detected
recall = 0       # Missed all frauds

Accuracy is misleading for imbalanced data.

Solution Categories

┌─────────────────────────────────────────────────────────┐
│              Handling Class Imbalance                    │
├───────────────┬───────────────┬───────────────┬─────────┤
│   Data Level  │ Algorithm     │  Cost-Sensitive│ Metric │
│               │    Level      │               │         │
├───────────────┼───────────────┼───────────────┼─────────┤
│ Oversampling  │ Class weights │ Custom loss   │ Use F1, │
│ Undersampling │ Ensemble      │ Threshold     │ AUC-ROC │
│ SMOTE         │ One-class     │ tuning        │ PR-AUC  │
└───────────────┴───────────────┴───────────────┴─────────┘

Data-Level Solutions

Random Oversampling

Duplicate minority class samples:

from sklearn.utils import resample

# Separate classes
X_majority = X[y == 0]
X_minority = X[y == 1]

# Upsample minority
X_minority_upsampled = resample(
    X_minority,
    replace=True,  # With replacement
    n_samples=len(X_majority),  # Match majority
    random_state=42
)

# Combine
X_balanced = np.vstack([X_majority, X_minority_upsampled])
y_balanced = np.hstack([np.zeros(len(X_majority)), 
                        np.ones(len(X_minority_upsampled))])

Random Undersampling

Remove majority class samples:

from sklearn.utils import resample

X_majority_downsampled = resample(
    X_majority,
    replace=False,
    n_samples=len(X_minority),
    random_state=42
)

X_balanced = np.vstack([X_majority_downsampled, X_minority])

Warning: Loses information from majority class.

SMOTE (Synthetic Minority Oversampling)

Create synthetic samples by interpolating:

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print(f"Original: {Counter(y)}")
print(f"Resampled: {Counter(y_resampled)}")

How SMOTE works:

1. Pick a minority sample
2. Find its k nearest minority neighbors
3. Create synthetic point between them

   ○ minority sample
   │
   │←──×──→○ neighbor
        ↑
   synthetic sample

SMOTE Variants

from imblearn.over_sampling import SMOTE, BorderlineSMOTE, ADASYN

# Borderline-SMOTE: Focus on samples near decision boundary
borderline_smote = BorderlineSMOTE(random_state=42)

# ADASYN: Adaptively generate more samples where harder to learn
adasyn = ADASYN(random_state=42)

Combined Sampling

from imblearn.combine import SMOTETomek, SMOTEENN

# SMOTE + Tomek links (clean up overlapping)
smote_tomek = SMOTETomek(random_state=42)
X_res, y_res = smote_tomek.fit_resample(X, y)

# SMOTE + ENN (more aggressive cleaning)
smote_enn = SMOTEENN(random_state=42)

Algorithm-Level Solutions

Class Weights

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Automatically balance
model = LogisticRegression(class_weight='balanced')
model = RandomForestClassifier(class_weight='balanced')

# Manual weights
model = LogisticRegression(
    class_weight={0: 1, 1: 99}  # Weight minority 99x more
)

Weighted Loss in Deep Learning

import torch
import torch.nn as nn

# Calculate weights
class_counts = [9900, 100]  # 99% vs 1%
weights = 1 / torch.tensor(class_counts, dtype=torch.float)
weights = weights / weights.sum()  # Normalize

# Weighted CrossEntropy
criterion = nn.CrossEntropyLoss(weight=weights)

# Or Focal Loss (reduces easy example contribution)
class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
    
    def forward(self, inputs, targets):
        ce_loss = F.cross_entropy(inputs, targets, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        return focal_loss.mean()

Ensemble Methods

from imblearn.ensemble import BalancedRandomForestClassifier, EasyEnsembleClassifier

# Balanced Random Forest (undersamples for each tree)
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)

# Easy Ensemble (AdaBoost on balanced subsets)
easy_ensemble = EasyEnsembleClassifier(n_estimators=10, random_state=42)
easy_ensemble.fit(X_train, y_train)

Threshold Tuning

from sklearn.metrics import precision_recall_curve

# Get probabilities
y_probs = model.predict_proba(X_test)[:, 1]

# Find optimal threshold for F1
precision, recall, thresholds = precision_recall_curve(y_test, y_probs)
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-8)
best_threshold = thresholds[np.argmax(f1_scores[:-1])]

print(f"Default threshold: 0.5")
print(f"Optimal threshold: {best_threshold:.3f}")

# Use optimal threshold
y_pred_optimized = (y_probs >= best_threshold).astype(int)

Evaluation Metrics

Use These Instead of Accuracy

from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    average_precision_score,
    f1_score
)

# Confusion Matrix
print(confusion_matrix(y_test, y_pred))
#           Predicted
#           Neg  Pos
# Actual Neg [TN] [FP]
#        Pos [FN] [TP]

# Classification Report
print(classification_report(y_test, y_pred))

# AUC-ROC (threshold-independent)
roc_auc = roc_auc_score(y_test, y_probs)

# Precision-Recall AUC (better for imbalanced)
pr_auc = average_precision_score(y_test, y_probs)

# F1 Score
f1 = f1_score(y_test, y_pred)

Metric Comparison for Imbalanced Data

Metric	Good For Imbalanced?	Why
Accuracy	No	Dominated by majority
Precision	Yes	Focuses on positive predictions
Recall	Yes	Focuses on finding positives
F1	Yes	Balances precision/recall
AUC-ROC	Somewhat	Can be misleading if very imbalanced
PR-AUC	Yes	Best for high imbalance

Practical Guidelines

Mild Imbalance (80:20 to 90:10)

# Try class weights first
model = RandomForestClassifier(class_weight='balanced')

Moderate Imbalance (95:5 to 99:1)

# SMOTE + class weights
smote = SMOTE()
X_res, y_res = smote.fit_resample(X_train, y_train)
model = RandomForestClassifier(class_weight='balanced')
model.fit(X_res, y_res)

Severe Imbalance (>99:1)

# Ensemble methods + threshold tuning + focal loss
model = EasyEnsembleClassifier(n_estimators=20)
# Or anomaly detection approach
from sklearn.ensemble import IsolationForest

Common Mistakes

1. Resampling Before Split

# WRONG - data leakage!
smote = SMOTE()
X_res, y_res = smote.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res)

# CORRECT - resample only training data
X_train, X_test, y_train, y_test = train_test_split(X, y)
smote = SMOTE()
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
# Test set remains untouched!

2. Ignoring Cross-Validation

# Use stratified CV to maintain class proportions
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X, y, cv=cv, scoring='f1')

Key Takeaways

Don't use accuracy for imbalanced data - use F1, PR-AUC, or AUC-ROC
Try class_weight='balanced' as a first approach
SMOTE creates synthetic samples; use only on training data
Threshold tuning can significantly improve performance
For severe imbalance, consider anomaly detection approaches
Always maintain class distribution in cross-validation