intermediateClassical Machine Learning

Learn strategies for handling imbalanced datasets where one class significantly outnumbers others, a common challenge in fraud detection, medical diagnosis, and more.

imbalanced-dataclassificationsamplingsmoteevaluation

Class Imbalance

Class imbalance occurs when the distribution of classes in a dataset is highly skewed, with some classes having significantly more samples than others. This is common in fraud detection, medical diagnosis, and anomaly detection.

The Problem

Balanced Dataset:           Imbalanced Dataset:

 Class A: ████████ 50%      Class A (Normal):  █████████████████ 99%
 Class B: ████████ 50%      Class B (Fraud):   █ 1%

A naive classifier can achieve 99% accuracy by always
predicting the majority class!

Why It Matters

# With 99% negative, 1% positive (fraud):

# Naive classifier (always predict negative)
y_pred = [0] * len(y_test)  # Always "not fraud"
accuracy = 0.99  # 99% accuracy!
precision = 0    # No frauds detected
recall = 0       # Missed all frauds

Accuracy is misleading for imbalanced data.

Solution Categories

┌─────────────────────────────────────────────────────────┐
│              Handling Class Imbalance                    │
├───────────────┬───────────────┬───────────────┬─────────┤
│   Data Level  │ Algorithm     │  Cost-Sensitive│ Metric │
│               │    Level      │               │         │
├───────────────┼───────────────┼───────────────┼─────────┤
│ Oversampling  │ Class weights │ Custom loss   │ Use F1, │
│ Undersampling │ Ensemble      │ Threshold     │ AUC-ROC │
│ SMOTE         │ One-class     │ tuning        │ PR-AUC  │
└───────────────┴───────────────┴───────────────┴─────────┘

Data-Level Solutions

Random Oversampling

Duplicate minority class samples:

from sklearn.utils import resample

# Separate classes
X_majority = X[y == 0]
X_minority = X[y == 1]

# Upsample minority
X_minority_upsampled = resample(
    X_minority,
    replace=True,  # With replacement
    n_samples=len(X_majority),  # Match majority
    random_state=42
)

# Combine
X_balanced = np.vstack([X_majority, X_minority_upsampled])
y_balanced = np.hstack([np.zeros(len(X_majority)), 
                        np.ones(len(X_minority_upsampled))])

Random Undersampling

Remove majority class samples:

from sklearn.utils import resample

X_majority_downsampled = resample(
    X_majority,
    replace=False,
    n_samples=len(X_minority),
    random_state=42
)

X_balanced = np.vstack([X_majority_downsampled, X_minority])

Warning: Loses information from majority class.

SMOTE (Synthetic Minority Oversampling)

Create synthetic samples by interpolating:

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print(f"Original: {Counter(y)}")
print(f"Resampled: {Counter(y_resampled)}")
How SMOTE works:

1. Pick a minority sample
2. Find its k nearest minority neighbors
3. Create synthetic point between them

   ○ minority sample
   │
   │←──×──→○ neighbor
        ↑
   synthetic sample

SMOTE Variants

from imblearn.over_sampling import SMOTE, BorderlineSMOTE, ADASYN

# Borderline-SMOTE: Focus on samples near decision boundary
borderline_smote = BorderlineSMOTE(random_state=42)

# ADASYN: Adaptively generate more samples where harder to learn
adasyn = ADASYN(random_state=42)

Combined Sampling

from imblearn.combine import SMOTETomek, SMOTEENN

# SMOTE + Tomek links (clean up overlapping)
smote_tomek = SMOTETomek(random_state=42)
X_res, y_res = smote_tomek.fit_resample(X, y)

# SMOTE + ENN (more aggressive cleaning)
smote_enn = SMOTEENN(random_state=42)

Algorithm-Level Solutions

Class Weights

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Automatically balance
model = LogisticRegression(class_weight='balanced')
model = RandomForestClassifier(class_weight='balanced')

# Manual weights
model = LogisticRegression(
    class_weight={0: 1, 1: 99}  # Weight minority 99x more
)

Weighted Loss in Deep Learning

import torch
import torch.nn as nn

# Calculate weights
class_counts = [9900, 100]  # 99% vs 1%
weights = 1 / torch.tensor(class_counts, dtype=torch.float)
weights = weights / weights.sum()  # Normalize

# Weighted CrossEntropy
criterion = nn.CrossEntropyLoss(weight=weights)

# Or Focal Loss (reduces easy example contribution)
class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
    
    def forward(self, inputs, targets):
        ce_loss = F.cross_entropy(inputs, targets, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        return focal_loss.mean()

Ensemble Methods

from imblearn.ensemble import BalancedRandomForestClassifier, EasyEnsembleClassifier

# Balanced Random Forest (undersamples for each tree)
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)

# Easy Ensemble (AdaBoost on balanced subsets)
easy_ensemble = EasyEnsembleClassifier(n_estimators=10, random_state=42)
easy_ensemble.fit(X_train, y_train)

Threshold Tuning

from sklearn.metrics import precision_recall_curve

# Get probabilities
y_probs = model.predict_proba(X_test)[:, 1]

# Find optimal threshold for F1
precision, recall, thresholds = precision_recall_curve(y_test, y_probs)
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-8)
best_threshold = thresholds[np.argmax(f1_scores[:-1])]

print(f"Default threshold: 0.5")
print(f"Optimal threshold: {best_threshold:.3f}")

# Use optimal threshold
y_pred_optimized = (y_probs >= best_threshold).astype(int)

Evaluation Metrics

Use These Instead of Accuracy

from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    average_precision_score,
    f1_score
)

# Confusion Matrix
print(confusion_matrix(y_test, y_pred))
#           Predicted
#           Neg  Pos
# Actual Neg [TN] [FP]
#        Pos [FN] [TP]

# Classification Report
print(classification_report(y_test, y_pred))

# AUC-ROC (threshold-independent)
roc_auc = roc_auc_score(y_test, y_probs)

# Precision-Recall AUC (better for imbalanced)
pr_auc = average_precision_score(y_test, y_probs)

# F1 Score
f1 = f1_score(y_test, y_pred)

Metric Comparison for Imbalanced Data

MetricGood For Imbalanced?Why
AccuracyNoDominated by majority
PrecisionYesFocuses on positive predictions
RecallYesFocuses on finding positives
F1YesBalances precision/recall
AUC-ROCSomewhatCan be misleading if very imbalanced
PR-AUCYesBest for high imbalance

Practical Guidelines

Mild Imbalance (80:20 to 90:10)

# Try class weights first
model = RandomForestClassifier(class_weight='balanced')

Moderate Imbalance (95:5 to 99:1)

# SMOTE + class weights
smote = SMOTE()
X_res, y_res = smote.fit_resample(X_train, y_train)
model = RandomForestClassifier(class_weight='balanced')
model.fit(X_res, y_res)

Severe Imbalance (>99:1)

# Ensemble methods + threshold tuning + focal loss
model = EasyEnsembleClassifier(n_estimators=20)
# Or anomaly detection approach
from sklearn.ensemble import IsolationForest

Common Mistakes

1. Resampling Before Split

# WRONG - data leakage!
smote = SMOTE()
X_res, y_res = smote.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res)

# CORRECT - resample only training data
X_train, X_test, y_train, y_test = train_test_split(X, y)
smote = SMOTE()
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
# Test set remains untouched!

2. Ignoring Cross-Validation

# Use stratified CV to maintain class proportions
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X, y, cv=cv, scoring='f1')

Key Takeaways

  1. Don't use accuracy for imbalanced data - use F1, PR-AUC, or AUC-ROC
  2. Try class_weight='balanced' as a first approach
  3. SMOTE creates synthetic samples; use only on training data
  4. Threshold tuning can significantly improve performance
  5. For severe imbalance, consider anomaly detection approaches
  6. Always maintain class distribution in cross-validation