Data Leakage

Data leakage occurs when information that wouldn't be available at prediction time is used during training. It's one of the most common and dangerous mistakes in machine learning.

Why Leakage Is Dangerous

With leakage:    Training AUC = 0.99, Production AUC = 0.65
Without leakage: Training AUC = 0.85, Production AUC = 0.83

Leakage gives false confidence. Models seem great but fail in production.

Types of Data Leakage

1. Target Leakage

Features contain information about the target:

Predicting: Will customer default on loan?

❌ Leaked feature: "collection_agency_contacted"
   (Only happens after default!)

❌ Leaked feature: "account_closed_reason=default"
   (Directly encodes the target)

2. Train-Test Contamination

Test data influences training:

# ❌ Wrong: Scale on all data
X_scaled = scaler.fit_transform(X)  # Sees test data!
X_train, X_test = split(X_scaled)

# ✓ Correct: Scale only on training
X_train, X_test = split(X)
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

3. Temporal Leakage

Using future information to predict past:

Predicting: Stock price tomorrow

❌ Using data from tomorrow
❌ Using "will_price_increase_next_week"

4. Group/Entity Leakage

Same entity in train and test:

Patient data: Multiple visits per patient

❌ Patient in both train and test
   (Model memorizes patient, not patterns)

Common Leakage Sources

Preprocessing Before Split

Step	❌ Wrong	✓ Correct
Scaling	Fit on all data	Fit only on train
Imputation	Use global mean	Use train mean
Feature selection	Select on all data	Select on train
Encoding	Fit on all data	Fit on train

Time-based Features

❌ customer_lifetime_value  (includes future)
❌ final_purchase_date      (future knowledge)
❌ total_transactions       (if includes future)

ID-related Features

❌ patient_id → encodes which hospital → encodes outcome rates
❌ timestamp → encodes data collection period → encodes policy changes

Derived Features

❌ avg_order_value = total_revenue / n_orders
   (If predicting revenue, this leaks target)

Detecting Leakage

Signs of Leakage

Too-good-to-be-true performance: AUC > 0.95 on hard problems
Large train-production gap: Works great offline, fails online
Features too correlated with target: r > 0.9
Unreasonable feature importance: Unexpected feature dominates

Debugging Steps

# 1. Check feature importances
print(model.feature_importances_)
# Is the top feature suspicious?

# 2. Inspect high-importance features
print(df.groupby('target')['suspicious_feature'].describe())
# Does it perfectly separate classes?

# 3. Check feature availability
# Would this feature exist at prediction time?

# 4. Temporal validation
# Train on past, test on future

Preventing Leakage

1. Use Pipelines

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

# Scaling happens inside CV, preventing leakage
cross_val_score(pipeline, X, y, cv=5)

2. Split First, Process After

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Then process
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

3. Use Proper CV for Groups

from sklearn.model_selection import GroupKFold

# Ensure same patient not in train and test
gkf = GroupKFold(n_splits=5)
for train_idx, test_idx in gkf.split(X, y, groups=patient_ids):
    ...

4. Use Time-based Splits

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    # train always before test
    ...

5. Ask "Would I Have This?"

For every feature:

"At prediction time, would this information be available?"

If no → remove or modify.

Example: Preventing Common Leakage

Feature Engineering

def create_features(df, reference_date):
    """Create features using only data before reference_date"""
    past_data = df[df['date'] < reference_date]
    
    features = {
        'n_purchases': len(past_data),
        'avg_amount': past_data['amount'].mean(),
        'days_since_last': (reference_date - past_data['date'].max()).days
    }
    return features

Full Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])

pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('model', RandomForestClassifier())
])

# CV with proper groups
scores = cross_val_score(
    pipeline, X, y, 
    cv=GroupKFold(5), 
    groups=group_ids
)

Leakage Checklist

Split before any preprocessing
Use pipelines for cross-validation
Check temporal ordering of features
Group-aware splitting if needed
Review feature importance for suspicious patterns
Verify features available at prediction time
Test in production-like environment

Key Takeaways

Leakage = using unavailable information during training
Causes unrealistic performance estimates
Always split before preprocessing
Use pipelines to automate correct ordering
Ask: "Would I have this at prediction time?"
Validate with proper temporal/group splits