intermediateEvaluation & Metrics

Learn about data leakage - when information from outside the training set improperly influences the model, leading to overly optimistic results.

data-leakagevalidationpreprocessingmistakes

Data Leakage

Data leakage occurs when information that wouldn't be available at prediction time is used during training. It's one of the most common and dangerous mistakes in machine learning.

Why Leakage Is Dangerous

With leakage:    Training AUC = 0.99, Production AUC = 0.65
Without leakage: Training AUC = 0.85, Production AUC = 0.83

Leakage gives false confidence. Models seem great but fail in production.

Types of Data Leakage

1. Target Leakage

Features contain information about the target:

Predicting: Will customer default on loan?

❌ Leaked feature: "collection_agency_contacted"
   (Only happens after default!)

❌ Leaked feature: "account_closed_reason=default"
   (Directly encodes the target)

2. Train-Test Contamination

Test data influences training:

# ❌ Wrong: Scale on all data
X_scaled = scaler.fit_transform(X)  # Sees test data!
X_train, X_test = split(X_scaled)

# ✓ Correct: Scale only on training
X_train, X_test = split(X)
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

3. Temporal Leakage

Using future information to predict past:

Predicting: Stock price tomorrow

❌ Using data from tomorrow
❌ Using "will_price_increase_next_week"

4. Group/Entity Leakage

Same entity in train and test:

Patient data: Multiple visits per patient

❌ Patient in both train and test
   (Model memorizes patient, not patterns)

Common Leakage Sources

Preprocessing Before Split

Step❌ Wrong✓ Correct
ScalingFit on all dataFit only on train
ImputationUse global meanUse train mean
Feature selectionSelect on all dataSelect on train
EncodingFit on all dataFit on train

Time-based Features

❌ customer_lifetime_value  (includes future)
❌ final_purchase_date      (future knowledge)
❌ total_transactions       (if includes future)

ID-related Features

❌ patient_id → encodes which hospital → encodes outcome rates
❌ timestamp → encodes data collection period → encodes policy changes

Derived Features

❌ avg_order_value = total_revenue / n_orders
   (If predicting revenue, this leaks target)

Detecting Leakage

Signs of Leakage

  1. Too-good-to-be-true performance: AUC > 0.95 on hard problems
  2. Large train-production gap: Works great offline, fails online
  3. Features too correlated with target: r > 0.9
  4. Unreasonable feature importance: Unexpected feature dominates

Debugging Steps

# 1. Check feature importances
print(model.feature_importances_)
# Is the top feature suspicious?

# 2. Inspect high-importance features
print(df.groupby('target')['suspicious_feature'].describe())
# Does it perfectly separate classes?

# 3. Check feature availability
# Would this feature exist at prediction time?

# 4. Temporal validation
# Train on past, test on future

Preventing Leakage

1. Use Pipelines

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

# Scaling happens inside CV, preventing leakage
cross_val_score(pipeline, X, y, cv=5)

2. Split First, Process After

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Then process
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

3. Use Proper CV for Groups

from sklearn.model_selection import GroupKFold

# Ensure same patient not in train and test
gkf = GroupKFold(n_splits=5)
for train_idx, test_idx in gkf.split(X, y, groups=patient_ids):
    ...

4. Use Time-based Splits

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    # train always before test
    ...

5. Ask "Would I Have This?"

For every feature:

"At prediction time, would this information be available?"

If no → remove or modify.

Example: Preventing Common Leakage

Feature Engineering

def create_features(df, reference_date):
    """Create features using only data before reference_date"""
    past_data = df[df['date'] < reference_date]
    
    features = {
        'n_purchases': len(past_data),
        'avg_amount': past_data['amount'].mean(),
        'days_since_last': (reference_date - past_data['date'].max()).days
    }
    return features

Full Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])

pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('model', RandomForestClassifier())
])

# CV with proper groups
scores = cross_val_score(
    pipeline, X, y, 
    cv=GroupKFold(5), 
    groups=group_ids
)

Leakage Checklist

  • Split before any preprocessing
  • Use pipelines for cross-validation
  • Check temporal ordering of features
  • Group-aware splitting if needed
  • Review feature importance for suspicious patterns
  • Verify features available at prediction time
  • Test in production-like environment

Key Takeaways

  1. Leakage = using unavailable information during training
  2. Causes unrealistic performance estimates
  3. Always split before preprocessing
  4. Use pipelines to automate correct ordering
  5. Ask: "Would I have this at prediction time?"
  6. Validate with proper temporal/group splits

Practice Questions

Test your understanding with these related interview questions: