Data Leakage
Data leakage occurs when information that wouldn't be available at prediction time is used during training. It's one of the most common and dangerous mistakes in machine learning.
Why Leakage Is Dangerous
With leakage: Training AUC = 0.99, Production AUC = 0.65
Without leakage: Training AUC = 0.85, Production AUC = 0.83
Leakage gives false confidence. Models seem great but fail in production.
Types of Data Leakage
1. Target Leakage
Features contain information about the target:
Predicting: Will customer default on loan?
❌ Leaked feature: "collection_agency_contacted"
(Only happens after default!)
❌ Leaked feature: "account_closed_reason=default"
(Directly encodes the target)
2. Train-Test Contamination
Test data influences training:
# ❌ Wrong: Scale on all data
X_scaled = scaler.fit_transform(X) # Sees test data!
X_train, X_test = split(X_scaled)
# ✓ Correct: Scale only on training
X_train, X_test = split(X)
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
3. Temporal Leakage
Using future information to predict past:
Predicting: Stock price tomorrow
❌ Using data from tomorrow
❌ Using "will_price_increase_next_week"
4. Group/Entity Leakage
Same entity in train and test:
Patient data: Multiple visits per patient
❌ Patient in both train and test
(Model memorizes patient, not patterns)
Common Leakage Sources
Preprocessing Before Split
| Step | ❌ Wrong | ✓ Correct |
|---|---|---|
| Scaling | Fit on all data | Fit only on train |
| Imputation | Use global mean | Use train mean |
| Feature selection | Select on all data | Select on train |
| Encoding | Fit on all data | Fit on train |
Time-based Features
❌ customer_lifetime_value (includes future)
❌ final_purchase_date (future knowledge)
❌ total_transactions (if includes future)
ID-related Features
❌ patient_id → encodes which hospital → encodes outcome rates
❌ timestamp → encodes data collection period → encodes policy changes
Derived Features
❌ avg_order_value = total_revenue / n_orders
(If predicting revenue, this leaks target)
Detecting Leakage
Signs of Leakage
- Too-good-to-be-true performance: AUC > 0.95 on hard problems
- Large train-production gap: Works great offline, fails online
- Features too correlated with target: r > 0.9
- Unreasonable feature importance: Unexpected feature dominates
Debugging Steps
# 1. Check feature importances
print(model.feature_importances_)
# Is the top feature suspicious?
# 2. Inspect high-importance features
print(df.groupby('target')['suspicious_feature'].describe())
# Does it perfectly separate classes?
# 3. Check feature availability
# Would this feature exist at prediction time?
# 4. Temporal validation
# Train on past, test on future
Preventing Leakage
1. Use Pipelines
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
# Scaling happens inside CV, preventing leakage
cross_val_score(pipeline, X, y, cv=5)
2. Split First, Process After
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Then process
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
3. Use Proper CV for Groups
from sklearn.model_selection import GroupKFold
# Ensure same patient not in train and test
gkf = GroupKFold(n_splits=5)
for train_idx, test_idx in gkf.split(X, y, groups=patient_ids):
...
4. Use Time-based Splits
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
# train always before test
...
5. Ask "Would I Have This?"
For every feature:
"At prediction time, would this information be available?"
If no → remove or modify.
Example: Preventing Common Leakage
Feature Engineering
def create_features(df, reference_date):
"""Create features using only data before reference_date"""
past_data = df[df['date'] < reference_date]
features = {
'n_purchases': len(past_data),
'avg_amount': past_data['amount'].mean(),
'days_since_last': (reference_date - past_data['date'].max()).days
}
return features
Full Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_cols),
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])
pipeline = Pipeline([
('preprocess', preprocessor),
('model', RandomForestClassifier())
])
# CV with proper groups
scores = cross_val_score(
pipeline, X, y,
cv=GroupKFold(5),
groups=group_ids
)
Leakage Checklist
- Split before any preprocessing
- Use pipelines for cross-validation
- Check temporal ordering of features
- Group-aware splitting if needed
- Review feature importance for suspicious patterns
- Verify features available at prediction time
- Test in production-like environment
Key Takeaways
- Leakage = using unavailable information during training
- Causes unrealistic performance estimates
- Always split before preprocessing
- Use pipelines to automate correct ordering
- Ask: "Would I have this at prediction time?"
- Validate with proper temporal/group splits