A/B Testing for ML
A/B testing (split testing) is a method to compare two versions of a model or feature by randomly assigning users to groups and measuring which performs better.
Basic Concept
┌─────────────────┐
│ User Traffic │
└────────┬────────┘
│
Random Split
│
┌────────┴────────┐
│ │
┌────▼────┐ ┌────▼────┐
│ Group A │ │ Group B │
│ Control │ │ Treatment│
│(old model)│ │(new model)│
└────┬────┘ └────┬────┘
│ │
Metric: 2.1% Metric: 2.4%
│ │
└────────┬────────┘
│
Statistical Test
Is 2.4% > 2.1%?
Key Components
1. Hypothesis
Null Hypothesis (H₀): New model has no effect (μA = μB)
Alternative (H₁): New model has an effect (μA ≠ μB)
2. Metrics
Primary metric: The main goal (e.g., conversion rate) Guardrail metrics: Must not degrade (e.g., latency, crashes)
3. Statistical Parameters
α (significance level): P(reject H₀ | H₀ true) - typically 0.05
β (Type II error): P(fail to reject H₀ | H₁ true)
Power (1-β): P(reject H₀ | H₁ true) - typically 0.80
MDE (Minimum Detectable Effect): Smallest improvement worth detecting
Sample Size Calculation
from scipy import stats
import numpy as np
def calculate_sample_size(
baseline_rate,
mde, # Minimum detectable effect (relative)
alpha=0.05,
power=0.80
):
"""Calculate required sample size per group."""
p1 = baseline_rate
p2 = baseline_rate * (1 + mde) # Expected rate with treatment
# Pooled proportion
p_pool = (p1 + p2) / 2
# Z-scores
z_alpha = stats.norm.ppf(1 - alpha/2) # Two-tailed
z_beta = stats.norm.ppf(power)
# Sample size formula
n = (2 * p_pool * (1 - p_pool) * (z_alpha + z_beta)**2) / (p2 - p1)**2
return int(np.ceil(n))
# Example: 2% baseline CTR, want to detect 10% relative improvement
n = calculate_sample_size(baseline_rate=0.02, mde=0.10)
print(f"Need {n:,} users per group")
# Output: Need 39,240 users per group
Running the Test
Randomization
import hashlib
def assign_group(user_id, experiment_id, control_fraction=0.5):
"""Deterministic assignment based on user ID."""
# Hash ensures consistent assignment
hash_input = f"{user_id}_{experiment_id}"
hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
# Normalize to [0, 1)
normalized = (hash_value % 10000) / 10000
return 'control' if normalized < control_fraction else 'treatment'
# Same user always gets same assignment
print(assign_group("user123", "model_v2_test")) # Always 'treatment'
print(assign_group("user456", "model_v2_test")) # Always 'control'
Data Collection
class ABTestLogger:
def log_exposure(self, user_id, experiment_id, group):
"""Log when user is exposed to experiment."""
event = {
'timestamp': datetime.now(),
'user_id': user_id,
'experiment_id': experiment_id,
'group': group,
'event_type': 'exposure'
}
self.write_to_log(event)
def log_conversion(self, user_id, experiment_id, value=1):
"""Log conversion event."""
event = {
'timestamp': datetime.now(),
'user_id': user_id,
'experiment_id': experiment_id,
'value': value,
'event_type': 'conversion'
}
self.write_to_log(event)
Statistical Analysis
Two-Proportion Z-Test
from scipy import stats
def analyze_ab_test(control_conversions, control_total,
treatment_conversions, treatment_total):
"""Analyze A/B test results."""
# Conversion rates
p_control = control_conversions / control_total
p_treatment = treatment_conversions / treatment_total
# Relative lift
lift = (p_treatment - p_control) / p_control * 100
# Two-proportion z-test
count = [treatment_conversions, control_conversions]
nobs = [treatment_total, control_total]
z_stat, p_value = stats.proportions_ztest(count, nobs)
# Confidence interval for difference
se = np.sqrt(p_control*(1-p_control)/control_total +
p_treatment*(1-p_treatment)/treatment_total)
ci_low = (p_treatment - p_control) - 1.96 * se
ci_high = (p_treatment - p_control) + 1.96 * se
return {
'control_rate': p_control,
'treatment_rate': p_treatment,
'lift': lift,
'p_value': p_value,
'significant': p_value < 0.05,
'ci_95': (ci_low, ci_high)
}
# Example
results = analyze_ab_test(
control_conversions=1050, control_total=50000,
treatment_conversions=1150, treatment_total=50000
)
print(f"Control: {results['control_rate']:.2%}")
print(f"Treatment: {results['treatment_rate']:.2%}")
print(f"Lift: {results['lift']:.1f}%")
print(f"P-value: {results['p_value']:.4f}")
print(f"Significant: {results['significant']}")
For Continuous Metrics (t-test)
def analyze_continuous_metric(control_values, treatment_values):
"""Analyze continuous metric (e.g., revenue per user)."""
t_stat, p_value = stats.ttest_ind(treatment_values, control_values)
control_mean = np.mean(control_values)
treatment_mean = np.mean(treatment_values)
lift = (treatment_mean - control_mean) / control_mean * 100
return {
'control_mean': control_mean,
'treatment_mean': treatment_mean,
'lift': lift,
'p_value': p_value,
'significant': p_value < 0.05
}
Common Pitfalls
1. Peeking Problem
❌ Check results daily, stop when significant
Day 1: p=0.08 (continue)
Day 2: p=0.03 (stop! significant!) ← FALSE POSITIVE
✓ Pre-commit to sample size and test duration
2. Multiple Testing
# Testing many metrics inflates false positive rate
n_metrics = 20
alpha = 0.05
# Probability of at least one false positive
prob_false_positive = 1 - (1 - alpha) ** n_metrics
print(f"P(false positive): {prob_false_positive:.1%}") # 64%!
# Solution: Bonferroni correction
adjusted_alpha = alpha / n_metrics # 0.0025
3. Selection Bias
❌ Only include users who completed onboarding
(Treatment might affect onboarding rate!)
✓ Include all assigned users (intent-to-treat)
4. Novelty/Primacy Effects
Week 1: New model +15% (users exploring)
Week 2: New model +8%
Week 3: New model +5% (stabilized)
✓ Run test long enough for effects to stabilize
ML-Specific Considerations
1. Interleaving for Ranking Models
Instead of: Group A sees Model A results
Group B sees Model B results
Interleave: Each user sees mixed results from both models
Compare which model's results get more clicks
Benefit: Same users, faster convergence
2. Offline vs Online Metrics
Offline (test set): AUC improved 2%
Online (A/B test): Revenue unchanged
Why? Offline metrics may not capture real user behavior
✓ Always validate with online A/B test
3. Shadow Mode Testing
┌─────────────────────────────────────┐
│ Production │
├─────────────────────────────────────┤
│ Old Model → Serves users │
│ New Model → Logs predictions only │
└─────────────────────────────────────┘
Compare predictions offline before exposing to users
Best Practices
- Pre-register your hypothesis and metrics
- Calculate sample size before starting
- Don't peek at results early
- Run for full cycles (capture weekly patterns)
- Check guardrail metrics (latency, errors)
- Document everything for reproducibility
Key Takeaways
- A/B testing compares model versions on real users
- Calculate required sample size before running
- Use proper statistical tests (z-test, t-test)
- Avoid peeking and multiple testing problems
- Run long enough to capture stable effects
- Offline metrics don't guarantee online success