A/B Testing for ML

A/B testing (split testing) is a method to compare two versions of a model or feature by randomly assigning users to groups and measuring which performs better.

Basic Concept

        ┌─────────────────┐
        │   User Traffic   │
        └────────┬────────┘
                 │
         Random Split
                 │
        ┌────────┴────────┐
        │                 │
   ┌────▼────┐      ┌────▼────┐
   │ Group A │      │ Group B │
   │ Control │      │ Treatment│
   │(old model)│    │(new model)│
   └────┬────┘      └────┬────┘
        │                 │
   Metric: 2.1%      Metric: 2.4%
        │                 │
        └────────┬────────┘
                 │
         Statistical Test
         Is 2.4% > 2.1%?

Key Components

1. Hypothesis

Null Hypothesis (H₀): New model has no effect (μA = μB)
Alternative (H₁): New model has an effect (μA ≠ μB)

2. Metrics

Primary metric: The main goal (e.g., conversion rate) Guardrail metrics: Must not degrade (e.g., latency, crashes)

3. Statistical Parameters

α (significance level): P(reject H₀ | H₀ true) - typically 0.05
β (Type II error): P(fail to reject H₀ | H₁ true)
Power (1-β): P(reject H₀ | H₁ true) - typically 0.80
MDE (Minimum Detectable Effect): Smallest improvement worth detecting

Sample Size Calculation

from scipy import stats
import numpy as np

def calculate_sample_size(
    baseline_rate,
    mde,              # Minimum detectable effect (relative)
    alpha=0.05,
    power=0.80
):
    """Calculate required sample size per group."""
    p1 = baseline_rate
    p2 = baseline_rate * (1 + mde)  # Expected rate with treatment
    
    # Pooled proportion
    p_pool = (p1 + p2) / 2
    
    # Z-scores
    z_alpha = stats.norm.ppf(1 - alpha/2)  # Two-tailed
    z_beta = stats.norm.ppf(power)
    
    # Sample size formula
    n = (2 * p_pool * (1 - p_pool) * (z_alpha + z_beta)**2) / (p2 - p1)**2
    
    return int(np.ceil(n))

# Example: 2% baseline CTR, want to detect 10% relative improvement
n = calculate_sample_size(baseline_rate=0.02, mde=0.10)
print(f"Need {n:,} users per group")
# Output: Need 39,240 users per group

Running the Test

Randomization

import hashlib

def assign_group(user_id, experiment_id, control_fraction=0.5):
    """Deterministic assignment based on user ID."""
    # Hash ensures consistent assignment
    hash_input = f"{user_id}_{experiment_id}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    
    # Normalize to [0, 1)
    normalized = (hash_value % 10000) / 10000
    
    return 'control' if normalized < control_fraction else 'treatment'

# Same user always gets same assignment
print(assign_group("user123", "model_v2_test"))  # Always 'treatment'
print(assign_group("user456", "model_v2_test"))  # Always 'control'

Data Collection

class ABTestLogger:
    def log_exposure(self, user_id, experiment_id, group):
        """Log when user is exposed to experiment."""
        event = {
            'timestamp': datetime.now(),
            'user_id': user_id,
            'experiment_id': experiment_id,
            'group': group,
            'event_type': 'exposure'
        }
        self.write_to_log(event)
    
    def log_conversion(self, user_id, experiment_id, value=1):
        """Log conversion event."""
        event = {
            'timestamp': datetime.now(),
            'user_id': user_id,
            'experiment_id': experiment_id,
            'value': value,
            'event_type': 'conversion'
        }
        self.write_to_log(event)

Statistical Analysis

Two-Proportion Z-Test

from scipy import stats

def analyze_ab_test(control_conversions, control_total,
                    treatment_conversions, treatment_total):
    """Analyze A/B test results."""
    
    # Conversion rates
    p_control = control_conversions / control_total
    p_treatment = treatment_conversions / treatment_total
    
    # Relative lift
    lift = (p_treatment - p_control) / p_control * 100
    
    # Two-proportion z-test
    count = [treatment_conversions, control_conversions]
    nobs = [treatment_total, control_total]
    z_stat, p_value = stats.proportions_ztest(count, nobs)
    
    # Confidence interval for difference
    se = np.sqrt(p_control*(1-p_control)/control_total + 
                 p_treatment*(1-p_treatment)/treatment_total)
    ci_low = (p_treatment - p_control) - 1.96 * se
    ci_high = (p_treatment - p_control) + 1.96 * se
    
    return {
        'control_rate': p_control,
        'treatment_rate': p_treatment,
        'lift': lift,
        'p_value': p_value,
        'significant': p_value < 0.05,
        'ci_95': (ci_low, ci_high)
    }

# Example
results = analyze_ab_test(
    control_conversions=1050, control_total=50000,
    treatment_conversions=1150, treatment_total=50000
)
print(f"Control: {results['control_rate']:.2%}")
print(f"Treatment: {results['treatment_rate']:.2%}")
print(f"Lift: {results['lift']:.1f}%")
print(f"P-value: {results['p_value']:.4f}")
print(f"Significant: {results['significant']}")

For Continuous Metrics (t-test)

def analyze_continuous_metric(control_values, treatment_values):
    """Analyze continuous metric (e.g., revenue per user)."""
    
    t_stat, p_value = stats.ttest_ind(treatment_values, control_values)
    
    control_mean = np.mean(control_values)
    treatment_mean = np.mean(treatment_values)
    lift = (treatment_mean - control_mean) / control_mean * 100
    
    return {
        'control_mean': control_mean,
        'treatment_mean': treatment_mean,
        'lift': lift,
        'p_value': p_value,
        'significant': p_value < 0.05
    }

Common Pitfalls

1. Peeking Problem

❌ Check results daily, stop when significant

Day 1: p=0.08 (continue)
Day 2: p=0.03 (stop! significant!) ← FALSE POSITIVE

✓ Pre-commit to sample size and test duration

2. Multiple Testing

# Testing many metrics inflates false positive rate
n_metrics = 20
alpha = 0.05

# Probability of at least one false positive
prob_false_positive = 1 - (1 - alpha) ** n_metrics
print(f"P(false positive): {prob_false_positive:.1%}")  # 64%!

# Solution: Bonferroni correction
adjusted_alpha = alpha / n_metrics  # 0.0025

3. Selection Bias

❌ Only include users who completed onboarding
   (Treatment might affect onboarding rate!)

✓ Include all assigned users (intent-to-treat)

4. Novelty/Primacy Effects

Week 1: New model +15% (users exploring)
Week 2: New model +8%
Week 3: New model +5% (stabilized)

✓ Run test long enough for effects to stabilize

ML-Specific Considerations

1. Interleaving for Ranking Models

Instead of: Group A sees Model A results
            Group B sees Model B results
            
Interleave: Each user sees mixed results from both models
            Compare which model's results get more clicks
            
Benefit: Same users, faster convergence

2. Offline vs Online Metrics

Offline (test set): AUC improved 2%
Online (A/B test): Revenue unchanged

Why? Offline metrics may not capture real user behavior
✓ Always validate with online A/B test

3. Shadow Mode Testing

┌─────────────────────────────────────┐
│            Production               │
├─────────────────────────────────────┤
│  Old Model → Serves users           │
│  New Model → Logs predictions only  │
└─────────────────────────────────────┘

Compare predictions offline before exposing to users

Best Practices

Pre-register your hypothesis and metrics
Calculate sample size before starting
Don't peek at results early
Run for full cycles (capture weekly patterns)
Check guardrail metrics (latency, errors)
Document everything for reproducibility

Key Takeaways

A/B testing compares model versions on real users
Calculate required sample size before running
Use proper statistical tests (z-test, t-test)
Avoid peeking and multiple testing problems
Run long enough to capture stable effects
Offline metrics don't guarantee online success