Hypothesis Testing
Hypothesis testing is a statistical method to make decisions about populations based on sample data, determining whether observed effects are real or due to random chance.
Core Concepts
The Hypotheses
Null Hypothesis (H₀): Default assumption, usually "no effect"
Alternative (H₁): What we're trying to prove
Example:
H₀: The new model has the same accuracy as the old model
H₁: The new model has different (better/worse) accuracy
The Process
1. State hypotheses (H₀ and H₁)
2. Choose significance level (α)
3. Collect data
4. Calculate test statistic
5. Compute p-value
6. Make decision: Reject H₀ if p < α
P-Value
Definition: Probability of observing results at least as extreme as the data, assuming H₀ is true.
P-value = P(data as extreme or more | H₀ is true)
Small p-value: Data unlikely under H₀ → Reject H₀
Large p-value: Data consistent with H₀ → Fail to reject H₀
Common Thresholds
α = 0.05: Standard (5% false positive rate)
α = 0.01: Stringent (medical, etc.)
α = 0.10: Lenient (exploratory research)
Types of Errors
Reality
H₀ True H₀ False
┌──────────┬──────────┐
Reject H₀ │ Type I │ Correct │
Decision │ (α) │ (Power) │
├──────────┼──────────┤
Keep H₀ │ Correct │ Type II │
│ (1-α) │ (β) │
└──────────┴──────────┘
Type I Error (α): False positive - rejecting true H₀
Type II Error (β): False negative - keeping false H₀
Power (1-β): Correctly rejecting false H₀
Common Statistical Tests
One-Sample t-Test
Is the mean different from a known value?
from scipy import stats
# Test if model accuracy differs from 80%
accuracies = [0.82, 0.85, 0.79, 0.84, 0.81, 0.83]
t_stat, p_value = stats.ttest_1samp(accuracies, 0.80)
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Reject H₀: Accuracy differs from 80%")
Two-Sample t-Test
Are two means different?
# Compare two model accuracies
model_a = [0.82, 0.85, 0.79, 0.84, 0.81]
model_b = [0.88, 0.87, 0.90, 0.86, 0.89]
# Independent samples
t_stat, p_value = stats.ttest_ind(model_a, model_b)
print(f"p-value: {p_value:.4f}")
# Paired samples (same test data, different models)
t_stat, p_value = stats.ttest_rel(model_a, model_b)
print(f"p-value (paired): {p_value:.4f}")
Chi-Square Test
Are categorical variables independent?
import numpy as np
from scipy.stats import chi2_contingency
# Confusion matrix comparison
observed = np.array([
[50, 10], # Model A: [correct, incorrect]
[45, 15] # Model B: [correct, incorrect]
])
chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"Chi-square: {chi2:.3f}")
print(f"p-value: {p_value:.4f}")
ANOVA (Analysis of Variance)
Compare means of 3+ groups:
# Compare 3 models
model_a = [0.82, 0.85, 0.79, 0.84, 0.81]
model_b = [0.88, 0.87, 0.90, 0.86, 0.89]
model_c = [0.80, 0.83, 0.78, 0.82, 0.79]
f_stat, p_value = stats.f_oneway(model_a, model_b, model_c)
print(f"F-statistic: {f_stat:.3f}")
print(f"p-value: {p_value:.4f}")
Mann-Whitney U Test
Non-parametric alternative to t-test:
# When data isn't normally distributed
from scipy.stats import mannwhitneyu
u_stat, p_value = mannwhitneyu(model_a, model_b, alternative='two-sided')
print(f"p-value: {p_value:.4f}")
One-Tailed vs Two-Tailed
Two-tailed (H₁: μ ≠ μ₀):
Tests for any difference
p-value = P(|t| > |t_observed|)
One-tailed (H₁: μ > μ₀ or H₁: μ < μ₀):
Tests for directional difference
p-value = P(t > t_observed) or P(t < t_observed)
# Two-tailed (default)
t_stat, p_value_two = stats.ttest_ind(model_a, model_b)
# One-tailed (is B better than A?)
p_value_one = p_value_two / 2 if t_stat < 0 else 1 - p_value_two / 2
Multiple Testing Problem
With α = 0.05 and 20 tests:
P(at least one false positive) = 1 - 0.95²⁰ = 64%!
Corrections
from statsmodels.stats.multitest import multipletests
p_values = [0.01, 0.03, 0.05, 0.02, 0.15]
# Bonferroni correction (conservative)
rejected, p_adjusted, _, _ = multipletests(p_values, method='bonferroni')
# Benjamini-Hochberg (FDR control, less conservative)
rejected, p_adjusted, _, _ = multipletests(p_values, method='fdr_bh')
print(f"Original p-values: {p_values}")
print(f"Adjusted p-values: {p_adjusted}")
print(f"Rejected: {rejected}")
Effect Size
Statistical significance ≠ practical significance:
# Cohen's d for comparing means
def cohens_d(group1, group2):
n1, n2 = len(group1), len(group2)
var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
return (np.mean(group1) - np.mean(group2)) / pooled_std
d = cohens_d(model_a, model_b)
print(f"Cohen's d: {d:.3f}")
# Interpretation:
# |d| < 0.2: Small effect
# |d| = 0.5: Medium effect
# |d| > 0.8: Large effect
Confidence Intervals
import numpy as np
from scipy import stats
data = [0.82, 0.85, 0.79, 0.84, 0.81, 0.83]
mean = np.mean(data)
sem = stats.sem(data) # Standard error of mean
# 95% confidence interval
ci = stats.t.interval(
confidence=0.95,
df=len(data)-1,
loc=mean,
scale=sem
)
print(f"Mean: {mean:.3f}")
print(f"95% CI: [{ci[0]:.3f}, {ci[1]:.3f}]")
Power Analysis
from statsmodels.stats.power import TTestIndPower
power_analysis = TTestIndPower()
# Calculate required sample size
sample_size = power_analysis.solve_power(
effect_size=0.5, # Expected Cohen's d
power=0.8, # Desired power
alpha=0.05, # Significance level
ratio=1.0 # Equal group sizes
)
print(f"Required sample size per group: {int(np.ceil(sample_size))}")
# Calculate power given sample size
power = power_analysis.power(
effect_size=0.5,
nobs1=50,
alpha=0.05,
ratio=1.0
)
print(f"Power with n=50: {power:.2%}")
Common Mistakes
1. Misinterpreting P-Values
❌ "p = 0.03 means 3% chance H₀ is true"
✓ "p = 0.03 means 3% chance of this data if H₀ true"
2. P-Hacking
❌ Try many tests, report only significant ones
❌ Stop data collection when p < 0.05
❌ Remove outliers to get significance
✓ Pre-register your analysis plan
✓ Report all tests performed
✓ Use corrections for multiple testing
3. Ignoring Effect Size
❌ "p < 0.001, so the effect is huge"
✓ Report both p-value AND effect size/CI
With n=10,000, even tiny differences are "significant"
4. Assuming Non-Significance = No Effect
❌ "p = 0.08, so there's no difference"
✓ "We failed to detect a difference (could be low power)"
Hypothesis Testing in ML
Comparing Model Accuracies
# McNemar's test for paired classifiers
from statsmodels.stats.contingency_tables import mcnemar
# Counts of predictions
# Model B
# Wrong Right
# Model A Wrong a b
# Right c d
table = [[10, 15], # A wrong
[5, 70]] # A right
result = mcnemar(table, exact=False)
print(f"p-value: {result.pvalue:.4f}")
Cross-Validation Comparison
# Paired t-test on CV folds
from sklearn.model_selection import cross_val_score
scores_a = cross_val_score(model_a, X, y, cv=10)
scores_b = cross_val_score(model_b, X, y, cv=10)
t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
print(f"Model A: {scores_a.mean():.3f} ± {scores_a.std():.3f}")
print(f"Model B: {scores_b.mean():.3f} ± {scores_b.std():.3f}")
print(f"p-value: {p_value:.4f}")
Key Takeaways
- P-value = probability of data (or more extreme) given H₀ is true
- Reject H₀ when p < α (typically 0.05)
- Always report effect sizes, not just p-values
- Correct for multiple testing when running many tests
- Calculate required sample size before experiments
- Non-significant ≠ no effect (could be low power)