Hypothesis Testing

Hypothesis testing is a statistical method to make decisions about populations based on sample data, determining whether observed effects are real or due to random chance.

Core Concepts

The Hypotheses

Null Hypothesis (H₀): Default assumption, usually "no effect"
Alternative (H₁):    What we're trying to prove

Example:
H₀: The new model has the same accuracy as the old model
H₁: The new model has different (better/worse) accuracy

The Process

1. State hypotheses (H₀ and H₁)
2. Choose significance level (α)
3. Collect data
4. Calculate test statistic
5. Compute p-value
6. Make decision: Reject H₀ if p < α

P-Value

Definition: Probability of observing results at least as extreme as the data, assuming H₀ is true.

P-value = P(data as extreme or more | H₀ is true)

Small p-value: Data unlikely under H₀ → Reject H₀
Large p-value: Data consistent with H₀ → Fail to reject H₀

Common Thresholds

α = 0.05:  Standard (5% false positive rate)
α = 0.01:  Stringent (medical, etc.)
α = 0.10:  Lenient (exploratory research)

Types of Errors

                    Reality
                 H₀ True    H₀ False
              ┌──────────┬──────────┐
    Reject H₀ │ Type I   │ Correct  │
 Decision     │ (α)      │ (Power)  │
              ├──────────┼──────────┤
    Keep H₀   │ Correct  │ Type II  │
              │ (1-α)    │ (β)      │
              └──────────┴──────────┘

Type I Error (α):  False positive - rejecting true H₀
Type II Error (β): False negative - keeping false H₀
Power (1-β):       Correctly rejecting false H₀

Common Statistical Tests

One-Sample t-Test

Is the mean different from a known value?

from scipy import stats

# Test if model accuracy differs from 80%
accuracies = [0.82, 0.85, 0.79, 0.84, 0.81, 0.83]
t_stat, p_value = stats.ttest_1samp(accuracies, 0.80)

print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
    print("Reject H₀: Accuracy differs from 80%")

Two-Sample t-Test

Are two means different?

# Compare two model accuracies
model_a = [0.82, 0.85, 0.79, 0.84, 0.81]
model_b = [0.88, 0.87, 0.90, 0.86, 0.89]

# Independent samples
t_stat, p_value = stats.ttest_ind(model_a, model_b)
print(f"p-value: {p_value:.4f}")

# Paired samples (same test data, different models)
t_stat, p_value = stats.ttest_rel(model_a, model_b)
print(f"p-value (paired): {p_value:.4f}")

Chi-Square Test

Are categorical variables independent?

import numpy as np
from scipy.stats import chi2_contingency

# Confusion matrix comparison
observed = np.array([
    [50, 10],  # Model A: [correct, incorrect]
    [45, 15]   # Model B: [correct, incorrect]
])

chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"Chi-square: {chi2:.3f}")
print(f"p-value: {p_value:.4f}")

ANOVA (Analysis of Variance)

Compare means of 3+ groups:

# Compare 3 models
model_a = [0.82, 0.85, 0.79, 0.84, 0.81]
model_b = [0.88, 0.87, 0.90, 0.86, 0.89]
model_c = [0.80, 0.83, 0.78, 0.82, 0.79]

f_stat, p_value = stats.f_oneway(model_a, model_b, model_c)
print(f"F-statistic: {f_stat:.3f}")
print(f"p-value: {p_value:.4f}")

Mann-Whitney U Test

Non-parametric alternative to t-test:

# When data isn't normally distributed
from scipy.stats import mannwhitneyu

u_stat, p_value = mannwhitneyu(model_a, model_b, alternative='two-sided')
print(f"p-value: {p_value:.4f}")

One-Tailed vs Two-Tailed

Two-tailed (H₁: μ ≠ μ₀):
  Tests for any difference
  p-value = P(|t| > |t_observed|)

One-tailed (H₁: μ > μ₀ or H₁: μ < μ₀):
  Tests for directional difference
  p-value = P(t > t_observed) or P(t < t_observed)

# Two-tailed (default)
t_stat, p_value_two = stats.ttest_ind(model_a, model_b)

# One-tailed (is B better than A?)
p_value_one = p_value_two / 2 if t_stat < 0 else 1 - p_value_two / 2

Multiple Testing Problem

With α = 0.05 and 20 tests:
P(at least one false positive) = 1 - 0.95²⁰ = 64%!

Corrections

from statsmodels.stats.multitest import multipletests

p_values = [0.01, 0.03, 0.05, 0.02, 0.15]

# Bonferroni correction (conservative)
rejected, p_adjusted, _, _ = multipletests(p_values, method='bonferroni')

# Benjamini-Hochberg (FDR control, less conservative)
rejected, p_adjusted, _, _ = multipletests(p_values, method='fdr_bh')

print(f"Original p-values: {p_values}")
print(f"Adjusted p-values: {p_adjusted}")
print(f"Rejected: {rejected}")

Effect Size

Statistical significance ≠ practical significance:

# Cohen's d for comparing means
def cohens_d(group1, group2):
    n1, n2 = len(group1), len(group2)
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
    pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
    return (np.mean(group1) - np.mean(group2)) / pooled_std

d = cohens_d(model_a, model_b)
print(f"Cohen's d: {d:.3f}")

# Interpretation:
# |d| < 0.2:  Small effect
# |d| = 0.5:  Medium effect  
# |d| > 0.8:  Large effect

Confidence Intervals

import numpy as np
from scipy import stats

data = [0.82, 0.85, 0.79, 0.84, 0.81, 0.83]
mean = np.mean(data)
sem = stats.sem(data)  # Standard error of mean

# 95% confidence interval
ci = stats.t.interval(
    confidence=0.95,
    df=len(data)-1,
    loc=mean,
    scale=sem
)
print(f"Mean: {mean:.3f}")
print(f"95% CI: [{ci[0]:.3f}, {ci[1]:.3f}]")

Power Analysis

from statsmodels.stats.power import TTestIndPower

power_analysis = TTestIndPower()

# Calculate required sample size
sample_size = power_analysis.solve_power(
    effect_size=0.5,  # Expected Cohen's d
    power=0.8,        # Desired power
    alpha=0.05,       # Significance level
    ratio=1.0         # Equal group sizes
)
print(f"Required sample size per group: {int(np.ceil(sample_size))}")

# Calculate power given sample size
power = power_analysis.power(
    effect_size=0.5,
    nobs1=50,
    alpha=0.05,
    ratio=1.0
)
print(f"Power with n=50: {power:.2%}")

Common Mistakes

1. Misinterpreting P-Values

❌ "p = 0.03 means 3% chance H₀ is true"
✓ "p = 0.03 means 3% chance of this data if H₀ true"

2. P-Hacking

❌ Try many tests, report only significant ones
❌ Stop data collection when p < 0.05
❌ Remove outliers to get significance

✓ Pre-register your analysis plan
✓ Report all tests performed
✓ Use corrections for multiple testing

3. Ignoring Effect Size

❌ "p < 0.001, so the effect is huge"
✓ Report both p-value AND effect size/CI

With n=10,000, even tiny differences are "significant"

4. Assuming Non-Significance = No Effect

❌ "p = 0.08, so there's no difference"
✓ "We failed to detect a difference (could be low power)"

Hypothesis Testing in ML

Comparing Model Accuracies

# McNemar's test for paired classifiers
from statsmodels.stats.contingency_tables import mcnemar

# Counts of predictions
#               Model B
#              Wrong  Right
# Model A Wrong   a      b
#         Right   c      d

table = [[10, 15],   # A wrong
         [5, 70]]    # A right

result = mcnemar(table, exact=False)
print(f"p-value: {result.pvalue:.4f}")

Cross-Validation Comparison

# Paired t-test on CV folds
from sklearn.model_selection import cross_val_score

scores_a = cross_val_score(model_a, X, y, cv=10)
scores_b = cross_val_score(model_b, X, y, cv=10)

t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
print(f"Model A: {scores_a.mean():.3f} ± {scores_a.std():.3f}")
print(f"Model B: {scores_b.mean():.3f} ± {scores_b.std():.3f}")
print(f"p-value: {p_value:.4f}")

Key Takeaways

P-value = probability of data (or more extreme) given H₀ is true
Reject H₀ when p < α (typically 0.05)
Always report effect sizes, not just p-values
Correct for multiple testing when running many tests
Calculate required sample size before experiments
Non-significant ≠ no effect (could be low power)