Probability Distributions
Probability distributions are mathematical functions that describe how likely different outcomes are. They're fundamental to machine learning because they help us model uncertainty and make predictions.
Discrete vs. Continuous
Discrete Distributions
For countable outcomes (integers, categories):
- Probability Mass Function (PMF) gives P(X = x)
- Probabilities sum to 1
Continuous Distributions
For uncountable outcomes (real numbers):
- Probability Density Function (PDF) gives relative likelihood
- Area under curve equals 1
- P(X = exact value) = 0; we use ranges P(a ≤ X ≤ b)
Key Discrete Distributions
Bernoulli Distribution
Single binary trial (success/failure).
P(X = 1) = p
P(X = 0) = 1 - p
Use case: Modeling binary outcomes like click/no-click.
Binomial Distribution
Number of successes in n independent Bernoulli trials.
P(X = k) = C(n,k) × p^k × (1-p)^(n-k)
Use case: Counting successful conversions in n visitors.
Poisson Distribution
Number of events in a fixed interval when events occur at a constant rate.
P(X = k) = (λ^k × e^(-λ)) / k!
Use case: Modeling number of customer arrivals per hour.
Categorical/Multinomial
Generalization of Bernoulli/Binomial to multiple categories. Use case: Multi-class classification outputs.
Key Continuous Distributions
Normal (Gaussian) Distribution
The famous bell curve. Defined by mean (μ) and standard deviation (σ).
f(x) = (1 / (σ√(2π))) × e^(-(x-μ)²/(2σ²))
Why it's everywhere:
- Central Limit Theorem: averages of any distribution tend toward normal
- Many natural phenomena are approximately normal
- Computational convenience in ML
Exponential Distribution
Time between events in a Poisson process.
f(x) = λe^(-λx) for x ≥ 0
Use case: Modeling time until next customer arrives.
Uniform Distribution
All values in a range are equally likely.
f(x) = 1/(b-a) for a ≤ x ≤ b
Use case: Random initialization, some priors.
Beta Distribution
Distribution over probabilities (values between 0 and 1). Use case: Prior distribution for probability parameters.
Distributions in Machine Learning
Classification
- Softmax outputs follow a categorical distribution
- Naive Bayes assumes features follow specific distributions
Regression
- Often assume errors are normally distributed
- Heteroscedasticity: variance that changes with X
Deep Learning
- Weight initialization often uses normal or uniform
- Dropout uses Bernoulli
- VAEs model latent space as Gaussian
Generative Models
- Model the full distribution P(X) of the data
- GANs implicitly learn to sample from data distribution
Important Properties
Mean (Expected Value)
The center of the distribution.
E[X] = Σ x × P(x) (discrete)
E[X] = ∫ x × f(x) dx (continuous)
Variance
Measures spread around the mean.
Var(X) = E[(X - μ)²] = E[X²] - (E[X])²
Skewness and Kurtosis
- Skewness: asymmetry of the distribution
- Kurtosis: "tailedness" - how extreme outliers are
Key Takeaways
- Distributions model uncertainty in data and predictions
- Normal distribution is ubiquitous due to Central Limit Theorem
- Choose distributions that match your data's nature
- Understanding distributions helps interpret model outputs
- Many ML algorithms assume specific distributions