beginnerFoundations

Learn about probability distributions - the mathematical functions that describe the likelihood of different outcomes in data and ML models.

probabilitystatisticsgaussiandistributions

Probability Distributions

Probability distributions are mathematical functions that describe how likely different outcomes are. They're fundamental to machine learning because they help us model uncertainty and make predictions.

Discrete vs. Continuous

Discrete Distributions

For countable outcomes (integers, categories):

  • Probability Mass Function (PMF) gives P(X = x)
  • Probabilities sum to 1

Continuous Distributions

For uncountable outcomes (real numbers):

  • Probability Density Function (PDF) gives relative likelihood
  • Area under curve equals 1
  • P(X = exact value) = 0; we use ranges P(a ≤ X ≤ b)

Key Discrete Distributions

Bernoulli Distribution

Single binary trial (success/failure).

P(X = 1) = p
P(X = 0) = 1 - p

Use case: Modeling binary outcomes like click/no-click.

Binomial Distribution

Number of successes in n independent Bernoulli trials.

P(X = k) = C(n,k) × p^k × (1-p)^(n-k)

Use case: Counting successful conversions in n visitors.

Poisson Distribution

Number of events in a fixed interval when events occur at a constant rate.

P(X = k) = (λ^k × e^(-λ)) / k!

Use case: Modeling number of customer arrivals per hour.

Categorical/Multinomial

Generalization of Bernoulli/Binomial to multiple categories. Use case: Multi-class classification outputs.

Key Continuous Distributions

Normal (Gaussian) Distribution

The famous bell curve. Defined by mean (μ) and standard deviation (σ).

f(x) = (1 / (σ√(2π))) × e^(-(x-μ)²/(2σ²))

Why it's everywhere:

  • Central Limit Theorem: averages of any distribution tend toward normal
  • Many natural phenomena are approximately normal
  • Computational convenience in ML

Exponential Distribution

Time between events in a Poisson process.

f(x) = λe^(-λx) for x ≥ 0

Use case: Modeling time until next customer arrives.

Uniform Distribution

All values in a range are equally likely.

f(x) = 1/(b-a) for a ≤ x ≤ b

Use case: Random initialization, some priors.

Beta Distribution

Distribution over probabilities (values between 0 and 1). Use case: Prior distribution for probability parameters.

Distributions in Machine Learning

Classification

  • Softmax outputs follow a categorical distribution
  • Naive Bayes assumes features follow specific distributions

Regression

  • Often assume errors are normally distributed
  • Heteroscedasticity: variance that changes with X

Deep Learning

  • Weight initialization often uses normal or uniform
  • Dropout uses Bernoulli
  • VAEs model latent space as Gaussian

Generative Models

  • Model the full distribution P(X) of the data
  • GANs implicitly learn to sample from data distribution

Important Properties

Mean (Expected Value)

The center of the distribution.

E[X] = Σ x × P(x)  (discrete)
E[X] = ∫ x × f(x) dx  (continuous)

Variance

Measures spread around the mean.

Var(X) = E[(X - μ)²] = E[X²] - (E[X])²

Skewness and Kurtosis

  • Skewness: asymmetry of the distribution
  • Kurtosis: "tailedness" - how extreme outliers are

Key Takeaways

  1. Distributions model uncertainty in data and predictions
  2. Normal distribution is ubiquitous due to Central Limit Theorem
  3. Choose distributions that match your data's nature
  4. Understanding distributions helps interpret model outputs
  5. Many ML algorithms assume specific distributions

Practice Questions

Test your understanding with these related interview questions: