beginnerClassical Machine Learning

Master logistic regression - the foundational classification algorithm that models probability using the sigmoid function.

classificationlinear-modelsprobabilityinterpretability

Logistic Regression

Despite its name, logistic regression is a classification algorithm, not regression. It's one of the most fundamental and widely used methods for binary classification.

The Core Idea

Model the probability of class 1 using a linear combination of features, passed through the sigmoid function:

P(y=1|x) = σ(w·x + b) = 1 / (1 + e^(-(w·x + b)))

The sigmoid squashes any real number to (0, 1), giving us a valid probability.

The Sigmoid Function

σ(z) = 1 / (1 + e^(-z))

Properties:

  • Output is always between 0 and 1
  • σ(0) = 0.5
  • Smooth and differentiable
  • σ'(z) = σ(z)(1 - σ(z))
      1 |           ————————
        |         /
    0.5 |--------/----------
        |       /
      0 |——————
        |__________________
              z = 0

Decision Boundary

Predict class 1 if P(y=1|x) > 0.5, which means w·x + b > 0.

The decision boundary is a linear hyperplane in feature space.

Training: Maximum Likelihood

We want to maximize the likelihood of observing the training labels:

L(w) = Π P(yᵢ|xᵢ) = Π σ(xᵢ)^yᵢ × (1-σ(xᵢ))^(1-yᵢ)

Taking the negative log (for minimization):

Loss = -Σ [yᵢ log(σ(xᵢ)) + (1-yᵢ) log(1-σ(xᵢ))]

This is the binary cross-entropy loss!

Why Not Mean Squared Error?

MSE with sigmoid creates a non-convex loss surface:

  • Multiple local minima
  • Vanishing gradients when predictions are wrong but confident

Cross-entropy is convex and has better gradient properties.

Regularization

Logistic regression can overfit with many features. Add regularization:

L2 Regularization (Ridge)

Loss = CrossEntropy + λ × ||w||²

Shrinks weights toward zero, keeps all features.

L1 Regularization (Lasso)

Loss = CrossEntropy + λ × ||w||₁

Drives some weights exactly to zero (feature selection).

Elastic Net

Loss = CrossEntropy + λ₁||w||₁ + λ₂||w||²

Combines both.

Multi-class Extension

One-vs-Rest (OvR)

  • Train k binary classifiers (one per class vs all others)
  • Predict the class with highest probability

Multinomial (Softmax)

  • Extend to k classes directly:
P(y=k|x) = e^(wₖ·x) / Σⱼ e^(wⱼ·x)
  • More principled, often better

Interpreting Coefficients

Logistic regression is interpretable!

Log-Odds (Logit)

log(P/(1-P)) = w·x + b

Coefficients are changes in log-odds per unit increase in feature.

Odds Ratios

e^(wᵢ) = multiplicative change in odds for unit increase in xᵢ

Example: wᵢ = 0.5 means odds multiply by e^0.5 ≈ 1.65 per unit increase.

Advantages

  1. Interpretable: Coefficients have meaning
  2. Probability outputs: Not just class labels
  3. Fast: Convex optimization, scales well
  4. Works with sparse data: Efficient implementations
  5. Good baseline: Often competitive

Disadvantages

  1. Linear boundaries: Can't capture complex patterns
  2. Assumes linear log-odds: Strong assumption
  3. Sensitive to outliers: Without regularization
  4. Can't model interactions: Unless you add them manually

When to Use Logistic Regression

Good for:

  • Binary classification baseline
  • When interpretability matters
  • High-dimensional sparse data (text)
  • When you need probabilities
  • Linear relationships in log-odds

Consider alternatives:

  • Complex non-linear relationships: Trees, neural networks
  • When features interact: Add interaction terms or use trees

Practical Tips

  1. Standardize features: Helps optimization and interpretation
  2. Add regularization: Almost always helps
  3. Check calibration: Probabilities may need calibration
  4. Handle imbalance: Use class weights or resampling
  5. Check linearity assumption: Plot log-odds vs features

Key Takeaways

  1. Logistic regression = linear model + sigmoid for classification
  2. Trained by minimizing cross-entropy (log loss)
  3. Coefficients represent change in log-odds
  4. Always use regularization (L1 or L2)
  5. Great baseline that's hard to beat for linear problems

Practice Questions

Test your understanding with these related interview questions: