Logistic Regression

Despite its name, logistic regression is a classification algorithm, not regression. It's one of the most fundamental and widely used methods for binary classification.

The Core Idea

Model the probability of class 1 using a linear combination of features, passed through the sigmoid function:

P(y=1|x) = σ(w·x + b) = 1 / (1 + e^(-(w·x + b)))

The sigmoid squashes any real number to (0, 1), giving us a valid probability.

The Sigmoid Function

σ(z) = 1 / (1 + e^(-z))

Properties:

Output is always between 0 and 1
σ(0) = 0.5
Smooth and differentiable
σ'(z) = σ(z)(1 - σ(z))

      1 |           ————————
        |         /
    0.5 |--------/----------
        |       /
      0 |——————
        |__________________
              z = 0

Decision Boundary

Predict class 1 if P(y=1|x) > 0.5, which means w·x + b > 0.

The decision boundary is a linear hyperplane in feature space.

Training: Maximum Likelihood

We want to maximize the likelihood of observing the training labels:

L(w) = Π P(yᵢ|xᵢ) = Π σ(xᵢ)^yᵢ × (1-σ(xᵢ))^(1-yᵢ)

Taking the negative log (for minimization):

Loss = -Σ [yᵢ log(σ(xᵢ)) + (1-yᵢ) log(1-σ(xᵢ))]

This is the binary cross-entropy loss!

Why Not Mean Squared Error?

MSE with sigmoid creates a non-convex loss surface:

Multiple local minima
Vanishing gradients when predictions are wrong but confident

Cross-entropy is convex and has better gradient properties.

Regularization

Logistic regression can overfit with many features. Add regularization:

L2 Regularization (Ridge)

Loss = CrossEntropy + λ × ||w||²

Shrinks weights toward zero, keeps all features.

L1 Regularization (Lasso)

Loss = CrossEntropy + λ × ||w||₁

Drives some weights exactly to zero (feature selection).

Elastic Net

Loss = CrossEntropy + λ₁||w||₁ + λ₂||w||²

Combines both.

Multi-class Extension

One-vs-Rest (OvR)

Train k binary classifiers (one per class vs all others)
Predict the class with highest probability

Multinomial (Softmax)

Extend to k classes directly:

P(y=k|x) = e^(wₖ·x) / Σⱼ e^(wⱼ·x)

More principled, often better

Interpreting Coefficients

Logistic regression is interpretable!

Log-Odds (Logit)

log(P/(1-P)) = w·x + b

Coefficients are changes in log-odds per unit increase in feature.

Odds Ratios

e^(wᵢ) = multiplicative change in odds for unit increase in xᵢ

Example: wᵢ = 0.5 means odds multiply by e^0.5 ≈ 1.65 per unit increase.

Advantages

Interpretable: Coefficients have meaning
Probability outputs: Not just class labels
Fast: Convex optimization, scales well
Works with sparse data: Efficient implementations
Good baseline: Often competitive

Disadvantages

Linear boundaries: Can't capture complex patterns
Assumes linear log-odds: Strong assumption
Sensitive to outliers: Without regularization
Can't model interactions: Unless you add them manually

When to Use Logistic Regression

Good for:

Binary classification baseline
When interpretability matters
High-dimensional sparse data (text)
When you need probabilities
Linear relationships in log-odds

Consider alternatives:

Complex non-linear relationships: Trees, neural networks
When features interact: Add interaction terms or use trees

Practical Tips

Standardize features: Helps optimization and interpretation
Add regularization: Almost always helps
Check calibration: Probabilities may need calibration
Handle imbalance: Use class weights or resampling
Check linearity assumption: Plot log-odds vs features

Key Takeaways

Logistic regression = linear model + sigmoid for classification
Trained by minimizing cross-entropy (log loss)
Coefficients represent change in log-odds
Always use regularization (L1 or L2)
Great baseline that's hard to beat for linear problems