Logistic Regression
Despite its name, logistic regression is a classification algorithm, not regression. It's one of the most fundamental and widely used methods for binary classification.
The Core Idea
Model the probability of class 1 using a linear combination of features, passed through the sigmoid function:
P(y=1|x) = σ(w·x + b) = 1 / (1 + e^(-(w·x + b)))
The sigmoid squashes any real number to (0, 1), giving us a valid probability.
The Sigmoid Function
σ(z) = 1 / (1 + e^(-z))
Properties:
- Output is always between 0 and 1
- σ(0) = 0.5
- Smooth and differentiable
- σ'(z) = σ(z)(1 - σ(z))
1 | ————————
| /
0.5 |--------/----------
| /
0 |——————
|__________________
z = 0
Decision Boundary
Predict class 1 if P(y=1|x) > 0.5, which means w·x + b > 0.
The decision boundary is a linear hyperplane in feature space.
Training: Maximum Likelihood
We want to maximize the likelihood of observing the training labels:
L(w) = Π P(yᵢ|xᵢ) = Π σ(xᵢ)^yᵢ × (1-σ(xᵢ))^(1-yᵢ)
Taking the negative log (for minimization):
Loss = -Σ [yᵢ log(σ(xᵢ)) + (1-yᵢ) log(1-σ(xᵢ))]
This is the binary cross-entropy loss!
Why Not Mean Squared Error?
MSE with sigmoid creates a non-convex loss surface:
- Multiple local minima
- Vanishing gradients when predictions are wrong but confident
Cross-entropy is convex and has better gradient properties.
Regularization
Logistic regression can overfit with many features. Add regularization:
L2 Regularization (Ridge)
Loss = CrossEntropy + λ × ||w||²
Shrinks weights toward zero, keeps all features.
L1 Regularization (Lasso)
Loss = CrossEntropy + λ × ||w||₁
Drives some weights exactly to zero (feature selection).
Elastic Net
Loss = CrossEntropy + λ₁||w||₁ + λ₂||w||²
Combines both.
Multi-class Extension
One-vs-Rest (OvR)
- Train k binary classifiers (one per class vs all others)
- Predict the class with highest probability
Multinomial (Softmax)
- Extend to k classes directly:
P(y=k|x) = e^(wₖ·x) / Σⱼ e^(wⱼ·x)
- More principled, often better
Interpreting Coefficients
Logistic regression is interpretable!
Log-Odds (Logit)
log(P/(1-P)) = w·x + b
Coefficients are changes in log-odds per unit increase in feature.
Odds Ratios
e^(wᵢ) = multiplicative change in odds for unit increase in xᵢ
Example: wᵢ = 0.5 means odds multiply by e^0.5 ≈ 1.65 per unit increase.
Advantages
- Interpretable: Coefficients have meaning
- Probability outputs: Not just class labels
- Fast: Convex optimization, scales well
- Works with sparse data: Efficient implementations
- Good baseline: Often competitive
Disadvantages
- Linear boundaries: Can't capture complex patterns
- Assumes linear log-odds: Strong assumption
- Sensitive to outliers: Without regularization
- Can't model interactions: Unless you add them manually
When to Use Logistic Regression
Good for:
- Binary classification baseline
- When interpretability matters
- High-dimensional sparse data (text)
- When you need probabilities
- Linear relationships in log-odds
Consider alternatives:
- Complex non-linear relationships: Trees, neural networks
- When features interact: Add interaction terms or use trees
Practical Tips
- Standardize features: Helps optimization and interpretation
- Add regularization: Almost always helps
- Check calibration: Probabilities may need calibration
- Handle imbalance: Use class weights or resampling
- Check linearity assumption: Plot log-odds vs features
Key Takeaways
- Logistic regression = linear model + sigmoid for classification
- Trained by minimizing cross-entropy (log loss)
- Coefficients represent change in log-odds
- Always use regularization (L1 or L2)
- Great baseline that's hard to beat for linear problems