Linear Regression
Linear regression is the foundation of predictive modeling. It models the relationship between variables using a linear equation and serves as the starting point for understanding more complex algorithms.
The Model
Simple Linear Regression
One feature, one target:
y = β₀ + β₁x + ε
β₀: intercept (y when x=0)
β₁: slope (change in y per unit x)
ε: error term
Multiple Linear Regression
Multiple features:
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
In matrix form:
y = Xβ + ε
Finding the Coefficients
Ordinary Least Squares (OLS)
Minimize sum of squared errors:
min Σ(yᵢ - ŷᵢ)²
Closed-form solution:
β = (XᵀX)⁻¹Xᵀy
Gradient Descent
For large datasets, iteratively update:
β = β - α × ∇Loss
Assumptions
Linear regression makes several assumptions:
1. Linearity
Relationship between X and y is linear.
Check: Plot residuals vs fitted values (should be random)
2. Independence
Observations are independent of each other.
Violation: Time series, clustered data
3. Homoscedasticity
Constant variance of errors across all X values.
Check: Residuals should have constant spread
4. Normality
Errors are normally distributed.
Check: Q-Q plot of residuals
5. No Multicollinearity
Features are not highly correlated with each other.
Check: VIF (Variance Inflation Factor) < 10
Interpreting Coefficients
Salary = 30,000 + 2,500×Years + 5,000×Degree
- Intercept (30,000): Base salary with 0 years, no degree
- Years (2,500): Each year adds $2,500, holding degree constant
- Degree (5,000): Having degree adds $5,000, holding years constant
Standardized Coefficients
To compare importance across features:
β_standardized = β × (σₓ / σᵧ)
Evaluation Metrics
R² (Coefficient of Determination)
R² = 1 - (SS_res / SS_tot)
SS_res = Σ(y - ŷ)² (residual sum of squares)
SS_tot = Σ(y - ȳ)² (total sum of squares)
- R² = 1: Perfect fit
- R² = 0: Model no better than mean
- Can be negative for bad models
Adjusted R²
Penalizes adding useless features:
Adj R² = 1 - (1-R²)(n-1)/(n-p-1)
RMSE (Root Mean Squared Error)
RMSE = √(Σ(y - ŷ)² / n)
In same units as target.
MAE (Mean Absolute Error)
MAE = Σ|y - ŷ| / n
More robust to outliers than RMSE.
Regularized Linear Regression
Ridge (L2)
Loss = Σ(y - ŷ)² + λΣβ²
Shrinks coefficients, keeps all features.
Lasso (L1)
Loss = Σ(y - ŷ)² + λΣ|β|
Can zero out coefficients (feature selection).
Elastic Net
Loss = Σ(y - ŷ)² + λ₁Σ|β| + λ₂Σβ²
Combines L1 and L2.
Common Issues
Multicollinearity
Problem: Correlated features → unstable coefficients
Solutions:
- Remove correlated features
- Use regularization (Ridge)
- PCA before regression
Outliers
Problem: Large errors dominate OLS
Solutions:
- Remove outliers
- Use robust regression (Huber loss)
- Transform target (log)
Non-linear Relationships
Problem: Linear model can't capture curves
Solutions:
- Polynomial features: x, x², x³
- Log transform: log(x)
- Splines
- Use non-linear models
Code Example
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score, mean_squared_error
# Basic linear regression
model = LinearRegression()
model.fit(X_train, y_train)
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")
# Predictions
y_pred = model.predict(X_test)
# Evaluation
print(f"R²: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.3f}")
# Regularized versions
ridge = Ridge(alpha=1.0).fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
When to Use Linear Regression
Good for:
- Interpretability is important
- Linear relationships
- Baseline model
- Understanding feature effects
Consider alternatives when:
- Complex non-linear patterns
- Need better accuracy (try gradient boosting)
- Classification problem (use logistic regression)
Key Takeaways
- Linear regression fits y = Xβ + ε
- OLS minimizes squared errors
- Check assumptions: linearity, homoscedasticity, normality
- R² measures explained variance
- Use regularization (Ridge, Lasso) to prevent overfitting
- Coefficients show effect of each feature (holding others constant)