Naive Bayes
Naive Bayes is a family of probabilistic classifiers based on Bayes' theorem with a "naive" assumption of feature independence. Despite its simplicity, it works surprisingly well, especially for text classification.
Bayes' Theorem
P(y|X) = P(X|y) × P(y) / P(X)
Posterior = (Likelihood × Prior) / Evidence
Classify by finding the class with highest posterior probability.
The "Naive" Assumption
Assume features are conditionally independent given the class:
P(x₁, x₂, ..., xₙ | y) = P(x₁|y) × P(x₂|y) × ... × P(xₙ|y)
This is rarely true in practice, but it works anyway!
Classification Rule
ŷ = argmax_y P(y) × ∏ P(xᵢ|y)
In log space (for numerical stability):
ŷ = argmax_y [log P(y) + Σ log P(xᵢ|y)]
Variants
Gaussian Naive Bayes
For continuous features:
P(xᵢ|y) = N(xᵢ; μᵢᵧ, σ²ᵢᵧ)
Estimate mean and variance per feature per class.
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Multinomial Naive Bayes
For discrete counts (e.g., word frequencies):
P(xᵢ|y) = count(xᵢ, y) / count(y)
Common for text classification.
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_counts, y)
Bernoulli Naive Bayes
For binary features (word present/absent):
P(xᵢ|y) = P(xᵢ=1|y)^xᵢ × P(xᵢ=0|y)^(1-xᵢ)
from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB()
model.fit(X_binary, y)
Text Classification Example
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
# Pipeline: text → word counts → classifier
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('classifier', MultinomialNB())
])
pipeline.fit(train_texts, train_labels)
predictions = pipeline.predict(test_texts)
Spam Detection
P(spam | "free money now") ∝ P(spam) × P("free"|spam) × P("money"|spam) × P("now"|spam)
Laplace Smoothing
The Problem
Zero probability for unseen words:
P("cryptocurrency"|spam) = 0 # Never seen in training
→ Entire product becomes 0
The Solution
Add small count to all features:
P(xᵢ|y) = (count(xᵢ, y) + α) / (count(y) + α × |vocabulary|)
α = 1: Laplace smoothing
α < 1: Lidstone smoothing
model = MultinomialNB(alpha=1.0) # Laplace smoothing
Advantages
- Fast: O(n) training and prediction
- Simple: Few parameters
- Works with small data: Needs less training data
- Handles high dimensions: Works well with many features
- Interpretable: Can inspect class probabilities
- No tuning: Works well out of the box
Limitations
- Independence assumption: Often violated
- Probability estimates: Often poorly calibrated
- Feature interactions: Can't capture them
- Zero frequency: Needs smoothing
When to Use
Good for:
- Text classification (spam, sentiment)
- Quick baseline
- High-dimensional sparse data
- Small training sets
- Real-time predictions
Consider alternatives when:
- Features are highly correlated
- Need accurate probability estimates
- Have enough data for complex models
Why It Works Despite Violations
Even when independence is violated:
- Still ranks classes correctly (relative ordering)
- Decision boundaries can still be good
- Errors from independence can cancel out
Comparison with Other Classifiers
| Aspect | Naive Bayes | Logistic Regression | SVM |
|---|---|---|---|
| Training speed | Very fast | Fast | Slow |
| Prediction speed | Very fast | Fast | Slow |
| Interpretability | High | High | Low |
| Feature interactions | No | Limited | Yes |
| Probability calibration | Poor | Good | Poor |
Calibration
If you need accurate probabilities:
from sklearn.calibration import CalibratedClassifierCV
calibrated = CalibratedClassifierCV(MultinomialNB(), cv=5)
calibrated.fit(X_train, y_train)
probabilities = calibrated.predict_proba(X_test)
Key Takeaways
- Based on Bayes' theorem with independence assumption
- Fast, simple, works well for text
- Gaussian for continuous, Multinomial for counts, Bernoulli for binary
- Use smoothing to handle unseen features
- Great baseline, especially with small data
- Calibrate if you need accurate probabilities