Naive Bayes

Naive Bayes is a family of probabilistic classifiers based on Bayes' theorem with a "naive" assumption of feature independence. Despite its simplicity, it works surprisingly well, especially for text classification.

Bayes' Theorem

P(y|X) = P(X|y) × P(y) / P(X)

Posterior = (Likelihood × Prior) / Evidence

Classify by finding the class with highest posterior probability.

The "Naive" Assumption

Assume features are conditionally independent given the class:

P(x₁, x₂, ..., xₙ | y) = P(x₁|y) × P(x₂|y) × ... × P(xₙ|y)

This is rarely true in practice, but it works anyway!

Classification Rule

ŷ = argmax_y P(y) × ∏ P(xᵢ|y)

In log space (for numerical stability):

ŷ = argmax_y [log P(y) + Σ log P(xᵢ|y)]

Variants

Gaussian Naive Bayes

For continuous features:

P(xᵢ|y) = N(xᵢ; μᵢᵧ, σ²ᵢᵧ)

Estimate mean and variance per feature per class.

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Multinomial Naive Bayes

For discrete counts (e.g., word frequencies):

P(xᵢ|y) = count(xᵢ, y) / count(y)

Common for text classification.

from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_counts, y)

Bernoulli Naive Bayes

For binary features (word present/absent):

P(xᵢ|y) = P(xᵢ=1|y)^xᵢ × P(xᵢ=0|y)^(1-xᵢ)

from sklearn.naive_bayes import BernoulliNB

model = BernoulliNB()
model.fit(X_binary, y)

Text Classification Example

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Pipeline: text → word counts → classifier
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

pipeline.fit(train_texts, train_labels)
predictions = pipeline.predict(test_texts)

Spam Detection

P(spam | "free money now") ∝ P(spam) × P("free"|spam) × P("money"|spam) × P("now"|spam)

Laplace Smoothing

The Problem

Zero probability for unseen words:

P("cryptocurrency"|spam) = 0  # Never seen in training
→ Entire product becomes 0

The Solution

Add small count to all features:

P(xᵢ|y) = (count(xᵢ, y) + α) / (count(y) + α × |vocabulary|)

α = 1: Laplace smoothing
α < 1: Lidstone smoothing

model = MultinomialNB(alpha=1.0)  # Laplace smoothing

Advantages

Fast: O(n) training and prediction
Simple: Few parameters
Works with small data: Needs less training data
Handles high dimensions: Works well with many features
Interpretable: Can inspect class probabilities
No tuning: Works well out of the box

Limitations

Independence assumption: Often violated
Probability estimates: Often poorly calibrated
Feature interactions: Can't capture them
Zero frequency: Needs smoothing

When to Use

Good for:

Text classification (spam, sentiment)
Quick baseline
High-dimensional sparse data
Small training sets
Real-time predictions

Consider alternatives when:

Features are highly correlated
Need accurate probability estimates
Have enough data for complex models

Why It Works Despite Violations

Even when independence is violated:

Still ranks classes correctly (relative ordering)
Decision boundaries can still be good
Errors from independence can cancel out

Comparison with Other Classifiers

Aspect	Naive Bayes	Logistic Regression	SVM
Training speed	Very fast	Fast	Slow
Prediction speed	Very fast	Fast	Slow
Interpretability	High	High	Low
Feature interactions	No	Limited	Yes
Probability calibration	Poor	Good	Poor

Calibration

If you need accurate probabilities:

from sklearn.calibration import CalibratedClassifierCV

calibrated = CalibratedClassifierCV(MultinomialNB(), cv=5)
calibrated.fit(X_train, y_train)
probabilities = calibrated.predict_proba(X_test)

Key Takeaways

Based on Bayes' theorem with independence assumption
Fast, simple, works well for text
Gaussian for continuous, Multinomial for counts, Bernoulli for binary
Use smoothing to handle unseen features
Great baseline, especially with small data
Calibrate if you need accurate probabilities