beginnerClassical Machine Learning

Learn about Naive Bayes - a simple yet powerful probabilistic classifier based on Bayes' theorem with strong independence assumptions.

classificationprobabilistictext-classificationbayes

Naive Bayes

Naive Bayes is a family of probabilistic classifiers based on Bayes' theorem with a "naive" assumption of feature independence. Despite its simplicity, it works surprisingly well, especially for text classification.

Bayes' Theorem

P(y|X) = P(X|y) × P(y) / P(X)

Posterior = (Likelihood × Prior) / Evidence

Classify by finding the class with highest posterior probability.

The "Naive" Assumption

Assume features are conditionally independent given the class:

P(x₁, x₂, ..., xₙ | y) = P(x₁|y) × P(x₂|y) × ... × P(xₙ|y)

This is rarely true in practice, but it works anyway!

Classification Rule

ŷ = argmax_y P(y) × ∏ P(xᵢ|y)

In log space (for numerical stability):

ŷ = argmax_y [log P(y) + Σ log P(xᵢ|y)]

Variants

Gaussian Naive Bayes

For continuous features:

P(xᵢ|y) = N(xᵢ; μᵢᵧ, σ²ᵢᵧ)

Estimate mean and variance per feature per class.

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Multinomial Naive Bayes

For discrete counts (e.g., word frequencies):

P(xᵢ|y) = count(xᵢ, y) / count(y)

Common for text classification.

from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_counts, y)

Bernoulli Naive Bayes

For binary features (word present/absent):

P(xᵢ|y) = P(xᵢ=1|y)^xᵢ × P(xᵢ=0|y)^(1-xᵢ)
from sklearn.naive_bayes import BernoulliNB

model = BernoulliNB()
model.fit(X_binary, y)

Text Classification Example

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Pipeline: text → word counts → classifier
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

pipeline.fit(train_texts, train_labels)
predictions = pipeline.predict(test_texts)

Spam Detection

P(spam | "free money now") ∝ P(spam) × P("free"|spam) × P("money"|spam) × P("now"|spam)

Laplace Smoothing

The Problem

Zero probability for unseen words:

P("cryptocurrency"|spam) = 0  # Never seen in training
→ Entire product becomes 0

The Solution

Add small count to all features:

P(xᵢ|y) = (count(xᵢ, y) + α) / (count(y) + α × |vocabulary|)

α = 1: Laplace smoothing
α < 1: Lidstone smoothing
model = MultinomialNB(alpha=1.0)  # Laplace smoothing

Advantages

  1. Fast: O(n) training and prediction
  2. Simple: Few parameters
  3. Works with small data: Needs less training data
  4. Handles high dimensions: Works well with many features
  5. Interpretable: Can inspect class probabilities
  6. No tuning: Works well out of the box

Limitations

  1. Independence assumption: Often violated
  2. Probability estimates: Often poorly calibrated
  3. Feature interactions: Can't capture them
  4. Zero frequency: Needs smoothing

When to Use

Good for:

  • Text classification (spam, sentiment)
  • Quick baseline
  • High-dimensional sparse data
  • Small training sets
  • Real-time predictions

Consider alternatives when:

  • Features are highly correlated
  • Need accurate probability estimates
  • Have enough data for complex models

Why It Works Despite Violations

Even when independence is violated:

  1. Still ranks classes correctly (relative ordering)
  2. Decision boundaries can still be good
  3. Errors from independence can cancel out

Comparison with Other Classifiers

AspectNaive BayesLogistic RegressionSVM
Training speedVery fastFastSlow
Prediction speedVery fastFastSlow
InterpretabilityHighHighLow
Feature interactionsNoLimitedYes
Probability calibrationPoorGoodPoor

Calibration

If you need accurate probabilities:

from sklearn.calibration import CalibratedClassifierCV

calibrated = CalibratedClassifierCV(MultinomialNB(), cv=5)
calibrated.fit(X_train, y_train)
probabilities = calibrated.predict_proba(X_test)

Key Takeaways

  1. Based on Bayes' theorem with independence assumption
  2. Fast, simple, works well for text
  3. Gaussian for continuous, Multinomial for counts, Bernoulli for binary
  4. Use smoothing to handle unseen features
  5. Great baseline, especially with small data
  6. Calibrate if you need accurate probabilities

Practice Questions

Test your understanding with these related interview questions: