beginnerEvaluation & Metrics

Master precision and recall - the fundamental metrics for evaluating classification models, and understand when to prioritize each.

metricsclassificationevaluationf1-score

Precision and Recall

Precision and recall are fundamental metrics for classification tasks. They capture different aspects of model performance and understanding their tradeoff is essential for building effective ML systems.

Precision-Recall Curve

The Basics

For binary classification:

                    Predicted
                  Pos    Neg
Actual  Pos       TP     FN
        Neg       FP     TN
  • TP (True Positive): Correctly predicted positive
  • FP (False Positive): Incorrectly predicted positive (Type I error)
  • FN (False Negative): Incorrectly predicted negative (Type II error)
  • TN (True Negative): Correctly predicted negative

Precision

Precision = TP / (TP + FP)

"Of all items I predicted positive, how many were actually positive?"

Also called: Positive Predictive Value (PPV)

Recall

Recall = TP / (TP + FN)

"Of all actual positives, how many did I find?"

Also called: Sensitivity, True Positive Rate (TPR)

Intuitive Examples

Spam Filter

  • High precision: Very few legitimate emails marked as spam
  • High recall: Very few spam emails reach inbox

Medical Diagnosis

  • High precision: When we say "disease", we're usually right
  • High recall: We catch most cases of the disease

Search Engine

  • High precision: Top results are relevant
  • High recall: All relevant pages are found

The Tradeoff

Precision and recall are often at odds:

Threshold ↓ (more positive predictions)
  → More TP (good for recall)
  → More FP (bad for precision)
  → Recall ↑, Precision ↓

Threshold ↑ (fewer positive predictions)
  → Fewer FP (good for precision)
  → Fewer TP (bad for recall)
  → Precision ↑, Recall ↓

When to Prioritize What

Prioritize Precision When:

  • False positives are costly
  • Users will lose trust with wrong predictions
  • Acting on prediction has high cost

Examples:

  • Spam filter (don't lose important emails)
  • Content moderation (don't wrongly ban users)
  • Autonomous vehicles (don't brake unnecessarily)

Prioritize Recall When:

  • False negatives are costly
  • Missing a positive is dangerous
  • Better to over-detect than under-detect

Examples:

  • Disease screening (don't miss cancer)
  • Fraud detection (don't miss fraud)
  • Safety systems (don't miss hazards)

F1 Score

Harmonic mean of precision and recall:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Properties:

  • Ranges from 0 to 1
  • Only high when BOTH precision and recall are high
  • Harmonic mean penalizes extreme values

Example:

  • P=0.9, R=0.1 → F1=0.18 (low!)
  • P=0.6, R=0.6 → F1=0.60 (balanced)

F-beta Score

Weighted version when you care more about one:

Fβ = (1 + β²) × (P × R) / (β²P + R)
  • β < 1: Precision-focused (F0.5)
  • β = 1: Balanced (F1)
  • β > 1: Recall-focused (F2)

Precision-Recall Curve

Plot precision vs recall at different thresholds:

Precision
    1 |\    
      | \    
      |  \___
      |      \____
    0 |____________
      0           1
            Recall

Good model: Curve stays high (upper right) Random: Horizontal line at positive rate

Average Precision (AP)

Area under PR curve. Single number summarizing performance.

Multi-class Extensions

Macro Average

Compute metric for each class, then average:

Macro F1 = (F1_class1 + F1_class2 + ... + F1_classN) / N

Treats all classes equally.

Micro Average

Pool all predictions, compute globally:

Micro Precision = Total TP / (Total TP + Total FP)

Weights by class frequency.

Weighted Average

Weight by class support (number of samples):

Weighted F1 = Σ (support_i × F1_i) / total_support

Common Pitfalls

1. Accuracy Isn't Enough

99% class 0, 1% class 1
Predict everything as class 0 → 99% accuracy!
But precision and recall for class 1 = 0

2. Ignoring Class Imbalance

With imbalanced data:

  • High precision may be easy (few positives predicted)
  • High recall may be meaningful (found rare positives)

3. Threshold Dependence

Precision and recall depend on classification threshold. Always consider at which threshold!

4. Not Matching Business Needs

F1 assumes equal importance. Real costs often differ.

Practical Guidelines

Choosing Metrics

ScenarioMetric
Balanced classesAccuracy, F1
Imbalanced classesF1, PR-AUC
FP costlyPrecision
FN costlyRecall
Need single numberF1 or F-beta
Comparing modelsPR-AUC

Setting Thresholds

  1. Plot PR curve
  2. Identify acceptable operating point
  3. Consider business constraints
  4. May use cost-sensitive threshold

Key Takeaways

  1. Precision: correctness of positive predictions
  2. Recall: coverage of actual positives
  3. They trade off against each other
  4. F1 balances both, F-beta weights them
  5. Choose based on cost of FP vs FN
  6. Use PR curves and AP for full picture

Practice Questions

Test your understanding with these related interview questions: