Precision and Recall

Precision and recall are fundamental metrics for classification tasks. They capture different aspects of model performance and understanding their tradeoff is essential for building effective ML systems.

Precision-Recall Curve

The Basics

For binary classification:

                    Predicted
                  Pos    Neg
Actual  Pos       TP     FN
        Neg       FP     TN

TP (True Positive): Correctly predicted positive
FP (False Positive): Incorrectly predicted positive (Type I error)
FN (False Negative): Incorrectly predicted negative (Type II error)
TN (True Negative): Correctly predicted negative

Precision

Precision = TP / (TP + FP)

"Of all items I predicted positive, how many were actually positive?"

Also called: Positive Predictive Value (PPV)

Recall

Recall = TP / (TP + FN)

"Of all actual positives, how many did I find?"

Also called: Sensitivity, True Positive Rate (TPR)

Intuitive Examples

Spam Filter

High precision: Very few legitimate emails marked as spam
High recall: Very few spam emails reach inbox

Medical Diagnosis

High precision: When we say "disease", we're usually right
High recall: We catch most cases of the disease

Search Engine

High precision: Top results are relevant
High recall: All relevant pages are found

The Tradeoff

Precision and recall are often at odds:

Threshold ↓ (more positive predictions)
  → More TP (good for recall)
  → More FP (bad for precision)
  → Recall ↑, Precision ↓

Threshold ↑ (fewer positive predictions)
  → Fewer FP (good for precision)
  → Fewer TP (bad for recall)
  → Precision ↑, Recall ↓

When to Prioritize What

Prioritize Precision When:

False positives are costly
Users will lose trust with wrong predictions
Acting on prediction has high cost

Examples:

Spam filter (don't lose important emails)
Content moderation (don't wrongly ban users)
Autonomous vehicles (don't brake unnecessarily)

Prioritize Recall When:

False negatives are costly
Missing a positive is dangerous
Better to over-detect than under-detect

Examples:

Disease screening (don't miss cancer)
Fraud detection (don't miss fraud)
Safety systems (don't miss hazards)

F1 Score

Harmonic mean of precision and recall:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Properties:

Ranges from 0 to 1
Only high when BOTH precision and recall are high
Harmonic mean penalizes extreme values

Example:

P=0.9, R=0.1 → F1=0.18 (low!)
P=0.6, R=0.6 → F1=0.60 (balanced)

F-beta Score

Weighted version when you care more about one:

Fβ = (1 + β²) × (P × R) / (β²P + R)

β < 1: Precision-focused (F0.5)
β = 1: Balanced (F1)
β > 1: Recall-focused (F2)

Precision-Recall Curve

Plot precision vs recall at different thresholds:

Precision
    1 |\    
      | \    
      |  \___
      |      \____
    0 |____________
      0           1
            Recall

Good model: Curve stays high (upper right) Random: Horizontal line at positive rate

Average Precision (AP)

Area under PR curve. Single number summarizing performance.

Multi-class Extensions

Macro Average

Compute metric for each class, then average:

Macro F1 = (F1_class1 + F1_class2 + ... + F1_classN) / N

Treats all classes equally.

Micro Average

Pool all predictions, compute globally:

Micro Precision = Total TP / (Total TP + Total FP)

Weights by class frequency.

Weighted Average

Weight by class support (number of samples):

Weighted F1 = Σ (support_i × F1_i) / total_support

Common Pitfalls

1. Accuracy Isn't Enough

99% class 0, 1% class 1
Predict everything as class 0 → 99% accuracy!
But precision and recall for class 1 = 0

2. Ignoring Class Imbalance

With imbalanced data:

High precision may be easy (few positives predicted)
High recall may be meaningful (found rare positives)

3. Threshold Dependence

Precision and recall depend on classification threshold. Always consider at which threshold!

4. Not Matching Business Needs

F1 assumes equal importance. Real costs often differ.

Practical Guidelines

Choosing Metrics

Scenario	Metric
Balanced classes	Accuracy, F1
Imbalanced classes	F1, PR-AUC
FP costly	Precision
FN costly	Recall
Need single number	F1 or F-beta
Comparing models	PR-AUC

Setting Thresholds

Plot PR curve
Identify acceptable operating point
Consider business constraints
May use cost-sensitive threshold

Key Takeaways

Precision: correctness of positive predictions
Recall: coverage of actual positives
They trade off against each other
F1 balances both, F-beta weights them
Choose based on cost of FP vs FN
Use PR curves and AP for full picture

Precision and Recall

The Basics

Precision

Recall

Intuitive Examples

Spam Filter

Medical Diagnosis

Search Engine

The Tradeoff

When to Prioritize What

Prioritize Precision When:

Prioritize Recall When:

F1 Score

F-beta Score

Precision-Recall Curve

Average Precision (AP)

Multi-class Extensions

Macro Average

Micro Average

Weighted Average

Common Pitfalls

1. Accuracy Isn't Enough

2. Ignoring Class Imbalance

3. Threshold Dependence

4. Not Matching Business Needs

Practical Guidelines

Choosing Metrics

Setting Thresholds

Key Takeaways

Related Concepts

Practice Questions