Thresholds, accuracy's lie, and computing metrics

Intermediate Python for Data Science

Created by Best · 24.06.2026 at 14:03 UTC

A classifier doesn't output yes/no directly; it outputs a score, and you pick a threshold that turns the score into a decision. Lower the threshold and you catch more positives (recall rises) but raise more false alarms (precision falls). Where you set it is a policy choice about which mistake is worse, not a fact the maths decides for you.

F1 = 2 * precision * recall / (precision + recall) combines the two into one number, but precisely because it averages them, it hides the trade-off — always look at precision and recall themselves, not just F1.

Accuracy — the fraction of all predictions that are correct — sounds like the obvious metric, but it lies when one class is rare. If only 0.1% of transactions are fraud, a model that simply predicts "not fraud" for everything scores 99.9% accuracy while catching zero fraud. It is useless and looks excellent.

That's why, for rare positives, you reason in precision, recall, and absolute counts rather than accuracy (and prefer a precision-recall curve to an ROC curve). This is the literacy that turns "the model is 97% accurate" from a reassurance into a question.

Putting it together: count TP, FP and FN by comparing predictions with the truth, then apply the formulas, guarding any zero denominator:

tp = sum(1 for p, t in zip(pred, true) if p == 1 and t == 1)
fp = sum(1 for p, t in zip(pred, true) if p == 1 and t == 0)
fn = sum(1 for p, t in zip(pred, true) if p == 0 and t == 1)
precision = tp / (tp + fp) if (tp + fp) else 0.0
recall    = tp / (tp + fn) if (tp + fn) else 0.0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0

With these definitions in hand, scikit-learn's classification_report becomes something you can read critically instead of taking on faith.
leads into “Every metric is an estimate; confidence intervals”.*

University approvals: 0

Related cards

Builds on Confusion matrix, precision, and recall · Python for Data Science

Next Every metric is an estimate; confidence intervals · Python for Data Science

Tasks

Question 1

A fraud model has 99% recall but 4% precision. A stakeholder asks if it is 'good'. What is the most correct response?

Yes — 99% recall is excellent

It depends on the base rate and review capacity; 4% precision may flood reviewers with false alarms

No — recall above 95% is always overfitting

Cannot say anything without the ROC-AUC

Question 2

Why can 'accuracy' be a misleading metric when the positive class is very rare (e.g. 0.1% of cases)?

Accuracy cannot be computed for rare classes

A model that always predicts the majority class scores near-perfect accuracy while catching zero positives

Accuracy always equals recall

Rare classes make accuracy exceed 100%

Question 3

stdin: line 1 = predicted labels (0/1, space-separated); line 2 = true labels (0/1, space-separated), positive = 1. Compute precision, recall, and F1; print them space-separated, each rounded to 2 decimals. Use 0.0 for any metric whose denominator is 0.

Example input:

1 1 0 1
1 0 0 1

Expected output:

0.67 1.0 0.8

Runtime output (stdout/stderr)

3 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Card Info

Topic: Python for Data Science
Difficulty: Intermediate
Completed: 0 users

Creator

Best

BestBuddy