Thresholds, accuracy's lie, and computing metrics
A classifier doesn't output yes/no directly; it outputs a score, and you pick a threshold that turns the score into a decision. Lower the threshold and you catch more positives (recall rises) but raise more false alarms (precision falls). Where you set it is a policy choice about which mistake is worse, not a fact the maths decides for you.
F1 = 2 * precision * recall / (precision + recall) combines the two into one number, but precisely because it averages them, it hides the trade-off — always look at precision and recall themselves, not just F1.
Accuracy — the fraction of all predictions that are correct — sounds like the obvious metric, but it lies when one class is rare. If only 0.1% of transactions are fraud, a model that simply predicts "not fraud" for everything scores 99.9% accuracy while catching zero fraud. It is useless and looks excellent.
That's why, for rare positives, you reason in precision, recall, and absolute counts rather than accuracy (and prefer a precision-recall curve to an ROC curve). This is the literacy that turns "the model is 97% accurate" from a reassurance into a question.
Putting it together: count TP, FP and FN by comparing predictions with the truth, then apply the formulas, guarding any zero denominator:
tp = sum(1 for p, t in zip(pred, true) if p == 1 and t == 1)
fp = sum(1 for p, t in zip(pred, true) if p == 1 and t == 0)
fn = sum(1 for p, t in zip(pred, true) if p == 0 and t == 1)
precision = tp / (tp + fp) if (tp + fp) else 0.0
recall = tp / (tp + fn) if (tp + fn) else 0.0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0
With these definitions in hand, scikit-learn's classification_report becomes something you can read critically instead of taking on faith.
leads into “Every metric is an estimate; confidence intervals”.*
Related cards
Tasks
Card Info
- Topic: Python for Data Science
- Difficulty: Intermediate
- Completed: 0 users