⚡ The F1 score is the harmonic mean of precision (when the model says positive, how often is it right?) and recall (of all actual positives, how many did the model find?). It is the go-to metric for imbalanced datasets — where accuracy is misleading. A cancer screener that never flags anyone is 99% accurate if 99% of patients are healthy. Its F1 score is zero.
Category: Machine Learning · Difficulty: Beginner · Last updated: 15 May 2026 · 4 min read
What is F1 score?
Accuracy sounds like the obvious way to measure a model. If 94 out of 100 predictions are correct, the model is 94% accurate. Simple. Useful. Until you have an imbalanced dataset.
Imagine building a model to detect a rare disease that affects 1% of the population. A model that always says “no disease” is 99% accurate — it correctly classifies every healthy person. But it misses every single patient. That is a useless model with a great accuracy score. Accuracy lied.
F1 score fixes this by combining two complementary metrics — precision and recall — into one number that penalises both false alarms and missed detections. A model that ignores the rare class entirely scores an F1 of zero, regardless of its accuracy. You cannot hide behind the majority class.
How F1 score works
Precision — of everything the model predicted as positive, what fraction actually was positive?
Formula: True Positives / (True Positives + False Positives)
High precision = few false alarms. “When I say someone has the disease, I am usually right.”
Recall (Sensitivity) — of all the actual positives that exist, what fraction did the model find?
Formula: True Positives / (True Positives + False Negatives)
High recall = few missed cases. “I find most of the disease cases that are actually there.”
F1 Score — the harmonic mean of precision and recall.
Formula: 2 × (Precision × Recall) / (Precision + Recall)
Ranges from 0 (worst) to 1 (perfect). Requires both precision and recall to be high to score well. Punishes models that sacrifice one for the other.
Real-world examples
Not theory — what real teams actually shipped using this technique.
- A fraud detection model with 99% accuracy sounds impressive — until you check that 99.5% of transactions are legitimate. F1 score on the fraud class reveals whether the model actually catches fraud or just classifies everything as legitimate.
- A cancer screening AI evaluated only on accuracy might look excellent. Evaluated on recall, you discover it misses 30% of actual cancer cases. That 30% is people who receive a false all-clear and do not get treatment.
- Email spam filters use precision-recall tradeoffs deliberately — they prioritise precision (rarely blocking legitimate emails) over recall (occasionally letting spam through), because a false positive costs the user more than a false negative.
Common pitfalls
- F1 treats false positives and false negatives equally — but they rarely cost equally in the real world. Use F-beta score to weight recall more (beta > 1) or precision more (beta < 1) based on your specific cost structure.
- Macro vs micro F1 — for multi-class problems, macro F1 averages F1 across all classes equally (rare classes count as much as common ones). Micro F1 aggregates by total counts. Choose based on whether rare classes matter equally.
- F1 is class-specific — always specify which class you are measuring F1 for. F1 on the positive class tells a very different story from F1 on the negative class.
- Threshold dependence — F1 is calculated at a specific decision threshold. The precision-recall curve shows performance across all thresholds — always examine the full curve, not just F1 at the default 0.5 threshold.
Frequently asked questions
QUESTION 1 What is the F1 score in simple terms?
ANSWER 1 It combines precision (when the model says positive, how often is it right?) and recall (of all actual positives, how many did it find?) into one number — penalising models that sacrifice either.
QUESTION 2 What is the difference between precision and recall?
ANSWER 2 Precision: few false alarms. Recall: few missed cases. There is usually a trade-off — F1 balances into a single metric.
QUESTION 3 Why is accuracy misleading for imbalanced datasets?
ANSWER 3 A model predicting the majority class for every input achieves high accuracy while completely ignoring the minority class — which is often the entire point of the model.
QUESTION 4 When to prioritise recall over precision?
ANSWER 4 When missing a positive is very costly — cancer screening, fraud detection. Prioritise precision when false positives are costly — spam filtering, unnecessary medical treatment.
📬 Get one concept + one use case every Tuesday. Join the newsletter →