Choosing a threshold is a product decision

Most classification models do not output decisions. They output scores. For a binary classifier that is usually a sigmoid or softmax output: a number between zero and one that behaves like a confidence, and that ranks cases from least to most likely to be positive. Calling it a probability flatters most models, since uncalibrated scores drift from true frequencies, but the ranking is real, and it is the part you can use. The model can be excellent at ranking and still be useless in practice, because at some point a human or a system has to draw a line and say this one, act on it; that one, leave it. That line is the threshold, and where you put it is a product and operational decision, not something the training run hands you for free. The 0.5 everyone defaults to is just the midpoint of the score scale; nothing about your problem lives there.

In healthcare and other high-stakes settings, the two kinds of mistake are not symmetric. Missing a real case (a false negative) and chasing a case that is not there (a false positive) carry very different costs, and those costs are about the world, not the loss function. So before tuning anything, it helps to make the tradeoff concrete. Two numbers carry most of the weight:

Precision answers: of the cases we flagged, how many were real? It is the price of acting.
Recall answers: of the real cases, how many did we catch? It is the cost of missing.

Raise the threshold and you flag fewer, more confident cases: precision climbs, recall falls. Lower it and you catch more real cases at the cost of more false alarms. You cannot maximize both at once with a single number, so the honest question is not what is the best threshold but which mistake can we better afford.

See the tradeoff move

Below is a synthetic population of 240 cases. Each one has a model score, and each one is really positive or really negative. Drag the threshold and watch the same model become a cautious one or an aggressive one. Nothing about the model changes; only the line does.

Interactive · Precision / recall

Move the decision threshold

Really positive Really negative Everything right of the line is flagged positive

THRESHOLD 0.50

Precision

–

Recall

–

Flagged

–

True positives · caught

–

False positives · false alarms

–

False negatives · missed

–

True negatives

–

Notice there is no setting that makes both numbers go to 100 percent, because the two score distributions overlap. That overlap is the model's actual uncertainty, and no choice of threshold can argue it away. The threshold only decides how you spend it.

When the better model looks worse

Here is where this stops being abstract. Sooner or later every production ML team runs a champion-challenger evaluation. The champion is the model already in production. Suppose it is an ensemble: several models voting, accurate, but heavy. Every document costs multiple forward passes, latency adds up, and so does the maintenance surface. The challenger is a single model, a fraction of the cost and latency, easier to monitor and retrain. You want it to win.

So you run the validation set through both and compare at the default threshold of 0.5. The challenger's precision is clearly better. Its recall is clearly worse. In a setting where a false negative is a missed real case, the expensive mistake, that reads as disqualifying: faster and cheaper, but it misses more. Keep the champion?

That verdict is premature, and usually wrong. Comparing two models at the same threshold feels like apples to apples, but 0.5 is not a property of the problem; it is a property of each model's score scale. Two models can rank cases almost identically and still distribute their scores differently: one spreads scores wide, the other compresses them toward the low end. The same cut line lands at completely different operating points on each. You have not compared two models. You have compared two arbitrary points.

Compare curves, not points

The fix is to evaluate threshold-free first. Sweep the threshold across its whole range, trace out every precision-recall pair each model can offer, and you get the precision–recall curve. If the challenger's curve sits at or above the champion's, it is at least as good at every operating point. It does not need the champion's threshold; it needs its own. One reference line to read it by: the dashed gray baseline is what flagging at random would get you – precision equal to the prevalence of true positives, at every recall. Any model worth running lives above it.

Interactive · Champion vs challenger

A fixed champion, a tunable challenger

Champion · ensemble Challenger · single model 98% recall floor Gray dot: champion, fixed at its production 0.50 · Accent dot: challenger at the slider

CHALLENGER THRESHOLD 0.50

Champion · precision @ 0.50

–

Champion · recall @ 0.50

–

Challenger · precision

–

Challenger · recall

–

Watch the dots. The gray one never moves: the champion is in production at its chosen operating point. At 0.5 the challenger sits parked at the far high-precision, low-recall end of its own curve, and that gap is the whole illusion. Drag its threshold down and it walks along its curve until it matches, then beats, the champion's recall, while still holding the precision advantage. The model was never worse; the default just parked it at the wrong operating point.

Could you retune the champion too? Of course. Its threshold is just as movable, and everything tuning could buy is already drawn: its full curve. That is the point of the curve. But look where that curve sits – below the challenger's at every recall. No threshold rescues it. When one curve dominates another, the bake-off is over before either model picks an operating point; the only remaining question is where on the winning curve to run.

The gold line is the part that settles the bake-off. In cancer surveillance the floor is real and unforgiving: a registry is expected to capture at least 98 percent of true cases. So the deciding question is not who wins at 0.5; it is who delivers more precision at that floor. Drag the threshold past it and look at the gap.

Best F1 is just another default

At this point a tempting shortcut appears: sweep the threshold, pick the one that maximizes F1, and report that. It feels principled because it came from a curve. But F1 is the harmonic mean of precision and recall, which means it weights a false alarm and a missed case equally, and that equality is an assumption smuggled in as a default, exactly like 0.5 was. If a missed case costs ten times what a false alarm costs, F1 is quietly optimizing the wrong objective.

Two things to notice. First, the F1 curve is flat across a wide band of thresholds: many settings look almost equally good by F1 while recall swings substantially underneath. A metric that cannot tell those settings apart is a blunt instrument for choosing between them. Second, the best-F1 point is marked because it is worth knowing, not because it decides anything. If the requirement is to catch at least 98 percent of real cases, the requirement picks the threshold: the highest one that still clears the floor. Best F1 is a diagnostic. The recall floor is the decision.

Doing this in code

Mechanically this is a few lines. The work is not the arithmetic; it is deciding what precision and recall need to be for this particular use, given who is affected by each kind of error.

pythonimport numpy as np

# scores: 1-D float array, the model's positive-class score per case
#         (e.g. sigmoid/softmax output in [0, 1]); higher = more likely positive.
# labels: 1-D boolean array, the ground truth; True = really positive.
# Same length, same order. threshold: the cut you're testing.
def metrics_at(scores, labels, threshold):
    pred = scores >= threshold  # flag everything at or above the line
    tp = np.sum(pred & labels)
    fp = np.sum(pred & ~labels)
    fn = np.sum(~pred & labels)
    precision = tp / (tp + fp) if (tp + fp) else 0.0
    recall    = tp / (tp + fn) if (tp + fn) else 0.0
    return precision, recall

# Pick the operating point from the requirement, not the other way around.
# e.g. "we must catch at least 98% of real cases" -> choose the
# highest threshold whose recall is still >= 0.98.

The trap is to report a single accuracy number, or to quietly leave the threshold at 0.5 because that is the default. On an imbalanced problem, 0.5 is rarely the right line, and accuracy can look excellent while the model misses most of the cases you actually care about.

Pick the operating point from the requirement, then measure the model against it. Do not let the default threshold quietly make a decision that belongs to you.

Where a threshold can let you down

A threshold is not magic. It is a policy decision attached to a score distribution, and it inherits every weakness of that distribution. Three limitations worth naming:

It is only as stable as the scores underneath it. Tune it on a validation set, deploy, and then the case mix shifts or the model is retrained: the score distribution moves, and the threshold silently stops meaning what it meant. The operating point you validated is no longer the one you are running. This is why threshold choice and drift monitoring are one system, not two.
It is an estimate, not a fact. “Recall is 98% at this threshold” is measured from finite data, and the data are thinnest exactly where high-stakes operating points live, out in the tail. Put uncertainty around the number, and pick the threshold whose lower confidence bound clears the floor, not its point estimate.
One line may be too few. Nothing says a decision needs exactly one threshold. Auto-accept above a high line, auto-reject below a low one, and route the uncertain middle to human review. That two-threshold design is how high-recall systems protect the floor in practice without drowning anyone in false positives.

And what about sidestepping the threshold entirely – ensembling several models and acting on their consensus instead? It is a good instinct, but it is not actually an alternative. A vote is a threshold in disguise: “flag when two of three models agree” is a cut at two-thirds on the ensemble’s averaged score, just coarser and harder to tune. What ensembling genuinely buys is better, more stable scores: averaging across models smooths out individual quirks, improves the ranking, and makes the score distribution drift less, which protects whatever operating point you choose. So the two are complements, not rivals. Ensemble to score well; threshold to act deliberately.

What to take away

A model gives you a ranking. The threshold turns that ranking into action, and it encodes a value judgment about which mistakes you can live with. Treat it as part of the product: write down the requirement, choose the operating point that meets it, and revisit it when the cost of errors changes. And when a challenger looks worse than your champion at the default threshold, do not conclude, sweep. Compare the curves, match the models at the recall the business actually needs, and only then decide. Then hold the line loosely: a threshold is only as trustworthy as the scores beneath it, so monitor it like the part of the model it is. The math is the easy part.