Model Calibrations and How to Find Them

AKA: When a model gives you a prediction, how confident is it?

Screenshot 2025-05-08 at 10.35.30 AM.png

First, let’s motivate why we would want confidence scores.

If your LLM-as-a-judge says a LLM-generated code doesn’t contain insecure or backdoor code before submitting to a PR, wouldn’t it be great to be certain that the output actually does not contain those things if the confidence score is above a threshold?

If an agent is about to execute an action execute a credit card transaction, wouldn’t it be great to aligned with what its owner would do (and if it’s unsure of what its owner would do and is less confident, route to a human)?

This is what confidence scores come in, where a model’s prediction is used if the score is above a certain threshold, and discarded (and the system routed to a human) otherwise.

How to get confidence scores in black box models

Use sampling consistency as a proxy for calibrated probability scores.

What this means:

For a given LLM call, sample the LLM N times.
Take the response that occurs most often - let’s call this prediction p.
Calculate the semantic similarity of p with the responses the LLMs gave.
Take the average of the semantic similarities, and use that as the confidence score.

The method uses observed consistency (established through multiple model calls at different temperatures on the same question) and self-reflection consistency (established through follow up model calls that clarify accuracy of an initial given answer). The method is evaluated by showing the performance of the uncertainty scores in predicting answer accuracy and also on its usefulness for improving model performance in general.

How will I know what threshold to use?

At a high level, you’d want an evaluation dataset of the task at hand, which the model sometimes gets wrong. (Note: If it gets 100% on the eval task all the time, then confidence scores might not be needed if the eval task is representative of prod!).

Run your calibration scores algorithm on an evaluation dataset of the task at hand, and calculate the accuracy/precision/recall of the model’s predictions above the confidence threshold. You’d of course want to keep an eye out on number of predictions below the threshold - else your human annotators or ML team will be slammed with predictions to review!