AKA: When a model gives you a prediction, how confident is it?
First, let’s motivate why we would want confidence scores.
If your LLM-as-a-judge says a LLM-generated code doesn’t contain insecure or backdoor code before submitting to a PR, wouldn’t it be great to be certain that the output actually does not contain those things if the confidence score is above a threshold?
If an agent is about to execute an action execute a credit card transaction, wouldn’t it be great to aligned with what its owner would do (and if it’s unsure of what its owner would do and is less confident, route to a human)?
This is what confidence scores come in, where a model’s prediction is used if the score is above a certain threshold, and discarded (and the system routed to a human) otherwise.
Use sampling consistency as a proxy for calibrated probability scores.
What this means:
The method uses observed consistency (established through multiple model calls at different temperatures on the same question) and self-reflection consistency (established through follow up model calls that clarify accuracy of an initial given answer). The method is evaluated by showing the performance of the uncertainty scores in predicting answer accuracy and also on its usefulness for improving model performance in general.
At a high level, you’d want an evaluation dataset of the task at hand, which the model sometimes gets wrong. (Note: If it gets 100% on the eval task all the time, then confidence scores might not be needed if the eval task is representative of prod!).
Run your calibration scores algorithm on an evaluation dataset of the task at hand, and calculate the accuracy/precision/recall of the model’s predictions above the confidence threshold. You’d of course want to keep an eye out on number of predictions below the threshold - else your human annotators or ML team will be slammed with predictions to review!