Language Models Struggle With Numeric Calibration

Evaluating numeric calibration in language model outputs.

Research @ Kensho

2025

Imagine you reach into a bag of marbles containing 98 blue marbles and 99 red marbles. If you grab one without looking, what are the chances it's red versus blue? Without pulling out a pen and paper, most people would say about an even chance. When we ask gpt-4o, it puts 99.7% of its probability mass on red.

From 98 blue marbles and 99 red marbles, Tommy reached blindly into a bag and grabbed a marble with the color [blue/red]

Left: The problem setup. Center: A calibrated output would be roughly 50/50. Right: What gpt-4o actually produces.

This is not necessarily bad or surprising in isolation. It is possible that in aggregate the model would produce roughly balanced results over many prompt variations. But what we found is that a number of different models have stable preferences for specific colors and orderings. Word order and identity, which should be arbitrary factors, have a real and systematic impact on model outputs.

In our recent paper, we studied how language models answer and assign probabilities in basic scenarios like these. And we are not the only ones—since our work came out, a number of other groups have asked similar questions. See related work by Balepur et al., Gao et al., Chen et al., and Zhao et al.

Patterns in model behavior#

The figure below is involved but reveals a great deal. Each cell corresponds to 100 problems like the example above, with 50 cases where the first option has a higher value and 50 where the second does. gpt-4o-mini exhibits a stable pattern across the diagonal: the model's behavior depends strongly on the order of the keywords. For one ordering, say white/purple, the model behaves in one way and switches to an entirely different behavior profile for the reverse ordering. We see similar patterns for other models. See our paper for more detail.

Each cell represents model performance over 100 balanced examples. The top-left cell reads: when purple is listed before white, purple receives nearly 100% of the probability mass in all cases. The bottom-right cell: when white is listed before purple, the option with the higher numeric value is correctly picked 99 out of 100 times. The strong diagonal pattern reveals systematic ordering bias.

Measuring calibration#

We measure the distance between the calibrated ideal and the model outputs using Wasserstein Distance (WD). Wasserstein Distance captures how much "shifting" between two probability distributions is needed for them to match. A WD of 0 means the distributions are identical. See Lilian Weng's explainer for a nice introduction. To provide context, we compare against several baseline strategies: Pick Higher places all probability mass on the option with the higher value; Pick Lower does the opposite; Pick First/Second ignores values entirely; and Pick Random randomly assigns probability.

BaselineCalibration (WD) [↓]
Pick Higher0.47
Pick Lower0.95
Pick First / Second0.71
Pick Random0.27
ModelCalibration (WD) [↓]
Mistral 7B v0.30.48
Yi 1.50.49
Llama 3.1 8B0.40
gemma 2 9b0.50
gpt-4o-mini0.42
gpt-4o0.40

The results are striking: all models are poorly calibrated. None is more calibrated than randomly assigning probability mass, and only half outperform the Pick Higher baseline. We also study relative entropy and find that models tend to produce outputs that are too confident—far too low in entropy. Much of this mode collapse occurs after instruction tuning, though instruction tuning does have the benefit of leading models to at least choose valid words (like red or blue).

Testing newer models#

These results were established on models available in early 2024. Our paper was published at ACL 2025. We now check whether these results hold up on models released afterward—including gpt-4.1 and the gpt-5 series. Many of the latest models (most of the gpt-5-* and o* series) don't expose a logprobs endpoint, in part because they are reasoning models that hide thinking tokens. For gpt-5.1 and gpt-5.2, logprobs are available when reasoning effort is set to 'none'.

Newer models are not obviously better. When we run 10,000 new examples per model, we see different but stable behavioral signatures. Even when we fix the seed, results tend to change across runs, though the overall patterns remain relatively stable. gpt-5.2 always picks the higher value; gpt-4.1-mini almost always picks the first item; gpt-4o-mini almost always picks the first item but also exhibits color-ordering effects.

Average Wasserstein Distance (calibration error) by model across 10,000 examples. Lower is better. Even the best-performing model (gpt-4o) remains far from perfect calibration.

The aggregate calibration error tells part of the story, but the per-color-pair heatmaps below reveal the specific behavioral patterns behind each model's score.

First Color Listed

gpt-5.1

gpt-5.2

gpt-4.1

gpt-4.1-mini

Second Color Listed

calibrated
higher
lower
first
second
null
Behavior heatmaps for six models across all color-pair orderings at scale. Each model develops its own distinct pattern of biases. Some are dominated by position (pick-first), others by numeric value (pick-higher), and some by complex color-ordering interactions.

We can also examine calibration at the individual example level. In the scatter plots below, the x-axis represents the ideal (true) probability and the y-axis represents the model's predicted probability. Points falling on the diagonal line are perfectly calibrated. The shaded band around the diagonal marks approximately calibrated predictions. The top-right and bottom-left quadrants (green) indicate the model gets the direction right—it assigns higher probability to the more likely option. The top-left and bottom-right quadrants (red) indicate the model gets it wrong.

How to read the calibration scatter plots. Green quadrants indicate directionally calibrated predictions; red quadrants indicate the model assigns higher probability to the wrong option. The diagonal band marks approximately calibrated outputs.

None of the models exhibit truly calibrated behavior along the diagonal, but notably, there is a large difference in behavior between gpt-5.1 and gpt-5.2—the latter of which is always directionally calibrated, or, in other words, exhibiting mode collapse. gpt-4.1-mini shows collapse to a different strategy: always picking the first option listed. (This was also discernable from the heatmaps above.) We take these results here to suggest that our work (for now) continues to replicate.

Calibration scatter plots showing ideal probability versus model probability for each option. Points along the diagonal represent perfect calibration. The heavy clustering at the extremes (0 and 1) reflects model overconfidence—models tend to commit almost entirely to one option rather than expressing graded uncertainty.

So what?#

Language models exhibit strong biases and systematic patterns—even over basic heuristics—when faced with probabilistic choices.

Becoming more helpful for day-to-day tasks does not appear to solve these more fundamental problems. Models that are excellent at coding and reasoning still fail at proportional probability assignment.

We recommend caution when using models to make decisions in such environments. Both benchmarks and user testimony point to strong reasoning and coding abilities, but there appear to be gaps when it comes to calibrated uncertainty.

Open questions and related work#

Several directions are worth watching.

Complex scenarios. How much does this extend beyond simple two-option settings? There is some evidence that point-wise bias like we observe here does not necessarily transfer to longer generation scenarios. See Ruted Evaluation.

Verbalized Sampling. Verbalized Sampling is a recently proposed method that has models generate multiple outputs along with verbalized probabilities, and demonstrates strong results.

Debiasing approaches. Li (2025) takes a different approach using a Non-parametric Order-Preserving Algorithm (NOA) to improve debiasing over selection settings.

Citation#

See more details in our paper, Language Model Probabilities are Not Calibrated in Numeric Contexts.

@inproceedings{lovering-etal-2025-language,
    title = "Language Model Probabilities are
             $Not$ Calibrated in Numeric Contexts",
    author = "Lovering, Charles and
      Krumdick, Michael and
      Lai, Viet Dac and
      Reddy, Varshini and
      Ebner, Seth and
      Kumar, Nilesh and
      Koncel-Kedziorski, Rik and
      Tanner, Chris",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.1417/",
    doi = "10.18653/v1/2025.acl-long.1417",
}
Charles Lovering © 2026