Imagine you reach into a bag of marbles containing 98 blue marbles and 99 red marbles.
If you grab one without looking, what are the chances it's red versus blue?
Without pulling out a pen and paper, most people would say about an even chance.
When we ask gpt-4o,
it puts 99.7% of its probability mass on red.
From 98 blue marbles and 99 red marbles, Tommy reached blindly into a bag and grabbed a marble with the color [blue/red]
gpt-4o actually produces.This is not necessarily bad or surprising in isolation. It is possible that in aggregate the model would produce roughly balanced results over many prompt variations. But what we found is that a number of different models have stable preferences for specific colors and orderings. Word order and identity, which should be arbitrary factors, have a real and systematic impact on model outputs.
In our recent paper, we studied how language models answer and assign probabilities in basic scenarios like these. And we are not the only ones—since our work came out, a number of other groups have asked similar questions. See related work by Balepur et al., Gao et al., Chen et al., and Zhao et al.
Patterns in model behavior#
The figure below is involved but reveals a great deal. Each cell corresponds to 100
problems like the example above, with 50 cases where the first option has a higher
value and 50 where the second does. gpt-4o-mini exhibits a stable pattern across the diagonal: the model's behavior depends strongly
on the order of the keywords. For one ordering, say white/purple, the model
behaves in one way and switches to an entirely different behavior profile for the
reverse ordering. We see similar patterns for other models.
See our paper for more detail.
Measuring calibration#
We measure the distance between the calibrated ideal and the model outputs using Wasserstein Distance (WD). Wasserstein Distance captures how much "shifting" between two probability distributions is needed for them to match. A WD of 0 means the distributions are identical. See Lilian Weng's explainer for a nice introduction. To provide context, we compare against several baseline strategies: Pick Higher places all probability mass on the option with the higher value; Pick Lower does the opposite; Pick First/Second ignores values entirely; and Pick Random randomly assigns probability.
| Baseline | Calibration (WD) [↓] |
|---|---|
| Pick Higher | 0.47 |
| Pick Lower | 0.95 |
| Pick First / Second | 0.71 |
| Pick Random | 0.27 |
| Model | Calibration (WD) [↓] |
Mistral 7B v0.3 | 0.48 |
Yi 1.5 | 0.49 |
Llama 3.1 8B | 0.40 |
gemma 2 9b | 0.50 |
gpt-4o-mini | 0.42 |
gpt-4o | 0.40 |
The results are striking: all models are poorly calibrated. None is more calibrated than randomly assigning probability mass, and only half outperform the Pick Higher baseline. We also study relative entropy and find that models tend to produce outputs that are too confident—far too low in entropy. Much of this mode collapse occurs after instruction tuning, though instruction tuning does have the benefit of leading models to at least choose valid words (like red or blue).
Testing newer models#
These results were established on models available in early 2024.
Our paper was published at ACL 2025. We now check whether these results hold
up on models released afterward—including gpt-4.1 and the gpt-5 series. Many of the latest models (most of the gpt-5-* and o* series)
don't expose a logprobs endpoint, in part because they are reasoning models
that hide thinking tokens. For gpt-5.1 and gpt-5.2,
logprobs are available when reasoning effort is set to 'none'.
Newer models are not obviously better. When we run 10,000 new examples per model,
we see different but stable behavioral
signatures. Even when we fix the seed, results tend to change
across runs, though the overall patterns remain relatively stable. gpt-5.2 always picks the higher value; gpt-4.1-mini almost always picks the first item; gpt-4o-mini almost always picks the first item but also exhibits color-ordering effects.
gpt-4o)
remains far from perfect calibration.The aggregate calibration error tells part of the story, but the per-color-pair heatmaps below reveal the specific behavioral patterns behind each model's score.
gpt-5.1
gpt-5.2
gpt-4.1
gpt-4.1-mini
Second Color Listed
We can also examine calibration at the individual example level. In the scatter plots below, the x-axis represents the ideal (true) probability and the y-axis represents the model's predicted probability. Points falling on the diagonal line are perfectly calibrated. The shaded band around the diagonal marks approximately calibrated predictions. The top-right and bottom-left quadrants (green) indicate the model gets the direction right—it assigns higher probability to the more likely option. The top-left and bottom-right quadrants (red) indicate the model gets it wrong.
None of the
models exhibit truly calibrated behavior along the diagonal, but notably, there
is a large difference in behavior between gpt-5.1 and gpt-5.2—the
latter of which is always directionally calibrated, or, in other words,
exhibiting mode collapse. gpt-4.1-mini shows collapse to a different strategy: always picking the first option listed. (This was also discernable from the heatmaps above.) We take these results here to suggest that our work (for now) continues to replicate.
So what?#
Language models exhibit strong biases and systematic patterns—even over basic heuristics—when faced with probabilistic choices.
Becoming more helpful for day-to-day tasks does not appear to solve these more fundamental problems. Models that are excellent at coding and reasoning still fail at proportional probability assignment.
We recommend caution when using models to make decisions in such environments. Both benchmarks and user testimony point to strong reasoning and coding abilities, but there appear to be gaps when it comes to calibrated uncertainty.
Open questions and related work#
Several directions are worth watching.
Complex scenarios. How much does this extend beyond simple two-option settings? There is some evidence that point-wise bias like we observe here does not necessarily transfer to longer generation scenarios. See Ruted Evaluation.
Verbalized Sampling. Verbalized Sampling is a recently proposed method that has models generate multiple outputs along with verbalized probabilities, and demonstrates strong results.
Debiasing approaches. Li (2025) takes a different approach using a Non-parametric Order-Preserving Algorithm (NOA) to improve debiasing over selection settings.
Citation#
See more details in our paper, Language Model Probabilities are Not Calibrated in Numeric Contexts.
@inproceedings{lovering-etal-2025-language,
title = "Language Model Probabilities are
$Not$ Calibrated in Numeric Contexts",
author = "Lovering, Charles and
Krumdick, Michael and
Lai, Viet Dac and
Reddy, Varshini and
Ebner, Seth and
Kumar, Nilesh and
Koncel-Kedziorski, Rik and
Tanner, Chris",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.1417/",
doi = "10.18653/v1/2025.acl-long.1417",
}