On Finding Inconsistencies in Documents

In our paper, On Finding Inconsistencies in Documents, we were interested in benchmarking how well current language models find inconsistencies within documents. This is a big problem space, so we focused on financial documents, along with some case studies on computer science papers. With the help of financial experts, we created and released a dataset with 500 (test and validation) examples. Each example consists of a technical document and an identified inconsistency. The model's task is to find the inconsistency.

The FIND task. A document with an inconsistency is passed to a language model, which must identify the evidence and describe the issue.

Our findings#

Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model's suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, which shows that inconsistency detection is still a challenging task. These results rely on using language models as judges—but our testing suggested that the judges were quite good at the verification task, getting a Cohen's kappa above 0.9 with a human judge.

And this task generally—inconsistency detection—appears to be a rare case where even far-from-perfect performance can significantly help users, depending on the importance of the document. Most of the time the found inconsistencies are easy to verify—but just difficult or onerous for people to have found themselves.

Our results#

Average task score (recall) across all datasets. Scores represent the rate the model found the target inconsistency within the document. gpt-5 and gemini-2.5-pro do best overall.

Scores are presented as percents (out of 100), representing the rate the model found the target inconsistency within the document. The top models perform at a rate of about 60%, and scores drop for the document sources that tend to be longer or more complicated. BLS contains reports from the Bureau of Labor Statistics, PRE contains presale reports on bonds, SEC contains 10-Qs, EMM contains reporting on US municipality securities, and PG contains nonfiction books. The last column averages across all datasets.

Are models useful?#

High recall is only part of the story. If a model finds the inconsistency but also flags dozens of false positives, it's not useful in practice. We manually evaluated gpt-5's predictions on a subset of 25 documents per source.

Precision and usefulness of gpt-5 predictions across datasets. Overall, 53% of predictions exactly match the inserted inconsistency (precision), and 67% are judged useful by domain experts (usefulness). The model averages 2.6 findings per document. Precision counts only exact matches to the inserted inconsistency. Usefulness is broader: it counts any suggestion that a domain expert judged to be a real or helpful finding, including issues present in the original document.

The gap between precision and usefulness is telling: models frequently identify real issues in the original documents that were not part of the benchmark. This suggests that language models can serve as a genuine auditing tool, not just a benchmark solver.

Takeaways & case study on our own paper#

We found the results (surprisingly) compelling, and so the next step was to see how well models work on a real world use case: our own write-up on these results was readily available! And models found five inconsistencies in late drafts of our paper. We document these directly below and we think they're telling on how effective these models are, how they can be useful, and perhaps the limits of our findings.

Notably, even after multiple runs and multiple models, we manually found one additional inconsistency the models missed. (In the appendix of our work a table had mis-ordered columns that clashed with the aggregated results in the main body.) This speaks to the two-sided nature of our results. These models are effective tools, and probably worth trying on your documents of choice, but as both the results on our benchmark and our direct experience shows, models still make mistakes.

Lastly, most of the errors found so far are not game-changing. Both in our own work and most of the inconsistencies we found in others' work, the issues were typically small. Most authors would likely be happy to find such an error, but they did not significantly change the conclusions of the works.

Examples of inconsistencies in late drafts of our work#

Compute time math error gemini-2.5-pro

Evidence

gemini-2.5-flash & 6h & 2.1m

The times above are reported for the 420 items

Description

In Table A.1, the total compute time for gemini-2.5-flash is listed as "6h" and the per-document time is "2.1m". These two figures are contradictory for the stated 420 documents. 6 hours is 360 minutes, which averages to 0.86 minutes per document, not 2.1. Conversely, 2.1 minutes per document totals 882 minutes (14.7 hours), not 6 hours.

Test set size mismatch gpt-5

Evidence

containing 375 test and 125 development problems.

In the test set 83 out of 420 documents (at most,
depending on the model and tokenizer) had their
text truncated.

Description

The document defines the test set as 375 items, but later refers to the "test set" as having 420 documents when discussing truncation, creating a contradiction about test set size.

The 420 value includes the WLD data.

Model name inconsistency gpt-5

Evidence

gpt-5-mini

\model{gpt-5-nano}

Description

The Methods list "gpt-5-nano" as a tested model, whereas results tables and later sections report "gpt-5-mini", creating an inconsistency in the model variant named as evaluated.

A real oversight in the paper draft.

URL typo gpt-5

Evidence

\footnote{\url{https://emma.msrb.org/}}

\url{EMMA.msrp.org}

Description

The EMM source is given as https://emma.msrb.org/ in the text, but later the figure caption uses EMMA.msrp.org, changing msrb to msrp. This mismatch in the domain name is an inconsistency in the referenced source.

Found after the first round of fixes.

Recall range mismatch gemini-2.5-pro

Evidence

the highest recall scores range from 7 to 11 percent,

sonnet-v4        | 12 | 33 | 2
gpt-5-mini       | 10 | 35 | 2
gpt-5            | 12 | 36 | 7
o3-mini          |  0 | 24 | 9
o3               |  9 | 33 | 9
gemini-2.5-flash | 12 | 34 | 9
gemini-2.5-pro   | 16 | 38 | 9

Description

Section 7.2 states that for the MFR dataset, "the highest recall scores range from 7 to 11 percent". However, the data in Table A.7 shows that the highest recall score achieved by any model is 9%, making the stated range incorrect.

Issue arose due to an out-of-date table in the appendix.

Undefined abbreviation gemini-2.5-pro

Evidence

PG  & 75 & 109±106.6 & 4±2.6 & 8±6.0 & 22±24.1
PGS & 25 & 146±111.8 & 5±3.7 & 7±4.8 & 22±17.9

Description

The paper uses the abbreviation 'PG' for "Project Gutenberg" documents in the test set tables. However, in the development set tables, it uses 'PGS' for what appears to be the same source. The dataset statistics for 'PG' (avg. 109k tokens) and 'PGS' (avg. 146k tokens) are different, but the 'PGS' abbreviation is never defined, creating ambiguity.

PGS stands for "Project Gutenberg Seen", as the PG development data may be leaked to models.

Citation#

@misc{lovering2025findinginconsistenciesdocuments,
      title={On Finding Inconsistencies in Documents},
      author={Charles J. Lovering and Seth Ebner and Brandon Smock
              and Michael Krumdick and Saad Rabbani and Ahmed Muhammad
              and Varshini Reddy and Chris Tanner},
      year={2025},
      eprint={2512.18601},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.18601},
}