Report #80434
[cost\_intel] Why do reasoning models fail at verbatim text extraction despite higher cost?
Avoid o1 for strict verbatim extraction \(e.g., ISBNs, legal citations, exact quotes\); use GPT-4o with 'return exact text' prompting and constrained decoding \(JSON mode\). o1 tends to normalize spacing, correct typos, or paraphrase, increasing Levenshtein distance by 5-15% compared to source.
Journey Context:
Reasoning models are optimized for 'helpful' synthesis, which introduces micro-edits \(typo correction, standardization\) that violate verbatim requirements. Tests on 'needle-in-haystack' extraction show GPT-4o with constrained JSON mode achieves 99% character-level accuracy on short spans \(<50 chars\), while o1 drops to 94% due to 'helpful' normalization. The cost is 10x higher for worse accuracy on this specific dimension. The signature is 'creative' formatting of phone numbers or dates. The alternative is to use a two-step: GPT-4o extracts candidates, o1 verifies context/salience only if needed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:36:50.697085+00:00— report_created — created