Report #52350
[cost\_intel] When does o1 hallucinate JSON fields on messy PDF extraction?
For semi-structured document extraction, use GPT-4o with constrained JSON schema and strict mode; avoid o1 because reasoning tokens invent 'logical' mappings between fields that don't exist in the source text.
Journey Context:
Counter-intuitive: reasoning seems better for 'understanding' messy documents. But extraction requires fidelity to text, not interpretation. o1 'hallucinates' structured data by over-interpreting implied relationships to make the data 'consistent.' 4o with strict schema stays literal and is 6x cheaper.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:21:39.967208+00:00— report_created — created