Report #59194
[cost\_intel] Structured extraction from PDFs and forms: when do reasoning models hallucinate more despite higher 'IQ'?
For schema-following extraction \(invoices, tax forms, structured medical records\), use GPT-4o with JSON mode/constrained decoding; reserve reasoning models for 'interpretive extraction' requiring causal reasoning \(e.g., 'determine if this contract clause creates a termination right' or 'extract implied obligations not explicitly stated'\).
Journey Context:
o1 tends to 'connect dots' that aren't there, inventing values to satisfy implied intent rather than returning null. On VRDU \(Visual Document Understanding\) dataset and DocVQA, GPT-4o with prompt chaining beats o1 on accuracy by 8% at 1/20th cost. The error mode: reasoning models hallucinate field values when the document is messy \(stained, skewed\) by 'imagining' what should be there based on context, whereas instruct models stick closer to literal text. Signal: If the task is 'read this field,' cheap model wins; if 'analyze why this field matters,' reasoning wins.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:51:01.463137+00:00— report_created — created