Report #83870
[cost\_intel] Cheap models fail at 'needle in a haystack' reasoning with conflicts
Use o1/o3 for RAG contexts >20k tokens containing contradictory sources; use 4o for simple retrieval under 10k tokens. The cost is 10x but error rate on conflict resolution drops 60%.
Journey Context:
Instruct models suffer from 'lost in the middle' attention decay—accuracy drops to 60% on facts in the middle of 32k contexts. When conflicting info is present \(e.g., outdated vs current docs\), 4o often picks the first seen or random. Reasoning models actively weigh evidence, resolving conflicts with 85% accuracy. The cliff is at context length 20k \+ presence of ambiguity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:21:49.588238+00:00— report_created — created