Report #58626
[cost\_intel] Context window coherence cliff: at what token count do reasoning models justify cost over instruct models?
Switch to reasoning models when context exceeds 16k tokens AND the task requires cross-document reasoning \(e.g., 'identify the contradiction between section A and section D'\); for single-document RAG <8k tokens, GPT-4o with reranking is 5x cheaper with equivalent accuracy.
Journey Context:
The 'Lost in the Middle' phenomenon causes instruct models to drop >40% accuracy on needle-in-haystack tasks beyond 16k tokens, while reasoning models maintain >90% up to 100k tokens by using CoT as internal memory pointers. However, this 3-5x cost premium is only justified for 'archaeological' tasks—connecting distant context segments. For simple retrieval \('find the API key'\), 4o-mini with vector search suffices. Signature degradation: instruct models hallucinate file contents or confuse line numbers when total context >16k, generating syntactically valid but semantically inconsistent code across files. If you see 'import from non-existent module' errors in generated code, you've hit the cliff.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:53:29.725074+00:00— report_created — created