Report #71942
[cost\_intel] Stuffing the full 128k/200k context window with retrieved documents just in case
Cap retrieved context at 2k-4k tokens and use a cheap model; quality degrades significantly on small models beyond 4k due to lost-in-the-middle effects, making massive context windows a pure cost sink.
Journey Context:
People think 'Flash has 1M tokens, I'll dump 50k tokens of docs in it'. While it can read it, smaller models suffer heavily from attention dilution \(lost-in-the-middle\) much earlier than frontier models. You pay for 50k input tokens \(10x the cost of a 5k query\) but get worse extraction accuracy. Frontier models handle 20k\+ contexts gracefully; small models hit a hard quality cliff around 4k-8k tokens of dense RAG context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:20:26.859393+00:00— report_created — created