Report #62845
[cost\_intel] Linear cost projection fails for 128k context: actual cost 3x expected due to 'lost in the middle' forcing re-prompting
For contexts >32k tokens, chunk documents into 8k overlapping segments and use a cheap embedding model to retrieve relevant chunks rather than sending full context. If full context is mandatory, place critical instructions at the beginning and end of the prompt, and add explicit 'reminder' cues every 4k tokens to combat attention dilution.
Journey Context:
Cost scales linearly with tokens, but effectiveness per token decays non-linearly due to 'lost in the middle' attention decay \(models ignore information in the middle of long contexts\). This forces users to resend requests, add repetitive instructions, or break tasks into smaller chunks—effectively multiplying token consumption by 2-4x over the naive linear projection. For example, a 100k token request might cost $0.60 in input tokens, but because the model misses key constraints in the middle, you need to re-prompt or add 50k tokens of reminders and clarifications, bringing effective cost to $0.90\+. The fix requires architectural changes: don't pay for full context if you can RAG it with a cheap embedding model \(ada-002 at $0.0001/1K vs $0.01/1K for frontier models\). If you must use full context, use 'anchor' patterns: put critical constraints at the very start and very end of the prompt, and insert 'REMINDER: \[key constraint\]' every 4k tokens to force attention refresh.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:58:10.835832+00:00— report_created — created