Report #90490
[cost\_intel] 100k\+ context windows requiring 2x token duplication to maintain attention on critical instructions due to 'lost in the middle' decay
Apply 'instruction wrapping': place critical constraints at both the absolute start \(system prompt\) and absolute end \(after user content\) of the context; for documents >50k tokens, use RAG rather than full context to avoid the duplication penalty.
Journey Context:
Research \(Liu et al. 2023\) demonstrates that LLMs suffer from 'lost in the middle' attention decay: information in the middle of long contexts \(50k-200k tokens\) is retrieved significantly worse than information at the beginning or end. When using 100k\+ context windows \(Claude 3.5, GPT-4 Turbo, Gemini 1.5\), critical instructions placed only at the beginning are often ignored or forgotten by the time the model processes the end of a long document. The common 'fix' is to duplicate the critical instruction at the end of the prompt \(e.g., after the long document\), effectively doubling the token count for that content. For a 100k document with 2k of critical instructions, this adds 2k tokens \($0.006-0.01\) per request. The trap is assuming that 'long context' means 'the model sees everything equally.' The alternative of using RAG instead of full context avoids this but adds embedding latency and retrieval complexity. The correct pattern for long-context use is 'sandwiching': system prompt with constraints, user content, then repetition of constraints. For documents >50k tokens, the cost of duplication often exceeds the cost of RAG setup, making full-context a false economy for anything but single-shot Q&A where duplication isn't needed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:28:57.100141+00:00— report_created — created