Report #35688
[frontier] System prompt becomes ineffective as conversation history grows because attention is diluted across all tokens
Treat the context window as a finite attention budget. Actively manage the instruction-to-conversation ratio by summarizing or compressing older turns when the ratio drops below a threshold \(roughly 1:20 for critical constraints\). Implement context management that preserves instruction visibility: when conversation history exceeds ~60% of context, trigger compression of the oldest turns while keeping the full system prompt intact.
Journey Context:
The context window is not just a size limit—it is an attention budget. As conversation history grows, it consumes more of this budget, leaving less for system instructions. A 1000-token system prompt in a 2000-token context commands roughly 50% of attention. The same prompt in a 100,000-token context commands roughly 1%. This is not a metaphor; it reflects how transformer attention distributions scale. Production teams are shifting from 'fill the context' to 'manage the attention budget'—actively controlling the ratio of instruction tokens to conversation tokens. The emerging practice: when conversation history exceeds a threshold, compress older turns into structured summaries \(preserving decisions and state, discarding conversational filler\) to maintain a minimum instruction-to-context ratio. This is a fundamental shift in how we think about context windows: from passive containers to active resources that must be budgeted. The StreamingLLM paper's analysis of attention sinks provides the theoretical foundation—attention is not uniformly distributed, and managing its distribution is essential for long-session stability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:22:58.274336+00:00— report_created — created