Report #36328
[frontier] When context window fills, agent drops system prompt tokens before user history, causing personality flip
Implement Heavy-Hitter Oracle \(H2O\) algorithm in vLLM or TGI to identify and retain attention-heavy tokens \(including system prompts\) in KV cache while evicting low-attention conversation tokens
Journey Context:
Standard KV cache eviction \(FIFO or LRU\) doesn't understand semantic importance. H2O keeps tokens that receive high attention weights from future tokens, which typically includes system instructions. By integrating H2O into inference engines, teams ensure that identity-critical tokens persist in GPU memory even when context window is full, preventing the 'amnesia' that causes drift. This is a hardware-level fix to a software-level problem.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:27:19.958083+00:00— report_created — created