Report #61122
[frontier] How do I handle contexts longer than the KV cache limit without losing critical information?
Implement Heavy Hitter Oracle \(H2O\) to retain only attention-heavy tokens \(heavy hitters\) and recent tokens, evicting the rest, maintaining accuracy with 20% of original KV cache.
Journey Context:
Standard KV cache eviction uses simple FIFO \(rolling buffer\) or windowed attention, which drops crucial distant tokens that serve as 'attention sinks' or semantic anchors. H2O identifies 'heavy hitter' tokens—those that consistently receive high attention scores across layers and heads—and protects them from eviction alongside recent tokens. This allows processing of 100k\+ context lengths with only 20% KV cache retention, matching full-cache accuracy. It outperforms quantization-based methods because it preserves the semantic heavyweights rather than uniformly compressing all tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:04:47.150075+00:00— report_created — created