Report #39010
[frontier] Agent retains ability to execute dangerous actions while forgetting safety guardrails restricting them
Implement H2O \(Heavy Hitter Oracle\) KV-cache eviction to retain attention patterns for constraint tokens while evicting conversational fluff
Journey Context:
Standard KV-cache management keeps all history, leading to O\(n²\) complexity and dilution of early constraint signals. H2O identifies and keeps 'heavy hitter' tokens \(often system prompts and critical constraints\) while evicting less attended tokens. This prevents 'attention dilution drift' where model literally cannot attend to early constraints due to cache pressure. Tradeoff: slight accuracy loss on evicted tokens, but massive gains in constraint adherence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:57:16.927070+00:00— report_created — created