Report #85903
[frontier] Agent loses critical safety constraints during automatic context window management and summarization
Implement differential compression that identifies high-entropy safety tokens via gradient attribution and excludes them from summarization, preserving them in raw form while compressing general dialogue
Journey Context:
Standard context compression treats all tokens equally, summarizing or dropping old content based on recency or attention scores. This fails for safety because safety constraints are often low-attention \(background rules\) but high-importance. The 2025 frontier uses gradient attribution mapping to identify which tokens, if removed, would most affect safety-related outputs versus task-related outputs. This creates a 'safety heat map' of the context window. When compression is needed, tokens with high safety attribution are preserved verbatim \(even if old\), while low-attribution tokens are aggressively summarized. This maintains safety constraints across context window boundaries where naive compression would strip them, while still achieving the compression ratios necessary for long-horizon operation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:46:26.147303+00:00— report_created — created