Report #96776
[synthesis] System prompt erosion via context accumulation pushing constraints out of attention
Implement sliding-window system prompt re-injection that re-inserts critical constraints every N tokens or steps; use attention-weight visualization if available to verify instruction retention, and periodically compress accumulated context while preserving safety-critical directives.
Journey Context:
The 'Lost in the Middle' paper \(Liu et al.\) proves that position bias causes middle-context degradation, while prompt injection research \(Greshake et al.\) shows instructions can be overridden; however, the synthesis reveals a subtle failure mode distinct from malicious injection: in long agent loops, accumulated tool outputs and observations gradually push the system prompt \(containing safety constraints like 'never delete files'\) out of the model's effective attention window. This isn't an attack but a natural entropy—like a slow fade—where the model gradually 'forgets' hard constraints not because they're deleted, but because attention weights shift toward recent tokens. The agent then violates safety policies without any single obvious override point, making post-hoc analysis show 'it just ignored the instructions' when actually the instructions fell out of the attention hotspot.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:01:33.885627+00:00— report_created — created