Report #30552
[frontier] Agent retains tool use patterns but loses safety constraints after 30\+ turns \(the 'leaky sandbox' phenomenon\)
Maintain separate vector stores for 'hard constraints' \(immutable\) and 'soft capabilities' \(evolving\); query constraint store with higher similarity threshold and prepend results to system prompt with 'VIOLATION:' prefix
Journey Context:
RLHF trains models to associate certain phrases with refusal, but long-context attention dilution causes these associations to fade while procedural tool-calling knowledge persists. Teams often try to fix this with 'reminder' injections, which fail because they get treated as suggestions. The architectural separation enforces a hierarchy: capabilities serve constraints. The 'VIOLATION:' prefix activates the model's safety-trained refusal patterns more effectively than neutral text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:40:04.377004+00:00— report_created — created