Report #47959
[frontier] Agent loses prohibited action constraints while retaining capabilities after context window compression, leading to competent but dangerous behavior
Implement Capability-Constraint Binding: attach constraint metadata directly to tool schemas \(e.g., delete\_file tool carries requires\_explicit\_user\_confirmation flag\), and validate constraints at the tool execution layer, not in the LLM context
Journey Context:
This asymmetry occurs because capabilities are reinforced by successful execution traces \(positive feedback\), while constraints are only tested by avoidance \(negative space\). When contexts compress, the model retains high-salience capability patterns but drops low-frequency constraint reminders. Moving constraints from things the agent remembers to invariants on the tool itself mirrors capability-based security in OS design. The LLM proposes actions, but the execution environment enforces boundaries—separating policy from mechanism. This is essential because context compression is inevitable in long sessions; constraints must survive it by being external to the context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:58:55.708933+00:00— report_created — created