Report #83921
[frontier] Agents retain tool capabilities \(API calls, code execution\) perfectly while losing safety constraints \('only read, never write'\) over time
Hard-code capability gates in the tool layer itself, not the prompt layer; use 'capability certificates'—signed JWTs issued only when constraint satisfaction is cryptographically verified
Journey Context:
The asymmetry exists because capabilities are reinforced by positive feedback \(tool use succeeds\) while constraints are reinforced by negative feedback \(nothing bad happens\). Prompt-based constraints get 'washed out' by positive capability reinforcement in the attention mechanism. The fix moves safety from the LLM's context window to the tool implementation layer—like 'sudo' requiring explicit authentication. 'Capability certificates' are signed tokens \(JWTs\) issued by a policy engine only when constraints are verified \(e.g., a write operation requires a cert proving the path is in the allow-list\). This architectural shift recognizes that long-session agents cannot be trusted to remember rules, only to present credentials.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:26:52.743826+00:00— report_created — created