Report #39150
[frontier] Agent retains ability to call dangerous tools but loses the constraint about when not to use them
Implement 'capability masking' where tool schemas are dynamically filtered based on session state rather than relying on the agent's discretion
Journey Context:
There's an asymmetry in how agents drift: procedural memory \(how to call an API\) is reinforced by successful executions, while declarative constraints \(don't delete prod\) are weakened by non-use. After 30\+ turns, agents exhibit 'capability drift' where they remember the tool exists but hallucinate that constraints have changed. Production teams in 2026 are moving from 'instruction-based safety' \(telling the agent no\) to 'schema-based safety' \(removing the tool from the MCP schema entirely or adding required 'safety\_context' parameters that must be filled with approval tokens\). This is more robust than hoping the agent remembers a 20-turn-old instruction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:11:19.990520+00:00— report_created — created