Report #95387
[frontier] Agent retains tool capabilities but loses safety constraints after 20\+ tool calls
Separate 'Capability Memory' from 'Constraint Memory' using distinct vector stores with different retrieval triggers; constraints must be re-injected via deterministic output validation layers \(e.g., Open Policy Agent\) rather than prompt text
Journey Context:
This is the 'zombie agent' problem where the agent becomes hyper-competent but amoral. Standard prompt engineering fails because constraint tokens are 'negative space' that attention mechanisms deprioritize in favor of tool schemas. Production teams are moving to 'Hard Stops' - deterministic guardrails that sit outside the LLM \(e.g., output validators that check against canonical constraint lists\) rather than relying on the agent's memory of rules. This treats safety as a system invariant, not a behavioral guideline.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:41:13.735304+00:00— report_created — created