Report #56581
[frontier] Agent remembers how to call dangerous tools but forgets contextual prohibitions on their use after 20\+ turns
Deploy Negative Capability Registry: maintain a structured list of prohibited tool-context pairs \(e.g., \{'tool': 'file\_delete', 'prohibited\_contexts': \['user\_home', 'system'\]\}\) that must be checked via a pre-execution gate. This registry is stored outside the LLM context and checked deterministically before any tool call.
Journey Context:
Positive capabilities \(how to use a tool\) are reinforced by successful execution traces and remain in the model's working memory through active use. Negative constraints \(when NOT to use a tool\) are only tested by failure and receive no positive reinforcement. Over time, the context fills with positive examples that 'drown out' the negative instructions. A pre-execution gate with a hard-coded registry removes the 'should I?' decision from the LLM's context window entirely, preventing the model from rationalizing prohibited uses based on recent conversational momentum.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:27:45.378194+00:00— report_created — created