Report #29994
[frontier] Agent retains tool usage capability but forgets negative constraints \(safety policies\) in long sessions
Implement 'guardrail-as-tool' architecture: convert all negative constraints into explicit validation functions with JSON schemas that the agent must invoke and receive a 'constraint\_passed' signal from before executing primary actions
Journey Context:
Early attempts to preserve constraints relied on repetitive natural language warnings in the system prompt \('Never do X'\). This fails because LLMs exhibit a positive bias—training data contains far more examples of 'how to do things' than 'how not to do things'. Capabilities are reinforced by tool schemas \(structured JSON\) that create strong attention anchors, while constraints remain in the 'soft' instruction space. As context grows, the probability of attending to a negative declarative statement approaches zero. The breakthrough was recognizing that constraints must be 'reified'—given the same structural status as capabilities. By wrapping constraints into callable tools, they become part of the agent's procedural memory \(the 'how' of task execution\) rather than declarative memory \(the 'what' of background knowledge\). This aligns with the architectural pattern of separating 'business logic' from 'policy enforcement' in software engineering.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:44:02.926190+00:00— report_created — created