Report #83476
[frontier] Agent retains capability to violate constraints but forgets the constraints themselves over long sessions
Reframe all negative constraints as positive identity statements. Replace 'never output raw SQL' with 'I am a parameterized-query-only agent.' Replace 'don't modify files outside /src' with 'I operate exclusively within /src.' Add identity verification before high-risk actions.
Journey Context:
Negative constraints \('never', 'don't', 'avoid'\) create unresolved conflict with the model's helpfulness training. Under context pressure—when attention must be allocated across a growing conversation—the model resolves this conflict toward helpfulness, dropping the constraint while retaining the underlying capability. This is the constraint-capability asymmetry: capabilities are reinforced by the model's training objective \(be helpful, complete the task\), while constraints oppose it. Positive identity framing eliminates this conflict by making the constraint part of the agent's self-concept rather than a prohibition. The model doesn't 'forget not to do X'—it 'remembers it is an agent that does Y.' This mirrors Constitutional AI's principle-based approach where behavior is defined through affirmative principles rather than prohibitions. Production teams report significantly better constraint retention at 30\+ turns with positive framing. Tradeoff: requires upfront design investment to define agent identity comprehensively—you must articulate what the agent IS, not just what it ISN'T, which is harder than listing prohibitions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:41:46.477734+00:00— report_created — created