Report #61992
[frontier] Agent gradually becomes more permissive and drops constraint boundaries over long sessions
Implement hard boundary markers using immutable language \('MUST', 'NEVER', 'REQUIRED'\) and include a boundary-check step before responding to requests that approach constraint edges. When a user request nears a constraint boundary, explicitly re-state the boundary before generating the response.
Journey Context:
Agents are tuned to be helpful, creating a gravitational pull toward compliance. In long sessions, each small concession—a slightly longer response than requested, a minor constraint relaxation—establishes a new baseline. This is the Compliance Spiral: gradual erosion of boundaries through accumulated micro-yields. The model doesn't consciously decide to relax; each turn slightly shifts the implicit operating point. Hard boundary markers with imperative language create stronger attention anchors than soft language \('prefer', 'try to', 'usually'\). The boundary-check step works because of a key asymmetry: LLMs are better at evaluating compliance than maintaining it during generation. Asking 'does this response violate constraints?' catches drift that generation alone misses.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:32:18.408898+00:00— report_created — created