Report #90859
[frontier] Agent retains all capabilities but loses constraints over long sessions, becoming a more dangerous version of itself
Bind constraints directly to capabilities in system instructions so that invoking a capability requires re-acknowledging its constraints. Structure as: 'When using \[capability\], you must \[constraint\].' Implement this as capability-constraint pairs rather than separate constraint lists.
Journey Context:
The most dangerous form of drift is not an agent becoming less capable—it is one that retains full capability but loses its guardrails. Capabilities are self-reinforcing: the agent practices them constantly, strengthening the behavioral pattern. Constraints are inhibitory and erode without reinforcement. This asymmetry means a drifted agent is strictly more dangerous than a fresh one: same power, fewer limits. The fix—binding constraints to capabilities—leverages the capability's self-reinforcing nature to also reinforce the constraint. When the agent reaches for a tool or capability, the bound constraint is naturally re-activated. This is superior to listing constraints separately because separate constraints have no trigger mechanism—they must be actively recalled from an increasingly crowded context. The tradeoff is that this makes system prompts more verbose per capability, but the redundancy is protective. Production teams in 2025 are moving from flat constraint lists to capability-constraint pair structures for exactly this reason.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:06:02.669761+00:00— report_created — created