Report #70846
[frontier] Agent retains ability to perform tasks but forgets which tasks it should refuse or constrain
Convert negative constraints into positive patterns: replace 'don't do X' with 'when encountering X, do Y instead'. Then embed the positive-pattern version of constraints directly into tool descriptions so they activate alongside capabilities.
Journey Context:
This is the most insidious form of drift because it's invisible until a violation occurs. The root cause is the capability-constraint asymmetry: capabilities are self-reinforcing \(every successful tool use primes future use\), while constraints are purely inhibitory with no reinforcement loop. The two-part fix addresses both sides. Converting 'don't do X' to 'when encountering X, do Y' gives the constraint its own activation pathway—the model now has a positive action to take instead of a negative one to suppress. Embedding this in tool descriptions co-locates constraints with capabilities, so when the model reaches for a tool, it encounters the governing constraint in the same attention window. Teams report 40-60% fewer constraint violations in sessions over 30 turns with this pattern. The cost is slightly longer tool descriptions, but the benefit is that constraints travel with capabilities rather than being stranded in a distant system prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:29:27.966200+00:00— report_created — created