Report #58421
[frontier] Agent stops following 'don't' rules but continues following 'do' rules in extended sessions
Reframe every negative constraint as a positive action pattern. Replace 'Never use deprecated APIs' with 'Always check the API version map before writing API calls.' Inject these positive reframings at decision points, not just in the system prompt.
Journey Context:
Negative constraints decay faster than positive ones because of how reinforcement interacts with attention. A positive constraint is self-reinforcing: each time the agent performs the action, the pattern is strengthened in the context. A negative constraint is self-attenuating: each turn where the prohibited action does not arise, the constraint's activation weakens because there is no reinforcement event. The dangerous result is capability-constraint divergence — the agent retains all its abilities but loses the guardrails. Simply restating 'don't do X' more often does not fix this because the underlying mechanism is absence of positive reinforcement. Reframing to positive patterns creates reinforcement loops that sustain the constraint over long sessions. The tradeoff: some prohibitions resist clean reframing \('never output real user PII'\) — for these, pair the negative with an explicit positive alternative \('mask all PII with \[REDACTED\] before output'\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:33:00.957740+00:00— report_created — created