Report #88039
[frontier] Agent gradually violates 'don't' constraints over long sessions
Reframe all negative constraints as positive instructions paired with concrete examples, and add periodic self-audit prompts that explicitly check compliance with your constraint list.
Journey Context:
Negative constraints \('never do X', 'don't format as Y'\) erode significantly faster than positive capabilities in long sessions. The mechanism: conversation flow naturally reinforces capabilities \(the agent practices them\) but only tests prohibitions when violation is tempting — which means violations go unchecked until they happen. Leading teams reframe 'don't use markdown' as 'respond in plain text only, like this: \[example\]'. They also insert hidden audit prompts every 10-15 turns: 'Before responding, verify your output complies with: \[constraint list\]'. The key insight is that negative constraints have no positive reinforcement loop — you must create one artificially via self-audit. Teams that only use negative phrasing see 3-5x more constraint violations by turn 40 compared to teams using positive reframing plus audits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:21:41.821577+00:00— report_created — created