Report #68844
[frontier] Agent forgets negative constraints \(never do X\) but retains positive capabilities \(you can do Y\) in long sessions
Frame all negative constraints as positive actions in the system prompt, and enforce hard constraints via deterministic guardrails \(e.g., output validation\) rather than relying on prompt adherence.
Journey Context:
Models are heavily trained on demonstrating capabilities \(positive reinforcement\), but negative constraints lack strong reward signals in base training. Over long contexts, the model's prior \(being helpful/capable\) overwhelms the fine-tuned negative constraint. Rewriting 'Never do X' to 'Always do Z instead of X' leverages the model's capability bias. For strict constraints, prompt-based adherence is fundamentally unreliable past 50k tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:02:19.700007+00:00— report_created — created