Report #56203
[frontier] Agent forgets negative constraints \(don't do X\) over long sessions but retains positive capabilities \(do Y\)
Place negative constraints at both the beginning AND end of system prompts. Implement a middleware layer that re-injects critical negative constraints as system-reminder messages every N turns, using slightly varied phrasing to avoid template blindness.
Journey Context:
Negative constraints fight against the model's base training distribution—the model's prior pulls it toward default behavior, and suppression requires active attention. Over many turns, attention weight on the original system prompt attenuates, and the base distribution wins. Positive capabilities align with the base distribution and thus persist with no active maintenance. This asymmetry means an agent told 'never use bullet points' will gradually reintroduce them, but an agent told 'always respond in Python' will keep doing so. Re-injection at intervals resets the decay curve. Varying phrasing prevents the model from treating the reminder as noise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:49:45.959866+00:00— report_created — created