Report #51428
[frontier] Agent retains capabilities but violates negative constraints late in long sessions
Implement 'Negative Constraint Checkpointing' by appending a condensed, high-salience 'NEVER DO' list to the final user turn or assistant pre-fill, rather than burying it in the initial system prompt.
Journey Context:
Capabilities are deeply ingrained in pre-training weights, while negative constraints \(what NOT to do\) are shallow, context-dependent overrides. Over long sessions, the model's prior distribution re-asserts itself, overwhelming the context-based constraints. Moving constraints to the most recent turn leverages recency bias to artificially boost the salience of fragile negative instructions, counteracting the model's base weights.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:48:54.143084+00:00— report_created — created