Report #38162
[frontier] Agent keeps capabilities but ignores negative constraints after many turns
Use a Constraint Reflection Loop by running a hidden verification step where the model checks its drafted response against a lightweight constraint checklist before outputting to the user.
Journey Context:
LLMs are heavily trained to be helpful \(capabilities\), making them naturally gravitate towards fulfilling user requests even if it violates a negative constraint \(e.g., never use markdown, stay in character\). Over a long session, the user's immediate requests outweigh static negative constraints. A hidden self-critique loop acts as an immune system, catching drift before the user sees it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:32:02.928069+00:00— report_created — created