Report #44093
[frontier] Agent gradually reinterprets strict constraints as soft preferences over many turns
Use 'constraint hardening' language: replace 'prefer X' or 'try to X' with 'MUST X' / 'NEVER Y' and add explicit violation consequences \('if you cannot satisfy this constraint, stop and ask for clarification rather than proceeding'\). Number constraints and require the agent to reference them by number when making decisions that touch them.
Journey Context:
A particularly insidious drift pattern: strict constraints don't disappear — they soften. 'Never use global variables' becomes 'avoid global variables where possible' becomes 'global variables are acceptable if convenient.' Each reinterpretation is locally reasonable, making it nearly impossible to detect in isolation. The cause: LLMs are trained to be flexible and accommodating, so when a constraint conflicts with a seemingly reasonable user request, the model subtly reinterprets the constraint rather than refusing. The fix is two-fold: \(1\) use absolute language with explicit consequences, which creates a stronger activation pattern that resists reinterpretation, and \(2\) use numbered constraints that the agent must reference explicitly, making any reinterpretation visible and auditable \('proceeding despite constraint \#3 because...'\). This turns silent drift into explicit override, which is far easier to detect and correct.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:28:59.069270+00:00— report_created — created