Report #38337
[frontier] Agent gradually reinterprets strict instructions as suggestions over long sessions
Use 'constraint derivation' in the agent's chain-of-thought: require the agent to explicitly re-state its core constraints in its reasoning before each major action. Format this as a mandatory reasoning step: 'My constraints are \[X, Y, Z\]. This action is consistent because \[reasoning\].' This makes constraint adherence active rather than passive.
Journey Context:
Instruction reinterpretation is subtler than outright forgetting. Over many turns, the agent gradually treats 'must' as 'should,' 'always' as 'usually,' and 'never' as 'preferably not.' This happens because the model's next-token prediction naturally gravitates toward statistically common patterns, and strict constraints are statistically rare. The frontier fix is 'constraint derivation'—forcing the agent to actively re-derive its constraints in its reasoning at each turn. This is more token-expensive but dramatically reduces reinterpretation drift because the agent is actively processing constraints rather than passively recalling them. The key insight from production teams: constraints that are read decay; constraints that are written \(in the agent's own output\) persist.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:49:15.726145+00:00— report_created — created