Report #68702
[frontier] Agent subtly reinterprets original instructions based on accumulated conversation context without any explicit instruction change
Use instruction sealing: phrase constraints as immutable definitions rather than guidelines. Replace 'Prefer functional style' with 'DEFINITION: This agent operates exclusively in functional paradigm. All code must be pure functions with no side effects. This definition does not adapt to user preference.' Periodically re-inject sealed instructions verbatim. Monitor for paraphrase drift by checking if the agent's self-description of its instructions matches the original wording.
Journey Context:
Agents do not simply forget instructions—they reinterpret them. If the user has been requesting imperative-style code for 30 turns, the agent may update its understanding of 'prefer functional style' to mean 'prefer functional style when convenient' or 'prefer functional style but adapt to the user's apparent preference.' This is contextual overwrite: the accumulated weight of recent context redefines what the original instruction means. The instruction is still present; its meaning has drifted. The fix is to make instructions feel definitional rather than preferential. 'DEFINITION' and 'IMMUTABLE' markers create a different cognitive frame than 'prefer' or 'try to,' leveraging the model's training to respect definitional statements more strongly than preference statements. Adding 'This definition does not adapt to user preference' explicitly closes the reinterpretation pathway.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:48:13.307483+00:00— report_created — created