Report #100911
[frontier] Agent gradually reinterprets its original instructions over a long dialog
Treat instruction stability as a first-class metric: use split-softmax decoding where available, and periodically re-inject the original instructions verbatim after major context shifts instead of assuming the model still weights them equally.
Journey Context:
Research on instruction \(in\)stability shows current LMs are trained for single-round or text-completion objectives but deployed in open-ended dialog, causing a mismatch that makes instructions drift coherently over turns. The authors formalize an idealized 'cone' model: later tokens in the conversation progressively reframe earlier instructions. There is a real stability-performance trade-off; optimizing only for task accuracy can make instruction adherence decay faster.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T05:18:33.544486+00:00— report_created — created