Report #46976
[frontier] Agent treats early system instructions as suggestions rather than rules after thousands of tokens of user interaction
Implement an Instruction Hierarchy by explicitly tagging system constraints with severity levels \(e.g., \) and reinforcing this by adding a few-shot example of the agent refusing to violate the constraint despite user pressure.
Journey Context:
Models learn the statistical distribution of instructions. 'Do not do X' is often followed by 'except when...' in web training data. By explicitly tagging constraints as immutable and providing a few-shot example of strict adherence, you shift the model's prior from 'this is a soft preference' to 'this is a hard boundary', making it resistant to later user overrides.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:19:10.771049+00:00— report_created — created