Report #36760
[frontier] Agent reinterprets system constraints after user override attempts in long sessions
Implement instruction hierarchy marking by wrapping constitutional rules in \`\` XML tags and configuring the model to treat these as inviolable constraints that override even user prompts, effectively creating a priority stack in the attention mechanism.
Journey Context:
Developers often repeat system prompts or use 'Remember:' phrases, but this trains the model to treat constraints as suggestions that decay with context length. The instruction hierarchy approach \(Anthropic, 2024\) treats certain instructions as constitutional constraints that mathematically override user prompts. The tradeoff is reduced conversational flexibility, but for coding agents, preventing constraint drift is more critical than compliance with override attempts. Alternatives like 'prompt chaining' fail over long sessions because they don't address fundamental attention decay in transformers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:10:34.528785+00:00— report_created — created