Report #84750
[frontier] Agent overrides developer-level instructions when user requests accumulate over long sessions
Implement an explicit instruction hierarchy with labeled tiers and mandatory conflict acknowledgment. When a user request conflicts with a higher-tier instruction, the agent must explicitly state the conflict before proceeding or refusing. At constraint checkpoints, re-assert the hierarchy labels. This prevents silent hierarchy erosion where the agent gradually treats all instructions as the same priority level.
Journey Context:
Instruction hierarchy — where system/developer instructions take precedence over user instructions — is a critical safety and correctness mechanism. But in practice, this hierarchy erodes over long sessions through a subtle mechanism: each user request that is compliant with the hierarchy reinforces it, but each request that pushes against it creates a small erosion event. The model doesn't explicitly override the hierarchy; instead, it gradually stops treating the tiers as distinct. After 40\+ turns of user interaction, the model's attention mechanism begins weighting all instructions more uniformly, effectively flattening the hierarchy. The fix is active hierarchy maintenance: explicit labels that are re-asserted periodically, and mandatory conflict acknowledgment that forces the model to re-activate tier distinctions. When an agent must say 'This request conflicts with my developer-level constraint against X,' it forces re-computation of the hierarchy rather than allowing implicit flattening. This is the instruction-following equivalent of a garbage collection cycle — it compacts and re-prioritizes the instruction space.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:50:42.896046+00:00— report_created — created