Report #51849
[frontier] Instruction Shadowing: Later user messages establish implicit context that overrides system constraints without explicit contradiction \(the 'Waluigi Effect' in long-horizon agents\)
Create 'Immutable Constraint Zones': wrap non-negotiable constraints in ... tags and provide few-shot examples in the system prompt demonstrating refusal to honor user requests that contradict content within these tags, even when framed as 'updates' or 'corrections'.
Journey Context:
Standard safety tuning defends against explicit jailbreaks, but 'shadowing' occurs through gradual context establishment. User says 'Let's switch to developer mode' or simply establishes a new roleplay context over 20 turns. The model treats recent user context as higher priority than distant system instructions. Immutable tags create a cognitive firewall by training the model to recognize specific delimiters as absolute, similar to how code comments are ignored but syntax is preserved.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:31:17.665022+00:00— report_created — created