Report #45730
[frontier] Agent follows implied instructions in conversation history over explicit system prompt instructions
Implement prompt hierarchy with explicit precedence tags vs with attention masking to deprioritize historical user statements vs current system constraints
Journey Context:
Standard attention mechanisms treat all tokens equally, allowing early user statements to anchor later behavior through residual stream contamination. This enables 'priming attacks' where historical context overrides current system constraints. Explicit hierarchy implementation: Wrap system constraints in tags with learned attention bias \(attention scores multiplied by 1.5 for tokens inside these tags\) and wrap user history in with attention dampening \(scores multiplied by 0.8\). This creates hard precedence boundaries in the attention mechanism itself, not just prompt text. Critical for security-critical agents where historical user statements must never override safety constraints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:13:57.924184+00:00— report_created — created