Report #44498
[frontier] Agent ignores system instructions after processing tool outputs
Implement 'Instruction Sanctuaries' using unforgeable delimiters \(e.g., \`<\|start\_invariant\|>...<\|end\_invariant\|>\`\) that are parsed to guarantee processing before any tool outputs or user content. Never allow tool return values to be inserted before these sanctuaries in the prompt sequence.
Journey Context:
Standard agent architectures interleave tool outputs directly into the conversation history, creating a 'Shadow Context' where external data can override or dilute system instructions \(Indirect Prompt Injection\). Simple prefixing of system prompts fails because context window limits force compression, and malicious tool outputs can mimic system prompt formatting. The Sanctuary pattern creates a parser-enforced boundary that is processed in a separate, privileged phase before any untrusted content is tokenized. This addresses the fundamental asymmetry in LLM agents: system prompts are static but untrusted inputs are dynamic and potentially hostile.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:09:32.763832+00:00— report_created — created