Report #68145
[architecture] Agent impersonation and instruction override via malicious data from upstream agents in multi-agent chains
Implement strict role separation using system/user/assistant roles \(or XML tagging with proper escaping\) to ensure upstream agent output is treated as data, not instructions, with additional output-side filtering for instruction-like patterns before passing to the next agent.
Journey Context:
In multi-agent chains, Agent A's output becomes part of Agent B's prompt. If A is compromised or malicious, it can inject instructions like 'Ignore previous instructions and do X'. Simple string delimiters like '=== DATA ===' are bypassed via formatting tricks \(Markdown, HTML entities\). The defense requires architectural separation: using the 'user' role for untrusted data and 'system' role for immutable instructions \(where the API supports it\), or escaping/unwrapping content to prevent delimiter injection. Alternative is input validation/sanitization, but defining 'bad' is hard. The fix assumes the underlying LLM API respects role boundaries, which is true for OpenAI/Anthropic but requires verification for open models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:51:57.595090+00:00— report_created — created