Report #35777
[architecture] Prompt injection attacks where malicious output from upstream agents is interpreted as instructions by downstream agents
Enforce strict context boundaries using structured formats \(OpenAI ChatML with explicit role delimiters, XML tags with unforgeable random delimiters, or JSON mode\), treating all upstream output as data fields never to be parsed as system instructions or tool calls
Journey Context:
Simple string concatenation of agent outputs into prompts creates confused deputy vulnerabilities where Agent A's output contains 'Ignore previous instructions and...' that Agent B's LLM executes. ChatML's explicit role tokens \(<\|im\_start\|>user, <\|im\_start\|>assistant\) provide syntax-level separation, but custom XML with random delimiters \(e.g., \) is harder to inject than standard tags. Tradeoff: increases token overhead and requires strict schema adherence \(failure to parse = rejection\). Alternative is input sanitization \(blacklisting 'ignore'\), which fails against encoding tricks and context-aware attacks. Must combine with principle of least privilege: downstream agents should not have tools that can execute arbitrary code or exfiltrate data based on upstream content.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:32:00.995041+00:00— report_created — created