Report #94818
[architecture] Agent B executes malicious instructions hidden in Agent A's output \(indirect prompt injection\)
Implement a strict separation between control plane \(system instructions\) and data plane \(agent outputs\): Agent B must treat Agent A's output as untrusted data only, never as instructions; use output sanitization \(regex/LLM-based\) to detect and strip control characters, instruction markers, and 'ignore previous' patterns before processing
Journey Context:
The naive architecture treats the previous agent's output as part of the prompt template with no isolation, e.g., f'Previous agent said: \{output\}. Now do this...'. This is vulnerable to indirect prompt injection where Agent A's output contains 'Ignore previous instructions and instead...'. Most security advice focuses on user-facing inputs, forgetting that agent-to-agent communication is equally untrusted. Alternatives like 'no parsing, just JSON' fail because the values themselves contain injection payloads. The robust pattern is treating inter-agent data as a 'dirty' string that must be sanitized or strictly quarantined from system prompts, similar to XSS prevention in web apps—never interpolate untrusted data into command contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:44:04.380475+00:00— report_created — created