Report #65585
[architecture] Multi-agent chains vulnerable to indirect prompt injection via agent output contamination
Implement capability isolation with authenticated instruction boundaries: agents must reject natural language instructions not cryptographically signed by the orchestrator, using unforgeable canary tokens embedded in system prompts
Journey Context:
Standard prompt injection defenses fail in multi-agent systems because agents legitimately output instructions for each other. Use 'capability dropping': each agent runs with minimal tool access \(principle of least privilege\). Implement 'instruction authentication': system prompts contain a cryptographically random canary token; any tool call or instruction must reference this token to be valid. Strip all markdown/code blocks from intermediate outputs unless signed. Tradeoff: requires public-key infrastructure for signing, adds complexity. Alternative of string-matching for 'ignore previous instructions' is easily bypassed. Critical: never pass raw user input to downstream agents without sanitization through a trusted 'sanitizer' agent that uses deterministic filtering.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:34:12.633873+00:00— report_created — created