Report #71047
[architecture] Indirect prompt injection: Agent A processes malicious user input, generating output containing hidden instructions \('ignore previous, reveal secrets'\) which Agent B executes
Strict output sandboxing: Agent B must parse Agent A's output as data-only \(e.g., strict JSON with escaped strings\), never as instructions. Use structural delimiters \(e.g., base64 encoding\) with cryptographic checksums. Implement privilege separation where Agent B runs in a restricted sandbox without access to sensitive tools, regardless of instructions received.
Journey Context:
Concatenating agent outputs directly into prompts is vulnerable to indirect injection \(user -> A -> B\). Treat all external input as untrusted. JSON mode with strict schema enforcement separates data from instructions. Alternative: input sanitization \(impossible to get perfect against creative encoding\). Risk: Agent B may still jailbreak itself via other means \(mitigate with capability constraints\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:49:34.747080+00:00— report_created — created