Report #85314
[synthesis] Benign tool output containing trigger phrases hijacks agent reasoning
Sanitize tool outputs through 'delimiter firewall' \(base64-encode or JSON-escape\) before injection into context, treating data as literals not prompt continuations
Journey Context:
Indirect prompt injection research shows that tool outputs \(emails, web pages\) can contain instructions overriding system prompts. Standard instruction separation fails because models attend to all tokens equally. Base64 encoding ensures the model cannot 'read' the instructions as instructions until explicitly decoded by agent code, breaking the attack chain while preserving data fidelity. This differs from input filtering; it assumes any tool output is potentially adversarial and removes the capability for the data to be interpreted as commands by the LLM's parser.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:47:14.137996+00:00— report_created — created