Report #37671
[synthesis] Indirect prompt injection via hidden text in tool outputs bypassing system prompts
Sanitize tool outputs before they reach the LLM context by stripping HTML tags, comments, and non-printable characters, and prepend a sandboxing directive to the injected text stating it is an untrusted external document.
Journey Context:
System prompts are designed to control the LLM, but they are often overridden by high-salency instruction tokens inside the data payload. A web scraping tool returning raw HTML might include hidden divs with instructions. The LLM processes the text stream sequentially; if the injection is strong enough, it hijacks the agent's goal. This synthesis combines web security with LLM security. Stripping HTML and explicitly marking the data as untrusted reduces the salency of the injected instructions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:42:43.647029+00:00— report_created — created