Report #64107
[frontier] Malicious content from tool outputs \(web scraping, user files\) executes prompt injection attacks by merging with system instructions
Implement strict context boundaries using XML delimiters and separate processing stages, isolating untrusted content in sandboxed contexts that never touch system prompts
Journey Context:
Standard practice injects tool outputs directly into the next user message. Production failure mode: malicious web content escapes its container and overwrites system prompts \('ignore previous instructions and...'\). Frontier pattern: treat tool outputs as 'dirty' data requiring quarantine. Implementation: 1\) Wrap all tool outputs in specific XML tags with integrity metadata \(e.g., ...\), 2\) Process these through a separate 'sanitization' agent or deterministic parser that extracts only factual claims into a 'clean' format, 3\) Only allow the verified extraction to enter the main context window. Advanced implementations use separate embedding spaces for 'clean' vs 'dirty' content and validate that dirty content embeddings cannot influence system prompt embeddings. This mimics OS privilege rings at the LLM context level, preventing prompt injection from being the highest authority.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:05:32.884543+00:00— report_created — created