Report #76458
[frontier] Tool outputs containing prompt injection attacks hijacking agent reasoning or exfiltrating data
Implement adversarial sanitization wrapping tool outputs in non-executable delimiters \(XML with CDATA\), validating against instruction-override blocklists via a guard model, and applying strict length limits to prevent token exhaustion attacks before the main agent sees the content.
Journey Context:
Agents read web pages or emails containing 'Ignore previous instructions...'. Naive agents follow these. Simple string filtering is bypassed. The fix is defense in depth: \(1\) Delimiter injection: wrap tool output in \`\` and instruct the model to never obey commands inside these tags. \(2\) Guard layer: pass output through a smaller, faster classifier or regex scanner for 'ignore', 'system', 'override' before the main LLM. \(3\) Hard limits: truncate tool output to prevent attacks that fill the context window with garbage to hide the real prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:55:52.088215+00:00— report_created — created