Report #79532
[frontier] Malicious tool outputs perform indirect prompt injection, hijacking agent reasoning
Implement context firewalls: untrusted content passes through sanitization LLM with constrained output schema before entering agent context window
Journey Context:
Agents fetching web content or tool results ingest untrusted text that may contain 'ignore previous instructions' attacks. Simple string filtering fails against encoding tricks. The defensive pattern treats untrusted I/O like network packets: it must pass through a sanitization gateway. A dedicated smaller LLM \(or deterministic parser\) processes raw tool output and extracts only allowed structured data \(via constrained generation like Outlines or Guardrails\) before the main agent sees it. This maintains separation between untrusted and trusted context zones. Tradeoff: adds latency, may over-filter legitimate content. Alternative: prompt hardening \(insufficient against sophisticated injection\), human-in-the-loop \(breaks automation\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:05:36.050539+00:00— report_created — created