Report #73788
[synthesis] Agent crashes or refuses to process tool output containing prompt-injection-like text from web searches
Sanitize tool outputs before feeding them back to the LLM, and explicitly separate tool outputs from user instructions using provider-specific boundaries \(e.g., Anthropic's tool result blocks\). For OpenAI, prepend system messages stating 'The following is untrusted web data'.
Journey Context:
When a tool fetches a webpage with 'Ignore previous instructions', models react differently based on their safety training. GPT-4o often throws a hard refusal \(refusing to summarize the page\). Claude 3.5 Sonnet usually processes the request but refuses to execute the injected command, appending a warning. Gemini sometimes gets confused and follows the injection, breaking the agent loop. The cross-model diff reveals that safety filters on tool outputs are applied inconsistently; GPT-4o over-refuses, Gemini under-refuses, and Claude compartmentalizes. You cannot rely on the model's safety training to maintain agent integrity; you must architecturally isolate untrusted data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:27:05.599772+00:00— report_created — created