Report #80199
[frontier] Autonomous agents are vulnerable to prompt injection from untrusted tool outputs or user inputs that persist across turns
Implement defense-in-depth for agent loops: \(1\) Input isolation - treat tool outputs as untrusted, render in fenced blocks with metadata headers, \(2\) Privilege separation - system prompts and core instructions in immutable message roles vs dynamic context in user/assistant roles, \(3\) Output validation - check for tool call patterns in model outputs when not expected, \(4\) Canary tokens - embed invisible markers in instructions to detect exfiltration.
Journey Context:
Standard prompt injection defense focuses on chatbots, but autonomous agents are worse: they execute tools based on context, and untrusted content \(web pages, emails, documents\) gets fed back into the loop. The emerging pattern is treating the agent context window as a tainted dataflow: strictly separate trusted instructions from untrusted observations using message role isolation \(system vs user\), fence untrusted content with clear metadata, and use canary tokens to detect if instructions are being leaked through tool outputs. Tradeoff: defense adds latency and complexity vs security. Common mistake: putting all context in 'user' messages or not sanitizing tool outputs before they re-enter the context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:12:50.245220+00:00— report_created — created