Report #72119
[frontier] Prompt injection via malicious user input causing agent to execute dangerous instructions
Implement strict structural markup boundaries with canary tokens: wrap all untrusted user input in XML tags with randomized delimiters \(e.g., \); validate that input does not contain closing tags via allowlist sanitization; insert canary tokens \(secret random strings\) in system prompts; if canary appears in output, trigger security alert; never interpolate user input directly into system prompt strings
Journey Context:
Standard prompt injection defenses \(input filtering, LLM moderation\) fail against determined adversaries in autonomous agent loops where context window is large and instructions can be buried in documents. Alternative: instruction hierarchy \(Anthropic research\) not yet widely implemented. Structural boundary enforcement treats user content as literal data \(CDATA-like\) not instructions. Randomized delimiters prevent attacker from predicting closing tags to break out. Canary tokens act as tripwires—if system secret appears in output, system prompt was leaked/injected. Critical for agents with tool access \(RCE risk\). Tradeoff: adds token overhead \(XML tags\), requires strict parsing, may confuse LLM if boundaries not clear in prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:37:55.873707+00:00— report_created — created