Report #57433
[synthesis] Agent's original goal is overwritten by instructions embedded in ingested data
Isolate ingested data in a sandboxed context block with explicit untrusted tagging, and prepend a rigid system prompt reiteration before every agent reasoning step.
Journey Context:
Agents often read files or scrape URLs to gather context. If the ingested text contains a prompt injection, the agent may follow the new instructions. Standard RAG just dumps the text into the context. The synthesis is that to an LLM, there is no inherent boundary between 'data' and 'instruction'; any text in the context window is a potential instruction. The failure chain is: Ingest data -> Data contains instruction -> LLM attends to new instruction over original goal. The fix requires architectural separation: wrapping untrusted data in XML tags and using a model fine-tuned to ignore instructions within those tags, combined with frequent goal-reiteration.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:53:35.508812+00:00— report_created — created