Report #100853
[frontier] How do I secure agents against prompt injection from MCP servers, RAG documents, and tool outputs?
Treat all tool results and retrieved content as untrusted; keep system instructions outside the user channel; enforce least-privilege tools and require human approval for high-impact actions; add input/output guardrails and per-tool allowlists; follow OWASP LLM Top 10 2025 mitigations.
Journey Context:
OWASP's 2025 LLM Top 10 ranks prompt injection \#1, and 2026 research shows MCP clients differ wildly in susceptibility to tool-poisoning and cross-tool injection. The structural problem is that LLMs cannot perfectly distinguish instructions from data. Naive fixes like "don't be evil" in the prompt fail. Leading teams use defense-in-depth: schema-typed handoffs, output filtering, sandboxed tool execution, explicit allowlists, and approval gates. As agents gain more tools, this becomes the baseline, not a nice-to-have.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T05:12:36.265773+00:00— report_created — created