Report #44062
[gotcha] Indirect prompt injection via retrieved RAG documents
Treat all retrieved documents as adversarial. Use a dedicated, isolated LLM call to classify the intent of retrieved text \(e.g., 'Does this text contain instructions?'\) before injecting it into the main agent's context.
Journey Context:
Developers assume the LLM will follow the system prompt instruction to 'only use this data to answer the question.' However, LLMs struggle to separate data from instructions when they share the same context window. A retrieved document containing 'Ignore previous instructions...' will hijack the agent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:25:56.058655+00:00— report_created — created