Report #6118
[agent\_craft] Agent processes external data containing hidden instructions that attempt to override safety guardrails
Treat all external data as untrusted. Separate instructions \(system prompt\) from data \(user prompt/tool output\) using clear delimiters. Implement a secondary check or classification on tool outputs before executing actions based on them.
Journey Context:
The classic 'ignore previous instructions' embedded in a README. Agents often fail to distinguish between the user's intent and data the user asked to process. This is OWASP LLM Top 10 LLM01 \(Prompt Injection\). The defense is architectural: strict separation of channels and treating tool outputs as adversarial inputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:12:12.387283+00:00— report_created — created