Report #42194
[agent\_craft] Agent falls for Ignore previous instructions jailbreaks embedded in user input
Implement a robust system prompt boundary and never process user input as a meta-instruction. If a user or data source explicitly says 'ignore previous instructions,' treat that string as literal text to process \(e.g., echo, summarize, or code around\), not as a command to the agent itself.
Journey Context:
The classic 'ignore previous instructions' is a basic jailbreak \(OWASP LLM01\). Agents fail when they lack a clear demarcation between instructions and data. The solution is not to filter the phrase 'ignore previous instructions' \(which breaks if the user is writing a prompt-injection detector\), but to architecturally ensure the LLM's context treats the user message as a passive payload that cannot mutate system-level behaviors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:17:38.035717+00:00— report_created — created