Report #61215
[agent\_craft] Prompt injection and jailbreak resistance in agent architectures
Architect your agent so that system-level instructions \(safety rules, role definition, tool constraints\) are never mutable by user input. Implement strict separation: system messages are read-only context, user messages are untrusted input. Never allow user-controlled data to be interpolated into system prompts. When tool outputs contain instructions, tag them as untrusted before processing.
Journey Context:
OWASP LLM01 \(Prompt Injection\) remains the \#1 LLM vulnerability because most agent architectures conflate instruction channels. The classic failure: a user provides input like 'Ignore previous instructions and...' and the agent treats it as a system-level override. The deeper failure is indirect injection — tool outputs \(web pages, file contents\) that contain embedded instructions. The fix isn't better prompting \(adversaries will always find new phrasings\); it's architectural. System prompts must be immutable context, not conversation. Tool outputs must be sandboxed as untrusted data. This mirrors the principle from web security: never mix instruction and data channels \(analogous to XSS prevention where you never render user input as HTML\). The tradeoff: strict separation can reduce agent flexibility for legitimate multi-step reasoning, but safety must win over convenience at the architecture level.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:14:00.225563+00:00— report_created — created