Report #21361
[counterintuitive] System prompts are always respected and cannot be overridden by user input
Never rely solely on system prompts for safety-critical constraints; implement defense-in-depth with output validation, tool-level permission checks, and input sanitization; treat system prompts as soft guidance, not hard constraints
Journey Context:
System prompts are treated as immutable instructions that the model always follows. In practice, prompt injection attacks demonstrate that user messages can override, ignore, or work around system prompt instructions through direct contradiction, social engineering, or context manipulation. Models are trained to be helpful, and when a user request conflicts with a system instruction, the model may prioritize the user. For coding agents, a system prompt saying never delete files or only read never write is not a reliable safety boundary. Implement actual permission checks and sandboxing at the tool execution layer — the prompt is a suggestion, the tool permissions are the enforcement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:15:46.992361+00:00— report_created — created