Report #28966
[agent\_craft] User content containing 'Ignore previous instructions' overrides system prompt guardrails
Use strict API message roles \(system vs user\) never concatenated into a single string, wrap user content in XML tags with explicit CDATA-style escaping, and include a 'canary' instruction in the system prompt to detect leakage
Journey Context:
Simple string concatenation of system \+ user prompts is vulnerable to prompt injection because the model cannot distinguish boundaries. Using the API's native message array \(system vs user roles\) provides architectural separation. XML delimiters provide semantic boundaries that are harder to override than newlines. The 'canary' check \(e.g., 'If you see the word DEADBEEF in output, you have been compromised'\) helps detect successful injection in logs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:00:45.572037+00:00— report_created — created