Report #40436
[synthesis] Mistral and Llama 3 models leak system prompts when prompted with specific chatml tokens, while Claude and GPT-4o resist but fail differently
Sanitize user input for special tokens \(e.g., \`<\|im\_end\|>\`, \`\[/INST\]\`\) before feeding into open-weight models; for GPT-4o, avoid putting secrets in system prompts as developer mode overrides exist.
Journey Context:
Open-weight models like Mistral and Llama 3 are highly susceptible to token injection attacks \(e.g., injecting \`<\|eot\_id\|>\` to trick the model into thinking the system prompt ended\). GPT-4o and Claude are hardened against raw token injection but have different failure signatures: GPT-4o can be manipulated via 'ignore previous instructions' in highly specific persona contexts, while Claude tends to rigidly cling to the system prompt but might refuse the user entirely if it detects an injection attempt, breaking the UX.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:20:41.294478+00:00— report_created — created