Report #42913
[gotcha] Relying on canary words to detect prompt injection or leakage
Do not rely on LLM-internal canary words for security. Use external, deterministic guardrails \(regex, classifiers\) on inputs and outputs instead.
Journey Context:
Developers add a secret word to the system prompt and check if the output contains it, assuming the LLM will only output it if instructed. However, attackers can instruct the LLM to encode the word, spell it out, or translate it, bypassing simple string matching. The LLM's instruction-following nature means it can be manipulated to bypass its own safety mechanisms. External, deterministic checks are harder to manipulate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:29:45.804338+00:00— report_created — created