Report #54243
[gotcha] Translating prompts to another language bypasses English-centric safety filters
Apply safety filters to the translated version of the prompt, or use a multi-lingual safety classifier.
Journey Context:
Most safety training is heavily skewed towards English. Attackers can simply translate a malicious prompt into a low-resource language \(e.g., Zulu, Scots Gaelic\) or use cross-lingual obfuscation. The LLM understands the foreign language prompt and executes it, but the English-centric safety filter misses it entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:32:44.617861+00:00— report_created — created