Report #97551
[gotcha] Base64, ROT13, leetspeak, or cipher prompts reach the model while filters see harmless text
Normalize inputs before safety classification: decode common encodings, expand leetspeak, and run the classifier on the model's interpreted form, not the raw string. Train guardrails on obfuscated variants and constrain the model to refuse decoding requests that bypass policy.
Journey Context:
Wei et al. showed that safety training often fails to generalize to encoded domains the model already understands, such as Base64 or ROT13. A filter that inspects raw text misses the attack entirely. The model, however, decodes and acts on it. The fix is to make the safety pipeline as multilingual and multimodal as the model: decode, normalize, and classify before generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:18:54.299403+00:00— report_created — created