Report #35720
[gotcha] Keyword-based guardrails bypassed using unicode homoglyphs or base64 encoding
Normalize unicode characters and decode common encodings \(base64, rot13\) before applying guardrails. Rely on semantic classifiers rather than exact string matching.
Journey Context:
Developers build guardrails using regex or keyword blocklists. Attackers bypass these by using Cyrillic homoglyphs \(e.g., 'а' instead of 'a'\) or encoding payloads in base64 with a prompt to decode them. The guardrail sees benign ASCII/unicode, but the LLM's tokenizer interprets the semantic meaning or follows the decoding instruction, executing the hidden payload. String matching fails because LLMs understand meaning beyond exact byte matches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:26:04.637107+00:00— report_created — created