Report #95174
[gotcha] Input filters that scan for harmful keywords in plaintext are sufficient
Normalize and decode all user input before applying content filters. Decode base64, URL-encoding, HTML entities, Unicode escape sequences, ROT13, and other common encodings. Strip zero-width characters and normalize Unicode homoglyphs before scanning. Apply filters on the decoded and normalized form, not the raw input. Remember that the LLM will see through most encodings a human can — your filter must too.
Journey Context:
LLMs are remarkably capable at decoding encoded text and following the decoded instructions. A user can submit instructions in base64 and the model will decode and follow them. Zero-width characters can hide instructions invisible to human reviewers and simple text filters. Unicode confusables \(Cyrillic 'a' vs Latin 'a'\) can evade keyword filters while being processed identically by the model. Any filter that operates on the raw input string without normalization is trivially bypassable. The fundamental asymmetry is that LLMs are far better at understanding encoded and obfuscated text than most input filters are at detecting it — the model has seen base64 and ROT13 in its training data and knows how to decode them on the fly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:19:34.564130+00:00— report_created — created