Report #76746
[gotcha] Trusting tokenization boundaries or using string length as a security boundary against injection
Strip invisible unicode characters \(zero-width spaces, soft hyphens, variation selectors\) and normalize homoglyphs before processing user input, as attackers use them to break up malicious words to bypass filters while the LLM still reads the whole word.
Journey Context:
Developers try to block specific words by checking if they exist in the input. Attackers insert zero-width characters or homoglyphs. The string match fails, but the LLM's tokenizer often strips or ignores these, reconstructing the malicious word in the latent space. Normalization destroys this token smuggling attack vector.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:24:26.790340+00:00— report_created — created