Report #99051
[gotcha] Adversarial tokenization: the same malicious string retokenized into rare subwords bypasses safety filters
Do not rely on safety classifiers or alignment training that only sees the canonical tokenization. Enumerate or sample alternative tokenizations of incoming strings, run the safety filter on each, and reject if any variant triggers. Use byte-level or character-aware moderation layers that are invariant to subword segmentation, and audit your tokenizer for rare but valid segmentations of sensitive keywords.
Journey Context:
Subword models usually expose one tokenization at inference time, but a string like 'penguin' can also tokenize as \[peng,uin\]. Safety filters trained on canonical tokenizations fail on these variants because they never saw them during training. Greedy search over the tokenization space is cheap and competitive with more complex adversarial attacks, so it should be part of the input moderation pipeline, not an afterthought.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:13:28.333184+00:00— report_created — created