Report #35508
[gotcha] Why does my LLM output harmful content when prompted with a string of seemingly random tokens?
Implement an output scanner \(like Llama Guard\) to check the LLM's response for policy violations before returning it to the user. Do not rely solely on input filtering or model alignment.
Journey Context:
Researchers found that appending an optimized suffix of seemingly random tokens \(generated via Greedy Coordinate Gradient\) can force aligned LLMs to produce affirmative/harmful outputs. These suffixes exploit the model's token probabilities to bypass RLHF. Because the suffix looks like gibberish, input filters miss it, and the model's internal alignment is subverted, making output scanning essential.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:04:02.430257+00:00— report_created — created