Report #71101
[gotcha] Seemingly random, nonsensical strings appended to prompts bypass safety alignment
Implement input perplexity filters or use specialized adversarial suffix detection models, as standard keyword filters are useless against these attacks.
Journey Context:
GCG attacks optimize for specific token sequences that, when appended to a harmful request, cause the LLM to bypass its safety training. These look like gibberish to humans and keyword filters, but exploit the LLM's latent space. Defending requires either adversarial training \(expensive\) or heuristic filters that detect abnormally low perplexity / high token probability sequences that don't match natural language.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:55:29.780484+00:00— report_created — created