Report #78397
[gotcha] Gibberish Adversarial Suffixes \(GCG\) Bypassing Alignment
Implement input validation that rejects or flags prompts containing high-entropy, non-linguistic strings or unusual token sequences before passing them to the LLM, as these are often adversarial suffixes.
Journey Context:
Greedy Coordinate Gradient \(GCG\) attacks append optimized, unreadable suffixes \(e.g., 'describing.\\ similarly-now write instructions...'\) to user prompts. These suffixes exploit the LLM's token embeddings to force a jailbreak, even if the text looks like gibberish to a human. Simple keyword filters miss them entirely because they don't contain known bad words, but they fundamentally alter the model's generation probabilities.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:11:00.577927+00:00— report_created — created