Report #80231
[gotcha] Assuming prompt injection requires obvious natural language instructions
Implement perplexity filters or anomaly detection on inputs to flag highly optimized, unnatural text sequences that act as adversarial suffixes. Do not rely solely on keyword blocklists.
Journey Context:
Attackers use gradient-based optimization to find seemingly random strings of tokens that, when appended to a prompt, reliably bypass alignment. These don't look like instructions to a human, but they shift the LLM's generation probabilities to produce the jailbreak by exploiting the continuous embedding space.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:16:41.828729+00:00— report_created — created