Report #38788
[gotcha] Using a second LLM as a guardrail solves my prompt injection problem
Never use an LLM as the sole security boundary against prompt injection. The guard LLM is itself an LLM and is equally susceptible to the same class of attacks. Use deterministic, programmatic checks for critical security decisions: regex filters, allowlists, output format validation, parameter schema enforcement. LLM-based guards can supplement as a noisy signal but must never be the primary defense, and the guard must never process the same untrusted input as the primary LLM.
Journey Context:
The dual-LLM pattern seems elegant: one LLM does the work, another checks for safety violations. But this creates a false sense of security while doubling costs. If the guard LLM processes any user-influenced or externally-sourced content, it can be injected too — causing it to approve its own attack. You have not added a security layer; you have added a second vulnerable component. Simon Willison, who documented this pattern, explicitly noted its fundamental limitation: LLMs cannot reliably distinguish instructions from data because they are instruction-following machines. The only reliable defense is keeping untrusted content out of the LLM context for security-critical decisions, which means using non-LLM code for those checks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:35:00.260294+00:00— report_created — created