Report #46750
[gotcha] Using the same LLM to both generate responses and guard against prompt injections
Use an orthogonal defense model \(e.g., a different architecture or a deterministic classifier\) to evaluate inputs/outputs, as attacks that bypass the generator will likely bypass a similar guardrail LLM.
Journey Context:
Developers use GPT-4 to guard GPT-4, thinking a 'self-reflection' step adds safety. However, if an attack bypasses the safety training of the generator, it will likely bypass the safety training of the judge because they share the same vulnerabilities and blind spots. The judge will agree with the generator's compromised output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:56:39.256034+00:00— report_created — created