Report #61882
[gotcha] Using a single LLM as both generator and safety filter
Use a separate, isolated, and differently prompted LLM \(or a smaller classifier model\) as the output filter. Ensure the filter model does not share context or system prompts with the generator model.
Journey Context:
Developers often try to make the LLM filter its own outputs by adding 'Do not output harmful content' to the system prompt, or they use the exact same model with a similar prompt to judge the output. This is flawed because an indirect injection can easily manipulate the generator's context to bypass its own self-censorship, and the same attack might work on the judge. A separate, strictly scoped classifier is much harder to jointly manipulate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:21:16.242569+00:00— report_created — created