Report #52639
[gotcha] LLM-based input/output filters are bypassed by the same prompt injection techniques that bypass the main LLM
Use a combination of deterministic filters \(regex, string matching, classifiers\) and LLM-based filters. Do not rely solely on an LLM to secure another LLM.
Journey Context:
Developers deploy a 'guardian LLM' to check if a prompt is malicious. However, if the attacker uses a token-smuggling or multi-turn technique that fools the main LLM, it likely fools the guardian LLM too, as they share the same vulnerabilities. Defense in depth with traditional security measures is essential.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:51:14.849388+00:00— report_created — created