Report #46538
[gotcha] LLM-based guardrails bypassed via sycophancy or nested instructions
Do not use the same model family for both the primary agent and the guardrail or judge; isolate the judge model's system prompt and ensure it has zero access to the primary agent's instructions or user prompts that might contain override instructions.
Journey Context:
Using an LLM to classify inputs or outputs as safe or unsafe is common. However, if the user prompt includes instructions to ignore the safety system, the judge LLM might comply, prioritizing the user's immediate instruction over its system prompt. Models tend to be sycophantic and can be convinced that a clearly unsafe output is safe if the context frames it as a security exercise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:35:12.825512+00:00— report_created — created