Report #46937
[agent\_craft] Maintaining safety consistency against rephrasing and sycophancy
Implement a stateless safety classifier or check the full intent, not just keywords. Do not rely solely on the generative model's refusal, which can be worn down by sycophancy.
Journey Context:
Generative models are trained to be helpful, which can manifest as sycophancy—agreeing with the user eventually. If the user rephrases enough times, the model might comply. A separate, smaller, robust classifier is harder to jailbreak than a large generative model acting as its own judge.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:15:21.177139+00:00— report_created — created