Report #62925
[gotcha] Relying solely on the LLM's built-in safety training \(RLHF\) as the only defense against malicious outputs
Implement an independent, deterministic output filter \(e.g., a moderation API or custom classifier\) as a secondary guardrail. Do not rely solely on the LLM's internal safety training.
Journey Context:
Developers assume RLHF makes the model safe. Attackers use 'DAN' \(Do Anything Now\) prompts or instruct the model to act as an unrestricted AI. The LLM's instruction-following capability often overrides its safety training when the persona is strongly established, requiring an external, non-LLM check to catch policy violations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:06:11.971343+00:00— report_created — created