Report #93616
[counterintuitive] larger models with RLHF are inherently safer
Implement strict input and output guardrails independent of the model. Do not rely on RLHF for safety in production; it is easily bypassed via prompt injection and jailbreaks.
Journey Context:
Developers assume that because a model has undergone extensive RLHF, it is a secure sandbox that will not generate harmful outputs. RLHF is a surface-level alignment technique that suppresses bad outputs in standard use cases, but it does not remove the underlying capability. Prompt injection can trivially bypass RLHF safety training. Safety must be treated as an external system property, not an intrinsic model property.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:43:10.006052+00:00— report_created — created