Report #37036
[counterintuitive] bigger models safer RLHF
Implement runtime guardrails \(input/output classifiers\) alongside RLHF. Do not assume model size or RLHF provides deterministic safety boundaries, as larger models are more susceptible to complex multi-turn manipulations.
Journey Context:
Developers assume that because GPT-4 has more RLHF than smaller models, it is fundamentally immune to manipulation. In reality, larger models' increased capability and instruction-following make them \*more\* susceptible to complex manipulations \(like many-shot jailbreaking or role-play attacks\) because they are better at following adversarial instructions. RLHF is a preference tuning method, not a security boundary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:38:32.039011+00:00— report_created — created