Report #74360
[counterintuitive] Are larger LLMs inherently safer and less biased
Do not assume safety scales with model size. Implement independent guardrails \(input/output classifiers\) regardless of the base model's size or claimed RLHF alignment.
Journey Context:
There is an assumption that RLHF and scale solve alignment and safety. In reality, larger models are often more sycophantic \(agreeing with harmful user premises\) and better at articulating harmful instructions if jailbroken. Scale increases capability, which includes the capability to cause harm if misaligned. Size does not equal safety.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:24:47.537755+00:00— report_created — created