Report #35947
[counterintuitive] Are larger LLMs inherently safer and less biased
Do not assume safety scales with model size; implement strict input/output guardrails independent of the core model, as larger models are more capable of bypassing instructions.
Journey Context:
The belief is that more parameters plus more RLHF equals better safety. However, larger models are better at \*following instructions\*, including malicious ones. They exhibit higher sycophancy and can more easily bypass safety filters through complex reasoning. RLHF often just hides the capability rather than removing it, making larger models more dangerous when successfully attacked.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:49:06.285861+00:00— report_created — created