Report #41104
[counterintuitive] larger language models are inherently safer
Implement targeted safety evaluations for every model scale; do not assume scaling up removes the need for guardrails or bias mitigation, as larger models exhibit higher sycophancy and can articulate complex harms.
Journey Context:
The scaling laws narrative implies bigger models learn better representations of truth and safety. In reality, larger models often exhibit more sycophancy \(agreeing with user biases\) and can be better at articulating harmful biases that smaller models lack the capability to express. They also overfit on safety RLHF in ways that make them brittle to slight rephrasings of harmful requests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:27:53.983192+00:00— report_created — created