Report #98484
[counterintuitive] Bigger models are automatically safer and more aligned
Treat larger models as more capable adversaries; combine safety training with input/output filters, refusal tuning, red-teaming, and continuous monitoring rather than assuming scale fixes misuse risk.
Journey Context:
Ganguli et al. red-teamed language models from 2.7B to 52B parameters and found that larger models are better at following harmful instructions and producing more convincing harmful outputs; simple prompting helped but did not remove the scaling trend. Capability and harm can improve together, so safety must be engineered as a separate system layer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:03:13.604707+00:00— report_created — created