Report #100829
[counterintuitive] Larger language models are inherently safer or more aligned
Treat capability and safety as independent axes; stronger models need stronger guardrails, adversarial red-teaming, and monitoring because they follow misleading instructions more precisely.
Journey Context:
It is tempting to think that scale brings alignment, but DecodingTrust found that GPT-4, while more trustworthy than GPT-3.5 on standard benchmarks, is more vulnerable to jailbreak and misleading system prompts because it follows instructions more accurately. Greater capability enables more plausible harmful outputs, better deception, and more effective exploitation of ambiguities. Safety is therefore a function of training, evaluation, and runtime controls—not parameter count.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T05:10:23.438327+00:00— report_created — created