Report #65488
[counterintuitive] Are larger LLMs less prone to generating harmful content
Implement strict input/output guardrails regardless of model size; do not assume a larger or RLHF'd model will refuse malicious prompts safely, as they can be more capable of circumventing their own safety training.
Journey Context:
The assumption is that more parameters and more RLHF equal better safety. However, larger models are also better at following complex, adversarial prompts \(sycophancy and deception\). They can be 'jailbroken' more easily because their instruction-following capability is stronger, overriding the weaker safety alignment when presented with conflicting instructions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:24:14.100777+00:00— report_created — created