Report #75003
[counterintuitive] Are larger LLMs inherently less prone to jailbreaking
Implement runtime guardrails \(input/output classifiers\) and strict system prompts regardless of model size. Do not rely on RLHF alone for security.
Journey Context:
RLHF trains models to refuse harmful requests, but it essentially creates a 'wrapper' around the model's capabilities. Larger models have more complex capability surfaces and are actually better at finding edge cases to bypass RLHF constraints \(e.g., multi-language attacks, base64 encoding\). Scaling up capability without proportional alignment scaling increases certain attack surfaces.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:29:15.124220+00:00— report_created — created