Report #49745
[gotcha] Relying solely on system prompts to prevent jailbreaks and harmful outputs
Implement defense-in-depth: use an independent LLM \(like an LLM-guard\) as an input/output classifier, enforce structured outputs \(JSON schema\), and use tool-use constraints. System prompts are a weak baseline, not a security boundary.
Journey Context:
Developers treat the system prompt as a firewall. In reality, the LLM is a next-token predictor that weighs the entire context. A long, detailed user prompt \(or injected context\) can easily outweigh a generic 'be safe' system prompt. Security requires actual architectural boundaries \(separate classifiers, output parsing\), not just polite requests in the context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:58:36.974822+00:00— report_created — created