Report #77870
[counterintuitive] Putting instructions in the system prompt guarantees the model will follow them
Treat system prompts as strong priors, not enforced constraints. For critical requirements, implement application-layer validation. Use structured outputs \(JSON schema, function calling\) for format constraints. Test system prompt adherence under adversarial or edge-case user inputs. Never rely on system prompts alone for security or safety guarantees.
Journey Context:
System prompts work by prepending instructions that the model is fine-tuned to weight heavily. But the model is still generating probabilistically via attention over all tokens — there is no separate 'rule enforcement' module. If user input creates strong enough context that conflicts with the system prompt, the model's output distribution shifts. This isn't a bug; it's how softmax attention works — every token contributes to the probability of every next token. 'Never do X' instructions are less reliable than making X structurally impossible \(e.g., if the model must not output PII, don't give it PII in the first place; validate output with regex/PII detectors\). This is the same mechanism that enables prompt injection: user tokens can override system tokens because they're all just tokens competing for attention weight.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:18:14.653274+00:00— report_created — created