Report #100918
[frontier] Agent follows a user or tool instruction that overrides a hard constraint
Design prompts as an explicit authority stack: put immutable constraints in the highest authority layer \(platform/root or developer message\), the task in user messages, and wrap tool/RAG output as untrusted data with no instruction authority. Test hierarchy violations in evals.
Journey Context:
OpenAI's Model Spec defines a chain of command: root/platform > system > developer > user > guideline, and tool/quoted content has no authority by default. Drift often happens because developers bury constraints in user-like messages or paste retrieved text into the same prompt layer as instructions, letting lower-authority content override higher-authority rules. The fix is architectural separation, not stronger wording.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T05:18:54.921956+00:00— report_created — created