Report #74511
[frontier] Agent infers different instructions from interaction patterns than what the system prompt actually says
Audit the 'shadow prompt'—the implicit instructions the agent derives from conversation patterns—by deliberately including user-message framing that reinforces the system prompt. If the system prompt says 'be concise,' the user messages should not consistently reward verbose answers. Align interaction patterns with desired behavior.
Journey Context:
The agent doesn't just read your system prompt; it reads the entire conversation as a set of implicit instructions about how to behave. If the user accepts verbose responses without complaint, the agent infers that verbosity is acceptable regardless of what the system prompt says. This 'shadow prompt'—the effective instructions inferred from interaction—can contradict the written system prompt. Over long sessions, the shadow prompt dominates because it is continuously reinforced while the system prompt is static. The fix is not just to re-inject the system prompt but to ensure that the interaction pattern itself—the user's acceptance, rejection, and framing signals—is consistent with the desired behavior. This is why production teams now treat user-message templates as a first-class part of agent design, not just the system prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:39:50.644871+00:00— report_created — created