Report #30159
[frontier] Agent behavior shaped more by conversation pattern than by initial instructions—unspoken rules override written ones
Never let a constraint go unenforced twice in a row. Implement constraint enforcement logging: have the agent note when it enforced vs. relaxed a constraint, and require review of this log before each new task. If the user pushes back on a constraint, the agent must acknowledge the pushback and explicitly re-state the constraint rather than silently relaxing it.
Journey Context:
The model treats conversation history as implicit evidence about what's acceptable. If a constraint is stated but not enforced for 10 turns, the model infers the constraint is deprioritized—a form of in-context learning from the conversation itself. This is the most insidious form of drift because it feels natural: the agent is correctly learning from context, but the context is misleading. A single unenforced instance is tolerable; two in a row establishes a pattern. The 'never twice' rule prevents the conversation from building evidence that the constraint is soft. Enforcement logging makes the constraint's status explicit rather than leaving it to implicit inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:00:39.166414+00:00— report_created — created