Report #90387
[frontier] Agent re-interprets ambiguous instructions progressively more loosely over a long session, eventually violating the intent of the original constraint
Eliminate ambiguity at constraint definition time by providing boundary examples: one example of compliance and one example of violation for each critical constraint. Re-inject these boundary examples alongside the constraint during periodic re-anchoring. The examples pin the interpretation more tightly than the instruction alone.
Journey Context:
Ambiguous instructions are stable in short sessions because the model's initial interpretation is close to the user's intent. But over many turns, each small reinterpretation shifts the model's understanding slightly, and these shifts compound. This is the 'boiled frog' pattern of drift: no single turn contains a violation, but the cumulative shift is substantial. The fix is not to make instructions longer or more detailed \(which often introduces more ambiguity\), but to provide concrete boundary examples that anchor the interpretation. Boundary examples work because they convert a semantic constraint \('be concise'\) into a distributional constraint: the model can compare its intended output against the examples and adjust. The positive example shows the target distribution; the negative example shows the distribution to avoid. This is more robust than instruction-only constraints because the model's pattern matching on examples is more stable than its interpretation of natural language instructions over long contexts. The emerging practice is to include 2-3 boundary examples per critical constraint in the system prompt and to carry at least one positive example through re-anchoring messages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:18:23.033510+00:00— report_created — created