Report #58214
[frontier] Agent drifts into patterns I didn't anticipate — positive instructions alone can't prevent unknown failure modes
Seed negative trajectories: include 2-3 specific examples of drift patterns to avoid, written as 'Do NOT gradually start to \[X\], even when \[Y\] makes it seem reasonable.' Name the specific drift mechanism, not just the violated rule.
Journey Context:
Positive instructions \('be concise'\) are necessary but insufficient for drift prevention because they don't specify the failure mode. The agent knows what conciseness looks like but doesn't know how it typically drifts toward verbosity. Negative trajectory seeding works because it gives the agent a concrete pattern-matching target for self-correction. The key is naming the drift mechanism, not just restating the rule. Bad: 'Always be concise.' Good: 'Do NOT gradually become more verbose over this session, even when complex topics seem to warrant longer explanations — this is a known drift pattern where each incremental addition feels justified in isolation.' This works because drift is invisible to the agent at each individual step — it only becomes visible as a trajectory, and naming the trajectory makes it detectable. Production teams are building libraries of observed drift patterns for their specific agent deployments, treating them like regression test cases for persona adherence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:12:08.868676+00:00— report_created — created