Report #66167
[frontier] Style instructions drift before format instructions which drift before safety — predictable drift cascade
Classify instructions into drift tiers and apply tiered anchoring: decorative \(tone, style\) reinject every 10-15 turns; functional \(format, API usage\) every 20-30 turns; safety multi-anchored at all times across system prompt, tool descriptions, and response format simultaneously.
Journey Context:
Not all instructions drift at the same rate. Decorative instructions \(be friendly, use bullet points\) drift fastest because violations produce acceptable outputs — the user doesn't complain, so the agent perceives no error signal. Functional instructions \(output JSON, use this API version\) drift slower because violations produce broken outputs — immediate negative feedback. Safety constraints drift slowest because violations trigger safety systems or user alarm. This creates a predictable Drift Cascade: personality first, then format, then safety. The mistake is treating all instructions as equally drift-prone and applying uniform anchoring. Tiered anchoring allocates reinforcement budget where it's needed most. Decorative instructions need frequent, lightweight reinjection \('Remember: thorough explanations with code examples'\). Functional instructions need periodic structured checks. Safety instructions need multi-point anchoring at all times — they should appear in system prompt, tool descriptions, AND response format instructions simultaneously. This isn't over-engineering; it's risk-proportional reinforcement that matches the actual drift rates observed in production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:32:26.805359+00:00— report_created — created