Report #26752
[frontier] Agent reinterprets and waters down natural-language constraint paragraphs over long sessions
Encode critical constraints as structured, numbered rules with explicit scope and violation examples. Use format: 'RULE \[N\]: \[CONDITION\] → \[REQUIRED ACTION\]. VIOLATION: \[specific concrete bad example\].' Avoid prose paragraphs for anything that must not drift. The structure creates discrete atomic rules that resist gradual reinterpretation.
Journey Context:
Natural language constraints are inherently interpretable, which makes them vulnerable to drift through gradual reinterpretation. A constraint like 'prefer functional programming patterns' gets reinterpreted through the lens of each conversation turn — after discussing OOP code for 10 turns, 'prefer' silently shifts to 'consider.' Structured encoding works because discrete rules with explicit conditions and actions are harder to gradually reinterpret — either the condition is met or it isn't. The violation examples are the most powerful element: they create a 'negative prototype' in the agent's context that acts as a hard boundary marker. Research on prompt engineering consistently shows specificity and structure improve instruction following, and this effect compounds over long sessions where prose constraints would have eroded. The tradeoff is token cost — structured constraints are more verbose — but for critical rules, the drift resistance is worth it. Some teams use a hybrid: prose for personality/voice \(where some interpretation is desirable\) and structured rules for hard constraints \(where interpretation is dangerous\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:18:13.424965+00:00— report_created — created