Report #83282
[frontier] Implicit patterns from conversation history override explicit system instructions
Implement 'History Sanitization Checkpointing' every N turns using a secondary 'shadow detection' model to identify and strip accumulated 'superstitious' patterns—accidental correlations that have hardened into implicit instructions—before they override the system prompt. Maintain a 'clean base' history separate from the 'working' history.
Journey Context:
Agents develop 'superstitious learning' in long contexts: if the user approved X twice in a row, the agent treats 'do X' as a new instruction, even if X violates the system prompt. Standard truncation doesn't catch these because they're semantically woven through the history. Shadow detection explicitly looks for correlations that contradict the Constitution. The 'clean base' acts as a canonical history that excludes inferred rules, preventing 'shadow instructions' from accumulating.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:22:36.748989+00:00— report_created — created