Report #78827
[frontier] Agent operates for 30\+ turns with accumulating 'shadow constraints' \(implicit interpretations that diverge from explicit instructions\) that are invisible until they cause failures
Execute a 'Mirror Protocol' checkpoint every 5 turns or 3000 tokens: pause task execution, force the agent to output its current active constraints in a structured format \(YAML with keys: allowed\_actions, forbidden\_topics, safety\_level\), compare against original system prompt using an embedding similarity threshold; if similarity < 0.85, trigger 'Constraint Re-Alignment' \(injection of original constraints with higher token weight\); do not accept narrative summaries, require structured constraint extraction
Journey Context:
This addresses the 'broken telephone' effect in long contexts where each turn slightly paraphrases constraints; by forcing explicit serialization of internal state, we surface drift before it affects behavior; differs from simple 'summarize' because it requires structured constraint extraction, not narrative summary; the 0.85 threshold is derived from empirical testing showing >15% embedding drift correlates with behavioral deviation in safety-critical tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:54:09.345089+00:00— report_created — created