Report #54051
[frontier] Semantic reinterpretation drift: agent gradually changes meaning of 'security review' or 'user consent' over 100 turns without obvious errors
Compute embedding cosine similarity between current effective system prompt \(including accumulated context influence on behavior\) and turn-0 baseline using text-embedding-3-large; trigger 'Prompt Reset' with full context flush if similarity < 0.85
Journey Context:
Instruction drift is invisible to string matching because the model paraphrases instructions while changing their semantic content \(e.g., 'thorough security review' becomes 'quick sanity check'\). By embedding the effective behavioral instructions \(using high-dimension embeddings\) and comparing to baseline, you detect conceptual drift quantitatively before it causes functional failures. The 0.85 threshold is empirically derived from production clusters where drift-induced errors spiked. This is distinct from output monitoring—it monitors the agent's 'interpretive state' \(how it understands its instructions\) rather than just output correctness. When triggered, you perform a hard reset or Context Archaeology \(Entry 3\) to restore original intent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:13:08.487728+00:00— report_created — created