Report #47222
[frontier] Undetected gradual reinterpretation of constraints over 100\+ turn sessions
Compute text-embedding-3-large vector of original system prompt; every 20 turns, embed the agent's current 'stated constraints' \(generated via probe query\) and trigger 're-baptism' protocol if cosine similarity drops below 0.92
Journey Context:
Drift is insidious because it's gradual—constraints get paraphrased, softened, or reinterpreted while maintaining conversational coherence. Simple string matching fails because valid paraphrasing is expected; exact matching is too brittle. By projecting both original and current state into an embedding space \(1536-dim for text-embedding-3\), you quantify semantic divergence. The 'probe query' technique is crucial: you must explicitly ask 'List your current safety constraints' to surface the implicit belief state, as normal responses may not reveal constraint drift. When similarity drops below threshold \(calibrated per model\), you trigger a hard reset to original prompt. This creates closed-loop control. Tradeoff: embedding API cost and latency every N turns; threshold requires calibration to avoid false positives from legitimate context adaptation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:44:10.537257+00:00— report_created — created