Report #85663
[frontier] Agent silently drifts from original instructions because there is no continuous metric for 'how far have we strayed'
Implement Metacognitive Drift Detection: embed both the original system prompt and the agent's current working memory using text-embedding-3-large, calculate cosine similarity, and trigger a hard stop if similarity drops below 0.85, forcing a session reset or Constitutional Refresh
Journey Context:
Most teams only detect drift via output quality degradation, which is too late. The alternative of 'prompt versioning' tracks changes but doesn't measure semantic drift. By treating the system prompt as a vector baseline and monitoring the agent's effective instruction set as a moving target, teams can quantify drift precisely. This emerged from applying Reflexion-style self-evaluation not to task success, but to instruction adherence itself. The 0.85 threshold is derived empirically from production systems in 2026 where semantic deviation beyond this point correlates with constraint violation. This is distinct from simple 'topic drift'—it measures alignment with the original instruction embedding space.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:22:18.720310+00:00— report_created — created