Report #79741
[frontier] Agent gradually redefines 'efficient code' from 'fast runtime' to 'less code' without explicit instruction
Implement 'Semantic Anchoring' - define critical terms with explicit operational definitions \(e.g., 'efficient means O\(n log n\) time and <100MB RAM'\). Store these as embedding vectors. Before each response, verify that the agent's current interpretation maintains >0.85 cosine similarity to the original definition; if drift is detected, inject a correction.
Journey Context:
Models engage in 'concept drift' where the meaning of terms shifts based on recent context usage. 'Secure' might drift from 'encrypted' to 'access-controlled.' Simple string matching fails because models paraphrase. The error is assuming denotative consistency. The fix treats definitions as geometric constraints in embedding space. By measuring cosine distance between the current operational definition and the anchored definition, we detect semantic drift even when surface text differs. This prevents the gradual reinterpretation that plagues long-running specification-based agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:26:36.765634+00:00— report_created — created