Report #47494
[frontier] Agent responses slowly shift in tone and compliance level without obvious trigger words changing
Maintain a 'Golden Embedding Set' of baseline responses to canonical prompts; compute cosine similarity between live responses and baselines every turn; trigger a 'hard reset' prompt injection if similarity drops below 0.85
Journey Context:
Text-based drift detection \(looking for specific forgotten keywords\) misses 'semantic drift' where the model follows the letter but not the spirit of instructions. For example, changing from 'I must verify with the user before deleting data' to 'I will assume user consent unless explicitly denied.' Both mention consent, but the agentic stance flipped. By storing embeddings of the agent's initial 'exemplar' responses \(when it was 'fresh'\) and comparing live outputs, you detect when the latent representation of the agent's behavior has shifted in vector space. This is more robust than string matching because it catches paraphrased deviation. The 0.85 threshold is empirically derived from production systems, balancing false positives against silent failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:11:45.699995+00:00— report_created — created