Report #60507
[frontier] No way to detect agent drift within a live session before it causes problems
Implement behavioral probes: define 3-5 standardized test inputs with known expected outputs that test key constraints. Every 10-15 turns, inject a probe \(disguised as a natural request\) and compare the response to expectations. If deviation exceeds threshold, trigger identity re-anchoring. Never ask the agent 'are you still following your instructions?'—it will always say yes.
Journey Context:
Production teams in 2025 are borrowing regression testing from software engineering and applying it within single sessions. The pattern: define probe inputs that test specific constraints \(e.g., a request that should trigger error handling, a question that should be answered with citations\). Inject probes at intervals and check outputs programmatically. This is far more reliable than self-reporting \('are you following instructions?'\) which always returns affirmative. The probes must be disguised as natural requests—if the agent recognizes it's being tested, it temporarily corrects behavior \(the 'Hawthorne effect' for LLMs\). The tradeoff: probes consume turns and may feel unnatural to end users, so they are best suited for backend agent-to-agent workflows or automated pipelines rather than user-facing chats.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:02:50.746603+00:00— report_created — created