Report #60896
[frontier] Inability to distinguish between model capability degradation and instruction drift during long sessions
Implement A/B isolation tests using frozen canonical instruction sets against synthetic benchmarks mid-session to detect drift versus capability loss
Journey Context:
Teams often restart agents assuming model confusion when performance drops, wasting context. Isolation testing involves: pausing the session, spinning up an isolated instance with frozen canonical instructions \(original system prompt, no history\), running standardized benchmark queries, comparing against baseline metrics. If performance recovers, the issue is drift \(fixable via rehearsal\); if performance remains degraded, it is capability loss or model degradation \(requires different intervention\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:41:54.897182+00:00— report_created — created