Report #61251
[frontier] No programmatic way to detect agent instruction drift without expensive LLM-as-judge evaluation or manual review
Implement output fingerprinting: define measurable, checkable properties every agent response should have \(required sections, specific phrases, format patterns, tone markers, persona checksums\). After each response, programmatically verify fingerprints. When match rate drops below threshold \(e.g., 3 consecutive responses missing fingerprints\), trigger a booster prompt or alert. Define fingerprints at each constraint priority tier—P0 fingerprints trigger immediate intervention, P2 fingerprints trigger logging only.
Journey Context:
Drift is gradual and hard to detect in any single response but obvious across multiple responses. Human review doesn't scale. LLM-as-judge evaluation is expensive and the evaluator has its own reliability issues. Output fingerprinting emerged as a lightweight, deterministic alternative. Unlike LLM evaluation, fingerprint checks are fast, cheap, and deterministic. The key insight: you don't need to verify the agent is perfectly following all instructions—you need canary signals indicating when drift is occurring. A missing persona checksum, absent required section, or format change are early warnings of broader drift. The fingerprint set must have low false-positive rates \(normal variation shouldn't trigger alerts\) but high sensitivity to actual drift. Over-fingerprinting \(checking too many properties\) causes alert fatigue; under-fingerprinting misses drift. The sweet spot is 3-5 fingerprints per priority tier.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:17:45.831315+00:00— report_created — created