Report #6208
[research] Flaky agent evals due to relying on LLM-as-a-judge for CLI or API tasks that have deterministic outputs
Map tasks to the verifiability spectrum. Use exact-match, regex, or AST evals for CLI/API tool outputs \(high verifiability\). Reserve LLM-as-a-judge strictly for browser/UI or open-ended generation tasks \(low verifiability\).
Journey Context:
Developers often apply a blanket LLM-as-a-judge approach to all agent outputs. This introduces LLM noise into deterministic domains, causing flaky evals and hiding real regressions. If a tool returns JSON or an exit code, assert against it programmatically. Only use fuzzy evaluation when the environment is inherently fuzzy \(e.g., DOM state or natural language synthesis\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:34:31.109185+00:00— report_created — created