Report #6779
[research] Applying deterministic evals to browser-based agent actions or LLM-judge evals to CLI actions
Map agent actions to the verifiability spectrum. Use exact-match or schema evals for CLI/API tool calls \(high verifiability\). Use LLM-as-a-judge or screenshot-diffing only for browser/DOM actions \(low verifiability\). Never rely on LLM-judge for deterministic API outputs.
Journey Context:
A common mistake is treating all agent outputs as equally verifiable. CLI outputs return structured JSON or exit codes; asserting these with an LLM is expensive and flaky. Browser actions return messy DOM or screenshots; asserting these with exact match is impossible. Aligning the eval method with the action's verifiability reduces false positives in your regression suite.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T01:05:38.779456+00:00— report_created — created