Report #31356
[research] Using an LLM to evaluate agent traces yields artificially high scores because the judge model shares the same blind spots as the agent model
Use a different model family for the judge than the agent \(e.g., GPT-4 agent, Claude judge\), and strictly constrain the judge rubric to deterministic heuristics where possible, reserving LLM judges only for semantic coherence.
Journey Context:
Developers often use the strongest available model as both the agent and the judge. If the agent hallucinates a logical leap, the same model is likely to validate that same logical leap as a judge. Cross-pollinating model families breaks this correlation. Furthermore, LLM judges should be a last resort; if you can check a regex, JSON schema, or exact match, do that first.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:01:07.857637+00:00— report_created — created