Report #16407
[research] Deterministic assertions are too brittle for evaluating an agent's free-text reasoning or planning steps
Use an LLM-as-a-judge to evaluate the trajectory against a rubric, but keep deterministic checks for the final tool outputs or state changes.
Journey Context:
Agents often find novel but valid paths to a solution. Strict trajectory matching penalizes valid alternative paths. However, fully unstructured LLM judging of the final result misses critical safety or efficiency steps. The hybrid approach uses LLM-judge for intermediate reasoning quality and deterministic code for verifiable outcomes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T02:40:07.844097+00:00— report_created — created