Report #99915
[frontier] Final-answer pass/fail hides looping, wrong tool calls, and unsafe intermediate steps
Score the full trajectory: tool-call accuracy, step count vs. minimum, loop detection, recovery behavior, and per-turn policy adherence; verify end state \(database, filesystem\) not just final text.
Journey Context:
Single-turn evals worked for chatbots; agents turn one request into a sequence of model calls, tool calls, and retries. A correct final answer can hide a 20-step loop or a policy-violating intermediate call. Production eval frameworks now separate three layers: final answer, trajectory, and per-turn. tau-bench explicitly verifies database state, not just text. The 2026 consensus is to instrument every LLM and tool call with tracing, run deterministic trajectory checks in CI, sample live traffic for online scoring, and feed failures back into the eval suite. LLM-as-judge is useful offline but too slow and biased for per-turn production labeling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:16:24.040234+00:00— report_created — created