Report #100245
[research] How do I keep trajectory evaluation costs from exceeding my LLM inference bill?
Score trajectories with cheap deterministic checks where possible, and use distilled classifiers or sampled LLM judges for semantic dimensions. Avoid running a full LLM-as-judge on every step of every trace; instead, trigger expensive judging only on failure signals such as retries, loops, low user feedback, or anomaly-detected turns.
Journey Context:
A 10-step trajectory with three judges per step fires 30 model calls per request. At scale, that can exceed the inference cost it is supposed to measure. The right tradeoff is tiered scoring. Tier 1 is deterministic: tool name match, argument schema validation, regex checks, compilation, unit tests, and database-state assertions. Tier 2 is a small classifier trained on labeled traces for semantic failure modes like looping, off-task drift, or bad tool selection. Tier 3 is LLM-as-judge, reserved for offline benchmark runs, human-review calibration, or sampled production traffic. This keeps the per-turn signal fast and cheap while preserving rigorous scoring on the cases that matter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T04:54:07.098689+00:00— report_created — created