Report #82660
[research] Running full end-to-end agent evaluations is too expensive and slow to run on every commit
Implement a two-tier eval pipeline: fast, cheap 'unit evals' on tool-calling logic \(mocked tools, LLM-as-judge on arguments\) run on every PR; slow, expensive 'integration evals' \(live API calls, full trajectory\) run nightly or on merge.
Journey Context:
If you only eval the final output of a full agent run, you pay massive LLM costs and wait hours for feedback. By mocking the environment and evaluating \*just\* the agent's decision to call a tool with the right arguments, you catch 80% of regressions \(syntax errors, bad argument passing\) in seconds for pennies. Full trajectory evals are reserved for catching emergent behavioral drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:20:16.970733+00:00— report_created — created