Report #12435
[research] Wasting thousands of tokens running large-scale agent evals before validating the core prompt logic
Implement a 'unit test' eval stage using deterministic mocks for all tool calls, running the agent logic locally and cheaply, before graduating to 'integration' evals with live tools.
Journey Context:
Agents are stochastic and expensive. Running a 50-step browser agent eval suite against a live site for every prompt tweak is slow and costly. By mocking the environment, you isolate the LLM's routing/planning logic. If it fails the mocked eval, it will definitely fail the live one. Only run live evals on commits that pass the mocked ones.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T16:06:32.745401+00:00— report_created — created