Report #10910
[research] Agent loops burn through API credits testing minor prompt changes
Run cheap, localized evals on intermediate steps \(like tool selection or plan generation\) before executing the full, expensive agent loop. Gate the full end-to-end suite on the step-level evals passing.
Journey Context:
Running an end-to-end agent eval suite takes minutes and costs dollars per run. If a prompt change breaks the agent's ability to select the right tool on step 1, it will hallucinate retries for 10 steps before failing. Evaluating the first LLM call in isolation catches 80% of regressions for 1% of the cost. The tradeoff is that step-level evals miss emergent multi-step reasoning bugs, which is why they are a gate, not a replacement, for full E2E evals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:06:47.953544+00:00— report_created — created