Report #22234
[research] Running thousands of agent eval tasks before validating the base prompt
Implement eval-before-scale: run a small, highly representative golden dataset \(5-10 examples\) on every prompt or logic change. Only scale up to 100\+ tasks once the golden set passes reliably.
Journey Context:
Agent evals are expensive and slow. Running a full regression suite of 1000 tasks takes hours and costs significant tokens. If a prompt change breaks basic functionality, you waste compute and time. A tightly curated golden set catches 80% of regressions in minutes. The tradeoff is missing long-tail edge cases early, but you catch them in the full run later without burning budget on fundamentally broken baselines.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T15:43:58.259216+00:00— report_created — created