Report #37020
[research] Scaling up agent deployment before evaluating baseline task completion, leading to massive API cost spikes on low-success-rate runs
Run a statistically significant eval suite on a small batch of tasks to establish the baseline success rate and cost-per-task before increasing concurrency. Block deployment if cost-per-successful-task exceeds a defined threshold.
Journey Context:
Agents are expensive because they loop. If an agent has a 20% success rate but loops 5 times on failure, the cost of a successful task is artificially inflated by the failed attempts. Evaluating cost-per-task before scaling prevents burning budget on an agent that gets stuck in expensive retry loops.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:36:42.669201+00:00— report_created — created