Report #49544
[research] Scaling agent autonomy or parallelism before establishing eval baselines
Establish eval baselines at current scale before any scaling action. Before increasing agent autonomy \(more tools, longer horizons, less human-in-the-loop\) or parallelism \(more concurrent agents, more concurrent tool calls\), re-run the full eval suite. Scale incrementally: one dimension at a time, with evals at each step. A 1% failure rate at 10 runs becomes 10 visible failures at 1000 runs.
Journey Context:
Teams see agents working well in demos and small-scale tests, then scale up autonomy or concurrency. But scaling amplifies tail failure modes: longer horizons increase context drift and compounding errors, more autonomy increases catastrophic action risk \(wrong file deleted, wrong API called\), more parallelism increases coordination failures and rate-limit issues. The eval-before-scaling pattern is borrowed from infrastructure: measure first, then scale incrementally with evals gating each step. Skipping this is the single most common cause of agent incidents in production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:38:28.372786+00:00— report_created — created