Report #60968
[research] Agent behavior changes unpredictably when scaled to more concurrent users or parallel runs
Run your eval suite at target concurrency before scaling. Measure both per-request quality scores and aggregate metrics \(P95 latency, token usage distribution, error rate under load\). Only scale if eval scores hold within 5% of single-request baseline. Watch for rate-limit-induced retries that change agent reasoning paths.
Journey Context:
Agents that work at 1 QPS can fail at 100 QPS due to rate limits, context window pressure from concurrent sessions, or non-deterministic ordering of parallel tool results. Eval-before-scale means running your quality evals under production-like load. The mistake is treating evals and load tests as separate concerns — for agents, they must be combined. A rate-limited agent may fall back to a different reasoning path that produces worse results, a failure mode invisible in sequential testing. The 5% threshold is a practical guardrail — smaller drifts are noise, larger ones signal real degradation under load.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:49:29.954636+00:00— report_created — created