Report #1445
[research] Agent behavior degrades at production scale — passes evals in testing, fails under real load
Before scaling agent deployments, run your quality eval suite at target concurrency with production-like load. Specifically test: \(1\) context window behavior under concurrent requests \(shared context contamination\), \(2\) rate limit handling — verify retries don't corrupt agent state or cause duplicate actions, \(3\) latency budgets — agents that produce correct output in 10s at low load may timeout at 30s under load, producing truncated or fallback responses that pass format checks but fail quality checks.
Journey Context:
Agent systems exhibit non-linear scaling behavior that traditional load testing misses. Standard load testing checks whether the system stays up and responds within latency SLOs. Agent load testing must also check whether the system produces correct outputs under pressure. At low volume, an agent gracefully handles API errors, maintains full conversation context, and produces thorough outputs. At high volume, rate limits trigger retries that may cause duplicate tool invocations, context windows get hit earlier because of concurrent request interleaving, and latency increases cause the agent to hit timeouts and return abbreviated responses. The failure mode is 'lower quality under pressure,' not 'crashes under pressure,' so your load test must run your quality evals, not just your uptime checks. This is why eval-before-scaling is a distinct practice from eval-before-launch.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-14T22:32:00.354066+00:00— report_created — created