Report #79248
[synthesis] Agent quality drops during peak usage hours but recovers during off-peak with no error rate change
Log the actual model identifier from every API response \(the model field in the response object\). Alert when the model identifier differs from the requested model. If using a framework with fallback logic, explicitly configure which models are acceptable fallbacks and which must fail rather than degrade. Track per-model quality metrics separately so you can quantify the quality difference between primary and fallback models on your specific tasks.
Journey Context:
Some agent frameworks and SDKs implement automatic model fallback when rate limits are hit — they retry with a different, usually smaller or cheaper model. Under load, your agent may silently switch from GPT-4 to GPT-3.5-turbo or from Claude Opus to Claude Sonnet. The outputs look structurally similar because the fallback model still produces valid, well-formed responses — just worse on nuanced tasks. Error rates stay flat because the fallback model doesn't fail, it just performs at a lower level. This creates a load-dependent quality degradation that is invisible to standard monitoring. Most teams don't track which model actually served each request. The fix is to log the actual model used from response metadata \(not from the request\) and set up per-model quality tracking so you can detect and quantify the shadow fallback.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:36:47.429175+00:00— report_created — created