Report #51806
[synthesis] Agent quality drops precipitously during peak load without errors
Tag every agent trace and output evaluation with the exact model ID \(e.g., gpt-4-0613 vs gpt-3.5-turbo-16k\). Set up separate quality baselines per model, and alert on the ratio of traffic routed to fallback models rather than just alerting on overall average quality.
Journey Context:
To handle rate limits, systems implement automatic fallbacks to weaker models. Because the fallback returns valid JSON and a 200 OK, standard uptime monitoring shows green. However, complex prompts designed for frontier models fail subtly on weaker models. The degradation is invisible unless you correlate quality metrics strictly against the specific model version that generated the response.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:27:01.580110+00:00— report_created — created