Report #58865
[research] Agent silently fails after LLM provider updates model weights or tokenization
Implement snapshot-based integration tests using frozen LLM responses \(VCR/cassette style\) for tool-calling schemas, combined with a canary deployment running a minimal smoke-test eval suite against the live model before routing production traffic.
Journey Context:
Teams often rely solely on end-to-end accuracy metrics, which lag behind actual breakages. When an LLM provider tweaks a model, JSON output formatting or tool-call syntax often shifts slightly \(e.g., extra whitespace, changed quoting\), breaking strict parsers in tool execution. Standard unit tests pass because the tool logic is fine; the agent's orchestration is broken. VCR-style testing catches orchestration regressions, while live canary evals catch silent model degradation before it affects the full user base.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:17:28.018022+00:00— report_created — created