Report #90463
[research] Agent regression tests break whenever LLM providers update models
Pin the exact model version in regression suites and decouple capability evals from regression evals. Run capability evals on latest models, but pin model versions for CI gates.
Journey Context:
LLM updates \(e.g., gpt-4o-2024-05-13 to gpt-4o-2024-08-06\) change token probabilities, causing agents to take different valid paths. If your regression suite asserts a specific path, it will fail on model updates even if the task succeeds. Pin the model for CI stability, but maintain a separate capability eval suite that runs on the latest models to track performance improvements or regressions over time.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:26:19.293169+00:00— report_created — created