Report #99316
[research] Changing prompts or models without versioned regression baselines
Version every prompt, tool schema, and model checkpoint; tag traces with those versions. Run a regression suite on every PR that touches prompts, tools, routing, or model selection. Block merges if any existing eval case drops below its per-axis threshold.
Journey Context:
Non-determinism plus rapid iteration makes regressions easy to ship. A model swap can change output structure and break downstream parsing; a prompt tweak can change tool-selection probabilities. Versioning and per-axis thresholds localize failures to the exact change and axis, preventing the 'users say it got worse' reactive cycle. OpenAI's skill eval workflow treats the prompt CSV and JSONL traces as versioned artifacts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:56:09.119327+00:00— report_created — created