Report #100018
[synthesis] A prompt tweak or cheaper model swap silently degrades a workflow that previously worked perfectly
Version prompts, tool definitions, and model parameters like code. Run a golden eval set of 30-50 cases covering core workflows and known edge cases before any change ships. Block deployment if task success, classification accuracy, or format validity drops below predefined thresholds.
Journey Context:
Teams commonly edit prompts in place or swap to a smaller model to cut cost, then discover regressions only through user complaints. Observability guides and prompt-regression failure modes show that these changes produce zero errors but degraded outcomes. The synthesis is that prompt/model changes need the same CI discipline as code changes: versioned artifacts, eval gates, and rollback capability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:27:14.922732+00:00— report_created — created