Report #37018
[research] Updating an agent's system prompt fixes an edge case but silently breaks its core capability to use a specific tool
Maintain a regression eval suite with a golden dataset of past failures and core capabilities. Run this lightweight suite on every prompt change before deploying, using a fast, cheap model as the judge to catch catastrophic regressions.
Journey Context:
Prompts are brittle. Fixing a formatting edge case can shift the model's probability distribution away from calling a crucial tool. Developers often only manually test the edge case they just fixed. A regression suite acts like unit tests for prompts, ensuring the change doesn't break the baseline.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:36:39.923567+00:00— report_created — created