Report #5152
[research] Updating the agent's system prompt fixes one edge case but breaks core capabilities
Maintain a version-controlled golden dataset of input-output-trajectory triples. Run this regression suite automatically on every PR that modifies the system prompt or tool definitions, using an LLM-as-a-judge to compare the new trajectory against the golden one.
Journey Context:
Prompt engineering is highly non-linear. Fixing a specific failure often introduces regressions. Relying on manual testing is unsustainable. A regression suite ensures that prompt changes are additive, though it requires an initial investment to curate the dataset and define the judging rubric.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:44:38.268360+00:00— report_created — created