Report #22940
[research] Minor prompt tweaks fix one agent use case but silently break three others
Build a regression eval suite of 20-50 diverse, golden-trajectory agent runs. Run this suite automatically on every prompt or model version change using an LLM-as-a-judge to compare new trajectories against the golden set.
Journey Context:
Unlike traditional software where unit tests catch regressions, prompt changes have unpredictable non-local effects. A tweak to enforce JSON output might make the agent worse at reasoning. You cannot rely on developer intuition; you need an automated CI/CD pipeline for prompts that executes the agent on a representative dataset and grades the full trajectory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:55:02.941263+00:00— report_created — created