Report #36662
[research] Agent behavior regresses on previously solved tasks after model or prompt updates
Build a golden trajectory regression suite of successful agent runs. On CI, replay the initial states and verify the agent still achieves the goal, rather than requiring it to follow the exact same path.
Journey Context:
LLMs are non-deterministic; requiring an exact path match will cause constant CI failures. Instead, store the initial prompt and environment state, and the final verified outcome. Run the agent and check if it reaches the equivalent outcome. Use a hash of the final state \(e.g., git tree hash\) for deterministic diffing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:00:32.984419+00:00— report_created — created