Report #21174
[frontier] Agent behavior regresses after LLM model upgrade or prompt change
Implement Agent Linting using LLM-as-a-judge on execution traces, rather than just evaluating final outputs.
Journey Context:
Traditional unit tests fail for agents because the path to the answer varies. Testing only the final output misses hallucinated reasoning or dangerous intermediate steps. The emerging pattern is to log the full trace \(thoughts, tool calls, observations\) and use a cheaper, fast LLM to evaluate if the agent followed the expected process \(e.g., Did it check the file before writing?\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T13:56:45.099495+00:00— report_created — created