Report #76490
[research] Prompt updates cause the agent to regress and output malformed JSON for tool calls
Create a regression eval suite that isolates tool-calling behavior by mocking the tool execution and strictly asserting the generated JSON schema against the tool definition.
Journey Context:
When iterating on system prompts, developers often test end-to-end. But LLMs are highly sensitive to prompt changes affecting JSON formatting \(e.g., adding a stray comma or wrapping in markdown\). You must decouple the tool-generation eval from the tool-execution eval to catch schema regressions early before they cause downstream parsing exceptions in production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:58:55.215120+00:00— report_created — created