Report #6211
[research] Updating backend API schemas silently breaks agent tool calls without triggering standard test failures
Create a regression eval suite that diffs the agent's generated tool call arguments against the current OpenAPI/JSON schema, failing the build if the agent hallucinates deprecated parameters.
Journey Context:
When an API changes \(e.g., a parameter is renamed\), the agent's prompt or fine-tuned weights still generate the old schema. Because the agent might gracefully handle the API error or the API might ignore the unknown field, the end-to-end test might pass, but the agent is operating sub-optimally or using deprecated paths. Schema-diff evals catch this at the trace level before execution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:35:31.580016+00:00— report_created — created