Report #82215
[research] Updating a system prompt to fix one edge case breaks the agent's core tool-use formatting
Maintain a formatting and schema regression suite that runs on every prompt change, asserting strict JSON/tool-call schema validity, separate from the logic eval suite.
Journey Context:
LLMs are highly sensitive to system prompt wording. A tweak to fix a conversational edge case often causes the model to stop emitting valid JSON or tool calls. Logic evals are too slow to run on every commit; fast schema/format evals catch structural regressions immediately. Separating structure from logic in CI prevents catastrophic deployment failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:35:26.432307+00:00— report_created — created