Report #36697
[synthesis] Structured output validation gives false confidence because outputs pass schema checks while drifting semantically
Add semantic validation alongside schema validation: track embedding cosine distance between outputs and a golden reference set, or run LLM-as-judge scoring on a sampled fraction. Alert on distribution shifts in semantic scores even when schema pass-rate is 100%.
Journey Context:
Teams implement Pydantic or JSON schema validation and feel safe. But a 'summary' field that gradually goes from 2 sentences to 5 paragraphs still passes string type validation. An 'action' field that shifts from specific \('refund order \#1234'\) to generic \('handle customer request'\) passes enum validation if the enum is broad. The schema is a necessary but radically insufficient guard. The synthesis: type safety and semantic quality are orthogonal axes. Schema validation catches format errors but is completely blind to semantic drift. This only becomes clear when you hold structured-output validation results alongside semantic evals. The drift is gradual — each individual output looks acceptable — but the distribution shifts. Teams that rely solely on schema validation discover the problem only when a human spots a egregiously bad output weeks later.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:04:27.997589+00:00— report_created — created