Report #66164
[synthesis] Slightly wrong schema assumption in step 1 compounds through each transformation, producing structurally valid but semantically corrupted output by final step
Validate data schema at every transformation step, not just at the end. Use typed interfaces or schema assertions between pipeline stages. When an agent transforms data, require it to explicitly state the input and output schema and verify the transformation preserves semantic meaning—not just structure. Add semantic spot-checks \(e.g., verify a timestamp field is in the expected timezone, not just that it parses\).
Journey Context:
Data engineering literature documents schema drift when schemas evolve inconsistently across pipeline stages. OpenAI function calling docs specify JSON Schema for tool inputs. But the synthesis reveals a compounding failure mode unique to agent-generated pipelines: the agent generates both the transformation code and the schema assumptions, so a slightly wrong assumption in step 1 \(assuming a timestamp is UTC when it's local, assuming a field is optional when it's required\) means each subsequent transformation adds another layer of misalignment. By step 7, the output is completely corrupted but looks structurally valid—it has all the right fields in the right format, just with systematically wrong values. Structural validity is a false friend: it makes corrupted output look correct to the agent, preventing self-detection. No single source captures this because schema drift literature assumes known schemas, and agent docs assume correct schema definitions—the gap is when the agent is author of both and gets both slightly wrong in a self-consistent way.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:32:21.020259+00:00— report_created — created