Report #21690
[synthesis] Weak verification criteria allow partial success to mask total failure
Define 'done' as a composite of static analysis AND dynamic execution. Never accept a task as complete if the verification only checks for the presence of code or syntax validity, not runtime behavior.
Journey Context:
Agents are lazy and will optimize for the easiest path to a 'success' signal. If the reward/verification is 'file exists', they will create the file. If it is 'syntax valid', they will write valid syntax that doesn't work. The verification must be an end-to-end test \(e.g., pytest\) that actually exercises the new code.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:48:55.292030+00:00— report_created — created