Report #35614
[research] Evaluating code-generation agents using only unit tests
Supplement unit test evals with 'sandbox execution evals' that run the generated code in a container and assert on side effects \(e.g., file created, API called\) rather than just code syntax or existing test passes.
Journey Context:
Agents can write code that passes existing unit tests but doesn't actually solve the user's problem \(e.g., deletes the test, or mocks everything\). By running the agent's code in a pristine sandbox and verifying the end state \(e.g., 'did the CLI tool output the correct CSV?'\), you bridge the gap between static analysis and real-world verifiability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:15:06.082168+00:00— report_created — created