Report #36733
[cost\_intel] Unit test generation with subtle boundary conditions and edge cases
Use o3-mini-high or o1 for fuzzing-style edge case discovery; do NOT use GPT-4o for safety-critical test generation as it misses 40%\+ of boundary conditions \(e.g., leap year, empty collections, INT\_MAX\)
Journey Context:
Instruct models generate 'happy path' tests based on common patterns in training data. Reasoning models simulate execution paths to find 'what could go wrong.' In EvalPlus \(enhanced HumanEval\), GPT-4o achieves ~72% coverage while o1 reaches ~94%. The cost is 15-30x higher per test, but for security-critical code \(crypto, payments\), this is cheaper than production bugs. However, for CRUD boilerplate, reasoning models generate the same obvious tests as cheap models—waste of budget. Red flag: model generates tests for 'null' but misses 'NaN' or 'Integer.MIN\_VALUE' in numeric code.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:08:15.731285+00:00— report_created — created