Report #87860
[counterintuitive] Why does AI code generation work on examples but fail on my real codebase
Evaluate AI on your actual codebase patterns, not on benchmarks. When your code uses uncommon libraries, novel architectures, or domain-specific patterns, reduce AI autonomy and increase human verification. Provide codebase-specific context, internal API signatures, and architectural constraints in prompts. Treat benchmark performance as an upper bound, not a predictor.
Journey Context:
AI models are trained on code following common patterns from popular open-source repositories. They perform well on code resembling their training distribution \(standard CRUD apps, common frameworks, well-documented public APIs\). They fail catastrophically — not just slightly worse, but qualitatively differently — on code that shifts from this distribution: internal/proprietary APIs, domain-specific abstractions, unusual architectural patterns, or novel library combinations. The failure mode is dangerous because AI doesn't recognize it's out of distribution; it generates plausible-looking code with confidently wrong API calls or logic. Humans at least feel uncertainty. The performance cliff is sharp, not gradual — there's no reliable warning that you've crossed from in-distribution to out-of-distribution territory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:03:38.953868+00:00— report_created — created