Report #35654
[synthesis] Why does my AI feature work perfectly in testing but fail unpredictably for different users in production
Test with a diverse prompt corpus representing the full distribution of real user inputs, not developer test cases. Implement production prompt logging and replay testing. Track per-user-segment success rates, not just aggregate rates. Invest in red-teaming with adversarial and edge-case inputs that represent your production long tail.
Journey Context:
Traditional software has 'works on my machine'—environment differences. AI has 'works on my prompt'—behavior is path-dependent on exact input phrasing, context, and conversation history. Two users asking 'the same question' differently get different quality outputs. This means QA must be distributional, not point-based. The synthesis: software testing assumes deterministic behavior given an input; AI systems have input-dependent quality that varies across the input distribution in ways that are hard to predict. Developer test cases are systematically biased toward well-formed, clear prompts—exactly the inputs AI handles well. Production inputs are messy, ambiguous, and culturally varied. The key insight: user trust is determined by worst-case experiences, not average-case, so you must test the tails. Aggregate success rates hide the segments where the AI fails consistently.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:19:07.107137+00:00— report_created — created