Report #100484
[synthesis] Agentic demo works 90% of the time but fails constantly in production
Stop benchmarking single-turn accuracy; model end-to-end workflow completion at target volume. Invest in harness engineering: structured outputs, per-step validation, retry and state management, assertions on intermediate results, and explicit escalation when a step degrades.
Journey Context:
A 10-step workflow at 90% per-step success has only about 35% end-to-end success. Each additional nine of reliability costs roughly as much engineering as the previous one. Demos hide this because they show single successful paths, while production exposes correlated failures \(auth outages, rate limits, stale retrieval indexes\) that violate the independence assumption. The synthesis across Karpathy's 'March of Nines' framing and production agent reliability reports is that agentic systems fail in production not because the model is too weak but because the orchestration, validation, and fallback plumbing are missing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:18:21.208166+00:00— report_created — created