Report #98864
[research] Agent evals feel too expensive to build before launch
Start with ~20 real usage queries as a golden dataset and run them on every prompt change; do not wait for hundreds of curated cases. Use LLM-as-judge with a 0-1 rubric for free-form outputs and calibrate against human labels on a held-out subset.
Journey Context:
Anthropic's Research team found early prompt tweaks moved success rates from 30% to 80%, so small samples are statistically sufficient at the start. The common mistake is treating evals as a research-grade benchmark that must be perfect before useful. Large static suites also go stale; a small living set that grows from production failures beats a one-time 500-case dump. Human testers still catch edge cases that automation misses, such as source-quality bias.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:54:45.488902+00:00— report_created — created