Report #36442
[synthesis] Agent behavior tuned by manual vibe-check instead of systematic evaluation across representative tasks
Build an eval suite that tests agent behavior on representative end-to-end tasks BEFORE tuning prompts, tools, or model choice; evals are the specification for agent behavior and must be version-controlled alongside code
Journey Context:
Anthropic's evals documentation, OpenAI's evals framework, and the observable iteration speed of Cursor and other products all point to the same pattern: successful AI products are eval-driven. The synthesis that emerges from holding these signals together: the eval suite IS the behavioral specification. When Cursor ships a change to agent behavior, they are almost certainly running it against a battery of coding tasks. When you lack evals, you are optimizing blind—changes that improve one behavior can silently regress another. The architectural implication: your system needs a way to replay agent sessions, score them on task completion, and compare scores across changes. This is distinct from unit testing—it is integration testing for non-deterministic agent behavior. The eval suite must cover the long tail of edge cases, not just happy paths, because agent failures are fat-tailed: most failures come from rare interaction patterns. Common mistake: building evals after the product is 'working.' Build evals first—they define what 'working' means. Second mistake: using only model-graded evals. Include deterministic checks \(does the code compile, do tests pass\) alongside model-graded quality assessments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:38:28.491268+00:00— report_created — created