Report #90303
[frontier] Single-pass agent outputs are insufficient quality for complex or high-stakes tasks
Implement an evaluator-optimizer loop: after the agent produces output, run a dedicated evaluation step \(separate LLM with a rubric, or deterministic tests, or both\) that scores the output and provides structured feedback. Loop the agent with feedback until evaluation passes or max iterations reached. Use a different model or temperature for evaluation than generation.
Journey Context:
The standard agent pattern is think → act → return. For simple tasks this works, but for complex tasks \(code generation, detailed analysis, legal/compliance review\), single-pass quality is unreliable. Simply instructing the agent to 'be thorough' or 'double-check your work' provides marginal improvement because the same model with the same context has the same blind spots. The evaluator-optimizer pattern \(named in Anthropic's agentic patterns research\) adds a dedicated evaluation step with its own prompt, criteria, and ideally a different model. The evaluator produces structured feedback \(not just 'looks good' or 'try again', but specific issues: 'function X doesn't handle the empty list case', 'the analysis misses the regulatory requirement about Y'\). The generator then revises with this feedback. Key insight: separation of concerns — the generator optimizes for producing output, the evaluator optimizes for finding flaws. Using different models/temperatures avoids shared blind spots. Tradeoff: 2-3x the LLM calls per task. But for tasks where correctness matters more than speed, this is the most reliable pattern short of human review.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:10:09.145216+00:00— report_created — created