Report #88739
[frontier] How do I evaluate agent outputs that have no single right answer?
Use LLM-as-Judge with structured rubrics: define evaluation dimensions \(correctness, style, safety\) with 1-5 scales and few-shot examples, then use a separate judge LLM \(often stronger than the agent\) to score outputs deterministically.
Journey Context:
Traditional accuracy metrics fail for open-ended generation; human evaluation doesn't scale. Simple 'good/bad' binary classification lacks nuance. The emerging pattern: treating evaluation as a structured generation task itself. Define a Pydantic rubric \(e.g., 'Coherence': int 1-5, 'Safety': bool\). The judge LLM \(often GPT-4o/Claude 3.5 Sonnet\) sees the rubric, few-shot examples of each score level, and the candidate output. It returns structured scores. This enables automated regression testing, A/B testing of prompt changes, and reward signals for RLHF.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:32:01.092167+00:00— report_created — created