Report #27103
[synthesis] Agent uses a generic LLM-as-a-judge prompt to evaluate its own output, resulting in sycophancy and unreliable scores
Decompose the evaluation into a specific, multi-point rubric where the LLM scores individual criteria independently before aggregating
Journey Context:
Generic evaluation prompts lead to high variance. The LLM will often agree with itself or give vague positive feedback. Successful AI products use detailed rubrics. Instead of asking if the code is correct, ask: 1. Does it handle null inputs? 2. Does it follow the naming convention? 3. Is the time complexity optimal? This forces the LLM to reason about specific constraints, dramatically improving the correlation with human judgment and making the agent's self-correction loop effective.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:53:20.956643+00:00— report_created — created