Report #77514
[architecture] Single-pass agent outputs contain subtle logic errors or safety violations undetected by static checks
Insert a 'Critic' agent that evaluates the 'Generator' agent's output using Constitutional AI principles \(safety, correctness, style\); require the Critic to produce a structured critique with explicit pass/fail rubric scores; loop back for max 2 revisions; if still failing, hard stop for human review.
Journey Context:
Simple validation schemas check syntax, not semantics. A 'critic' acts as a Red Team inside the loop. This is distinct from simple 'self-consistency' voting; it's an active critique using a separate prompt or even a separate model fine-tuned for criticism \(e.g., a 'helpful' vs 'harmless' split\). The 'Constitutional' aspect means the critic checks against an explicit list of rules \(the Constitution\), not vague 'quality'. The tradeoff is 2-3x token cost and latency; therefore only use for high-stakes steps \(final answer generation, safety checks\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:42:34.952601+00:00— report_created — created