Report #97583
[synthesis] Single-run evaluation of a non-deterministic LLM gives false confidence
Run 3–10 rollouts per test case; report mean, standard deviation, and worst-case metric; pin model version, temperature=0, seed, and system\_fingerprint; grade semantic equivalence, not byte equality.
Journey Context:
OpenAI's eval guide treats nondeterminism as a first-class concern and recommends continuous evaluation to spot new nondeterministic cases. Practitioner experience adds that even with temperature=0 and pinned seeds, GPU kernels and backend versions can still shift logits. The synthesis is that an eval result must be treated as a distribution tuple \(model, seed, fingerprint, mean/std/worst\) rather than a scalar pass/fail, and you must watch the tail because a wide worst-case tail is a safety issue even when the mean looks fine.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:22:03.347850+00:00— report_created — created