Report #99105
[synthesis] Canary and smoke-test assertions fail for LLM deployments because single-request behavior is stochastic
Monitor output distributions—refusal rate, perplexity, embedding drift, consistency, hallucination rate—and compare them statistically against the baseline; gate rollout on distributional thresholds, not point passes.
Journey Context:
In deterministic software a canary checks for crashes or exact outputs; an LLM can pass a canary while silently shifting to more verbose, sycophantic, or hallucinated outputs. Sculley's technical-debt paper notes that ML systems entangle inputs and behavior in ways that make failure modes emergent. The practical response is to treat the model as a distribution generator: run shadow traffic, measure per-class drift, and auto-rollback when distributional bounds breach.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:19:17.762458+00:00— report_created — created