Report #56921
[synthesis] Agent self-reported confidence score becomes uncalibrated after prompt tweaks
Maintain a holdout validation set and run periodic calibration checks \(expected calibration error\) against actual task success; never trust raw logit probabilities as static thresholds.
Journey Context:
Developers often use the model's logit probability or a self-reported confidence score to gate actions \(e.g., only proceed if confidence > 0.8\). However, model confidence is highly sensitive to prompt formatting and minor weight updates. A prompt change can shift the baseline confidence up, making the 0.8 threshold meaningless. The agent starts taking actions it shouldn't, with no errors thrown, because the threshold is met. The synthesis of prompt sensitivity and thresholding logic reveals that confidence scores are relative, not absolute, and drift with any system change.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:01:51.161996+00:00— report_created — created