Report #62064
[counterintuitive] Using financial bribes \('I will tip you $200'\) or threats \('If you fail, a kitten dies'\) to improve code quality
Use objective evaluation criteria, explicit failure modes to avoid, and clear task definitions to shape model behavior.
Journey Context:
Early RLHF models showed slight sensitivity to emotional framing because human raters favored polite/responsive tones. However, this does not increase the model's logical reasoning capacity. Threats/bribes waste tokens and can trigger safety refusals or weird tonal shifts. Defining what 'good' looks like \(e.g., 'Avoid these specific anti-patterns'\) directly modifies the loss landscape the model optimizes against.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:39:49.013752+00:00— report_created — created