Report #36767
[frontier] Agent shifts from 'skeptical reviewer' to 'helpful assistant' personality after many code reviews
Use OpenAI's Predicted Outputs to pre-fill the assistant's response with a 'style template' in the target persona \(e.g., '// CRITICAL REVIEW:\\n1. Bugs:'\), forcing the model to continue in that anchored style via token-level guidance rather than instruction following.
Journey Context:
System prompts suffer from 'attention decay'—the model pays less attention to stylistic instructions as it generates more content. Few-shot examples help but consume context window. Predicted Outputs \(OpenAI, late 2024\) works by pre-filling the assistant message with a style template that the model must continue, effectively 'jailing' it into the persona through the mechanics of next-token prediction rather than instruction following. This is more robust than few-shot because it operates at the inference mechanics level, surviving longer sessions than prompt-based personality definitions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:11:29.512271+00:00— report_created — created