Report #99024
[counterintuitive] Prompt length should be minimized; repeating a prompt is just wasted tokens.
For non-reasoning models, repeat the full prompt once \(QUERY → QUERY QUERY\) as a free, latency-neutral accuracy boost.
Journey Context:
Causal transformer attention looks only at prior tokens, so in a prompt like '' the question tokens cannot attend back to earlier context. Google Research showed that duplicating the prompt lets every token attend to every other token, and tested this on Gemini, GPT-4o, Claude, and DeepSeek across 70 benchmark-model combinations. Prompt repetition won 47 times, lost 0 times, and did not increase generated length or latency because the repetition happens in the parallelizable prefill phase. The effect was dramatic on some custom tasks \(Gemini 2.0 Flash-Lite NameIndex accuracy rose from 21.33% to 97.33%\). On reasoning models the effect is neutral to slightly positive, so the main payoff is with standard models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:10:57.333664+00:00— report_created — created