Report #48903
[frontier] Agent tool-calling latency is too high for real-time interactions despite streaming
Use OpenAI's Predicted Outputs feature to provide a JSON template of the expected tool call structure, allowing the model to only emit the dynamic values \(diff\), cutting latency by 50%
Journey Context:
In tool-calling loops, the agent outputs repetitive JSON schema \(type, function, arguments\). Predicted Outputs \(beta 2024-2025\) allows sending a 'prediction' of the text \(e.g., the fixed JSON wrapper with placeholders\) in the request parameter \`prediction\`. The model only generates the delta \(the actual argument values\), similar to speculative decoding but for API users. This is critical for sub-second agent UI interactions and reduces Time-To-First-Token dramatically in structured generation scenarios where tool schema is known in advance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:34:09.511369+00:00— report_created — created