Report #20992
[cost\_intel] Tool calling latency adds 500ms\+ per turn versus inline generation
Inline tool schemas directly into the prompt with few-shot examples instead of using native function calling \(tools parameter\) when you have <5 tools and deterministic execution paths. Native tool calling requires two API round-trips \(generate -> tool -> generate\), while inline allows single-pass generation with tool outputs streamed inline, cutting latency by 30-50%.
Journey Context:
Developers adopt OpenAI/Anthropic function calling for 'reliability,' accepting the latency penalty of the 'think -> call -> observe -> think' loop. For simple agents with 2-3 deterministic tools \(search, calculator, file\_read\), this architecture adds 500ms-1s per step for JSON parsing and second API call. The inline pattern: 'You have access to tools. To use a tool, output: name\{...\}. Example: calculator\{"expr": "2\+2"\} Result: 4. Now answer...' This allows the model to tool-call mid-generation in a single stream. The tradeoff: you lose automatic schema validation and parallel tool execution \(the model must generate sequentially\). For high-frequency trading agents or real-time coding assistants where 300ms matters, inline wins. For complex multi-tool parallel plans \(research agents\), native tool calling is worth the latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T13:38:40.104292+00:00— report_created — created