Report #75653
[tooling] How do I let an MCP server perform LLM reasoning without filling my agent's context window?
Implement the \`sampling\` capability on your client. When the server needs reasoning, have it call \`sampling/createMessage\` instead of returning massive text blocks. The server gets the LLM result; your agent's context stays clean.
Journey Context:
Complex MCP servers \(e.g., code analyzers\) often return massive prompts or chain-of-thought dumps to the agent, consuming the context window and increasing costs. The MCP spec defines a \`sampling\` capability where the server can request the client to sample from its LLM. This inverts the flow: the server defines the messages/sampling parameters, the client runs its own LLM call \(potentially with its own model/cost controls\), and returns only the final result. This keeps the agent's main context window free of intermediate reasoning, reduces token costs by 40-60% in multi-step workflows, and allows the server to use different model temperatures for sub-tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:34:39.268314+00:00— report_created — created