If your API has more than a few dozen endpoints, exposing each one as a separate MCP tool creates a problem that scales in exactly the wrong direction. Every tool definition consumes tokens in the agent’s context window, and for large API surfaces the math gets ugly fast. A 2,500-endpoint API expressed as individual MCP tools can consume over a million input tokens before the agent even starts reasoning about what to do.
The search/execute pattern is a design approach that compresses that cost to roughly 1,000 tokens regardless of how many endpoints your API has. It does this by replacing hundreds or thousands of individual tool definitions with just two: one to discover the right operations and one to run them. This article breaks down how it works, when to use it, and how to implement it on an OpenAPI-first API gateway.
The token-cost problem with one-tool-per-endpoint MCP servers
The Model Context Protocol defines how AI
agents discover and invoke external tools. The standard approach is
straightforward: each API endpoint becomes one MCP tool, and each tool
definition includes its name, description, parameters, and response schema. When
an agent connects, it calls tools/list and receives every definition at once.
For a small API, this works well. Ten endpoints might produce a few hundred tokens of tool definitions. But context window consumption grows linearly with the number of endpoints. Consider a large API surface:
- 10 endpoints: ~500 tokens of tool definitions
- 100 endpoints: ~5,000 tokens
- 1,000 endpoints: ~50,000 tokens
- 2,500 endpoints: ~1,170,000 tokens
At the larger end, you are spending more than a million tokens just to describe the tools. That leaves less room for the actual conversation, the user’s instructions, and the agent’s reasoning. It also drives up cost: at typical input-token pricing, every agent session starts with a substantial bill before any work gets done.
There is also a quality problem. LLMs perform worse when they have to select from thousands of options. The more tool definitions an agent holds in context, the more likely it is to pick the wrong one or hallucinate parameters for a tool that almost matches what it needs.
How the search/execute pattern works
The search/execute pattern solves the token problem by splitting tool access into two phases, each backed by a single MCP tool.
Phase 1: Search
The search tool gives the agent a way to query the API’s
OpenAPI specification without loading
the entire spec into context. The agent describes what it wants to do in natural
language or filters by path, tag, or product area, and the search tool returns
only the matching operation definitions.
For example, an agent that needs to list DNS records does not need to see the
schemas for compute instances, billing, or storage. It calls search with a
query like “DNS records” and receives back only the relevant endpoints — their
paths, methods, parameters, and descriptions.
This keeps the context window lean. Instead of loading thousands of tool definitions upfront, the agent loads a compact search interface and pulls in endpoint details on demand.
Phase 2: Execute
The execute tool accepts code — typically JavaScript or TypeScript — that
calls the API using a typed SDK generated from the OpenAPI spec. The code runs
inside a sandboxed runtime, usually a V8 isolate, and can chain multiple API
calls, handle pagination, apply conditional logic, and aggregate results in a
single execution.
This is powerful because it replaces the typical multi-turn pattern where an
agent calls one tool, feeds the result back into the LLM, reasons about the next
step, calls another tool, and repeats. With execute, the agent writes a short
program that handles the entire workflow in one shot.
The agent gets a single, structured result back. No intermediate round trips. No context pollution from feeding raw API responses through the LLM between calls.
Why the token savings are so dramatic
The fixed cost of the search/execute pattern is roughly constant:
- The
searchtool definition: ~200 tokens - The
executetool definition: ~200 tokens - The typed SDK interface documentation: ~600 tokens
That is around 1,000 tokens total, regardless of whether the underlying API has 50 endpoints or 5,000. Compare that to the one-tool-per-endpoint approach where 2,500 endpoints produce over 1.17 million tokens of definitions. That is a 99.9% reduction in context window consumption.
When to use the search/execute pattern
The search/execute pattern is not universally better than the one-tool-per-endpoint approach. Each design makes different trade-offs.
Use search/execute when
- Your API surface is large — more than ~50 endpoints, and especially above 100. The token savings become significant at this scale.
- Agents need to chain operations — workflows that involve multiple
dependent API calls benefit from the
executetool’s ability to run them as a single program. - Your API has a high-quality OpenAPI spec — the search tool depends on rich descriptions, accurate schemas, and consistent tagging to return useful results. A sparse or inaccurate spec will produce poor search results and broken generated code.
- You need semantic discovery — when agents should be able to describe what they want in natural language rather than knowing exact tool names.
Stick with one-tool-per-endpoint when
- Your API is small — fewer than ~30 endpoints fit comfortably in the context window. The overhead of the search/execute layer is not worth it.
- Tools are well-named and obvious — if your tool names are self-explanatory
(e.g.,
create-user,get-invoice), agents can pick the right one without a search step. - You want maximum predictability — individual tools produce deterministic behavior. The execute step involves code generation, which introduces a small risk of the agent writing incorrect code.
- Your users expect direct tool calls — some MCP clients present tool lists
to users directly. A two-tool interface with
searchandexecuteis less intuitive than a list of named operations.
Most teams building MCP servers for internal APIs with under 50 endpoints should start with the one-tool-per-endpoint approach. If you want a deeper look at that approach, see Best Practices for Mapping REST APIs to MCP Tools.
Security implications of the execute step
The execute tool runs agent-generated code against your API. That is powerful,
and it requires careful guardrails.
Sandbox boundaries
The code execution environment must be locked down:
- No file system access — the sandbox should have no ability to read or write files on the host.
- No environment variable access — API keys and secrets must not be accessible to the generated code through environment variables or global state.
- Controlled network access — the sandbox should only be able to reach the APIs it is explicitly authorized to use. Outbound fetch calls to arbitrary URLs should be blocked.
- Resource limits — execution time, memory, and CPU should all be capped to prevent denial-of-service through infinite loops or excessive allocations.
V8 isolates are the natural fit here. They start in milliseconds, use minimal memory, and provide strong process-level isolation without the overhead of containers. This is the same technology that powers edge runtimes like Cloudflare Workers and Zuplo’s programmable gateway.
Credential scoping
Every agent session should operate with the minimum credentials required for its task. In practice this means:
- Per-agent API keys — each agent or agent session gets its own API key with scoped permissions, not a shared admin key. Zuplo’s API key management supports this with per-consumer keys and configurable metadata that policies can use to enforce permissions.
- Per-route policies — even within a scoped key, individual routes can enforce additional constraints like rate limits, IP restrictions, or request validation. This limits the blast radius if an agent generates code that calls endpoints it should not.
- OAuth scope downscoping — for APIs using OAuth, the token issued to the agent session should carry only the scopes needed for the current task.
Rate limiting
Agent-generated code can produce bursts of API calls within a single execute
invocation. A short script that paginates through thousands of records will hit
the API much harder than a human user making one request at a time. Your
rate limiting strategy should account for
this by setting per-key and per-route limits that accommodate legitimate
multi-call workflows while preventing runaway scripts from overwhelming your
backend.
Implementing the pattern on an OpenAPI-first gateway
The search/execute pattern depends on three things being true about your infrastructure:
- Your API is described by a high-quality OpenAPI spec — this is the search index and the source of type information for the execute step.
- You have a runtime that can execute code cheaply and safely — V8 isolates at the edge are ideal.
- You can enforce authentication and authorization per operation — the execute step calls real endpoints, and each one needs its own access control.
An OpenAPI-first API gateway satisfies all three by design.
How Zuplo’s architecture maps to the pattern
Zuplo is built around OpenAPI as the source of truth for routing, validation, documentation, and — with the MCP Server Handler — for MCP tool definitions. Here is how each part of the search/execute pattern maps to existing Zuplo capabilities:
Search → OpenAPI-derived tool discovery
The MCP Server Handler already transforms OpenAPI-defined routes into MCP tool
definitions that agents discover through tools/list. You control which
operations are exposed by listing their operationId values in the handler
configuration. The tool names, descriptions, and parameter schemas are derived
directly from the OpenAPI spec.
For a search/execute implementation, this same OpenAPI spec becomes the search index. A custom search tool can query the spec by path, tag, description, or operation ID and return matching operation definitions to the agent.
Execute → Programmable policies in V8 isolates
Zuplo’s programmable gateway runs custom TypeScript policies inside V8 isolates at 300+ edge locations. This is the same primitive that the execute step needs: a sandboxed runtime that can call API operations quickly, cheaply, and safely.
A custom policy or handler can accept agent-generated code, execute it in an isolated context with access to a typed API client, and return the result. The policy pipeline ensures that every API call made during execution passes through authentication, rate limiting, and validation — the same policies that protect direct API calls.
Access control → Per-key, per-route policies
Zuplo’s API key management
provides per-consumer keys with configurable metadata and permissions. Combined
with per-route inbound policies, this gives you fine-grained control over what
each agent can do. An agent with a key scoped to read-only DNS operations cannot
use the execute tool to delete a compute instance, even if it generates code
that tries to.
A practical configuration
Here is what the MCP Server Handler configuration looks like for exposing a curated set of operations as tools:
Each operation points to an operationId in your OpenAPI spec. The handler
reads the spec, extracts the route definition, and generates the corresponding
MCP tool definition automatically. No duplication, no drift between your API
definition and what agents see.
For a full walkthrough of setting this up, see Create an MCP Server from Your OpenAPI Spec.
Observability: keeping agent workflows auditable
When an agent chains five API calls inside a single execute invocation, your
standard request logs show five individual requests. But without context linking
them together, you cannot tell that they were part of a single agent workflow.
Good observability for the search/execute pattern requires logging at two levels:
Search queries — log what the agent searched for, what results were returned, and how many operations matched. This tells you which parts of your API surface agents are actually using and where your OpenAPI descriptions might need improvement.
Execute payloads — log the code the agent submitted, which API calls it made, and the results of each call. Correlate these with a session or request ID so you can reconstruct the full workflow from a single audit trail.
Because the MCP Server Handler re-invokes routes within the gateway rather than
making external HTTP calls, each API call passes through the full policy
pipeline — including any logging policies you have configured. This means you
can attach request logging to individual routes and capture each operation
invoked during an execute step with its own log entry, giving you the raw
material to reconstruct agent workflows after the fact.
The broader trend: OpenAPI as the MCP interface layer
The search/execute pattern is part of a larger shift in how APIs expose themselves to AI agents. Rather than building separate MCP servers that duplicate API logic, teams are treating their existing OpenAPI specs as the single source of truth and deriving MCP tool definitions from them.
This is exactly what tools like Speakeasy’s x-speakeasy-mcp extension and
various open-source openapi-mcp generators are formalizing: the OpenAPI spec
defines the API surface, and the MCP layer is a view on top of it. The
search/execute pattern takes this one step further by making the spec queryable
at runtime rather than expanded into static tool definitions.
For teams already running an OpenAPI-first API gateway, the path to the search/execute pattern is shorter than it looks. The spec is already there. The routing and validation are already there. The per-route access control is already there. What is new is the search interface and the sandboxed execution layer — and both map cleanly onto capabilities that a programmable edge gateway already provides.
Getting started
The search/execute pattern gives large APIs a way to participate in the MCP ecosystem without overwhelming agent context windows. If your API has a well-maintained OpenAPI spec, you already have the foundation.
To build your first MCP server on Zuplo, start with the MCP Server Handler documentation. For a step-by-step walkthrough, see Create an MCP Server from Your OpenAPI Spec. To understand how API gateways, AI gateways, and MCP gateways fit together in a modern AI infrastructure stack, read The Three Gates of AI Infrastructure. And for a comprehensive overview of MCP architecture and implementation options, see What Is an MCP Server?.