Zuplo
Model Context Protocol

The Search/Execute MCP Design Pattern: How Token-Efficient MCP Servers Are Reshaping Agent Integrations

Nate TottenNate Totten
May 22, 2026
10 min read

Learn how the search/execute MCP pattern cuts token costs by 99.9% for large APIs and why OpenAPI-first gateways are the natural place to implement it.

If your API has more than a few dozen endpoints, exposing each one as a separate MCP tool creates a problem that scales in exactly the wrong direction. Every tool definition consumes tokens in the agent’s context window, and for large API surfaces the math gets ugly fast. A 2,500-endpoint API expressed as individual MCP tools can consume over a million input tokens before the agent even starts reasoning about what to do.

The search/execute pattern is a design approach that compresses that cost to roughly 1,000 tokens regardless of how many endpoints your API has. It does this by replacing hundreds or thousands of individual tool definitions with just two: one to discover the right operations and one to run them. This article breaks down how it works, when to use it, and how to implement it on an OpenAPI-first API gateway.

The token-cost problem with one-tool-per-endpoint MCP servers

The Model Context Protocol defines how AI agents discover and invoke external tools. The standard approach is straightforward: each API endpoint becomes one MCP tool, and each tool definition includes its name, description, parameters, and response schema. When an agent connects, it calls tools/list and receives every definition at once.

For a small API, this works well. Ten endpoints might produce a few hundred tokens of tool definitions. But context window consumption grows linearly with the number of endpoints. Consider a large API surface:

  • 10 endpoints: ~500 tokens of tool definitions
  • 100 endpoints: ~5,000 tokens
  • 1,000 endpoints: ~50,000 tokens
  • 2,500 endpoints: ~1,170,000 tokens

At the larger end, you are spending more than a million tokens just to describe the tools. That leaves less room for the actual conversation, the user’s instructions, and the agent’s reasoning. It also drives up cost: at typical input-token pricing, every agent session starts with a substantial bill before any work gets done.

There is also a quality problem. LLMs perform worse when they have to select from thousands of options. The more tool definitions an agent holds in context, the more likely it is to pick the wrong one or hallucinate parameters for a tool that almost matches what it needs.

How the search/execute pattern works

The search/execute pattern solves the token problem by splitting tool access into two phases, each backed by a single MCP tool.

The search tool gives the agent a way to query the API’s OpenAPI specification without loading the entire spec into context. The agent describes what it wants to do in natural language or filters by path, tag, or product area, and the search tool returns only the matching operation definitions.

For example, an agent that needs to list DNS records does not need to see the schemas for compute instances, billing, or storage. It calls search with a query like “DNS records” and receives back only the relevant endpoints — their paths, methods, parameters, and descriptions.

This keeps the context window lean. Instead of loading thousands of tool definitions upfront, the agent loads a compact search interface and pulls in endpoint details on demand.

Phase 2: Execute

The execute tool accepts code — typically JavaScript or TypeScript — that calls the API using a typed SDK generated from the OpenAPI spec. The code runs inside a sandboxed runtime, usually a V8 isolate, and can chain multiple API calls, handle pagination, apply conditional logic, and aggregate results in a single execution.

This is powerful because it replaces the typical multi-turn pattern where an agent calls one tool, feeds the result back into the LLM, reasons about the next step, calls another tool, and repeats. With execute, the agent writes a short program that handles the entire workflow in one shot.

TypeScripttypescript
// Agent-generated code running inside the execute sandbox
const zones = await api.zones.list({ name: "example.com" });
const zoneId = zones.result[0].id;

const records = await api.dns.records.list({ zone_id: zoneId });
const aRecords = records.result.filter((r) => r.type === "A");

console.log(
  JSON.stringify({
    zone: zones.result[0].name,
    aRecords: aRecords.map((r) => ({ name: r.name, content: r.content })),
  }),
);

The agent gets a single, structured result back. No intermediate round trips. No context pollution from feeding raw API responses through the LLM between calls.

Why the token savings are so dramatic

The fixed cost of the search/execute pattern is roughly constant:

  • The search tool definition: ~200 tokens
  • The execute tool definition: ~200 tokens
  • The typed SDK interface documentation: ~600 tokens

That is around 1,000 tokens total, regardless of whether the underlying API has 50 endpoints or 5,000. Compare that to the one-tool-per-endpoint approach where 2,500 endpoints produce over 1.17 million tokens of definitions. That is a 99.9% reduction in context window consumption.

When to use the search/execute pattern

The search/execute pattern is not universally better than the one-tool-per-endpoint approach. Each design makes different trade-offs.

Use search/execute when

  • Your API surface is large — more than ~50 endpoints, and especially above 100. The token savings become significant at this scale.
  • Agents need to chain operations — workflows that involve multiple dependent API calls benefit from the execute tool’s ability to run them as a single program.
  • Your API has a high-quality OpenAPI spec — the search tool depends on rich descriptions, accurate schemas, and consistent tagging to return useful results. A sparse or inaccurate spec will produce poor search results and broken generated code.
  • You need semantic discovery — when agents should be able to describe what they want in natural language rather than knowing exact tool names.

Stick with one-tool-per-endpoint when

  • Your API is small — fewer than ~30 endpoints fit comfortably in the context window. The overhead of the search/execute layer is not worth it.
  • Tools are well-named and obvious — if your tool names are self-explanatory (e.g., create-user, get-invoice), agents can pick the right one without a search step.
  • You want maximum predictability — individual tools produce deterministic behavior. The execute step involves code generation, which introduces a small risk of the agent writing incorrect code.
  • Your users expect direct tool calls — some MCP clients present tool lists to users directly. A two-tool interface with search and execute is less intuitive than a list of named operations.

Most teams building MCP servers for internal APIs with under 50 endpoints should start with the one-tool-per-endpoint approach. If you want a deeper look at that approach, see Best Practices for Mapping REST APIs to MCP Tools.

Security implications of the execute step

The execute tool runs agent-generated code against your API. That is powerful, and it requires careful guardrails.

Sandbox boundaries

The code execution environment must be locked down:

  • No file system access — the sandbox should have no ability to read or write files on the host.
  • No environment variable access — API keys and secrets must not be accessible to the generated code through environment variables or global state.
  • Controlled network access — the sandbox should only be able to reach the APIs it is explicitly authorized to use. Outbound fetch calls to arbitrary URLs should be blocked.
  • Resource limits — execution time, memory, and CPU should all be capped to prevent denial-of-service through infinite loops or excessive allocations.

V8 isolates are the natural fit here. They start in milliseconds, use minimal memory, and provide strong process-level isolation without the overhead of containers. This is the same technology that powers edge runtimes like Cloudflare Workers and Zuplo’s programmable gateway.

Credential scoping

Every agent session should operate with the minimum credentials required for its task. In practice this means:

  • Per-agent API keys — each agent or agent session gets its own API key with scoped permissions, not a shared admin key. Zuplo’s API key management supports this with per-consumer keys and configurable metadata that policies can use to enforce permissions.
  • Per-route policies — even within a scoped key, individual routes can enforce additional constraints like rate limits, IP restrictions, or request validation. This limits the blast radius if an agent generates code that calls endpoints it should not.
  • OAuth scope downscoping — for APIs using OAuth, the token issued to the agent session should carry only the scopes needed for the current task.

Rate limiting

Agent-generated code can produce bursts of API calls within a single execute invocation. A short script that paginates through thousands of records will hit the API much harder than a human user making one request at a time. Your rate limiting strategy should account for this by setting per-key and per-route limits that accommodate legitimate multi-call workflows while preventing runaway scripts from overwhelming your backend.

Implementing the pattern on an OpenAPI-first gateway

The search/execute pattern depends on three things being true about your infrastructure:

  1. Your API is described by a high-quality OpenAPI spec — this is the search index and the source of type information for the execute step.
  2. You have a runtime that can execute code cheaply and safely — V8 isolates at the edge are ideal.
  3. You can enforce authentication and authorization per operation — the execute step calls real endpoints, and each one needs its own access control.

An OpenAPI-first API gateway satisfies all three by design.

How Zuplo’s architecture maps to the pattern

Zuplo is built around OpenAPI as the source of truth for routing, validation, documentation, and — with the MCP Server Handler — for MCP tool definitions. Here is how each part of the search/execute pattern maps to existing Zuplo capabilities:

Search → OpenAPI-derived tool discovery

The MCP Server Handler already transforms OpenAPI-defined routes into MCP tool definitions that agents discover through tools/list. You control which operations are exposed by listing their operationId values in the handler configuration. The tool names, descriptions, and parameter schemas are derived directly from the OpenAPI spec.

For a search/execute implementation, this same OpenAPI spec becomes the search index. A custom search tool can query the spec by path, tag, description, or operation ID and return matching operation definitions to the agent.

Execute → Programmable policies in V8 isolates

Zuplo’s programmable gateway runs custom TypeScript policies inside V8 isolates at 300+ edge locations. This is the same primitive that the execute step needs: a sandboxed runtime that can call API operations quickly, cheaply, and safely.

A custom policy or handler can accept agent-generated code, execute it in an isolated context with access to a typed API client, and return the result. The policy pipeline ensures that every API call made during execution passes through authentication, rate limiting, and validation — the same policies that protect direct API calls.

Access control → Per-key, per-route policies

Zuplo’s API key management provides per-consumer keys with configurable metadata and permissions. Combined with per-route inbound policies, this gives you fine-grained control over what each agent can do. An agent with a key scoped to read-only DNS operations cannot use the execute tool to delete a compute instance, even if it generates code that tries to.

A practical configuration

Here is what the MCP Server Handler configuration looks like for exposing a curated set of operations as tools:

JSONjson
{
  "handler": {
    "export": "mcpServerHandler",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "name": "my-api-mcp",
      "version": "1.0.0",
      "operations": [
        {
          "file": "./config/routes.oas.json",
          "id": "listDnsRecords"
        },
        {
          "file": "./config/routes.oas.json",
          "id": "createDnsRecord"
        },
        {
          "file": "./config/routes.oas.json",
          "id": "deleteDnsRecord"
        }
      ]
    }
  }
}

Each operation points to an operationId in your OpenAPI spec. The handler reads the spec, extracts the route definition, and generates the corresponding MCP tool definition automatically. No duplication, no drift between your API definition and what agents see.

For a full walkthrough of setting this up, see Create an MCP Server from Your OpenAPI Spec.

Observability: keeping agent workflows auditable

When an agent chains five API calls inside a single execute invocation, your standard request logs show five individual requests. But without context linking them together, you cannot tell that they were part of a single agent workflow.

Good observability for the search/execute pattern requires logging at two levels:

Search queries — log what the agent searched for, what results were returned, and how many operations matched. This tells you which parts of your API surface agents are actually using and where your OpenAPI descriptions might need improvement.

Execute payloads — log the code the agent submitted, which API calls it made, and the results of each call. Correlate these with a session or request ID so you can reconstruct the full workflow from a single audit trail.

Because the MCP Server Handler re-invokes routes within the gateway rather than making external HTTP calls, each API call passes through the full policy pipeline — including any logging policies you have configured. This means you can attach request logging to individual routes and capture each operation invoked during an execute step with its own log entry, giving you the raw material to reconstruct agent workflows after the fact.

The broader trend: OpenAPI as the MCP interface layer

The search/execute pattern is part of a larger shift in how APIs expose themselves to AI agents. Rather than building separate MCP servers that duplicate API logic, teams are treating their existing OpenAPI specs as the single source of truth and deriving MCP tool definitions from them.

This is exactly what tools like Speakeasy’s x-speakeasy-mcp extension and various open-source openapi-mcp generators are formalizing: the OpenAPI spec defines the API surface, and the MCP layer is a view on top of it. The search/execute pattern takes this one step further by making the spec queryable at runtime rather than expanded into static tool definitions.

For teams already running an OpenAPI-first API gateway, the path to the search/execute pattern is shorter than it looks. The spec is already there. The routing and validation are already there. The per-route access control is already there. What is new is the search interface and the sandboxed execution layer — and both map cleanly onto capabilities that a programmable edge gateway already provides.

Getting started

The search/execute pattern gives large APIs a way to participate in the MCP ecosystem without overwhelming agent context windows. If your API has a well-maintained OpenAPI spec, you already have the foundation.

To build your first MCP server on Zuplo, start with the MCP Server Handler documentation. For a step-by-step walkthrough, see Create an MCP Server from Your OpenAPI Spec. To understand how API gateways, AI gateways, and MCP gateways fit together in a modern AI infrastructure stack, read The Three Gates of AI Infrastructure. And for a comprehensive overview of MCP architecture and implementation options, see What Is an MCP Server?.