---
title: "The Search/Execute MCP Design Pattern: How Token-Efficient MCP Servers Are Reshaping Agent Integrations"
description: "Learn how the search/execute MCP pattern cuts token costs by 99.9% for large APIs and why OpenAPI-first gateways are the natural place to implement it."
canonicalUrl: "https://zuplo.com/learning-center/search-execute-mcp-design-pattern"
pageType: "learning-center"
authors: "nate"
tags: "Model Context Protocol, OpenAPI"
image: "https://zuplo.com/og?text=The%20Search%2FExecute%20MCP%20Design%20Pattern"
---
If your API has more than a few dozen endpoints, exposing each one as a separate
MCP tool creates a problem that scales in exactly the wrong direction. Every
tool definition consumes tokens in the agent's context window, and for large API
surfaces the math gets ugly fast. A 2,500-endpoint API expressed as individual
MCP tools can consume over a million input tokens before the agent even starts
reasoning about what to do.

The **search/execute pattern** is a design approach that compresses that cost to
roughly 1,000 tokens regardless of how many endpoints your API has. It does this
by replacing hundreds or thousands of individual tool definitions with just two:
one to discover the right operations and one to run them. This article breaks
down how it works, when to use it, and how to implement it on an OpenAPI-first
API gateway.

## The token-cost problem with one-tool-per-endpoint MCP servers

The [Model Context Protocol](https://modelcontextprotocol.io) defines how AI
agents discover and invoke external tools. The standard approach is
straightforward: each API endpoint becomes one MCP tool, and each tool
definition includes its name, description, parameters, and response schema. When
an agent connects, it calls `tools/list` and receives every definition at once.

For a small API, this works well. Ten endpoints might produce a few hundred
tokens of tool definitions. But context window consumption grows linearly with
the number of endpoints. Consider a large API surface:

- **10 endpoints**: ~500 tokens of tool definitions
- **100 endpoints**: ~5,000 tokens
- **1,000 endpoints**: ~50,000 tokens
- **2,500 endpoints**: ~1,170,000 tokens

At the larger end, you are spending more than a million tokens just to describe
the tools. That leaves less room for the actual conversation, the user's
instructions, and the agent's reasoning. It also drives up cost: at typical
input-token pricing, every agent session starts with a substantial bill before
any work gets done.

There is also a quality problem. LLMs perform worse when they have to select
from thousands of options. The more tool definitions an agent holds in context,
the more likely it is to pick the wrong one or hallucinate parameters for a tool
that almost matches what it needs.

## How the search/execute pattern works

The search/execute pattern solves the token problem by splitting tool access
into two phases, each backed by a single MCP tool.

### Phase 1: Search

The `search` tool gives the agent a way to query the API's
[OpenAPI specification](https://zuplo.com/docs/articles/openapi) without loading
the entire spec into context. The agent describes what it wants to do in natural
language or filters by path, tag, or product area, and the search tool returns
only the matching operation definitions.

For example, an agent that needs to list DNS records does not need to see the
schemas for compute instances, billing, or storage. It calls `search` with a
query like "DNS records" and receives back only the relevant endpoints — their
paths, methods, parameters, and descriptions.

This keeps the context window lean. Instead of loading thousands of tool
definitions upfront, the agent loads a compact search interface and pulls in
endpoint details on demand.

### Phase 2: Execute

The `execute` tool accepts code — typically JavaScript or TypeScript — that
calls the API using a typed SDK generated from the OpenAPI spec. The code runs
inside a sandboxed runtime, usually a V8 isolate, and can chain multiple API
calls, handle pagination, apply conditional logic, and aggregate results in a
single execution.

This is powerful because it replaces the typical multi-turn pattern where an
agent calls one tool, feeds the result back into the LLM, reasons about the next
step, calls another tool, and repeats. With `execute`, the agent writes a short
program that handles the entire workflow in one shot.

```typescript
// Agent-generated code running inside the execute sandbox
const zones = await api.zones.list({ name: "example.com" });
const zoneId = zones.result[0].id;

const records = await api.dns.records.list({ zone_id: zoneId });
const aRecords = records.result.filter((r) => r.type === "A");

console.log(
  JSON.stringify({
    zone: zones.result[0].name,
    aRecords: aRecords.map((r) => ({ name: r.name, content: r.content })),
  }),
);
```

The agent gets a single, structured result back. No intermediate round trips. No
context pollution from feeding raw API responses through the LLM between calls.

### Why the token savings are so dramatic

The fixed cost of the search/execute pattern is roughly constant:

- The `search` tool definition: ~200 tokens
- The `execute` tool definition: ~200 tokens
- The typed SDK interface documentation: ~600 tokens

That is around 1,000 tokens total, regardless of whether the underlying API has
50 endpoints or 5,000. Compare that to the one-tool-per-endpoint approach where
2,500 endpoints produce over 1.17 million tokens of definitions. That is a
**99.9% reduction** in context window consumption.

## When to use the search/execute pattern

The search/execute pattern is not universally better than the
one-tool-per-endpoint approach. Each design makes different trade-offs.

### Use search/execute when

- **Your API surface is large** — more than ~50 endpoints, and especially
  above 100. The token savings become significant at this scale.
- **Agents need to chain operations** — workflows that involve multiple
  dependent API calls benefit from the `execute` tool's ability to run them as a
  single program.
- **Your API has a high-quality OpenAPI spec** — the search tool depends on rich
  descriptions, accurate schemas, and consistent tagging to return useful
  results. A sparse or inaccurate spec will produce poor search results and
  broken generated code.
- **You need semantic discovery** — when agents should be able to describe what
  they want in natural language rather than knowing exact tool names.

### Stick with one-tool-per-endpoint when

- **Your API is small** — fewer than ~30 endpoints fit comfortably in the
  context window. The overhead of the search/execute layer is not worth it.
- **Tools are well-named and obvious** — if your tool names are self-explanatory
  (e.g., `create-user`, `get-invoice`), agents can pick the right one without a
  search step.
- **You want maximum predictability** — individual tools produce deterministic
  behavior. The execute step involves code generation, which introduces a small
  risk of the agent writing incorrect code.
- **Your users expect direct tool calls** — some MCP clients present tool lists
  to users directly. A two-tool interface with `search` and `execute` is less
  intuitive than a list of named operations.

Most teams building MCP servers for internal APIs with under 50 endpoints should
start with the one-tool-per-endpoint approach. If you want a deeper look at that
approach, see
[Best Practices for Mapping REST APIs to MCP Tools](/learning-center/mapping-rest-apis-to-mcp-tools).

## Security implications of the execute step

The `execute` tool runs agent-generated code against your API. That is powerful,
and it requires careful guardrails.

### Sandbox boundaries

The code execution environment must be locked down:

- **No file system access** — the sandbox should have no ability to read or
  write files on the host.
- **No environment variable access** — API keys and secrets must not be
  accessible to the generated code through environment variables or global
  state.
- **Controlled network access** — the sandbox should only be able to reach the
  APIs it is explicitly authorized to use. Outbound fetch calls to arbitrary
  URLs should be blocked.
- **Resource limits** — execution time, memory, and CPU should all be capped to
  prevent denial-of-service through infinite loops or excessive allocations.

V8 isolates are the natural fit here. They start in milliseconds, use minimal
memory, and provide strong process-level isolation without the overhead of
containers. This is the same technology that powers edge runtimes like
Cloudflare Workers and Zuplo's
[programmable gateway](https://zuplo.com/features/programmable).

### Credential scoping

Every agent session should operate with the minimum credentials required for its
task. In practice this means:

- **Per-agent API keys** — each agent or agent session gets its own API key with
  scoped permissions, not a shared admin key. Zuplo's
  [API key management](https://zuplo.com/docs/articles/api-key-management)
  supports this with per-consumer keys and configurable metadata that policies
  can use to enforce permissions.
- **Per-route policies** — even within a scoped key, individual routes can
  enforce additional constraints like rate limits, IP restrictions, or request
  validation. This limits the blast radius if an agent generates code that calls
  endpoints it should not.
- **OAuth scope downscoping** — for APIs using OAuth, the token issued to the
  agent session should carry only the scopes needed for the current task.

### Rate limiting

Agent-generated code can produce bursts of API calls within a single `execute`
invocation. A short script that paginates through thousands of records will hit
the API much harder than a human user making one request at a time. Your
[rate limiting](/learning-center/api-rate-limiting) strategy should account for
this by setting per-key and per-route limits that accommodate legitimate
multi-call workflows while preventing runaway scripts from overwhelming your
backend.

## Implementing the pattern on an OpenAPI-first gateway

The search/execute pattern depends on three things being true about your
infrastructure:

1. **Your API is described by a high-quality OpenAPI spec** — this is the search
   index and the source of type information for the execute step.
2. **You have a runtime that can execute code cheaply and safely** — V8 isolates
   at the edge are ideal.
3. **You can enforce authentication and authorization per operation** — the
   execute step calls real endpoints, and each one needs its own access control.

An OpenAPI-first API gateway satisfies all three by design.

### How Zuplo's architecture maps to the pattern

Zuplo is built around
[OpenAPI as the source of truth](https://zuplo.com/docs/articles/openapi) for
routing, validation, documentation, and — with the
[MCP Server Handler](https://zuplo.com/docs/handlers/mcp-server) — for MCP tool
definitions. Here is how each part of the search/execute pattern maps to
existing Zuplo capabilities:

**Search → OpenAPI-derived tool discovery**

The MCP Server Handler already transforms OpenAPI-defined routes into MCP tool
definitions that agents discover through `tools/list`. You control which
operations are exposed by listing their `operationId` values in the handler
configuration. The tool names, descriptions, and parameter schemas are derived
directly from the OpenAPI spec.

For a search/execute implementation, this same OpenAPI spec becomes the search
index. A custom search tool can query the spec by path, tag, description, or
operation ID and return matching operation definitions to the agent.

**Execute → Programmable policies in V8 isolates**

Zuplo's [programmable gateway](https://zuplo.com/features/programmable) runs
custom TypeScript policies inside V8 isolates at 300+ edge locations. This is
the same primitive that the execute step needs: a sandboxed runtime that can
call API operations quickly, cheaply, and safely.

A custom policy or handler can accept agent-generated code, execute it in an
isolated context with access to a typed API client, and return the result. The
policy pipeline ensures that every API call made during execution passes through
authentication, rate limiting, and validation — the same policies that protect
direct API calls.

**Access control → Per-key, per-route policies**

Zuplo's [API key management](https://zuplo.com/docs/articles/api-key-management)
provides per-consumer keys with configurable metadata and permissions. Combined
with per-route inbound policies, this gives you fine-grained control over what
each agent can do. An agent with a key scoped to read-only DNS operations cannot
use the `execute` tool to delete a compute instance, even if it generates code
that tries to.

### A practical configuration

Here is what the MCP Server Handler configuration looks like for exposing a
curated set of operations as tools:

```json
{
  "handler": {
    "export": "mcpServerHandler",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "name": "my-api-mcp",
      "version": "1.0.0",
      "operations": [
        {
          "file": "./config/routes.oas.json",
          "id": "listDnsRecords"
        },
        {
          "file": "./config/routes.oas.json",
          "id": "createDnsRecord"
        },
        {
          "file": "./config/routes.oas.json",
          "id": "deleteDnsRecord"
        }
      ]
    }
  }
}
```

Each operation points to an `operationId` in your OpenAPI spec. The handler
reads the spec, extracts the route definition, and generates the corresponding
MCP tool definition automatically. No duplication, no drift between your API
definition and what agents see.

For a full walkthrough of setting this up, see
[Create an MCP Server from Your OpenAPI Spec](/learning-center/create-mcp-server-from-openapi).

## Observability: keeping agent workflows auditable

When an agent chains five API calls inside a single `execute` invocation, your
standard request logs show five individual requests. But without context linking
them together, you cannot tell that they were part of a single agent workflow.

Good observability for the search/execute pattern requires logging at two
levels:

**Search queries** — log what the agent searched for, what results were
returned, and how many operations matched. This tells you which parts of your
API surface agents are actually using and where your OpenAPI descriptions might
need improvement.

**Execute payloads** — log the code the agent submitted, which API calls it
made, and the results of each call. Correlate these with a session or request ID
so you can reconstruct the full workflow from a single audit trail.

Because the MCP Server Handler re-invokes routes within the gateway rather than
making external HTTP calls, each API call passes through the full policy
pipeline — including any logging policies you have configured. This means you
can attach request logging to individual routes and capture each operation
invoked during an `execute` step with its own log entry, giving you the raw
material to reconstruct agent workflows after the fact.

## The broader trend: OpenAPI as the MCP interface layer

The search/execute pattern is part of a larger shift in how APIs expose
themselves to AI agents. Rather than building separate MCP servers that
duplicate API logic, teams are treating their existing OpenAPI specs as the
single source of truth and deriving MCP tool definitions from them.

This is exactly what tools like Speakeasy's `x-speakeasy-mcp` extension and
various open-source `openapi-mcp` generators are formalizing: the OpenAPI spec
defines the API surface, and the MCP layer is a view on top of it. The
search/execute pattern takes this one step further by making the spec queryable
at runtime rather than expanded into static tool definitions.

For teams already running an OpenAPI-first API gateway, the path to the
search/execute pattern is shorter than it looks. The spec is already there. The
routing and validation are already there. The per-route access control is
already there. What is new is the search interface and the sandboxed execution
layer — and both map cleanly onto capabilities that a programmable edge gateway
already provides.

## Getting started

The search/execute pattern gives large APIs a way to participate in the MCP
ecosystem without overwhelming agent context windows. If your API has a
well-maintained OpenAPI spec, you already have the foundation.

To build your first MCP server on Zuplo, start with the
[MCP Server Handler documentation](https://zuplo.com/docs/handlers/mcp-server).
For a step-by-step walkthrough, see
[Create an MCP Server from Your OpenAPI Spec](/learning-center/create-mcp-server-from-openapi).
To understand how API gateways, AI gateways, and MCP gateways fit together in a
modern AI infrastructure stack, read
[The Three Gates of AI Infrastructure](/learning-center/three-gates-ai-infrastructure-api-ai-mcp-gateway).
And for a comprehensive overview of MCP architecture and implementation options,
see [What Is an MCP Server?](/learning-center/what-is-an-mcp-server).