---
title: "Never Ship an MCP Server Without a Rate Limit"
description: "GitHub's MCP server inherits the 5,000/hour REST API limit, and that's the only thing standing between an agent and a suspended account. Most public MCP servers either have no limit, the wrong limit, or one so tight it breaks their own protocol. Put a real rate limit on every MCP route you publish."
canonicalUrl: "https://zuplo.com/blog/2026/05/18/never-ship-mcp-server-without-rate-limit"
pageType: "blog"
date: "2026-05-18"
authors: "nate"
tags: "rate-limiting, mcp, AI"
image: "https://zuplo.com/og?text=Never%20Ship%20an%20MCP%20Server%20Without%20a%20Rate%20Limit"
---
A few days ago I was using Claude with the GitHub MCP server connected, asking
it to look through a busy repo. About two minutes in, the tool started returning
errors. Claude paused, retried, retried again, then surfaced a
`403 rate limit exceeded` from GitHub. One conversation, mostly read-only, and I
had walked straight into a 5,000-request-per-hour ceiling.

The agent was reading issues, listing files, fetching some history: the kind of
thing a developer would do over a couple of hours, and an MCP-connected LLM does
in a couple of minutes. The ceiling I hit isn't an MCP-layer limit. It's
[GitHub's REST API rate limit](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api),
and the GitHub MCP server is a thin proxy with no throttling of its own. Every
`tools/call` (the MCP method an agent uses to invoke a tool) quietly burns one
or more of those 5,000 requests. The only reason my account didn't end up worse
is that the underlying API was already protecting itself.

That's the new shape of API consumption in 2026. An MCP server turns your API
into something an LLM can drive at full agent speed, and your best customers are
the ones running those agents. If you don't put a rate limit at the MCP edge,
the only thing protecting your service and your customers' accounts is whatever
quota your downstream happens to have. As I just demonstrated to myself, that's
not much.

<CalloutAudience
  variant="useIf"
  items={[
    `You run, or are about to ship, an MCP server in front of an API you own`,
    `You've had a customer's agent burn through their own rate limit or quota inside a single chat session`,
    `You want MCP traffic to fail safely without taking your direct API users down with it`,
  ]}
/>

## Public MCP servers are getting this wrong in three different ways

I went looking for how the major MCP servers in the wild handle rate limiting,
expecting a spread of sensible defaults. Instead, the ecosystem sits in one of
three failure modes.

**No MCP-layer limit at all.** GitHub's
[github-mcp-server](https://github.com/github/github-mcp-server) is the example
I hit. The remote-server docs cover toolsets, read-only modes, and lockdown
headers, but no rate-limit section. An open
[issue on the repo](https://github.com/github/github-mcp-server/issues/933)
captures the symptom from a real support escalation:

> who was getting api rate limits broken constantly and after working with them
> they see that the MCP extension for VSCode was consuming a lot of tokens for
> this user.

The attached rate-limit JSON shows the same user's usage jumping from 2,995 to
3,050 to 4,192 within seconds. Cloudflare's and Stripe's MCP servers sit in the
same category, with no documented MCP-layer limit, so every call goes straight
at the underlying API's quota.

**Limits tuned for human request shapes, saturated in seconds.** Sentry's MCP
server caps each authenticated user at 60 requests per 60 seconds.
[Issue #844](https://github.com/getsentry/sentry-mcp/issues/844) on that repo
reports that three or four parallel automation runs sharing one user bucket
"saturate the limit within seconds." Atlassian doesn't publish its MCP rate
limit, and an Atlassian community thread describes a user being
[blocked for thirty minutes after a query that returned 30 cards](https://community.atlassian.com/forums/Jira-questions/Atlassian-MCP-Server-Rate-Limits/qaq-p/3198188),
with no signal about how long to wait or how much budget remains. Notion is more
honest: 180 requests per minute general, 30 per minute for search,
[documented up front](https://developers.notion.com/docs/mcp-supported-tools).
None of these were chosen wrong on purpose. They were chosen for a request shape
that no longer exists.

**Overcorrection that breaks the protocol itself.** AWS Knowledge MCP throttles
roughly one request per fifteen seconds per IP. The MCP spec requires a client
to send `initialize`, then `tools/list`, then any `tools/call`. As
[issue #2949](https://github.com/awslabs/mcp/issues/2949) records:

> Client sends `initialize` — 200 OK. Client immediately sends `tools/list` —
> 429 Too Many Requests. Client treats this as a connection failure and marks
> the server as unavailable.

The handshake itself exceeds the rate limit. The client can't tell whether the
server is throttled or just broken, so it gives up.

Three failure modes, one root cause: a rate limit designed for human request
patterns in front of a request shape that no longer matches. For a deeper take
on why agent traffic breaks request-count limits in general, see
[Rate-limiting AI agents beyond request counts](/blog/rate-limit-ai-agents-beyond-request-counts).

## The MCP spec gives clients no way to back off

You might think a well-behaved client would slow down on a `429`. It can't,
because the spec hasn't told it how. The
[MCP Streamable HTTP transport spec](https://modelcontextprotocol.io/specification/2025-11-25/basic/transports)
is explicit about which status code belongs where: an invalid `Origin` header
"**MUST** respond with HTTP 403 Forbidden", a request without a session id
"**MUST** respond... with HTTP 400 Bad Request", an unknown session id gets
"HTTP 404 Not Found", and a server that doesn't offer an SSE stream returns
"HTTP 405 Method Not Allowed". `429` is not in that list. There is no
`Retry-After` convention, no token bucket headers, no error code in the JSON-RPC
envelope for "rate-limited, retry in N seconds." The SSE `retry` field exists
for reconnects, not for throttling.

The practical consequence is the one you'd expect: clients see a `429`, have no
specified behavior, and do something arbitrary. Some retry immediately. Some
give up. Some surface a confusing error. None of them implement the backoff a
thoughtful HTTP client would, because the spec never told them to. If your
gateway doesn't return clear, machine-readable rate-limit information, the
client won't learn to slow down, and your gateway gets to enjoy the same request
pattern again three seconds later.

The github-mcp-server issue above shows a user's REST quota climbing from 2,995
to 4,192 in seconds once an editor extension started driving the MCP server.
Sentry's #844 records three or four parallel agents on a single user bucket
saturating its 60/min limit immediately. The AWS Knowledge MCP handshake bug was
fixed in April 2026, but the underlying pattern, a 429 with no machine-readable
backoff signal, applies to any MCP route that doesn't return one. The protection
has to come from your edge.

## Rate-limit your MCP route by tier, from API key metadata

The point of a rate limit in front of an MCP server isn't to punish agent
traffic. It's to keep the API usable for the customer running the agent, and to
keep the customers who _aren't_ running agents from being collateral damage when
one of them does.

Zuplo's [MCP gateway](/blog/zuplo-mcp-gateway) makes the MCP server a route
handler. You add a route, attach the
[`mcpServerHandler`](https://zuplo.com/docs/handlers/mcp-server), and point its
`operations` array at OpenAPI operation IDs in your routes file. Each tool call
that arrives at the `/mcp` route is dispatched internally to the underlying
route, so the request runs through the policy chains on both routes in this
order:

1. Inbound policies on the `/mcp` route
2. Inbound policies on the tool's route
3. The tool's handler
4. Outbound policies on the tool's route
5. Outbound policies on the `/mcp` route

The rate limit goes in step 1, where it sees all MCP-driven traffic before any
tool dispatch.

A minimal `routes.oas.json` with the MCP handler and a rate-limit policy
reference:

```json
// config/routes.oas.json
{
  "paths": {
    "/mcp": {
      "post": {
        "x-zuplo-route": {
          "handler": {
            "export": "mcpServerHandler",
            "module": "$import(@zuplo/runtime)",
            "options": {
              "name": "my-mcp-server",
              "version": "1.0.0",
              "operations": [
                { "file": "./config/routes.oas.json", "id": "listOrders" },
                { "file": "./config/routes.oas.json", "id": "getOrder" },
                { "file": "./config/routes.oas.json", "id": "createOrder" }
              ]
            }
          },
          "policies": {
            "inbound": ["api-key-inbound", "mcp-rate-limit-inbound-policy"]
          }
        }
      }
    }
  }
}
```

The API key policy has to run first so the rate limiter has a consumer to
identify. After that, the rate-limit policy can read whatever metadata you've
stored on the key.

The interesting choice is what counts as "the right limit." A blanket
1,000-per-minute might be right for an enterprise team running a research agent.
The same number on a free-tier evaluator is ruinous. Drive the limit from a
`plan` field on the consumer's API key metadata, using `rate-limit-inbound`'s
`function` mode:

```json
// config/policies.json
{
  "name": "mcp-rate-limit-inbound-policy",
  "policyType": "rate-limit-inbound",
  "handler": {
    "export": "RateLimitInboundPolicy",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "rateLimitBy": "function",
      "requestsAllowed": 60,
      "timeWindowMinutes": 1,
      "headerMode": "retry-after",
      "identifier": {
        "module": "$import(./modules/mcp-rate-limit)",
        "export": "mcpRateLimit"
      }
    }
  }
}
```

Create a `modules/` directory at your project root. The API key policy puts the
consumer's id on `request.user.sub` and the key's metadata on
`request.user.data`. The function reads them and returns a per-tier ceiling:

```ts
// modules/mcp-rate-limit.ts
import {
  CustomRateLimitDetails,
  ZuploContext,
  ZuploRequest,
} from "@zuplo/runtime";

const PLAN_LIMITS: Record<string, number> = {
  enterprise: 600,
  pro: 120,
  free: 30,
};

export function mcpRateLimit(
  request: ZuploRequest,
  _context: ZuploContext,
  _policyName: string,
): CustomRateLimitDetails {
  const sub = request.user?.sub ?? "anonymous";
  const plan = request.user?.data?.plan ?? "free";

  return {
    key: sub,
    requestsAllowed: PLAN_LIMITS[plan] ?? PLAN_LIMITS.free,
    timeWindowMinutes: 1,
  };
}
```

A consumer's API key stores the plan in metadata:

```json
{
  "name": "acme-corp-key",
  "metadata": {
    "plan": "enterprise",
    "orgId": "org_acme"
  }
}
```

Free-tier consumers get 30 MCP calls per minute. Pro gets 120. Enterprise
gets 600. Raising a customer's ceiling is a metadata update on the key, not a
gateway redeploy. The same pattern extends to anything on the metadata object:
region, model family, whether the consumer has accepted a usage policy.

<CalloutDoc
  title="Rate Limiting Policy Reference"
  description="Full reference for rate-limit-inbound, including the function mode used above for per-tier limits."
  href="https://zuplo.com/docs/policies/rate-limit-inbound"
  icon="book"
/>

<CalloutTip variant="tip">
  Keep `mode: "strict"` (the default) on MCP routes. Async mode lets some
  requests through over the limit while replication catches up across edge
  locations. That's fine on a chatty human API. On an MCP route where one agent
  can fire fifty calls in a second, async mode lets a burst clear before the
  counter has caught up.
</CalloutTip>

## Stop runaway loops before they trip the limit

The traffic shape that's hardest on your backend isn't 1,000 well-behaved
requests in an hour. It's the agent that gets stuck calling the same tool with
the same arguments hundreds of times in a minute because some upstream returned
a confusing error. We see this shape on Zuplo-hosted MCP routes too: a stuck
agent can fire dozens of identical `tools/call` payloads in a few seconds before
any per-minute limit catches up. A per-minute rate limit catches it eventually,
but "eventually" can be a minute of pegged backend and several hundred failed
retries.

Zuplo's
[programmable gateway](https://zuplo.com/docs/articles/custom-code-patterns)
lets you sit a small detector ahead of the rate-limit policy. It hashes the
consumer plus the tool name plus the arguments, counts how many times the same
hash hits in a short window, and short-circuits past a threshold:

```ts
// modules/mcp-loop-breaker.ts
import {
  ZuploContext,
  ZuploRequest,
  HttpProblems,
  ZoneCache,
} from "@zuplo/runtime";

const WINDOW_SECONDS = 30;
const MAX_REPEATS = 10;

export async function mcpLoopBreaker(
  request: ZuploRequest,
  context: ZuploContext,
): Promise<ZuploRequest | Response> {
  const body = await request
    .clone()
    .json()
    .catch(() => null);
  if (body?.method !== "tools/call") return request;

  const consumer = request.user?.sub ?? "anonymous";
  const sig = `${body?.params?.name}:${JSON.stringify(body?.params?.arguments ?? {})}`;
  const cacheKey = `mcp-loop:${consumer}:${sig}`;

  const cache = new ZoneCache<number>("mcp-loop-breaker", context);
  const next = ((await cache.get(cacheKey)) ?? 0) + 1;

  if (next > MAX_REPEATS) {
    context.log.warn(
      `MCP loop detected for ${consumer}: ${next} identical calls to ${body?.params?.name}`,
    );
    return HttpProblems.tooManyRequests(request, context, {
      detail:
        "The same tool is being called repeatedly with identical arguments. Check your agent's retry logic.",
    });
  }

  await cache.put(cacheKey, next, WINDOW_SECONDS);
  return request;
}
```

A legitimate agent varies arguments between calls and never trips it. A stuck
agent trips it on the eleventh identical call, gets a clear `429` with a Problem
Details body, and stops eating the customer's per-minute budget on requests that
were never going to succeed. If you expose a polling tool that legitimately
takes identical arguments, exclude it by name or widen `MAX_REPEATS` for that
route.

Wire it as a `custom-code-inbound` policy ahead of the rate-limit policy on the
`/mcp` route:

```json
// config/policies.json
{
  "name": "mcp-loop-breaker-inbound-policy",
  "policyType": "custom-code-inbound",
  "handler": {
    "export": "mcpLoopBreaker",
    "module": "$import(./modules/mcp-loop-breaker)"
  }
}
```

The `/mcp` route's inbound array now reads:

```json
"inbound": [
  "api-key-inbound",
  "mcp-loop-breaker-inbound-policy",
  "mcp-rate-limit-inbound-policy"
]
```

<CalloutTip variant="mistake">
  Don't read the JSON body twice. The MCP handler needs to read it after your
  policy. `request.clone().json()` keeps the original body stream available for
  the handler.
</CalloutTip>

## Send back signals the client can actually use

The MCP spec doesn't standardize what `429` means, so it's on you to make your
response well-behaved. Zuplo's
[rate-limit-inbound](https://zuplo.com/docs/policies/rate-limit-inbound) policy
with `headerMode: "retry-after"` (the default) returns a `Retry-After` header on
every throttle, alongside a Problem Details JSON body. Both ride out unchanged
through the MCP handler:

```
HTTP/1.1 429 Too Many Requests
Retry-After: 60
Content-Type: application/problem+json

{
  "type": "https://httpproblems.com/http-status/429",
  "title": "Too Many Requests",
  "detail": "You have exceeded the rate limit for MCP tool calls",
  "instance": "/mcp"
}
```

Most MCP clients today won't read either field correctly. That's their bug, and
the spec's. But the right machine-readable signal at your edge is the
prerequisite for any client to ever do the right thing, and it gives you a
defensible answer when a customer's agent gets cut off mid-task: point at the
`Retry-After` and say "this is how long to wait."

<CalloutDoc
  title="MCP Server Handler Reference"
  description="Full reference for mcpServerHandler, including the operations array, policy execution order, and how route policies compose with the handler."
  href="https://zuplo.com/docs/handlers/mcp-server"
  icon="book"
/>

## Why the rate limit belongs at the MCP edge, not the underlying API

It's tempting to say "we already rate-limit the underlying API, why double up?"
The rate limit on the underlying API is the wrong one for agent traffic. It was
sized for the request shape human customers produce. An MCP call multiplies that
shape, sometimes by a lot. One agent intent can produce ten tool calls in the
time it takes a developer to read one screen.

The `/mcp` route is the only place in your routing where you know the request
came from an LLM-driven client, so it's where the agent-shaped limit belongs.
Underlying-API limits stay where they are. The MCP route gets a tier-aware
policy that matches how agents actually call, a loop breaker for the worst-case,
and a `Retry-After` so any client that knows how to read it can back off
correctly. For more on what agent traffic actually looks like at the gateway,
see
[what autonomous agents actually need from your APIs](/blog/what-autonomous-agents-actually-need-from-your-apis).

GitHub didn't have any of this when I hit the 5,000-per-hour ceiling, and the
only reason that ceiling held was that GitHub's REST API was already enforcing
it. Your API probably doesn't have GitHub's headroom. Don't ship an MCP server
without putting a rate limit at the edge first.