A few days ago I was using Claude with the GitHub MCP server connected, asking
it to look through a busy repo. About two minutes in, the tool started returning
errors. Claude paused, retried, retried again, then surfaced a
403 rate limit exceeded from GitHub. One conversation, mostly read-only, and I
had walked straight into a 5,000-request-per-hour ceiling.
The agent was reading issues, listing files, fetching some history: the kind of
thing a developer would do over a couple of hours, and an MCP-connected LLM does
in a couple of minutes. The ceiling I hit isn’t an MCP-layer limit. It’s
GitHub’s REST API rate limit,
and the GitHub MCP server is a thin proxy with no throttling of its own. Every
tools/call (the MCP method an agent uses to invoke a tool) quietly burns one
or more of those 5,000 requests. The only reason my account didn’t end up worse
is that the underlying API was already protecting itself.
That’s the new shape of API consumption in 2026. An MCP server turns your API into something an LLM can drive at full agent speed, and your best customers are the ones running those agents. If you don’t put a rate limit at the MCP edge, the only thing protecting your service and your customers’ accounts is whatever quota your downstream happens to have. As I just demonstrated to myself, that’s not much.
- You run, or are about to ship, an MCP server in front of an API you own
- You've had a customer's agent burn through their own rate limit or quota inside a single chat session
- You want MCP traffic to fail safely without taking your direct API users down with it
Public MCP servers are getting this wrong in three different ways
I went looking for how the major MCP servers in the wild handle rate limiting, expecting a spread of sensible defaults. Instead, the ecosystem sits in one of three failure modes.
No MCP-layer limit at all. GitHub’s github-mcp-server is the example I hit. The remote-server docs cover toolsets, read-only modes, and lockdown headers, but no rate-limit section. An open issue on the repo captures the symptom from a real support escalation:
who was getting api rate limits broken constantly and after working with them they see that the MCP extension for VSCode was consuming a lot of tokens for this user.
The attached rate-limit JSON shows the same user’s usage jumping from 2,995 to 3,050 to 4,192 within seconds. Cloudflare’s and Stripe’s MCP servers sit in the same category, with no documented MCP-layer limit, so every call goes straight at the underlying API’s quota.
Limits tuned for human request shapes, saturated in seconds. Sentry’s MCP server caps each authenticated user at 60 requests per 60 seconds. Issue #844 on that repo reports that three or four parallel automation runs sharing one user bucket “saturate the limit within seconds.” Atlassian doesn’t publish its MCP rate limit, and an Atlassian community thread describes a user being blocked for thirty minutes after a query that returned 30 cards, with no signal about how long to wait or how much budget remains. Notion is more honest: 180 requests per minute general, 30 per minute for search, documented up front. None of these were chosen wrong on purpose. They were chosen for a request shape that no longer exists.
Overcorrection that breaks the protocol itself. AWS Knowledge MCP throttles
roughly one request per fifteen seconds per IP. The MCP spec requires a client
to send initialize, then tools/list, then any tools/call. As
issue #2949 records:
Client sends
initialize— 200 OK. Client immediately sendstools/list— 429 Too Many Requests. Client treats this as a connection failure and marks the server as unavailable.
The handshake itself exceeds the rate limit. The client can’t tell whether the server is throttled or just broken, so it gives up.
Three failure modes, one root cause: a rate limit designed for human request patterns in front of a request shape that no longer matches. For a deeper take on why agent traffic breaks request-count limits in general, see Rate-limiting AI agents beyond request counts.
The MCP spec gives clients no way to back off
You might think a well-behaved client would slow down on a 429. It can’t,
because the spec hasn’t told it how. The
MCP Streamable HTTP transport spec
is explicit about which status code belongs where: an invalid Origin header
“MUST respond with HTTP 403 Forbidden”, a request without a session id
“MUST respond… with HTTP 400 Bad Request”, an unknown session id gets
“HTTP 404 Not Found”, and a server that doesn’t offer an SSE stream returns
“HTTP 405 Method Not Allowed”. 429 is not in that list. There is no
Retry-After convention, no token bucket headers, no error code in the JSON-RPC
envelope for “rate-limited, retry in N seconds.” The SSE retry field exists
for reconnects, not for throttling.
The practical consequence is the one you’d expect: clients see a 429, have no
specified behavior, and do something arbitrary. Some retry immediately. Some
give up. Some surface a confusing error. None of them implement the backoff a
thoughtful HTTP client would, because the spec never told them to. If your
gateway doesn’t return clear, machine-readable rate-limit information, the
client won’t learn to slow down, and your gateway gets to enjoy the same request
pattern again three seconds later.
The github-mcp-server issue above shows a user’s REST quota climbing from 2,995 to 4,192 in seconds once an editor extension started driving the MCP server. Sentry’s #844 records three or four parallel agents on a single user bucket saturating its 60/min limit immediately. The AWS Knowledge MCP handshake bug was fixed in April 2026, but the underlying pattern, a 429 with no machine-readable backoff signal, applies to any MCP route that doesn’t return one. The protection has to come from your edge.
Rate-limit your MCP route by tier, from API key metadata
The point of a rate limit in front of an MCP server isn’t to punish agent traffic. It’s to keep the API usable for the customer running the agent, and to keep the customers who aren’t running agents from being collateral damage when one of them does.
Zuplo’s MCP gateway makes the MCP server a route
handler. You add a route, attach the
mcpServerHandler, and point its
operations array at OpenAPI operation IDs in your routes file. Each tool call
that arrives at the /mcp route is dispatched internally to the underlying
route, so the request runs through the policy chains on both routes in this
order:
- Inbound policies on the
/mcproute - Inbound policies on the tool’s route
- The tool’s handler
- Outbound policies on the tool’s route
- Outbound policies on the
/mcproute
The rate limit goes in step 1, where it sees all MCP-driven traffic before any tool dispatch.
A minimal routes.oas.json with the MCP handler and a rate-limit policy
reference:
The API key policy has to run first so the rate limiter has a consumer to identify. After that, the rate-limit policy can read whatever metadata you’ve stored on the key.
The interesting choice is what counts as “the right limit.” A blanket
1,000-per-minute might be right for an enterprise team running a research agent.
The same number on a free-tier evaluator is ruinous. Drive the limit from a
plan field on the consumer’s API key metadata, using rate-limit-inbound’s
function mode:
Create a modules/ directory at your project root. The API key policy puts the
consumer’s id on request.user.sub and the key’s metadata on
request.user.data. The function reads them and returns a per-tier ceiling:
A consumer’s API key stores the plan in metadata:
Free-tier consumers get 30 MCP calls per minute. Pro gets 120. Enterprise gets 600. Raising a customer’s ceiling is a metadata update on the key, not a gateway redeploy. The same pattern extends to anything on the metadata object: region, model family, whether the consumer has accepted a usage policy.
Rate Limiting Policy Reference
Full reference for rate-limit-inbound, including the function mode used above for per-tier limits.
Pro tip:
Keep mode: "strict" (the default) on MCP routes. Async mode lets some
requests through over the limit while replication catches up across edge
locations. That’s fine on a chatty human API. On an MCP route where one agent
can fire fifty calls in a second, async mode lets a burst clear before the
counter has caught up.
Stop runaway loops before they trip the limit
The traffic shape that’s hardest on your backend isn’t 1,000 well-behaved
requests in an hour. It’s the agent that gets stuck calling the same tool with
the same arguments hundreds of times in a minute because some upstream returned
a confusing error. We see this shape on Zuplo-hosted MCP routes too: a stuck
agent can fire dozens of identical tools/call payloads in a few seconds before
any per-minute limit catches up. A per-minute rate limit catches it eventually,
but “eventually” can be a minute of pegged backend and several hundred failed
retries.
Zuplo’s programmable gateway lets you sit a small detector ahead of the rate-limit policy. It hashes the consumer plus the tool name plus the arguments, counts how many times the same hash hits in a short window, and short-circuits past a threshold:
A legitimate agent varies arguments between calls and never trips it. A stuck
agent trips it on the eleventh identical call, gets a clear 429 with a Problem
Details body, and stops eating the customer’s per-minute budget on requests that
were never going to succeed. If you expose a polling tool that legitimately
takes identical arguments, exclude it by name or widen MAX_REPEATS for that
route.
Wire it as a custom-code-inbound policy ahead of the rate-limit policy on the
/mcp route:
The /mcp route’s inbound array now reads:
Common mistake:
Don’t read the JSON body twice. The MCP handler needs to read it after your
policy. request.clone().json() keeps the original body stream available for
the handler.
Send back signals the client can actually use
The MCP spec doesn’t standardize what 429 means, so it’s on you to make your
response well-behaved. Zuplo’s
rate-limit-inbound policy
with headerMode: "retry-after" (the default) returns a Retry-After header on
every throttle, alongside a Problem Details JSON body. Both ride out unchanged
through the MCP handler:
Most MCP clients today won’t read either field correctly. That’s their bug, and
the spec’s. But the right machine-readable signal at your edge is the
prerequisite for any client to ever do the right thing, and it gives you a
defensible answer when a customer’s agent gets cut off mid-task: point at the
Retry-After and say “this is how long to wait.”
MCP Server Handler Reference
Full reference for mcpServerHandler, including the operations array, policy execution order, and how route policies compose with the handler.
Why the rate limit belongs at the MCP edge, not the underlying API
It’s tempting to say “we already rate-limit the underlying API, why double up?” The rate limit on the underlying API is the wrong one for agent traffic. It was sized for the request shape human customers produce. An MCP call multiplies that shape, sometimes by a lot. One agent intent can produce ten tool calls in the time it takes a developer to read one screen.
The /mcp route is the only place in your routing where you know the request
came from an LLM-driven client, so it’s where the agent-shaped limit belongs.
Underlying-API limits stay where they are. The MCP route gets a tier-aware
policy that matches how agents actually call, a loop breaker for the worst-case,
and a Retry-After so any client that knows how to read it can back off
correctly. For more on what agent traffic actually looks like at the gateway,
see
what autonomous agents actually need from your APIs.
GitHub didn’t have any of this when I hit the 5,000-per-hour ceiling, and the only reason that ceiling held was that GitHub’s REST API was already enforcing it. Your API probably doesn’t have GitHub’s headroom. Don’t ship an MCP server without putting a rate limit at the edge first.