Never Ship an MCP Server Without a Rate Limit

A few days ago I was using Claude with the GitHub MCP server connected, asking it to look through a busy repo. About two minutes in, the tool started returning errors. Claude paused, retried, retried again, then surfaced a 403 rate limit exceeded from GitHub. One conversation, mostly read-only, and I had walked straight into a 5,000-request-per-hour ceiling.

The agent was reading issues, listing files, fetching some history: the kind of thing a developer would do over a couple of hours, and an MCP-connected LLM does in a couple of minutes. The ceiling I hit isn’t an MCP-layer limit. It’s GitHub’s REST API rate limit, and the GitHub MCP server is a thin proxy with no throttling of its own. Every tools/call (the MCP method an agent uses to invoke a tool) quietly burns one or more of those 5,000 requests. The only reason my account didn’t end up worse is that the underlying API was already protecting itself.

That’s the new shape of API consumption in 2026. An MCP server turns your API into something an LLM can drive at full agent speed, and your best customers are the ones running those agents. If you don’t put a rate limit at the MCP edge, the only thing protecting your service and your customers’ accounts is whatever quota your downstream happens to have. As I just demonstrated to myself, that’s not much.

Use this approach if you're:

You run, or are about to ship, an MCP server in front of an API you own
You've had a customer's agent burn through their own rate limit or quota inside a single chat session
You want MCP traffic to fail safely without taking your direct API users down with it

Public MCP servers are getting this wrong in three different ways

I went looking for how the major MCP servers in the wild handle rate limiting, expecting a spread of sensible defaults. Instead, the ecosystem sits in one of three failure modes.

No MCP-layer limit at all. GitHub’s github-mcp-server is the example I hit. The remote-server docs cover toolsets, read-only modes, and lockdown headers, but no rate-limit section. An open issue on the repo captures the symptom from a real support escalation:

who was getting api rate limits broken constantly and after working with them they see that the MCP extension for VSCode was consuming a lot of tokens for this user.

The attached rate-limit JSON shows the same user’s usage jumping from 2,995 to 3,050 to 4,192 within seconds. Cloudflare’s and Stripe’s MCP servers sit in the same category, with no documented MCP-layer limit, so every call goes straight at the underlying API’s quota.

Limits tuned for human request shapes, saturated in seconds. Sentry’s MCP server caps each authenticated user at 60 requests per 60 seconds. Issue #844 on that repo reports that three or four parallel automation runs sharing one user bucket “saturate the limit within seconds.” Atlassian doesn’t publish its MCP rate limit, and an Atlassian community thread describes a user being blocked for thirty minutes after a query that returned 30 cards, with no signal about how long to wait or how much budget remains. Notion is more honest: 180 requests per minute general, 30 per minute for search, documented up front. None of these were chosen wrong on purpose. They were chosen for a request shape that no longer exists.

Overcorrection that breaks the protocol itself. AWS Knowledge MCP throttles roughly one request per fifteen seconds per IP. The MCP spec requires a client to send initialize, then tools/list, then any tools/call. As issue #2949 records:

Client sends initialize — 200 OK. Client immediately sends tools/list — 429 Too Many Requests. Client treats this as a connection failure and marks the server as unavailable.

The handshake itself exceeds the rate limit. The client can’t tell whether the server is throttled or just broken, so it gives up.

Three failure modes, one root cause: a rate limit designed for human request patterns in front of a request shape that no longer matches. For a deeper take on why agent traffic breaks request-count limits in general, see Rate-limiting AI agents beyond request counts.

The MCP spec gives clients no way to back off

You might think a well-behaved client would slow down on a 429. It can’t, because the spec hasn’t told it how. The MCP Streamable HTTP transport spec is explicit about which status code belongs where: an invalid Origin header “MUST respond with HTTP 403 Forbidden”, a request without a session id “MUST respond… with HTTP 400 Bad Request”, an unknown session id gets “HTTP 404 Not Found”, and a server that doesn’t offer an SSE stream returns “HTTP 405 Method Not Allowed”. 429 is not in that list. There is no Retry-After convention, no token bucket headers, no error code in the JSON-RPC envelope for “rate-limited, retry in N seconds.” The SSE retry field exists for reconnects, not for throttling.

The practical consequence is the one you’d expect: clients see a 429, have no specified behavior, and do something arbitrary. Some retry immediately. Some give up. Some surface a confusing error. None of them implement the backoff a thoughtful HTTP client would, because the spec never told them to. If your gateway doesn’t return clear, machine-readable rate-limit information, the client won’t learn to slow down, and your gateway gets to enjoy the same request pattern again three seconds later.

The github-mcp-server issue above shows a user’s REST quota climbing from 2,995 to 4,192 in seconds once an editor extension started driving the MCP server. Sentry’s #844 records three or four parallel agents on a single user bucket saturating its 60/min limit immediately. The AWS Knowledge MCP handshake bug was fixed in April 2026, but the underlying pattern, a 429 with no machine-readable backoff signal, applies to any MCP route that doesn’t return one. The protection has to come from your edge.

Rate-limit your MCP route by tier, from API key metadata

The point of a rate limit in front of an MCP server isn’t to punish agent traffic. It’s to keep the API usable for the customer running the agent, and to keep the customers who aren’t running agents from being collateral damage when one of them does.

Zuplo’s MCP gateway makes the MCP server a route handler. You add a route, attach the mcpServerHandler, and point its operations array at OpenAPI operation IDs in your routes file. Each tool call that arrives at the /mcp route is dispatched internally to the underlying route, so the request runs through the policy chains on both routes in this order:

Inbound policies on the /mcp route
Inbound policies on the tool’s route
The tool’s handler
Outbound policies on the tool’s route
Outbound policies on the /mcp route

The rate limit goes in step 1, where it sees all MCP-driven traffic before any tool dispatch.

A minimal routes.oas.json with the MCP handler and a rate-limit policy reference:

json

// config/routes.oas.json
{
  "paths": {
    "/mcp": {
      "post": {
        "x-zuplo-route": {
          "handler": {
            "export": "mcpServerHandler",
            "module": "$import(@zuplo/runtime)",
            "options": {
              "name": "my-mcp-server",
              "version": "1.0.0",
              "operations": [
                { "file": "./config/routes.oas.json", "id": "listOrders" },
                { "file": "./config/routes.oas.json", "id": "getOrder" },
                { "file": "./config/routes.oas.json", "id": "createOrder" }
              ]
            }
          },
          "policies": {
            "inbound": ["api-key-inbound", "mcp-rate-limit-inbound-policy"]
          }
        }
      }
    }
  }
}

The API key policy has to run first so the rate limiter has a consumer to identify. After that, the rate-limit policy can read whatever metadata you’ve stored on the key.

The interesting choice is what counts as “the right limit.” A blanket 1,000-per-minute might be right for an enterprise team running a research agent. The same number on a free-tier evaluator is ruinous. Drive the limit from a plan field on the consumer’s API key metadata, using rate-limit-inbound’s function mode:

json

// config/policies.json
{
  "name": "mcp-rate-limit-inbound-policy",
  "policyType": "rate-limit-inbound",
  "handler": {
    "export": "RateLimitInboundPolicy",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "rateLimitBy": "function",
      "requestsAllowed": 60,
      "timeWindowMinutes": 1,
      "headerMode": "retry-after",
      "identifier": {
        "module": "$import(./modules/mcp-rate-limit)",
        "export": "mcpRateLimit"
      }
    }
  }
}

Create a modules/ directory at your project root. The API key policy puts the consumer’s id on request.user.sub and the key’s metadata on request.user.data. The function reads them and returns a per-tier ceiling:

// modules/mcp-rate-limit.ts
import {
  CustomRateLimitDetails,
  ZuploContext,
  ZuploRequest,
} from "@zuplo/runtime";

const PLAN_LIMITS: Record<string, number> = {
  enterprise: 600,
  pro: 120,
  free: 30,
};

export function mcpRateLimit(
  request: ZuploRequest,
  _context: ZuploContext,
  _policyName: string,
): CustomRateLimitDetails {
  const sub = request.user?.sub ?? "anonymous";
  const plan = request.user?.data?.plan ?? "free";

  return {
    key: sub,
    requestsAllowed: PLAN_LIMITS[plan] ?? PLAN_LIMITS.free,
    timeWindowMinutes: 1,
  };
}

A consumer’s API key stores the plan in metadata:

json

{
  "name": "acme-corp-key",
  "metadata": {
    "plan": "enterprise",
    "orgId": "org_acme"
  }
}

Free-tier consumers get 30 MCP calls per minute. Pro gets 120. Enterprise gets 600. Raising a customer’s ceiling is a metadata update on the key, not a gateway redeploy. The same pattern extends to anything on the metadata object: region, model family, whether the consumer has accepted a usage policy.

Rate Limiting Policy Reference

Full reference for rate-limit-inbound, including the function mode used above for per-tier limits.

Pro tip:

Keep mode: "strict" (the default) on MCP routes. Async mode lets some requests through over the limit while replication catches up across edge locations. That’s fine on a chatty human API. On an MCP route where one agent can fire fifty calls in a second, async mode lets a burst clear before the counter has caught up.

Stop runaway loops before they trip the limit

The traffic shape that’s hardest on your backend isn’t 1,000 well-behaved requests in an hour. It’s the agent that gets stuck calling the same tool with the same arguments hundreds of times in a minute because some upstream returned a confusing error. We see this shape on Zuplo-hosted MCP routes too: a stuck agent can fire dozens of identical tools/call payloads in a few seconds before any per-minute limit catches up. A per-minute rate limit catches it eventually, but “eventually” can be a minute of pegged backend and several hundred failed retries.

Zuplo’s programmable gateway lets you sit a small detector ahead of the rate-limit policy. It hashes the consumer plus the tool name plus the arguments, counts how many times the same hash hits in a short window, and short-circuits past a threshold:

// modules/mcp-loop-breaker.ts
import {
  ZuploContext,
  ZuploRequest,
  HttpProblems,
  ZoneCache,
} from "@zuplo/runtime";

const WINDOW_SECONDS = 30;
const MAX_REPEATS = 10;

export async function mcpLoopBreaker(
  request: ZuploRequest,
  context: ZuploContext,
): Promise<ZuploRequest | Response> {
  const body = await request
    .clone()
    .json()
    .catch(() => null);
  if (body?.method !== "tools/call") return request;

  const consumer = request.user?.sub ?? "anonymous";
  const sig = `${body?.params?.name}:${JSON.stringify(body?.params?.arguments ?? {})}`;
  const cacheKey = `mcp-loop:${consumer}:${sig}`;

  const cache = new ZoneCache<number>("mcp-loop-breaker", context);
  const next = ((await cache.get(cacheKey)) ?? 0) + 1;

  if (next > MAX_REPEATS) {
    context.log.warn(
      `MCP loop detected for ${consumer}: ${next} identical calls to ${body?.params?.name}`,
    );
    return HttpProblems.tooManyRequests(request, context, {
      detail:
        "The same tool is being called repeatedly with identical arguments. Check your agent's retry logic.",
    });
  }

  await cache.put(cacheKey, next, WINDOW_SECONDS);
  return request;
}

A legitimate agent varies arguments between calls and never trips it. A stuck agent trips it on the eleventh identical call, gets a clear 429 with a Problem Details body, and stops eating the customer’s per-minute budget on requests that were never going to succeed. If you expose a polling tool that legitimately takes identical arguments, exclude it by name or widen MAX_REPEATS for that route.

Wire it as a custom-code-inbound policy ahead of the rate-limit policy on the /mcp route:

json

// config/policies.json
{
  "name": "mcp-loop-breaker-inbound-policy",
  "policyType": "custom-code-inbound",
  "handler": {
    "export": "mcpLoopBreaker",
    "module": "$import(./modules/mcp-loop-breaker)"
  }
}

The /mcp route’s inbound array now reads:

json

"inbound": [
  "api-key-inbound",
  "mcp-loop-breaker-inbound-policy",
  "mcp-rate-limit-inbound-policy"
]

Common mistake:

Don’t read the JSON body twice. The MCP handler needs to read it after your policy. request.clone().json() keeps the original body stream available for the handler.

Send back signals the client can actually use

The MCP spec doesn’t standardize what 429 means, so it’s on you to make your response well-behaved. Zuplo’s rate-limit-inbound policy with headerMode: "retry-after" (the default) returns a Retry-After header on every throttle, alongside a Problem Details JSON body. Both ride out unchanged through the MCP handler:

plaintext

HTTP/1.1 429 Too Many Requests
Retry-After: 60
Content-Type: application/problem+json

{
  "type": "https://httpproblems.com/http-status/429",
  "title": "Too Many Requests",
  "detail": "You have exceeded the rate limit for MCP tool calls",
  "instance": "/mcp"
}

Most MCP clients today won’t read either field correctly. That’s their bug, and the spec’s. But the right machine-readable signal at your edge is the prerequisite for any client to ever do the right thing, and it gives you a defensible answer when a customer’s agent gets cut off mid-task: point at the Retry-After and say “this is how long to wait.”

MCP Server Handler Reference

Full reference for mcpServerHandler, including the operations array, policy execution order, and how route policies compose with the handler.

Why the rate limit belongs at the MCP edge, not the underlying API

It’s tempting to say “we already rate-limit the underlying API, why double up?” The rate limit on the underlying API is the wrong one for agent traffic. It was sized for the request shape human customers produce. An MCP call multiplies that shape, sometimes by a lot. One agent intent can produce ten tool calls in the time it takes a developer to read one screen.

The /mcp route is the only place in your routing where you know the request came from an LLM-driven client, so it’s where the agent-shaped limit belongs. Underlying-API limits stay where they are. The MCP route gets a tier-aware policy that matches how agents actually call, a loop breaker for the worst-case, and a Retry-After so any client that knows how to read it can back off correctly. For more on what agent traffic actually looks like at the gateway, see what autonomous agents actually need from your APIs.

GitHub didn’t have any of this when I hit the 5,000-per-hour ceiling, and the only reason that ceiling held was that GitHub’s REST API was already enforcing it. Your API probably doesn’t have GitHub’s headroom. Don’t ship an MCP server without putting a rate limit at the edge first.