Rate Limit LLM APIs by Tokens Not Requests

A request count is a terrible proxy for LLM cost. Two calls to the same endpoint can differ by three orders of magnitude in tokens, dollars, and latency. One might be a 30-token classifier ping. The next might ship a 40,000-token document plus tool definitions and ask for a long structured response. A 60-RPM cap treats them as equal, and the heavy user empties your provider budget before breakfast.

Use this approach if you're:

You proxy OpenAI, Anthropic, or another LLM provider through your own API
You bill or budget per customer and a single oversized request can blow a month of margin
Your current rate limit is requests-per-minute and the heavy users are eating the cheap users' headroom

Why requests-per-minute breaks for LLM APIs

A normal CRUD endpoint has flat cost. Whether the body is 100 bytes or 10 KB, the work is roughly the same, and counting requests maps cleanly to load.

An LLM call doesn’t behave like that. Cost scales with input tokens, output tokens, model class, whether prompt caching hit, and whether the response streamed. Two requests with identical paths and headers can hit your provider bill for $0.0001 and $4. Rate limiting on request count is the wrong axis.

The providers know this. They publish the right axes themselves.

OpenAI and Anthropic limit on tokens, not requests

Anthropic’s docs are unambiguous about what their meter actually measures:

The rate limits for the Messages API are measured in requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM) for each model class.

Three counters per model, and the token counters dominate. Tier 1 Sonnet 4.x is 50 RPM but only 30,000 input tokens per minute and 8,000 output tokens per minute. Fifty 30-token pings sail through; a single 40,000-token document is already over the input ceiling.

Azure OpenAI applies the same shape: TPM and RPM as separate limits, allocated per model and deployment. OpenAI’s own rate-limit docs match. Tokens are what run out first on real workloads.

If your gateway sits between customers and these providers, it should meter in the same units. Counting requests when the provider counts tokens means you either limit too loosely (a mega-request trips the upstream limit anyway) or too tightly (a chatty cheap user gets capped like one running 50K-token jobs).

Track tokens with complex-rate-limit-inbound

Zuplo’s rate-limit-inbound policy meters one counter per request. That’s the right shape for CRUD. For LLM traffic you want complex-rate-limit-inbound, which supports multiple named counters in the same window and lets each request count for an arbitrary amount against any of them, rather than always counting as one.

The config is a limits dictionary plus a time window. Each key is a counter, each value is its budget for the window. This entry goes in config/policies.json alongside any other inbound policies on the route:

json

// config/policies.json
{
  "name": "llm-rate-limit-inbound-policy",
  "policyType": "complex-rate-limit-inbound",
  "handler": {
    "export": "ComplexRateLimitInboundPolicy",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "limits": {
        "requests": 60,
        "inputTokens": 30000,
        "outputTokens": 8000
      },
      "rateLimitBy": "user",
      "timeWindowMinutes": 1
    }
  }
}

Three counters, all keyed on the authenticated consumer. rateLimitBy: "user" reads request.user.sub, the subject claim populated by an upstream auth policy on the route (the API key or JWT inbound policies both populate it). The auth policy has to run before the rate limiter, otherwise there’s no consumer to key on. The values mirror Anthropic’s Tier 1 Sonnet shape so the gateway runs out at the same time the upstream would. Any counter overrunning trips a 429 with a retry-after header.

By default each request counts as 1 against every counter, which is no better than RPM. The interesting part is replacing those increments with the real token counts.

Count real tokens against the limit

ComplexRateLimitInboundPolicy.setIncrements() lets a custom policy set the per-counter increment for the request that’s in flight. Call it from a custom outbound policy after the upstream response arrives and you can apply the real token counts from the provider’s usage block:

// modules/count-llm-tokens.ts
import { ComplexRateLimitInboundPolicy, ZuploContext } from "@zuplo/runtime";

export default async function countLlmTokens(
  response: Response,
  request: Request,
  context: ZuploContext,
): Promise<Response> {
  // Errors don't have a usage block, so leave the default 1-per-request count.
  if (!response.ok) return response;

  // Clone the response so the client still gets an unread body to consume.
  const body = await response
    .clone()
    .json()
    .catch(() => null);

  const usage = body?.usage;
  // Anthropic names them input_tokens / output_tokens, OpenAI uses prompt_tokens / completion_tokens.
  const inputTokens = usage?.input_tokens ?? usage?.prompt_tokens ?? 0;
  const outputTokens = usage?.output_tokens ?? usage?.completion_tokens ?? 0;

  ComplexRateLimitInboundPolicy.setIncrements(context, {
    inputTokens,
    outputTokens,
  });

  return response;
}

Two response shapes covered. Anthropic returns usage.input_tokens / usage.output_tokens. OpenAI returns usage.prompt_tokens / usage.completion_tokens. If .json() fails or there’s no usage block, the token increments fall back to 0 and the request still counts as 1 on the requests counter.

json

// config/policies.json
{
  "name": "count-llm-tokens",
  "policyType": "custom-code-outbound",
  "handler": {
    "export": "default",
    "module": "$import(./modules/count-llm-tokens)"
  }
}

Attach both policies to the LLM-proxy route in your OpenAPI routes file. Inbound limiter runs first, the upstream call happens, outbound policy applies the real token counts:

json

// config/routes.oas.json
{
  "paths": {
    "/v1/messages": {
      "x-zuplo-path": { "pathMode": "open-api" },
      "post": {
        "x-zuplo-route": {
          "handler": {
            "export": "urlForwardHandler",
            "module": "$import(@zuplo/runtime)",
            "options": { "baseUrl": "https://api.anthropic.com/v1/messages" }
          },
          "policies": {
            "inbound": ["llm-rate-limit-inbound-policy"],
            "outbound": ["count-llm-tokens"]
          }
        }
      }
    }
  }
}

setIncrements writes the real counts to the bucket before the response leaves Zuplo, so the in-flight request lands on the counter at its real weight, and the next request sees updated totals. A user who blows the input token budget on a single 40K-token call gets 429’d on their next attempt, not after several free passes.

Common mistake:

response.clone().json() only works on buffered JSON. If you proxy streaming SSE from OpenAI or Anthropic, the body is a token-by-token event stream and .json() will reject. Counting tokens from a stream needs a streaming-aware outbound hook built on StreamingZoneCache that accumulates usage events from the SSE chunks: a separate pattern, not covered here.

Complex Rate Limit Policy

Reference for multi-counter rate limiting and the setIncrements API used to weight requests by real token usage.

Size the budget per plan

The limits block above is one global ceiling. Real APIs run different plans with different ceilings. The cleanest way to model that is one complex-rate-limit-inbound instance per plan, each with its own token and request budgets, attached to a route the matching consumers hit.

A free plan might be 60 RPM, 30,000 input tokens, 8,000 output tokens. A pro plan on the same upstream might be 600 RPM, 300,000 input tokens, 80,000 output tokens. Both keyed on rateLimitBy: "user" for per-consumer counters, both attached to a route that filters consumers by plan with a small inbound gate. The setIncrements hook stays the same across plans because the increment is the real token count regardless of budget.

For a pre-flight ceiling against a single oversize request (a 200K-token prompt that would burn the upstream’s per-request token cap), add a custom inbound policy that reads the request body’s messages / prompt, estimates input tokens, and rejects with 413 if it exceeds a hard per-request cap. That’s a separate gate from the per-minute counter and worth running on every LLM route.

The same per-consumer token signal also doubles as a billing signal if you want it to. Zuplo’s monetization-inbound policy ties usage to a consumer’s subscription, so the counts that drive your rate limits can also feed plan-based billing without a second metering pipeline. Rate limiting is the focus here; the same plumbing extends to monetization when you’re ready.

What a token-weighted gateway buys you

A token-weighted gateway throttles every consumer by what they actually used. The chatty classifier user keeps their 30-token calls flowing. The batch-summarization user gets capped at the budget their plan paid for. The cheap user isn’t squeezed out of their RPM headroom by someone else’s 40K-token jobs, and your upstream provider quota stops getting surprised.