---
title: "Rate Limit LLM APIs by Tokens Not Requests"
description: "Requests-per-minute is the wrong meter for LLM endpoints. One call can be 50 tokens or 50,000. Rate limit on input and output tokens with Zuplo's complex-rate-limit-inbound policy and the real counts from each upstream response."
canonicalUrl: "https://zuplo.com/blog/2026/05/12/rate-limit-llm-apis-by-tokens-not-requests"
pageType: "blog"
date: "2026-05-12"
authors: "martyn"
tags: "rate-limiting, llm, ai-gateway"
image: "https://zuplo.com/og?text=Rate%20Limit%20LLM%20APIs%20by%20Tokens%20Not%20Requests"
---
A request count is a terrible proxy for LLM cost. Two calls to the same endpoint
can differ by three orders of magnitude in tokens, dollars, and latency. One
might be a 30-token classifier ping. The next might ship a 40,000-token document
plus tool definitions and ask for a long structured response. A 60-RPM cap
treats them as equal, and the heavy user empties your provider budget before
breakfast.

<CalloutAudience
  variant="useIf"
  items={[
    `You proxy OpenAI, Anthropic, or another LLM provider through your own API`,
    `You bill or budget per customer and a single oversized request can blow a month of margin`,
    `Your current rate limit is requests-per-minute and the heavy users are eating the cheap users' headroom`,
  ]}
/>

## Why requests-per-minute breaks for LLM APIs

A normal CRUD endpoint has flat cost. Whether the body is 100 bytes or 10 KB,
the work is roughly the same, and counting requests maps cleanly to load.

An LLM call doesn't behave like that. Cost scales with input tokens, output
tokens, model class, whether prompt caching hit, and whether the response
streamed. Two requests with identical paths and headers can hit your provider
bill for $0.0001 and $4. Rate limiting on request count is the wrong axis.

The providers know this. They publish the right axes themselves.

## OpenAI and Anthropic limit on tokens, not requests

Anthropic's docs are unambiguous about what their meter actually measures:

> The rate limits for the Messages API are measured in requests per minute
> (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM) for
> each model class.

Three counters per model, and the token counters dominate. Tier 1 Sonnet 4.x is
50 RPM but only 30,000 input tokens per minute and 8,000 output tokens per
minute. Fifty 30-token pings sail through; a single 40,000-token document is
already over the input ceiling.

Azure OpenAI applies the same shape: TPM and RPM as separate limits, allocated
per model and deployment.
[OpenAI's own rate-limit docs](https://platform.openai.com/docs/guides/rate-limits)
match. Tokens are what run out first on real workloads.

If your gateway sits between customers and these providers, it should meter in
the same units. Counting requests when the provider counts tokens means you
either limit too loosely (a mega-request trips the upstream limit anyway) or too
tightly (a chatty cheap user gets capped like one running 50K-token jobs).

## Track tokens with complex-rate-limit-inbound

Zuplo's `rate-limit-inbound` policy meters one counter per request. That's the
right shape for CRUD. For LLM traffic you want `complex-rate-limit-inbound`,
which supports multiple named counters in the same window and lets each request
count for an arbitrary amount against any of them, rather than always counting
as one.

The config is a `limits` dictionary plus a time window. Each key is a counter,
each value is its budget for the window. This entry goes in
`config/policies.json` alongside any other inbound policies on the route:

```json
// config/policies.json
{
  "name": "llm-rate-limit-inbound-policy",
  "policyType": "complex-rate-limit-inbound",
  "handler": {
    "export": "ComplexRateLimitInboundPolicy",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "limits": {
        "requests": 60,
        "inputTokens": 30000,
        "outputTokens": 8000
      },
      "rateLimitBy": "user",
      "timeWindowMinutes": 1
    }
  }
}
```

Three counters, all keyed on the authenticated consumer. `rateLimitBy: "user"`
reads `request.user.sub`, the subject claim populated by an upstream auth policy
on the route (the [API key](https://zuplo.com/docs/policies/api-key-inbound) or
[JWT](https://zuplo.com/docs/policies/open-id-jwt-auth-inbound) inbound policies
both populate it). The auth policy has to run before the rate limiter, otherwise
there's no consumer to key on. The values mirror Anthropic's Tier 1 Sonnet shape
so the gateway runs out at the same time the upstream would. Any counter
overrunning trips a `429` with a `retry-after` header.

By default each request counts as `1` against every counter, which is no better
than RPM. The interesting part is replacing those increments with the real token
counts.

## Count real tokens against the limit

`ComplexRateLimitInboundPolicy.setIncrements()` lets a custom policy set the
per-counter increment for the request that's in flight. Call it from a custom
outbound policy after the upstream response arrives and you can apply the real
token counts from the provider's `usage` block:

```ts
// modules/count-llm-tokens.ts
import { ComplexRateLimitInboundPolicy, ZuploContext } from "@zuplo/runtime";

export default async function countLlmTokens(
  response: Response,
  request: Request,
  context: ZuploContext,
): Promise<Response> {
  // Errors don't have a usage block, so leave the default 1-per-request count.
  if (!response.ok) return response;

  // Clone the response so the client still gets an unread body to consume.
  const body = await response
    .clone()
    .json()
    .catch(() => null);

  const usage = body?.usage;
  // Anthropic names them input_tokens / output_tokens, OpenAI uses prompt_tokens / completion_tokens.
  const inputTokens = usage?.input_tokens ?? usage?.prompt_tokens ?? 0;
  const outputTokens = usage?.output_tokens ?? usage?.completion_tokens ?? 0;

  ComplexRateLimitInboundPolicy.setIncrements(context, {
    inputTokens,
    outputTokens,
  });

  return response;
}
```

Two response shapes covered. Anthropic returns `usage.input_tokens` /
`usage.output_tokens`. OpenAI returns `usage.prompt_tokens` /
`usage.completion_tokens`. If `.json()` fails or there's no `usage` block, the
token increments fall back to `0` and the request still counts as `1` on the
`requests` counter.

Register the outbound policy in `config/policies.json`:

```json
// config/policies.json
{
  "name": "count-llm-tokens",
  "policyType": "custom-code-outbound",
  "handler": {
    "export": "default",
    "module": "$import(./modules/count-llm-tokens)"
  }
}
```

Attach both policies to the LLM-proxy route in your OpenAPI routes file. Inbound
limiter runs first, the upstream call happens, outbound policy applies the real
token counts:

```json
// config/routes.oas.json
{
  "paths": {
    "/v1/messages": {
      "x-zuplo-path": { "pathMode": "open-api" },
      "post": {
        "x-zuplo-route": {
          "handler": {
            "export": "urlForwardHandler",
            "module": "$import(@zuplo/runtime)",
            "options": { "baseUrl": "https://api.anthropic.com/v1/messages" }
          },
          "policies": {
            "inbound": ["llm-rate-limit-inbound-policy"],
            "outbound": ["count-llm-tokens"]
          }
        }
      }
    }
  }
}
```

`setIncrements` writes the real counts to the bucket before the response leaves
Zuplo, so the in-flight request lands on the counter at its real weight, and the
next request sees updated totals. A user who blows the input token budget on a
single 40K-token call gets `429`'d on their next attempt, not after several free
passes.

<CalloutTip variant="mistake">
  `response.clone().json()` only works on buffered JSON. If you proxy streaming
  SSE from OpenAI or Anthropic, the body is a token-by-token event stream and
  `.json()` will reject. Counting tokens from a stream needs a streaming-aware
  outbound hook built on
  [StreamingZoneCache](https://zuplo.com/docs/programmable-api/streaming-zone-cache)
  that accumulates `usage` events from the SSE chunks: a separate pattern, not
  covered here.
</CalloutTip>

<CalloutDoc
  title="Complex Rate Limit Policy"
  description="Reference for multi-counter rate limiting and the setIncrements API used to weight requests by real token usage."
  href="https://zuplo.com/docs/policies/complex-rate-limit-inbound"
  icon="book"
/>

## Size the budget per plan

The `limits` block above is one global ceiling. Real APIs run different plans
with different ceilings. The cleanest way to model that is one
`complex-rate-limit-inbound` instance per plan, each with its own token and
request budgets, attached to a route the matching consumers hit.

A `free` plan might be 60 RPM, 30,000 input tokens, 8,000 output tokens. A `pro`
plan on the same upstream might be 600 RPM, 300,000 input tokens, 80,000 output
tokens. Both keyed on `rateLimitBy: "user"` for per-consumer counters, both
attached to a route that filters consumers by plan with a small inbound gate.
The `setIncrements` hook stays the same across plans because the increment is
the real token count regardless of budget.

For a pre-flight ceiling against a single oversize request (a 200K-token prompt
that would burn the upstream's per-request token cap), add a custom inbound
policy that reads the request body's `messages` / `prompt`, estimates input
tokens, and rejects with `413` if it exceeds a hard per-request cap. That's a
separate gate from the per-minute counter and worth running on every LLM route.

The same per-consumer token signal also doubles as a billing signal if you want
it to. Zuplo's
[monetization-inbound policy](https://zuplo.com/docs/articles/monetization) ties
usage to a consumer's subscription, so the counts that drive your rate limits
can also feed plan-based billing without a second metering pipeline. Rate
limiting is the focus here; the same plumbing extends to monetization when
you're ready.

## What a token-weighted gateway buys you

A token-weighted gateway throttles every consumer by what they actually used.
The chatty classifier user keeps their 30-token calls flowing. The
batch-summarization user gets capped at the budget their plan paid for. The
cheap user isn't squeezed out of their RPM headroom by someone else's 40K-token
jobs, and your upstream provider quota stops getting surprised.