A request count is a terrible proxy for LLM cost. Two calls to the same endpoint can differ by three orders of magnitude in tokens, dollars, and latency. One might be a 30-token classifier ping. The next might ship a 40,000-token document plus tool definitions and ask for a long structured response. A 60-RPM cap treats them as equal, and the heavy user empties your provider budget before breakfast.
- You proxy OpenAI, Anthropic, or another LLM provider through your own API
- You bill or budget per customer and a single oversized request can blow a month of margin
- Your current rate limit is requests-per-minute and the heavy users are eating the cheap users' headroom
Why requests-per-minute breaks for LLM APIs
A normal CRUD endpoint has flat cost. Whether the body is 100 bytes or 10 KB, the work is roughly the same, and counting requests maps cleanly to load.
An LLM call doesn’t behave like that. Cost scales with input tokens, output tokens, model class, whether prompt caching hit, and whether the response streamed. Two requests with identical paths and headers can hit your provider bill for $0.0001 and $4. Rate limiting on request count is the wrong axis.
The providers know this. They publish the right axes themselves.
OpenAI and Anthropic limit on tokens, not requests
Anthropic’s docs are unambiguous about what their meter actually measures:
The rate limits for the Messages API are measured in requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM) for each model class.
Three counters per model, and the token counters dominate. Tier 1 Sonnet 4.x is 50 RPM but only 30,000 input tokens per minute and 8,000 output tokens per minute. Fifty 30-token pings sail through; a single 40,000-token document is already over the input ceiling.
Azure OpenAI applies the same shape: TPM and RPM as separate limits, allocated per model and deployment. OpenAI’s own rate-limit docs match. Tokens are what run out first on real workloads.
If your gateway sits between customers and these providers, it should meter in the same units. Counting requests when the provider counts tokens means you either limit too loosely (a mega-request trips the upstream limit anyway) or too tightly (a chatty cheap user gets capped like one running 50K-token jobs).
Track tokens with complex-rate-limit-inbound
Zuplo’s rate-limit-inbound policy meters one counter per request. That’s the
right shape for CRUD. For LLM traffic you want complex-rate-limit-inbound,
which supports multiple named counters in the same window and lets each request
count for an arbitrary amount against any of them, rather than always counting
as one.
The config is a limits dictionary plus a time window. Each key is a counter,
each value is its budget for the window. This entry goes in
config/policies.json alongside any other inbound policies on the route:
Three counters, all keyed on the authenticated consumer. rateLimitBy: "user"
reads request.user.sub, the subject claim populated by an upstream auth policy
on the route (the API key or
JWT inbound policies
both populate it). The auth policy has to run before the rate limiter, otherwise
there’s no consumer to key on. The values mirror Anthropic’s Tier 1 Sonnet shape
so the gateway runs out at the same time the upstream would. Any counter
overrunning trips a 429 with a retry-after header.
By default each request counts as 1 against every counter, which is no better
than RPM. The interesting part is replacing those increments with the real token
counts.
Count real tokens against the limit
ComplexRateLimitInboundPolicy.setIncrements() lets a custom policy set the
per-counter increment for the request that’s in flight. Call it from a custom
outbound policy after the upstream response arrives and you can apply the real
token counts from the provider’s usage block:
Two response shapes covered. Anthropic returns usage.input_tokens /
usage.output_tokens. OpenAI returns usage.prompt_tokens /
usage.completion_tokens. If .json() fails or there’s no usage block, the
token increments fall back to 0 and the request still counts as 1 on the
requests counter.
Register the outbound policy in config/policies.json:
Attach both policies to the LLM-proxy route in your OpenAPI routes file. Inbound limiter runs first, the upstream call happens, outbound policy applies the real token counts:
setIncrements writes the real counts to the bucket before the response leaves
Zuplo, so the in-flight request lands on the counter at its real weight, and the
next request sees updated totals. A user who blows the input token budget on a
single 40K-token call gets 429’d on their next attempt, not after several free
passes.
Common mistake:
response.clone().json() only works on buffered JSON. If you proxy streaming
SSE from OpenAI or Anthropic, the body is a token-by-token event stream and
.json() will reject. Counting tokens from a stream needs a streaming-aware
outbound hook built on
StreamingZoneCache
that accumulates usage events from the SSE chunks: a separate pattern, not
covered here.
Complex Rate Limit Policy
Reference for multi-counter rate limiting and the setIncrements API used to weight requests by real token usage.
Size the budget per plan
The limits block above is one global ceiling. Real APIs run different plans
with different ceilings. The cleanest way to model that is one
complex-rate-limit-inbound instance per plan, each with its own token and
request budgets, attached to a route the matching consumers hit.
A free plan might be 60 RPM, 30,000 input tokens, 8,000 output tokens. A pro
plan on the same upstream might be 600 RPM, 300,000 input tokens, 80,000 output
tokens. Both keyed on rateLimitBy: "user" for per-consumer counters, both
attached to a route that filters consumers by plan with a small inbound gate.
The setIncrements hook stays the same across plans because the increment is
the real token count regardless of budget.
For a pre-flight ceiling against a single oversize request (a 200K-token prompt
that would burn the upstream’s per-request token cap), add a custom inbound
policy that reads the request body’s messages / prompt, estimates input
tokens, and rejects with 413 if it exceeds a hard per-request cap. That’s a
separate gate from the per-minute counter and worth running on every LLM route.
The same per-consumer token signal also doubles as a billing signal if you want it to. Zuplo’s monetization-inbound policy ties usage to a consumer’s subscription, so the counts that drive your rate limits can also feed plan-based billing without a second metering pipeline. Rate limiting is the focus here; the same plumbing extends to monetization when you’re ready.
What a token-weighted gateway buys you
A token-weighted gateway throttles every consumer by what they actually used. The chatty classifier user keeps their 30-token calls flowing. The batch-summarization user gets capped at the budget their plan paid for. The cheap user isn’t squeezed out of their RPM headroom by someone else’s 40K-token jobs, and your upstream provider quota stops getting surprised.
