Your API is serving AI agents now, whether you planned for it or not. A chatbot sending ten requests a minute looks identical to a runaway automation loop sending ten requests a minute. One pays you. The other drains your compute budget. Your rate limiter can’t tell them apart, because the only thing it counts is requests.
That’s what happens when you shape AI agent traffic with tools built for human developers. A person hits an endpoint a handful of times per minute, takes a break, hits it again. An autonomous agent can chain fifty calls in ten seconds and then go silent for an hour. Per-minute counters were never designed for that.
- AI agents are a meaningful slice of your API traffic
- You've had a per-minute rate limit break a legitimate agent workflow
- You're billing or throttling by request count and the math no longer lines up with your costs
Why Fixed-Window Rate Limits Break on Agents
Standard API rate limiting counts requests in a fixed time window. Cross the line, get a 429. That model assumes requests have roughly uniform cost and arrive at a predictable cadence. Agent traffic violates both.
Bursts Look Like Abuse
An autonomous agent completing one task might chain 10 to 50 API calls in seconds: tool lookups, retrieval queries, multi-step reasoning, a final completion. Then it idles for minutes. A fixed-window limit either blocks the burst and breaks the workflow, or sets the ceiling high enough to allow it and leaves your API exposed during sustained abuse. No single ceiling handles both.
Loops Don’t Rest
Humans take breaks. Agents don’t. A batch agent can hold a steady cadence under your per-minute limit for twenty-four hours straight, consuming more cumulative resources than any human would touch in a week. Without cumulative tracking, you never see it.
Single Actions Fan Out
Agent workflows span endpoints. One user prompt can trigger a vector search, three LLM calls with different models, a tool-use API call, and a final summarisation. Rate limiting each endpoint independently means the workflow fails unpredictably whenever one link gets throttled, even when overall consumption is reasonable.
Three Agent Traffic Patterns, Three Different Strategies
Before you touch any policy config, figure out which shape of traffic you’re dealing with.
Conversational Agents
Request-response exchanges with variable gaps. A chatbot sends a prompt, waits, processes the reply, maybe follows up. Bursty within a session, idle between sessions. Short bursts look like abuse to a fixed-window limiter, but hourly volume is moderate.
Use sliding-window limits with generous per-minute allowances and tighter per-hour caps. A sliding window re-evaluates the count over the last N seconds on every request, so a burst doesn’t collide with an arbitrary bucket boundary. Zuplo’s rate limiter slides by default. Group by API key identity, not IP: conversational agents run behind shared cloud-provider egress ranges that pool many customers behind one address.
Batch Processing Agents
Sustained, high-volume streams. A data pipeline processes thousands of records sequentially at a steady rate. Predictable but relentless. These agents sit politely under your per-minute limit forever, while burning a year of resources in a day.
Layer short-term rate limits with longer-term quotas tied to the consumer’s plan, and combine request-count limits with token-based rate limiting so you count resource consumption, not just call volume.
Autonomous Workflow Agents
Multi-step workflows where the agent decides which APIs to call based on previous responses. Volume and pattern are non-deterministic: three calls or thirty, depending on what it finds. The hardest to rate limit, because a static ceiling that works for one task is too tight for another.
Use per-workflow rate limits with custom grouping keys that track consumption per task or session, not just per consumer. Add circuit breakers that detect anomalous patterns (retry loops, stuck agents) and halt traffic before it spirals.
Identity-Aware Limits With API Key Metadata
Every strategy above rests on the same foundation: the gateway needs to know which agent is calling before it can apply the right limit. Bucketing by IP is already a bad idea for human developers. For agents, it’s unworkable.
Zuplo’s API key management lets you attach arbitrary metadata to each consumer. Store what the rate limiter needs to decide:

When a request arrives with a valid key, the
API Key Authentication policy
populates request.user.sub with the consumer name and request.user.data with
the metadata object. Every downstream policy can read those fields.
Now drive the limit from that metadata. Zuplo’s
Rate Limiting policy
supports a rateLimitBy: "function" mode where a TypeScript function returns
the grouping key and per-request limit overrides:
Wire it into the policy:
The requestsAllowed and timeWindowMinutes in policy options are defaults;
the function overrides them per request. Onboarding a new agent consumer is a
metadata update on the API key, not a gateway redeploy.
Token Budgets for What Agents Actually Consume
Request counts tell you how often an agent calls your API. For anything that proxies an LLM or runs variable-cost work, that’s the wrong measurement. Two agents sending the same number of requests can differ by orders of magnitude in what they cost you.
Zuplo’s
Complex Rate Limiting policy
supports multiple named counters with dynamic increments: the primitive you want
for token budgets. Instead of incrementing by one per request, increment by the
token count the upstream LLM reports (the example below reads OpenAI-style
usage.total_tokens; adapt to your provider’s shape):
Pair it with an outbound policy on the same route that reads token usage from
the upstream response and calls setIncrements before the response is
finalised. The inbound policy enforces the limit on the next request: the
current response passes through, and its token cost is charged to the budget for
subsequent calls.
Now an agent sending a few large prompts totalling 50,000 tokens hits the budget just as fast as one sending hundreds of tiny requests to the same total. The limiter tracks what actually costs you.
In production you’ll want to skip error responses and streaming bodies (they
don’t carry a parseable usage block), and adapt the field path to your provider:
Anthropic returns usage.input_tokens and usage.output_tokens, for example.
One important distinction: the pattern above is for the case where agents are calling your own API, and any LLM call is one step inside your handler. If the API you’re protecting is itself an LLM proxy, forwarding directly to OpenAI, Anthropic, or Gemini, Zuplo’s AI Gateway handles token budgeting, hierarchical team and app limits, semantic caching, and provider switching natively, without the custom outbound policy.
Pro tip:
You want both layers on the same route. A burst of small requests trips the request-count limit. A few massive prompts trip the token budget. Neither alone catches both failure modes.
Complex Rate Limiting Policy Reference
Full reference for named counters, dynamic increments, and setIncrements.
Tiered Access and Usage-Based Pricing
Agent traffic maps neatly onto tiered pricing. A developer on a free tier has different needs than an enterprise running production autonomous agents. If limits are already driven by API key metadata, tiers are cheap to add.
Store the plan and entitlements on the key:
Your custom rate limiting function reads request.user.data.requestsPerMinute
directly. New tiers or custom limits for a specific customer are a metadata
update, not a code change.
For APIs where agents are a revenue source, API Monetization connects the rate limiter to billing. Define meters that count API calls, tokens, or any custom unit, attach them to subscription plans with included allowances and overage pricing, and Zuplo handles enforcement and Stripe billing.
A typical agent monetisation setup:
- Developer plan: 10,000 requests/month included, $0.001 per overage
- Pro plan: 100,000 requests/month included, $0.0005 per overage
- Enterprise plan: Custom allowances with volume discounts
The monetization-inbound-policy validates subscriptions and tracks usage
against each consumer’s plan in real time. It exposes entitlement data (current
usage, remaining allowance) on every request, which your pipeline uses to reject
hard-limit consumers once exhausted or to let soft-limit usage through for
Stripe to bill at period close. Rate limiting becomes part of the pricing
surface, not just a defensive measure.
Circuit Breakers for Runaway Agents
The worst agent traffic pattern isn’t high volume, it’s a loop. An agent stuck retrying the same failed request hundreds of times will exhaust its rate limits, inflate your costs, and degrade service for everyone else. A per-minute limiter catches it eventually, but not before the damage is done.
This is a different job than the classical circuit breaker that trips on failing downstreams. Here the signal is an agent repeatedly hitting the same path, but the response is the same: stop serving the pattern before it compounds. Zuplo’s programmable gateway lets you write a custom inbound policy that detects the pattern in real time. If the same signature repeats past a threshold in a short window, short-circuit before it hits your backend:
Legitimate traffic varies endpoint, parameters, or timing, so it passes through. A stuck agent hitting the same path over and over trips the breaker and gets a clear error back telling it to stop.
Common mistake:
Put the breaker ahead of your rate limiters in the pipeline. Otherwise runaway agents burn quota that legitimate traffic needs.
A working order for the layers:
- Circuit breaker: catches retry loops and stuck agents
- Per-agent rate limit: per-minute request caps
- Token budget: hourly or daily token consumption caps
- Monetisation quota: billing-period usage caps
Each layer addresses a different failure mode. Together they cover the full spread of agent traffic without asking one counter to do everything.
How to Implement a Circuit Breaker at the API Gateway
Walkthrough of the classical circuit breaker pattern for failing downstreams, the complement to the agent-loop variant above.
The Layered Model Is the Point
AI agents aren’t going away, and they aren’t going to start behaving like humans. The only rate limiting strategy that survives is layered: identity-aware limits driven by metadata, token budgets for actual cost, billing quotas for period caps, and circuit breakers for runaway loops. Each layer catches what the others miss.
If you’re starting from a single per-minute counter, the first move is identity. Get API key metadata into the limiting function so the gateway knows who’s calling. Everything else composes on top of that.
Rate Limiting Policy Reference
Full reference for the rate-limit-inbound policy, including custom function mode.
