---
title: "How to Rate Limit AI Agents Beyond Request Counts"
description: "A chatbot and a runaway automation loop both send ten requests a minute. One pays you, the other drains your compute. Fixed-window limiters can't tell them apart. Here's how to layer limits so AI agent traffic behaves."
canonicalUrl: "https://zuplo.com/blog/2026/04/27/rate-limit-ai-agents-beyond-request-counts"
pageType: "blog"
date: "2026-04-27"
authors: "martyn"
tags: "AI, API Rate Limiting"
image: "https://zuplo.com/og?text=How%20to%20Rate%20Limit%20AI%20Agents%20Beyond%20Request%20Counts"
---
Your API is serving AI agents now, whether you planned for it or not. A chatbot
sending ten requests a minute looks identical to a runaway automation loop
sending ten requests a minute. One pays you. The other drains your compute
budget. Your rate limiter can't tell them apart, because the only thing it
counts is requests.

That's what happens when you shape AI agent traffic with tools built for human
developers. A person hits an endpoint a handful of times per minute, takes a
break, hits it again. An autonomous agent can chain fifty calls in ten seconds
and then go silent for an hour. Per-minute counters were never designed for
that.

<CalloutAudience
  variant="useIf"
  items={[
    `AI agents are a meaningful slice of your API traffic`,
    `You've had a per-minute rate limit break a legitimate agent workflow`,
    `You're billing or throttling by request count and the math no longer lines up with your costs`,
  ]}
/>

## Why Fixed-Window Rate Limits Break on Agents

Standard [API rate limiting](https://zuplo.com/docs/concepts/rate-limiting)
counts requests in a fixed time window. Cross the line, get a 429. That model
assumes requests have roughly uniform cost and arrive at a predictable cadence.
Agent traffic violates both.

### Bursts Look Like Abuse

An autonomous agent completing one task might chain 10 to 50 API calls in
seconds: tool lookups, retrieval queries, multi-step reasoning, a final
completion. Then it idles for minutes. A fixed-window limit either blocks the
burst and breaks the workflow, or sets the ceiling high enough to allow it and
leaves your API exposed during sustained abuse. No single ceiling handles both.

### Loops Don't Rest

Humans take breaks. Agents don't. A batch agent can hold a steady cadence under
your per-minute limit for twenty-four hours straight, consuming more cumulative
resources than any human would touch in a week. Without cumulative tracking, you
never see it.

### Single Actions Fan Out

Agent workflows span endpoints. One user prompt can trigger a vector search,
three LLM calls with different models, a tool-use API call, and a final
summarisation. Rate limiting each endpoint independently means the workflow
fails unpredictably whenever one link gets throttled, even when overall
consumption is reasonable.

## Three Agent Traffic Patterns, Three Different Strategies

Before you touch any policy config, figure out which shape of traffic you're
dealing with.

### Conversational Agents

Request-response exchanges with variable gaps. A chatbot sends a prompt, waits,
processes the reply, maybe follows up. Bursty within a session, idle between
sessions. Short bursts look like abuse to a fixed-window limiter, but hourly
volume is moderate.

Use sliding-window limits with generous per-minute allowances and tighter
per-hour caps. A sliding window re-evaluates the count over the last N seconds
on every request, so a burst doesn't collide with an arbitrary bucket boundary.
Zuplo's rate limiter slides by default. Group by API key identity, not IP:
conversational agents run behind shared cloud-provider egress ranges that pool
many customers behind one address.

### Batch Processing Agents

Sustained, high-volume streams. A data pipeline processes thousands of records
sequentially at a steady rate. Predictable but relentless. These agents sit
politely under your per-minute limit forever, while burning a year of resources
in a day.

Layer short-term rate limits with longer-term quotas tied to the consumer's
plan, and combine request-count limits with token-based rate limiting so you
count resource consumption, not just call volume.

### Autonomous Workflow Agents

Multi-step workflows where the agent decides which APIs to call based on
previous responses. Volume and pattern are non-deterministic: three calls or
thirty, depending on what it finds. The hardest to rate limit, because a static
ceiling that works for one task is too tight for another.

Use per-workflow rate limits with custom grouping keys that track consumption
per task or session, not just per consumer. Add circuit breakers that detect
anomalous patterns (retry loops, stuck agents) and halt traffic before it
spirals.

## Identity-Aware Limits With API Key Metadata

Every strategy above rests on the same foundation: the gateway needs to know
which agent is calling before it can apply the right limit. Bucketing by IP is
already a bad idea for human developers. For agents, it's unworkable.

Zuplo's [API key management](https://zuplo.com/docs/articles/api-key-management)
lets you attach arbitrary metadata to each consumer. Store what the rate limiter
needs to decide:

![Zuplo Developer Portal Change consumer dialog showing a literature-review-agent consumer with JSON metadata including agentType, tier, monthlyTokenBudget, maxRequestsPerMinute, and model fields](/blog-images/rate-limit-ai-agents-beyond-request-counts/api-key-metadata.png)

```json
{
  "name": "acme-research-agent",
  "metadata": {
    "agentType": "autonomous",
    "tier": "enterprise",
    "monthlyTokenBudget": 5000000,
    "maxRequestsPerMinute": 500,
    "model": "gpt-4o"
  }
}
```

When a request arrives with a valid key, the
[API Key Authentication policy](https://zuplo.com/docs/articles/api-key-authentication)
populates `request.user.sub` with the consumer name and `request.user.data` with
the metadata object. Every downstream policy can read those fields.

Now drive the limit from that metadata. Zuplo's
[Rate Limiting policy](https://zuplo.com/docs/policies/rate-limit-inbound)
supports a `rateLimitBy: "function"` mode where a TypeScript function returns
the grouping key and per-request limit overrides:

```typescript
import {
  CustomRateLimitDetails,
  ZuploContext,
  ZuploRequest,
} from "@zuplo/runtime";

export function agentRateLimit(
  request: ZuploRequest,
  context: ZuploContext,
  policyName: string,
): CustomRateLimitDetails {
  const sub = request.user?.sub ?? "anonymous";
  const agentType = request.user?.data?.agentType ?? "unknown";
  const tier = request.user?.data?.tier ?? "free";

  if (agentType === "autonomous") {
    const limits: Record<string, number> = {
      enterprise: 500,
      pro: 200,
      free: 30,
    };
    return {
      // Own bucket for autonomous traffic.
      key: `${sub}-autonomous`,
      // Unknown tier, treat as free.
      requestsAllowed: limits[tier] ?? 30,
      timeWindowMinutes: 1,
    };
  }

  if (agentType === "batch") {
    return {
      key: `${sub}-batch`,
      requestsAllowed: tier === "enterprise" ? 300 : 60,
      timeWindowMinutes: 1,
    };
  }

  return {
    key: sub,
    requestsAllowed: tier === "enterprise" ? 100 : 20,
    timeWindowMinutes: 1,
  };
}
```

Wire it into the policy:

```json
{
  "name": "rate-limit-inbound-policy",
  "policyType": "rate-limit-inbound",
  "handler": {
    "export": "RateLimitInboundPolicy",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "rateLimitBy": "function",
      "requestsAllowed": 20,
      "timeWindowMinutes": 1,
      "identifier": {
        "module": "$import(./modules/agent-rate-limit)",
        "export": "agentRateLimit"
      }
    }
  }
}
```

The `requestsAllowed` and `timeWindowMinutes` in policy options are defaults;
the function overrides them per request. Onboarding a new agent consumer is a
metadata update on the API key, not a gateway redeploy.

## Token Budgets for What Agents Actually Consume

Request counts tell you how often an agent calls your API. For anything that
proxies an LLM or runs variable-cost work, that's the wrong measurement. Two
agents sending the same number of requests can differ by orders of magnitude in
what they cost you.

Zuplo's
[Complex Rate Limiting policy](https://zuplo.com/docs/policies/complex-rate-limit-inbound)
supports multiple named counters with dynamic increments: the primitive you want
for token budgets. Instead of incrementing by one per request, increment by the
token count the upstream LLM reports (the example below reads OpenAI-style
`usage.total_tokens`; adapt to your provider's shape):

```json
{
  "name": "complex-rate-limit-inbound-policy",
  "policyType": "complex-rate-limit-inbound",
  "handler": {
    "export": "ComplexRateLimitInboundPolicy",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "rateLimitBy": "user",
      "timeWindowMinutes": 60,
      "limits": {
        "tokens": 100000
      }
    }
  }
}
```

Pair it with an outbound policy on the same route that reads token usage from
the upstream response and calls `setIncrements` before the response is
finalised. The inbound policy enforces the limit on the next request: the
current response passes through, and its token cost is charged to the budget for
subsequent calls.

```typescript
import {
  ComplexRateLimitInboundPolicy,
  ZuploContext,
  ZuploRequest,
} from "@zuplo/runtime";

export default async function trackTokens(
  response: Response,
  request: ZuploRequest,
  context: ZuploContext,
  options: never,
  policyName: string,
) {
  // Bail on anything that isn't JSON: streaming bodies, errors, empty 204s.
  const contentType = response.headers.get("content-type") ?? "";
  if (!response.ok || !contentType.includes("application/json")) {
    return response;
  }

  // Clone so the original response still streams to the client.
  const body = await response.clone().json();
  // Missing usage still costs something, never free.
  const totalTokens = body?.usage?.total_tokens ?? 1;

  ComplexRateLimitInboundPolicy.setIncrements(context, {
    tokens: totalTokens,
  });

  return response;
}
```

Now an agent sending a few large prompts totalling 50,000 tokens hits the budget
just as fast as one sending hundreds of tiny requests to the same total. The
limiter tracks what actually costs you.

In production you'll want to skip error responses and streaming bodies (they
don't carry a parseable usage block), and adapt the field path to your provider:
Anthropic returns `usage.input_tokens` and `usage.output_tokens`, for example.

One important distinction: the pattern above is for the case where agents are
calling your own API, and any LLM call is one step inside your handler. If the
API you're protecting is itself an LLM proxy, forwarding directly to OpenAI,
Anthropic, or Gemini, Zuplo's
[AI Gateway](https://zuplo.com/docs/ai-gateway/introduction) handles token
budgeting, hierarchical team and app limits, semantic caching, and provider
switching natively, without the custom outbound policy.

<CalloutTip variant="tip">
  You want both layers on the same route. A burst of small requests trips the
  request-count limit. A few massive prompts trip the token budget. Neither
  alone catches both failure modes.
</CalloutTip>

<CalloutDoc
  title="Complex Rate Limiting Policy Reference"
  description="Full reference for named counters, dynamic increments, and setIncrements."
  href="https://zuplo.com/docs/policies/complex-rate-limit-inbound"
  icon="book"
/>

## Tiered Access and Usage-Based Pricing

Agent traffic maps neatly onto tiered pricing. A developer on a free tier has
different needs than an enterprise running production autonomous agents. If
limits are already driven by API key metadata, tiers are cheap to add.

Store the plan and entitlements on the key:

```json
{
  "name": "enterprise-workflow-agent",
  "metadata": {
    "plan": "enterprise",
    "requestsPerMinute": 500,
    "tokensPerHour": 500000,
    "tokensPerMonth": 10000000
  }
}
```

Your custom rate limiting function reads `request.user.data.requestsPerMinute`
directly. New tiers or custom limits for a specific customer are a metadata
update, not a code change.

For APIs where agents are a revenue source,
[API Monetization](https://zuplo.com/docs/articles/monetization) connects the
rate limiter to billing. Define meters that count API calls, tokens, or any
custom unit, attach them to subscription plans with included allowances and
overage pricing, and Zuplo handles enforcement and Stripe billing.

A typical agent monetisation setup:

- **Developer plan**: 10,000 requests/month included, $0.001 per overage
- **Pro plan**: 100,000 requests/month included, $0.0005 per overage
- **Enterprise plan**: Custom allowances with volume discounts

The `monetization-inbound-policy` validates subscriptions and tracks usage
against each consumer's plan in real time. It exposes entitlement data (current
usage, remaining allowance) on every request, which your pipeline uses to reject
hard-limit consumers once exhausted or to let soft-limit usage through for
Stripe to bill at period close. Rate limiting becomes part of the pricing
surface, not just a defensive measure.

## Circuit Breakers for Runaway Agents

The worst agent traffic pattern isn't high volume, it's a loop. An agent stuck
retrying the same failed request hundreds of times will exhaust its rate limits,
inflate your costs, and degrade service for everyone else. A per-minute limiter
catches it eventually, but not before the damage is done.

This is a different job than the
[classical circuit breaker](https://zuplo.com/blog/how-to-implement-circuit-breaker-at-the-api-gateway)
that trips on failing downstreams. Here the signal is an agent repeatedly
hitting the same path, but the response is the same: stop serving the pattern
before it compounds. Zuplo's
[programmable gateway](https://zuplo.com/docs/articles/custom-code-patterns)
lets you write a custom inbound policy that detects the pattern in real time. If
the same signature repeats past a threshold in a short window, short-circuit
before it hits your backend:

```typescript
import {
  ZuploContext,
  ZuploRequest,
  HttpProblems,
  ZoneCache,
} from "@zuplo/runtime";

interface CircuitBreakerOptions {
  maxRepeats: number;
  windowSeconds: number;
}

export default async function circuitBreaker(
  request: ZuploRequest,
  context: ZuploContext,
  options: CircuitBreakerOptions,
  policyName: string,
): Promise<ZuploRequest | Response> {
  const consumer = request.user?.sub ?? "anonymous";
  // Method plus path only, so a retry loop with shifting params still matches.
  const requestSignature = `${request.method}:${new URL(request.url).pathname}`;
  // One counter per consumer, so agents don't trip each other's breakers.
  const cacheKey = `circuit:${consumer}:${requestSignature}`;

  const cache = new ZoneCache<number>("circuit-breaker", context);
  const current = await cache.get(cacheKey);
  const count = (current ?? 0) + 1;

  if (count > options.maxRepeats) {
    context.log.warn(
      `Circuit breaker tripped for ${consumer}: ${count} repeated requests to ${requestSignature}`,
    );
    return HttpProblems.tooManyRequests(request, context, {
      detail: `Repeated request pattern detected. Please check your agent's retry logic.`,
    });
  }

  // Refresh the TTL on every hit so a sustained loop keeps tripping.
  await cache.put(cacheKey, count, options.windowSeconds);

  return request;
}
```

Legitimate traffic varies endpoint, parameters, or timing, so it passes through.
A stuck agent hitting the same path over and over trips the breaker and gets a
clear error back telling it to stop.

<CalloutTip variant="mistake">
  Put the breaker ahead of your rate limiters in the pipeline. Otherwise runaway
  agents burn quota that legitimate traffic needs.
</CalloutTip>

A working order for the layers:

1. **Circuit breaker**: catches retry loops and stuck agents
2. **Per-agent rate limit**: per-minute request caps
3. **Token budget**: hourly or daily token consumption caps
4. **Monetisation quota**: billing-period usage caps

Each layer addresses a different failure mode. Together they cover the full
spread of agent traffic without asking one counter to do everything.

<CalloutDoc
  title="How to Implement a Circuit Breaker at the API Gateway"
  description="Walkthrough of the classical circuit breaker pattern for failing downstreams, the complement to the agent-loop variant above."
  href="https://zuplo.com/blog/how-to-implement-circuit-breaker-at-the-api-gateway"
  icon="lightning"
/>

## The Layered Model Is the Point

AI agents aren't going away, and they aren't going to start behaving like
humans. The only rate limiting strategy that survives is layered: identity-aware
limits driven by metadata, token budgets for actual cost, billing quotas for
period caps, and circuit breakers for runaway loops. Each layer catches what the
others miss.

If you're starting from a single per-minute counter, the first move is identity.
Get API key metadata into the limiting function so the gateway knows who's
calling. Everything else composes on top of that.

<CalloutDoc
  title="Rate Limiting Policy Reference"
  description="Full reference for the rate-limit-inbound policy, including custom function mode."
  href="https://zuplo.com/docs/policies/rate-limit-inbound"
  icon="book"
/>