How to Rate Limit AI Agents Beyond Request Counts

Your API is serving AI agents now, whether you planned for it or not. A chatbot sending ten requests a minute looks identical to a runaway automation loop sending ten requests a minute. One pays you. The other drains your compute budget. Your rate limiter can’t tell them apart, because the only thing it counts is requests.

That’s what happens when you shape AI agent traffic with tools built for human developers. A person hits an endpoint a handful of times per minute, takes a break, hits it again. An autonomous agent can chain fifty calls in ten seconds and then go silent for an hour. Per-minute counters were never designed for that.

Use this approach if you're:

AI agents are a meaningful slice of your API traffic
You've had a per-minute rate limit break a legitimate agent workflow
You're billing or throttling by request count and the math no longer lines up with your costs

Why Fixed-Window Rate Limits Break on Agents

Standard API rate limiting counts requests in a fixed time window. Cross the line, get a 429. That model assumes requests have roughly uniform cost and arrive at a predictable cadence. Agent traffic violates both.

Bursts Look Like Abuse

An autonomous agent completing one task might chain 10 to 50 API calls in seconds: tool lookups, retrieval queries, multi-step reasoning, a final completion. Then it idles for minutes. A fixed-window limit either blocks the burst and breaks the workflow, or sets the ceiling high enough to allow it and leaves your API exposed during sustained abuse. No single ceiling handles both.

Loops Don’t Rest

Humans take breaks. Agents don’t. A batch agent can hold a steady cadence under your per-minute limit for twenty-four hours straight, consuming more cumulative resources than any human would touch in a week. Without cumulative tracking, you never see it.

Single Actions Fan Out

Agent workflows span endpoints. One user prompt can trigger a vector search, three LLM calls with different models, a tool-use API call, and a final summarisation. Rate limiting each endpoint independently means the workflow fails unpredictably whenever one link gets throttled, even when overall consumption is reasonable.

Three Agent Traffic Patterns, Three Different Strategies

Before you touch any policy config, figure out which shape of traffic you’re dealing with.

Conversational Agents

Request-response exchanges with variable gaps. A chatbot sends a prompt, waits, processes the reply, maybe follows up. Bursty within a session, idle between sessions. Short bursts look like abuse to a fixed-window limiter, but hourly volume is moderate.

Use sliding-window limits with generous per-minute allowances and tighter per-hour caps. A sliding window re-evaluates the count over the last N seconds on every request, so a burst doesn’t collide with an arbitrary bucket boundary. Zuplo’s rate limiter slides by default. Group by API key identity, not IP: conversational agents run behind shared cloud-provider egress ranges that pool many customers behind one address.

Batch Processing Agents

Sustained, high-volume streams. A data pipeline processes thousands of records sequentially at a steady rate. Predictable but relentless. These agents sit politely under your per-minute limit forever, while burning a year of resources in a day.

Layer short-term rate limits with longer-term quotas tied to the consumer’s plan, and combine request-count limits with token-based rate limiting so you count resource consumption, not just call volume.

Autonomous Workflow Agents

Multi-step workflows where the agent decides which APIs to call based on previous responses. Volume and pattern are non-deterministic: three calls or thirty, depending on what it finds. The hardest to rate limit, because a static ceiling that works for one task is too tight for another.

Use per-workflow rate limits with custom grouping keys that track consumption per task or session, not just per consumer. Add circuit breakers that detect anomalous patterns (retry loops, stuck agents) and halt traffic before it spirals.

Identity-Aware Limits With API Key Metadata

Every strategy above rests on the same foundation: the gateway needs to know which agent is calling before it can apply the right limit. Bucketing by IP is already a bad idea for human developers. For agents, it’s unworkable.

Zuplo’s API key management lets you attach arbitrary metadata to each consumer. Store what the rate limiter needs to decide:

Zuplo Developer Portal Change consumer dialog showing a literature-review-agent consumer with JSON metadata including agentType, tier, monthlyTokenBudget, maxRequestsPerMinute, and model fields

json

{
  "name": "acme-research-agent",
  "metadata": {
    "agentType": "autonomous",
    "tier": "enterprise",
    "monthlyTokenBudget": 5000000,
    "maxRequestsPerMinute": 500,
    "model": "gpt-4o"
  }
}

When a request arrives with a valid key, the API Key Authentication policy populates request.user.sub with the consumer name and request.user.data with the metadata object. Every downstream policy can read those fields.

Now drive the limit from that metadata. Zuplo’s Rate Limiting policy supports a rateLimitBy: "function" mode where a TypeScript function returns the grouping key and per-request limit overrides:

typescript

import {
  CustomRateLimitDetails,
  ZuploContext,
  ZuploRequest,
} from "@zuplo/runtime";

export function agentRateLimit(
  request: ZuploRequest,
  context: ZuploContext,
  policyName: string,
): CustomRateLimitDetails {
  const sub = request.user?.sub ?? "anonymous";
  const agentType = request.user?.data?.agentType ?? "unknown";
  const tier = request.user?.data?.tier ?? "free";

  if (agentType === "autonomous") {
    const limits: Record<string, number> = {
      enterprise: 500,
      pro: 200,
      free: 30,
    };
    return {
      // Own bucket for autonomous traffic.
      key: `${sub}-autonomous`,
      // Unknown tier, treat as free.
      requestsAllowed: limits[tier] ?? 30,
      timeWindowMinutes: 1,
    };
  }

  if (agentType === "batch") {
    return {
      key: `${sub}-batch`,
      requestsAllowed: tier === "enterprise" ? 300 : 60,
      timeWindowMinutes: 1,
    };
  }

  return {
    key: sub,
    requestsAllowed: tier === "enterprise" ? 100 : 20,
    timeWindowMinutes: 1,
  };
}

Wire it into the policy:

json

{
  "name": "rate-limit-inbound-policy",
  "policyType": "rate-limit-inbound",
  "handler": {
    "export": "RateLimitInboundPolicy",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "rateLimitBy": "function",
      "requestsAllowed": 20,
      "timeWindowMinutes": 1,
      "identifier": {
        "module": "$import(./modules/agent-rate-limit)",
        "export": "agentRateLimit"
      }
    }
  }
}

The requestsAllowed and timeWindowMinutes in policy options are defaults; the function overrides them per request. Onboarding a new agent consumer is a metadata update on the API key, not a gateway redeploy.

Token Budgets for What Agents Actually Consume

Request counts tell you how often an agent calls your API. For anything that proxies an LLM or runs variable-cost work, that’s the wrong measurement. Two agents sending the same number of requests can differ by orders of magnitude in what they cost you.

Zuplo’s Complex Rate Limiting policy supports multiple named counters with dynamic increments: the primitive you want for token budgets. Instead of incrementing by one per request, increment by the token count the upstream LLM reports (the example below reads OpenAI-style usage.total_tokens; adapt to your provider’s shape):

json

{
  "name": "complex-rate-limit-inbound-policy",
  "policyType": "complex-rate-limit-inbound",
  "handler": {
    "export": "ComplexRateLimitInboundPolicy",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "rateLimitBy": "user",
      "timeWindowMinutes": 60,
      "limits": {
        "tokens": 100000
      }
    }
  }
}

Pair it with an outbound policy on the same route that reads token usage from the upstream response and calls setIncrements before the response is finalised. The inbound policy enforces the limit on the next request: the current response passes through, and its token cost is charged to the budget for subsequent calls.

typescript

import {
  ComplexRateLimitInboundPolicy,
  ZuploContext,
  ZuploRequest,
} from "@zuplo/runtime";

export default async function trackTokens(
  response: Response,
  request: ZuploRequest,
  context: ZuploContext,
  options: never,
  policyName: string,
) {
  // Bail on anything that isn't JSON: streaming bodies, errors, empty 204s.
  const contentType = response.headers.get("content-type") ?? "";
  if (!response.ok || !contentType.includes("application/json")) {
    return response;
  }

  // Clone so the original response still streams to the client.
  const body = await response.clone().json();
  // Missing usage still costs something, never free.
  const totalTokens = body?.usage?.total_tokens ?? 1;

  ComplexRateLimitInboundPolicy.setIncrements(context, {
    tokens: totalTokens,
  });

  return response;
}

Now an agent sending a few large prompts totalling 50,000 tokens hits the budget just as fast as one sending hundreds of tiny requests to the same total. The limiter tracks what actually costs you.

In production you’ll want to skip error responses and streaming bodies (they don’t carry a parseable usage block), and adapt the field path to your provider: Anthropic returns usage.input_tokens and usage.output_tokens, for example.

One important distinction: the pattern above is for the case where agents are calling your own API, and any LLM call is one step inside your handler. If the API you’re protecting is itself an LLM proxy, forwarding directly to OpenAI, Anthropic, or Gemini, Zuplo’s AI Gateway handles token budgeting, hierarchical team and app limits, semantic caching, and provider switching natively, without the custom outbound policy.

Pro tip:

You want both layers on the same route. A burst of small requests trips the request-count limit. A few massive prompts trip the token budget. Neither alone catches both failure modes.

Complex Rate Limiting Policy Reference

Full reference for named counters, dynamic increments, and setIncrements.

Tiered Access and Usage-Based Pricing

Agent traffic maps neatly onto tiered pricing. A developer on a free tier has different needs than an enterprise running production autonomous agents. If limits are already driven by API key metadata, tiers are cheap to add.

Store the plan and entitlements on the key:

json

{
  "name": "enterprise-workflow-agent",
  "metadata": {
    "plan": "enterprise",
    "requestsPerMinute": 500,
    "tokensPerHour": 500000,
    "tokensPerMonth": 10000000
  }
}

Your custom rate limiting function reads request.user.data.requestsPerMinute directly. New tiers or custom limits for a specific customer are a metadata update, not a code change.

For APIs where agents are a revenue source, API Monetization connects the rate limiter to billing. Define meters that count API calls, tokens, or any custom unit, attach them to subscription plans with included allowances and overage pricing, and Zuplo handles enforcement and Stripe billing.

A typical agent monetisation setup:

Developer plan: 10,000 requests/month included, $0.001 per overage
Pro plan: 100,000 requests/month included, $0.0005 per overage
Enterprise plan: Custom allowances with volume discounts

The monetization-inbound-policy validates subscriptions and tracks usage against each consumer’s plan in real time. It exposes entitlement data (current usage, remaining allowance) on every request, which your pipeline uses to reject hard-limit consumers once exhausted or to let soft-limit usage through for Stripe to bill at period close. Rate limiting becomes part of the pricing surface, not just a defensive measure.

Circuit Breakers for Runaway Agents

The worst agent traffic pattern isn’t high volume, it’s a loop. An agent stuck retrying the same failed request hundreds of times will exhaust its rate limits, inflate your costs, and degrade service for everyone else. A per-minute limiter catches it eventually, but not before the damage is done.

This is a different job than the classical circuit breaker that trips on failing downstreams. Here the signal is an agent repeatedly hitting the same path, but the response is the same: stop serving the pattern before it compounds. Zuplo’s programmable gateway lets you write a custom inbound policy that detects the pattern in real time. If the same signature repeats past a threshold in a short window, short-circuit before it hits your backend:

typescript

import {
  ZuploContext,
  ZuploRequest,
  HttpProblems,
  ZoneCache,
} from "@zuplo/runtime";

interface CircuitBreakerOptions {
  maxRepeats: number;
  windowSeconds: number;
}

export default async function circuitBreaker(
  request: ZuploRequest,
  context: ZuploContext,
  options: CircuitBreakerOptions,
  policyName: string,
): Promise<ZuploRequest | Response> {
  const consumer = request.user?.sub ?? "anonymous";
  // Method plus path only, so a retry loop with shifting params still matches.
  const requestSignature = `${request.method}:${new URL(request.url).pathname}`;
  // One counter per consumer, so agents don't trip each other's breakers.
  const cacheKey = `circuit:${consumer}:${requestSignature}`;

  const cache = new ZoneCache<number>("circuit-breaker", context);
  const current = await cache.get(cacheKey);
  const count = (current ?? 0) + 1;

  if (count > options.maxRepeats) {
    context.log.warn(
      `Circuit breaker tripped for ${consumer}: ${count} repeated requests to ${requestSignature}`,
    );
    return HttpProblems.tooManyRequests(request, context, {
      detail: `Repeated request pattern detected. Please check your agent's retry logic.`,
    });
  }

  // Refresh the TTL on every hit so a sustained loop keeps tripping.
  await cache.put(cacheKey, count, options.windowSeconds);

  return request;
}

Legitimate traffic varies endpoint, parameters, or timing, so it passes through. A stuck agent hitting the same path over and over trips the breaker and gets a clear error back telling it to stop.

Common mistake:

Put the breaker ahead of your rate limiters in the pipeline. Otherwise runaway agents burn quota that legitimate traffic needs.

A working order for the layers:

Circuit breaker: catches retry loops and stuck agents
Per-agent rate limit: per-minute request caps
Token budget: hourly or daily token consumption caps
Monetisation quota: billing-period usage caps

Each layer addresses a different failure mode. Together they cover the full spread of agent traffic without asking one counter to do everything.

How to Implement a Circuit Breaker at the API Gateway

Walkthrough of the classical circuit breaker pattern for failing downstreams, the complement to the agent-loop variant above.

The Layered Model Is the Point

AI agents aren’t going away, and they aren’t going to start behaving like humans. The only rate limiting strategy that survives is layered: identity-aware limits driven by metadata, token budgets for actual cost, billing quotas for period caps, and circuit breakers for runaway loops. Each layer catches what the others miss.

If you’re starting from a single per-minute counter, the first move is identity. Get API key metadata into the limiting function so the gateway knows who’s calling. Everything else composes on top of that.

Rate Limiting Policy Reference

Full reference for the rate-limit-inbound policy, including custom function mode.