---
title: "API Cost Protection: How Rate Limits, Quotas, and Spending Caps Prevent Billing Disasters"
description: "Learn how rate limits, quotas, and hard spending caps protect against runaway API costs — with real incidents and implementation patterns."
canonicalUrl: "https://zuplo.com/learning-center/api-cost-protection-rate-limits-quotas-spending-caps"
pageType: "learning-center"
authors: "nate"
tags: "API Rate Limiting, API Best Practices"
image: "https://zuplo.com/og?text=API%20Cost%20Protection%3A%20Rate%20Limits%2C%20Quotas%2C%20and%20Spending%20Caps"
---
A single AI agent retries a $1.58 API call over a thousand times across a
weekend. By Monday, the bill is $1.6 million. A stolen Gemini API key racks up
$82,314 in 48 hours. Both of these incidents happened in early 2026, and both
share the same root cause: APIs without hard spending guardrails.

Billing alerts didn't help. The API gateways involved approved every request
because each one looked perfectly valid in isolation. The tokens were
authenticated, the payloads were well-formed, and the rate limits (where they
existed) measured throughput — not cost.

If you're exposing APIs that cost real money per call — especially to AI agents
or external consumers — you need more than monitoring. You need enforcement.
This article breaks down the three layers of API cost protection that actually
prevent billing disasters: rate limits, quotas, and hard spending caps.

## The Problem: APIs Without Spending Guardrails

Two high-profile incidents in early 2026 illustrate what happens when API cost
protection is treated as an afterthought.

### The $1.6 Million Weekend

In March 2026,
[SD Times reported](https://sdtimes.com/api/the-1-6-million-weekend-why-simple-api-gateways-fail-in-the-agentic-era/)
on an enterprise whose AI-powered contract review API cost $1.58 per document to
process. The team exposed the API via MCP (Model Context Protocol) for agentic
consumption. On a Friday evening, an AI agent hit a timeout and began retrying
relentlessly. A single document was processed over a thousand times. Multiplied
across a batch of a thousand contracts, the weekend bill reached $1.6 million.

The gateway approved every single request. The token was valid, rate limits were
respected, and scope was authorized. As Derric Gilling wrote in his analysis: "A
token rate limit measures throughput, not waste; a slow retry loop passes every
rate limit while burning money for hours."

### The $82,000 API Key Theft

In February 2026, a Mexico-based startup with three developers had their Google
Gemini API key compromised. As
[The Register reported](https://www.theregister.com/2026/03/03/gemini_api_key_82314_dollar_charge/),
the stolen key generated an $82,314.44 bill in just 48 hours — a roughly 457x
spike from their typical $180 monthly spending. The attacker simply used the API
as designed, generating automated requests at machine speed.

The developer
[described being](https://securityboulevard.com/2026/03/when-a-stolen-ai-api-key-becomes-an-82000-problem/)
"in a state of shock and panic," warning that enforcing even partial payment
would cause the startup to go bankrupt. Google cited their shared responsibility
model, placing the burden of key security on the user.

### Why Billing Alerts Aren't Enough

Both incidents expose the same gap: billing alerts notify you _after_ the damage
is done. They are monitoring tools, not enforcement mechanisms. By the time a
human reads an email notification at 2 AM on a Saturday, the costs have already
accumulated.

What these scenarios needed were hard controls — mechanisms that automatically
block requests when spending thresholds are exceeded, without waiting for human
intervention.

## Rate Limiting as a First Line of Defense

[Rate limiting](/learning-center/api-rate-limiting) is the most fundamental
layer of API cost protection. It caps how many requests a consumer can make
within a given time window, preventing any single client from overwhelming your
API — whether through malicious intent, buggy code, or agentic retry loops.

### Per-Endpoint, Per-User, and Per-Key Limits

Effective rate limiting isn't one-size-fits-all. You need different limits at
different levels of granularity:

- **Per-endpoint limits** protect expensive operations. A `/generate-report`
  endpoint that triggers LLM inference should have tighter limits than a
  `/health` check.
- **Per-user limits** prevent any single authenticated user from monopolizing
  resources, regardless of how many API keys they hold.
- **Per-API-key limits** let you enforce different tiers of access. A free-tier
  key might get 100 requests per hour, while an enterprise key gets 10,000.

With Zuplo's
[rate limiting policy](https://zuplo.com/docs/policies/rate-limit-inbound), you
configure these limits declaratively. Here's a basic per-user rate limit:

```json
{
  "name": "cost-protection-rate-limit",
  "policyType": "rate-limit-inbound",
  "handler": {
    "export": "RateLimitInboundPolicy",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "rateLimitBy": "user",
      "requestsAllowed": 500,
      "timeWindowMinutes": 60
    }
  }
}
```

This limits each authenticated user to 500 requests per hour. When they exceed
the limit, they receive a `429 Too Many Requests` response with a `Retry-After`
header.

### Dynamic Rate Limits Based on Customer Tier

Static limits are a starting point, but real cost protection requires dynamic
limits that adapt to each consumer's plan. Zuplo supports this through the
`rateLimitBy: "function"` option, where a custom TypeScript function determines
the limit based on request context:

```typescript
import {
  CustomRateLimitDetails,
  ZuploContext,
  ZuploRequest,
} from "@zuplo/runtime";

export async function rateLimitKey(
  request: ZuploRequest,
  context: ZuploContext,
  policyName: string,
): Promise<CustomRateLimitDetails> {
  const customerType = request.user?.data?.customerType;

  switch (customerType) {
    case "enterprise":
      return {
        key: request.user.sub,
        requestsAllowed: 10000,
        timeWindowMinutes: 60,
      };
    case "pro":
      return {
        key: request.user.sub,
        requestsAllowed: 1000,
        timeWindowMinutes: 60,
      };
    default:
      return {
        key: request.user.sub,
        requestsAllowed: 100,
        timeWindowMinutes: 60,
      };
  }
}
```

This pattern is particularly powerful for cost protection because it ties your
rate limits directly to your pricing model. Enterprise customers paying for
higher throughput get it; free-tier users are automatically constrained.

### Strict vs. Asynchronous Enforcement

Zuplo's rate limiting supports two enforcement modes:

- **Strict mode** checks the rate limit synchronously before processing the
  request. This guarantees hard enforcement but adds a small amount of latency.
- **Asynchronous mode** processes the request while checking the limit in the
  background. This minimizes latency but may allow a small number of requests
  through above the limit during high-concurrency bursts.

For cost protection, **strict mode is almost always the right choice**. The
slight latency penalty is trivial compared to the cost of letting excess
requests through to expensive downstream services.

## Quotas and Hard Spending Caps

Rate limits control the _speed_ of consumption. Quotas control the _total
amount_. Both are essential for cost protection, and they serve different
purposes.

### The Difference Between Rate Limits and Quotas

A rate limit of 100 requests per minute prevents a burst of traffic from
overwhelming your API. But a consumer making 99 requests per minute, 24 hours a
day, still generates 142,560 requests per day — which could translate to
enormous costs if each request triggers expensive downstream processing.

Quotas set an absolute ceiling on total usage over a billing period. For
example:

- **10,000 API calls per month** on the free tier
- **100,000 API calls per month** on the pro tier
- **Custom quota** negotiated for enterprise contracts

When a consumer hits their quota, they're blocked until the next billing cycle
or until they upgrade their plan.

### Hard Caps vs. Soft Caps

This distinction is critical for cost protection:

- **Hard caps** block requests when the quota is exhausted. The consumer
  receives a `429` or `402` response and must wait or upgrade. This is the
  mechanism that would have prevented both the $1.6M weekend and the $82K API
  key theft.
- **Soft caps** allow requests to continue but flag the overage for billing
  later. These are appropriate for trusted enterprise customers who prefer
  uninterrupted service and are willing to pay for overages.

The right approach depends on the consumer. Free tiers and self-service plans
should use hard caps by default. Paid tiers can use soft caps with overage
billing, but only when the customer has explicitly opted in.

### Tying Quotas to Billing Plans

Quotas are most effective when they're directly connected to your billing
system, so that enforcement happens automatically based on each consumer's
subscription. Zuplo's
[API monetization](https://zuplo.com/docs/articles/monetization) features
support this pattern. You define meters (what you count), features (what
customers buy), and plans (tiers with rate cards), and the gateway enforces
limits in real time:

- **Meters** count usage dimensions like requests, tokens, or bytes.
- **Features** connect meters to your product catalog (e.g., "10,000 API calls
  per month").
- **Enforcement** happens at the gateway before your backend is called. If the
  consumer's balance is insufficient, the request is rejected.

Because the gateway is the system of record for both metering and enforcement,
there's no gap between usage tracking and access control. This is fundamentally
different from architectures where billing runs on a separate system and
"eventually" syncs with access policies.

## Circuit Breakers for AI Agent Traffic

The incidents described above highlight a new reality: AI agents are becoming
primary API consumers, and they behave fundamentally differently from human-
driven applications.

### Why Agentic Consumers Are Different

Human API consumers are predictable. They follow documented code paths, retry a
handful of times, and give up when something breaks. AI agents exhibit none of
these characteristics:

- **Relentless retries.** An agent that encounters a timeout doesn't get
  frustrated and stop. It retries according to its programming — potentially
  thousands of times — because achieving the outcome is its objective.
- **Non-deterministic behavior.** The same prompt can trigger dramatically
  different chains of API calls. You can't predict what an agent will do based
  on what it did last time.
- **Identity blurring.** When an AI agent acts on behalf of a user, it's unclear
  who bears responsibility for the costs. The agent has its own credentials, but
  the user initiated the action.
- **Machine-speed consumption.** Agents generate requests at a rate no human
  could match. The $82K Gemini incident demonstrated this — automated requests
  at machine speed converted valid authentication into five figures of charges
  in two days.

### Implementing Cost-Aware Circuit Breaking

Traditional rate limiting measures throughput, but agentic cost protection
requires tracking accumulated cost. Here are the patterns that matter:

**Session-based cost tracking.** Instead of just counting requests, track the
cumulative cost of all requests within a session or time window. When the
accumulated cost exceeds a threshold, block further requests. This is what would
have caught the $1.6M retry storm — a thousand retries of a $1.58 call would
have triggered a session cost limit long before Monday.

**Spend velocity monitoring.** Flag abnormal burn rates even when absolute
limits haven't been reached. If a consumer's hourly spend jumps by 10x compared
to their baseline, that's a signal to pause and verify — regardless of whether
they've hit a hard cap.

**Loop detection.** Recognize when an agent is making repetitive, similar
requests to the same endpoint. Rate limits alone won't catch this if the agent
paces its retries within the allowed throughput.

With Zuplo's
[complex rate limiting policy](https://zuplo.com/docs/policies/complex-rate-limit-inbound)
(available on enterprise plans; free to try in development), you can implement
multi-dimensional limits that go beyond simple request counting. For example,
you can define separate limits for request count and compute cost, and set
dynamic increments per request based on the actual cost of the operation:

```json
{
  "name": "cost-aware-rate-limit",
  "policyType": "complex-rate-limit-inbound",
  "handler": {
    "export": "ComplexRateLimitInboundPolicy",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "rateLimitBy": "user",
      "limits": {
        "requests": 1000,
        "computeCost": 500
      },
      "timeWindowMinutes": 60
    }
  }
}
```

You can then set dynamic increments in a custom policy to weight expensive
operations more heavily:

```typescript
import {
  ComplexRateLimitInboundPolicy,
  ZuploContext,
  ZuploRequest,
} from "@zuplo/runtime";

export async function setCostIncrement(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const endpoint = new URL(request.url).pathname;

  // Weight expensive endpoints higher
  if (endpoint.includes("/generate") || endpoint.includes("/analyze")) {
    ComplexRateLimitInboundPolicy.setIncrements(context, {
      requests: 1,
      computeCost: 50,
    });
  } else {
    ComplexRateLimitInboundPolicy.setIncrements(context, {
      requests: 1,
      computeCost: 1,
    });
  }

  return request;
}
```

This means a consumer who calls cheap endpoints uses their budget slowly, while
expensive operations burn through it quickly — providing natural cost protection
even without dollar-amount tracking.

## Building a Layered Cost Protection Strategy

No single mechanism is sufficient. Effective API cost protection requires
layering multiple controls:

### Layer 1: Rate Limits (Burst Protection)

Set per-endpoint, per-user, and per-key rate limits to prevent any consumer from
overwhelming your API in a short time window. Use
[strict enforcement](https://zuplo.com/features/rate-limiting) for endpoints
that trigger expensive downstream operations.

### Layer 2: Quotas (Total Usage Control)

Define monthly or daily quotas tied to billing plans. Use hard caps for
self-service tiers and soft caps with overage billing for enterprise accounts.
Enforce these at the gateway so requests are rejected before reaching your
backend.

### Layer 3: Anomaly Detection (Behavioral Protection)

Monitor for unusual patterns — sudden spikes in usage, repetitive requests to
the same endpoint, or consumption rates that deviate significantly from a
consumer's baseline. Flag these for review or automatically throttle the
consumer.

### Layer 4: API Key Hygiene (Credential Protection)

The $82K Gemini incident started with a stolen API key. Strong key management
practices are an essential complement to spending controls:

- **Key rotation**: Regularly rotate API keys to limit the window of exposure.
  Zuplo's
  [API key management](https://zuplo.com/docs/articles/api-key-management)
  supports self-service key rotation and revocation.
- **Leak detection**: Zuplo partners with GitHub's secret scanning program to
  automatically detect
  [API key leaks](https://zuplo.com/docs/articles/api-key-leak-detection) in
  source code repositories.
- **Scoped permissions**: Issue keys with the minimum permissions necessary.
  Don't give a key access to expensive endpoints if the consumer only needs read
  access.

## What Should You Do Next?

If you're running APIs that cost money per call — whether that's LLM inference,
document processing, or any metered third-party service — here's the minimum you
should implement:

1. **Add rate limits to every endpoint**, especially expensive ones. Start with
   per-user limits and
   [adjust based on traffic patterns](/learning-center/10-best-practices-for-api-rate-limiting-in-2025).
2. **Set hard quotas on self-service tiers.** Free-tier consumers should never
   be able to generate unlimited costs.
3. **Audit your API key security.** Rotate keys regularly, enable leak
   detection, and scope permissions tightly.
4. **Plan for agentic consumers.** If you're exposing APIs via MCP or any agent
   framework, implement session-aware cost tracking — not just request counting.

The $1.6M weekend and the $82K key theft weren't edge cases. They're the
predictable result of APIs built without spending guardrails in an era where
machines are the primary consumers. The tools to prevent these disasters exist
today. The question is whether you'll implement them before your own billing
surprise arrives.

Zuplo's [programmable rate limiting](https://zuplo.com/features/rate-limiting),
[API key management](https://zuplo.com/docs/articles/api-key-management), and
[monetization features](https://zuplo.com/docs/articles/monetization) give you
the building blocks to implement comprehensive cost protection at the gateway
layer. [Get started with Zuplo](https://portal.zuplo.com) and add rate limiting
to your first endpoint in under five minutes.