API Cost Protection: How Rate Limits, Quotas, and Spending Caps Prevent Billing Disasters

A single AI agent retries a $1.58 API call over a thousand times across a weekend. By Monday, the bill is $1.6 million. A stolen Gemini API key racks up $82,314 in 48 hours. Both of these incidents happened in early 2026, and both share the same root cause: APIs without hard spending guardrails.

Billing alerts didn’t help. The API gateways involved approved every request because each one looked perfectly valid in isolation. The tokens were authenticated, the payloads were well-formed, and the rate limits (where they existed) measured throughput — not cost.

If you’re exposing APIs that cost real money per call — especially to AI agents or external consumers — you need more than monitoring. You need enforcement. This article breaks down the three layers of API cost protection that actually prevent billing disasters: rate limits, quotas, and hard spending caps.

The Problem: APIs Without Spending Guardrails

Two high-profile incidents in early 2026 illustrate what happens when API cost protection is treated as an afterthought.

The $1.6 Million Weekend

In March 2026, SD Times reported on an enterprise whose AI-powered contract review API cost $1.58 per document to process. The team exposed the API via MCP (Model Context Protocol) for agentic consumption. On a Friday evening, an AI agent hit a timeout and began retrying relentlessly. A single document was processed over a thousand times. Multiplied across a batch of a thousand contracts, the weekend bill reached $1.6 million.

The gateway approved every single request. The token was valid, rate limits were respected, and scope was authorized. As Derric Gilling wrote in his analysis: “A token rate limit measures throughput, not waste; a slow retry loop passes every rate limit while burning money for hours.”

The $82,000 API Key Theft

In February 2026, a Mexico-based startup with three developers had their Google Gemini API key compromised. As The Register reported, the stolen key generated an $82,314.44 bill in just 48 hours — a roughly 457x spike from their typical $180 monthly spending. The attacker simply used the API as designed, generating automated requests at machine speed.

The developer described being “in a state of shock and panic,” warning that enforcing even partial payment would cause the startup to go bankrupt. Google cited their shared responsibility model, placing the burden of key security on the user.

Why Billing Alerts Aren’t Enough

Both incidents expose the same gap: billing alerts notify you after the damage is done. They are monitoring tools, not enforcement mechanisms. By the time a human reads an email notification at 2 AM on a Saturday, the costs have already accumulated.

What these scenarios needed were hard controls — mechanisms that automatically block requests when spending thresholds are exceeded, without waiting for human intervention.

Rate Limiting as a First Line of Defense

Rate limiting is the most fundamental layer of API cost protection. It caps how many requests a consumer can make within a given time window, preventing any single client from overwhelming your API — whether through malicious intent, buggy code, or agentic retry loops.

Per-Endpoint, Per-User, and Per-Key Limits

Effective rate limiting isn’t one-size-fits-all. You need different limits at different levels of granularity:

Per-endpoint limits protect expensive operations. A /generate-report endpoint that triggers LLM inference should have tighter limits than a /health check.
Per-user limits prevent any single authenticated user from monopolizing resources, regardless of how many API keys they hold.
Per-API-key limits let you enforce different tiers of access. A free-tier key might get 100 requests per hour, while an enterprise key gets 10,000.

With Zuplo’s rate limiting policy, you configure these limits declaratively. Here’s a basic per-user rate limit:

json

{
  "name": "cost-protection-rate-limit",
  "policyType": "rate-limit-inbound",
  "handler": {
    "export": "RateLimitInboundPolicy",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "rateLimitBy": "user",
      "requestsAllowed": 500,
      "timeWindowMinutes": 60
    }
  }
}

This limits each authenticated user to 500 requests per hour. When they exceed the limit, they receive a 429 Too Many Requests response with a Retry-After header.

Dynamic Rate Limits Based on Customer Tier

Static limits are a starting point, but real cost protection requires dynamic limits that adapt to each consumer’s plan. Zuplo supports this through the rateLimitBy: "function" option, where a custom TypeScript function determines the limit based on request context:

typescript

import {
  CustomRateLimitDetails,
  ZuploContext,
  ZuploRequest,
} from "@zuplo/runtime";

export async function rateLimitKey(
  request: ZuploRequest,
  context: ZuploContext,
  policyName: string,
): Promise<CustomRateLimitDetails> {
  const customerType = request.user?.data?.customerType;

  switch (customerType) {
    case "enterprise":
      return {
        key: request.user.sub,
        requestsAllowed: 10000,
        timeWindowMinutes: 60,
      };
    case "pro":
      return {
        key: request.user.sub,
        requestsAllowed: 1000,
        timeWindowMinutes: 60,
      };
    default:
      return {
        key: request.user.sub,
        requestsAllowed: 100,
        timeWindowMinutes: 60,
      };
  }
}

This pattern is particularly powerful for cost protection because it ties your rate limits directly to your pricing model. Enterprise customers paying for higher throughput get it; free-tier users are automatically constrained.

Strict vs. Asynchronous Enforcement

Zuplo’s rate limiting supports two enforcement modes:

Strict mode checks the rate limit synchronously before processing the request. This guarantees hard enforcement but adds a small amount of latency.
Asynchronous mode processes the request while checking the limit in the background. This minimizes latency but may allow a small number of requests through above the limit during high-concurrency bursts.

For cost protection, strict mode is almost always the right choice. The slight latency penalty is trivial compared to the cost of letting excess requests through to expensive downstream services.

Quotas and Hard Spending Caps

Rate limits control the speed of consumption. Quotas control the total amount. Both are essential for cost protection, and they serve different purposes.

The Difference Between Rate Limits and Quotas

A rate limit of 100 requests per minute prevents a burst of traffic from overwhelming your API. But a consumer making 99 requests per minute, 24 hours a day, still generates 142,560 requests per day — which could translate to enormous costs if each request triggers expensive downstream processing.

Quotas set an absolute ceiling on total usage over a billing period. For example:

10,000 API calls per month on the free tier
100,000 API calls per month on the pro tier
Custom quota negotiated for enterprise contracts

When a consumer hits their quota, they’re blocked until the next billing cycle or until they upgrade their plan.

Hard Caps vs. Soft Caps

This distinction is critical for cost protection:

Hard caps block requests when the quota is exhausted. The consumer receives a 429 or 402 response and must wait or upgrade. This is the mechanism that would have prevented both the $1.6M weekend and the $82K API key theft.
Soft caps allow requests to continue but flag the overage for billing later. These are appropriate for trusted enterprise customers who prefer uninterrupted service and are willing to pay for overages.

The right approach depends on the consumer. Free tiers and self-service plans should use hard caps by default. Paid tiers can use soft caps with overage billing, but only when the customer has explicitly opted in.

Tying Quotas to Billing Plans

Quotas are most effective when they’re directly connected to your billing system, so that enforcement happens automatically based on each consumer’s subscription. Zuplo’s API monetization features support this pattern. You define meters (what you count), features (what customers buy), and plans (tiers with rate cards), and the gateway enforces limits in real time:

Meters count usage dimensions like requests, tokens, or bytes.
Features connect meters to your product catalog (e.g., “10,000 API calls per month”).
Enforcement happens at the gateway before your backend is called. If the consumer’s balance is insufficient, the request is rejected.

Because the gateway is the system of record for both metering and enforcement, there’s no gap between usage tracking and access control. This is fundamentally different from architectures where billing runs on a separate system and “eventually” syncs with access policies.

Circuit Breakers for AI Agent Traffic

The incidents described above highlight a new reality: AI agents are becoming primary API consumers, and they behave fundamentally differently from human- driven applications.

Why Agentic Consumers Are Different

Human API consumers are predictable. They follow documented code paths, retry a handful of times, and give up when something breaks. AI agents exhibit none of these characteristics:

Relentless retries. An agent that encounters a timeout doesn’t get frustrated and stop. It retries according to its programming — potentially thousands of times — because achieving the outcome is its objective.
Non-deterministic behavior. The same prompt can trigger dramatically different chains of API calls. You can’t predict what an agent will do based on what it did last time.
Identity blurring. When an AI agent acts on behalf of a user, it’s unclear who bears responsibility for the costs. The agent has its own credentials, but the user initiated the action.
Machine-speed consumption. Agents generate requests at a rate no human could match. The $82K Gemini incident demonstrated this — automated requests at machine speed converted valid authentication into five figures of charges in two days.

Implementing Cost-Aware Circuit Breaking

Traditional rate limiting measures throughput, but agentic cost protection requires tracking accumulated cost. Here are the patterns that matter:

Session-based cost tracking. Instead of just counting requests, track the cumulative cost of all requests within a session or time window. When the accumulated cost exceeds a threshold, block further requests. This is what would have caught the $1.6M retry storm — a thousand retries of a $1.58 call would have triggered a session cost limit long before Monday.

Spend velocity monitoring. Flag abnormal burn rates even when absolute limits haven’t been reached. If a consumer’s hourly spend jumps by 10x compared to their baseline, that’s a signal to pause and verify — regardless of whether they’ve hit a hard cap.

Loop detection. Recognize when an agent is making repetitive, similar requests to the same endpoint. Rate limits alone won’t catch this if the agent paces its retries within the allowed throughput.

With Zuplo’s complex rate limiting policy (available on enterprise plans; free to try in development), you can implement multi-dimensional limits that go beyond simple request counting. For example, you can define separate limits for request count and compute cost, and set dynamic increments per request based on the actual cost of the operation:

json

{
  "name": "cost-aware-rate-limit",
  "policyType": "complex-rate-limit-inbound",
  "handler": {
    "export": "ComplexRateLimitInboundPolicy",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "rateLimitBy": "user",
      "limits": {
        "requests": 1000,
        "computeCost": 500
      },
      "timeWindowMinutes": 60
    }
  }
}

You can then set dynamic increments in a custom policy to weight expensive operations more heavily:

typescript

import {
  ComplexRateLimitInboundPolicy,
  ZuploContext,
  ZuploRequest,
} from "@zuplo/runtime";

export async function setCostIncrement(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const endpoint = new URL(request.url).pathname;

  // Weight expensive endpoints higher
  if (endpoint.includes("/generate") || endpoint.includes("/analyze")) {
    ComplexRateLimitInboundPolicy.setIncrements(context, {
      requests: 1,
      computeCost: 50,
    });
  } else {
    ComplexRateLimitInboundPolicy.setIncrements(context, {
      requests: 1,
      computeCost: 1,
    });
  }

  return request;
}

This means a consumer who calls cheap endpoints uses their budget slowly, while expensive operations burn through it quickly — providing natural cost protection even without dollar-amount tracking.

Building a Layered Cost Protection Strategy

No single mechanism is sufficient. Effective API cost protection requires layering multiple controls:

Layer 1: Rate Limits (Burst Protection)

Set per-endpoint, per-user, and per-key rate limits to prevent any consumer from overwhelming your API in a short time window. Use strict enforcement for endpoints that trigger expensive downstream operations.

Layer 2: Quotas (Total Usage Control)

Define monthly or daily quotas tied to billing plans. Use hard caps for self-service tiers and soft caps with overage billing for enterprise accounts. Enforce these at the gateway so requests are rejected before reaching your backend.

Layer 3: Anomaly Detection (Behavioral Protection)

Monitor for unusual patterns — sudden spikes in usage, repetitive requests to the same endpoint, or consumption rates that deviate significantly from a consumer’s baseline. Flag these for review or automatically throttle the consumer.

Layer 4: API Key Hygiene (Credential Protection)

The $82K Gemini incident started with a stolen API key. Strong key management practices are an essential complement to spending controls:

Key rotation: Regularly rotate API keys to limit the window of exposure. Zuplo’s API key management supports self-service key rotation and revocation.
Leak detection: Zuplo partners with GitHub’s secret scanning program to automatically detect API key leaks in source code repositories.
Scoped permissions: Issue keys with the minimum permissions necessary. Don’t give a key access to expensive endpoints if the consumer only needs read access.

What Should You Do Next?

If you’re running APIs that cost money per call — whether that’s LLM inference, document processing, or any metered third-party service — here’s the minimum you should implement:

Add rate limits to every endpoint, especially expensive ones. Start with per-user limits and adjust based on traffic patterns.
Set hard quotas on self-service tiers. Free-tier consumers should never be able to generate unlimited costs.
Audit your API key security. Rotate keys regularly, enable leak detection, and scope permissions tightly.
Plan for agentic consumers. If you’re exposing APIs via MCP or any agent framework, implement session-aware cost tracking — not just request counting.

The $1.6M weekend and the $82K key theft weren’t edge cases. They’re the predictable result of APIs built without spending guardrails in an era where machines are the primary consumers. The tools to prevent these disasters exist today. The question is whether you’ll implement them before your own billing surprise arrives.

Zuplo’s programmable rate limiting, API key management, and monetization features give you the building blocks to implement comprehensive cost protection at the gateway layer. Get started with Zuplo and add rate limiting to your first endpoint in under five minutes.