---
title: "How to Control AI Costs with an API Gateway"
description: "Five concrete ways to reduce AI and LLM costs using your API gateway — semantic caching, rate limiting, spend limits, model routing, and token-based billing."
canonicalUrl: "https://zuplo.com/blog/2026/02/26/control-ai-costs-api-gateway"
pageType: "blog"
date: "2026-02-26"
authors: "nate"
tags: "AI, API Best Practices"
image: "https://zuplo.com/og?text=How%20to%20Control%20AI%20Costs%20with%20an%20API%20Gateway"
---
AI costs are out of control. If you are running a GPT-4 endpoint handling 10,000
requests per day, you could be looking at $30,000 or more per month in inference
costs alone. And that number only goes up as your users grow.

The wild part? Most of that spend is preventable. Duplicate prompts, runaway
consumers, overqualified models answering simple questions — these are all
problems you can solve before the request ever hits your LLM provider.

Your API gateway is the single best place to do it. It sits between your
consumers and your AI services, which means it sees every request, every
response, and every token. That makes it the perfect control plane for AI cost
management.

Here are five concrete levers you can pull today.

## Lever 1: Semantic Caching

The easiest win. A huge percentage of AI requests are duplicates or
near-duplicates. "Summarize our refund policy" gets asked a hundred different
ways, but the answer is always the same.

Semantic caching stores responses for identical (or similar) prompts and serves
them from cache instead of making another expensive inference call. Unlike
traditional HTTP caching that matches on exact URLs, semantic caching can
recognize that "What is your return policy?" and "How do returns work?" should
return the same cached response.

At the gateway level, you intercept the request, check your cache, and either
return the cached response instantly or forward the request to the LLM and cache
the result on the way back.

In Zuplo, you can add caching to any route with a simple policy configuration:

```json
{
  "name": "ai-cache-policy",
  "policyType": "caching-inbound",
  "handler": {
    "export": "default",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "cacheControl": "public, max-age=3600",
      "varyBy": ["body.prompt", "body.model"],
      "ttlSeconds": 3600
    }
  }
}
```

**Potential savings: 30-60%** for workloads with repetitive prompts. Customer
support bots, FAQ endpoints, and content generation pipelines see the highest
cache hit rates.

## Lever 2: Per-Consumer Rate Limiting

Without rate limits, a single misconfigured consumer can burn through your
entire monthly AI budget in hours. One developer's infinite loop or one
enthusiastic beta tester can send your OpenAI bill through the roof.

Per-consumer rate limiting puts a ceiling on how many AI requests any single API
key can make. This is not about throttling your overall system — it is about
preventing any single actor from dominating your spend.

Here is a Zuplo rate limiting policy that caps each API key to 100 AI requests
per hour:

```json
{
  "name": "ai-rate-limit-policy",
  "policyType": "rate-limit-inbound",
  "handler": {
    "export": "default",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "rateLimitBy": "user",
      "requestsAllowed": 100,
      "timeWindowMinutes": 60,
      "identifier": {
        "source": "header",
        "name": "Authorization"
      }
    }
  }
}
```

When a consumer exceeds their limit, they get a `429 Too Many Requests` response
with a `Retry-After` header. Clean and predictable.

**Potential savings: prevents 2-10x cost overruns** from runaway consumers. The
savings here are not about optimization — they are about preventing catastrophic
bills.

## Lever 3: Per-App Spend Limits

Rate limiting controls request volume, but AI costs are driven by tokens, not
requests. A single complex prompt with a long context window can cost more than
a hundred simple ones.

Spend limits track actual token usage per consumer and enforce monthly or daily
budgets. When a consumer hits their cap, they get blocked until the next billing
cycle.

Here is a Zuplo custom policy that tracks token spend and enforces a monthly
cap:

```typescript
import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

const MONTHLY_TOKEN_LIMIT = 1_000_000; // 1M tokens per consumer per month

export default async function spendLimitPolicy(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const consumerId = request.user?.sub;
  if (!consumerId) {
    return new Response("Unauthorized", { status: 401 });
  }

  const currentMonth = new Date().toISOString().slice(0, 7);
  const usageKey = `usage:${consumerId}:${currentMonth}`;

  // Get current token usage from your tracking store
  const currentUsage = await context.storage.get(usageKey);
  const tokensUsed = parseInt(currentUsage ?? "0", 10);

  if (tokensUsed >= MONTHLY_TOKEN_LIMIT) {
    return new Response(
      JSON.stringify({
        error: "Monthly token budget exceeded",
        limit: MONTHLY_TOKEN_LIMIT,
        used: tokensUsed,
        resetsAt: `${currentMonth}-01T00:00:00Z`,
      }),
      {
        status: 429,
        headers: { "Content-Type": "application/json" },
      },
    );
  }

  // Forward the request — track tokens in the response hook
  return request;
}
```

**Potential savings: hard cap on worst-case spend.** If you set a $500/month
limit per consumer and you have 50 consumers, your maximum AI cost is $25,000 no
matter what.

## Lever 4: Intelligent Model Routing

Not every request needs GPT-4. A simple classification task, a short summary, or
a structured extraction can be handled perfectly well by GPT-4o-mini or Claude
3.5 Haiku at a fraction of the cost.

Intelligent model routing inspects the incoming request and routes it to the
cheapest model that can handle it. The logic can be as simple as checking prompt
length, or as sophisticated as classifying intent.

Here is a Zuplo custom policy that routes based on prompt complexity:

```typescript
import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

const SIMPLE_MODEL = "gpt-4o-mini";
const COMPLEX_MODEL = "gpt-4o";
const COMPLEXITY_THRESHOLD = 500; // tokens

export default async function modelRoutingPolicy(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const body = await request.json();
  const promptLength = body.messages
    ?.map((m: { content: string }) => m.content)
    .join(" ").length;

  // Simple heuristic: short prompts go to the cheap model
  const selectedModel =
    promptLength < COMPLEXITY_THRESHOLD ? SIMPLE_MODEL : COMPLEX_MODEL;

  context.log.info(
    `Routing to ${selectedModel} (prompt length: ${promptLength})`,
  );

  const modifiedBody = {
    ...body,
    model: selectedModel,
  };

  return new Request(request.url, {
    method: request.method,
    headers: request.headers,
    body: JSON.stringify(modifiedBody),
  });
}
```

You can get more advanced by classifying the actual intent of the request —
routing code generation to one model, summarization to another, and simple Q&A
to the cheapest option. The gateway is the perfect place for this logic because
it has access to the full request before it reaches any backend.

**Potential savings: 40-70%** on mixed workloads. GPT-4o-mini is roughly 30x
cheaper than GPT-4o per token. If even half your traffic can use the cheaper
model, you are cutting your bill dramatically.

## Lever 5: Token-Based Billing

Here is the ultimate lever: stop absorbing AI costs and pass them through to
your consumers. If your API wraps an LLM, your consumers should pay for the
tokens they use.

Token-based billing meters actual token consumption per API key and bills
accordingly. Your consumers get transparency into their usage, and you get
sustainable unit economics.

Zuplo's
[API monetization features](https://zuplo.com/blog/zuplo-api-monetization) make
this straightforward. You can define metering based on token usage, attach it to
pricing plans, and connect it to Stripe for automated billing. The gateway
tracks every token in real time so your billing is always accurate.

The pattern looks like this:

1. **Meter** token usage on every response (input tokens + output tokens)
2. **Aggregate** usage per consumer per billing period
3. **Bill** based on tiered pricing — for example, $0.01 per 1K tokens for the
   first million, $0.008 per 1K after that
4. **Enforce** quotas so consumers cannot exceed their plan limits without
   upgrading

**Potential savings: 100% margin recovery.** You are no longer eating AI costs.
Your consumers pay for what they use, and your gateway ensures accurate metering
and enforcement.

## Putting It All Together

Each of these levers is powerful on its own. Combined, they form a complete AI
cost management strategy that runs entirely at the gateway layer.

Here is how they work together in practice:

1. **Semantic caching** catches duplicate requests before they hit your LLM.
   That alone cuts 30-60% of your inference costs.
2. **Rate limiting** prevents any single consumer from burning through your
   budget. No more surprise bills from runaway integrations.
3. **Spend limits** enforce hard caps on token usage per consumer per month.
   Your worst-case cost is always known.
4. **Model routing** sends simple requests to cheap models and complex ones to
   expensive models. You stop overpaying for easy tasks.
5. **Token-based billing** passes the remaining costs through to your consumers,
   making your AI API a revenue generator instead of a cost center.

In a Zuplo gateway, you stack these as policies on your AI routes. Each request
flows through caching, rate limiting, spend tracking, and model routing before
it ever reaches your LLM provider. On the way back, you meter the token usage
for billing.

The best part? None of this requires changes to your AI backend. Your LLM
integration stays exactly the same. The gateway handles all the cost control
logic transparently.

## Start Controlling AI Costs Today

If you are running AI endpoints without cost controls, you are leaving money on
the table — or worse, burning through it. An API gateway gives you the
visibility and control you need to run AI services sustainably.

Zuplo's [free tier](https://portal.zuplo.com) gives you everything you need to
get started: built-in rate limiting, custom policies for spend tracking and
model routing, caching, and monetization features for token-based billing. You
can have all five levers running in production in an afternoon.

Stop letting AI costs surprise you.
[Get started with Zuplo](https://portal.zuplo.com) and take control of your AI
spend.