How to Control AI Costs with an API Gateway

AI costs are out of control. If you are running a GPT-4 endpoint handling 10,000 requests per day, you could be looking at $30,000 or more per month in inference costs alone. And that number only goes up as your users grow.

The wild part? Most of that spend is preventable. Duplicate prompts, runaway consumers, overqualified models answering simple questions — these are all problems you can solve before the request ever hits your LLM provider.

Your API gateway is the single best place to do it. It sits between your consumers and your AI services, which means it sees every request, every response, and every token. That makes it the perfect control plane for AI cost management.

Here are five concrete levers you can pull today.

Lever 1: Semantic Caching

The easiest win. A huge percentage of AI requests are duplicates or near-duplicates. "Summarize our refund policy" gets asked a hundred different ways, but the answer is always the same.

Semantic caching stores responses for identical (or similar) prompts and serves them from cache instead of making another expensive inference call. Unlike traditional HTTP caching that matches on exact URLs, semantic caching can recognize that "What is your return policy?" and "How do returns work?" should return the same cached response.

At the gateway level, you intercept the request, check your cache, and either return the cached response instantly or forward the request to the LLM and cache the result on the way back.

In Zuplo, you can add caching to any route with a simple policy configuration:

json

{
  "name": "ai-cache-policy",
  "policyType": "caching-inbound",
  "handler": {
    "export": "default",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "cacheControl": "public, max-age=3600",
      "varyBy": ["body.prompt", "body.model"],
      "ttlSeconds": 3600
    }
  }
}

Potential savings: 30-60% for workloads with repetitive prompts. Customer support bots, FAQ endpoints, and content generation pipelines see the highest cache hit rates.

Lever 2: Per-Consumer Rate Limiting

Without rate limits, a single misconfigured consumer can burn through your entire monthly AI budget in hours. One developer's infinite loop or one enthusiastic beta tester can send your OpenAI bill through the roof.

Per-consumer rate limiting puts a ceiling on how many AI requests any single API key can make. This is not about throttling your overall system — it is about preventing any single actor from dominating your spend.

Here is a Zuplo rate limiting policy that caps each API key to 100 AI requests per hour:

json

{
  "name": "ai-rate-limit-policy",
  "policyType": "rate-limit-inbound",
  "handler": {
    "export": "default",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "rateLimitBy": "user",
      "requestsAllowed": 100,
      "timeWindowMinutes": 60,
      "identifier": {
        "source": "header",
        "name": "Authorization"
      }
    }
  }
}

When a consumer exceeds their limit, they get a 429 Too Many Requests response with a Retry-After header. Clean and predictable.

Potential savings: prevents 2-10x cost overruns from runaway consumers. The savings here are not about optimization — they are about preventing catastrophic bills.

Lever 3: Per-App Spend Limits

Rate limiting controls request volume, but AI costs are driven by tokens, not requests. A single complex prompt with a long context window can cost more than a hundred simple ones.

Spend limits track actual token usage per consumer and enforce monthly or daily budgets. When a consumer hits their cap, they get blocked until the next billing cycle.

Here is a Zuplo custom policy that tracks token spend and enforces a monthly cap:

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

const MONTHLY_TOKEN_LIMIT = 1_000_000; // 1M tokens per consumer per month

export default async function spendLimitPolicy(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const consumerId = request.user?.sub;
  if (!consumerId) {
    return new Response("Unauthorized", { status: 401 });
  }

  const currentMonth = new Date().toISOString().slice(0, 7);
  const usageKey = `usage:${consumerId}:${currentMonth}`;

  // Get current token usage from your tracking store
  const currentUsage = await context.storage.get(usageKey);
  const tokensUsed = parseInt(currentUsage ?? "0", 10);

  if (tokensUsed >= MONTHLY_TOKEN_LIMIT) {
    return new Response(
      JSON.stringify({
        error: "Monthly token budget exceeded",
        limit: MONTHLY_TOKEN_LIMIT,
        used: tokensUsed,
        resetsAt: `${currentMonth}-01T00:00:00Z`,
      }),
      {
        status: 429,
        headers: { "Content-Type": "application/json" },
      },
    );
  }

  // Forward the request — track tokens in the response hook
  return request;
}

Potential savings: hard cap on worst-case spend. If you set a $500/month limit per consumer and you have 50 consumers, your maximum AI cost is $25,000 no matter what.

Lever 4: Intelligent Model Routing

Not every request needs GPT-4. A simple classification task, a short summary, or a structured extraction can be handled perfectly well by GPT-4o-mini or Claude 3.5 Haiku at a fraction of the cost.

Intelligent model routing inspects the incoming request and routes it to the cheapest model that can handle it. The logic can be as simple as checking prompt length, or as sophisticated as classifying intent.

Here is a Zuplo custom policy that routes based on prompt complexity:

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

const SIMPLE_MODEL = "gpt-4o-mini";
const COMPLEX_MODEL = "gpt-4o";
const COMPLEXITY_THRESHOLD = 500; // tokens

export default async function modelRoutingPolicy(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const body = await request.json();
  const promptLength = body.messages
    ?.map((m: { content: string }) => m.content)
    .join(" ").length;

  // Simple heuristic: short prompts go to the cheap model
  const selectedModel =
    promptLength < COMPLEXITY_THRESHOLD ? SIMPLE_MODEL : COMPLEX_MODEL;

  context.log.info(
    `Routing to ${selectedModel} (prompt length: ${promptLength})`,
  );

  const modifiedBody = {
    ...body,
    model: selectedModel,
  };

  return new Request(request.url, {
    method: request.method,
    headers: request.headers,
    body: JSON.stringify(modifiedBody),
  });
}

You can get more advanced by classifying the actual intent of the request — routing code generation to one model, summarization to another, and simple Q&A to the cheapest option. The gateway is the perfect place for this logic because it has access to the full request before it reaches any backend.

Potential savings: 40-70% on mixed workloads. GPT-4o-mini is roughly 30x cheaper than GPT-4o per token. If even half your traffic can use the cheaper model, you are cutting your bill dramatically.

Lever 5: Token-Based Billing

Here is the ultimate lever: stop absorbing AI costs and pass them through to your consumers. If your API wraps an LLM, your consumers should pay for the tokens they use.

Token-based billing meters actual token consumption per API key and bills accordingly. Your consumers get transparency into their usage, and you get sustainable unit economics.

Zuplo's API monetization features make this straightforward. You can define metering based on token usage, attach it to pricing plans, and connect it to Stripe for automated billing. The gateway tracks every token in real time so your billing is always accurate.

The pattern looks like this:

Meter token usage on every response (input tokens + output tokens)
Aggregate usage per consumer per billing period
Bill based on tiered pricing — for example, $0.01 per 1K tokens for the first million, $0.008 per 1K after that
Enforce quotas so consumers cannot exceed their plan limits without upgrading

Potential savings: 100% margin recovery. You are no longer eating AI costs. Your consumers pay for what they use, and your gateway ensures accurate metering and enforcement.

Putting It All Together

Each of these levers is powerful on its own. Combined, they form a complete AI cost management strategy that runs entirely at the gateway layer.

Here is how they work together in practice:

Semantic caching catches duplicate requests before they hit your LLM. That alone cuts 30-60% of your inference costs.
Rate limiting prevents any single consumer from burning through your budget. No more surprise bills from runaway integrations.
Spend limits enforce hard caps on token usage per consumer per month. Your worst-case cost is always known.
Model routing sends simple requests to cheap models and complex ones to expensive models. You stop overpaying for easy tasks.
Token-based billing passes the remaining costs through to your consumers, making your AI API a revenue generator instead of a cost center.

In a Zuplo gateway, you stack these as policies on your AI routes. Each request flows through caching, rate limiting, spend tracking, and model routing before it ever reaches your LLM provider. On the way back, you meter the token usage for billing.

The best part? None of this requires changes to your AI backend. Your LLM integration stays exactly the same. The gateway handles all the cost control logic transparently.

Start Controlling AI Costs Today

If you are running AI endpoints without cost controls, you are leaving money on the table — or worse, burning through it. An API gateway gives you the visibility and control you need to run AI services sustainably.

Zuplo's free tier gives you everything you need to get started: built-in rate limiting, custom policies for spend tracking and model routing, caching, and monetization features for token-based billing. You can have all five levers running in production in an afternoon.

Stop letting AI costs surprise you. Get started with Zuplo and take control of your AI spend.

Here are five concrete levers you can pull today.

Lever 1: Semantic Caching

The easiest win. A huge percentage of AI requests are duplicates or near-duplicates. "Summarize our refund policy" gets asked a hundred different ways, but the answer is always the same.

At the gateway level, you intercept the request, check your cache, and either return the cached response instantly or forward the request to the LLM and cache the result on the way back.

In Zuplo, you can add caching to any route with a simple policy configuration:

json

{
  "name": "ai-cache-policy",
  "policyType": "caching-inbound",
  "handler": {
    "export": "default",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "cacheControl": "public, max-age=3600",
      "varyBy": ["body.prompt", "body.model"],
      "ttlSeconds": 3600
    }
  }
}

Potential savings: 30-60% for workloads with repetitive prompts. Customer support bots, FAQ endpoints, and content generation pipelines see the highest cache hit rates.

Lever 2: Per-Consumer Rate Limiting

Here is a Zuplo rate limiting policy that caps each API key to 100 AI requests per hour:

json

{
  "name": "ai-rate-limit-policy",
  "policyType": "rate-limit-inbound",
  "handler": {
    "export": "default",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "rateLimitBy": "user",
      "requestsAllowed": 100,
      "timeWindowMinutes": 60,
      "identifier": {
        "source": "header",
        "name": "Authorization"
      }
    }
  }
}

When a consumer exceeds their limit, they get a 429 Too Many Requests response with a Retry-After header. Clean and predictable.

Potential savings: prevents 2-10x cost overruns from runaway consumers. The savings here are not about optimization — they are about preventing catastrophic bills.

Lever 3: Per-App Spend Limits

Rate limiting controls request volume, but AI costs are driven by tokens, not requests. A single complex prompt with a long context window can cost more than a hundred simple ones.

Spend limits track actual token usage per consumer and enforce monthly or daily budgets. When a consumer hits their cap, they get blocked until the next billing cycle.

Here is a Zuplo custom policy that tracks token spend and enforces a monthly cap:

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

const MONTHLY_TOKEN_LIMIT = 1_000_000; // 1M tokens per consumer per month

export default async function spendLimitPolicy(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const consumerId = request.user?.sub;
  if (!consumerId) {
    return new Response("Unauthorized", { status: 401 });
  }

  const currentMonth = new Date().toISOString().slice(0, 7);
  const usageKey = `usage:${consumerId}:${currentMonth}`;

  // Get current token usage from your tracking store
  const currentUsage = await context.storage.get(usageKey);
  const tokensUsed = parseInt(currentUsage ?? "0", 10);

  if (tokensUsed >= MONTHLY_TOKEN_LIMIT) {
    return new Response(
      JSON.stringify({
        error: "Monthly token budget exceeded",
        limit: MONTHLY_TOKEN_LIMIT,
        used: tokensUsed,
        resetsAt: `${currentMonth}-01T00:00:00Z`,
      }),
      {
        status: 429,
        headers: { "Content-Type": "application/json" },
      },
    );
  }

  // Forward the request — track tokens in the response hook
  return request;
}

Potential savings: hard cap on worst-case spend. If you set a $500/month limit per consumer and you have 50 consumers, your maximum AI cost is $25,000 no matter what.

Lever 4: Intelligent Model Routing

Not every request needs GPT-4. A simple classification task, a short summary, or a structured extraction can be handled perfectly well by GPT-4o-mini or Claude 3.5 Haiku at a fraction of the cost.

Here is a Zuplo custom policy that routes based on prompt complexity:

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

const SIMPLE_MODEL = "gpt-4o-mini";
const COMPLEX_MODEL = "gpt-4o";
const COMPLEXITY_THRESHOLD = 500; // tokens

export default async function modelRoutingPolicy(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const body = await request.json();
  const promptLength = body.messages
    ?.map((m: { content: string }) => m.content)
    .join(" ").length;

  // Simple heuristic: short prompts go to the cheap model
  const selectedModel =
    promptLength < COMPLEXITY_THRESHOLD ? SIMPLE_MODEL : COMPLEX_MODEL;

  context.log.info(
    `Routing to ${selectedModel} (prompt length: ${promptLength})`,
  );

  const modifiedBody = {
    ...body,
    model: selectedModel,
  };

  return new Request(request.url, {
    method: request.method,
    headers: request.headers,
    body: JSON.stringify(modifiedBody),
  });
}

Lever 5: Token-Based Billing

Here is the ultimate lever: stop absorbing AI costs and pass them through to your consumers. If your API wraps an LLM, your consumers should pay for the tokens they use.

Token-based billing meters actual token consumption per API key and bills accordingly. Your consumers get transparency into their usage, and you get sustainable unit economics.

The pattern looks like this:

Meter token usage on every response (input tokens + output tokens)
Aggregate usage per consumer per billing period
Bill based on tiered pricing — for example, $0.01 per 1K tokens for the first million, $0.008 per 1K after that
Enforce quotas so consumers cannot exceed their plan limits without upgrading

Potential savings: 100% margin recovery. You are no longer eating AI costs. Your consumers pay for what they use, and your gateway ensures accurate metering and enforcement.

Putting It All Together

Each of these levers is powerful on its own. Combined, they form a complete AI cost management strategy that runs entirely at the gateway layer.

Here is how they work together in practice:

Semantic caching catches duplicate requests before they hit your LLM. That alone cuts 30-60% of your inference costs.
Rate limiting prevents any single consumer from burning through your budget. No more surprise bills from runaway integrations.
Spend limits enforce hard caps on token usage per consumer per month. Your worst-case cost is always known.
Model routing sends simple requests to cheap models and complex ones to expensive models. You stop overpaying for easy tasks.
Token-based billing passes the remaining costs through to your consumers, making your AI API a revenue generator instead of a cost center.

The best part? None of this requires changes to your AI backend. Your LLM integration stays exactly the same. The gateway handles all the cost control logic transparently.

Start Controlling AI Costs Today

Stop letting AI costs surprise you. Get started with Zuplo and take control of your AI spend.

How to Control AI Costs with an API Gateway

Lever 1: Semantic Caching

Lever 2: Per-Consumer Rate Limiting

Lever 3: Per-App Spend Limits

Lever 4: Intelligent Model Routing

Lever 5: Token-Based Billing

Putting It All Together

Start Controlling AI Costs Today

Related Articles

API Monetization 101: Your Guide to Charging for Your API

Use AI to Plan Your API Pricing Strategy

How to Control AI Costs with an API Gateway

Lever 1: Semantic Caching

Lever 2: Per-Consumer Rate Limiting

Lever 3: Per-App Spend Limits

Lever 4: Intelligent Model Routing

Lever 5: Token-Based Billing

Putting It All Together

Start Controlling AI Costs Today

Related Articles

API Monetization 101: Your Guide to Charging for Your API

Use AI to Plan Your API Pricing Strategy