AI Governance for API Teams: Controlling Access, Cost, and Compliance

AI adoption across the enterprise is accelerating at a pace that governance frameworks can barely keep up with. Engineering teams are integrating OpenAI, Anthropic, Google Gemini, and open-source models into everything from customer support chatbots to code generation pipelines. But while the pace of adoption is impressive, the controls around that usage often range from informal to nonexistent.

The result? Uncontrolled costs, unknown data exposure, and compliance gaps that only surface during audits or incidents. The teams best positioned to close these gaps are API teams -- because every AI and LLM interaction, whether it's a call to GPT-4o or an internal fine-tuned model, is ultimately an API call.

This guide walks through building a practical AI governance framework centered on the API gateway. You'll learn how to enforce access controls, manage costs, maintain compliance, and create an audit trail -- with concrete patterns and code examples you can implement today.

Why API Teams Own AI Governance

Think about the path every AI request takes. A developer's application sends a prompt. That prompt travels over HTTP to a model endpoint -- OpenAI's /v1/chat/completions, Anthropic's /v1/messages, or your own internally hosted model. The response comes back over the same channel.

This means the API layer is the single chokepoint for all AI traffic in your organization. The API gateway sits at exactly the right position in the stack to enforce governance policies uniformly, regardless of which team, application, or model is involved.

Here's why this matters:

Centralized enforcement: Instead of relying on every team to implement their own controls, the gateway applies policies consistently across all AI traffic.
Separation of concerns: Application developers focus on building features. The platform team handles governance at the infrastructure level.
Visibility: The gateway sees every request and response, making it the natural place to log, meter, and audit AI usage.
Speed of implementation: Adding a new policy to a gateway takes minutes. Retrofitting controls into dozens of individual applications takes months.

The API gateway is not just the transport layer for AI -- it's the control plane. And that makes the API team the de facto AI governance team, whether they signed up for the job or not.

Building a Governance Framework

A governance framework for AI APIs needs three pillars: clear roles and policies, well-defined access tiers, and technical enforcement mechanisms that don't rely on trust alone.

Roles and Policies

Start by defining who can do what with AI services. This isn't just about blocking unauthorized access -- it's about creating an approval workflow that scales as AI adoption grows.

A practical starting point:

Platform team: Owns the AI gateway configuration. Approves new model integrations. Defines rate limits, cost caps, and compliance policies.
Application teams: Request access to specific models for specific use cases. Operate within the guardrails set by the platform team.
Security/compliance team: Defines data classification rules. Reviews audit logs. Signs off on new external AI providers.
Finance: Sets departmental budget caps for AI spend. Reviews usage reports.

For new AI service requests, establish a lightweight approval flow. A team wants to use Claude for summarization? They submit a request specifying the model, the use case, estimated volume, and the data classification of inputs. The platform team provisions access with appropriate controls. This doesn't need to be bureaucratic -- a Slack workflow or a simple form backed by API key provisioning is enough to start.

Access Tiers

Not every team needs access to every model. Access tiers let you match model capabilities (and costs) to actual needs.

A common tiering structure:

Tier	Models Available	Use Cases	Rate Limit
Development	GPT-4o-mini, Claude Haiku	Prototyping, testing	100 req/min
Standard	GPT-4o, Claude Sonnet	Production features, internal tools	500 req/min
Premium	GPT-4o, Claude Opus, o1-pro	Revenue-critical, complex reasoning	2,000 req/min
Restricted	Fine-tuned internal models	Sensitive data processing	Custom

Each tier maps to an API key or JWT claim that the gateway uses to enforce routing and limits. Teams start in the Development tier and move up through the approval process.

JWT-Claim Routing

When your organization uses JWT-based authentication, you can embed the access tier directly in the token claims. The gateway then routes requests to the appropriate model endpoint without any application-level logic.

Here's a Zuplo inbound policy that reads the tier from a JWT claim and routes accordingly:

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

const MODEL_ROUTES: Record<string, string> = {
  development: "https://api.openai.com/v1/chat/completions", // routes to mini via body rewrite
  standard: "https://api.openai.com/v1/chat/completions",
  premium: "https://api.anthropic.com/v1/messages",
  restricted: "https://internal-models.company.com/v1/completions",
};

const TIER_MODELS: Record<string, string[]> = {
  development: ["gpt-4o-mini", "claude-3-haiku-20240307"],
  standard: ["gpt-4o", "claude-sonnet-4-20250514"],
  premium: ["gpt-4o", "claude-opus-4-20250514", "o1-pro"],
  restricted: ["internal-summarizer-v2", "internal-classifier-v1"],
};

export default async function aiTierRouting(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const tier = request.user?.data?.tier as string;

  if (!tier || !MODEL_ROUTES[tier]) {
    return new Response(
      JSON.stringify({ error: "Invalid or missing AI access tier" }),
      { status: 403, headers: { "Content-Type": "application/json" } },
    );
  }

  // Validate the requested model is allowed for this tier
  const body = await request.json();
  const requestedModel = body.model;

  if (requestedModel && !TIER_MODELS[tier].includes(requestedModel)) {
    return new Response(
      JSON.stringify({
        error: `Model '${requestedModel}' is not available in the '${tier}' tier`,
        allowed_models: TIER_MODELS[tier],
      }),
      { status: 403, headers: { "Content-Type": "application/json" } },
    );
  }

  // Set the upstream URL based on tier
  context.custom.upstreamUrl = MODEL_ROUTES[tier];

  return request;
}

This pattern keeps routing logic out of application code entirely. Developers send requests to a single AI gateway endpoint. The gateway reads their token, checks their tier, validates the requested model, and routes accordingly. If a developer tries to access a model above their tier, they get a clear error telling them which models they can use.

Cost Controls

AI API costs can escalate quickly. A single runaway process calling GPT-4o in a loop can burn through thousands of dollars in hours. Effective cost controls require multiple layers: per-team quotas, smart caching, model tiering, and usage tracking.

Per-Team Quotas

Rate limiting is the first line of defense, but for AI governance you need more than simple requests-per-minute limits. You need quotas that map to business units and budgets.

With Zuplo, you can configure rate limiting per API key, which maps directly to teams or applications:

json

{
  "policies": [
    {
      "name": "ai-rate-limit",
      "policyType": "rate-limit-inbound",
      "handler": {
        "export": "default",
        "module": "$import(@zuplo/runtime)",
        "options": {
          "rateLimitBy": "user",
          "requestsAllowed": 10000,
          "timeWindowMinutes": 1440,
          "identifier": {
            "func": "$import(./modules/rate-limit-id)",
            "export": "rateLimitId"
          }
        }
      }
    }
  ]
}

The rateLimitBy: "user" configuration ensures each API consumer gets their own quota bucket. Set requestsAllowed to the daily limit appropriate for each tier, and the gateway enforces it automatically.

For monthly spend caps, you need to track cumulative token usage. More on that in the token-based billing section below.

Semantic Caching

If multiple users or applications send identical (or near-identical) prompts to the same model, you're paying for the same computation repeatedly. Semantic caching intercepts these duplicate requests and serves the cached response instead.

The concept works like this:

A request comes in with a prompt.
The gateway computes a hash of the prompt (and relevant parameters like model, temperature, and system prompt).
If a cached response exists for that hash, return it immediately.
If not, forward the request to the model, cache the response, and return it.

This is especially effective for common operations like classification, extraction from templates, and FAQ-style queries where the same questions recur frequently. In practice, organizations see 15-40% cache hit rates on AI traffic, translating directly to cost savings.

For prompts that aren't identical but semantically similar, you can use embedding similarity to match against cached responses. This adds complexity but can significantly increase hit rates for use cases like customer support where the same question gets phrased many different ways.

Model Tiering for Cost Optimization

Not every request needs the most capable (and expensive) model. A request classifier at the gateway level can route simple requests to cheaper models automatically.

Consider this pattern:

Simple lookups and classifications: Route to GPT-4o-mini or Claude Haiku. These models handle straightforward tasks at a fraction of the cost.
Standard generation and summarization: Route to GPT-4o or Claude Sonnet. Good balance of quality and cost.
Complex reasoning and analysis: Route to Claude Opus, o1-pro, or specialized models. Reserve these for tasks that genuinely need them.

You can implement this as a gateway policy that inspects request metadata -- a custom header like X-AI-Priority or a field in the request body -- and routes accordingly:

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

const PRIORITY_MODEL_MAP: Record<string, string> = {
  low: "gpt-4o-mini",
  standard: "gpt-4o",
  high: "claude-opus-4-20250514",
};

export default async function modelTiering(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const priority = request.headers.get("x-ai-priority") ?? "standard";
  const targetModel = PRIORITY_MODEL_MAP[priority];

  if (!targetModel) {
    return new Response(
      JSON.stringify({
        error: `Invalid priority '${priority}'. Use: low, standard, high`,
      }),
      { status: 400, headers: { "Content-Type": "application/json" } },
    );
  }

  const body = await request.json();
  body.model = targetModel;

  return new Request(request.url, {
    method: request.method,
    headers: request.headers,
    body: JSON.stringify(body),
  });
}

Application teams set the priority based on the use case. The gateway handles the rest. This approach can reduce AI spend by 30-60% for organizations with a mix of simple and complex AI workloads.

Token-Based Billing and Metering

To enforce monthly spend caps and provide accurate usage reporting, you need to track token consumption per consumer. Zuplo's metering capabilities let you log token usage alongside standard request metrics.

Here's an outbound policy that extracts token usage from the AI provider's response and records it:

typescript

import { ZuploContext, ZuploRequest, ZuploResponse } from "@zuplo/runtime";

interface TokenUsage {
  prompt_tokens: number;
  completion_tokens: number;
  total_tokens: number;
}

// Cost per 1K tokens (example rates)
const MODEL_COSTS: Record<string, { input: number; output: number }> = {
  "gpt-4o": { input: 0.0025, output: 0.01 },
  "gpt-4o-mini": { input: 0.00015, output: 0.0006 },
  "claude-opus-4-20250514": { input: 0.015, output: 0.075 },
  "claude-sonnet-4-20250514": { input: 0.003, output: 0.015 },
};

export default async function trackTokenUsage(
  response: ZuploResponse,
  request: ZuploRequest,
  context: ZuploContext,
) {
  try {
    const body = await response.json();
    const usage: TokenUsage = body.usage;
    const model: string = body.model;

    if (usage && model) {
      const costs = MODEL_COSTS[model] ?? { input: 0, output: 0 };
      const estimatedCost =
        (usage.prompt_tokens / 1000) * costs.input +
        (usage.completion_tokens / 1000) * costs.output;

      // Log to Zuplo analytics for metering and billing
      context.log.info("AI token usage", {
        consumer: request.user?.sub,
        team: request.user?.data?.team,
        model,
        promptTokens: usage.prompt_tokens,
        completionTokens: usage.completion_tokens,
        totalTokens: usage.total_tokens,
        estimatedCost: estimatedCost.toFixed(6),
      });
    }

    // Return the response unchanged
    return new Response(JSON.stringify(body), {
      status: response.status,
      headers: response.headers,
    });
  } catch {
    // If we can't parse the response, pass it through unchanged
    return response;
  }
}

This data feeds into dashboards and alerting. When a team approaches their monthly budget, you can trigger warnings. When they hit the cap, the rate limiter kicks in. Finance gets a clear report of AI spend by team, model, and application.

Spend Limit Enforcement

Combining token tracking with spend limits creates a hard cap on AI costs. Here's an inbound policy that checks cumulative spend before allowing a request:

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

interface SpendRecord {
  currentSpend: number;
  limit: number;
  periodStart: string;
}

export default async function enforceSpendLimit(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const team = request.user?.data?.team as string;

  if (!team) {
    return new Response(
      JSON.stringify({ error: "Team identification required" }),
      { status: 403, headers: { "Content-Type": "application/json" } },
    );
  }

  // Retrieve current spend from your tracking store
  const spendRecord = await getTeamSpend(team, context);

  if (spendRecord.currentSpend >= spendRecord.limit) {
    return new Response(
      JSON.stringify({
        error: "Monthly AI spend limit reached",
        current_spend: `$${spendRecord.currentSpend.toFixed(2)}`,
        limit: `$${spendRecord.limit.toFixed(2)}`,
        resets: spendRecord.periodStart,
        action: "Contact your platform team to request a limit increase",
      }),
      { status: 429, headers: { "Content-Type": "application/json" } },
    );
  }

  // Allow the request to proceed
  return request;
}

async function getTeamSpend(
  team: string,
  context: ZuploContext,
): Promise<SpendRecord> {
  // Implementation depends on your storage backend
  // Could be a KV store, database, or external billing API
  const response = await fetch(
    `https://billing-api.internal/teams/${team}/ai-spend`,
    {
      headers: { Authorization: `Bearer ${context.custom.billingApiKey}` },
    },
  );
  return response.json() as Promise<SpendRecord>;
}

This gives teams a clear, predictable boundary. No surprises on the monthly AI bill.

Compliance and Audit

Cost controls protect your budget. Compliance controls protect your business. For organizations in regulated industries -- or any company handling customer data -- AI governance requires robust auditing, data protection, and residency controls.

Audit Logging

Every AI request should be logged with enough context to answer these questions during an audit:

Who made the request? (User identity, team, application)
What model was called, with what parameters?
When did the request occur?
How many tokens were consumed?
What was the response status?

Here's a Zuplo policy that creates comprehensive audit logs:

typescript

import { ZuploContext, ZuploRequest, ZuploResponse } from "@zuplo/runtime";

export default async function auditLog(
  response: ZuploResponse,
  request: ZuploRequest,
  context: ZuploContext,
) {
  const auditEntry = {
    timestamp: new Date().toISOString(),
    requestId: context.requestId,
    // Identity
    userId: request.user?.sub,
    team: request.user?.data?.team,
    apiKeyId: request.user?.data?.apiKeyId,
    // Request details
    model: context.custom.requestedModel,
    endpoint: request.url,
    method: request.method,
    sourceIp: request.headers.get("x-forwarded-for"),
    userAgent: request.headers.get("user-agent"),
    // Response details
    statusCode: response.status,
    tokenUsage: context.custom.tokenUsage,
    estimatedCost: context.custom.estimatedCost,
    // Compliance metadata
    dataClassification: request.headers.get("x-data-classification"),
    region: context.custom.routedRegion,
  };

  // Send to your audit log destination
  context.log.info("AI_AUDIT", auditEntry);

  return response;
}

Ship these logs to your SIEM (Splunk, Datadog, etc.) or a dedicated audit store. The key is making them immutable and queryable. When the compliance team asks "which teams used GPT-4o to process customer data last quarter?", you should be able to answer in minutes, not weeks.

PII Safeguards

One of the biggest risks with external AI services is inadvertently sending personally identifiable information (PII) to a third-party provider. An inbound policy can scan request payloads before they leave your network:

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

// Common PII patterns
const PII_PATTERNS: Array<{ name: string; pattern: RegExp }> = [
  {
    name: "SSN",
    pattern: /\b\d{3}-\d{2}-\d{4}\b/g,
  },
  {
    name: "Credit Card",
    pattern: /\b(?:\d{4}[- ]?){3}\d{4}\b/g,
  },
  {
    name: "Email Address",
    pattern: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g,
  },
  {
    name: "Phone Number",
    pattern: /\b(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b/g,
  },
];

export default async function piiScanPolicy(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const body = await request.text();
  const detectedPii: string[] = [];

  for (const { name, pattern } of PII_PATTERNS) {
    if (pattern.test(body)) {
      detectedPii.push(name);
    }
    // Reset regex lastIndex after test
    pattern.lastIndex = 0;
  }

  if (detectedPii.length > 0) {
    context.log.warn("PII detected in AI request", {
      userId: request.user?.sub,
      piiTypes: detectedPii,
      requestId: context.requestId,
    });

    const action = request.headers.get("x-pii-action") ?? "block";

    if (action === "block") {
      return new Response(
        JSON.stringify({
          error: "Request blocked: potential PII detected",
          detected_types: detectedPii,
          action:
            "Remove PII from your prompt or use the x-pii-action: warn header to proceed",
        }),
        { status: 422, headers: { "Content-Type": "application/json" } },
      );
    }
    // If action is 'warn', log but allow the request through
  }

  return request;
}

For production deployments, you'll want more sophisticated PII detection -- potentially using a dedicated NLP model or a service like Microsoft Presidio. The regex-based approach above catches the most common patterns and serves as a first layer of defense.

You can also implement PII redaction instead of blocking, replacing detected PII with placeholders before the request reaches the model. This lets the request proceed while protecting sensitive data.

Data Residency

For organizations operating across regions, data residency requirements dictate where AI processing can happen. European customer data might need to stay within the EU. Healthcare data might need to remain in specific jurisdictions.

The gateway can enforce this by routing requests to region-specific model endpoints based on the user's location or data classification:

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

const REGION_ENDPOINTS: Record<string, string> = {
  eu: "https://eu.openai.azure.com/openai/deployments/gpt-4o/chat/completions",
  us: "https://us.openai.azure.com/openai/deployments/gpt-4o/chat/completions",
  apac: "https://apac.openai.azure.com/openai/deployments/gpt-4o/chat/completions",
};

export default async function dataResidencyRouting(
  request: ZuploRequest,
  context: ZuploContext,
) {
  // Determine region from user claims or request metadata
  const region =
    (request.user?.data?.region as string) ??
    request.headers.get("x-data-region") ??
    "us";

  const endpoint = REGION_ENDPOINTS[region];

  if (!endpoint) {
    return new Response(
      JSON.stringify({ error: `Unsupported region: ${region}` }),
      { status: 400, headers: { "Content-Type": "application/json" } },
    );
  }

  context.custom.routedRegion = region;
  context.custom.upstreamUrl = endpoint;

  return request;
}

This approach works well with Azure OpenAI Service deployments, which let you host models in specific Azure regions. For other providers, you may need to maintain separate accounts or use provider-specific regional endpoints.

Retention Policies

AI request and response logs can contain sensitive information -- the prompts themselves, the generated content, and metadata about usage patterns. Define clear retention policies:

Audit metadata (who, when, which model, token count): Retain for 12-24 months for compliance. This data is small and doesn't contain sensitive content.
Full request/response payloads: Retain for 30-90 days for debugging and quality monitoring. Auto-delete after the retention period.
PII-flagged requests: Either don't log the payload at all, or encrypt it with a key that gets rotated on a schedule.

Configure your logging pipeline to separate these tiers. The audit metadata goes to your long-term compliance store. Full payloads go to a time-limited store with automatic expiry. This balances operational needs with data minimization principles.

Practical Implementation with Zuplo

Zuplo's AI Gateway brings these governance patterns together in a single platform. Here's how the pieces fit.

JWT Validation and Claim-Based Routing

Zuplo's built-in JWT authentication policy validates tokens and extracts claims automatically. Combine it with the tier-routing policy from earlier:

json

{
  "policies": [
    {
      "name": "jwt-auth",
      "policyType": "open-id-jwt-auth-inbound",
      "handler": {
        "export": "default",
        "module": "$import(@zuplo/runtime)",
        "options": {
          "issuer": "https://auth.yourcompany.com/",
          "audience": "ai-gateway",
          "jwksUrl": "https://auth.yourcompany.com/.well-known/jwks.json"
        }
      }
    },
    {
      "name": "ai-tier-routing",
      "policyType": "custom-code-inbound",
      "handler": {
        "export": "default",
        "module": "$import(./modules/ai-tier-routing)"
      }
    }
  ]
}

The JWT policy runs first, validating the token and populating request.user with the token claims. The tier-routing policy then reads the tier claim and routes the request to the appropriate model endpoint.

Rate Limiting Per API Key

For teams that use API keys instead of JWTs, Zuplo's API key authentication gives you per-consumer rate limiting out of the box. Each API key can be assigned metadata (team, tier, spend limit) that your policies can reference.

The rate limiting policy applies per-key quotas automatically. You can set different limits for different keys through the Zuplo Developer Portal, where consumers self-serve their API keys and you control the access parameters.

Custom Logging for Audit Trail

Chain the audit logging policy as an outbound handler on your AI routes. Every request that passes through gets logged with full context:

json

{
  "routes": [
    {
      "path": "/v1/ai/completions",
      "methods": ["POST"],
      "handler": {
        "export": "default",
        "module": "$import(@zuplo/runtime)",
        "options": {
          "url": "https://api.openai.com/v1/chat/completions"
        }
      },
      "policies": {
        "inbound": ["jwt-auth", "ai-tier-routing", "pii-scan", "spend-limit"],
        "outbound": ["token-tracking", "audit-log"]
      }
    }
  ]
}

Notice the policy chain: inbound policies handle authentication, routing, PII scanning, and spend limits. Outbound policies capture token usage and write the audit log. This layered approach means each policy does one thing well, and you can mix and match them across different AI routes.

Bringing It All Together

A complete AI governance setup in Zuplo looks like this:

Authentication: JWT or API key validation on every request.
Authorization: Tier-based access control using token claims or key metadata.
PII protection: Inbound scan for sensitive data before it leaves your network.
Cost controls: Rate limiting, spend caps, and model tiering to keep costs predictable.
Audit trail: Comprehensive logging of every request with identity, model, tokens, and cost.
Data residency: Region-based routing for compliance with local regulations.

Each layer is a separate policy, configured declaratively and applied at the gateway level. No changes to application code. No reliance on developers remembering to implement controls. The governance is built into the infrastructure.

Governance Checklist

Before going to production with AI APIs, make sure you've addressed each of these items:

Access Controls

Every AI API consumer is authenticated (JWT or API key)
Access tiers are defined and mapped to specific models
An approval workflow exists for new AI service access requests
Unused API keys and access grants are reviewed and revoked quarterly

Cost Management

Per-team or per-application rate limits are configured
Monthly spend caps are set and enforced at the gateway
Semantic caching is enabled for high-volume, repetitive workloads
Model tiering routes low-priority requests to cost-effective models
Token usage is tracked and reported per consumer

Compliance and Data Protection

PII scanning is enabled on inbound requests to external AI providers
Data residency requirements are mapped to regional model endpoints
Audit logs capture identity, model, tokens, cost, and timestamp for every request
Audit logs are shipped to an immutable store with appropriate retention policies
Full request/response payload logging has a defined retention period with auto-expiry

Operational Readiness

Alerting is configured for spend anomalies and rate limit breaches
A runbook exists for responding to AI-related incidents (data exposure, cost spikes)
The governance configuration is version-controlled and reviewed through the same process as application code
Dashboards show real-time AI usage by team, model, and cost

Organizational

Roles and responsibilities for AI governance are documented
The platform team has the authority to enforce controls at the gateway level
Application teams understand the available tiers and how to request changes
Finance receives regular reports on AI spend by department

This checklist is a starting point. Your organization may have additional requirements based on your industry, regulatory environment, and risk tolerance. The important thing is that governance is explicit, enforced at the infrastructure level, and not left to individual teams to implement on their own.

Get Started with Zuplo's AI Gateway

AI governance doesn't have to be a bottleneck. With the right architecture -- an API gateway as the enforcement point, clear policies, and layered controls -- you can give teams the AI capabilities they need while maintaining the visibility and control your organization requires.

Zuplo's AI Gateway gives you the building blocks: JWT and API key authentication, programmable policies in TypeScript, per-consumer rate limiting, and comprehensive logging. You can start with basic access controls and add layers as your AI usage matures.

Sign up for Zuplo and deploy your first AI governance policy in minutes. Your finance team and compliance team will both thank you.

Tags:#AI #API Governance #API Security

Why API Teams Own AI Governance

Here's why this matters:

Centralized enforcement: Instead of relying on every team to implement their own controls, the gateway applies policies consistently across all AI traffic.
Separation of concerns: Application developers focus on building features. The platform team handles governance at the infrastructure level.
Visibility: The gateway sees every request and response, making it the natural place to log, meter, and audit AI usage.
Speed of implementation: Adding a new policy to a gateway takes minutes. Retrofitting controls into dozens of individual applications takes months.

The API gateway is not just the transport layer for AI -- it's the control plane. And that makes the API team the de facto AI governance team, whether they signed up for the job or not.

Building a Governance Framework

A governance framework for AI APIs needs three pillars: clear roles and policies, well-defined access tiers, and technical enforcement mechanisms that don't rely on trust alone.

Roles and Policies

Start by defining who can do what with AI services. This isn't just about blocking unauthorized access -- it's about creating an approval workflow that scales as AI adoption grows.

A practical starting point:

Platform team: Owns the AI gateway configuration. Approves new model integrations. Defines rate limits, cost caps, and compliance policies.
Application teams: Request access to specific models for specific use cases. Operate within the guardrails set by the platform team.
Security/compliance team: Defines data classification rules. Reviews audit logs. Signs off on new external AI providers.
Finance: Sets departmental budget caps for AI spend. Reviews usage reports.

Access Tiers

Not every team needs access to every model. Access tiers let you match model capabilities (and costs) to actual needs.

A common tiering structure:

Tier	Models Available	Use Cases	Rate Limit
Development	GPT-4o-mini, Claude Haiku	Prototyping, testing	100 req/min
Standard	GPT-4o, Claude Sonnet	Production features, internal tools	500 req/min
Premium	GPT-4o, Claude Opus, o1-pro	Revenue-critical, complex reasoning	2,000 req/min
Restricted	Fine-tuned internal models	Sensitive data processing	Custom

Each tier maps to an API key or JWT claim that the gateway uses to enforce routing and limits. Teams start in the Development tier and move up through the approval process.

JWT-Claim Routing

Here's a Zuplo inbound policy that reads the tier from a JWT claim and routes accordingly:

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

const MODEL_ROUTES: Record<string, string> = {
  development: "https://api.openai.com/v1/chat/completions", // routes to mini via body rewrite
  standard: "https://api.openai.com/v1/chat/completions",
  premium: "https://api.anthropic.com/v1/messages",
  restricted: "https://internal-models.company.com/v1/completions",
};

const TIER_MODELS: Record<string, string[]> = {
  development: ["gpt-4o-mini", "claude-3-haiku-20240307"],
  standard: ["gpt-4o", "claude-sonnet-4-20250514"],
  premium: ["gpt-4o", "claude-opus-4-20250514", "o1-pro"],
  restricted: ["internal-summarizer-v2", "internal-classifier-v1"],
};

export default async function aiTierRouting(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const tier = request.user?.data?.tier as string;

  if (!tier || !MODEL_ROUTES[tier]) {
    return new Response(
      JSON.stringify({ error: "Invalid or missing AI access tier" }),
      { status: 403, headers: { "Content-Type": "application/json" } },
    );
  }

  // Validate the requested model is allowed for this tier
  const body = await request.json();
  const requestedModel = body.model;

  if (requestedModel && !TIER_MODELS[tier].includes(requestedModel)) {
    return new Response(
      JSON.stringify({
        error: `Model '${requestedModel}' is not available in the '${tier}' tier`,
        allowed_models: TIER_MODELS[tier],
      }),
      { status: 403, headers: { "Content-Type": "application/json" } },
    );
  }

  // Set the upstream URL based on tier
  context.custom.upstreamUrl = MODEL_ROUTES[tier];

  return request;
}

Cost Controls

Per-Team Quotas

Rate limiting is the first line of defense, but for AI governance you need more than simple requests-per-minute limits. You need quotas that map to business units and budgets.

With Zuplo, you can configure rate limiting per API key, which maps directly to teams or applications:

json

{
  "policies": [
    {
      "name": "ai-rate-limit",
      "policyType": "rate-limit-inbound",
      "handler": {
        "export": "default",
        "module": "$import(@zuplo/runtime)",
        "options": {
          "rateLimitBy": "user",
          "requestsAllowed": 10000,
          "timeWindowMinutes": 1440,
          "identifier": {
            "func": "$import(./modules/rate-limit-id)",
            "export": "rateLimitId"
          }
        }
      }
    }
  ]
}

For monthly spend caps, you need to track cumulative token usage. More on that in the token-based billing section below.

Semantic Caching

The concept works like this:

A request comes in with a prompt.
The gateway computes a hash of the prompt (and relevant parameters like model, temperature, and system prompt).
If a cached response exists for that hash, return it immediately.
If not, forward the request to the model, cache the response, and return it.

Model Tiering for Cost Optimization

Not every request needs the most capable (and expensive) model. A request classifier at the gateway level can route simple requests to cheaper models automatically.

Consider this pattern:

Simple lookups and classifications: Route to GPT-4o-mini or Claude Haiku. These models handle straightforward tasks at a fraction of the cost.
Standard generation and summarization: Route to GPT-4o or Claude Sonnet. Good balance of quality and cost.
Complex reasoning and analysis: Route to Claude Opus, o1-pro, or specialized models. Reserve these for tasks that genuinely need them.

You can implement this as a gateway policy that inspects request metadata -- a custom header like X-AI-Priority or a field in the request body -- and routes accordingly:

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

const PRIORITY_MODEL_MAP: Record<string, string> = {
  low: "gpt-4o-mini",
  standard: "gpt-4o",
  high: "claude-opus-4-20250514",
};

export default async function modelTiering(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const priority = request.headers.get("x-ai-priority") ?? "standard";
  const targetModel = PRIORITY_MODEL_MAP[priority];

  if (!targetModel) {
    return new Response(
      JSON.stringify({
        error: `Invalid priority '${priority}'. Use: low, standard, high`,
      }),
      { status: 400, headers: { "Content-Type": "application/json" } },
    );
  }

  const body = await request.json();
  body.model = targetModel;

  return new Request(request.url, {
    method: request.method,
    headers: request.headers,
    body: JSON.stringify(body),
  });
}

Application teams set the priority based on the use case. The gateway handles the rest. This approach can reduce AI spend by 30-60% for organizations with a mix of simple and complex AI workloads.

Token-Based Billing and Metering

Here's an outbound policy that extracts token usage from the AI provider's response and records it:

typescript

import { ZuploContext, ZuploRequest, ZuploResponse } from "@zuplo/runtime";

interface TokenUsage {
  prompt_tokens: number;
  completion_tokens: number;
  total_tokens: number;
}

// Cost per 1K tokens (example rates)
const MODEL_COSTS: Record<string, { input: number; output: number }> = {
  "gpt-4o": { input: 0.0025, output: 0.01 },
  "gpt-4o-mini": { input: 0.00015, output: 0.0006 },
  "claude-opus-4-20250514": { input: 0.015, output: 0.075 },
  "claude-sonnet-4-20250514": { input: 0.003, output: 0.015 },
};

export default async function trackTokenUsage(
  response: ZuploResponse,
  request: ZuploRequest,
  context: ZuploContext,
) {
  try {
    const body = await response.json();
    const usage: TokenUsage = body.usage;
    const model: string = body.model;

    if (usage && model) {
      const costs = MODEL_COSTS[model] ?? { input: 0, output: 0 };
      const estimatedCost =
        (usage.prompt_tokens / 1000) * costs.input +
        (usage.completion_tokens / 1000) * costs.output;

      // Log to Zuplo analytics for metering and billing
      context.log.info("AI token usage", {
        consumer: request.user?.sub,
        team: request.user?.data?.team,
        model,
        promptTokens: usage.prompt_tokens,
        completionTokens: usage.completion_tokens,
        totalTokens: usage.total_tokens,
        estimatedCost: estimatedCost.toFixed(6),
      });
    }

    // Return the response unchanged
    return new Response(JSON.stringify(body), {
      status: response.status,
      headers: response.headers,
    });
  } catch {
    // If we can't parse the response, pass it through unchanged
    return response;
  }
}

Spend Limit Enforcement

Combining token tracking with spend limits creates a hard cap on AI costs. Here's an inbound policy that checks cumulative spend before allowing a request:

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

interface SpendRecord {
  currentSpend: number;
  limit: number;
  periodStart: string;
}

export default async function enforceSpendLimit(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const team = request.user?.data?.team as string;

  if (!team) {
    return new Response(
      JSON.stringify({ error: "Team identification required" }),
      { status: 403, headers: { "Content-Type": "application/json" } },
    );
  }

  // Retrieve current spend from your tracking store
  const spendRecord = await getTeamSpend(team, context);

  if (spendRecord.currentSpend >= spendRecord.limit) {
    return new Response(
      JSON.stringify({
        error: "Monthly AI spend limit reached",
        current_spend: `$${spendRecord.currentSpend.toFixed(2)}`,
        limit: `$${spendRecord.limit.toFixed(2)}`,
        resets: spendRecord.periodStart,
        action: "Contact your platform team to request a limit increase",
      }),
      { status: 429, headers: { "Content-Type": "application/json" } },
    );
  }

  // Allow the request to proceed
  return request;
}

async function getTeamSpend(
  team: string,
  context: ZuploContext,
): Promise<SpendRecord> {
  // Implementation depends on your storage backend
  // Could be a KV store, database, or external billing API
  const response = await fetch(
    `https://billing-api.internal/teams/${team}/ai-spend`,
    {
      headers: { Authorization: `Bearer ${context.custom.billingApiKey}` },
    },
  );
  return response.json() as Promise<SpendRecord>;
}

This gives teams a clear, predictable boundary. No surprises on the monthly AI bill.

Compliance and Audit

Audit Logging

Every AI request should be logged with enough context to answer these questions during an audit:

Who made the request? (User identity, team, application)
What model was called, with what parameters?
When did the request occur?
How many tokens were consumed?
What was the response status?

Here's a Zuplo policy that creates comprehensive audit logs:

typescript

import { ZuploContext, ZuploRequest, ZuploResponse } from "@zuplo/runtime";

export default async function auditLog(
  response: ZuploResponse,
  request: ZuploRequest,
  context: ZuploContext,
) {
  const auditEntry = {
    timestamp: new Date().toISOString(),
    requestId: context.requestId,
    // Identity
    userId: request.user?.sub,
    team: request.user?.data?.team,
    apiKeyId: request.user?.data?.apiKeyId,
    // Request details
    model: context.custom.requestedModel,
    endpoint: request.url,
    method: request.method,
    sourceIp: request.headers.get("x-forwarded-for"),
    userAgent: request.headers.get("user-agent"),
    // Response details
    statusCode: response.status,
    tokenUsage: context.custom.tokenUsage,
    estimatedCost: context.custom.estimatedCost,
    // Compliance metadata
    dataClassification: request.headers.get("x-data-classification"),
    region: context.custom.routedRegion,
  };

  // Send to your audit log destination
  context.log.info("AI_AUDIT", auditEntry);

  return response;
}

PII Safeguards

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

// Common PII patterns
const PII_PATTERNS: Array<{ name: string; pattern: RegExp }> = [
  {
    name: "SSN",
    pattern: /\b\d{3}-\d{2}-\d{4}\b/g,
  },
  {
    name: "Credit Card",
    pattern: /\b(?:\d{4}[- ]?){3}\d{4}\b/g,
  },
  {
    name: "Email Address",
    pattern: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g,
  },
  {
    name: "Phone Number",
    pattern: /\b(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b/g,
  },
];

export default async function piiScanPolicy(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const body = await request.text();
  const detectedPii: string[] = [];

  for (const { name, pattern } of PII_PATTERNS) {
    if (pattern.test(body)) {
      detectedPii.push(name);
    }
    // Reset regex lastIndex after test
    pattern.lastIndex = 0;
  }

  if (detectedPii.length > 0) {
    context.log.warn("PII detected in AI request", {
      userId: request.user?.sub,
      piiTypes: detectedPii,
      requestId: context.requestId,
    });

    const action = request.headers.get("x-pii-action") ?? "block";

    if (action === "block") {
      return new Response(
        JSON.stringify({
          error: "Request blocked: potential PII detected",
          detected_types: detectedPii,
          action:
            "Remove PII from your prompt or use the x-pii-action: warn header to proceed",
        }),
        { status: 422, headers: { "Content-Type": "application/json" } },
      );
    }
    // If action is 'warn', log but allow the request through
  }

  return request;
}

You can also implement PII redaction instead of blocking, replacing detected PII with placeholders before the request reaches the model. This lets the request proceed while protecting sensitive data.

Data Residency

The gateway can enforce this by routing requests to region-specific model endpoints based on the user's location or data classification:

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

const REGION_ENDPOINTS: Record<string, string> = {
  eu: "https://eu.openai.azure.com/openai/deployments/gpt-4o/chat/completions",
  us: "https://us.openai.azure.com/openai/deployments/gpt-4o/chat/completions",
  apac: "https://apac.openai.azure.com/openai/deployments/gpt-4o/chat/completions",
};

export default async function dataResidencyRouting(
  request: ZuploRequest,
  context: ZuploContext,
) {
  // Determine region from user claims or request metadata
  const region =
    (request.user?.data?.region as string) ??
    request.headers.get("x-data-region") ??
    "us";

  const endpoint = REGION_ENDPOINTS[region];

  if (!endpoint) {
    return new Response(
      JSON.stringify({ error: `Unsupported region: ${region}` }),
      { status: 400, headers: { "Content-Type": "application/json" } },
    );
  }

  context.custom.routedRegion = region;
  context.custom.upstreamUrl = endpoint;

  return request;
}

Retention Policies

AI request and response logs can contain sensitive information -- the prompts themselves, the generated content, and metadata about usage patterns. Define clear retention policies:

Audit metadata (who, when, which model, token count): Retain for 12-24 months for compliance. This data is small and doesn't contain sensitive content.
Full request/response payloads: Retain for 30-90 days for debugging and quality monitoring. Auto-delete after the retention period.
PII-flagged requests: Either don't log the payload at all, or encrypt it with a key that gets rotated on a schedule.

Practical Implementation with Zuplo

Zuplo's AI Gateway brings these governance patterns together in a single platform. Here's how the pieces fit.

JWT Validation and Claim-Based Routing

Zuplo's built-in JWT authentication policy validates tokens and extracts claims automatically. Combine it with the tier-routing policy from earlier:

json

{
  "policies": [
    {
      "name": "jwt-auth",
      "policyType": "open-id-jwt-auth-inbound",
      "handler": {
        "export": "default",
        "module": "$import(@zuplo/runtime)",
        "options": {
          "issuer": "https://auth.yourcompany.com/",
          "audience": "ai-gateway",
          "jwksUrl": "https://auth.yourcompany.com/.well-known/jwks.json"
        }
      }
    },
    {
      "name": "ai-tier-routing",
      "policyType": "custom-code-inbound",
      "handler": {
        "export": "default",
        "module": "$import(./modules/ai-tier-routing)"
      }
    }
  ]
}

Rate Limiting Per API Key

Custom Logging for Audit Trail

Chain the audit logging policy as an outbound handler on your AI routes. Every request that passes through gets logged with full context:

json

{
  "routes": [
    {
      "path": "/v1/ai/completions",
      "methods": ["POST"],
      "handler": {
        "export": "default",
        "module": "$import(@zuplo/runtime)",
        "options": {
          "url": "https://api.openai.com/v1/chat/completions"
        }
      },
      "policies": {
        "inbound": ["jwt-auth", "ai-tier-routing", "pii-scan", "spend-limit"],
        "outbound": ["token-tracking", "audit-log"]
      }
    }
  ]
}

Bringing It All Together

A complete AI governance setup in Zuplo looks like this:

Authentication: JWT or API key validation on every request.
Authorization: Tier-based access control using token claims or key metadata.
PII protection: Inbound scan for sensitive data before it leaves your network.
Cost controls: Rate limiting, spend caps, and model tiering to keep costs predictable.
Audit trail: Comprehensive logging of every request with identity, model, tokens, and cost.
Data residency: Region-based routing for compliance with local regulations.

Governance Checklist

Before going to production with AI APIs, make sure you've addressed each of these items:

Access Controls

Every AI API consumer is authenticated (JWT or API key)
Access tiers are defined and mapped to specific models
An approval workflow exists for new AI service access requests
Unused API keys and access grants are reviewed and revoked quarterly

Cost Management

Per-team or per-application rate limits are configured
Monthly spend caps are set and enforced at the gateway
Semantic caching is enabled for high-volume, repetitive workloads
Model tiering routes low-priority requests to cost-effective models
Token usage is tracked and reported per consumer

Compliance and Data Protection

PII scanning is enabled on inbound requests to external AI providers
Data residency requirements are mapped to regional model endpoints
Audit logs capture identity, model, tokens, cost, and timestamp for every request
Audit logs are shipped to an immutable store with appropriate retention policies
Full request/response payload logging has a defined retention period with auto-expiry

Operational Readiness

Alerting is configured for spend anomalies and rate limit breaches
A runbook exists for responding to AI-related incidents (data exposure, cost spikes)
The governance configuration is version-controlled and reviewed through the same process as application code
Dashboards show real-time AI usage by team, model, and cost

Organizational

Roles and responsibilities for AI governance are documented
The platform team has the authority to enforce controls at the gateway level
Application teams understand the available tiers and how to request changes
Finance receives regular reports on AI spend by department

Get Started with Zuplo's AI Gateway

Sign up for Zuplo and deploy your first AI governance policy in minutes. Your finance team and compliance team will both thank you.

Tags:#AI #API Governance #API Security

Why API Teams Own AI Governance

Building a Governance Framework

Roles and Policies

Access Tiers

JWT-Claim Routing

Cost Controls

Per-Team Quotas

Semantic Caching

Model Tiering for Cost Optimization

Token-Based Billing and Metering

Spend Limit Enforcement

Compliance and Audit

Audit Logging

PII Safeguards

Data Residency

Retention Policies

Practical Implementation with Zuplo

JWT Validation and Claim-Based Routing

Rate Limiting Per API Key

Custom Logging for Audit Trail

Bringing It All Together

Governance Checklist

Get Started with Zuplo's AI Gateway

Related Articles

How to Implement API Key Authentication: A Complete Guide

Developer Portal Comparison: Customization, Documentation, and Self-Service

Why API Teams Own AI Governance

Building a Governance Framework

Roles and Policies

Access Tiers

JWT-Claim Routing

Cost Controls

Per-Team Quotas

Semantic Caching

Model Tiering for Cost Optimization

Token-Based Billing and Metering

Spend Limit Enforcement

Compliance and Audit

Audit Logging

PII Safeguards

Data Residency

Retention Policies

Practical Implementation with Zuplo

JWT Validation and Claim-Based Routing

Rate Limiting Per API Key

Custom Logging for Audit Trail

Bringing It All Together

Governance Checklist

Get Started with Zuplo's AI Gateway

Related Articles

How to Implement API Key Authentication: A Complete Guide

Developer Portal Comparison: Customization, Documentation, and Self-Service