AI adoption across the enterprise is accelerating at a pace that governance frameworks can barely keep up with. Engineering teams are integrating OpenAI, Anthropic, Google Gemini, and open-source models into everything from customer support chatbots to code generation pipelines. But while the pace of adoption is impressive, the controls around that usage often range from informal to nonexistent.
The result? Uncontrolled costs, unknown data exposure, and compliance gaps that only surface during audits or incidents. The teams best positioned to close these gaps are API teams -- because every AI and LLM interaction, whether it's a call to GPT-4o or an internal fine-tuned model, is ultimately an API call.
This guide walks through building a practical AI governance framework centered on the API gateway. You'll learn how to enforce access controls, manage costs, maintain compliance, and create an audit trail -- with concrete patterns and code examples you can implement today.
Why API Teams Own AI Governance
Think about the path every AI request takes. A developer's application sends a
prompt. That prompt travels over HTTP to a model endpoint -- OpenAI's
/v1/chat/completions, Anthropic's /v1/messages, or your own internally
hosted model. The response comes back over the same channel.
This means the API layer is the single chokepoint for all AI traffic in your organization. The API gateway sits at exactly the right position in the stack to enforce governance policies uniformly, regardless of which team, application, or model is involved.
Here's why this matters:
- Centralized enforcement: Instead of relying on every team to implement their own controls, the gateway applies policies consistently across all AI traffic.
- Separation of concerns: Application developers focus on building features. The platform team handles governance at the infrastructure level.
- Visibility: The gateway sees every request and response, making it the natural place to log, meter, and audit AI usage.
- Speed of implementation: Adding a new policy to a gateway takes minutes. Retrofitting controls into dozens of individual applications takes months.
The API gateway is not just the transport layer for AI -- it's the control plane. And that makes the API team the de facto AI governance team, whether they signed up for the job or not.
Building a Governance Framework
A governance framework for AI APIs needs three pillars: clear roles and policies, well-defined access tiers, and technical enforcement mechanisms that don't rely on trust alone.
Roles and Policies
Start by defining who can do what with AI services. This isn't just about blocking unauthorized access -- it's about creating an approval workflow that scales as AI adoption grows.
A practical starting point:
- Platform team: Owns the AI gateway configuration. Approves new model integrations. Defines rate limits, cost caps, and compliance policies.
- Application teams: Request access to specific models for specific use cases. Operate within the guardrails set by the platform team.
- Security/compliance team: Defines data classification rules. Reviews audit logs. Signs off on new external AI providers.
- Finance: Sets departmental budget caps for AI spend. Reviews usage reports.
For new AI service requests, establish a lightweight approval flow. A team wants to use Claude for summarization? They submit a request specifying the model, the use case, estimated volume, and the data classification of inputs. The platform team provisions access with appropriate controls. This doesn't need to be bureaucratic -- a Slack workflow or a simple form backed by API key provisioning is enough to start.
Access Tiers
Not every team needs access to every model. Access tiers let you match model capabilities (and costs) to actual needs.
A common tiering structure:
| Tier | Models Available | Use Cases | Rate Limit |
|---|---|---|---|
| Development | GPT-4o-mini, Claude Haiku | Prototyping, testing | 100 req/min |
| Standard | GPT-4o, Claude Sonnet | Production features, internal tools | 500 req/min |
| Premium | GPT-4o, Claude Opus, o1-pro | Revenue-critical, complex reasoning | 2,000 req/min |
| Restricted | Fine-tuned internal models | Sensitive data processing | Custom |
Each tier maps to an API key or JWT claim that the gateway uses to enforce routing and limits. Teams start in the Development tier and move up through the approval process.
JWT-Claim Routing
When your organization uses JWT-based authentication, you can embed the access tier directly in the token claims. The gateway then routes requests to the appropriate model endpoint without any application-level logic.
Here's a Zuplo inbound policy that reads the tier from a JWT claim and routes accordingly:
This pattern keeps routing logic out of application code entirely. Developers send requests to a single AI gateway endpoint. The gateway reads their token, checks their tier, validates the requested model, and routes accordingly. If a developer tries to access a model above their tier, they get a clear error telling them which models they can use.
Cost Controls
AI API costs can escalate quickly. A single runaway process calling GPT-4o in a loop can burn through thousands of dollars in hours. Effective cost controls require multiple layers: per-team quotas, smart caching, model tiering, and usage tracking.
Per-Team Quotas
Rate limiting is the first line of defense, but for AI governance you need more than simple requests-per-minute limits. You need quotas that map to business units and budgets.
With Zuplo, you can configure rate limiting per API key, which maps directly to teams or applications:
The rateLimitBy: "user" configuration ensures each API consumer gets their own
quota bucket. Set requestsAllowed to the daily limit appropriate for each
tier, and the gateway enforces it automatically.
For monthly spend caps, you need to track cumulative token usage. More on that in the token-based billing section below.
Semantic Caching
If multiple users or applications send identical (or near-identical) prompts to the same model, you're paying for the same computation repeatedly. Semantic caching intercepts these duplicate requests and serves the cached response instead.
The concept works like this:
- A request comes in with a prompt.
- The gateway computes a hash of the prompt (and relevant parameters like model, temperature, and system prompt).
- If a cached response exists for that hash, return it immediately.
- If not, forward the request to the model, cache the response, and return it.
This is especially effective for common operations like classification, extraction from templates, and FAQ-style queries where the same questions recur frequently. In practice, organizations see 15-40% cache hit rates on AI traffic, translating directly to cost savings.
For prompts that aren't identical but semantically similar, you can use embedding similarity to match against cached responses. This adds complexity but can significantly increase hit rates for use cases like customer support where the same question gets phrased many different ways.
Model Tiering for Cost Optimization
Not every request needs the most capable (and expensive) model. A request classifier at the gateway level can route simple requests to cheaper models automatically.
Consider this pattern:
- Simple lookups and classifications: Route to GPT-4o-mini or Claude Haiku. These models handle straightforward tasks at a fraction of the cost.
- Standard generation and summarization: Route to GPT-4o or Claude Sonnet. Good balance of quality and cost.
- Complex reasoning and analysis: Route to Claude Opus, o1-pro, or specialized models. Reserve these for tasks that genuinely need them.
You can implement this as a gateway policy that inspects request metadata -- a
custom header like X-AI-Priority or a field in the request body -- and routes
accordingly:
Application teams set the priority based on the use case. The gateway handles the rest. This approach can reduce AI spend by 30-60% for organizations with a mix of simple and complex AI workloads.
Token-Based Billing and Metering
To enforce monthly spend caps and provide accurate usage reporting, you need to track token consumption per consumer. Zuplo's metering capabilities let you log token usage alongside standard request metrics.
Here's an outbound policy that extracts token usage from the AI provider's response and records it:
This data feeds into dashboards and alerting. When a team approaches their monthly budget, you can trigger warnings. When they hit the cap, the rate limiter kicks in. Finance gets a clear report of AI spend by team, model, and application.
Spend Limit Enforcement
Combining token tracking with spend limits creates a hard cap on AI costs. Here's an inbound policy that checks cumulative spend before allowing a request:
This gives teams a clear, predictable boundary. No surprises on the monthly AI bill.
Compliance and Audit
Cost controls protect your budget. Compliance controls protect your business. For organizations in regulated industries -- or any company handling customer data -- AI governance requires robust auditing, data protection, and residency controls.
Audit Logging
Every AI request should be logged with enough context to answer these questions during an audit:
- Who made the request? (User identity, team, application)
- What model was called, with what parameters?
- When did the request occur?
- How many tokens were consumed?
- What was the response status?
Here's a Zuplo policy that creates comprehensive audit logs:
Ship these logs to your SIEM (Splunk, Datadog, etc.) or a dedicated audit store. The key is making them immutable and queryable. When the compliance team asks "which teams used GPT-4o to process customer data last quarter?", you should be able to answer in minutes, not weeks.
PII Safeguards
One of the biggest risks with external AI services is inadvertently sending personally identifiable information (PII) to a third-party provider. An inbound policy can scan request payloads before they leave your network:
For production deployments, you'll want more sophisticated PII detection -- potentially using a dedicated NLP model or a service like Microsoft Presidio. The regex-based approach above catches the most common patterns and serves as a first layer of defense.
You can also implement PII redaction instead of blocking, replacing detected PII with placeholders before the request reaches the model. This lets the request proceed while protecting sensitive data.
Data Residency
For organizations operating across regions, data residency requirements dictate where AI processing can happen. European customer data might need to stay within the EU. Healthcare data might need to remain in specific jurisdictions.
The gateway can enforce this by routing requests to region-specific model endpoints based on the user's location or data classification:
This approach works well with Azure OpenAI Service deployments, which let you host models in specific Azure regions. For other providers, you may need to maintain separate accounts or use provider-specific regional endpoints.
Retention Policies
AI request and response logs can contain sensitive information -- the prompts themselves, the generated content, and metadata about usage patterns. Define clear retention policies:
- Audit metadata (who, when, which model, token count): Retain for 12-24 months for compliance. This data is small and doesn't contain sensitive content.
- Full request/response payloads: Retain for 30-90 days for debugging and quality monitoring. Auto-delete after the retention period.
- PII-flagged requests: Either don't log the payload at all, or encrypt it with a key that gets rotated on a schedule.
Configure your logging pipeline to separate these tiers. The audit metadata goes to your long-term compliance store. Full payloads go to a time-limited store with automatic expiry. This balances operational needs with data minimization principles.
Practical Implementation with Zuplo
Zuplo's AI Gateway brings these governance patterns together in a single platform. Here's how the pieces fit.
JWT Validation and Claim-Based Routing
Zuplo's built-in JWT authentication policy validates tokens and extracts claims automatically. Combine it with the tier-routing policy from earlier:
The JWT policy runs first, validating the token and populating request.user
with the token claims. The tier-routing policy then reads the tier claim and
routes the request to the appropriate model endpoint.
Rate Limiting Per API Key
For teams that use API keys instead of JWTs, Zuplo's API key authentication gives you per-consumer rate limiting out of the box. Each API key can be assigned metadata (team, tier, spend limit) that your policies can reference.
The rate limiting policy applies per-key quotas automatically. You can set different limits for different keys through the Zuplo Developer Portal, where consumers self-serve their API keys and you control the access parameters.
Custom Logging for Audit Trail
Chain the audit logging policy as an outbound handler on your AI routes. Every request that passes through gets logged with full context:
Notice the policy chain: inbound policies handle authentication, routing, PII scanning, and spend limits. Outbound policies capture token usage and write the audit log. This layered approach means each policy does one thing well, and you can mix and match them across different AI routes.
Bringing It All Together
A complete AI governance setup in Zuplo looks like this:
- Authentication: JWT or API key validation on every request.
- Authorization: Tier-based access control using token claims or key metadata.
- PII protection: Inbound scan for sensitive data before it leaves your network.
- Cost controls: Rate limiting, spend caps, and model tiering to keep costs predictable.
- Audit trail: Comprehensive logging of every request with identity, model, tokens, and cost.
- Data residency: Region-based routing for compliance with local regulations.
Each layer is a separate policy, configured declaratively and applied at the gateway level. No changes to application code. No reliance on developers remembering to implement controls. The governance is built into the infrastructure.
Governance Checklist
Before going to production with AI APIs, make sure you've addressed each of these items:
Access Controls
- Every AI API consumer is authenticated (JWT or API key)
- Access tiers are defined and mapped to specific models
- An approval workflow exists for new AI service access requests
- Unused API keys and access grants are reviewed and revoked quarterly
Cost Management
- Per-team or per-application rate limits are configured
- Monthly spend caps are set and enforced at the gateway
- Semantic caching is enabled for high-volume, repetitive workloads
- Model tiering routes low-priority requests to cost-effective models
- Token usage is tracked and reported per consumer
Compliance and Data Protection
- PII scanning is enabled on inbound requests to external AI providers
- Data residency requirements are mapped to regional model endpoints
- Audit logs capture identity, model, tokens, cost, and timestamp for every request
- Audit logs are shipped to an immutable store with appropriate retention policies
- Full request/response payload logging has a defined retention period with auto-expiry
Operational Readiness
- Alerting is configured for spend anomalies and rate limit breaches
- A runbook exists for responding to AI-related incidents (data exposure, cost spikes)
- The governance configuration is version-controlled and reviewed through the same process as application code
- Dashboards show real-time AI usage by team, model, and cost
Organizational
- Roles and responsibilities for AI governance are documented
- The platform team has the authority to enforce controls at the gateway level
- Application teams understand the available tiers and how to request changes
- Finance receives regular reports on AI spend by department
This checklist is a starting point. Your organization may have additional requirements based on your industry, regulatory environment, and risk tolerance. The important thing is that governance is explicit, enforced at the infrastructure level, and not left to individual teams to implement on their own.
Get Started with Zuplo's AI Gateway
AI governance doesn't have to be a bottleneck. With the right architecture -- an API gateway as the enforcement point, clear policies, and layered controls -- you can give teams the AI capabilities they need while maintaining the visibility and control your organization requires.
Zuplo's AI Gateway gives you the building blocks: JWT and API key authentication, programmable policies in TypeScript, per-consumer rate limiting, and comprehensive logging. You can start with basic access controls and add layers as your AI usage matures.
Sign up for Zuplo and deploy your first AI governance policy in minutes. Your finance team and compliance team will both thank you.