---
title: "AI Firewalls Are a Layer Not a Wall"
description: "AI firewalls and guardrails are a real defense-in-depth layer, and Anthropic builds them too. But probabilistic detection cannot be the primary control for an autonomous agent. The wall has to be deterministic."
canonicalUrl: "https://zuplo.com/blog/2026/07/02/ai-firewalls-layer-not-wall"
pageType: "blog"
date: "2026-07-02"
authors: "nate"
tags: "ai-agents, API Security, AI"
image: "https://zuplo.com/og?text=AI%20Firewalls%20Are%20a%20Layer%20Not%20a%20Wall"
---
The AI security vendors pitching us this year share a recurring promise: drop
our firewall in front of your model and it blocks prompt injection. These
products catch attacks and earn their spot in the stack. What worries me is the
verb. "Blocks" is a guarantee, and the thing under the hood is a probability.

<CalloutAudience
  variant="useIf"
  items={[
    `Evaluating an AI firewall or guardrail product for an agent endpoint`,
    `Exposing tools or MCP servers to Claude Code, Cursor, or ChatGPT`,
    `On the hook to tell a security reviewer what actually stops an attack`,
  ]}
/>

## Even Anthropic builds guardrails

Guardrails are not snake oil. Anthropic, the company training the models these
firewalls sit in front of, builds classifier defenses itself.
[Constitutional Classifiers](https://www.anthropic.com/research/constitutional-classifiers)
are input and output classifiers trained on synthetic data that, in their words,
"filter the overwhelming majority of jailbreaks with minimal over-refusals."

The same approach runs at the tool layer. In
[how Anthropic contains Claude](https://www.anthropic.com/engineering/how-we-contain-claude),
tool responses route through proxies that can run a classifier over a return
value before it re-enters the model's context, and that classifier "can be a
small, fast model."

The people who understand the threat best ship these defenses. Which is exactly
why it is worth listening when they say where the defenses stop.

## The model layer can't stand alone

The same containment writeup is candid about the ceiling the marketing copy
skips. Three numbers from it, each measuring the model layer's own defenses:

| What Anthropic measured                                       | Failure rate |
| ------------------------------------------------------------- | ------------ |
| Prompt injection, single attempt (Claude Opus 4.7, Gray Swan) | ~0.1%        |
| Prompt injection, after 100 adaptive attempts                 | 5 to 6%      |
| Claude Code auto mode, overeager actions allowed through      | ~17%         |

The first attempt almost never lands. Patience and the second row do, and the
third is a different defense leaking at a far higher rate.

Anthropic's own conclusion is the sentence to take to your security review:
"protection in the model layer will never be 100% effective, which is why it
can't stand alone." If the lab with the best model and the best classifiers will
not lean its safety on detection, neither should you.

## A 95% wall is a 100% breach with patience

A firewall product is in a weaker spot than Anthropic's own stack. It usually
runs a lighter, faster classifier than a frontier model like Opus so it can sit
inline without adding latency, and the attacker gets unlimited adaptive retries
against it.

A control that catches 95% of injection attempts on the first try, the number
these products tend to advertise, is not a 95% wall against a determined
adversary, it is a delay. Given enough attempts, the gap is the whole attack
surface.

Simon Willison made this point before the current wave of products. Reviewing
guardrail vendors in
[his lethal trifecta writeup](https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/),
he notes they "almost always carry confident claims that they capture '95% of
attacks' or similar," then delivers the line: "in web application security 95%
is very much a failing grade."

A SQL injection filter that blocks 95% of payloads is a vulnerable application,
and a prompt-injection filter is no different once you accept the attacker will
iterate.

<CalloutTip variant="mistake">
  Reading a detection rate as a guarantee. "Catches 95% of prompt injection" is
  a property of a probabilistic classifier facing a fixed test set, not a
  promise about an adversary who retries. Treat it as a layer that lowers
  volume, never as the control that makes an action safe.
</CalloutTip>

## Buy the firewall, but build a deterministic wall

The resolution is not to throw the firewall out, it is to stop asking it to be
the wall. Probabilistic detection lowers the volume of attempts that reach the
thing that actually has to hold, and that thing is deterministic. It has no
success rate against adaptive attacks because it is not pattern-matching content
at all.

Four controls do the load-bearing work, and none of them is a classifier:

| Control                          | What it enforces                                                                               |
| -------------------------------- | ---------------------------------------------------------------------------------------------- |
| Identity                         | Every call is tied to a caller and the agent acting on a user's behalf                         |
| Token scope and audience binding | A token minted for one server is rejected at another, so a compromised server cannot replay it |
| Tool curation                    | The agent can only invoke the tools you exposed                                                |
| Egress control                   | Data cannot leave through a path you did not sanction                                          |

An attacker who jailbreaks the model still hits these, and they answer the same
way whether the prompt was clever or not.

Say the injection works and the model is convinced to call a tool it should not:
the audience-bound token it carries is rejected the moment it is presented at a
server it was not minted for, full stop, no classifier consulted. As Anthropic
puts it, the deterministic boundary is what gets hit when everything
probabilistic misses.

Our
[Q1 2026 API and agent security scorecard](/blog/q1-2026-api-agent-security-scorecard)
tracks how those miss rates play out against real endpoints.

## Zuplo ships the wall, not the guess

We built this boundary for MCP and we use it ourselves. The deterministic
controls are the guarantee:
[authentication, audience-bound tokens](https://zuplo.com/docs/mcp-gateway/auth/overview),
no token passthrough, and
[per-route tool curation](https://zuplo.com/docs/mcp-gateway/capability-filtering).
They have no success rate to quote, because they do not detect, they enforce.

Defense in depth is a real option on top of that, not a replacement. You can
compose the
[Akamai AI Firewall](https://zuplo.com/docs/ai-gateway/policies/akamai-ai-firewall)
onto an AI or MCP route for prompt-injection and data-loss screening.

You can also write a
[custom outbound TypeScript policy](https://zuplo.com/docs/policies/custom-code-outbound)
to inspect or redact a tool response before it re-enters the model's context.
Both lower the volume of bad input, but neither is what I would point a security
reviewer at when they ask what stops the attack.

That outbound policy matters more than it looks, because
[prompt injection in MCP flows backwards](/blog/protect-mcp-against-prompt-injection),
arriving in a tool response rather than the user prompt. The layer that scans it
has to watch what comes back, not just what the user typed.

<CalloutDoc
  title="MCP Gateway authentication"
  description="OAuth, audience-bound tokens, and no token passthrough. The deterministic controls that hold whether or not a classifier catches the prompt."
  href="https://zuplo.com/docs/mcp-gateway/auth/overview"
  icon="book"
/>

Buy the firewall, and run it as a layer that lowers the volume of attempts. Just
build the wall out of identity, token scope, tool curation, and egress, the
controls that hold the same way on the millionth adaptive retry as on the first.
That is the slice Zuplo ships off the shelf.