Zuplo
ai-agents

AI Firewalls Are a Layer Not a Wall

Nate TottenNate Totten
July 2, 2026
5 min read

AI firewalls and guardrails are a real defense-in-depth layer, and Anthropic builds them too. But probabilistic detection cannot be the primary control for an autonomous agent. The wall has to be deterministic.

The AI security vendors pitching us this year share a recurring promise: drop our firewall in front of your model and it blocks prompt injection. These products catch attacks and earn their spot in the stack. What worries me is the verb. “Blocks” is a guarantee, and the thing under the hood is a probability.

Use this approach if you're:
  • Evaluating an AI firewall or guardrail product for an agent endpoint
  • Exposing tools or MCP servers to Claude Code, Cursor, or ChatGPT
  • On the hook to tell a security reviewer what actually stops an attack

Even Anthropic builds guardrails

Guardrails are not snake oil. Anthropic, the company training the models these firewalls sit in front of, builds classifier defenses itself. Constitutional Classifiers are input and output classifiers trained on synthetic data that, in their words, “filter the overwhelming majority of jailbreaks with minimal over-refusals.”

The same approach runs at the tool layer. In how Anthropic contains Claude, tool responses route through proxies that can run a classifier over a return value before it re-enters the model’s context, and that classifier “can be a small, fast model.”

The people who understand the threat best ship these defenses. Which is exactly why it is worth listening when they say where the defenses stop.

The model layer can’t stand alone

The same containment writeup is candid about the ceiling the marketing copy skips. Three numbers from it, each measuring the model layer’s own defenses:

What Anthropic measured Failure rate
Prompt injection, single attempt (Claude Opus 4.7, Gray Swan) ~0.1%
Prompt injection, after 100 adaptive attempts 5 to 6%
Claude Code auto mode, overeager actions allowed through ~17%

The first attempt almost never lands. Patience and the second row do, and the third is a different defense leaking at a far higher rate.

Anthropic’s own conclusion is the sentence to take to your security review: “protection in the model layer will never be 100% effective, which is why it can’t stand alone.” If the lab with the best model and the best classifiers will not lean its safety on detection, neither should you.

A 95% wall is a 100% breach with patience

A firewall product is in a weaker spot than Anthropic’s own stack. It usually runs a lighter, faster classifier than a frontier model like Opus so it can sit inline without adding latency, and the attacker gets unlimited adaptive retries against it.

A control that catches 95% of injection attempts on the first try, the number these products tend to advertise, is not a 95% wall against a determined adversary, it is a delay. Given enough attempts, the gap is the whole attack surface.

Simon Willison made this point before the current wave of products. Reviewing guardrail vendors in his lethal trifecta writeup, he notes they “almost always carry confident claims that they capture ‘95% of attacks’ or similar,” then delivers the line: “in web application security 95% is very much a failing grade.”

A SQL injection filter that blocks 95% of payloads is a vulnerable application, and a prompt-injection filter is no different once you accept the attacker will iterate.

Common mistake:

Reading a detection rate as a guarantee. “Catches 95% of prompt injection” is a property of a probabilistic classifier facing a fixed test set, not a promise about an adversary who retries. Treat it as a layer that lowers volume, never as the control that makes an action safe.

Buy the firewall, but build a deterministic wall

The resolution is not to throw the firewall out, it is to stop asking it to be the wall. Probabilistic detection lowers the volume of attempts that reach the thing that actually has to hold, and that thing is deterministic. It has no success rate against adaptive attacks because it is not pattern-matching content at all.

Four controls do the load-bearing work, and none of them is a classifier:

Control What it enforces
Identity Every call is tied to a caller and the agent acting on a user’s behalf
Token scope and audience binding A token minted for one server is rejected at another, so a compromised server cannot replay it
Tool curation The agent can only invoke the tools you exposed
Egress control Data cannot leave through a path you did not sanction

An attacker who jailbreaks the model still hits these, and they answer the same way whether the prompt was clever or not.

Say the injection works and the model is convinced to call a tool it should not: the audience-bound token it carries is rejected the moment it is presented at a server it was not minted for, full stop, no classifier consulted. As Anthropic puts it, the deterministic boundary is what gets hit when everything probabilistic misses.

Our Q1 2026 API and agent security scorecard tracks how those miss rates play out against real endpoints.

Zuplo ships the wall, not the guess

We built this boundary for MCP and we use it ourselves. The deterministic controls are the guarantee: authentication, audience-bound tokens, no token passthrough, and per-route tool curation. They have no success rate to quote, because they do not detect, they enforce.

Defense in depth is a real option on top of that, not a replacement. You can compose the Akamai AI Firewall onto an AI or MCP route for prompt-injection and data-loss screening.

You can also write a custom outbound TypeScript policy to inspect or redact a tool response before it re-enters the model’s context. Both lower the volume of bad input, but neither is what I would point a security reviewer at when they ask what stops the attack.

That outbound policy matters more than it looks, because prompt injection in MCP flows backwards, arriving in a tool response rather than the user prompt. The layer that scans it has to watch what comes back, not just what the user typed.

MCP Gateway authentication

OAuth, audience-bound tokens, and no token passthrough. The deterministic controls that hold whether or not a classifier catches the prompt.

Buy the firewall, and run it as a layer that lowers the volume of attempts. Just build the wall out of identity, token scope, tool curation, and egress, the controls that hold the same way on the millionth adaptive retry as on the first. That is the slice Zuplo ships off the shelf.