--- title: "How to implement a circuit breaker at the API gateway" description: "When a backend fails, retry storms can make recovery even harder. Learn how to implement the circuit breaker pattern as custom TypeScript policies in Zuplo to automatically stop traffic to failing services, with per-route thresholds and RFC 7807 error responses." canonicalUrl: "https://zuplo.com/blog/2026/03/17/how-to-implement-circuit-breaker-at-the-api-gateway" pageType: "blog" date: "2026-03-17" authors: "martyn" tags: "API Gateway" image: "https://zuplo.com/og?text=How%20to%20Implement%20a%20Circuit%20Breaker%20at%20the%20API%20Gateway" --- Imagine this scenario: Your backend goes down. Every client retries simultaneously. The retry storm adds more load, making recovery harder. Meanwhile your gateway is burning resources on requests that will never succeed. Sounds like a bad day, right? Fortunately, there's an approach that you can use to prevent this at the gateway level. The **circuit breaker pattern** monitors backend health and automatically stops forwarding traffic when a service is failing, giving it time to recover without being hammered by doomed requests. ## The three states A circuit breaker is a state machine with three states: 1. **Closed**: Requests flow normally. The breaker tracks failures in a rolling window. 2. **Open**: Failures exceeded the threshold. All requests immediately get a 503 response. No traffic reaches the backend. 3. **Half-open**: After a cool down period, the breaker allows a test request through. If it succeeds, the circuit closes. If it fails, it opens again. For a deeper look at the pattern and how it fits into a broader resilience strategy (retries, timeouts, bulkheads), see the [API Gateway Resilience and Fault Tolerance](https://zuplo.com/learning-center/api-gateway-resilience-fault-tolerance) article in our learning center. ## The implementation In a programmable gateway, you can implement this as two custom policies that share state: an inbound policy that checks the circuit before each request, and an outbound policy that tracks failures from backend responses. The shared state lives in [ZoneCache](https://zuplo.com/docs/programmable-api/zone-cache), Zuplo's low-latency cache within each deployment zone. ### Inbound policy: check the circuit This policy runs before the request reaches your backend. If the circuit is open, it short-circuits and returns a 503 immediately. ```ts // modules/circuit-breaker-inbound.ts import { ZuploContext, ZuploRequest, ZoneCache, HttpProblems, } from "@zuplo/runtime"; interface CircuitState { failures: number; lastFailure: number; state: "closed" | "open" | "half-open"; } interface CircuitBreakerOptions { failureThreshold: number; cooldownSeconds: number; backendId: string; stateTtlSeconds?: number; } const DEFAULT_STATE: CircuitState = { failures: 0, lastFailure: 0, state: "closed", }; export default async function circuitBreakerInbound( request: ZuploRequest, context: ZuploContext, options: CircuitBreakerOptions, policyName: string, ) { const cache = new ZoneCache("circuit-breaker", context); const cacheKey = `cb:${options.backendId}`; const state = (await cache.get(cacheKey)) ?? { ...DEFAULT_STATE }; if (state.state === "open") { const elapsed = Date.now() - state.lastFailure; if (elapsed < options.cooldownSeconds * 1000) { // Still within cooldown, reject immediately context.log.warn(`Circuit open for backend '${options.backendId}'.`); return HttpProblems.serviceUnavailable(request, context, { detail: `Service temporarily unavailable. Retry after ${options.cooldownSeconds} seconds.`, }); } // Cooldown expired, transition to half-open state.state = "half-open"; await cache.put(cacheKey, state, options.stateTtlSeconds ?? 300); } return request; } ``` When the circuit is open and the cooldown hasn't expired, the client gets a standard [RFC 7807](https://datatracker.ietf.org/doc/html/rfc7807) problem response with a 503 status. No request ever reaches the backend. The response looks like this: ```json { "type": "https://httpproblems.com/http-status/503", "title": "Service Unavailable", "status": 503, "detail": "Service temporarily unavailable. Retry after 30 seconds.", "instance": "/v1/payments", "trace": { "timestamp": "2025-03-17T10:42:03.128Z", "requestId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890" } } ``` This is a standard Zuplo problem response. Your clients can check for a 503 status and implement their own backoff logic on their end. Once the cooldown period passes, the policy transitions to half-open and lets the next request through as a test. ### Outbound policy: track failures This policy inspects backend responses and updates the circuit state. On failure it increments the counter. When the threshold is crossed, it opens the circuit. ```ts // modules/circuit-breaker-outbound.ts import { ZuploContext, ZuploRequest, ZoneCache } from "@zuplo/runtime"; interface CircuitState { failures: number; lastFailure: number; state: "closed" | "open" | "half-open"; } interface CircuitBreakerOptions { failureThreshold: number; cooldownSeconds: number; backendId: string; stateTtlSeconds?: number; } const DEFAULT_STATE: CircuitState = { failures: 0, lastFailure: 0, state: "closed", }; export default async function circuitBreakerOutbound( response: Response, request: ZuploRequest, context: ZuploContext, options: CircuitBreakerOptions, policyName: string, ) { const cache = new ZoneCache("circuit-breaker", context); const cacheKey = `cb:${options.backendId}`; const state = (await cache.get(cacheKey)) ?? { ...DEFAULT_STATE }; if (response.ok) { // Success during half-open: close the circuit if (state.state === "half-open") { context.log.info(`Circuit closing for backend '${options.backendId}'.`); state.state = "closed"; state.failures = 0; state.lastFailure = 0; await cache.put(cacheKey, state, options.stateTtlSeconds ?? 300); } return response; } // Failure: increment counter state.failures += 1; state.lastFailure = Date.now(); context.log.warn( `Backend '${options.backendId}' returned ${response.status}. ` + `Failures: ${state.failures}/${options.failureThreshold}.`, ); if (state.failures >= options.failureThreshold) { context.log.error(`Circuit opening for backend '${options.backendId}'.`); state.state = "open"; } await cache.put(cacheKey, state, options.stateTtlSeconds ?? 300); return response; } ``` The outbound policy uses `response.ok` to classify success vs. failure. This covers any 2xx response as success and everything else as a failure. You can customize this. For example, you might only count 5xx responses as failures and treat 4xx client errors as normal: ```ts // Only count server errors as failures const isFailure = response.status >= 500; ``` ### Wiring it up Add both policies to your `policies.json` and attach them to the route: ```json // config/policies.json { "policies": [ { "name": "circuit-breaker-inbound", "policyType": "custom-code-inbound", "handler": { "export": "default", "module": "$import(./modules/circuit-breaker-inbound)", "options": { "failureThreshold": 5, "cooldownSeconds": 30, "backendId": "my-backend-api" } } }, { "name": "circuit-breaker-outbound", "policyType": "custom-code-outbound", "handler": { "export": "default", "module": "$import(./modules/circuit-breaker-outbound)", "options": { "failureThreshold": 5, "cooldownSeconds": 30, "backendId": "my-backend-api" } } } ] } ``` Then reference both policies on any route that should be protected: ```json "policies": { "inbound": ["circuit-breaker-inbound"], "outbound": ["circuit-breaker-outbound"] } ``` The `backendId` option is the key to per-route customization. Set a different `backendId` for each backend, and each one gets its own independent circuit state. A payment service can trip after 3 failures while a search endpoint tolerates 10. If you're also using other policies like rate limiting or authentication, order matters. The circuit breaker inbound policy should run after authentication (no point checking the circuit for unauthenticated requests) but before rate limiting (a tripped circuit should return 503 before consuming a rate limit token). ## Choosing thresholds Getting thresholds right matters. Too sensitive and you'll trip on transient errors. Too generous and real outages affect clients for too long. **Failure threshold**: Start with 5 failures. For critical payment flows, drop it to 2 or 3. For search or non-critical reads, 10 is reasonable. **Cooldown period**: 30 seconds is a good starting point. Long enough for most transient issues to resolve, short enough that you aren't blocking traffic for ages if the backend recovered quickly. **Cache TTL** (`stateTtlSeconds`): This is a safety net. If no requests come in for this period, the state expires and resets to closed. The default of 300 seconds (5 minutes) works for most cases. Set it higher for low-traffic routes. ## Testing the circuit breaker You can verify the circuit breaker works without waiting for a real outage. The quickest approach is to create a mock backend using [Mockbin](https://mockbin.io) that returns a 500 error. Create a new bin and configure the response like this: - **Status**: `500` - **Headers**: `Content-Type: application/json` - **Body**: ```json { "error": "Internal Server Error", "message": "Simulated backend failure" } ``` Copy the bin URL and use it as your route's backend URL. Every request to that route will now get a 500 response, which the outbound policy counts as a failure. For more control, you can swap your route handler for a simple one that fails on demand via a query parameter: ```ts // modules/test-handler.ts import { ZuploContext, ZuploRequest } from "@zuplo/runtime"; export default async function (request: ZuploRequest, context: ZuploContext) { const fail = request.query.fail === "true"; if (fail) { return new Response("Internal Server Error", { status: 500 }); } return new Response(JSON.stringify({ status: "ok" }), { headers: { "content-type": "application/json" }, }); } ``` Either way, set your failure threshold to 3 and cooldown to 10 seconds so you can cycle through the states quickly. Then: 1. Send a few normal requests to confirm they pass through (circuit closed). 2. Send 3 failing requests (via the Mockbin route or `?fail=true`) to trip the circuit. 3. Send another request and confirm you get a 503 with no backend call. 4. Wait 10 seconds, send a successful request, and confirm the circuit closes. Check your Zuplo logs for the circuit state transitions. You should see the `warn` and `error` messages from both policies as the state changes. ## Why implement this in code? Config-based gateways that support circuit breakers typically give you a few knobs: threshold, cooldown, maybe a status code filter. That works until it doesn't. With a programmable gateway, the circuit breaker logic is just TypeScript. You can: - Factor in response latency, not just error codes - Use different failure detection per route without duplicating config - Send alerts (via a webhook in the outbound policy) when a circuit opens - Log structured circuit state changes to your observability stack - Implement gradual recovery in half-open state instead of a single test request The tradeoff is that you maintain the code. But it's ~60 lines per policy and the logic is straightforward. ## Deploy a circuit breaker in seconds with GitOps A circuit breaker adds overhead you might not need when your backends are healthy. The good news: you don't have to treat this as permanent infrastructure. Because Zuplo projects are Git repos, adding a circuit breaker to a route is a code change. When a backend starts misbehaving, you can: 1. Add the two policy files to your project. 2. Reference them on the affected route in policies.json. 3. Push to your branch. Zuplo deploys in seconds. Once your production gateway rebuilds, the circuit breaker is live. Once the backend is stable again, you can remove the policies from the route and push again. You're back to zero overhead. This works well as an incident response tool. Keep the policy modules in your repo but don't attach them to any routes. When something goes wrong, wiring them up is a one-line change to your route config. If you use environment-based routing, you can even test the circuit breaker on a preview branch before promoting it to production. ## Going further This implementation covers the core pattern. A few things you might add for production use: **Rolling window**: Instead of a simple counter, track failures within a time window (e.g., 5 failures in the last 60 seconds). Reset the counter when the window rolls over. **Gradual half-open recovery**: Allow 3 test requests through in half-open state instead of one. Close the circuit only if all 3 succeed. **Alerting**: Fire a webhook or write to a queue when the circuit opens. Your on-call team should know when a backend is failing hard enough to trip the breaker. **Combine with retries and timeouts**: Circuit breakers work best alongside other resilience patterns. Add a timeout to prevent slow backends from holding connections, and a retry policy for transient errors that happen while the circuit is closed.