Every API will eventually return an error. The question is whether that error helps the developer fix the problem or sends them on a guessing game. Poorly structured error responses — vague messages, wrong status codes, missing context — lead to frustrated developers, wasted debugging hours, and a flood of support tickets.
This guide covers concrete strategies for building consistent, useful API error responses. You will learn how to select the right HTTP status codes, structure error payloads using the RFC 9457 Problem Details standard, handle errors across REST, GraphQL, and gRPC protocols, and implement retry logic, idempotency, and observability patterns that make errors actionable instead of opaque.
Here is what a well-structured error response looks like, following RFC 9457:
This response tells the developer exactly what went wrong, which field caused the problem, and how to fix it. That is the goal.
Choosing the Right HTTP Status Codes
HTTP status codes are the first signal a client receives about what happened. Using them correctly is table stakes for any API.
4xx codes indicate the client sent a bad request. The client needs to change something before retrying:
- 400 Bad Request — Malformed syntax, invalid JSON, or missing required fields.
- 401 Unauthorized — Missing or invalid authentication credentials. (Despite the name, this means “unauthenticated.”)
- 403 Forbidden — Authenticated, but lacking permission for this operation.
- 404 Not Found — The requested resource does not exist.
- 409 Conflict — The request conflicts with the current state of the resource (e.g., a duplicate creation or concurrent update).
- 422 Unprocessable Content — The request is syntactically valid but fails business validation rules.
- 429 Too Many Requests — The client has exceeded its
rate limit. Always include a
Retry-Afterheader.
5xx codes indicate the server failed to fulfill a valid request. These are typically transient and worth retrying:
- 500 Internal Server Error — An unhandled exception on the server side.
- 502 Bad Gateway — The server, acting as a proxy, received an invalid upstream response.
- 503 Service Unavailable — The server is temporarily down for maintenance
or overloaded. Include a
Retry-Afterheader. - 504 Gateway Timeout — The upstream server did not respond in time.
The critical rule: use the most specific status code that applies. Returning
400 for every client error or 500 for every server error strips the response
of useful information. If a user’s email fails validation, return 422 with
field-level details — not a generic 400.
You can find a full reference of status codes on MDN. For deeper dives on specific codes, see our guides on HTTP 429 Too Many Requests and HTTP 431 Request Header Fields Too Large.
Structuring Error Responses with RFC 9457 Problem Details
Ad-hoc error formats are one of the biggest sources of friction in API
integrations. Every API invents its own shape — { "error": "..." },
{ "code": 123, "msg": "..." }, { "errors": [{ ... }] } — and consumers have
to write custom parsing logic for each one.
RFC 9457 (Problem Details for HTTP APIs) solves this by defining a standard error response format. It is the successor to the widely adopted RFC 7807 and is fully backward compatible. If you are starting fresh, adopt RFC 9457. If you already use RFC 7807, your responses are already compliant.
The Five Standard Fields
A Problem Details response uses the application/problem+json content type and
includes these fields (all optional, but all recommended):
type(URI) — Identifies the specific error type. When set to"about:blank", thetitleshould match the standard HTTP status phrase.title(string) — A short, human-readable summary. Should remain stable for a giventype.status(integer) — The HTTP status code, duplicated in the body for convenience.detail(string) — A human-readable explanation specific to this occurrence.instance(URI) — Identifies the specific request occurrence, useful for log correlation.
Extension Members and Validation Errors
RFC 9457 explicitly supports custom extension members beyond the five standard
fields. This is where you add application-specific context. RFC 9457 illustrates
how extension members can include an errors array for reporting multiple
validation problems in a single response:
Returning all validation errors at once — rather than one at a time — saves developers from the frustrating loop of fix-one-submit-get-another-error.
Here is the full HTTP response, including headers:
We talked with Erik Wilde, one of the authors of RFC 9457, about the design decisions behind the standard. Check it out here:
Zuplo and Problem Details
If you use an API gateway like Zuplo, Problem Details
responses come built in. Zuplo defaults to the RFC 7807 Problem Details format
for every error — from authentication failures to rate limit violations to
request validation errors. Each response automatically includes a trace
extension with a requestId, buildId, and timestamp for debugging.
You can also customize error responses programmatically using the HttpProblems
helper from @zuplo/runtime, which provides methods for every HTTP status code
(badRequest(), unauthorized(), tooManyRequests(), and so on) with support
for custom detail, type, and extension fields.
Writing Clear, Secure Error Messages
A good error message answers three questions: What went wrong? Where did it happen? How do I fix it?
Do this:
Not this:
The first response tells the developer exactly which field failed and what format is expected. The second is useless.
Security Considerations
Error messages must balance helpfulness with security. Follow these rules:
- Never expose internal details. Stack traces, database queries, file paths, and internal service names do not belong in API responses.
- Use neutral authentication messages. Return the same error for “user not
found” and “wrong password” to prevent account enumeration attacks. A message
like
"Invalid credentials"works for both cases. - Validate and sanitize inputs before including them in error messages to prevent injection attacks.
An API gateway can act as a safety net here. With a programmable gateway like Zuplo, you can write outbound policies that scan response bodies for sensitive data patterns — internal IP addresses, database connection strings, stack traces — and strip them before they reach the client. See our API security best practices guide for more on securing your API surface.
Error Handling by API Protocol
Different API protocols handle errors differently. Understanding these conventions is important when you build or consume APIs, and especially when you run mixed-protocol architectures behind a single gateway.
REST API Errors
REST APIs communicate errors through HTTP status codes paired with structured error payloads. The RFC 9457 Problem Details format described above is the current best practice for REST error responses. It supports content negotiation between JSON and XML, and its extension mechanism handles everything from simple validation errors to complex multi-step workflow failures.
GraphQL Errors
GraphQL always responds with HTTP 200 OK, even when errors occur. Error
information lives in an errors array in the response body. This design enables
partial success — some fields can resolve successfully while others fail:
The path field pinpoints exactly which data field triggered the error. The
extensions.code field provides a machine-readable error code for programmatic
handling (e.g., UNAUTHENTICATED, FORBIDDEN, BAD_USER_INPUT).
For domain-specific errors, many GraphQL APIs use union types to make error states part of the schema itself, making every possible outcome explicit and type-safe. For more on GraphQL API design patterns, see our dedicated guide.
gRPC Errors
gRPC uses a fixed set of 17 status codes (0 through 16) defined in the
grpc.status package. Every response includes a numeric code and a string
message, with optional structured details via the google.rpc package:
A key difference from REST: well-formed gRPC responses always use HTTP 200 OK
at the transport level. The actual error status is carried in the grpc-status
trailer. Here are the most commonly used gRPC status codes and their HTTP
equivalents:
INVALID_ARGUMENT(3) → 400 Bad RequestNOT_FOUND(5) → 404 Not FoundPERMISSION_DENIED(7) → 403 ForbiddenRESOURCE_EXHAUSTED(8) → 429 Too Many RequestsUNIMPLEMENTED(12) → 501 Not ImplementedINTERNAL(13) → 500 Internal Server ErrorUNAVAILABLE(14) → 503 Service UnavailableUNAUTHENTICATED(16) → 401 Unauthorized
For a deeper comparison of these protocols, see our REST or gRPC guide.
Cross-Protocol Consistency
Regardless of which protocol you use, two principles remain constant: machine-readable error codes for programmatic handling and human-readable details for debugging. When you run APIs across multiple protocols, an API gateway can normalize error formats — translating gRPC status codes into Problem Details responses for REST consumers, for example — so that downstream clients get a consistent experience.
Retry Strategies and Backoff Patterns
Not every error is permanent. Transient failures — rate limits, temporary overloads, upstream timeouts — will often succeed on retry. But retrying incorrectly can make things worse, turning a brief spike into a sustained outage.
Which Errors to Retry
Retry these: 429 Too Many Requests, 500 Internal Server Error,
502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout.
Do not retry these: 400, 401, 403, 404, 422 — these are client
errors that will fail identically on every attempt. Retrying them wastes
resources and can trigger rate limits.
Exponential Backoff with Jitter
The standard retry pattern is exponential backoff: double the wait time after each failed attempt. But naive exponential backoff has a problem. If a server goes down and 1,000 clients fail simultaneously, they will all retry at the same intervals, creating a “thundering herd” that overwhelms the recovering server.
Jitter fixes this by randomizing the delay. Here is a practical implementation:
Key details in this implementation:
- Always honor the
Retry-Afterheader when present. This header appears in429and503responses and tells you exactly when to retry, either as seconds (Retry-After: 120) or an HTTP date (Retry-After: Thu, 28 May 2026 14:30:00 GMT). - Full jitter (
Math.random() * exponentialDelay) provides the best distribution of retry times across clients. - Cap your maximum delay and maximum retry count to prevent unbounded waits.
Idempotency Keys for Safe Error Recovery
Retries introduce a second problem: what happens when the request succeeded on
the server but the response was lost in transit? Without safeguards, a retried
POST /payments could create a duplicate charge.
Idempotency keys solve this. The client generates a unique key (typically a UUID) and sends it with the request. The server uses this key to deduplicate:
How the server handles this:
- First request with this key: Process normally, store the key and response.
- Subsequent requests with the same key: Return the stored response without reprocessing.
- Same key but different parameters: Return an error — the key is bound to the original request payload.
Idempotency keys should have a defined lifetime (24–48 hours is standard) and
are only needed for non-idempotent methods like POST. Methods like GET,
PUT, and DELETE are naturally idempotent.
For more on implementing this pattern, see our guide to idempotency keys in REST APIs.
Observability and Error Tracking
Returning good error responses is only half the battle. You also need visibility into what is failing, how often, and why.
Correlation IDs
Every API request should carry a unique identifier that flows through your entire system — from the gateway, through backend services, into logs, and back in the error response. This lets you trace a single failed request across multiple services in seconds.
In Zuplo, every request automatically receives a zp-rid (request ID) header.
This ID appears in the error response trace object, in gateway logs, and can
be forwarded to your backend services for cross-system correlation.
Structured Logging
Log error events with structured data — not just free-text messages. Include:
- Request ID (correlation ID)
- HTTP method and path
- Status code returned
- Error type and detail
- Timestamp
- Client identifier (API key name, user ID — never the raw credential)
Structured logs make it possible to query, aggregate, and alert on error patterns. For example, you can set up alerts when the error rate for a specific endpoint exceeds a threshold, or when a particular client starts hitting validation errors at an unusual rate.
Key Metrics to Track
- Error rate by endpoint — Which endpoints are failing most?
- Error rate by status code — Are you seeing more 5xx (your problem) or 4xx (client problems)?
- P95 error resolution time — How quickly are errors being investigated and fixed?
- Error recurrence rate — Are the same errors coming back after fixes?
Tools like Sentry and Raygun provide error aggregation with distributed tracing. Zuplo integrates with logging platforms like Datadog, Splunk, Dynatrace, and Google Cloud Logging through its log forwarding plugins, so error data flows directly into your existing observability stack.
For more on API observability tooling, see our guide to API monitoring tools.
Evolving Error Schemas Without Breaking Clients
Error response formats are part of your API contract. Changing them carelessly will break client integrations just like changing your success response schemas would.
When you need to evolve your error format:
- Add optional fields instead of changing existing ones. Adding a new
errorsarray alongside an existingdetailfield is safe. Removingdetailis not. - Preserve legacy formats during transitions. Support both old and new formats simultaneously until clients have migrated.
- Use semantic versioning for your API so clients can opt in to new error formats at their own pace.
- Include deprecation notices in response headers when old error formats are being phased out.
For example, adding an errors array alongside an existing detail field is a
safe, additive change:
Existing clients that only read detail continue to work. New clients can parse
the errors array for field-level granularity.
An API gateway is particularly useful during error format migrations. You can use outbound policies to transform upstream error responses into your target format, giving you a consistent error contract at the gateway level while your backend services migrate incrementally.
Testing Your Error Handling
Your error handling code deserves the same testing rigor as your happy-path logic. Three approaches work well together:
-
Schema validation — If you have defined error response schemas in your OpenAPI spec, validate live responses against those schemas. This is called contract testing, and it catches regressions where an error response drifts from its documented format.
-
Automated end-to-end tests — Write tests that intentionally trigger every error path: invalid inputs, missing authentication, nonexistent resources, rate limits. Tools like Playwright and StepCI work well for end-to-end API testing.
-
Mock API simulation — Use mock APIs to simulate error scenarios like network timeouts, 500 errors, and malformed responses. This lets you verify your client-side error handling without depending on a live backend.
If you use Zuplo’s request validation policy, your API will automatically reject requests that do not match your OpenAPI schema and return a Problem Details response with field-level error pointers — no custom validation code required.
Summary
Good API error handling is not about catching every edge case. It is about building a consistent, predictable system that helps developers understand and recover from failures quickly.
The key practices covered in this guide:
- Use specific HTTP status codes —
422for validation errors,429for rate limits,503for temporary outages. Never overload400or500. - Adopt RFC 9457 Problem Details — A standard format with
type,title,status,detail, andinstancefields, plus extension members for application-specific context. - Follow protocol conventions — REST uses status codes and response bodies,
GraphQL uses
errorsarrays, gRPC uses its own status codes. - Implement retry logic correctly — Exponential backoff with jitter for
transient errors, and always honor
Retry-Afterheaders. - Use idempotency keys — Prevent duplicate side effects when retrying mutating operations.
- Invest in observability — Correlation IDs, structured logging, and error-rate monitoring turn error data into actionable insights.
If you are looking for a platform that handles many of these patterns out of the box — Problem Details responses, request validation, rate limiting, and log forwarding — try Zuplo for free.
