What is P99 latency and why does it matter?

P99 latency is the response time that 99% of requests complete within. It matters because it reveals the experience of your worst-affected users — those encountering issues like garbage collection pauses, cold starts, or database connection pool exhaustion that averages hide.

How does edge computing reduce API latency?

Edge computing moves API processing to data centers physically close to users, eliminating round-trip time to a distant centralized backend. This can reduce network latency from hundreds of milliseconds to single digits.

How do I measure API latency accurately?

Focus on percentile metrics (P50, P95, P99) rather than averages. Average latency hides tail latency that affects your most impacted users. Use tools like k6 for load testing, Grafana for dashboards, and distributed tracing (like OpenTelemetry) to pinpoint where time is spent in each request.

How to Reduce API Latency Under High Traffic: A Practical Guide

Q: What is the single highest-impact optimization for API latency?

Response caching is typically the highest-impact latency optimization. By serving responses from cache, you skip network round-trips, database queries, and backend processing entirely. Even short TTLs dramatically reduce backend load during traffic spikes.

Slow APIs kill user experience. When milliseconds separate you from your competitors, laggy responses send users straight to alternatives. Research shows that faster page speeds directly correlate with higher conversion rates — a site loading in 1 second converts at 2.5x the rate of one loading in 5 seconds. Today’s users expect instant responsiveness from every API-driven interaction.

The stakes are highest under heavy traffic. During product launches, flash sales, or viral moments, your APIs face peak concurrency and the latency issues that come with it. But you have more tools than ever to fight back. This guide covers what actually causes API slowdowns at scale and the battle-tested strategies — edge computing, caching, connection pooling, P99 measurement, and gateway optimization — that will keep your APIs fast under pressure.

Why Milliseconds Matter at Scale

API latency isn’t just a technical metric — it’s a direct driver of user retention and revenue. When users tap or click and nothing happens immediately, they blame your product, not their connection.

Here’s what different latency levels mean for user experience:

Under 100ms: Feels instantaneous — ideal for interactive applications
100–300ms: Acceptable for most workflows, but users notice the pause
300ms–1s: Frustration builds, especially on repeated interactions
Over 1 second: Expect measurable drops in engagement and conversion

These thresholds tighten further under high traffic. When thousands of concurrent users hit your API simultaneously, even small per-request delays compound into widespread degradation. A 50ms increase in median response time might push your P99 latency past the 1-second mark, causing timeouts and failed requests for your most latency-sensitive users.

API latency breaks down into three core components:

Network latency: The round-trip time for data to travel between client and server, affected by physical distance, routing efficiency, and network congestion.
Server processing time: How long your backend takes to handle the request, including database queries, business logic, and response serialization.
Client-side processing: The time a client spends parsing and rendering the response — not strictly API latency, but critical for perceived performance.

Understanding where your latency budget is spent is the first step toward reducing it. Let’s examine each category of bottleneck and the strategies that address them.

Identifying What’s Slowing Your APIs Down

Before you can fix latency, you need to know what’s causing it. High-traffic APIs face bottlenecks across three main areas: the network, the server, and the client.

Network Bottlenecks

Network latency often accounts for the largest share of total API response time, especially for globally distributed users:

Physical distance: Data traveling between continents adds 100–200ms of unavoidable round-trip time. There’s no algorithm that beats the speed of light.
DNS resolution: Before the first byte of your API response, the client must resolve your domain name. Uncached DNS lookups can add 50–200ms.
Network hops: Each router and switch along the path adds 1–5ms. Complex routes with many hops accumulate significant delays.
TLS handshakes: Establishing encrypted connections requires multiple round-trips. TLS 1.3 helps by reducing this to a single round-trip, but it still matters on high-latency connections.

The most effective remedy is to move processing closer to users. Edge computing and CDN caching both attack network latency at its source by eliminating physical distance from the equation.

Server-Side Slowdowns

Under high traffic, server-side processing is where latency problems become acute:

Resource saturation: When CPU, memory, or network bandwidth hits capacity during traffic spikes, request queues grow and response times spike nonlinearly.
Database bottlenecks: Slow queries, missing indexes, lock contention, and exhausted connection pools are among the most common latency culprits at scale.
Synchronous blocking: APIs that perform I/O-bound operations synchronously (file reads, external API calls, database queries) waste thread time waiting instead of processing other requests.
Memory pressure: Memory leaks or excessive object allocation triggers garbage collection pauses, causing intermittent latency spikes that are notoriously difficult to diagnose.

Implementing rate limiting helps protect servers from overload during traffic surges, while smart routing for microservices optimizes how requests reach backend services.

Client-Side Factors

Often overlooked, client-side processing impacts perceived API performance:

Heavy response parsing: Large JSON payloads take measurable time to deserialize, especially on mobile devices with limited processing power.
Mobile network variability: Cellular connections have higher and more inconsistent latency than wired networks, with round-trip times varying from 20ms to 500ms+ depending on signal strength and network type.
Payload size: Every additional kilobyte in your response body translates to additional transfer time, particularly on bandwidth-constrained connections.

Design your APIs to return only the data clients need. Consider implementing pagination, field selection, or GraphQL to give clients control over response size.

Measuring Latency: Focus on Percentiles, Not Averages

You can’t reduce what you don’t measure, and how you measure matters as much as what you measure. Average latency is a misleading metric because it hides the experience of your worst-affected users.

Why P95 and P99 Matter More Than Averages

Consider an API with an average response time of 150ms. That number looks reasonable — until you discover that 1% of requests take over 3 seconds. Those slow requests are the ones users notice and complain about, especially if they coincide with critical transactions like checkout flows or real-time notifications.

Percentile metrics tell the real story:

P50 (median): The response time that half your requests complete within. Good for understanding typical performance.
P95: 95% of requests complete within this time. Your primary target for user-facing SLAs.
P99: Only 1% of requests are slower. Critical for catching tail latency caused by garbage collection, cold starts, or database connection pool exhaustion.

Setting Performance Baselines

Before optimizing, establish clear baselines with API performance testing:

Define latency budgets: Set specific P95 and P99 targets for each endpoint. For example, your search API might target P95 < 200ms while a reporting endpoint targets P95 < 2s.
Benchmark under realistic load: Test with traffic patterns that match production, including burst traffic, concurrent connections, and mixed endpoint usage.
Profile from multiple regions: Measure from the geographic locations where your actual users are, not just from the same cloud region as your backend.

Tools for Latency Measurement

Use these tools to capture accurate latency data:

k6: A developer-friendly load testing tool that generates histograms, percentile breakdowns, and trend analysis using JavaScript test scripts.
Grafana: Pair with Prometheus or InfluxDB to build real-time dashboards tracking P50/P95/P99 latency across all your endpoints.
wrk: A lightweight, high-throughput benchmarking tool ideal for targeted endpoint testing.

For Zuplo users, the built-in analytics dashboard provides request latency data filterable by route, API key, and time period. For deeper analysis, enable OpenTelemetry tracing to get span-level timing for every stage of the request lifecycle — including each policy, the handler, and any outbound calls.

You can measure and log execution time within custom policies to identify exactly where time is spent:

typescript

import { ZuploContext, ZuploRequest } from "@zuplo/runtime";

export default async function policy(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const start = Date.now();
  // ... policy logic ...
  const duration = Date.now() - start;
  context.log.info(`Policy executed in ${duration}ms`);
  return request;
}

Proven Strategies to Reduce API Latency

Now for the actionable strategies. These techniques work individually, but the biggest gains come from combining them across your architecture.

Edge Computing: Process Requests Closer to Users

Edge computing eliminates network latency by moving computation to data centers physically close to your users. Instead of routing every request to a single-region backend, edge functions handle requests locally, cutting round-trip times from hundreds of milliseconds to single digits.

This approach is particularly effective for:

Authentication and authorization: Validate API keys and tokens at the edge before requests ever reach your backend.
Request transformation: Modify headers, rewrite URLs, or filter request bodies without a backend round-trip.
Response caching: Serve cached responses directly from edge locations, bypassing the backend entirely for cacheable endpoints.
Rate limiting: Enforce rate limits at the edge to protect your backend from traffic spikes.

Zuplo runs on an edge runtime across 300+ data centers worldwide, processing requests at the location closest to each caller. The base latency added by the gateway is approximately 20–30ms with no policies enabled, and most policies add only 1–5ms each. This means edge processing can actually reduce total latency compared to routing all traffic through a single-region gateway or load balancer.

To implement edge computing effectively:

Identify which API functions can run independently at the edge
Use serverless platforms with global edge deployment
Design stateless handlers that don’t depend on centralized state
Keep IO-intensive services (like database queries) close to your database and handle compute-intensive tasks at the edge

Caching: Eliminate Unnecessary Work

Smart caching is the single highest-impact latency optimization for most APIs. If you can serve a response from cache, you skip network round-trips, database queries, and backend processing entirely.

Gateway-Level Response Caching

The simplest caching strategy is caching entire API responses at the gateway. Zuplo’s Caching policy stores responses in a distributed cache and serves them directly for subsequent identical requests:

json

{
  "name": "cache-product-catalog",
  "policyType": "caching-inbound",
  "handler": {
    "export": "CachingInboundPolicy",
    "module": "$import(@zuplo/runtime)",
    "options": {
      "expirationSecondsTtl": 300,
      "statusCodes": [200]
    }
  }
}

This configuration caches successful responses for 5 minutes — ideal for endpoints like product catalogs, configuration data, or public content that changes infrequently.

Programmatic Caching with ZoneCache

For more granular caching control, use Zuplo’s ZoneCache API to cache expensive computations, external API responses, or aggregated data within your custom handlers:

typescript

import { ZoneCache, ZuploContext, ZuploRequest } from "@zuplo/runtime";

interface ProductData {
  id: string;
  name: string;
  price: number;
}

export default async function handler(
  request: ZuploRequest,
  context: ZuploContext,
) {
  const cache = new ZoneCache<ProductData>("product-cache", context);
  const productId = request.params.productId;

  // Check cache first
  let product = await cache.get(productId);

  if (!product) {
    // Cache miss — fetch from backend
    const response = await fetch(
      `https://api.internal.example.com/products/${productId}`,
    );
    product = await response.json();

    // Cache for 5 minutes (fire-and-forget to avoid adding latency)
    cache.put(productId, product, 300).catch((err) => context.log.error(err));
  }

  return new Response(JSON.stringify(product), {
    headers: { "Content-Type": "application/json" },
  });
}

Notice the fire-and-forget pattern for cache writes — by not await-ing the put() call, you avoid adding cache write latency to your response time.

HTTP Cache Headers

Set proper Cache-Control headers so CDNs and browsers can cache responses without hitting your API at all:

plaintext

Cache-Control: public, max-age=60, s-maxage=3600

This tells browsers to cache for 1 minute and CDN edge servers to cache for 1 hour, dramatically reducing the number of requests that reach your gateway.

Caching Best Practices

Start with read-heavy endpoints: Product catalogs, configuration data, and public content are ideal caching candidates.
Use appropriate TTLs: Balance freshness requirements against cache hit ratios. Even a 60-second TTL eliminates most redundant backend calls during traffic spikes.
Implement cache invalidation: Use event-driven invalidation or the cache-busting strategies documented in the Caching policy to purge stale data when underlying data changes.
Cache at multiple layers: Combine browser caching, CDN caching, and gateway-level caching for maximum coverage.

Connection Pooling and Database Optimization

Database interactions are the most common source of server-side latency. Two strategies deliver outsized impact: connection pooling and query optimization.

Connection Pooling

Every new database connection requires a TCP handshake, TLS negotiation, and authentication exchange — easily adding 20–50ms per request. Connection pooling maintains a set of reusable connections, eliminating this overhead:

typescript

import { Pool } from "pg";

// Create a connection pool (do this once at module level)
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 20, // Maximum connections in the pool
  idleTimeoutMillis: 30000, // Close idle connections after 30s
  connectionTimeoutMillis: 5000, // Fail fast if pool is exhausted
});

export async function getUser(userId: string) {
  // Borrows a connection from the pool — no handshake overhead
  const result = await pool.query("SELECT * FROM users WHERE id = $1", [
    userId,
  ]);
  return result.rows[0];
}

Key pooling parameters to tune:

max: Set this to your expected peak concurrent database queries. Too low and requests queue; too high and you overwhelm the database.
idleTimeoutMillis: Close unused connections to free database resources during low-traffic periods.
connectionTimeoutMillis: Set a reasonable timeout so requests fail fast instead of hanging when the pool is exhausted.

Query Optimization

Beyond pooling, optimize the queries themselves:

Add indexes: The most impactful single optimization. An unindexed query on a million-row table can take seconds; with a proper index, it takes milliseconds.
Use EXPLAIN ANALYZE: Profile slow queries to understand where time is spent.
Avoid N+1 queries: Batch related queries using JOINs or WHERE id IN (...) instead of issuing one query per item.
Implement read replicas: Route read-heavy traffic to replicas, reserving the primary for writes.

Asynchronous Processing

For operations that don’t need to complete before sending a response, move them out of the request path:

Background processing: Offload tasks like logging, analytics, email notifications, and webhook delivery to background workers.
Async I/O: Use non-blocking patterns for all I/O operations — database queries, HTTP calls, and file reads — so your API can handle more concurrent requests on fewer threads.
Event-driven architecture: Publish events to a message queue (SQS, Kafka, RabbitMQ) and let consumers process them asynchronously, keeping your API response times consistently fast.

Payload Optimization

Reducing what you send over the wire has an outsized impact on high-traffic APIs:

Compression: Enable gzip or Brotli compression for JSON responses. A typical API response compresses by 60–80%, directly reducing transfer time.
Efficient serialization: For internal service-to-service communication, consider Protocol Buffers or MessagePack instead of JSON for smaller payloads and faster serialization.
Pagination: Never return unbounded result sets. Implement cursor-based or offset-based pagination with sensible default page sizes.
Field filtering: Let clients request only the fields they need. GraphQL does this natively; for REST APIs, support a fields query parameter.

Monitoring and Scaling for Sustained Performance

Optimizing latency is not a one-time effort. As traffic grows and usage patterns shift, continuous monitoring and proactive scaling keep your APIs fast.

Real-Time Performance Monitoring

Set up proactive monitoring to catch latency regressions before users notice:

Set actionable alerts: Trigger notifications when P95 latency exceeds your target for critical endpoints. For example, alert when P95 response time on /api/checkout exceeds 500ms for more than 5 minutes.
Track the right metrics: Monitor response time percentiles (P50, P95, P99), error rates (4xx and 5xx), request throughput, and resource utilization (CPU, memory, connection pool saturation).
Implement distributed tracing: Use OpenTelemetry to follow requests across services and pinpoint exactly where delays occur. Zuplo’s OpenTelemetry plugin provides span-level timing for each policy, the handler, and outbound calls.
Correlate metrics with deployments: Track whether latency changes correlate with code deploys, configuration changes, or traffic pattern shifts.

Zuplo also supports sending metrics to Datadog, Dynatrace, New Relic, and OpenTelemetry-compatible collectors for centralized observability. Combined with logging integrations to platforms like Splunk and AWS CloudWatch, you get full visibility into your API’s performance characteristics.

Scaling Strategies for Growing Traffic

To handle increasing traffic without latency degradation:

Auto-scaling: Configure your infrastructure to scale automatically based on traffic. Zuplo’s managed edge platform scales to handle any load — from zero to billions of requests — without configuration.
Database read replicas: Add read replicas for read-heavy endpoints and implement connection pooling to prevent connection exhaustion.
Intelligent load balancing: Distribute traffic based on server health and current load, not just round-robin assignment.
Circuit breakers: Implement circuit breakers to prevent cascading failures when downstream services degrade. A fast failure is better than a slow one.

API Gateway Optimization

Your API gateway is the first component every request passes through, making it a high-leverage optimization target:

Order policies by cost: Place lightweight checks (header validation, API key verification) before expensive operations (request transformation, custom code) so invalid requests are rejected quickly. Zuplo documents policy performance tiers — from 0–3ms for simple validation to 10–20ms for complex transformations.
Enable gateway-level caching: Cache responses at the gateway to eliminate backend calls entirely for cacheable endpoints.
Configure rate limiting: Use Zuplo’s Rate Limiting policy to protect backend services from traffic spikes and ensure fair usage across consumers.
Enable response compression: Reduce payload sizes at the gateway before responses travel across the network to clients.

Your API Latency Reduction Checklist

Start with the quick wins and progress to more advanced optimizations:

Measure first: Establish P95 and P99 baselines for your critical endpoints using performance testing.
Enable caching: Add gateway-level response caching for your most-called, cacheable endpoints. Even short TTLs dramatically reduce backend load during traffic spikes.
Optimize database queries: Add missing indexes, implement connection pooling, and eliminate N+1 queries — these changes often deliver the largest single improvement.
Move to the edge: Deploy your API gateway at the edge so authentication, rate limiting, and cached responses are served from locations close to your users.
Set up monitoring: Configure P95/P99 alerts, distributed tracing, and real-time dashboards so you catch regressions before users do.
Compress and trim payloads: Enable gzip compression and implement pagination to reduce transfer times.
Go async: Move non-critical operations (logging, notifications, analytics) out of the request path with background workers or message queues.

Every millisecond you save compounds across thousands of concurrent users. Ready to put these strategies into practice? Sign up for a free Zuplo account and deploy an edge-native API gateway that handles caching, rate limiting, and observability out of the box — with approximately 20–30ms of base latency across 300+ global edge locations.

Why Milliseconds Matter at Scale

Identifying What’s Slowing Your APIs Down

Network Bottlenecks

Server-Side Slowdowns

Client-Side Factors

Measuring Latency: Focus on Percentiles, Not Averages

Why P95 and P99 Matter More Than Averages

Setting Performance Baselines

Tools for Latency Measurement

Proven Strategies to Reduce API Latency

Edge Computing: Process Requests Closer to Users

Caching: Eliminate Unnecessary Work

Gateway-Level Response Caching

Programmatic Caching with ZoneCache

HTTP Cache Headers

Caching Best Practices

Connection Pooling and Database Optimization

Connection Pooling

Query Optimization

Asynchronous Processing

Payload Optimization

Monitoring and Scaling for Sustained Performance

Real-Time Performance Monitoring

Scaling Strategies for Growing Traffic

API Gateway Optimization

Your API Latency Reduction Checklist

Try the platform behind this guide