Hard Limits, Soft Limits, and Progressive Friction for Monetized APIs

Plan quota enforcement loses you customers in two mirror-image ways.

Option A, hard-limit failure. A developer wakes up at 3am to production 429s. Their app crossed its monthly plan quota overnight, and nothing warned them.

Option B, soft-limit failure. The same developer signed up for overage pricing with eyes open, so the bill arriving isn’t the issue. What is: nothing surfaced their usage climbing through the month, so the invoice landing at five times their budget is the first they heard about it.

These are the same product bug in different clothes, and neither is a config problem. It’s a design problem, and most teams make the call before they’ve thought about it.

The default framing is binary: hard (block at the threshold and return 429) versus soft (pass the request and bill the overage). There is a third pattern, better than either, and it’s where experienced API teams end up after their first postmortem involving one of the scenarios above.

Every team I’ve watched land on progressive friction got there the hard way, usually after an end-of-month support thread that started with a developer asking why their app “just stopped working” at 2am.

Use this approach if you're:

You charge for API access on a monthly plan with a request budget
You've had a customer hit a 429 they didn't see coming, or an invoice they didn't see building
You're designing the first paid tier above your free plan and aren't sure which way to jump

The three patterns

A rate card in Zuplo is the per-plan price sheet, and each line on it is an entitlement, a feature on the plan. Entitlements come in four flavours: Metered (tracks usage against a monthly allowance), Boolean (on/off), Static (a fixed config value like “max 5 webhooks”), and No entitlement (feature isn’t on this plan). This post is about the Metered kind, because the other three don’t have a quota to overflow.

Hard limit. The request is blocked at the threshold and a 429 is returned. Predictable and blunt, it’s the right call for free tiers, for abuse prevention, and for any entitlement where “one more request” is cheaper for you to refuse than to serve.

Soft limit. The request passes, and every call above the included allowance is billed at a per-unit rate, usually via graduated tiered pricing. This is a revenue-positive setup when the customer expects the bill, and a support incident when they don’t, because you’re no longer monetizing, you’re surprising them.

Progressive friction. Layered enforcement that escalates across thresholds: a warning at 80% via the developer portal and an email to the owner, induced latency at 95%+, soft overage billing at 100%, and a hard cutoff only much higher (say 200% of plan) to cap runaway cost. The customer knows the ceiling is coming, has time to upgrade, and the app never falls off a cliff in production.

Implementation with Zuplo

The Monetization API is there for teams managing plans as code, but the portal is the faster path.

Hard limit

Use the Free pricing model. Set the entitlement to Metered (track usage) with a usage limit of, say, 20 requests per month. No soft-limit toggle appears because there’s no overage tier to bill against. Hit the limit, you’re blocked.

Free plan rate card with a Metered entitlement and a 20-request monthly usage limit

Soft limit

Use a Tiered pricing model with Graduated price mode. Two tiers: the first from 0 to your included allowance at $0, the second from allowance+1 to ∞ at the per-unit overage rate. Once the overage tier exists, the Soft limit toggle appears. On means requests past the allowance are billed at the overage rate. Off means they’re blocked. Same monetization-inbound policy, one UI switch.

Tiered rate card with graduated pricing and the Soft limit toggle enabled

Pro tip:

Soft limit off doesn’t mean “no limit.” It means requests past the allowance are blocked, even though an overage tier is configured. The toggle is the single switch between “bill them” and “block them,” so it’s worth confirming the state before you publish a rate card.

Rate Cards Reference

Full reference for rate cards, entitlement types, the Soft limit toggle, and the pricing models they pair with.

Progressive friction

Start from the soft-limit configuration above, then add a custom-code-inbound policy to the route. custom-code-inbound is Zuplo’s “drop in your own TypeScript” escape hatch: the file lives in modules/, and config/policies.json points at it by export name.

Order it after the monetization policy in the inbound pipeline. The monetization policy attaches the customer’s subscription to the request context, so the code that runs next can read it and add friction when usage crosses a threshold. This is the same pattern the monetization-inbound policy docs describe as a soft-limit example, with a latency step added.

Zuplo inbound policy pipeline showing monetization-inbound, then apply-progressive-friction, then set-user-headers, ordered before the URL Forward handler

import {
  ZuploContext,
  ZuploRequest,
  MonetizationInboundPolicy,
} from "@zuplo/runtime";

export default async function (request: ZuploRequest, context: ZuploContext) {
  const subscription = MonetizationInboundPolicy.getSubscriptionData(context);
  const entitlement = subscription?.entitlements?.["api_requests"];
  // No metered entitlement on this plan, nothing to slow down.
  if (!entitlement?.balance) return request;

  const used = entitlement.usage / entitlement.balance;
  if (used < 0.95) return request;

  await new Promise((r) => setTimeout(r, 2000));
  // Inbound headers are immutable, so clone the request to add one.
  const warned = new ZuploRequest(request);
  warned.headers.set(
    "X-Usage-Warning",
    `${Math.round(used * 100)}% of plan used`,
  );
  return warned;
}

On the subscription object, "api_requests" is the meter name you set on the rate card, balance is the allowance granted for the period (not the remaining amount, a naming quirk worth flagging), and usage is how much has been consumed. The usage value lags the current request by one increment because it reflects backend state, which is fine for a threshold like “slow down at 95%.”

The delay is doing most of the work here, not the header. Induced latency shows up in the developer’s own logs, dashboards, and alerts, where a response header won’t. The X-Usage-Warning header pairs with the delay rather than replacing it, and the stock rate-limit-inbound policy only emits Retry-After, so if you want the client to see their remaining balance, that’s on your custom code.

The snippet only implements the 95% slowdown because that’s the representative step. All three thresholds live in the same policy, each as its own if block. The shape stays the same across them, what changes is the action per threshold.

At 80%, you’d add an outbound HTTP call to whichever transactional email provider you already use: Resend, Twilio SendGrid, Cloudflare’s new Email Service, anything with a send API.

At 200%, a second if block returns a 429 and cuts the customer off. Usage this far over has stopped looking like a friction problem and started looking like an incident.

Pro tip:

The snippet shows the shape, not a production implementation. Four places worth hardening before it sees real traffic:

Scale the delay with usage. A flat 2-second setTimeout holds a worker slot per slowed request. Under a burst, those stack up and add latency to requests that weren’t even over the threshold.
Move thresholds and the meter name into policy options. Otherwise tweaking “slow down at 95%” to “slow down at 90%” means a code change and a redeploy.
Log via context.log when friction fires. A customer opening a “my app feels slow” ticket is much easier to diagnose if you can see whether friction was the cause.
Use the IETF RateLimit draft header instead of a custom X-*. Any client library that understands the standard can back off automatically, instead of every caller having to learn your bespoke header.

Monetization Policy Reference

Full reference for the monetization-inbound policy, the documented soft-limit example this pattern extends, and the subscription data model it exposes.

Where each fits

Hard limits for free tiers, abuse prevention, and any entitlement where the marginal cost of an extra call is too high to eat.
Soft limits for enterprise contracts, customers with payment on file and spending forecasts, and any call whose value exceeds the marginal cost.
Progressive friction as the default for paid plans, where the goal is customers who upgrade rather than churn or rage-quit.

Design, not default

A 429 in production at 3am and an invoice landing at five times budget are two shapes of the same failure: the gateway had the usage data and didn’t surface it in time for anyone to act on it. That makes the gateway the right place to close the gap, because it’s the only thing that sees usage in real time, before the backend does and before the customer does. Progressive friction is that visibility turned into signals the customer can respond to.

Same logic as why Zuplo’s meterOnStatusCodes defaults to "200-299", failed requests shouldn’t count against a customer’s quota. Quota enforcement is a sibling design decision, and one worth making deliberately rather than by default.