Rate Limiting: The System Design Concept You Can't Afford to Skip

Welcome, Developer 👋

Last time we talked about caching, and how it reduces the number of times you hit your database. Today we are going to talk about the other side of the same coin.

Because here is the scenario that keeps people up at night. You built a great API. You cached the hot paths. You are handling solid traffic. Then one client sends 100,000 requests in a minute, your database connection pool drains, and the whole thing falls over. Everyone gets a 500.

Rate limiting is what stands between a well-designed system and an average one. And like caching, most developers know it exists but far fewer can reason about which algorithm to pick or why their implementation has a race condition hiding in it.

Let’s get into it.

Why Rate Limiting Exists

When people hear “rate limiting” they think about blocking attackers. That is part of it, but it is not the main reason most systems need it.

Here are the three situations that actually bite you in production:

A buggy client in a retry loop. Someone ships a mobile app with a retry on failure, forgets to add backoff, and now every error triggers an instant retry. One bad deploy and you have thousands of clients hammering a failing endpoint as fast as the network allows.

A legitimate user doing too much. A customer writes a script against your public API to sync their data. Their loop has no delay. They are not malicious, they are just enthusiastic, and they are eating capacity that belongs to everyone else.

An actual denial of service. Intentional or not, traffic that is designed to overwhelm you.

Notice that two of those three are your own users acting in good faith. Your database does not care why it is being hammered. It falls over the same way regardless of intent. Rate limiting is how you protect the system from load, whatever the source.

Where Rate Limiting Lives

Before we get into algorithms, it helps to know where the limiter can sit. Same as we did with cache layers in the last post.

At the API gateway. Tools like Azure API Management, Kong, or AWS API Gateway can reject traffic before it ever reaches your application code. This is the cheapest place to drop a request because your app never spends a cycle on it. Good for coarse limits like “this API key gets 1000 requests per minute”.

In your application code. When you need finer control, per user, per endpoint, per resource, you do it in the app. You have all the context here. You know who the user is, what they are asking for, and what tier they are on. The cost is that the request already made it to your server before you reject it.

At the infrastructure level. Nginx and most load balancers can rate limit by IP. Fast and blunt. Useful as a first line of defence against raw connection floods, but it has no idea about users or business logic.

Most real systems combine at least two of these. A gateway limit to catch the obvious abuse, plus application-level limits for the per-user rules that actually matter to your product.

For the rest of this post we will focus on the application layer, because that is where you write the interesting logic and where the algorithm choice matters.

Setup

We are using the following versions throughout this post:

Node.js 22 LTS
TypeScript 6
ioredis 5 (the latest stable Redis client for Node.js)
Express 5
Redis 8 (the current major version — Lua scripting API is unchanged from v7)

Install the dependencies:

npm install ioredis express
npm install -D typescript @types/node @types/express

Create your Redis client once and reuse it across the app. In ioredis v5, the named import is the preferred style for TypeScript projects:

// src/redis.ts
import { Redis } from "ioredis";
 
// Reads REDIS_URL from your environment. Falls back to localhost for local dev.
// Example: REDIS_URL=redis://username:password@your-host:6379
export const redis = new Redis(process.env.REDIS_URL ?? "redis://localhost:6379");
 
redis.on("error", (err) => {
  console.error("Redis connection error:", err);
});

Beginner tip: REDIS_URL is a convention, not a built-in. You set it as an environment variable in your deployment platform (Render, Railway, Fly, etc.) and the Redis client reads it. For local development, the fallback redis://localhost:6379 connects to a Redis instance running on your machine. You can start one with Docker: docker run -p 6379:6379 redis:8.

The Four Core Algorithms

1. Fixed Window Counter

The simplest approach. Divide time into fixed buckets — say one minute each — and count requests inside the current bucket. When the count exceeds your limit, reject the request. When the clock rolls over to the next bucket, the counter resets automatically because we set a TTL on the Redis key.

Here is the basic implementation:

// src/limiters/fixed-window.ts
import { redis } from "../redis";
 
interface RateLimitResult {
  allowed: boolean;
  remaining: number;
  resetInSeconds: number;
}
 
export async function fixedWindowLimit(
  key: string,          // e.g. "ratelimit:user:123"
  limit: number,        // max requests allowed in the window
  windowSeconds: number // window size in seconds
): Promise<RateLimitResult> {
  // Build a key that includes the current window start time.
  // Math.floor(now / window) * window gives the unix timestamp of the window start.
  // e.g. at 12:01:35 with a 60s window, windowStart = 12:01:00
  const now = Math.floor(Date.now() / 1000);
  const windowStart = Math.floor(now / windowSeconds) * windowSeconds;
  const windowKey = `${key}:${windowStart}`;
 
  // INCR atomically increments and returns the new value.
  // If the key does not exist, Redis creates it at 0 before incrementing, so the
  // first call in any window always returns 1.
  const count = await redis.incr(windowKey);
 
  if (count === 1) {
    // This is the first request in this window. Set the TTL so the key
    // expires when the window ends, keeping Redis memory clean.
    await redis.expire(windowKey, windowSeconds);
  }
 
  return {
    allowed: count <= limit,
    remaining: Math.max(0, limit - count),
    resetInSeconds: windowSeconds - (now - windowStart),
  };
}

Note for beginners: INCR is atomic in Redis, meaning no two clients can increment and read the value at the same time. The danger here is not the increment itself but the expire call on line below it. Those are two separate round trips, so if your server crashes between them you end up with a key that never expires and permanently blocks that user. The Lua scripts later in this post eliminate that gap entirely.

This is easy to understand and cheap to run. But it has a boundary problem that can let twice your limit through in a short burst.

Say your limit is 100 requests per minute. A client sends 100 requests at 00:00:59. All allowed, the window A counter is full. One second later at 00:01:00, the window resets, and the client sends 100 more. Also allowed, because that is window B.

You just served 200 requests in about one second, even though your limit was “100 per minute”. The next two algorithms exist to fix exactly this.

2. Sliding Window Log

Instead of counting per fixed bucket, store the exact timestamp of every request. On each new request, discard timestamps older than your window, then count what is left. If the count is under the limit, allow it and record the timestamp. If not, reject.

Redis sorted sets are perfect here. The score is the timestamp, and you can range-delete old entries in a single command.

We use a Lua script to make the whole operation atomic — no gap between the read and the write:

-- sliding_window_log.lua
-- KEYS[1]: the rate limit key (e.g. "ratelimit:user:123")
-- ARGV[1]: current time in milliseconds
-- ARGV[2]: window size in milliseconds
-- ARGV[3]: request limit
-- ARGV[4]: a unique member ID for this request
--
-- Why a unique member ID? Redis sorted sets identify members by value, not position.
-- If two requests arrive at the exact same millisecond, they would have the same
-- score AND the same value if you used just the timestamp, causing one to overwrite
-- the other. Passing something like "<timestamp>-<random>" avoids that collision.
 
local now    = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit  = tonumber(ARGV[3])
 
-- Remove all entries older than the sliding window
redis.call('ZREMRANGEBYSCORE', KEYS[1], 0, now - window)
 
-- Count remaining entries — these are all requests inside the current window
local count = redis.call('ZCARD', KEYS[1])
 
if count < limit then
  -- Allow the request and record its timestamp
  redis.call('ZADD', KEYS[1], now, ARGV[4])
  -- Keep the key alive for one full window after the last request
  redis.call('PEXPIRE', KEYS[1], window)
  return { 1, limit - count - 1 } -- { allowed, remaining }
else
  return { 0, 0 }
end

The TypeScript wrapper:

// src/limiters/sliding-window-log.ts
import { redis } from "../redis";
import { readFileSync } from "fs";
 
const SCRIPT = readFileSync("./src/lua/sliding_window_log.lua", "utf8");
 
interface RateLimitResult {
  allowed: boolean;
  remaining: number;
}
 
export async function slidingWindowLogLimit(
  key: string,
  limit: number,
  windowMs: number // window in milliseconds
): Promise<RateLimitResult> {
  const now = Date.now();
  // Unique member: timestamp + random suffix to handle same-millisecond requests
  const member = `${now}-${Math.random().toString(36).slice(2)}`;
 
  const [allowed, remaining] = (await redis.eval(
    SCRIPT,
    1,       // number of KEYS
    key,     // KEYS[1]
    now,     // ARGV[1]
    windowMs,// ARGV[2]
    limit,   // ARGV[3]
    member,  // ARGV[4]
  )) as [number, number];
 
  return { allowed: allowed === 1, remaining };
}

This is perfectly accurate. There is no boundary trick that lets someone double up, because the window truly slides with each request.

The cost is memory. Every allowed request leaves a timestamp entry in the sorted set. If a user is allowed 10,000 requests per hour, you hold up to 10,000 entries for that one user. For most public APIs with thousands of active users, that memory cost rules this one out. Use it for internal services where the user base is small and accuracy matters more than memory.

3. Sliding Window Counter

This is the one worth knowing best, because it gives you most of the accuracy of the log with a tiny fraction of the memory.

The core idea: instead of storing every individual timestamp, keep just two integers — a counter for the current fixed window and a counter for the previous one. When a new request arrives, estimate how many requests have happened in the true sliding window by weighting the previous counter based on how much of it still overlaps.

Start with the concept before the formula. Imagine your window is one minute and you are currently 15 seconds into minute B. That means 45 seconds of minute A are still inside your one-minute sliding view. So the “true” request count is everything in minute B, plus 75% of everything in minute A.

That is all the formula says:

overlap = (windowSize - elapsedInCurrentWindow) / windowSize
estimated = currentWindowCount + (previousWindowCount × overlap)

Walk through the boundary example from the fixed window section. The client sent 100 requests at the end of window A. One second into window B, the overlap ratio is (60 - 1) / 60 ≈ 0.98. The estimate is 0 + (100 × 0.98) = 98. That is under the limit of 100, so a small number of new requests in window B would still be allowed. But if the client immediately fires 100 more, the estimate becomes 100 + 98 = 198 — well over the limit. The burst gets rejected. The boundary hole is closed, and you only stored two integers per user.

The trade-off is that this is an approximation. It assumes requests in the previous window were spread evenly across that minute. In practice the error is small enough that almost nobody cares, and the memory saving is large enough that almost everybody does. This is the algorithm Cloudflare uses for their own rate limiting at scale.

Here is the implementation, as a Lua script to keep everything atomic:

-- sliding_window_counter.lua
-- KEYS[1]: current window key  (e.g. "ratelimit:user:123:1719878400")
-- KEYS[2]: previous window key (e.g. "ratelimit:user:123:1719878340")
-- ARGV[1]: request limit
-- ARGV[2]: window size in seconds
-- ARGV[3]: seconds elapsed so far in the current window
 
local limit   = tonumber(ARGV[1])
local window  = tonumber(ARGV[2])
local elapsed = tonumber(ARGV[3])
 
-- Read both counters. A missing key returns false in Lua, so default to 0.
local current_count  = tonumber(redis.call('GET', KEYS[1]) or 0)
local previous_count = tonumber(redis.call('GET', KEYS[2]) or 0)
 
-- How much of the previous window still overlaps the sliding view?
local overlap   = (window - elapsed) / window
local estimated = current_count + (previous_count * overlap)
 
if estimated >= limit then
  -- Already at or over the limit. Reject without incrementing.
  return { 0, 0 }
end
 
-- Increment the current window counter atomically and reset its TTL.
-- We keep the key alive for two full windows to ensure the previous window
-- counter is still readable when the next window starts.
local new_count = redis.call('INCR', KEYS[1])
redis.call('EXPIRE', KEYS[1], window * 2)
 
local remaining = math.floor(limit - estimated - 1)
return { 1, math.max(0, remaining) }

The TypeScript wrapper:

// src/limiters/sliding-window-counter.ts
import { redis } from "../redis";
import { readFileSync } from "fs";
 
const SCRIPT = readFileSync("./src/lua/sliding_window_counter.lua", "utf8");
 
interface RateLimitResult {
  allowed: boolean;
  remaining: number;
}
 
export async function slidingWindowCounterLimit(
  key: string,
  limit: number,
  windowSeconds: number
): Promise<RateLimitResult> {
  const now = Math.floor(Date.now() / 1000);
 
  // Compute the start of the current and previous fixed windows
  const currentWindowStart  = Math.floor(now / windowSeconds) * windowSeconds;
  const previousWindowStart = currentWindowStart - windowSeconds;
 
  // Seconds elapsed since the current window started
  const elapsed = now - currentWindowStart;
 
  const currentKey  = `${key}:${currentWindowStart}`;
  const previousKey = `${key}:${previousWindowStart}`;
 
  const [allowed, remaining] = (await redis.eval(
    SCRIPT,
    2,              // number of KEYS
    currentKey,     // KEYS[1]
    previousKey,    // KEYS[2]
    limit,          // ARGV[1]
    windowSeconds,  // ARGV[2]
    elapsed,        // ARGV[3]
  )) as [number, number];
 
  return { allowed: allowed === 1, remaining };
}

Two integers per user. No timestamp entries. Accurate enough for every public API use case I have encountered.

4. Token Bucket

The most intuitive model, and the one most production systems actually use — including AWS API Gateway throttling and Stripe’s API limits.

Picture a bucket that can hold up to N tokens. Every request consumes one token. Tokens are added back at a fixed rate — say 10 per second — up to the bucket’s capacity. If the bucket is empty when a request arrives, that request is rejected.

What makes this feel good to users is the burst behaviour. If a client has been quiet, the bucket fills back up to capacity. They can then fire a burst of requests up to that capacity all at once, then they are limited to the steady refill rate afterward. This matches how real clients actually behave — quiet for a while, then a flurry of activity, then quiet again.

There is a close sibling called the leaky bucket, which processes requests at a constant output rate — like water leaking from a hole at the bottom at a fixed drip. It smooths traffic to a perfectly steady stream but never allows bursts. Token bucket allows bursts, leaky bucket eliminates them. Most APIs prefer the burst-friendly behaviour of token bucket. Leaky bucket shows up more at the network layer, for example in Nginx’s limit_req module, where smoothing raw connection rates is exactly the goal.

Picking the Right Algorithm

Here is a quick reference before we move on to the hard part:

Algorithm	Memory	Accuracy	Burst-friendly	Best for
Fixed Window	Low	Low	No	Prototypes, admin tools, quick internal scripts
Sliding Window Log	High	High	No	Internal services with a small, controlled user base
Sliding Window Counter	Low	Good	No	Public APIs, high user counts
Token Bucket	Low	Good	Yes	Any API where clients have naturally bursty behaviour

A practical rule of thumb: reach for the sliding window counter when you want accuracy without memory cost, token bucket when you want to be forgiving about short bursts, and fixed window only when you can live with the boundary problem.

Distributed Rate Limiting: Where It Gets Hard

Everything above works fine on a single server. The problem is you do not run a single server.

Picture ten application instances behind a load balancer, each keeping its own in-memory counter. A client sends requests, the load balancer spreads them across all ten instances, and each instance sees only a tenth of the traffic. Your “100 per minute” limit just became “1000 per minute” without changing a single config. The limit is effectively meaningless.

The fix is a shared store that every instance talks to. Redis is the standard answer — one counter, one source of truth, every app server checking the same numbers.

But moving to a shared store introduces a new problem: race conditions.

The Check-Then-Act Trap

Here is code that looks correct and is not:

// BROKEN. Do not ship this.
const count = await redis.get(key); // reads 99
if (Number(count) < limit) {
  // 99 < 100, looks fine
  await redis.incr(key); // now 100
  return true;           // allowed
}

Imagine two requests arriving at the same moment. Both read 99. Both decide 99 is under 100. Both increment. The counter is now 101 and both requests were allowed. Under heavy concurrency — exactly when rate limiting matters most — this leaks requests constantly.

The problem is that the read and the write are two separate operations with a network round trip between them. Anything can happen in that gap.

Redis gives you two clean ways to close this gap:

For simple cases: use commands that atomically read and modify in one step. INCR is the classic example — it increments and returns the new value as a single atomic operation. Check the return value rather than reading separately.

For complex cases: use a Lua script. Redis executes Lua scripts atomically. Nothing else runs between the first line and the last. The whole check-and-update happens as one indivisible unit.

All the Lua scripts in this post use that second approach.

A Production-Ready Token Bucket

Let’s put everything together. Here is a full token bucket in Lua, atomic, using Redis’s own clock so every app server agrees on the time.

-- token_bucket.lua
-- KEYS[1]: bucket key (e.g. "ratelimit:user:123")
-- ARGV[1]: capacity      — max tokens the bucket can hold
-- ARGV[2]: refillRate    — tokens added per second
-- ARGV[3]: requested     — tokens this request wants (almost always 1)
 
local capacity   = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2])
local requested  = tonumber(ARGV[3])
 
-- Read the clock from Redis, not from the application server.
-- If you pass Date.now() from ten app servers whose clocks drift even slightly,
-- the refill calculation drifts with them. redis.call('TIME') returns a table
-- of { seconds, microseconds } from the Redis process — one shared clock for
-- every app instance talking to this Redis.
local t   = redis.call('TIME')
local now = tonumber(t[1]) + (tonumber(t[2]) / 1000000)
 
-- Read the current state of the bucket
local data      = redis.call('HMGET', KEYS[1], 'tokens', 'timestamp')
local tokens    = tonumber(data[1])
local timestamp = tonumber(data[2])
 
-- First request for this key — start with a full bucket
if tokens == nil then
  tokens    = capacity
  timestamp = now
end
 
-- Refill: add tokens proportional to the time that has passed, capped at capacity.
-- math.max(0, ...) guards against the (rare) case where clock adjustment makes
-- now appear slightly earlier than the stored timestamp.
local elapsed = math.max(0, now - timestamp)
tokens = math.min(capacity, tokens + (elapsed * refillRate))
 
local allowed = 0
if tokens >= requested then
  allowed = 1
  tokens  = tokens - requested
end
 
-- Persist the updated token count and the current timestamp
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'timestamp', now)
 
-- Auto-expire idle buckets. A full bucket holds `capacity` tokens which
-- refill at `refillRate` per second, so capacity/refillRate seconds is
-- exactly how long a depleted bucket takes to fully refill. After that,
-- the key contains no useful state — a fresh key would behave identically.
redis.call('EXPIRE', KEYS[1], math.ceil(capacity / refillRate) + 1)
 
-- Important Redis/Lua gotcha: when a Lua script returns a number, Redis
-- converts it to an integer by truncating, not rounding. We store the precise
-- fractional token count in the hash for accurate refill calculations, but
-- we floor the remaining count before returning it. If you ever return
-- a fractional value here and wonder why it comes back as a whole number,
-- this is why.
return { allowed, math.floor(tokens) }

Now the TypeScript wrapper. Note the defineCommand pattern — this is the production-correct way to handle Lua scripts with ioredis. It handles the EVALSHA / EVAL fallback automatically, meaning it caches the script by SHA on Redis and only re-sends the full script if the cache was cleared.

// src/limiters/token-bucket.ts
import { Redis } from "ioredis";
import { readFileSync } from "fs";
 
// Extend the Redis type so TypeScript knows about our custom command
declare module "ioredis" {
  interface Redis {
    tokenBucket(
      numKeys: number,
      key: string,
      capacity: number,
      refillRate: number,
      requested: number
    ): Promise<[number, number]>;
  }
}
 
const redis = new Redis(process.env.REDIS_URL ?? "redis://localhost:6379");
 
// Register the Lua script as a named command.
// ioredis will use EVALSHA internally (faster — sends only the hash, not the
// full script) and fall back to EVAL automatically if Redis doesn't have it
// cached yet (e.g. after a Redis restart).
redis.defineCommand("tokenBucket", {
  numberOfKeys: 1,
  lua: readFileSync("./src/lua/token_bucket.lua", "utf8"),
});
 
interface RateLimitResult {
  allowed: boolean;
  remaining: number;
}
 
export async function tokenBucketLimit(
  key: string,
  capacity: number,        // max tokens in the bucket
  refillRate: number,      // tokens added per second
  requested = 1            // tokens this request costs
): Promise<RateLimitResult> {
  const [allowed, remaining] = await redis.tokenBucket(
    1,          // numKeys — matches numberOfKeys in defineCommand
    key,
    capacity,
    refillRate,
    requested,
  );
 
  return { allowed: allowed === 1, remaining };
}

Beginner tip: defineCommand vs eval — eval sends the entire Lua script to Redis on every call. defineCommand sends it once, Redis stores it by its SHA checksum, and every subsequent call just sends that short hash (EVALSHA). For a script that runs on every API request, defineCommand is always the right choice in production.

Wiring It Into Express

Here is the middleware that puts it all together, using the token bucket and the standard rate limit response headers that well-behaved clients rely on.

// src/middleware/rate-limit.ts
import type { Request, Response, NextFunction } from "express";
import { tokenBucketLimit } from "../limiters/token-bucket";
 
export function rateLimitMiddleware(capacity: number, refillRate: number) {
  return async (req: Request, res: Response, next: NextFunction): Promise<void> => {
    // Limit by authenticated user ID when available, fall back to IP.
    // IP-based limiting can punish everyone behind a corporate NAT or a
    // shared office network — try to identify the actual user when you can.
    const identifier = (req as any).user?.id ?? req.ip ?? "anonymous";
    const key = `ratelimit:${identifier}`;
 
    try {
      const { allowed, remaining } = await tokenBucketLimit(
        key,
        capacity,
        refillRate,
      );
 
      // Set standard rate limit headers on every response, not just on rejection.
      // Well-behaved clients read these to throttle themselves proactively.
      // GitHub uses X-RateLimit-* (the older convention).
      // The IETF draft standardises RateLimit-* without the X- prefix.
      // Either works — pick one and be consistent across your API.
      res.setHeader("RateLimit-Limit", capacity);
      res.setHeader("RateLimit-Remaining", remaining);
 
      if (!allowed) {
        // Retry-After tells the client how many seconds to wait.
        // For a token bucket, one second is the minimum — at least one token
        // refills per second at any non-zero refillRate.
        const retryAfter = Math.ceil(1 / refillRate);
        res.setHeader("Retry-After", retryAfter);
 
        // 429 Too Many Requests, not 500. Never return a 500 for rate limiting.
        res.status(429).json({ error: "Too many requests" });
        return;
      }
 
      next();
    } catch (err) {
      // Redis is down. See the fail-open vs fail-closed discussion below.
      console.error("Rate limiter error:", err);
      next();
    }
  };
}

Use it in your Express app:

// src/app.ts
import express from "express";
import { rateLimitMiddleware } from "./middleware/rate-limit";
 
const app = express();
 
// rateLimitMiddleware(capacity, refillRate)
//   capacity   — the maximum number of tokens the bucket can hold.
//                This is also the maximum burst a user can send all at once.
//   refillRate — tokens added back per second while the user is under the limit.
//
// rateLimitMiddleware(100, 10) means:
//   - A user can fire up to 100 requests instantly if their bucket is full.
//   - After that, they get 10 more requests every second.
//   - A fully depleted bucket takes 100 / 10 = 10 seconds to recover completely.
app.use(rateLimitMiddleware(100, 10));
 
// Tighter limits for sensitive endpoints.
// rateLimitMiddleware(5, 0.016) means:
//   - Burst of 5 requests at once.
//   - Refill rate of ~0.016 tokens per second = roughly 1 token per minute.
//   - A depleted bucket takes about 5 minutes to fully recover.
app.post(
  "/api/send-email",
  rateLimitMiddleware(5, 0.016),
  async (req, res) => {
    res.json({ sent: true });
  }
);

The two parameters always mean the same thing regardless of the endpoint: the first controls how big a burst you allow, the second controls the sustained rate once the burst is spent. A useful formula to keep in mind:

recovery time (seconds) = capacity / refillRate

Here are some realistic examples to build intuition:

Use case	capacity	refillRate	Max burst	Sustained rate	Recovery time
Public search API	20	1	20 requests	1 per second	20 seconds
Send email endpoint	5	0.016	5 requests	~1 per minute	~5 minutes
Authenticated dashboard	200	10	200 requests	10 per second	20 seconds
Webhook delivery	10	0.1	10 requests	1 per 10 seconds	100 seconds

The tighter your refillRate, the harder it is to abuse your endpoint in a sustained way. The larger your capacity, the more forgiving it feels for clients that send occasional short bursts, like a mobile app that wakes up and syncs several items at once.

Fail Open or Fail Closed

Look at that catch block again. When Redis is unreachable, the example calls next() and lets the request through. That is a deliberate choice, and it is the choice that separates a senior answer from a staff-level one.

Fail open means if the rate limiter breaks, you allow traffic. Your API stays up, but for the duration of the outage you have no protection. A flood during a Redis blip goes straight through to your database.

Fail closed means if the rate limiter breaks, you reject traffic. You stay protected, but a Redis outage now takes your whole API down with it — including users who were nowhere near their limit.

There is no universally correct answer. For a public API where abuse is the bigger threat, failing closed might be right. For an internal service where availability matters more than perfect limiting, failing open usually wins.

Here is how you implement fail closed if that is your choice:

} catch (err) {
  console.error("Rate limiter error:", err);
  // Fail closed: treat a limiter error as a rejection
  res.status(503).json({ error: "Service temporarily unavailable" });
  return;
}

The point is that this is a decision you make on purpose, with the trade-off written down — not a default you backed into because that is how the catch block happened to be structured.

How to Think About This in a System Design Interview

When an interviewer asks how you would protect an API under heavy load, caching is one answer and rate limiting is the very next one. When you bring it up, here is the mental checklist that signals you have done this for real:

What are you limiting on? Per IP, per user, per API key, per endpoint? An IP limit punishes everyone behind a corporate NAT. A per-user limit needs the user identified before the check. The dimension you pick changes everything downstream.
What is the limit and the window? A burst-friendly token bucket behaves very differently from a strict fixed window. Match it to how your clients actually behave.
Where does the counter live, and how do you keep it consistent? This is where you mention the distributed problem and atomic operations in Redis. If you can say the words “check-then-act race condition” and explain why a Lua script fixes it, you are ahead of most candidates.
What happens when the limiter itself fails? Fail open or fail closed. Saying this out loud tells the interviewer you think about the failure modes of your own infrastructure, not just the happy path.

That fourth point is the one that lands. Anyone can describe a token bucket. The people who get the senior offer are the ones who think about what breaks when the thing they built to protect the system is itself the thing that goes down.

What I Learned

Rate limiting feels like a solved problem right up until you are staring at a dashboard at 2am wondering why a paying customer is getting 429s when they are clearly under their limit. Then you find the per-IP key, realise half your users sit behind the same office gateway, and understand why the dimension you limit on is not a detail.

The algorithm is the easy part. The hard parts are the ones nobody draws on the whiteboard. Clock skew between servers. The race condition that only shows up under load. The expired key with no TTL that quietly blocks someone forever. The decision about what to do when Redis itself is down. Those are where the real engineering lives.

Conclusion

Caching reduces how often you hit the database. Rate limiting protects you when caching is not enough, when a single client decides to send you a year’s worth of traffic in a minute. They are the first two lines of defence in any system that has to survive real load, and they belong on your design checklist from the very start, not bolted on after the first outage.

Pick your algorithm based on how your clients behave. Token bucket when you want to allow bursts, sliding window counter when you want accuracy without the memory cost, fixed window only when you can live with the boundary problem. Make every check atomic. Use a shared store the moment you have more than one server. And decide on purpose what happens when the limiter breaks.

That is what separates a developer who adds a rate limit because someone told them to, from one who designs a system that stays standing when the traffic gets ugly.

If you are prepping for system design interviews, this one pairs directly with the caching post, so read them together.

Stay focused, Developer!