Rate Limiter: The Invisible Bouncer Protecting Every API You've Ever Used
Every time you scroll Twitter during the Super Bowl, smash the like button on Instagram, or fire off API calls to Stripe, there's an invisible system deciding whether you're allowed to continue. That system is a rate limiter — and designing one that's fast, fair, and fault-tolerant is one of the most practical system design challenges you'll face.
Why Rate Limiting Matters
Rate limiting isn't just about stopping malicious actors. It's about protecting shared resources from legitimate overuse. A single user with a buggy retry loop can take down your service for everyone. A sudden mobile app reconnection storm can saturate your database. Rate limiting is traffic control for your infrastructure.
The core tension: you want to block abuse without punishing legitimate bursts. A user sending 50 cached requests after reconnecting? Probably fine. That same user sustaining 50 requests per second for an hour? That's a problem.
The Four Algorithms You Need to Know
| Algorithm | Burst Handling | Memory | Precision | Best For |
|---|---|---|---|---|
| Token Bucket | Excellent | Low | Good | APIs with burst tolerance |
| Leaky Bucket | Smooths output | Low | Good | Steady-rate processing |
| Fixed Window | Poor (boundary exploit) | Very Low | Low | Simple use cases |
| Sliding Window | Good | Higher | High | Fair, accurate limiting |
Token Bucket is the most popular choice. Tokens accumulate at a steady rate (e.g., 10/second) up to a maximum bucket size. Each request costs a token. Bursts are naturally supported — if you've been idle, your bucket is full.
Fixed Window is simple but has a well-known exploit: send 1000 requests at 11:59:59 and another 1000 at 12:00:01 — technically both are within their respective windows, but you just doubled your effective rate.
Sliding Window fixes this by tracking requests across a rolling time period, but requires more memory per user — you need timestamps, not just counters.
Leaky Bucket processes requests at a fixed rate regardless of input burst, acting like a queue with constant drain. Great for smoothing traffic to downstream services.
The Interview Framework
When you get this in an interview, start here:
- Clarify requirements — Per-user? Per-IP? Per-endpoint? What's the scale (1K QPS or 100K)?
- Non-functional requirements — Latency budget (adding 50ms per request is unacceptable), availability requirements, fail-open vs fail-closed policy
- Back-of-envelope math — 10M users × 8 bytes per counter = 80MB. Fits easily in Redis.
- High-level architecture — API Gateway → Rate Limit Middleware → Redis counters → Your service
- Deep dive — Distributed coordination, algorithm choice, failure modes
Key insight: The rate limiter sits on the hot path of every request. Anything you add here multiplies across your entire QPS.
The Distributed Problem
Here's where "just use Redis" stops being a complete answer. With multiple API servers, you need coordination:
Centralized (Redis): All servers check/increment the same counter. Atomic operations (INCR + EXPIRE, or Lua scripts) prevent race conditions. Trade-off: every request adds a network hop to Redis.
Local + Periodic Sync: Each server counts locally and syncs to a central store periodically. Lower latency, but you lose precision — a user could exceed limits by hitting different servers before sync occurs.
Batched Updates: Buffer several requests worth of counter increments and flush them in batches. Reduces Redis load but introduces a small accuracy window.
The thundering herd is real: 100 servers simultaneously incrementing the same Redis key during a traffic spike. Solutions include request coalescing, Redis clustering, and client-side jitter.
Fail-Open vs Fail-Closed
When your rate limiting infrastructure goes down:
- Fail-open: Allow all requests through. Preserves availability but removes protection. Most companies choose this — a brief period without rate limiting is better than dropping all traffic.
- Fail-closed: Reject all requests. Safer against attacks but causes an outage for legitimate users.
The pragmatic approach: fail-open with local fallback counters and circuit breakers. If Redis is unreachable, apply conservative in-memory limits until connectivity recovers.
How the Big Players Built It
Stripe uses Redis with sliding window counters and Lua scripts for atomicity. They batch operations and handle billions of API requests with sub-millisecond latency overhead. Their multi-tier approach applies different limits at the API key, IP, and account levels simultaneously.
Cloudflare does rate limiting at the edge network level — blocking bad traffic before it reaches your origin servers. It's like having bouncers at every entrance to the city, not just your building.
GitHub evolved from simple counters to sophisticated abuse detection. They use a sharded, replicated rate limiter in Redis that differentiates between authenticated and unauthenticated requests, with different limits per endpoint category.
Key Takeaways
- Algorithm choice depends on your burst tolerance — Token bucket for most APIs, sliding window when fairness is critical
- Distributed coordination is the hard part — Not the algorithm itself
- Monitor your false positive rate — If legitimate users are getting rate limited, your thresholds are wrong
- Design for failure — Your rate limiter will go down; decide what happens when it does
- Layer your defenses — Edge-level, application-level, and database-level limits each catch different problems
- Don't add latency to the hot path — Async updates, batching, and local caches are your friends