Rate Limiter — System Design Podcast S1E2

🎧 Listen

▶ YouTube

A concise written overview — no audio required

Rate Limiter: The Invisible Bouncer Protecting Every API You've Ever Used

Every time you scroll Twitter during the Super Bowl, smash the like button on Instagram, or fire off API calls to Stripe, there's an invisible system deciding whether you're allowed to continue. That system is a rate limiter — and designing one that's fast, fair, and fault-tolerant is one of the most practical system design challenges you'll face.

Why Rate Limiting Matters

Rate limiting isn't just about stopping malicious actors. It's about protecting shared resources from legitimate overuse. A single user with a buggy retry loop can take down your service for everyone. A sudden mobile app reconnection storm can saturate your database. Rate limiting is traffic control for your infrastructure.

The core tension: you want to block abuse without punishing legitimate bursts. A user sending 50 cached requests after reconnecting? Probably fine. That same user sustaining 50 requests per second for an hour? That's a problem.

The Four Algorithms You Need to Know

Algorithm	Burst Handling	Memory	Precision	Best For
Token Bucket	Excellent	Low	Good	APIs with burst tolerance
Leaky Bucket	Smooths output	Low	Good	Steady-rate processing
Fixed Window	Poor (boundary exploit)	Very Low	Low	Simple use cases
Sliding Window	Good	Higher	High	Fair, accurate limiting

Token Bucket is the most popular choice. Tokens accumulate at a steady rate (e.g., 10/second) up to a maximum bucket size. Each request costs a token. Bursts are naturally supported — if you've been idle, your bucket is full.

Fixed Window is simple but has a well-known exploit: send 1000 requests at 11:59:59 and another 1000 at 12:00:01 — technically both are within their respective windows, but you just doubled your effective rate.

Sliding Window fixes this by tracking requests across a rolling time period, but requires more memory per user — you need timestamps, not just counters.

Leaky Bucket processes requests at a fixed rate regardless of input burst, acting like a queue with constant drain. Great for smoothing traffic to downstream services.

The Interview Framework

When you get this in an interview, start here:

Clarify requirements — Per-user? Per-IP? Per-endpoint? What's the scale (1K QPS or 100K)?
Non-functional requirements — Latency budget (adding 50ms per request is unacceptable), availability requirements, fail-open vs fail-closed policy
Back-of-envelope math — 10M users × 8 bytes per counter = 80MB. Fits easily in Redis.
High-level architecture — API Gateway → Rate Limit Middleware → Redis counters → Your service
Deep dive — Distributed coordination, algorithm choice, failure modes

Key insight: The rate limiter sits on the hot path of every request. Anything you add here multiplies across your entire QPS.

The Distributed Problem

Here's where "just use Redis" stops being a complete answer. With multiple API servers, you need coordination:

Centralized (Redis): All servers check/increment the same counter. Atomic operations (INCR + EXPIRE, or Lua scripts) prevent race conditions. Trade-off: every request adds a network hop to Redis.

Local + Periodic Sync: Each server counts locally and syncs to a central store periodically. Lower latency, but you lose precision — a user could exceed limits by hitting different servers before sync occurs.

Batched Updates: Buffer several requests worth of counter increments and flush them in batches. Reduces Redis load but introduces a small accuracy window.

The thundering herd is real: 100 servers simultaneously incrementing the same Redis key during a traffic spike. Solutions include request coalescing, Redis clustering, and client-side jitter.

Fail-Open vs Fail-Closed

When your rate limiting infrastructure goes down:

Fail-open: Allow all requests through. Preserves availability but removes protection. Most companies choose this — a brief period without rate limiting is better than dropping all traffic.
Fail-closed: Reject all requests. Safer against attacks but causes an outage for legitimate users.

The pragmatic approach: fail-open with local fallback counters and circuit breakers. If Redis is unreachable, apply conservative in-memory limits until connectivity recovers.

How the Big Players Built It

Stripe uses Redis with sliding window counters and Lua scripts for atomicity. They batch operations and handle billions of API requests with sub-millisecond latency overhead. Their multi-tier approach applies different limits at the API key, IP, and account levels simultaneously.

Cloudflare does rate limiting at the edge network level — blocking bad traffic before it reaches your origin servers. It's like having bouncers at every entrance to the city, not just your building.

GitHub evolved from simple counters to sophisticated abuse detection. They use a sharded, replicated rate limiter in Redis that differentiates between authenticated and unauthenticated requests, with different limits per endpoint category.

Key Takeaways

Algorithm choice depends on your burst tolerance — Token bucket for most APIs, sliding window when fairness is critical
Distributed coordination is the hard part — Not the algorithm itself
Monitor your false positive rate — If legitimate users are getting rate limited, your thresholds are wrong
Design for failure — Your rate limiter will go down; decide what happens when it does
Layer your defenses — Edge-level, application-level, and database-level limits each catch different problems
Don't add latency to the hot path — Async updates, batching, and local caches are your friends

🏗 Architecture Diagram

↗ Open full screen

Metadata

Segment 1: The Problem (~2 min)

Alex Welcome back to Byte by Design! I'm here with my co-host who definitely tried to DOS our coffee machine this morning...

Blake Hey! I just wanted my caffeine fix. But speaking of DOS attacks, today we're talking rate limiters — basically the bouncer at the club of your API.

Alex *laughs* That's... actually not a terrible analogy. You know what's wild? Every time you hit "like" on Instagram or send a tweet, there's this invisible system deciding whether you're being too spammy.

Blake Right? I never thought about it until I tried to bulk-upload photos once and got that dreaded "Rate limit exceeded" message. Suddenly I'm like, wait, how does Instagram even know I'm going too fast?

Alex Exactly! And here's the thing — it's not just about stopping spam. When you're handling millions of requests per second like Twitter during the Super Bowl, you need rate limiting to prevent your entire system from melting down. One user with a buggy script shouldn't be able to take down your API for everyone else.

Blake Oh that's nasty. So it's like... traffic control for your servers?

Alex Perfect analogy. But here's where it gets tricky — you need to be fair but also handle legitimate bursts. Like, if someone's mobile app reconnects and sends 50 cached requests at once, that's probably okay. But 50 requests per second for an hour? That's suspicious.

Segment 2: Interview Framework (~3 min)

Alex So if you got this in an interview, where do you start?

Blake Okay, requirements gathering time. Functional requirements — we need to limit requests per user, return proper error codes when limits are hit, and... handle different rate limits for different endpoints?

Alex Good start! Don't forget about non-functional stuff. What's our scale? Are we talking 1000 requests per second or 100,000? Latency matters too — adding 50ms to every request for rate limiting is not gonna fly.

Blake Right, and what about availability? If our rate limiter goes down, do we fail open or fail closed?

Alex Ooh, good catch. Let's say we're building for a social media API — 10 million users, 1000 requests per user per hour, peaks at 100K QPS during events. Back-of-envelope math... that's roughly 28 requests per second average, but we need to handle those bursts.

Blake For storage, if we're tracking per-user counters... 10 million users times maybe 8 bytes per counter, that's only 80MB. Pretty manageable in Redis.

Alex Nice! High-level architecture — API gateway with rate limiting middleware, Redis for counter storage, and probably some async processing for non-critical updates. The key is keeping the hot path fast.

Blake Wait, but what about distributed systems? If I have multiple API servers, they all need to coordinate on these limits, right?

Alex *laughs* And there's the million-dollar question! That's where things get spicy.

Segment 3: Deep Dive (~3 min)

Alex Okay, now let's get into the hard parts. First big decision — which algorithm? Token bucket, sliding window, fixed window, or leaky bucket?

Blake I always get these confused. Token bucket is like... you get tokens at a steady rate and spend them on requests?

Alex Exactly! It's great for handling bursts — you can save up tokens and then spend them all at once. But sliding window is more memory-intensive but fairer. Fixed window is simple but has that boundary problem where you can get 2x your limit by hitting the boundary just right.

Blake Oh that's sneaky! So someone could send 1000 requests at 11:59:59 and another 1000 at 12:00:01?

Alex Bingo! Now, distributed coordination — this is where it gets nasty. You could use Redis with atomic operations, but now you're adding network latency to every request. Or you could do local counting with periodic sync, but then you lose accuracy.

Blake Hmm, what about the thundering herd problem? If 100 servers all try to update the same Redis counter at once...

Alex *groans* Don't even get me started. You need batching, you need proper Redis clustering, maybe even eventual consistency. Stripe actually does some clever stuff here with multiple Redis instances and reconciliation.

Blake And what happens when Redis goes down? Do we just... let everyone through?

Alex That's the classic fail-open vs fail-closed dilemma. Most companies fail open because availability trumps perfect rate limiting. But then you need local fallbacks, circuit breakers, the whole nine yards.

Segment 4: How They Actually Built It (~2 min)

Alex So how does Stripe actually handle this in production? They're processing billions of API requests with sub-millisecond latency impact.

Blake I read their blog post — they use Redis with sliding window counters, but the clever part is how they batch operations and use Lua scripts for atomicity.

Alex Right! And Cloudflare takes it even further — they do rate limiting at their edge network level. So they're blocking bad traffic before it even hits your servers. That's like having bouncers at every entrance to the city, not just your building.

Blake That's actually brilliant. GitHub has a fun story too — they started with simple counters but had to evolve to handle sophisticated abuse patterns. Turns out people get really creative when trying to scrape GitHub repos.

Alex *laughs* Oh I bet. Twitter's approach during high-traffic events is fascinating too. They basically have multiple tiers — aggressive limiting for suspicious patterns, but they relax limits for verified accounts during major news events.

Blake Makes sense. You don't want to rate limit the President's Twitter account during a crisis because your algorithm thinks they're tweeting too much.

Alex Exactly! The real world is messier than the textbook version. You need business logic, special cases, and a lot of monitoring to see how your limits affect real user behavior.

Segment 5: Interview Tips (~1 min)

Alex Alright, rapid-fire interview tips for this one...

Blake First — always ask about the specific use case. Rate limiting for a payment API is very different from rate limiting for social media posts.

Alex Yes! And don't just jump to Redis. Ask about latency requirements first. Sometimes in-memory with periodic sync is fine, sometimes you need a full distributed consensus system.

Blake Common mistake — forgetting about monitoring. If you can't see your rate limit hit rates and false positives, you're flying blind.

Alex And please, for the love of all that's holy, don't implement synchronous Redis calls on every request without thinking about batching or async updates.

Blake *laughs* "Just throw Redis at it" — the battle cry of engineers everywhere.

Alex Key takeaway — rate limiting is easy in theory, brutal in practice. It's all about the trade-offs between accuracy, latency, and complexity.

Blake And remember, the best rate limiter is one that users never notice... until they really need it.

Alex Alright, that wraps up Episode 2 of Byte by Design! If you found this useful, share it with a fellow engineer who's prepping for system design interviews.

Blake And next time, we're diving into chat systems — WebSockets, message queues, and all the fun challenges of building real-time messaging at scale. See you then!

Alex See you next week!