限流(Rate Limiting)是分布式系统中最重要的防御机制之一。当你构建一个面向公众的 API 时,如果不做限流,任何一个恶意用户或者写了死循环的程序都可以把你的服务打垮。Netflix 每天处理数亿请求,Stripe 的 API 每秒承接数万笔支付——他们都依赖精心设计的限流系统来保持稳定。
限流解决三个核心问题:
可用性保护:防止单个用户耗尽服务资源
成本控制:LLM API 调用、短信等高成本操作必须被约束
公平性:确保资源在用户之间公平分配
English:
Rate limiting is one of the most critical defensive mechanisms in distributed systems. Without it, a single misbehaving client — whether a DDoS attacker or a developer's buggy retry loop — can bring down your entire service. Netflix, Stripe, GitHub, and every major API platform relies on sophisticated rate limiting. This deep dive covers four algorithms with increasing sophistication, a production-grade Redis implementation, and the distributed systems challenges that make this problem genuinely hard.
Part 1: Theory / 理论基础 (5 min)
四大限流算法 / The Four Algorithms
中文:
限流有四种主流算法,每种都有不同的特性和适用场景。
English:
There are four main rate limiting algorithms, each with distinct trade-offs.
Precisely solves the boundary problem. Maintain a sorted log of request timestamps per user. On each request, evict timestamps outside the window, then check count.
Pro: Exact. Con: Memory-heavy — stores every timestamp for every active user.
User A: [12:00:01, 12:00:23, 12:00:45, 12:00:58] ← 4 requests, all within 1 min window
New request at 12:01:05:
Evict 12:00:01 (>60s ago) → [12:00:23, 12:00:45, 12:00:58, 12:01:05]
Count = 4 → OK (if limit is 5)
The mirror of token bucket. Requests enter a queue, processed at fixed rate. Guarantees smooth output but no burst tolerance. Great for video streaming or payment processing where you need steady throughput.
算法选择指南 / Algorithm Selection Guide
场景 / Scenario
推荐算法 / Algorithm
简单 API 限流 / Simple API rate limit
Sliding Window Counter
允许突发 / Burst traffic OK
Token Bucket
需要平滑输出 / Smooth output required
Leaky Bucket
精确计费 / Exact billing
Sliding Window Log
Part 2: Step-by-Step Implementation / 一步一步实现 (8 min)
中文:
我们来实现两个版本:(1)单机内存版——理解算法核心;(2)分布式 Redis 版——生产可用。
English:
We'll build two versions: (1) in-memory single-node to understand the algorithm; (2) distributed Redis version that's production-ready.
Version 1: In-Memory Token Bucket / 单机令牌桶
import time
import threading
from dataclasses import dataclass, field
from typing import Dict
@dataclass
class TokenBucket:
capacity: float # Max tokens (burst limit)
rate: float # Tokens added per second
tokens: float = field(init=False)
last_refill: float = field(init=False)
lock: threading.Lock = field(default_factory=threading.Lock, init=False)
def __post_init__(self):
self.tokens = self.capacity # Start full
self.last_refill = time.monotonic()
def _refill(self):
"""Add tokens based on elapsed time since last refill."""
now = time.monotonic()
elapsed = now - self.last_refill
# Calculate how many tokens to add
new_tokens = elapsed * self.rate
# Cap at capacity (can't overflow the bucket)
self.tokens = min(self.capacity, self.tokens + new_tokens)
self.last_refill = now
def consume(self, tokens: float = 1.0) -> bool:
"""Try to consume `tokens`. Returns True if allowed."""with self.lock:
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
returnTrue# Allow requestreturnFalse# Reject: bucket emptyclass RateLimiter:
"""Per-user rate limiter using token bucket algorithm."""def __init__(self, capacity: int = 10, rate: float = 2.0):
self.capacity = capacity
self.rate = rate
self._buckets: Dict[str, TokenBucket] = {}
self._lock = threading.Lock()
def _get_bucket(self, user_id: str) -> TokenBucket:
"""Lazily create bucket for user."""with self._lock:
if user_id notin self._buckets:
self._buckets[user_id] = TokenBucket(
capacity=self.capacity,
rate=self.rate
)
return self._buckets[user_id]
def is_allowed(self, user_id: str) -> bool:
return self._get_bucket(user_id).consume()
# Usage
limiter = RateLimiter(capacity=5, rate=1.0) # 5 burst, 1/sec sustainedfor i in range(8):
allowed = limiter.is_allowed("user_123")
print(f"Request {i+1}: {'✅ ALLOWED' if allowed else '❌ REJECTED'}")
# Output: first 5 allowed (burst), next 3 rejected (bucket empty)
Version 2: Distributed Sliding Window Counter with Redis / 分布式滑动窗口计数器
Single-node limiters have a fatal flaw: in multi-instance deployments, each instance has its own counters. A user hitting 3 servers with 10 req/server sees 30 effective requests per window. Redis atomic operations solve this.
import time
import redis
from typing import Tuple
class RedisRateLimiter:
"""
Sliding Window Counter using Redis.
Key insight: we store TWO counters per user — current window and previous.
We use the weighted formula to approximate a true sliding window.
This uses O(1) memory per user, vs O(n) for log-based approach.
"""def __init__(
self,
redis_client: redis.Redis,
limit: int = 100,
window_seconds: int = 60,
key_prefix: str = "rl"
):
self.redis = redis_client
self.limit = limit
self.window = window_seconds
self.prefix = key_prefix
def _get_keys(self, user_id: str) -> Tuple[str, str]:
"""Get Redis keys for current and previous windows."""# Integer division gives us the current window bucket
current_window = int(time.time()) // self.window
prev_window = current_window - 1
curr_key = f"{self.prefix}:{user_id}:{current_window}"
prev_key = f"{self.prefix}:{user_id}:{prev_window}"return curr_key, prev_key
def is_allowed(self, user_id: str) -> Tuple[bool, int]:
"""
Check if request is allowed.
Returns (allowed: bool, remaining: int)
"""
curr_key, prev_key = self._get_keys(user_id)
# Lua script for atomic read-increment-check# Redis executes Lua atomically — no race conditions possible
lua_script = """
local curr_key = KEYS[1]
local prev_key = KEYS[2]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
-- Get current counts (default 0)
local curr_count = tonumber(redis.call('GET', curr_key) or 0)
local prev_count = tonumber(redis.call('GET', prev_key) or 0)
-- Calculate what fraction of current window has elapsed
local elapsed_in_window = now % window
local elapsed_ratio = elapsed_in_window / window
-- Weighted estimate: prev window contributes less as current window progresses
local estimated = prev_count * (1 - elapsed_ratio) + curr_count
if estimated >= limit then
return {0, 0} -- Reject: over limit
end
-- Increment current window counter, set TTL to 2x window
redis.call('INCR', curr_key)
redis.call('EXPIRE', curr_key, window * 2)
local remaining = math.floor(limit - estimated - 1)
return {1, remaining}
"""
script = self.redis.register_script(lua_script)
now = time.time()
result = script(
keys=[curr_key, prev_key],
args=[self.limit, self.window, now]
)
allowed = bool(result[0])
remaining = int(result[1])
return allowed, remaining
def get_headers(self, user_id: str) -> dict:
"""Return rate limit headers (standard HTTP spec)."""
curr_key, prev_key = self._get_keys(user_id)
curr_count = int(self.redis.get(curr_key) or0)
return {
"X-RateLimit-Limit": str(self.limit),
"X-RateLimit-Remaining": str(max(0, self.limit - curr_count)),
"X-RateLimit-Reset": str(
(int(time.time()) // self.window + 1) * self.window
),
}
# FastAPI middleware examplefrom fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
app = FastAPI()
redis_client = redis.Redis(host="localhost", port=6379, db=0)
limiter = RedisRateLimiter(redis_client, limit=100, window_seconds=60)
@app.middleware("http")
asyncdef rate_limit_middleware(request: Request, call_next):
# Extract user identifier (API key, JWT sub, or IP as fallback)
user_id = request.headers.get("X-API-Key") or request.client.host
allowed, remaining = limiter.is_allowed(user_id)
headers = limiter.get_headers(user_id)
ifnot allowed:
return JSONResponse(
status_code=429,
content={
"error": "rate_limit_exceeded",
"message": "Too many requests. Please slow down.",
"retry_after": headers["X-RateLimit-Reset"]
},
headers={**headers, "Retry-After": headers["X-RateLimit-Reset"]}
)
response = await call_next(request)
# Inject rate limit headers into every response (good practice)for key, value in headers.items():
response.headers[key] = value
return response
Tiered Rate Limiting / 分级限流
中文:
真实系统中通常有多层限流:不同 API 端点有不同限制,不同用户等级有不同配额。
English:
Production systems use tiered limits — different endpoints, different user tiers, different time windows:
# Tiered config — typically loaded from DB or feature flags
RATE_LIMIT_CONFIG = {
"free": {
"default": {"limit": 60, "window": 60}, # 60 req/min"/api/search": {"limit": 10, "window": 60}, # Search is expensive"/api/export": {"limit": 2, "window": 3600}, # 2 exports/hour
},
"pro": {
"default": {"limit": 1000, "window": 60},
"/api/search": {"limit": 100, "window": 60},
"/api/export": {"limit": 50, "window": 3600},
},
"enterprise": {
"default": {"limit": 10000, "window": 60}, # Effectively unlimited
}
}
def get_limit_config(user_tier: str, endpoint: str) -> dict:
tier_config = RATE_LIMIT_CONFIG.get(user_tier, RATE_LIMIT_CONFIG["free"])
# Fall back to default if endpoint not specifically configuredreturn tier_config.get(endpoint, tier_config["default"])
Part 3: Edge Cases & Gotchas / 边界情况 (2 min)
中文:
1. 竞态条件(Race Condition)
不用 Redis Lua 脚本而用 GET → 业务逻辑 → SET 会导致竞态。两个请求同时读到 count=99,都认为可以递增到 100,结果实际到达 101。**永远使用原子操作:INCR 或 Lua 脚本。**
2. Redis 故障时怎么办?
两种策略:
Fail Open(故障放行):Redis 挂了就放行所有请求,服务可用性优先
Fail Closed(故障拒绝):Redis 挂了就拒绝所有请求,安全性优先
大多数 API 服务选择 Fail Open。
3. 分布式时钟漂移
多台服务器的系统时钟可能不同步(差几百毫秒),导致窗口边界不一致。解决方案:使用 Redis 的 TIME 命令获取统一时间源。
Netflix:每个 API 路由有独立的限流配置,Hystrix/Resilience4j 提供熔断器(Circuit Breaker)作为限流的补充
English:
GitHub API: 60 unauthenticated / 5000 authenticated reqs/hour. Search is separately limited at 10/min — same backend, different resource cost profile.
Stripe: 100 reqs/sec baseline, but billing webhooks use exponential backoff rather than fixed rate limiting to handle payment processing spikes.
OpenAI: Limits on tokens per minute (TPM), not just requests — effectively a token bucket where the "token" unit is an LLM token. They also layer requests per minute (RPM) on top.
Cloudflare Workers: Rate limiting at the edge (CDN PoP level), before requests even reach your origin. Reduces attack surface at sub-millisecond latency with zero origin load.
DoorDash: Uses a combination of per-user limits AND global service limits with adaptive throttling — if downstream services degrade, they automatically tighten rate limits upstream.
Part 5: Interview Simulation / 面试模拟 (3 min)
中文:面试中,限流是系统设计面试的常见子问题。以下是面试官最常问的 5 个追问。
English: Rate limiting appears in nearly every system design interview, either as the main topic or as a component. Here are the 5 most common follow-ups:
Q1: 如果我们有 100 个 API 服务器,怎么保证全局限流准确?
If we have 100 API servers, how do we enforce a global rate limit accurately?
A: 中心化存储(Redis 集群)是标准方案。每台 API 服务器都读写同一个 Redis,原子操作保证一致性。但如果 Redis 延迟高,可以考虑"本地令牌 + 同步"方案:每台服务器缓存一小批令牌(比如 10%),定期从 Redis 补充,减少网络往返。这叫 Token Bucket with Local Cache,牺牲少量精度换取性能。
Q2: 怎么防止用户伪造 IP 绕过限流?
How do you prevent IP spoofing to bypass rate limits?
A: IP-based limiting should be a last resort. Prefer authenticated identifiers (API keys, JWT user IDs). For unauthenticated endpoints, trust only the last hop of X-Forwarded-For that your own infrastructure controls (e.g., the IP your load balancer added). Never trust client-supplied IPs blindly. For high-security scenarios, layer fingerprinting (User-Agent + TLS fingerprint) on top.
Q3: 限流和熔断器(Circuit Breaker)有什么区别?
What's the difference between rate limiting and a circuit breaker?
A: Rate limiting protects your service from excessive client requests (inbound). Circuit breakers protect your service from failing downstream dependencies (outbound). Rate limiter rejects requests at the door; circuit breaker stops you from calling a service that's already down. They're complementary: use both. Netflix Hystrix (and its successor Resilience4j) implements circuit breakers; Redis or API Gateway handles rate limiting.
Q4: 如何对 "代价不同" 的 API 操作做限流?比如写操作比读操作贵得多。
How do you rate limit operations with different "costs"? Writes vs reads, for example.
A: Use weighted token consumption. Instead of consume(1), writes call consume(5). This is exactly how OpenAI handles LLM tokens — a 4K-token completion costs 4000 units from the TPM bucket, while a 100-token prompt costs 100. Define a "cost unit" (e.g., 1 unit = 1 DB read), assign costs to each operation, and run one token bucket per user with a capacity in cost units rather than request counts.
Q5: 用户触发限流了,怎么给他们好的体验?
What's the user experience when a rate limit is hit?
A: Return HTTP 429 with three key pieces of info: (1) Retry-After header with exact seconds until window resets; (2) X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset on every response so clients can self-throttle before hitting the limit; (3) A clear error body with a link to rate limit docs. Pro tip: send a 429 with Retry-After: 0 once the window resets, which is a soft signal that the client can retry immediately. Client libraries like the official Stripe SDK and OpenAI Python client parse these headers and auto-retry with backoff.
🔗 References / 参考资料
📄 System Design Primer — Rate Limiting: https://github.com/donnemartin/system-design-primer
📄 Stripe Engineering Blog — Rate Limiters: https://stripe.com/blog/rate-limiters
📄 Cloudflare — How We Built Rate Limiting: https://blog.cloudflare.com/counting-things-a-lot-of-different-things/
📄 Redis Official Docs — INCR pattern: https://redis.io/docs/latest/commands/incr/
📄 RFC 6585 — HTTP 429 Too Many Requests: https://tools.ietf.org/html/rfc6585