Message queues are the circulatory system of modern distributed architecture. They allow services to communicate asynchronously — decoupling producers from consumers so each can scale, fail, and recover independently. If you've ever placed an Amazon order and received a confirmation email seconds later without the checkout page hanging, you've experienced event-driven architecture in action.
Why it matters in interviews: This topic appears in ~80% of senior-level system design interviews. "Design Twitter", "Design Uber", "Design a notification system" — all roads lead to message queues.
In a synchronous world, the order service would call inventory → payment → warehouse → email → loyalty service, all in sequence. This creates:
Temporal coupling: All services must be up at the same time
Performance coupling: The slowest service determines the total latency
Failure coupling: One failure cascades through the whole chain
Message queues break all three couplings.
核心概念 / Core Concepts
中文 → English glossary:
概念
英文
说明
生产者
Producer
发消息的服务
消费者
Consumer
收消息的服务
消息
Message / Event
传递的数据单元
队列
Queue
点对点:一条消息只被一个消费者消费
主题
Topic
发布/订阅:一条消息被多个消费者消费
消费者组
Consumer Group
多个消费者实例共享消费同一个 topic
偏移量
Offset (Kafka)
消息在分区中的位置,消费者自己维护
确认
Acknowledgment (ACK)
消费者告诉队列"我已成功处理"
死信队列
Dead Letter Queue (DLQ)
处理失败的消息的"最终归宿"
两种核心模型 / Two Models
Model 1: Point-to-Point (Queue)
Producer → [Queue] → Consumer A (Consumer B never sees this message)
每条消息只被消费一次
适合:任务分发、工作队列
代表:AWS SQS, RabbitMQ (default)
Model 2: Publish-Subscribe (Topic)
Producer → [Topic] → Consumer A
→ Consumer B
→ Consumer C (all three get the same message)
每条消息被所有订阅者消费
适合:事件通知、数据管道
代表:Apache Kafka, AWS SNS, Google Pub/Sub
主流技术对比 / Technology Comparison
中文: 面试中最常被问到的是 Kafka vs RabbitMQ。记住核心区别:
特性
Apache Kafka
RabbitMQ / AWS SQS
模型
Log-based (追加写日志)
Queue-based (消费后删除)
消息保留
时间/大小限制(可重放)
消费后删除(默认)
吞吐量
极高 (百万级/秒)
高 (万-十万级/秒)
顺序保证
分区内有序
队列内有序
适合场景
流处理、日志、事件溯源
任务队列、RPC、复杂路由
消费模型
Pull (消费者拉取)
Push (队列推送)
English:
The key insight: Kafka is a distributed log, not a traditional queue. Messages stay on disk until a retention policy removes them. Consumers maintain their own offsets, enabling:
Replay: Reprocess all events from the beginning
Multiple independent consumers: Each consumer group gets full history
Event sourcing: The log IS the source of truth
Part 2: Step-by-Step Implementation / 一步一步实现 (8 min)
场景:设计一个通知系统 / Scenario: Notification System Design
中文: 我们设计一个类似于电商平台的通知系统,支持邮件、短信、推送通知。
English: Design a notification system that can send email, SMS, and push notifications after key events (order placed, payment failed, shipment update).
English: At-least-once delivery means the same message can arrive twice (e.g., consumer crashed after processing but before committing offset). Always design consumers to be idempotent.
# idempotency.py — Using Redis to track processed eventsimport redis
class IdempotentNotificationHandler:
def __init__(self):
self.redis = redis.Redis(host="localhost", port=6379, db=0)
self.TTL = 86400# 24 hours — enough to catch duplicatesdef process_event(self, event: NotificationEvent) -> bool:
idempotency_key = f"processed:{event.event_id}"# SET NX = "set if not exists" — atomic check-and-set
was_new = self.redis.set(
idempotency_key,
"1",
ex=self.TTL,
nx=True# Only set if key doesn't exist
)
ifnot was_new:
# This event was already processed — skip it
logger.info(f"Duplicate event {event.event_id} — skipping")
returnTrue# First time seeing this event — process itreturn self._do_send_notification(event)
When consumers join/leave a group, Kafka triggers rebalancing — all consumption pauses. Use cooperative rebalancing (partition.assignment.strategy=CooperativeStickyAssignor) to minimize disruption.
Whenever your design has one service that needs to trigger multiple downstream services, or you need async processing, buffering, or decoupling — reach for a message queue.
Part 5: Interview Simulation / 面试模拟 (3 min)
中文: 以下是面试中最常见的追问,以及简洁回答思路。
English: The 5 most common follow-up questions:
Q1: 如何保证消息不丢失?/ How do you guarantee no message loss?
Producer side: Set acks=all (wait for all in-sync replicas) + enable_idempotence=True.
Broker side: Set min.insync.replicas=2 to require at least 2 replicas to acknowledge.
Consumer side: Disable auto-commit; manually commit only after successful processing.
Result: At-least-once delivery. Accept that duplicates can happen, design consumers to be idempotent.
Q2: Kafka vs SQS,你会怎么选?/ Kafka vs SQS, how do you choose?
- Choose Kafka when: you need message replay, multiple independent consumers, very high throughput (>100K/s), stream processing, or event sourcing.
- Choose SQS when: you want fully managed simplicity, visibility timeout semantics, built-in DLQ, and don't need replay or complex routing. Great for task queues.
- Rule of thumb: SQS for task queues, Kafka for data pipelines and event streams.
Q3: 如何处理消费者崩溃?/ What happens when a consumer crashes?
With manual offset commit disabled (enable_auto_commit=False):
1. Consumer crashes after processing but before committing → same message redelivered to next consumer in group
This is why idempotency is non-negotiable. Use Redis or DB to track processed_event_ids.
Q4: 消息队列如何帮助削峰填谷?/ How does a queue help with traffic spikes?
Queue acts as a buffer. During Black Friday, order service produces 100K orders/minute. Payment service can only process 10K/minute. Without a queue, payment service crashes under load. With a queue, orders buffer up; payment service consumes at its own pace. Users see "order received" immediately, payment confirms asynchronously. The queue absorbs the spike.
Q5: 如果消费者处理一直失败怎么办?/ What if a message keeps failing?
Dead Letter Queue (DLQ) pattern:
1. Set max_delivery_attempts = 3 (or retry in consumer code)
2. After N failures, move message to DLQ topic
3. DLQ messages trigger alerts to on-call engineer
4. Engineer investigates, fixes the bug, then replays DLQ messages back to the original topic
Never silently drop messages. Always have a DLQ.
下周六继续深挖!Next Saturday: we'll go deeper on Kafka internals — partitions, replication, and the ISR mechanism. 🚀