← 2026-03-14 📂 All Days 2026-03-16 →
🏗️
🏗️ System Design
🏗️ 系统设计 Day 2 / System Design Day 2

🏗️ 系统设计 Day 2 / System Design Day 2

Topic: DNS, IP, and TCP/UDP — 互联网的"电话本"与"快递公司"


场景引入 / Scenario

想象你在设计一个全球用户访问的网站。你写了 https://myblog.com,浏览器是怎么找到你服务器的?从你按下 Enter 到页面出现,背后发生了什么魔法?

Imagine you're building a website for global users. You type https://myblog.com — how does your browser find your server? What magic happens between pressing Enter and seeing the page?


DNS:互联网的电话本 / DNS: The Internet's Phone Book

人类记得 google.com,机器只认识 142.250.80.46。DNS(Domain Name System)就是把"人话"翻译成"机器话"的翻译官。

Humans remember google.com, machines only understand 142.250.80.46. DNS translates human-readable names into machine-readable IPs.

DNS 查询流程 / DNS Resolution Flow

真实类比 / Real-world analogy:

根服务器 = 全国电话总机 → TLD 服务器 = 城市区号本 → 权威服务器 = 某公司的直线 *Root server = National operator → TLD = City directory → Authoritative = Company's direct line*

IP:你在网络上的"门牌号" / IP: Your Network "Address"

  • IPv4: 203.0.113.42 — 32位,约43亿个地址,已经快用完了
  • IPv6: 2001:0db8:85a3::8a2e:0370:7334 — 128位,几乎无限

IPv4 is 32-bit (~4.3 billion addresses, nearly exhausted). IPv6 is 128-bit, essentially unlimited.

公网 vs 私网 / Public vs Private IP:

家庭网络 / Home Network: 你的电脑 → 192.168.1.100 (私网/Private) 你的手机 → 192.168.1.101 (私网/Private) 路由器对外 → 203.0.113.42 (公网/Public) ← 互联网只看到这个 ← Internet only sees this NAT(网络地址转换)帮你把私网地址映射到公网 NAT (Network Address Translation) maps private to public

TCP vs UDP:快递公司 vs 广播电台

特性 / FeatureTCPUDP
连接方式 / Connection三次握手 / 3-way handshake无连接 / Connectionless
可靠性 / Reliability✅ 保证送达 / Guaranteed❌ 尽力而为 / Best-effort
顺序 / Order✅ 有序 / Ordered❌ 可能乱序 / May arrive out of order
速度 / Speed较慢 / Slower更快 / Faster
适用场景 / Use casesHTTP, Email, File transferVideo streaming, Gaming, DNS

TCP 三次握手 / TCP 3-Way Handshake

为什么需要3次?/ Why 3 handshakes?

2次不够——服务器无法确认客户端收到了回复。就像打电话:"喂?" "喂,听到了吗?" "听到了,开始说吧。"

2 isn't enough — the server can't confirm the client received its reply. Like a phone call: "Hello?" "Hello, can you hear me?" "Yes, go ahead."


为什么这样设计?/ Why This Design?

DNS 分层设计的好处 / Benefits of hierarchical DNS:

  • 可扩展性: 根服务器只有13个,但全球有数十亿个域名
  • 缓存: 每层都可以缓存,减少重复查询
  • 容错: 多个根服务器,一个挂了其他继续工作

Scalability (13 root servers handle billions of domains via delegation), caching at every layer, and fault tolerance through redundancy.


别踩这个坑 / Don't Fall Into This Trap

坑1: DNS 缓存污染面试题

面试问:"为什么我改了 DNS 记录,但用户还是访问旧服务器?"

答:TTL(Time To Live)没过期。DNS 记录有缓存时间,改了之后要等 TTL 归零才会全面生效。上线前提前降低 TTL!

DNS cache: after changing DNS records, users still hit old servers until TTL expires. Best practice: lower TTL hours before a migration.

坑2: TCP 不等于安全

TCP 保证送达,但不加密。http:// 用 TCP,但数据是明文。需要加密要用 TLS(即 https://)。

TCP guarantees delivery, not security. HTTP over TCP is plaintext. TLS (HTTPS) is needed for encryption.


关键要点 / Key Takeaways

  1. DNS = 域名 → IP 的翻译,分层设计,有缓存
  2. IPv4 快用完了,IPv6 是未来
  3. TCP = 可靠但慢(文件、网页);UDP = 快但不可靠(直播、游戏)
  4. 三次握手确保双向通信可靠建立

DNS translates domains to IPs with hierarchical caching. IPv4 is nearly exhausted, IPv6 is the future. TCP = reliable but slower; UDP = fast but lossy. 3-way handshake ensures both ends can send and receive.


*Day 2 of 100 | #ByteByByte | 系统设计基础系列*
💻
💻 Algorithms
💻 算法 Day 2 / Algorithms Day 2

💻 算法 Day 2 / Algorithms Day 2

#242 Valid Anagram(有效的字母异位词)— Easy | Pattern: Arrays & Hashing


生活类比 / Real-World Analogy

想象你有两袋相同字母的乐高积木。不管你怎么排列,只要袋子里的积木种类和数量完全一样,就是"字母异位词"。我们要做的,就是数清楚每个袋子里有什么积木

Imagine two bags of Lego pieces. As long as both bags contain the exact same types and counts of pieces — no matter how they're arranged — they're anagrams. Our job: count the pieces in each bag and compare.


题目 / Problem Statement

中文: 给定两个字符串 st,判断 t 是否是 s 的字母异位词(即用完全相同的字母,重新排列而成)。

English: Given two strings s and t, return true if t is an anagram of s, and false otherwise. An anagram uses the same characters with the same frequencies.

Input: s = "anagram", t = "nagaram" → Output: True Input: s = "rat", t = "car" → Output: False

解题思路 / Step-by-Step Walkthrough

核心想法 / Core Idea:

字母异位词 = 每个字符出现的次数完全相同。

Anagrams have identical character frequency distributions.

方法: 哈希表计数 / Method: Hash Map Counting

用一个字典,遍历 s 时 +1,遍历 t 时 -1。最后所有值都是 0 → 是异位词。

Use one dictionary: +1 for each char in s, -1 for each char in t. If all values are 0 → anagram.

具体追踪 / Concrete Trace

s = "anagram", t = "nagaram"

遍历 s (+1):

'a' → count['a'] = 1 'n' → count['n'] = 1 'a' → count['a'] = 2 'g' → count['g'] = 1 'r' → count['r'] = 1 'a' → count['a'] = 3 'm' → count['m'] = 1

状态 / State: {'a':3, 'n':1, 'g':1, 'r':1, 'm':1}

遍历 t (-1):

'n' → count['n'] = 0 'a' → count['a'] = 2 'g' → count['g'] = 0 'a' → count['a'] = 1 'r' → count['r'] = 0 'a' → count['a'] = 0 'm' → count['m'] = 0

状态 / State: {'a':0, 'n':0, 'g':0, 'r':0, 'm':0}

所有值为 0 → True ✅


Python 解法 / Python Solution

from collections import defaultdict

def isAnagram(s: str, t: str) -> bool:
    # Quick check: different lengths can't be anagrams
    # 长度不同直接排除
    if len(s) != len(t):
        return False
    
    # Count character frequencies
    # 统计每个字符出现的频率
    count = defaultdict(int)
    
    # +1 for every char in s
    for char in s:
        count[char] += 1
    
    # -1 for every char in t
    for char in t:
        count[char] -= 1
    
    # If all zeros, they have the same characters
    # 所有计数为零,说明字符完全匹配
    return all(v == 0 for v in count.values())


# 更 Pythonic 的写法 / More Pythonic version:
from collections import Counter

def isAnagram_v2(s: str, t: str) -> bool:
    return Counter(s) == Counter(t)

复杂度分析 / Complexity Analysis

复杂度 / Complexity说明 / Explanation
时间 / TimeO(n)n = len(s),遍历两次 / two passes
空间 / SpaceO(k)k = 字符集大小,最多26个字母 / at most 26 letters

为什么不排序?/ Why not sort?

排序是 O(n log n),哈希表是 O(n),更快。面试时提出这个比较能加分。

Sorting is O(n log n) vs O(n) for hash map. Always worth mentioning this tradeoff in interviews.


边界情况 / Edge Cases

isAnagram("a", "a")     # True  — single char match
isAnagram("a", "b")     # False — single char mismatch
isAnagram("", "")       # True  — both empty (Counter({}) == Counter({}))
isAnagram("ab", "a")    # False — length check catches this early
isAnagram("aa", "bb")   # False — same length, different chars

举一反三 / Pattern Recognition

这道题的核心模式:用哈希表统计频率,再比较。以下题目用同一个模式:

Core pattern: use a hash map to count frequencies, then compare. Same pattern appears in:

题目 / Problem变化 / Twist
#49 Group Anagrams把所有互为异位词的字符串分组 / group all anagrams together
#438 Find All Anagrams in a String滑动窗口找所有异位词位置 / sliding window to find positions
#383 Ransom Note一个字符串能否由另一个构成 / can s be built from t's chars

进阶思考 / Follow-up:

如果字符串包含 Unicode(中文、emoji)怎么办?用 Counter 依然 work,因为它对任何 hashable 字符都有效。

What if strings contain Unicode (Chinese, emoji)? Counter still works — it handles any hashable character.


*Day 2 of 100 | #ByteByByte | Arrays & Hashing 系列*

Visual Step Trace

Start
Empty counter before scanning any characters.
count
{}
empty map
Scan s
Read `anagram` and count each character.
key['a']
3
active key
key['n']
1
count
key['g']
1
count
key['r']
1
count
key['m']
1
count
State
Hash map now stores the required counts.
key['a']
3
count
key['n']
1
count
key['g']
1
count
key['r']
1
count
key['m']
1
count
Scan t
Read `nagaram` and subtract each character back down.
key['a']
0
balanced ✓
key['n']
0
balanced ✓
key['g']
0
balanced ✓
key['r']
0
balanced ✓
key['m']
0
active key
Finish
All counts return to zero, so the strings are anagrams.
key['a']
0
balanced ✓
key['n']
0
balanced ✓
key['g']
0
balanced ✓
key['r']
0
balanced ✓
key['m']
0
balanced ✓
🗣️
🗣️ Soft Skills
🗣️ 软技能 Day 2 / Soft Skills Day 2

🗣️ 软技能 Day 2 / Soft Skills Day 2

Topic: 没有直接权力如何影响他人 / Influencing Without Authority


为什么这很重要 / Why This Matters

在大公司里,真正的工作不是独自完成任务,而是让别人帮你完成任务——在你没有权力命令他们的情况下。这是 Senior 和 Staff 工程师的核心技能,也是最常被问到的行为面试题之一。

In big tech, the real job isn't doing work alone — it's getting others to do work without being able to order them. This is the core skill separating senior from staff engineers, and one of the most common behavioral interview questions.


经典面试题 / Classic Question

"描述一次你需要在没有直接管理权限的情况下影响他人的经历。"
"Describe a situation where you had to influence others without having direct authority."

STAR 框架拆解 / STAR Framework Breakdown

S — Situation(情境)

设定背景:你在哪个团队,影响的是谁(另一个团队、高级别同事、跨职能合作者),以及为什么他们没有义务听你的。

Set the scene: which team, who you needed to influence (cross-team, senior colleague, partner org), and crucially — why they had no obligation to listen to you.

T — Task(任务)

你的目标是什么?为什么这件事很重要?影响失败会有什么后果?

What was your goal? Why did it matter? What was at stake if influence failed?

A — Action(行动) ← 这是重点 / This is where you shine

Senior 级别要展示的不是"我很能说服人",而是系统性的影响力策略

  1. 理解对方的激励机制 — 他们关心什么?什么对他们有利?
  2. 用数据说话 — 不是"我觉得应该这样",而是"数据显示这会影响 X% 用户"
  3. 建立盟友 — 先和愿意接受的人对齐,再扩大影响范围
  4. 给对方一个赢的理由 — frame 成对他们也有好处,不是你求他们帮忙

Don't just say "I convinced them." Show a systematic influence strategy: understand their incentives, use data, build allies, frame it as their win too.

R — Result(结果)

量化结果。不只是"成功了",而是"减少了 X 毫秒延迟"、"帮助团队提前 2 周上线"。

Quantify outcomes. Not "it worked" but "reduced P99 latency by 40ms" or "helped the team ship 2 weeks early."


❌ 坏回答 vs ✅ 好回答 / Bad vs Good Answer

❌ 坏回答:

"我们的设计文档需要安全团队 review,但他们很忙不想做。我就发了很多邮件催他们,最后他们 review 了。"

问题:被动、无策略、显示影响力只是"坚持催",没有展示任何高级技能。

Bad: "The security team was busy, so I kept emailing until they reviewed it." — Passive, no strategy, just persistence.


✅ 好回答:

"我需要安全团队 review 一个涉及用户数据的新功能,但他们的 Q4 排期已经满了,而我们的 launch date 是固定的。我没有直接汇报关系。
我先花时间了解了安全团队的 OKR——他们那季度的目标之一是'减少高风险数据暴露事件'。我把这个功能的 review 包装成他们达成 OKR 的机会,而不是额外负担。
我准备了一份 1 页的风险摘要,聚焦于如果不 review 可能的合规风险,并提议缩小 review 范围(只看数据流部分,而不是整个 PR)来降低他们的时间成本。
同时,我找到了一位之前和安全团队合作过的 Staff Engineer,请他帮我引荐,建立了初始信任。
最终安全团队在 3 天内完成了 review,我们按时上线,功能还因为 review 过程发现并修复了一个边界条件。"

Good: Understood their OKRs, reframed as their win, reduced their cost, used a warm introduction to build trust. Showed systematic strategy.


场景模板 / Scenario Template to Adapt

情境: 我需要 [另一个团队/高级工程师/PM] 在 [时间节点] 前完成 [X],
     但他们没有义务优先处理我的需求。

策略:
  1. 我研究了他们的 [优先级/OKR/痛点]
  2. 我将需求包装成对他们的 [利益/风险规避/认可机会]
  3. 我降低了他们的参与成本,通过 [缩小范围/提供草稿/async 方式]
  4. 我通过 [共同认识的人/过去的合作] 建立了信任基础

结果: [量化的结果]

Senior/Staff 级别加分项 / Senior/Staff Level Tips

  1. 展示系统性思维,而不是一次性技巧。 Staff 工程师影响的不是一个人,而是建立了一套让他人自愿对齐的系统(写 RFC、建立 review 文化、设计 API 让正确做法成为默认)。
  1. 提到失败的尝试。 "我第一次直接发需求,他们无视了。然后我调整策略…" — 这比一帆风顺更真实,也更展示学习能力。
  1. 区分说服和操纵。 好的影响力是基于真实利益对齐,不是包装欺骗。面试官会探究"你是如何确保这对他们也真的有好处的?"

Show systematic thinking not one-off tricks. Mention what didn't work first — shows learning. Distinguish persuasion (real alignment) from manipulation (packaging deception).


关键要点 / Key Takeaways

  1. 理解对方的激励,而不是假设他们应该配合你
  2. 用数据和风险框架,而不是个人请求
  3. 降低对方的参与成本 — 帮他们更容易说"好"
  4. 量化结果 — "成功了"不够,需要具体数字

Understand their incentives. Use data/risk framing, not personal favors. Lower their cost to say yes. Always quantify results.


*Day 2 of 100 | #ByteByByte | 行为面试系列*
🎨
🎨 Frontend
🎨 前端 Day 2 / Frontend Day 2

🎨 前端 Day 2 / Frontend Day 2

Topic: Flexbox — 一维布局的瑞士军刀 / One-Dimensional Layouts Made Easy


猜猜这段代码输出什么?/ What Does This Code Output?

<div class="container">
  <div class="box">A</div>
  <div class="box">B</div>
  <div class="box">C</div>
</div>
.container{
  display: flex;
  justify-content: space-between;
  width: 300px;
}
.box{
  width: 80px;
  height: 80px;
  background: steelblue;
}

▶ Live result / 实际效果:

A
B
C

你的猜测 / Your guess:

A) 三个方块左对齐,紧靠在一起

B) 三个方块均匀分布,A 在最左,C 在最右,B 在中间

C) 三个方块居中显示

D) 报错,因为 80×3=240 < 300


答案: B ✅

|容器 300px / container 300px| |A | B | C| ←80px→ ←←←30px gap→→→ ←80px→ (C 也是 80px,右端对齐 / C is also 80px, flush right)

space-between 的含义:第一个元素靠左边,最后一个元素靠右边,中间的间距平均分配

space-between: first item at start, last item at end, remaining space distributed equally between items.

剩余空间 = 300 - 80×3 = 60px,分成 2 份间距 = 每份 30px

Remaining space = 300 - 240 = 60px, split into 2 gaps = 30px each.


Flexbox 心智模型 / Mental Model

Flexbox 的核心:一个主轴(main axis)和一个交叉轴(cross axis)

flex-direction: row (默认/default) 主轴 main axis →→→→→→→→→→→→→→→→→→→→→→→ ┌──────┐ ┌──────┐ ┌──────┐ │ A │ │ B │ │ C │ └──────┘ └──────┘ └──────┘ 交叉轴 cross axis ↓ (垂直方向/vertical) flex-direction: column 主轴 main axis ↓ ┌──────┐ ↓ │ A │ ↓ ├──────┤ ↓ │ B │ ↓ ├──────┤ ↓ │ C │ └──────┘ 交叉轴 cross axis → (水平方向/horizontal)

核心属性速查 / Key Properties Cheat Sheet

父容器属性 / Container Properties

.container{
  display: flex;
  
  /* 主轴方向 / Main axis direction */
  flex-direction: row | row-reverse | column | column-reverse;
  
  /* 主轴对齐 / Main axis alignment */
  justify-content: flex-start | flex-end | center 
                 | space-between | space-around | space-evenly;
  
  /* 交叉轴对齐 / Cross axis alignment */
  align-items: stretch | flex-start | flex-end | center | baseline;
  
  /* 换行 / Wrapping */
  flex-wrap: nowrap | wrap | wrap-reverse;
  
  /* gap (现代写法/modern) */
  gap: 16px;  /* 比 margin 更优雅 / cleaner than margin hacks */
}

子元素属性 / Item Properties

.item{
  /* 伸长比例 / Grow ratio */
  flex-grow: 0;    /* 默认不伸长 / default: don't grow */
  flex-grow: 1;    /* 占据剩余空间 / take remaining space */
  
  /* 收缩比例 / Shrink ratio */
  flex-shrink: 1;  /* 默认允许收缩 / default: can shrink */
  flex-shrink: 0;  /* 禁止收缩 / don't shrink */
  
  /* 基准尺寸 / Base size */
  flex-basis: auto | 200px | 30%;
  
  /* 简写 / Shorthand */
  flex: 1;        /* = flex-grow: 1, flex-shrink: 1, flex-basis: 0 */
  flex: 0 0 200px; /* = 固定200px,不伸不缩 / fixed 200px */
}

你可能不知道 / You Might Not Know (Gotcha!)

flex: 1flex: 1 1 auto 不一样!

/* flex: 1 → flex-grow:1, flex-shrink:1, flex-basis: 0 */
/* 基准是 0,意思是从 0 开始按比例分配空间 */
/* base is 0: space is distributed purely by ratio */

/* flex: 1 1 auto → flex-grow:1, flex-shrink:1, flex-basis: auto */
/* 基准是内容大小,先按内容分,剩余的再按比例分 */
/* base is content size: content first, then distribute remaining */

.container { display: flex; width: 300px; }
.a { flex: 1; }          /* a 和 b 各得 150px */
.b { flex: 1; }          /* split evenly from 0 */

/* vs */
.a { flex: 1 1 auto; content: "longer text"; }  /* a 会更宽!*/
.b { flex: 1 1 auto; content: "hi"; }           /* a gets more space! */

flex: 1 splits space from zero (equal shares). flex: 1 1 auto splits from content size (content-biased). This trips up many senior devs!


经典布局示例 / Classic Layout Example

圣杯布局(Header + Sidebar + Main + Footer)

/* 用 Flexbox 实现三列布局 / Three-column layout */
.page{
  display: flex;
  flex-direction: column;
  min-height: 100vh;
}

.content-area{
  display: flex;
  flex: 1;  /* 占据剩余高度 / fill remaining height */
}

.sidebar{
  flex: 0 0 240px;  /* 固定宽度,不伸不缩 / fixed width */
}

.main{
  flex: 1;  /* 占据剩余宽度 / take remaining width */
}

Mini Challenge 🎯

用纯 CSS Flexbox(不用 Grid),实现这个布局:

┌─────────────────────────────┐ │ Header │ ├────────┬────────────────────┤ │Sidebar │ Main Content │ │(200px) │ (flexible) │ ├────────┴────────────────────┤ │ Footer │ └─────────────────────────────┘

侧边栏固定 200px,主内容区自适应,整体高度 100vh。

Sidebar fixed at 200px, main content flexible, total height 100vh.

答案明天揭晓!/ Answer revealed tomorrow!


*Day 2 of 100 | #ByteByByte | CSS Fundamentals 系列*

Interactive Flexbox Playground

Change the flex container controls and compare flex: 1 with flex: 1 1 auto.

Live Container
A
B
C
Shorthand Comparison

flex: 1 starts from zero width.

A short
B has much longer text content

flex: 1 1 auto respects content size first.

A short
B has much longer text content
🤖
🤖 AI
🤖 AI Day 2

🤖 AI Day 2

Topic: Transformer 是怎么工作的?— "Attention Is All You Need"


从"翻译"说起 / Start With Translation

2017年之前,翻译系统用 RNN(循环神经网络):逐字读取,逐字生成。就像一个翻译员,读完一个字才能记下来,再读下一个,记忆有限,长句子容易忘记开头。

Before 2017, translation used RNNs: process word by word, like a translator who reads one word at a time with limited working memory. Long sentences = forgotten beginnings.

2017年,Google 发了一篇论文:"Attention Is All You Need"。核心思想震惊了整个 AI 界:

"你不需要按顺序读句子。你可以一次性看整个句子,然后决定每个词该'关注'哪些其他词。"
"You don't need to read sequentially. Look at the whole sentence at once, and let each word 'attend' to whichever other words are most relevant."

直觉解释 / Intuitive Explanation

为什么需要 Attention(注意力机制)?

翻译 "The animal didn't cross the street because it was too tired."

"it" 指的是什么?是 "animal" 还是 "street"?

人类一眼就知道是 "animal"(动物会累,街道不会累)。

RNN 在处理 "it" 的时候,离 "animal" 已经太远了,可能已经"忘了"。

Attention 的解法: 处理 "it" 时,让模型自动"回头看"整个句子,计算 "it" 和每个其他词的相关性分数。

"it" 与各词的 attention 分数示意 / Attention scores for "it": The → 0.05 animal → 0.72 ← 高分!/ High score! didn't → 0.03 cross → 0.04 the → 0.02 street → 0.08 because → 0.03 it → 0.03

Attention solves the "it" problem: when processing "it", the model looks back at all words, assigns a relevance score to each, and "pays attention" to "animal" the most.


Transformer 核心机制 / Core Mechanism

Self-Attention 三步走 / Three Steps

每个词(token)会生成三个向量:

Each word generates three vectors:

┌─────────────────────────────────────┐ 输入词 / Input word │ Q (Query) K (Key) V (Value) │ "animal" │ "我想问什么" "我是什么" "我代表什么信息" │ │ "what I ask" "what I am" "my info" │ └─────────────────────────────────────┘

计算过程 / Computation:

步骤1: Score = Q · K^T / √d_k 用"查询"和每个词的"键"做点积,得到相关性分数 Dot product of Query with all Keys → relevance scores 步骤2: Softmax(Score) 把分数转成概率分布(所有词的权重加起来 = 1) Convert scores to probability distribution (weights sum to 1) 步骤3: Output = Σ(weight × V) 用权重加权所有词的"值"向量,得到最终表示 Weighted sum of all Value vectors → final representation

Transformer 整体架构 / Overall Architecture

输入文本 / Input Text: "I love cats" │ ▼ [词嵌入 + 位置编码] [Token Embedding + Positional Encoding] (告诉模型每个词的位置,因为 Attention 本身没有位置感) (adds position info, since Attention has no inherent order) │ ▼ ┌─────────────────────────────┐ │ Encoder Block × N │ ← N 层叠加 / N stacked layers │ ┌──────────────────────┐ │ │ │ Multi-Head │ │ ← 多个 Attention "头"并行 │ │ Self-Attention │ │ multiple Attention heads in parallel │ └──────────┬───────────┘ │ │ │ (+ residual) │ │ ┌──────────▼───────────┐ │ │ │ Feed Forward │ │ ← 每个位置独立的 MLP │ │ Network (FFN) │ │ per-position MLP │ └──────────────────────┘ │ └─────────────────────────────┘ │ ▼ [丰富的上下文表示 / Rich contextual representation]

多头注意力 / Multi-Head Attention: 同时用多个 Attention(比如 8 个),每个关注不同的语义关系(语法、指代、情感等),最后拼接。就像同时从 8 个角度看一张照片。

Multi-head: run 8 attention mechanisms in parallel, each learning different relationship types (syntax, coreference, sentiment). Concatenate results. Like viewing a photo from 8 angles simultaneously.


为什么 Transformer 改变了一切?/ Why Did It Change Everything?

特性 / FeatureRNNTransformer
并行计算 / Parallelizable❌ 必须顺序 / Must be sequential✅ 所有位置同时处理 / All positions at once
长距离依赖 / Long-range❌ 容易遗忘 / Forgets✅ 直接 Attention / Direct connection
可扩展性 / Scalability差 / Poor优秀 / Excellent
GPU 利用率 / GPU usage低 / Low高 / High

这就是 GPT、BERT、Claude、Gemini 的核心。 这篇 2017 年的论文,启动了整个现代 AI 时代。

This is the foundation of GPT, BERT, Claude, Gemini — every modern LLM. The 2017 paper that started the modern AI era.


代码片段:简化版 Attention / Simplified Attention in Code

import numpy as np

def scaled_dot_product_attention(Q, K, V):
    """
    Q: Query matrix  (seq_len, d_k)
    K: Key matrix    (seq_len, d_k)
    V: Value matrix  (seq_len, d_v)
    """
    d_k = Q.shape[-1]
    
    # Step 1: Compute attention scores
    # scores[i][j] = how much position i should attend to position j
    scores = Q @ K.T / np.sqrt(d_k)
    
    # Step 2: Softmax — convert scores to probabilities
    # exp for numerical stability trick usually applied here
    exp_scores = np.exp(scores - scores.max(axis=-1, keepdims=True))
    attention_weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)
    
    # Step 3: Weighted sum of values
    output = attention_weights @ V
    
    return output, attention_weights

# Example: 3 tokens, d_k=4
np.random.seed(42)
Q = np.random.randn(3, 4)  # 3 tokens asking questions
K = np.random.randn(3, 4)  # 3 tokens presenting keys
V = np.random.randn(3, 4)  # 3 tokens' actual info

output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights shape:", weights.shape)  # (3, 3)
# Each row sums to 1.0 — how much each token attends to each other

一句话总结 / One-Liner

Transformer = "让每个词直接看所有其他词,用相关性加权求和,并行计算,堆叠多层"
Transformer = "let every word directly look at all others, weight by relevance, compute in parallel, stack many layers"

延伸阅读 / Going Deeper

  • 原始论文 / Original paper: "Attention Is All You Need" (Vaswani et al., 2017)
  • 可视化工具 / Visualization: The Illustrated Transformer
  • 明天 Day 3 AI 主题预告:BERT vs GPT — 为什么双向比单向更聪明(有时候)

Tomorrow Day 3 AI preview: BERT vs GPT — why bidirectional beats unidirectional (sometimes).


*Day 2 of 100 | #ByteByByte | AI Foundations 系列*

Attention Heatmap

When the model processes it, it attends most strongly to animal.

Theanimaldidntcrossthestreetbecauseitwastootired
it →