Inside Linux TCP stack: Overview
Inside Linux TCP stack: Overview Inside Linux TCP stack: Overview
Inside Linux TCP stack: Overview Li Yu
- Page 2 and 3: Contents TCP/IP protocol stack ove
- Page 4 and 5: Interactions Device drivers: NIC,
- Page 6 and 7: RFC documents review - Terms
- Page 8 and 9: TCP protocol review - RFC All RFCs
- Page 10 and 11: inet_stream_connect() tcp_v4_connec
- Page 12 and 13: Create new connections tcp_v4_do_rc
- Page 14 and 15: tcp_sendmsg Key features: Tx data
- Page 16 and 17: Rx Data: from NIC Key features: Rx
- Page 18 and 19: GAME OVER Navigate your eyes to Tx
<strong>Inside</strong> <strong>Linux</strong> <strong>TCP</strong> <strong>stack</strong>: <strong>Overview</strong><br />
Li Yu
Contents<br />
<strong>TCP</strong>/IP protocol <strong>stack</strong> overview<br />
Key components<br />
How protocol <strong>stack</strong> interacts with other kernel subsystems<br />
Related important RFC documents review<br />
Key features of <strong>TCP</strong> <strong>stack</strong><br />
Create new connections<br />
Transmit data<br />
Receive data<br />
Huh!
Key components<br />
Applications<br />
System call<br />
Virtual file system<br />
Protocol independent socket layer<br />
ICMP<br />
INET socket layer<br />
Neighbor(ARP) layer<br />
Internet<br />
Connection<br />
Socket<br />
Layer<br />
<strong>TCP</strong><br />
Routing layer<br />
NIC device management layer / NIC device drivers<br />
IP<br />
UDP
Interactions<br />
Device drivers: NIC, PCI, USB, …<br />
Virtual file system: sockfs glue layer, NFS, …<br />
Timing subsystem: various <strong>TCP</strong> timers, ARP timers, …<br />
Interrupt dispatch subsystem: SoftIRQ, MultiQueue, RPS, …<br />
Security subsystem: IPSec, NetLabel, … (TODO: wzt :)<br />
Virtual memory subsystem, of course!<br />
…
RFC documents review – Terms<br />
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<br />
| Source Port | Destination Port |<br />
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<br />
| Sequence Number |<br />
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<br />
| Acknowledgment Number |<br />
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<br />
| Data | |U|A|P|R|S|F| |<br />
| Offset| Reserved |R|C|S|S|Y|I| Window |<br />
| | |G|K|H|T|N|N| |<br />
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<br />
| Checksum | Urgent Pointer |<br />
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<br />
| Options | Padding |<br />
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<br />
| data |<br />
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
RFC documents review – Terms
<strong>TCP</strong> protocol review – RFC<br />
793 : TRANSMISSION CONTROL PROTOCOL<br />
Core concepts, packet format, connection create and release, window management<br />
and a reference implementation.<br />
813 : WINDOW AND ACKNOWLEDGEMENT STRATEGY IN <strong>TCP</strong><br />
Introduce to concepts of SWS and why we need to delay send ACKs.<br />
1122 : Requirements for Internet Hosts -- Communication Layers<br />
This RFC is an official specification for the Internet community.It incorporates by<br />
reference, amends, corrects, and supplements the primary protocol standards<br />
documents relating to hosts.<br />
5681 : <strong>TCP</strong> Congestion Control<br />
RFC2001 -> RFC2581 -> RFC5681<br />
slow start, congestion avoidance, fast retransmit, and fast recovery.
<strong>TCP</strong> protocol review – RFC<br />
All RFCs related to <strong>TCP</strong> have 4000+ pages, some are experiment status,<br />
some are specification, even some are unused or obsoleted, so we need a<br />
roadmap to study them, that is RFC 4614<br />
RFC is enough? No! Many information only can be found in other<br />
resources, e.g. Fast<strong>TCP</strong>, and so on.
Client vs Server<br />
Create new connections<br />
CLOSE->SYN_SENT(tcp_connect) SYN><br />
LISTEN(tcp_v4_conn_request()) SYN/ACK><br />
SYN_SENT->ESTABLISHED(tcp_rcv_synsent_state_process())<br />
ACK><br />
LISTEN->(new)SYN_RECV->(new)ESTABLISHED ACK><br />
tcp_child_process() / tcp_rcv_state_process()
inet_stream_connect()<br />
tcp_v4_connect()<br />
tcp_connect()<br />
tcp_transmit_skb()<br />
Create new connections<br />
Client: - / SYN<br />
检查 socket 的 “ 抽象 ” 状态,如果未连接,就启动连接。<br />
之后,等待连接完成。如果是非阻塞模式就立即返回。<br />
设置 SYN_SENT 状态<br />
端口检查,如果没有绑定端口,就自动绑定一个端口;<br />
根据 socket pair ,查找输出路由;<br />
计算 <strong>TCP</strong> 序列号, IP 报文ID。<br />
tcp_connect_init()<br />
计算可用的 MSS ( MTU + sockopt ), <strong>TCP</strong>-MTUP 的初始化,估算 rcv_mss ;<br />
计算发送窗口(缓冲大小、 Scaling 、 sockopt );<br />
初始化 <strong>TCP</strong> 内部状态: snd_una/rcv_nxt/rcv_wup/copied_seq 等,复位计数器;<br />
--------------------------<br />
分配 skb ,设置相关的发送队列信息(内存计数, quota 统计,重启RTO)<br />
构造 <strong>TCP</strong> options(time stamp/MSS/Window-scaling/SACK ability)<br />
构造 <strong>TCP</strong> header (包括检验和)<br />
发送IP报文
Create new connections<br />
tcp_v4_do_rcv()<br />
tcp_rcv_state_process()<br />
tcp_v4_conn_request()<br />
基本检查: SEG 长度,校验和<br />
如果处于 LISTEN 状态,继续处理<br />
如果是 LISTEN 状态,就检查输入 SEG 是否设置 ACK/RST 位,<br />
如果有就丢包<br />
如果有 SYN ,就继续处理,否则,就丢包<br />
拒绝广播/组播SEG<br />
检查 request ,或者处于半连接状态( accepting )的连接是否太多。<br />
创建一个新的 request 结构,解析输入 SEG 的 <strong>TCP</strong> options<br />
将连接的基本信息记入 request 结构<br />
根据是否使用 SYN cookie ,计算 ISN<br />
发送 SYN/ACK<br />
将 request 加入到 request queue 中<br />
SYN_RECV?<br />
Server: SYN / SYN-ACK
Create new connections<br />
tcp_v4_do_rcv()<br />
tcp_rcv_state_process()<br />
tcp_rcv_synsent_state_process()<br />
Client: SYN-ACK / ACK<br />
如果是 SYN_SENT 状态,就继续处理<br />
如果处理失败返回 tcp_v4_do_rcv() ,从而导致发送RST<br />
解析 <strong>TCP</strong> options<br />
如果ACK标志置位 {<br />
检查 ACK 的序号与 snd_nxt 是否相同<br />
timestamp 检查<br />
RST?<br />
有没有 SYN?<br />
更新 window /scaling factor / timestamp<br />
初始化 <strong>TCP</strong> MTUP/MSS/rcv MSS/<br />
设置 ESTABLISHED 状态<br />
拥塞避免处理初始化<br />
发送 ACK<br />
}
tcp_v4_do_rcv()<br />
Create new connections<br />
如果是 LISTEN 状态<br />
如果<br />
tcp_v4_hnd_req()<br />
创建<br />
了新的 socket<br />
tcp_v4_hnd_req()<br />
tcp_child_process()<br />
Server: Recv ACK<br />
搜索是否有匹配的 req ,若找到就调用 tcp_check_req() ,<br />
这个函数完成了:<br />
1. 解析 <strong>TCP</strong> options<br />
2. RFC793 的其余检查<br />
3. 调用 tcp_v4_syn_recv_sock() 创建新 socket ,<br />
这个 socket 状态是 SYN_RECV ,大部分状态从父 socket 复制而来<br />
4. 将 req 从 request queue 中删除,加入到 accept queue 中<br />
1. 调用 tcp_rcv_state_process() 处理<br />
处于 SYN_RECV 状态的新 socket:<br />
1.1 设置状态 ESTABLISHED<br />
1.2 更新各种 <strong>TCP</strong> 内部状态<br />
2. 唤醒处于休眠状态的进程( accept() syscall )
tcp_sendmsg<br />
Key features: Tx data<br />
计算发送超时 , 可用的 MSS 大小;<br />
检查连接是否处于已连接/半连接状态;<br />
检查连接是否已经 shutdown<br />
从发送队列中取最后一个 skb ,<br />
如果没有这样 skb ,或者这个 skb 没有空闲空间就分配一个新的 skb<br />
如果没有空闲内存,就进入休眠状态,等待空闲内存数量可用<br />
将新分配的 skb 加入到发送队列尾部<br />
向 skb 内复制数据,其间可能涉及到 linear area 和 fragmented area<br />
如果写入的数据距上次 push 的 SEQ 大于 max_window/2 个字节了,<br />
就将写入的 push 出去<br />
如果发送的 seg ,就是正要发送的 seg ,直接 push 它<br />
复制完所有数据之后,根据 NAGLE 和 max_window/2 条件判断是否 push
tcp_write_xmit()<br />
Key features: Tx data<br />
尝试捎带发送 MTUP<br />
遍历发送队列,发送数据直到违反发送条件<br />
检查是否超过拥塞窗口/发送窗口<br />
如果只有一个 seg ,进行 nagle 检查<br />
如果不是 push 出来的SEG,为了降低TSO分拆<br />
SEG的代价,检查是否可以延缓发包<br />
tcp_transmit_skb()<br />
调整发送队列<br />
重置重传队列
Rx Data: from NIC<br />
Key features: Rx data<br />
tcp_v4_do_rcv() tcp_rcv_established(FAST PATH)<br />
如果符合以下条件:顺序 SEG 、只有 PSH/ACK 标志、<br />
除 timestamp 外无其他选项、发送窗口未变化,就进<br />
入快速接收路径:<br />
1. 解析 timestamp 、进行PAWS检查,若失败就进<br />
入正常接收 iytk<br />
2. 如果 SEG 长度小于最小值,就丢包返回0<br />
3. 尝试使用 DMA channel 复制<br />
4. 如果当前任务就是接收方,尝试直接复制到 “ 接收区 ” 内。<br />
5. 如果以上尝试成功:<br />
5.1 如果 SEG 包含上次发送的第一个 SEG 的 ACK ,就更<br />
新 timesstamp 记录<br />
5.2 RTTM<br />
5.3 发送可能的数据。<br />
6. 如果以上优化尝试的都没有成功,就进行校验和检查,再<br />
执行第 5 步,之后将 SEG 放入接收队列<br />
7. 调用 tcp_ack() 处理接收到的 ACK ,更新 <strong>TCP</strong> 内部状态,<br />
如 snd_una ,刷新重传队列。<br />
8. 通知这个 socket 上的休眠任务。
Rx Data: from system call<br />
Key features: Rx data<br />
tcp_recvmsg()<br />
1. 计算 receive timeout<br />
2. 进入主工作循环<br />
2.1 如果接收到了紧急数据,或者有信号需要处理,就退出这次系统调用。<br />
2.2 从 copied_seq 开始搜索接收队列,查找适当位置上的 skb<br />
2.3 复制SEG数据到用户的缓冲区内<br />
2.4 调整接收队列的状态,加速下次查找<br />
2.5 如果不是 MSG_PEEK ,就把这个 SEG 从接收队列上删掉<br />
2.6 如果 2.2 步没有找到需要的 skb ,就进行一系列可能错误检查,如果没有发生错误就<br />
进入休眠。之后, tcp_rcv_established() 接收到数据后会复制到 prequeue ( “ 接收<br />
区 ” ),然后唤醒这个任务,之后,再进入与 2.2 - 2.5 类似的过程<br />
接收队列:<br />
sk_receive_queue<br />
DMA channel<br />
ucopy.prequeue<br />
sk_backlog_queue<br />
async_queue
GAME OVER<br />
Navigate your eyes to Tx source<br />
code if we still have time<br />
And<br />
Q&A<br />
That is all, Thanks!
Pending questions