02.07.2013 Views

Inside Linux TCP stack: Overview

Inside Linux TCP stack: Overview

Inside Linux TCP stack: Overview

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Inside</strong> <strong>Linux</strong> <strong>TCP</strong> <strong>stack</strong>: <strong>Overview</strong><br />

Li Yu


Contents<br />

<strong>TCP</strong>/IP protocol <strong>stack</strong> overview<br />

Key components<br />

How protocol <strong>stack</strong> interacts with other kernel subsystems<br />

Related important RFC documents review<br />

Key features of <strong>TCP</strong> <strong>stack</strong><br />

Create new connections<br />

Transmit data<br />

Receive data<br />

Huh!


Key components<br />

Applications<br />

System call<br />

Virtual file system<br />

Protocol independent socket layer<br />

ICMP<br />

INET socket layer<br />

Neighbor(ARP) layer<br />

Internet<br />

Connection<br />

Socket<br />

Layer<br />

<strong>TCP</strong><br />

Routing layer<br />

NIC device management layer / NIC device drivers<br />

IP<br />

UDP


Interactions<br />

Device drivers: NIC, PCI, USB, …<br />

Virtual file system: sockfs glue layer, NFS, …<br />

Timing subsystem: various <strong>TCP</strong> timers, ARP timers, …<br />

Interrupt dispatch subsystem: SoftIRQ, MultiQueue, RPS, …<br />

Security subsystem: IPSec, NetLabel, … (TODO: wzt :)<br />

Virtual memory subsystem, of course!<br />


RFC documents review – Terms<br />

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<br />

| Source Port | Destination Port |<br />

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<br />

| Sequence Number |<br />

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<br />

| Acknowledgment Number |<br />

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<br />

| Data | |U|A|P|R|S|F| |<br />

| Offset| Reserved |R|C|S|S|Y|I| Window |<br />

| | |G|K|H|T|N|N| |<br />

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<br />

| Checksum | Urgent Pointer |<br />

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<br />

| Options | Padding |<br />

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<br />

| data |<br />

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


RFC documents review – Terms


<strong>TCP</strong> protocol review – RFC<br />

793 : TRANSMISSION CONTROL PROTOCOL<br />

Core concepts, packet format, connection create and release, window management<br />

and a reference implementation.<br />

813 : WINDOW AND ACKNOWLEDGEMENT STRATEGY IN <strong>TCP</strong><br />

Introduce to concepts of SWS and why we need to delay send ACKs.<br />

1122 : Requirements for Internet Hosts -- Communication Layers<br />

This RFC is an official specification for the Internet community.It incorporates by<br />

reference, amends, corrects, and supplements the primary protocol standards<br />

documents relating to hosts.<br />

5681 : <strong>TCP</strong> Congestion Control<br />

RFC2001 -> RFC2581 -> RFC5681<br />

slow start, congestion avoidance, fast retransmit, and fast recovery.


<strong>TCP</strong> protocol review – RFC<br />

All RFCs related to <strong>TCP</strong> have 4000+ pages, some are experiment status,<br />

some are specification, even some are unused or obsoleted, so we need a<br />

roadmap to study them, that is RFC 4614<br />

RFC is enough? No! Many information only can be found in other<br />

resources, e.g. Fast<strong>TCP</strong>, and so on.


Client vs Server<br />

Create new connections<br />

CLOSE->SYN_SENT(tcp_connect) SYN><br />

LISTEN(tcp_v4_conn_request()) SYN/ACK><br />

SYN_SENT->ESTABLISHED(tcp_rcv_synsent_state_process())<br />

ACK><br />

LISTEN->(new)SYN_RECV->(new)ESTABLISHED ACK><br />

tcp_child_process() / tcp_rcv_state_process()


inet_stream_connect()<br />

tcp_v4_connect()<br />

tcp_connect()<br />

tcp_transmit_skb()<br />

Create new connections<br />

Client: - / SYN<br />

检查 socket 的 “ 抽象 ” 状态,如果未连接,就启动连接。<br />

之后,等待连接完成。如果是非阻塞模式就立即返回。<br />

设置 SYN_SENT 状态<br />

端口检查,如果没有绑定端口,就自动绑定一个端口;<br />

根据 socket pair ,查找输出路由;<br />

计算 <strong>TCP</strong> 序列号, IP 报文ID。<br />

tcp_connect_init()<br />

计算可用的 MSS ( MTU + sockopt ), <strong>TCP</strong>-MTUP 的初始化,估算 rcv_mss ;<br />

计算发送窗口(缓冲大小、 Scaling 、 sockopt );<br />

初始化 <strong>TCP</strong> 内部状态: snd_una/rcv_nxt/rcv_wup/copied_seq 等,复位计数器;<br />

--------------------------<br />

分配 skb ,设置相关的发送队列信息(内存计数, quota 统计,重启RTO)<br />

构造 <strong>TCP</strong> options(time stamp/MSS/Window-scaling/SACK ability)<br />

构造 <strong>TCP</strong> header (包括检验和)<br />

发送IP报文


Create new connections<br />

tcp_v4_do_rcv()<br />

tcp_rcv_state_process()<br />

tcp_v4_conn_request()<br />

基本检查: SEG 长度,校验和<br />

如果处于 LISTEN 状态,继续处理<br />

如果是 LISTEN 状态,就检查输入 SEG 是否设置 ACK/RST 位,<br />

如果有就丢包<br />

如果有 SYN ,就继续处理,否则,就丢包<br />

拒绝广播/组播SEG<br />

检查 request ,或者处于半连接状态( accepting )的连接是否太多。<br />

创建一个新的 request 结构,解析输入 SEG 的 <strong>TCP</strong> options<br />

将连接的基本信息记入 request 结构<br />

根据是否使用 SYN cookie ,计算 ISN<br />

发送 SYN/ACK<br />

将 request 加入到 request queue 中<br />

SYN_RECV?<br />

Server: SYN / SYN-ACK


Create new connections<br />

tcp_v4_do_rcv()<br />

tcp_rcv_state_process()<br />

tcp_rcv_synsent_state_process()<br />

Client: SYN-ACK / ACK<br />

如果是 SYN_SENT 状态,就继续处理<br />

如果处理失败返回 tcp_v4_do_rcv() ,从而导致发送RST<br />

解析 <strong>TCP</strong> options<br />

如果ACK标志置位 {<br />

检查 ACK 的序号与 snd_nxt 是否相同<br />

timestamp 检查<br />

RST?<br />

有没有 SYN?<br />

更新 window /scaling factor / timestamp<br />

初始化 <strong>TCP</strong> MTUP/MSS/rcv MSS/<br />

设置 ESTABLISHED 状态<br />

拥塞避免处理初始化<br />

发送 ACK<br />

}


tcp_v4_do_rcv()<br />

Create new connections<br />

如果是 LISTEN 状态<br />

如果<br />

tcp_v4_hnd_req()<br />

创建<br />

了新的 socket<br />

tcp_v4_hnd_req()<br />

tcp_child_process()<br />

Server: Recv ACK<br />

搜索是否有匹配的 req ,若找到就调用 tcp_check_req() ,<br />

这个函数完成了:<br />

1. 解析 <strong>TCP</strong> options<br />

2. RFC793 的其余检查<br />

3. 调用 tcp_v4_syn_recv_sock() 创建新 socket ,<br />

这个 socket 状态是 SYN_RECV ,大部分状态从父 socket 复制而来<br />

4. 将 req 从 request queue 中删除,加入到 accept queue 中<br />

1. 调用 tcp_rcv_state_process() 处理<br />

处于 SYN_RECV 状态的新 socket:<br />

1.1 设置状态 ESTABLISHED<br />

1.2 更新各种 <strong>TCP</strong> 内部状态<br />

2. 唤醒处于休眠状态的进程( accept() syscall )


tcp_sendmsg<br />

Key features: Tx data<br />

计算发送超时 , 可用的 MSS 大小;<br />

检查连接是否处于已连接/半连接状态;<br />

检查连接是否已经 shutdown<br />

从发送队列中取最后一个 skb ,<br />

如果没有这样 skb ,或者这个 skb 没有空闲空间就分配一个新的 skb<br />

如果没有空闲内存,就进入休眠状态,等待空闲内存数量可用<br />

将新分配的 skb 加入到发送队列尾部<br />

向 skb 内复制数据,其间可能涉及到 linear area 和 fragmented area<br />

如果写入的数据距上次 push 的 SEQ 大于 max_window/2 个字节了,<br />

就将写入的 push 出去<br />

如果发送的 seg ,就是正要发送的 seg ,直接 push 它<br />

复制完所有数据之后,根据 NAGLE 和 max_window/2 条件判断是否 push


tcp_write_xmit()<br />

Key features: Tx data<br />

尝试捎带发送 MTUP<br />

遍历发送队列,发送数据直到违反发送条件<br />

检查是否超过拥塞窗口/发送窗口<br />

如果只有一个 seg ,进行 nagle 检查<br />

如果不是 push 出来的SEG,为了降低TSO分拆<br />

SEG的代价,检查是否可以延缓发包<br />

tcp_transmit_skb()<br />

调整发送队列<br />

重置重传队列


Rx Data: from NIC<br />

Key features: Rx data<br />

tcp_v4_do_rcv() tcp_rcv_established(FAST PATH)<br />

如果符合以下条件:顺序 SEG 、只有 PSH/ACK 标志、<br />

除 timestamp 外无其他选项、发送窗口未变化,就进<br />

入快速接收路径:<br />

1. 解析 timestamp 、进行PAWS检查,若失败就进<br />

入正常接收 iytk<br />

2. 如果 SEG 长度小于最小值,就丢包返回0<br />

3. 尝试使用 DMA channel 复制<br />

4. 如果当前任务就是接收方,尝试直接复制到 “ 接收区 ” 内。<br />

5. 如果以上尝试成功:<br />

5.1 如果 SEG 包含上次发送的第一个 SEG 的 ACK ,就更<br />

新 timesstamp 记录<br />

5.2 RTTM<br />

5.3 发送可能的数据。<br />

6. 如果以上优化尝试的都没有成功,就进行校验和检查,再<br />

执行第 5 步,之后将 SEG 放入接收队列<br />

7. 调用 tcp_ack() 处理接收到的 ACK ,更新 <strong>TCP</strong> 内部状态,<br />

如 snd_una ,刷新重传队列。<br />

8. 通知这个 socket 上的休眠任务。


Rx Data: from system call<br />

Key features: Rx data<br />

tcp_recvmsg()<br />

1. 计算 receive timeout<br />

2. 进入主工作循环<br />

2.1 如果接收到了紧急数据,或者有信号需要处理,就退出这次系统调用。<br />

2.2 从 copied_seq 开始搜索接收队列,查找适当位置上的 skb<br />

2.3 复制SEG数据到用户的缓冲区内<br />

2.4 调整接收队列的状态,加速下次查找<br />

2.5 如果不是 MSG_PEEK ,就把这个 SEG 从接收队列上删掉<br />

2.6 如果 2.2 步没有找到需要的 skb ,就进行一系列可能错误检查,如果没有发生错误就<br />

进入休眠。之后, tcp_rcv_established() 接收到数据后会复制到 prequeue ( “ 接收<br />

区 ” ),然后唤醒这个任务,之后,再进入与 2.2 - 2.5 类似的过程<br />

接收队列:<br />

sk_receive_queue<br />

DMA channel<br />

ucopy.prequeue<br />

sk_backlog_queue<br />

async_queue


GAME OVER<br />

Navigate your eyes to Tx source<br />

code if we still have time<br />

And<br />

Q&A<br />

That is all, Thanks!


Pending questions

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!