TCP Implementation in Linux: A Brief Tutorial - University of Virginia
TCP Implementation in Linux: A Brief Tutorial - University of Virginia
TCP Implementation in Linux: A Brief Tutorial - University of Virginia
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
User Space<br />
Kernel Space<br />
s<strong>of</strong>tirf to free<br />
packet descriptor<br />
Application<br />
<strong>TCP</strong> send Buffer<br />
(tcp_wmem)<br />
<strong>TCP</strong> Process<br />
completion<br />
IP Layer<br />
queue<br />
dev_queue_transmit()<br />
Device Driver<br />
qdisc<br />
(txqueuelen)<br />
hard_start_xmit()<br />
Interrupt<br />
Generator<br />
write()<br />
tcp_transmit_skb()<br />
ip_queue_xmit()<br />
R<strong>in</strong>g<br />
Buffer<br />
NIC Memory<br />
Packet<br />
Transmission<br />
Kernel<br />
Memory<br />
Packet Data<br />
DMA Eng<strong>in</strong>e<br />
Fig. 2: Packet Transmission<br />
The analogue to the receiver’s netdev max backlog is the<br />
sender’s txqueuelen. The <strong>TCP</strong> layer builds packets when data<br />
is available <strong>in</strong> the send buffer or ACK packets <strong>in</strong> response to<br />
data packets received. Each packet is pushed down to the IP<br />
layer for transmission. The IP layer enqueues each packet <strong>in</strong> an<br />
output queue (qdisc) associated with the NIC. The size <strong>of</strong> the<br />
qdisc can be modified by assign<strong>in</strong>g a value to the txqueuelen<br />
variable associated with each NIC device. If the output queue<br />
is full, the attempt to enqueue a packet generates a localcongestion<br />
event, which is propagated upward to the <strong>TCP</strong><br />
layer. The <strong>TCP</strong> congestion-control algorithm then enters <strong>in</strong>to<br />
the Congestion W<strong>in</strong>dow Reduced (CWR) state, and reduces<br />
the congestion w<strong>in</strong>dow by one every other ACK (known as<br />
rate halv<strong>in</strong>g). After a packet is successfully queued <strong>in</strong>side the<br />
output queue, the packet descriptor (sk buff ) is then placed<br />
<strong>in</strong> the output r<strong>in</strong>g buffer tx r<strong>in</strong>g. When packets are available<br />
<strong>in</strong>side the r<strong>in</strong>g buffer, the device driver <strong>in</strong>vokes the NIC DMA<br />
eng<strong>in</strong>e to transmit packets onto the wire.<br />
While the above parameters dictate the flow-control pr<strong>of</strong>ile<br />
<strong>of</strong> a connection, the congestion-control behavior can also have<br />
a large impact on the throughput. <strong>TCP</strong> uses one <strong>of</strong> several<br />
congestion control algorithms to match its send<strong>in</strong>g rate with<br />
the bottleneck-l<strong>in</strong>k rate. Over a connectionless network, a large<br />
number <strong>of</strong> <strong>TCP</strong> flows and other types <strong>of</strong> traffic share the same<br />
bottleneck l<strong>in</strong>k. As the number <strong>of</strong> flows shar<strong>in</strong>g the bottleneck<br />
l<strong>in</strong>k changes, the available bandwidth for a certa<strong>in</strong> <strong>TCP</strong> flow<br />
varies. Packets get lost when the send<strong>in</strong>g rate <strong>of</strong> a <strong>TCP</strong> flow<br />
is higher than the available bandwidth. On the other hand,<br />
packets are not lost due to competition with other flows <strong>in</strong> a<br />
circuit as bandwidth is reserved. However, when a fast sender<br />
is connected to a circuit with lower rate, packets can get lost<br />
due to buffer overflow at the switch.<br />
When a <strong>TCP</strong> connection is set up, a <strong>TCP</strong> sender uses ACK<br />
packets as a ’clock, known as ACK-clock<strong>in</strong>g, to <strong>in</strong>ject new<br />
2<br />
packets <strong>in</strong>to the network [1]. S<strong>in</strong>ce <strong>TCP</strong> receivers cannot<br />
send ACK packets faster than the bottleneck-l<strong>in</strong>k rate, a<br />
<strong>TCP</strong> senders transmission rate while under ACK-clock<strong>in</strong>g is<br />
matched with the bottleneck l<strong>in</strong>k rate. In order to start the<br />
ACK-clock, a <strong>TCP</strong> sender uses the slow-start mechanism.<br />
Dur<strong>in</strong>g the slow-start phase, for each ACK packet received,<br />
a <strong>TCP</strong> sender transmits two data packets back-to-back. S<strong>in</strong>ce<br />
ACK packets are com<strong>in</strong>g at the bottleneck-l<strong>in</strong>k rate, the sender<br />
is essentially transmitt<strong>in</strong>g data twice as fast as the bottleneck<br />
l<strong>in</strong>k can susta<strong>in</strong>. The slow-start phase ends when the size<br />
<strong>of</strong> the congestion w<strong>in</strong>dow grows beyond ssthresh. In many<br />
congestion control algorithms, such as BIC [2], the <strong>in</strong>itial<br />
slow start threshold (ssthresh) can be adjusted, as can other<br />
factors such as the maximum <strong>in</strong>crement, to make BIC more<br />
or less aggressive. However, like chang<strong>in</strong>g the buffers via<br />
the sysctl function, these are system-wide changes which<br />
could adversely affect other ongo<strong>in</strong>g and future connections.<br />
A <strong>TCP</strong> sender is allowed to send the m<strong>in</strong>imum <strong>of</strong> the congestion<br />
w<strong>in</strong>dow and the receivers advertised w<strong>in</strong>dow number<br />
<strong>of</strong> packets. Therefore, the number <strong>of</strong> outstand<strong>in</strong>g packets<br />
is doubled <strong>in</strong> each roundtrip time, unless bounded by the<br />
receivers advertised w<strong>in</strong>dow. As packets are be<strong>in</strong>g forwarded<br />
by the bottleneck-l<strong>in</strong>k rate, doubl<strong>in</strong>g the number <strong>of</strong> outstand<strong>in</strong>g<br />
packets <strong>in</strong> each roundtrip time will also double the buffer<br />
occupancy <strong>in</strong>side the bottleneck switch. Eventually, there will<br />
be packet losses <strong>in</strong>side the bottleneck switch once the buffer<br />
overflows.<br />
After packet loss occurs, a <strong>TCP</strong> sender enters <strong>in</strong>to the<br />
congestion avoidance phase. Dur<strong>in</strong>g congestion avoidance,<br />
the congestion w<strong>in</strong>dow is <strong>in</strong>creased by one packet <strong>in</strong> each<br />
roundtrip time. As ACK packets are com<strong>in</strong>g at the bottleneck<br />
l<strong>in</strong>k rate, the congestion w<strong>in</strong>dow keeps grow<strong>in</strong>g, as does the<br />
the number <strong>of</strong> outstand<strong>in</strong>g packets. Therefore, packets will get<br />
lost aga<strong>in</strong> once the number <strong>of</strong> outstand<strong>in</strong>g packets grows larger<br />
than the buffer size <strong>in</strong> the bottleneck switch plus the number<br />
<strong>of</strong> packets on the wire.<br />
There are many other parameters that are relevant to the<br />
operation <strong>of</strong> <strong>TCP</strong> <strong>in</strong> L<strong>in</strong>ux, and each is at least briefly<br />
expla<strong>in</strong>ed <strong>in</strong> the documentation <strong>in</strong>cluded <strong>in</strong> the distribution<br />
(Documentation/network<strong>in</strong>g/ip-sysctl.txt). An example <strong>of</strong> a<br />
configurable parameter <strong>in</strong> the <strong>TCP</strong> implementation is the<br />
RFC2861 congestion w<strong>in</strong>dow restart function. RFC2861 proposes<br />
restart<strong>in</strong>g the congestion w<strong>in</strong>dow if the sender is idle for<br />
a period <strong>of</strong> time (one RTO). The purpose is to ensure that the<br />
congestion w<strong>in</strong>dow reflects the current state <strong>of</strong> the network.<br />
If the connection has been idle, the congestion w<strong>in</strong>dow may<br />
reflect an obsolete view <strong>of</strong> the network and so is reset. This behavior<br />
can be disabled us<strong>in</strong>g the sysctl tcp slow start after idle<br />
but, aga<strong>in</strong>, this change affects all connections system-wide.<br />
REFERENCES<br />
[1] V. Jacobson, “Congestion avoidance and control,” <strong>in</strong> Proceed<strong>in</strong>gs <strong>of</strong><br />
SIGCOMM, Stanford, CA, Aug. 1988.<br />
[2] L. Xu, K. Harfoush, and I. Rhee, “B<strong>in</strong>ary <strong>in</strong>crease congestion control for<br />
fast long-distance networks.” <strong>in</strong> Proceed<strong>in</strong>gs <strong>of</strong> IEEE INFOCOM, Mar.<br />
2003.