25.12.2012 Views

TCP Implementation in Linux: A Brief Tutorial - University of Virginia

TCP Implementation in Linux: A Brief Tutorial - University of Virginia

TCP Implementation in Linux: A Brief Tutorial - University of Virginia

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

User Space<br />

Kernel Space<br />

s<strong>of</strong>tirf to free<br />

packet descriptor<br />

Application<br />

<strong>TCP</strong> send Buffer<br />

(tcp_wmem)<br />

<strong>TCP</strong> Process<br />

completion<br />

IP Layer<br />

queue<br />

dev_queue_transmit()<br />

Device Driver<br />

qdisc<br />

(txqueuelen)<br />

hard_start_xmit()<br />

Interrupt<br />

Generator<br />

write()<br />

tcp_transmit_skb()<br />

ip_queue_xmit()<br />

R<strong>in</strong>g<br />

Buffer<br />

NIC Memory<br />

Packet<br />

Transmission<br />

Kernel<br />

Memory<br />

Packet Data<br />

DMA Eng<strong>in</strong>e<br />

Fig. 2: Packet Transmission<br />

The analogue to the receiver’s netdev max backlog is the<br />

sender’s txqueuelen. The <strong>TCP</strong> layer builds packets when data<br />

is available <strong>in</strong> the send buffer or ACK packets <strong>in</strong> response to<br />

data packets received. Each packet is pushed down to the IP<br />

layer for transmission. The IP layer enqueues each packet <strong>in</strong> an<br />

output queue (qdisc) associated with the NIC. The size <strong>of</strong> the<br />

qdisc can be modified by assign<strong>in</strong>g a value to the txqueuelen<br />

variable associated with each NIC device. If the output queue<br />

is full, the attempt to enqueue a packet generates a localcongestion<br />

event, which is propagated upward to the <strong>TCP</strong><br />

layer. The <strong>TCP</strong> congestion-control algorithm then enters <strong>in</strong>to<br />

the Congestion W<strong>in</strong>dow Reduced (CWR) state, and reduces<br />

the congestion w<strong>in</strong>dow by one every other ACK (known as<br />

rate halv<strong>in</strong>g). After a packet is successfully queued <strong>in</strong>side the<br />

output queue, the packet descriptor (sk buff ) is then placed<br />

<strong>in</strong> the output r<strong>in</strong>g buffer tx r<strong>in</strong>g. When packets are available<br />

<strong>in</strong>side the r<strong>in</strong>g buffer, the device driver <strong>in</strong>vokes the NIC DMA<br />

eng<strong>in</strong>e to transmit packets onto the wire.<br />

While the above parameters dictate the flow-control pr<strong>of</strong>ile<br />

<strong>of</strong> a connection, the congestion-control behavior can also have<br />

a large impact on the throughput. <strong>TCP</strong> uses one <strong>of</strong> several<br />

congestion control algorithms to match its send<strong>in</strong>g rate with<br />

the bottleneck-l<strong>in</strong>k rate. Over a connectionless network, a large<br />

number <strong>of</strong> <strong>TCP</strong> flows and other types <strong>of</strong> traffic share the same<br />

bottleneck l<strong>in</strong>k. As the number <strong>of</strong> flows shar<strong>in</strong>g the bottleneck<br />

l<strong>in</strong>k changes, the available bandwidth for a certa<strong>in</strong> <strong>TCP</strong> flow<br />

varies. Packets get lost when the send<strong>in</strong>g rate <strong>of</strong> a <strong>TCP</strong> flow<br />

is higher than the available bandwidth. On the other hand,<br />

packets are not lost due to competition with other flows <strong>in</strong> a<br />

circuit as bandwidth is reserved. However, when a fast sender<br />

is connected to a circuit with lower rate, packets can get lost<br />

due to buffer overflow at the switch.<br />

When a <strong>TCP</strong> connection is set up, a <strong>TCP</strong> sender uses ACK<br />

packets as a ’clock, known as ACK-clock<strong>in</strong>g, to <strong>in</strong>ject new<br />

2<br />

packets <strong>in</strong>to the network [1]. S<strong>in</strong>ce <strong>TCP</strong> receivers cannot<br />

send ACK packets faster than the bottleneck-l<strong>in</strong>k rate, a<br />

<strong>TCP</strong> senders transmission rate while under ACK-clock<strong>in</strong>g is<br />

matched with the bottleneck l<strong>in</strong>k rate. In order to start the<br />

ACK-clock, a <strong>TCP</strong> sender uses the slow-start mechanism.<br />

Dur<strong>in</strong>g the slow-start phase, for each ACK packet received,<br />

a <strong>TCP</strong> sender transmits two data packets back-to-back. S<strong>in</strong>ce<br />

ACK packets are com<strong>in</strong>g at the bottleneck-l<strong>in</strong>k rate, the sender<br />

is essentially transmitt<strong>in</strong>g data twice as fast as the bottleneck<br />

l<strong>in</strong>k can susta<strong>in</strong>. The slow-start phase ends when the size<br />

<strong>of</strong> the congestion w<strong>in</strong>dow grows beyond ssthresh. In many<br />

congestion control algorithms, such as BIC [2], the <strong>in</strong>itial<br />

slow start threshold (ssthresh) can be adjusted, as can other<br />

factors such as the maximum <strong>in</strong>crement, to make BIC more<br />

or less aggressive. However, like chang<strong>in</strong>g the buffers via<br />

the sysctl function, these are system-wide changes which<br />

could adversely affect other ongo<strong>in</strong>g and future connections.<br />

A <strong>TCP</strong> sender is allowed to send the m<strong>in</strong>imum <strong>of</strong> the congestion<br />

w<strong>in</strong>dow and the receivers advertised w<strong>in</strong>dow number<br />

<strong>of</strong> packets. Therefore, the number <strong>of</strong> outstand<strong>in</strong>g packets<br />

is doubled <strong>in</strong> each roundtrip time, unless bounded by the<br />

receivers advertised w<strong>in</strong>dow. As packets are be<strong>in</strong>g forwarded<br />

by the bottleneck-l<strong>in</strong>k rate, doubl<strong>in</strong>g the number <strong>of</strong> outstand<strong>in</strong>g<br />

packets <strong>in</strong> each roundtrip time will also double the buffer<br />

occupancy <strong>in</strong>side the bottleneck switch. Eventually, there will<br />

be packet losses <strong>in</strong>side the bottleneck switch once the buffer<br />

overflows.<br />

After packet loss occurs, a <strong>TCP</strong> sender enters <strong>in</strong>to the<br />

congestion avoidance phase. Dur<strong>in</strong>g congestion avoidance,<br />

the congestion w<strong>in</strong>dow is <strong>in</strong>creased by one packet <strong>in</strong> each<br />

roundtrip time. As ACK packets are com<strong>in</strong>g at the bottleneck<br />

l<strong>in</strong>k rate, the congestion w<strong>in</strong>dow keeps grow<strong>in</strong>g, as does the<br />

the number <strong>of</strong> outstand<strong>in</strong>g packets. Therefore, packets will get<br />

lost aga<strong>in</strong> once the number <strong>of</strong> outstand<strong>in</strong>g packets grows larger<br />

than the buffer size <strong>in</strong> the bottleneck switch plus the number<br />

<strong>of</strong> packets on the wire.<br />

There are many other parameters that are relevant to the<br />

operation <strong>of</strong> <strong>TCP</strong> <strong>in</strong> L<strong>in</strong>ux, and each is at least briefly<br />

expla<strong>in</strong>ed <strong>in</strong> the documentation <strong>in</strong>cluded <strong>in</strong> the distribution<br />

(Documentation/network<strong>in</strong>g/ip-sysctl.txt). An example <strong>of</strong> a<br />

configurable parameter <strong>in</strong> the <strong>TCP</strong> implementation is the<br />

RFC2861 congestion w<strong>in</strong>dow restart function. RFC2861 proposes<br />

restart<strong>in</strong>g the congestion w<strong>in</strong>dow if the sender is idle for<br />

a period <strong>of</strong> time (one RTO). The purpose is to ensure that the<br />

congestion w<strong>in</strong>dow reflects the current state <strong>of</strong> the network.<br />

If the connection has been idle, the congestion w<strong>in</strong>dow may<br />

reflect an obsolete view <strong>of</strong> the network and so is reset. This behavior<br />

can be disabled us<strong>in</strong>g the sysctl tcp slow start after idle<br />

but, aga<strong>in</strong>, this change affects all connections system-wide.<br />

REFERENCES<br />

[1] V. Jacobson, “Congestion avoidance and control,” <strong>in</strong> Proceed<strong>in</strong>gs <strong>of</strong><br />

SIGCOMM, Stanford, CA, Aug. 1988.<br />

[2] L. Xu, K. Harfoush, and I. Rhee, “B<strong>in</strong>ary <strong>in</strong>crease congestion control for<br />

fast long-distance networks.” <strong>in</strong> Proceed<strong>in</strong>gs <strong>of</strong> IEEE INFOCOM, Mar.<br />

2003.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!