TCP/IP protocol optimisation for large websites

As a site with millions or tens of millions of DAUs, it is not only necessary to optimize the web application and database, but also the TCP/IP protocol layer.

In my work, I have used the following basic optimisation approaches.

Increasing the maximum number of connections

In Linux, all network connections are made via file descriptors, so the number of file descriptors a process can open determines the maximum number of connections it can create.

As the Linux system limits the number of file descriptors for a process to 1024; for a large distributed site, this connection limit is not sufficient and it is recommended that the value be increased appropriately.

First query Linux for the limit on the number of file descriptors.

`1`	`ulimit -n`

By default it will show 1024, then edit the /etc/security/limits.conf file and add the following 2 lines of configuration.

1
2

* soft nofile 10000
* hard nofile 10000

After restarting the system, use ulimit -n again to see the result is 10000.

Reducing the TIME_WAIT time on TCP disconnection

During the four waves END phase of a TCP disconnection, the initiator of the connection break enters the TIME_WAIT state.

Execute the following command on your Linux server.

`1`	`netstat -n \| grep 'tcp'`

In the last column of the output you may see a line with the value TIME_WAIT, this means that the TCP connection has entered the TIME_WAIT state, a connection that has entered this state will not be released and by default will have to wait 2MSL (1MSL is roughly equal to the time it takes for the TTL to decay to 0, which is specified in RFC 793 as 2 minutes). Therefore the connection will remain in TIME_WAIT and hold a file handle for 2 2 minute periods after disconnection, which in large sites will result in the server connection being exhausted very quickly, in fact the 2MSL (4 minutes) is too long for today’s internet speeds and it is recommended that the time be set to 30 seconds.

Edit the /etc/sysctl.conf file and add the following line.

`1`	`net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_tw_reuse = 1`

Execute the sudo /sbin/sysctl -p command to make the changes take effect immediately.

Disabling Delayed Acknowledgements

In the TCP protocol, Delayed ACK is a double-edged sword; in the vast majority of cases, instead of understanding the ACK response for each packet after it is received by the server, Delayed ACK combines the ACK responses of multiple packets into one, thus improving the performance of network transmissions and reducing the load on the network, however, under certain conditions, it can also have a negative impact.

The impact is mainly twofold.

The TCP protocol introduced the Nagle algorithm in RFC 896, which is mainly used to solve the small-packet problem during TCP transmissions, roughly meaning that when the sender sends a small amount of data, it does not send it immediately, but only when the amount of data to be sent accumulates to a certain threshold (MSS: Maximum Segment Size); Wikipedia gives a graphic explanation of this.

if there is new data to send then
    if the window size ≥ MSS and available data is ≥ MSS then
        send complete MSS segment now
    else
        if there is unconfirmed data still in the pipe then
            enqueue data in the buffer until an acknowledge is received
        else
            send data immediately
        end if
    end if
end if

Note the " an acknowledge is received" above, which means that if a sender is using Nagle’s algorithm to send small data, this data may be delayed until the receiver responds with an ACK, and the receiver will only send an ACK to the sender after the delayed acknowledgement timeout has been reached because of the delayed acknowledgement; thus causing the time to send the data to be stretched.

In network environments with significant packet loss, delayed acknowledgement can make the performance of the transmission even worse.

For example, if the sender needs to send nine packets with sequence numbers 1, 2, 3, 4, 5, 6, 7, 8 and 9 to the receiver, and if three packets, 1, 2 and 3, are sent successfully and four packets, 4, 5, 6 and 7, are sent unsuccessfully, the receiver returns an ACK seq=4 packet to the sender when it receives the eighth packet, and the sender resends the fourth packet, but the receiver does not send an ACK seq=5 packet immediately upon The receiver will not send an ACK seq=5 packet immediately upon receiving the fourth packet, but will wait until the delayed acknowledgement timeout before responding, then the sender will resend the fifth packet, and the receiver will wait until the delayed acknowledgement timeout before responding with ACK seq=6 …… And so on, which can make an already poor network transmission even worse.

As mentioned above, delayed acknowledgement is a double-edged sword that can be useful in most environments, so it can be turned off or off depending on the circumstances, and in some cases the delayed acknowledgement timeout can be reduced.

Enabling SACK

In the second of the two aspects of delayed acknowledgement mentioned above, the sender needs to wait until the receiver responds with an ACK before resending the lost packet; for example, the fourth packet is resent when ACK 4 is received, the fifth when ACK 5 is received and the sixth when ACK 6 is received …… This step is particularly lengthy if the number of packets lost is large.

SACK (Selective Acknowledgment) solves this problem by causing the receiver to write the sequence number of the packet it has received (in this case 8, meaning the 8th packet has been received) in the ACK seq=4 packet it responds to after receiving the 8th packet; after receiving this ACK packet, the receiver will know that the 1st, 2nd, 3rd and 8th packets were sent successfully, and thus deduce that the 4th, 5th, 6th and 7th packets need to be retransmitted.

SACK is two-sided and needs to be enabled by both the sender and the receiver to take effect.

Using the QUIC protocol

TCP, as a connection-oriented secure and reliable transport protocol, also has many natural flaws, such as the failed retransmission problem talked about above, and other problems such as the need for three handshakes and four waves to establish/disconnect, and the head-of-queue blocking problem when sending packets, so the performance of the transport is not good enough; Google has proposed an upper-layer QUIC protocol based on the UDP protocol; it does not require three handshakes and four waves when establishing/disconnecting, and it uses the RAID5 algorithm to mitigate the failed retransmission problem when packets are lost in the TCP protocol.

The QUIC protocol is now supported by Google’s Chrome browser and Microsoft’s Edge browser, while the upcoming HTTP/3 is also built on top of the QUIC protocol (RFC 9000).

Many popular web servers such as Tencent’s Caddy have built-in support for the QUIC protocol, and Nginx has released a preview version of Nginx-quic with official support for QUIC and HTTP/3.

Table of Contents

Increasing the maximum number of connections

Reducing the TIME_WAIT time on TCP disconnection

Disabling Delayed Acknowledgements

Enabling SACK

Using the QUIC protocol