How to monitor TCP in K8s Pod

In the Kubernetes system, each container resource monitoring information is collected by the kubelet’s built-in cadvisor component, but officially, based on performance-related considerations, it would consume a lot of CPU memory resources to capture the network-related metrics in each container, so the collection of network and other related metrics is turned off by default in cadvisor.

https://github.com/google/cadvisor/blob/master/docs/runtime_options.md#metrics
https://github.com/kubernetes/kubernetes/issues/60279

So in the kubelet cadvisor metrics endpoint that prometheus crawls by default, the metrics container_network_tcp_usage_total and container_network_udp_usage_total are all 0, but the actual business monitoring may require the collection of these metrics for monitoring alarms and troubleshooting problems.

Option 1

If you want to get these metrics from cadvisor, you need to modify the configuration in the kubelet file, i.e. you need to modify the kubelet source code and recompile it, which is not native.

The other option is to deploy cadvisor daemonset and turn on the collection of tcp network metrics and use prometheus to grab these metrics, but this is a duplicate deployment of cadvisor components, and after testing, the deployed cadvisor on a 64C384G physical machine with 300 containers This is also the reason why cadvisor disables network metrics collection by default. Also, the prometheus crawl profile needs to be modified, otherwise it will crawl and generate two copies of cadvisor metrics at the same time.

After the above tests, the solution of deploying additional cadvisor and using prometheus crawler consumes huge resources, will have duplicate metrics collection, and needs to query and process from prometheus again and upload to open-falcon monitoring platform.

Option 2

To get the network connection metrics in each container, it is not practical to have a built-in monitoring component in each container, and the network metrics collection of the container is different from that of the virtual machine, and the monitoring agent is not common.

We know that the virtualization of containers uses n amespace to isolate the network of each process, and the nsenter tool can execute the relevant network operation commands in the network space of each process. Refer to: https://stackoverflow.com/questions/40350456/docker-any-way-to-list-open-sockets-inside-a-running-docker-container.

In addition, the mapping between the project container name and the device on the host can be found through the api of docker, such as the pid process correspondence and the NIC device correspondence.

From the above, we can write our own script to capture only the required network metrics, and execute it on the host regularly to get the network status of each application container and upload it to the open-falcon monitoring platform in the format of monitoring data to do unified monitoring and alerting.

Take ss --summary as an example, it shows the tcp connection status and udp connection status of various states, and reads the statistics in the following fields.

Total: 141267 (kernel 7503901)
TCP:   28090 (estab 80, closed 27999, orphaned 139, synrecv 0, timewait 6586/0), ports 0

Transport Total     IP        IPv6
*   7503901   -         -
RAW   0         0         0
UDP   3         3         0
TCP   91        91        0
INET   94        94        0
FRAG   0         0         0

The execution efficiency in the above configured machine (64C384G - physical machine with 300 containers running) is at the minute level.

Using ss command to get network status information is slow, consider two optimizations.

change the serial execution of the network status command for each container to multi-threaded concurrent execution, control the number of concurrency and pay attention to the CPU consumption.
Replace the ss command with a direct read of the /proc/{pid}/net/sockstat file to get the network status statistics of the corresponding process.

Option 3

Use the optimized version of option 2, instead of using the ss command to get the network status of each container cyberspace, directly read the sockstat file under the corresponding pid of each container, i.e. /proc/{pid}/net/sockstat file, which is much faster than the ss command. And through the docker api and docker exchange, get the pod name and pid information of each running container, which is much faster than executing the command.

sockstat file information.

sockets: used 141118
TCP: inuse 89 orphan 96 tw 7181 alloc 21341 mem 13896
UDP: inuse 3 mem 116
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

The value of TCP inuse is taken from tcp4_hashed+tcp6_hashed, which is the number of TCP sockets in use, not the tcp connections in establish state; the state of tcp connections in establish state needs to be retrieved from the /proc/net/snmp file, reading the CurrEstab field.

This solution reports the six status values of network connection Establish, InUse, TimeWait, Orphan, Total, and UDP to the open-falcon monitoring platform, so as to obtain the internal network connection status of the container that is not available in the prometheus monitoring, and to do data monitoring and alerting.

In the above configuration machine (64C384G - physical machine running with 300 containers) the execution efficiency is in the millisecond range (around 300ms) and the memory consumption is about 40M, so the project is implemented using option 3.

Table of Contents

Option 1

Option 2

Option 3