Kubernetes application troubleshooting

Set reasonable Req and Limit

If you don’t set Req and Limit, when the CPU and MEM of the application skyrocket, it will endanger other Pods on the same node, and even cause the cluster nodes to be crushed one by one.
There are four values of Req and Limit, if only some of them are set, Kubelet will expel Pods when the node resource usage reaches Kubelet’s preset value, the order of expulsion is Guaranteed > Burstable > Best-Effort

Where:

Guaranteed, all containers set CPU, MEM requests and limits, and equal at the same time
Burstable, at least one container has CPU, MEM requests or limits
BestEffort, none of the containers set CPU, MEM requests or limits

Req and Limit should not be too different, usually Limit = Req * 1.5 or set based on monitoring history. The more resource consuming the Pod is, the more the Req and Limit should be similar.

Focus on CPU Limiting for Applications

Since CPU is a compressible resource, with Limit set, the application is usually available even under high load. But it can be slow.

This is because the system allocates CPUs in slices, and if the application reaches the Limit limit on CPU slices within a cycle, the application has to wait for the next cycle to get CPU usage. At this point, the application is in a CPU-limited state.

CPU limiting causes the application to be less responsive.

Note that it is not only the Pod’s CPU usage that reaches the limit, but also if the CPU load on the node is high, CPU throttling will also exist. The resources that the application can use are the ones that are not consumed by the node itself.

Restarting Prometheus with scaling

Do not roll restart, rolling restart will result in doubling component resource consumption in a short period of time, affecting cluster stability.

1
2

kubectl scale deployment prom-prometheus-server --replicas=0
kubectl scale deployment prom-prometheus-server --replicas=1

Also, Prometheus without --storage.tsdb.no-lockfile parameter enabled by default cannot be restarted on a rolling basis, and can only be restarted using the above method.

Otherwise, an error is reported, opening storage failed: lock DB directory: resource temporarily unavailable.

One storage volume with multiple Prometheus may result in dirty data.

Table of Contents

Set reasonable Req and Limit

Focus on CPU Limiting for Applications

Restarting Prometheus with scaling