I decided to add a few nodes to our Elasticsearch (ES) cluster because of business needs, the current ES cluster version is 7.5.2, and the nodes to be added are moved from the previous ES cluster version 1.5.2. The current 7.5.2 cluster already has 6 data nodes, and my job this time is to add 5 new data nodes to it.

Since these 5 nodes are all data nodes from the previous 1.5.2 ES cluster and have been running stably in the previous cluster for years without any problems, I am very confident about their configuration. This careless attitude set the stage for the tragedy that followed. Since these nodes had no problems in the previous cluster, I added all 5 nodes to the ES7 cluster at once and made them data nodes.

An error occurred

After the nodes were added, the ES cluster started to rebalance the slices, the whole cluster started to perform the relocation and replication operations of the slices, and it seemed that our work would soon be done. Soon this calm was shattered by the ringing of a cell phone text message alerting the business that it had suddenly started to experience a large number of read and write failure errors! At the same time, the alert group of enterprise WeChat also started to alert a large number of alerts, and the alert messages were all CircuitBreakingException type exceptions, and the specific error messages I excerpted partly as follows.

1
[parent] Data too large, data for [<transport_request>] would be [15634611356/14.5gb], which is larger than the limit of [15300820992/14.2gb], real usage: [15634605168/14.5gb], new bytes reserved: [6188/6kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=6188/6kb, accounting=18747688/17.8mb]

In fact, the reason for the error is very simple: the current memory has triggered a circuit breaker at the parent level, which makes it impossible to continue the transport_request. Because if the transport_request continues, it may cause ES to generate an OutOfMemory error. ES will set some circuit breaker (circuit breaker) to avoid OOM. reference/7.5/circuit-breaker.html), the role of these circuit breakers is to actively reject the next operation when there is not enough memory, instead of allocating further memory to eventually generate OutOfMemoryError, the role of circuit breakers is to protect the entire process from going down.

We already know that the reason for the error is that the Java process is running out of heap memory, but what exactly is causing the memory to run out? At the moment I am not thinking about these issues, the 5 new nodes are frequently reporting insufficient memory, which is causing a lot of online read and write failures, and my primary goal is to solve these errors. This is the ambush I mentioned before, because I added all 5 nodes to the cluster at once, so at this time, the main slice and all replica slice of an index slice may be distributed on these newly added nodes, so I can’t stop all these nodes at once, because this will lead to complete data loss of these slice and thus make the whole cluster turn red.

In fact, I did this because I was nervous about too many alarms, i.e. I stopped several nodes and the cluster status immediately turned red. The good thing is that these slices still existed on the disks of the stopped nodes, and after the cluster turned red, I rushed to get these nodes up again, and the cluster was out of the red state again. After that, I could only wait for the replication of the slices while enduring the alarms, and I could only stop the node after confirming that there was no unique slice of data on a node.

Anyway, I had to endure the alarms while stopping the nodes, and after some time, I finally stopped all 5 nodes, and the read/write alarms caused by insufficient memory finally stopped. The problem is that there are some unassigned slices in the cluster, i.e. these slices are not allocated successfully. We use the following command to find all the unassigned shards and explain why they were not allocated.

1
GET /_cluster/allocation/explain

The reason for the error is as follows/.

1
shard has exceeded the maximum number of retries

This means that a previous memory error caused the allocation of these slices to fail, and the maximum number of retries was reached after multiple failures, at which point ES aborts the allocation of these slices. In this case, we just need to execute the following command to manually start reallocating the slices.

1
POST /_cluster/reroute?retry_failed=true

The cluster will then start allocating these unallocated slices, and after waiting for some time for the slices to be allocated and replicated, the whole cluster is finally back to green again.

Cause of error

The first thing we thought was that the memory was not recovered in time because of the GC problem, and there was not enough memory left to cause the error. We observed the GC logs of the G1 garbage collector, and the logs of G1 are roughly divided into three parts as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# 正常的YoungGC
Pause Young (Normal) (G1 Evacuation Pause)

# 伴随着YoungGC会有多次标记操作
Pause Young (Concurrent Start) (G1 Humongous Allocation)
Concurrent Cycle

# MixedGC
Pause Young (Prepare Mixed) (G1 Evacuation Pause)
Pause Young (Mixed) (G1 Evacuation Pause)

After observing the GC logs, we found that GCs are triggered only after the heap memory has reached a high occupancy rate, which is likely to lead to unrecovered memory and insufficient memory remaining. If we could make GCs happen earlier, we could reduce the probability of running out of memory (although this would reduce the throughput of the system by making GCs more frequent). By searching the ES source code in Pull requests we found the following GC configuration we found the following GC configuration.

1
2
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30

The two parameters are explained in the official documentation of the G1 garbage collector as follows.

1
2
3
4
5
-XX:G1ReservePercent=10
Sets the percentage of reserve memory to keep free so as to reduce the risk of to-space overflows. The default is 10 percent. When you increase or decrease the percentage, make sure to adjust the total Java heap by the same amount. This setting is not available in Java HotSpot VM, build 23.

-XX:InitiatingHeapOccupancyPercent=45
Sets the Java heap occupancy threshold that triggers a marking cycle. The default occupancy is 45 percent of the entire Java heap.

In short -XX:G1ReservePercent is the percentage of memory space reserved to avoid out-of-memory errors, the default value is 10, we raised it to 25. -XX:InitiatingHeapOccupancyPercent is the percentage of memory usage threshold to trigger a marking cycle, the default value is 45, we reduced it to 30. By modifying these two VM parameters, the VM will be able to GC earlier, which will greatly reduce the probability of out-of-memory errors.

After modifying the parameters, we restart the nodes, this time one by one, and wait for a few hours after starting one to make sure there is no problem before starting the next one. In addition, we set the cluster.routing.allocation.enable parameter to none before adding a new node, and then set it to all after the node is confirmed to be started, so that we can manually control the start and stop of slice allocation. After changing the parameter, the node does not have frequent out-of-memory errors anymore, so it can be seen that changing the configuration to advance the GC time does reduce the out-of-memory problem caused by too slow GC. Although there is no pressure on the node memory in general, there is another problem, that is, there is still a chance to trigger the node out of memory error when relocating the slice after joining the node, so we just need to slow down the slice relocation.

We change the value of cluster.routing.allocation.node_concurrent_recoveries from the default value of 2 to 1, which reduces the number of relocated slices on a node at the same time. In addition, we set indices.recovery.max_bytes_per_sec to reduce the slice relocation speed from 40mb/s to 20mb/s per node, which also reduces the memory pressure on the node during slice relocation. After changing these configurations, the node has never had an out-of-memory error again.

In fact, we can summarize the reasons for the problem as follows

  1. gc speed is too slow.
  2. too fast memory growth

The solutions are as follows.

  1. reduce the GC trigger threshold and increase the GC frequency.
  2. reduce the speed of data synchronization, reduce the speed of memory increase.