ETCD's memory problem

Today I’d like to share with you a problem with the large memory footprint of etcd, a problem we encountered some time ago in our open source software Easegress. The problem is relatively simple, but I also want to talk about the causes and consequences, including, why we use etcd, the user scenarios using etcd, including some designs of etcd that lead to a relatively large memory footprint, and finally some suggestions. I hope this article has given you more than just a glimpse of a simple memory problem. Of course, if you think our open source software is doing a good job, you are welcome to follow our open source software as well.

raft

Why ETCD

Let’s start with why we use etcd. Let’s illustrate it with a practice we did ourselves, also an API gateway we made - Easegress)

Easegress is an API application gateway product we developed and open-sourced. This API application gateway is not just used to be a reverse proxy like nginx. This gateway can do a lot of things, such as: API orchestration, service discovery, elastic design (fusion, flow limitation, retry, etc.), authentication and authentication (JWT, OAuth2, HMAC, etc.).

It also supports various Cloud Native architectures such as: Microservice Architecture, Service Mesh, Serverless/FaaS integration, and can be used to carry more advanced enterprise-level solutions such as high concurrency, grayscale release, full-link stress testing ……. So, in order to achieve these goals, in 2017, we felt that we could not evolve such software on existing gateways such as Nginx, and had to write a new one (later, Hashicrop should also think the same as us, so they also wrote an envoy, except that Envoy is written in C++, and I used the lower technical threshold of Go language)

In addition, there are three main core design aspects of Easegress.

One is the ability to select your own master group cluster without third-party dependencies
Two are pipeline-style plug-in streaming like the Linux pipeline command line (Go/WebAssembly support)
three built-in a Data Store for cluster control and data sharing.

For any distributed system, there needs to be a mandatory Paxos/Raft-based master-selectable mechanism and some key control/configuration and related shared data needs to be synchronized across the cluster to ensure that the behavior of the entire cluster is uniform and consistent. Without such a thing, there is no way to play with distributed systems. This is why components like Zookeeper/etcd are available, they are not primarily for you to store data, but to group clusters.

Zookeeper is a popular open source software that is also used in the production lines of major companies, including some open source software, such as: Kafka. however, this can make other software have a dependency and bring a lot of complexity in the operation and maintenance. The latest version of Kafka has also abandoned the external zookeeper design by building in a master selection algorithm. etcd is the mainstay of the go language side and a key component of the kubernetes group cluster. easegress started out (5 years ago), we started out using the gossip protocol to synchronize state (thinking too far ahead, trying to do a wide area network cluster), and we started out using the gossip protocol to synchronize state (thinking too far ahead, trying to do a wide area network cluster). We found that the protocol was too complex and difficult to debug, and the WAN API Gateway did not meet the appropriate scenarios. So, 3 years ago, for stability reasons, we replaced it with an inline version of etcd, which has been used until today.

Easegress will put all the configuration information into the etcd, including some statistical monitoring data, as well as some user-defined data (so that the user’s own plugin not only in a pipeline, but also in the entire cluster for data sharing), which is very convenient for users to expand. Software code scalability has always been our primary goal, especially open source software to find ways to reduce the technical threshold to make the technology easy to expand, which is why Google’s open source software will choose to use the Java language reasons.

Background

Well, after describing why we use etcd, I’m going to share a real problem. We had a user who was using Easegress and configured thousands of pipelines within Easegress, causing Easegress’ memory to spike very high - 10+GB or more - and it wouldn’t come down for a long time. Generally speaking, even if there are more APIs, you should not configure so many pipelines. Generally speaking, it is reasonable to use the prefix of HTTP API to configure a group of APIs belonging to a category in a pipeline, just like the configuration of location under nginx, which is generally not too many. However, it is the first time that we have seen thousands of pipelines configured in this scenario for users.

The problem reported by the user is - creating an HTTP object on Easegress 1.4.1 with 1000 Pipelines, the memory footprint at the completion of Easegres initial startup is about 400M, after 80 minutes of running 2GB, after 200 minutes of running it reaches 4GB During this time, nothing was done and not a single request was made to Easegress.

After investigation, we found that the memory usage basically all came from etcd. We really didn’t think that there were not many keys in etcd, so why would it take up several 10GB of memory. At this time, I usually suspect that etcd has a memory leak, so I searched on the github of etcd and found that etcd had a memory leak on both 3.2 and 3.3 versions, but both were fixed.

Easegress is using the latest version 3.5, and besides, memory leaks are generally not that big of a problem, so we started to wonder where we had misused etcd. To find out if we had misused etcd, there was only one way to go, sink our teeth into the design of etcd and take a good look at it.

After spending about two days looking at the design of etcd, I discovered that etcd has the following memory-consuming design, which is frankly very expensive, and I’m sharing it here to avoid people falling into this trap again.

First and foremost - RaftLog . etcd uses Raft Log, mainly to help follower synchronize data, and the underlying implementation of this log is not a file, but memory. So, and also to keep at least 5000 of the latest requests. If the size of the key is large, these 5000 requests will incur a lot of memory overhead. For example, if you keep updating a 1M key, even if it is the same key, these 5000 logs are 5000MB = 5GB of memory overhead. This issue was also mentioned in the issue list of etcd issue #12548, but it was never resolved. The 5000 is still a hardcode and cannot be changed. (See DefaultSnapshotCatchUpEntries related source code)

// DefaultSnapshotCatchUpEntries is the number of entries for a slow follower
// to catch-up after compacting the raft storage entries.
// We expect the follower has a millisecond level latency with the leader.
// The max throughput is around 10K. Keep a 5K entries is enough for helping
// follower to catch up.
DefaultSnapshotCatchUpEntries uint64 = 5000

In addition, we also found that this design in the history of etcd’s official team to reduce this default value from 10000 to 5000, we estimate that etcd official team also realized that 10000 a little too memory-consuming, so, down by half, but afraid of follwer synchronization can not be, so, retain the 5000 … (here, I personally feel that there is a better way, at least do not have to put all in the memory it …)

There are also the following items that will cause the memory of etcd to increase

Index. etcd has a B-tree index in memory for each key-value pair. The overhead of this index is related to the length of the key. etcd also saves versions. the memory of the B-tree is also related to the length of the key and the number of historical version numbers.
mmap. Also, etcd uses mmap, an ancient unix technique for file mapping, to map its blotdb memory into virtual memory, so the larger the db-size, the larger the memory.
Watcher . watch also takes up a lot of memory, and if there are many watches and many connections, they will all pile up memory.

(Obviously, etcd does this for a high performance consideration)

The problem in Easegress should be more of a Raft Log problem. The last three problems we don’t think would be the cause of this problem for users. For indexing and mmap, use etcd’s compact and defreg (compression and defragmentation should reduce memory, but should not be the core cause of this problem on the user’s side).

For the user’s problem, there are about 1000+ pipelines, because Easegress will do statistics for each pipeline (e.g. M1, M5, M15, P99, P90, P50, etc. such statistics), the statistics may be about 1KB-2KB but Easegress will combine these 1000 But Easegress will combine these 1000 pipeline statistics and write them to a single key, which will result in a key with an average size of 2MB, and 5000 in-memory RaftLogs will result in 10GB of memory consumed by etcd. There were not so many pipeline scenarios before, so this memory problem was not exposed.

So, our final solution is also very simple, we modify our strategy, no longer write such a large Value of data, although only on a key, but split into multiple smaller keys to write, so that we can bring down the size of each data of RaftLog, and finally, solve the problem. The related PR is here PR#542.

Summary

To use etcd well, there are the following practices

Avoid large size of key and value, on the one hand it will take up a lot of memory through a memory level Raft Log, on the other hand, the multi-version index of B-tree will also consume memory because of this.
Avoid too large size of DB and reduce memory by compact and defreg for compression and defragmentation.
Avoid large number of Watch Clients and Watch counts. This overhead is also relatively large.
One last thing, is to use newer versions as much as possible, either go language or etcd, so there will be much less memory problems. For example: golang has this memory problem related to the core within LInux – golang 1.12 version sget’s is MADV_FREE of The difference between the two is that FREE means that although the process marks the memory as not wanted, the OS will keep it until more memory is needed, while DONTNEED is immediately recycled, as you can see on the resident memory RSS, although the former is recycled on the The Linux implementation of MADV_FREE has some problems in some cases, so the default value was changed to MADV_DONTNEED in go 1.16. And etcd 3.4 was compiled with 1.12.

Finally, welcome to our open source software! https://github.com/megaease/

Reference https://coolshell.cn/articles/22242.html

Table of Contents

Why ETCD

Background

Summary