The company had multiple go-redis clients and multiple versions of the redis cluster. When conducting a business stress test, we found that even if we only access the redis interface, the latency can be as high as a second, which is very counterintuitive.

We use different versions of go-redis and different versions of redis cluster to do a simple stress test. redis commands are simple get, kv size is one byte, here we put the data out first.

Client Version Server Version Average delay qps
v6 5.0 2.58ms 42138.85
v8 5.0 2.29ms 43567.68
v6 6.0 549.22ms 117
v8 6.0 2.48ms 41794.86

You can see that the third line of data exceptions, access to redis is surprisingly 500ms, indicating that if the client and server versions are not on the right will indeed have more trouble.

Let’s analyze this problem.

Stress test environment preparation

Install redis-server, there is nothing to say, don’t use brew for mac, because the brew install doesn’t have the create-cluster script tool. Just download the distribution from the official website and go to the redis root directory on the mac and execute make once.

redis server install

Here we only test server 5.0 and 6.0.

Go to the utils/create-cluster directory of redis and execute the following command directly.

1
2
./create-cluster start
./create-cluster create

The default should be to start three masters and three slaves, and after starting it, you can connect a random instance to get in.

1
2
~ ❯❯❯ redis-cli -c -p 30001
127.0.0.1:30002> set 1 1

Save the kv in advance and create the client with 30001, 30002 and 30003 instances.

The client side should be prepared with go-redis/redis and go-redis/redis/v8 respectively.

v6 version.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
package main

import (
        "fmt"
        "log"
        "net/http"

        _ "net/http/pprof"

        "github.com/go-redis/redis"
)

var cli = redis.NewClusterClient(&redis.ClusterOptions{
        Addrs:        []string{"127.0.0.1:30001", "127.0.0.1:30002", "127.0.0.1:30003"},
})

func sayhello(wr http.ResponseWriter, r *http.Request) {
        s := cli.Get( "1")
        err := s.Err()
        if err != nil {
                fmt.Println(err)
        }
}

func main() {
        http.HandleFunc("/", sayhello)
        err := http.ListenAndServe(":10003", nil)
        if err != nil {
                log.Fatal("ListenAndServe:", err)
        }
}

v8 version.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
package main

import (
        "context"
        "fmt"
        "log"
        "net/http"

        _ "net/http/pprof"

        "github.com/go-redis/redis/v8"
)

var cli = redis.NewClusterClient(&redis.ClusterOptions{
        Addrs:        []string{"127.0.0.1:30001", "127.0.0.1:30002", "127.0.0.1:30003"},
})

func sayhello(wr http.ResponseWriter, r *http.Request) {
        s := cli.Get(context.TODO(), "1")
        err := s.Err()
        if err != nil {
                fmt.Println(err)
        }
}

func main() {
        http.HandleFunc("/", sayhello)
        err := http.ListenAndServe(":10003", nil)
        if err != nil {
                log.Fatal("ListenAndServe:", err)
        }
}

Once started, it is straightforward to use wrk.

1
~ ❯❯❯ wrk -t10 -c100 -d60 http://localhost:10003

The data table at the beginning of this article can be obtained by combining the stress tests on the client and server sides of different versions separately.

Problem Analysis

Why does the v6 go-redis/redis with server 6.0 cluster have such a large latency? First, use pprof to see what goroutine is doing during the stress test.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
goroutine profile: total 205
100 @ 0x1047b2c88 0x1047ac364 0x1047dc484 0x10481f31c 0x10481fe54 0x10481fe45 0x1048a511c 0x1048af4b8 0x104944678 0x1047e1c94
#	0x1047dc483	internal/poll.runtime_pollWait+0xa3		/usr/local/go/src/runtime/netpoll.go:302
#	0x10481f31b	internal/poll.(*pollDesc).wait+0x2b		/usr/local/go/src/internal/poll/fd_poll_runtime.go:83
#	0x10481fe53	internal/poll.(*pollDesc).waitRead+0x1e3	/usr/local/go/src/internal/poll/fd_poll_runtime.go:88
#	0x10481fe44	internal/poll.(*FD).Read+0x1d4			/usr/local/go/src/internal/poll/fd_unix.go:167
#	0x1048a511b	net.(*netFD).Read+0x2b				/usr/local/go/src/net/fd_posix.go:55
#	0x1048af4b7	net.(*conn).Read+0x37				/usr/local/go/src/net/net.go:183
#	0x104944677	net/http.(*connReader).backgroundRead+0x47	/usr/local/go/src/net/http/server.go:672

69 @ 0x1047b2c88 0x1047c43a4 0x1047c4381 0x1047dddc8 0x1047f27a0 0x10499b324 0x10499b2bd 0x1049ab318 0x1049a226c 0x1049a25dc 0x1049a2830 0x1049a2748 0x1049ab744 0x1049b00ec 0x10494b06c 0x10494c744 0x10494d95c 0x10494a05c 0x1047e1c94
#	0x1047dddc7	sync.runtime_SemacquireMutex+0x27				/usr/local/go/src/runtime/sema.go:71
#	0x1047f279f	sync.(*Mutex).lockSlow+0x17f					/usr/local/go/src/sync/mutex.go:162
#	0x10499b323	sync.(*Mutex).Lock+0xa3						/usr/local/go/src/sync/mutex.go:81
#	0x10499b2bc	github.com/go-redis/redis/internal.(*Once).Do+0x3c		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/internal/once.go:50
#	0x1049ab317	github.com/go-redis/redis.(*cmdsInfoCache).Get+0x47		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/command.go:1963
#	0x1049a226b	github.com/go-redis/redis.(*ClusterClient).cmdInfo+0x2b		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:754
#	0x1049a25db	github.com/go-redis/redis.(*ClusterClient).cmdSlotAndNode+0x5b	/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:790
#	0x1049a282f	github.com/go-redis/redis.(*ClusterClient).defaultProcess+0xaf	/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:918
#	0x1049a2747	github.com/go-redis/redis.(*ClusterClient).Process+0x37		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:905
#	0x1049ab743	github.com/go-redis/redis.(*cmdable).Get+0xf3			/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/commands.go:756
#	0x1049b00eb	main.sayhello+0x3b						/Users/sg00186ml/test/go/redis/goredis/v6.go:25
#	0x10494b06b	net/http.HandlerFunc.ServeHTTP+0x3b				/usr/local/go/src/net/http/server.go:2084
#	0x10494c743	net/http.(*ServeMux).ServeHTTP+0x143				/usr/local/go/src/net/http/server.go:2462
#	0x10494d95b	net/http.serverHandler.ServeHTTP+0x3fb				/usr/local/go/src/net/http/server.go:2916
#	0x10494a05b	net/http.(*conn).serve+0x56b					/usr/local/go/src/net/http/server.go:1966

29 @ 0x1047b2c88 0x1047c43a4 0x1047c4381 0x1047dddc8 0x1047f27a0 0x10499b324 0x10499b2bd 0x1049ab318 0x1049a226c 0x1049a24f8 0x1049a25f0 0x1049a2830 0x1049a2748 0x1049ab744 0x1049b00ec 0x10494b06c 0x10494c744 0x10494d95c 0x10494a05c 0x1047e1c94
#	0x1047dddc7	sync.runtime_SemacquireMutex+0x27				/usr/local/go/src/runtime/sema.go:71
#	0x1047f279f	sync.(*Mutex).lockSlow+0x17f					/usr/local/go/src/sync/mutex.go:162
#	0x10499b323	sync.(*Mutex).Lock+0xa3						/usr/local/go/src/sync/mutex.go:81
#	0x10499b2bc	github.com/go-redis/redis/internal.(*Once).Do+0x3c		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/internal/once.go:50
#	0x1049ab317	github.com/go-redis/redis.(*cmdsInfoCache).Get+0x47		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/command.go:1963
#	0x1049a226b	github.com/go-redis/redis.(*ClusterClient).cmdInfo+0x2b		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:754
#	0x1049a24f7	github.com/go-redis/redis.(*ClusterClient).cmdSlot+0x147	/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:780
#	0x1049a25ef	github.com/go-redis/redis.(*ClusterClient).cmdSlotAndNode+0x6f	/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:791
#	0x1049a282f	github.com/go-redis/redis.(*ClusterClient).defaultProcess+0xaf	/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:918
#	0x1049a2747	github.com/go-redis/redis.(*ClusterClient).Process+0x37		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:905
#	0x1049ab743	github.com/go-redis/redis.(*cmdable).Get+0xf3			/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/commands.go:756
#	0x1049b00eb	main.sayhello+0x3b						/Users/sg00186ml/test/go/redis/goredis/v6.go:25
#	0x10494b06b	net/http.HandlerFunc.ServeHTTP+0x3b				/usr/local/go/src/net/http/server.go:2084
#	0x10494c743	net/http.(*ServeMux).ServeHTTP+0x143				/usr/local/go/src/net/http/server.go:2462
#	0x10494d95b	net/http.serverHandler.ServeHTTP+0x3fb				/usr/local/go/src/net/http/server.go:2916
#	0x10494a05b	net/http.(*conn).serve+0x56b					/usr/local/go/src/net/http/server.go:1966

Or as I wrote in my previous article, most of the performance problems of the business are in the locks, this time the locks of the base library. We need to read the go-redis code briefly.

When starting a cluster client, each cluster client has an object called cmdsInfoCache.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15

// http://redis.io/topics/cluster-spec.
func NewClusterClient(opt *ClusterOptions) *ClusterClient {
   .....

   c := &ClusterClient{
      opt:   opt,
      nodes: newClusterNodes(opt),
   }
   .....
   c.cmdsInfoCache = newCmdsInfoCache(c.cmdsInfo) // 这里
   ......

   return c
}

This object is initialized the first time any command is executed on this client (our code example here is the Get command for redis).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Implementation of the get command
func (c *cmdsInfoCache) Get() (map[string]*CommandInfo, error) {
   err := c.once.Do(func() error {
      cmds, err := c.fn()
      if err != nil {
         return err
      }
      c.cmds = cmds
      return nil
   })
   return c.cmds, err
}

As you can see, the first few lines of each command have to execute this c.once.Do logic. What exactly is this?

go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
func (c *ClusterClient) cmdsInfo() (map[string]*CommandInfo, error) {
   addrs, err := c.nodes.Addrs() // Get node information, no network interaction
   if err != nil {
      return nil, err
   }

   var firstErr error
   for _, addr := range addrs {
      node, err := c.nodes.Get(addr)
      ......

      info, err := node.Client.Command().Result()
      ......
   }
   return nil, firstErr
}

As soon as a node responds, the result is returned and cached in the c.cmds object.

If the err returned by the function is empty, the cmdsInfoCache is initialized. By the nature of sync.Once, the next once.Do for all commands simply performs an atomic.Load, which is not costly.

However, after a long pressure test, we found that each command still gets stuck on the lock, which means that the err is non-empty.

Simply put, the response to the command command returned by redis server 6.0 to the client will error out in the go-redis/redis v6 version parse.

Refer to this issues: https://github.com/go-redis/redis/pull/1355

Every parse is wrong, so naturally every once.Do will go into slow path. The client of redis cluster is a global common, so the lock here is a global lock and there are slower network calls inside the lock.

From the description in the PR, this fix goes directly to the latest two client versions. As is the practice in most open source community projects, only the latest two versions are officially maintained to avoid overburdening development with too many versions being maintained at the same time.

This is also true for the Go language itself, so if you are not using one of the two latest versions and report some weird problem to the officials, you will usually be advised to upgrade to the new version and try again.