Background

Elastic Cloud online services have long been plagued by cache inconsistency.

The occurrence of cache inconsistency is usually accompanied by the upgrade or restart of the kube-apiserver. When cache inconsistency occurs, the user side can perceive it more obviously, and the problem can cause online service failure when it is serious. The common failures are as follows.

  • Platform data inconsistency: Pod status is normal at one time and abnormal at another, and bounces back and forth.
  • Loss of service management events: when the service changes, the service management does not work normally, such as service tree is not mounted, traffic is not accessed, etc.

Before the problem is located, Elastic Cloud has developed many problem-aware and timely stop-loss strategies.

  • Problem perception.
    • Manual: When kube-apiserver is upgraded or restarted, manually notify the associated party to also restart the platform service
    • Smart: Configure monitoring and alerting policies to send alert messages when no change events are received for k8s objects for a period of time
  • Timely stop loss.
    • Restart: restart the service when cache inconsistency issue occurs and pull the latest data from kube-apiserver in full
    • Self-healing: In some scenarios, even if the service is restarted, it cannot be fully recovered, so add a self-healing policy to proactively sense and handle abnormal situations.

Problem-aware and stop-loss strategies do not really solve the problem, but only try to restore service in deterministic scenarios, and along with the discovery of more anomalous scenarios, the strategies need to be adjusted in parallel.

Problem Location

Perception and stopping is a fix similar to mending, and obviously, we would prefer a complete solution to the problem. So, let’s start with the root cause of the cache inconsistency.

We choose notifier to troubleshoot the problem. Notifier is a collection of controllers for cluster management services, and its functionality consists mainly of

  • Service tree mounts
  • DNS registration
  • LVS pickup flow, etc.

The reason for choosing notifier is its simplicity: notifier uses client-go’s informer and registers processing functions for core resource events; in addition, there are no complex business processes to interfere with troubleshooting.

Problem Replication

We tested in our offline environment and found that the kube-apiserver service was able to reproduce the problem after restarting, which gave us a great convenience to troubleshoot the problem. So the steps to reproduce the problem are as follows.

  • Start the notifier service
  • Restart the kube-apiserver service

Status analysis

When the problem occurs, we first do some basic checks on the service status.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# #服务存活状态
# ps -ef|grep notifier
stupig 1007922 1003335  2 13:41 pts/1    00:00:08 ./notifier -c configs/notifier-test.toml
 
# #服务FD打开状态
# lsof -nP -p 1007922
COMMAND       PID       USER   FD      TYPE             DEVICE SIZE/OFF        NODE NAME
nobody    1007922     stupig   0u       CHR              136,1      0t0           4 /dev/pts/1
nobody    1007922     stupig   1u       CHR              136,1      0t0           4 /dev/pts/1
nobody    1007922     stupig   2u       CHR              136,1      0t0           4 /dev/pts/1
nobody    1007922     stupig   3u      unix 0xffff8810a3132400      0t0  4254094659 socket
nobody    1007922     stupig   4u   a_inode                0,9        0        8548 [eventpoll]
nobody    1007922     stupig   5r      FIFO                0,8      0t0  4253939077 pipe
nobody    1007922     stupig   6w      FIFO                0,8      0t0  4253939077 pipe
nobody    1007922     stupig   8u      IPv4         4254094660      0t0         UDP *:37087
nobody    1007922     stupig   9r       CHR                1,9      0t0        2057 /dev/urandom
nobody    1007922     stupig   10u     IPv4         4253939079      0t0         TCP *:4397 (LISTEN)
nobody    1007922     stupig   11u      REG               8,17 12538653  8604570895 ./logs/notifier.stupig.log.INFO.20211127-134138.1007922
nobody    1007922     stupig   15u     IPv4         4254204931      0t0         TCP 127.0.0.1:43566->127.0.0.1:2479 (ESTABLISHED)   # ETCD
nobody    1007922     stupig   19u      REG                8,5   252384         821 /tmp/notifier.stupig.log.ERROR.20211127-134505.1007922
nobody    1007922     stupig   20u      REG                8,5   252384         822 /tmp/notifier.stupig.log.WARNING.20211127-134505.1007922
nobody    1007922     stupig   21u      REG               8,17   414436  8606917935 ./logs/notifier.stupig.log.WARNING.20211127-134139.1007922
nobody    1007922     stupig   24u      REG               8,17   290725  8606917936 ./logs/notifier.stupig.log.ERROR.20211127-134238.1007922
nobody    1007922     stupig   30u      REG                8,5   252384         823 /tmp/notifier.stupig.log.INFO.20211127-134505.1007922

Comparing the service status information before the problem occurred, we found a serious problem. The connection established between notifier and the kube-apiserver service address: https://localhost:6443 disappeared.

As a result, the notifier lost data synchronization with the kube-apiserver, and subsequently the notifier did not sense business change events and eventually lost the ability to manage the service.

Log analysis

Now we analyze the notifier’s operational logs, focusing on the logs printed by the notifier when the kube-apiserver restarts, where the key log messages are as follows.

The above shows the key exception message http2: no cached connection was available, and the associated operation is the EndpointInformer’s ListAndWatch operation.

Here we have the key clue, and the next step is to analyze the code to locate the root cause.

Code Analysis

The introduction of the working mechanism of Informer is not the focus of this article, we will only focus on the following code snippet.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// Run starts a watch and handles watch events. Will restart the watch if it is closed.
// Run will exit when stopCh is closed.
func (r *Reflector) Run(stopCh <-chan struct{}) {
   glog.V(3).Infof("Starting reflector %v (%s) from %s", r.expectedType, r.resyncPeriod, r.name)
   wait.Until(func() {
      if err := r.ListAndWatch(stopCh); err != nil {
         utilruntime.HandleError(err)
      }
   }, r.period, stopCh)
}

The Informer’s Reflector component runs in a separate goroutine and recursively calls ListAndWatch to receive notification events from the kube-apiserver.

We conclude from the log analysis that all ListAndWatch operations of the notifier service return the http2: no cached connection was available error when the kube-apiserver service is restarted.

Therefore, we shifted our focus to this error message.

By searching through the code, we were able to locate the location of the error and where it was returned.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// file: vendor/golang.org/x/net/http2/transport.go:L301
var ErrNoCachedConn = errors.New("http2: no cached connection was available")
 
// file: vendor/golang.org/x/net/http2/client_conn_pool.go:L55~80
func (p *clientConnPool) getClientConn(req *http.Request, addr string, dialOnMiss bool) (*ClientConn, error) {
   if isConnectionCloseRequest(req) && dialOnMiss {
      // It gets its own connection.
      const singleUse = true
      cc, err := p.t.dialClientConn(addr, singleUse)
      if err != nil {
         return nil, err
      }
      return cc, nil
   }
   p.mu.Lock()
   for _, cc := range p.conns[addr] {
      if cc.CanTakeNewRequest() {
         p.mu.Unlock()
         return cc, nil
      }
   }
   if !dialOnMiss {
      p.mu.Unlock()
      return nil, ErrNoCachedConn
   }
   call := p.getStartDialLocked(addr)
   p.mu.Unlock()
   <-call.done
   return call.res, call.err
}

The above code returns ErrNoCachedConn with the following condition

  • The parameter dialOnMiss has a value of false
  • No connection available in the p.conns connection pool

Theoretically, when sending an http request, if the connection pool is empty, a connection will be established first and then the request will be sent; and the connection pool can automatically reject connections with abnormal status. So how does the problem of concern in this article sometimes happen?

Now we focus on the call chain of the getClientConn method, which has two main.

Stack one.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
0  0x0000000000a590b8 in notifier/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn
    at ./go/src/notifier/vendor/golang.org/x/net/http2/client_conn_pool.go:55
 1  0x0000000000a5aea6 in notifier/vendor/golang.org/x/net/http2.noDialClientConnPool.GetClientConn
    at ./go/src/notifier/vendor/golang.org/x/net/http2/client_conn_pool.go:255
 2  0x0000000000a6c4f9 in notifier/vendor/golang.org/x/net/http2.(*Transport).RoundTripOpt
    at ./go/src/notifier/vendor/golang.org/x/net/http2/transport.go:345
 3  0x0000000000a6bd0e in notifier/vendor/golang.org/x/net/http2.(*Transport).RoundTrip
    at ./go/src/notifier/vendor/golang.org/x/net/http2/transport.go:313
 4  0x0000000000a5b97e in notifier/vendor/golang.org/x/net/http2.noDialH2RoundTripper.RoundTrip
    at ./go/src/notifier/vendor/golang.org/x/net/http2/configure_transport.go:75
 5  0x0000000000828e45 in net/http.(*Transport).roundTrip
    at /usr/local/go1.16.7/src/net/http/transport.go:537
 6  0x00000000008016de in net/http.(*Transport).RoundTrip
    at /usr/local/go1.16.7/src/net/http/roundtrip.go:17
 7  0x00000000016a1ef8 in notifier/vendor/k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip
    at ./go/src/notifier/vendor/k8s.io/client-go/transport/round_trippers.go:162
 8  0x00000000007a3aa2 in net/http.send
    at /usr/local/go1.16.7/src/net/http/client.go:251
 9  0x00000000007a324b in net/http.(*Client).send
    at /usr/local/go1.16.7/src/net/http/client.go:175
10  0x00000000007a6ed5 in net/http.(*Client).do
    at /usr/local/go1.16.7/src/net/http/client.go:717
11  0x00000000007a5d9e in net/http.(*Client).Do
    at /usr/local/go1.16.7/src/net/http/client.go:585
12  0x00000000016b9487 in notifier/vendor/k8s.io/client-go/rest.(*Request).request
    at ./go/src/notifier/vendor/k8s.io/client-go/rest/request.go:732
13  0x00000000016b9f2d in notifier/vendor/k8s.io/client-go/rest.(*Request).Do
    at ./go/src/notifier/vendor/k8s.io/client-go/rest/request.go:804
14  0x00000000017093bb in notifier/vendor/k8s.io/client-go/kubernetes/typed/core/v1.(*endpoints).List
    at ./go/src/notifier/vendor/k8s.io/client-go/kubernetes/typed/core/v1/endpoints.go:83
……

Stack two.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
0  0x0000000000a590b8 in notifier/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn
    at ./go/src/notifier/vendor/golang.org/x/net/http2/client_conn_pool.go:55
 1  0x0000000000a5aea6 in notifier/vendor/golang.org/x/net/http2.noDialClientConnPool.GetClientConn
    at ./go/src/notifier/vendor/golang.org/x/net/http2/client_conn_pool.go:255
 2  0x0000000000a6c4f9 in notifier/vendor/golang.org/x/net/http2.(*Transport).RoundTripOpt
    at ./go/src/notifier/vendor/golang.org/x/net/http2/transport.go:345
 3  0x0000000000a6bd0e in notifier/vendor/golang.org/x/net/http2.(*Transport).RoundTrip
    at ./go/src/notifier/vendor/golang.org/x/net/http2/transport.go:313
 4  0x00000000008296ed in net/http.(*Transport).roundTrip
    at /usr/local/go1.16.7/src/net/http/transport.go:590
 5  0x00000000008016de in net/http.(*Transport).RoundTrip
    at /usr/local/go1.16.7/src/net/http/roundtrip.go:17
 6  0x00000000016a1ef8 in notifier/vendor/k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip
    at ./go/src/notifier/vendor/k8s.io/client-go/transport/round_trippers.go:162
 7  0x00000000007a3aa2 in net/http.send
    at /usr/local/go1.16.7/src/net/http/client.go:251
 8  0x00000000007a324b in net/http.(*Client).send
    at /usr/local/go1.16.7/src/net/http/client.go:175
 9  0x00000000007a6ed5 in net/http.(*Client).do
    at /usr/local/go1.16.7/src/net/http/client.go:717
10  0x00000000007a5d9e in net/http.(*Client).Do
    at /usr/local/go1.16.7/src/net/http/client.go:585
11  0x00000000016b9487 in notifier/vendor/k8s.io/client-go/rest.(*Request).request
    at ./go/src/notifier/vendor/k8s.io/client-go/rest/request.go:732
12  0x00000000016b9f2d in notifier/vendor/k8s.io/client-go/rest.(*Request).Do
    at ./go/src/notifier/vendor/k8s.io/client-go/rest/request.go:804
13  0x00000000017093bb in notifier/vendor/k8s.io/client-go/kubernetes/typed/core/v1.(*endpoints).List
    at ./go/src/notifier/vendor/k8s.io/client-go/kubernetes/typed/core/v1/endpoints.go:83
……

After tracing the two call stacks separately, we can quickly rule out stack one.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// file: net/http/transport.go:L502~620
func (t *Transport) roundTrip(req *Request) (*Response, error) {
   if altRT := t.alternateRoundTripper(req); altRT != nil {               // L537
   if resp, err := altRT.RoundTrip(req); err != ErrSkipAltProtocol {
      return resp, err
   }
   var err error
   req, err = rewindBody(req)
   if err != nil {
      return nil, err
   }
}
 
// file: vendor/golang.org/x/net/http2/configure_transport.go:L74~80
func (rt noDialH2RoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
   res, err := rt.t.RoundTrip(req)                                        // L75
   if err == ErrNoCachedConn {
      return nil, http.ErrSkipAltProtocol
   }
   return res, err
}
 
// file: vendor/golang.org/x/net/http2/transport.go:L312~314
func (t *Transport) RoundTrip(req *http.Request) (*http.Response, error) {
   return t.RoundTripOpt(req, RoundTripOpt{})                             // L313
}
 
// file: vendor/golang.org/x/net/http2/transport.go:L337~379
func (t *Transport) RoundTripOpt(req *http.Request, opt RoundTripOpt) (*http.Response, error) {
   addr := authorityAddr(req.URL.Scheme, req.URL.Host)
   for retry := 0; ; retry++ {
      cc, err := t.connPool().GetClientConn(req, addr)                    // L345
      if err != nil {
         t.vlogf("http2: Transport failed to get client conn for %s: %v", addr, err)
         return nil, err
      }
   }
}
 
// file: vendor/golang.org/x/net/http2/client_conn_pool.go:L254~256
func (p noDialClientConnPool) GetClientConn(req *http.Request, addr string) (*ClientConn, error) {
   return p.getClientConn(req, addr, noDialOnMiss)                        // L255
}

Stack one call to getClientConn returns an ErrNoCachedConn error and is replaced with an http.ErrSkipAltProtocol error in the noDialH2RoundTripper.RoundTrip function, which returns the roundTrip function after continues with the rest of the process and goes to stack 2.

We therefore focus on the Stack 2 process.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// file: net/http/transport.go:L502~620
func (t *Transport) roundTrip(req *Request) (*Response, error) {
   for {
      var resp *Response
      if pconn.alt != nil {
         // HTTP/2 path.
         t.setReqCanceler(cancelKey, nil) // not cancelable with CancelRequest
         resp, err = pconn.alt.RoundTrip(req)                             // L590
      }
      if err == nil {
         resp.Request = origReq
         return resp, nil
      }
 
      // Failed. Clean up and determine whether to retry.
      if http2isNoCachedConnError(err) {
         if t.removeIdleConn(pconn) {
            t.decConnsPerHost(pconn.cacheKey)
         }
      } else if !pconn.shouldRetryRequest(req, err) {
         return nil, err
      }
   }
}
 
// file: vendor/golang.org/x/net/http2/transport.go:L312~314
func (t *Transport) RoundTrip(req *http.Request) (*http.Response, error) {
   return t.RoundTripOpt(req, RoundTripOpt{})                             // L313
}
 
// 内层调用栈同栈一,不再列出

Unlike stack 1, stack 2 does not do a conversion on the returned error, but returns the ErrNoCachedConn error directly, and the error handling process of roundTrip also specifically handles this type of error. If http2isnoCachedConnError returns true, the connection pool will remove the exception connection.

Everything makes sense, so how does the problem occur? The problem here is http2isnoCachedConnError.

1
2
3
4
5
6
7
8
// file: net/http/h2_bundle.go:L6922~6928
// isNoCachedConnError reports whether err is of type noCachedConnError
// or its equivalent renamed type in net/http2's h2_bundle.go. Both types
// may coexist in the same running program.
func http2isNoCachedConnError(err error) bool {
   _, ok := err.(interface{ IsHTTP2NoCachedConnError() })
   return ok
}

If the err object implements the anonymous interface (only one function IsHTTP2NoCachedConnError is defined), then it returns true, otherwise it returns false.

So, does the error type returned by getClientConn implement that interface? Obviously: no.

1
2
// file: vendor/golang.org/x/net/http2/transport.go:L301
var ErrNoCachedConn = errors.New("http2: no cached connection was available")

So far, the cause of the problem has been basically located clearly.

Solution

Since the problem is caused by the error type ErrNoCachedConn returned by getClientConn not implementing the IsHTTP2NoCachedConnError function, the natural fix is to change the returned error type and implement the interface function.

Note that since this part of the code is from an external code base that we referenced, we checked the latest golang.org/x/net code and found that the problem was fixed back in January 2018. See specifically: golang.org/x/net fix.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
// noCachedConnError is the concrete type of ErrNoCachedConn, which
// needs to be detected by net/http regardless of whether it's its
// bundled version (in h2_bundle.go with a rewritten type name) or
// from a user's x/net/http2. As such, as it has a unique method name
// (IsHTTP2NoCachedConnError) that net/http sniffs for via func
// isNoCachedConnError.
type noCachedConnError struct{}

func (noCachedConnError) IsHTTP2NoCachedConnError() {}
func (noCachedConnError) Error() string             { return "http2: no cached connection was available" }

// isNoCachedConnError reports whether err is of type noCachedConnError
// or its equivalent renamed type in net/http2's h2_bundle.go. Both types
// may coexist in the same running program.
func isNoCachedConnError(err error) bool {
    _, ok := err.(interface{ IsHTTP2NoCachedConnError() })
    return ok
}

var ErrNoCachedConn error = noCachedConnError{}

The version we are using online is still: 1c05540f6.

Therefore, our fix strategy has become much simpler, just upgrade the dependency library version in vendor.

Currently, the online notifier service has been upgraded and is fully live on all server rooms. We have also verified that the kube-apiserver restart does not cause the notifier service to be abnormal.