Direct phenomenon: I compiled kube-controller-manager with Go 1.13, and after running it for a while, I found that the controller did not work, and when I checked the logs, I found that it printed “http2: no cached connection was available “.

1
2
3
4
5
6
7
I0328 09:48:59.925056       1 round_trippers.go:383] GET https://10.220.14.10:8443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager
I0328 09:48:59.925085       1 round_trippers.go:390] Request Headers:
I0328 09:48:59.925094       1 round_trippers.go:393]     User-Agent: kube-controller-manager/v1.11.1 (linux/amd64) kubernetes/b1b2997/leader-election
I0328 09:48:59.925102       1 round_trippers.go:393]     Accept: application/vnd.kubernetes.protobuf, */*
I0328 09:48:59.925148       1 round_trippers.go:408] Response Status:  in 0 milliseconds
E0328 09:48:59.925199       1 leaderelection.go:234] error retrieving resource lock kube-system/kube-controller-manager: Get https://10.220.14.10:8443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager: http2: no cached connection was available
I0328 09:48:59.925216       1 leaderelection.go:190] failed to acquire lease kube-system/kube-controller-manager

Using 1.13 will have the above phenomenon. No problem with 1.12. k8s version is 1.11.

From the go code query “no cached connection was available”, we can see that it is an http2noCachedConnError error.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// noCachedConnError is the concrete type of ErrNoCachedConn, which
// needs to be detected by net/http regardless of whether it's its
// bundled version (in h2_bundle.go with a rewritten type name) or
// from a user's x/net/http2. As such, as it has a unique method name
// (IsHTTP2NoCachedConnError) that net/http sniffs for via func
// isNoCachedConnError.
type http2noCachedConnError struct{}

func (http2noCachedConnError) IsHTTP2NoCachedConnError() {}

func (http2noCachedConnError) Error() string { return "http2: no cached connection was available" }

// isNoCachedConnError reports whether err is of type noCachedConnError
// or its equivalent renamed type in net/http2's h2_bundle.go. Both types
// may coexist in the same running program.
func http2isNoCachedConnError(err error) bool {
    _, ok := err.(interface{ IsHTTP2NoCachedConnError() })
    return ok
}

var http2ErrNoCachedConn error = http2noCachedConnError{}

Since the problem is sure to occur, you may want to use delve trace.

Reproduction method: Configure kube-controller-manage to connect to 1 API Server, shut down the API Server when kube-controller-manag is running normally, then restart the API Server, and check the kube-controller-manager log afterwards. log, it will print “http2: no cached connection was available”.

From the code analysis, http2ErrNoCachedConn is only returned in src/net/http/h2_bundle.go, so it is possible that there is an error in the function getClientConn. Take a look at the Go source code implementation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
func (p *http2clientConnPool) getClientConn(req *Request, addr string, dialOnMiss bool) (*http2ClientConn, error) {
    if http2isConnectionCloseRequest(req) && dialOnMiss {
        // It gets its own connection.
        http2traceGetConn(req, addr)
        const singleUse = true
        cc, err := p.t.dialClientConn(addr, singleUse)
        if err != nil {
            return nil, err
        }
        return cc, nil
    }
    p.mu.Lock()
    for _, cc := range p.conns[addr] {
        if st := cc.idleState(); st.canTakeNewRequest {
            if p.shouldTraceGetConn(st) {
                http2traceGetConn(req, addr)
            }
            p.mu.Unlock()
            return cc, nil
        }
    }
    if !dialOnMiss {
        p.mu.Unlock()
        return nil, http2ErrNoCachedConn
    }

After dlv attach pid, I tried to breakpoint getClientConn and found that there are 2 getClientConn, one in Go source code as seen above, and the other in the vendor directory of k8s source code. So, which one will be called in the actual run? It doesn’t matter, hit the breakpoint on both.

1
2
3
4
5
6
(dlv) b getClientConn
Command failed: Location "getClientConn" ambiguous: net/http.(*http2clientConnPool).getClientConn, k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn…
(dlv) b net/http.(*http2clientConnPool).getClientConn
Breakpoint 3 (enabled) set at 0x7b6b93 for net/http.(*http2clientConnPool).getClientConn() .usr/local/go/src/net/http/h2_bundle.go:749
(dlv) b k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn
Breakpoint 4 (enabled) set at 0x841a03 for k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn() .export/hubt/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go:55

Continuing the execution, it will enter a breakpoint and you can see that the call is to getClientConn in the vendor directory of the k8s source code.

Stack 1.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
0  0x0000000000851ff3 in k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go:55
 1  0x000000000085355e in k8s.io/kubernetes/vendor/golang.org/x/net/http2.noDialClientConnPool.GetClientConn
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go:255
 2  0x000000000086d985 in k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*Transport).RoundTripOpt
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/transport.go:345
 3  0x0000000000853cee in k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*Transport).RoundTrip
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/transport.go:313
 4  0x0000000000853cee in k8s.io/kubernetes/vendor/golang.org/x/net/http2.noDialH2RoundTripper.RoundTrip
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/configure_transport.go:75
 5  0x00000000008161aa in net/http.(*Transport).roundTrip
    at .usr/local/go/src/net/http/transport.go:486
 6  0x00000000007fc4d5 in net/http.(*Transport).RoundTrip
    at .usr/local/go/src/net/http/roundtrip.go:17
 7  0x0000000000ea9100 in k8s.io/kubernetes/vendor/k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/transport/round_trippers.go:162
 8  0x00000000007bfafa in net/http.send
    at .usr/local/go/src/net/http/client.go:250
 9  0x00000000007bf51a in net/http.(*Client).send

Come see it implemented.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
var ErrNoCachedConn = errors.New("http2: no cached connection was available")

func (p *clientConnPool) getClientConn(req *http.Request, addr string, dialOnMiss bool) (*ClientConn, error) {
    if isConnectionCloseRequest(req) && dialOnMiss {
        // It gets its own connection.
        const singleUse = true
        cc, err := p.t.dialClientConn(addr, singleUse)
        if err != nil {
            return nil, err
        }
        return cc, nil
    }
    p.mu.Lock()
    for _, cc := range p.conns[addr] {
        if cc.CanTakeNewRequest() {
            p.mu.Unlock()
            return cc, nil
        }
    }
    if !dialOnMiss {
        p.mu.Unlock()
        return nil, ErrNoCachedConn
    }
    call := p.getStartDialLocked(addr)
    p.mu.Unlock()
    <-call.done
    return call.res, call.err
}

As you can see, ErrNoCachedConn is returned here (from the stack, rt is noDialH2RoundTripper, so the dialOnMiss value is false), and the value of ErrNoCachedConn is indeed the kube-controller-manager print string.

Looking at the previous stack, the error code goes all the way back to RoundTrip in configure_transport.go. Take a look at this piece of code.

1
2
3
4
5
6
7
func (rt noDialH2RoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
    res, err := rt.t.RoundTrip(req)
    if err == ErrNoCachedConn {
        return nil, http.ErrSkipAltProtocol
    }
    return res, err
}

As you can see here, if the returned error code is noDialH2RoundTripper, it will be converted to http.ErrSkipAltProtocol and returned again. The error is not what is printed here.

Continuing to look back up the stack, in the Go source roundTrip, there is indeed special handling for http.ErrSkipAltProtocol, which does not return an error directly for such errors.

Continuing the single-step trace, you can see that ErrNoCachedConn is returned in pconn’s roundTrip, but in the http2isNoCachedConnError judgment, it is not considered to be the same error, does not execute t.removeIdleConn(pconn), and enters pconn.shouldRetryRequest and eventually returned ErrNoCachedConn .

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// roundTrip implements a RoundTripper over HTTP.
func (t *Transport) roundTrip(req *Request) (*Response, error) {
...
    if t.useRegisteredProtocol(req) {
        altProto, _ := t.altProto.Load().(map[string]RoundTripper)
        if altRT := altProto[scheme]; altRT != nil {
            if resp, err := altRT.RoundTrip(req); err != ErrSkipAltProtocol {
                return resp, err
            }
        }
    }
...
    for {
...
        pconn, err := t.getConn(treq, cm)
        if err != nil {
            t.setReqCanceler(req, nil)
            req.closeBody()
            return nil, err
        }

        var resp *Response
        if pconn.alt != nil {
            // HTTP/2 path.
            t.setReqCanceler(req, nil) // not cancelable with CancelRequest
            resp, err = pconn.alt.RoundTrip(req)
        } else {
            resp, err = pconn.roundTrip(treq)
        }
        if err == nil {
            return resp, nil
        }
        if http2isNoCachedConnError(err) {
            t.removeIdleConn(pconn)
        } else if !pconn.shouldRetryRequest(req, err) {
            // Issue 16465: return underlying net.Conn.Read error from peek,
            // as we've historically done.
            if e, ok := err.(transportReadFromServerError); ok {
                err = e.err
            }
            return nil, err
        }

Put a breakpoint on shouldRetryRequest as well (again, it appears in 2 places).

1
2
b net/http.(*persistConn).shouldRetryRequest
b k8s.io/kubernetes/vendor/golang.org/x/net/http2.shouldRetryRequest

When reproducing it, I see the stack as follows, and print the entry err, which is indeed ErrNoCachedConn.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
 0  0x0000000000816413 in net/http.(*persistConn).shouldRetryRequest
    at .usr/local/go/src/net/http/transport.go:569
 1  0x0000000000815ce9 in net/http.(*Transport).roundTrip
    at .usr/local/go/src/net/http/transport.go:543
 2  0x00000000007fc4d5 in net/http.(*Transport).RoundTrip
    at .usr/local/go/src/net/http/roundtrip.go:17
 3  0x0000000000ea9100 in k8s.io/kubernetes/vendor/k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/transport/round_trippers.go:162
 4  0x00000000007bfafa in net/http.send
    at .usr/local/go/src/net/http/client.go:250
 5  0x00000000007bf51a in net/http.(*Client).send
(dlv) p err
error(*errors.errorString) *{
    s: "http2: no cached connection was available",}

Let’s take a look at the implementation of shouldRetryRequest.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
func (pc *persistConn) shouldRetryRequest(req *Request, err error) bool {
    if http2isNoCachedConnError(err) {
        // Issue 16582: if the user started a bunch of
        // requests at once, they can all pick the same conn
        // and violate the server's max concurrent streams.
        // Instead, match the HTTP/1 behavior for now and dial
        // again to get a new TCP connection, rather than failing
        // this request.
        return true
    }
...
    return false // conservatively

As you can see, if there is no cached conn, it will return true, which means that it should retry, but if you trace it separately, you will find that there is no retry, so that from the previous code analysis, it will return an error code to http.send, which is the log printed out.

So, as you can see, the problem is http2isNoCachedConnError. The implementation is posted earlier, so I’ll post it again.

1
2
3
4
func http2isNoCachedConnError(err error) bool {
    _, ok := err.(interface{ IsHTTP2NoCachedConnError() })
    return ok
}

Write a small code to check if ErrNoCachedConn satisfies http2isNoCachedConnError (https://play.golang.org/p/fB8ilA6-VPZ).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package main

import (
    "errors"
    "fmt"
)

type http2noCachedConnError struct{}

func (http2noCachedConnError) IsHTTP2NoCachedConnError() {}

func (http2noCachedConnError) Error() string { return "http2: no cached connection was available" }

// isNoCachedConnError reports whether err is of type noCachedConnError
// or its equivalent renamed type in net/http2's h2_bundle.go. Both types
// may coexist in the same running program.
func http2isNoCachedConnError(err error) bool {
    _, ok := err.(interface{ IsHTTP2NoCachedConnError() })
    return ok
}
func main(){
    var ErrNoCachedConn = errors.New("http2: no cached connection was available")
    res := http2isNoCachedConnError(ErrNoCachedConn)
    fmt.Printf("result 1 %v\n", res)

    var err = http2noCachedConnError{}
    res = http2isNoCachedConnError(err)
    fmt.Printf("result 2 %v\n", res)

    if ErrNoCachedConn.Error() == err.Error() {
        fmt.Printf("same result!")
    }
}

As you can see, although their values are the same, ErrNoCachedConn does not meet the requirements of http2isNoCachedConnError, which also leads to the above error being reported all the time and no reconnection being initiated.

Of course, the solution to this problem is very simple, do not use Go 1.13 version to compile, use Go 1.12 version will not be a problem. The reconnection logic in version 1.12 is different from that in 1.13, and how it differs will be analyzed when we have a chance.

Of course, it is possible to modify the implementation of vendor in the k8s source code to unify the definition of error codes on both sides, so that subsequent versions of Go can be used, but it is better not to change the vendor.

In fact, Go does this in subsequent versions of its implementation, as seen in these two patches.

However, this patch is not very well thought out, and it would be better to change it to be forward compatible with.

1
2
3
4
5
6
7
8
9
func http2isNoCachedConnError(err error) bool {
    _, ok := err.(interface{ IsHTTP2NoCachedConnError() })
    if !ok {
        if err.Error() == http2ErrNoCachedConn.Error() {
            return true
        }
    }
    return ok
}