One of the problems we can’t avoid in development is how to achieve reliable network communication in unreliable network services, and http request retry is a technique often used. However, the Go standard library net/http actually does not have retry function, so this article mainly explains how to implement request retry in Go.

Overview

In general, the handling of network communication failure is divided into the following steps.

  1. perceive the error. Identifying different errors by different error codes, in HTTP status code can be used to identify different types of errors.
  2. retry decision. This step is mainly used to reduce unnecessary retries, such as HTTP 4xx errors, usually 4xx indicates a client error, at which time the client should not retry the operation, or some errors customized in the business should not be retried. Judgment according to these rules can effectively reduce the number of unnecessary retries and improve the response speed.
  3. retry policy. The retry policy contains the retry interval time, the number of retries, etc.. If the number of times is not enough, it may not effectively cover the short time period of failure, and if the number of retries is too many, or the retry interval is too small, it may cause a lot of wasted resources (CPU, memory, threads, network). This we will talk about below.
  4. hedging strategy. Hedging refers to actively sending multiple requests for a single call without waiting for a response, and then taking the first returned packet back. This concept is a concept in grpc, which I borrowed as well.
  5. Circuit Breaker & Fallback; if it still does not work after retrying, it means that the failure is not a short time failure, but a long time failure. Then the service can be fused and fallback, the later requests will not be retried, this time to do the downgrading process, reduce unnecessary requests, and wait for the server to recover before requesting, there are many implementations of this go-zero, sentinel, hystrix-go, which are also quite interesting.

Retry strategy

The retry strategy can be divided into many kinds, on the one hand, we have to consider the business tolerance affected by the length of this request, on the other hand, we have to consider the impact of retry on the downstream service generated too many requests, in short, it is a trade-off problem.

So for the retry algorithm, it is generally to add a gap time between retry, interested friends can also go to see this article. Combined with our own usual practice plus the algorithm of this article can generally summarize the following rules.

  • Linear interval (Linear Backoff): each retry interval time is fixed for retry, such as retry once every 1s.
  • Linear interval + random time (Linear Jitter Backoff): sometimes the same time for each retry interval may lead to multiple requests at the same time request, then we can add a random time that fluctuates a percentage of the time based on the linear interval time.
  • Exponential interval (Exponential Backoff): each interval time is 2 exponential increments, such as wait 3s 9s 27s after retry.
  • Exponential interval + random time (Exponential Jitter Backoff): this is similar to the second one, adding a fluctuation time on top of the exponential increment.

Both of the above strategies incorporate jitter in order to prevent the occurrence of the Thundering Herd Problem.

In computer science, the thundering herd problem occurs when a large number of processes or threads waiting for an event are awoken when that event occurs, but only one process is able to handle the event. When the processes wake up, they will each try to handle the event, but only one will win. All processes will compete for resources, possibly freezing the computer, until the herd is calmed down again

Problems caused by using net/http retries

The retry operation can’t be done directly for Go by adding a for loop based on the number of times. For Get requests, there is no request body when retrying, so you can retry directly, but for Post requests, you need to put the request body into the Reader, as follows.

1
req, _ := http.NewRequest("POST", "localhost", strings.NewReader("hello"))

When the server receives the request, it will call the Read() function to read the data from the Reader. Usually when the server reads the data, the offset will be changed, and the next time it reads it, it will continue to read backwards from the offset position. So if we retry directly, we will not be able to read Reader.

We can get an example first.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
func main() {
    go func() {
        http.HandleFunc("/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            time.Sleep(time.Millisecond * 20)
            body, _ := ioutil.ReadAll(r.Body)  
            fmt.Printf("received body with length %v containing: %v\n", len(body), string(body))
            w.WriteHeader(http.StatusOK)
        }))
        http.ListenAndServe(":8090", nil)
    }()
    fmt.Print("Try with bare strings.Reader\n") 
    retryDo(req)
}

func retryDo() {
    originalBody := []byte("abcdefghigklmnopqrst")
    reader := strings.NewReader(string(originalBody))
    req, _ := http.NewRequest("POST", "http://localhost:8090/", reader)
    client := http.Client{
        Timeout: time.Millisecond * 10,
    }

    for {
        _, err := client.Do(req)
        if err != nil {
            fmt.Printf("error sending the first time: %v\n", err)
        } 
        time.Sleep(1000)
    }
}

// output:
error sending the first time: Post "http://localhost:8090/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
error sending the first time: Post "http://localhost:8090/": http: ContentLength=20 with Body length 0
error sending the first time: Post "http://localhost:8090/": http: ContentLength=20 with Body length 0
received body with length 20 containing: abcdefghigklmnopqrst
error sending the first time: Post "http://localhost:8090/": http: ContentLength=20 with Body length 0
....

In the above example, a timeout of 10ms is set on the client side. On the server side, we simulate the request processing timeout by first sleeping 20ms and then reading the request data, which will definitely time out.

When the request is made again, it is found that the Body data requested by the client is not 20 lengths as we expected, but 0, resulting in an err. Therefore, the Body Reader needs to be reset, as follows.

1
2
3
4
5
6
func resetBody(request *http.Request, originalBody []byte) {
    request.Body = io.NopCloser(bytes.NewBuffer(originalBody))
    request.GetBody = func() (io.ReadCloser, error) {
        return io.NopCloser(bytes.NewBuffer(originalBody)), nil
    }
}

In the above code, we use io.NopCloser to reset the requested Body data to avoid unintended exceptions in the next request.

Then, compared to the humble example above, it can be improved by adding the StatusCode retry judgment, retry policy, retry count, etc. we mentioned above, and it can be written like this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
func retryDo(req *http.Request, maxRetries int, timeout time.Duration,
    backoffStrategy BackoffStrategy) (*http.Response, error) {
    var (
        originalBody []byte
        err          error
    )
    if req != nil && req.Body != nil {
        originalBody, err = copyBody(req.Body)
        resetBody(req, originalBody)
    }
    if err != nil {
        return nil, err
    }
    AttemptLimit := maxRetries
    if AttemptLimit <= 0 {
        AttemptLimit = 1
    }

    client := http.Client{
        Timeout: timeout,
    }
    var resp *http.Response
  //重试次数
    for i := 1; i <= AttemptLimit; i++ {
        resp, err = client.Do(req)
        if err != nil {
            fmt.Printf("error sending the first time: %v\n", err)
        } 
        // 重试 500 以上的错误码
        if err == nil && resp.StatusCode < 500 {
            return resp, err
        }
        // 如果正在重试,那么释放fd
        if resp != nil {
            resp.Body.Close()
        }
        // 重置body
        if req.Body != nil {
            resetBody(req, originalBody)
        }
        time.Sleep(backoffStrategy(i) + 1*time.Microsecond)
    }
    // 到这里,说明重试也没用
    return resp, req.Context().Err()
}

func copyBody(src io.ReadCloser) ([]byte, error) {
    b, err := ioutil.ReadAll(src)
    if err != nil {
        return nil, ErrReadingRequestBody
    }
    src.Close()
    return b, nil
}

func resetBody(request *http.Request, originalBody []byte) {
    request.Body = io.NopCloser(bytes.NewBuffer(originalBody))
    request.GetBody = func() (io.ReadCloser, error) {
        return io.NopCloser(bytes.NewBuffer(originalBody)), nil
    }
}

Hedging strategy

The above is about the concept of retry, so sometimes our interface just occasionally goes wrong and our downstream service doesn’t care about requesting more than once, so we can borrow the concept from grpc: Hedged requests.

Hedging refers to actively sending multiple requests for a single call without waiting for a response, and then taking the first returned packet. The main difference between hedging and retrying is that hedging initiates a request directly if no response is received after a specified time, while retrying requires a response from the server before the request is initiated. So hedging is more like a more aggressive retry strategy.

One thing to note when using hedging is that because the downstream service may do a load balancing policy, the requested downstream service is generally required to be idempotent, able to be safe in multiple concurrent requests, and to meet expectations.

Concurrent mode processing

Because the concept of concurrency is added to hedge retry, goroutines are used to concurrently request, so we can encapsulate the data into the channel for asynchronous processing of messages.

And since multiple goroutines are processing messages, we need to return err if each goroutine finishes processing but fails, so it’s important not to get stuck in the main flow directly due to channel waiting.

However, since it is impossible to get the execution result of each goroutine in Go, and we only focus on the correct processing result and need to ignore errors, we need to work with WaitGroup to achieve process control, the example is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
func main() {
    totalSentRequests := &sync.WaitGroup{}
    allRequestsBackCh := make(chan struct{})
    multiplexCh := make(chan struct {
        result string
        retry  int
    })
    go func() {
    //所有请求完成之后会close掉allRequestsBackCh
        totalSentRequests.Wait()
        close(allRequestsBackCh)
    }()
    for i := 1; i <= 10; i++ {
        totalSentRequests.Add(1)
        go func() {
            // 标记已经执行完
            defer totalSentRequests.Done()
            // 模拟耗时操作
            time.Sleep(500 * time.Microsecond)
            // 模拟处理成功
            if random.Intn(500)%2 == 0 {
                multiplexCh <- struct {
                    result string
                    retry  int
                }{"finsh success", i}
            }
            // 处理失败不关心,当然,也可以加入一个错误的channel中进一步处理
        }()
    }
    select {
    case <-multiplexCh:
        fmt.Println("finish success")
    case <-allRequestsBackCh:
        // 到这里,说明全部的 goroutine 都执行完毕,但是都请求失败了
        fmt.Println("all req finish,but all fail")
    }
}

From the above code, in order to carry out process control, two more channels are used: totalSentRequests, allRequestsBackCh, and one more goroutine asynchronous shutdown allRequestsBackCh, in order to achieve the process control, it is too much trouble, there is a new implementation of the program students I would like to discuss it with you.

In addition to the above problem of concurrent request control, for hedging retries, it should be noted that the context of http.Request will change because requests are not serial, so you need to clone the context once before each request to ensure that the context of each different request is independent. However, after each clone the offset position of Reader changes again, so we need to reset again.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
func main() {
    req, _ := http.NewRequest("POST", "localhost", strings.NewReader("hello"))
    req2 := req.Clone(req.Context())
    contents, _ := io.ReadAll(req.Body)
    contents2, _ := io.ReadAll(req2.Body)
    fmt.Printf("First read: %v\n", string(contents))
    fmt.Printf("Second read: %v\n", string(contents2))
}

//output:
First read: hello
Second read: 

So combining the above examples, we can change the code for hedging retries to:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
func retryHedged(req *http.Request, maxRetries int, timeout time.Duration,
    backoffStrategy BackoffStrategy) (*http.Response, error) {
    var (
        originalBody []byte
        err          error
    )
    if req != nil && req.Body != nil {
        originalBody, err = copyBody(req.Body)
    }
    if err != nil {
        return nil, err
    }

    AttemptLimit := maxRetries
    if AttemptLimit <= 0 {
        AttemptLimit = 1
    }

    client := http.Client{
        Timeout: timeout,
    }

    // 每次请求copy新的request
    copyRequest := func() (request *http.Request) {
        request = req.Clone(req.Context())
        if request.Body != nil {
            resetBody(request, originalBody)
        }
        return
    }

    multiplexCh := make(chan struct {
        resp  *http.Response
        err   error
        retry int
    })

    totalSentRequests := &sync.WaitGroup{}
    allRequestsBackCh := make(chan struct{})
    go func() {
        totalSentRequests.Wait()
        close(allRequestsBackCh)
    }()
    var resp *http.Response
    for i := 1; i <= AttemptLimit; i++ {
        totalSentRequests.Add(1)
        go func() {
            // 标记已经执行完
            defer totalSentRequests.Done()
            req = copyRequest()
            resp, err = client.Do(req)
            if err != nil {
                fmt.Printf("error sending the first time: %v\n", err)
            }
            // 重试 500 以上的错误码
            if err == nil && resp.StatusCode < 500 {
                multiplexCh <- struct {
                    resp  *http.Response
                    err   error
                    retry int
                }{resp: resp, err: err, retry: i}
                return
            }
            // 如果正在重试,那么释放fd
            if resp != nil {
                resp.Body.Close()
            }
            // 重置body
            if req.Body != nil {
                resetBody(req, originalBody)
            }
            time.Sleep(backoffStrategy(i) + 1*time.Microsecond)
        }()
    }

    select {
    case res := <-multiplexCh:
        return res.resp, res.err
    case <-allRequestsBackCh:
        // 到这里,说明全部的 goroutine 都执行完毕,但是都请求失败了
        return nil, errors.New("all req finish,but all fail")
    }
}

Circuit Breaker & Fallback

Because when we use http calls, the external services invoked are often unreliable, and it is very likely that the external service problems will cause our own service interface calls to wait, resulting in long call times and a large backlog of calls, slowly exhausting service resources and eventually leading to an avalanche of service calls, so it is very necessary to use fusion fallback in the service.

In fact, the concept of fusion degradation is generally similar in implementation. The core idea is to use a global counter to count the number of calls and the number of successes/failures. The state of the fuse is represented by three states: closed, open, half open, and the following diagram is borrowed from sentinel to show the relationship between the three.

closed, open, half open

The initial state is closed and each call is counted by a counter for the total number of times and the number of successes/failures, then after a certain threshold or condition is reached the fuser switches to the open state and the request is rejected.

A fuse timeout retry time is configured in the fuse rule, and after the fuse timeout retry time the fuse will set the state to half-open. This state initiates timed probes for sentinel and allows a certain percentage of requests to go through for go-zero. Whether it is an active timed probe or a passive request call, as long as the result of the request returns normal, the counter needs to be reset to the closed state.

In general two fusing policies are supported.

  • Error rate : The number of requests within the fuse time window has a threshold error rate greater than the error rate threshold, thus triggering a fuse.
  • Average RT (response time): the number of requests within the fuse window is greater than the average RT threshold, triggering a fuse.

For example, if we use hystrix-go to handle the fusing of our service interface, we can combine it with the retries we mentioned above to further secure our service.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
hystrix.ConfigureCommand("my_service", hystrix.CommandConfig{ 
        ErrorPercentThreshold:  30,
    })
    _ = hystrix.Do("my_service", func() error { 
        req, _ := http.NewRequest("POST", "http://localhost:8090/", strings.NewReader("test"))
        _, err := retryDo(req, 5, 20*time.Millisecond, ExponentialBackoff)
        if err != nil {
            fmt.Println("get error:%v",err)
            return err
        }
        return nil
    }, func(err error) error {
        fmt.Printf("handle  error:%v\n", err)
        return nil
    }) 

The above example uses hystrix-go to set the maximum error percentage equal to 30, above which the fuse will be performed.

Summary

This article explores several points of retry from the interface call, and explains several strategies of retry; then in the practice session, it explains what problems there will be when using net/http retry directly, and for the hedging strategy, it uses channel plus waitgroup to achieve concurrent request control; finally, it uses hystrix-go to fuse the faulty service to prevent requests from piling up and running out of resources.