One of the problems we can’t avoid in development is how to achieve reliable network communication in unreliable network services, and http request retry is a technique often used. However, the Go standard library net/http actually does not have retry function, so this article mainly explains how to implement request retry in Go.
In general, the handling of network communication failure is divided into the following steps.
- perceive the error. Identifying different errors by different error codes, in HTTP status code can be used to identify different types of errors.
- retry decision. This step is mainly used to reduce unnecessary retries, such as HTTP 4xx errors, usually 4xx indicates a client error, at which time the client should not retry the operation, or some errors customized in the business should not be retried. Judgment according to these rules can effectively reduce the number of unnecessary retries and improve the response speed.
- retry policy. The retry policy contains the retry interval time, the number of retries, etc.. If the number of times is not enough, it may not effectively cover the short time period of failure, and if the number of retries is too many, or the retry interval is too small, it may cause a lot of wasted resources (CPU, memory, threads, network). This we will talk about below.
- hedging strategy. Hedging refers to actively sending multiple requests for a single call without waiting for a response, and then taking the first returned packet back. This concept is a concept in grpc, which I borrowed as well.
- Circuit Breaker & Fallback; if it still does not work after retrying, it means that the failure is not a short time failure, but a long time failure. Then the service can be fused and fallback, the later requests will not be retried, this time to do the downgrading process, reduce unnecessary requests, and wait for the server to recover before requesting, there are many implementations of this go-zero, sentinel, hystrix-go, which are also quite interesting.
The retry strategy can be divided into many kinds, on the one hand, we have to consider the business tolerance affected by the length of this request, on the other hand, we have to consider the impact of retry on the downstream service generated too many requests, in short, it is a trade-off problem.
So for the retry algorithm, it is generally to add a gap time between retry, interested friends can also go to see this article. Combined with our own usual practice plus the algorithm of this article can generally summarize the following rules.
- Linear interval (Linear Backoff): each retry interval time is fixed for retry, such as retry once every 1s.
- Linear interval + random time (Linear Jitter Backoff): sometimes the same time for each retry interval may lead to multiple requests at the same time request, then we can add a random time that fluctuates a percentage of the time based on the linear interval time.
- Exponential interval (Exponential Backoff): each interval time is 2 exponential increments, such as wait 3s 9s 27s after retry.
- Exponential interval + random time (Exponential Jitter Backoff): this is similar to the second one, adding a fluctuation time on top of the exponential increment.
Both of the above strategies incorporate jitter in order to prevent the occurrence of the Thundering Herd Problem.
In computer science, the thundering herd problem occurs when a large number of processes or threads waiting for an event are awoken when that event occurs, but only one process is able to handle the event. When the processes wake up, they will each try to handle the event, but only one will win. All processes will compete for resources, possibly freezing the computer, until the herd is calmed down again
Problems caused by using net/http retries
The retry operation can’t be done directly for Go by adding a for loop based on the number of times. For Get requests, there is no request body when retrying, so you can retry directly, but for Post requests, you need to put the request body into the
Reader, as follows.
When the server receives the request, it will call the
Read() function to read the data from the
Reader. Usually when the server reads the data, the
offset will be changed, and the next time it reads it, it will continue to read backwards from the
offset position. So if we retry directly, we will not be able to read
We can get an example first.
In the above example, a timeout of 10ms is set on the client side. On the server side, we simulate the request processing timeout by first sleeping 20ms and then reading the request data, which will definitely time out.
When the request is made again, it is found that the Body data requested by the client is not 20 lengths as we expected, but 0, resulting in an err. Therefore, the Body
Reader needs to be reset, as follows.
In the above code, we use
io.NopCloser to reset the requested Body data to avoid unintended exceptions in the next request.
Then, compared to the humble example above, it can be improved by adding the StatusCode retry judgment, retry policy, retry count, etc. we mentioned above, and it can be written like this.
The above is about the concept of retry, so sometimes our interface just occasionally goes wrong and our downstream service doesn’t care about requesting more than once, so we can borrow the concept from grpc: Hedged requests.
Hedging refers to actively sending multiple requests for a single call without waiting for a response, and then taking the first returned packet. The main difference between hedging and retrying is that hedging initiates a request directly if no response is received after a specified time, while retrying requires a response from the server before the request is initiated. So hedging is more like a more aggressive retry strategy.
One thing to note when using hedging is that because the downstream service may do a load balancing policy, the requested downstream service is generally required to be idempotent, able to be safe in multiple concurrent requests, and to meet expectations.
Concurrent mode processing
Because the concept of concurrency is added to hedge retry, goroutines are used to concurrently request, so we can encapsulate the data into the channel for asynchronous processing of messages.
And since multiple goroutines are processing messages, we need to return err if each goroutine finishes processing but fails, so it’s important not to get stuck in the main flow directly due to channel waiting.
However, since it is impossible to get the execution result of each goroutine in Go, and we only focus on the correct processing result and need to ignore errors, we need to work with WaitGroup to achieve process control, the example is as follows.
From the above code, in order to carry out process control, two more channels are used: totalSentRequests, allRequestsBackCh, and one more goroutine asynchronous shutdown allRequestsBackCh, in order to achieve the process control, it is too much trouble, there is a new implementation of the program students I would like to discuss it with you.
In addition to the above problem of concurrent request control, for hedging retries, it should be noted that the context of http.Request will change because requests are not serial, so you need to clone the context once before each request to ensure that the context of each different request is independent. However, after each clone the
offset position of
Reader changes again, so we need to
So combining the above examples, we can change the code for hedging retries to:
Circuit Breaker & Fallback
Because when we use http calls, the external services invoked are often unreliable, and it is very likely that the external service problems will cause our own service interface calls to wait, resulting in long call times and a large backlog of calls, slowly exhausting service resources and eventually leading to an avalanche of service calls, so it is very necessary to use fusion fallback in the service.
In fact, the concept of fusion degradation is generally similar in implementation. The core idea is to use a global counter to count the number of calls and the number of successes/failures. The state of the fuse is represented by three states: closed, open, half open, and the following diagram is borrowed from sentinel to show the relationship between the three.
The initial state is
closed and each call is counted by a counter for the total number of times and the number of successes/failures, then after a certain threshold or condition is reached the fuser switches to the
open state and the request is rejected.
A fuse timeout retry time is configured in the fuse rule, and after the fuse timeout retry time the fuse will set the state to
half-open. This state initiates timed probes for sentinel and allows a certain percentage of requests to go through for go-zero. Whether it is an active timed probe or a passive request call, as long as the result of the request returns normal, the counter needs to be reset to the
In general two fusing policies are supported.
- Error rate : The number of requests within the fuse time window has a threshold error rate greater than the error rate threshold, thus triggering a fuse.
- Average RT (response time): the number of requests within the fuse window is greater than the average RT threshold, triggering a fuse.
For example, if we use hystrix-go to handle the fusing of our service interface, we can combine it with the retries we mentioned above to further secure our service.
The above example uses hystrix-go to set the maximum error percentage equal to 30, above which the fuse will be performed.
This article explores several points of retry from the interface call, and explains several strategies of retry; then in the practice session, it explains what problems there will be when using
net/http retry directly, and for the hedging strategy, it uses channel plus waitgroup to achieve concurrent request control; finally, it uses
hystrix-go to fuse the faulty service to prevent requests from piling up and running out of resources.