k8s

One of the most important problems that Linkerd Service Mesh solves is observability: providing a detailed view of service behavior, Linkerd’s value proposition for observability is that it can provide golden metrics for your HTTP and gRPC services that are executed automatically, without code changes or developer involvement.

Out of the box, Linkerd provides these metrics on a per-service basis: all requests across services, regardless of what those requests are. However, sometimes it is necessary to get more granular metrics. For example, for the Emoji microservice in the Emojivoto application earlier, the metrics reported by Linkerd as seen in the previous section are aggregated across all endpoints of that service. Under real-world scenarios, we may also want to see the success rate or latency of specific endpoints, for example, one endpoint may be particularly critical to the service, or particularly slow.

To solve this problem, Linkerd uses the concept of a service profile, which is an optional configuration that informs Linkerd about the different types of requests for a service, categorized by their routes. A route is simply an endpoint (for gRPC) or an HTTP verb and endpoint (for HTTP).

Service profiles allow Linkerd to provide per-route (pre-route) for services instead of per-service metrics, and in later sections we will also see that service profiles allow Linkerd to perform configurations such as retries and timeouts on services, but for now let’s just focus on metrics.

It is also important to note that service profiles are not simply required for a service to run with Linkerd, they are optional configuration bits that enable more advanced behavior of Linkerd, and they are one of the few examples of Linkerd using Kubernetes CRD. Next we will still illustrate the use of service profiles with the Emojivoto application.

Generating a service profile

Linkerd’s service profile is implemented by instantiating a Kubernetes CRD named ServiceProfile. The ServiceProfile will enumerate the routes that Linkerd expects for that service.

We can create ServiceProfiles manually, but they can also be generated automatically. the Linkerd CLI has a profile command that can generate service profiles in a few different ways. One of the most common ways is to generate them from the service’s existing resources, such as the OpenAPI/Swagger specification or protobuf files.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
$ linkerd profile -h
Output service profile config for Kubernetes.

Usage:
  linkerd profile [flags] (--template | --open-api file | --proto file) (SERVICE)

Examples:
  # Output a basic template to apply after modification.
  linkerd profile -n emojivoto --template web-svc

  # Generate a profile from an OpenAPI specification.
  linkerd profile -n emojivoto --open-api web-svc.swagger web-svc

  # Generate a profile from a protobuf definition.
  linkerd profile -n emojivoto --proto Voting.proto vote-svc


Flags:
  -h, --help               help for profile
      --ignore-cluster     Output a service profile through offline generation
  -n, --namespace string   Namespace of the service
      --open-api string    Output a service profile based on the given OpenAPI spec file
      --proto string       Output a service profile based on the given Protobuf spec file
      --template           Output a service profile template

Global Flags:
      --api-addr string            Override kubeconfig and communicate directly with the control plane at host:port (mostly for testing)
      --as string                  Username to impersonate for Kubernetes operations
      --as-group stringArray       Group to impersonate for Kubernetes operations
      --cni-namespace string       Namespace in which the Linkerd CNI plugin is installed (default "linkerd-cni")
      --context string             Name of the kubeconfig context to use
      --kubeconfig string          Path to the kubeconfig file to use for CLI requests
  -L, --linkerd-namespace string   Namespace in which Linkerd is installed ($LINKERD_NAMESPACE) (default "linkerd")
      --verbose                    Turn on debug logging

The above help command output lists the flags that can be used to generate YAML for a ServiceProfile resource, and you can see that one of them is the -open-api flag, which instructs the ServiceProfile resource to generate a service profile from the specified OpenAPI or Swagger document, and the --proto flag to indicate that the service profile will be generated from the specified Protobuf file.

The web service for Emojivoto has a simple Swagger specification file that reads as follows.

1
2
3
4
5
6
7
8
# web.swagger
openapi: 3.0.1
version: v10
paths:
  /api/list:
    get: {}
  /api/vote:
    get: {}

Now we can use the above specification file to generate a ServiceProfile object with the following command.

1
$ linkerd profile -n emojivoto --open-api web.swagger web-svc > web-sp.yaml

The above command will output the ServiceProfile resource manifest file for service web-svc. Note that just like the linkerd install command, the linkerd profile command only generates YAML, which does not apply YAML to the cluster, so we redirect the output to the web-sp.yaml file, which corresponds to the following generated file contents.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  creationTimestamp: null
  name: web-svc.emojivoto.svc.cluster.local
  namespace: emojivoto
spec:
  routes:
    - condition:
        method: GET
        pathRegex: /api/list
      name: GET /api/list
    - condition:
        method: GET
        pathRegex: /api/vote
      name: GET /api/vote

The resource manifest file above is a typical way of declaring a ServiceProfile object, spec.routes is used to declare all routing rules, each route contains a name and condition attribute.

  • name is used to indicate the name of the route for display use.
  • condition is used to describe the specification of the route. The condition generated in the above example has two fields.
    • method: the HTTP method that matches the request.
    • pathRegex: the regular expression used to match the path. In our example, these are exact match rules, but usually these are regular expressions.

In addition to the service configuration files that can be generated via OpenAPI, they can also be generated via Protobuf. The gRPC protocol uses protobuf to encode and decode requests and responses, which means that each gRPC service also has a protobuf definition.

The Voting microservice for the Emojivoto application contains a protobuf definition, the contents of which are shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
syntax = "proto3";
option go_package = "github.com/buoyantio/emojivoto/proto";

package emojivoto.v1;

message VotingResult {
    string Shortcode = 1;
    int32 Votes = 2;
}

message VoteRequest {
}

message VoteResponse {
}

message ResultsRequest {
}

message ResultsResponse {
    repeated VotingResult results = 1;
}

service VotingService {
    rpc VotePoop (VoteRequest) returns (VoteResponse);
    rpc VoteJoy (VoteRequest) returns (VoteResponse);
    rpc VoteSunglasses (VoteRequest) returns (VoteResponse);
    ......Omit part of the content
}

Also now we can use the linkerd profile command to generate the corresponding ServiceProfile object.

1
$ linkerd profile -n emojivoto --proto Voting.proto voting-svc > voting-sp.yaml

The output of this command will be much more than the output of the previous command because the Voting.proto file defines more routes. We can look at the voting-sp.yaml file and compare it with the ServiceProfile resource created above.

There is also another way to dynamically generate ServiceProfile, where Linkerd can monitor live requests coming in during a specified time period and collect routing data from them. This is a very powerful feature for cluster administrators who are not aware of the internal routing provided by the service, as Linkerd takes care of processing the requests and writing them to a file, the underlying principle of this feature is the Tap feature for watching requests in real time mentioned in the previous section.

Now, let’s use the linkerd profile command to monitor the emoji service for 10 seconds and redirect the output to a file. As with all the commands we learned earlier, the output will be printed to the terminal and this command will redirect the output to a file, so we only have to run the command (and wait 10 seconds) once.

The Linkerd Viz extension also has its own profile subcommand that can be used in conjunction with the Tap function to generate service profiles from live traffic! As shown in the command below.

1
$ linkerd viz profile emoji-svc --tap deploy/emoji --tap-duration 10s -n emojivoto > emoji-sp.yaml

When requests are sent to the emoji service, these routes are collected via the Tap function. We can also use the Emoji.proto file to generate a service profile and then compare the definitions of the ServiceProfile resources created with the -proto and -tap flags.

The contents of the generated ServiceProfile resource manifest file are shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# emoji-sp.yaml
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  creationTimestamp: null
  name: emoji-svc.emojivoto.svc.cluster.local
  namespace: emojivoto
spec:
  routes:
    - condition:
        method: POST
        pathRegex: /emojivoto\.v1\.EmojiService/FindByShortcode
      name: POST /emojivoto.v1.EmojiService/FindByShortcode
    - condition:
        method: POST
        pathRegex: /emojivoto\.v1\.EmojiService/ListAll
      name: POST /emojivoto.v1.EmojiService/ListAll

If you need to write a service profile manually, the -linkerd profile command has a -template flag that generates a scaffold for the ServiceProfile resource, which can then be updated with your service’s details.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
$ linkerd profile --template -n emojivoto emoji-svc
### ServiceProfile for emoji-svc.emojivoto ###
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: emoji-svc.emojivoto.svc.cluster.local
  namespace: emojivoto
spec:
  # A service profile defines a list of routes.  Linkerd can aggregate metrics
  # like request volume, latency, and success rate by route.
  routes:
  - name: '/authors/{id}'

    # Each route must define a condition.  All requests that match the
    # condition will be counted as belonging to that route.  If a request
    # matches more than one route, the first match wins.
    condition:
      # The simplest condition is a path regular expression.
      pathRegex: '/authors/\d+'

      # This is a condition that checks the request method.
      method: POST

      # If more than one condition field is set, all of them must be satisfied.
      # This is equivalent to using the 'all' condition:
      # all:
      # - pathRegex: '/authors/\d+'
      # - method: POST

      # Conditions can be combined using 'all', 'any', and 'not'.
      # any:
      # - all:
      #   - method: POST
      #   - pathRegex: '/authors/\d+'
      # - all:
      #   - not:
      #       method: DELETE
      #   - pathRegex: /info.txt

    # A route may be marked as retryable.  This indicates that requests to this
    # route are always safe to retry and will cause the proxy to retry failed
    # requests on this route whenever possible.
    # isRetryable: true

    # A route may optionally define a list of response classes which describe
    # how responses from this route will be classified.
    responseClasses:

    # Each response class must define a condition.  All responses from this
    # route that match the condition will be classified as this response class.
    - condition:
        # The simplest condition is a HTTP status code range.
        status:
          min: 500
          max: 599

        # Specifying only one of min or max matches just that one status code.
        # status:
        #   min: 404 # This matches 404s only.

        # Conditions can be combined using 'all', 'any', and 'not'.
        # all:
        # - status:
        #     min: 500
        #     max: 599
        # - not:
        #     status:
        #       min: 503

      # The response class defines whether responses should be counted as
      # successes or failures.
      isFailure: true

    # A route can define a request timeout.  Any requests to this route that
    # exceed the timeout will be canceled.  If unspecified, the default timeout
    # is '10s' (ten seconds).
    # timeout: 250ms

  # A service profile can also define a retry budget.  This specifies the
  # maximum total number of retries that should be sent to this service as a
  # ratio of the original request volume.
  # retryBudget:
  #   The retryRatio is the maximum ratio of retries requests to original
  #   requests.  A retryRatio of 0.2 means that retries may add at most an
  #   additional 20% to the request load.
  #   retryRatio: 0.2

  #   This is an allowance of retries per second in addition to those allowed
  #   by the retryRatio.  This allows retries to be performed, when the request
  #   rate is very low.
  #   minRetriesPerSecond: 10

  #   This duration indicates for how long requests should be considered for the
  #   purposes of calculating the retryRatio.  A higher value considers a larger
  #   window and therefore allows burstier retries.
  #   ttl: 10s

The above command will output the fields containing the ServiceProfile resources and a detailed description of each field, or if you need a quick reference to the ServiceProfile definition, you can use the --template flag!

At this point we understand how to generate the ServiceProfile resource manifest file, next let’s look at the metrics data for each route defined in the service profile.

View Per-Route Metrics in Linkerd Dashboard

Above we learned how to use the linkerd profile command to generate a ServiceProfile resource manifest file, now let’s rerun the generate command and apply the generated ServiceProfile object directly to the cluster.

1
2
3
4
5
6
$ linkerd profile -n emojivoto --open-api web.swagger web-svc | kubectl apply -f -
serviceprofile.linkerd.io/web-svc.emojivoto.svc.cluster.local created
$ linkerd profile -n emojivoto --proto Voting.proto voting-svc | kubectl apply -f -
serviceprofile.linkerd.io/voting-svc.emojivoto.svc.cluster.local created
$ linkerd viz profile emoji-svc --tap deploy/emoji --tap-duration 10s -n emojivoto | kubectl apply -f -
serviceprofile.linkerd.io/emoji-svc.emojivoto.svc.cluster.local created

Once the above command has run successfully, let’s open the Linkerd dashboard to see the metrics, and we’ll start by navigating to Web Deployment to see the metrics for each route.

Routing Metrics in Linkerd Dashboard

From the above figure, we can see that under the ROUTE METRICS tab there are two more routes than the default [DEFAULT] route, which are the two routes we generated from the web.swagger file with the linkerd profile --open-api command. Each metric data for both routes. Before deploying the ServiceProfile object, we could only see the aggregated metrics for the web service, after deployment we can now see that the /api/list route is 100% successful and the /api/vote route has some errors.

Again before the service profile we only knew that the web service was returning errors, now we have errors coming from the /api/vote route, and the additional [DEFAULT] default route indicates the route Linkerd uses when there is no route in the service profile to match the request, and it will catch any traffic observed before the ServiceProfile. any traffic observed before ServiceProfile.

Now let’s go to the other service profile, where the Voting service is more representative because it contains a lot of routes. On the Linkerd Dashboard page go to Voting Deployment on the emojivoto namespace and switch to the ROUTE METRICS tab. We will have a very large number of routes under this service, the web service above we know that the /api/vote route has a success rate of less than 100%, each route information in all voting services may provide related error messages, since there are so many routes, we can directly sort them in ascending order by the Success Rate column, normally you can See that the VoteDoughnut route has a success rate of 0.

Voting Service Routing

Viewing Per-Route Metrics via Linkerd CLI

We’ve already learned how to view per-route metrics for services in the Emojivoto application via the Dashboard, so let’s try using the CLI tool to view per-route metrics.

The linkerd viz routes command uses the same metrics as the Dashboard, so let’s take a look at using the Linkerd CLI to view the routes for the emoji service, as shown below.

1
2
3
4
5
$ linkerd viz routes deploy/emoji -n emojivoto
ROUTE                                               SERVICE   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99
POST /emojivoto.v1.EmojiService/FindByShortcode   emoji-svc   100.00%   1.0rps           1ms           1ms           1ms
POST /emojivoto.v1.EmojiService/ListAll           emoji-svc   100.00%   1.0rps           1ms           1ms           1ms
[DEFAULT]                                         emoji-svc         -        -             -             -             -

We can see that both routes defined for the emoji service were successful and processed the request within 1ms, which shows that these routes are healthy. Note also our default route, marked as [DEFAULT], again this is the route Linkerd uses when there is no route in the service profile that matches the request.

Now we use the linkerd viz routes command in the same way to see the routes for the voting and web services, as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ linkerd viz routes deploy/web -n emojivoto
ROUTE           SERVICE   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99
GET /api/list   web-svc   100.00%   1.0rps           1ms           1ms           1ms
GET /api/vote   web-svc    88.33%   1.0rps           2ms           7ms           9ms
[DEFAULT]       web-svc         -        -             -             -             -
$ linkerd viz routes deploy/voting -n emojivoto
ROUTE                             SERVICE   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99
Results                        voting-svc         -        -             -             -             -
Vote100                        voting-svc   100.00%   0.0rps           1ms           1ms           1ms
VoteBacon                      voting-svc         -        -             -             -             -
VoteBalloon                    voting-svc         -        -             -             -             -
# ......

The output of the voting service is long because it has many routes, so you can find VoteDoughnut routes with the grep command: linkerd viz routes deploy/voting -n emojivoto | grep VoteDoughnut (or you can use the -o json flag and tools like jq to parse the output).

At this point we’ve learned about Linkerd’s service profile functionality, and for now we’ll focus on the observability features of the service profile, where we can look at the golden metrics of each route we learned about earlier. Next we will dive further into ServiceProfile and explore Linkerd’s retry and timeout features.

Retries and Timeouts

Next, we’ll look at how to configure timeouts and retries using ServiceProfile. Linkerd can ensure reliability through traffic splitting, load balancing, retries and timeouts, each of which plays an important role in improving the overall reliability of the system.

But what kind of reliability do these features actually add? It comes down to the fact that Linkerd can help prevent transient failures. If the service shuts down completely, or always returns a failure, or always hangs, then no amount of retries or load balancing will help. But if an instance of the service goes down, or if the potential problem is only temporary, then that’s when Linkerd can come in handy, and these partial, temporary failures are the most frequent problems with distributed systems!

Here are two important things to understand when it comes to Linkerd’s core reliability features of load balancing, retries, and timeouts (traffic segmentation, also a reliability feature, but a little less so, which we’ll address in a later section).

  1. all of these techniques occur on the client side: the agent making the call is the agent performing these functions, not the server-side agent. If your server is on the grid, but your client is not, then these features will not be enabled in the calls between the two!
  2. These three features work best together. Without retries, timeouts are of little value; and without load balancing, retries are almost worthless.

We can start by understanding load balancing, Linkerd automatically load balances requests between possible destinations, note the word request - unlike quad or TCP load balancing, which balances connections, Linkerd will establish connections to the set of possible endpoints and balance requests between all these connections. This allows for more granular control of traffic and, in the background, Linkerd’s approach is very sophisticated, using techniques such as Exponentially Weighted Moving Average of Server Latency (EWMA) to optimize where requests go; pool connections across endpoints where possible; and automatically escalate HTTP/1.1 traffic to HTTP/2 connections between proxies. However, from our perspective, there is no configuration, just the knowledge that Linkerd automatically balances requests across its endpoints .

Next, look at timeouts, which are a way to set a maximum time on a route. Let’s say you have a route named getValue() on your service, and the performance of getValue() is unpredictable: most of the time getValue() will return within 10 milliseconds, but sometimes it takes 10 minutes to return, perhaps because of lock contention on some contested resource. If there is no timeout, calling getValue() will take anywhere between 10 milliseconds and 10 minutes, but if a timeout of 500 milliseconds is set, then getValue() will take up to 500 milliseconds.

Adding a timeout can be used as a mechanism to limit the system’s worst-case latency by allowing the caller of getValue() to have more predictable performance and not take up resources waiting for a 10-minute long call to return. Second, and more importantly, the timeout can be combined with retry and load balancing to automatically resend the request to a different instance of the service. If getValue() is slow on instance A, it may be fast on instance B or C, and even multiple retries will be far less than waiting 10 minutes.

So let’s look at retry, which is when Linkerd automatically retries a request. In what scenarios would this be useful? Again due to some temporary errors: if a specific route on a specific instance returns an error and simply retrying the request may result in a successful response, it is of course important to realize that simply retrying the request does not guarantee a successful response. If the underlying errors are not temporary, or if Linkerd is unable to retry, then the application will still need to handle them.

In practice, implementing retries can be cumbersome. There is also a risk that taking it for granted may add additional load to the system, a load that may make things worse. One common failure scenario is the retry storm: a transient failure in service A triggers retries by B on its requests; these retries cause A to overload, which means that B starts a failed request; this triggers retries by its caller C on B, which then causes B to overload, and so on. This is especially true for systems that allow you to configure the maximum number of retries per request: a maximum of 3 retries per request may not sound like a problem, but in the worst case it will increase the load by 300%.

Linkerd minimizes the possibility of retry storms by setting parameters for retries using a retry budget rather than a per-request limit. The retry budget is a percentage of the total number of requests that can be retried, and Linkerd’s default behavior is to allow 20% (200) retries for failed requests, plus an additional 10 requests per second. For example, if the original request load is 1000 requests per second, then Linkerd will allow 210 requests per second to be retried. When the budget is exhausted, Linkerd will not retry the request, but will return a 504 error to the client.

In summary: load balancing, retries, and timeouts are all designed to protect application reliability in the event of partial, temporary failures and to prevent those failures from escalating into global outages. But they are not a panacea, and we must still be able to handle errors in our applications.

Using Per-Route Metrics to Determine When to Retry and Timeout

We learned above about the reasons for using retries and timeouts in Linkerd, so let’s build on the observability features we learned about earlier and use metrics to make decisions about applying retries and timeouts.

First, we’ll look at the statistics of all Deployments in the emojivoto namespace, and then we’ll dive into unhealthy services. This is done directly using the linkerd viz stat command, as follows.

1
2
3
4
5
6
$ linkerd viz stat deploy -n emojivoto
NAME       MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   TCP_CONN
emoji         1/1   100.00%   2.3rps           1ms           1ms           4ms          3
vote-bot      1/1   100.00%   0.3rps           1ms           2ms           2ms          1
voting        1/1    87.01%   1.3rps           1ms           7ms           9ms          3
web           1/1    91.91%   2.3rps           1ms          10ms          28ms          3

The stat command shows us the golden metrics, including success rate and latency, and we can notice that the success rate of the voting and web services is below 100%, and next we can use the linkerd viz edges command to understand the relationship between the services.

1
2
3
4
5
6
7
8
9
$ linkerd viz edges deploy -n emojivoto
SRC          DST        SRC_NS        DST_NS      SECURED
vote-bot     web        emojivoto     emojivoto   √
web          emoji      emojivoto     emojivoto   √
web          voting     emojivoto     emojivoto   √
prometheus   emoji      linkerd-viz   emojivoto   √
prometheus   vote-bot   linkerd-viz   emojivoto   √
prometheus   voting     linkerd-viz   emojivoto   √
prometheus   web        linkerd-viz   emojivoto   √

Of course, you can also use the Linkerd Dashboard to see the octopus diagram to understand the relationship between services.

octopus diagram

From the above results, we can see that the Pods in the web service are calling the Pods of the voting service, so we can guess that the voting service is causing the errors in the web service, but of course this is not the end of the story. We have metrics for each route and should be able to see exactly which routes have a success rate of less than 100%, so you can find out what the voting service’s route metrics are by using the linkerd viz routes command, as follows.

1
2
3
4
5
6
7
8
$ linkerd viz routes deploy/voting -n emojivoto
# ......
VoteDog                        voting-svc         -        -             -             -             -
VoteDoughnut                   voting-svc     0.00%   0.1rps           1ms           1ms           1ms
VoteFax                        voting-svc         -        -             -             -             -
VoteFire                       voting-svc   100.00%   0.0rps           1ms           1ms           1ms
VoteFlightDeparture            voting-svc         -        -             -             -             -
# ......

We can also view the ROUTE METRICS information for the voting service through the Linkerd Dashboard, as shown below.

Linkerd Dashboard

We can finally pinpoint VoteDoughnut as the route where the request failed. Now we not only know that an error occurred between the web service and the voting service, but we also know that an error occurred with the VoteDoughnut route. Next we can use retries to try to resolve the error and also ask the developer to debug the code.

Configuring Retries

Before we start configuring retries for VotingDoughnut routing, we must first take a closer look at the web and voting service metrics, as this will help us really understand if applying retries will solve the problem. Here we will only use the Linkerd CLI because it can show us the actual and valid request volume and success rates by using the -o wide flag. The Linkerd dashboard will show the overall success rate and RPS, but not the actual and valid metrics. The difference between actual and valid metrics is as follows.

  • Actual values are from the perspective of the server receiving the request
  • Valid values are from the point of view of the client sending the request

In the absence of retries and timeouts, obviously the two figures are the same. However, once retries or timeouts are configured, they may not be the same. For example, a retry can make the actual RPS higher than the valid RPS, because from the server’s perspective, the retry is another request, but from the client’s perspective, it is the same request. Retries can make the actual success rate lower than the effective success rate because the failed retry call also occurs on the server but is not exposed to the client. And a timeout may have the opposite effect: depending on the exact return time, a timeout request that eventually returns successfully may make the actual success rate higher than the valid success rate, because the server sees it as a success, while the client only sees a failure.

The overall point is that Linkerd’s actual and valid metrics may differ in the case of retries or timeouts, but the actual number represents the actual hit to the server, while the valid number represents that the client effectively got a response to its request after Linkerd’s reliability logic had done its job.

For example, we check the routing metrics of the vote-bot service with the following command.

1
2
3
4
5
$ linkerd viz routes deploy/vote-bot -n emojivoto --to deploy/web -owide
ROUTE           SERVICE   EFFECTIVE_SUCCESS   EFFECTIVE_RPS   ACTUAL_SUCCESS   ACTUAL_RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99
GET /api/list   web-svc             100.00%          1.0rps          100.00%       1.0rps           1ms           2ms           2ms
GET /api/vote   web-svc              86.21%          1.0rps           86.21%       1.0rps           2ms           7ms          10ms
[DEFAULT]       web-svc                   -               -                -            -             -             -             -

In the above command we added a -o wide flag so that the output will contain both actual and valid success and RPS metrics. Looking at the vote-bot service, the effective success rate and actual success rate for the /api/vote route for the web service are both below 100%, because we do not have retries configured yet. And we can’t assume that all requests are retryable; retry requests are very specific to Linkerd.

  • Right now, requests using the HTTP POST method are not retryable in Linkerd. Because POST requests almost always contain data in the request body, retrying the request means that the proxy must store that data in memory. Therefore, to keep memory usage to a minimum, the proxy does not store POST request bodies and they cannot be retried.

  • As we mentioned earlier, Linkerd only treats the 5XX status code in the response as an error, while both 2XX and 4XX are recognized as success status codes. The 4XX status code indicates that the server looked but could not find the resource, which is correct server behavior, while the 5XX status code indicates that the server encountered an error while processing the request, which is incorrect behavior.

Now let’s verify what we have learned by adding a retry function to the /api/vote route of the web service. Let’s look at the ServiceProfile object of the web service again, which looks like this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# web-sp.yaml
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: web-svc.emojivoto.svc.cluster.local
  namespace: emojivoto
spec:
  routes:
    - condition:
        method: GET
        pathRegex: /api/list
      name: GET /api/list
    - condition:
        method: GET
        pathRegex: /api/vote
      name: GET /api/vote

Next we add an attribute isRetryable: true to the route /api/vote, as follows.

1
2
3
4
5
6
7
# web-sp-with-retry.yaml
# ......
- condition:
    method: GET
    pathRegex: /api/vote
  name: GET /api/vote
  isRetryable: true

Reapply the ServiceProfile object after updating.

1
$ kubectl apply -f web-sp-with-retry.yaml

After application we can observe the change in the routing metrics of the vote-bot service.

1
2
3
4
5
6
7
$ watch linkerd viz routes deploy/vote-bot -n emojivoto --to deploy/web -o wide
Every 2.0s: linkerd viz routes deploy/vote-bot -n emojivoto --to deploy/web -o wide                   MBP2022.local: Sun Aug 28 17:41:37 2022

ROUTE           SERVICE   EFFECTIVE_SUCCESS   EFFECTIVE_RPS   ACTUAL_SUCCESS   ACTUAL_RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99
GET /api/list   web-svc             100.00%          1.0rps          100.00%       1.0rps           2ms           8ms          10ms
GET /api/vote   web-svc              85.00%          1.0rps           16.50%       5.0rps           4ms         255ms         290ms
[DEFAULT]       web-svc                   -               -                -            -             -             -             -

You can see that the actual success rate becomes very low, because the retry result may still be wrong. We mentioned above that Linkerd’s retry behavior is configured by the retry budget, and that when isRetryable: true is configured, the default retry budget rules are applied, and the following YAML fragment of the ServiceProfile resource shows the retryBudget object configuration with default values.

1
2
3
4
5
spec:
  retryBudget:
    retryRatio: 0.2
    minRetriesPerSecond: 10
    ttl: 10s

where the retryBudget parameter is with three main fields.

  • retryRatio: the retry rate, which indicates the maximum ratio of retry requests to original requests. A retryRatio of 0.2 means that retry will increase the request load by up to 20%.
  • minRetriesPerSecond: this is the number of retries per second allowed in addition to those allowed by retryRatio (this allows retries when the request rate is very low), default is 10 RPS.
  • ttl: indicates how long requests should be considered when calculating the retry rate, a larger value will consider a larger window and therefore allow more retries. The default is 10 seconds.

Configuring Timeouts

In addition to the retry and retry budget, Linkerd also provides a timeout feature that allows you to ensure that requests for a given route never exceed a specified amount of time.

To illustrate this, let’s take a fresh look at each of the routing metrics for the web and voting services.

1
2
$ linkerd viz routes deploy/vote-bot -n emojivoto --to deploy/web -o wide
$ linkerd viz routes deploy/web -n emojivoto --to deploy/voting -o wide

We have already learned that the delay between web and voting is close to 1ms. To demonstrate the timeout, we set the /api/vote route timeout to 0.5ms, so that basically it will not be able to meet the requirements and timeout will occur, Linkerd will send the error to the client, and the success rate will become 0.

Modify the ServiceProfile object of the web service and add the timeout property, as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# web-sp-with-timeout.yaml
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: web-svc.emojivoto.svc.cluster.local
  namespace: emojivoto
spec:
  routes:
    - condition:
        method: GET
        pathRegex: /api/list
      name: GET /api/list
    - condition:
        method: GET
        pathRegex: /api/vote
      name: GET /api/vote
      timeout: 0.5ms
      isRetryable: true

After applying the above object, both the valid and actual success rates of the service drop to 0, because /api/vote times out within 0.5ms, so the success rate becomes 0.

1
2
3
4
5
6
7
$ watch linkerd viz routes deploy/vote-bot -n emojivoto --to deploy/web -o wide
Every 2.0s: linkerd viz routes deploy/vote-bot -n emojivoto --to deploy/web -o wide                   MBP2022.local: Sun Aug 28 19:12:15 2022

ROUTE           SERVICE   EFFECTIVE_SUCCESS   EFFECTIVE_RPS   ACTUAL_SUCCESS   ACTUAL_RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99
GET /api/list   web-svc             100.00%          1.0rps          100.00%       1.0rps           2ms           2ms           2ms
GET /api/vote   web-svc               0.00%          1.0rps            0.00%       0.0rps           0ms           0ms           0ms
[DEFAULT]       web-svc                   -               -                -            -             -             -             -

At this point we understand the use of retries and timeouts in Linkerd, which are part of Linkerd’s overall policy to increase reliability in the event of a transient failure. We decide when and how to configure retries and timeouts by using per-route metrics in the service profile, and Linkerd parameters its retries with a retry budget to avoid “retry storms” where the agent will stop sending requests to the service when the retry budget is exhausted to avoid overloading the service and potentially affecting the rest of the application.