Implementing Progressive Release with Argo Rollouts

Argo Rollouts is a Kubernetes Operator implementation that provides more advanced deployment capabilities for Kubernetes, such as Bluegreen, Canary, Canary Analytics, Experimentation, and Progressive Delivery capabilities. Enables automated, GitOps-based incremental delivery for cloud-native applications and services.

The following features are supported.

Bluegreen update strategy
Canary update policies
More fine-grained, weighted traffic splitting
Automatic rollback
Manual judgment
Customizable metric queries and business KPI analysis
Ingress controller integration: NGINX, ALB
Service Grid Integration: Istio, Linkerd, SMI
Metrics metrics integration: Prometheus, Wavefront, Kayenta, Web, Kubernetes Jobs, Datadog, New Relic

Implementation Principle

Similar to the Deployment object, the Argo Rollouts controller will manage the creation, scaling and deletion of ReplicaSets, which are defined by spec.template in the Rollout resource, using the same pod template as the Deployment object.

When spec.template is changed, this signals to the Argo Rollouts controller that a new ReplicaSet will be introduced, and the controller will use the strategy in the spec.strategy field to determine how the rollout from the old ReplicaSet to the new ReplicaSet will occur. Once this new ReplicaSet is scaled up (optionally via an Analysis), the controller will mark it as stable.

If another change occurs during the transition of the spec.template from the stable ReplicaSet to the new ReplicaSet (i.e., the application version is changed during the release process), then the previous new ReplicaSet will be scaled down and the controller will attempt to release the ReplicasSet that reflects the updated spec.template field.

Before we continue let’s understand some basic concepts.

Rollout

Rollout is a Kubernetes CRD resource that is the equivalent of a Kubernetes Deployment object and is designed to replace the Deployment object in cases where more advanced deployment or incremental delivery capabilities are needed, Rollout provides features that Kubernetes Deployment does not Rollout provides functionality that Kubernetes Deployment does not.

Blue-Green Deployment
Canary deployments
Integration with Ingress Controller and Service Grid for advanced traffic routing
Integration with metrics providers for Bluegreen and Canary analytics
Automated publishing or rollback based on successful or failed metrics

Progressive Delivery

Progressive release is the process of releasing product updates in a controlled and incremental manner, thereby reducing the risk of release, often combining automation and metrics analysis to drive automatic upgrade or rollback of updates.

Progressive Delivery

Progressive delivery is often described as the evolution of continuous delivery, extending the speed benefits in CI/CD to the deployment process. By limiting new releases to a subset of users, the correct behavior is observed and analyzed, and then more traffic is gradually added while continuously verifying its correctness.

Deployment Strategies

While the industry uses consistent terminology to describe various deployment strategies, the implementation of these strategies often varies from tool to tool, and to clarify how Argo Rollouts behaves, here is a description of the various deployment strategy implementations provided by Argo Rollouts.

RollingUpdate: Slowly replaces old versions with new ones, which are slowly scaled down as they become available to maintain the total number of applications. This is the default policy for Deployment objects.
Recreate: Recreate removes the old version of the application before starting the new version, which ensures that both versions of the application never run at the same time, but there is downtime during deployment.
Blue-Green: A Blue-Green release (sometimes called Red-Black) means that both the old and new versions of the application are deployed at the same time, during which only the old version of the application receives production traffic, which allows developers to test against the new version before switching live traffic to the new version.
Canary: Canary releases refer to exposing a portion of users to a new version of the application, while making the rest of the traffic available to the old version, which can gradually replace the old version once the new version is verified to be correct.Ingress controllers and service grids, such as NGINX Ingress and Istio, can make the traffic splitting model of Canary more sophisticated than native (for example, implementing very fine-grained traffic splitting, or splitting based on HTTP headers).

The above chart shows a canary with two phases (10% and 33% of traffic into the new version), by using Argo Rollouts we can define the exact number of phases and percentage of traffic based on actual usage.

Architecture

The following shows all the components of Deployment managed by Argo Rollouts.

all the components of Deployment managed by Argo Rollouts

Rollout Controller

This is the main controller that monitors the events of the cluster and reacts when changes are made to the Rollout type of resources. The controller will read all the details of the rollout and keep the cluster in the same state as described in the rollout definition.

Note that Argo Rollouts will not tamper with or respond to any changes that occur on normal Deployment resources, which means you can install Argo Rollouts in a cluster that uses other methods of deploying applications.

Rollout Resources

A Rollout resource is a custom Kubernetes resource introduced and managed by Argo Rollouts that is largely compatible with the native Kubernetes Deployment resource, but has additional fields to control more advanced deployment methods such as canary and blue/green deployments.

The Argo Rollouts controller will only react to changes in the Rollout resource and will not do anything to the normal Deployment resource, so if you want to manage your Deployment with Argo Rollouts, you will need to migrate your Deployment to Rollouts.

Old and new versions of ReplicaSets

These are examples of standard Kubernetes ReplicaSet resources, Argo Rollouts adds some additional metadata to them to keep track of the different versions belonging to the application.

Note also that ReplicaSets participating in Rollout are completely managed automatically by the controller and you should not tamper with them using external tools.

Ingress/Service

After a user’s traffic enters the cluster and is redirected to the appropriate version, Argo Rollouts uses standard Kubernetes Service resources, but with some additional metadata.

Argo Rollouts is very flexible in terms of network configuration, starting with the ability to use different services during a Rollout that are only available for newer versions, only for older versions, or both. For Canary deployments in particular, Argo Rollouts supports multiple service grids and Ingress solutions for splitting traffic by a specific percentage, rather than a simple configuration based on the number of Pods.

Analysis and AnalysisRun

Analysis is a custom Kubernetes resource that connects a Rollout to a metrics provider and defines specific thresholds for certain metrics that will determine whether the Rollout is successful or not. For each Analysis, you can define one or more metric queries and their expected results, and the Rollout will continue to operate if the metric query works, roll back automatically if the metric shows a failure, or suspend the release if the metric cannot provide a success/failure answer.

Analysis is just a template for which metrics to query. The actual result attached to a Rollout is the AnalysisRun custom resource, which allows you to define Analysis on a specific Rollout or globally on a cluster for multiple Rollouts to share.

Note that using Analysis and metrics in a Rollout is completely optional, and you can manually pause and facilitate releases or use other external methods via the API or CLI. You don’t need to use only Argo Rollouts’ Metrics solution, you can also mix automatic (i.e. Analysis-based) and manual steps in Rollout.

In addition to metrics, you can also determine the success of a release by running a Kubernetes Job or running a webhook.

Metric Providers

Argo Rollouts includes native integrations with several popular metric providers that you can use in your Analysis resources to automatically boost or roll back releases.

CLI and UI

Rollout can also be viewed and managed using the Argo Rollouts CLI or the integrated UI, both of which are optional.

Installation

Install Argo Rollouts directly using the following command.

1
2

$ kubectl create namespace argo-rollouts
$ kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/download/v1.2.2/install.yaml

Here a namespace named argo-rollouts is created and the Argo Rollouts controller runs underneath.

1
2
3

$ kubectl get pods -n argo-rollouts
NAME                             READY   STATUS    RESTARTS   AGE
argo-rollouts-845b79ff9-crx9v    1/1     Running   0          58s

In addition, we can install a kubectl plugin, which is very handy for command line management and visualization of releases. Use curl to install the Argo Rollouts kubectl plugin.

1
2

# https://github.91chi.fun/https://github.com//argoproj/argo-rollouts/releases/download/v1.2.2/kubectl-argo-rollouts-linux-amd64
$ curl -LO https://github.com/argoproj/argo-rollouts/releases/download/v1.2.2/kubectl-argo-rollouts-linux-amd64

Then grant executable permissions to the kubectl-argo-rollouts binary.

`1`	`$ chmod +x ./kubectl-argo-rollouts-linux-amd64`

Move the binary to the bottom of your PATH path.

`1`	`$ sudo mv ./kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts`

Execute the following command to verify that the plugin is installed successfully.

$ kubectl argo rollouts version
kubectl-argo-rollouts: v1.2.2+22aff27
  BuildDate: 2022-07-26T17:24:43Z
  GitCommit: 22aff273bf95646e0cd02555fbe7d2da0f903316
  GitTreeState: clean
  GoVersion: go1.17.6
  Compiler: gc
  Platform: linux/amd64

Usage

Next we will demonstrate the various features of Rollouts by illustrating a few simple examples of deployment, upgrade, release, and interruption of Rollouts.

1. Deploying Rollout

First we deploy a Rollout resource and a Kubernetes Service object for that resource. The Rollout in our example here uses a canary update strategy, sending 20% of the traffic to the canary, then manually releasing it, and then gradually and automatically increasing the traffic for the remainder of the upgrade, as illustrated by the Rollout to describe this strategy.

# basic-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: rollouts-demo
spec:
  replicas: 5 # 定义5个副本
  strategy: # 定义升级策略
    canary: # 金丝雀发布
      steps: # 发布的节奏
        - setWeight: 20
        - pause: {} # 会一直暂停
        - setWeight: 40
        - pause: { duration: 10 } # 暂停10s
        - setWeight: 60
        - pause: { duration: 10 }
        - setWeight: 80
        - pause: { duration: 10 }
  revisionHistoryLimit: 2 # 下面部分其实是和 Deployment 兼容的
  selector:
    matchLabels:
      app: rollouts-demo
  template:
    metadata:
      labels:
        app: rollouts-demo
    spec:
      containers:
        - name: rollouts-demo
          image: argoproj/rollouts-demo:blue
          ports:
            - name: http
              containerPort: 8080
              protocol: TCP
          resources:
            requests:
              memory: 32Mi
              cpu: 5m

It also includes a Service resource object as shown below.

# basic-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: rollouts-demo
spec:
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app: rollouts-demo

Create the two resource objects above directly.

1
2

$ kubectl apply -f basic-rollout.yaml
$ kubectl apply -f basic-service.yaml

Any initial creation of Rollout will immediately expand the copy to 100% (skipping any canary upgrade steps, analysis, etc…) because no upgrade has occurred yet.

$ kubectl get pods -l app=rollouts-demo
NAME                             READY   STATUS    RESTARTS   AGE
rollouts-demo-687d76d795-6ppnh   1/1     Running   0          53s
rollouts-demo-687d76d795-8swrk   1/1     Running   0          53s
rollouts-demo-687d76d795-fnt2w   1/1     Running   0          53s
rollouts-demo-687d76d795-mtvtw   1/1     Running   0          53s
rollouts-demo-687d76d795-sh56l   1/1     Running   0          53s

The kubectl plugin for Argo Rollouts allows us to visualize the Rollout and related resource objects and show real-time status changes. To watch the Rollout during deployment, you can run the plugin’s get rollout --watch command, e.g.

`1`	`$ kubectl argo rollouts get rollout rollouts-demo --watch`

get rollout –watch

2. Update Rollout

Now that the above deployment is complete, it’s time to perform an update. Similar to Deployment, any changes to the Pod template fields will result in a new version (i.e. ReplicaSet) being deployed, updating Rollout is usually done by modifying the version of the container image and then executing kubectl apply, for convenience the rollouts plugin also provides a separate set image command. For example, here we run the command shown below to update the above Rollout with the yellow version of the container.

1
2
3

$ kubectl argo rollouts set image rollouts-demo \
  rollouts-demo=argoproj/rollouts-demo:yellow
rollout "rollouts-demo" image updated

During a rollout update, the controller will go through the steps defined in the Rollout Update Policy. This example rollout sets a 20% traffic weight for the canary and keeps the rollout paused until the user cancels or facilitates the release. After updating the image, the rollout is observed again until it reaches the paused state.

`1`	`$ kubectl argo rollouts get rollout rollouts-demo --watch`

get rollout –watch

When the demo rollout reaches step 2, we can see from the plugin that the rollout is paused and now 1 of the 5 copies is running the new version of the pod, while the remaining 4 are still running the old version, which corresponds to the 20% canary weight defined in the setWeight: 20 step.

3. Promote Rollout

After the update above, Rollout is now in a paused state and when a Rollout reaches a paused step with no duration, it will remain in a paused state until it is resumed/lifted. To manually switch a Rollout to the next step, run the plugin’s promotion command.

1
2

$ kubectl argo rollouts promote rollouts-demo
rollout 'rollouts-demo' promoted

After the switchover Rollout will continue with the remaining steps. In our example, the remaining steps are fully automated, so Rollout will eventually complete the steps until it has fully transitioned to the new version. Watch Rollout again until it has completed all the steps.

`1`	`$ kubectl argo rollouts get rollout rollouts-demo --watch`

get rollout –watch

The promote command also supports skipping all remaining steps and analysis with the --full flag.

You can see that the stable version has been switched to the revision:2 ReplicaSet. Rollout will fall back to the stable version whenever it is updated, either automatically by a failed canary analysis, or manually by the user.

4. Interrupting Rollout

Next, let’s see how to manually abort a rollout during an update. First, use the set image command to deploy a new red version of the container and wait for the rollout to reach the pause step again.

1
2
3

$ kubectl argo rollouts set image rollouts-demo \
  rollouts-demo=argoproj/rollouts-demo:red
rollout "rollouts-demo" image updated

This time we will abort the update instead of switching the rollout to the next step, so that it goes back to the stable version, which also provides an abort command to manually abort the rollout at any point during the update.

`1`	`$ kubectl argo rollouts abort rollouts-demo`

When the rollout is aborted, it expands the stable version of the ReplicaSet (in this case the yellow version) and shrinks any other versions. Although the stable version of the ReplicaSet may be running and healthy, the entire Rollout is still considered degraded because the desired version (the red version) is not the one that is actually running.

get rollout –watch

In order for Rollout to be considered healthy again and not a faulty version, it is necessary to change the desired state back to the previous stable version. In our case, we can simply re-run the set image command using the previous yellow image.

1
2

$ kubectl argo rollouts set image rollouts-demo \
  rollouts-demo=argoproj/rollouts-demo:yellow

After running this command, you can see that Rollout immediately changes to the health state, and there is no dynamics regarding the creation of new ReplicaSets.

get rollout –watch

When Rollout has not yet reached the expected state (e.g., it has been aborted, or is in the process of being updated) and the stable version of the resource list is reapplied, Rollout detects that this is a rollback, not an update, and will quickly deploy the stable ReplicaSet by skipping the analysis and steps.

The Rollout in the example above does not use an Ingress controller or service grid to control traffic. Instead, it uses the normal Kubernetes Service to implement approximate canary weighting, based on the ratio of the number of old and new replicas. So, this Rollout has the limitation that it can only achieve a minimum weighting of 20% by extending one of the 5 pods to run the new version. To achieve a more granular canary, this would require an Ingress controller or service grid.

Dashboard

The Argo Rollouts Kubectl plugin can provide a local Dashboard to visualize your Rollouts.

To start this Dashboard, run the kubectl argo rollouts dashboard command in the namespace containing the Rollouts resource object, and then just access localhost:3100.

Dashboard

Click Rollout to proceed to the detail page, where you can see the configuration information of Rollout and also perform some common operations such as restart, reboot, interrupt, etc. directly on the UI interface.

Dashboard

Analysis and Progressive Interaction

Argo Rollouts provides several ways to perform Analysis to drive incremental delivery, starting with an understanding of several CRD resources.

Rollout: Rollout is a direct replacement for the Deployment resource that provides additional blueGreen and canary update policies that create AnalysisRuns and Experiments that can advance updates, or abort updates, during updates.
AnalysisTemplate: AnalysisTemplate is a template that defines how to perform a canary analysis, such as the metrics it should perform, the frequency, and the values that are considered successes or failures, AnalysisTemplate can be parameterized with input values.
ClusterAnalysisTemplate: ClusterAnalysisTemplate is similar to AnalysisTemplate, but it is global in scope and it can be used by any Rollout across the cluster.
AnalysisRun: AnalysisRun is an instantiation of AnalysisTemplate. analysisRun is like Job, they eventually complete, the completed run is considered successful, failed or indeterminate, and the result of the run affects whether the Rollout’s update continues, aborts or pauses respectively.

Argo Rollouts

Background Analytics

Analytics can run in the background while Canary is performing its deployment steps.

The following example gradually increases the Canary weight by 20% every 10 minutes until it reaches 100%. In the background, AnalysisRun is started based on an AnalysisTemplate named success-rate, and the success-rate template queries the Prometheus server to measure HTTP success at 5-minute intervals/samples, which has no end time and continues until it stops or fails. If the measured metric is less than 95% and there are three such measurements, the analysis is considered to have failed. A failed analysis causes the Rollout to stop, sets the Canary weight back to zero, and the Rollout is considered degraded. Otherwise, if the Rollout completes all of its Canary steps, the rollout is considered successful and the controller will stop running the analysis.

The Rollout resource object as shown below.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
  # ...
  strategy:
    canary:
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 2 # 延迟开始分析，到第3步开始
        args:
          - name: service-name
            value: guestbook-svc.default.svc.cluster.local
      steps:
        - setWeight: 20
        - pause: { duration: 10m }
        - setWeight: 40
        - pause: { duration: 10m }
        - setWeight: 60
        - pause: { duration: 10m }
        - setWeight: 80
        - pause: { duration: 10m }

Above we referenced a template for success-rate.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 5m
      # NOTE: prometheus queries return results in the form of a vector.
      # So it is common to access the index 0 of the returned array to obtain the value
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.example.com:9090
          query: |
            sum(irate(
              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
            )) /
            sum(irate(
              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
            ))

Inline Analysis

Analysis can also be executed as an inline “analysis” step, when analysis is performed “inline”, starting AnalysisRun when it reaches that step and preventing it from advancing until the run is complete. The success or failure of the analysis run determines whether the deployment continues to the next step or aborts the deployment altogether.

In the example shown below we set the Canary weight to 20%, pause for 5 minutes, and then run the analysis. If the analysis is successful, the rollout continues, otherwise it is aborted.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
  # ...
  strategy:
    canary:
      steps:
        - setWeight: 20
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: guestbook-svc.default.svc.cluster.local

In the above object we have inlined analysis as a step in the Rollout step, and the analysis template success-rate is executed when 20% of the traffic is paused for 5 minutes.

Here the AnalysisTemplate is the same as the backend analysis example above, but since no interval is specified, the analysis will be executed in a single measurement.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
    - name: prometheus-port
      value: 9090
  metrics:
    - name: success-rate
      successCondition: result[0] >= 0.95
      provider:
        prometheus:
          address: "http://prometheus.example.com:{{args.prometheus-port}}"
          query: |
            sum(irate(
              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
            )) /
            sum(irate(
              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
            ))

In addition, we can specify the count and interval fields so that multiple measurements can be taken over a long period of time.

metrics:
  - name: success-rate
    successCondition: result[0] >= 0.95
    interval: 60s
    count: 5
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: ...

Multiple templates for analysis

Rollout can reference multiple AnalysisTemplates when building an AnalysisRun. this allows us to compose analysis from multiple AnalysisTemplates, if multiple templates are referenced then the controller will merge them together and the controller will combine the metrics and args fields from all the templates.

This is shown below.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
  # ...
  strategy:
    canary:
      analysis:
        templates:
          - templateName: success-rate
          - templateName: error-rate
        args:
          - name: service-name
            value: guestbook-svc.default.svc.cluster.local
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 5m
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.example.com:9090
          query: |
            sum(irate(
              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
            )) /
            sum(irate(
              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
            ))            
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 5m
      successCondition: result[0] <= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.example.com:9090
          query: |
            sum(irate(
              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code=~"5.*"}[5m]
            )) /
            sum(irate(
              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
            ))

When the analysis is performed, the controller will merge the success-rate and error-rate templates above into an AnalysisRun object.

Note that the controller will make an error when merging the templates if

Multiple metrics in the template have the same name
two parameters with the same name both have values

Analysis Template Parameters

AnalysisTemplates can declare a set of parameters that can be passed by Rollouts. These parameters can then be used as in the metrics configuration and are instantiated when AnalysisRun is created, with parameter placeholders defined as {{ args.<name> }}, as shown below.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: args-example
spec:
  args:
    # required
    - name: service-name
    - name: stable-hash
    - name: latest-hash
    # optional
    - name: api-url
      value: http://example/measure
    # from secret
    - name: api-token
      valueFrom:
        secretKeyRef:
          name: token-secret
          key: apiToken
  metrics:
    - name: webmetric
      successCondition: result == 'true'
      provider:
        web:
          # placeholders are resolved when an AnalysisRun is created
          url: "{{ args.api-url }}?service={{ args.service-name }}"
          headers:
            - key: Authorization
              value: "Bearer {{ args.api-token }}"
          jsonPath: "{$.results.ok}"

When creating an AnalysisRun, the parameters defined in the Rollout are merged with those of the AnalysisTemplate, as follows.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
---
strategy:
  canary:
    analysis:
      templates:
        - templateName: args-example
      args:
        # required value
        - name: service-name
          value: guestbook-svc.default.svc.cluster.local
        # override default value
        - name: api-url
          value: http://other-api
        # pod template hash from the stable ReplicaSet
        - name: stable-hash
          valueFrom:
            podTemplateHashValue: Stable
        # pod template hash from the latest ReplicaSet
        - name: latest-hash
          valueFrom:
            podTemplateHashValue: Latest

In addition, analysis parameters also support valueFrom, which is used to read meta data and pass it to AnalysisTemplate as a parameter, as in the example below, which references the env and region tags in the meta data and passes them to AnalysisTemplate.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
  labels:
    appType: demo-app
    buildType: nginx-app
    ...
    env: dev
    region: us-west-2
spec:
...
  strategy:
    canary:
      analysis:
        templates:
        - templateName: args-example
        args:
        ...
        - name: env
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['env']
        # region where this app is deployed
        - name: region
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['region']

BlueGreen Pre-Release Analysis

A Rollout using the BlueGreen policy can start an AnalysisRun before switching traffic to a new version using a pre-release. the success or failure of the analysis run determines whether the Rollout switches traffic, or aborts the Rollout altogether, as shown below.

kind: Rollout
metadata:
  name: guestbook
spec:
---
strategy:
  blueGreen:
    activeService: active-svc
    previewService: preview-svc
    prePromotionAnalysis:
      templates:
        - templateName: smoke-tests
      args:
        - name: service-name
          value: preview-svc.default.svc.cluster.local

In our example above, once the new ReplicaSet is fully available, Rollout will create a pre-released AnalysisRun and Rollout will not switch the traffic to the new version, but will wait until the analysis run completes successfully.

Note: If the autoPromotionSeconds field is specified and Rollout has waited auto promotion seconds, Rollout will mark the AnalysisRun as successful and automatically switch the traffic to the new version. If the AnalysisRun completes before then, Rollout will not create another AnalysisRun and will wait the remaining time for autoPromotionSeconds.

BlueGreen Post-Release Analysis

Rollout using the BlueGreen policy can also use post-release analysis after traffic has been switched to a new version. If the post-release analysis fails or there is an error, Rollout goes into abort and switches the traffic back to the previous stable ReplicaSet. When the post-analysis is successful, Rollout is considered fully released and the new ReplicaSet is marked as stable, and then the old ReplicaSet is scaled down based on scaleDownDelaySeconds (default is 30 seconds) to scale down.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
---
strategy:
  blueGreen:
    activeService: active-svc
    previewService: preview-svc
    scaleDownDelaySeconds: 600 # 10 minutes
    postPromotionAnalysis:
      templates:
        - templateName: smoke-tests
      args:
        - name: service-name
          value: preview-svc.default.svc.cluster.local

Failure Condition

failureCondition can be used to configure the analysis to fail. The following example continuously polls the Prometheus server every 5 minutes to get the total number of errors, and if 10 or more errors are encountered, the analysis is considered to have failed.

metrics:
  - name: total-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"5.*"}[5m]
          ))

Fruitless runs

Analyzing run j results can also be considered indeterminate, which indicates that the run neither succeeds nor fails. A run with no results causes the release to pause at the current step. This requires manual intervention to resume the run, or to suspend it. When an indicator does not define a success or failure condition, the analysis run may become an example of a no-result run.

metrics:
  - name: my-query
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: ...

Also uncertain analysis runs can occur when both success and failure conditions are specified, but the measured values do not satisfy either condition.

metrics:
  - name: success-rate
    successCondition: result[0] >= 0.90
    failureCondition: result[0] < 0.50
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: ...

One scenario for uncertain analysis runs is to enable Argo Rollouts to automatically perform analysis runs and collect measurements, but still allow us to make a judgment call to determine if the measurements are acceptable and decide to continue or abort.

Delaying analysis runs

Analysis runs can be delayed for specific metrics if the analysis run does not need to start immediately (i.e., to give the metric provider time to collect the canary version of the metric). Each metric can be configured to have a different delay, and in addition to the delay for a specific metric, releases with background analysis can delay the creation of an analysis run until a certain step is reached

Delaying a specified analysis metric as shown below.

metrics:
  - name: success-rate
    # Do not start this analysis until 5 minutes after the analysis run starts
    initialDelay: 5m
    successCondition: result[0] >= 0.90
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: ...

Delay the start of the background analysis run until step 3 (set weight 40%).

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
  strategy:
    canary:
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 2
      steps:
        - setWeight: 20
        - pause: { duration: 10m }
        - setWeight: 40
        - pause: { duration: 10m }

In addition, the OpenKurise project has recently launched a similar progressive release tool, Kruise Rollouts, for those interested in learning more about it at https://github.com/openkruise/rollouts.

Table of Contents

Implementation Principle

Related Concepts

Rollout

Progressive Delivery

Deployment Strategies

Architecture

Rollout Controller

Rollout Resources

Old and new versions of ReplicaSets

Ingress/Service

Analysis and AnalysisRun

Metric Providers

CLI and UI

Installation

Usage

1. Deploying Rollout

2. Update Rollout

3. Promote Rollout

4. Interrupting Rollout

Dashboard

Analysis and Progressive Interaction

Background Analytics

Inline Analysis

Multiple templates for analysis

Analysis Template Parameters

BlueGreen Pre-Release Analysis

BlueGreen Post-Release Analysis

Failure Condition

Fruitless runs

Delaying analysis runs