Someone mentioned a problem before, it is about Prometheus monitoring Job task false alarm problem, the general meaning of the CronJob control Job, the front of the execution failed, monitoring will trigger the alarm, after the solution to generate a new Job can be executed normally, but will still receive the alarm in front.

monitoring will trigger the alarm

This is because we generally keep some history when executing Job tasks to facilitate troubleshooting, so if there is a failed Job before, even if it will become successful later, the previous Job will continue to exist, and the default alarm rule used by most direct kube-prometheus installation and deployment is kube_job_ status_failed > 0, which is obviously inaccurate. status_failed > 0`, which is obviously inaccurate, only we can manually delete the failed Job task to eliminate false positives. troubleshooting problems. Let’s reorganize our thinking to solve this problem.

CronJob will create a Job object at each execution time of the schedule, and you can use the .spec.successfulJobsHistoryLimit and .spec.failedJobsHistoryLimit properties to keep how many completed and failed Jobs, the default is 3 and 1 respectively, such as the following Declare a resource object for CronJob.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: batch/v1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "*/1 * * * *"
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            - date;
          restartPolicy: OnFailure

According to the resource object specification above, Kubernetes will keep only one failed Job and one successful Job.

1
2
3
NAME               COMPLETIONS   DURATION   AGE
hello-4111706356   0/1           2m         10d
hello-4111706356   1/1           5s         5s

To solve the above false positives, you also need to use the kube-state-metrics service, which listens to the Kubernetes APIServer and generates metrics about the state of objects. Deployment, Node, Job, Pod, and other resource objects. Here we will use the following metrics.

  • kube_job_owner: to find the relationship between a Job and the CronJob that triggered it
  • kube_job_status_start_time: to get the time when the Job was triggered
  • kube_job_status_failed : Get the job that failed to execute
  • kube_cronjob_spec_suspend : filter out pending jobs

Here is an example metric with tags generated by the hello task triggered by the CronJob run.

1
2
3
4
kube_job_owner{job_name="hello-1604875860", namespace="myNamespace", owner_is_controller="true", owner_kind="CronJob", owner_name="hello"} 1
kube_job_status_start_time{job_name="hello-1604875860", namespace="myNamespace"} 1604875874
kube_job_status_failed{job_name="hello-1604875860", namespace="myNamespace", reason="BackoffLimitExceeded"} 1
kube_cronjob_spec_suspend{cronjob="hello",job="kube-state-metrics", namespace="myNamespace"} 0

To make monitoring alarms accurate, we really just need to go get the last job of a group of Jobs triggered by the same CronJob, and only trigger an alarm when the Job fails in execution.

Since the kube_job_status_failed and kube_job_status_start_time metrics do not contain the tag of the CronJob to which they belong, the first step is to add this tag, and the owner_name in the kube_job_owner metric is all we need to merge it with the following promql statement.

1
2
3
4
5
6
max(
  kube_job_status_start_time
  * ON(job_name, namespace) GROUP_RIGHT()
  kube_job_owner{owner_name != ""}
  )
BY (job_name, owner_name, namespace)

Here we use the max function because we may run multiple kube-state-metrics for HA, so it is sufficient to use the max function to return one result for each Job task. Assuming our Job history contains 2 tasks (one failed and the other successful), the results will look like this.

1
2
{job_name="hello-1623578940", namespace="myNamespace", owner_name="hello"} 1623578959
{job_name="hello-1617667200", namespace="myNamespace", owner_name="hello"} 1617667204

Now that we know the owner of each Job, we need to find out the last task executed, which we can do by aggregating the results by the owner_name tag.

1
2
3
4
5
6
max(
  kube_job_status_start_time
  * ON(job_name,namespace) GROUP_RIGHT()
  kube_job_owner{owner_name!=""}
) 
BY (owner_name)

The above statement will find the latest job start time of each owner (that is, CronJob), and then merge it with the above statement, keeping the record with the same start time as the latest executed Job task.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
max(
 kube_job_status_start_time
 * ON(job_name,namespace) GROUP_RIGHT()
 kube_job_owner{owner_name!=""}
)
BY (job_name, owner_name, namespace)
== ON(owner_name) GROUP_LEFT()
max(
 kube_job_status_start_time
 * ON(job_name,namespace) GROUP_RIGHT()
 kube_job_owner{owner_name!=""}
)
BY (owner_name)

The results will show the last job executed for each CronJob and only the last one.

1
{job_name="hello-1623578940", namespace="myNamespace", owner_name="hello"} 1623578959

To increase readability we can also replace the job_name, owner_name tags with job and cronjob to make it easier to read.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
label_replace(
  label_replace(
    max(
      kube_job_status_start_time
      * ON(job_name,namespace) GROUP_RIGHT()
      kube_job_owner{owner_name!=""}
    )
    BY (job_name, owner_name, namespace)
    == ON(owner_name) GROUP_LEFT()
    max(
      kube_job_status_start_time
      * ON(job_name,namespace) GROUP_RIGHT()
      kube_job_owner{owner_name!=""}
    )
    BY (owner_name),
  "job", "$1", "job_name", "(.+)"),
"cronjob", "$1", "owner_name", "(.+)")

You will now see a result similar to the following.

1
{job="hello-1623578940", cronjob="hello", job_name="hello-1623578940", namespace="myNamespace", owner_name="hello"} 1623578959

Since the above query statement is complex, it would be very stressful for Prometheus to perform a real-time computation for each alarm evaluation, so we can use record rules to implement an offline computation-like approach to greatly improve efficiency. Create a record rule as shown below to represent the last executed job record of each CronJob.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
- record: job:kube_job_status_start_time:max
  expr: |
    label_replace(
      label_replace(
        max(
          kube_job_status_start_time
          * ON(job_name,namespace) GROUP_RIGHT()
          kube_job_owner{owner_name!=""}
        )
        BY (job_name, owner_name, namespace)
        == ON(owner_name) GROUP_LEFT()
        max(
          kube_job_status_start_time
          * ON(job_name,namespace) GROUP_RIGHT()
          kube_job_owner{owner_name!=""}
        )
        BY (owner_name),
      "job", "$1", "job_name", "(.+)"),
    "cronjob", "$1", "owner_name", "(.+)")    

Now that we know the most recent job that CronJob started executing, we can use the kube_job_status_failed metric to filter out the failed ones.

1
2
3
4
5
6
7
8
- record: job:kube_job_status_failed:sum
  expr: |
    clamp_max(job:kube_job_status_start_time:max, 1)
      * ON(job) GROUP_LEFT()
      label_replace(
        (kube_job_status_failed > 0),
        "job", "$1", "job_name", "(.+)"
      )    

The clamp_max function is used here to convert the result of job:kube_job_status_start_time:max to a set of time series with an upper bound of 1. It is used to filter the failed jobs by multiplication to get a set of Job tasks containing the most recent failures, which we also add here to the record rule named kube_job_ status_failed:sum to the record rule.

The final step is to add alarm rules directly to the failed Job tasks, as shown below.

1
2
3
4
5
- alert: CronJobStatusFailed
  expr: |
    job:kube_job_status_failed:sum
    * ON(cronjob, namespace) GROUP_LEFT()
    (kube_cronjob_spec_suspend == 0)

To avoid false alarms, we have excluded pending tasks. Here we have solved the problem of false alarms for Prometheus monitoring CronJob tasks. Although kube-prometheus has a lot of built-in monitoring and alerting rules for us, we can’t be completely superstitious and sometimes they don’t always fit the actual needs.