As the hottest cloud-native monitoring tool, there is no doubt that Prometheus is a great performer. However, as we use Prometheus, the monitoring metrics stored in Prometheus become more and more frequent over time, and the frequency of queries increases, as we add more Dashboards with Grafana, we may slowly experience that Grafana is no longer able to render charts on time, and occasionally timeouts occur, especially when we are aggregating a large amount of metrics over a long period of time. Especially when we are aggregating a large amount of metrics over a long period of time, Prometheus queries may time out even more, which requires a mechanism similar to background batch processing to complete the computation of these complex operations in the background, so that the user only needs to query the results of these operations.

Prometheus provides a Recording Rule to support this kind of background computation, which can optimize the performance of PromQL statements for complex queries and improve query efficiency.

Problems

Let’s say we want to know the actual CPU and memory utilization between Kubernetes nodes, we can query the CPU and memory utilization by using the metrics container_cpu_usage_seconds_total and container_memory_usage_bytes. Because each running container collects these two metrics, it is important to know that for slightly larger online environments where we may have thousands of containers running at the same time, for example, Prometheus will have a harder time querying data quickly when we are now querying data for thousands of containers every 5 minutes for the next week.

Let’s say we divide the total number of container_cpu_usage_seconds_total by the total number of kube_node_status_allocatable_cpu_cores to get the CPU utilization.

1
2
sum(rate(container_cpu_usage_seconds_total[5m])) / avg_over_time(sum(kube_node_status_allocatable_cpu_cores)[5m:5m])
Load time: 15723ms

Use the scrolling window to calculate memory utilization by dividing the total number of container_memory_usage_bytes by the total number of kube_node_status_allocatable_memory_bytes.

1
2
avg_over_time(sum(container_memory_usage_bytes)[15m:15m]) / avg_over_time(sum(kube_node_status_allocatable_memory_bytes)[5m:5m])
Load time: 18656ms

Recording Rules

We said that Prometheus provides a way to optimize our query statements called Recording Rule. The basic idea of the Recording Rule is that it allows us to create custom meta-time sequences based on other time series, and if you use the Prometheus Operator you will find a number of such rules already in Prometheus, such as

1
2
3
4
5
6
7
8
9
groups:
  - name: k8s.rules
    rules:
    - expr: |
                sum(rate(container_cpu_usage_seconds_total{image!="", container!=""}[5m])) by (namespace)
      record: namespace:container_cpu_usage_seconds_total:sum_rate
    - expr: |
                sum(container_memory_usage_bytes{image!="", container!=""}) by (namespace)
      record: namespace:container_memory_usage_bytes:sum

These two rules above would be perfectly fine for executing our query above, they would be executed continuously and store the results in a very small time series. sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace) will be evaluated at predefined time intervals and stored as a new metric : namespace:container_cpu_usage_seconds_total:sum_rate, the same as the in-memory query.

Now, I can change the query to derive the CPU utilization as follows.

1
2
sum(namespace:container_cpu_usage_seconds_total:sum_rate) / avg_over_time(sum(kube_node_status_allocatable_cpu_cores)[5m:5m])
Load time: 1077ms

Now, it runs 14 times faster!

In the same way memory utilization is calculated.

1
2
sum(namespace:container_memory_usage_bytes:sum) / avg_over_time(sum(kube_node_status_allocatable_memory_bytes)[5m:5m])
Load time: 677ms

Now runs 27 times faster!

Recoding rule usage

In the Prometheus configuration file, we can define the access path to the recoding rule rule file via rule_files, in much the same way as we define an alarm rule.

1
2
rule_files:
  [ - <filepath_glob> ... ]

Each rule file is defined by the following format.

1
2
groups:
  [ - <rule_group> ]

A simple rules file might look like this.

1
2
3
4
5
groups:
- name: example
  rules:
  - record: job:http_inprogress_requests:sum
    expr: sum(http_inprogress_requests) by (job)

The specific configuration items for rule_group are shown below.

1
2
3
4
5
6
# 分组的名称,在一个文件中必须是唯一的
name: <string>
# 评估分组中规则的频率
[ interval: <duration> | default = global.evaluation_interval ]
rules:
  [ - <rule> ... ]

Consistent with alert rules, a group can contain multiple rule rules.

1
2
3
4
5
6
7
# 输出的时间序列名称,必须是一个有效的 metric 名称
record: <string>
# 要计算的 PromQL 表达式,每个评估周期都是在当前时间进行评估的,结果记录为一组新的时间序列,metrics 名称由 record 设置
expr: <string>
# 添加或者覆盖的标签
labels:
  [ <labelname>: <labelvalue> ]

As defined in the rules, Prometheus completes the calculation of the PromQL expressions defined in expr in the background and saves the results in a new time series record, with the possibility to add additional labels to these samples via the labels tag.

The frequency of these rule files is by default the same as that of the alarm rules, which are defined by global.evaluation_interval:

1
2
global:
  [ evaluation_interval: <duration> | default = 1m ]