Vmagent Manual

I. What is vmagent

The following is the original official documentation.

vmagent is a tiny but powerful agent that helps us collect metrics from different sources and store them in vm or other prometheus-compatible storage systems that support the remote_write protocol.

Features of vmagent

Supports as a replacement for prometheus for grabbing data from, for example, node_exporter
Can read data from Kafka. See these documents.
Data can be written to Kafka. See these documents.
Labels (relabel) can be added, removed, and modified via Prometheus relabeling. Data can be filtered before it is sent to remote storage. For more information, see these documents.
Receive data via all ingestion protocols supported by VictoriaMetrics - see these documents.
Can replicate collected metrics to multiple remote storage systems simultaneously.
Works smoothly in environments with unstable connections to remote storage. If remote storage is not available, the collected metrics are cached in -remoteWrite.tmpDataPath. Once the connection to the remote storage is repaired, the cached metrics are sent to the remote storage. You can use Limit the maximum disk usage of the buffer -remoteWrite.maxDiskUsagePerURL.
Uses less RAM, CPU, disk IO, and network bandwidth than Prometheus.
vmagent When a large number of targets must be crawled, the crawl targets can be distributed among multiple instances. See these documents.
Targets that expose millions of time series, such as the /federate endpoint in Prometheus, can be efficiently crawled. See these documents.
High base and high churn can be handled by limiting the number of unique time series before they are crawled and sent to a remote storage system. Please refer to these documents.
Crawl configurations can be loaded from multiple files. See these documents.

II. Architecture diagram

Metrics collected by vmagent include: node-exporter, kubernetes-cadvisor, kube-state-metrics
vmagent collects and sends to VictoriaMetrics on the monitoring team

Architecture diagram

III. Deployment

namespace.yml

apiVersion: v1
kind: Namespace
metadata:
  name: sbux-monitoring

serviceaccount.yml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: vmagent
  namespace: sbux-monitoring

clusterrole.yml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: vmagent
rules:
  - apiGroups: ["", "networking.k8s.io", "extensions"]
    resources:
      - nodes
      - nodes/metrics
      - services
      - endpoints
      - endpointslices
      - pods
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources:
      - namespaces
      - configmaps
    verbs: ["get"]
  - nonResourceURLs: ["/metrics", "/metrics/resources"]
    verbs: ["get"]

clusterrolebinding.yml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: vmagent
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: vmagent
subjects:
  - kind: ServiceAccount
    name: vmagent
    namespace: sbux-monitoring

configmap-vmagent.yml

The vmagent profile, depending on the actual requirements, needs to capture node-exporter, cadvisor, and kube-state-metrics metrics.

global.external_labels field: configured as the name of each cluster

scrape_timeout : I set 60 seconds, the actual test kube-state-metrics metrics are sometimes slow to pull, in addition to adjusting this timeout, you should also adjust the CPU and memory configuration of the kube-state-metrics pod.

apiVersion: v1
kind: ConfigMap
metadata:
  name: vmagent-config
  namespace: sbux-monitoring
data:
  scrape.yml: |
    global:
      scrape_interval: 30s
      scrape_timeout: 60s
      external_labels:
        cluster: gds-poc
    scrape_configs:
    - job_name: 'vmanent'
      static_configs:
        - targets: ['vmagent:8429']
    - job_name: 'node-exporter'
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_endpoints_name]
        regex: node-exporter
        action: keep
      - source_labels: [__meta_kubernetes_namespace]
        regex: kube-system
        action: keep
      - source_labels: [__meta_kubernetes_pod_node_name]
        target_label: cps_node_name

    - job_name: 'kubernetes-cadvisor'

      scheme: https
      tls_config:
        ca_file: /secrets/kubelet/ca
        key_file: /secrets/kubelet/key
        cert_file: /secrets/kubelet/cert
      metrics_path: /metrics/cadvisor

      kubernetes_sd_configs:
      - role: node

      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: cps_node_name
        replacement: $1

    - job_name: 'kube-state-metrics'

      kubernetes_sd_configs:
      - role: endpoints

      relabel_configs:
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: keep
        regex: kube-state-metrics
      - source_labels: [__meta_kubernetes_namespace]
        action: keep
        regex: kube-system

secret_kubelet.yml

The cadvisor job uses tls, so it needs to mount the kubelet’s certificate.

1
2
3

#Use the kube-client-tls-secret under the kube-system namespace
#Modify namespace fields, remove creationTimestamp, resourceVersion, selfLink, uid, etc.
~]$ kubectl -n kube-system get secret kubelet-client-tls-secret -oyaml > secret_kubelet.yml

deployment.yml

Command-line parameters.

-promscrape.config=/config/scrape.yml Specify the path to the vmagent’s configuration file, such as the volumeMounts field
-remoteWrite.tmpDataPath=/tmpData, -remoteWrite.maxDiskUsagePerURL=10GB Specify the temporary storage directory for monitoring metrics as /tmpData, the maximum available temporary directory is 10GB.
-remoteWrite.url=http://victoriametrics.victoriametrics:8428/api/v1/write, -remoteWrite.url=https://prometheus-vminsert.xxxxxx. net/insert/0/prometheus specifies two remote write addresses, one for my own test victoriametrics and one for the monitoring team’s victoriametrics
-remoteWrite.tlsInsecureSkipVerify=true Because the https certificate of the remote write address is self-signed, so you need to configure this option, production environments are recommended to add basicauth configuration to strengthen security
-promscrape.maxScrapeSize=50MB The maximum size of scrape response. in actual testing, when the number of applications in the cluster is large, the response will exceed 20MB.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vmagent
  namespace: sbux-monitoring
  labels:
    app: vmagent
spec:
  selector:
    matchLabels:
      app: vmagent
  template:
    metadata:
      labels:
        app: vmagent
    spec:
      serviceAccountName: vmagent
      containers:
        - name: vmagent
          image: "registry.xxxxxx.net/library/vmagent:v1.77.1"
          imagePullPolicy: IfNotPresent
          args:
            - -promscrape.config=/config/scrape.yml
            - -remoteWrite.tmpDataPath=/tmpData
            - -promscrape.maxScrapeSize=50MB
            - -remoteWrite.maxDiskUsagePerURL=10GB
            - -remoteWrite.url=http://victoriametrics.victoriametrics:8428/api/v1/write
            - -remoteWrite.url=https://prometheus-vminsert.xxxxxxcf.net/insert/0/prometheus
            - -remoteWrite.tlsInsecureSkipVerify=true
            - -envflag.enable=true
            - -envflag.prefix=VM_
            - -loggerFormat=json
          ports:
            - name: http
              containerPort: 8429
          resources:
            limits:
              cpu: "4"
              memory: 8Gi
            requests:
              cpu: "1"
              memory: 1Gi
          volumeMounts:
            - name: config
              mountPath: /config
            - name: kubelet-client-tls-secret
              mountPath: /secrets/kubelet
      volumes:
        - name: config
          configMap:
            name: vmagent-config
        - name: kubelet-client-tls-secret
          secret:
            defaultMode: 420
            optional: true
            secretName: kubelet-client-tls-secret

service.yml

---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: vmagent
  name: vmagent
  namespace: sbux-monitoring
spec:
  ports:
  - name: http-8429
    port: 8429
    protocol: TCP
    targetPort: 8429
  selector:
    app: vmagent
  type: ClusterIP

IV. Problems encountered

kube-state-metrics metrics are not available, and the vmagent reports errors

{"ts":"2022-05-30T13:20:16.594Z","level":"error","caller":"VictoriaMetrics/lib/promscrape/scrapework.go:355","msg":"error when scraping \"http://192.168.154.27:8080/metrics\" from job \"kube-state-metrics\" with labels {cluster=\"prod-azure\",instance=\"192.168.154.27:8080\",job=\"kube-state-metrics\"}: cannot read Prometheus exposition data: cannot read a block of data in 0.000s: the response from \"http://192.168.154.27:8080/metrics\" exceeds -promscrape.maxScrapeSize=16777216; either reduce the response size for the target or increase -promscrape.maxScrapeSize"}

Solution: Increase promscrape.maxScrapeSize.

kube-state-metrics metrics pull timeout

Solution: The configuration of the kube-state-metrics application is too low, CPU memory is adjusted upwards. vmagent configuration file inside the scrape timeout is changed to 60s.

dashboard

vmagent：12683

V. Resource consumption

Basically 1C1G is enough

node nodes: 400
pod number: 4500 (790 of them are business pods, the rest are system components and so on)
deployment: 191 (131 of them are business applications, the rest are system components)

vmagent

Table of Contents