Recently, I encountered a case where a host was in a hung state, and the IPMI console of the host at that time showed the log: audit: backlog limit exceeded, and for some reasons, the NMI signal was not sent in time to trigger the kernel core dump, so I could only troubleshoot according to the existing information, and recorded the following audit buffer related configuration learning.

Audit

The Linux kernel introduced audit in 2.6 to better record various security events in the system, such as file modification events and system call events.

Configuration methods

Under the /etc/audit directory

  • Control rules: set some behavior of the audit system and modify its default settings
  • File system rules: audit files, record access to special files or directories
  • System call rules: record the system call behavior of some special applications

buffer configuration

  • Some exception logs are printed when events cannot be logged correctly.

    1
    2
    3
    
    audit: audit_backlog=321 > audit_backlog_limit=320
    audit: audit_lost=44395 audit_rate_limit=0 audit_backlog_limit=320
    audit: backlog limit exceeded
    
  • Determined by audit’s flag configuration, when flag == 1, prompt log is printed; when flag == 2, kernel panic. Default flag = 1 .

  • In the audit system, a socket buffer queue is used to hold events. Whenever a new event is received, it is logged and prepared to be added to this queue. There are several parameters to control this behavior.

    • backlog_limit
      • queue maximum length, when an event is logged that causes the queue length to exceed the limit, then a failure will occur
    • raid_limit
      • rate, if the number of events in a second exceeds the limit, then the queue will not be added and a failure will occur

Troubleshooting

  • If the event cannot be logged, then a fault will occur and the handling behavior will be determined by the flag setting
    • 0, silent, silent processing
    • 1, printk (default behavior), prints to the system log, specific print limits based on kernel parameters.
      1
      2
      3
      
      # sysctl -a | grep kernel.printk_rate
      kernel.printk_ratelimit = 5
      kernel.printk_ratelimit_burst = 10
      
    • 2, panic,kernel panic

buffer resource calculation

  • queue in memory, you need to set a reasonable backlog_limit value to prevent occupying too much memory resources, each event is around 9000 bytes, if set to 320, then occupy memory resources of 320 * 9000 = 2.7 MiB or so.

Possible problems encountered

  • audit: backlog limit exceeded
    • The IPMI console prints the above log, indicating that events are not being logged correctly and that the current number of events exceeds the backlog_limit limit, which may cause the system to hang or remain unresponsive.

      An audit buffer queue at or exceeding capacity might also cause the instance to hang or remain in an unresponsive state.

    • It is recommended to adjust the backlog_limit size according to the actual situation, such as 8192.

    • Possible causes.

      • Audit system setup parameters are not set properly.
      • File system freeze (usually due to system snapshot)
    • audit version 2.4.1-5, where the configuration is.

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      
      [root@dogfood-idc-elf-65 audit]# rpm -qa |grep audit
      audit-2.4.1-5.el7.x86_64
      audit-libs-2.4.1-5.el7.x86_64
      [root@dogfood-idc-elf-65 audit]# auditctl -s
      enabled 1
      flag 1
      pid 1093
      rate_limit 0
      backlog_limit 320
      lost 0
      backlog 0
      loginuid_immutable 0 unlocked
      
    • The audit version is 2.7.6-3, where the configuration is

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      
      [root@node90 14:16:09 ~]$rpm -q audit
      audit-2.7.6-3.el7.x86_64
      [root@node90 14:16:13 ~]$auditctl -s
      enabled 1
      failure 1
      pid 1133
      rate_limit 0
      backlog_limit 8192
      lost 0
      backlog 0
      loginuid_immutable 0 unlocked
      

Note: auditd disables restart operations (RefuseManualStop = yes), so you cannot use [[systemctl]] to control service start/stop. You can use service instead: service auditd restart

Summary

Redhat recommends reasonable configuration of audit-related parameters on online servers to avoid some unexpected situations due to unreasonable parameters. However, the impact of an audit exception is not very clear, as stated in the AWS KB

An audit buffer queue at or exceeding capacity might also cause the instance to hang or remain in an unresponsive state.

However, there is no detail in the Redhat KB about what the impact might be, so this needs to be investigated.