Recently, I encountered a case where a host was in a hung state, and the IPMI console of the host at that time showed the log:
audit: backlog limit exceeded, and for some reasons, the NMI signal was not sent in time to trigger the kernel core dump, so I could only troubleshoot according to the existing information, and recorded the following audit buffer related configuration learning.
The Linux kernel introduced audit in 2.6 to better record various security events in the system, such as file modification events and system call events.
- Control rules: set some behavior of the audit system and modify its default settings
- File system rules: audit files, record access to special files or directories
- System call rules: record the system call behavior of some special applications
Some exception logs are printed when events cannot be logged correctly.
Determined by audit’s flag configuration, when flag == 1, prompt log is printed; when flag == 2, kernel panic. Default flag = 1 .
In the audit system, a socket buffer queue is used to hold events. Whenever a new event is received, it is logged and prepared to be added to this queue. There are several parameters to control this behavior.
- queue maximum length, when an event is logged that causes the queue length to exceed the limit, then a failure will occur
- rate, if the number of events in a second exceeds the limit, then the queue will not be added and a failure will occur
- If the event cannot be logged, then a fault will occur and the handling behavior will be determined by the flag setting
- 0, silent, silent processing
- 1, printk (default behavior), prints to the system log, specific print limits based on kernel parameters.
- 2， panic，kernel panic
buffer resource calculation
- queue in memory, you need to set a reasonable backlog_limit value to prevent occupying too much memory resources, each event is around 9000 bytes, if set to 320, then occupy memory resources of 320 * 9000 = 2.7 MiB or so.
Possible problems encountered
audit: backlog limit exceeded
The IPMI console prints the above log, indicating that events are not being logged correctly and that the current number of events exceeds the backlog_limit limit, which may cause the system to hang or remain unresponsive.
An audit buffer queue at or exceeding capacity might also cause the instance to hang or remain in an unresponsive state.
It is recommended to adjust the backlog_limit size according to the actual situation, such as 8192.
- Audit system setup parameters are not set properly.
- File system freeze (usually due to system snapshot)
audit version 2.4.1-5, where the configuration is.
The audit version is 2.7.6-3, where the configuration is
Note: auditd disables restart operations (RefuseManualStop = yes), so you cannot use [[systemctl]] to control service start/stop. You can use service instead: service auditd restart
Redhat recommends reasonable configuration of audit-related parameters on online servers to avoid some unexpected situations due to unreasonable parameters. However, the impact of an audit exception is not very clear, as stated in the AWS KB
However, there is no detail in the Redhat KB about what the impact might be, so this needs to be investigated.