Recently, a colleague had a weird problem with his online server, where the error “fork:Unable to allocate memory” is always reported when executing any command. This problem is a recent one, and it was solved after the first few reboots, but it occurs every 2-3 days.

1
2
3
4
# service docker stop
-bash fork: Unable to Allocate Memory
# vi 1.txt
-bash fork: Unable to Allocate Memory

When you see this tip, your first reaction must be to suspect that there is really not enough memory. But check the memory occupation but found that it is not at all, memory is still free a lot! (Try a few more times to have a chance of executing successfully once.)

linux free

After some discussion, 3 ideas came to mind.

  1. Is it possible that under the numa architecture, the node is bound when the process is started, so that only the memory in one node works?
  2. Under numa architecture, if all memory is inserted into one slot, other nodes will run out of memory.
  3. Check what the number of incoming threads is now, and whether it exceeds the maximum limit.

After a period of troubleshooting, the cause was finally found, and the problem was solved successfully. Here I will report the conclusion directly to you, the previous guess about the numa memory shortage is wrong. The real reason is number 3 above, some java processes on this server created too many threads, which caused this error, not a real lack of memory.

I. The underlying process analysis

In this problem, Linux error prompts misleading place, resulting in people did not first think about the number of processes, so there is such a complex and tortuous process of troubleshooting.

So I want to go deep into the kernel to see how the error is actually prompted out, how to report such an inappropriate error prompt. Then, let’s also understand the process of creating the process.

The operating system of the online server in question is CentOS 7.8, and the corresponding kernel version is 3.10.0-1127.

1.1 Anatomy of do_fork

In the Linux kernel, both the creation of processes and threads are called to the core do_fork. Inside this function, the kernel data object needed for the new process (thread) is created by means of a copy.

1
2
3
4
5
6
7
8
9
//file:kernel/fork.c
long do_fork(unsigned long clone_flags, ...)
{
 //所谓的创建,其实是根据当前进程进行拷贝
 //注意:倒数第二个参数传入的是 NULL
 p = copy_process(clone_flags, stack_start, stack_size,
    child_tidptr, NULL, trace);
 ...
}

The core of the entire process creation is located in copy_process, let’s look at its source code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
//file:kernel/fork.c
static struct task_struct *copy_process(unsigned long clone_flags, 
    ...
    struct pid *pid,
    int trace)
{
 //内核表示进程(线程)的数据结构叫task_struct
 struct task_struct *p;

 ......

 //拷贝方式生成新进程的核心数据结构
 p = dup_task_struct(current);

 //拷贝方式生成新进程的其它核心数据
 retval = copy_semundo(clone_flags, p);
 retval = copy_files(clone_flags, p);
 retval = copy_fs(clone_flags, p);
 retval = copy_sighand(clone_flags, p);
 retval = copy_mm(clone_flags, p);
 retval = copy_namespaces(clone_flags, p);
 retval = copy_io(clone_flags, p);
 retval = copy_thread(clone_flags, stack_start, stack_size, p);

 //注意这里!!!!!!
 //申请整数形式的 pid 值
 if (pid != &init_struct_pid) {
  retval = -ENOMEM;
  pid = alloc_pid(p->nsproxy->pid_ns);
  if (!pid)
   goto bad_fork_cleanup_io;
 }

 //将生成的整数pid值设置到新进程的 task_struct 上
 p->pid = pid_nr(pid);
 p->tgid = p->pid;
 if (clone_flags & CLONE_THREAD)
  p->tgid = current->tgid;

bad_fork_cleanup_io:
 if (p->io_context)
  exit_io_context(p);
......
fork_out:
 return ERR_PTR(retval); 
}

As you can see from the above code, the Linux kernel creates the entire process kernel object by calling different copy_xxx’s, including mm structs, including namespaces, etc.

Let’s focus on the paragraph related to alloc_pid. In this paragraph, the purpose is to request a pid object. If the application fails, an error is returned. Note the details of this code: whatever type of failure alloc_pid returns, its error type is written to return -ENOMEM. For your understanding, I’ll show this logic again separately.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
//file:kernel/fork.c
static struct task_struct *copy_process(...){
 ......

 //申请整数形式的 pid 值
 if (pid != &init_struct_pid) {
  retval = -ENOMEM;
  pid = alloc_pid(p->nsproxy->pid_ns);
  if (!pid)
   goto bad_fork_cleanup_io;
 }
bad_fork_cleanup_io:
...
fork_out:
 return ERR_PTR(retval); 
} 

The error type is set to -ENOMEM(retval = -ENOMEM) directly before the call to alloc_pid, and whenever alloc_pid returns incorrectly, it returns the ENOMEM error to the upper level. It doesn’t matter what the reason for the alloc_pid memory error is .

Let’s look at the definition of ENOMEM. It stands for Out of memory. (The kernel just returns the error code and the application layer gives the specific error, so the actual message is “unable to allocate memory”.)

1
2
//file:include/uapi/asm-generic/errno-base.h
#define ENOMEM  12 /* Out of memory */

I have to say. This error message from the kernel is too problematic. It causes a lot of confusion to the user.

1.2 Causes of alloc_pid failure

So let’s look at the cases where allocating a pid fails. Let’s look at the source code of alloc_pid.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
//file:kernel/pid.c
struct pid *alloc_pid(struct pid_namespace *ns)
{
 //第一种情况:申请 pid 内核对象失败
 pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 if (!pid)
  goto out;

 //第二种情况:申请整数 pid 号失败
 //调用到alloc_pidmap来分配一个空闲的pid
 tmp = ns;
 pid->level = ns->level;
 for (i = ns->level; i >= 0; i--) {
  nr = alloc_pidmap(tmp);
  if (nr < 0)
   goto out_free;

  pid->numbers[i].nr = nr;
  pid->numbers[i].ns = tmp;
  tmp = tmp->parent;
 }

 ...
out:
 return pid; 
out_free:
 goto out; 
}

What we usually call pid is not a simple integer type in the kernel, but a small structure (struct pid), as follows.

1
2
3
4
5
6
7
8
9
//file:include/linux/pid.h
struct pid
{
 atomic_t count;
 unsigned int level;
 struct hlist_head tasks[PIDTYPE_MAX];
 struct rcu_head rcu;
 struct upid numbers[1];
};

So you need to first request a piece of memory to store the small object. The first error case is that if the memory request fails, alloc_pid will return a failure. In this case it is indeed a memory problem and there is nothing wrong with the kernel returning ENOMEM after an error.

Moving on to the second case, alloc_pidmap is to request a process number for the current process, which is what we usually call a PID number. If the request fails, an error will be returned.

In this case, it’s just an error in allocating a process number, and it doesn’t have anything to do with running out of memory. But in this case the kernel causes an error of type ENOMEM (Out of memory) to be returned to the upper layer. This is quite unreasonable.

Here’s another extra lesson we learned! A process doesn’t just request one process number, it requests more than one through a for loop.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
//file:kernel/pid.c
struct pid *alloc_pid(struct pid_namespace *ns)
{
 //调用到alloc_pidmap来分配一个空闲的pid
 tmp = ns;
 pid->level = ns->level;
 for (i = ns->level; i >= 0; i--) {
  nr = alloc_pidmap(tmp);
  if (nr < 0)
   goto out_free;

  pid->numbers[i].nr = nr;
  pid->numbers[i].ns = tmp;
  tmp = tmp->parent;
 }
}

If the currently created process is a process in a container, then it has to request at least two PID numbers to be able to do so. One PID is the process number in the container namespace and one is the process number in the root namespace (the host).

This is in line with our usual experience. Every process in the container is actually visible to us in the host. But the process number you see in the container is generally different from the one you see on the host. For example, if the pid of a process in the container is 5, and in the host namespace it is 1256, then the object of the process in the kernel will look something like this.

process

Second, whether the new version has improved

Next, the first thing I thought of was that the kernel version we were using was too old. (I’m using kernel version 3.10.1 to keep up with the version of our online server.)

So I went back to the very new Linux 5.16.11 to see if the new version had fixed the inappropriate prompt.

A recommended tool: https://elixir.bootlin.com/ . You can view any version of the linux kernel source code on this site. It’s a great tool to use if you just want to look at it temporarily.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
//file:kernel/fork.c
static __latent_entropy struct task_struct *copy_process(...)
{
 ...
 pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid,
    args->set_tid_size);
 if (IS_ERR(pid)) {
  retval = PTR_ERR(pid);
  goto bad_fork_cleanup_thread;
 }
}

It seems to be working, retval is no longer written dead as ENOMEM, but is set according to the actual error of alloc_pid. Let’s see if alloc_pid is setting the error type correctly.

I was a little disappointed when I opened the alloc_pid source code and saw this big comment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
//file:include/pid.c
struct pid *alloc_pid(struct pid_namespace *ns, ...)
{
 /*
  * ENOMEM is not the most obvious choice especially for the case
  * where the child subreaper has already exited and the pid
  * namespace denies the creation of any new processes. But ENOMEM
  * is what we have exposed to userspace for a long time and it is
  * documented behavior for pid namespaces. So we can't easily
  * change it even if there were an error code better suited.
  */
 retval = -ENOMEM;
 .......
 
 return retval
}

It means " ENOMEM is not the most obvious choice, especially for cases where pid creation fails. However, ENOMEM is something that we expose to userspace for a long time. Therefore, we can’t easily change it even if there is a more suitable error code".

This is not well addressed in the latest version either.

Conclusion

When creating a process in Linux, the error message returned when the pid is insufficient is “insufficient memory”. This inappropriate error prompt has caused a lot of confusion for many people.

Through today’s analysis, when we encounter this kind of insufficient memory error in the future, we should be more careful not to be fooled by the kernel and check if we have too many processes (threads) first.

As for how to solve this problem, you can increase the number of available pids by modifying the kernel parameters (/proc/sys/kernel/pid_max).

But I think the most fundamental method is to find out why there are so many processes (threads) in the system, and then kill it. The default number of 20,000 to 30,000 processes is already too large a number for most servers, and even this number is exceeded, which must be unreasonable.