We all know that the smallest unit of scheduling in k8s is the POD, and each POD has a so-called Infra container Pause, so what exactly is a Pause container? What does it look like? What does it do?

Analyze the source code

From the official pause.c.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

#define STRINGIFY(x) #x
#define VERSION_STRING(x) STRINGIFY(x)

#ifndef VERSION
#define VERSION HEAD
#endif

static void sigdown(int signo) {
  psignal(signo, "Shutting down, got signal");
  exit(0);
}

static void sigreap(int signo) {
  while (waitpid(-1, NULL, WNOHANG) > 0)
    ;
}

int main(int argc, char **argv) {
  int i;
  for (i = 1; i < argc; ++i) {
    if (!strcasecmp(argv[i], "-v")) {
      printf("pause.c %s\n", VERSION_STRING(VERSION));
      return 0;
    }
  }

  if (getpid() != 1)
    /* Not an error because pause sees use outside of infra containers. */
    fprintf(stderr, "Warning: pause should be the first process\n");

  if (sigaction(SIGINT, &(struct sigaction){.sa_handler = sigdown}, NULL) < 0)
    return 1;
  if (sigaction(SIGTERM, &(struct sigaction){.sa_handler = sigdown}, NULL) < 0)
    return 2;
  if (sigaction(SIGCHLD, &(struct sigaction){.sa_handler = sigreap,
                                             .sa_flags = SA_NOCLDSTOP},
                NULL) < 0)
    return 3;

  for (;;)
    pause();
  fprintf(stderr, "Error: infinite loop terminated\n");
  return 42;
}

You can see that the Pause container does the following two things.

  1. register various signal handling functions, which mainly handle two types of information: exit signals and child signals. When it receives SIGINT or SIGTERM, it exits directly. When SIGCHLD signal is received, call waitpid and recycle the exiting process.
  2. The main process for loop calls the pause() function, which puts the process to sleep until it is terminated or receives a signal.

The suspicious waitpid

I still don’t have a solid foundation in c. I always thought waitpid was the parent process waiting to recycle the exiting child process, but is it really so?

1
2
3
4
5
6
7
8
zerun.dong$ man waitpid
WAIT(2)                     BSD System Calls Manual                    WAIT(2)

NAME
     wait, wait3, wait4, waitpid -- wait for process termination

SYNOPSIS
     #include <sys/wait.h>

Looking at the man manual on the mac, wait for process termination does say the same thing. Log on to ubuntu 18.04 and check it out.

1
2
3
4
5
:~# man waitpid
WAIT(2)                                                      Linux Programmer's Manual                                                      WAIT(2)

NAME
       wait, waitpid, waitid - wait for process to change state

For the linux man manual, it becomes wait for process to change state Wait for the process state to change!!!

1
2
3
4
All of these system calls are used to wait for state changes in a child of the calling process, and obtain information about the child whose
state has changed.  A state change is considered to be: the child terminated; the child was stopped by a signal; or the child was resumed by
a  signal.   In the case of a terminated child, performing a wait allows the system to release the resources associated with the child; if a
wait is not performed, then the terminated child remains in a "zombie" state (see NOTES below).

And it is also very thoughtful to provide test code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#include <sys/wait.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
   pid_t cpid, w;
   int wstatus;

   cpid = fork();
   if (cpid == -1) {
       perror("fork");
       exit(EXIT_FAILURE);
   }

   if (cpid == 0) {            /* Code executed by child */
       printf("Child PID is %ld\n", (long) getpid());
       if (argc == 1)
           pause();                    /* Wait for signals */
       _exit(atoi(argv[1]));

   } else {                    /* Code executed by parent */
       do {
           w = waitpid(cpid, &wstatus, WUNTRACED | WCONTINUED);
           if (w == -1) {
               perror("waitpid");
               exit(EXIT_FAILURE);
           }

           if (WIFEXITED(wstatus)) {
               printf("exited, status=%d\n", WEXITSTATUS(wstatus));
           } else if (WIFSIGNALED(wstatus)) {
               printf("killed by signal %d\n", WTERMSIG(wstatus));
           } else if (WIFSTOPPED(wstatus)) {
               printf("stopped by signal %d\n", WSTOPSIG(wstatus));
           } else if (WIFCONTINUED(wstatus)) {
               printf("continued\n");
           }
       } while (!WIFEXITED(wstatus) && !WIFSIGNALED(wstatus));
       exit(EXIT_SUCCESS);
   }
}

The child process stays in the pause state, while the parent process waits for the child process to change its state. Let’s open one session to run the code and another session to send signals.

1
2
3
4
5
6
7
8
~$ ./a.out
Child PID is 70718
stopped by signal 19

continued
stopped by signal 19
continued
^C
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
~# ps aux | grep a.out
zerun.d+   70717  0.0  0.0   4512   744 pts/0    S+   06:48   0:00 ./a.out
zerun.d+   70718  0.0  0.0   4512    72 pts/0    S+   06:48   0:00 ./a.out
root       71155  0.0  0.0  16152  1060 pts/1    S+   06:49   0:00 grep --color=auto a.out
~#
~# kill -STOP 70718
~#
~# ps aux | grep a.out
zerun.d+   70717  0.0  0.0   4512   744 pts/0    S+   06:48   0:00 ./a.out
zerun.d+   70718  0.0  0.0   4512    72 pts/0    T+   06:48   0:00 ./a.out
root       71173  0.0  0.0  16152  1060 pts/1    S+   06:49   0:00 grep --color=auto a.out
~#
~# kill -CONT 70718
~#
~# ps aux | grep a.out
zerun.d+   70717  0.0  0.0   4512   744 pts/0    S+   06:48   0:00 ./a.out
zerun.d+   70718  0.0  0.0   4512    72 pts/0    S+   06:48   0:00 ./a.out
root       71296  0.0  0.0  16152  1056 pts/1    R+   06:49   0:00 grep --color=auto a.out

The process is controlled by sending signals STOP CONT to the child processes.

It seems that the shape of the c function with the same name is not quite the same for different systems. I’m the one who made a fuss.

What NS to share

Generally speaking, if you mention POD, you know that if containers within the same POD access each other, you can just call localhost. If you imagine a k8s cluster as a distributed operating system, then POD is the concept of process groups, which must share certain things, so what namespace is shared by default?

To build an environment with minikube, first look at the POD definition file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  shareProcessNamespace: true
  containers:
  - name: nginx
    image: nginx
  - name: shell
    image: busybox
    securityContext:
      capabilities:
        add:
        - SYS_PTRACE
    stdin: true
    tty: true

From 1.17 onwards there is a parameter shareProcessNamespace to control whether the PID namespace is shared within the POD, after 1.18 the default is false, if there is a need to fill in the field.

1
2
3
4
5
6
7
8
9
~$ kubectl attach -it nginx -c shell
If you don't see a command prompt, try pressing enter.
/ # ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 /pause
    8 root      0:00 nginx: master process nginx -g daemon off;
   41 101       0:00 nginx: worker process
   42 root      0:00 sh
   49 root      0:00 ps aux

Attaching to the shell container shows all processes in that POD, and only the pause container is the init 1 process.

1
2
3
4
5
6
7
8
/ # kill -HUP 8
/ # ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 /pause
    8 root      0:00 nginx: master process nginx -g daemon off;
   42 root      0:00 sh
   50 101       0:00 nginx: worker process
   51 root      0:00 ps aux

Test sending a HUP signal to the nginx master and restarting the child processes.

If PID ns are not shared, then the process pid in each container is the init 1 process. What are the benefits of sharing PID ns? Refer to this article.

  1. Container processes no longer have PID 1. In the absence of PID 1, some container images refuse to start (for example, containers using systemd) or refuse to execute commands like kill -HUP 1 to notify container processes. In pods with a shared process namespace, kill -HUP 1 will notify the pod sandbox (/pause in the above example).
  2. Processes are visible to other containers in the pod. This includes all information visible in /proc, such as passwords passed as parameters or environment variables. These are only protected by regular Unix permissions.
  3. The container file system is visible to other containers in the pod via the /proc/$pid/root link. This makes debugging easier, but it also means that filesystem security is only protected by filesystem permissions.

See the process ids of nginx, sh on the host, and the namespace ids via /proc/pid/ns

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
~# ls -l /proc/140756/ns
total 0
lrwxrwxrwx 1 root root 0 May  6 09:08 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 May  6 09:08 ipc -> 'ipc:[4026532497]'
lrwxrwxrwx 1 root root 0 May  6 09:08 mnt -> 'mnt:[4026532561]'
lrwxrwxrwx 1 root root 0 May  6 09:08 net -> 'net:[4026532500]'
lrwxrwxrwx 1 root root 0 May  6 09:08 pid -> 'pid:[4026532498]'
lrwxrwxrwx 1 root root 0 May  6 09:08 pid_for_children -> 'pid:[4026532498]'
lrwxrwxrwx 1 root root 0 May  6 09:08 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 May  6 09:08 uts -> 'uts:[4026532562]'
~# ls -l /proc/140879/ns
total 0
lrwxrwxrwx 1 root root 0 May  6 09:08 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 May  6 09:08 ipc -> 'ipc:[4026532497]'
lrwxrwxrwx 1 root root 0 May  6 09:08 mnt -> 'mnt:[4026532563]'
lrwxrwxrwx 1 root root 0 May  6 09:08 net -> 'net:[4026532500]'
lrwxrwxrwx 1 root root 0 May  6 09:08 pid -> 'pid:[4026532498]'
lrwxrwxrwx 1 root root 0 May  6 09:08 pid_for_children -> 'pid:[4026532498]'
lrwxrwxrwx 1 root root 0 May  6 09:08 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 May  6 09:08 uts -> 'uts:[4026532564]'

You can see that cgroup, ipc, net, pid, user are shared here. This is limited to test cases.

Kill the Pause container

Test how k8s handles POD if you kill the Pause container. Build the environment with minikube and look at the POD definition file first.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  shareProcessNamespace: false
  containers:
  - name: nginx
    image: nginx
  - name: shell
    image: busybox
    securityContext:
      capabilities:
        add:
        - SYS_PTRACE
    stdin: true
    tty: true

After starting, check the pause process id, and then kill it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
~$ kubectl describe pod nginx
......
Events:
  Type    Reason          Age                   From     Message
  ----    ------          ----                  ----     -------
  Normal  SandboxChanged  3m1s (x2 over 155m)   kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal  Killing         3m1s (x2 over 155m)   kubelet  Stopping container nginx
  Normal  Killing         3m1s (x2 over 155m)   kubelet  Stopping container shell
  Normal  Pulling         2m31s (x3 over 156m)  kubelet  Pulling image "nginx"
  Normal  Pulling         2m28s (x3 over 156m)  kubelet  Pulling image "busybox"
  Normal  Created         2m28s (x3 over 156m)  kubelet  Created container nginx
  Normal  Started         2m28s (x3 over 156m)  kubelet  Started container nginx
  Normal  Pulled          2m28s                 kubelet  Successfully pulled image "nginx" in 2.796081224s
  Normal  Created         2m25s (x3 over 156m)  kubelet  Created container shell
  Normal  Started         2m25s (x3 over 156m)  kubelet  Started container shell
  Normal  Pulled          2m25s                 kubelet  Successfully pulled image "busybox" in 2.856292466s

k8s will restart the POD when it detects an abnormal state of the pause container, and it is not hard to understand, whether or not the PID namespace is shared, the infra container exits, the POD must be restarted, after all, the life cycle is the same as the infra container.