The following is a standard explanation of containerization.

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.

Containers are different from virtual machines that embed an entire operating system and then connect the virtual OS to the host OS through a Hypervisor. Containers, on the other hand, achieve a more lightweight environment isolation and resource limitation directly through shared memory.

Namespace

If a process belongs to a namespace, then the visibility of that process is limited to the current namespace. To put it another way, a process under a namespace cannot affect processes outside of the namespace. In the Linux operating system, namespaces are a mechanism for isolating processes. The main types of namespaces that are commonly used are as follows.

  • Device mount (mount), a mount is used to mount file systems, devices, etc.
  • Process ID (PID), processes in a namespace have a unique PID only within the namespace.
  • Network (network), each network namespace has its own instance of a network device that can be configured using a separate network address. Processes in the same network namespace can have their own ports and routing tables.
  • user (user), user namespaces can have their own user and group IDs. processes using unprivileged users in the host may have the root user identity in the user namespace.
  • UTS, which specifies the host name, host domain name, etc.

Create a PID namespace.

1
➜  sudo unshare --fork --pid --mount-proc bash

Executing ps aux in a bash process under this namespace to see the list of processes indicates that it can only see processes under this PID space, which indicates that PID isolation is achieved within a separate PID namespace.

1
2
3
4
root@andersonu20:/home/anderson# ps -aux 
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0  12916  4280 pts/2    S    14:55   0:00 bash
root          12  0.0  0.0  14768  3720 pts/2    R+   14:55   0:00 ps -aux

From here we can see that the bash process is 1 inside the space. At this point, we return to the host terminal to view the process.

1
2
➜  ps -aux | grep bash
root      540064  0.0  0.0  12916  4244 pts/2    S+   14:57   0:00 bash

This means that the process inside the container is also a process on the host, but with a different PID, so the namespace is the basis of containerization technology.

Cgroup

If a namespace isolates one or more processes, a Cgroup can be used to measure, limit and monitor resource usage within the group, such as limiting memory, CPU, I/O, etc. There are many types of Cgroups, such as memory Cgroups, CPU CGroups, which will all be defined in the /sys/fs/cgroup/ directory.

1
2
3
4
5
6
➜  ~ ls /sys/fs/cgroup/
blkio/             devices/           net_cls,net_prio/  systemd/         
cpu@               freezer/           net_prio@          unified/         
cpuacct@           hugetlb/           perf_event/                         
cpu,cpuacct/       memory/            pids/                               
cpuset/            net_cls@           rdma/

Create a Cgroup.

1
2
3
4
➜  ~ sudo apt-get install cgroup-tools
➜  ~ sudo cgcreate -t $USER:$USER -a $USER:$USER -g memory:m_group
➜  ~ ls -l /sys/fs/cgroup/memory | grep m_group
drwxr-xr-x   2 anderson root 0 11月 20 15:15 m_group

View memory resource limits. Each of the following files defines a different type of resource limit.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
➜  ~ ls /sys/fs/cgroup/memory/m_group/
cgroup.clone_children               memory.memsw.failcnt
cgroup.event_control                memory.memsw.limit_in_bytes
cgroup.procs                        memory.memsw.max_usage_in_bytes
memory.failcnt                      memory.memsw.usage_in_bytes
memory.force_empty                  memory.move_charge_at_immigrate
memory.kmem.failcnt                 memory.numa_stat
memory.kmem.limit_in_bytes          memory.oom_control
memory.kmem.max_usage_in_bytes      memory.pressure_level
memory.kmem.slabinfo                memory.soft_limit_in_bytes
memory.kmem.tcp.failcnt             memory.stat
memory.kmem.tcp.limit_in_bytes      memory.swappiness
memory.kmem.tcp.max_usage_in_bytes  memory.usage_in_bytes
memory.kmem.tcp.usage_in_bytes      memory.use_hierarchy
memory.kmem.usage_in_bytes          notify_on_release
memory.limit_in_bytes               tasks
memory.max_usage_in_bytes

Check here and change the maximum memory limit from 9223372036854771712B to 8KB.

1
2
3
4
5
➜  ~ cat /sys/fs/cgroup/memory/m_group/memory.limit_in_bytes 
9223372036854771712
➜  ~ echo 8192 > /sys/fs/cgroup/memory/m_group/memory.limit_in_bytes
➜  ~ cat /sys/fs/cgroup/memory/m_group/memory.limit_in_bytes   
8192

Create a process and associate it to the created m_group.

1
2
➜  ~ cgexec -g memory:m_group bash
[1]    542707 killed     cgexec -g memory:m_group bash

You can see that the process was killed because of the OOM and the resource limit of 8KB. check the kernel crash information with the dmesg command.

1
2
3
4
5
6
7
➜  ~  dmesg
...
[787533.474951] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[787533.474952] [ 542707]  1000 542707     1298      528    53248        0             0 cgexec
[787533.474955] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/m_group,task_memcg=/m_group,task=cgexec,pid=542707,uid=1000
[787533.474963] Memory cgroup out of memory: Killed process 542707 (cgexec) total-vm:5192kB, anon-rss:832kB, file-rss:1280kB, shmem-rss:0kB, UID:1000 pgtables:52kB oom_score_adj:0
[787533.475092] oom_reaper: reaped process 542707 (cgexec), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Adjust the resource limit and then create a bash process, you can see that the bash process is created normally.

1
2
3
4
5
6
7
➜  ~ echo 8192000 > /sys/fs/cgroup/memory/m_group/memory.limit_in_bytes
➜  ~ cat /sys/fs/cgroup/memory/m_group/memory.limit_in_bytes           
8192000
➜  ~ cgexec -g memory:m_group bash                                     
anderson@andersonu20:~$ 
echo 1
1

Container

Containerization is precisely based on two core technologies, Namespace and Cgroup, to implement.

  • Namespace is used to achieve isolation of the environment.
  • Cgroup is used to restrict the use of resources.

The creation process of a container can be roughly understood as the following steps.

  • First create a new process by clone() and attach it to the specified Namespace.
  • Then the corresponding pid is written to the specified cgroup (echo $pid > /sys/fs/cgroup/$type/tasks), so that this pid is bound by a different cgrop.

Container engine

Container engines encapsulate containerization technologies that simplify the process of creating and managing containers in the host, such as the common Docker and LXC. Container orchestration, such as Kubernetes, simplifies the process of running and managing containers at scale, greatly improving O&M efficiency and productivity.