Starting with the OCI specification

The OCI (Open Container Initiative) specification is the de facto container standard that has been adopted by most container implementations and container orchestration systems, including Docker and Kubernetes, and was introduced in 2015 by Dokcer as the lead company.

Starting with the OCI specification to understand container images allows us to build a clearer picture of container technology rather than getting bogged down in implementation details. The OCI specification is divided into two parts, Image spec and Runtime spec, which cover different phases of the container lifecycle.

OCI specification

Image Specification

The image specification defines how to create an OCI-compliant image. It specifies the content and format that the image’s build system needs to output, and the output container image can be unpacked into a runtime bundle, which is a folder consisting of specific files and directory structures from which the container can be run according to runtime standards.

What’s inside the image

The specification requires that the contents of the image must include the following 3 parts.

  • Image Manifest: Provides configuration and filesystem layer location information for an image, which can be thought of as a directory for the image, in json file format.
  • Image Layer Filesystem Changeset: serialized filesystem and filesystem changes that can be applied sequentially layer by layer as a container rootfs, so often also called a layer (synonymous with the image layer mentioned below), file format can be tar, gzip and other archive or compressed formats.
  • Image Configuration: contains the execution parameters used by the image at runtime as well as ordered rootfs change information, file type json.

rootfs (root file system), the file system mounted by the / root mount point, is the files, configurations, and directories contained in an operating system, but not the operating system kernel, which is shared by all containers on the same machine with the kernel of the host operating system.

Next, let’s explore the actual contents of an image using Docker and nginx as examples. Pull a recent version of the nginx image, save it as a tar file, and extract it.

1
2
3
4
$ docker pull nginx
$ docker save nginx -o nginx-img.tar
$ mkdir nginx-img
$ tar -xf nginx-img.tar --directory=nginx-img

Get the contents of the nginx-img directory as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
nginx-img
├── 013a6edf61f54428da349193e7a2077a714697991d802a1c5298b07dbe0519c9
│   ├── json
│   ├── layer.tar
│   └── VERSION
├── 2bf70c858e6c8243c4713064cf43dea840866afefe52089a3b339f06576b930e
│   ├── json
│   ├── layer.tar
│   └── VERSION
├── 490a3e67a61048564048a15d501b8e075d951d0dbba8098d5788bb8453f2371f
│   ├── json
│   ├── layer.tar
│   └── VERSION
├── 4cdc5dd7eaadff5080649e8d0014f2f8d36d4ddf2eff2fdf577dd13da85c5d2f.json
├── 761c908ee54e7ccd769e815f38e3040f7b3ff51f1c04f55aac12b9ea3d544cfe
│   ├── json
│   ├── layer.tar
│   └── VERSION
├── 96bfd5bf4ab4c2513fb43534d51e816c4876620767858377d14dcc5a7de5f1fd
│   ├── json
│   ├── layer.tar
│   └── VERSION
├── d18832ef411b346c36b7ba42a6c2e3f77097026fb80651c2d870f19c6fd9ccef
│   ├── json
│   ├── layer.tar
│   └── VERSION
├── manifest.json
└── repositories

First look at the contents of the manifest.json file, which is the Image Manifest for the image.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
$ python -m json.tool manifest.json
[
    {
        "Config": "4cdc5dd7eaadff5080649e8d0014f2f8d36d4ddf2eff2fdf577dd13da85c5d2f.json",
        "Layers": [
            "490a3e67a61048564048a15d501b8e075d951d0dbba8098d5788bb8453f2371f/layer.tar",
            "2bf70c858e6c8243c4713064cf43dea840866afefe52089a3b339f06576b930e/layer.tar",
            "013a6edf61f54428da349193e7a2077a714697991d802a1c5298b07dbe0519c9/layer.tar",
            "761c908ee54e7ccd769e815f38e3040f7b3ff51f1c04f55aac12b9ea3d544cfe/layer.tar",
            "d18832ef411b346c36b7ba42a6c2e3f77097026fb80651c2d870f19c6fd9ccef/layer.tar",
            "96bfd5bf4ab4c2513fb43534d51e816c4876620767858377d14dcc5a7de5f1fd/layer.tar"
        ],
        "RepoTags": [
            "nginx:latest"
        ]
    }
]

It contains the file location information for Config and Layers, which are the Image Layer Filesystem Changeset and Image Configuration as specified in the standard.

Config is stored in a separate json file, which is more extensive than we can show, and contains the following information.

  • The configuration of the image, which will be written to the runtime configuration file after the image is extracted into a runtime bundle.
  • Diff IDs between the layers of the images.
  • Meta information such as the build history of the image.

The tar files in the Layers list together form the rootfs of the generated container, the image of the container is built in layers, the order of the elements in Layers also represents the order of the stack of image layers, all layers form a stack structure from bottom to top. Let’s first look at the contents of the base layer, the first record.

1
2
$ mkdir base
$ tar -xf 490a3e67a61048564048a15d501b8e075d951d0dbba8098d5788bb8453f2371f/layer.tar --directory=base

The contents of the extracted file in the base directory are as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
drwxr-xr-x  2 root root 4096 6月  21 08:00 bin
drwxr-xr-x  2 root root    6 6月  13 18:30 boot
drwxr-xr-x  2 root root    6 6月  21 08:00 dev
drwxr-xr-x 28 root root 4096 6月  21 08:00 etc
drwxr-xr-x  2 root root    6 6月  13 18:30 home
drwxr-xr-x  7 root root   85 6月  21 08:00 lib
drwxr-xr-x  2 root root   34 6月  21 08:00 lib64
drwxr-xr-x  2 root root    6 6月  21 08:00 media
drwxr-xr-x  2 root root    6 6月  21 08:00 mnt
drwxr-xr-x  2 root root    6 6月  21 08:00 opt
drwxr-xr-x  2 root root    6 6月  13 18:30 proc
drwx------  2 root root   37 6月  21 08:00 root
drwxr-xr-x  3 root root   30 6月  21 08:00 run
drwxr-xr-x  2 root root 4096 6月  21 08:00 sbin
drwxr-xr-x  2 root root    6 6月  21 08:00 srv
drwxr-xr-x  2 root root    6 6月  13 18:30 sys
drwxrwxrwt  2 root root    6 6月  21 08:00 tmp
drwxr-xr-x 10 root root  105 6月  21 08:00 usr
drwxr-xr-x 11 root root  139 6月  21 08:00 var

This is already a complete rootfs, look at the contents of the file obtained from the top layer of layer.

1
2
3
96bfd5bf4ab4c2513fb43534d51e816c4876620767858377d14dcc5a7de5f1fd/
└── docker-entrypoint.d
    └── 30-tune-worker-processes.sh

There is only one shell script file, which means that the image is built incrementally, with each layer containing only the contents of files that have changed compared to lower layers, which is why the container image is kept small.

How to delete a file in a image layer

Each layer in Layers is a ChangeSet for the file system, and the ChangeSet contains three kinds of changes: add, modify and delete. The case of adding or modifying (replacing) a file is better handled, but how to delete a file when applying a ChangeSet? The answer is to use Whiteouts to indicate the file or folder to be deleted.

A Whiteouts file is an empty file with a special filename, and the prefix .wh. added to the base name of the path to be deleted in the filename indicates that a path (at a lower layer) should be deleted. Suppose there are the following files in a layer.

1
2
3
4
./etc/my-app.d/
./etc/my-app.d/default.cfg
./bin/my-app-tools
./etc/my-app-config

If a higher level layer of the application contains . /etc/.wh.my-app-config, the original . /etc/my-app-config path will be removed when that layer is changed.

How to merge multiple image layers into a single file system

The specification has only a schematic description of how to apply multiple image layers into a single file system, so if we want to apply Layer B on top of Layer A.

  • first copy the filesystem directory in Layer A to another snapshot directory A.snapshot with the file attributes preserved
  • Then execute the file changes contained in Layer B in the snapshot directory, all changes will not affect the original changeset.

In practice more efficient implementations such as federated file systems are used.

What is Union File System

Union File System, also known as UnionFS, is a system that mounts multiple directories from different locations to the same directory.

The following is an example of Ubuntu distribution and unionfs-fuse implementation to demonstrate the effect of union mount.

  1. First install unionfs-fuse, which is an implementation of UnionFS, using the package manager.

    1
    
    $ apt install unionfs-fuse
    
  2. Then create the following directory structure.

    1
    2
    3
    4
    5
    6
    
    A
    ├── a
    └── x
    B
    ├── b
    └── x
    
  3. Create directory C and co-mount directories A and B under C.

    1
    
    $ unionfs ./B:./A ./C
    
  4. The contents of the C directory after mounting are as follows.

    1
    2
    3
    4
    
    C
    ├── a
    ├── b
    └── x
    
  5. If we edit the x files in directories A and B separately, we will find that accessing the x files in directory C gives us the contents of B/x (because B is higher up in the mount).

How OverlayFS in Docker works

The current federated filesystem implementation used by most distributions of Docker is overlay2, which is much lighter and more efficient than other implementations, so here is an example to understand how it works.

Following the example of the nginx image above, pull the image and extract the corresponding layer to the /var/lib/docker/overlay2 directory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ ll /var/lib/docker/overlay2 | tee layers.a

drwx-----x 4 root root     72 7月  20 17:20 335aaf02cbde069ddf7aa0077fecac172d4b2f0240975ab0ebecc3f94f1420cc
drwx-----x 3 root root     47 7月  19 10:04 560df35d349e6a750f1139db22d4cb52cba2a1f106616dc1c0c68b3cf11e3df6
drwx-----x 4 root root     72 7月  20 17:20 769a9f5d698522d6e55bd9882520647bd84375a751a67a8ccad1f7bb1ca066dd
drwx-----x 4 root root     72 7月  20 17:20 97aaf293fef495f0f06922d422a6187a952ec6ab29c0aa94cd87024c40e1a7e8
drwx-----x 4 root root     72 7月  20 17:20 a91fb6955249dadfb34a3f5f06d083c192f2774fbec5fbb1db42a04e918432c0
brw------- 1 root root 253, 1 7月  19 10:00 backingFsBlockDev
drwx-----x 4 root root     72 7月  20 17:20 fa29ec8cfe5a6c0b2cd1486f27a20a02867126edf654faad7f3520a220f3705f
drwx-----x 2 root root    278 7月  20 17:25 l

We save the output to the layers.a file for later comparison. The six l directories contain the six layers of the image (the directory names do not correspond to the names in manifest.json), and the l directory contains soft links to the layers folder, mainly for the purpose of shortening the directory identifiers to avoid exceeding the page size limit when executing the mount command.

The contents of each layer folder are as follows.

1
2
3
4
5
6
7
8
$ cd /var/lib/docker/overlay2/
$ ll 335aaf02cbde069ddf7aa0077fecac172d4b2f0240975ab0ebecc3f94f1420cc

-rw------- 1 root root  0 7月  15 17:00 committed
drwxr-xr-x 3 root root 33 7月  15 17:00 diff
-rw-r--r-- 1 root root 26 7月  15 17:00 link
-rw-r--r-- 1 root root 86 7月  15 17:00 lower
drwx------ 2 root root  6 7月  15 17:00 work

link records the short links in the l directory, lower records the lower layer of the layer (if there is no such file, the current layer is already the lowest layer, i.e. the base layer), the work directory is used internally by overlay2, and the diff directory holds the contents of the filesystem contained in the layer.

1
2
3
4
5
6
7
$ ll 335aaf02cbde069ddf7aa0077fecac172d4b2f0240975ab0ebecc3f94f1420cc/diff/
drwxr-xr-x  2 root root    6 7月   7 03:39 docker-entrypoint.d
drwxr-xr-x 20 root root 4096 7月   7 03:39 etc
drwxr-xr-x  5 root root   56 7月   7 03:39 lib
drwxrwxrwt  2 root root    6 7月   7 03:39 tmp
drwxr-xr-x  7 root root   66 6月  21 08:00 usr
drwxr-xr-x  5 root root   41 6月  21 08:00 var

Now let’s try to run a container based on this image and see the effect of co-mounting during the container phase.

1
$ docker run -d --name nginx_container  nginx

Execute the mount command to confirm the addition of a read/write overlay mount point.

1
2
3
$ mount | grep overlay

overlay on /var/lib/docker/overlay2/bab121ecb1d54b787b7b1834810baf212b035e28ca8d7875a09b1af837116011/merged type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/6Y7DPCTGLB6JHUPVBGTOEK2QFN:/var/lib/docker/overlay2/l/QMEEOSTJM2QON4M7PJJBB4KDEF:/var/lib/docker/overlay2/l/XNN2MRN4KWITFTZYLFUSLBP322:/var/lib/docker/overlay2/l/6DC6VDOMBZMLBZBT3QSOWLCR37:/var/lib/docker/overlay2/l/NXYWG253WSMELQKF2E2NH2GWCG:/var/lib/docker/overlay2/l/M4SO5XMO4VXRIJIGUHDMTATWH3:/var/lib/docker/overlay2/l/QI3P6ONJSLQI26DVPFGWIZI2EW,upperdir=/var/lib/docker/overlay2/bab121ecb1d54b787b7b1834810baf212b035e28ca8d7875a09b1af837116011/diff,workdir=/var/lib/docker/overlay2/bab121ecb1d54b787b7b1834810baf212b035e28ca8d7875a09b1af837116011/work)

This mount point contains a rootfs that is a combination of all the image layers layer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ ll /var/lib/docker/overlay2/bab121ecb1d54b787b7b1834810baf212b035e28ca8d7875a09b1af837116011/merged

drwxr-xr-x 2 root root 4096 6月  21 08:00 bin
drwxr-xr-x 2 root root    6 6月  13 18:30 boot
drwxr-xr-x 1 root root   43 7月  16 16:52 dev
drwxr-xr-x 1 root root   41 7月   7 03:39 docker-entrypoint.d
-rwxrwxr-x 1 root root 1202 7月   7 03:39 docker-entrypoint.sh
drwxr-xr-x 1 root root   19 7月  16 16:52 etc
drwxr-xr-x 2 root root    6 6月  13 18:30 home
drwxr-xr-x 1 root root   56 7月   7 03:39 lib
drwxr-xr-x 2 root root   34 6月  21 08:00 lib64
drwxr-xr-x 2 root root    6 6月  21 08:00 media
drwxr-xr-x 2 root root    6 6月  21 08:00 mnt
drwxr-xr-x 2 root root    6 6月  21 08:00 opt
drwxr-xr-x 2 root root    6 6月  13 18:30 proc
drwx------ 2 root root   37 6月  21 08:00 root
drwxr-xr-x 1 root root   23 7月  16 16:52 run
drwxr-xr-x 2 root root 4096 6月  21 08:00 sbin
drwxr-xr-x 2 root root    6 6月  21 08:00 srv
drwxr-xr-x 2 root root    6 6月  13 18:30 sys
drwxrwxrwt 1 root root    6 7月   7 03:39 tmp
drwxr-xr-x 1 root root   66 6月  21 08:00 usr
drwxr-xr-x 1 root root   19 6月  21 08:00 var

In addition to co-mounting the original image layer to the merged directory as shown above, the diff command shows that two new layer directories are added to /var/lib/docker/overlay2 after the container runs successfully, and merged is also located in one of them.

1
2
3
4
5
6
7
8
$ ll /var/lib/docker/overlay2 | tee layers.b
$ diff layers.a layers.b

> drwx-----x 5 root root     69 7月  19 10:08 bab121ecb1d54b787b7b1834810baf212b035e28ca8d7875a09b1af837116011
> drwx-----x 4 root root     72 7月  19 10:08 bab121ecb1d54b787b7b1834810baf212b035e28ca8d7875a09b1af837116011-init
< drwx-----x 2 root root    210 7月  19 10:08 l
---
> drwx-----x 2 root root    278 7月  19 10:08 l

Probing the GraphDriver of a running container with the inspect command allows you to see more clearly how the container’s layers have changed compared to the image.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ docker inspect nginx_container

....
"GraphDriver": {
    "Data": {
        "LowerDir": "/var/lib/docker/overlay2/bab121ecb1d54b787b7b1834810baf212b035e28ca8d7875a09b1af837116011-init/diff:/var/lib/docker/overlay2/97aaf293fef495f0f06922d422a6187a952ec6ab29c0aa94cd87024c40e1a7e8/diff:/var/lib/docker/overlay2/fa29ec8cfe5a6c0b2cd1486f27a20a02867126edf654faad7f3520a220f3705f/diff:/var/lib/docker/overlay2/769a9f5d698522d6e55bd9882520647bd84375a751a67a8ccad1f7bb1ca066dd/diff:/var/lib/docker/overlay2/a91fb6955249dadfb34a3f5f06d083c192f2774fbec5fbb1db42a04e918432c0/diff:/var/lib/docker/overlay2/335aaf02cbde069ddf7aa0077fecac172d4b2f0240975ab0ebecc3f94f1420cc/diff:/var/lib/docker/overlay2/560df35d349e6a750f1139db22d4cb52cba2a1f106616dc1c0c68b3cf11e3df6/diff",
        "MergedDir": "/var/lib/docker/overlay2/bab121ecb1d54b787b7b1834810baf212b035e28ca8d7875a09b1af837116011/merged",
        "UpperDir": "/var/lib/docker/overlay2/bab121ecb1d54b787b7b1834810baf212b035e28ca8d7875a09b1af837116011/diff",
        "WorkDir": "/var/lib/docker/overlay2/bab121ecb1d54b787b7b1834810baf212b035e28ca8d7875a09b1af837116011/work"
    },
    "Name": "overlay2"
}
....

LowerDir records the original image layer file system, in addition to a new init layer at the top, they are read-only in the container runtime; MergedDir records all the directories of LowerDir for joint mounting mount point; UpperDir is also a new layer, at this time is also a new layer, which is at the top of all the above layers and is read and writeable compared to the other image layers. The layers of the container stage are shown below.

The layers of the container stage

The new writable layer added when creating a container is called container layer, and any changes to the file system during the container runtime are only written to this layer, including additions, modifications and deletions of files, without changing the contents of the original image at lower layers. This greatly improves the efficiency of image distribution. The init layer between the image layer and the container layer records some configuration files written by the container at startup, which happens before the new read/write layer is added, and we do not want to write this data to the original image.

These two new layers exist only during the container runtime, and they will be deleted when the container is deleted. Multiple different containers can be created from the same image, and only multiple different writeable layers need to be created; containers that have been changed at runtime can also be repackaged into a new image by adding the writeable layer to the read-only layer of the new image.

Why containers are not as efficient as native file systems for reading and writing

To minimize I/O and reduce image size, a container’s federated file system uses a copy-on-write strategy for reading and writing files. If a file or directory exists on a lower tier in the image and another tier (including the writable tier) needs to read access to it, it will access the file on the lower tier directly. The first time another layer needs to write to that file (when building an image or running a container), the file is copied to that layer and modified. This greatly reduces the startup time of the container (the new writeable layer at startup has very few files to write), but each time a file is modified for the first time after the container is running, the entire file needs to be copied to the container layer first.

For the above reasons, the read and write efficiency of container runtime is not as good as the native file system (especially the write efficiency), so it is not suitable to do a lot of file reading and writing in the container layer. Bind Mount) directly in the read-write layer, bypassing the performance loss caused by copy-on-write.

Runtime specification

The runtime specification describes the configuration, execution environment and lifecycle of a container. It describes in detail the field format of the configuration file config.json for different container runtime architectures, how to apply and inject these configurations in the execution environment to ensure a consistent environment between different runtimes for programs running inside a container, and defines a uniform set of operational behaviors through the container’s lifecycle.

Container lifecycle

The container lifecycle, as defined by the specification, describes a timeline consisting of events that occur from the creation to the exit of a container, which defines 13 different events, and the following diagram depicts the changes in the container state in the timeline.

Container lifecycle

Only four container states are defined in the specification, and runtime implementations can add other states to the specification, while the standard also specifies the runtime must support Operations.

  • Query State, which queries the current state of the container
  • Create, which creates a new container based on the image and configuration, but does not run the user-specified program
  • Start, runs the user-specified program in a created container
  • Kill, send a specific signal to terminate the container process
  • Delete, which deletes the resources created by the stopped container

Each operation is also preceded or followed by different hooks, which must be executed by the compliant runtime.

The essence of a container is a process

The concept of container process is used in the runtime specification. container process is equivalent to the user-specified program and container process mentioned above, and some scenarios also refer to the process as the container’s init process. To run a container, you must define the container’s container process in config.json, and the fields that can be defined include command parameters, environment parameters, execution paths, and so on.

The change of container state actually reflects the change of container process. We can divide the container’s life cycle and state changes into the following phases.

  1. Before container process execution.
    • The create command is executed at runtime to create the specified resource based on config.json.
    • When the resource is created successfully, the container enters the created state.
  2. Execute the container process.
    • The start command is executed at runtime.
    • The user-specified program, container process, is run at runtime.
    • The container enters the running state.
  3. The container process process ends, either because the program has finished executing, because of an error or because it has crashed, or because the kill command signals it to terminate at runtime.
    • The container enters the stopped state.
    • All resources created by the create command are cleared by executing the delete command at runtime.

The core of the container runtime is the container process: the image file system satisfies the dependencies needed to run the process, the runtime preparation is to run the process correctly, the runtime continuously monitors the state of the process, once the process is finished the container is declared (temporarily) dead, and the runtime performs the finishing cleanup.

Of course, other processes can be run in the container, but they only share the container process environment.

Implementation and Ecology

Docker has donated its container runtime runC project to the OCI specification as a standard implementation of the specification. Most of the container projects that exist today directly use runC as their runtime implementation.

The following diagram summarizes the relationship between Docker-related organizations and projects within the container ecosystem.

relationship between Docker-related organizations and projects within the container ecosystem

Kubernetes defines a CRI (Container Runtime Interface) to implement an interchangeable container runtime, and there are several implementations such as cri-containerd, cri-o and docker, but they are all actually based on runC as well.

Container Runtime Interface