Recently, I have been doing CSI-related work, and as I have been developing it, I have come to believe that the details of CSI are quite tedious. By organizing the CSI workflow, I can deepen my understanding of CSI and share my knowledge of CSI with you.

I will introduce CSI through two articles, the first of which will focus on the basic components and workings of CSI, based on Kubernetes as the COs (Container Orchestration Systems) for CSI. The second article will take a few typical CSI projects and analyze the specific implementation.

Basic components of CSI

There are two types of CSI cloud providers, one of in-tree type and one of out-of-tree type. The former is a storage plugin that runs inside the k8s core component; the latter is a storage plugin that runs independently of the k8s component. This article focuses on the out-of-tree type of plugins.

The out-of-tree type of plug-in interacts with k8s components through the gRPC interface, and k8s provides a number of SideCar components to work with CSI plug-ins to achieve rich functionality. For out-of-tree plugins, the components used are divided into SideCar components and plugins that need to be implemented by third parties.

SideCar components

external-attacher

external-attacher

Listens to the VolumeAttachment object and calls the ControllerPublishVolume and ControllerUnpublishVolume interfaces of the CSI driver Controller service to attach and remove volumes to and from the node.

This component is needed if the storage system requires the attach/detach step, as the K8s internal Attach/Detach Controller does not call the CSI driver interface directly.

external-provisioner

external-provisioner

Provided that the provisioner field of the StorageClass specified in the PVC is the same as the return value of the CreateVolume and DeleteVolume interfaces of the CSI driver Controller service. The GetPluginInfo interface of the Identity service returns the same value. Once the new volume is provisioned, K8s will create the corresponding PV.

If the recycling policy of the PV bound to the PVC is delete, then the external-provisioner component listens for the deletion of the PVC and calls the DeleteVolume interface of the CSI driver Controller service. Once the volume is deleted successfully, the component will also delete the corresponding PV.

The component also supports the creation of data sources from snapshots. If the Snapshot CRD data source is specified in the PVC, the component will get information about the snapshot through the SnapshotContent object and pass this content to the CSI driver when calling the CreateVolume interface, and the CSI driver will need to create the volume based on the data source snapshot.

external-resizer

external-resizer

Listens to the PVC object and if the user requests more storage on the PVC object, the component calls the NodeExpandVolume interface of the CSI driver Controller service, which is used to expand the volume.

external-snapshotter

external-snapshotter

This component is used in conjunction with the Snapshot Controller, which creates the corresponding VolumeSnapshotContent based on the Snapshot objects created in the cluster, and the external-snapshotter, which listens to the VolumeSnapshotContent object. When the VolumeSnapshotContent is listened to, the corresponding parameter is passed to the CSI driver Controller service via CreateSnapshotRequest to invoke its CreateSnapshot interface. This component is also responsible for calling the DeleteSnapshot and ListSnapshots interfaces.

livenessprobe

livenessprobe

Responsible for monitoring the health of the CSI driver and reporting to k8s through the Liveness Probe mechanism, and restarting the pod when an abnormality is detected in the CSI driver.

node-driver-registrar

node-driver-registrar

By directly calling the NodeGetInfo interface of the CSI driver Node service, the CSI driver information is registered on the corresponding node’s kubelet through the kubelet’s plugin registration mechanism.

external-health-monitor-controller

external-health-monitor-controller

The health of the CSI volume is checked and reported in the event of the PVC by calling the ListVolumes or ControllerGetVolume interface of the CSI driver Controller service.

external-health-monitor-agent

external-health-monitor-agent

Check the health of the CSI volume by calling the NodeGetVolumeStats interface of the CSI driver Node service and report it in the pod’s event.

Third-party plugins

The third-party storage provider (i.e., SP, Storage Provider) needs to implement two plugins, Controller, which is responsible for volume management and is deployed as a StatefulSet, and Node, which is responsible for mounting the volume into the pod and is deployed as a DaemonSet in each node.

The CSI plug-in interacts with kubelet and k8s external components through the Unix Domani Socket gRPC, which defines three sets of RPC interfaces that SPs need to implement in order to communicate with k8s external components. The three sets of interfaces are CSI Identity, CSI Controller, and CSI Node, which are defined in detail below.

CSI Identity

The CSI Identity is used to provide the identity information of the CSI driver and needs to be implemented by both the Controller and the Node. The interfaces are as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
service Identity {
  rpc GetPluginInfo(GetPluginInfoRequest)
    returns (GetPluginInfoResponse) {}

  rpc GetPluginCapabilities(GetPluginCapabilitiesRequest)
    returns (GetPluginCapabilitiesResponse) {}

  rpc Probe (ProbeRequest)
    returns (ProbeResponse) {}
}

GetPluginInfo is mandatory, the node-driver-registrar component will call this interface to register the CSI driver to the kubelet; GetPluginCapabilities is used to indicate what features the CSI driver mainly provides.

CSI Controller

is used to create/delete volumes, attach/detach volumes, volume snapshots, volume expansion/contraction, etc. The Controller plugin needs to implement this set of interfaces. The interfaces are as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
service Controller {
  rpc CreateVolume (CreateVolumeRequest)
    returns (CreateVolumeResponse) {}

  rpc DeleteVolume (DeleteVolumeRequest)
    returns (DeleteVolumeResponse) {}

  rpc ControllerPublishVolume (ControllerPublishVolumeRequest)
    returns (ControllerPublishVolumeResponse) {}

  rpc ControllerUnpublishVolume (ControllerUnpublishVolumeRequest)
    returns (ControllerUnpublishVolumeResponse) {}

  rpc ValidateVolumeCapabilities (ValidateVolumeCapabilitiesRequest)
    returns (ValidateVolumeCapabilitiesResponse) {}

  rpc ListVolumes (ListVolumesRequest)
    returns (ListVolumesResponse) {}

  rpc GetCapacity (GetCapacityRequest)
    returns (GetCapacityResponse) {}

  rpc ControllerGetCapabilities (ControllerGetCapabilitiesRequest)
    returns (ControllerGetCapabilitiesResponse) {}

  rpc CreateSnapshot (CreateSnapshotRequest)
    returns (CreateSnapshotResponse) {}

  rpc DeleteSnapshot (DeleteSnapshotRequest)
    returns (DeleteSnapshotResponse) {}

  rpc ListSnapshots (ListSnapshotsRequest)
    returns (ListSnapshotsResponse) {}

  rpc ControllerExpandVolume (ControllerExpandVolumeRequest)
    returns (ControllerExpandVolumeResponse) {}

  rpc ControllerGetVolume (ControllerGetVolumeRequest)
    returns (ControllerGetVolumeResponse) {
        option (alpha_method) = true;
    }
}

As mentioned above in the introduction of k8s external components, different interfaces are provided to different components for different functions. For example, CreateVolume / DeleteVolume can be used with external-provisioner to create/delete volume, ControllerPublishVolume / ControllerUnpublishVolume can be used with external-attacher for attaching/detaching volumes, etc.

CSI Node

The Node plug-in needs to implement this set of interfaces for mount/umount volume, check volume status, etc. The interfaces are as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
service Node {
  rpc NodeStageVolume (NodeStageVolumeRequest)
    returns (NodeStageVolumeResponse) {}

  rpc NodeUnstageVolume (NodeUnstageVolumeRequest)
    returns (NodeUnstageVolumeResponse) {}

  rpc NodePublishVolume (NodePublishVolumeRequest)
    returns (NodePublishVolumeResponse) {}

  rpc NodeUnpublishVolume (NodeUnpublishVolumeRequest)
    returns (NodeUnpublishVolumeResponse) {}

  rpc NodeGetVolumeStats (NodeGetVolumeStatsRequest)
    returns (NodeGetVolumeStatsResponse) {}

  rpc NodeExpandVolume(NodeExpandVolumeRequest)
    returns (NodeExpandVolumeResponse) {}

  rpc NodeGetCapabilities (NodeGetCapabilitiesRequest)
    returns (NodeGetCapabilitiesResponse) {}

  rpc NodeGetInfo (NodeGetInfoRequest)
    returns (NodeGetInfoResponse) {}
}

NodeStageVolume is used to implement the function of multiple pods sharing a single volume, supporting first mounting the volume to a temporary directory and then mounting it to the pod via NodePublishVolume; NodeUnstageVolume does the opposite.

Workflow

Here’s a look at the entire workflow of pod mounting volume. There are three phases in the process: Provision/Delete, Attach/Detach, and Mount/Unmount, but not every storage solution goes through these three phases, for example, NFS does not have an Attach/Detach phase.

The entire process involves not only the work of the components described above, but also the AttachDetachController and PVController components of the ControllerManager and the kubelet, and the following is a detailed analysis of the Provision, Attach, and Mount phases respectively.

Provision

csi Provision

Let’s look at the Provision phase first, the whole process is shown in the figure above. Both extenal-provisioner and PVController watch PVC resources.

  1. When PVController watches that a PVC is created in the cluster, it determines whether there is an in-tree plugin that matches it, and if not, it determines that its storage type is out-of-tree, so it annotates the PVC with volume.beta.kubernetes.io/storage- provisioner={csi driver name}.
  2. call the CSI Controller’s CreateVolume interface when the extenal-provisioner watches that the PVC’s annotated csi driver is consistent with its own csi driver.
  3. when the CreateVolume interface of the CSI Controller returns success, the extenal-provisioner creates the corresponding PV in the cluster.
  4. when the PVController watches that a PV is created in the cluster, it binds the PV to the PVC.

Attach

csi Attach

The Attach phase refers to attaching a volume to a node, and the whole process is shown in the figure above.

  1. the ADController listens when a pod is dispatched to a node and is using a PV of type CSI, and calls the internal in-tree CSI plug-in interface, which creates a VolumeAttachment resource in the cluster.
  2. the external-attacher component watches for a VolumeAttachment resource to be created and calls the CSI Controller’s ControllerPublishVolume interface.
  3. when the CSI Controller’s ControllerPublishVolume interface is called successfully, the external-attacher sets the Attached state of the corresponding VolumeAttachment object to true.
  4. When the ADController watches that the Attached state of the VolumeAttachment object is true, it updates the internal state of the ADController ActualStateOfWorld.

Mount

CSI Mount

The last step of mounting the volume to the pod involves the kubelet, and the whole process is simply described as a kubelet on the corresponding node calling the CSI Node plug-in to perform the mount operation during the pod creation process. The following is a breakdown of the kubelet’s internal components.

First, in the main function syncPod where the kubelet creates the pod, the kubelet calls the WaitForAttachAndMount method of its subcomponent volumeManager and waits for the volume mount to complete.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
func (kl *Kubelet) syncPod(o syncPodOptions) error {
...
	// Volume manager will not mount volumes for terminated pods
	if !kl.podIsTerminated(pod) {
		// Wait for volumes to attach/mount
		if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {
			kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedMountVolume, "Unable to attach or mount volumes: %v", err)
			klog.Errorf("Unable to attach or mount volumes for pod %q: %v; skipping pod", format.Pod(pod), err)
			return err
		}
	}
...
}

The volumeManager contains two components: the desiredStateOfWorldPopulator and the reconciler, which work together to complete the mount and umount process of the volume in the pod. The whole process is as follows.

csi

The desiredStateOfWorldPopulator and reconciler have a producer-consumer model. two queues are maintained in the volumeManager (technically an interface, but acting as a queue here), namely DesiredStateOfWorld and The former maintains the desired state of the volume in the current node; the latter maintains the actual state of the volume in the current node.

The desiredStateOfWorldPopulator only does two things in its own loop, one is to get the newly created Pod from the podManager of the kubelet and record the information of the volume it needs to mount into DesiredStateOfWorld; the other thing is to get the information of the volume being mounted from the podManager of the current node into DesiredStateOfWorld. The other thing is to get the deleted pods from the podManager of the current node and check if their volumes are in the ActualStateOfWorld record, and if not, remove them from DesiredStateOfWorld as well, thus ensuring that DesiredStateOfWorld records the desired state of all the volumes in the node. This ensures that DesiredStateOfWorld records the desired state of all volumes in the node. The relevant code is as follows (some code has been removed to streamline the logic).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Iterate through all pods and add to desired state of world if they don't
// exist but should
func (dswp *desiredStateOfWorldPopulator) findAndAddNewPods() {
	// Map unique pod name to outer volume name to MountedVolume.
	mountedVolumesForPod := make(map[volumetypes.UniquePodName]map[string]cache.MountedVolume)
	...
	processedVolumesForFSResize := sets.NewString()
	for _, pod := range dswp.podManager.GetPods() {
		dswp.processPodVolumes(pod, mountedVolumesForPod, processedVolumesForFSResize)
	}
}

// processPodVolumes processes the volumes in the given pod and adds them to the
// desired state of the world.
func (dswp *desiredStateOfWorldPopulator) processPodVolumes(
	pod *v1.Pod,
	mountedVolumesForPod map[volumetypes.UniquePodName]map[string]cache.MountedVolume,
	processedVolumesForFSResize sets.String) {
	uniquePodName := util.GetUniquePodName(pod)
    ...
	for _, podVolume := range pod.Spec.Volumes {   
		pvc, volumeSpec, volumeGidValue, err :=
			dswp.createVolumeSpec(podVolume, pod, mounts, devices)

		// Add volume to desired state of world
		_, err = dswp.desiredStateOfWorld.AddPodToVolume(
			uniquePodName, pod, volumeSpec, podVolume.Name, volumeGidValue)
		dswp.actualStateOfWorld.MarkRemountRequired(uniquePodName)
    }
}

The reconciler is the consumer, and it does three main things.

  1. unmountVolumes() : iterates through the volume in ActualStateOfWorld, determines if it is in DesiredStateOfWorld, and if not, calls the CSI Node’s interface to perform an unmount and records it in ActualStateOfWorld.
  2. mountAttachVolumes(): Get the volume from DesiredStateOfWorld that needs to be mounted, call the CSI Node’s interface to perform mount or expand, and record it in ActualStateOfWorld.
  3. unmountDetachDevices() : iterate through the volume in ActualStateOfWorld, and if it has been attached, but there is no pod in use and not recorded in DesiredStateOfWorld, then unmount/detach it.

Let’s take mountAttachVolumes() as an example and see how it calls the CSI Node interface.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
func (rc *reconciler) mountAttachVolumes() {
	// Ensure volumes that should be attached/mounted are attached/mounted.
	for _, volumeToMount := range rc.desiredStateOfWorld.GetVolumesToMount() {
		volMounted, devicePath, err := rc.actualStateOfWorld.PodExistsInVolume(volumeToMount.PodName, volumeToMount.VolumeName)
		volumeToMount.DevicePath = devicePath
		if cache.IsVolumeNotAttachedError(err) {
			...
		} else if !volMounted || cache.IsRemountRequiredError(err) {
			// Volume is not mounted, or is already mounted, but requires remounting
			err := rc.operationExecutor.MountVolume(
				rc.waitForAttachTimeout,
				volumeToMount.VolumeToMount,
				rc.actualStateOfWorld,
				isRemount)
			...
		} else if cache.IsFSResizeRequiredError(err) {
			err := rc.operationExecutor.ExpandInUseVolume(
				volumeToMount.VolumeToMount,
				rc.actualStateOfWorld)
			...
		}
	}
}

The execution of the mount is done in rc.operationExecutor, look at the operationExecutor code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
func (oe *operationExecutor) MountVolume(
	waitForAttachTimeout time.Duration,
	volumeToMount VolumeToMount,
	actualStateOfWorld ActualStateOfWorldMounterUpdater,
	isRemount bool) error {
	...
	var generatedOperations volumetypes.GeneratedOperations
		generatedOperations = oe.operationGenerator.GenerateMountVolumeFunc(
			waitForAttachTimeout, volumeToMount, actualStateOfWorld, isRemount)

	// Avoid executing mount/map from multiple pods referencing the
	// same volume in parallel
	podName := nestedpendingoperations.EmptyUniquePodName

	return oe.pendingOperations.Run(
		volumeToMount.VolumeName, podName, "" /* nodeName */, generatedOperations)
}

This function constructs the executor function first and then executes it, then look at the constructor again.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
func (og *operationGenerator) GenerateMountVolumeFunc(
	waitForAttachTimeout time.Duration,
	volumeToMount VolumeToMount,
	actualStateOfWorld ActualStateOfWorldMounterUpdater,
	isRemount bool) volumetypes.GeneratedOperations {

	volumePlugin, err :=
		og.volumePluginMgr.FindPluginBySpec(volumeToMount.VolumeSpec)

	mountVolumeFunc := func() volumetypes.OperationContext {
		// Get mounter plugin
		volumePlugin, err := og.volumePluginMgr.FindPluginBySpec(volumeToMount.VolumeSpec)
		volumeMounter, newMounterErr := volumePlugin.NewMounter(
			volumeToMount.VolumeSpec,
			volumeToMount.Pod,
			volume.VolumeOptions{})
		...
		// Execute mount
		mountErr := volumeMounter.SetUp(volume.MounterArgs{
			FsUser:              util.FsUserFrom(volumeToMount.Pod),
			FsGroup:             fsGroup,
			DesiredSize:         volumeToMount.DesiredSizeLimit,
			FSGroupChangePolicy: fsGroupChangePolicy,
		})
		// Update actual state of world
		markOpts := MarkVolumeOpts{
			PodName:             volumeToMount.PodName,
			PodUID:              volumeToMount.Pod.UID,
			VolumeName:          volumeToMount.VolumeName,
			Mounter:             volumeMounter,
			OuterVolumeSpecName: volumeToMount.OuterVolumeSpecName,
			VolumeGidVolume:     volumeToMount.VolumeGidValue,
			VolumeSpec:          volumeToMount.VolumeSpec,
			VolumeMountState:    VolumeMounted,
		}

		markVolMountedErr := actualStateOfWorld.MarkVolumeAsMounted(markOpts)
		...
		return volumetypes.NewOperationContext(nil, nil, migrated)
	}

	return volumetypes.GeneratedOperations{
		OperationName:     "volume_mount",
		OperationFunc:     mountVolumeFunc,
		EventRecorderFunc: eventRecorderFunc,
		CompleteFunc:      util.OperationCompleteHook(util.GetFullQualifiedPluginNameForVolume(volumePluginName, volumeToMount.VolumeSpec), "volume_mount"),
	}
}

Here we first go to the plugin list of the CSI registered to the kubelet and find the corresponding plugin, then execute volumeMounter.SetUp and finally update the ActualStateOfWorld record. Here, the external CSI plugin is csiMountMgr, and the code is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
func (c *csiMountMgr) SetUp(mounterArgs volume.MounterArgs) error {
	return c.SetUpAt(c.GetPath(), mounterArgs)
}

func (c *csiMountMgr) SetUpAt(dir string, mounterArgs volume.MounterArgs) error {
	csi, err := c.csiClientGetter.Get()
	...

	err = csi.NodePublishVolume(
		ctx,
		volumeHandle,
		readOnly,
		deviceMountPath,
		dir,
		accessMode,
		publishContext,
		volAttribs,
		nodePublishSecrets,
		fsType,
		mountOptions,
	)
    ...
	return nil
}

As you can see, the CSI Node NodePublishVolume / NodeUnPublishVolume interface is called by the volumeManager’s csiMountMgr in the kubelet.

Summary

This article analyzes the entire CSI system from three aspects: the CSI components, the CSI interface, and the process of how volume is mounted on a pod.

CSI is the standard storage interface for the entire container ecosystem, and the CO communicates with the CSI plug-in via gRPC. In order to achieve universality, k8s has designed many external components to work with the CSI plug-in to achieve different functions, thus ensuring the purity of k8s internal logic and the simplicity of the CSI plug-in.