Using cuda in containers requires nvidia-container-toolkit in addition to newer drivers, an available container runtime (podman, docker, etc.).

1
2
pacman -S podman
paru -S nvidia-container-toolkit

Generate a CDI description file:

1
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

You can then pull a cuda image and test whether cuda is available:

1
podman run --rm --device nvidia.com/gpu=all docker.io/nvidia/cuda nvidia-smi -L

The installation logic outlined above is described in the Installation Guide - NVIDIA Cloud Native Technologies documentation, please refer to it for details.

Current problem encountered

In short, the nvidia driver was updated, so the following error occurred.

1
Error: unable to start container "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx": crun: error stat'ing file `/usr/lib/libEGL_nvidia.so.530.41.03`: No such file or directory: OCI runtime attempted to invoke a command that was not found

The root cause of the error was that the container was started with --device nvidia.com/all (i.e. CDI), which is affected by the files under /etc/cdi/. However, the section of the /etc/cdi/nvidia.yaml file (part of which is shown below) that deals with the file no longer exists because nvidia was updated from 530.41.03-17 to 535.54.03-2, which eventually led to the above error.

1
2
3
4
5
6
7
  - containerPath: /usr/lib/libEGL_nvidia.so.530.41.03
    hostPath: /usr/lib/libEGL_nvidia.so.530.41.03
    options:
    - ro
    - nosuid
    - nodev
    - bind

In this case, regenerating the file will solve the problem (the following command is also mentioned in the Installation Guide - NVIDIA Cloud Native Technologies documentation).

1
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

Automation

To avoid further problems, the following hooks were created to automatically update /etc/cdi/nvidia.yaml when nvidia is updated, but since nvidia has just been updated once, it is not known if the hook works properly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# This file located at /etc/pacman.d/hooks/nvidia-generate-cdi.hook
[Trigger]
Operation=Install
Operation=Upgrade
Operation=Remove
Type=Package
Target=nvidia

[Action]
Description=Update cdi for container
Depends=nvidia-container-toolkit
When=PostTransaction
NeedsTargets
Exec=/usr/bin/nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml