Using cuda in containers requires nvidia-container-toolkit in addition to newer drivers, an available container runtime (podman, docker, etc.).
Generate a CDI description file:
You can then pull a cuda image and test whether cuda is available:
The installation logic outlined above is described in the Installation Guide - NVIDIA Cloud Native Technologies documentation, please refer to it for details.
Current problem encountered
In short, the nvidia driver was updated, so the following error occurred.
The root cause of the error was that the container was started with
--device nvidia.com/all (i.e. CDI), which is affected by the files under
/etc/cdi/. However, the section of the
/etc/cdi/nvidia.yaml file (part of which is shown below) that deals with the file no longer exists because nvidia was updated from 530.41.03-17 to 535.54.03-2, which eventually led to the above error.
In this case, regenerating the file will solve the problem (the following command is also mentioned in the Installation Guide - NVIDIA Cloud Native Technologies documentation).
To avoid further problems, the following hooks were created to automatically update
/etc/cdi/nvidia.yaml when nvidia is updated, but since nvidia has just been updated once, it is not known if the hook works properly.