diff --git a/doc/source/node_management/kubernetes/hardware_acceleration_devices/configure-nvidia-gpu-operator-for-pci-passthrough.rst b/doc/source/node_management/kubernetes/hardware_acceleration_devices/configure-nvidia-gpu-operator-for-pci-passthrough.rst new file mode 100644 index 000000000..17f9ca2f2 --- /dev/null +++ b/doc/source/node_management/kubernetes/hardware_acceleration_devices/configure-nvidia-gpu-operator-for-pci-passthrough.rst @@ -0,0 +1,171 @@ + +.. fgy1616003207054 +.. _configure-nvidia-gpu-operator-for-pci-passthrough: + +================================================= +Configure NVIDIA GPU Operator for PCI Passthrough +================================================= + +|release-caveat| + +This section provides instructions for configuring NVIDIA GPU Operator. + +.. rubric:: |context| + +.. note:: + NVIDIA GPU Operator is only supported for standard performance kernel + profile. There is no support provided for low-latency performance kernel + profile. + +NVIDIA GPU Operator automates the installation, maintenance, and management of +NVIDIA software needed to provision NVIDIA GPU and provisioning of pods that +require nvidia.com/gpu resources. + +NVIDIA GPU Operator is delivered as a Helm chart to install a number of services +and pods to automate the provisioning of NVIDIA GPUs with the needed NVIDIA +software components. These components include: + +.. _fgy1616003207054-ul-sng-blk-z4b: + +- NVIDIA drivers \(to enable CUDA which is a parallel computing platform\) + +- Kubernetes device plugin for GPUs + +- NVIDIA Container Runtime + +- Automatic Node labelling + +- DCGM \(NVIDIA Data Center GPU Manager\) based monitoring + +.. rubric:: |prereq| + +Download the **gpu-operator-v3-1.6.0.3.tgz** file at +`http://mirror.starlingx.cengn.ca/mirror/starlingx/ +`__. + +Use the following steps to configure the GPU Operator container: + +.. rubric:: |proc| + +#. Lock the hosts\(s\). + + .. code-block:: none + + ~(keystone_admin)]$ system host-lock + +#. Configure the Container Runtime host path to the NVIDIA runtime which will be installed by the GPU Operator Helm deployment. + + .. code-block:: none + + ~(keystone_admin)]$ system service-parameter-add platform container_runtime custom_container_runtime=nvidia:/usr/local/nvidia/toolkit/nvidia-container-runtime + +#. Unlock the hosts\(s\). Once the system is unlocked, the system will reboot automatically. + + .. code-block:: none + + ~(keystone_admin)]$ system host-unlock + +#. Create the RuntimeClass resource definition and apply it to the system. + + .. code-block:: none + + cat > nvidia.yml << EOF + kind: RuntimeClass + apiVersion: node.k8s.io/v1beta1 + metadata: + name: nvidia + handler: nvidia + EOF + + .. code-block:: none + + ~(keystone_admin)]$ kubectl apply -f nvidia.yml + +#. Install the GPU Operator Helm charts. + + .. code-block:: none + + ~(keystone_admin)]$ helm install -–name gpu-operator /path/to/gpu-operator-1.6.0.3.tgz + +#. Check if the GPU Operator is deployed using the following command. + + .. code-block:: none + + ~(keystone_admin)]$ kubectl get pods –A + NAMESPACE NAME READY STATUS RESTART AGE + default g-node.. 1/1 Running 1 7h54m + default g-node.. 1/1 Running 1 7h54m + default gpu-ope. 1/1 Running 1 7h54m + gpu-operator-resources gpu-.. 1/1 Running 4 28m + gpu-operator-resources nvidia.. 1/1 Running 0 28m + gpu-operator-resources nvidia.. 1/1 Running 0 28m + gpu-operator-resources nvidia.. 1/1 Running 0 28m + gpu-operator-resources nvidia.. 0/1 Completed 0 7h53m + gpu-operator-resources nvidia.. 1/1 Running 0 28m + + The plugin validation pod is marked completed. + +#. Check if the nvidia.com/gpu resources are available using the following command. + + .. code-block:: none + + ~(keystone_admin)]$ kubectl describe nodes | grep nvidia + +#. Create a pod that uses the NVIDIA RuntimeClass and requests a + nvidia.com/gpu resource. Update the nvidia-usage-example-pod.yml file to launch + a pod NVIDIA GPU. For example: + + .. code-block:: none + + cat < nvidia-usage-example-pod.yml + apiVersion: v1 + kind: Pod + metadata: + name: nvidia-usage-example-pod + spec: + runtimeClassName: nvidia + containers: + - name: nvidia-usage-example-pod + image: nvidia/samples:cuda10.2-vectorAdd + imagePullPolicy: IfNotPresent command: [ "/bin/bash", "-c", "--" ] + args: [ "while true; do sleep 300000; done;" ] + resources: + requests: + nvidia.com/gpu: 1 + limits: + nvidia.com/gpu: 1 + EOF + +#. Create a pod using the following command. + + .. code-block:: none + + ~(keystone_admin)]$ kubectl create -f nvidia-usage-example-pod.yml + +#. Check that the pod has been set up correctly. The status of the NVIDIA device is displayed in the table. + + .. code-block:: none + + ~(keystone_admin)]$ kubectl exec -it nvidia-usage-example-pod -- nvidia-smi + +-----------------------------------------------------------------------------+ + | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | + |-------------------------------+----------------------+----------------------+ + | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | + | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | + | | | MIG M. | + |===============================+======================+======================| + | 0 Tesla T4 On | 00000000:AF:00.0 Off | 0 | + | N/A 28C P8 14W / 70W | 0MiB / 15109MiB | 0% Default | + | | | N/A | + +-------------------------------+----------------------+----------------------+ + + +-----------------------------------------------------------------------------+ + | Processes: | + | GPU GI CI PID Type Process name GPU Memory | + | ID ID Usage | + |=============================================================================| + | No running processes found | + +-----------------------------------------------------------------------------+ + + For information on deleting the GPU Operator, see :ref:`Delete the GPU + Operator `. diff --git a/doc/source/node_management/kubernetes/hardware_acceleration_devices/delete-the-gpu-operator.rst b/doc/source/node_management/kubernetes/hardware_acceleration_devices/delete-the-gpu-operator.rst new file mode 100644 index 000000000..6791f6caa --- /dev/null +++ b/doc/source/node_management/kubernetes/hardware_acceleration_devices/delete-the-gpu-operator.rst @@ -0,0 +1,59 @@ + +.. nsr1616019467549 +.. _delete-the-gpu-operator: + +======================= +Delete the GPU Operator +======================= + +|release-caveat| + +Use the commands in this section to delete the GPU Operator, if required. + +.. rubric:: |prereq| + +Ensure that all user generated pods with access to `nvidia.com/gpu` resources are deleted first. + +.. rubric:: |proc| + +#. Remove the GPU Operator pods from the system using the following commands: + + .. code-block:: none + + ~(keystone_admin)]$ helm delete --purge gpu-operator + ~(keystone_admin)]$ kubectl delete runtimeclasses.node.k8s.io nvidia + +#. Remove the GPU Operator, and remove the service parameter platform + `container\_runtime custom\_container\_runtime` from the system, using the + following commands: + + #. Lock the host\(s\). + + .. code-block:: none + + ~(keystone_admin)]$ system host-lock + + #. List the service parameter using the following command. + + .. code-block:: none + + ~(keystone_admin)]$ system service-parameter-list + + #. Remove the service parameter platform `container\_runtime custom\_container\_runtime` + from the system, using the following command. + + .. code-block:: none + + ~(keystone_admin)]$ system service-parameter-delete + + where ```` is the ID of the service parameter, for example, 3c509c97-92a6-4882-a365-98f1599a8f56. + + #. Unlock the hosts\(s\). + + .. code-block:: none + + ~(keystone_admin)]$ system host-unlock + + For information on configuring the GPU Operator, see :ref:`Configure NVIDIA + GPU Operator for PCI Passthrough Operator + `. diff --git a/doc/source/node_management/kubernetes/index.rst b/doc/source/node_management/kubernetes/index.rst index 052300021..5faf5ebbe 100644 --- a/doc/source/node_management/kubernetes/index.rst +++ b/doc/source/node_management/kubernetes/index.rst @@ -273,17 +273,6 @@ Node inventory tasks Hardware acceleration devices ----------------------------- -.. toctree:: - :maxdepth: 1 - - hardware_acceleration_devices/uploading-a-device-image - hardware_acceleration_devices/listing-uploaded-device-images - hardware_acceleration_devices/listing-device-labels - hardware_acceleration_devices/removing-a-device-image - hardware_acceleration_devices/removing-a-device-label - hardware_acceleration_devices/initiating-a-device-image-update-for-a-host - hardware_acceleration_devices/displaying-the-status-of-device-images - ************************ Intel N3000 FPGA support ************************ @@ -295,8 +284,22 @@ Intel N3000 FPGA support hardware_acceleration_devices/updating-an-intel-n3000-fpga-image hardware_acceleration_devices/n3000-fpga-forward-error-correction hardware_acceleration_devices/showing-details-for-an-fpga-device + hardware_acceleration_devices/uploading-a-device-image hardware_acceleration_devices/common-device-management-tasks +Common device management tasks +****************************** + +.. toctree:: + :maxdepth: 2 + + hardware_acceleration_devices/listing-uploaded-device-images + hardware_acceleration_devices/listing-device-labels + hardware_acceleration_devices/removing-a-device-image + hardware_acceleration_devices/removing-a-device-label + hardware_acceleration_devices/initiating-a-device-image-update-for-a-host + hardware_acceleration_devices/displaying-the-status-of-device-images + *********************************************** vRAN Accelerator ACC100 Adapter \(Mount Bryce\) *********************************************** @@ -306,6 +309,17 @@ vRAN Accelerator ACC100 Adapter \(Mount Bryce\) hardware_acceleration_devices/enabling-mount-bryce-hw-accelerator-for-hosted-vram-containerized-workloads hardware_acceleration_devices/set-up-pods-to-use-sriov + +******************* +NVIDIA GPU Operator +******************* +.. toctree:: + :maxdepth: 1 + + hardware_acceleration_devices/configure-nvidia-gpu-operator-for-pci-passthrough + hardware_acceleration_devices/delete-the-gpu-operator + + ------------------------ Host hardware management ------------------------ diff --git a/doc/source/planning/kubernetes/verified-commercial-hardware.rst b/doc/source/planning/kubernetes/verified-commercial-hardware.rst index 936c84a7f..c739b655f 100755 --- a/doc/source/planning/kubernetes/verified-commercial-hardware.rst +++ b/doc/source/planning/kubernetes/verified-commercial-hardware.rst @@ -176,6 +176,10 @@ Verified and approved hardware components for use with |prod| are listed here. | Hardware Accelerator Devices Verified for PCI-Passthrough or PCI SR-IOV Access | - ACC100 Adapter \(Mount Bryce\) - SRIOV only | +--------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | GPUs Verified for PCI Passthrough | - NVIDIA Corporation: VGA compatible controller - GM204GL \(Tesla M60 rev a1\) | + | | | + | | - NVIDIA T4 TENSOR CORE GPU | + | | | + | | | +--------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Board Management Controllers | - HPE iLO3 | | | |