New Content for NVIDIA T4 GPU Support

Created 2 new topics in Node management - HW Acceleration Devices: - Configure NVIDIA GPU Operator for PCI Passthrough - Delete the GPU Operator Patch 4: Added NVIDIA information in Planning - Verified Comm HW Patch 5: Acted on Greg's comment Patch 6: updated Index as requested in review worked on comments from Ghada Patch 7 and 8: acted on Mary's comments Added 'release-caveat' Acted on Ron's comments Story: 2008434 Task: 42220 https://review.opendev.org/c/starlingx/docs/+/785251 Signed-off-by: Adil <mohamed.adilassakkali@windriver.com> Change-Id: I337e33e805d89621436b35c238aca800b0727e0b
2021-04-07 14:42:46 -03:00 · 2021-04-07 14:42:46 -03:00 · 3053ff6e40
commit 3053ff6e40
parent 1521b4c4a9
4 changed files with 259 additions and 11 deletions
--- a/doc/source/node_management/kubernetes/hardware_acceleration_devices/configure-nvidia-gpu-operator-for-pci-passthrough.rst
+++ b/doc/source/node_management/kubernetes/hardware_acceleration_devices/configure-nvidia-gpu-operator-for-pci-passthrough.rst
@ -0,0 +1,171 @@
+
+.. fgy1616003207054
+.. _configure-nvidia-gpu-operator-for-pci-passthrough:
+
+=================================================
+Configure NVIDIA GPU Operator for PCI Passthrough
+=================================================
+
+|release-caveat|
+
+This section provides instructions for configuring NVIDIA GPU Operator.
+
+.. rubric:: |context|
+
+.. note::
+    NVIDIA GPU Operator is only supported for standard performance kernel
+    profile. There is no support provided for low-latency performance kernel
+    profile.
+
+NVIDIA GPU Operator automates the installation, maintenance, and management of
+NVIDIA software needed to provision NVIDIA GPU and provisioning of pods that
+require nvidia.com/gpu resources.
+
+NVIDIA GPU Operator is delivered as a Helm chart to install a number of services
+and pods to automate the provisioning of NVIDIA GPUs with the needed NVIDIA
+software components. These components include:
+
+.. _fgy1616003207054-ul-sng-blk-z4b:
+
+-   NVIDIA drivers \(to enable CUDA which is a parallel computing platform\)
+
+-   Kubernetes device plugin for GPUs
+
+-   NVIDIA Container Runtime
+
+-   Automatic Node labelling
+
+-   DCGM \(NVIDIA Data Center GPU Manager\) based monitoring
+
+.. rubric:: |prereq|
+
+Download the **gpu-operator-v3-1.6.0.3.tgz** file at
+`http://mirror.starlingx.cengn.ca/mirror/starlingx/
+<http://mirror.starlingx.cengn.ca/mirror/starlingx/>`__.
+
+Use the following steps to configure the GPU Operator container:
+
+.. rubric:: |proc|
+
+#.  Lock the hosts\(s\).
+
+    .. code-block:: none
+
+        ~(keystone_admin)]$  system host-lock <hostname>
+
+#.  Configure the Container Runtime host path to the NVIDIA runtime which will be installed by the GPU Operator Helm deployment.
+
+    .. code-block:: none
+
+        ~(keystone_admin)]$ system service-parameter-add platform container_runtime custom_container_runtime=nvidia:/usr/local/nvidia/toolkit/nvidia-container-runtime
+
+#.  Unlock the hosts\(s\). Once the system is unlocked, the system will reboot automatically.
+
+    .. code-block:: none
+
+        ~(keystone_admin)]$ system host-unlock <hostname>
+
+#.  Create the RuntimeClass resource definition and apply it to the system.
+
+    .. code-block:: none
+
+        cat > nvidia.yml << EOF
+            kind: RuntimeClass
+            apiVersion: node.k8s.io/v1beta1
+            metadata:
+              name: nvidia
+            handler: nvidia
+        EOF
+
+    .. code-block:: none
+
+        ~(keystone_admin)]$ kubectl apply -f nvidia.yml
+
+#.  Install the GPU Operator Helm charts.
+
+    .. code-block:: none
+
+        ~(keystone_admin)]$ helm install -–name gpu-operator /path/to/gpu-operator-1.6.0.3.tgz
+
+#.  Check if the GPU Operator is deployed using the following command.
+
+    .. code-block:: none
+
+        ~(keystone_admin)]$ kubectl get pods –A
+        NAMESPACE               NAME      READY  STATUS    RESTART  AGE
+        default                 g-node..  1/1    Running   1       7h54m
+        default                 g-node..  1/1    Running   1       7h54m
+        default                 gpu-ope.  1/1    Running   1       7h54m
+        gpu-operator-resources  gpu-..    1/1    Running   4       28m
+        gpu-operator-resources  nvidia..  1/1    Running   0       28m
+        gpu-operator-resources  nvidia..  1/1    Running   0       28m
+        gpu-operator-resources  nvidia..  1/1    Running   0       28m
+        gpu-operator-resources  nvidia..  0/1    Completed 0       7h53m
+        gpu-operator-resources  nvidia..  1/1    Running   0       28m
+
+    The plugin validation pod is marked completed.
+
+#.  Check if the nvidia.com/gpu resources are available using the following command.
+
+    .. code-block:: none
+
+        ~(keystone_admin)]$ kubectl describe nodes <hostname> | grep nvidia
+
+#.  Create a pod that uses the NVIDIA RuntimeClass and requests a
+    nvidia.com/gpu resource. Update the nvidia-usage-example-pod.yml file to launch
+    a pod NVIDIA GPU. For example:
+
+    .. code-block:: none
+
+        cat <<EOF > nvidia-usage-example-pod.yml
+        apiVersion: v1
+        kind: Pod
+        metadata:
+          name: nvidia-usage-example-pod
+        spec:
+          runtimeClassName: nvidia
+          containers:
+           - name: nvidia-usage-example-pod
+              image: nvidia/samples:cuda10.2-vectorAdd
+              imagePullPolicy: IfNotPresent    command: [ "/bin/bash", "-c", "--" ]
+             args: [ "while true; do sleep 300000; done;" ]
+             resources:
+               requests:
+                 nvidia.com/gpu: 1
+               limits:
+                 nvidia.com/gpu: 1
+        EOF
+
+#.  Create a pod using the following command.
+
+    .. code-block:: none
+
+        ~(keystone_admin)]$ kubectl create -f nvidia-usage-example-pod.yml
+
+#.  Check that the pod has been set up correctly. The status of the NVIDIA device is displayed in the table.
+
+    .. code-block:: none
+
+        ~(keystone_admin)]$ kubectl exec -it nvidia-usage-example-pod -- nvidia-smi
+        +-----------------------------------------------------------------------------+
+        | NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
+        |-------------------------------+----------------------+----------------------+
+        | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+        | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+        |                               |                      |               MIG M. |
+        |===============================+======================+======================|
+        |   0  Tesla T4            On   | 00000000:AF:00.0 Off |                    0 |
+        | N/A   28C    P8    14W /  70W |      0MiB / 15109MiB |      0%      Default |
+        |                               |                      |                  N/A |
+        +-------------------------------+----------------------+----------------------+
+
+        +-----------------------------------------------------------------------------+
+        | Processes:                                                                  |
+        |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
+        |        ID   ID                                                   Usage      |
+        |=============================================================================|
+        |  No running processes found                                                 |
+        +-----------------------------------------------------------------------------+
+
+    For information on deleting the GPU Operator, see :ref:`Delete the GPU
+    Operator <delete-the-gpu-operator>`.
--- a/doc/source/node_management/kubernetes/hardware_acceleration_devices/delete-the-gpu-operator.rst
+++ b/doc/source/node_management/kubernetes/hardware_acceleration_devices/delete-the-gpu-operator.rst
@ -0,0 +1,59 @@
+
+.. nsr1616019467549
+.. _delete-the-gpu-operator:
+
+=======================
+Delete the GPU Operator
+=======================
+
+|release-caveat|
+
+Use the commands in this section to delete the GPU Operator, if required.
+
+.. rubric:: |prereq|
+
+Ensure that all user generated pods with access to `nvidia.com/gpu` resources are deleted first.
+
+.. rubric:: |proc|
+
+#.  Remove the GPU Operator pods from the system using the following commands:
+
+    .. code-block:: none
+
+        ~(keystone_admin)]$ helm delete --purge gpu-operator
+        ~(keystone_admin)]$ kubectl delete runtimeclasses.node.k8s.io nvidia
+
+#.  Remove the GPU Operator, and remove the service parameter platform
+    `container\_runtime custom\_container\_runtime` from the system, using the
+    following commands:
+
+    #.  Lock the host\(s\).
+
+        .. code-block:: none
+
+            ~(keystone_admin)]$ system host-lock <hostname>
+
+    #.  List the service parameter using the following command.
+
+        .. code-block:: none
+
+            ~(keystone_admin)]$ system service-parameter-list
+
+    #.  Remove the service parameter platform `container\_runtime custom\_container\_runtime`
+        from the system, using the following command.
+
+        .. code-block:: none
+
+            ~(keystone_admin)]$ system service-parameter-delete <service param ID>
+
+        where ``<service param ID>`` is the ID of the service parameter, for example, 3c509c97-92a6-4882-a365-98f1599a8f56.
+
+    #.  Unlock the hosts\(s\).
+
+        .. code-block:: none
+
+            ~(keystone_admin)]$ system host-unlock <hostname>
+
+    For information on configuring the GPU Operator, see :ref:`Configure NVIDIA
+    GPU Operator for PCI Passthrough Operator
+    <configure-nvidia-gpu-operator-for-pci-passthrough>`.
--- a/doc/source/node_management/kubernetes/index.rst
+++ b/doc/source/node_management/kubernetes/index.rst
@ -273,17 +273,6 @@ Node inventory tasks
 Hardware acceleration devices
 -----------------------------

-.. toctree::
-   :maxdepth: 1
-
-   hardware_acceleration_devices/uploading-a-device-image
-   hardware_acceleration_devices/listing-uploaded-device-images
-   hardware_acceleration_devices/listing-device-labels
-   hardware_acceleration_devices/removing-a-device-image
-   hardware_acceleration_devices/removing-a-device-label
-   hardware_acceleration_devices/initiating-a-device-image-update-for-a-host
-   hardware_acceleration_devices/displaying-the-status-of-device-images
-
 ************************
 Intel N3000 FPGA support
 ************************
@ -295,8 +284,22 @@ Intel N3000 FPGA support
   hardware_acceleration_devices/updating-an-intel-n3000-fpga-image
   hardware_acceleration_devices/n3000-fpga-forward-error-correction
   hardware_acceleration_devices/showing-details-for-an-fpga-device
+   hardware_acceleration_devices/uploading-a-device-image
   hardware_acceleration_devices/common-device-management-tasks

+Common device management tasks
+******************************
+
+.. toctree::
+   :maxdepth: 2
+
+   hardware_acceleration_devices/listing-uploaded-device-images
+   hardware_acceleration_devices/listing-device-labels
+   hardware_acceleration_devices/removing-a-device-image
+   hardware_acceleration_devices/removing-a-device-label
+   hardware_acceleration_devices/initiating-a-device-image-update-for-a-host
+   hardware_acceleration_devices/displaying-the-status-of-device-images
+
 ***********************************************
 vRAN Accelerator ACC100 Adapter \(Mount Bryce\)
 ***********************************************
@ -306,6 +309,17 @@ vRAN Accelerator ACC100 Adapter \(Mount Bryce\)
   hardware_acceleration_devices/enabling-mount-bryce-hw-accelerator-for-hosted-vram-containerized-workloads
   hardware_acceleration_devices/set-up-pods-to-use-sriov

+
+*******************
+NVIDIA GPU Operator
+*******************
+.. toctree::
+   :maxdepth: 1
+
+   hardware_acceleration_devices/configure-nvidia-gpu-operator-for-pci-passthrough
+   hardware_acceleration_devices/delete-the-gpu-operator
+
+
 ------------------------
 Host hardware management
 ------------------------
--- a/doc/source/planning/kubernetes/verified-commercial-hardware.rst
+++ b/doc/source/planning/kubernetes/verified-commercial-hardware.rst
@ -176,6 +176,10 @@ Verified and approved hardware components for use with |prod| are listed here.
    | Hardware Accelerator Devices Verified for PCI-Passthrough or PCI SR-IOV Access | -   ACC100 Adapter \(Mount Bryce\) - SRIOV only                                                                                                                                                                                                                                                                                                                                                                                        |
    +--------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | GPUs Verified for PCI Passthrough                                              | -   NVIDIA Corporation: VGA compatible controller - GM204GL \(Tesla M60 rev a1\)                                                                                                                                                                                                                                                                                                                                                       |
+    |                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+    |                                                                                | -   NVIDIA T4 TENSOR CORE GPU                                                                                                                                                                                                                                                                                                                                                                                                          |
+    |                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+    |                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                        |
    +--------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Board Management Controllers                                                   | -   HPE iLO3                                                                                                                                                                                                                                                                                                                                                                                                                           |
    |                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                        |