Hardware offloads comparative performance analysis

Introduce test plan for assessing throughput as well as the effect on CPU utilization when various hardware offloads are enabled. Attached test results and their analysis. Change-Id: I2269e281a11a9e05138119692c680f70767fce15
2016-06-02 17:52:03 +03:00 · 2016-06-02 17:52:03 +03:00 · 2c6ab19e43
commit 2c6ab19e43
parent 3f5c84a60a
29 changed files with 742 additions and 0 deletions
--- a/doc/source/test_plans/hardware_features/hardware_offloads/network_diagram_VM-to-VM.png
+++ b/doc/source/test_plans/hardware_features/hardware_offloads/network_diagram_VM-to-VM.png
--- a/doc/source/test_plans/hardware_features/hardware_offloads/network_diagram_VM-to-VM_VxLAN.png
+++ b/doc/source/test_plans/hardware_features/hardware_offloads/network_diagram_VM-to-VM_VxLAN.png
--- a/doc/source/test_plans/hardware_features/hardware_offloads/network_diagram_physical.png
+++ b/doc/source/test_plans/hardware_features/hardware_offloads/network_diagram_physical.png
--- a/doc/source/test_plans/hardware_features/hardware_offloads/test_plan.rst
+++ b/doc/source/test_plans/hardware_features/hardware_offloads/test_plan.rst
@ -0,0 +1,519 @@
+.. _hardware_offloads_performance_analysis:
+
+====================================================
+Hardware Offloads - Comparative performance analysis
+====================================================
+
+:status: **draft**
+:version: 1.0
+
+:Abstract:
+
+    The aim of this document is to present scenarios for a comparative
+    performance analysis of various hardware offloads. We examine differences
+    in throughput as well as the effect on CPU utilization when these hardware
+    offloads are enabled.
+
+    The end goal is to provide documentation that a deployer of OpenStack
+    can use to answer a number of common questions when constructing their
+    environments:
+
+    -  Enabling which hardware offloads will give me the biggest "bang for
+       the buck" when it comes to increased network throughput?
+
+    -  What impact on CPU utilization does disabling certain hardware
+       offloads have?
+
+:Conventions:
+
+  - **VxLAN:** Virtual eXtensible LAN
+
+  - **NIC:** Network Interface Card
+
+  - **TX:** Transmit (packets transmitted out the interface)
+
+  - **RX:** Receive (packets received on the interface)
+
+  - **TSO:** TCP Segmentation Offload
+
+  - **GRO:** Generic Receive Offload
+
+  - **GSO:** Generic Segmentation Offload
+
+  - **MTU:** Maximum Transmission Unit
+
+  - **OVS:** Open vSwitch
+
+Test Plan
+=========
+
+The following hardware offloads were examined in this performance
+comparison:
+
+-  Tx checksumming
+
+-  TCP segmentation offload (TSO)
+
+-  Generic segmentation offload (GSO)
+
+-  Tx UDP tunneling segmentation
+
+-  Rx checksumming
+
+-  Generic receive offload (GRO)
+
+The performance analysis is based on the results of running test
+scenarios in a lab consisting of two hardware nodes.
+
+-  Scenario #1: Baseline physical
+
+-  Scenario #2: Baseline physical over VxLAN tunnel
+
+-  Scenario #3: VM-to-VM on different hardware nodes
+
+-  Scenario #4: VM-to-VM on different nodes over VxLAN tunnel
+
+For each of the above scenarios, we used `netperf` to measure the network
+throughput and the `sar` utility to measure CPU usage.
+
+One node was used as the transmitter and the other node served as a
+receiver and had a `netserver` daemon running on it.
+
+Performance validation involved running both TCP and UDP stream tests
+with all offloads on and then turning each one of the offloads off, one
+by one.
+
+For the transmit side the following offloads were triggered:
+
+-  TSO
+
+-  Tx checksumming
+
+-  GSO (no VxLAN tunneling)
+
+-  Tx-udp\_tnl-segmentation (VxLAN tunneling only)
+
+For the receive side:
+
+-  GRO
+
+-  Rx checksumming
+
+Base physical tests involved turning offloads on and off for the `p514p2`
+physical interface on both transmitter and receiver.
+
+Some hardware offloads require certain other offloads to be enabled or
+disabled in order to make any effect. For example, TSO can not be
+enabled without Tx checksumming being turned on while GSO only kicks in
+when TSO is disabled. Due to that the following order of turning off
+offloads was chosen for the tests:
+
+-------------------+--------------------------------+--------------------------------+--------------------------------+------------------------------------+
+| **Transmitter**   | All on                         | TSO off                        | Tx checksumming off            | GSO/tx-udp\_tnl-segmentation off   |
+===================+================================+================================+================================+====================================+
+| Active offloads   | TSO,                           | tx checksumming,               | GSO/tx-udp\_tnl-segmentation   | All off                            |
+|                   |                                |                                |                                |                                    |
+|                   | tx checksumming                | GSO/tx-udp\_tnl-segmentation   |                                |                                    |
+-------------------+--------------------------------+--------------------------------+--------------------------------+------------------------------------+
+| Offloads on       | TSO,                           | tx checksumming,               | GSO/tx-udp\_tnl-segmentation   | All off                            |
+|                   |                                |                                |                                |                                    |
+|                   | tx checksumming,               | GSO/tx-udp\_tnl-segmentation   |                                |                                    |
+|                   |                                |                                |                                |                                    |
+|                   | GSO/tx-udp\_tnl-segmentation   |                                |                                |                                    |
+-------------------+--------------------------------+--------------------------------+--------------------------------+------------------------------------+
+
+-------------------+-------------------+-------------------+-----------------------+
+| **Receiver**      | All on            | GRO off           | Rx checksumming off   |
+===================+===================+===================+=======================+
+| Active offloads   | GRO,              | Rx checksumming   | All off               |
+|                   |                   |                   |                       |
+|                   | Rx checksumming   |                   |                       |
+-------------------+-------------------+-------------------+-----------------------+
+| Offloads on       | GRO,              | Rx checksumming   | All off               |
+|                   |                   |                   |                       |
+|                   | Rx checksumming   |                   |                       |
+-------------------+-------------------+-------------------+-----------------------+
+
+Test Environment
+----------------
+
+Preparation
+^^^^^^^^^^^
+
+Baseline physical host setup
++++++++++++++++++++++++++++
+
+We force the physical device to use only one queue. This allows us to
+obtain consistent results between test runs as we can avoid the
+possibility of RSS assigning the flows to a different queue, and as a
+result a different CPU core:
+
+``$ sudo ethtool -L p514p2 combined 1``
+
+where `p514p2` is the physical interface name.
+
+Pin that queue to CPU0 either using `set\_irq\_affinity` script or by
+running
+
+``$ sudo echo 1 > /proc/irq/<irq_num>/smp_affinity``
+
+This guarantees that all interrupts related to the `p514p2` interface are
+processed on a designated CPU (CPU0 in our case) which allows to isolate
+the tests.
+
+At the same time, any traffic generators, like `netperf`, were manually
+made to use CPU1 using `taskset`.
+
+``$ sudo taskset -c 1 netperf -H <peer_ip> -cC [-t UDP_STREAM]``
+
+Next, it is important to prevent CPUs from switching frequencies and
+power states (or C-states) as this can affect CPU utilization and thus
+affect test results.
+
+Make the processors keep the C0 C-state by writing 0 to
+`/dev/cpu\_dma\_latency` file. This will prevent any other C-states with
+from being used, as long as the file `/dev/cpu\_dma\_latency` is kept
+open. It can be easily done with `this helper
+program <https://github.com/gtcasl/hpc-benchmarks/blob/master/setcpulatency.c>`__:
+
+``$ make setcpulatency``
+
+``$ sudo ./setcpulatency 0 &``
+
+We made the CPUs run at max frequency by choosing the `“performance”`
+scaling governor:
+
+.. code-block:: bash
+
+    $ echo "performance" | sudo tee
+    /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
+
+In order to set up a VxLAN tunnel, an OVS bridge (`virbr`) was created:
+
+``$ sudo ovs-vsctl add-br virbr``
+
+Next, we created a VxLAN port, specifying the corresponding remote ip:
+
+.. code-block:: bash
+
+    $ sudo ovs-vsctl add-port virbr vxlan -- set interface vxlan type=vxlan
+    options:remote_ip=10.1.2.2 options:local_ip=10.1.2.1
+
+.. image:: network_diagram_physical.png
+    :width: 650px
+
+VM-to-VM on different nodes setup
+++++++++++++++++++++++++++++++++
+
+Scenarios 3 and 4 assess the performance impact of hardware offloads in
+a deployment with two VMs running on separate hardware nodes. The
+scenarios measure VM-to-VM TCP/UDP traffic throughput, with and without
+VxLAN encapsulation.
+
+In order to get more accurate CPU consumption metrics, the CPUs on which
+VMs would be running should be isolated from the kernel scheduler. This
+prevents any other processes from running on those CPUs. In this
+installation, half of the CPUs are isolated by adding the following to
+the end of the ``/etc/default/grub`` file:
+
+``GRUB_CMDLINE_LINUX=”$GRUB_CMDLINE_LINUX isolcpus=6-11”``
+
+These settings were applied and the nodes rebooted:
+
+``$ sudo update-grub``
+
+``$ sudo reboot``
+
+libvirt was installed and the default storage pool appropriately
+configured:
+
+``$ sudo apt-get install qemu-kvm libvirt-bin virtinst virt-manager``
+
+``$ sudo virsh -c qemu:///system pool-define-as store dir --target /home/elena/store``
+
+``$ sudo virsh -c qemu:///system pool-autostart store``
+
+``$ sudo virsh -c qemu:///system pool-start store``
+
+An OVS bridge (`virbr`) was created next:
+
+``$ sudo ovs-vsctl add-br virbr``
+
+Then, we added physical interface `p514p1` as a port to `virbr` and cleaned
+all IP addresses from it:
+
+``$ sudo ovs-vsctl add-port virbr p514p1``
+
+``$ sudo ifconfig p514p1 0``
+
+An IP address from the same subnet as the VMs was then added to `virbr`.
+In this case 10.1.6.0/24 was the subnet used for VM-to-VM communication.
+
+``$ sudo ip addr add 10.1.6.1/24 dev virbr``
+
+From that point, virbr had two IP addresses: 10.1.1.1 and 10.1.6.1
+(10.1.1.2 and 10.1.6.3 on Host 2). Finally, we created a tap port vport1
+that will be used to connect the VM to `virbr`:
+
+``$ sudo ip tuntap add mode tap vport1``
+
+``$ sudo ovs-vsctl add-port virbr vport1``
+
+**Guest setup**
+
+In scenarios 3 and 4, an Ubuntu Trusty cloud image is being used. The
+VMs were defined from an XML domain file. Each VM was pinned to a pair
+of CPUs that were isolated from the kernel scheduler in the following
+way in the `libvirt.xml` files:
+
+.. code-block:: xml
+
+    <vcpu placement='static' cpuset='7-8'>2</vcpu>
+    <cputune>
+        <vcpupin vcpu='0' cpuset='7'/>
+        <vcpupin vcpu='1' cpuset='8'/>
+    </cputune>
+
+Each VM has two interfaces: `eth0` connects through a tap device (`vport1`)
+to an OVS bridge (`virbr`) from which there is a link to another host,
+`eth1` is connected to em1 through a virtual network (with a NAT) which
+allows VM to access the Internet. Interface configuration was done as
+follows:
+
+.. code-block:: xml
+
+    <interface type='bridge'>
+        <source bridge='virbr'/>
+        <virtualport type='openvswitch'/>
+        <target dev='vport1'/>
+        <model type='virtio'/>
+        <alias name='net0'/>
+    </interface>
+    <interface type='network'>
+        <source network='net1'/>
+        <target dev='vnet0'/>
+        <model type='rtl8139'/>
+        <alias name='net1'/>
+    </interface>
+
+Interface configuration for VM1 was done in the following way:
+
+/etc/network/interfaces.d/eth0.cfg
+
+.. code-block:: bash
+
+    # The primary network interface
+
+    auto eth0
+    iface eth0 inet static
+    address 10.1.6.2
+    netmask 255.255.255.0
+    network 10.1.6.0
+    nexthop 10.1.6.1
+    broadscast 10.1.6.255
+
+/etc/network/interfaces.d/eth1.cfg
+
+.. code-block:: bash
+
+    auto eth1
+    iface eth1 inet static
+    address 192.168.100.2
+    netmask 255.255.255.0
+    network 192.168.100.0
+    gateway 192.168.100.1
+    broadscast 192.168.100.255
+    dns-nameservers 8.8.8.8 8.8.4.4
+
+And on VM2:
+
+/etc/network/interfaces.d/eth0.cfg
+
+.. code-block:: bash
+
+    # The primary network interface
+
+    auto eth0
+    iface eth0 inet static
+    address 10.1.6.4
+    netmask 255.255.255.0
+    network 10.1.6.0
+    nexthop 10.1.6.3
+    broadscast 10.1.6.255
+
+/etc/network/interfaces.d/eth1.cfg
+
+.. code-block:: bash
+
+    auto eth1
+    iface eth1 inet static
+    address 192.168.100.129
+    netmask 255.255.255.0
+    network 192.168.100.0
+    gateway 192.168.100.1
+    broadscast 192.168.100.255
+    dns-nameservers 8.8.8.8 8.8.4.4
+
+.. image:: network_diagram_VM-to-VM.png
+    :width: 650px
+
+For scenario 4, the `p514p1` port was removed from `virbr` and its IP
+address restored:
+
+``$ sudo ovs-vsctl del-port virbr p514p1``
+
+``$ sudo ip addr add 10.1.1.1/24 dev p514p1``
+
+For setting up a VxLAN tunnel we added a `vxlan` type port to the `virbr`
+bridge:
+
+# Host1
+
+.. code-block:: bash
+
+    $ sudo ovs-vsctl add-port virbr vxlan1 -- set interface vxlan1
+    type=vxlan options:remote_ip=10.1.1.2 options:local_ip=10.1.1.1
+
+# Host2
+
+.. code-block:: bash
+
+    $ sudo ovs-vsctl add-port virbr vxlan1 -- set interface vxlan1
+    type=vxlan options:remote_ip=10.1.1.1 options:local_ip=10.1.1.2
+
+.. image:: network_diagram_VM-to-VM_VxLAN.png
+    :width: 650px
+
+Environment description
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Hardware
++++++++
+
+The environment consists of two hardware nodes with the following
+configuration:
+
+.. table::
+
+    +-------------+-------+-----------------------------------------------------------------------------------------------------------------------+
+    | Parameter   | Value | Comments                                                                                                              |
+    +-------------+-------+-----------------------------------------------------------------------------------------------------------------------+
+    | Server      |       | Supermicro SYS-5018R-WR, 1U, 1xCPU, 4/6 FAN, 4x 3,5" SAS/SATA Hotswap, 2x PS                                          |
+    +-------------+-------+-----------------------------------------------------------------------------------------------------------------------+
+    | Motherboard |       | Supermicro X10SRW-F, 1xCPU(LGA 2011), Intel C612, 8xDIMM Up To 512GB RAM, 10xSATA3, IPMI, 2xGbE LAN,,sn: NM15BS004776 |
+    +-------------+-------+-----------------------------------------------------------------------------------------------------------------------+
+    | CPU         |       | Intel Xeon E5-2620v3, 2.4GHz, Socket 2011, 15MB Cache, 6 core, 85W                                                    |
+    +-------------+-------+-----------------------------------------------------------------------------------------------------------------------+
+    | RAM         |       | 4x 16GB Samsung M393A2G40DB0-CPB DDR-IV PC4-2133P ECC Reg. CL13                                                       |
+    +-------------+-------+-----------------------------------------------------------------------------------------------------------------------+
+    | Storage     |       | HDD: 2x 1TB Seagate Constellation ES.3, ST1000NM0033, SATA,6.0Gb/s,7200 RPM,128MB Cache, 3.5”                         |
+    +-------------+-------+-----------------------------------------------------------------------------------------------------------------------+
+    | NIC         |       | AOC-STG-i4S, PCI Express 3.0, STD 4-port 10 Gigabit Ethernet SFP+ (`Intel XL710 controller`_)                         |
+    +-------------+-------+-----------------------------------------------------------------------------------------------------------------------+
+
+Software
++++++++
+
+This section describes installed software.
+
+--------------+-------+------------------+
+| Parameter    | Value | Comment          |
+--------------+-------+------------------+
+| OS           |       | Ubuntu 14.04     |
+--------------+-------+------------------+
+| Kernel       |       | 4.2.0-27-generic |
+--------------+-------+------------------+
+| QEMU         |       | 2.0.0            |
+--------------+-------+------------------+
+| Libvirt      |       | 1.2.2            |
+--------------+-------+------------------+
+| Open vSwitch |       | 2.0.2            |
+--------------+-------+------------------+
+| Netperf      |       | 2.7.0            |
+--------------+-------+------------------+
+
+Test Case 1: Baseline physical scenario
+---------------------------------------
+
+Description
+^^^^^^^^^^^
+
+This test measures network performance with hardware offloads on/off for two
+hardware nodes when sending non-encapsulated traffic.
+
+List of performance metrics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+========  ===============  =================  =====================================================
+Priority  Value            Measurement Units  Description
+========  ===============  =================  =====================================================
+1         TCP throughput   10^6 bits/sec      Average throughput for TCP traffic
+1         UDP throughput   10^6 bits/sec      Average throughput for UDP traffic
+1         CPU consumption  %                  Average utilization of CPU used for packet processing
+========  ===============  =================  =====================================================
+
+Test Case 2: Baseline physical over VxLAN scenario
+--------------------------------------------------
+
+Description
+^^^^^^^^^^^
+
+This test measures network performance with hardware offloads on/off for two
+hardware nodes when sending VxLAN-encapsulated traffic.
+
+List of performance metrics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+========  ===============  =================  =====================================================
+Priority  Value            Measurement Units  Description
+========  ===============  =================  =====================================================
+1         TCP throughput   10^6 bits/sec      Average throughput for TCP traffic
+1         UDP throughput   10^6 bits/sec      Average throughput for UDP traffic
+1         CPU consumption  %                  Average utilization of CPU used for packet processing
+========  ===============  =================  =====================================================
+
+Test Case 3: VM-to-VM on different nodes scenario
+-------------------------------------------------
+
+Description
+^^^^^^^^^^^
+
+This test assesses the performance impact of hardware offloads in
+a deployment with two VMs running on separate hardware nodes. The
+scenarios measure VM-to-VM TCP/UDP traffic throughput as well as host
+CPU consumption.
+
+List of performance metrics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+========  ===============  =================  =====================================================
+Priority  Value            Measurement Units  Description
+========  ===============  =================  =====================================================
+1         TCP throughput   10^6 bits/sec      Average throughput for TCP traffic
+1         UDP throughput   10^6 bits/sec      Average throughput for UDP traffic
+1         CPU consumption  %                  Average utilization of CPU used for packet processing
+========  ===============  =================  =====================================================
+
+Test Case 4: VM-to-VM on different nodes over VxLAN scenario
+------------------------------------------------------------
+
+Description
+^^^^^^^^^^^
+
+This test assesses the performance impact of hardware offloads in
+a deployment with two VMs running on separate hardware nodes. The
+scenarios measure VM-to-VM TCP/UDP traffic throughput as well as host
+CPU consumption in case VxLAN encapsulation is used.
+
+List of performance metrics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+========  ===============  =================  =====================================================
+Priority  Value            Measurement Units  Description
+========  ===============  =================  =====================================================
+1         TCP throughput   10^6 bits/sec      Average throughput for TCP traffic
+1         UDP throughput   10^6 bits/sec      Average throughput for UDP traffic
+1         CPU consumption  %                  Average utilization of CPU used for packet processing
+========  ===============  =================  =====================================================
+
+.. _Intel XL710 controller: http://www.intel.com/content/www/us/en/embedded/products/networking/xl710-10-40-gbe-controller-brief.html
--- a/doc/source/test_plans/hardware_features/index.rst
+++ b/doc/source/test_plans/hardware_features/index.rst
@ -0,0 +1,12 @@
+.. raw:: pdf
+
+    PageBreak oneColumn
+
+============================
+Hardware features test plans
+============================
+
+.. toctree::
+    :maxdepth: 3
+
+    hardware_offloads/test_plan
--- a/doc/source/test_plans/index.rst
+++ b/doc/source/test_plans/index.rst
@ -18,4 +18,5 @@ Test Plans
    keystone/plan
    container_cluster_systems/plan
    neutron_features/l3_ha/test_plan
+    hardware_features/index

--- a/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption01.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption01.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption02.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption02.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption03.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption03.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption04.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption04.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption05.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption05.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption06.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption06.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption07.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption07.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption08.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption08.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption09.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption09.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption10.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption10.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption11.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption11.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption12.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/cpu_consumption12.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/test_results.rst
+++ b/doc/source/test_results/hardware_features/hardware_offloads/test_results.rst
@ -0,0 +1,197 @@
+================================
+Hardware Offloads - Test results
+================================
+
+Baseline physical scenario
+--------------------------
+
+In this scenario, hardware offloads provided a significant performance
+gain in terms of CPU usage and throughput both for the transmitter and
+the receiver. Throughput being was affected in tests with small MTU
+(1500) while CPU consumption changes were only visible with jumbo frames
+enabled (MTU 9000).
+
+-  Enabling offloads on a transmitter with MTU 1500 gives 60% throughput
+   gain for TCP and 5% for UDP with TSO and GSO giving the highest
+   boost.
+
+.. image:: throughput01.png
+    :width: 650px
+
+-  Enabling offloads on a transmitter with MTU 9000 gives 55% CPU
+   performance gain for TCP and 11.6% for UDP with Tx checksumming
+   and GSO giving the highest boost.
+
+.. image:: cpu_consumption01.png
+    :width: 650px
+
+-  Enabling offloads on a receiver with MTU 1500 gives 40% throughput
+   gain for TCP and 11.7% for UDP with GRO and Rx checksumming
+   giving the highest boost respectively.
+
+.. image:: throughput02.png
+    :width: 650px
+
+-  Enabling offloads on a receiver with MTU 9000 gives 44% CPU
+   performance gain for TCP and 27.3% for UDP with with GRO and Rx
+   checksumming giving the highest boost respectively.
+
+.. image:: cpu_consumption02.png
+    :width: 650px
+
+Baseline physical over VxLAN scenario
+-------------------------------------
+
+Similarly to baseline physical scenario, hardware offloads introduced a
+significant performance gain in terms of CPU usage and throughput both
+for the transmitter and the receiver. For TCP tests the effect was most
+noticeable on the receiver side while for UDP most significant
+improvement were achieved on the transmitter side.
+
+-  Enabling offloads on a transmitter with MTU 1500 gives 23.3%
+   throughput gain for TCP and 7.4% for UDP with Tx checksumming and
+   Tx UDP tunneling segmentation giving the highest boost.
+
+.. image:: throughput03.png
+    :width: 650px
+
+-  Enabling offloads on a transmitter with MTU 9000 gives 25% CPU
+   performance gain for TCP and 17.4% for UDP with Tx checksumming
+   giving the highest boost.
+
+.. image:: cpu_consumption03.png
+    :width: 650px
+
+-  Enabling offloads on a receiver with MTU 1500 gives 66% throughput
+   gain for TCP and 2.4% for UDP with GRO giving the highest boost
+   respectively.
+
+.. image:: throughput04.png
+    :width: 650px
+
+-  Enabling offloads on a receiver with MTU 9000 gives 48% CPU
+   performance gain for TCP and 29% for UDP with with GRO and Rx
+   checksumming giving the highest boost respectively.
+
+.. image:: cpu_consumption04.png
+    :width: 650px
+
+VM-to-VM on different nodes scenario
+------------------------------------
+
+-  Enabling Tx and TSO on the transmit side increases throughput by 44%
+   for TCP and by 44.7% for UDP respectively.
+
+.. image:: throughput05.png
+    :width: 650px
+
+-  Enabling GRO on the receive side increases throughput by 59.7% for
+   TCP and by 61.6% for UDP respectively.
+
+.. image:: throughput06.png
+    :width: 650px
+
+-  CPU performance improvement is mostly visible on the transmitter with
+   turning offloads off triggering total CPU consumption rising by
+   64.8% and 54.5% for TCP and UDP stream tests
+   correspondingly.
+
+.. image:: cpu_consumption05.png
+    :width: 650px
+.. image:: cpu_consumption06.png
+    :width: 650px
+
+-  Total CPU consumption on the receiver side in TCP stream tests
+   does not change significantly, but at the same time turning
+   offloads off leads to processes in user space taking more CPU
+   time with system time decreasing.
+
+.. image:: cpu_consumption07.png
+    :width: 650px
+
+-  In UDP stream tests user part of CPU consumption on the receiver
+   drops by 3.5% with offloads on.
+
+.. image:: cpu_consumption08.png
+    :width: 650px
+
+VM-to-VM on different nodes over VxLAN scenario
+-----------------------------------------------
+
+-  Enabling Tx and TSO on the transmit side increases throughput by
+   35.4% for TCP and by 70% for UDP respectively.
+
+.. image:: throughput07.png
+    :width: 650px
+
+-  Enabling GRO on the receive side increases throughput by 26% for TCP
+   and by 4% for UDP respectively.
+
+.. image:: throughput08.png
+    :width: 650px
+
+-  CPU performance improvement is mostly visible on the transmitter with
+   turning offloads off triggering total CPU consumption rise by
+   78.6% and 72.5% for TCP and UDP stream tests correspondingly.
+
+.. image:: cpu_consumption09.png
+    :width: 650px
+
+.. image:: cpu_consumption10.png
+    :width: 650px
+
+-  Total CPU consumption on the receiver side in TCP stream tests
+   does not change significantly, but at the same time turning
+   offloads off leads to processes in kernel space taking more CPU
+   time with user space time decreasing.
+
+.. image:: cpu_consumption11.png
+    :width: 650px
+
+-  In UDP stream tests user part of CPU consumption on the receiver
+   drops by 7% with offloads on.
+
+.. image:: cpu_consumption12.png
+    :width: 650px
+
+Summary
+-------
+
+Network hardware offloads provide significant performance improvement in
+terms of CPU usage and throughput on both the transmit and receive side.
+This impact in particularly strong in case of VM-to-VM communication.
+
+Based on testing results the following recommendations on using offloads
+for improving throughput and CPU performance can be made:
+
+-  To increase TCP throughput:
+
+   -  Enable TSO and GSO (tx-udp\_tnl-segmentation for VxLAN
+      encapsulation) on transmitter
+
+   -  Enable GRO on receiver
+
+-  To increase UDP throughput:
+
+   -  Enable Tx checksumming on transmitter
+
+   -  Enable GRO on receiver
+
+-  To improve TCP CPU performance:
+
+   -  Enable TSO on transmitter
+
+   -  Enable GRO on receiver
+
+-  To improve UDP CPU performance:
+
+   -  Enable Tx checksumming on transmitter
+
+   -  Enable Rx checksumming on receiver
+
+Using `kernel ver. 3.19 <http://kernelnewbies.org/Linux_3.19>`__ or
+higher and hardware capable of performing offloads for encapsulated
+traffic like `Intel XL710
+controller <http://www.intel.com/content/www/us/en/embedded/products/networking/xl710-10-40-gbe-controller-brief.html>`__
+means that these improvements can be seen in deployments that involve
+VxLAN encapsulation.
--- a/doc/source/test_results/hardware_features/hardware_offloads/throughput01.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/throughput01.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/throughput02.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/throughput02.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/throughput03.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/throughput03.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/throughput04.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/throughput04.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/throughput05.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/throughput05.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/throughput06.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/throughput06.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/throughput07.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/throughput07.png
--- a/doc/source/test_results/hardware_features/hardware_offloads/throughput08.png
+++ b/doc/source/test_results/hardware_features/hardware_offloads/throughput08.png
--- a/doc/source/test_results/hardware_features/index.rst
+++ b/doc/source/test_results/hardware_features/index.rst
@ -0,0 +1,12 @@
+.. raw:: pdf
+
+    PageBreak oneColumn
+
+=========================
+Hardware features testing
+=========================
+
+.. toctree::
+    :maxdepth: 3
+
+    hardware_offloads/test_results
--- a/doc/source/test_results/index.rst
+++ b/doc/source/test_results/index.rst
@ -16,4 +16,5 @@ Test Results
    keystone/index
    container_platforms/index
    neutron_features/index
+    hardware_features/index