ETCD health research: test plan
Change-Id: I893a9449ced14c888f08fdfe903371fecc889c25
This commit is contained in:
parent
5c76b09ab9
commit
6816ebca3d
213
doc/source/test_plans/container_cluster_systems/etcd.rst
Normal file
213
doc/source/test_plans/container_cluster_systems/etcd.rst
Normal file
@ -0,0 +1,213 @@
|
||||
|
||||
.. _ETCD_health_tests:
|
||||
|
||||
==============================
|
||||
ETCD health research test plan
|
||||
==============================
|
||||
|
||||
:status: **ready**
|
||||
:version: 1.0
|
||||
|
||||
:Abstract:
|
||||
|
||||
This document is a test plan for ETCD health research that should
|
||||
determine a process of getting the dependency between Kubernetes
|
||||
cluster under load and ETCD cluster health state.
|
||||
|
||||
Test Plan
|
||||
=========
|
||||
|
||||
We should obtain the test results by collecting crucial system
|
||||
metrics that provided natively by ETCD/Kubernetes API, compare and
|
||||
normalize them and plot dependency graphs.
|
||||
|
||||
Test Environment
|
||||
----------------
|
||||
|
||||
Preparation
|
||||
^^^^^^^^^^^
|
||||
|
||||
1.
|
||||
Monitoring system must be set up and working, basing on the
|
||||
`Monitoring`_ methodology documentation.
|
||||
|
||||
2.
|
||||
K8S cluster should be deployed using `Kargo`_ on top of the
|
||||
430 nodes with preinstalled Ubuntu Xenial.
|
||||
3.
|
||||
On the one of the K8S master we should check/install the
|
||||
following packages/tools:
|
||||
|
||||
|
||||
.. table:: Software to be installed
|
||||
|
||||
+--------------+---------+-----------------------------------+
|
||||
| package name | version | source |
|
||||
+==============+=========+===================================+
|
||||
| `curl`_ | latest | Ubuntu xenial universe repository |
|
||||
+--------------+---------+-----------------------------------+
|
||||
| `jq`_ | latest | Ubuntu xenial universe repository |
|
||||
+--------------+---------+-----------------------------------+
|
||||
| `paste`_ | latest | Ubuntu xenial universe repository |
|
||||
+--------------+---------+-----------------------------------+
|
||||
| `MMM`_ | latest | GitHub |
|
||||
+--------------+---------+-----------------------------------+
|
||||
| `Hoseproxy`_ | latest | Github |
|
||||
+--------------+---------+-----------------------------------+
|
||||
|
||||
Environment description
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Test results MUST include a description of the environment used. The following
|
||||
items should be included:
|
||||
|
||||
- **Hardware configuration of each server.** If virtual machines are used then
|
||||
both physical and virtual hardware should be fully documented.
|
||||
An example format is given below:
|
||||
|
||||
.. table:: Description of server hardware
|
||||
|
||||
+-------+----------------+-------+-------+
|
||||
|server |name | | |
|
||||
| +----------------+-------+-------+
|
||||
| |role | | |
|
||||
| +----------------+-------+-------+
|
||||
| |vendor,model | | |
|
||||
| +----------------+-------+-------+
|
||||
| |operating_system| | |
|
||||
+-------+----------------+-------+-------+
|
||||
|CPU |vendor,model | | |
|
||||
| +----------------+-------+-------+
|
||||
| |processor_count | | |
|
||||
| +----------------+-------+-------+
|
||||
| |core_count | | |
|
||||
| +----------------+-------+-------+
|
||||
| |frequency_MHz | | |
|
||||
+-------+----------------+-------+-------+
|
||||
|RAM |vendor,model | | |
|
||||
| +----------------+-------+-------+
|
||||
| |amount_MB | | |
|
||||
+-------+----------------+-------+-------+
|
||||
|NETWORK|interface_name | | |
|
||||
| +----------------+-------+-------+
|
||||
| |vendor,model | | |
|
||||
| +----------------+-------+-------+
|
||||
| |bandwidth | | |
|
||||
+-------+----------------+-------+-------+
|
||||
|STORAGE|dev_name | | |
|
||||
| +----------------+-------+-------+
|
||||
| |vendor,model | | |
|
||||
| +----------------+-------+-------+
|
||||
| |SSD/HDD | | |
|
||||
| +----------------+-------+-------+
|
||||
| |size | | |
|
||||
+-------+----------------+-------+-------+
|
||||
|
||||
- **Configuration of hardware network switches.** The configuration file from
|
||||
the switch can be downloaded and attached.
|
||||
|
||||
|
||||
- **Network scheme.** The plan should show how all hardware is connected and
|
||||
how the components communicate. All ethernet/fibrechannel and VLAN channels
|
||||
should be included. Each interface of every hardware component should be
|
||||
matched with the corresponding L2 channel and IP address.
|
||||
|
||||
Test Cases
|
||||
----------
|
||||
|
||||
Description
|
||||
^^^^^^^^^^^
|
||||
|
||||
There are two specific cases that should be conducted.
|
||||
|
||||
1.
|
||||
Load K8S with as much big as possible number of pods per each node.
|
||||
Stop when either of K8S or ETCD degrades or existing limits are reached.
|
||||
|
||||
2.
|
||||
Load K8S with as much big as possible number of services.
|
||||
Stop when either of K8S or ETCD degrades or existing limits are reached.
|
||||
|
||||
List of performance metrics
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Basing on `CoreOS ETCD`_ documentation, we collected a list of key metrics
|
||||
that define ETCD cluster health state:
|
||||
|
||||
|
||||
.. table:: List of performance metrics
|
||||
|
||||
+------------------------------------------------------------+------------------------------------------+
|
||||
| Metrics | Short description |
|
||||
+============================================================+==========================================+
|
||||
| || Resident memory size in bytes. |
|
||||
| process_resident_memory_bytes || |
|
||||
| || |
|
||||
+------------------------------------------------------------+------------------------------------------+
|
||||
| || The total latency distributions of save |
|
||||
| etcd_debugging_snap_save_total_duration_seconds_bucket || called by snapshot. |
|
||||
| || |
|
||||
+------------------------------------------------------------+------------------------------------------+
|
||||
| || The latency distributions of commit |
|
||||
| etcd_disk_backend_commit_duration_seconds_bucket || called by backend. |
|
||||
| || |
|
||||
+------------------------------------------------------------+------------------------------------------+
|
||||
| || Counter of handle failures of requests |
|
||||
| etcd_http_failed_total || (non-watches), by method (GET/PUT etc.) |
|
||||
| || and code (400, 500 etc.). |
|
||||
+------------------------------------------------------------+------------------------------------------+
|
||||
| || The total number of bytes received/sent |
|
||||
| etcd_network_peer_(received|sent)_bytes_total || from/to peers. |
|
||||
| || |
|
||||
+------------------------------------------------------------+------------------------------------------+
|
||||
| || Current number of proposals pending/ |
|
||||
| etcd_server_proposals_(pending|committed|applied|failed) || committed/applied/failed. |
|
||||
| || |
|
||||
+------------------------------------------------------------+------------------------------------------+
|
||||
|
||||
K8S-sided metrics should only define total number of pods/services in the cluster
|
||||
for each moment of time within testing period.
|
||||
|
||||
Collecting metrics
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Each required metric could be gathered through `Prometheus API`_ using
|
||||
curl and jq to extract json objects and strip off extra data. For
|
||||
example, let say we need to get `<metric_a>` values within period
|
||||
starting from `<start>` and finishing at `<stop>` with a time step
|
||||
= `<step>`. Prometheus IP address is `<prometheus_server>`. Resulted
|
||||
query will look like:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
curl -q 'http://<prometheus_server>/api/v1/query_range?query=<metric_a>&start=<start>&end=<end>&step=<step>'
|
||||
|
||||
Plotting 'K8S vs ETCD dependency'
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
After obtaining metrics for each case, we need to make plots showing
|
||||
dependency between K8S pods/services number and corresponding metric.
|
||||
It's better to merge collected metrics in two csv files (for each case)
|
||||
in order to make plots easily using third-party instruments like
|
||||
`Google sheets`_ or `Plotly`_.
|
||||
|
||||
Reports
|
||||
=======
|
||||
|
||||
Resulted report page:
|
||||
* :ref:`Results_of_the_ETCD_health_tests`
|
||||
|
||||
.. references:
|
||||
|
||||
.. _Kargo: https://github.com/kubernetes-incubator/kargo.git
|
||||
.. _Monitoring: https://docs.openstack.org/developer/performance-docs/methodologies/monitoring/index.html
|
||||
.. _curl: https://curl.haxx.se/
|
||||
.. _jq: https://stedolan.github.io/jq/
|
||||
.. _paste: https://linux.die.net/man/1/paste
|
||||
.. _MMM: https://github.com/AleksandrNull/MMM
|
||||
.. _Hoseproxy: https://github.com/ivan4th/hoseproxy
|
||||
.. _CoreOS ETCD: https://coreos.com/etcd/docs/latest/metrics.html
|
||||
.. _Prometheus API: https://prometheus.io/docs/querying/api/
|
||||
.. _Google sheets: https://docs.google.com/spreadsheets/
|
||||
.. _Plotly: https://plot.ly/
|
||||
|
@ -17,6 +17,7 @@ Contents
|
||||
kargo_deploy_performance
|
||||
performance_and_scaling
|
||||
API_latency
|
||||
etcd
|
||||
|
||||
.. raw:: pdf
|
||||
|
||||
|
@ -8,7 +8,7 @@ Results of the ETCD health tests
|
||||
:Abstract:
|
||||
|
||||
This piece of art includes the results of the ETCD tests made
|
||||
basing on the _ETCD_health_tests plan.
|
||||
basing on the :ref:`ETCD_health_tests`.
|
||||
Our goal was to research how many Kubernetes items (pods and services)
|
||||
could be spawned in terms of ETCD. We figured out which ETCD metrics
|
||||
are crucial and collected them under appropriate (pods or services)
|
||||
|
Loading…
Reference in New Issue
Block a user