browbeat/doc/source/usage.rst
Joe Talerico 21900bffad Query Elastic to compare software-metadata
Method to query ElasticSearch for a specific set of browbeat_uuids and
compare the metadata to determine if there are differences.

This work will also tell the user if a option or value is missing.

Eventually, I would like to see us query Elastic for collectd data to
see if there has been CPU/Memory/DiskIO increases during a specific
Browbeat run -- this is a longer-term goal.

Example of this :
https://gist.github.com/jtaleric/ffc1508eba3cba9515ca24cfcf23583c

Change-Id: Ie65e2c3d505aa2f19ba10109276ba982ee4ab67b
2017-06-22 10:13:15 -04:00

282 lines
14 KiB
ReStructuredText

========
Usage
========
Run Overcloud checks
--------------------
::
$ ansible-playbook -i hosts check/site.yml
Your Overcloud check output is located in results/bug_report.log
NOTE: It is strongly advised to not run the ansible playbooks in a venv.
Run performance stress tests through Browbeat on the undercloud:
----------------------------------------------------------------
::
$ ssh undercloud-root
[root@ospd ~]# su - stack
[stack@ospd ~]$ screen -S browbeat
[stack@ospd ~]$ . browbeat-venv/bin/activate
(browbeat-venv)[stack@ospd ~]$ cd browbeat/
(browbeat-venv)[stack@ospd browbeat]$ vi browbeat-config.yaml # Edit browbeat-config.yaml to control how many stress tests are run.
(browbeat-venv)[stack@ospd browbeat]$ ./browbeat.py <workload> #perfkit, rally, shaker or "all"
Run performance stress tests through Browbeat
---------------------------------------------
::
[stack@ospd ansible]$ . ../../browbeat-venv/bin/activate
(browbeat-venv)[stack@ospd ansible]$ cd ..
(browbeat-venv)[stack@ospd browbeat]$ vi browbeat-config.yaml # Edit browbeat.cfg to control how many stress tests are run.
(browbeat-venv)[stack@ospd browbeat]$ ./browbeat.py <workload> #perfkit, rally, shaker or "all"
Running PerfKitBenchmarker
---------------------------
Work is on-going to utilize PerfKitBenchmarker as a workload provider to
Browbeat. Many benchmarks work out of the box with Browbeat. You must
ensure that your network is setup correctly to run those benchmarks and
you will need to configure the settings in
ansible/install/group_vars/all.yml for Browbeat public/private
networks. Currently tested benchmarks include: aerospike, bonnie++,
cluster_boot, copy_throughput(cp,dd,scp), fio, iperf, mesh_network,
mongodb_ycsb, netperf, object_storage_service, ping, scimark2, and
sysbench_oltp.
To run Browbeat's PerfKit Benchmarks, you can start by viewing the
tested benchmark's configuration in conf/browbeat-perfkit-complete.yaml.
You must add them to your specific Browbeat config yaml file or
enable/disable the benchmarks you wish to run in the default config file
(browbeat-config.yaml). There are many flags exposed in the
configuration files to tune how those benchmarks run. Additional flags
are exposed in the source code of PerfKitBenchmarker available on the
Google Cloud Github_.
.. _Github: https://github.com/GoogleCloudPlatform/PerfKitBenchmarker
Example running only PerfKitBenchmarker benchmarks with Browbeat from
browbeat-config.yaml:
::
(browbeat-venv)[stack@ospd browbeat]$ ./browbeat.py perfkit -s browbeat-config.yaml
Running Shaker
---------------
Running Shaker requires the shaker image to be built, which in turn requires
instances to be able to access the internet. The playbooks for this installation
have been described in the installation documentation but for the sake of
convenience they are being mentioned here as well.
::
$ ansible-playbook -i hosts install/browbeat_network.yml
$ ansible-playbook -i hosts install/shaker_build.yml
.. note:: The playbook to setup networking is provided as an example only and
might not work for you based on your underlay/overlay network setup. In such
cases, the exercise of setting up networking for instances to be able to access
the internet is left to the user.
Once the shaker image is built, you can run Shaker via Browbeat by filling in a
few configuration options in the configuration file. The meaning of each option is
summarized below:
**shaker:**
:enabled: Boolean ``true`` or ``false``, enable shaker or not
:server: IP address of the shaker-server for agent to talk to (undercloud IP
by default)
:port: Port to connect to the shaker-server (undercloud port 5555 by default)
:flavor: OpenStack instance flavor you want to use
:join_timeout: Timeout in seconds for agents to join
:sleep_before: Time in seconds to sleep before executing a scenario
:sleep_after: Time in seconds to sleep after executing a scenario
:venv: venv to execute shaker commands in, ``/home/stack/shaker-venv`` by
default
:shaker_region: OpenStack region you want to use
:external_host: IP of a server for external tests (should have
``browbeat/util/shaker-external.sh`` executed on it previously and have
iptables/firewalld/selinux allowing connections on the ports used by network
testing tools netperf and iperf)
**scenarios:** List of scenarios you want to run
:\- name: Name for the scenario. It is used to create directories/files
accordingly
:enabled: Boolean ``true`` or ``false`` depending on whether or not you
want to execute the scenario
:density: Number of instances
:compute: Number of compute nodes across which to spawn instances
:placement: ``single_room`` would mean one instance per compute node and
``double_room`` would give you two instances per compute node
:progression: ``null`` means all agents are involved, ``linear`` means
execution starts with one agent and increases linearly, ``quadratic``
would result in quadratic growth in number of agents participating
in the test concurrently
:time: Time in seconds you want each test in the scenario
file to run
:file: The base shaker scenario file to use to override
options (this would depend on whether you want to run L2, L3 E-W or L3
N-S tests and also on the class of tool you want to use such as flent or
iperf3)
To analyze results sent to Elasticsearch (you must have Elasticsearch enabled
and the IP of the Elasticsearch host provided in the browbeat configuration
file), you can use the following playbook to setup some prebuilt dashboards for
you:
::
$ ansible-playbook -i hosts install/kibana-visuals.yml
Alternatively you can create your own visualizations of specific shaker runs
using some simple searches such as:
::
shaker_uuid: 97092334-34e8-446c-87d6-6a0f361b9aa8 AND record.concurrency: 1 AND result.result_type: bandwidth
shaker_uuid: c918a263-3b0b-409b-8cf8-22dfaeeaf33e AND record.concurrency:1 AND record.test:Bi-Directional
Running YODA
============
YODA (Yet Openstack Deployment tool, Another) is a workload integrated into
Browbeat for benchmarking TripleO deployment. This includes importing baremetal
nodes, running introspections and overcloud deployements of various kinds. Note
that YODA assumes it is on the undercloud of a TripleO instance post undercloud
installation and introspection.
Configuration
-------------
For examples of the configuration see `browbeat-complete.yaml` in the repo root directory.
Additional configuration documentation can be found below for each subworkload of YODA.
Overcloud
~~~~~~~~~
For overcloud workloads, note that the nodes dictionary is dynamic, so you don't
have to define types you aren't using, this is done in the demonstration
configurations for the sake of completeness. Furthermore the node name is taken
from the name of the field, meaning custom role names should work fine there.
The step parameter decides how many nodes can be distributed between the various
types to get from start scale to end scale, if these are the same it won't
matter. But if they are different up to that many nodes will be distributed to
the different node types (in no particular order) before the next deploy is
performed. The step rule is violated if and only if it is required to keep the
deployment viable, for example if the step dictates that 2 control nodes be
deployed it will skip to 3 even if it violates step.
YODA has basic support for custom templates and more advanced roles, configure the
`templates:` paramater in the overcloud benchmark section with a string for
template paths.
templates: "-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml"
Note that `--templates` is passed to the `overcloud deploy` command before this,
then nodes sizes, ntp server and timeout are passed after, so your templates
will override the defaults, but not scale, timeout, or ntp settings from the
YODA config. If you want to use scheduling hints for your overcloud deploy you
will need to pip install [ostag](https://github.com/jkilpatr/ostag) and set
`node_pinning: True` in your config file. Ostag will be used before every deploy
to clean all tags and tag the appropriate nodes. If you set `node_pinning: False`
tags will be cleaned before the deploy. If you need more advanced features view
the ostag readme for how to tag based on node properties. If you don't want YODA
to edit your node properties, don't define `node_pinning` in your configuration.
Introspection
~~~~~~~~~~~~~
Introspection workloads have two modes, batch and individual, the batch workload
follows the documentation exactly, nodes are imported, then bulk introspection
is run. Individual introspection has it's own custom batch size and handles
failures more gracefully (individual instead of group retries). Both have a
timeout configured in seconds and record the amount of time required for each
node to pxe and the number of failures.
`timeout` is how long we wait for the node to come back from introspection this is
hardware variable. Although the default 900 seconds has been shown to be the 99th
percentile for success across at least two stes of hardware. Adjust as required.
Note that `batch_size` can not produce a batch of unintrospected ndoes if none exist
so the last batch may be below the maximum size. When nodes in a batch fail the `failure_count`
is incremented and the nodes are returned to the pool. So it's possible that same node will
fail again in another batch. There is a saftey mechanism that will kill Yoda if a node exceeds
10 retries as that's pretty much garunteed to be misconfigured. For bulk introspection all nodes
are tried once and what you get is what you get.
If you wish to change the introspection workload failure threshold of 10% you can
set `max_fail_amnt` to any floating point value you desire.
I would suggest bulk introspection for testing documented TripleO workflows and
individual introspection to test the performance of introspection itself.
Interpreting Browbeat Results
------------------------------
By default results for each test will be placed in a timestamped folder `results/` inside your Browbeat folder.
Each run folder will contain output files from the various workloads and benchmarks that ran during that Browbeat
run, as well as a report card that summarizes the results of the tests.
Browbeat for the most part tries to restrict itself to running tests, it will only exit with a nonzero return code
if a workload failed to run. If, for example, Rally where to run but not be able to boot any instances on your cloud
Browbeat would return with RC 0 without any complaints, only by looking into the Rally results for that Browbeat run
would you determine that your cloud had a problem that made benchmarking it impossible.
Likewise if Rally manages to run at a snails pace, Browbeat will still exit without complaint. Be aware of this when
running Browbeat and take the time to either view the contents of the results folder after a run. Or setup Elasticsearch
and Kibana to view them more easily.
Working with Multiple Clouds
-----------------------------
If you are running playbooks from your local machine you can run against more
than one cloud at the same time. To do this, you should create a directory
per-cloud and clone Browbeat into that specific directory:
::
[browbeat@laptop ~]$ mkdir cloud01; cd cloud01
[browbeat@laptop cloud01]$ git clone git@github.com:openstack/browbeat.git
...
[browbeat@laptop cloud01]$ cd browbeat/ansible
[browbeat@laptop ansible]$ ./generate_tripleo_hostfile.sh -t <cloud01-ip-address>
[browbeat@laptop ansible]$ ansible-playbook -i hosts (Your playbook you wish to run...)
[browbeat@laptop ansible]$ ssh -F ssh-config overcloud-controller-0 # Takes you to first controller
Repeat the above steps for as many clouds as you have to run playbooks against your clouds.
Compare software-metadata from two different runs
--------------------------------------------------
Browbeat's metadata is great to help build visuals in Kibana by querying on specific metadata fields, but sometimes
we need to see what the difference between two builds might be. Kibana doesn't have a good way to show this, so we
added an option to Browbeat CLI to query ElasticSearch.
To use :
::
$ python browbeat.py --compare software-metadata --uuid "browbeat-uuid-1" "browbeat-uuid-2"
Real world use-case, we had two builds in our CI that used the exact same DLRN hash, however the later build had a
10x performance hit for two Neutron operations, router-create and add-interface-to-router. Given we had exactly the
same DLRN hash, the only difference could be how things were configured. Using this new code, we could quickly identify
the difference -- TripleO enabled l3_ha.
::
[rocketship:browbeat] jtaleric:browbeat$ python browbeat.py --compare software-metadata --uuid "3fc2f149-7091-4e16-855a-60738849af17" "6738eed7-c8dd-4747-abde-47c996975a57"
2017-05-25 02:34:47,230 - browbeat.Tools - INFO - Validating the configuration file passed by the user
2017-05-25 02:34:47,311 - browbeat.Tools - INFO - Validation successful
2017-05-25 02:34:47,311 - browbeat.Elastic - INFO - Querying Elastic : index [_all] : role [controller] : uuid [3fc2f149-7091-4e16-855a-60738849af17]
2017-05-25 02:34:55,684 - browbeat.Elastic - INFO - Querying Elastic : index [_all] : role [controller] : uuid [6738eed7-c8dd-4747-abde-47c996975a57]
2017-05-25 02:35:01,165 - browbeat.Elastic - INFO - Difference found : Host [overcloud-controller-2] Service [neutron] l3_ha [False]
2017-05-25 02:35:01,168 - browbeat.Elastic - INFO - Difference found : Host [overcloud-controller-1] Service [neutron] l3_ha [False]
2017-05-25 02:35:01,172 - browbeat.Elastic - INFO - Difference found : Host [overcloud-controller-0] Service [neutron] l3_ha [False]