From 6881008de0f0361549f359bd13db6ec62041004b Mon Sep 17 00:00:00 2001 From: Clark Boylan Date: Fri, 31 May 2013 11:46:29 -0700 Subject: [PATCH] Add logstash documentation. * doc/source/logstash.rst: Add documentation on our Logstash system architecture and how to query logstash. Change-Id: I9da3e6d6391081131d1fd852230ddac6326c01a2 Reviewed-on: https://review.openstack.org/31257 Reviewed-by: James E. Blair Reviewed-by: Elizabeth Krumbach Joseph Approved: Jeremy Stanley Reviewed-by: Jeremy Stanley Tested-by: Jenkins --- doc/source/logstash.rst | 188 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 185 insertions(+), 3 deletions(-) diff --git a/doc/source/logstash.rst b/doc/source/logstash.rst index e6f466ded0..b48ee8b108 100644 --- a/doc/source/logstash.rst +++ b/doc/source/logstash.rst @@ -12,7 +12,7 @@ At a Glance :Hosts: * http://logstash.openstack.org - * logstash-worker-\*.openstack.org + * logstash-worker\*.openstack.org * elasticsearch.openstack.org :Puppet: * :file:`modules/logstash` @@ -21,13 +21,16 @@ At a Glance * :file:`modules/openstack_project/manifests/elasticsearch.pp` :Configuration: * :file:`modules/openstack_project/files/logstash` + * :file:`modules/openstack_project/templates/logstash` :Projects: * http://logstash.net/ * http://kibana.org/ + * http://www.elasticsearch.org/ :Bugs: * http://bugs.launchpad.net/openstack-ci * https://logstash.jira.com/secure/Dashboard.jspa * https://github.com/rashidkpc/Kibana/issues + * https://github.com/elasticsearch/elasticsearch/issues Overview ======== @@ -38,7 +41,186 @@ sources in a single test run, searching for errors or particular events within a test run, as well as searching for log event trends across test runs. -TODO(clarkb): more details about system architecture +System Architecture +=================== -TODO(clarkb): useful queries +There are four major layers in our Logstash setup. +1. Log Pusher Script. + Subscribes to the Jenkins ZeroMQ Event Publisher listening for build + finished events. When a build finishes this script fetches the logs + generated by that build, chops them up, annotates them with Jenkins + build info and finally sends them to a Logstash indexer process. +2. Logstash Indexer. + Reads these log events from the log pusher, filters them to remove + unwanted lines, collapses multiline events together, and parses + useful information out of the events before shipping them to + ElasticSearch for storage and indexing. +3. ElasticSearch. + Provides log storage, indexing, and search. +4. Kibana. + A Logstash oriented web client for ElasticSearch. You can perform + queries on your Logstash logs in ElasticSearch through Kibana using + the Lucene query language. + +Each layer scales horizontally. As the number of logs grows we can add +more log pushers, more Logstash indexers, and more ElasticSearch nodes. +Currently we have multiple Logstash worker nodes that pair a log pusher +with a Logstash indexer. We did this as each Logstash process can only +dedicate a single thread to filtering log events which turns into a +bottleneck very quickly. This looks something like: + +:: + + _ logstash-worker1 _ + / \ + jenkins -- logstash-worker2 -- elasticsearch -- kibana + \_ _/ + logstash-worker3 + +Log Pusher +---------- + +This is a simple Python script that is given a list of log files to push +to Logstash when Jenkins builds complete. + +Log pushing looks like this: + +* Jenkins publishes build complete notifications. +* Log pusher receives the notification from Jenkins. +* Using info in the notification log files are retrieved. +* Log files are processed then shipped to Logstash. + +In the near future this script will be modified to act as a Gearman +worker so that we can add an arbitrary number of them without needing +to partition the log files that each worker handles by hand. Instead +each worker will be able to fetch and push any log file and will do +so as directed by Gearman. + +If you are interested in technical details The source of this script +can be found at +:file:`modules/openstack_project/files/logstash/log-pusher.py` + +Logstash +-------- + +Logstash does the heavy lifting of squashing all of our log lines into +events with a common format. It reads the JSON log events from the log +pusher connected to it, deletes events we don't want, parses log lines +to set the timestamp, message, and other fields for the event, then +ships these processed events off to ElasticSearch where they are stored +and made queryable. + +At a high level Logstash takes: + +:: + + { + "fields" { + "build_name": "gate-foo", + "build_numer": "10", + "event_message": "2013-05-31T17:31:39.113 DEBUG Something happened", + }, + } + +And turns that into: + +:: + + { + "fields" { + "build_name": "gate-foo", + "build_numer": "10", + "loglevel": "DEBUG" + }, + "@message": "Something happened", + "@timestamp": "2013-05-31T17:31:39.113Z", + } + +It flattens each log line into something that looks very much like +all of the other events regardless of the source log line format. This +makes querying your logs for lines from a specific build that failed +between two timestamps with specific message content very easy. You +don't need to write complicated greps instead you query against a +schema. + +The config file that tells Logstash how to do this flattening can be +found at +:file:`modules/openstack_project/templates/logstash/indexer.conf.erb` + + +ElasticSearch +------------- + +ElasticSearch is basically a REST API layer for Lucene. It provides +the storage and search engine for Logstash. It scales horizontally and +loves it when you give it more memory. Currently we run a single node +cluster on a large VM to give ElasticSearch both memory and disk space. +Per index (Logstash creates one index per day) we have one replica (on +the same node, this does not provide HA, it speeds up searches) and +five shards (each shard is basically its own index, having multiple +shards increases indexing throughput). + +As this setup grows and handles more logs we may need to add more +ElasticSearch nodes and run a proper cluster. Haven't reached that point +yet, but will probably be necessary as disk and memory footprints +increase. + +Kibana +------ + +Kibana is a ruby app sitting behind Apache that provides a nice web UI +for querying Logstash events stored in ElasticSearch. Our install can +be reached at http://logstash.openstack.org. See +:ref:`query-logstash` for more info on using Kibana to perform +queries. + +.. _query-logstash: + +Querying Logstash +================= + +Hop on over to http://logstash.openstack.org and by default you get the +last 15 minutes of everything Logstash knows about in chunks of 100. +We run a lot of tests but it is possible no logs have come in over the +last 15 minutes, change the dropdown in the top left from ``Last 15m`` +to ``Last 60m`` to get a better window on the logs. At this point you +should see a list of logs, if you click on a log event it will expand +and show you all of the fields associated with that event and their +values (not Chromium and Kibana seem to have trouble with this at times +and some fields end up without values, use Firefox if this happens). +You can search based on all of these fields and if you click the +magnifying glass next to a field in the expanded event view it will add +that field and value to your search. This is a good way of refining +searches without a lot of typing. + +The above is good info for poking around in the Logstash logs, but +one of your changes has a failing test and you want to know why. We +can jumpstart the refining process with a simple query. + +``@fields.build_change:"$FAILING_CHANGE" AND @fields.build_patchset:"$FAILING_PATCHSET" AND @fields.build_name:"$FAILING_BUILD_NAME" AND @fields.build_number:"$FAILING_BUILD_NUMBER"`` + +This will show you all logs available from the patchset and build pair +that failed. Chances are that this is still a significant number of +logs and you will want to do more filtering. You can add more filters +to the queriy using ``AND`` and ``OR`` and parentheses can be used to +group sections of the query. Potential additions to the above query +might be + +* ``AND @fields.filename:"logs/syslog.txt"`` to get syslog events. +* ``AND @fields.filename:"logs/screen-n-api.txt"`` to get Nova API events. +* ``AND @fields.loglevel:"ERROR"`` to get ERROR level events. +* ``AND @message"error"`` to get events with error in their message. + and so on. + +General query tips: + +* Don't search ``All time``. ElasticSearch is bad at trying to find all + the things it ever knew about. Give it a window of time to look + through. You can use the presets in the dropdown to select a window or + use the ``foo`` to ``bar`` boxes above the frequency graph. +* Only the @message field can have fuzzy searches performed on it. Other + fields require specific information. +* This system is growing fast and may not always keep up with the load. + Be patient. If expected logs do not show up immediately after the + Jenkins job completes wait a few minutes.