system-config/doc/source/logstash.rst
Clark Boylan 4aba312807 Use gearman to distribute logstash log pushing.
This change reorgs the logstash log pushing so that there is a central
gearman server that listens to Jenkins ZMQ events which are then
converted to per log file gearman jobs which are processed by gearman
workers. The central gearman server will live on logstash.o.o and the
existing logstash-worker hosts will be converted to gearman log pusher
workers.

This commit includes relavent documentation changes.

Change-Id: I45f7185c2479c54b090d223408dff268e1e8d7db
Reviewed-on: https://review.openstack.org/32455
Reviewed-by: Jeremy Stanley <fungi@yuggoth.org>
Approved: Clark Boylan <clark.boylan@gmail.com>
Reviewed-by: Clark Boylan <clark.boylan@gmail.com>
Tested-by: Jenkins
2013-06-12 21:45:42 +00:00

243 lines
8.8 KiB
ReStructuredText

:title: Logstash
.. _logstash:
Logstash
########
Logstash is a high-performance indexing and search engine for logs.
At a Glance
===========
:Hosts:
* http://logstash.openstack.org
* logstash-worker\*.openstack.org
* elasticsearch.openstack.org
:Puppet:
* :file:`modules/logstash`
* :file:`modules/openstack_project/manifests/logstash.pp`
* :file:`modules/openstack_project/manifests/logstash_worker.pp`
* :file:`modules/openstack_project/manifests/elasticsearch.pp`
:Configuration:
* :file:`modules/openstack_project/files/logstash`
* :file:`modules/openstack_project/templates/logstash`
:Projects:
* http://logstash.net/
* http://kibana.org/
* http://www.elasticsearch.org/
:Bugs:
* http://bugs.launchpad.net/openstack-ci
* https://logstash.jira.com/secure/Dashboard.jspa
* https://github.com/rashidkpc/Kibana/issues
* https://github.com/elasticsearch/elasticsearch/issues
Overview
========
Logs from Jenkins test runs are sent to logstash where they are
indexed and stored. Logstash facilitates reviewing logs from mulitple
sources in a single test run, searching for errors or particular
events within a test run, as well as searching for log event trends
across test runs.
System Architecture
===================
There are four major layers in our Logstash setup.
1. Log Pusher Scripts.
Subscribes to the Jenkins ZeroMQ Event Publisher listening for build
finished events. When a build finishes an event is received from
Jenkins which is then converted into Gearman jobs specific to that
event for each log file we care about. These jobs trigger Gearman
workers that then fetch the logs generated by that build, chop them
up, annotate them with Jenkins build info and finally sends them to a
Logstash indexer process.
2. Logstash Indexer.
Reads these log events from the log pusher, filters them to remove
unwanted lines, collapses multiline events together, and parses
useful information out of the events before shipping them to
ElasticSearch for storage and indexing.
3. ElasticSearch.
Provides log storage, indexing, and search.
4. Kibana.
A Logstash oriented web client for ElasticSearch. You can perform
queries on your Logstash logs in ElasticSearch through Kibana using
the Lucene query language.
Each layer scales horizontally. As the number of logs grows we can add
more log pushers, more Logstash indexers, and more ElasticSearch nodes.
Currently we have multiple Logstash worker nodes that pair a log pusher
with a Logstash indexer. We did this as each Logstash process can only
dedicate a single thread to filtering log events which turns into a
bottleneck very quickly. This looks something like:
::
jenkins
|
|
gearman-client
/ | \
/ | \
gearman gearman gearman
worker1 worker2 worker3
| | |
logstash logstash logstash
indexer1 indexer2 indexer3
\ | /
\ | /
elasticsearch
|
|
kibana
Log Pusher
----------
This is a pair of simple Python scripts. The first listens to Jenkins
build events and converts them into Gearman jobs and the second performs
Gearman jobs to push log files into logstash.
Log pushing looks like this:
* Jenkins publishes build complete notifications.
* Receive notification from Jenkins and convert to Gearman jobs.
* Using info in the Gearman job log files are retrieved.
* Log files are processed then shipped to Logstash.
Using Gearman allows us to scale the number of log pushers
horizontally. It is as simple as adding another process that talks to
the Gearman server.
If you are interested in technical details the source of these scripts
can be found at
* :file:`modules/openstack_project/files/logstash/log-gearman-client.py`
* :file:`modules/openstack_project/files/logstash/log-gearman-worker.py`
Logstash
--------
Logstash does the heavy lifting of squashing all of our log lines into
events with a common format. It reads the JSON log events from the log
pusher connected to it, deletes events we don't want, parses log lines
to set the timestamp, message, and other fields for the event, then
ships these processed events off to ElasticSearch where they are stored
and made queryable.
At a high level Logstash takes:
::
{
"fields" {
"build_name": "gate-foo",
"build_numer": "10",
"event_message": "2013-05-31T17:31:39.113 DEBUG Something happened",
},
}
And turns that into:
::
{
"fields" {
"build_name": "gate-foo",
"build_numer": "10",
"loglevel": "DEBUG"
},
"@message": "Something happened",
"@timestamp": "2013-05-31T17:31:39.113Z",
}
It flattens each log line into something that looks very much like
all of the other events regardless of the source log line format. This
makes querying your logs for lines from a specific build that failed
between two timestamps with specific message content very easy. You
don't need to write complicated greps instead you query against a
schema.
The config file that tells Logstash how to do this flattening can be
found at
:file:`modules/openstack_project/templates/logstash/indexer.conf.erb`
ElasticSearch
-------------
ElasticSearch is basically a REST API layer for Lucene. It provides
the storage and search engine for Logstash. It scales horizontally and
loves it when you give it more memory. Currently we run a single node
cluster on a large VM to give ElasticSearch both memory and disk space.
Per index (Logstash creates one index per day) we have one replica (on
the same node, this does not provide HA, it speeds up searches) and
five shards (each shard is basically its own index, having multiple
shards increases indexing throughput).
As this setup grows and handles more logs we may need to add more
ElasticSearch nodes and run a proper cluster. Haven't reached that point
yet, but will probably be necessary as disk and memory footprints
increase.
Kibana
------
Kibana is a ruby app sitting behind Apache that provides a nice web UI
for querying Logstash events stored in ElasticSearch. Our install can
be reached at http://logstash.openstack.org. See
:ref:`query-logstash` for more info on using Kibana to perform
queries.
.. _query-logstash:
Querying Logstash
=================
Hop on over to http://logstash.openstack.org and by default you get the
last 15 minutes of everything Logstash knows about in chunks of 100.
We run a lot of tests but it is possible no logs have come in over the
last 15 minutes, change the dropdown in the top left from ``Last 15m``
to ``Last 60m`` to get a better window on the logs. At this point you
should see a list of logs, if you click on a log event it will expand
and show you all of the fields associated with that event and their
values (note Chromium and Kibana seem to have trouble with this at times
and some fields end up without values, use Firefox if this happens).
You can search based on all of these fields and if you click the
magnifying glass next to a field in the expanded event view it will add
that field and value to your search. This is a good way of refining
searches without a lot of typing.
The above is good info for poking around in the Logstash logs, but
one of your changes has a failing test and you want to know why. We
can jumpstart the refining process with a simple query.
``@fields.build_change:"$FAILING_CHANGE" AND @fields.build_patchset:"$FAILING_PATCHSET" AND @fields.build_name:"$FAILING_BUILD_NAME" AND @fields.build_number:"$FAILING_BUILD_NUMBER"``
This will show you all logs available from the patchset and build pair
that failed. Chances are that this is still a significant number of
logs and you will want to do more filtering. You can add more filters
to the queriy using ``AND`` and ``OR`` and parentheses can be used to
group sections of the query. Potential additions to the above query
might be
* ``AND @fields.filename:"logs/syslog.txt"`` to get syslog events.
* ``AND @fields.filename:"logs/screen-n-api.txt"`` to get Nova API events.
* ``AND @fields.loglevel:"ERROR"`` to get ERROR level events.
* ``AND @message"error"`` to get events with error in their message.
and so on.
General query tips:
* Don't search ``All time``. ElasticSearch is bad at trying to find all
the things it ever knew about. Give it a window of time to look
through. You can use the presets in the dropdown to select a window or
use the ``foo`` to ``bar`` boxes above the frequency graph.
* Only the @message field can have fuzzy searches performed on it. Other
fields require specific information.
* This system is growing fast and may not always keep up with the load.
Be patient. If expected logs do not show up immediately after the
Jenkins job completes wait a few minutes.