Merge "Logging with Heka spec"
This commit is contained in:
commit
5991abbb4b
313
specs/logging-with-heka.rst
Normal file
313
specs/logging-with-heka.rst
Normal file
@ -0,0 +1,313 @@
|
||||
=================
|
||||
Logging with Heka
|
||||
=================
|
||||
|
||||
https://blueprints.launchpad.net/kolla/+spec/heka
|
||||
|
||||
Kolla currently uses Rsyslog for logging. And Change Request ``252968`` [1]
|
||||
suggests to use ELK (Elasticsearch, Logstash, Kibana) as a way to index all the
|
||||
logs, and visualize them.
|
||||
|
||||
This spec suggests using Heka [2] instead of Logstash, while still using
|
||||
Elasticsearch for indexing and Kibana for visualization. It also discusses
|
||||
the removal of Rsyslog along the way.
|
||||
|
||||
What is Heka? Heka is a open-source stream processing software created and
|
||||
maintained by Mozilla.
|
||||
|
||||
Using Heka will provide a lightweight and scalable log processing solution
|
||||
for Kolla.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Change Request ``252968`` [1] adds an Ansible role named "elk" that enables
|
||||
deploying ELK (Elasticsearch, Logstash, Kibana) on nodes with that role. This
|
||||
spec builds on that work, proposing a scalable log processing architecture
|
||||
based on the Heka [2] stream processing software.
|
||||
|
||||
We think that Heka provides for a lightweight, flexible and powerful solution
|
||||
for processing data streams, including logs.
|
||||
|
||||
Using Heka our primary goal is distributing the logs processing load across the
|
||||
OpenStack nodes rather than using a centralized log processing engine that
|
||||
represents a bottleneck and a single-point-of-failure.
|
||||
|
||||
We also know from experience that Heka provides all the necessary flexibility
|
||||
for processing other types of data streams than log messages. For example, we
|
||||
already use Heka together with Elasticsearch for logs, but also with collectd
|
||||
and InfluxDB for statistics and metrics.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
We propose to build on the ELK infrastructure brought by CR ``252968`` [1], and
|
||||
use Heka to collect and process logs in a distributed and scalable way.
|
||||
|
||||
This is the proposed architecture:
|
||||
|
||||
.. image:: logging-with-heka.svg
|
||||
|
||||
In this architecture Heka runs on every node of the OpenStack cluster. It runs
|
||||
in a dedicated container, referred to as the Heka container in the rest of this
|
||||
document.
|
||||
|
||||
Each Heka instance reads and processes the logs local to the node it runs on,
|
||||
and sends these logs to Elasticsearch for indexing. Elasticsearch may be
|
||||
distributed on multiple nodes for resiliency and scalability, but that part is
|
||||
outside the scope of that specification.
|
||||
|
||||
Heka, written in Go, is fast and has a small footprint, making it possible to
|
||||
run it on every node of the cluster. In contrast, Logstash runs in a JVM and
|
||||
is known [3] to be too heavy to run on every node.
|
||||
|
||||
Another important aspect is flow control and avoiding the loss of log messages
|
||||
in case of overload. Heka’s filter and output plugins, and the Elasticsearch
|
||||
output plugin in particular, support the use of a disk based message queue.
|
||||
This message queue allows plugins to reprocess messages from the queue when
|
||||
downstream servers (Elasticsearch) are down or cannot keep up with the data
|
||||
flow.
|
||||
|
||||
With Logstash it is often recommended [3] to use Redis as a centralized queue,
|
||||
which introduces some complexity and other points-of-failures.
|
||||
|
||||
Remove Rsyslog
|
||||
--------------
|
||||
|
||||
Kolla currently uses Rsyslog. The Kolla services are configured to write their
|
||||
logs to Syslog. Rsyslog gets the logs from the ``/var/lib/kolla/dev/log`` Unix
|
||||
socket and dispatches them to log files on the local file system. Rsyslog
|
||||
running in a Docker container, the log files are stored in a Docker volume
|
||||
(named ``rsyslog``).
|
||||
|
||||
With Rsyslog already running on each cluster node, the question of using two
|
||||
log processing daemons, namely ``rsyslogd`` and ``hekad``, has been raised on
|
||||
the mailing list. The spec evaluates the possibility of using ``hekad`` only,
|
||||
based on some prototyping work we have conducted [4].
|
||||
|
||||
Note: Kolla doesn't currently collect logs from RabbitMQ, HAProxy and
|
||||
Keepalived. For RabbitMQ the problem is related to RabbitMQ not having the
|
||||
capability to write its logs to Syslog. HAProxy and Keepalived do have that
|
||||
capability, but the ``/var/lib/kolla/dev/log`` Unix socket file is currently
|
||||
not mounted into the HAProxy and Keepalived containers.
|
||||
|
||||
Use Heka's ``DockerLogInput`` plugin
|
||||
------------------------------------
|
||||
|
||||
To remove Rsyslog and only use Heka one option would be to make the Kolla
|
||||
services write their logs to ``stdout`` (or ``stderr``) and rely on Heka's
|
||||
``DockerLogInput`` plugin [5] for reading the logs. Our experiments have
|
||||
revealed a number of problems with this option:
|
||||
|
||||
* The ``DockerLogInput`` plugin doesn't currently work for containers that have
|
||||
a ``tty`` allocated. And Kolla currently allocates a tty for all containers
|
||||
(for good reasons).
|
||||
|
||||
* When ``DockerLogInput`` is used there is no way to differentiate log messages
|
||||
for containers producing multiple log streams. ``neutron-agents`` is an
|
||||
example of such a container. (Sam Yaple has raised that issue multiple
|
||||
times.)
|
||||
|
||||
* If Heka is stopped and restarted later then log messages will be lost, as the
|
||||
``DockerLogInput`` plugin doesn't currently have a mechanism for tracking its
|
||||
positions in the log streams. This is in contrast to the ``LogstreamerInput``
|
||||
plugin [6] which does include that mechanism.
|
||||
|
||||
For these reasons we think that relying on the ``DockerLogInput`` plugin may
|
||||
not be a practical option.
|
||||
|
||||
For the note, our experiments have also shown that the OpenStack containers
|
||||
logs written to ``stdout`` are visible to neither Heka nor ``docker logs``.
|
||||
This problem is not reproducible when ``stderr`` is used rather than
|
||||
``stdout``. The cause of this problem is currently unknown. And it looks like
|
||||
other people have come across that issue [7].
|
||||
|
||||
Use local log files
|
||||
-------------------
|
||||
|
||||
Another option consists of configuring all the Kolla services to log into local
|
||||
files, and using Heka's ``LogstreamerInput`` plugin [5].
|
||||
|
||||
This option involves using a Docker named volume, mounted both into the service
|
||||
containers (in ``rw`` mode) and into the Heka container (in ``ro`` mode). The
|
||||
services write logs into files placed in that volume, and Heka reads logs from
|
||||
the files found in that volume.
|
||||
|
||||
This option doesn't present the problems described in the previous section.
|
||||
And it relies on Heka's ``LogstreamerInput`` plugin, which, based on our
|
||||
experience, is efficient and robust.
|
||||
|
||||
Keeping file logs locally on the nodes has been established as a requirement by
|
||||
the Kolla developers. With this option, and the Docker volume used, meeting
|
||||
that requirement necessitates no additonal mechanism.
|
||||
|
||||
For this option to be applicable the services must have the capability of
|
||||
logging into files. Most of the Kolla services have this capability. The
|
||||
exceptions are HAProxy and Keepalived, for which a different mechanism should
|
||||
be used (described further down in the document). Note that this will make it
|
||||
possible to collect logs from RabbitMQ, which does not support logging to
|
||||
Syslog but does support logging to a file.
|
||||
|
||||
Also, this option requires that the services have the permission to create
|
||||
files into the Docker volume, and that Heka has the permission to read these
|
||||
files. This means that the Docker named volume will have to have appropriate
|
||||
owner, group and permission bits. With the Heka container running under
|
||||
a specific user (see below) this will mean using an ``extend_start.sh`` script
|
||||
including ``sudo chown`` and possibly ``sudo chmod`` commands. Our prototype
|
||||
[4] already includes this.
|
||||
|
||||
As mentioned already the ``LogstreamerInput`` plugin includes a mechanism for
|
||||
tracking positions in log streams. This works with journal files stored on the
|
||||
file system (in ``/var/cache/hekad``). A specific volume, private to Heka,
|
||||
will be used for these journal files. In this way no logs will be lost if the
|
||||
Heka container is removed and a new one is created.
|
||||
|
||||
Handling HAProxy and Keepalived
|
||||
-------------------------------
|
||||
|
||||
As already mentioned HAProxy and Keepalived do not support logging to files.
|
||||
This means that some other mechanism should be used for these two services (and
|
||||
any other services that only suppport logging to Syslog).
|
||||
|
||||
Our prototype has demonstrated that we can make Heka act as a Syslog server.
|
||||
This works by using Heka's ``UdpInput`` plugin with its ``net`` option set
|
||||
to ``unixgram``.
|
||||
|
||||
This also requires that a Unix socket is created by Heka, and that socket is
|
||||
mounted into the HAProxy and Keepalived containers. For that we will use the
|
||||
same technique as the one currently used in Kolla with Rsyslog, that is
|
||||
mounting ``/var/lib/kolla/dev`` into the Heka container and mounting
|
||||
``/var/lib/kolla/dev/log`` into the service containers.
|
||||
|
||||
Our prototype already includes some code demonstrating this. See [4].
|
||||
|
||||
Also, to be able to store a copy of the HAProxy and Keepalived logs locally on
|
||||
the node, we will use Heka's ``FileOutput`` plugin. We will possibly create
|
||||
two instances of that plugin, one for HAProxy and one for Keepalived, with
|
||||
specific filters (``message_matcher``).
|
||||
|
||||
Read Python Tracebacks
|
||||
----------------------
|
||||
|
||||
In case of exceptions the OpenStack services log Python Tracebacks as multiple
|
||||
log messages. If no special care is taken then the Python Tracebacks will be
|
||||
indexed as separate documents in Elasticsearch, and displayed as distinct log
|
||||
entries in Kibana, making them hard to read. To address that issue we will use
|
||||
a custom Heka decoder, which will be responsible for coalescing the log lines
|
||||
making up a Python Traceback into one message. Our prototype includes that
|
||||
decoder [4].
|
||||
|
||||
Collect system logs
|
||||
-------------------
|
||||
|
||||
In addition to container logs we think it is important to collect system logs
|
||||
as well. For that we propose to mount the host's ``/var/log`` directory into
|
||||
the Heka container, and configure Heka to get logs from standard log files
|
||||
located in that directory (e.g. ``kern.log``, ``auth.log``, ``messages``). The
|
||||
list of system log files will be determined at development time.
|
||||
|
||||
Log rotation
|
||||
------------
|
||||
|
||||
Log rotation is an important aspect of the logging system. Currently Kolla
|
||||
doesn't rotate logs. Logs just accumulate in the ``rsyslog`` Docker volume.
|
||||
The work on Heka proposed in this spec isn't directly related to log rotation,
|
||||
but we are suggesting to address this issue for Mitaka. This will mean
|
||||
creating a new container that uses ``logrotate`` to manage the log files
|
||||
created by the Kolla containers.
|
||||
|
||||
Create an ``heka`` user
|
||||
-----------------------
|
||||
|
||||
For security reasons an ``heka`` user will be created in the Heka container and
|
||||
the ``hekad`` daemon will run under that user. The ``heka`` user will be added
|
||||
to the ``kolla`` group, to make sure that Heka can read the log files created
|
||||
by the services.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
Heka is a mature product maintained and used in production by Mozilla. So we
|
||||
trust Heka as being secure. We also trust the Heka developers as being serious
|
||||
should security vulnerabilities be found in the Heka code.
|
||||
|
||||
As described above we are proposing to use a Docker volume between the service
|
||||
containers and the Heka container. The group of the volume directory and the
|
||||
log files will be ``kolla``. And the owner of the log files will be the user
|
||||
that executes the service producing logs. But the ``gid`` of the ``kolla``
|
||||
group and the ``uid``'s of the users executing the services may correspond
|
||||
to a different group and different users on the host system. This means
|
||||
that the permissions may not be right on the host system. This problem is
|
||||
not specific to this specification, and it already exists in Kolla (for
|
||||
the mariadb data volume for example).
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The ``hekad`` daemon will run in a container on each cluster node. But the
|
||||
``rsyslogd`` will be removed. And we have assessed that Heka is lightweight
|
||||
enough to run on every node. Also, a possible option would be to constrain the
|
||||
Heka container to only use a defined amount of resources.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
An alternative to this proposal involves using Logstash in a centralized
|
||||
way as done in [1].
|
||||
|
||||
Another alternative would be to execute Logstash on each cluster node, as this
|
||||
spec proposes with Heka. But this would mean running a JVM on each cluster
|
||||
node, and using Redis as a centralized queue.
|
||||
|
||||
Also, as described above, we initially considered relying on services writing
|
||||
their logs to ``stdout`` and use Heka's ``DockerLogInput`` plugin. But our
|
||||
prototyping work has demonstrated the limits of that approach. See the
|
||||
``DockerLogInput`` section above for more information.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Éric Lemoine (elemoine)
|
||||
|
||||
Milestones
|
||||
----------
|
||||
|
||||
Target Milestone for completion: Mitaka 3 (March 4th, 2016).
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
1. Create an Heka Docker image
|
||||
2. Create an Heka configuration for Kolla
|
||||
3. Develop the necessary Heka decoders (with support for Python Tracebacks)
|
||||
4. Create Ansible deployment files for Heka
|
||||
5. Modify the services' logging configuration when required
|
||||
6. Correctly handle RabbitMQ, HAProxy and Keepalived
|
||||
7. Integrate with Elastichsearch and Kibana
|
||||
8. Assess logs from all the Kolla services are collected
|
||||
9. Make the Heka container upgradable
|
||||
10. Integrate with kolla-mesos (will be done after Mitaka)
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
We will rely on the existing gate checks.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The location of log files on the host will be mentioned in the documentation.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[1] <https://review.openstack.org/#/c/252968/>
|
||||
[2] <http://hekad.readthedocs.org>
|
||||
[3] <http://blog.sematext.com/2015/09/28/recipe-rsyslog-redis-logstash/>
|
||||
[4] <https://review.openstack.org/#/c/269745/>
|
||||
[5] <http://hekad.readthedocs.org/en/latest/config/inputs/docker_log.html>
|
||||
[6] <http://hekad.readthedocs.org/en/latest/config/inputs/logstreamer.html>
|
||||
[7] <https://review.openstack.org/#/c/269952/>
|
614
specs/logging-with-heka.svg
Normal file
614
specs/logging-with-heka.svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 92 KiB |
Loading…
Reference in New Issue
Block a user