self-healing-sig/use-cases/memory-leak.rst
Adam Spiers 3bca09913f Remove CC license headers from individual specs / use cases
These are unnecessary because they are already covered by the
top-level LICENSE file.

Change-Id: I98145df081c03ce5cfa0ecfbf5133f9d658cc126
2019-04-25 23:54:21 +01:00

176 lines
4.9 KiB
ReStructuredText

=============================================
Memory leak mitigation and failure prevention
=============================================
In a cloud infrastructure, it is important to detect and mitigate memory leaks
in critical infrastructure software before catastrophic failure takes place.
Problem description
===================
Many long-running software services are susceptible to memory leaks. A service
experiencing memory leak tends to gradually consume more and more memory
during its operation, ultimately compromising the performance and the
stability of the service. In the case of critical infrastructure software,
it is important to detect and prevent catastrophic failure due to memory
leaks.
Fault class
===========
* Software error
* Performance degradation
OpenStack projects used
=======================
There are at least three solution architectures.
1. Local: mitigation decisions and actions are conducted locally on each
server.
2. Central: mitigation decisions and actions are orchestrated by a central
service.
3. Delegated: mitigation decision policy can be defined centrally but
localized to each server.
Local
-----
In the local architecture, a local utility can make the decision to restart a
process or take other mitigating actions when the process exceeds certain
fixed memory thresholds specified by the cloud operator. Candidate
implementations include:
- custom scripts.
- native memory limit mechanisms (e.g., cgroups) which would kill a process
when memory usage becomes too high, allowing another mechanism to restart
the process.
- `Monit`_.
.. _Monit: https://mmonit.com/monit/documentation/monit.html#Process-resource-tests
Central
-------
In the central architecture, mitigation decisions can be made at a central
level which is able to use cloud level information and policy not available at
a local level. Mitigation actions can include the orchestration of graceful
failovers that involve multiple servers.
There are three logical components to a solution.
* Memory usage collection:
- Monasca (monasca-agent)
- Nagios
- Zabbix
* Mitigation decision:
- Congress
- Vitrage
- Watcher
* Mitigation action:
- Mistral
- Watcher
Delegated
---------
The delegated architecture might be implemented as a mixture of the above two.
Remediation class
=================
Proactive / preemptive
Fault detection
===============
Definitive detection of memory leak is an unsolved problem. For the purpose of
this use case, suspected memory leak can be determined based on operator-set
limits or a more generic procedure based on memory usage history and other
relevant information.
Inputs and decision-making
==========================
Inputs:
* Memory usage by process.
* Memory usage by server.
* (Potentially) Memory usage history.
* (Potentially) A list of candidate services or processes for memory leak
mitigation.
Decision making:
* The simplest case is when the operator prescribes memory limits for each
relevant process. Take mitigating actions when prescribed memory limits
for a service/process is breached.
* The appropriate memory limits for each service/process might be determined
by an inductive algorithm. The subject is under active investigation by
the research community (for selected references, see `References`_).
* When there are no prescribed memory limits, decisions can be made on the
basis of a more generic procedure or policy. For example, a policy sketch
may be as follows.
- When a server's overall memory usage exceeds 90% of available memory for
a period of 10 minutes, take mitigating actions on the candidate
services or processes, prioritized by parameters such as:
+ Each service's total memory usage.
+ Each service's historical memory usage.
+ Risk and level of disruption of mitigating action taken upon each
service.
Remediation
===========
Two main mitigating approaches are available:
* Restart the service experiencing memory leak.
* Orchestrate a graceful fail-over.
Existing implementation(s)
==========================
Existing implementations are available for the local architecture. See `Local`_.
Future work
===========
If there is operator interest in the central or delegated architectures,
future work would include implementing the architectures using the referenced
projects and documenting the results.
Dependencies
============
Not applicable.
References
==========
Matthias Hauswirth and Trishul M. Chilimbi. 2004.
Low-overhead memory leak detection using adaptive statistical profiling.
SIGOPS Oper. Syst. Rev. 38, 5 (October 2004), 156-164.
DOI=http://dx.doi.org/10.1145/1037949.1024412
Sor, Vladimir, Plumbr Ou, Tarvo Treier and Satish Narayana Srirama.
“Improving Statistical Approach for Memory Leak Detection Using Machine
Learning.” 2013 IEEE International Conference on Software Maintenance (2013):
544-547.