3bca09913f
These are unnecessary because they are already covered by the top-level LICENSE file. Change-Id: I98145df081c03ce5cfa0ecfbf5133f9d658cc126
176 lines
4.9 KiB
ReStructuredText
176 lines
4.9 KiB
ReStructuredText
=============================================
|
|
Memory leak mitigation and failure prevention
|
|
=============================================
|
|
|
|
In a cloud infrastructure, it is important to detect and mitigate memory leaks
|
|
in critical infrastructure software before catastrophic failure takes place.
|
|
|
|
|
|
Problem description
|
|
===================
|
|
|
|
Many long-running software services are susceptible to memory leaks. A service
|
|
experiencing memory leak tends to gradually consume more and more memory
|
|
during its operation, ultimately compromising the performance and the
|
|
stability of the service. In the case of critical infrastructure software,
|
|
it is important to detect and prevent catastrophic failure due to memory
|
|
leaks.
|
|
|
|
|
|
Fault class
|
|
===========
|
|
|
|
* Software error
|
|
* Performance degradation
|
|
|
|
|
|
OpenStack projects used
|
|
=======================
|
|
|
|
There are at least three solution architectures.
|
|
|
|
1. Local: mitigation decisions and actions are conducted locally on each
|
|
server.
|
|
2. Central: mitigation decisions and actions are orchestrated by a central
|
|
service.
|
|
3. Delegated: mitigation decision policy can be defined centrally but
|
|
localized to each server.
|
|
|
|
|
|
Local
|
|
-----
|
|
In the local architecture, a local utility can make the decision to restart a
|
|
process or take other mitigating actions when the process exceeds certain
|
|
fixed memory thresholds specified by the cloud operator. Candidate
|
|
implementations include:
|
|
|
|
- custom scripts.
|
|
- native memory limit mechanisms (e.g., cgroups) which would kill a process
|
|
when memory usage becomes too high, allowing another mechanism to restart
|
|
the process.
|
|
- `Monit`_.
|
|
|
|
.. _Monit: https://mmonit.com/monit/documentation/monit.html#Process-resource-tests
|
|
|
|
|
|
Central
|
|
-------
|
|
In the central architecture, mitigation decisions can be made at a central
|
|
level which is able to use cloud level information and policy not available at
|
|
a local level. Mitigation actions can include the orchestration of graceful
|
|
failovers that involve multiple servers.
|
|
|
|
There are three logical components to a solution.
|
|
|
|
* Memory usage collection:
|
|
|
|
- Monasca (monasca-agent)
|
|
- Nagios
|
|
- Zabbix
|
|
|
|
* Mitigation decision:
|
|
|
|
- Congress
|
|
- Vitrage
|
|
- Watcher
|
|
|
|
* Mitigation action:
|
|
|
|
- Mistral
|
|
- Watcher
|
|
|
|
|
|
Delegated
|
|
---------
|
|
The delegated architecture might be implemented as a mixture of the above two.
|
|
|
|
|
|
Remediation class
|
|
=================
|
|
|
|
Proactive / preemptive
|
|
|
|
|
|
Fault detection
|
|
===============
|
|
|
|
Definitive detection of memory leak is an unsolved problem. For the purpose of
|
|
this use case, suspected memory leak can be determined based on operator-set
|
|
limits or a more generic procedure based on memory usage history and other
|
|
relevant information.
|
|
|
|
|
|
Inputs and decision-making
|
|
==========================
|
|
|
|
Inputs:
|
|
* Memory usage by process.
|
|
* Memory usage by server.
|
|
* (Potentially) Memory usage history.
|
|
* (Potentially) A list of candidate services or processes for memory leak
|
|
mitigation.
|
|
|
|
Decision making:
|
|
* The simplest case is when the operator prescribes memory limits for each
|
|
relevant process. Take mitigating actions when prescribed memory limits
|
|
for a service/process is breached.
|
|
|
|
* The appropriate memory limits for each service/process might be determined
|
|
by an inductive algorithm. The subject is under active investigation by
|
|
the research community (for selected references, see `References`_).
|
|
|
|
* When there are no prescribed memory limits, decisions can be made on the
|
|
basis of a more generic procedure or policy. For example, a policy sketch
|
|
may be as follows.
|
|
|
|
- When a server's overall memory usage exceeds 90% of available memory for
|
|
a period of 10 minutes, take mitigating actions on the candidate
|
|
services or processes, prioritized by parameters such as:
|
|
|
|
+ Each service's total memory usage.
|
|
+ Each service's historical memory usage.
|
|
+ Risk and level of disruption of mitigating action taken upon each
|
|
service.
|
|
|
|
|
|
Remediation
|
|
===========
|
|
|
|
Two main mitigating approaches are available:
|
|
|
|
* Restart the service experiencing memory leak.
|
|
* Orchestrate a graceful fail-over.
|
|
|
|
Existing implementation(s)
|
|
==========================
|
|
|
|
Existing implementations are available for the local architecture. See `Local`_.
|
|
|
|
|
|
Future work
|
|
===========
|
|
|
|
If there is operator interest in the central or delegated architectures,
|
|
future work would include implementing the architectures using the referenced
|
|
projects and documenting the results.
|
|
|
|
|
|
Dependencies
|
|
============
|
|
|
|
Not applicable.
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
Matthias Hauswirth and Trishul M. Chilimbi. 2004.
|
|
Low-overhead memory leak detection using adaptive statistical profiling.
|
|
SIGOPS Oper. Syst. Rev. 38, 5 (October 2004), 156-164.
|
|
DOI=http://dx.doi.org/10.1145/1037949.1024412
|
|
|
|
Sor, Vladimir, Plumbr Ou, Tarvo Treier and Satish Narayana Srirama.
|
|
“Improving Statistical Approach for Memory Leak Detection Using Machine
|
|
Learning.” 2013 IEEE International Conference on Software Maintenance (2013):
|
|
544-547.
|