Gorka Eguileor 245b16fc71 Dynamic Log Level control via REST API

Specs to add REST API to control Cinder services' log levels
dynamically.

blueprint: dynamic-log-levels
Change-Id: Ic4f76011a8bb34a2a125e591c8c6228421b29416

2017-04-20 10:01:48 -05:00

8.8 KiB

Raw Blame History

Dynamic Log Level control via REST API

https://blueprints.launchpad.net/cinder/+spec/dynamic-log-levels

Add REST API to control Cinder services' log levels dynamically.

Problem description

To change log levels in a service the service's configuration needs to be changed and the service restarted. The restart can be done by restarting the service itself or by requesting an internal restart via SIGHUP signal.

In some services a restart is not a big deal, API and scheduler, because they only operate in the control plane and they don't perform long running operations, but in other services, Volume and Backup, this is a bigger deal, because they are in the data plane as well and restarting of a service may take a long time.

We should be able to change service log levels dynamically as needed, even if they will revert back to the defaults on restart.

A downside to being able to dynamically change log levels is that we'll no longer be sure of what log level a service is running at a given time, so we'll also need a mechanism to query current log levels of a service.

Use Cases

Cloud users are encountering problems when using the cloud and they contact support, so the system operator starts looking at the logs only to find out that correct log levels are insufficient to determine the root cause of the problem and the log levels need to be changed to DEBUG.

Another use case that would be satisfied by the implementation of this spec as a side product would be when a system administrator wants to confirm Message Broker connectivity in a service, as the log level query mechanism can be used as a ping to the service via the Message Broker.

Proposed change

The proposal is to introduce 2 new service REST APIs actions, one to modify debug levels at runtime and another to query them. The life of the log level changes will be the current service run, as they will revert to those defined in the configuration file upon restart.

Setting the log levels will be possible for all Volume, Scheduler, and Backup services, but limited in the API service to only the service process that receives the request since there is no mechanism in place right now to propagate the request to other API nodes and adding such mechanism for this feature would be an unnecessary complexity at this point.

This is a reasonable limitation, since API services can be easily restarted without impacting the cloud because they are only in the control plane and are usually deployed in an Active/Active configuration. And if they are not in an Active/Active configuration then there's only 1 API service running and not being able to propagate the API log level change isn't such a big deal.

While some operators may prefer to restart the API services to change the log levels, there may be others that prefer to directly make the dynamic log level changes to the all the API nodes skipping the load balancer to avoid restarts, and some others that will just change one API node dynamically skipping the load balancer and make the test request to that one API node.

The mechanism to set the log level should be versatile enough that no scripting is necessary when we want to do multiple changes. The way to achieve this will be to allow changing log levels to all addressable services or limit by binary and/or server.

It'll also be possible to decide which log levels to change in the service, so we'll be able to not only change the log levels of the cinder service itself, but also those of its libraries (ie. SQLAlchemy library).

Both mechanism will allow setting/querying multiple services but will only work on services that are up as per DB heartbeats.

Alternatives

An alternative would be to support Dynamic Reconfiguration after modifying cinder.conf, but that is a considerably bigger problem that will require more code changes, and while it'll be more powerful it has also some drawbacks, since it requires access to the nodes to change the configuration of each of the services and also trigger the reload of each of them.

The benefit of having an API for the log levels is that you don't have to have access to the infrastructure as you can request the change through the REST API and then check the logs in the log monitoring service.

Data model impact

None

REST API impact

Set log level: This will be implemented as a service action like enable and disable, but will use the set-log identifier. Effective URL /v3/{tenant_id}/os-services/set-log will take following parameters in the body:
- binary (optional): A string parameter indicating the binary of the service to change, it can take following values, cinder-volume, cinder-scheduler, cinder-backup, cinder-api, *, null, empty string or be missing. The last four possibilities being equivalent to all services.
- server (optional): A string parameter indicating the server to change, Can be a host or cluster reference - host@backend or cluster@backend -, or null, empty string, or be missing for all servers matching the binary.
- prefix (optional): A string indicating the prefix for the log path, for example cinder. or sqlalchemy.engine. When not present all logs will be changed.
- level (required): A string with the log level to set, case insensitive, accepted values are INFO, WARNING, ERROR, DEBUG.
Get log level: Service action with get-log identifier. Effective URL /v3/{tenant_id}/os-services/get-log will accept the following parameters in the body:
- binary (optional): A string parameter indicating the binary of the service to query, it can take following values, *, empty string, null, cinder-volume, cinder-scheduler, cinder-backup, and cinder-api. If missing or * or the an empty string is passed then all binaries will be used.
- server (optional): A string parameter indicating the server to query, Can be a host or a cluster reference - host@backend or cluster@backend.
- prefix (optional): A string indicating the prefix for the log path we are querying, for example cinder. or sqlalchemy.engine. When not present or the empty string is passed all log levels will be retrieved.

Example response to get-log:

{
   "log_levels":[
       {
          "binary": "cinder-api",
          "host": "hostname1",
          "levels":{
             "cinder.api": "DEBUG",
             "cinder.api.common": "DEBUG"
             "cinder.db.sqlalchemy.api": "DEBUG"
       },
       {
          "binary": "cinder-scheduler",
          "host": "hostname1",
          "levels":{
             "cinder": "DEBUG",
             "cinder.scheduler.manager": "DEBUG"
             "eventlet": "ERROR"
          }
       },
       {
          "binary": "cinder-volume",
          "host": "hostname2@backend#pool",
          "levels":{
             "cinder": "DEBUG",
             "cinder.volume.drivers.rbd": "DEBUG",
             "sqlalchemy": "WARNING"
          }
       }
   ]
}

Security impact

None, since it will be using the service update Access Control policy used for operations like enable, disable, and freeze...

Notifications impact

For audit purposes a new notification will be emitted with every dynamic log level change.

Other end user impact

None

Performance Impact

None besides the possible increase in log quantity when changed to a greater log level, for example debug.

Other deployer impact

None.

Developer impact

None

Implementation

Assignee(s)

Primary assignee:: Gorka Eguileor (geguileo)

Work Items

Add the set API endpoint and mechanism on the services
Cinder client support for set action
Add the get API endpoint and mechanism on the services
Cinder client support for get action

Dependencies

None

Testing

Unittests for new API behavior.

Documentation Impact

Only the changes to the API need to be documented.

8.8 KiB Raw Blame History

Dynamic Log Level control via REST API

Problem description

Use Cases

Proposed change

Alternatives

Data model impact

REST API impact

Security impact

Notifications impact

Other end user impact

Performance Impact

Other deployer impact

Developer impact

Implementation

Assignee(s)

Work Items

Dependencies

Testing

Documentation Impact

References

8.8 KiB

Raw Blame History