User facing error Summary Messages

The purpose of this spec is to define a mechanism in Cinder that will
allow a user to see error messages for asynchronous operations.

This is more specific to failed operations and cause of their failure.

Spec started in Newton

APIImpact New /messages resource

DocImpact

Co-Authored-By: Alex Meade <mr.alex.meade@gmail.com>

Implements : blueprint summarymessage

Change-Id: If7f8b7f277863a6dbf9686dadab145590ec9567d
This commit is contained in:
Sheel Rana 2016-01-29 13:42:19 +05:30
parent 8afa9c447b
commit 48ddf3ff2a

View File

@ -0,0 +1,432 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
==============================================
User facing "Failure" Message and Event Viewer
==============================================
For quite some time, OpenStack services have wanted to be able to send
messages, especially error messages, to API end users (by user I do not mean
the operator, but the user that is interacting with the API).
If user performs some operation and operation fails or goes in
hang state, then there must be some interface for user to see reason
for this behavior.
So, this is basically regarding facilitating user to see error messages
for asynchronous operations either directly through APIs or through new
"Event Viewer" tab in horizon or horizon pluggin.
This is more specific to failed operations and cause of their failure.
https://blueprints.launchpad.net/cinder/+spec/summarymessage
Problem description
===================
If operations like create volume, create snapshot etc fails, user gets
no detailed information in case operation fails.
Sometime only operation status is updated to error, failed etc without
any update to user.
In some case it is worse than this.
For ex,
1. In case rabbitmq is inactive and user tries to create volume,
horizon hangs on waiting response from API forever.
2. In case rabbitmq is active but recipient service is not running
and user tries to create volume, API hangs on waiting response
from rabbitMQ forever.
A few resources among all the OpenStack projects handle reporting errors
to end users for asynchronous operations and those that do are inconsistent
with each other.
In addition to a mechanism to enable error reporting in a
consistent way across OpenStack, the solution must also be able to
accommodate a deployment of Cinder that contains no other OpenStack service.
Use Cases
=========
Motivation for this Blue Print:
1. To help admin in debugging failed operations with
less log filtering.
2. To notify admin to start services which failed due to
some abnormal conditions.
3. To inform user about failure in case operation fails.
4. To provide enough information to user about failure.
General Use Cases:
* Cinder volume/snapshot creation goes to ERROR status due
to lack of capacity. (Scheduling error)
* Cinder add volume to CG fails, how to tell user the volume
and group are not on the same backend?
* Cinder volume goes from attaching to available. why?
* Volume retype fails.
* Volume extend fails, I'd like to know why and be able to
still use my volume instead of it being in error_extending.
* ETC..
Proposed change
===============
User can fetch operation failure details through direct API calls or
through new horizon tab "Event Viewer".
From CLI, cli client could be used to display same kind of information.
Suggested implementation is based on 2 way approach
1. Push Information:
During user operations, messages will be pushed to database
using component specific notifications.
These messages will be generated and pushed to database by
component for operation start, operation completion or operation
failure.
Message generation will be based on eventID to eventMessage
mapping using message constant files which keeps different
notification messages mapping in it.
This way deployer can easily modify notification messages as per
requirement.
2. Pull Information:
In case user needs to check operation status, user can pull details
using CLI client or Horizon tab.
Results may be shown in tabular way as shown below
+--------+---------+----------+-------+-------+----------+------+---------+
| Tenant | EventID | NodeName | ReqID | Level | Resource | Time | Summary |
+========+=========+==========+=======+=======+==========+======+=========+
| Sheel | UKN_ERR | BS-cind1 | {...} | Error | Volume | {..} | {....} |
+--------+---------+----------+-------+-------+----------+------+---------+
Every 'operation' initiated by the user has a request ID returned as
an HTTP header in the context.
These notification messages will be tied to operation request ID.
(This request ID will be used for mapping a request to what happened in cinder
for that operation.)
Summary message will contain operation specific failure message.
For ex,
"Volume create operation failed - {Reason of failure}"
Filters:
Results can be filtered depending upon TenantID, HostID, UserID,
Operation Outcome/Result(Fail/Pass) etc.
Type of Messages:
1. API Events : messages for failed operations.
2. Service Logs: information for failed services either stopped by user
or stopped due to any abnormal conditions.
3. System Logs: any other logs than API and service logs.
**Suggested Architecture**:
The proposed change is to add a new /v3/<tenant>/messages API
resource backed by a messages table in the Cinder DB.
This endpoint will return a list of error messages that are
intended for the end-user regarding failed asynchronous operations.
In short:
* /v3/<tenant>/messages API resource, exposes notifications messages
depending upon filters
* message_ttl config option that dictates message minimum life in seconds
* messages DB table
Questions
---------
None
Alternatives
------------
* User facing notifications
Use the existing notification framework in combination with an AMQP
consumer to pull messages off and provide an endpoint for the user.
Faults with this approach are that we do not want to display the current
information in notifications to the user and it will require many more
services as dependencies.
* Per resource faults
This alternative suggests adding a sub-resource to each resource, such as
volumes/<volume-id>/faults, similar to Nova's instance faults. This
makes it difficult to poll for messages for more than a single resource
or resource type. It also adds significant complexity to the api as
every resource must add /faults in order to support messages.
* Exposing user messages via a separate service (such as Zaqar)
This approach suggests storing user messages in another service that the
user could query for messages or the service could utilize webhooks to
notify the user. One major drawback to this approach is the complexity
in writing bindings for the separate service(s) and the need for a
separate service as a dependency.
What this specification does not solve
--------------------------------------
* State change notifications.
This solution does not intend to solve the use-case of alerting users when
a volume or any other resource changes state. For example, when a volume
changes from ``creating`` to ``available``.
REST API impact
---------------
New APIs:
* GET /v3/<tenant>/messages
With filters by attribute. Ex: GET /v3/<tenant>/messages?resource_type=volume
* GET /v3/<tenant>/messages/<message-id>
* DELETE /v3/<tenant>/messages/<message-id>
Message schema ::
Message:
type: object
required:
- user_message
- id
- project_id
- request_id
- event_id
- created_at
- message_level
- expires_at
properties:
id:
type: string
description: UUID will be stored in 'id' field.
message_level:
type: string
enum:
- ERROR
description: The level of the message. In the future we may expand to
sending information to the user that is not an error.
user_message:
type: string
event_id:
type: string
description: Event ID can be used to
a. update message text at deployer end for some specific situation.
b. to report errors by user.
c. to debug fast as it is easy to search where specific eventID is
used for reporting error.
resource_uuid:
type: string
description: The uuid of the offending resource.
resource_type:
type: string
description: The type of resource this message pertains to.
For ex, volume, snapshot, backup etc
request_id:
type: string
created_at:
type: string
expires_at:
type: string
description: After this time the message may no longer exist
Data model impact
-----------------
New messages table in the DB to store all messages. This table may prove to
grow large in a cloud with lots of errors. The admin will be able to utilize
the ``expires_at`` column to reap messages.
Security impact
---------------
Messages must be highly scrutinized before becoming visible to the user in
order to avoid any sensitive data from being shown. This will be mitigated by
having all user visible messages defined in a single module. The messaging
mechanism will assert that any message it will create comes from the sanctioned
location.
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
None
Other deployer impact
---------------------
* New configuration option ``message_ttl`` that will dictate the number of
seconds after the messages creation time to set the 'guaranteed_until'
attribute on generated messages.
* New configuration option ``message_reap_interval`` that will dictate the
number of seconds between calls to delete old messages. A value of -1
will never run. DocImpact: This option should not be set on a large number
of nodes, since too many nodes trying this delete at the same time will cause
transaction bouncing and degraded DB performance.
* New configuration option ``message_reap_batch_size`` that dictates the number
of expired messages to delete each interval. This allows a deployer to
limit DB performance impact by setting a ceiling for the number of
messages deleted at a time.
* The messages table will be potentially large and may be reaped based on
the 'guaranteed_until' column. Where all messages with a
``expires_at`` date earlier than the current time can be safely
deleted.
Developer impact
----------------
Developers should be aware of use-cases where the user needs information
about an error. In these situations, an appropriate user message should be
written and creation of the message added in the specific code path(s).
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Sheel Rana
Alex Meade
Work Items
----------
This whole implementation depends upon message generation, transport,
collection, storage and analytics of different failure messages.
* cinder:
Implementation to generate notification messages at the time of
failure for all existing operations.
* cinder:
notification listener is required which will serve as
basis for handling event messages from different components.
* cinder:
collector is required to collect, validate and store event
messages to database.
* cinder:
new API to fetch details form database depending upon filters.
* cinder:
Add pagination to messages
* cinder-manage:
Add mechanism to automatically, and via a cinder-manage command,
reap expired messages in the db depending upon ttl value.
* cinder:
Documentation for new API details.
* cinder:
Update "Getting started Guide".
* cinder:
Database schema preparation to store notification messages.
* cinder:
Need to implement "delete messages as per message life" from database after
message expiry time.
For ex, if user has set ``message_ttl`` to 7 days, then all messages
older than 7 days will be purged from database.
* horizon:
Separate tab for cinder to display event messages.
* cinder-client:
cinder cli to communicate with API and fetch event messages.
* cinder-client:
Update to CLI reference Guide.
* Tempest tests
Implementation Phases:
----------------------
This whole feature will be implemented in multiple phases:
Phase 1. Basic implementation regarding notification generation and storage
into database with "/messages" exposed to view notification messages.
This spec targets Phase 1 first, other phases will be implemented after
acceptance of phase 1.
Phase 2. Implementation for facilitating admin to configure notification
storage like db or zaqar or both.
If both RPC/DB are configured by admin, notification message would be
stored in zaqar along with storing information to database.
Phase 3. Implementation for consuming information from zaqar directly.
Phase 4. Horizon and CLI implementations to view notifications in more
formatted manner.
Phase 5. Handling of some special cases where generation of notifications
requires seperate handling like rabbitMQ related implementations for showing
notifications in case rabbitMQ is in failed state or rabbitMQ recipient is
in inactive state.
Dependencies
============
None
Testing
=======
Tempest tests should be written and run in the gate. It may prove difficult
to implement complete functional testing of the feature as messages will not
be created unless there is an error, which may be difficult to trigger.
However, some operations are easy to trigger failure with unlimited quotas.
One example is creating a thick provisioned volume too big to be stored on the
backend.
Example Test Cases
------------------
# List messages with no messages
# Attempt creation of a TOO LARGE volume and verify appropriate scheduling
error message is created
# List messages with filters, especially resource_type
Documentation Impact
====================
* REST API documentation
* New config option, ``message_ttl`` (time to live)
* New config option, ``message_reap_interval``
(number of seconds between calls to delete old messages)
* New config option, ``message_reap_batch_size``
(number of messages which could be deleted in one batch)
* New API policies for messages
References
==========
Mitaka Midcycle discussion
https://etherpad.openstack.org/p/mitaka-cinder-midcycle-user-notifications
https://etherpad.openstack.org/p/mitaka-cinder-midcycle-day-1
Kilo Summit Discussion
https://etherpad.openstack.org/p/kilo-cinder-async-reporting
Liberty Summit Discussion (in conjunction with HEAT) -
https://etherpad.openstack.org/p/liberty-cross-project-user-notifications