User facing error Summary Messages
The purpose of this spec is to define a mechanism in Cinder that will allow a user to see error messages for asynchronous operations. This is more specific to failed operations and cause of their failure. Spec started in Newton APIImpact New /messages resource DocImpact Co-Authored-By: Alex Meade <mr.alex.meade@gmail.com> Implements : blueprint summarymessage Change-Id: If7f8b7f277863a6dbf9686dadab145590ec9567d
This commit is contained in:
parent
8afa9c447b
commit
48ddf3ff2a
432
specs/newton/summarymessage.rst
Normal file
432
specs/newton/summarymessage.rst
Normal file
@ -0,0 +1,432 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==============================================
|
||||
User facing "Failure" Message and Event Viewer
|
||||
==============================================
|
||||
|
||||
For quite some time, OpenStack services have wanted to be able to send
|
||||
messages, especially error messages, to API end users (by user I do not mean
|
||||
the operator, but the user that is interacting with the API).
|
||||
|
||||
If user performs some operation and operation fails or goes in
|
||||
hang state, then there must be some interface for user to see reason
|
||||
for this behavior.
|
||||
|
||||
So, this is basically regarding facilitating user to see error messages
|
||||
for asynchronous operations either directly through APIs or through new
|
||||
"Event Viewer" tab in horizon or horizon pluggin.
|
||||
|
||||
This is more specific to failed operations and cause of their failure.
|
||||
|
||||
https://blueprints.launchpad.net/cinder/+spec/summarymessage
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
If operations like create volume, create snapshot etc fails, user gets
|
||||
no detailed information in case operation fails.
|
||||
Sometime only operation status is updated to error, failed etc without
|
||||
any update to user.
|
||||
|
||||
In some case it is worse than this.
|
||||
For ex,
|
||||
1. In case rabbitmq is inactive and user tries to create volume,
|
||||
horizon hangs on waiting response from API forever.
|
||||
2. In case rabbitmq is active but recipient service is not running
|
||||
and user tries to create volume, API hangs on waiting response
|
||||
from rabbitMQ forever.
|
||||
|
||||
A few resources among all the OpenStack projects handle reporting errors
|
||||
to end users for asynchronous operations and those that do are inconsistent
|
||||
with each other.
|
||||
In addition to a mechanism to enable error reporting in a
|
||||
consistent way across OpenStack, the solution must also be able to
|
||||
accommodate a deployment of Cinder that contains no other OpenStack service.
|
||||
|
||||
Use Cases
|
||||
=========
|
||||
|
||||
Motivation for this Blue Print:
|
||||
|
||||
1. To help admin in debugging failed operations with
|
||||
less log filtering.
|
||||
2. To notify admin to start services which failed due to
|
||||
some abnormal conditions.
|
||||
3. To inform user about failure in case operation fails.
|
||||
4. To provide enough information to user about failure.
|
||||
|
||||
General Use Cases:
|
||||
|
||||
* Cinder volume/snapshot creation goes to ERROR status due
|
||||
to lack of capacity. (Scheduling error)
|
||||
* Cinder add volume to CG fails, how to tell user the volume
|
||||
and group are not on the same backend?
|
||||
* Cinder volume goes from attaching to available. why?
|
||||
* Volume retype fails.
|
||||
* Volume extend fails, I'd like to know why and be able to
|
||||
still use my volume instead of it being in error_extending.
|
||||
* ETC..
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
User can fetch operation failure details through direct API calls or
|
||||
through new horizon tab "Event Viewer".
|
||||
From CLI, cli client could be used to display same kind of information.
|
||||
|
||||
Suggested implementation is based on 2 way approach
|
||||
|
||||
1. Push Information:
|
||||
During user operations, messages will be pushed to database
|
||||
using component specific notifications.
|
||||
|
||||
These messages will be generated and pushed to database by
|
||||
component for operation start, operation completion or operation
|
||||
failure.
|
||||
|
||||
Message generation will be based on eventID to eventMessage
|
||||
mapping using message constant files which keeps different
|
||||
notification messages mapping in it.
|
||||
This way deployer can easily modify notification messages as per
|
||||
requirement.
|
||||
|
||||
2. Pull Information:
|
||||
In case user needs to check operation status, user can pull details
|
||||
using CLI client or Horizon tab.
|
||||
|
||||
Results may be shown in tabular way as shown below
|
||||
|
||||
+--------+---------+----------+-------+-------+----------+------+---------+
|
||||
| Tenant | EventID | NodeName | ReqID | Level | Resource | Time | Summary |
|
||||
+========+=========+==========+=======+=======+==========+======+=========+
|
||||
| Sheel | UKN_ERR | BS-cind1 | {...} | Error | Volume | {..} | {....} |
|
||||
+--------+---------+----------+-------+-------+----------+------+---------+
|
||||
|
||||
Every 'operation' initiated by the user has a request ID returned as
|
||||
an HTTP header in the context.
|
||||
These notification messages will be tied to operation request ID.
|
||||
(This request ID will be used for mapping a request to what happened in cinder
|
||||
for that operation.)
|
||||
|
||||
Summary message will contain operation specific failure message.
|
||||
For ex,
|
||||
"Volume create operation failed - {Reason of failure}"
|
||||
|
||||
|
||||
Filters:
|
||||
|
||||
Results can be filtered depending upon TenantID, HostID, UserID,
|
||||
Operation Outcome/Result(Fail/Pass) etc.
|
||||
|
||||
Type of Messages:
|
||||
|
||||
1. API Events : messages for failed operations.
|
||||
2. Service Logs: information for failed services either stopped by user
|
||||
or stopped due to any abnormal conditions.
|
||||
3. System Logs: any other logs than API and service logs.
|
||||
|
||||
**Suggested Architecture**:
|
||||
|
||||
The proposed change is to add a new /v3/<tenant>/messages API
|
||||
resource backed by a messages table in the Cinder DB.
|
||||
This endpoint will return a list of error messages that are
|
||||
intended for the end-user regarding failed asynchronous operations.
|
||||
|
||||
In short:
|
||||
* /v3/<tenant>/messages API resource, exposes notifications messages
|
||||
depending upon filters
|
||||
* message_ttl config option that dictates message minimum life in seconds
|
||||
* messages DB table
|
||||
|
||||
Questions
|
||||
---------
|
||||
None
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
* User facing notifications
|
||||
Use the existing notification framework in combination with an AMQP
|
||||
consumer to pull messages off and provide an endpoint for the user.
|
||||
Faults with this approach are that we do not want to display the current
|
||||
information in notifications to the user and it will require many more
|
||||
services as dependencies.
|
||||
|
||||
* Per resource faults
|
||||
This alternative suggests adding a sub-resource to each resource, such as
|
||||
volumes/<volume-id>/faults, similar to Nova's instance faults. This
|
||||
makes it difficult to poll for messages for more than a single resource
|
||||
or resource type. It also adds significant complexity to the api as
|
||||
every resource must add /faults in order to support messages.
|
||||
|
||||
* Exposing user messages via a separate service (such as Zaqar)
|
||||
This approach suggests storing user messages in another service that the
|
||||
user could query for messages or the service could utilize webhooks to
|
||||
notify the user. One major drawback to this approach is the complexity
|
||||
in writing bindings for the separate service(s) and the need for a
|
||||
separate service as a dependency.
|
||||
|
||||
What this specification does not solve
|
||||
--------------------------------------
|
||||
|
||||
* State change notifications.
|
||||
This solution does not intend to solve the use-case of alerting users when
|
||||
a volume or any other resource changes state. For example, when a volume
|
||||
changes from ``creating`` to ``available``.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
New APIs:
|
||||
* GET /v3/<tenant>/messages
|
||||
With filters by attribute. Ex: GET /v3/<tenant>/messages?resource_type=volume
|
||||
* GET /v3/<tenant>/messages/<message-id>
|
||||
* DELETE /v3/<tenant>/messages/<message-id>
|
||||
|
||||
Message schema ::
|
||||
|
||||
Message:
|
||||
type: object
|
||||
required:
|
||||
- user_message
|
||||
- id
|
||||
- project_id
|
||||
- request_id
|
||||
- event_id
|
||||
- created_at
|
||||
- message_level
|
||||
- expires_at
|
||||
properties:
|
||||
id:
|
||||
type: string
|
||||
description: UUID will be stored in 'id' field.
|
||||
message_level:
|
||||
type: string
|
||||
enum:
|
||||
- ERROR
|
||||
description: The level of the message. In the future we may expand to
|
||||
sending information to the user that is not an error.
|
||||
user_message:
|
||||
type: string
|
||||
event_id:
|
||||
type: string
|
||||
description: Event ID can be used to
|
||||
a. update message text at deployer end for some specific situation.
|
||||
b. to report errors by user.
|
||||
c. to debug fast as it is easy to search where specific eventID is
|
||||
used for reporting error.
|
||||
resource_uuid:
|
||||
type: string
|
||||
description: The uuid of the offending resource.
|
||||
resource_type:
|
||||
type: string
|
||||
description: The type of resource this message pertains to.
|
||||
For ex, volume, snapshot, backup etc
|
||||
request_id:
|
||||
type: string
|
||||
created_at:
|
||||
type: string
|
||||
expires_at:
|
||||
type: string
|
||||
description: After this time the message may no longer exist
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
New messages table in the DB to store all messages. This table may prove to
|
||||
grow large in a cloud with lots of errors. The admin will be able to utilize
|
||||
the ``expires_at`` column to reap messages.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
Messages must be highly scrutinized before becoming visible to the user in
|
||||
order to avoid any sensitive data from being shown. This will be mitigated by
|
||||
having all user visible messages defined in a single module. The messaging
|
||||
mechanism will assert that any message it will create comes from the sanctioned
|
||||
location.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
* New configuration option ``message_ttl`` that will dictate the number of
|
||||
seconds after the messages creation time to set the 'guaranteed_until'
|
||||
attribute on generated messages.
|
||||
|
||||
* New configuration option ``message_reap_interval`` that will dictate the
|
||||
number of seconds between calls to delete old messages. A value of -1
|
||||
will never run. DocImpact: This option should not be set on a large number
|
||||
of nodes, since too many nodes trying this delete at the same time will cause
|
||||
transaction bouncing and degraded DB performance.
|
||||
|
||||
* New configuration option ``message_reap_batch_size`` that dictates the number
|
||||
of expired messages to delete each interval. This allows a deployer to
|
||||
limit DB performance impact by setting a ceiling for the number of
|
||||
messages deleted at a time.
|
||||
|
||||
* The messages table will be potentially large and may be reaped based on
|
||||
the 'guaranteed_until' column. Where all messages with a
|
||||
``expires_at`` date earlier than the current time can be safely
|
||||
deleted.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Developers should be aware of use-cases where the user needs information
|
||||
about an error. In these situations, an appropriate user message should be
|
||||
written and creation of the message added in the specific code path(s).
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Sheel Rana
|
||||
Alex Meade
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
This whole implementation depends upon message generation, transport,
|
||||
collection, storage and analytics of different failure messages.
|
||||
|
||||
* cinder:
|
||||
Implementation to generate notification messages at the time of
|
||||
failure for all existing operations.
|
||||
|
||||
* cinder:
|
||||
notification listener is required which will serve as
|
||||
basis for handling event messages from different components.
|
||||
|
||||
* cinder:
|
||||
collector is required to collect, validate and store event
|
||||
messages to database.
|
||||
|
||||
* cinder:
|
||||
new API to fetch details form database depending upon filters.
|
||||
|
||||
* cinder:
|
||||
Add pagination to messages
|
||||
|
||||
* cinder-manage:
|
||||
Add mechanism to automatically, and via a cinder-manage command,
|
||||
reap expired messages in the db depending upon ttl value.
|
||||
|
||||
* cinder:
|
||||
Documentation for new API details.
|
||||
|
||||
* cinder:
|
||||
Update "Getting started Guide".
|
||||
|
||||
* cinder:
|
||||
Database schema preparation to store notification messages.
|
||||
|
||||
* cinder:
|
||||
Need to implement "delete messages as per message life" from database after
|
||||
message expiry time.
|
||||
For ex, if user has set ``message_ttl`` to 7 days, then all messages
|
||||
older than 7 days will be purged from database.
|
||||
|
||||
* horizon:
|
||||
Separate tab for cinder to display event messages.
|
||||
|
||||
* cinder-client:
|
||||
cinder cli to communicate with API and fetch event messages.
|
||||
|
||||
* cinder-client:
|
||||
Update to CLI reference Guide.
|
||||
|
||||
* Tempest tests
|
||||
|
||||
|
||||
Implementation Phases:
|
||||
----------------------
|
||||
This whole feature will be implemented in multiple phases:
|
||||
|
||||
Phase 1. Basic implementation regarding notification generation and storage
|
||||
into database with "/messages" exposed to view notification messages.
|
||||
This spec targets Phase 1 first, other phases will be implemented after
|
||||
acceptance of phase 1.
|
||||
|
||||
Phase 2. Implementation for facilitating admin to configure notification
|
||||
storage like db or zaqar or both.
|
||||
If both RPC/DB are configured by admin, notification message would be
|
||||
stored in zaqar along with storing information to database.
|
||||
|
||||
Phase 3. Implementation for consuming information from zaqar directly.
|
||||
|
||||
Phase 4. Horizon and CLI implementations to view notifications in more
|
||||
formatted manner.
|
||||
|
||||
Phase 5. Handling of some special cases where generation of notifications
|
||||
requires seperate handling like rabbitMQ related implementations for showing
|
||||
notifications in case rabbitMQ is in failed state or rabbitMQ recipient is
|
||||
in inactive state.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Tempest tests should be written and run in the gate. It may prove difficult
|
||||
to implement complete functional testing of the feature as messages will not
|
||||
be created unless there is an error, which may be difficult to trigger.
|
||||
However, some operations are easy to trigger failure with unlimited quotas.
|
||||
One example is creating a thick provisioned volume too big to be stored on the
|
||||
backend.
|
||||
|
||||
Example Test Cases
|
||||
------------------
|
||||
|
||||
# List messages with no messages
|
||||
# Attempt creation of a TOO LARGE volume and verify appropriate scheduling
|
||||
error message is created
|
||||
# List messages with filters, especially resource_type
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* REST API documentation
|
||||
* New config option, ``message_ttl`` (time to live)
|
||||
* New config option, ``message_reap_interval``
|
||||
(number of seconds between calls to delete old messages)
|
||||
* New config option, ``message_reap_batch_size``
|
||||
(number of messages which could be deleted in one batch)
|
||||
* New API policies for messages
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Mitaka Midcycle discussion
|
||||
https://etherpad.openstack.org/p/mitaka-cinder-midcycle-user-notifications
|
||||
https://etherpad.openstack.org/p/mitaka-cinder-midcycle-day-1
|
||||
|
||||
Kilo Summit Discussion
|
||||
https://etherpad.openstack.org/p/kilo-cinder-async-reporting
|
||||
|
||||
Liberty Summit Discussion (in conjunction with HEAT) -
|
||||
https://etherpad.openstack.org/p/liberty-cross-project-user-notifications
|
Loading…
Reference in New Issue
Block a user