User facing error Summary Messages

The purpose of this spec is to define a mechanism in Cinder that will allow a user to see error messages for asynchronous operations. This is more specific to failed operations and cause of their failure. Spec started in Newton APIImpact New /messages resource DocImpact Co-Authored-By: Alex Meade <mr.alex.meade@gmail.com> Implements : blueprint summarymessage Change-Id: If7f8b7f277863a6dbf9686dadab145590ec9567d
2016-01-29 13:42:19 +05:30 · 2016-01-29 13:42:19 +05:30 · 48ddf3ff2a
commit 48ddf3ff2a
parent 8afa9c447b
1 changed files with 432 additions and 0 deletions
--- a/specs/newton/summarymessage.rst
+++ b/specs/newton/summarymessage.rst
@ -0,0 +1,432 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==============================================
+User facing "Failure" Message and Event Viewer
+==============================================
+
+For quite some time, OpenStack services have wanted to be able to send
+messages, especially error messages, to API end users (by user I do not mean
+the operator, but the user that is interacting with the API).
+
+If user performs some operation and operation fails or goes in
+hang state, then there must be some interface for user to see reason
+for this behavior.
+
+So, this is basically regarding facilitating user to see error messages
+for asynchronous operations either directly through APIs or through new
+"Event Viewer" tab in horizon or horizon pluggin.
+
+This is more specific to failed operations and cause of their failure.
+
+https://blueprints.launchpad.net/cinder/+spec/summarymessage
+
+Problem description
+===================
+
+If operations like create volume, create snapshot etc fails, user gets
+no detailed information in case operation fails.
+Sometime only operation status is updated to error, failed etc without
+any update to user.
+
+In some case it is worse than this.
+For ex,
+1. In case rabbitmq is inactive and user tries to create volume,
+horizon hangs on waiting response from API forever.
+2. In case rabbitmq is active but recipient service is not running
+and user tries to create volume, API hangs on waiting response
+from rabbitMQ forever.
+
+A few resources among all the OpenStack projects handle reporting errors
+to end users for asynchronous operations and those that do are inconsistent
+with each other.
+In addition to a mechanism to enable error reporting in a
+consistent way across OpenStack, the solution must also be able to
+accommodate a deployment of Cinder that contains no other OpenStack service.
+
+Use Cases
+=========
+
+Motivation for this Blue Print:
+
+1. To help admin in debugging failed operations with
+   less log filtering.
+2. To notify admin to start services which failed due to
+   some abnormal conditions.
+3. To inform user about failure in case operation fails.
+4. To provide enough information to user about failure.
+
+General Use Cases:
+
+* Cinder volume/snapshot creation goes to ERROR status due
+  to lack of capacity. (Scheduling error)
+* Cinder add volume to CG fails, how to tell user the volume
+  and group are not on the same backend?
+* Cinder volume goes from attaching to available. why?
+* Volume retype fails.
+* Volume extend fails, I'd like to know why and be able to
+  still use my volume instead of it being in error_extending.
+* ETC..
+
+Proposed change
+===============
+
+User can fetch operation failure details through direct API calls or
+through new horizon tab "Event Viewer".
+From CLI, cli client could be used to display same kind of information.
+
+Suggested implementation is based on 2 way approach
+
+1. Push Information:
+   During user operations, messages will be pushed to database
+   using component specific notifications.
+
+   These messages will be generated and pushed to database by
+   component for operation start, operation completion or operation
+   failure.
+
+   Message generation will be based on eventID to eventMessage
+   mapping using message constant files which keeps different
+   notification messages mapping in it.
+   This way deployer can easily modify notification messages as per
+   requirement.
+
+2. Pull Information:
+   In case user needs to check operation status, user can pull details
+   using CLI client or Horizon tab.
+
+Results may be shown in tabular way as shown below
+
+--------+---------+----------+-------+-------+----------+------+---------+
+| Tenant | EventID | NodeName | ReqID | Level | Resource | Time | Summary |
+========+=========+==========+=======+=======+==========+======+=========+
+| Sheel  | UKN_ERR | BS-cind1 | {...} | Error | Volume   | {..} | {....}  |
+--------+---------+----------+-------+-------+----------+------+---------+
+
+Every 'operation' initiated by the user has a request ID returned as
+an HTTP header in the context.
+These notification messages will be tied to operation request ID.
+(This request ID will be used for mapping a request to what happened in cinder
+for that operation.)
+
+Summary message will contain operation specific failure message.
+For ex,
+"Volume create operation failed - {Reason of failure}"
+
+
+Filters:
+
+Results can be filtered depending upon TenantID, HostID, UserID,
+Operation Outcome/Result(Fail/Pass) etc.
+
+Type of Messages:
+
+1. API Events : messages for failed operations.
+2. Service Logs: information for failed services either stopped by user
+   or stopped due to any abnormal conditions.
+3. System Logs: any other logs than API and service logs.
+
+**Suggested Architecture**:
+
+The proposed change is to add a new /v3/<tenant>/messages API
+resource backed by a messages table in the Cinder DB.
+This endpoint will return a list of error messages that are
+intended for the end-user regarding failed asynchronous operations.
+
+In short:
+* /v3/<tenant>/messages API resource, exposes notifications messages
+depending upon filters
+* message_ttl config option that dictates message minimum life in seconds
+* messages DB table
+
+Questions
+---------
+None
+
+Alternatives
+------------
+
+* User facing notifications
+  Use the existing notification framework in combination with an AMQP
+  consumer to pull messages off and provide an endpoint for the user.
+  Faults with this approach are that we do not want to display the current
+  information in notifications to the user and it will require many more
+  services as dependencies.
+
+* Per resource faults
+  This alternative suggests adding a sub-resource to each resource, such as
+  volumes/<volume-id>/faults, similar to Nova's instance faults. This
+  makes it difficult to poll for messages for more than a single resource
+  or resource type. It also adds significant complexity to the api as
+  every resource must add /faults in order to support messages.
+
+* Exposing user messages via a separate service (such as Zaqar)
+  This approach suggests storing user messages in another service that the
+  user could query for messages or the service could utilize webhooks to
+  notify the user. One major drawback to this approach is the complexity
+  in writing bindings for the separate service(s) and the need for a
+  separate service as a dependency.
+
+What this specification does not solve
+--------------------------------------
+
+* State change notifications.
+  This solution does not intend to solve the use-case of alerting users when
+  a volume or any other resource changes state. For example, when a volume
+  changes from ``creating`` to ``available``.
+
+REST API impact
+---------------
+
+New APIs:
+* GET /v3/<tenant>/messages
+With filters by attribute. Ex: GET /v3/<tenant>/messages?resource_type=volume
+* GET /v3/<tenant>/messages/<message-id>
+* DELETE /v3/<tenant>/messages/<message-id>
+
+Message schema ::
+
+  Message:
+    type: object
+    required:
+    - user_message
+    - id
+    - project_id
+    - request_id
+    - event_id
+    - created_at
+    - message_level
+    - expires_at
+    properties:
+      id:
+        type: string
+        description: UUID will be stored in 'id' field.
+      message_level:
+        type: string
+        enum:
+        - ERROR
+        description: The level of the message. In the future we may expand to
+        sending information to the user that is not an error.
+      user_message:
+        type: string
+      event_id:
+        type: string
+        description: Event ID can be used to
+        a. update message text at deployer end for some specific situation.
+        b. to report errors by user.
+        c. to debug fast as it is easy to search where specific eventID is
+        used for reporting error.
+      resource_uuid:
+        type: string
+        description: The uuid of the offending resource.
+      resource_type:
+        type: string
+        description: The type of resource this message pertains to.
+        For ex, volume, snapshot, backup etc
+      request_id:
+        type: string
+      created_at:
+        type: string
+      expires_at:
+        type: string
+        description: After this time the message may no longer exist
+
+Data model impact
+-----------------
+
+New messages table in the DB to store all messages. This table may prove to
+grow large in a cloud with lots of errors. The admin will be able to utilize
+the ``expires_at`` column to reap messages.
+
+Security impact
+---------------
+
+Messages must be highly scrutinized before becoming visible to the user in
+order to avoid any sensitive data from being shown. This will be mitigated by
+having all user visible messages defined in a single module. The messaging
+mechanism will assert that any message it will create comes from the sanctioned
+location.
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+None
+
+Other deployer impact
+---------------------
+
+* New configuration option ``message_ttl`` that will dictate the number of
+  seconds after the messages creation time to set the 'guaranteed_until'
+  attribute on generated messages.
+
+* New configuration option ``message_reap_interval`` that will dictate the
+  number of seconds between calls to delete old messages. A value of -1
+  will never run. DocImpact: This option should not be set on a large number
+  of nodes, since too many nodes trying this delete at the same time will cause
+  transaction bouncing and degraded DB performance.
+
+* New configuration option ``message_reap_batch_size`` that dictates the number
+  of expired messages to delete each interval. This allows a deployer to
+  limit DB performance impact by setting a ceiling for the number of
+  messages deleted at a time.
+
+* The messages table will be potentially large and may be reaped based on
+  the 'guaranteed_until' column. Where all messages with a
+  ``expires_at`` date earlier than the current time can be safely
+  deleted.
+
+Developer impact
+----------------
+
+Developers should be aware of use-cases where the user needs information
+about an error. In these situations, an appropriate user message should be
+written and creation of the message added in the specific code path(s).
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  Sheel Rana
+  Alex Meade
+
+Work Items
+----------
+
+This whole implementation depends upon message generation, transport,
+collection, storage and analytics of different failure messages.
+
+* cinder:
+  Implementation to generate notification messages at the time of
+  failure for all existing operations.
+
+* cinder:
+  notification listener is required which will serve as
+  basis for handling event messages from different components.
+
+* cinder:
+  collector is required to collect, validate and store event
+  messages to database.
+
+* cinder:
+  new API to fetch details form database depending upon filters.
+
+* cinder:
+  Add pagination to messages
+
+* cinder-manage:
+  Add mechanism to automatically, and via a cinder-manage command,
+  reap expired messages in the db depending upon ttl value.
+
+* cinder:
+  Documentation for new API details.
+
+* cinder:
+  Update "Getting started Guide".
+
+* cinder:
+  Database schema preparation to store notification messages.
+
+* cinder:
+  Need to implement "delete messages as per message life" from database after
+  message expiry time.
+  For ex, if user has set ``message_ttl`` to 7 days, then all messages
+  older than 7 days will be purged from database.
+
+* horizon:
+  Separate tab for cinder to display event messages.
+
+* cinder-client:
+  cinder cli to communicate with API and fetch event messages.
+
+* cinder-client:
+  Update to CLI reference Guide.
+
+* Tempest tests
+
+
+Implementation Phases:
+----------------------
+This whole feature will be implemented in multiple phases:
+
+Phase 1. Basic implementation regarding notification generation and storage
+into database with "/messages" exposed to view notification messages.
+This spec targets Phase 1 first, other phases will be implemented after
+acceptance of phase 1.
+
+Phase 2. Implementation for facilitating admin to configure notification
+storage like db or zaqar or both.
+If both RPC/DB are configured by admin, notification message would be
+stored in zaqar along with storing information to database.
+
+Phase 3. Implementation for consuming information from zaqar directly.
+
+Phase 4. Horizon and CLI implementations to view notifications in more
+formatted manner.
+
+Phase 5. Handling of some special cases where generation of notifications
+requires seperate handling like rabbitMQ related implementations for showing
+notifications in case rabbitMQ is in failed state or rabbitMQ recipient is
+in inactive state.
+
+Dependencies
+============
+None
+
+
+Testing
+=======
+
+Tempest tests should be written and run in the gate. It may prove difficult
+to implement complete functional testing of the feature as messages will not
+be created unless there is an error, which may be difficult to trigger.
+However, some operations are easy to trigger failure with unlimited quotas.
+One example is creating a thick provisioned volume too big to be stored on the
+backend.
+
+Example Test Cases
+------------------
+
+# List messages with no messages
+# Attempt creation of a TOO LARGE volume and verify appropriate scheduling
+error message is created
+# List messages with filters, especially resource_type
+
+Documentation Impact
+====================
+
+* REST API documentation
+* New config option, ``message_ttl`` (time to live)
+* New config option, ``message_reap_interval``
+  (number of seconds between calls to delete old messages)
+* New config option, ``message_reap_batch_size``
+  (number of messages which could be deleted in one batch)
+* New API policies for messages
+
+References
+==========
+
+Mitaka Midcycle discussion
+ https://etherpad.openstack.org/p/mitaka-cinder-midcycle-user-notifications
+ https://etherpad.openstack.org/p/mitaka-cinder-midcycle-day-1
+
+Kilo Summit Discussion
+ https://etherpad.openstack.org/p/kilo-cinder-async-reporting
+
+Liberty Summit Discussion (in conjunction with HEAT) -
+ https://etherpad.openstack.org/p/liberty-cross-project-user-notifications