Add backup & restore guide

Story: 2006770 Task: 38487 Closes-Bug: 1871065 Change-Id: I9de2b9a2d925f34291f67fb0ab1c8d46945995d6 Signed-off-by: MCamp859 <maryx.camp@intel.com>
2020-04-21 23:14:29 -04:00 · 2020-04-21 23:14:29 -04:00 · 2061492e12
commit 2061492e12
parent 20441af036
2 changed files with 519 additions and 0 deletions
--- a/doc/source/developer_resources/backup_restore.rst
+++ b/doc/source/developer_resources/backup_restore.rst
@ -0,0 +1,518 @@
+==================
+Backup and Restore
+==================
+
+This guide describes the StarlingX backup and restore functionality.
+
+
+.. contents::
+   :local:
+   :depth: 2
+
+--------
+Overview
+--------
+
+This feature provides a last resort disaster recovery option for situations
+where the StarlingX software and/or data are compromised. The provided backup
+utility creates a deployment state snapshot, which can be used to restore the
+deployment to a previously good working state.
+
+There are two main options for backup and restore:
+
+* Platform restore, where the platform data is re-initialized, but the
+  applications are preserved – including OpenStack, if previously installed.
+  During this process, you can choose to keep the Ceph cluster (Default
+  option: ``wipe_ceph_osds=false``) or to wipe it and restore Ceph data from
+  off-box copies (``wipe_ceph_osds=true``).
+
+* OpenStack application backup and restore, where only the OpenStack application
+  is restored. This scenario deletes the OpenStack application, re-applies the
+  OpenStack application, and restores data from off-box copies (Glance, Ceph
+  volumes, database).
+
+This guide describes both restore options, including the backup procedure.
+
+.. note::
+
+      * Ceph application data is **not** backed up. It is preserved by the
+        restore process by default (``wipe_ceph_osds=false``), but it is not
+        restored if ``wipe_ceph_osds=true`` is used. You can protect against
+        Ceph cluster failures by using off-box custom backups.
+
+      * During restore, images for applications that are integrated with
+        StarlingX are automatically downloaded to the local registry from
+        external sources. If your system has custom Kubernetes pods that use the
+        local registry and are **not** integrated with StarlingX, after restore
+        you should confirm that the correct images are present, so the
+        applications can restart automatically.
+
+----------
+Backing up
+----------
+
+There are two methods for backing up: local play method and remote play method.
+
+~~~~~~~~~~~~~~~~~
+Local play method
+~~~~~~~~~~~~~~~~~
+
+Run the following command:
+
+::
+
+  ansible-playbook /usr/share/ansible/stx-ansible/playbooks/backup.yml -e "ansible_become_pass=<sysadmin password> admin_password=<sysadmin password>"
+
+The ``<admin_password>`` and ``<ansible_become_pass>`` must be set correctly by
+one of the following methods:
+
+* The ``-e`` option on the command line
+* An override file
+* In the Ansible secret file
+
+The output of the command is a file named in this format:
+``<inventory_hostname>_platform_backup_<timestamp>.tgz``
+
+The prefixes ``<platform_backup_filename_prefix>`` and
+``<openstack_backup_filename_prefix>`` can be overridden via the ``-e`` option
+on the command line or an override file.
+
+The generated backup tar files will look like this:
+``localhost_platform_backup_2019_08_08_15_25_36.tgz`` and
+``localhost_openstack_backup_2019_08_08_15_25_36.tgz``. They are located in
+the ``/opt/backups`` directory on controller-0.
+
+~~~~~~~~~~~~~~~~~~
+Remote play method
+~~~~~~~~~~~~~~~~~~
+
+#.  Log in to the host where Ansible is installed and clone the playbook code
+    from opendev at https://opendev.org/starlingx/ansible-playbooks.git
+
+#.  Provide an inventory file, either a customized one that is specified via the
+    ``-i`` option or the default one which resides in the Ansible configuration
+    directory (``/etc/ansible/hosts``). You must specify the IP of the controller
+    host. For example, if the host-name is ``my_vbox``, the inventory-file should
+    have an entry called ``my_vbox`` as shown in the example below:
+
+    ::
+
+      all:
+        hosts:
+          wc68:
+            ansible_host: 128.222.100.02
+         my_vbox:
+            ansible_host: 128.224.141.74
+
+#.  Run Ansible with the command:
+
+    ::
+
+      ansible-playbook <path-to-backup-playbook-entry-file> --limit host-name -i <inventory-file> -e <optional-extra-vars>
+
+    The generated backup tar files can be found in ``<host_backup_dir>`` which
+    is ``$HOME`` by default. It can be overridden by the ``-e`` option on the
+    command line or in an override file.
+
+    The generated backup tar file has the same naming convention as the local
+    play method.
+
+Example:
+
+::
+
+  ansible-playbook /localdisk/designer/repo/cgcs-root/stx/stx-ansible-playbooks/playbookconfig/src/playbooks/backup-restore/backup.yml --limit my_vbox -i $HOME/br_test/hosts -e "host_backup_dir=$HOME/br_test ansible_become_pass=Li69nux* admin_password=Li69nux* ansible_ssh_pass=Li69nux* ansible_ssh_pass=Li69nux*"
+
+
+~~~~~~~~~~~~~~~~~~~~~~
+Backup content details
+~~~~~~~~~~~~~~~~~~~~~~
+
+The backup contains the following:
+
+* Postgresql config: Backup roles, table spaces and schemas for databases
+
+* Postgresql data:
+
+  * template1, sysinv, barbican db data, fm db data,
+
+  * keystone db for primary region,
+
+  * dcmanager db for dc controller,
+
+  * dcorch db for dc controller
+
+* ETCD database
+
+* LDAP db
+
+* Ceph crushmap
+
+* DNS server list
+
+* System Inventory network overrides. These are needed at restore to correctly
+  set up the OS configuration:
+
+  * addrpool
+
+  * pxeboot_subnet
+
+  * management_subnet
+
+  * management_start_address
+
+  * cluster_host_subnet
+
+  * cluster_pod_subnet
+
+  * cluster_service_subnet
+
+  * external_oam_subnet
+
+  * external_oam_gateway_address
+
+  * external_oam_floating_address
+
+* Docker registries on controller
+
+* Docker proxy  (See :doc:`../configuration/docker_proxy_config` for details.)
+
+* Backup data:
+
+  * OS configuration
+
+    ok: [localhost] => (item=/etc) Note:  Although everything here is backed up,
+    not all of the content will be restored.
+
+  * Home directory ‘sysadmin’ user and all LDAP user accounts
+
+    ok: [localhost] => (item=/home)
+
+  * Generated platform configuration
+
+    ok: [localhost] => (item=/opt/platform/config/<SW_VERSION>)
+
+    ok: [localhost] => (item=/opt/platform/puppet/<SW_VERSION>/hieradata) - All the
+    hieradata in this folder is backed up. However, only the static hieradata
+    (static.yaml and secure_static.yaml) will be restored to bootstrap
+    controller-0.
+
+  * Keyring
+
+    ok: [localhost] => (item=/opt/platform/.keyring/<SW_VERSION>)
+
+  * Patching and package repositories
+
+    ok: [localhost] => (item=/opt/patching)
+
+    ok: [localhost] => (item=/www/pages/updates)
+
+  * Extension filesystem
+
+    ok: [localhost] => (item=/opt/extension)
+
+  * atch-vault filesystem for distributed cloud system-controller
+
+    ok: [localhost] => (item=/opt/patch-vault)
+
+  * Armada manifests
+
+    ok: [localhost] => (item=/opt/platform/armada/<SW_VERSION>)
+
+  * Helm charts
+
+    ok: [localhost] => (item=/opt/platform/helm_charts)
+
+
+---------
+Restoring
+---------
+
+This section describes the platform restore and OpenStack restore processes.
+
+~~~~~~~~~~~~~~~~
+Platform restore
+~~~~~~~~~~~~~~~~
+
+In the platform restore process, the etcd and system inventory databases are
+preserved by default. You can choose to preserve the Ceph data or to wipe it.
+
+* To preserve Ceph cluster data, use ``wipe_ceph_osds=false``.
+
+* To start with an empty Ceph cluster, use ``wipe_ceph_osds=true``. After the
+  restore procedure is complete and before you restart the applications, you
+  must restore the Ceph data from off-box copies.
+
+Steps:
+
+#.  Backup: Run the backup.yml playbook, whose output is a platform backup
+    tarball. Move the backup tarball outside of the cluster for safekeeping.
+
+#.  Restore:
+
+    a.  If using ``wipe_ceph_osds=true``, then power down all the nodes.
+
+        **Do not** power down storage nodes if using ``wipe_ceph_osds=false``.
+
+        .. important::
+
+                It is mandatory for the storage cluster to remain functional
+                during restore when ``wipe_ceph_osds=false``, otherwise data
+                loss will occur. Power down storage nodes only when
+                ``wipe_ceph_osds=true``.
+
+    #.  Reinstall controller-0.
+
+    #.  Run the Ansible restore_platform.yml playbook to restore a full system
+        from the platform tarball archive. For this step, similar to the backup
+        procedure, we have two options: local and remote play.
+
+        **Local play**
+
+        i.  Download the backup to the controller. You can also use an external
+            storage device, for example, a USB drive.
+
+        #.  Run the command:
+
+        ::
+
+          ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_platform.yml -e "initial_backup_dir=<location_of_tarball> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>"
+
+        **Remote play**
+
+        i.  Log in to the host where Ansible is installed and clone the playbook
+            code from OpenDev at
+            https://opendev.org/starlingx/ansible-playbooks.git
+
+        #.  Provide an inventory file, either a customized one that is specified
+            via the ``-i`` option or the default one that resides in the Ansible
+            configuration directory (``/etc/ansible/hosts``). You must specify
+            the IP of the controller host. For example, if the host-name is
+            ``my_vbox``, the inventory-file should have an entry called
+            ``my_vbox`` as shown in the example below.
+
+            ::
+
+              all:
+              hosts:
+                  wc68:
+                  ansible_host: 128.222.100.02
+              my_vbox:
+                  ansible_host: 128.224.141.74
+
+        #.  Run Ansible:
+
+            ::
+
+              ansible-playbook <path-to-backup-playbook-entry-file> --limit host-name -i <inventory-file> -e <optional-extra-vars>
+
+            Where ``optional-extra-vars`` include:
+
+            * ``<wipe_ceph_osds>`` is set to either ``wipe_ceph_osds=false``
+              (Default:  Keep Ceph data intact) or
+              ``wipe_ceph_osds=true`` (Start with an empty Ceph cluster).
+
+            * ``<backup_filename>`` is the platform backup tar file. It must be
+              provided via the ``-e`` option on the command line. For example,
+              ``-e “backup_filename=localhost_platform_backup_2019_07_15_14_46_37.tgz”``
+
+            * ``<initial_backup_dir>`` is the location on the Ansible
+              control machine where the platform backup tar file is placed to
+              restore the platform. It must be provided via the ``-e`` option on
+              the command line.
+
+            * ``<admin_password>``, ``<ansible_become_pass>`` and
+              ``<ansible_ssh_pass>`` must be set correctly via the ``-e``
+              option on the command line or in the Ansible secret file.
+              ``<ansible_ssh_pass>`` is the password for the sysadmin user on
+              controller-0.
+
+            * ``<ansible_remote_tmp>`` should be set to a new directory (no
+              need to create it ahead of time) under ``/home/sysadmin`` on
+              controller-0 via the ``-e`` option on the command line.
+
+            Example command:
+
+            ::
+
+              ansible-playbook /localdisk/designer/jenkins/tis-stx-dev/cgcs-root/stx/ansible-playbooks/playbookconfig/src/playbooks/restore_platform.yml --limit my_vbox -i $HOME/br_test/hosts -e " ansible_become_pass=Li69nux* admin_password=Li69nux* ansible_ssh_pass=Li69nux* initial_backup_dir=$HOME/br_test backup_filename=my_vbox_system_backup_2019_08_08_15_25_36.tgz ansible_remote_tmp=/home/sysadmin/ansible-restore"
+
+    #.  After Ansible is executed, perform the following steps based on your
+        deployment mode:
+
+        **AIO-SX**
+
+        Unlock controller-0 and wait for it to boot.
+
+        **AIO-DX**
+
+        i. Unlock controller-0 and wait for it to boot.
+
+        #. Reinstall controller-1 (boot it from PXE, wait for it to become
+           `online`).
+
+        #. Unlock controller-1.
+
+        **Standard (without controller storage)**
+
+        i. Unlock controller-0 and wait for it to boot. After unlock, you will
+           see all nodes, including storage nodes, as offline.
+
+        #. Reinstall controller-1 and compute nodes (boot them from PXE, wait
+           for them to become `online`).
+
+        #. Unlock controller-1 and wait for it to be available.
+
+        #. Unlock compute nodes and wait for them to be available.
+
+        **Standard (with controller storage)**
+
+        i. Unlock controller-0 and wait for it to boot. After unlock, you will
+           see all nodes, except storage nodes, as offline. If
+           ``wipe_ceph_osds=false`` is used, storage nodes must be powered on
+           and in the `available` state throughout the procedure. Otherwise,
+           storage nodes must be powered off.
+
+        #. Reinstall controller-1 and compute nodes (boot them from PXE, wait
+           for them to become `online`).
+
+        #. Unlock controller-1 and wait for it to be available.
+
+        #. If ``wipe_ceph_osds=true`` is used, then reinstall storage nodes.
+
+        #. Unlock compute nodes and wait for them to be available.
+
+        #. (Optional) Reinstall storage nodes.
+
+    #.  Re-apply applications (e.g. OpenStack) to force pods to restart.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+OpenStack application backup and restore
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In this procedure, only the OpenStack application will be restored.
+
+Steps:
+
+#.  Backup: Run the backup.yml playbook, whose output is a platform backup
+    tarball. Move the backup tarball outside of the cluster for safekeeping.
+
+    .. note::
+
+            When OpenStack is running, the backup.yml playbook generates two
+            tarballs: a platform backup tarball and an OpenStack backup tarball.
+
+#.  Restore:
+
+    a.  Delete the old OpenStack application and upload the application again.
+        (Note that images and volumes will remain in Ceph.)
+
+        ::
+
+          system application-remove stx-openstack
+          system application-delete stx-openstack
+          system application-upload stx-openstack-<ver>.tgz
+
+    #.  (Optional) If you want to delete the Ceph data, remove old Glance images
+        and Cinder volumes from the Ceph pool.
+
+    #.  Run the restore_openstack.yml Ansible playbook to restore the OpenStack
+        tarball.
+
+        If you don't want to manipulate the Ceph data, execute this command:
+
+        ::
+
+          ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'initial_backup_dir=<location_of_backup_filename> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>'
+
+        For example:
+
+        ::
+
+          ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'initial_backup_dir=/opt/backups ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz'
+
+        If you want to restore Glance images and Cinder volumes from external
+        storage (the Optional step above was executed) or you want to reconcile
+        newer data in the Glance and Cinder volumes pool with older data, then
+        you must execute the following steps:
+
+        * Run restore_openstack playbook with the ``restore_cinder_glance_data``
+          flag enabled. This step will bring up MariaDB services, restore
+          MariaDB data, and bring up Cinder and Glance services.
+
+          ::
+
+            ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_cinder_glance_data=true initial_backup_dir=<location_of_backup_filename> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>'
+
+          For example:
+
+          ::
+
+            ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_cinder_glance_data=true ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz initial_backup_dir=/opt/backups'
+
+        * Restore Glance images and Cinder volumes using image-backup.sh and
+          tidy_storage_post_restore helper scripts.
+
+          The tidy storage script is used to detect any discrepancy between
+          Cinder/Glance DB and rbd pools.
+
+          Discrepancies between the Glance images DB and the rbd images pool are
+          handled in the following ways:
+
+          * If an image is in the Glance images DB but not in the rbd images
+            pool, list the image and suggested actions to take in a log file.
+
+          * If an image is in the rbd images pool but not in the Glance images
+            DB, create a Glance image in the Glance images DB to associate with
+            the backend data. Also, list the image and suggested actions to
+            take in a log file.
+
+          Discrepancies between the Cinder volumes DB and the rbd cinder-volumes
+          pool are handled in the following ways:
+
+          * If a volume is in the Cinder volumes DB but not in the rbd
+            cinder-volumes pool, set the volume state to "error". Also, list
+            the volume and suggested actions to take in a log file.
+
+          * If a volume is in the rbd cinder-volumes pool but not in the Cinder
+            volumes DB, remove any snapshot(s) associated with this volume in
+            the rbd pool and create a volume in the Cinder volumes DB to
+            associate with the backend data. List the volume and suggested
+            actions to take in a log file.
+
+          * If a volume is in both the Cinder volumes DB and the rbd
+            cinder-volumes pool and it has snapshot(s) in the rbd pool,
+            re-create the snapshot in Cinder if it doesn't exist.
+
+          * If a snapshot is in the Cinder DB but not in the rbd pool, it
+            will be deleted.
+
+          Usage:
+
+          ::
+
+            tidy_storage_post_restore <log_file>
+
+          The image-backup.sh script is used to backup and restore Glance
+          images from the ceph image pool.
+
+          Usage:
+
+          ::
+
+            image-backup export <uuid> - export the image with <uuid> into backup file /opt/backups/image_<uuid>.tgz
+
+            image-backup import image_<uuid>.tgz - import the image from the backup source file at /opt/backups/image_<uuid>.tgz
+
+    #.  To bring up the remaining OpenStack services, run the playbook
+        again with ``restore_openstack_continue`` set to true:
+
+        ::
+
+          ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_openstack_continue=true initial_backup_dir=<location_of_backup_filename> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>'
+
+        For example:
+
+        ::
+
+          ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_openstack_continue=true ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz initial_backup_dir=/opt/backups'
--- a/doc/source/developer_resources/index.rst
+++ b/doc/source/developer_resources/index.rst
@ -17,6 +17,7 @@ Developer Resources
   build_docker_image
   move_to_new_openstack_version_in_starlingx
   mirror_repo
+   backup_restore
   Project Specifications <https://docs.starlingx.io/specs/>
   stx_ipv6_deployment
   stx_tsn_in_kata