diff --git a/doc/source/developer_resources/backup_restore.rst b/doc/source/developer_resources/backup_restore.rst new file mode 100644 index 000000000..01cb69036 --- /dev/null +++ b/doc/source/developer_resources/backup_restore.rst @@ -0,0 +1,518 @@ +================== +Backup and Restore +================== + +This guide describes the StarlingX backup and restore functionality. + + +.. contents:: + :local: + :depth: 2 + +-------- +Overview +-------- + +This feature provides a last resort disaster recovery option for situations +where the StarlingX software and/or data are compromised. The provided backup +utility creates a deployment state snapshot, which can be used to restore the +deployment to a previously good working state. + +There are two main options for backup and restore: + +* Platform restore, where the platform data is re-initialized, but the + applications are preserved – including OpenStack, if previously installed. + During this process, you can choose to keep the Ceph cluster (Default + option: ``wipe_ceph_osds=false``) or to wipe it and restore Ceph data from + off-box copies (``wipe_ceph_osds=true``). + +* OpenStack application backup and restore, where only the OpenStack application + is restored. This scenario deletes the OpenStack application, re-applies the + OpenStack application, and restores data from off-box copies (Glance, Ceph + volumes, database). + +This guide describes both restore options, including the backup procedure. + +.. note:: + + * Ceph application data is **not** backed up. It is preserved by the + restore process by default (``wipe_ceph_osds=false``), but it is not + restored if ``wipe_ceph_osds=true`` is used. You can protect against + Ceph cluster failures by using off-box custom backups. + + * During restore, images for applications that are integrated with + StarlingX are automatically downloaded to the local registry from + external sources. If your system has custom Kubernetes pods that use the + local registry and are **not** integrated with StarlingX, after restore + you should confirm that the correct images are present, so the + applications can restart automatically. + +---------- +Backing up +---------- + +There are two methods for backing up: local play method and remote play method. + +~~~~~~~~~~~~~~~~~ +Local play method +~~~~~~~~~~~~~~~~~ + +Run the following command: + +:: + + ansible-playbook /usr/share/ansible/stx-ansible/playbooks/backup.yml -e "ansible_become_pass= admin_password=" + +The ```` and ```` must be set correctly by +one of the following methods: + +* The ``-e`` option on the command line +* An override file +* In the Ansible secret file + +The output of the command is a file named in this format: +``_platform_backup_.tgz`` + +The prefixes ```` and +```` can be overridden via the ``-e`` option +on the command line or an override file. + +The generated backup tar files will look like this: +``localhost_platform_backup_2019_08_08_15_25_36.tgz`` and +``localhost_openstack_backup_2019_08_08_15_25_36.tgz``. They are located in +the ``/opt/backups`` directory on controller-0. + +~~~~~~~~~~~~~~~~~~ +Remote play method +~~~~~~~~~~~~~~~~~~ + +#. Log in to the host where Ansible is installed and clone the playbook code + from opendev at https://opendev.org/starlingx/ansible-playbooks.git + +#. Provide an inventory file, either a customized one that is specified via the + ``-i`` option or the default one which resides in the Ansible configuration + directory (``/etc/ansible/hosts``). You must specify the IP of the controller + host. For example, if the host-name is ``my_vbox``, the inventory-file should + have an entry called ``my_vbox`` as shown in the example below: + + :: + + all: + hosts: + wc68: + ansible_host: 128.222.100.02 + my_vbox: + ansible_host: 128.224.141.74 + +#. Run Ansible with the command: + + :: + + ansible-playbook --limit host-name -i -e + + The generated backup tar files can be found in ```` which + is ``$HOME`` by default. It can be overridden by the ``-e`` option on the + command line or in an override file. + + The generated backup tar file has the same naming convention as the local + play method. + +Example: + +:: + + ansible-playbook /localdisk/designer/repo/cgcs-root/stx/stx-ansible-playbooks/playbookconfig/src/playbooks/backup-restore/backup.yml --limit my_vbox -i $HOME/br_test/hosts -e "host_backup_dir=$HOME/br_test ansible_become_pass=Li69nux* admin_password=Li69nux* ansible_ssh_pass=Li69nux* ansible_ssh_pass=Li69nux*" + + +~~~~~~~~~~~~~~~~~~~~~~ +Backup content details +~~~~~~~~~~~~~~~~~~~~~~ + +The backup contains the following: + +* Postgresql config: Backup roles, table spaces and schemas for databases + +* Postgresql data: + + * template1, sysinv, barbican db data, fm db data, + + * keystone db for primary region, + + * dcmanager db for dc controller, + + * dcorch db for dc controller + +* ETCD database + +* LDAP db + +* Ceph crushmap + +* DNS server list + +* System Inventory network overrides. These are needed at restore to correctly + set up the OS configuration: + + * addrpool + + * pxeboot_subnet + + * management_subnet + + * management_start_address + + * cluster_host_subnet + + * cluster_pod_subnet + + * cluster_service_subnet + + * external_oam_subnet + + * external_oam_gateway_address + + * external_oam_floating_address + +* Docker registries on controller + +* Docker proxy (See :doc:`../configuration/docker_proxy_config` for details.) + +* Backup data: + + * OS configuration + + ok: [localhost] => (item=/etc) Note: Although everything here is backed up, + not all of the content will be restored. + + * Home directory ‘sysadmin’ user and all LDAP user accounts + + ok: [localhost] => (item=/home) + + * Generated platform configuration + + ok: [localhost] => (item=/opt/platform/config/) + + ok: [localhost] => (item=/opt/platform/puppet//hieradata) - All the + hieradata in this folder is backed up. However, only the static hieradata + (static.yaml and secure_static.yaml) will be restored to bootstrap + controller-0. + + * Keyring + + ok: [localhost] => (item=/opt/platform/.keyring/) + + * Patching and package repositories + + ok: [localhost] => (item=/opt/patching) + + ok: [localhost] => (item=/www/pages/updates) + + * Extension filesystem + + ok: [localhost] => (item=/opt/extension) + + * atch-vault filesystem for distributed cloud system-controller + + ok: [localhost] => (item=/opt/patch-vault) + + * Armada manifests + + ok: [localhost] => (item=/opt/platform/armada/) + + * Helm charts + + ok: [localhost] => (item=/opt/platform/helm_charts) + + +--------- +Restoring +--------- + +This section describes the platform restore and OpenStack restore processes. + +~~~~~~~~~~~~~~~~ +Platform restore +~~~~~~~~~~~~~~~~ + +In the platform restore process, the etcd and system inventory databases are +preserved by default. You can choose to preserve the Ceph data or to wipe it. + +* To preserve Ceph cluster data, use ``wipe_ceph_osds=false``. + +* To start with an empty Ceph cluster, use ``wipe_ceph_osds=true``. After the + restore procedure is complete and before you restart the applications, you + must restore the Ceph data from off-box copies. + +Steps: + +#. Backup: Run the backup.yml playbook, whose output is a platform backup + tarball. Move the backup tarball outside of the cluster for safekeeping. + +#. Restore: + + a. If using ``wipe_ceph_osds=true``, then power down all the nodes. + + **Do not** power down storage nodes if using ``wipe_ceph_osds=false``. + + .. important:: + + It is mandatory for the storage cluster to remain functional + during restore when ``wipe_ceph_osds=false``, otherwise data + loss will occur. Power down storage nodes only when + ``wipe_ceph_osds=true``. + + #. Reinstall controller-0. + + #. Run the Ansible restore_platform.yml playbook to restore a full system + from the platform tarball archive. For this step, similar to the backup + procedure, we have two options: local and remote play. + + **Local play** + + i. Download the backup to the controller. You can also use an external + storage device, for example, a USB drive. + + #. Run the command: + + :: + + ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_platform.yml -e "initial_backup_dir= ansible_become_pass= admin_password= backup_filename=" + + **Remote play** + + i. Log in to the host where Ansible is installed and clone the playbook + code from OpenDev at + https://opendev.org/starlingx/ansible-playbooks.git + + #. Provide an inventory file, either a customized one that is specified + via the ``-i`` option or the default one that resides in the Ansible + configuration directory (``/etc/ansible/hosts``). You must specify + the IP of the controller host. For example, if the host-name is + ``my_vbox``, the inventory-file should have an entry called + ``my_vbox`` as shown in the example below. + + :: + + all: + hosts: + wc68: + ansible_host: 128.222.100.02 + my_vbox: + ansible_host: 128.224.141.74 + + #. Run Ansible: + + :: + + ansible-playbook --limit host-name -i -e + + Where ``optional-extra-vars`` include: + + * ```` is set to either ``wipe_ceph_osds=false`` + (Default: Keep Ceph data intact) or + ``wipe_ceph_osds=true`` (Start with an empty Ceph cluster). + + * ```` is the platform backup tar file. It must be + provided via the ``-e`` option on the command line. For example, + ``-e “backup_filename=localhost_platform_backup_2019_07_15_14_46_37.tgz”`` + + * ```` is the location on the Ansible + control machine where the platform backup tar file is placed to + restore the platform. It must be provided via the ``-e`` option on + the command line. + + * ````, ```` and + ```` must be set correctly via the ``-e`` + option on the command line or in the Ansible secret file. + ```` is the password for the sysadmin user on + controller-0. + + * ```` should be set to a new directory (no + need to create it ahead of time) under ``/home/sysadmin`` on + controller-0 via the ``-e`` option on the command line. + + Example command: + + :: + + ansible-playbook /localdisk/designer/jenkins/tis-stx-dev/cgcs-root/stx/ansible-playbooks/playbookconfig/src/playbooks/restore_platform.yml --limit my_vbox -i $HOME/br_test/hosts -e " ansible_become_pass=Li69nux* admin_password=Li69nux* ansible_ssh_pass=Li69nux* initial_backup_dir=$HOME/br_test backup_filename=my_vbox_system_backup_2019_08_08_15_25_36.tgz ansible_remote_tmp=/home/sysadmin/ansible-restore" + + #. After Ansible is executed, perform the following steps based on your + deployment mode: + + **AIO-SX** + + Unlock controller-0 and wait for it to boot. + + **AIO-DX** + + i. Unlock controller-0 and wait for it to boot. + + #. Reinstall controller-1 (boot it from PXE, wait for it to become + `online`). + + #. Unlock controller-1. + + **Standard (without controller storage)** + + i. Unlock controller-0 and wait for it to boot. After unlock, you will + see all nodes, including storage nodes, as offline. + + #. Reinstall controller-1 and compute nodes (boot them from PXE, wait + for them to become `online`). + + #. Unlock controller-1 and wait for it to be available. + + #. Unlock compute nodes and wait for them to be available. + + **Standard (with controller storage)** + + i. Unlock controller-0 and wait for it to boot. After unlock, you will + see all nodes, except storage nodes, as offline. If + ``wipe_ceph_osds=false`` is used, storage nodes must be powered on + and in the `available` state throughout the procedure. Otherwise, + storage nodes must be powered off. + + #. Reinstall controller-1 and compute nodes (boot them from PXE, wait + for them to become `online`). + + #. Unlock controller-1 and wait for it to be available. + + #. If ``wipe_ceph_osds=true`` is used, then reinstall storage nodes. + + #. Unlock compute nodes and wait for them to be available. + + #. (Optional) Reinstall storage nodes. + + #. Re-apply applications (e.g. OpenStack) to force pods to restart. + +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +OpenStack application backup and restore +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this procedure, only the OpenStack application will be restored. + +Steps: + +#. Backup: Run the backup.yml playbook, whose output is a platform backup + tarball. Move the backup tarball outside of the cluster for safekeeping. + + .. note:: + + When OpenStack is running, the backup.yml playbook generates two + tarballs: a platform backup tarball and an OpenStack backup tarball. + +#. Restore: + + a. Delete the old OpenStack application and upload the application again. + (Note that images and volumes will remain in Ceph.) + + :: + + system application-remove stx-openstack + system application-delete stx-openstack + system application-upload stx-openstack-.tgz + + #. (Optional) If you want to delete the Ceph data, remove old Glance images + and Cinder volumes from the Ceph pool. + + #. Run the restore_openstack.yml Ansible playbook to restore the OpenStack + tarball. + + If you don't want to manipulate the Ceph data, execute this command: + + :: + + ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'initial_backup_dir= ansible_become_pass= admin_password= backup_filename=' + + For example: + + :: + + ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'initial_backup_dir=/opt/backups ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz' + + If you want to restore Glance images and Cinder volumes from external + storage (the Optional step above was executed) or you want to reconcile + newer data in the Glance and Cinder volumes pool with older data, then + you must execute the following steps: + + * Run restore_openstack playbook with the ``restore_cinder_glance_data`` + flag enabled. This step will bring up MariaDB services, restore + MariaDB data, and bring up Cinder and Glance services. + + :: + + ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_cinder_glance_data=true initial_backup_dir= ansible_become_pass= admin_password= backup_filename=' + + For example: + + :: + + ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_cinder_glance_data=true ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz initial_backup_dir=/opt/backups' + + * Restore Glance images and Cinder volumes using image-backup.sh and + tidy_storage_post_restore helper scripts. + + The tidy storage script is used to detect any discrepancy between + Cinder/Glance DB and rbd pools. + + Discrepancies between the Glance images DB and the rbd images pool are + handled in the following ways: + + * If an image is in the Glance images DB but not in the rbd images + pool, list the image and suggested actions to take in a log file. + + * If an image is in the rbd images pool but not in the Glance images + DB, create a Glance image in the Glance images DB to associate with + the backend data. Also, list the image and suggested actions to + take in a log file. + + Discrepancies between the Cinder volumes DB and the rbd cinder-volumes + pool are handled in the following ways: + + * If a volume is in the Cinder volumes DB but not in the rbd + cinder-volumes pool, set the volume state to "error". Also, list + the volume and suggested actions to take in a log file. + + * If a volume is in the rbd cinder-volumes pool but not in the Cinder + volumes DB, remove any snapshot(s) associated with this volume in + the rbd pool and create a volume in the Cinder volumes DB to + associate with the backend data. List the volume and suggested + actions to take in a log file. + + * If a volume is in both the Cinder volumes DB and the rbd + cinder-volumes pool and it has snapshot(s) in the rbd pool, + re-create the snapshot in Cinder if it doesn't exist. + + * If a snapshot is in the Cinder DB but not in the rbd pool, it + will be deleted. + + Usage: + + :: + + tidy_storage_post_restore + + The image-backup.sh script is used to backup and restore Glance + images from the ceph image pool. + + Usage: + + :: + + image-backup export - export the image with into backup file /opt/backups/image_.tgz + + image-backup import image_.tgz - import the image from the backup source file at /opt/backups/image_.tgz + + #. To bring up the remaining OpenStack services, run the playbook + again with ``restore_openstack_continue`` set to true: + + :: + + ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_openstack_continue=true initial_backup_dir= ansible_become_pass= admin_password= backup_filename=' + + For example: + + :: + + ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_openstack_continue=true ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz initial_backup_dir=/opt/backups' \ No newline at end of file diff --git a/doc/source/developer_resources/index.rst b/doc/source/developer_resources/index.rst index 525a28007..563df892a 100644 --- a/doc/source/developer_resources/index.rst +++ b/doc/source/developer_resources/index.rst @@ -17,6 +17,7 @@ Developer Resources build_docker_image move_to_new_openstack_version_in_starlingx mirror_repo + backup_restore Project Specifications stx_ipv6_deployment stx_tsn_in_kata