docs/doc/source/developer_resources/backup_restore.rst
2022-08-03 17:46:22 +00:00

560 lines
22 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

==================
Backup and Restore
==================
This guide describes the StarlingX backup and restore functionality.
.. contents::
:local:
:depth: 2
--------
Overview
--------
This feature provides a last resort disaster recovery option for situations
where the StarlingX software and/or data are compromised. The provided backup
utility creates a deployment state snapshot, which can be used to restore the
deployment to a previously good working state.
There are two main options for backup and restore:
* Platform restore, where the platform data is re-initialized, but the
applications are preserved including OpenStack, if previously installed.
During this process, you can choose to keep the Ceph cluster (Default
option: ``wipe_ceph_osds=false``) or to wipe it and restore Ceph data from
off-box copies (``wipe_ceph_osds=true``).
* OpenStack application backup and restore, where only the OpenStack application
is restored. This scenario deletes the OpenStack application, re-applies the
OpenStack application, and restores data from off-box copies (Glance, Ceph
volumes, database).
This guide describes both restore options, including the backup procedure.
.. note::
* Ceph application data is **not** backed up. It is preserved by the
restore process by default (``wipe_ceph_osds=false``), but it is not
restored if ``wipe_ceph_osds=true`` is used. You can protect against
Ceph cluster failures by using off-box custom backups.
* During restore, images for applications that are integrated with
StarlingX are automatically downloaded to the local registry from
external sources. If your system has custom Kubernetes pods that use the
local registry and are **not** integrated with StarlingX, after restore
you should confirm that the correct images are present, so the
applications can restart automatically.
----------
Backing up
----------
There are two methods for backing up: local play method and remote play method.
~~~~~~~~~~~~~~~~~
Local play method
~~~~~~~~~~~~~~~~~
Run the following command:
::
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/backup.yml -e "ansible_become_pass=<sysadmin password> admin_password=<sysadmin password>"
The ``<admin_password>`` and ``<ansible_become_pass>`` must be set correctly by
one of the following methods:
* The ``-e`` option on the command line
* An override file
* In the Ansible secret file
If you deploy the system with rook instead of ceph backend, you must add the ``rook_enabled=true`` variable.
::
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/backup.yml -e "ansible_become_pass=<sysadmin password> admin_password=<sysadmin password> [ rook_enabled=true ]"
The output of the command is a file named in this format:
``<inventory_hostname>_platform_backup_<timestamp>.tgz``
The prefixes ``<platform_backup_filename_prefix>`` and
``<openstack_backup_filename_prefix>`` can be overridden via the ``-e`` option
on the command line or an override file.
The generated backup tar files will look like this:
``localhost_platform_backup_2019_08_08_15_25_36.tgz`` and
``localhost_openstack_backup_2019_08_08_15_25_36.tgz``. They are located in
the ``/opt/backups`` directory on controller-0.
~~~~~~~~~~~~~~~~~~
Remote play method
~~~~~~~~~~~~~~~~~~
#. Log in to the host where Ansible is installed and clone the playbook code
from opendev at https://opendev.org/starlingx/ansible-playbooks.git
#. Provide an inventory file, either a customized one that is specified via the
``-i`` option or the default one which resides in the Ansible configuration
directory (``/etc/ansible/hosts``). You must specify the IP of the controller
host. For example, if the host-name is ``my_vbox``, the inventory-file should
have an entry called ``my_vbox`` as shown in the example below:
::
all:
hosts:
wc68:
ansible_host: 128.222.100.02
my_vbox:
ansible_host: 128.224.141.74
#. Run Ansible with the command:
::
ansible-playbook <path-to-backup-playbook-entry-file> --limit host-name -i <inventory-file> -e <optional-extra-vars>
The generated backup tar files can be found in ``<host_backup_dir>`` which
is ``$HOME`` by default. It can be overridden by the ``-e`` option on the
command line or in an override file.
The generated backup tar file has the same naming convention as the local
play method.
Example:
::
ansible-playbook /localdisk/designer/repo/cgcs-root/stx/stx-ansible-playbooks/playbookconfig/src/playbooks/backup-restore/backup.yml --limit my_vbox -i $HOME/br_test/hosts -e "host_backup_dir=$HOME/br_test ansible_become_pass=Li69nux* admin_password=Li69nux* ansible_ssh_pass=Li69nux* ansible_ssh_pass=Li69nux*"
#. If you deploy the system with rook instead of ceph backend, you must add the ``rook_enabled=true`` variable.
~~~~~~~~~~~~~~~~~~~~~~
Backup content details
~~~~~~~~~~~~~~~~~~~~~~
The backup contains the following:
* Postgresql config: Backup roles, table spaces and schemas for databases
* Postgresql data:
* template1, sysinv, barbican db data, fm db data,
* keystone db for primary region,
* dcmanager db for dc controller,
* dcorch db for dc controller
* ETCD database
* LDAP db
* Ceph crushmap
* DNS server list
* System Inventory network overrides. These are needed at restore to correctly
set up the OS configuration:
* addrpool
* pxeboot_subnet
* management_subnet
* management_start_address
* cluster_host_subnet
* cluster_pod_subnet
* cluster_service_subnet
* external_oam_subnet
* external_oam_gateway_address
* external_oam_floating_address
* Docker registries on controller
* Docker proxy (See :ref:`docker_proxy_config` for details.)
* Backup data:
* OS configuration
ok: [localhost] => (item=/etc) Note: Although everything here is backed up,
not all of the content will be restored.
* Home directory sysadmin user and all LDAP user accounts
ok: [localhost] => (item=/home)
* Generated platform configuration
ok: [localhost] => (item=/opt/platform/config/<SW_VERSION>)
ok: [localhost] => (item=/opt/platform/puppet/<SW_VERSION>/hieradata) - All the
hieradata in this folder is backed up. However, only the static hieradata
(static.yaml and secure_static.yaml) will be restored to bootstrap
controller-0.
* Keyring
ok: [localhost] => (item=/opt/platform/.keyring/<SW_VERSION>)
* Patching and package repositories
ok: [localhost] => (item=/opt/patching)
ok: [localhost] => (item=/var/www/pages/updates)
* Extension filesystem
ok: [localhost] => (item=/opt/extension)
* atch-vault filesystem for distributed cloud system-controller
ok: [localhost] => (item=/opt/patch-vault)
* FluxCD manifests
ok: [localhost] => (item=/opt/platform/armada/<SW_VERSION>)
* Helm charts
ok: [localhost] => (item=/opt/platform/helm_charts)
---------
Restoring
---------
This section describes the platform restore and OpenStack restore processes.
~~~~~~~~~~~~~~~~
Platform restore
~~~~~~~~~~~~~~~~
In the platform restore process, the etcd and system inventory databases are
preserved by default. You can choose to preserve the Ceph data or to wipe it.
* To preserve Ceph cluster data, use ``wipe_ceph_osds=false``.
* To start with an empty Ceph cluster, use ``wipe_ceph_osds=true``. After the
restore procedure is complete and before you restart the applications, you
must restore the Ceph data from off-box copies.
Steps:
#. Backup: Run the backup.yml playbook, whose output is a platform backup
tarball. Move the backup tarball outside of the cluster for safekeeping.
#. Restore:
a. If using ``wipe_ceph_osds=true``, then power down all the nodes.
**Do not** power down storage nodes if using ``wipe_ceph_osds=false``.
.. important::
It is mandatory for the storage cluster to remain functional
during restore when ``wipe_ceph_osds=false``, otherwise data
loss will occur. Power down storage nodes only when
``wipe_ceph_osds=true``.
#. Reinstall controller-0.
#. Run the Ansible restore_platform.yml playbook to restore a full system
from the platform tarball archive. For this step, similar to the backup
procedure, we have two options: local and remote play.
**Local play**
i. Download the backup to the controller. You can also use an external
storage device, for example, a USB drive.
#. Run the command:
::
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_platform.yml -e "initial_backup_dir=<location_of_tarball> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>"
#. If you deploy the system with rook instead of ceph backend, you must add the ``rook_enabled=true`` variable in above command.
**Remote play**
i. Log in to the host where Ansible is installed and clone the playbook
code from OpenDev at
https://opendev.org/starlingx/ansible-playbooks.git
#. Provide an inventory file, either a customized one that is specified
via the ``-i`` option or the default one that resides in the Ansible
configuration directory (``/etc/ansible/hosts``). You must specify
the IP of the controller host. For example, if the host-name is
``my_vbox``, the inventory-file should have an entry called
``my_vbox`` as shown in the example below.
::
all:
hosts:
wc68:
ansible_host: 128.222.100.02
my_vbox:
ansible_host: 128.224.141.74
#. Run Ansible:
::
ansible-playbook <path-to-backup-playbook-entry-file> --limit host-name -i <inventory-file> -e <optional-extra-vars>
Where ``optional-extra-vars`` include:
* ``<wipe_ceph_osds>`` is set to either ``wipe_ceph_osds=false``
(Default: Keep Ceph data intact) or
``wipe_ceph_osds=true`` (Start with an empty Ceph cluster).
* ``<backup_filename>`` is the platform backup tar file. It must be
provided via the ``-e`` option on the command line. For example,
``-e “backup_filename=localhost_platform_backup_2019_07_15_14_46_37.tgz”``
* ``<initial_backup_dir>`` is the location on the Ansible
control machine where the platform backup tar file is placed to
restore the platform. It must be provided via the ``-e`` option on
the command line.
* ``<admin_password>``, ``<ansible_become_pass>`` and
``<ansible_ssh_pass>`` must be set correctly via the ``-e``
option on the command line or in the Ansible secret file.
``<ansible_ssh_pass>`` is the password for the sysadmin user on
controller-0.
* ``<ansible_remote_tmp>`` should be set to a new directory (no
need to create it ahead of time) under ``/home/sysadmin`` on
controller-0 via the ``-e`` option on the command line.
Example command:
::
ansible-playbook /localdisk/designer/jenkins/tis-stx-dev/cgcs-root/stx/ansible-playbooks/playbookconfig/src/playbooks/restore_platform.yml --limit my_vbox -i $HOME/br_test/hosts -e "ansible_become_pass=Li69nux* admin_password=Li69nux* ansible_ssh_pass=Li69nux* initial_backup_dir=$HOME/br_test backup_filename=my_vbox_system_backup_2019_08_08_15_25_36.tgz ansible_remote_tmp=/home/sysadmin/ansible-restore"
#. If you deploy the system with rook instead of ceph backend, you must add the ``rook_enabled=true`` variable in above command.
#. After Ansible is executed, perform the following steps based on your
deployment mode:
**AIO-SX**
i. Unlock controller-0 and wait for it to boot.
#. Applications should transition from `restore-requested` to
`applying` and make a final transition to `applied` state. If
applications transition from `applying` to `restore-requested`
state, ensure there is network access and access to the Docker
registry. The process is repeated once per minute until all
applications are transitioned to the `applied` state.
**AIO-DX**
i. Unlock controller-0 and wait for it to boot.
#. Applications should transition from `restore-requested` to
`applying` and make a final transition to `applied` state. If
applications transition from `applying` to `restore-requested`
state, ensure there is network access and access to the Docker
registry. The process is repeated once per minute until all
applications are transitioned to the `applied` state.
#. Reinstall controller-1 (boot it from PXE, wait for it to become
`online`).
#. Unlock controller-1.
**Standard (with controller storage)**
i. Unlock controller-0 and wait for it to boot. After unlock, you will
see all nodes, including storage nodes, as offline.
#. Applications should transition from `restore-requested` to
`applying` and make a final transition to `applied` state. If
applications transition from `applying` to `restore-requested`
state, ensure there is network access and access to the Docker
registry. The process is repeated once per minute until all
applications are transitioned to the `applied` state.
#. Reinstall controller-1 and compute nodes (boot them from PXE, wait
for them to become `online`).
#. Unlock controller-1 and wait for it to be available.
#. Unlock compute nodes and wait for them to be available.
**Standard (without controller storage)**
i. Unlock controller-0 and wait for it to boot. After unlock, you will
see all nodes, except storage nodes, as offline. If
``wipe_ceph_osds=false`` is used, storage nodes must be powered on
and in the `available` state throughout the procedure. Otherwise,
storage nodes must be powered off.
#. Applications should transition from `restore-requested` to
`applying` and make a final transition to `applied` state. If
applications transition from `applying` to `restore-requested`
state, ensure there is network access and access to the Docker
registry. The process is repeated once per minute until all
applications are transitioned to the `applied` state.
#. Reinstall controller-1 and compute nodes (boot them from PXE, wait
for them to become `online`).
#. Unlock controller-1 and wait for it to be available.
#. If ``wipe_ceph_osds=true`` is used, then reinstall storage nodes.
#. Unlock compute nodes and wait for them to be available.
#. (Optional) Reinstall storage nodes.
#. Wait for Calico and Coredns pods to start. Run the
``system restore-complete`` command. Type 750.006 alarms will disappear
one at a time, as the applications are being auto-applied.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
OpenStack application backup and restore
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this procedure, only the OpenStack application will be restored.
Steps:
#. Backup: Run the backup.yml playbook, whose output is a platform backup
tarball. Move the backup tarball outside of the cluster for safekeeping.
.. note::
When OpenStack is running, the backup.yml playbook generates two
tarballs: a platform backup tarball and an OpenStack backup tarball.
#. Restore:
a. Delete the old OpenStack application and upload the application again.
(Note that images and volumes will remain in Ceph.)
.. parsed-literal::
system application-remove |prefix|-openstack
system application-delete |prefix|-openstack
system application-upload |prefix|-openstack-<ver>.tgz
#. (Optional) If you want to delete the Ceph data, remove old Glance images
and Cinder volumes from the Ceph pool.
#. Run the restore_openstack.yml Ansible playbook to restore the OpenStack
tarball.
If you don't want to manipulate the Ceph data, execute this command:
::
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'initial_backup_dir=<location_of_backup_filename> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>'
For example:
::
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'initial_backup_dir=/opt/backups ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz'
If you want to restore Glance images and Cinder volumes from external
storage (the Optional step above was executed) or you want to reconcile
newer data in the Glance and Cinder volumes pool with older data, then
you must execute the following steps:
* Run restore_openstack playbook with the ``restore_cinder_glance_data``
flag enabled. This step will bring up MariaDB services, restore
MariaDB data, and bring up Cinder and Glance services.
::
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_cinder_glance_data=true initial_backup_dir=<location_of_backup_filename> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>'
For example:
::
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_cinder_glance_data=true ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz initial_backup_dir=/opt/backups'
* Restore Glance images and Cinder volumes using image-backup.sh and
tidy_storage_post_restore helper scripts.
The tidy storage script is used to detect any discrepancy between
Cinder/Glance DB and rbd pools.
Discrepancies between the Glance images DB and the rbd images pool are
handled in the following ways:
* If an image is in the Glance images DB but not in the rbd images
pool, list the image and suggested actions to take in a log file.
* If an image is in the rbd images pool but not in the Glance images
DB, create a Glance image in the Glance images DB to associate with
the backend data. Also, list the image and suggested actions to
take in a log file.
Discrepancies between the Cinder volumes DB and the rbd cinder-volumes
pool are handled in the following ways:
* If a volume is in the Cinder volumes DB but not in the rbd
cinder-volumes pool, set the volume state to "error". Also, list
the volume and suggested actions to take in a log file.
* If a volume is in the rbd cinder-volumes pool but not in the Cinder
volumes DB, remove any snapshot(s) associated with this volume in
the rbd pool and create a volume in the Cinder volumes DB to
associate with the backend data. List the volume and suggested
actions to take in a log file.
* If a volume is in both the Cinder volumes DB and the rbd
cinder-volumes pool and it has snapshot(s) in the rbd pool,
re-create the snapshot in Cinder if it doesn't exist.
* If a snapshot is in the Cinder DB but not in the rbd pool, it
will be deleted.
Usage:
::
tidy_storage_post_restore <log_file>
The image-backup.sh script is used to backup and restore Glance
images from the ceph image pool.
Usage:
::
image-backup export <uuid> - export the image with <uuid> into backup file /opt/backups/image_<uuid>.tgz
image-backup import image_<uuid>.tgz - import the image from the backup source file at /opt/backups/image_<uuid>.tgz
#. To bring up the remaining OpenStack services, run the playbook
again with ``restore_openstack_continue`` set to true:
::
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_openstack_continue=true initial_backup_dir=<location_of_backup_filename> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>'
For example:
::
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_openstack_continue=true ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz initial_backup_dir=/opt/backups'