560 lines
22 KiB
ReStructuredText
560 lines
22 KiB
ReStructuredText
==================
|
||
Backup and Restore
|
||
==================
|
||
|
||
This guide describes the StarlingX backup and restore functionality.
|
||
|
||
|
||
.. contents::
|
||
:local:
|
||
:depth: 2
|
||
|
||
--------
|
||
Overview
|
||
--------
|
||
|
||
This feature provides a last resort disaster recovery option for situations
|
||
where the StarlingX software and/or data are compromised. The provided backup
|
||
utility creates a deployment state snapshot, which can be used to restore the
|
||
deployment to a previously good working state.
|
||
|
||
There are two main options for backup and restore:
|
||
|
||
* Platform restore, where the platform data is re-initialized, but the
|
||
applications are preserved – including OpenStack, if previously installed.
|
||
During this process, you can choose to keep the Ceph cluster (Default
|
||
option: ``wipe_ceph_osds=false``) or to wipe it and restore Ceph data from
|
||
off-box copies (``wipe_ceph_osds=true``).
|
||
|
||
* OpenStack application backup and restore, where only the OpenStack application
|
||
is restored. This scenario deletes the OpenStack application, re-applies the
|
||
OpenStack application, and restores data from off-box copies (Glance, Ceph
|
||
volumes, database).
|
||
|
||
This guide describes both restore options, including the backup procedure.
|
||
|
||
.. note::
|
||
|
||
* Ceph application data is **not** backed up. It is preserved by the
|
||
restore process by default (``wipe_ceph_osds=false``), but it is not
|
||
restored if ``wipe_ceph_osds=true`` is used. You can protect against
|
||
Ceph cluster failures by using off-box custom backups.
|
||
|
||
* During restore, images for applications that are integrated with
|
||
StarlingX are automatically downloaded to the local registry from
|
||
external sources. If your system has custom Kubernetes pods that use the
|
||
local registry and are **not** integrated with StarlingX, after restore
|
||
you should confirm that the correct images are present, so the
|
||
applications can restart automatically.
|
||
|
||
----------
|
||
Backing up
|
||
----------
|
||
|
||
There are two methods for backing up: local play method and remote play method.
|
||
|
||
~~~~~~~~~~~~~~~~~
|
||
Local play method
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
Run the following command:
|
||
|
||
::
|
||
|
||
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/backup.yml -e "ansible_become_pass=<sysadmin password> admin_password=<sysadmin password>"
|
||
|
||
The ``<admin_password>`` and ``<ansible_become_pass>`` must be set correctly by
|
||
one of the following methods:
|
||
|
||
* The ``-e`` option on the command line
|
||
* An override file
|
||
* In the Ansible secret file
|
||
|
||
If you deploy the system with rook instead of ceph backend, you must add the ``rook_enabled=true`` variable.
|
||
|
||
::
|
||
|
||
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/backup.yml -e "ansible_become_pass=<sysadmin password> admin_password=<sysadmin password> [ rook_enabled=true ]"
|
||
|
||
The output of the command is a file named in this format:
|
||
``<inventory_hostname>_platform_backup_<timestamp>.tgz``
|
||
|
||
The prefixes ``<platform_backup_filename_prefix>`` and
|
||
``<openstack_backup_filename_prefix>`` can be overridden via the ``-e`` option
|
||
on the command line or an override file.
|
||
|
||
The generated backup tar files will look like this:
|
||
``localhost_platform_backup_2019_08_08_15_25_36.tgz`` and
|
||
``localhost_openstack_backup_2019_08_08_15_25_36.tgz``. They are located in
|
||
the ``/opt/backups`` directory on controller-0.
|
||
|
||
~~~~~~~~~~~~~~~~~~
|
||
Remote play method
|
||
~~~~~~~~~~~~~~~~~~
|
||
|
||
#. Log in to the host where Ansible is installed and clone the playbook code
|
||
from opendev at https://opendev.org/starlingx/ansible-playbooks.git
|
||
|
||
#. Provide an inventory file, either a customized one that is specified via the
|
||
``-i`` option or the default one which resides in the Ansible configuration
|
||
directory (``/etc/ansible/hosts``). You must specify the IP of the controller
|
||
host. For example, if the host-name is ``my_vbox``, the inventory-file should
|
||
have an entry called ``my_vbox`` as shown in the example below:
|
||
|
||
::
|
||
|
||
all:
|
||
hosts:
|
||
wc68:
|
||
ansible_host: 128.222.100.02
|
||
my_vbox:
|
||
ansible_host: 128.224.141.74
|
||
|
||
#. Run Ansible with the command:
|
||
|
||
::
|
||
|
||
ansible-playbook <path-to-backup-playbook-entry-file> --limit host-name -i <inventory-file> -e <optional-extra-vars>
|
||
|
||
The generated backup tar files can be found in ``<host_backup_dir>`` which
|
||
is ``$HOME`` by default. It can be overridden by the ``-e`` option on the
|
||
command line or in an override file.
|
||
|
||
The generated backup tar file has the same naming convention as the local
|
||
play method.
|
||
|
||
Example:
|
||
|
||
::
|
||
|
||
ansible-playbook /localdisk/designer/repo/cgcs-root/stx/stx-ansible-playbooks/playbookconfig/src/playbooks/backup-restore/backup.yml --limit my_vbox -i $HOME/br_test/hosts -e "host_backup_dir=$HOME/br_test ansible_become_pass=Li69nux* admin_password=Li69nux* ansible_ssh_pass=Li69nux* ansible_ssh_pass=Li69nux*"
|
||
|
||
#. If you deploy the system with rook instead of ceph backend, you must add the ``rook_enabled=true`` variable.
|
||
|
||
~~~~~~~~~~~~~~~~~~~~~~
|
||
Backup content details
|
||
~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The backup contains the following:
|
||
|
||
* Postgresql config: Backup roles, table spaces and schemas for databases
|
||
|
||
* Postgresql data:
|
||
|
||
* template1, sysinv, barbican db data, fm db data,
|
||
|
||
* keystone db for primary region,
|
||
|
||
* dcmanager db for dc controller,
|
||
|
||
* dcorch db for dc controller
|
||
|
||
* ETCD database
|
||
|
||
* LDAP db
|
||
|
||
* Ceph crushmap
|
||
|
||
* DNS server list
|
||
|
||
* System Inventory network overrides. These are needed at restore to correctly
|
||
set up the OS configuration:
|
||
|
||
* addrpool
|
||
|
||
* pxeboot_subnet
|
||
|
||
* management_subnet
|
||
|
||
* management_start_address
|
||
|
||
* cluster_host_subnet
|
||
|
||
* cluster_pod_subnet
|
||
|
||
* cluster_service_subnet
|
||
|
||
* external_oam_subnet
|
||
|
||
* external_oam_gateway_address
|
||
|
||
* external_oam_floating_address
|
||
|
||
* Docker registries on controller
|
||
|
||
* Docker proxy (See :ref:`docker_proxy_config` for details.)
|
||
|
||
* Backup data:
|
||
|
||
* OS configuration
|
||
|
||
ok: [localhost] => (item=/etc) Note: Although everything here is backed up,
|
||
not all of the content will be restored.
|
||
|
||
* Home directory ‘sysadmin’ user and all LDAP user accounts
|
||
|
||
ok: [localhost] => (item=/home)
|
||
|
||
* Generated platform configuration
|
||
|
||
ok: [localhost] => (item=/opt/platform/config/<SW_VERSION>)
|
||
|
||
ok: [localhost] => (item=/opt/platform/puppet/<SW_VERSION>/hieradata) - All the
|
||
hieradata in this folder is backed up. However, only the static hieradata
|
||
(static.yaml and secure_static.yaml) will be restored to bootstrap
|
||
controller-0.
|
||
|
||
* Keyring
|
||
|
||
ok: [localhost] => (item=/opt/platform/.keyring/<SW_VERSION>)
|
||
|
||
* Patching and package repositories
|
||
|
||
ok: [localhost] => (item=/opt/patching)
|
||
|
||
ok: [localhost] => (item=/var/www/pages/updates)
|
||
|
||
* Extension filesystem
|
||
|
||
ok: [localhost] => (item=/opt/extension)
|
||
|
||
* atch-vault filesystem for distributed cloud system-controller
|
||
|
||
ok: [localhost] => (item=/opt/patch-vault)
|
||
|
||
* FluxCD manifests
|
||
|
||
ok: [localhost] => (item=/opt/platform/armada/<SW_VERSION>)
|
||
|
||
* Helm charts
|
||
|
||
ok: [localhost] => (item=/opt/platform/helm_charts)
|
||
|
||
|
||
---------
|
||
Restoring
|
||
---------
|
||
|
||
This section describes the platform restore and OpenStack restore processes.
|
||
|
||
~~~~~~~~~~~~~~~~
|
||
Platform restore
|
||
~~~~~~~~~~~~~~~~
|
||
|
||
In the platform restore process, the etcd and system inventory databases are
|
||
preserved by default. You can choose to preserve the Ceph data or to wipe it.
|
||
|
||
* To preserve Ceph cluster data, use ``wipe_ceph_osds=false``.
|
||
|
||
* To start with an empty Ceph cluster, use ``wipe_ceph_osds=true``. After the
|
||
restore procedure is complete and before you restart the applications, you
|
||
must restore the Ceph data from off-box copies.
|
||
|
||
Steps:
|
||
|
||
#. Backup: Run the backup.yml playbook, whose output is a platform backup
|
||
tarball. Move the backup tarball outside of the cluster for safekeeping.
|
||
|
||
#. Restore:
|
||
|
||
a. If using ``wipe_ceph_osds=true``, then power down all the nodes.
|
||
|
||
**Do not** power down storage nodes if using ``wipe_ceph_osds=false``.
|
||
|
||
.. important::
|
||
|
||
It is mandatory for the storage cluster to remain functional
|
||
during restore when ``wipe_ceph_osds=false``, otherwise data
|
||
loss will occur. Power down storage nodes only when
|
||
``wipe_ceph_osds=true``.
|
||
|
||
#. Reinstall controller-0.
|
||
|
||
#. Run the Ansible restore_platform.yml playbook to restore a full system
|
||
from the platform tarball archive. For this step, similar to the backup
|
||
procedure, we have two options: local and remote play.
|
||
|
||
**Local play**
|
||
|
||
i. Download the backup to the controller. You can also use an external
|
||
storage device, for example, a USB drive.
|
||
|
||
#. Run the command:
|
||
|
||
::
|
||
|
||
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_platform.yml -e "initial_backup_dir=<location_of_tarball> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>"
|
||
|
||
#. If you deploy the system with rook instead of ceph backend, you must add the ``rook_enabled=true`` variable in above command.
|
||
|
||
**Remote play**
|
||
|
||
i. Log in to the host where Ansible is installed and clone the playbook
|
||
code from OpenDev at
|
||
https://opendev.org/starlingx/ansible-playbooks.git
|
||
|
||
#. Provide an inventory file, either a customized one that is specified
|
||
via the ``-i`` option or the default one that resides in the Ansible
|
||
configuration directory (``/etc/ansible/hosts``). You must specify
|
||
the IP of the controller host. For example, if the host-name is
|
||
``my_vbox``, the inventory-file should have an entry called
|
||
``my_vbox`` as shown in the example below.
|
||
|
||
::
|
||
|
||
all:
|
||
hosts:
|
||
wc68:
|
||
ansible_host: 128.222.100.02
|
||
my_vbox:
|
||
ansible_host: 128.224.141.74
|
||
|
||
#. Run Ansible:
|
||
|
||
::
|
||
|
||
ansible-playbook <path-to-backup-playbook-entry-file> --limit host-name -i <inventory-file> -e <optional-extra-vars>
|
||
|
||
Where ``optional-extra-vars`` include:
|
||
|
||
* ``<wipe_ceph_osds>`` is set to either ``wipe_ceph_osds=false``
|
||
(Default: Keep Ceph data intact) or
|
||
``wipe_ceph_osds=true`` (Start with an empty Ceph cluster).
|
||
|
||
* ``<backup_filename>`` is the platform backup tar file. It must be
|
||
provided via the ``-e`` option on the command line. For example,
|
||
``-e “backup_filename=localhost_platform_backup_2019_07_15_14_46_37.tgz”``
|
||
|
||
* ``<initial_backup_dir>`` is the location on the Ansible
|
||
control machine where the platform backup tar file is placed to
|
||
restore the platform. It must be provided via the ``-e`` option on
|
||
the command line.
|
||
|
||
* ``<admin_password>``, ``<ansible_become_pass>`` and
|
||
``<ansible_ssh_pass>`` must be set correctly via the ``-e``
|
||
option on the command line or in the Ansible secret file.
|
||
``<ansible_ssh_pass>`` is the password for the sysadmin user on
|
||
controller-0.
|
||
|
||
* ``<ansible_remote_tmp>`` should be set to a new directory (no
|
||
need to create it ahead of time) under ``/home/sysadmin`` on
|
||
controller-0 via the ``-e`` option on the command line.
|
||
|
||
Example command:
|
||
|
||
::
|
||
|
||
ansible-playbook /localdisk/designer/jenkins/tis-stx-dev/cgcs-root/stx/ansible-playbooks/playbookconfig/src/playbooks/restore_platform.yml --limit my_vbox -i $HOME/br_test/hosts -e "ansible_become_pass=Li69nux* admin_password=Li69nux* ansible_ssh_pass=Li69nux* initial_backup_dir=$HOME/br_test backup_filename=my_vbox_system_backup_2019_08_08_15_25_36.tgz ansible_remote_tmp=/home/sysadmin/ansible-restore"
|
||
|
||
#. If you deploy the system with rook instead of ceph backend, you must add the ``rook_enabled=true`` variable in above command.
|
||
|
||
#. After Ansible is executed, perform the following steps based on your
|
||
deployment mode:
|
||
|
||
**AIO-SX**
|
||
|
||
i. Unlock controller-0 and wait for it to boot.
|
||
|
||
#. Applications should transition from `restore-requested` to
|
||
`applying` and make a final transition to `applied` state. If
|
||
applications transition from `applying` to `restore-requested`
|
||
state, ensure there is network access and access to the Docker
|
||
registry. The process is repeated once per minute until all
|
||
applications are transitioned to the `applied` state.
|
||
|
||
**AIO-DX**
|
||
|
||
i. Unlock controller-0 and wait for it to boot.
|
||
|
||
#. Applications should transition from `restore-requested` to
|
||
`applying` and make a final transition to `applied` state. If
|
||
applications transition from `applying` to `restore-requested`
|
||
state, ensure there is network access and access to the Docker
|
||
registry. The process is repeated once per minute until all
|
||
applications are transitioned to the `applied` state.
|
||
|
||
#. Reinstall controller-1 (boot it from PXE, wait for it to become
|
||
`online`).
|
||
|
||
#. Unlock controller-1.
|
||
|
||
**Standard (with controller storage)**
|
||
|
||
i. Unlock controller-0 and wait for it to boot. After unlock, you will
|
||
see all nodes, including storage nodes, as offline.
|
||
|
||
#. Applications should transition from `restore-requested` to
|
||
`applying` and make a final transition to `applied` state. If
|
||
applications transition from `applying` to `restore-requested`
|
||
state, ensure there is network access and access to the Docker
|
||
registry. The process is repeated once per minute until all
|
||
applications are transitioned to the `applied` state.
|
||
|
||
#. Reinstall controller-1 and compute nodes (boot them from PXE, wait
|
||
for them to become `online`).
|
||
|
||
#. Unlock controller-1 and wait for it to be available.
|
||
|
||
#. Unlock compute nodes and wait for them to be available.
|
||
|
||
**Standard (without controller storage)**
|
||
|
||
i. Unlock controller-0 and wait for it to boot. After unlock, you will
|
||
see all nodes, except storage nodes, as offline. If
|
||
``wipe_ceph_osds=false`` is used, storage nodes must be powered on
|
||
and in the `available` state throughout the procedure. Otherwise,
|
||
storage nodes must be powered off.
|
||
|
||
#. Applications should transition from `restore-requested` to
|
||
`applying` and make a final transition to `applied` state. If
|
||
applications transition from `applying` to `restore-requested`
|
||
state, ensure there is network access and access to the Docker
|
||
registry. The process is repeated once per minute until all
|
||
applications are transitioned to the `applied` state.
|
||
|
||
#. Reinstall controller-1 and compute nodes (boot them from PXE, wait
|
||
for them to become `online`).
|
||
|
||
#. Unlock controller-1 and wait for it to be available.
|
||
|
||
#. If ``wipe_ceph_osds=true`` is used, then reinstall storage nodes.
|
||
|
||
#. Unlock compute nodes and wait for them to be available.
|
||
|
||
#. (Optional) Reinstall storage nodes.
|
||
|
||
#. Wait for Calico and Coredns pods to start. Run the
|
||
``system restore-complete`` command. Type 750.006 alarms will disappear
|
||
one at a time, as the applications are being auto-applied.
|
||
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
OpenStack application backup and restore
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
In this procedure, only the OpenStack application will be restored.
|
||
|
||
Steps:
|
||
|
||
#. Backup: Run the backup.yml playbook, whose output is a platform backup
|
||
tarball. Move the backup tarball outside of the cluster for safekeeping.
|
||
|
||
.. note::
|
||
|
||
When OpenStack is running, the backup.yml playbook generates two
|
||
tarballs: a platform backup tarball and an OpenStack backup tarball.
|
||
|
||
#. Restore:
|
||
|
||
a. Delete the old OpenStack application and upload the application again.
|
||
(Note that images and volumes will remain in Ceph.)
|
||
|
||
.. parsed-literal::
|
||
|
||
system application-remove |prefix|-openstack
|
||
system application-delete |prefix|-openstack
|
||
system application-upload |prefix|-openstack-<ver>.tgz
|
||
|
||
#. (Optional) If you want to delete the Ceph data, remove old Glance images
|
||
and Cinder volumes from the Ceph pool.
|
||
|
||
#. Run the restore_openstack.yml Ansible playbook to restore the OpenStack
|
||
tarball.
|
||
|
||
If you don't want to manipulate the Ceph data, execute this command:
|
||
|
||
::
|
||
|
||
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'initial_backup_dir=<location_of_backup_filename> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>'
|
||
|
||
For example:
|
||
|
||
::
|
||
|
||
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'initial_backup_dir=/opt/backups ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz'
|
||
|
||
If you want to restore Glance images and Cinder volumes from external
|
||
storage (the Optional step above was executed) or you want to reconcile
|
||
newer data in the Glance and Cinder volumes pool with older data, then
|
||
you must execute the following steps:
|
||
|
||
* Run restore_openstack playbook with the ``restore_cinder_glance_data``
|
||
flag enabled. This step will bring up MariaDB services, restore
|
||
MariaDB data, and bring up Cinder and Glance services.
|
||
|
||
::
|
||
|
||
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_cinder_glance_data=true initial_backup_dir=<location_of_backup_filename> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>'
|
||
|
||
For example:
|
||
|
||
::
|
||
|
||
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_cinder_glance_data=true ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz initial_backup_dir=/opt/backups'
|
||
|
||
* Restore Glance images and Cinder volumes using image-backup.sh and
|
||
tidy_storage_post_restore helper scripts.
|
||
|
||
The tidy storage script is used to detect any discrepancy between
|
||
Cinder/Glance DB and rbd pools.
|
||
|
||
Discrepancies between the Glance images DB and the rbd images pool are
|
||
handled in the following ways:
|
||
|
||
* If an image is in the Glance images DB but not in the rbd images
|
||
pool, list the image and suggested actions to take in a log file.
|
||
|
||
* If an image is in the rbd images pool but not in the Glance images
|
||
DB, create a Glance image in the Glance images DB to associate with
|
||
the backend data. Also, list the image and suggested actions to
|
||
take in a log file.
|
||
|
||
Discrepancies between the Cinder volumes DB and the rbd cinder-volumes
|
||
pool are handled in the following ways:
|
||
|
||
* If a volume is in the Cinder volumes DB but not in the rbd
|
||
cinder-volumes pool, set the volume state to "error". Also, list
|
||
the volume and suggested actions to take in a log file.
|
||
|
||
* If a volume is in the rbd cinder-volumes pool but not in the Cinder
|
||
volumes DB, remove any snapshot(s) associated with this volume in
|
||
the rbd pool and create a volume in the Cinder volumes DB to
|
||
associate with the backend data. List the volume and suggested
|
||
actions to take in a log file.
|
||
|
||
* If a volume is in both the Cinder volumes DB and the rbd
|
||
cinder-volumes pool and it has snapshot(s) in the rbd pool,
|
||
re-create the snapshot in Cinder if it doesn't exist.
|
||
|
||
* If a snapshot is in the Cinder DB but not in the rbd pool, it
|
||
will be deleted.
|
||
|
||
Usage:
|
||
|
||
::
|
||
|
||
tidy_storage_post_restore <log_file>
|
||
|
||
The image-backup.sh script is used to backup and restore Glance
|
||
images from the ceph image pool.
|
||
|
||
Usage:
|
||
|
||
::
|
||
|
||
image-backup export <uuid> - export the image with <uuid> into backup file /opt/backups/image_<uuid>.tgz
|
||
|
||
image-backup import image_<uuid>.tgz - import the image from the backup source file at /opt/backups/image_<uuid>.tgz
|
||
|
||
#. To bring up the remaining OpenStack services, run the playbook
|
||
again with ``restore_openstack_continue`` set to true:
|
||
|
||
::
|
||
|
||
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_openstack_continue=true initial_backup_dir=<location_of_backup_filename> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>'
|
||
|
||
For example:
|
||
|
||
::
|
||
|
||
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_openstack_continue=true ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz initial_backup_dir=/opt/backups'
|