openstack-ansible/doc/source/install-guide/ops-galera-recovery.rst
Hector I Gonzalez aac703c63d Doc: File name incorrect
The doc show an example on how to configure a container as:

 openstack-ansible infrastructure-setup.yml \
-l node3_galera_container-3ea2cbd3

When the file name should be: setup-infrastructure.yml

Change-Id: Iaa279fb2604a1b26fd1df1314ec9f1ec91419d0a
Closes-Bug: 1544698
2016-02-12 16:22:19 +00:00

10 KiB

Home OpenStack-Ansible Installation Guide

Galera cluster recovery

When one or all nodes fail within a galera cluster you may need to re-bootstrap the environment. To make take advantage of the automation Ansible provides simply execute the galera-install.yml play using the galera-bootstrap to auto recover a node or an entire environment.

  1. Run the following Ansible command to show the failed nodes:

    # openstack-ansible galera-install.yml --tags galera-bootstrap

Upon completion of this command the cluster should be back online an in a functional state.

Single-node failure

If a single node fails, the other nodes maintain quorum and continue to process SQL requests.

  1. Run the following Ansible command to determine the failed node:

    # ansible galera_container -m shell -a "mysql -h localhost \
    -e 'show status like \"%wsrep_cluster_%\";'"
    node3_galera_container-3ea2cbd3 | FAILED | rc=1 >>
    ERROR 2002 (HY000): Can't connect to local MySQL server through
    socket '/var/run/mysqld/mysqld.sock' (111)
    
    node2_galera_container-49a47d25 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     17
    wsrep_cluster_size        3
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary
    
    node4_galera_container-76275635 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     17
    wsrep_cluster_size        3
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary

    In this example, node 3 has failed.

  2. Restart MariaDB on the failed node and verify that it rejoins the cluster.

  3. If MariaDB fails to start, run the mysqld command and perform further analysis on the output. As a last resort, rebuild the container for the node.

Multi-node failure

When all but one node fails, the remaining node cannot achieve quorum and stops processing SQL requests. In this situation, failed nodes that recover cannot join the cluster because it no longer exists.

  1. Run the following Ansible command to show the failed nodes:

    # ansible galera_container -m shell -a "mysql \
    -h localhost -e 'show status like \"%wsrep_cluster_%\";'"
    node2_galera_container-49a47d25 | FAILED | rc=1 >>
    ERROR 2002 (HY000): Can't connect to local MySQL server
    through socket '/var/run/mysqld/mysqld.sock' (111)
    
    node3_galera_container-3ea2cbd3 | FAILED | rc=1 >>
    ERROR 2002 (HY000): Can't connect to local MySQL server
    through socket '/var/run/mysqld/mysqld.sock' (111)
    
    node4_galera_container-76275635 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     18446744073709551615
    wsrep_cluster_size        1
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      non-Primary

    In this example, nodes 2 and 3 have failed. The remaining operational server indicates non-Primary because it cannot achieve quorum.

  2. Run the following command to rebootstrap the operational node into the cluster.

    # mysql -e "SET GLOBAL wsrep_provider_options='pc.bootstrap=yes';"
    node4_galera_container-76275635 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     15
    wsrep_cluster_size        1
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary
    
    node3_galera_container-3ea2cbd3 | FAILED | rc=1 >>
    ERROR 2002 (HY000): Can't connect to local MySQL server
    through socket '/var/run/mysqld/mysqld.sock' (111)
    
    node2_galera_container-49a47d25 | FAILED | rc=1 >>
    ERROR 2002 (HY000): Can't connect to local MySQL server
    through socket '/var/run/mysqld/mysqld.sock' (111)

    The remaining operational node becomes the primary node and begins processing SQL requests.

  3. Restart MariaDB on the failed nodes and verify that they rejoin the cluster.

    # ansible galera_container -m shell -a "mysql \
    -h localhost -e 'show status like \"%wsrep_cluster_%\";'"
    node3_galera_container-3ea2cbd3 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     17
    wsrep_cluster_size        3
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary
    
    node2_galera_container-49a47d25 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     17
    wsrep_cluster_size        3
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary
    
    node4_galera_container-76275635 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     17
    wsrep_cluster_size        3
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary
  4. If MariaDB fails to start on any of the failed nodes, run the mysqld command and perform further analysis on the output. As a last resort, rebuild the container for the node.

Complete failure

If all of the nodes in a Galera cluster fail (do not shutdown gracefully), then the integrity of the database can no longer be guaranteed and should be restored from backup. Run the following command to determine if all nodes in the cluster have failed:

# ansible galera_container -m shell -a "cat /var/lib/mysql/grastate.dat"
node3_galera_container-3ea2cbd3 | success | rc=0 >>
# GALERA saved state
version: 2.1
uuid:    338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno:   -1
cert_index:

node2_galera_container-49a47d25 | success | rc=0 >>
# GALERA saved state
version: 2.1
uuid:    338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno:   -1
cert_index:

node4_galera_container-76275635 | success | rc=0 >>
# GALERA saved state
version: 2.1
uuid:    338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno:   -1
cert_index:

All the nodes have failed if mysqld is not running on any of the nodes and all of the nodes contain a seqno value of -1.

If any single node has a positive seqno value, then that node can be used to restart the cluster. However, because there is no guarantee that each node has an identical copy of the data, it is not recommended to restart the cluster using the --wsrep-new-cluster command on one node.

Rebuilding a container

Sometimes recovering from a failure requires rebuilding one or more containers.

  1. Disable the failed node on the load balancer.

    Do not rely on the load balancer health checks to disable the node. If the node is not disabled, the load balancer will send SQL requests to it before it rejoins the cluster and cause data inconsistencies.

  2. Use the following commands to destroy the container and remove MariaDB data stored outside of the container. In this example, node 3 failed.

    # lxc-stop -n node3_galera_container-3ea2cbd3
    # lxc-destroy -n node3_galera_container-3ea2cbd3
    # rm -rf /openstack/node3_galera_container-3ea2cbd3/*
  3. Run the host setup playbook to rebuild the container specifically on node 3:

    # openstack-ansible setup-hosts.yml -l node3 \
    -l node3_galera_container-3ea2cbd3

    The playbook will also restart all other containers on the node.

  4. Run the infrastructure playbook to configure the container specifically on node 3:

    # openstack-ansible setup-infrastructure.yml \
    -l node3_galera_container-3ea2cbd3

    The new container runs a single-node Galera cluster, a dangerous state because the environment contains more than one active database with potentially different data.

    # ansible galera_container -m shell -a "mysql \
    -h localhost -e 'show status like \"%wsrep_cluster_%\";'"
    node3_galera_container-3ea2cbd3 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     1
    wsrep_cluster_size        1
    wsrep_cluster_state_uuid  da078d01-29e5-11e4-a051-03d896dbdb2d
    wsrep_cluster_status      Primary
    
    node2_galera_container-49a47d25 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     4
    wsrep_cluster_size        2
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary
    
    node4_galera_container-76275635 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     4
    wsrep_cluster_size        2
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary
  5. Restart MariaDB in the new container and verify that it rejoins the cluster.

    # ansible galera_container -m shell -a "mysql \
    -h localhost -e 'show status like \"%wsrep_cluster_%\";'"
    node2_galera_container-49a47d25 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     5
    wsrep_cluster_size        3
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary
    
    node3_galera_container-3ea2cbd3 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     5
    wsrep_cluster_size        3
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary
    
    node4_galera_container-76275635 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     5
    wsrep_cluster_size        3
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary
  6. Enable the failed node on the load balancer.