openstack-ansible/doc/source/install-guide/ops-galera-recoverymulti.rst
Matt Thompson 06708191d6 Install Guide Cleanup
This commit does the following:

- sets all shell prompts in code-blocks to the root prompt
- uses shell-session code-block since the shell prompt was being
  treated as a comment
- links configure-aodh.rst in configure.rst (running tox was
  complaining that this file wasn't being linked anywhere)
- other minor cleanup

Change-Id: I9e3ac8bb0cabd1cc17952cfd765dbb0d8f7b6fa2
2015-10-21 07:25:25 +01:00

3.6 KiB

Home OpenStack-Ansible Installation Guide

Multi-node failure

When all but one node fails, the remaining node cannot achieve quorum and stops processing SQL requests. In this situation, failed nodes that recover cannot join the cluster because it no longer exists.

  1. Run the following Ansible command to show the failed nodes:

    # ansible galera_container -m shell -a "mysql \
    -h localhost -e 'show status like \"%wsrep_cluster_%\";'"
    node2_galera_container-49a47d25 | FAILED | rc=1 >>
    ERROR 2002 (HY000): Can't connect to local MySQL server
    through socket '/var/run/mysqld/mysqld.sock' (111)
    
    node3_galera_container-3ea2cbd3 | FAILED | rc=1 >>
    ERROR 2002 (HY000): Can't connect to local MySQL server
    through socket '/var/run/mysqld/mysqld.sock' (111)
    
    node4_galera_container-76275635 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     18446744073709551615
    wsrep_cluster_size        1
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      non-Primary

    In this example, nodes 2 and 3 have failed. The remaining operational server indicates non-Primary because it cannot achieve quorum.

  2. Run the following command to rebootstrap the operational node into the cluster.

    # mysql -e "SET GLOBAL wsrep_provider_options='pc.bootstrap=yes';"
    node4_galera_container-76275635 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     15
    wsrep_cluster_size        1
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary
    
    node3_galera_container-3ea2cbd3 | FAILED | rc=1 >>
    ERROR 2002 (HY000): Can't connect to local MySQL server
    through socket '/var/run/mysqld/mysqld.sock' (111)
    
    node2_galera_container-49a47d25 | FAILED | rc=1 >>
    ERROR 2002 (HY000): Can't connect to local MySQL server
    through socket '/var/run/mysqld/mysqld.sock' (111)

    The remaining operational node becomes the primary node and begins processing SQL requests.

  3. Restart MariaDB on the failed nodes and verify that they rejoin the cluster.

    # ansible galera_container -m shell -a "mysql \
    -h localhost -e 'show status like \"%wsrep_cluster_%\";'"
    node3_galera_container-3ea2cbd3 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     17
    wsrep_cluster_size        3
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary
    
    node2_galera_container-49a47d25 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     17
    wsrep_cluster_size        3
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary
    
    node4_galera_container-76275635 | success | rc=0 >>
    Variable_name             Value
    wsrep_cluster_conf_id     17
    wsrep_cluster_size        3
    wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
    wsrep_cluster_status      Primary
  4. If MariaDB fails to start on any of the failed nodes, run the mysqld command and perform further analysis on the output. As a last resort, rebuild the container for the node.