Change-Id: I233346c372bb582a5a4df3e5fc16042a38866788
8.2 KiB
RabbitMQ cluster maintenance
A RabbitMQ broker is a logical grouping of one or several Erlang nodes with each node running the RabbitMQ application and sharing users, virtual hosts, queues, exchanges, bindings, and runtime parameters. A collection of nodes is often referred to as a cluster. For more information on RabbitMQ clustering, see RabbitMQ cluster.
Within OpenStack-Ansible, all data and states required for operation of the RabbitMQ cluster is replicated across all nodes including the message queues providing high availability. RabbitMQ nodes address each other using domain names. The hostnames of all cluster members must be resolvable from all cluster nodes, as well as any machines where CLI tools related to RabbitMQ might be used. There are alternatives that may work in more restrictive environments. For more details on that setup, see Inet Configuration.
Note
There is currently an Ansible bug in regards to
HOSTNAME
. If the host .bashrc
holds a var
named HOSTNAME
, the container where the
lxc_container
module attaches will inherit this var and
potentially set the wrong $HOSTNAME
. See the Ansible fix
which will be released in Ansible version 2.3.
Create a RabbitMQ cluster
RabbitMQ clusters can be formed in two ways:
- Manually with
rabbitmqctl
- Declaratively (list of cluster nodes in a config, with
rabbitmq-autocluster
, orrabbitmq-clusterer
plugins)
Note
RabbitMQ brokers can tolerate the failure of individual nodes within the cluster. These nodes can start and stop at will as long as they have the ability to reach previously known members at the time of shutdown.
There are two types of nodes you can configure: disk and RAM nodes. Most commonly, you will use your nodes as disk nodes (preferred). Whereas RAM nodes are more of a special configuration used in performance clusters.
RabbitMQ nodes and the CLI tools use an erlang cookie
to
determine whether or not they have permission to communicate. The cookie
is a string of alphanumeric characters and can be as short or as long as
you would like.
Note
The cookie value is a shared secret and should be protected and kept private.
The default location of the cookie on *nix
environments
is /var/lib/rabbitmq/.erlang.cookie
or in
$HOME/.erlang.cookie
.
Tip
While troubleshooting, if you notice one node is refusing to join the cluster, it is definitely worth checking if the erlang cookie matches the other nodes. When the cookie is misconfigured (for example, not identical), RabbitMQ will log errors such as "Connection attempt from disallowed node" and "Could not auto-cluster". See clustering for more information.
To form a RabbitMQ Cluster, you start by taking independent RabbitMQ brokers and re-configuring these nodes into a cluster configuration.
Using a 3 node example, you would be telling nodes 2 and 3 to join the cluster of the first node.
Login to the 2nd and 3rd node and stop the rabbitmq application.
Join the cluster, then restart the application:
rabbit2$ rabbitmqctl stop_app Stopping node rabbit@rabbit2 ...done. rabbit2$ rabbitmqctl join_cluster rabbit@rabbit1 Clustering node rabbit@rabbit2 with [rabbit@rabbit1] ...done. rabbit2$ rabbitmqctl start_app Starting node rabbit@rabbit2 ...done.
Check the RabbitMQ cluster status
- Run
rabbitmqctl cluster_status
from either node.
You will see rabbit1
and rabbit2
are both
running as before.
The difference is that the cluster status section of the output, both nodes are now grouped together:
rabbit1$ rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit1 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2]}]},
{running_nodes,[rabbit@rabbit2,rabbit@rabbit1]}]
...done.
To add the third RabbitMQ node to the cluster, repeat the above process by stopping the RabbitMQ application on the third node.
Join the cluster, and restart the application on the third node.
Execute
rabbitmq cluster_status
to see all 3 nodes:rabbit1$ rabbitmqctl cluster_status Cluster status of node rabbit@rabbit1 ... [{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2,rabbit@rabbit3]}]}, {running_nodes,[rabbit@rabbit3,rabbit@rabbit2,rabbit@rabbit1]}] ...done.
Stop and restart a RabbitMQ cluster
To stop and start the cluster, keep in mind the order in which you shut the nodes down. The last node you stop, needs to be the first node you start. This node is the master.
If you start the nodes out of order, you could run into an issue where it thinks the current master should not be the master and drops the messages to ensure that no new messages are queued while the real master is down.
RabbitMQ and mnesia
Mnesia is a distributed database that RabbitMQ uses to store information about users, exchanges, queues, and bindings. Messages, however are not stored in the database.
For more information about Mnesia, see the Mnesia overview.
To view the locations of important Rabbit files, see File Locations.
Repair a partitioned RabbitMQ cluster for a single-node
Invariably due to something in your environment, you are likely to lose a node in your cluster. In this scenario, multiple LXC containers on the same host are running Rabbit and are in a single Rabbit cluster.
If the host still shows as part of the cluster, but it is not running, execute:
# rabbitmqctl start_app
However, you may notice some issues with your application as clients may be trying to push messages to the un-responsive node. To remedy this, forget the node from the cluster by executing the following:
Ensure RabbitMQ is not running on the node:
# rabbitmqctl stop_app
On the Rabbit2 node, execute:
# rabbitmqctl forget_cluster_node rabbit@rabbit1
By doing this, the cluster can continue to run effectively and you can repair the failing node.
Important
Watch out when you restart the node, it will still think it is part of the cluster and will require you to reset the node. After resetting, you should be able to rejoin it to other nodes as needed.
rabbit1$ rabbitmqctl start_app
Starting node rabbit@rabbit1 ...
Error: inconsistent_cluster: Node rabbit@rabbit1 thinks it's clustered with node rabbit@rabbit2, but rabbit@rabbit2 disagrees
rabbit1$ rabbitmqctl reset
Resetting node rabbit@rabbit1 ...done.
rabbit1$ rabbitmqctl start_app
Starting node rabbit@mcnulty ...
...done.
Repair a partitioned RabbitMQ cluster for a multi-node cluster
The same concepts apply to a multi-node cluster that exist in a single-node cluster. The only difference is that the various nodes will actually be running on different hosts. The key things to keep in mind when dealing with a multi-node cluster are:
When the entire cluster is brought down, the last node to go down must be the first node to be brought online. If this does not happen, the nodes will wait 30 seconds for the last disc node to come back online, and fail afterwards.
If the last node to go offline cannot be brought back up, it can be removed from the cluster using the
forget_cluster_node
command.If all cluster nodes stop in a simultaneous and uncontrolled manner, (for example, with a power cut) you can be left with a situation in which all nodes think that some other node stopped after them. In this case you can use the
force_boot
command on one node to make it bootable again.
Consult the rabbitmqctl manpage for more information.