diff --git a/doc/source/node_management/kubernetes/customizing_the_host_life_cycles/handling-maintenance-heartbeat-failure-for-active-controller-service-activation-70fb51663717.rst b/doc/source/node_management/kubernetes/customizing_the_host_life_cycles/handling-maintenance-heartbeat-failure-for-active-controller-service-activation-70fb51663717.rst new file mode 100644 index 000000000..bb8794a7c --- /dev/null +++ b/doc/source/node_management/kubernetes/customizing_the_host_life_cycles/handling-maintenance-heartbeat-failure-for-active-controller-service-activation-70fb51663717.rst @@ -0,0 +1,50 @@ +.. _handling-maintenance-heartbeat-failure-for-active-controller-service-activation-70fb51663717: + +============================================================================= +Handle Maintenance Heartbeat Failure for Active Controller Service Activation +============================================================================= + +Maintenance is started by Service Management (along with other active +controller services) based on one of the following 3 events: + +- on initial controller activity startup, or + +- on a controlled or uncontrolled controller |SWACT|, or + +- on active controller selection following a double controller reboot/power + outage; i.e. |DOR| + +In such events, Maintenance process startup queries System Inventory for a list +of provisioned hosts along with their configuration and state information. + +Hosts that are found to be in the unlocked/enabled state are expected to +service Maintenance heartbeat. + +However, the uptime on the active controller can impact how quickly Maintenance +reacts to unlocked-enabled hosts that fail heartbeat following controller +services activation. + +If the active controller reboots or loses power, then the standby controller +takes over by way of an uncontrolled |SWACT|. + +**Greater than 15 minute uptime**: When maintenance starts on a controller whose +uptime is greater than 15 minutes, any host found to be in the unlocked/enabled +state and not servicing heartbeat will be given a 5 second grace period before +Maintenance declares the node failed and puts it into **Graceful Recovery**. + +**Graceful Recovery** is a maintenance heartbeat failure state capable of avoiding +a second reboot if the host was found to have already rebooted upon heartbeat +loss recovery. + +If both controllers reboot or lose power, then Service Management will start +services on the first healthy controller following the outage. + +**Less than 15 minute uptime**: When maintenance starts on a controller whose +uptime is less than 15 minutes, it assumes the system is in |DOR| mode. +Maintenance is more tolerant of unlocked/enabled hosts that are not immediately +servicing heartbeat following maintenance process startup in |DOR| mode. +Instead of failing a node after 5 seconds, it waits up to 10 minutes to give +servers a longer grace period to recover, knowing that power outage recovery +time can vary from server to server. + + diff --git a/doc/source/node_management/kubernetes/index.rst b/doc/source/node_management/kubernetes/index.rst index af05497c7..ad13ccab1 100644 --- a/doc/source/node_management/kubernetes/index.rst +++ b/doc/source/node_management/kubernetes/index.rst @@ -277,6 +277,7 @@ Customize host life cycles customizing_the_host_life_cycles/adjusting-the-host-heartbeat-interval-and-heartbeat-response-thresholds customizing_the_host_life_cycles/configuring-heartbeat-failure-action customizing_the_host_life_cycles/configuring-multi-node-failure-avoidance + customizing_the_host_life_cycles/handling-maintenance-heartbeat-failure-for-active-controller-service-activation-70fb51663717 -------------------- Node inventory tasks diff --git a/doc/source/shared/abbrevs.txt b/doc/source/shared/abbrevs.txt index 971291b8c..802395acf 100755 --- a/doc/source/shared/abbrevs.txt +++ b/doc/source/shared/abbrevs.txt @@ -34,6 +34,7 @@ .. |CVE| replace:: :abbr:`CVE (Common Vulnerabilities and Exposures)` .. |DAD| replace:: :abbr:`DAD (Duplicate Address Detection)` .. |DC| replace:: :abbr:`DC (Distributed Cloud)` +.. |DOR| replace:: :abbr:`DOR (Dead Office Recovery)` .. |DHCP| replace:: :abbr:`DHCP (Dynamic Host Configuration Protocol)` .. |DMA| replace:: :abbr:`DMA (Direct Memory Access)` .. |DNS| replace:: :abbr:`DNS (Domain Name System)` @@ -123,6 +124,7 @@ .. |SSH| replace:: :abbr:`SSH (Secure Shell)` .. |SSL| replace:: :abbr:`SSL (Secure Socket Layer)` .. |STP| replace:: :abbr:`STP (Spanning Tree Protocol)` +.. |SWACT| replace:: :abbr:`SWACT (SWitch ACTivity)` .. |TCP| replace:: :abbr:`TCP (Transition Control Protocol)` .. |TFTP| replace:: :abbr:`TFTP (Trivial File Transfer Protocol)` .. |TLS| replace:: :abbr:`TLS (Transport Layer Security)`