Recovery after application node failure

When an application node (the Kubernetes master node) fails or reboots in a Fabric Services System cluster, the application is unavailable until the services running on that node can be recovered on the remaining application nodes.

This recovery process may require manual intervention as some services do not restart automatically on the remaining nodes.

Distribution of applications and services across nodes

The different applications or services that are part of the Fabric Services System are deployed across the three application nodes. When a node is unavailable or reboots, Kubernetes reschedules some of these applications on the remaining nodes. By default, Kubernetes waits 5 minutes before it considers a node to be unavailable and starts rescheduling the applications.

Applications in the Fabric Services System deployment can be one of the following types:

Deployments – stateless applications that Kubernetes automatically reschedules after the unreachable timeout
Stateful Sets – stateful applications that Kubernetes does not automatically reschedule after the unreachable timeout

The following procedures are used to instruct Kubernetes to restart the Stateful Set services on a different node in case they are unavailable.

When a node that only contains Deployments becomes unavailable for more than five minutes, the Fabric Services System automatically recovers, as Kubernetes restarts those services immediately on the remaining available nodes without intervention.

By default, the Fabric Services System deploys all the Stateful Set applications on a single node. This strategy lowers the risk of the need for a manual intervention; manual intervention is needed only when the node running the Stateful Sets is down for more than 5 minutes.

Note: Execute the following procedures only if the application does not automatically recover when a node is down. By default, it takes Kubernetes up to five minutes to detect that a node is down, and another couple of minutes to restart the stateless applications. Normally, manual intervention is needed only if the node containing the stateful applications is unavailable.

Recovering an application after node failure

When a node fails for a longer period and cannot immediately be recovered, the services are also offline for a longer time. In this scenario, Kubernetes must be instructed to restart these services on the remaining nodes so that the application is recovered as quickly as possible.

Execute the commands from a system that has its kubeconfig environment configured to reach the Fabric Services System Kubernetes cluster.

Verify which node is offline.

In this example, node03 is offline.

$ kubectl get nodes
NAME         STATUS     ROLES                  AGE   VERSION
fss-node01   Ready      control-plane,master   26d   v1.23.1
fss-node02   Ready      control-plane,master   26d   v1.23.1
fss-node03   NotReady   control-plane,master   26d   v1.23.1
fss-node04   Ready      <none>                 26d   v1.23.1
fss-node05   Ready      <none>                 26d   v1.23.1
fss-node06   Ready      <none>                 26d   v1.23.1

Determine which pods on this node have failed.

$ kubectl get pods -n default --field-selector spec.nodeName=fss-node03
NAME                                             READY   STATUS        RESTARTS      AGE
prod-cp-kafka-2                                  2/2     Terminating   0             26d
prod-cp-zookeeper-1                              2/2     Terminating   0             26d
prod-fss-cfggen-86859cfb5f-ctn54                 1/1     Terminating   0             26d
prod-fss-da-78f9fdfc7c-cwsv9                     1/1     Terminating   0             26d
prod-fss-da-78f9fdfc7c-rtrc6                     1/1     Terminating   0             26d
prod-fss-da-78f9fdfc7c-wltzd                     1/1     Terminating   0             26d
prod-fss-deviationmgr-netinst-7d7fc645bd-qbkf6   1/1     Terminating   0             26d
prod-fss-digitalsandbox-7d86cc5fc4-7xfxn         1/1     Terminating   2 (26d ago)   26d
prod-fss-oper-da-67c6d6c6bb-2bzhx                1/1     Terminating   0             26d
prod-fss-oper-da-67c6d6c6bb-8r4w9                1/1     Terminating   0             26d
prod-fss-oper-topomgr-6548c8d6c4-vsttk           1/1     Terminating   1 (26d ago)   26d
prod-fss-topomgr-5f997b544d-4mfnk                1/1     Terminating   0             26d
prod-fss-version-f5b4d74f-9nhss                  1/1     Terminating   0             26d
prod-fss-workloadmgr-64ffcf7547-rrvbc            1/1     Terminating   1 (26d ago)   26d
prod-fss-ztp-7bd78ccd9-x5vb7                     1/1     Terminating   1 (26d ago)   26d
prod-mongodb-arbiter-0                           1/1     Terminating   0             26d
prod-mongodb-secondary-0                         1/1     Terminating   0             26d
prod-neo4j-core-0                                1/1     Terminating   0             26d
prod-postgresql-0                                1/1     Terminating   0             26d

Wait for the pods to all show Terminating.

Force delete all of the pods that are in the Terminating state, as shown in the output of Step 2.

Enter the following command for each pod:

$ kubectl delete pods --grace-period=0 --force <pod-name>

$ kubectl delete pods --grace-period=0 --force prod-fss-cfggen-86859cfb5f-ctn54
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "prod-fss-cfggen-86859cfb5f-ctn54" force deleted

Wait for all pods in the default namespace to be in a Running state again.
This step can take a longer time if there are pods in a CrashLoopBackOff state, because they try to restart only with an increased delay between attempts.