Recovery after application node failure
When an application node (the Kubernetes master node) fails or reboots in a Fabric Services System cluster, the application is unavailable until the services running on that node can be recovered on the remaining application nodes.
This recovery process may require manual intervention as some services do not restart automatically on the remaining nodes.
Distribution of applications and services across nodes
The different applications or services that are part of the Fabric Services System are deployed across the three application nodes. When a node is unavailable or reboots, Kubernetes reschedules some of these applications on the remaining nodes. By default, Kubernetes waits 5 minutes before it considers a node to be unavailable and starts rescheduling the applications.
- Deployments – stateless applications that Kubernetes automatically reschedules after the unreachable timeout
- Stateful Sets – stateful applications that Kubernetes does not automatically reschedule after the unreachable timeout
The following procedures are used to instruct Kubernetes to restart the Stateful Set services on a different node in case they are unavailable.
When a node that only contains Deployments becomes unavailable for more than five minutes, the Fabric Services System automatically recovers, as Kubernetes restarts those services immediately on the remaining available nodes without intervention.
Recovering an application after node failure
-
Verify which node is offline.
In this example, node03 is offline.
$ kubectl get nodes NAME STATUS ROLES AGE VERSION fss-node01 Ready control-plane,master 26d v1.23.1 fss-node02 Ready control-plane,master 26d v1.23.1 fss-node03 NotReady control-plane,master 26d v1.23.1 fss-node04 Ready <none> 26d v1.23.1 fss-node05 Ready <none> 26d v1.23.1 fss-node06 Ready <none> 26d v1.23.1
-
Determine which pods on this node have failed.
$ kubectl get pods -n default --field-selector spec.nodeName=fss-node03 NAME READY STATUS RESTARTS AGE prod-cp-kafka-2 2/2 Terminating 0 26d prod-cp-zookeeper-1 2/2 Terminating 0 26d prod-fss-cfggen-86859cfb5f-ctn54 1/1 Terminating 0 26d prod-fss-da-78f9fdfc7c-cwsv9 1/1 Terminating 0 26d prod-fss-da-78f9fdfc7c-rtrc6 1/1 Terminating 0 26d prod-fss-da-78f9fdfc7c-wltzd 1/1 Terminating 0 26d prod-fss-deviationmgr-netinst-7d7fc645bd-qbkf6 1/1 Terminating 0 26d prod-fss-digitalsandbox-7d86cc5fc4-7xfxn 1/1 Terminating 2 (26d ago) 26d prod-fss-oper-da-67c6d6c6bb-2bzhx 1/1 Terminating 0 26d prod-fss-oper-da-67c6d6c6bb-8r4w9 1/1 Terminating 0 26d prod-fss-oper-topomgr-6548c8d6c4-vsttk 1/1 Terminating 1 (26d ago) 26d prod-fss-topomgr-5f997b544d-4mfnk 1/1 Terminating 0 26d prod-fss-version-f5b4d74f-9nhss 1/1 Terminating 0 26d prod-fss-workloadmgr-64ffcf7547-rrvbc 1/1 Terminating 1 (26d ago) 26d prod-fss-ztp-7bd78ccd9-x5vb7 1/1 Terminating 1 (26d ago) 26d prod-mongodb-arbiter-0 1/1 Terminating 0 26d prod-mongodb-secondary-0 1/1 Terminating 0 26d prod-neo4j-core-0 1/1 Terminating 0 26d prod-postgresql-0 1/1 Terminating 0 26d
-
Wait for the pods to all show
Terminating
. -
Force delete all of the pods that are in the Terminating state, as shown in the
output of Step 2.
Enter the following command for each pod:
$ kubectl delete pods --grace-period=0 --force <pod-name>
$ kubectl delete pods --grace-period=0 --force prod-fss-cfggen-86859cfb5f-ctn54 Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "prod-fss-cfggen-86859cfb5f-ctn54" force deleted
-
Wait for all pods in the default namespace to be in a Running state again.
This step can take a longer time if there are pods in a CrashLoopBackOff state, because they try to restart only with an increased delay between attempts.