Recovery after application node failure
When an application node (the Kubernetes master node) fails or reboots in a Fabric Services System cluster, the application is unavailable until the services running on that node can be recovered on the remaining application nodes.
This recovery process may require manual intervention as some services do not restart automatically on the remaining nodes.
Distribution of applications and services across nodes
The different applications or services that are part of the Fabric Services System are deployed across the three application nodes. When a node is unavailable or reboots, Kubernetes reschedules some of these applications on the remaining nodes. By default, Kubernetes waits 5 minutes before it considers a node to be unavailable and starts rescheduling the applications.
- Deployments: These are stateless applications that Kubernetes automatically reschedules after the unreachable timeout.
- Stateful Sets: These are stateful applications that Kubernetes does not automatically reschedule after the unreachable timeout. Stateful Sets are applications that use persistent storage. Kubernetes does not automatically restart them because it cannot ensure that it can restart them.
The following procedures are used to instruct Kubernetes to restart the Stateful Set services on a different node in case they are unavailable.
When a node that only contains Deployments becomes unavailable for more than five minutes, the Fabric Services System automatically recovers, as Kubernetes restarts those services immediately on the remaining available nodes without intervention.
Recovering an application after node failure
-
Verify which node is offline.
In the this example, node03 is offline.
$ kubectl get nodes NAME STATUS ROLES AGE VERSION fss-node01 Ready control-plane,master 26d v1.23.1 fss-node02 Ready control-plane,master 26d v1.23.1 fss-node03 NotReady control-plane,master 26d v1.23.1 fss-node04 Ready <none> 26d v1.23.1 fss-node05 Ready <none> 26d v1.23.1 fss-node06 Ready <none> 26d v1.23.1
-
Find all pods running on this node that have failed.
$ kubectl get pods -n default --field-selector spec.nodeName=fss-node03 NAME READY STATUS RESTARTS AGE prod-cp-kafka-2 2/2 Terminating 0 26d prod-cp-zookeeper-1 2/2 Terminating 0 26d prod-fss-cfggen-86859cfb5f-ctn54 1/1 Terminating 0 26d prod-fss-da-78f9fdfc7c-cwsv9 1/1 Terminating 0 26d prod-fss-da-78f9fdfc7c-rtrc6 1/1 Terminating 0 26d prod-fss-da-78f9fdfc7c-wltzd 1/1 Terminating 0 26d prod-fss-deviationmgr-netinst-7d7fc645bd-qbkf6 1/1 Terminating 0 26d prod-fss-digitalsandbox-7d86cc5fc4-7xfxn 1/1 Terminating 2 (26d ago) 26d prod-fss-oper-da-67c6d6c6bb-2bzhx 1/1 Terminating 0 26d prod-fss-oper-da-67c6d6c6bb-8r4w9 1/1 Terminating 0 26d prod-fss-oper-topomgr-6548c8d6c4-vsttk 1/1 Terminating 1 (26d ago) 26d prod-fss-topomgr-5f997b544d-4mfnk 1/1 Terminating 0 26d prod-fss-version-f5b4d74f-9nhss 1/1 Terminating 0 26d prod-fss-workloadmgr-64ffcf7547-rrvbc 1/1 Terminating 1 (26d ago) 26d prod-fss-ztp-7bd78ccd9-x5vb7 1/1 Terminating 1 (26d ago) 26d prod-mongodb-arbiter-0 1/1 Terminating 0 26d prod-mongodb-secondary-0 1/1 Terminating 0 26d prod-neo4j-core-0 1/1 Terminating 0 26d prod-postgresql-0 1/1 Terminating 0 26d
-
Wait for the pods to all show
Terminating
. -
Force delete all of the pods that are in the Terminating state as shown in the
output of Step 2.
Enter the following command for each pod:
$ kubectl delete pods --grace-period=0 --force <pod-name>
$ kubectl delete pods --grace-period=0 --force prod-fss-cfggen-86859cfb5f-ctn54 Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "prod-fss-cfggen-86859cfb5f-ctn54" force deleted
-
Wait for all pods in the default namespace to be in a Running state again.
This step can take a longer time if there are pods in a CrashLoopBackOff state, because they try to restart only with an increased delay between attempts.
-
When all the kafka, zookeeper, postgresql, mongodb and neo4j pods are in a
Running state, if pods continue to restart and enter another CrashLoopBackOff
state, verify that all kafka pods are running.
To force another restart of the zookeeper pods, execute the following command:
Wait until the prod-cp-zookeeper pod has scaled down, then scale up the prod-cp-zookeeper pod.$ kubectl -n default scale statefulset --replicas 0 prod-cp-zookeeper statefulset.apps/prod-cp-zookeeper scaled
$ kubectl -n default scale statefulset --replicas 3 prod-cp-zookeeper statefulset.apps/prod-cp-zookeeper scaled
Recovering an application after node reboot
Without a healthy Kafka cluster, the Fabric Services System micro-services go into a failed state, as they require a stable Kafka cluster to function.
Complete the following steps to recover when the application is in a failed state and all Kubernetes nodes are available.-
Scale down, then scale up the zookeeper pod.
Wait for the prod-cp-zookeeper pod to scale down, then scale up the prod-cp-zookeeper pod.$ kubectl -n default scale statefulset --replicas 0 prod-cp-zookeeper statefulset.apps/prod-cp-zookeeper scaled
$ kubectl -n default scale statefulset --replicas 3 prod-cp-zookeeper statefulset.apps/prod-cp-zookeeper scaled
The preceding commands force the Zookeeper service to restart and recover, which enables Kafka and other applications to recover. - Optional:
In some scenarios, you may also need to scale down and scale up the Kafka pod
to recover the Kafka service.
This command scales down the prod-cp-kafka pod.
Wait for the prod-cp-kafka pod to scale down, then scale the prod-cp-kafka pod.$ kubectl -n default scale statefulset --replicas 0 prod-cp-kafka statefulset.apps/prod-cp-kafka scaled
$ kubectl -n default scale statefulset --replicas 3 prod-cp-kafka statefulset.apps/prod-cp-kafka scaled