Recovery after application node failure

When an application node (the Kubernetes master node) fails or reboots in a Fabric Services System cluster, the application is unavailable until the services running on that node can be recovered on the remaining application nodes.

This recovery process may require manual intervention as some services do not restart automatically on the remaining nodes.

Distribution of applications and services across nodes

The different applications or services that are part of the Fabric Services System are deployed across the three application nodes. When a node is unavailable or reboots, Kubernetes reschedules some of these applications on the remaining nodes. By default, Kubernetes waits 5 minutes before it considers a node to be unavailable and starts rescheduling the applications.

Applications in the Fabric Services System deployment can be one of the following types:

Deployments – stateless applications that Kubernetes automatically reschedules after the unreachable timeout
Stateful Sets – stateful applications that Kubernetes does not automatically reschedule after the unreachable timeout
Stateful Sets are applications that use persistent storage. Kubernetes does not automatically restart them because it cannot ensure that it can restart them.

The following procedures are used to instruct Kubernetes to restart the Stateful Set services on a different node in case they are unavailable.

When a node that only contains Deployments becomes unavailable for more than five minutes, the Fabric Services System automatically recovers, as Kubernetes restarts those services immediately on the remaining available nodes without intervention.

By default, the Fabric Services System deploys all the Stateful Set applications on a single node. This strategy lowers the risk of the need for a manual intervention; manual intervention is needed only when the node running the Stateful Sets is down for more than 5 minutes.

Note: Execute the following procedures only if the application does not automatically recover when a node is down. By default, it takes Kubernetes up to five minutes to detect that a node is down, and another couple of minutes to restart the stateless applications. Normally, manual intervention is needed only if the node containing the stateful applications is unavailable.

Recovering an application after node failure

When a node fails for a longer period and cannot immediately be recovered, the services are also offline for a longer time. In this scenario, Kubernetes must be instructed to restart these services on the remaining nodes so that the application is recovered as quickly as possible.

Execute the commands from a system that has its kubeconfig environment configured to reach the Fabric Services System Kubernetes cluster.

Verify which node is offline.

In this example, node03 is offline.

$ kubectl get nodes
NAME         STATUS     ROLES                  AGE   VERSION
fss-node01   Ready      control-plane,master   26d   v1.23.1
fss-node02   Ready      control-plane,master   26d   v1.23.1
fss-node03   NotReady   control-plane,master   26d   v1.23.1
fss-node04   Ready      <none>                 26d   v1.23.1
fss-node05   Ready      <none>                 26d   v1.23.1
fss-node06   Ready      <none>                 26d   v1.23.1

Determine which pods on this node have failed.

$ kubectl get pods -n default --field-selector spec.nodeName=fss-node03
NAME                                             READY   STATUS        RESTARTS      AGE
prod-cp-kafka-2                                  2/2     Terminating   0             26d
prod-cp-zookeeper-1                              2/2     Terminating   0             26d
prod-fss-cfggen-86859cfb5f-ctn54                 1/1     Terminating   0             26d
prod-fss-da-78f9fdfc7c-cwsv9                     1/1     Terminating   0             26d
prod-fss-da-78f9fdfc7c-rtrc6                     1/1     Terminating   0             26d
prod-fss-da-78f9fdfc7c-wltzd                     1/1     Terminating   0             26d
prod-fss-deviationmgr-netinst-7d7fc645bd-qbkf6   1/1     Terminating   0             26d
prod-fss-digitalsandbox-7d86cc5fc4-7xfxn         1/1     Terminating   2 (26d ago)   26d
prod-fss-oper-da-67c6d6c6bb-2bzhx                1/1     Terminating   0             26d
prod-fss-oper-da-67c6d6c6bb-8r4w9                1/1     Terminating   0             26d
prod-fss-oper-topomgr-6548c8d6c4-vsttk           1/1     Terminating   1 (26d ago)   26d
prod-fss-topomgr-5f997b544d-4mfnk                1/1     Terminating   0             26d
prod-fss-version-f5b4d74f-9nhss                  1/1     Terminating   0             26d
prod-fss-workloadmgr-64ffcf7547-rrvbc            1/1     Terminating   1 (26d ago)   26d
prod-fss-ztp-7bd78ccd9-x5vb7                     1/1     Terminating   1 (26d ago)   26d
prod-mongodb-arbiter-0                           1/1     Terminating   0             26d
prod-mongodb-secondary-0                         1/1     Terminating   0             26d
prod-neo4j-core-0                                1/1     Terminating   0             26d
prod-postgresql-0                                1/1     Terminating   0             26d

Wait for the pods to all show Terminating.

Force delete all of the pods that are in the Terminating state, as shown in the output of Step 2.

Enter the following command for each pod:

$ kubectl delete pods --grace-period=0 --force <pod-name>

$ kubectl delete pods --grace-period=0 --force prod-fss-cfggen-86859cfb5f-ctn54
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "prod-fss-cfggen-86859cfb5f-ctn54" force deleted

Wait for all pods in the default namespace to be in a Running state again.
This step can take a longer time if there are pods in a CrashLoopBackOff state, because they try to restart only with an increased delay between attempts.
When all the kafka, zookeeper, postgresql, mongodb and neo4j pods are in a Running state, if pods continue to restart and enter another CrashLoopBackOff state, verify that all kafka pods are running.
To force another restart of the zookeeper pods, execute the following command:
```
$ kubectl -n default scale statefulset --replicas 0 prod-cp-zookeeper
statefulset.apps/prod-cp-zookeeper scaled
```
Wait until the prod-cp-zookeeper pod has scaled down, then scale up the prod-cp-zookeeper pod.
```
$ kubectl -n default scale statefulset --replicas 3 prod-cp-zookeeper
statefulset.apps/prod-cp-zookeeper scaled
```

Recovering an application after node reboot

When a node reboots, services such as Kafka and Zookeeper try to recover after the reboot, but a known issue with Zookeeper may prevent Kafka and Zookeeper from recovering and recognizing that they are again in a working cluster.

Without a healthy Kafka cluster, the Fabric Services System microservices go into a failed state, as they require a stable Kafka cluster to function.

Complete the following steps to recover when the application is in a failed state and all Kubernetes nodes are available.

Scale down, then scale up the zookeeper pod.
```
$ kubectl -n default scale statefulset --replicas 0 prod-cp-zookeeper
statefulset.apps/prod-cp-zookeeper scaled
```
Wait for the prod-cp-zookeeper pod to scale down, then scale up the prod-cp-zookeeper pod.
```
$ kubectl -n default scale statefulset --replicas 3 prod-cp-zookeeper
statefulset.apps/prod-cp-zookeeper scaled
```
The preceding commands force the Zookeeper service to restart and recover, which enables Kafka and other applications to recover.
Optional: In some scenarios, you may also need to scale down and scale up the Kafka pod to recover the Kafka service.
This command scales down the prod-cp-kafka pod.
```
$ kubectl -n default scale statefulset --replicas 0 prod-cp-kafka
statefulset.apps/prod-cp-kafka scaled
```
Wait for the prod-cp-kafka pod to scale down, then scale the prod-cp-kafka pod.
```
$ kubectl -n default scale statefulset --replicas 3 prod-cp-kafka
statefulset.apps/prod-cp-kafka scaled
```