Recovery after application node failure

When an application node (the Kubernetes master node) fails or reboots in a Fabric Services System cluster, the application is unavailable until the services running on that node can be recovered on the remaining application nodes.

This recovery process may require manual intervention as some services do not restart automatically on the remaining nodes.

Recovering an application after node reboot

When a node reboots, services such as Kafka and Zookeeper try to recover after the reboot, but a known issue with Zookeeper may prevent Kafka and Zookeeper from recovering and recognizing that they are again in a working cluster.

Without a healthy Kafka cluster, the Fabric Services System micro-services go into a failed state, as they require a stable Kafka cluster to function.

Complete the following steps to recover when the application is in a failed state and all Kubernetes nodes are available.
  1. Scale down, then scale up the zookeeper pod.
    $ kubectl -n default scale statefulset --replicas 0 prod-cp-zookeeper
    statefulset.apps/prod-cp-zookeeper scaled
    
    Wait for the prod-cp-zookeeper pod to scale down, then scale up the prod-cp-zookeeper pod.
    
    $ kubectl -n default scale statefulset --replicas 3 prod-cp-zookeeper
    statefulset.apps/prod-cp-zookeeper scaled
    
    The preceding commands force the Zookeeper service to restart and recover, which enables Kafka and other applications to recover.
  2. Optional: In some scenarios, you may also need to scale down and scale up the Kafka pod to recover the Kafka service.
    This command scales down the prod-cp-kafka pod.
    $ kubectl -n default scale statefulset --replicas 0 prod-cp-kafka
    statefulset.apps/prod-cp-kafka scaled
    
    Wait for the prod-cp-kafka pod to scale down, then scale the prod-cp-kafka pod.
    $ kubectl -n default scale statefulset --replicas 3 prod-cp-kafka
    statefulset.apps/prod-cp-kafka scaled
    

Recovering an application after node failure

When a node fails for a longer period and cannot immediately be recovered, the services are also offline for a longer time. In this scenario, Kubernetes must be instructed to restart these services on the remaining nodes so that the application is recovered as quickly as possible.
Execute the commands from a system that has its kubeconfig environment configured to reach the Fabric Services System Kubernetes cluster.
  1. Verify which node is offline.
    In the this example, node03 is offline.
    $ kubectl get nodes
    NAME         STATUS     ROLES                  AGE   VERSION
    fss-node01   Ready      control-plane,master   26d   v1.23.1
    fss-node02   Ready      control-plane,master   26d   v1.23.1
    fss-node03   NotReady   control-plane,master   26d   v1.23.1
    fss-node04   Ready      <none>                 26d   v1.23.1
    fss-node05   Ready      <none>                 26d   v1.23.1
    fss-node06   Ready      <none>                 26d   v1.23.1
    
    
  2. Find all pods running on this node that have failed.
    $ kubectl get pods -n default --field-selector spec.nodeName=fss-node03
    NAME                                             READY   STATUS        RESTARTS      AGE
    prod-cp-kafka-2                                  2/2     Terminating   0             26d
    prod-cp-zookeeper-1                              2/2     Terminating   0             26d
    prod-fss-cfggen-86859cfb5f-ctn54                 1/1     Terminating   0             26d
    prod-fss-da-78f9fdfc7c-cwsv9                     1/1     Terminating   0             26d
    prod-fss-da-78f9fdfc7c-rtrc6                     1/1     Terminating   0             26d
    prod-fss-da-78f9fdfc7c-wltzd                     1/1     Terminating   0             26d
    prod-fss-deviationmgr-netinst-7d7fc645bd-qbkf6   1/1     Terminating   0             26d
    prod-fss-digitalsandbox-7d86cc5fc4-7xfxn         1/1     Terminating   2 (26d ago)   26d
    prod-fss-oper-da-67c6d6c6bb-2bzhx                1/1     Terminating   0             26d
    prod-fss-oper-da-67c6d6c6bb-8r4w9                1/1     Terminating   0             26d
    prod-fss-oper-topomgr-6548c8d6c4-vsttk           1/1     Terminating   1 (26d ago)   26d
    prod-fss-topomgr-5f997b544d-4mfnk                1/1     Terminating   0             26d
    prod-fss-version-f5b4d74f-9nhss                  1/1     Terminating   0             26d
    prod-fss-workloadmgr-64ffcf7547-rrvbc            1/1     Terminating   1 (26d ago)   26d
    prod-fss-ztp-7bd78ccd9-x5vb7                     1/1     Terminating   1 (26d ago)   26d
    prod-mongodb-arbiter-0                           1/1     Terminating   0             26d
    prod-mongodb-secondary-0                         1/1     Terminating   0             26d
    prod-neo4j-core-0                                1/1     Terminating   0             26d
    prod-postgresql-0                                1/1     Terminating   0             26d
    
  3. Wait for the pods to all show Terminating.
  4. Force delete all of the pods that are in the Terminating state as shown in the output of Step 2.
    Enter the following command for each pod:
    $ kubectl delete pods --grace-period=0 --force <pod-name>
    $ kubectl delete pods --grace-period=0 --force prod-fss-cfggen-86859cfb5f-ctn54
    Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
    pod "prod-fss-cfggen-86859cfb5f-ctn54" force deleted
    
  5. Wait for all pods in the default namespace to be in a Running state again.
    This step can take a longer time if there are pods in a CrashLoopBackOff state, because they try to restart only with an increased delay between attempts.
  6. When all the kafka, zookeeper, postgresql, mongodb and neo4j pods are in a Running state, if pods continue to restart and enter another CrashLoopBackOff state, verify that all kafka pods are running.
    To force another restart of the zookeeper pods, execute the following command:
    $ kubectl -n default scale statefulset --replicas 0 prod-cp-zookeeper
    statefulset.apps/prod-cp-zookeeper scaled
    
    
    Wait until the prod-cp-zookeeper pod has scaled down, then scale up the prod-cp-zookeeper pod.
    
    $ kubectl -n default scale statefulset --replicas 3 prod-cp-zookeeper
    statefulset.apps/prod-cp-zookeeper scaled