What are the NSP cluster DR functions?

Description

The NSP disaster recovery (DR) function involves redundant NSP clusters in a warm standby configuration for fault tolerance in the event of a cluster failure. The following procedures describe how to control and manage NSP DR.

DR functions

The following NSP DR functions swap the primary and standby NSP cluster roles:

failover—automatic DR role change initiated by the standby NSP cluster when a primary cluster failure is suspected
switchover—manual DR operation that switches the NSP cluster roles

Failovers and switchovers

NSP DR failovers and switchovers are controlled by the ASM and role manager services, which run as the nspos-asm-app and nsp-role-manager pods in each DR NSP cluster. The standby role manager periodically checks the connectivity to the role manager in the primary NSP cluster.

In addition, the role manager monitors essential primary pods and services such as the following:

ZooKeeper
Kafka
PostgreSQL
nspos-tomcat
nsp-tomcat
Keycloak
prometheus-server

If the role manager connectivity check fails for two minutes, or if an essential primary pod or service is down, the ASM triggers a failover to the standby cluster. The standby role manager and NSP cluster then assume the primary role. When the fault is resolved, the NSP automatically returns to normal operation with functional primary and standby clusters.

How do I identify the NSP cluster DR roles? describes how to display which role—primary or standby—is assigned to each NSP cluster. To restore the initial cluster roles after a failover, you perform a manual switchover, as described in How do I perform an NSP DR switchover?.

Note: After a failover or switchover, NSP functions restart processes that were interrupted. If downstream functions are not up yet, the restarted processes may fail. For example, if a network configuration deployment was auditing at the time of a failover, the audit will restart when Infrastructure Configuration Management is up. If Network Intents is not back up yet when the audit is restarted, the audit will fail. The process can be restarted manually when the NSP has stabilized.

Disabling and enabling failovers

NSP DR failovers are enabled by default in a DR NSP deployment. If required, you can disable failovers to prevent disruption during a period of maintenance activity, as described in How do I disable NSP DR failovers?

How do I display the NSP DR failover setting? describes how to identify whether failovers are enabled.

Note: The failover setting persists through an NSP software upgrade.

Note: For maximum fault tolerance, failovers must be disabled only during a maintenance period, and re-enabled after the maintenance period, as described in How do I enable NSP DR failovers?.

Standby cluster alarms

The NSP raises the following alarms for a standby NSP cluster:

PodDownAlarm
DiskSpaceBelowThresholdAlarm
NodeMemoryBelowThresholdAlarm
ServerMemoryBelowThresholdAlarm
BaseServiceDownAlarm (InstanceDown)
ClusterNodeDownAlarm

Before you take action to respond to an alarm, you must clearly identify the system raising the alarm and the node or pod at fault.

For better visibility in the standby cluster, the Source Type field of an alarm indicates the Site ID, Site Name, or Alarmed Object Name.

The Site ID has this format: dc-name:node-name
The Site Name has this format: dc-name:node-name
The Alarmed Object Name has this format: dc-name:node-name:pod-name

Where dc-name is the DR data center name

When the pod is pending, the node-name is N/A.

For example, the Source Type field of a ServerMemoryBelowThreshold alarm, the operator has to view the Site ID, Site Name, or Alarmed Object Name field to identify which pod is at fault.

If the Site ID is DR1:node1, node1 in the DR1 data center is at fault.
If the Site Name is DR1:dr1-node1, dr1-node1 in the DR1 data center is at fault.
If the Alarmed Object Name is DR1:dr1-node1:nspos-app1-tomcat-jmx-svc, the nspos-app1-tomcat-jmx-svc pod on dr1-node1 in the DR1 data center is at fault.