Disaster recovery

A DR NSP deployment consists of identical primary and standby NSP clusters and ancillary components in separate, geographically distributed data centers, or “sites”. One cluster has what is called the primary role, and processes all client requests.

The standby NSP cluster in a DR deployment operates in warm standby mode. If a primary cluster failure is detected, the standby automatically initializes as the primary, and fully assumes the primary role.

Note: Nokia strongly recommends that all primary components in a DR deployment be in the same physical facility. An NSP administrator can align the NSP component roles, as required.

NSP Role Manager

In a DR NSP deployment, the Role Manager runs in an NSP cluster and acts as a Kubernetes controller. The Role Manager monitors the Kubernetes objects for changes, and updates the objects as required based on the current primary or standby site role.

The Role Manager has the following operation modes:

standalone: The Role Manager sets the cluster mode to 'active' at initialization time, and does nothing more.
DR: The Role Manager negotiates the local role with the DR peer, determining which cluster will run in 'active' and which in 'standby' mode.

The Role Manager uses the configuration in the dr section of the NSP configuration file to identify the local and peer sites.

The NSP monitors the following NSP base services in a DR deployment:

Kafka
Keycloak
NSP Tomcat
nspOS Tomcat
PostgreSQL
Prometheus
ZooKeeper

DR fault conditions

If any base service in a DR deployment is unavailable for more than three minutes, or two instances of a service in an HA+DR deployment are unavailable for more than three minutes:

An activity switch occurs; consequently, the peer NSP cluster assumes the primary role.
An alarm is raised against the service or containing pod to indicate that the service or pod is down.

Note: Such an alarm may not be generated because of a base service disruption, depending on the circumstances.
A major ActivitySwitch alarm is raised against the former active site, which is now the standby site.

The following are the alarms that the NSP raises against the NmsSystem object in response to such a failure:

ActivitySwitch—severity Major
NspApplicationPodDown—severity Critical

Note: If you clear an alarm while the failure condition is still present, the NSP does not raise the alarm again.

The following example describes an alarm condition in a simple DR deployment.

An nspOS service at the primary site fails.
An activity switch occurs; the standby site consequently assumes the primary role.
A major ActivitySwitch alarm is raised against the former primary site, which is now the standby site.

DR for integrated components

A DR NSP deployment can include NFM-P and WS-NOC. NFM-P can be standalone or redundant; however, WS-NOC must be redundant. For example, if a DR deployment includes classic management, the NFM-P can be standalone and WS-NOC is redundant.

The following figure shows a simple NSP DR deployment.

Disaster recovery

Disaster recovery

NSP Role Manager

DR fault conditions

DR for integrated components

Figure 8-1: NSP DR deployment with integrated NFM-P