NSP application downtime in HA and redundant deployments

Overview

NSP components in a redundant configuration will experience application down time during an activity switch. NSP components that support HA deployment may experience application down time during pod reselection. Other NSP components that do not support HA deployment may incur down time for a pod restart.

Time estimates in this section are based on testing in a lab environment with a small number of nodes. Customer production networks may experience different downtime intervals based on, but not limited to, deployment type, managed network size and installation options. Estimates provided here are intended to provide guidance to network engineers and administrators.

Note: The first DR switchover in a NSP deployment after an install or upgrade will initialize adaptors on the redundant NSP cluster. As a result, slower recovery of NSP services is expected on the first DR switchover. Subsequent DR switchovers will not have this performance impact.

Launchpad and Access Token

Users and client applications that need access to Launchpad and NSP APIs will experience downtime during an activity switch. When an activity switch is initiated, all active sessions will terminate. Once the new active is up and initialized, new GUI and API sessions can be opened.

Launchpad
Login down time for an activity switch in a redundant NSP deployment	8 minutes

Service management

Service provisioning activities will be impacted by a DR switchover from active to standby NSP cluster.

Service management
Down time for an activity switch from active to standby in a redundant deployment	10 minutes

PCE operations

PCE operations will be affected by HA switchover in an HA deployment and during an activity switch. PCE operations do support HA functionality through replica pod.

PCE operations
Down time for pod reselection in an HA deployment	30 seconds
Down time for an activity switch from active to standby in a redundant deployment	15 minutes

Alarms

Alarm events and updates will be affected during HA switchovers and DR activity switches.

Alarms
Down time for alarm event notifications due to HA switchover	up to 4 minutes
Down time for alarm updates due to HA switchover	up to 4 minutes
Down time for alarm event notifications due to a DR activity switch	up to 10 minutes
Down time for alarm updates due to a DR activity switch	up to 10 minutes

Telemetry Collection

In a N+M MDM deployment, telemetry collection will be impacted during HA and DR switchovers.

Active NSP cluster MDM pod restart	Telemetry Collection Down Time
Reset 1 active MDM server pod (switch to protection pod)	15 seconds
Reset all active MDM server pods	90 seconds

When the new active NSP is starting up following a DR switchover, the telemetry application must wait for all MDM servers and NFM-P main server to be active, as well as Postgres DB to be up and running before enabling previously persisted telemetry subscriptions.

DR switchover type	Restconf replays telemetry subscription	MDM server completes subscriptions to NEs	Total Down Time
Manual switchover	7 minutes	5 minutes	12 minutes
Automatic switchover	4 minutes	5 minutes	9 minutes

For deployments with CN Telemetry the following manual DR switchover times can be anticipated:

Test environment dimensions: 4000 NEs under management with throughput factor 2 configured in nsp-config.yml

Table 5-10: gNMI collection with output to DB and Kafka

Event following DR switchover	Elapsed time
Collector pods up	120 seconds
First NE record processed	7 minutes
Ramp to 4000 NEs and 3000 records/s	37 minutes

Table 5-11: Accounting collection with output to DB, Kafka and file

Event following DR switchover	Elapsed time
Accounting processor pods up	160 seconds
First NE file processed	7 minutes
Ramp to 4000 NEs and 3.88 million records / 15 minutes	27 minutes