NSP application downtime in HA and redundant deployments

Overview

NSP components in a redundant configuration will experience application down time during an activity switch. NSP components that support HA deployment may experience application down time during pod reselection. Other NSP components that do not support HA deployment may incur down time for a pod restart.

Time estimates in this section are based on testing in a lab environment with a small number of nodes. Customer production networks may experience different downtime intervals based on, but not limited to, deployment type, managed network size and installation options. Estimates provided here are intended to provide guidance to network engineers and administrators.

Note: The first DR switchover in a NSP deployment after an install or upgrade will initialize adaptors on the redundant NSP cluster. As a result, slower recovery of NSP services is expected on the first DR switchover. Subsequent DR switchovers will not have this performance impact.

Launchpad and Access Token

Users and client applications that need access to Launchpad and NSP APIs will experience downtime during an activity switch. When an activity switch is initiated, all active sessions will terminate. Once the new active is up and initialized, new GUI and API sessions can be opened.

Launchpad

Login down time for an activity switch in a redundant NSP deployment

8 minutes

Service management

Service provisioning activities will be impacted by a DR switchover from active to standby NSP cluster.

Service management

Down time for an activity switch from active to standby in a redundant deployment

10 minutes

PCE operations

PCE operations will be affected by HA switchover in an HA deployment and during an activity switch. PCE operations do support HA functionality through replica pod.

PCE operations

Down time for pod reselection in an HA deployment

30 seconds

Down time for an activity switch from active to standby in a redundant deployment

15 minutes

Alarms

Alarm events and updates will be affected during HA switchovers and DR activity switches.

Alarms

Down time for alarm event notifications due to HA switchover

up to 4 minutes

Down time for alarm updates due to HA switchover

up to 4 minutes

Down time for alarm event notifications due to a DR activity switch

up to 10 minutes

Down time for alarm updates due to a DR activity switch

up to 10 minutes

Telemetry Collection

In a N+M MDM deployment, telemetry collection will be impacted during HA and DR switchovers.

Active NSP cluster MDM pod restart

Telemetry Collection Down Time

Reset 1 active MDM server pod (switch to protection pod)

15 seconds

Reset all active MDM server pods

90 seconds

When the new active NSP is starting up following a DR switchover, the telemetry application must wait for all MDM servers and NFM-P main server to be active, as well as Postgres DB to be up and running before enabling previously persisted telemetry subscriptions.

DR switchover type

Restconf replays telemetry subscription

MDM server completes subscriptions to NEs

Total Down Time

Manual switchover

7 minutes

5 minutes

12 minutes

Automatic switchover

4 minutes

5 minutes

9 minutes

For deployments with CN Telemetry the following manual DR switchover times can be anticipated:

Test environment dimensions: 4000 NEs under management with throughput factor 2 configured in nsp-config.yml

Table 5-10: gNMI collection with output to DB and Kafka

Event following DR switchover

Elapsed time

Collector pods up

120 seconds

First NE record processed

7 minutes

Ramp to 4000 NEs and 3000 records/s

37 minutes

Table 5-11: Accounting collection with output to DB, Kafka and file

Event following DR switchover

Elapsed time

Accounting processor pods up

160 seconds

First NE file processed

7 minutes

Ramp to 4000 NEs and 3.88 million records / 15 minutes

27 minutes