NSP application downtime in HA and redundant deployments
Overview
NSP components in a redundant configuration will experience application down time during an activity switch. NSP components that support HA deployment may experience application down time during pod reselection. Other NSP components that do not support HA deployment may incur down time for a pod restart.
Time estimates in this section are based on testing in a lab environment with a small number of nodes. Customer production networks may experience different downtime intervals based on, but not limited to, deployment type, managed network size and installation options. Estimates provided here are intended to provide guidance to network engineers and administrators.
Note: The first DR switchover in a NSP deployment after an install or upgrade will initialize adaptors on the redundant NSP cluster. As a result, slower recovery of NSP services is expected on the first DR switchover. Subsequent DR switchovers will not have this performance impact.
Launchpad and Access Token
Users and client applications that need access to Launchpad and NSP APIs will experience downtime during an activity switch. When an activity switch is initiated, all active sessions will terminate. Once the new active is up and initialized, new GUI and API sessions can be opened.
Launchpad | |
---|---|
Login down time for an activity switch in a redundant NSP deployment |
8 minutes |
Service management
Service provisioning activities will be impacted by a DR switchover from active to standby NSP cluster.
Service management | |
---|---|
Down time for an activity switch from active to standby in a redundant deployment |
10 minutes |
PCE operations
PCE operations will be affected by HA switchover in an HA deployment and during an activity switch. PCE operations do support HA functionality through replica pod.
PCE operations | |
---|---|
Down time for pod reselection in an HA deployment |
30 seconds |
Down time for an activity switch from active to standby in a redundant deployment |
15 minutes |
Alarms
Alarm events and updates will be affected during HA switchovers and DR activity switches.
Alarms | |
---|---|
Down time for alarm event notifications due to HA switchover |
up to 4 minutes |
Down time for alarm updates due to HA switchover |
up to 4 minutes |
Down time for alarm event notifications due to a DR activity switch |
up to 10 minutes |
Down time for alarm updates due to a DR activity switch |
up to 10 minutes |
Telemetry Collection
In a N+M MDM deployment, telemetry collection will be impacted during HA and DR switchovers.
Active NSP cluster MDM pod restart |
Telemetry Collection Down Time |
---|---|
Reset 1 active MDM server pod (switch to protection pod) |
15 seconds |
Reset all active MDM server pods |
90 seconds |
When the new active NSP is starting up following a DR switchover, the telemetry application must wait for all MDM servers and NFM-P main server to be active, as well as Postgres DB to be up and running before enabling previously persisted telemetry subscriptions.
DR switchover type |
Restconf replays telemetry subscription |
MDM server completes subscriptions to NEs |
Total Down Time |
---|---|---|---|
Manual switchover |
7 minutes |
5 minutes |
12 minutes |
Automatic switchover |
4 minutes |
5 minutes |
9 minutes |
For deployments with CN Telemetry the following manual DR switchover times can be anticipated:
Test environment dimensions: 4000 NEs under management with throughput factor 2 configured in nsp-config.yml
Table 5-10: gNMI collection with output to DB and Kafka
Event following DR switchover |
Elapsed time |
---|---|
Collector pods up |
120 seconds |
First NE record processed |
7 minutes |
Ramp to 4000 NEs and 3000 records/s |
37 minutes |
Table 5-11: Accounting collection with output to DB, Kafka and file
Event following DR switchover |
Elapsed time |
---|---|
Accounting processor pods up |
160 seconds |
First NE file processed |
7 minutes |
Ramp to 4000 NEs and 3.88 million records / 15 minutes |
27 minutes |