NSP system failure and recovery scenarios

Introduction

The following topics describe the NSP recovery actions in the event of a redundancy failure; a failure scenario may apply to multiple deployment configurations; the following scenarios are examples only.

Primary NSP cluster failure

The redundant nsp-role-manager agents exchange a heartbeat every five seconds. If the agent on the standby cluster does not receive a heartbeat within 60 seconds, the standby cluster is promoted to primary. The new primary cluster subsequently communicates with the NFM-P and the newly active VSR-NRC. Primary cluster base services that stop running can also trigger a switchover.

NSP cluster communication failure

When communication between the NSP clusters fails, each NSP cluster assumes the active role, which creates what is called a split-brain scenario. A 60-second loss of communication between the primary and standby NSP clusters may trigger a switchover.

After communication in a split-brain scenario is restored, the NSP cluster with the higher uptime value assumes the primary role, and the peer cluster assumes the standby role. The assumption is that the cluster running for the longer time was the primary cluster at the time of the loss. In such a scenario, the clients continue to communicate with the same primary cluster.

Figure 8-2: Primary and standby NSP cluster communication failure
Primary and standby NSP cluster communication failure