Sync failure and recovery
Sync failure
If the sync connection fails between the active and standby systems, the standby attempts to sync with the active instance for 90 seconds. The connection between the active instance and standby instance is considered lost after three missed heartbeats (approximately 90 seconds).
Recovery
- attempt to recover from a temporary sync failure
For instructions, see Recovering from sync failure.
- initiate the switchover of roles between the two clusters, if the system
with the active role is still reachable
For instructions, see Initiating failover: switching between the active and standby clusters.
- initiate standalone operation on the standby
For instructions, see Making the standby system active - standalone operation.
Recovering from sync failure
If the sync failure is because of a temporary glitch, and both active and standby clusters are still available, use the following procedure to recover the sync connection and initiate a force start on the active cluster.
- From the main menu of the active system, select Geo-Redundancy.
-
On the upper-right, click Start Syncing.
WARNING:
Before proceeding to the next step, ensure that there are no pending workload jobs, deployments, or any operations that could potentially modify the database in the background.
- Click Reconcile.
Initiating failover: switching between the active and standby clusters
- Do not make configuration changes until the system is stable and geo-redundancy has been restored. If you must make a configuration change, use only the active site, not the standby.
-
If both clusters are still active, ensure that both clusters are in maintenance mode. Do not restart SR Linux nodes.
- Maintenance: if you need to perform maintenance activities on the active
system Note: You need to disable the sync connection first, as shown in step 1 of this procedure.
- Disaster recovery: if cluster 1 becomes unreachable and you want to make cluster 2 the active cluster
-
If you are performing maintenance on the active cluster, stop the sync
connection from the active cluster. Skip this step if you are switching over for
disaster recovery reasons.
On the Geo-Redundancy page of the active system, click Stop Sync.The standby site automatically enters the Sync Aborted state when the active cluster has stopped syncing. Wait until the status of the standby system is Sync Aborted, then continue to step 2.
-
From the Geo-Redundancy page on the standby system, run
an audit.
From the standby Audit drop-down menu, click AuditStart.The status of the audit is shown on the Geo-Redundancy page.Note: The auth pod can go into a crash loop for a few restarts.When the audit completes, continue with the next step.
- From the upper-right drop-down list, select ForceStandalone.
- Optional:
If you are performing this procedure for maintenance, click
CONFIGURE; otherwise, skip this step.
- In the Local Site section, enable the Active field.
- In the Remote Site section, disable the Active field.
- Optional: If you are performing this procedure for maintenance, click SAVE; otherwise, skip this step.
-
Upload software images to the standby system.
-
Align the deployer configuration between the active and standby systems.