Sync failure and recovery

Sync failure

If the sync connection fails between the active and standby systems, the standby attempts to sync with the active instance for 90 seconds. The connection between the active instance and standby instance is considered lost after three missed heartbeats (approximately 90 seconds).

Recovery

In a geo-redundant system, if the sync connection fails, the active cluster becomes read-only (the standby is always ready-only). An administrator can decide the next steps:

Recovering from sync failure

If the sync failure is because of a temporary glitch, and both active and standby clusters are still available, use the following procedure to recover the sync connection and initiate a force start on the active cluster.

  1. From the main menu of the active system, select Geo-Redundancy.
  2. On the upper-right, click Start Syncing.
    WARNING:

    Before proceeding to the next step, ensure that there are no pending workload jobs, deployments, or any operations that could potentially modify the database in the background.

  3. Click Reconcile.

Initiating failover: switching between the active and standby clusters

  • Do not make configuration changes until the system is stable and geo-redundancy has been restored. If you must make a configuration change, use only the active site, not the standby.
  • If both clusters are still active, ensure that both clusters are in maintenance mode. Do not restart SR Linux nodes.

In a geo-redundant system, assume cluster 1 is active and cluster 2 is standby. Use this procedure to switch the roles assigned to the cluster 1 and cluster 2 for the following scenarios:
  • Maintenance: if you need to perform maintenance activities on the active system
    Note: You need to disable the sync connection first, as shown in step 1 of this procedure.
  • Disaster recovery: if cluster 1 becomes unreachable and you want to make cluster 2 the active cluster
  1. If you are performing maintenance on the active cluster, stop the sync connection from the active cluster. Skip this step if you are switching over for disaster recovery reasons.
    On the Geo-Redundancy page of the active system, click Stop Sync.
    The standby site automatically enters the Sync Aborted state when the active cluster has stopped syncing. Wait until the status of the standby system is Sync Aborted, then continue to step 2.
  2. From the Geo-Redundancy page on the standby system, run an audit.
    From the standby Audit drop-down menu, click AuditStart.
    The status of the audit is shown on the Geo-Redundancy page.
    Note: The auth pod can go into a crash loop for a few restarts.
    When the audit completes, continue with the next step.
  3. From the upper-right drop-down list, select ForceStandalone.
  4. Optional: If you are performing this procedure for maintenance, click CONFIGURE; otherwise, skip this step.
    1. In the Local Site section, enable the Active field.
    2. In the Remote Site section, disable the Active field.
  5. Optional: If you are performing this procedure for maintenance, click SAVE; otherwise, skip this step.
  6. Upload software images to the standby system.
    1. Upload SR Linux images to the standby system.
      For instructions, see Adding a new software image.
    2. After you have uploaded the software images to the standby system, from the Image Catalog page, click REGEN PROVISION SCRIPTS.
      This action re-initiates ZTP provisioning scripts in case you a need to bootstrap or upgrade SR Linux nodes.
  7. Align the deployer configuration between the active and standby systems.
    1. Determine the name of the active and standby sites.
      [root@fss-deployer ~]# /root/bin/fss-install.sh status-georedundancy
    2. While logged in to the deployer VM in cluster 1, enter the following command to make the deployer in cluster 2 the primary deployer.
      [root@fss-deployer ~]# /root/bin/fss-install.sh set-active-deployer -t <name>
      Note: If both deployer VMs are still available, repeat this step on the deployer VM in cluster 2.