Geo-redundancy operations

Synchronizing (sync)

During geo-redundancy configuration, the sync connection is established between the active and standby sites. The active and standby clusters are considered synchronized after a successful heartbeat exchange between active and standby sites. When the sync connection is active, configuration data is copied from the active to the standby site, which includes all the data needed by the standby system to manage the fabric as needed.
Note:
  • Data that can be relearned or regenerated within a reasonable time (minutes) if it was not fully synced.
  • Software images are not synced because they are large files.

    If a standby cluster becomes the new active cluster, it uploads images from the local image source. Software images are not transferred from an old active cluster to new active cluster.

  • Performance metrics or platform health metrics (Prometheus data) are not synchronized.

The options to restart and stop synchronizing are available only on the active site. When the active system stops synchronizing, it goes into the Sync Stopped state and becomes read-only. When the standby system stops synchronizing, it goes to a Sync Aborted state.

Audits

If the sync connection is down and the active system is reachable and is operational, you can perform sync and reconcile operations to retain the active site. If you want to failover to the standby system and make it active, you must run an audit on a standby system before you can make it active.

During an audit, each service in the cluster verifies its own data, in sequence, to ensure that no data in its database is in an inconsistent state. An audit is not a verification of data between the clusters in a geo-redundant system. An audit is performed on the standby system. The system blocks any changes to system configuration while an audit is in progress.

You can run an audit can only when the sync connection is down between the active and standby cluster, and the standby system is in the Sync Aborted state. After an audit completes successfully, you can:

You can generate an audit report to display, for each app, what that app tried to do to correct the inconsistencies of the respective data collections. If the data is already consistent, the report may not contain much information.

In the rare event of an audit failure (that is, the state moves to Audit Fail), you can recover by reinstalling the configuration from backup. For instructions, see Backup and restore.

Reconciling

The reconcile operation initiates the replication of the data set from the active to the standby cluster. This action replaces the data set in the standby system with the data set from the active system.

This operation is needed when the data between the active and standby sites is not in sync, such as in the following scenarios:
  • When the sync is first established (such as during the initial geo-redundancy configuration
  • When the sync connection recovers after an unintentional disruption in the sync connection
Note:
  • This operation is available from the active system.

  • The sync connection must be active before you can initiate the reconcile operation.

  • There should be no pending workload jobs, deployments or any operations which could potentially modify the database in the background.

REST API geo-redundancy operations

The operations allowed on the REST API varies depending on current geo-redundant status of the site.

Table 1. Allowed REST operations based on current state
State Active site Standby site
STANDALONE Read/Write Read/Write
ACTIVE_SYNCING Read/Write Not applicable
SYNC_STOPPED Read/Write Not applicable
SYNC_ABORTED Read-only Read-only
AUDIT Not applicable Read-only
AUDIT_DONE Not applicable Read-only
AUDIT_FAILED Not applicable Read-only
STANBY_SYNCING Not applicable Read-only
RECONCILE Read-only Read-only