Geo-redundancy configuration

Deployment considerations

  • Geo-redundancy works for one, three, or six node deployments, but both clusters must have the same type of node deployment.
  • The active and standby deployments must have the same number of nodes and the same resource configuration.
  • The active and standby deployments must have the same version installed.
  • Signing certificates must be aligned.

    For instructions, see Realigning certificates.

Networking considerations

Geo-redundancy has a few requirements and considerations from a networking perspective and connectivity between the active and standby site:

  • Synchronization uses the API service and should use the OAM network. When configuring geo-redundancy, make sure to use the FQDN or VIP of the other cluster on the OAM network.
  • Connectivity between the active and standby cluster can be through a stretched L2 subnet between the sites, or routed with two different L2 subnets.
  • The active and standby site must use different IP and VIP addresses.
  • Synchronisation is supported over IPv4 and IPv6.
  • The maximum allowed RTT latency between the active and standby sites is 100ms, but a maximum of 50ms is highly recommended. The lower the RTT latency, the better.
  • The connection speed between the active and standby site must be a minimum of 1Gbps.

ZTP and DHCP handling after active failure

When the active site fails and the standby site has not been activated yet, the DHCP and ZTP capabilities of the platform are unavailable. At that time, SR Linux nodes cannot be rebooted, bootstrapped, or upgraded.

After the standby site has been made active, the standby site now runs the DHCP service and supports the ZTP process of SR Linux nodes.

Geo-redundancy configuration tasks

Following are the high-level tasks that you need complete to configure a geo-redundant system.
  1. Deploy the deployer VM on the active and standby sites.

    For instructions, see The Fabric Services System deployer VM in the Fabric Services System Installation Guide.

    .
  2. Deploy the Fabric Services System on the active and the standby sites, using the installation procedures provided in the Fabric Services System Installation Guide.
    Note: You can also upgrade an existing standalone deployment first, then set up the standby site.
  3. Configuring geo-redundancy information in deployer VMs
  4. Verifying that the setup is ready for geo-redundancy using the deployer VMs
  5. Realigning certificates
  6. Configuring geo-redundancy

Configuring geo-redundancy information in deployer VMs

Use this procedure to configure the deployer VMs with the remote site details. The steps in this procedure help you view the status of both the local and remote Fabric Services System clusters and determine whether both the sites are configured to allow for geo-redundancy to work correctly.

This section is optional as it does not affect the actual geo-redundancy functionality of the platform. Configuring the deployer VMs on the active and standby site to know about each other, does help in potential troubleshooting and inspecting the infrastructure for discrepancies.

  1. Configure passwordless SSH access locally on both the active and standby deployer VMs.

    Enter the following command on both the active and standby deployer VMs:

    cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
  2. Configure passwordless SSH access from the local deployer VM to the remote deployer VM and vice-versa.
    Copy the contents of the /root/.ssh/id_rsa.pub file on the remote deployer and update the /root/.ssh/authorized_keys file of the local deployer.
  3. Add the necessary details from the remote site by copying the input.json file from the remote site.
    Enter the following command:
    fss-install.sh add-remote-deployer <input_json_of_remote_deployer>
    Note: For this command to work, on the input.json file of the active site, the deployernode.role field must be set to active and on the input.json file of the standby site, the deployernode.role field must be set to standby. This setting is needed so each deployer VM is aware which site is considered active and which is standby by default.
  4. Repeat step 3 from the remote deployer.
  5. Verify the configuration by displaying the contents of the sites.json file on both deployer VMs.
[root@fss-deployersite01 ~]# cat /var/lib/fss/sites/sites.json
{
  "local": {
    "name": "site01",
    "ipv4": "10.x.x.1",
    "ipv6": "",
    "accessip": "10.x.x.1",
    "role": "active"
  },
  "remote": [
    {
      "name": "site02",
      "ipv4": "10.x.x.11",
      "ipv6": "",
      "accessip": "10.x.x.11",
      "role": "standby"
    }
  ]
}

Verifying that the setup is ready for geo-redundancy using the deployer VMs

The deployer VM provides the following tools that you can use to verify and display information about geo-redundancy configuration and status:

  • cat /var/lib/fss/sites/sites.json: displays the local and remote clusters, the access IP address and IP addresses, and role of each cluster
  • /root/bin/fss-install.sh status-georedundancy: displays basic geo-redundancy status
  • /root/bin/fss-install.sh status-georedundancy -v: displays detailed geo-redundancy status
  • /root/bin/fss-install.sh status-georedundancy -t site01: displays the details of the Fabric Services System cluster, certificates, and applications

Status of geo-redundancy, basic output

In the output, Active(fd56:1:91:2::21) vs Standby(fd56:1:91:2::6a) reports the IP addresses used to connect to the active and standby sites.

[root@fss-deployersite01 ~]# /root/bin/fss-install.sh status-georedundancy
=====================================================
Sites Overview
=====================================================
+--------------+---------+--------+-------------+
|     NAME     |  ROLE   | STATUS | CONSISTENCY | 
+--------------+---------+--------+-------------+
| site01(self) | active  |  GOOD  |     N/A     |
|    site02    | standby |  GOOD  |    ERROR    |
+--------------+---------+--------+-------------+
=====================================================
Active(fd56:1:91:2::21) vs Standby(fd56:1:91:2::6a) 
=====================================================
+--------------+----------+
|     NAME     |  STATUS  |
+--------------+----------+
|    NODES     |   GOOD   |
|  PASSWORDS   |   GOOD   |
| CERTIFICATES | MISMATCH |
|   VERSION    |   GOOD   |
+--------------+----------+
Note: If the system displays ERROR or MISMATCH in the output, align both sites so they have the same configuration before you start the procedure Configuring geo-redundancy.
You can also display the output in YAML format:
[root@fss-deployersite01 ~]# /root/bin/fss-install.sh status-georedundancy -o yaml
Overview:
- NAME: site01(self)
  ROLE: active
  STATUS: GOOD
  CONSISTENCY: N/A
- NAME: site02
  ROLE: standby
  STATUS: GOOD
  CONSISTENCY: ERROR
standby-site02:
- NAME: NODES
  STATUS: GOOD
- NAME: PASSWORDS
  STATUS: GOOD
- NAME: CERTIFICATES
  STATUS: MISMATCH
- NAME: VERSION
  STATUS: GOOD
[root@fss-deployersite01 ~]#

If the CONSISTENCY column reports an error, use the -v option with the /root/bin/fss-install.sh status-georedundancy command to display more information. To display detailed information about a particular site, use the fss-install.sh status-georedundancy -t <site name> command.

Detailed geo-redundancy information

Use the /root/bin/fss-install.sh status-georedundancy -v command to display details about consistency errors. In the output, the Sites Overview section shows a consistency error. The subsequent sections indicate the area with the consistency error. In the example below, the details about the error are shown in Details about CERTIFICATES section. The serial number mismatch is not a severe issue, but the different node CAs between in site01 and site02 should be addressed.

[root@fss-deployersite01 ~]# /root/bin/fss-install.sh status-georedundancy -v
=====================================================
Sites Overview
=====================================================
+--------------+---------+--------+-------------+
|     NAME     |  ROLE   | STATUS | CONSISTENCY |
+--------------+---------+--------+-------------+
| site01(self) | active  |  GOOD  |     N/A     |
|    site02    | standby |  GOOD  |    ERROR    |
+--------------+---------+--------+-------------+
=====================================================
Active(fd56:1:91:2::21) vs Standby(fd56:1:91:2::6a)
=====================================================
-----------------------------------------------------
Details about fss VERSION
-----------------------------------------------------
+-------------------+--------------+--------+
|       NAME        | CHARTVERSION | STATUS |
+-------------------+--------------+--------+
|   cert-manager    |     GOOD     |  GOOD  |
|     fss-logs      |     GOOD     |  GOOD  |
|       kafka       |     GOOD     |  GOOD  |
|      kafkaop      |     GOOD     |  GOOD  |
|      metallb      |     GOOD     |  GOOD  |
|       prod        |     GOOD     |  GOOD  |
|     rook-ceph     |     GOOD     |  GOOD  |
| rook-ceph-cluster |     GOOD     |  GOOD  |
|      traefik      |     GOOD     |  GOOD  |
+-------------------+--------------+--------+
-----------------------------------------------------
Details about CERTIFICATES
-----------------------------------------------------
+--------------+--------+---------+---------------+---------+
|  CERTSOURCE  | ISSUER | SUBJECT | SERIAL-NUMBER | VALIDTO |
+--------------+--------+---------+---------------+---------+
| fss gui/rest |  GOOD  |  GOOD   |   MISMATCH    |  GOOD   | 
|    kafka     |  GOOD  |  GOOD   |   MISMATCH    |  GOOD   |
|   node CA    | ERROR  |  ERROR  |     ERROR     |  ERROR  | 
+--------------+--------+---------+---------------+---------+
-----------------------------------------------------
Details about NODES
-----------------------------------------------------
+------------+-------------+
|    NAME    | CONSISTENCY |
+------------+-------------+
| master_cnt |    GOOD     |
| total_cnt  |    GOOD     |
+------------+-------------+
-----------------------------------------------------
Details about PASSWORDS
-----------------------------------------------------
+------------+-----------------+-------------+
|    APP     |      USER       | CONSISTENCY |
+------------+-----------------+-------------+
|  mongodb   |      root       |    GOOD     |
|  mongodb   |    fsp_user     |    GOOD     |
|   neo4j    |      root       |    GOOD     |
|  keycloak  |     master      |    GOOD     |
|  keycloak  |       fss       |    GOOD     |
|  keycloak  |       ztp       |    GOOD     |
| postgresql |      root       |    GOOD     |
| postgresql |    keycloak     |    GOOD     |
|   kafka    | fss-kafka-admin |    GOOD     |
+------------+-----------------+-------------+
You can also display the output in YAML format:
[root@fss-deployersite01 ~]# /root/bin/fss_geo_redundancy.py status -v -o yaml
Overview:
- NAME: site01(self)
  ROLE: active
  STATUS: GOOD
  CONSISTENCY: N/A
- NAME: site02
  ROLE: standby
  STATUS: GOOD
  CONSISTENCY: ERROR
standby-site02:
  HELMVERSION:
  - NAME: cert-manager
    CHARTVERSION: GOOD
    STATUS: GOOD
  - NAME: fss-logs
    CHARTVERSION: GOOD
    STATUS: GOOD
  - NAME: kafka
    CHARTVERSION: GOOD
    STATUS: GOOD
  - NAME: kafkaop
    CHARTVERSION: GOOD
    STATUS: GOOD
  - NAME: metallb
    CHARTVERSION: GOOD
    STATUS: GOOD
  - NAME: prod
    CHARTVERSION: GOOD
    STATUS: GOOD
  - NAME: rook-ceph
    CHARTVERSION: GOOD
    STATUS: GOOD
  - NAME: rook-ceph-cluster
    CHARTVERSION: GOOD
    STATUS: GOOD
  - NAME: traefik
    CHARTVERSION: GOOD
    STATUS: GOOD
  CERTIFICATES:
  - CERTSOURCE: fss gui/rest
    ISSUER: GOOD
    SUBJECT: GOOD
    SERIAL-NUMBER: MISMATCH
    VALIDTO: GOOD
  - CERTSOURCE: kafka
    ISSUER: GOOD
    SUBJECT: GOOD
    SERIAL-NUMBER: MISMATCH
    VALIDTO: GOOD
  - CERTSOURCE: node CA
    ISSUER: ERROR
    SUBJECT: ERROR
    SERIAL-NUMBER: ERROR
    VALIDTO: ERROR
  NODES:
  - NAME: master_cnt
    CONSISTENCY: GOOD
  - NAME: total_cnt
    CONSISTENCY: GOOD
  PASSWORDS:
  - APP: mongodb
    USER: root
    CONSISTENCY: GOOD
  - APP: mongodb
    USER: fsp_user
    CONSISTENCY: GOOD
  - APP: neo4j
    USER: root
    CONSISTENCY: GOOD
  - APP: keycloak
    USER: master
    CONSISTENCY: GOOD
  - APP: keycloak
    USER: fss
    CONSISTENCY: GOOD
  - APP: keycloak
    USER: ztp
    CONSISTENCY: GOOD
  - APP: postgresql
    USER: root
    CONSISTENCY: GOOD
  - APP: postgresql
    USER: keycloak
    CONSISTENCY: GOOD
  - APP: kafka
    USER: fss-kafka-admin
    CONSISTENCY: GOOD

Realigning certificates

When the independent clusters of a geo-redundant system are first installed, each site has a set of default self-installed certificates; the node-signing CA is unique for each cluster. In a geo-redundant system, the node-signing certificate must be the same for the active and standby cluster.
Note: Before configuring geo-redundancy, the node signing certificates must be aligned. In addition, if custom signing certificates are in use for northbound, UI, and Kafka interfaces, these certificate files must also be aligned.

Use the fss-certificate.sh export [-d directory] utility to export the signing certificate files installed in the intended active system to a local directory. If you do not specify a directory, the certificates are exported to the local /root/userdata/certificates directory. Then, copy the needed certificate files to the intended standby and deploy them.

  1. From the deployer of the intended active system, execute the fss-certificate.sh export command.
    The following output shows that default signing certificates are present.
    [root@fss-deployer ~]# /root/bin/fss-certificate.sh export
    Certificates will be exported to  /root/userdata/certificates
    Default install generated signing certificates are in use for generating node certificates
    Default install generated signing certificates are in use for nbi/gui/kafka
    Server Certificates are generated and renewed using signing certificates for nbi/gui/kafka
    
    In this example, custom signing certificates are in use for the northbound, GUI, and Kafka interfaces.
    Certificates will be exported to  /root/userdata/certificates
    Default install generated signing certificates are in use for generating node certificates
    Custom signing certificates are in use for nbi/gui/kafka
    Server Certificates are generated and renewed using signing certificates for nbi/gui/kafka
  2. View the exported certificate files.
    [root@fss-deployer ~]# ls -ltr /root/userdata/certificates/
    total 28
    -r-------- 1 root root 1675 Dec 13 03:59 current-nodesigning__rootCA.key
    -r-------- 1 root root 1269 Dec 13 03:59 current-nodesigning__rootCA.pem
    -r-------- 1 root root 1679 Dec 13 03:59 current-nbi__tls.key
    -r-------- 1 root root 1501 Dec 13 03:59 current-nbi__tls.crt
    -r-------- 1 root root 1874 Dec 13 03:59 current-nbi__ca.crt
    -r-------- 1 root root 3272 Dec 13 03:59 current-nbisigning__tls.key
    -r-------- 1 root root 1874 Dec 13 03:59 current-nbisigning__tls.crt
    The following files were exported to the /root/userdata/certificates directory.
    • current-nodesigning__rootCA.key: the signing certificate used by Cert-Manager to sign and generate certificates for managed nodes.
    • current-nodesigning__rootCA.pem: the self-signed root certificate for managed nodes.
    • current-nbi tls.crt: the signing certificate used by Cert-Manager to sign and generate other certificates.
    • current-nbi__tls.key: the private key.
    • current-nbi__ca.crt: the signed certificate.
    • current-nbisigning__tls.key: the private key for the signing certificate.
    • current-nbisigning__tls.crt: the signed certificate.
    Note: current-nbi__ca.crt and current-nbisigning__tls.crt are the same files.
  3. Copy the contents of the current-nodesigning__rootCA.key and current-nodesigning__rootCA.pem files to a directory in the intended standby system.
    The content of the current-nodesigning__rootCA.key file resembles the example below. Copy the entire contents shown to a file in the intended standby.
    -----BEGIN RSA PRIVATE KEY-----
    MIIJKAIBAAKCAgEAuMg5L2oizpf+g77atvmtuvc6Y4xBok27DbUDlYMBgkmy8Lj2
    uolLD+WGlEODCrPcn+88IMG+xiHyuomu0vqMVF2UxJZD8K0AHrhRv6uDPXPr+D1e
    SHj3MfntkQEcCHH0Bakk7sc0FhqgvgWNJWRXz+g/QI24BAhJx/lvEDtwrwnLg4Sg
    ydTjd2D+a+XtcxoMvyWGxQdkqse/qVY1zibzBtmQKJ+3dXjOc6UHVVyrxP5fgWn2
    ebw1hxG6rQdJ7HkFpwH3p/rYUHjrGXSxhgm7YEPNLXuuhxzW+maFxZ3VpyHwl/lE
    vrGzMhTsBXogm+Jj0fZdbiGF4khJwNp6OaUhqHM37rabWCzMxki8uNR1pXkFdgHf
    b9Ph5e0bfTix8L+keUmCSyfQdp404eKEsMmc3JFruH6oJU/9bdNESyHTZ2eK+F4g
    +roe2Fu9TB1p64QUUtQv8k2s77qFiuqvaRL1hDNV4sNuIeNmKcu1n8dU+vRiGL2T
    z95xqGYjYNx6SeNC/WCLBodyVAjPAayFRTB5y5K28x81Ip0Ozjz7+XdFFSV8amOa
  4. From the deployer on the intended standby system, deploy the required certificate files.
    In the standby system, assume that you copied the key to the standby-nodesigning__rootCA.key file and the root CA to the standby-nodesigning__rootCA.pem file.

    Enter the following command to deploy the certificates:

    [root@fss-standby-deployer ~]# /root/bin/fss-certificate.sh deploy-node-ca-certs --certificate
          /root/directory/standby-nodesigning__rootCA.key --pem
        /root/directory/standby-nodesigning__rootCA.pem
    Note: If the deployer is already configured with the node signing certificates, use the --force option.
  5. Verify that the deployed certificates in the intended standby system are correct.
    [root@fss-standby-deployer ~]# /root/bin/fss-certificate.sh export -d directory
    Certificates will be exported to  /root/userdata/directory
    Custom signing certificates are in use for generating node certificates
    Custom server certificates are in use for nbi/gui/kafka
    
  6. Repeat steps 3 and 4 as needed for other certificates.
    In step 1, if the output shows that custom signing certificates are in use for northbound, UI, and Kafka interfaces, copy the needed files to the intended standby and deploy them.
Configuring geo-redundancy

Geo-redundancy parameters

Table 1. Parameter descriptions
Parameter Description Value
Local parameters: configures the active system
Name Specifies the name of the local site. The local site is assigned the active role. String
URL Specifies the URL of the local system.
User and Password Specifies the credentials that the remote system uses to log in to this local system. You can only configure geo-redundancy using the geored user account and you must provide the default geored password. You can change the password for the geored user as needed. String
Active Specifies whether the local cluster is active. Enable on the active cluster
Verify Remote CA Checks whether the certificates on the standby cluster are valid. If enabled, enter the Root CA for the standby site.
Remote parameters: configures the standby system
Name Specifies the name of the remote site; the remote is the standby site. String
URL Specifies the URL of the remote system. IP notation
User and Password Specifies what credentials to use to log in to the remote system. You can only configure geo-redundancy using the geored user account. You can change the password for the geored user as needed. String
Active Specifies that the remote cluster is the standby; must be disabled for the standby.
Verify Remote CA Checks whether the certificates on the active cluster are valid. If enabled, enter the Root CA for the active cluster.
Sync queue length
Sync Queue Length Specifies the number of messages that can be buffered in the queue.
Note: Available only over the API.
Integer

Default: 25000

Configuring geo-redundancy

  • Perform this procedure during a maintenance window.
  • The intended active and standby systems must be running the same Fabric Services System software version.
  • The intended active and standby systems must be reachable.
  • The standby system should not be running Digital Sandbox workloads.
  • Update the password for the geored user. For instructions, see Resetting internal passwords.
  • Be prepared to provide the following information for the active and standby Fabric Services System instances:
    • names for the active and standby systems
    • the URLs for the active and standby systems
    • the password for the geored user
Configure geo-redundancy locally from the intended active system. In the Geo-Redundancy Configuration form, the Local Sites section configures the active system; the Remote Sites section configures the standby system.
  1. Realign the certificates between the intended active and standby systems.
    For instructions, see Realigning certificates.
  2. From the main menu of the intended active system, select Geo-Redundancy.
    1. Click CONFIGURE.
      In the LOCAL SITE section, configure the following settings for the active system.
      • Provide a name for the active system.
      • Provide the URL of the active system.
      • Provide the login credentials (username and password) for the active system.
      • Ensure that the Active parameter is enabled.
    2. In the Remote Site section, configure the standby system.
      • Provide a name for the standby system.
      • Provide the URL of the standby system.
      • Provide the login credentials (username and password) for the standby system.
      • Ensure that the Active parameter is disabled.
      Note: If a remote site is already configured as a standby for another system, this new configuration takes precedence and overwrites the previous configuration.
    3. Test the connection between the endpoints of the geo-redundant system.
      Click TEST CONNECTION.
    4. Optional: Update the setting for the sync queue length.
    5. Click SAVE.
      Warning: This action restarts multiple services on both the active and standby sites; it can take 5 to 10 minutes before all services are up and running again. In this time, the API and UI may behave unpredictably. Monitor the services in the Geo-Redundancy page and ensure that all services are up before proceeding.
    The systems should automatically come up in the Syncing state. If the systems do not come up in the Syncing state, from the Geo-Redundancy page of the active cluster, click SyncStart to initiate the sync connection.
    The Geo-Redundancy page displays:
    • the names of the local and remote sites
    • the role of each site, either active or standby
    No services are displayed at this point.
    Warning:

    Before proceeding to the next step, ensure that there are no pending workload jobs, deployments, or any operations that could potentially modify the database in the background.

  3. Reconcile data from the active to the standby.
    From the active system Geo-Redundancy page, click select Reconcile.
    The Sync Status shows Reconcile, then moves to Syncing.

    This action replaces the data collection set in the standby system with the data collection set from the active system.

The Geo-Redundancy page displays:
  • the names and roles (active or standby) of the clusters in the geo-redundant system

    The status of the active cluster is Active syncing; the status of the standby cluster is Standby syncing.

  • the Fabric Services System services and the status of each service
Note: On the standby site, the UI loads with the Geo-Redundancy page as its home page. Only the Alarms and Geo-Redundancy options are available from the main menu. The API is disabled, except for GET calls for alarms and geo-redundancy.