Redundancy
As part of critical infrastructure, EDA must be resilient in case of outages to continue to support the infrastructure. Outages can be caused by power outages, network outages, storage outages, or any other dependent infrastructure outages and EDA must be able to mitigate the loss of visibility and automation during these events. Outages can also impact the connectivity between members of an EDA cluster; in these cases, EDA needs to avoid split brain scenarios.
EDA provides resiliency via redundancy, using the following strategies:
- Localized restartability: assuming any application can fail at any time, and the system must reconcile. This approach is taken in general in EDA, and is quite relevant for services like ConfigEngine. In general, any service should be able to restart and the system converge back to a golden state. It is also true that on a failure of any EDA pod either Kubernetes or ConfigEngine should restart it.
- Localized redundancy and microservices: multiple instances of a common service with loadbalancing. This strategy limits localized outages, in most cases, only inflight requests are lost.
- Remote redundancy: multiple clusters (or cluster members depending on hierarchy). Typically referred to as geo-redundancy, where one or more cluster members are present and each one can operate the full load of management activities, with only one active at a time. In EDA, pushes to redundant sites are not synchronous as long as changes are persisted in the majority of configured git servers. This does mean some inflight changes could be lost during a switchover.
Local redundancy
EDA supports automatic recovery of local services in the event of a failure. EDA leverages Kubernetes for deployment of its core services, which provides out-of-the-box redundancy when more than one worker node is available, with EDA services able to be scheduled or rescheduled to remaining available nodes during failures.
Cluster recovery
EDA supports cluster recovery by allowing the bootstrapping of a cluster from any member. This process removes all members, start the active member, and then add members back.
Remote redundancy
Remote redundancy is accomplished by configuring a set of members within the
                    EngineConfig resource in
                    .spec.cluster.redundancy.members context, and a credential to
                authenticate members at .spec.cluster.redundancy.credential
                context.
Synchronization occurs when changes are pushed to the set of git servers for backup.
Alarms
Support for the following alarms, generated only on the active cluster:
- Reachability to any member fails.Additional details should be included - ideally a user should be able to figure out if the issues relate to connectivity or authentication for example.
- Latency to a member is above a specified threshold.
- Any core-generated alarms from any standby member. These are forwarded to the active for the active to display, with the node set to the name of the member that raised it.
Geo-redundancy (remote redundancy)
EDA supports two concepts of remote redundancy that can be used together or separately:
- Git redundancy
- 
                    EDA supports remote redundancy through the backup of configuration information and data to a set of git servers and restoring backed up data from the same set of git servers. The git servers are defined in the .spec.git.serverscontext of theEngineConfigCR. Whenever a change occurs in the system, the active ConfigEngine asynchronously pushes changes to all git servers, and from there, any other ConfigEngine can start with the same content via the same git servers.
- Cluster redundancy
- 
                    In a true geo-redundant environment, multiple EDA deployments are running in different locations, where one deployment is designated the active, and the other deployment is designated as standby. Both deployments must have the same git servers configured so they have access to the same data. An operator must define the members of a geo-redundant cluster, where each member is a standalone EDA deployment configured to be part of a cluster. It takes two members to form a cluster, with manual intervention currently required for switchovers to occur. For details, see Switching the active deployment. 
Adding remotes
An operator can enable remote redundancy during initial installation or after installation. All cluster members must be running the same software version.
Initial standalone configuration
The following example shows the initial EngineConfig CR fields for the
        standalone member, us-west-1. This resource defines a single member cluster
        with two git servers, exposed via a load balancer or directly via the address
          10.0.0.1 for IPv4, or 2000::101 for IPv6, or is
        reachable via the domain name cluster.eda.nokia.com (which maps to the two
        IP addresses).
apiVersion: core.eda.nokia.com/v1
kind: EngineConfig
metadata:
  name: us-west-1
spec:
  git:
    servers:
      - name: git1
        url: https://git1.eda.nokia.com
        credential: git1-token
      - name: git2
        url: https://git2.eda.nokia.com
        credential: git2
    backup:
      repo: sr/eda/backup
    userStorage:
      repo: sr/eda/user-storage
    apps:
      repo: sr/eda/apps
  cluster:
    external:
      ipv4Address: 10.0.0.1
      ipv6Address: 2000::101
      domainName: cluster.eda.nokia.com
      port: 51101Adding another EDA instance
EngineConfig CR is for the new EDA instance,
          us-east-2:apiVersion: core.eda.nokia.com/v1
kind: EngineConfig
metadata:
  name: us-east-2
spec:
  git:
    servers:
      - name: git1
        url: https://git1.eda.nokia.com
        credential: git1-token
      - name: git2
        url: https://git2.eda.nokia.com
        credential: git2
    backup:
      repo: sr/eda/backup
    userStorage:
      repo: sr/eda/user-storage
    apps:
      repo: sr/eda/apps
  cluster:
    external:
      ipv4Address: 10.0.0.1
      ipv6Address: 2000::101
      domainName: cluster.eda.nokia.com
      port: 51101
    redundancy:
      credential: cluster-cred
      active: us-west-1
      members:
        - name: us-west-1
          address: 10.0.0.2
          port: 55000
        - name: us-east-2
          address: 20.0.0.1
          port: 55001us-east-2 cluster, it attempts to connect to
          us-west-1, which is not yet currently configured as a cluster member. The
        attempt to join should fail, with us-east-2 attempting to form a cluster at
        a back-off interval. The active cluster is then updated
        to:apiVersion: core.eda.nokia.com/v1
kind: EngineConfig
metadata:
  name: us-west-1
spec:
  git:
    servers:
      - name: git1
        url: https://git1.eda.nokia.com
        credential: git1-token
      - name: git2
        url: https://git2.eda.nokia.com
        credential: git2
    backup:
      repo: sr/eda/backup
    userStorage:
      repo: sr/eda/user-storage
    apps:
      repo: sr/eda/apps
  cluster:
    external:
      ipv4Address: 10.0.0.1
      ipv6Address: 2000::101
      domainName: cluster.eda.nokia.com
      port: 51101
    redundancy:
      credential: cluster-cred
      active: us-west-1
      active: us-west-1
      members:
        - name: us-west-1
          address: 10.0.0.2
          port: 55000
        - name: us-east-2
          address: 20.0.0.1
          port: 55001This resource describes a two-member cluster, where each member is aware of how to reach
        each other using the credential, address, and port provided. The address and port values can
        be a DNS name or IPv4/IPv6 address, and is mapped directly to the
          ConfigEngine resource in each cluster.
The name field in the EngineConfig resource differs per
        cluster, and should map to one of the members listed.
In this example, the cluster grows from 0 members to 2. Both members must specify the same member as active. In this sample configuration, the previously standalone member remains active.
Removing remotes
After installation, you can decommission a remote and reinstall it or remove it entirely. You can remove a remote member even if it is unreachable. You can only remove a member that is a standby, so if you want to remove an active cluster, you should first switchover to a member that is not being removed.
apiVersion: core.eda.nokia.com/v1
kind: EngineConfig
metadata:
  name: us-west-1
spec:
  git:
    servers:
      - name: git1
        url: https://git1.eda.nokia.com
        credential: git1-token
      - name: git2
        url: https://git2.eda.nokia.com
        credential: git2
    backup:
      repo: sr/eda/backup
    userStorage:
      repo: sr/eda/user-storage
    apps:
      repo: sr/eda/apps
  cluster:
    external:
      ipv4Address: 10.0.0.1
      ipv6Address: 2000::101
      domainName: cluster.eda.nokia.com
      port: 51101
    redundancy:
      credential: cluster-cred
      members:
        - name: us-west-1
          address: 10.0.0.2
          port: 55000
        - name: us-east-2
          address: 20.0.0.1
          port: 55001
        - name: us-east-3
          address: 30.0.0.1
          port: 55001us-west-1, the following would need to occur:- Make us-west-1the active member.
- Remove the us-east-3member fromus-west-1andus-east-2.
- Uninstall us-east-3.
- Remove us-east-2fromus-west-1.
- Uninstall us-east-2.
Cluster members
The following fields in the in EngineConfig CR define the members of a
            cluster:
- In the .spec.cluster.redundancy.memberscontext:- name- a user friendly name for the member. This setting is validated against the name of the local- EngineConfigresource to determine which cluster member the local ConfigEngine is. This requires changes to the current- EngineConfigname. If no members are provided, the cluster is assumed to be a single member cluster, and the name check does not occur.
- address- either an IPv4 or IPv6 address, or domain name that can be resolved.
 
- 
                In the .spec.cluster.redundancy.credentialcontext:credential. This value is used for authentication between members. The value must be the same for all members.
- 
                port: the port on which a peer ConfigEngine (proxied through APIServer) is exposed. Both the address and port are external addresses/ports may live on a load balancer.
- The set of git servers provided at .spec.git.serverscontext must be identical.
- The number of replicas for the API server (.spec.api.replicas) and State Aggregator (.spec.stateAggregator.replicas) must be consistent between the clusters. This ensures that standby clusters can take the load of the active cluster. This check occurs only initially syncing a remote, as the values can change post run-time.
- The content of .spec.clustercontext must match. This includes members in.spec.cluster.redundancy.members, and information around external reachability of the cluster in.spec.cluster.externalcontext
- The content of .spec.playgroundand.spec.simulatemust match.
Verifying the geo-redundancy state
$ edactl cluster
Name       Address     ActivityState  AveLatency(ms)  Reachable  InSyncWithActive
us-east-2  192.0.2.11  Standby                        true       true
us-west-1  self        Active                         true       trueSwitching the active deployment
edactl cluster take-activity <name of member to make active>If the other deployment is still active and can be reached, the local deployment instructs it to go into standby mode, and make itself active.
If the other deployment is no longer available (or reachable), the local deployment assumes it to be lost and makes itself active.