Redundancy
As part of critical infrastructure, EDA must be resilient in case of outages to continue to support the infrastructure. Outages can be caused by power outages, network outages, storage outages, or any other dependent infrastructure outages and EDA must be able to mitigate the loss of visibility and automation during these events. Outages can also impact the connectivity between members of an EDA cluster; in these cases, EDA needs to avoid split brain scenarios.
EDA provides resiliency via redundancy, using the following strategies:
- Localized restartability: assuming any application can fail at any time, and the system must reconcile. This approach is taken in general in EDA, and is quite relevant for services like ConfigEngine. In general, any service should be able to restart and the system converge back to a golden state. It is also true that on a failure of any EDA pod either Kubernetes or ConfigEngine should restart it.
- Localized redundancy and microservices: multiple instances of a common service with loadbalancing. This strategy limits localized outages, in most cases, only inflight requests are lost.
- Remote redundancy: multiple clusters (or cluster members depending on hierarchy). Typically referred to as geo-redundancy, where one or more cluster members are present and each one can operate the full load of management activities, with only one active at a time. In EDA, pushes to redundant sites are not synchronous as long as changes are persisted in the majority of configured git servers. This does mean some inflight changes could be lost during a switchover.
Local redundancy
EDA supports automatic recovery of local services in the event of a failure. EDA leverages Kubernetes for deployment of its core services, which provides out-of-the-box redundancy when more than one worker node is available, with EDA services able to be scheduled or rescheduled to remaining available nodes during failures.
Cluster recovery
EDA supports cluster recovery by allowing the bootstrapping of a cluster from any member. This process removes all members, start the active member, and then add members back.
Remote redundancy
Remote redundancy is accomplished by configuring a set of members within the
EngineConfig
resource in
.spec.cluster.redundancy.members
context, and a credential to
authenticate members at .spec.cluster.redundancy.credential
context.
Synchronization occurs when changes are pushed to the set of git servers for backup.
Alarms
Support for the following alarms, generated only on the active cluster:
- Reachability to any member fails.Additional details should be included - ideally a user should be able to figure out if the issues relate to connectivity or authentication for example.
- Latency to a member is above a specified threshold.
- Any core-generated alarms from any standby member. These are forwarded to the active for the active to display, with the node set to the name of the member that raised it.
Geo-redundancy (remote redundancy)
EDA supports two concepts of remote redundancy that can be used together or separately:
- Git redundancy
-
EDA supports remote redundancy through the backup of configuration information and data to a set of git servers and restoring backed up data from the same set of git servers.
The git servers are defined in the
.spec.git.servers
context of theEngineConfig
CR. Whenever a change occurs in the system, the active ConfigEngine asynchronously pushes changes to all git servers, and from there, any other ConfigEngine can start with the same content via the same git servers. - Cluster redundancy
-
In a true geo-redundant environment, multiple EDA deployments are running in different locations, where one deployment is designated the active, and the other deployment is designated as standby. Both deployments must have the same git servers configured so they have access to the same data.
An operator must define the members of a geo-redundant cluster, whether each member is a standalone EDA deployment configured to be part of a cluster. It takes two members to form a cluster, with manual intervention currently required for switchovers to occur. For details, see Switching the active deployment.
Adding remotes
An operator can enable remote redundancy during initial installation or after installation. All cluster members must be running the same software version.
Initial standalone configuration
The following example shows the initial EngineConfig
CR fields for the
standalone member, us-west-1
. This resource defines a single member cluster
with two git servers, exposed via a load balancer or directly via the address
10.0.0.1
for IPv4, or 2000::101
for IPv6, or is
reachable via the domain name cluster.eda.nokia.com
(which maps to the two
IP addresses).
apiVersion: core.eda.nokia.com/v1
kind: EngineConfig
metadata:
name: us-west-1
spec:
git:
servers:
- name: git1
url: https://git1.eda.nokia.com
credential: git1-token
- name: git2
url: https://git2.eda.nokia.com
credential: git2
backup:
repo: sr/eda/backup
userStorage:
repo: sr/eda/user-storage
apps:
repo: sr/eda/apps
cluster:
external:
ipv4Address: 10.0.0.1
ipv6Address: 2000::101
domainName: cluster.eda.nokia.com
port: 51101
Adding another EDA instance
EngineConfig
CR is for the new EDA instance,
us-east-2
:apiVersion: core.eda.nokia.com/v1
kind: EngineConfig
metadata:
name: us-east-2
spec:
git:
servers:
- name: git1
url: https://git1.eda.nokia.com
credential: git1-token
- name: git2
url: https://git2.eda.nokia.com
credential: git2
backup:
repo: sr/eda/backup
userStorage:
repo: sr/eda/user-storage
apps:
repo: sr/eda/apps
cluster:
external:
ipv4Address: 10.0.0.1
ipv6Address: 2000::101
domainName: cluster.eda.nokia.com
port: 51101
redundancy:
credential: cluster-cred
active: us-west-1
members:
- name: us-west-1
address: 10.0.0.2
port: 55000
- name: us-east-2
address: 20.0.0.1
port: 55001
us-east-2
cluster, it attempts to connect to
us-west-1
, which is not yet currently configured as a cluster member. The
attempt to join should fail, with us-east-2
attempting to form a cluster at
a back-off interval. The active cluster is then updated
to:apiVersion: core.eda.nokia.com/v1
kind: EngineConfig
metadata:
name: us-west-1
spec:
git:
servers:
- name: git1
url: https://git1.eda.nokia.com
credential: git1-token
- name: git2
url: https://git2.eda.nokia.com
credential: git2
backup:
repo: sr/eda/backup
userStorage:
repo: sr/eda/user-storage
apps:
repo: sr/eda/apps
cluster:
external:
ipv4Address: 10.0.0.1
ipv6Address: 2000::101
domainName: cluster.eda.nokia.com
port: 51101
redundancy:
credential: cluster-cred
active: us-west-1
active: us-west-1
members:
- name: us-west-1
address: 10.0.0.2
port: 55000
- name: us-east-2
address: 20.0.0.1
port: 55001
This resource describes a two-member cluster, where each member is aware of how to reach
each other using the credential, address, and port provided. The address and port values can
be a DNS name or IPv4/IPv6 address, and is mapped directly to the
ConfigEngine
resource in each cluster.
The name
field in the EngineConfig
resource differs per
cluster, and should map to one of the members
listed.
In this example, the cluster grows from 0 members to 2. Both members must specify the same member as active. In this sample configuration, the previously standalone member remains active.
Removing remotes
After installation, you can decommission a remote and reinstall it or remove it entirely. You can remove a remote member even if it is unreachable. You can only remove a member that is a standby, so if you want to remove an active cluster, you should first switchover to a member that is not being removed.
apiVersion: core.eda.nokia.com/v1
kind: EngineConfig
metadata:
name: us-west-1
spec:
git:
servers:
- name: git1
url: https://git1.eda.nokia.com
credential: git1-token
- name: git2
url: https://git2.eda.nokia.com
credential: git2
backup:
repo: sr/eda/backup
userStorage:
repo: sr/eda/user-storage
apps:
repo: sr/eda/apps
cluster:
external:
ipv4Address: 10.0.0.1
ipv6Address: 2000::101
domainName: cluster.eda.nokia.com
port: 51101
redundancy:
credential: cluster-cred
members:
- name: us-west-1
address: 10.0.0.2
port: 55000
- name: us-east-2
address: 20.0.0.1
port: 55001
- name: us-east-3
address: 30.0.0.1
port: 55001
To update the configuration so there is only a standalone
member, us-west-1
, the following would need to occur:- Make
us-west-1
the active member. - Remove the
us-east-3
member fromus-west-1
andus-east-2
. - Uninstall
us-east-3
. - Remove
us-east-2
fromus-west-1
. - Uninstall
us-east-2
.
Cluster members
The following fields in the in EngineConfig
CR define the members of a
cluster:
- In the
.spec.cluster.redundancy.members
context:name
- a user friendly name for the member. This setting is validated against the name of the localEngineConfig
resource to determine which cluster member the local ConfigEngine is. This requires changes to the currentEngineConfig
name. If no members are provided, the cluster is assumed to be a single member cluster, and the name check does not occur.address
- either an IPv4 or IPv6 address, or domain name that can be resolved.
-
In the
.spec.cluster.redundancy.credential
context:credential
. This value is used for authentication between members. The value must be the same for all members. -
port
: the port on which a peer ConfigEngine (proxied through APIServer) is exposed. Both the address and port are external addresses/ports may live on a load balancer.
- The set of git servers provided at
.spec.git.servers
context must be identical. - The number of replicas for the API server (
.spec.api.replicas
) and State Aggregator (.spec.stateAggregator.replicas
) must be consistent between the clusters. This ensures that standby clusters can take the load of the active cluster. This check occurs only initially syncing a remote, as the values can change post run-time. - The content of
.spec.cluster
context must match. This includes members in.spec.cluster.redundancy.members
, and information around external reachability of the cluster in.spec.cluster.external
context - The content of
.spec.playground
and.spec.simulate
must match.
Verifying the geo-redundancy state
$ edactl cluster
Name Address ActivityState AveLatency(ms) Reachable InSyncWithActive
us-east-2 192.0.2.11 Standby true true
us-west-1 self Active true true
Switching the active deployment
edactl cluster take-activity <name of member to make active>
If the other deployment is still active and can be reached, the local deployment instructs it to go into standby mode, and make itself active.
If the other deployment is no longer available (or reachable), the local deployment assumes it to be lost and makes itself active.