Alarms
Alarms in EDA can arise from a variety of sources, including the EDA system itself and
the wide array of supported apps. For any alarm, the source/affected object is
identified as part of the alarm in the group
and kind
fields.
Alarm are also associated with a namespace; this could be the base EDA namespace, or some other namespace. Users can only see and interact with alarms in namespaces for which they have access permissions.
Some alarms can be generated by intent-based apps within EDA. EDA treats such alarms as having been cleared if the app stops reporting that alarm.
Alarms associated with apps are described in documentation for individual apps.
Alarms on standby clusters
Standby cluster alarms can be important in understanding the state of redundancy in an EDA cluster. It is therefore useful to be able to see alarms generated on a standby cluster member even when working with the active member.
EDA supports this using the `cluster_member` field, which is set to the name of the cluster member that raised the alarm. This allows an operator to view alarms for all clusters, but still distinguish alarms for the active cluster from those for a standby cluster. For alarms that are not cluster-specific, this field remains unset.
Alarms in the EDA GUI
The EDA GUI includes several summary views of alarms known to EDA:
- An alarm summary is displayed on the EDA home page.
- A more detailed summary of alarms affecting key EDA components (clusters, Git servers, App catalogs and registries) is displayed on the main Alarms Summary page
- The Alarms list displays a list of all active alarms
- Suppress an alarm: this sets the suppressed flag for the current instance of the
alarm. By default, suppressed alarms are not displayed in the EDA GUI.Note: You can still view suppressed alarms by choosing "Show all alarms" from the Alarm List Table Settings & Actions menu.
- Delete an alarm: this removes all history of the alarm. Deletion is only allowed for cleared alarms. The option is disabled for active alarms.
- Acknowledge an alarm: this sets the Acknowledged flag for the current instance of the alarm.
Alarms with multiple parents
It is common for an alarm to have more than one parent. For example, an
InterfaceDegraded
alarm may be caused by one or more of its
component members being down. If an interface had four members, it would have two
parents if two of its members were down. To support this relationship, the
parent_alarm
field on the update_alarm
function supports both a string (for a single parent) and an array of strings (for
cases with more than one parent).
The Alarms Summary

Dashlet | Description |
---|---|
Alarm States | The top band of the Summary page displays a breakdown of total alarms
by their Active, Acknowledged, and Suppressed status. Clicking the View link opens the Alarms List. |
Active Alarms | The Active Alarms dashlet displays a total count of alarms affecting
EDA applications and the EDA platform itself. Charts break these counts
down by severity. Clicking the View link opens the Alarms List. |
Active Application Alarms | Building on the data displayed in the Active Alarms panel, the Active
Application Alarms dashlet lists the active alarms affecting EDA
applications, their severity, and their type. Clicking the View link opens the Alarms List. |
Active Platform Alarms | Similarly, the Active Platform Alarms dashlet lists the active alarms
affecting the EDA application, their severity, and their
type. Clicking the View link opens the Alarms List. |
The Alarms list

# | Name | Function |
---|---|---|
1 | Alarms menu | The Alarms menu includes:
The Show all alarms action updates the alarms list to include suppressed alarms, which are hidden by default. Subsequently choosing Hide suppressed alarms restores the default behavior and removes suppressed alarms from the list. |
2 | Alarm count | Displays the number of current alarms of various severities. |
Columns
The list of alarms displays the following columns by default.
Column | Description |
---|---|
Severity | The importance of the alarm, as defined by the alarm itself.
Supported severities are:
|
Type | The alarm type, as defined by the alarm itself. For example, InterfaceDown. |
Cleared | Whether the alarm has been cleared by an operator. Possible
values are:
|
Occurrences | The number of occurrences for the alarm. |
Resource | Indicates the name of the resource that this alarm is present
on.
|
Kind | Indicates the kind of resource the alarm is present on.
|
Group | Indicates the group of the resource the alarm is present on.
|
Last Changed | Indicates the time the alarm last changed state. The timestamp is updated any time an alarm changes state between cleared and not cleared. |
Namespace | Indicates the namespace to which the alarm belongs. Alarms that are not specific to a namespace, such as platform certificate alarms, do not have or display a Namespace value. |
Name | Indicates the name of the alarm. |
Acknowledged | Indicates whether the alarm has been acknowledged (True or False) |
Last Acknowledged | Indicates the date and time when the most recent acknowledgement occurred for this alarm. |
Acknowledged Until | If the alarm has been temporarily acknowledged, this indicates the date and time at which the acknowledgement will expire. |
Parent Alarms | Indicates whether the alarm is associated with one or more parent alarms. It is common for alarms to have one or more parents; for example, an InterfaceDegraded alarm may be caused by one or more of its component members being down; that condition is itself the subject of a separate alarm. |
Cluster Member | For EDA platform alarms, the EDA cluster member to which the alarm applies. |
Source Resource | The EDA-managed resource to which the alarm applies. |
Source Kind | The kind of the resource to which the alarm applies. |
Source Group | The group of the resource to which the alarm applies. |
Description | The description of the alarm, from the alarm's encoded Description field. |
Probable Cause | The probably cause of the alarm, from the alarm's encoded Probable Cause field. |
Remedial Action | The suggested remedial action to resolve the alarm, from the alarm's encoded Remedial Action field. |
JS Path | For an EDA-managed resource, the JS path to that resource. For example, if the alarm pertains to an interface, this is the JS path to that interface: .node{.name=="spine-1"}.srl.interface{.name=="ethernet-1/14"} |
Last Suppressed | Indicates the date and time when the most recent suppression occurred for this alarm. |
Suppressed Until | If the alarm has been temporarily suppressed, this indicates the date and time at which the suppression will expire. |
Sample core alarms
Property | Description |
---|---|
Name | RepositoryReachabilityDown-<cluster>-<server-name>-<repo-type>-<source> |
Severity | Critical |
Description | Connectivity between <source-kind> "<source>" and the "<repo-type>" repository at "<server-uri/remote-path>" is down. This alarm is raised after three failures to connect to a repository, where each attempt is made at a 15s interval. After three failures the alarm is generated (so after 45s), and is cleared on a connection attempt succeeding. |
Probable cause | Connectivity issues, Kubernetes CNI misconfiguration, or credential/TLS misconfiguration/expiration. |
Remedial action | Restore connectivity between the corresponding <source-kind> and apps repository/git server. Ensure credentials and proxy configuration is correct, and any offered certificates are trusted. |
|
Property | Description |
---|---|
Name | ServiceReachabilityDown-<cluster>-<service>-<source> |
Severity | Critical |
Description | Connectivity between <source-kind> "<source>" and the <kind> on "<service>" is down. |
Probable cause | Connectivity issues between worker nodes in the Kubernetes cluster, Kubernetes CNI misconfiguration, pod failure, or TLS misconfiguration/expiration. |
Remedial action | Restore connectivity between the corresponding source and destination. Ensure credentials and proxy configuration is correct (typically using no proxy for inter-cluster HTTPS), and certificate validity. |
|
Property | Description |
---|---|
Name | PodNotRunning-<cluster>-<pod> |
Severity | Critical |
Description | Pod "<pod>" is not in the "Running" state. Any functionality provided by this pod is not available. This alarm can be raised transiently at system startup. |
Probable cause | Kubernetes controller or registry reachability issues, worker node failure, initial instantiation. |
Remedial action | Validate reachability to the registry used to pull the image for the specified pod, ensure no worker node, storage, or networking issues exist that would cause the Kubernetes controller to mark the pod in any state other than "Running". |
|
Property | Description |
---|---|
Name | DeploymentDegraded-<cluster>-<deployment> |
Severity | Critical |
Description | Deployment "<deployment>" has at least one replica not in the "Running" state. Depending on the application this may result in loss of functionality or loss of service capacity. This alarm can be raised transiently at system startup. |
Probable cause | Kubernetes infrastructure issues, worker node failure, initial instantiation. |
Remedial action | Validate reachability to the registry used to pull images for any failed pods in the Deployment, ensure no worker node, storage, or networking issues exist that would cause the Kubernetes controller to mark pods in any state other than "Running". |
|
Property | Description |
---|---|
Name | DeploymentDown-<cluster>-<deployment> |
Severity | Critical |
Description | Deployment "<deployment>" is down, with no pods in the "Running" state. Any functionality provided by the Deployment is not available. This alarm can be raised transiently at system startup. |
Probable cause | Kubernetes infrastructure issues, worker node failure, initial instantiation. |
Remedial action | Validate reachability to the registry used to pull images for failed pods in the Deployment, ensure no worker node, storage, or networking issues exist that would cause the Kubernetes controller to mark pods in any state other than "Running". |
|
Property | Description |
---|---|
Name | PPDown-<cluster>-<npp> |
Severity | Critical |
Description | Connectivity between ConfigEngine "<config-engine>" and the NPP "<npp>" is down. This results in no new transactions succeeding to targets served by this NPP (unless operating in null mode), and no telemetry updates being received. Effectively targets served by this NPP are offline. Look for a corresponding PodNotRunning alarm. |
Probable cause | Connectivity issues between worker nodes in the Kubernetes cluster, Kubernetes CNI misconfiguration, pod failure, or TLS misconfiguration/expiration. |
Remedial action | Restore connectivity between the corresponding ConfigEngine and the destination NPP. Ensure credentials and proxy configuration is correct (typically using no proxy for inter-cluster HTTPS), and certificate validity. |
|
Property | Description |
---|---|
Name | PoolThresholdExceeded-<pool-type>-<pool-name>-<pool-instance> |
Severity | Varies; see definitions |
Description | The "<pool-instance>" instance of the <pool-type> "<pool-name>" has crossed the <severity> threshold of <threshold>. |
Probable cause | Pool utilization. |
Remedial action | Expand the pool via growing a segment, or add additional segments. Additionally you may move pool consumers to a different pool. |
|
Property | Description |
---|---|
Name | StateEngineReachabilityDown-<state-engine>-<state-controller> |
Severity | Critical |
Description | Connectivity between State Controller "<state-controller>" and the State Engine "<state-engine>" is down. This results in no new state application instances being deployed to the corresponding State Engine, and the rebalancing of already-pinned instances to other State Engines. This connectivity is also used to distribute the map of shards to State Engine, meaning the corresponding State Engine will not receive shard updates (assuming it is still running). |
Probable cause | Connectivity issues between worker nodes in the Kubernetes cluster, Kubernetes CNI misconfiguration, pod failure, or TLS misconfiguration/expiration. |
Remedial action | Restore connectivity between the corresponding State Controller and the destination State Engine. Ensure credentials and proxy configuration is correct (typically using no proxy for inter-cluster HTTPS), and certificate validity. |
|
Viewing alarms
The page in the EDA UI in which to view and interact with alarms is located at
.- is sorted first by "Severity", and then by the "last changed" timestamp in descending order (most recent change first)
- hides any suppressed alarms
-
To include suppressed alarms (which are hidden by default), do the
following:
- Click the More icon at the upper right of the Alarms page.
- Select Show All Alarms from the displayed list.
-
To exclude suppressed alarms from the list, do the following:
- Click the More icon at the upper right of the Alarms page.
- Select Hide suppressed alarms from the displayed list.
Acknowledging an alarm
- Acknowledge the alarm permanently
- Acknowledge the alarm temporarily, after which the alarm will return to its unacknowledged state.

- Find the alarm in the list using the sorting and filtering controls.
- At the right side of the row, click the Table row actions button.
- Select Acknowledge from the list.
-
Optionally, you can choose to acknowledge the alarm only temporarily by doing
either of the following:
- Click the drop-down control and select one of the standard periods displayed.
- Click the drop-down control, then click Custom, and in the resulting window select a date and time for the acknowledgement to expire.
- Click Acknowledge to complete the acknowledgement of the alarm.
Acknowledge multiple alarms
- Acknowledge the alarms permanently
- Acknowledge the alarms temporarily, after which all of the selected alarms will return to their unacknowledged state.
- Use the sorting and filtering controls to display the necessary set of alarms in the list.
-
Select all of the alarms you want to acknowledge by checking the box at the
left edge of the list. Click the check box again to unselect any alarm.
Note: To select all alarms in the list, check the check box in the title row. Click the check box again to unselect all alarms in the list.Note: The number of alarms you have selected, as well as the total number of alarms, is indicated at the lower right of the Alarms page.
- At the upper right of the Alarms page, click the Table settings & actions button.
- Select Acknowledge from the list.
-
Optionally, you can choose to acknowledge the alarm only temporarily by doing
either of the following:
- Click the drop-down control and select one of the standard periods displayed.
- Click the drop-down control, then click Custom, and in the resulting window select a date and time for the acknowledgement to expire.
- Click Acknowledge to complete the acknowledgement of the selected alarms.
Deleting a single alarm
- Find the alarm in the list using the sorting and filtering controls.
- At the right side of the row, click the Table row actions button.
-
Select Delete from the list.
Note: The Delete option is not displayed for an alarm that has not been cleared.
- Click Confirm to complete the acknowledgement.
Deleting multiple alarms
- Use the sorting and filtering controls to display the necessary set of alarms in the list.
-
Select all of the alarms you want to delete by checking the box at the left
edge of the list. Click the check box again to unselect any alarm.
Note: To select all alarms in the list, check the check box in the title row. Click the check box again to unselect all alarms in the list.Note: The number of alarms you have selected, as well as the total number of alarms, is indicated at the lower right of the Alarms page.
- At the upper right of the Alarms page, click the Table settings & actions button.
- Select Delete from the list.
-
Click Confirm to complete the acknowledgement for all
alarms.
Note: If some of the alarms you selected were not eligible for deletion, only those that were eligible are deleted by this operation. Ineligible alarms are not deleted. No error message displays in this case.
Suppress a single alarm
- Suppress the alarm permanently
- Suppress the alarm temporarily, after which the alarm will return to its unsuppressed state.

- Find the alarm in the list using the sorting and filtering controls.
- At the right side of the row, click the Table row actions button.
- Select Suppress from the list.
-
Optionally, you can choose to suppress the alarm only temporarily by doing
either of the following:
- Click the drop-down control and select one of the standard periods displayed.
- Click the drop-down control, then click Custom, and in the resulting window select a date and time for the suppression to expire.
-
Click Confirm to complete the alarm suppression.
Note: By default, suppressed alarms are not displayed in the alarms list. Unless you have selected to show all alarms, suppressing an alarm causes it to vanish from the alarms list.
Suppressing multiple alarms
- Suppress the alarms permanently
- Suppress the alarms temporarily, after which all of the selected alarms will return to their unsuppressed state.
- Use the sorting and filtering controls to display the necessary set of alarms in the list.
-
Select all of the alarms you want to delete by checking the box at the left
edge of the list. Click the check box again to unselect any alarm.
Note: To select all alarms in the list, check the check box in the title row. Click the check box again to unselect all alarms in the list.Note: The number of alarms you have selected, as well as the total number of alarms, is indicated at the lower right of the Alarms page.
- At the upper right of the Alarms page, click the Table settings & actions button.
- Select Suppress from the list.
-
Optionally, you can choose to suppress the alarm only temporarily by doing
either of the following:
- Click the drop-down control and select one of the standard periods displayed.
- Click the drop-down control, then click Custom, and in the resulting window select a date and time for the suppression to expire.
-
Click Confirm to complete the suppression for all
alarms.
Note: By default, suppressed alarms are not displayed in the alarms list. Unless you have selected to show all alarms, suppressing alarms causes them to vanish from the alarms list.
Viewing alarm history
- Find the alarm in the list using the sorting and filtering controls.
- At the right side of the row, click the Table row actions button.
-
Select History from the list.
EDA opens the Alarm History window, which shows all events pertaining to the selected alarm including the following details:
- Cleared (yes/no)
- Last change date/time
- Probable cause
- Remedial action