Alarms
In the Fabric Services System, alarms arise when a managed object enters an undesirable state; for example, if a managed node goes out of service, an associated alarm is raised.
- Communication
- Configuration
- Environment
- Equipment
- Operational
- the Alarms panel on the dashboard, which summarizes current alarms.
- the Alarms List page, which you can use to view and manage individual alarms
- the policy manager, which you can use to customize the severity level for specific types of alarm or to suppress alarms of a specific type entirely.
Alarm states
In the Fabric Services System, an alarm can adopt the following states:
- Acknowledged: An acknowledged alarm still displays in the Alarms List page. When viewing details for the individual alarm, its state displays as Acknowledged and any note you added to the alarm while acknowledging it is displayed as well. You can use the Acknowledge state as the basis for filtering or sorting the alarm list.
- Closed: A Closed alarm still displays in the Alarms List page. This state can be the basis for filtering the alarms included in the list. Closing an alarm does not resolve the condition that caused the alarm to be raised in the first place.
- Cleared: An alarm is Cleared when the condition that raised the alarm has been resolved. Unlike Acknowledged and Closed, the Cleared state cannot be assigned manually by a Fabric Services System operator. Only the device or devices that raised the original alarm can determine and communicate its closure.
Displaying alarms
To view and manage alarms with the Alarm List page:
-
From the main menu, select Alarms List.
The alarm list displays, showing all active alarms (where "active" refers to alarms that have not been cleared).Note: Cleared alarms are not included in this list because the "Cleared" filter is set to "False" by default. To view cleared alarms in this list, clear that filter.Note: A set of default columns display in the Alarms List view:
- Severity
- Alarm type
- Node name
- Resource name
- Cleared
- Occurrence
- Last Raised
There are other columns available to show more information about each alarm. You can add or remove columns from any list.
-
To view details about an alarm and its state:
- Select an alarm in the list.
- At the right edge of the row, click and select State Details from the displayed action list.
- Click the ALARM STATE tab to view details about the alarm's severity, a description of the alarm, and the time it was raised.
- Click the OPERATOR STATE tab to view the state assigned by the operator to address the alarm (either Acknowledged or Closed).
- When you are finished, click CLOSE to return to the Alarms List page.
-
To acknowledge an alarm:
Acknowledging an alarm marks it as received, but does not clear the alarm from the alarm list.
- Select an alarm in the list.
- At the right edge of the row, click and select Acknowledge from the displayed action list.
- Optionally, enter any comments about the acknowledgement in the Additional Info field.
- Click SAVE.
The alarm is marked as Acknowledged (but not Closed). -
To close an alarm:
Closing an alarm prevents the alarm from appearing in the Alarms List page, but does not resolve the condition that raised the alarm in the first place.
Customizing an alarm severity level
A policy affects the alarm type of all future alarms raised; it does not retroactively modify existing alarms of the same type.
Each policy can include a start time and an end time; these are boundaries on the time of day during which the policy applies. An alarm raised outside these boundaries has its default severity instead of the severity level defined by the policy. If no start and end times are defined, the policy is always active.
You can also use a policy to suppress an alarm entirely while the policy is in effect.
- by key value, which allows you to trigger a policy based on the name of the object (node, fabric, intent, or region) affected by the alarm
- by alarm category and type, to apply the policy regardless of the object affected.
To customize an alarm's severity:
- From the main menu, select Policies.
- Click + CREATE A POLICY.
- Set the Name and Description fields for the policy.
-
In the Policy Definition panel, set the Start
Time and End Time fields.
An alarm that would be affected by this policy uses the customized severity level only if it is raised during this period. If it is raised outside this period, it uses the default severity level.
- Do one of the following:
-
With the Key Value toggle enabled, do the
following:
-
In the Policy Definition panel, do the following:
-
Configure the way this policy modifies alarms:
- Click CREATE.
Third-party tool access to Fabric Services System alarms
You can configure the system to allow third-party tools to access Fabric System Services alarms to allow operators to use their operational tool sets to monitor and operate their network. The Fabric Services System exposes raised alarms to third-party tools through a Kafka message bus. The system publishes all generated alarms on a Kafka topic to which an external system can subscribe.
The Kafka broker used for this topic only exposes SSL connections to itself for external systems to use. An external client must authenticate before being able to subscribe to a topic. The Kafka broker allows the external client to only subscribe to the topic, but not publish to it.
Alarm messages
The alarm messages that are published to the Kafka topic are in Protocol Buffer (protobuf) format. For an example, see Appendix B: Protobuf file message format.
From the Fabric Services System, you can obtain this file using the following REST call:
https://fss.domain.tld/rest/alarmmgr/fss_alarmexternal.proto
Configuration
The settings that enable third-party tools to access Fabric Services System alarms are configured during the Fabric Services System application installation; for more information, see “Editing the installation configuration file” in the Fabric Services System Software Installation Guide.
After the Fabric Services System application has been installed, you can update the settings as described in Updating configuration for the external Kafka service.
Updating configuration for the external Kafka service
- You must perform this procedure during a maintenance window.
- All external connections must be closed before executing this procedure; you can initiate the connections again after you have completed this procedure.
- HTTPS must be enabled on the Fabric Services System.
-
Update the sample-input.json file.
The parameters are in
kafkaconfig
sub-section of thefss
section.
Currently, you can only change the setting for the maxConnections parameter. This parameter specifies the maximum number of clients that can connect to the Kafka service; the maximum value for this parameter is 10."fss": { "dhcpnode": "fss-node01", "dhcpinterface": "192.0.2.11/24", "ztpaddress": "192.0.2.11", "httpsenabled": true, "certificate": "/root/certs/fss-tls.crt", "privatekey": "/root/certs/fss-tls.key", "domainhost": "myhost.mydomain.com", "kafkaconfig": { "port": "32425", "groupprefix": "mygrp", "user": "myuser", "password": "mypasswd", "maxConnections": 2 } }
[root@fss-deployer ~]# diff updated-input.json input.json < "maxConnections": 3 --- > "maxConnections": 2
-
Run the fss-install.sh script to update the system
configuration.
The fss-install.sh script is available in the /root/bin directory.
[root@fss-deployer ~]# /root/bin/fss-install.sh configure updated-input.json WARNING: truststore not configured Timesync service is running on 10.254.45.123 Time difference is 0 seconds Timesync service is running on 10.254.44.123 Time difference is 0 seconds Timesync service is running on 10.254.43.123 Time difference is -1 seconds Timesync service is running on 10.254.42.123 Time difference is 0 seconds Timesync service is running on 10.254.41.123 Time difference is 0 seconds Timesync service is running on 10.254.40.123 Time difference is 0 seconds Maximum time difference between nodes 1 seconds WARNING: Storage related disks will be wiped clean during install, data will be lost. Please verify that correct disks are referred in the input configuration.
-
Update the Kafka service.
[root@fss-deployer ~]# /root/bin/update-kafka.sh Kafka will be updated with the current config. release "kafka" uninstalled Using User certificates for the cluster secret "kafka-fss-cluster-ca-cert" deleted secret/kafka-fss-clients-ca-cert created secret/kafka-fss-cluster-ca-cert created secret "kafka-fss-cluster-ca" deleted secret/kafka-fss-cluster-ca created secret/kafka-fss-clients-ca created secret/kafka-fss-cluster-ca-cert labeled secret/kafka-fss-clients-ca-cert labeled secret/kafka-fss-cluster-ca labeled secret/kafka-fss-clients-ca labeled secret/kafka-fss-cluster-ca-cert annotated secret/kafka-fss-clients-ca-cert annotated secret/kafka-fss-cluster-ca annotated secret/kafka-fss-clients-ca annotated NAME: kafka LAST DEPLOYED: Fri Mar 31 05:03:05 2023 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None Fri Mar 31 05:03:07 UTC 2023 Start: Checking Kafka pods status Fri Mar 31 05:03:07 UTC 2023 wait 800s for kafka cluster to startup Fri Mar 31 05:03:18 UTC 2023 wait 800s for kafka cluster to startup Fri Mar 31 05:03:28 UTC 2023 wait 800s for kafka cluster to startup Fri Mar 31 05:03:39 UTC 2023 wait 800s for kafka cluster to startup Fri Mar 31 05:03:49 UTC 2023 wait 800s for kafka cluster to startup Fri Mar 31 05:04:00 UTC 2023 wait 800s for kafka cluster to startup Fri Mar 31 05:04:10 UTC 2023 wait 800s for kafka cluster to startup Fri Mar 31 05:04:52 UTC 2023 wait 800s for kafka cluster to startup Fri Mar 31 05:05:02 UTC 2023 Kafka Operator is up NAME READY STATUS RESTARTS AGE kafka-fss-entity-operator-b6757b664-bvvpq 3/3 Running 0 37s kafka-fss-kafka-0 1/1 Running 0 71s kafka-fss-kafka-1 1/1 Running 0 71s kafka-fss-kafka-2 1/1 Running 0 71s kafka-fss-zookeeper-0 1/1 Running 0 115s kafka-fss-zookeeper-1 1/1 Running 0 115s kafka-fss-zookeeper-2 1/1 Running 0 115s strimzi-cluster-operator-5bc66cb4f9-dnkcv 1/1 Running 0 12h NAME CLUSTER AUTHENTICATION AUTHORIZATION READY fss-kafka-admin kafka-fss scram-sha-512 simple True myuser kafka-fss scram-sha-512 simple True NAME TYPE DATA AGE default-token-tr6nz kubernetes.io/service-account-token 3 12h fss-kafka-admin Opaque 2 116s kafka-fss-clients-ca Opaque 1 2m1s kafka-fss-clients-ca-cert Opaque 3 2m3s kafka-fss-cluster-ca Opaque 1 2m1s kafka-fss-cluster-ca-cert Opaque 3 2m2s kafka-fss-cluster-operator-certs Opaque 4 115s kafka-fss-entity-operator-token-zhz2r kubernetes.io/service-account-token 3 37s kafka-fss-entity-topic-operator-certs Opaque 4 37s kafka-fss-entity-user-operator-certs Opaque 4 37s kafka-fss-kafka-brokers Opaque 12 71s kafka-fss-kafka-token-52ddx kubernetes.io/service-account-token 3 72s kafka-fss-zookeeper-nodes Opaque 12 115s kafka-fss-zookeeper-token-xfvfb kubernetes.io/service-account-token 3 115s myuser Opaque 2 116s sh.helm.release.v1.kafkaop.v1 helm.sh/release.v1 1 12h strimzi-cluster-operator-token-q5kkt kubernetes.io/service-account-token 3 12h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kafka-fss-kafka-0 NodePort 10.233.61.12 <none> 9094:31239/TCP 72s kafka-fss-kafka-1 NodePort 10.233.5.58 <none> 9094:30182/TCP 72s kafka-fss-kafka-2 NodePort 10.233.56.222 <none> 9094:30026/TCP 72s kafka-fss-kafka-bootstrap ClusterIP 10.233.34.56 <none> 9091/TCP,9092/TCP,9093/TCP 72s kafka-fss-kafka-brokers ClusterIP None <none> 9090/TCP,9091/TCP,9092/TCP,9093/TCP 72s kafka-fss-kafka-external-bootstrap NodePort 10.233.45.207 <none> 9094:32425/TCP 72s kafka-fss-zookeeper-client ClusterIP 10.233.21.201 <none> 2181/TCP 115s kafka-fss-zookeeper-nodes ClusterIP None <none> 2181/TCP,2888/TCP,3888/TCP 115s
-
Wait for the Fabric Services System application to stabilize.
Convergence may take some time. During this period, pods are known to fail and restart.The system is stable when all the pods are in Running state.
[root@fss-deployer ~]# export KUBECONFIG=/var/lib/fss/config.fss [root@fss-deployer ~]# kubectl get pods NAME READY STATUS RESTARTS AGE fss-logs-fluent-bit-56t99 1/1 Running 0 12h fss-logs-fluent-bit-d94x2 1/1 Running 0 12h fss-logs-fluent-bit-hbvzt 1/1 Running 0 12h fss-logs-fluent-bit-q7f6g 1/1 Running 0 12h fss-logs-fluent-bit-r5tr4 1/1 Running 0 12h fss-logs-fluent-bit-tmldd 1/1 Running 0 12h prod-ds-apiserver-88fcd7cd7-lhmhh 1/1 Running 0 12h prod-ds-cli-7cfd7664db-6xhk5 1/1 Running 0 12h prod-ds-docker-registry-5b467bbf67-4lh2z 1/1 Running 0 12h prod-ds-imgsvc-deploy-5f99648577-fjfdg 1/1 Running 0 12h prod-fss-alarmmgr-78fd576464-2tfl9 1/1 Running 1 (2m19s ago) 12h prod-fss-auth-6c99d44ccb-tnt8t 1/1 Running 1 (3m20s ago) 12h prod-fss-catalog-54cb57645-s6mj7 1/1 Running 1 (2m50s ago) 12h prod-fss-cfggen-6dfc6d8ccb-rjmxt 1/1 Running 1 (2m49s ago) 12h prod-fss-cfgsync-78df54976f-nqrcm 1/1 Running 0 12h prod-fss-connect-58c98db7d4-x4w5g 1/1 Running 1 (3m18s ago) 12h prod-fss-da-0 1/1 Running 1 (2m20s ago) 12h prod-fss-da-1 1/1 Running 1 (2m20s ago) 12h prod-fss-da-2 1/1 Running 1 (2m20s ago) 12h prod-fss-da-3 1/1 Running 1 (2m20s ago) 12h prod-fss-da-4 1/1 Running 1 (2m18s ago) 12h prod-fss-da-5 1/1 Running 1 (2m48s ago) 12h prod-fss-da-6 1/1 Running 1 (2m18s ago) 12h prod-fss-da-7 1/1 Running 1 (2m18s ago) 12h prod-fss-deviationmgr-acl-7d8d878d66-jc48z 1/1 Running 0 12h prod-fss-deviationmgr-bfd-5f6bcf7d-xsq46 1/1 Running 0 12h prod-fss-deviationmgr-interface-5f7fdcfc6c-fpk48 1/1 Running 0 12h prod-fss-deviationmgr-netinst-c7d5648d7-z9mdp 1/1 Running 0 12h prod-fss-deviationmgr-platform-6d9c574bb9-l4cb7 1/1 Running 0 12h prod-fss-deviationmgr-qos-5b99fcc7d9-977r6 1/1 Running 0 12h prod-fss-deviationmgr-routingpolicy-775f49b66-qnqrj 1/1 Running 0 12h prod-fss-deviationmgr-system-557bbbc75f-rjknq 1/1 Running 0 12h prod-fss-dhcp-5bc95b6966-kzd2n 1/1 Running 0 12h prod-fss-dhcp6-69d8785d64-l4qdk 1/1 Running 0 12h prod-fss-digitalsandbox-5c44679f86-4bp8p 1/1 Running 1 (2m50s ago) 12h prod-fss-filemgr-65c6799996-ggl27 1/1 Running 0 12h prod-fss-imagemgr-fd97fc4fb-6w8t4 1/1 Running 1 (2m50s ago) 12h prod-fss-intentmgr-64f97dc466-ftjgm 1/1 Running 1 (2m20s ago) 12h prod-fss-inventory-6f84769f46-w8h97 1/1 Running 1 (3m18s ago) 12h prod-fss-labelmgr-847575b8c6-4m8xj 1/1 Running 1 (3m19s ago) 12h prod-fss-maintmgr-7f599dd5db-fqk29 1/1 Running 1 (2m20s ago) 12h prod-fss-mgmtstack-79c67c585c-pk2nv 1/1 Running 1 (2m20s ago) 12h prod-fss-oper-da-0 1/1 Running 1 (2m20s ago) 12h prod-fss-oper-da-1 1/1 Running 1 (2m20s ago) 12h prod-fss-oper-da-2 1/1 Running 1 (2m20s ago) 12h prod-fss-oper-da-3 1/1 Running 1 (2m20s ago) 12h prod-fss-oper-da-4 1/1 Running 1 (2m19s ago) 12h prod-fss-oper-da-5 1/1 Running 1 (2m18s ago) 12h prod-fss-oper-da-6 1/1 Running 1 (2m18s ago) 12h prod-fss-oper-da-7 1/1 Running 1 (2m18s ago) 12h prod-fss-oper-topomgr-6b848bbcf7-5z8c9 1/1 Running 1 (2m19s ago) 12h prod-fss-protocolmgr-776bdf59c7-zvfl2 1/1 Running 0 12h prod-fss-topomgr-5dd97997b8-jw8rk 1/1 Running 1 (2m19s ago) 12h prod-fss-transaction-79bdb7d78d-lxwpp 1/1 Running 1 (2m50s ago) 12h prod-fss-version-767b859c96-t2v5w 1/1 Running 1 (2m20s ago) 12h prod-fss-web-5c94fd7455-l4sfz 1/1 Running 1 (2m20s ago) 12h prod-fss-workloadmgr-7b8f44b95d-f8cv6 1/1 Running 1 (3m19s ago) 12h prod-fss-ztp-86cbf5cdc-xtx9q 1/1 Running 1 (2m49s ago) 12h prod-keycloak-0 1/1 Running 0 12h prod-mongodb-arbiter-0 1/1 Running 0 12h prod-mongodb-primary-0 1/1 Running 0 12h prod-mongodb-secondary-0 1/1 Running 0 12h prod-neo4j-core-0 1/1 Running 0 12h prod-postgresql-0 1/1 Running 0 12h prod-sftpserver-77cd8696d5-fxswn 1/1 Running 0 12h [root@6node-deployer-vm ~]#
- Initiate external connections from Kafka clients.