Health monitoring

The Fabric Services System includes a health monitoring service that allows you to monitor the state of the system and its component software services. You can use this service to ensure that the system is functioning correctly, and to alert you to components that are encountering issues.

The Fabric Services System uses Prometheus to store these metrics, a common approach and a well-known format within the Kubernetes ecosystem. This standard format is intended to make it easier to find information, and to monitor and interpret the telemetry provided.

The health monitoring system currently retrieves the following metrics:

Fabric Services System API metrics
- fss_request_duration_seconds
- fss_request_duration_seconds_count
- fss_request-size_bytes_sum
- fss_request_size_bytes_count
- fss_requests_total
- fss_response_size_bytes_sum
- fss_response_size_bytes_count
Fabric Services System Apps, Pods, and other Golang Apps metrics
- go_gc_duration_seconds
- go_gc_duration_seconds_count
- go_gc_duration_seconds_sum
- go_goroutines"
  go_info
- go_memstats_alloc_bytes
- go_memstats_alloc_bytes_total
- go_memstats_buck_hash_sys_bytes
- go_memstats_frees_total
- go_memstats_gc_cpu_fraction
- go_memstats_gc_sys_bytes
- go_memstats_heap_alloc_bytes
- go_memstats_heap_idle_bytes
- go_memstats_heap_inuse_bytes
- go_memstats_heap_objects
- go_memstats_heap_released_bytes
- go_memstats_heap_sys_bytes
- ago_memstats_last_gc_time_seconds
- go_memstats_lookups_total"
- go_memstats_mallocs_total
- go_memstats_mcache_inuse_bytes
- go_memstats_mcache_sys_bytes
- go_memstats_mspan_inuse_bytes
- go_memstats_mspan_sys_bytes
- go_memstats_next_gc_bytes
- go_memstats_other_sys_bytes
- go_memstats_stack_inuse_bytes
- go_memstats_stack_sys_bytes
- go_memstats_sys_bytes
- go_threads

Deployment and configuration

By default, the health monitoring and telemetry feature is disabled and the Prometheus server and the node exporters are not deployed.

A configuration file is available during deployment of the Fabric Services System which allows you to:

Enable the health monitoring and telemetry
Enable or disable worker node monitoring. These are disabled by default; but when enabled, Prometheus is configured for node metrics/telemetry.
Configure a scrape interval, which determines how often the telemetry is gathered from all the endpoints and stored. The default value is one minute.
Configure retention time. This is represented in a time notation with indication of hours (h). For instance "3h". The default retention time is one hour.
Configure user authentication information. This allows you to manage the users and passwords of the Prometheus service which have Read access.
Note: Passwords must be provided as a bcrypt hash. Use any of the standard methods exist to generate bcrypt hashes from text.

The configuration is part of the user_values.yaml file that can be found in the /root folder of the Deployer VM. This file is used to configure some advanced features in the platform, and may contain additional parameters that impact other features.

For the Health Monitoring feature, the following example contains all the possible settings as described above with example values:

prometheus:
  enabled: true
  serverFiles:
    fssconfig.yml:
      basic_auth_users:
        fss-user: '$2a$12$d81J/Hadc/rb2eiOQYN0T.wCYvSi29RiQ2Ql3JR9dcmUtt5l/39i.'
        extrauser: '$2b$12$bjTtOtoB5WVhC5KAmtRbTOLE2PB5HKqHfLWacytVGnAqlyWcSU1Ry'
  server:
    global:
      scrape_interval: 30s
    retention: 6h
  prometheus-node-exporter:
    enabled: true

Note: The following constraints apply to the health monitoring service:

Prometheus data is not backed up during a backup of the Fabric Services System.
the system stores a maximum 2GB of data
the minimum scrape interval is 30 seconds
the maximum retention time is 3 hours
the system supports a maximum of five users

Enabling health monitoring during installation

After deploying the Deployer VM and deploying the different Fabric Services System node VMs, but before running the actual installation process, create the appropriate entries in the /root/user_values.yaml file as described in Deployment and configuration.

The configuration is applied during the regular installation process.

Accessing the Prometheus metrics

The metrics gathered by Prometheus should only be retrieved using the standard Prometheus API, and using any tool of choice that can use this API. Direct access to the Prometheus UI is not supported.

To access the health monitoring metrics, direct the tool to the https://fss.domain.tld/prometheus URI, where "fss.domain.tld" is replaced with the FQDN of the Fabric Services System deployment.

Updating the health monitoring configuration after installation

You can change the configuration of the health monitoring feature after installation. This includes enabling or disabling aspects of the feature, and changing the passwords for users.

Follow these steps to change the health monitoring configuration:

Update the /root/user_values.yaml file with the new configuration.

Execute the following command to make sure the Deployer VM has the latest configuration information of your deployment. The command uses the input json file that was used during installation as well. The example shows the expected output as well.

# /root/bin/fss-upgrade.sh configure input.json
    Timesync service is running on 192.0.2.201  Time difference is -1 seconds
    Timesync service is running on 192.0.2.202  Time difference is -1 seconds
    Timesync service is running on 192.0.2.203  Time difference is -1 seconds
    Timesync service is running on 192.0.2.204  Time difference is -1 seconds
    Timesync service is running on 192.0.2.205  Time difference is -1 seconds
    Timesync service is running on 192.0.2.206  Time difference is -1 seconds
  Maximum time difference between nodes 1 seconds
  Successfully configured. Please run /root/bin/fss-upgrade.sh discover

Execute the following command to update the configuration in the actual application. You must confirm that the new values must be configured (which is considered an ":upgrade"). The example shows the expected output as well.

# /root/bin/fss-upgrade.sh upgrade
NAME         STATUS   ROLES                  AGE   VERSION
fss-node01   Ready    control-plane,master   12d   v1.23.1
fss-node02   Ready    control-plane,master   12d   v1.23.1
fss-node03   Ready    control-plane,master   12d   v1.23.1
fss-node04   Ready    <none>                 12d   v1.23.1
fss-node05   Ready    <none>                 12d   v1.23.1
fss-node06   Ready    <none>                 12d   v1.23.1
FSS will be upgraded from fss-FSS_23_4_B1-charts-v23.4.1-18 to fss-FSS_23_4_B1-charts v23.4.1-18 : Are you sure [YyNn]? Y
Upgrade in progress...
Upgrading fss-logs
fss-logs release discovered: fluent-bit-0.20.9 ; Deployer packages fss-logs release:  fluent-bit 0.20.9
fss-logs upgrade not required
Upgrading traefik and ingress routes
traefik release discovered: traefik-21.0.0 ; Deployer packages traefik release:  traefik 21.0.0
traefik upgrade not required
Upgrading kafka and kafkaop if required
kafkaop release discovered: strimzi-kafka-operator-0.31.0 ; Deployer packages kafkaop release:  strimzi-kafka-operator 0.31.0
kafka release discovered: fss-strimzi-kafka-0.1.8 ; Deployer packages kafka release:  fss-strimzi-kafka 0.1.8
kafka and kafkaop upgrade not required
Release "prod" has been upgraded. Happy Helming!
NAME: prod
LAST DEPLOYED: Wed May 24 22:04:36 2023
NAMESPACE: default
STATUS: deployed
REVISION: 4
NOTES:
   Checking for FSS pods
   All FSS pods are running
   Checking for FSS digitalsandbox pods
   FSS digital sandbox pods are running
   Checking for digitalsandbox pods
   Digital sandbox pods are running

   FSS is ready, you can access FSS using https://fss.domain.tld