Health monitoring

The Fabric Services System includes a health monitoring service that allows you to monitor the state of the system and its component software services. You can use this service to ensure that the system is functioning correctly, and to alert you to components that are encountering issues.

The Fabric Services System uses Prometheus to store these metrics, a common approach and a well-known format within the Kubernetes ecosystem. This standard format is intended to make it easier to find information, and to monitor and interpret the telemetry provided.

The health monitoring system currently retrieves the following metrics:
  • Fabric Services System API metrics
    • fss_request_duration_seconds
    • fss_request_duration_seconds_count
    • fss_request-size_bytes_sum
    • fss_request_size_bytes_count
    • fss_requests_total
    • fss_response_size_bytes_sum
    • fss_response_size_bytes_count
  • Fabric Services System Apps, Pods, and other Golang Apps metrics
    • go_gc_duration_seconds

    • go_gc_duration_seconds_count
    • go_gc_duration_seconds_sum
    • go_goroutines"

      go_info

    • go_memstats_alloc_bytes
    • go_memstats_alloc_bytes_total
    • go_memstats_buck_hash_sys_bytes
    • go_memstats_frees_total
    • go_memstats_gc_cpu_fraction
    • go_memstats_gc_sys_bytes
    • go_memstats_heap_alloc_bytes
    • go_memstats_heap_idle_bytes
    • go_memstats_heap_inuse_bytes
    • go_memstats_heap_objects
    • go_memstats_heap_released_bytes
    • go_memstats_heap_sys_bytes
    • ago_memstats_last_gc_time_seconds
    • go_memstats_lookups_total"
    • go_memstats_mallocs_total
    • go_memstats_mcache_inuse_bytes
    • go_memstats_mcache_sys_bytes
    • go_memstats_mspan_inuse_bytes
    • go_memstats_mspan_sys_bytes
    • go_memstats_next_gc_bytes
    • go_memstats_other_sys_bytes
    • go_memstats_stack_inuse_bytes
    • go_memstats_stack_sys_bytes
    • go_memstats_sys_bytes
    • go_threads

Deployment and configuration

By default, the health monitoring and telemetry feature is disabled and the Prometheus server and the node exporters are not deployed.

A configuration file is available during deployment of the Fabric Services System which allows you to:

  • Enable the health monitoring and telemetry
  • Enable or disable worker node monitoring. These are disabled by default; but when enabled, Prometheus is configured for node metrics/telemetry.
  • Configure a scrape interval, which determines how often the telemetry is gathered from all the endpoints and stored. The default value is one minute.
  • Configure retention time. This is represented in a time notation with indication of hours (h). For instance "3h". The default retention time is one hour.
  • Configure user authentication information. This allows you to manage the users and passwords of the Prometheus service which have Read access.
    Note: Passwords must be provided as a bcrypt hash. Use any of the standard methods exist to generate bcrypt hashes from text.

The configuration is part of the user_values.yaml file that can be found in the /root folder of the Deployer VM. This file is used to configure some advanced features in the platform, and may contain additional parameters that impact other features.

For the Health Monitoring feature, the following example contains all the possible settings as described above with example values:

prometheus:
  enabled: true
  serverFiles:
    fssconfig.yml:
      basic_auth_users:
        fss-user: '$2a$12$d81J/Hadc/rb2eiOQYN0T.wCYvSi29RiQ2Ql3JR9dcmUtt5l/39i.'
        extrauser: '$2b$12$bjTtOtoB5WVhC5KAmtRbTOLE2PB5HKqHfLWacytVGnAqlyWcSU1Ry'
  server:
    global:
      scrape_interval: 30s
    retention: 6h
  prometheus-node-exporter:
    enabled: true
Note: The following constraints apply to the health monitoring service:
  • Prometheus data is not backed up during a backup of the Fabric Services System.
  • the system stores a maximum 2GB of data
  • the minimum scrape interval is 30 seconds
  • the maximum retention time is 3 hours
  • the system supports a maximum of five users

Enabling health monitoring during installation

After deploying the Deployer VM and deploying the different Fabric Services System node VMs, but before running the actual installation process, create the appropriate entries in the /root/user_values.yaml file as described in Deployment and configuration.

During the regular installation process, the configuration will be applied.

Accessing the Prometheus metrics

All of the metrics gathered by Prometheus can be retrieved using the standard Prometheus API, and using any tool of choice that can use this API.

To access the health monitoring metrics, direct the tool to the https://fss.domain.tld/prometheus URI, where "fss.domain.tld" is replaced with the FQDN of the Fabric Services System deployment.

Updating the health monitoring configuration after installation

You can change the configuration of the health monitoring feature after installation. This includes enabling or disabling aspects of the feature, and changing the passwords for users.

Follow these steps to change the health monitoring configuration:

  1. Update the /root/user_values.yaml file with the new configuration.
  2. Execute the following command to make sure the Deployer VM has the latest configuration information of your deployment. The command uses the input json file that was used during installation as well. The example shows the expected output as well.
    # /root/bin/fss-upgrade.sh configure input.json
        Timesync service is running on 192.0.2.201  Time difference is -1 seconds
        Timesync service is running on 192.0.2.202  Time difference is -1 seconds
        Timesync service is running on 192.0.2.203  Time difference is -1 seconds
        Timesync service is running on 192.0.2.204  Time difference is -1 seconds
        Timesync service is running on 192.0.2.205  Time difference is -1 seconds
        Timesync service is running on 192.0.2.206  Time difference is -1 seconds
      Maximum time difference between nodes 1 seconds
      Successfully configured. Please run /root/bin/fss-upgrade.sh discover
    
  3. Execute the following command to update the configuration in the actual application. You must confirm that the new values must be configured (which is considered an ":upgrade"). The example shows the expected output as well.
    # /root/bin/fss-upgrade.sh upgrade
    NAME         STATUS   ROLES                  AGE   VERSION
    fss-node01   Ready    control-plane,master   12d   v1.23.1
    fss-node02   Ready    control-plane,master   12d   v1.23.1
    fss-node03   Ready    control-plane,master   12d   v1.23.1
    fss-node04   Ready    <none>                 12d   v1.23.1
    fss-node05   Ready    <none>                 12d   v1.23.1
    fss-node06   Ready    <none>                 12d   v1.23.1
    FSS will be upgraded from fss-FSS_23_4_B1-charts-v23.4.1-18 to fss-FSS_23_4_B1-charts v23.4.1-18 : Are you sure [YyNn]? Y
    Upgrade in progress...
    Upgrading fss-logs
    fss-logs release discovered: fluent-bit-0.20.9 ; Deployer packages fss-logs release:  fluent-bit 0.20.9
    fss-logs upgrade not required
    Upgrading traefik and ingress routes
    traefik release discovered: traefik-21.0.0 ; Deployer packages traefik release:  traefik 21.0.0
    traefik upgrade not required
    Upgrading kafka and kafkaop if required
    kafkaop release discovered: strimzi-kafka-operator-0.31.0 ; Deployer packages kafkaop release:  strimzi-kafka-operator 0.31.0
    kafka release discovered: fss-strimzi-kafka-0.1.8 ; Deployer packages kafka release:  fss-strimzi-kafka 0.1.8
    kafka and kafkaop upgrade not required
    Release "prod" has been upgraded. Happy Helming!
    NAME: prod
    LAST DEPLOYED: Wed May 24 22:04:36 2023
    NAMESPACE: default
    STATUS: deployed
    REVISION: 4
    NOTES:
       Checking for FSS pods
       All FSS pods are running
       Checking for FSS digitalsandbox pods
       FSS digital sandbox pods are running
       Checking for digitalsandbox pods
       Digital sandbox pods are running
    
       FSS is ready, you can access FSS using https://fss.domain.tld