Health monitoring
The Fabric Services System includes a health monitoring service that allows you to monitor the state of the system and its component software services. You can use this service to ensure that the system is functioning correctly, and to alert you to components that are encountering issues.
The Fabric Services System uses Prometheus to store these metrics, a common approach and a well-known format within the Kubernetes ecosystem. This standard format is intended to make it easier to find information, and to monitor and interpret the telemetry provided.
- Fabric Services System API metrics
- fss_request_duration_seconds
- fss_request_duration_seconds_count
- fss_request-size_bytes_sum
- fss_request_size_bytes_count
- fss_requests_total
- fss_response_size_bytes_sum
- fss_response_size_bytes_count
- Fabric Services System Apps, Pods, and other Golang Apps metrics
-
go_gc_duration_seconds
- go_gc_duration_seconds_count
- go_gc_duration_seconds_sum
- go_goroutines"
go_info
- go_memstats_alloc_bytes
- go_memstats_alloc_bytes_total
- go_memstats_buck_hash_sys_bytes
- go_memstats_frees_total
- go_memstats_gc_cpu_fraction
- go_memstats_gc_sys_bytes
- go_memstats_heap_alloc_bytes
- go_memstats_heap_idle_bytes
- go_memstats_heap_inuse_bytes
- go_memstats_heap_objects
- go_memstats_heap_released_bytes
- go_memstats_heap_sys_bytes
- ago_memstats_last_gc_time_seconds
- go_memstats_lookups_total"
- go_memstats_mallocs_total
- go_memstats_mcache_inuse_bytes
- go_memstats_mcache_sys_bytes
- go_memstats_mspan_inuse_bytes
- go_memstats_mspan_sys_bytes
- go_memstats_next_gc_bytes
- go_memstats_other_sys_bytes
- go_memstats_stack_inuse_bytes
- go_memstats_stack_sys_bytes
- go_memstats_sys_bytes
- go_threads
-
Deployment and configuration
By default, the health monitoring and telemetry feature is disabled and the Prometheus server and the node exporters are not deployed.
A configuration file is available during deployment of the Fabric Services System which allows you to:
- Enable the health monitoring and telemetry
- Enable or disable worker node monitoring. These are disabled by default; but when enabled, Prometheus is configured for node metrics/telemetry.
- Configure a scrape interval, which determines how often the telemetry is gathered from all the endpoints and stored. The default value is one minute.
- Configure retention time. This is represented in a time notation with indication of hours (h). For instance "3h". The default retention time is one hour.
- Configure user authentication information. This allows you to manage the
users and passwords of the Prometheus service which have Read access.Note: Passwords must be provided as a bcrypt hash. Use any of the standard methods exist to generate bcrypt hashes from text.
The configuration is part of the user_values.yaml
file that can be
found in the /root folder of the Deployer VM. This file is used
to configure some advanced features in the platform, and may contain additional
parameters that impact other features.
For the Health Monitoring feature, the following example contains all the possible settings as described above with example values:
prometheus:
enabled: true
serverFiles:
fssconfig.yml:
basic_auth_users:
fss-user: '$2a$12$d81J/Hadc/rb2eiOQYN0T.wCYvSi29RiQ2Ql3JR9dcmUtt5l/39i.'
extrauser: '$2b$12$bjTtOtoB5WVhC5KAmtRbTOLE2PB5HKqHfLWacytVGnAqlyWcSU1Ry'
server:
global:
scrape_interval: 30s
retention: 6h
prometheus-node-exporter:
enabled: true
- Prometheus data is not backed up during a backup of the Fabric Services System.
- the system stores a maximum 2GB of data
- the minimum scrape interval is 30 seconds
- the maximum retention time is 3 hours
- the system supports a maximum of five users
Enabling health monitoring during installation
After deploying the Deployer VM and deploying the different Fabric Services System
node VMs, but before running the actual installation process, create the appropriate
entries in the /root/user_values.yaml
file as described in Deployment and configuration.
The configuration is applied during the regular installation process.
Accessing the Prometheus metrics
The metrics gathered by Prometheus should only be retrieved using the standard Prometheus API, and using any tool of choice that can use this API. Direct access to the Prometheus UI is not supported.
To access the health monitoring metrics, direct the tool to the
https://fss.domain.tld/prometheus
URI, where "fss.domain.tld"
is replaced with the FQDN of the Fabric Services System deployment.
Updating the health monitoring configuration after installation
You can change the configuration of the health monitoring feature after installation. This includes enabling or disabling aspects of the feature, and changing the passwords for users.
Follow these steps to change the health monitoring configuration:
-
Update the
/root/user_values.yaml
file with the new configuration. -
Execute the following command to make sure the Deployer VM has the latest
configuration information of your deployment. The command uses the input json
file that was used during installation as well. The example shows the expected
output as well.
# /root/bin/fss-upgrade.sh configure input.json Timesync service is running on 192.0.2.201 Time difference is -1 seconds Timesync service is running on 192.0.2.202 Time difference is -1 seconds Timesync service is running on 192.0.2.203 Time difference is -1 seconds Timesync service is running on 192.0.2.204 Time difference is -1 seconds Timesync service is running on 192.0.2.205 Time difference is -1 seconds Timesync service is running on 192.0.2.206 Time difference is -1 seconds Maximum time difference between nodes 1 seconds Successfully configured. Please run /root/bin/fss-upgrade.sh discover
-
Execute the following command to update the configuration in the actual
application. You must confirm that the new values must be configured (which is
considered an ":upgrade"). The example shows the expected output as well.
# /root/bin/fss-upgrade.sh upgrade NAME STATUS ROLES AGE VERSION fss-node01 Ready control-plane,master 12d v1.23.1 fss-node02 Ready control-plane,master 12d v1.23.1 fss-node03 Ready control-plane,master 12d v1.23.1 fss-node04 Ready <none> 12d v1.23.1 fss-node05 Ready <none> 12d v1.23.1 fss-node06 Ready <none> 12d v1.23.1 FSS will be upgraded from fss-FSS_23_4_B1-charts-v23.4.1-18 to fss-FSS_23_4_B1-charts v23.4.1-18 : Are you sure [YyNn]? Y Upgrade in progress... Upgrading fss-logs fss-logs release discovered: fluent-bit-0.20.9 ; Deployer packages fss-logs release: fluent-bit 0.20.9 fss-logs upgrade not required Upgrading traefik and ingress routes traefik release discovered: traefik-21.0.0 ; Deployer packages traefik release: traefik 21.0.0 traefik upgrade not required Upgrading kafka and kafkaop if required kafkaop release discovered: strimzi-kafka-operator-0.31.0 ; Deployer packages kafkaop release: strimzi-kafka-operator 0.31.0 kafka release discovered: fss-strimzi-kafka-0.1.8 ; Deployer packages kafka release: fss-strimzi-kafka 0.1.8 kafka and kafkaop upgrade not required Release "prod" has been upgraded. Happy Helming! NAME: prod LAST DEPLOYED: Wed May 24 22:04:36 2023 NAMESPACE: default STATUS: deployed REVISION: 4 NOTES: Checking for FSS pods All FSS pods are running Checking for FSS digitalsandbox pods FSS digital sandbox pods are running Checking for digitalsandbox pods Digital sandbox pods are running FSS is ready, you can access FSS using https://fss.domain.tld