How do baselines and anomalies work?
Components
A baseline provides the logic for the collection of baseline statistics, and for the detection of anomalies.
When you create baselines, you specify a resource or group of resources to collect a set of statistics over a defined time window. Information collected during that window is used to calculate a data point for the baseline. You also define a season, which is the length of time statistics need to be measured to assess trends. For example, to assess network traffic, you could set up a 15-minute window with a one-week season, which provides values calculated every 15 minutes over a one-week period.
Creation of a baseline creates a baseline subscription to collect the required data.
Note: Baseline subscriptions and telemetry subscriptions are separate. A baseline cannot be generated from data collected by a telemetry subscription.
On-demand NFM-P statistics cannot be used to create baselines.
A baseline consists of the following components. The components appear in the Create and Edit forms.
Baselines are created on a per-resource basis. A resource is an entity that can collect the desired statistics. In the Create Baselines form, you configure the required parameters and choose the resources to collect the statistics.
If the NE is managed using MDM, configuration of a baseline initiates statistics collection. If the NE is managed by NFM-P, statistics collection must be configured on the NFM-P and the resource must already be collecting the desired statistics for a baseline to be created.
Note: Baseline Analytics is different from the NSP Analytics application.
Baseline Analytics provides near-real-time baseline and anomaly detection from telemetry counters, for example, received octets for the /telemetry:base/interfaces/interface telemetry type.
The Analytics application computes a baseline for data configured for reporting, for example, utilization and throughput for a port in a Port LAG Details report, or bandwidth and data for an application group in a Router Level Usage Summary report with Baseline. See the Analytics Report Catalog and the NSP User Guide for more information about Analytics.
Baseline Analytics data storage
Baseline data is stored in Postgres, unless there is an auxiliary database enabled, in which case all collected data goes in the auxiliary database.
The following data is stored:
-
statistics data collected during the configured window; see General parameters
By default, data is stored in Postgres for 35 days and in the auxiliary database for 90 days. These values can be changed using the RESTCONF API or by updating the age-out policy; see How do I edit an age-out policy?.
General
The general parameters include the following:
-
Add an optional description for filtering on the Baselines view.
-
For MDM managed NEs, this represents the interval at which to collect the statistics, for example, every 30 seconds. For NEs managed by the NFM-P, this value is ignored and statistics are collected according to the settings configured in the NFM-P.
-
A season is the length of time statistics must be collected for a pattern to be seen. For example, for network traffic you can expect the data pattern to repeat on a weekly basis.
-
Window duration is the size of the data bucket for telemetry calculation. For example, a counter calculates the change between the first and last values taken during the window. The calculation used depends on the counter type parameter in the Filters & Counters panel.
-
Admin State and Training Status
These parameters are enabled by default when a baseline is created. They can be changed in the Edit form.
-
If the Admin State of a baseline is Enabled, NSP is monitoring the statistics.
-
If the Training Status is Active, NSP is incorporating new information into the baseline’s model. If the Training Status is paused, future anomalies will be detected against the expected values that are already calculated.
If you are monitoring for error counters, such as packet loss, you can pause learning after a season with no errors, which sets the expected number of errors to zero, while continuing to monitor.
-
For example, if you create a baseline and set the Collection Interval to 30, the Season to 1 week, and the Window Duration to 15 minutes, the baseline subscription collects the statistics values every 30 seconds, calculates a baseline data point every 15 minutes, and assesses trends based on one week of data.
Filter & Counters
The Filter & Counters parameters declare the telemetry values to be collected, the counter types, and the resources of interest.
When a telemetry type is selected, the COUNTERS button becomes available.
You can configure one of the following counter types:
-
Counter: a counter takes input counter values and calculates the change in value over the window. For example, if the counter represents the number of transmitted octets and windows are 15 minutes, a counter baseline value is the number of octets transmitted over 15 minutes.
-
Gauge: a gauge takes input values and calculates the mean value over the window. For example, if the input value is octets per second over a 30 second period and windows are 15 minutes, a gauge baseline value is the mean octets per second over 15 minutes.
An example of a gauge is CPU usage: it is a bounded value between 1 and 100%.
-
Sampled: a sampled baseline takes sampled values and calculates the sample mean value over the window. Sampled values represent the value at the exact time the sample was taken, not the value since the last sample was taken. For example, if CPU % is sampled every 2 minutes and windows are 15 minutes, a sampled baseline value is the sampled mean over the samples collected in the 15 minutes.
An example of a sampled value is latency.
Configure an object filter as needed to filter the available resources; see How do object filters work?.
When at least one counter is added and a counter type is specified, the VERIFY RESOURCES button becomes available.
Detectors
A detector defines the rules for anomaly detection. The detector rule provides an acceptable range of expected values. If a detected value exceeds the range, it is marked anomalous.
Anomaly detection is optional.
A detector rule is composed of the following:
-
algorithm — the formula to use to compare the expected and measured values
-
evaluate what — value, rate, or bandwidth
The measured values may be converted to a rate or bandwidth to perform the evaluation:
The comparison and threshold parameters define the range of acceptable values. For example, a rule could state that a value with an absolute Z-score greater than 2 is an anomaly.
Algorithms
You can define a rule based on an algorithm.
The following algorithms are suitable for most purposes:
-
This refers to the Z-score (number of standard deviations) of the measured value against the expected values. In this case, the expected value is the mean.
Formula: (measured - expected) / stddev
-
This refers to the absolute value of the Z-score of the measured value against the expected values. In this case, the expected value is the mean.
Formula: |(measured - expected) / stddev|
The Z-score algorithms are useful because they incorporate the standard deviation: in addition to recording how far the current value is from the mean, the algorithm also factors in the variability of the values. This can be very important when deciding if a value is anomalous. If your values are highly variable, that is, the standard deviation is high, it is important to choose a Z-score algorithm.
You can also use one of the following:
-
This is the relative difference using the absolute value of the arithmetic mean of the measured and expected values.
Formula: |measured - expected| / (|measured + expected| * 0.5)
This algorithm could be suitable if the standard deviation is very small, that is, if there is very little variation in the values.
-
This is the relative change (including the positive or negative sign) between the measured and expected values.
Formula: (measured - expected) / |expected|
-
This is the relative change (with no sign) between the measured and expected values.
Formula: |measured - expected| / |expected|
-
This is the change of the measured and expected values over the absolute value of the arithmetic mean.
Formula: (measured - expected) / (|measured + expected| * 0.5)
-
This is a score that becomes more sensitive as the measured or expected value approaches +/–100. This detector algorithm works well with percentages although it may have use with other types of values.
Formula: (sign(measured - expected) * (|measured - expected| + max(measured,expected))) / 200