Scaling guidelines for service assurance tests
Scheduled tests (STM)
NFM-P provides the ability to generate, manage and schedule STM tests within the network. This section provides guidelines that can be used to determine the extent to which STM tests can be scheduled and launched within a network.
There are a number of factors which will influence NFM-P’s ability to concurrently manage and schedule a large number of tests. NFM-P keeps track of how many tests are running concurrently. This is to limit the initiation of the tests, and the processing of the results without interfering with the system’s other functions.
To understand the STM guidelines, the following terminology is required:
Elemental Test: An OAM test to be sent to a router such as an LSP ping
Elemental Test Result: An OAM test result received from a network element
Accounting file Test: An OAM test that is initiated in the default manner, however, the test results are retrieved from the network element via FTP on a periodic basis.
Test Policy: A definition or configuration that tells NFM-P the specifics about how to generate a test. A test policy can contain multiple test definitions. The policies are used by test suites.
Test Suite: A collection of elemental tests that can be assigned to a specific schedule. There are three defined sections in which tests can be placed within a test suite: First run, Generated and Last run. The tests are executed in order by these sections. It is possible to configure the execution order of tests within the First Run and Last Run sections to be parallel or sequential. The tests in the Generated position are run by the system as concurrently as possible. If the Generated section contains tests from several different test definitions, then all the tests belonging to one definition will be executed before the tests of the next definition begin. Within a definition, the system will attempt to execute the tests as concurrently as possible. This is important to note, as a test suite containing a large number of tests in the Generated section (or in the First Run/Last Run sections set to parallel) may tax the system. Part of the increased stress placed on the system by concurrent tests is a result of the need for the system to use greater amounts of resources in order to initiate, wait for and process many tests concurrently. As well, tests that result in a large amount data to be returned from the routers will place increased demands on the NFM-P.
Schedule: A start time that can have a test suite or test suites assigned to it to produce scheduled tasks. When the schedule's start time is reached, the suite or suites assigned to it will commence. The schedule may be set to continuously repeat after a configurable period of time.
Scheduled Task: An instance of a test suite assigned to a schedule
Non -NE Schedulable STM Tests: NFM-P provides the ability to execute and process results for non NE schedulable tests. Non NE schedulable tests are elemental tests which are not persistently defined on network elements; rather, these tests are defined/configured from NFM-P per test execution. Elemental test results from non-NE schedulable tests are always regular (SNMP mediated) and share the same scale limits/considerations as regular scheduled STM tests.
Table 5-21: Maximum number of STM elemental test results
NFM-P platform |
Maximum regular STM elemental test results (SNMP mediated schedulable/ non-NE schedulable) in a 15–minute period |
Maximum accounting file STM elemental test results in a 15–minute period with results stored in the NFM-P database or NFM-P database and using logToFile |
Maximum accounting file STM elemental test results in a 15–minute period using logToFile only |
---|---|---|---|
Distributed NFM-P configuration with minimum 8 CPU Core NFM-P server |
15 000 |
1 500 000 1 |
1 500 000 1 |
Distributed NFM-P configuration NOTE: It may be possible to achieve higher numbers depending on the NFM-P server activity and hardware platform |
6000 |
22 500 |
60 000 |
Minimum Supported Collocated NFM-P configuration NOTE: It may be possible to achieve higher numbers depending on the NFM-P server activity and hardware platform |
3000 |
1500 |
15 000 |
Notes:
Guidelines for maximizing STM test execution
By default, NFM-P will only allow test suites with a combined weight of 80 000 to execute concurrently. The test suite weights are identified in the NFM-P GUI’s Test Suites List window. Running too many tests that start at the same time will cause the system to exceed the previously mentioned limit, and the test will be skipped. Ensuring the successful execution of as many STM tests as possible requires planning the schedules, the contents, and the configuration of the test suites. The following guidelines will assist in maximizing the number of tests that can be executed on your system:
-
When configuring tests or test policies, do not configure more packets (probes) than necessary, as they increase the weight of the test suite.
-
Test suites with a smaller weight will typically complete more quickly, and allow other test suites to execute concurrently. The weight of the test suite is determined by the number of tests in the test suite, and the number of probes that are executed by each test. See Table 5-22, OAM test weight for test weight per test type.
-
Assign the time-out of the test suite in such a way that if one of the test results has not been received it can be considered missed or failed without stopping other test suites from executing.
-
Rather than scheduling a test suite to execute all tests on one network element, tests should be executed on multiple network elements to allow for concurrent handling of the tests on the network elements. This will allow the test suite results to be received from the network element and processed by NFM-P more quickly freeing up available system weight more quickly.
-
Rather than scheduling a test suite to run sequentially, consider duplicating the test suite and running the test suites on alternating schedules. This allows each test suite time to complete or time-out before the same test suite is executed again. Remember that this may cause double the system weight to be consumed until the alternate test suite has completed.
-
Create test suites that contain less than 200 elemental tests. This way you can initiate the tests at different times by assigning the test suites to different schedules thereby having greater control over how many tests are initiated or in progress at any given time.
-
Prioritize which tests you wish to perform by manually executing the test suite to determine how long it will take in your network. Use that duration with some added buffer time to help determine how much time to leave between schedules or repetitions of a schedule and how to configure the test suite time-out.
-
A test suite time-out needs to be configured to take effect before the same test suite is scheduled to run again, or it will not execute if it does not complete before the time-out.
-
NFM-P database backups can impact the performance of STM tests.
Table 5-22: OAM test weight
Test type |
Weight |
---|---|
Regular Elemental STM Test |
10 per Test Packet |
Accounting File Elemental STM Test |
1 |
Accounting file STM test configuration
Accounting file collection of STM test results requires 7750 SR and 7450 ESS network elements that are version 7.0 R4 and above. To take advantage of accounting file STM test execution, the test policy must be configured to be NE schedulable with “Accounting file” selected. This will produce STM tests that will be executed on the network element, while the test results are collected by the NFM-P server by way of an accounting file in a similar way to accounting statistics. Accounting file STM test results are collected by the NFM-P server only.
NFM-P supports the use of logToFile for file accounting STM results. When using this method only for results, the number of tests that can be executed per 15 minute interval is increased. See Table 5-21, Maximum number of STM elemental test results for specific scaling limits. The logToFile method for file accounting STM results supports a maximum of two JMS clients.
Examples of STM test configuration
The following examples describe the configuration of STM tests on different network configurations.
Example 1:
Assume there is a network with 400 LSPs and that the objective is to perform LSP pings on each LSP as frequently as possible. The following steps are to be followed:
-
Create 4 test suites each containing 100 elemental LSP ping tests
-
One at a time, execute each test suite and record the time each one took to complete. Assume that the longest time for executing one of the test suites is 5 minutes.
-
Create a schedule that is ongoing and has a frequency of 15 minutes. This doubles the time taken for the longest test suite and ensures that the test will complete before it is executed again. Assign this schedule to the 4 test suites.
-
Monitor the test suite results to ensure that they are completing. If the tests are not completing (for example getting marked as “skipped”), then increase the frequency time value of the schedule.
-
In the above case, there are 200 elemental tests configured to be executed each 10 minutes.
Example 2:
Assume there are eight test suites (T1, T2, T3, T4, T5, T6, T7 and T8), each containing 50 elemental tests. Assume the test suites individually take 5 minutes to run. Also, assume the objective is to schedule them so that the guideline of having less than 200 concurrently running elemental tests is respected.
The recommended approach for scheduling these tests suites is as follows:
-
Test suites T1, T2, T3, T4 can be scheduled on the hour and repeat every 10 minutes
-
Test suites T5, T6, T7, T8 can be scheduled on the hour + 5 minutes and repeated every 10 minutes
Factors impacting the number of elemental tests that can be executed in a given time frame
The following factors can impact the number of elemental tests that can be executed during a given time frame:
-
The type of tests being executed. Each type of elemental test takes varying quantities of time to complete (for example, a simple LSP ping of an LSP that spans only two routers may take less than 2 seconds; an MTU ping could take many minutes).
-
The amount of data that is generated/updated by the test within the network elements. NFM-P will have to obtain this information and store it in the NFM-P database. The quantity of data depends on the type of tests being performed and the configuration of the objects on which the tests are performed.
-
The number of test suites scheduled at or around the same time
-
The number of routers over which the tests are being executed; in general, a large number of tests on a single router can be expected to take longer than the same number of tests distributed over many routers.
-
An NFM-P database backup may temporarily reduce the system’s ability to write test results into the database.
-
The station used to perform the tests will dictate how many physical resources NFM-P can dedicate to executing elemental tests. On the minimum supported station (collocated NFM-P server and NFM-P database on a single server), the number of concurrent tests must be limited to 3 000.
Possible consequences of exceeding the capacity of the system to perform tests
NFM-P will exhibit the following symptoms if the number of scheduled tests exceeds the system’s capacity:
-
Skipped tests - If a test suite is still in progress at the time that its schedule triggers again, then that scheduled task will be marked as skipped and that test suite will not be attempted again until the next scheduled time.
-
Failed tests (time-out) - Tests may time-out and get marked as failed. If any of the tests take more than 15 minutes it may get purged from an internal current test list. For example, a test may be successfully sent to a router and the system does not receive any results for 15 minutes. The system marks the test as failed and purges its’ expectation of receiving a result. However, later, the system could still receive the results from the router and update its result for the test to success.
Disk space requirements for STM test results
STM test results are stored in the tablespace DB partition. The STM database partitions start with a total size of 300MB of disk space. When the maximum number of test results is configured at 20 000 000 (maximum), the disk space requirement for the STM tests may increase by up to 80 GB. A larger tablespace partition should be considered.
The maximum number of test results stored in the database reflects the sum of the aggregate results, test results, and probe results.
Running 10 tests with 1 probe each versus 1 test with 10 probes consumes the same amount of disk space.
When using logToFile for accounting file STM test results, the maximum time-to-live on the disk is 24 hours. At the maximum collection rate of 1 500 000 test results per 15 minutes, the storage requirements on the NFM-P server in the xml_output directory is 600 GB per JMS client. The storage requirements are doubled if using the maximum number of JMS clients for file accounting STM results. The disk storage requirements can be decreased by using the compress option for logToFile but will result in increased CPU utilization on the NFM-P server.
Scaling guidelines for OAM PM test results
See the NFM-P Classic Management User Guide for details on OAM PM test configuration and result retrieval.
The quantity of resources which are allocated to the retrieval and processing of OAM PM test results within the NFM-P server are set at the installation time and depend on the number of CPUs available to the NFM-P server software. The number of CPUs available to the NFM-P server depends on the number of CPUs on the station and whether the NFM-P database software is collocated with the NFM-P server software on the same station
The following tables provide the maximum number of OAM PM test results that can be retrieved and processed by the NFM-P server or NFM-P statistics auxiliary in various configurations.
Table 5-23: Maximum number of OAM PM test results processed by an NFM-P server
Number of CPU cores on the NFM-P server |
Maximum number of OAM PM test results per 15-minute interval | |
---|---|---|
Collocated configuration |
Distributed configuration | |
6 |
100 000 |
200 000 |
8 or greater |
200 000 |
400 000 |
Table 5-24: Maximum number of OAM PM test results processed by an NFM-P statistics auxiliary
Number of active NFM-P statistics auxiliaries |
Maximum number of OAM PM test results per 15-minute interval | ||||||
---|---|---|---|---|---|---|---|
OAM PM test result collection with NFM-P database |
OAM PM test result collection with single auxiliary database |
OAM PM test result collection with three+ auxiliary database cluster |
logToFile only | ||||
8 CPU cores, 32 GB RAM |
12 CPU cores, 32 GB RAM |
8 CPU cores, 32 GB RAM |
12 CPU cores, 32 GB RAM |
12 CPU cores, 32 GB RAM | |||
1 |
10 000 000 |
10 000 000 |
5 000 000 |
20 000 000 |
20 000 000 | ||
2 |
10 000 000 |
10 000 000 |
5 000 000 |
40 000 000 |
40 000 000 | ||
3 |
10 000 000 |
10 000 000 |
5 000 000 |
60 000 000 |
60 000 000 |
The table below shows the retention that is achievable depending upon the total number of test results to retain and the database used to retain the records:
Table 5-25: Maximum OAM PM test result retention
Database to retain records |
Total number of OAM PM test results to be stored in the database |
Maximum number of retention intervals |
---|---|---|
NFM-P database |
<40M |
672 |
>40M |
96 | |
NFM-P auxiliary database |
N/A |
35,040 |