Scaling guidelines for service assurance tests

Scheduled tests (STM)

NFM-P provides the ability to generate, manage and schedule STM tests within the network. This section provides guidelines that can be used to determine the extent to which STM tests can be scheduled and launched within a network.

There are a number of factors which will influence NFM-P’s ability to concurrently manage and schedule a large number of tests. NFM-P keeps track of how many tests are running concurrently. This is to limit the initiation of the tests, and the processing of the results without interfering with the system’s other functions.

To understand the STM guidelines, the following terminology is required:

Elemental Test: An OAM test to be sent to a router such as an LSP ping

Elemental Test Result: An OAM test result received from a network element

Accounting file Test: An OAM test that is initiated in the default manner, however, the test results are retrieved from the network element via FTP on a periodic basis.

Test Policy: A definition or configuration that tells NFM-P the specifics about how to generate a test. A test policy can contain multiple test definitions. The policies are used by test suites.

Test Suite: A collection of elemental tests that can be assigned to a specific schedule. There are three defined sections in which tests can be placed within a test suite: First run, Generated and Last run. The tests are executed in order by these sections. It is possible to configure the execution order of tests within the First Run and Last Run sections to be parallel or sequential. The tests in the Generated position are run by the system as concurrently as possible. If the Generated section contains tests from several different test definitions, then all the tests belonging to one definition will be executed before the tests of the next definition begin. Within a definition, the system will attempt to execute the tests as concurrently as possible. This is important to note, as a test suite containing a large number of tests in the Generated section (or in the First Run/Last Run sections set to parallel) may tax the system. Part of the increased stress placed on the system by concurrent tests is a result of the need for the system to use greater amounts of resources in order to initiate, wait for and process many tests concurrently. As well, tests that result in a large amount data to be returned from the routers will place increased demands on the NFM-P.

Schedule: A start time that can have a test suite or test suites assigned to it to produce scheduled tasks. When the schedule's start time is reached, the suite or suites assigned to it will commence. The schedule may be set to continuously repeat after a configurable period of time.

Scheduled Task: An instance of a test suite assigned to a schedule

Non -NE Schedulable STM Tests: NFM-P provides the ability to execute and process results for non NE schedulable tests. Non NE schedulable tests are elemental tests which are not persistently defined on network elements; rather, these tests are defined/configured from NFM-P per test execution. Elemental test results from non-NE schedulable tests are always regular (SNMP mediated) and share the same scale limits/considerations as regular scheduled STM tests.

Table 5-25: Maximum number of STM elemental test results

NFM-P platform	Maximum regular STM elemental test results (SNMP mediated schedulable/ non-NE schedulable) in a 15–minute period	Maximum accounting file STM elemental test results in a 15–minute period with results stored in the NFM-P database or NFM-P database and using logToFile	Maximum accounting file STM elemental test results in a 15–minute period using logToFile only
Distributed NFM-P configuration with minimum 8 CPU Core NFM-P server	15 000	1 500 000¹	1 500 000¹
Distributed NFM-P configuration NOTE: It may be possible to achieve higher numbers depending on the NFM-P server activity and hardware platform	6000	22 500	60 000
Minimum Supported Collocated NFM-P configuration NOTE: It may be possible to achieve higher numbers depending on the NFM-P server activity and hardware platform	3000	1500	15 000

NFM-P platform

Maximum regular STM elemental test results (SNMP mediated schedulable/ non-NE schedulable) in a 15–minute period

Maximum accounting file STM elemental test results in a 15–minute period with results stored in the NFM-P database or NFM-P database and using logToFile

Maximum accounting file STM elemental test results in a 15–minute period using logToFile only

Distributed NFM-P configuration with minimum 8 CPU Core NFM-P server

15 000

1 500 000¹

Distributed NFM-P configuration

NOTE: It may be possible to achieve higher numbers depending on the NFM-P server activity and hardware platform

6000

22 500

60 000

Minimum Supported Collocated NFM-P configuration

NOTE: It may be possible to achieve higher numbers depending on the NFM-P server activity and hardware platform

3000

1500

15 000

Notes:

may require a dedicated disk or striped disks for the xml_output partition

Guidelines for maximizing STM test execution

By default, NFM-P will only allow test suites with a combined weight of 80 000 to execute concurrently. The test suite weights are identified in the NFM-P GUI’s Test Suites List window. Running too many tests that start at the same time will cause the system to exceed the previously mentioned limit, and the test will be skipped. Ensuring the successful execution of as many STM tests as possible requires planning the schedules, the contents, and the configuration of the test suites. The following guidelines will assist in maximizing the number of tests that can be executed on your system:

When configuring tests or test policies, do not configure more packets (probes) than necessary, as they increase the weight of the test suite.
Test suites with a smaller weight will typically complete more quickly, and allow other test suites to execute concurrently. The weight of the test suite is determined by the number of tests in the test suite, and the number of probes that are executed by each test. See Table 5-26, OAM test weight for test weight per test type.
Assign the time-out of the test suite in such a way that if one of the test results has not been received it can be considered missed or failed without stopping other test suites from executing.
Rather than scheduling a test suite to execute all tests on one network element, tests should be executed on multiple network elements to allow for concurrent handling of the tests on the network elements. This will allow the test suite results to be received from the network element and processed by NFM-P more quickly freeing up available system weight more quickly.
Rather than scheduling a test suite to run sequentially, consider duplicating the test suite and running the test suites on alternating schedules. This allows each test suite time to complete or time-out before the same test suite is executed again. Remember that this may cause double the system weight to be consumed until the alternate test suite has completed.
Create test suites that contain less than 200 elemental tests. This way you can initiate the tests at different times by assigning the test suites to different schedules thereby having greater control over how many tests are initiated or in progress at any given time.
Prioritize which tests you wish to perform by manually executing the test suite to determine how long it will take in your network. Use that duration with some added buffer time to help determine how much time to leave between schedules or repetitions of a schedule and how to configure the test suite time-out.
A test suite time-out needs to be configured to take effect before the same test suite is scheduled to run again, or it will not execute if it does not complete before the time-out.
NFM-P database backups can impact the performance of STM tests.

Table 5-26: OAM test weight

Test type	Weight
Regular Elemental STM Test	10 per Test Packet
Accounting File Elemental STM Test	1

Accounting file STM test configuration

Accounting file collection of STM test results requires 7750 SR and 7450 ESS network elements that are version 7.0 R4 and above. To take advantage of accounting file STM test execution, the test policy must be configured to be NE schedulable with “Accounting file” selected. This will produce STM tests that will be executed on the network element, while the test results are collected by the NFM-P server by way of an accounting file in a similar way to accounting statistics. Accounting file STM test results are collected by the NFM-P server only.

NFM-P supports the use of logToFile for file accounting STM results. When using this method only for results, the number of tests that can be executed per 15 minute interval is increased. See Table 5-25, Maximum number of STM elemental test results for specific scaling limits. The logToFile method for file accounting STM results supports a maximum of two JMS clients.

Examples of STM test configuration

The following examples describe the configuration of STM tests on different network configurations.

Example 1:

Assume there is a network with 400 LSPs and that the objective is to perform LSP pings on each LSP as frequently as possible. The following steps are to be followed:

Create 4 test suites each containing 100 elemental LSP ping tests
One at a time, execute each test suite and record the time each one took to complete. Assume that the longest time for executing one of the test suites is 5 minutes.
Create a schedule that is ongoing and has a frequency of 15 minutes. This doubles the time taken for the longest test suite and ensures that the test will complete before it is executed again. Assign this schedule to the 4 test suites.
Monitor the test suite results to ensure that they are completing. If the tests are not completing (for example getting marked as “skipped”), then increase the frequency time value of the schedule.
In the above case, there are 200 elemental tests configured to be executed each 10 minutes.

Example 2:

Assume there are eight test suites (T1, T2, T3, T4, T5, T6, T7 and T8), each containing 50 elemental tests. Assume the test suites individually take 5 minutes to run. Also, assume the objective is to schedule them so that the guideline of having less than 200 concurrently running elemental tests is respected.

The recommended approach for scheduling these tests suites is as follows:

Test suites T1, T2, T3, T4 can be scheduled on the hour and repeat every 10 minutes
Test suites T5, T6, T7, T8 can be scheduled on the hour + 5 minutes and repeated every 10 minutes

Factors impacting the number of elemental tests that can be executed in a given time frame

The following factors can impact the number of elemental tests that can be executed during a given time frame:

The type of tests being executed. Each type of elemental test takes varying quantities of time to complete (for example, a simple LSP ping of an LSP that spans only two routers may take less than 2 seconds; an MTU ping could take many minutes).
The amount of data that is generated/updated by the test within the network elements. NFM-P will have to obtain this information and store it in the NFM-P database. The quantity of data depends on the type of tests being performed and the configuration of the objects on which the tests are performed.
The number of test suites scheduled at or around the same time
The number of tests in a test suite
The number of routers over which the tests are being executed; in general, a large number of tests on a single router can be expected to take longer than the same number of tests distributed over many routers.
An NFM-P database backup may temporarily reduce the system’s ability to write test results into the database.
The station used to perform the tests will dictate how many physical resources NFM-P can dedicate to executing elemental tests. On the minimum supported station (collocated NFM-P server and NFM-P database on a single server), the number of concurrent tests must be limited to 3 000.

Possible consequences of exceeding the capacity of the system to perform tests

NFM-P will exhibit the following symptoms if the number of scheduled tests exceeds the system’s capacity:

Skipped tests - If a test suite is still in progress at the time that its schedule triggers again, then that scheduled task will be marked as skipped and that test suite will not be attempted again until the next scheduled time.
Failed tests (time-out) - Tests may time-out and get marked as failed. If any of the tests take more than 15 minutes it may get purged from an internal current test list. For example, a test may be successfully sent to a router and the system does not receive any results for 15 minutes. The system marks the test as failed and purges its’ expectation of receiving a result. However, later, the system could still receive the results from the router and update its result for the test to success.

Disk space requirements for STM test results

STM test results are stored in the tablespace DB partition. The STM database partitions start with a total size of 300MB of disk space. When the maximum number of test results is configured at 20 000 000 (maximum), the disk space requirement for the STM tests may increase by up to 80 GB. A larger tablespace partition should be considered.

The maximum number of test results stored in the database reflects the sum of the aggregate results, test results, and probe results.

Running 10 tests with 1 probe each versus 1 test with 10 probes consumes the same amount of disk space.

When using logToFile for accounting file STM test results, the maximum time-to-live on the disk is 24 hours. At the maximum collection rate of 1 500 000 test results per 15 minutes, the storage requirements on the NFM-P server in the xml_output directory is 600 GB per JMS client. The storage requirements are doubled if using the maximum number of JMS clients for file accounting STM results. The disk storage requirements can be decreased by using the compress option for logToFile but will result in increased CPU utilization on the NFM-P server.

Scaling guidelines for OAM PM test results

See the NSP NFM-P Classic Management User Guide for details on OAM PM test configuration and result retrieval.

The quantity of resources which are allocated to the retrieval and processing of OAM PM test results within the NFM-P server are set at the installation time and depend on the number of CPUs available to the NFM-P server software. The number of CPUs available to the NFM-P server depends on the number of CPUs on the station and whether the NFM-P database software is collocated with the NFM-P server software on the same station

The following tables provide the maximum number of OAM PM test results that can be retrieved and processed by the NFM-P server or NFM-P statistics auxiliary in various configurations.

Table 5-27: Maximum number of OAM PM test results processed by an NFM-P server

Number of CPU cores on the NFM-P server	Maximum number of OAM PM test results per 15-minute interval
Number of CPU cores on the NFM-P server	Collocated configuration	Distributed configuration
6	100 000	200 000
8 or greater	200 000	400 000

Table 5-28: Maximum number of OAM PM test results processed by an NFM-P statistics auxiliary

Number of active NFM-P statistics auxiliaries	Maximum number of OAM PM test results per 15-minute interval
	OAM PM test result collection with NFM-P database		OAM PM test result collection with single auxiliary database	OAM PM test result collection with three+ auxiliary database cluster	logToFile only
	8 CPU cores, 32 GB RAM	12 CPU cores, 32 GB RAM	8 CPU cores, 32 GB RAM	12 CPU cores, 32 GB RAM	12 CPU cores, 32 GB RAM
1	10 000 000	10 000 000	5 000 000	20 000 000	20 000 000
2	10 000 000	10 000 000	5 000 000	40 000 000	40 000 000
3	10 000 000	10 000 000	5 000 000	60 000 000	60 000 000

The table below shows the retention that is achievable depending upon the total number of test results to retain and the database used to retain the records:

Table 5-29: Maximum OAM PM test result retention

Database to retain records	Total number of OAM PM test results to be stored in the database	Maximum number of retention intervals
NFM-P database	<40M	672
NFM-P database	>40M	96
NSP auxiliary database	N/A	35,040
NSP auxiliary database	N/A	35,040