The troubleshooting process

Identifying network performance issues

The troubleshooting process identifies and resolves performance issues related to a network service or component. The performance issue can result in service degradation, or in a complete network failure.

The first step in problem resolution is to identify the problem. Problem identification can include an alarm received from a network component, an analysis of network capacity and performance data, or a customer problem report.

The personnel responsible for troubleshooting the problem must:

Network maintenance

The most effective method to prevent problems is to schedule and perform routine maintenance on your network. Major networking problems often start as minor performance issues. See the NSP System Administrator Guide for more information about how to perform routine maintenance on your network.

Troubleshooting problem-solving model

An effective troubleshooting problem-solving model includes the following tasks:

  1. Establish a performance baseline .

  2. Categorize the problem .

  3. Identify the root cause of the problem .

  4. Plan corrective action and resolve the problem .

  5. Verify the solution to the problem .

See Process to troubleshoot a problem in the NSP for information about how the problem-solving model aligns with using the NSP to troubleshoot a network or network management problem.

Establish a performance baseline

You must have a thorough knowledge of your network and how it operates under normal conditions to troubleshoot problems effectively. This knowledge facilitates the identification of fault conditions in your network. You must establish and maintain baseline information for your network and services. The maintenance of the baseline information is critical because a network is not a static environment.

See the NSP System Administrator Guide for more information on how to generate NSP system baseline information.

Categorize the problem

When you categorize a problem, you must differentiate between total failures and problems that result in a degradation in performance. For example, the failure of an access switch results in a total failure for a customer who has one DS3 link into a network. A core router that operates at over 80% average utilization can start to discard packets, which results in a degradation of performance for services that use the device. Performance degradations exhibit different symptoms from total failures and may not generate alarms or significant network events.

Multiple problems can simultaneously occur and create related or unique symptoms. Detailed information about the symptoms that are associated with the problem helps the NOC or engineering operational staff diagnose and fix the problem. The following information can help you assess the scope of the problem:

  • alarm files

  • error logs

  • network statistics

  • network analyzer traces

  • output of CLI show commands

  • accounting logs

  • customer problem reports

Use the following guidelines to help you categorize the problem:

Identify the root cause of the problem

A symptom for a problem can be the result of more than one network issue. You can resolve multiple, related problems by resolving the root cause of the problem.

Use the following guidelines to help you implement a systematic approach to resolve the root cause of the problem:

Plan corrective action and resolve the problem

The corrective action required to resolve a problem depends on the problem type. The problem severity and associated QoS commitments affect the approach to resolving the problem. You must balance the risk of creating further service interruptions against restoring service in the shortest possible time.

Corrective action should:

  1. Document each step of the corrective action.

  2. Test the corrective action.

  3. Use the CLI to verify behavior changes in each step.

  4. Apply the corrective action to the live network.

  5. Test to verify that the corrective action resolved the problem.

Verify the solution to the problem

You must make sure that the corrective action associated with the resolution of the problem did not introduce new symptoms in your network. If new symptoms are detected, or if the problem has only recently been mitigated, you need to repeat the troubleshooting process.

Checklist for identifying problems

When a problem is identified in the network management domain, track and store data to use for troubleshooting purposes:

During troubleshooting: