Troubleshooting NSP cluster issues

Purpose

The commands provided in this section can help diagnose and resolve NSP deployment and installation issues by providing detailed information about the status of various deployed components.

Pod troubleshooting

This topic covers troubleshooting the pods that are part of the NSP cluster.

Retrieve a list of pods

Enter the following command to view a list of pods in the NSP cluster:

kubectl get pods -A ↵

Retrieve pod information

Enter the following to view information about a specific pod:

kubectl describe pod pod_name

where pod_name is the name of the pod to view

The command output includes many parameters, including any events associated with the pod. For example:

Type     Reason            Age        From               Message
----     ------            ----       ----               -------
Warning  FailedScheduling  <unknown>  default-scheduler  0/1 nodes are available: 1 Insufficient memory.
Recover pods

Enter the following command to recover a pod:

kubectl delete pod pod-name

where pod-name is the name of the pod

The pod is automatically redeployed. You can use the command to recover a pod in an errored state.

Recover executor pods

The following applications use executor and driver pods:

An executor pod name has the following format:

app_name-instance-exec-executor_ID

where

app_name is the application name

instance is the pod instance ID

executor_ID is a number that identifies the executor instance

Enter the following to recover an executor pod, where app_name is the application name:

kubectl delete pod app-name-driver ↵

The driver pod is automatically redeployed, thereby recovering any associated errored executor pods.

NSP cluster member troubleshooting

This topic describes troubleshooting the members of an NSP cluster.

Retrieve a list of members

Enter the following to list the NSP cluster members:

kubectl get nodes ↵

Retrieve member information

Enter the following to view information about a specific member:

kubectl describe nodes node_name

where node_name is the name of the member to view

The command output includes member information such as the following:

MDM server troubleshooting

This topic describes troubleshooting MDM instances.

Note: The NSP system must be operational before these operations can be performed.

Retrieve detailed information about MDM servers

From the NSP deployer host software directory, enter the following to show the MDM server roles, the number of NEs managed using MDM, and which MDM server is hosting which NE.

# tools/mdm/bin/server-load.bash --user username --pass password--detail

where

username is the NSP username

password is the NSP password

The command output includes information such as the following:

{

    "mdmInstanceInfos": [

      {

           "name": mdm-server-0,

           "ipAddress": mdm-server-0.mdm-server-svc-headless.default.svc.cluster.local,

           "grpcPort": 30000,

           "status": Up,

           "neCount": 0,

           "neIds": null,

           "active": False

           "groupIds": [1, 2],

      },

      {

           "name": mdm-server-1,

           "ipAddress": mdm-server-1.mdm-server-svc-headless.default.svc.cluster.local,

           "grpcPort": 30000,

           "status": Up,

           "neCount": 2,

           "neIds": ["1.1.1.1", "1.1.1.2"],

           "active": True

           "groupId": 1,

      },

      {

           "name": mdm-server-2,

           "ipAddress": mdm-server-2.mdm-server-svc-headless.default.svc.cluster.local,

           "grpcPort": 30000,

           "status": Up,

           "neCount": 2,

           "neIds": ["1.1.1.3", "1.1.1.4"],

           "active": True

           "groupId": 2,

      }

    ]

}

Rebalance NE load on MDM servers

From the NSP deployer host software directory, enter the following to rebalance the NE load on the MDM servers.

# tools/mdm/bin/server-load.bash --user username --pass password--rebalance

where

username is the NSP username

password is the NSP password

Disk performance tests

This topic describes NSP disk tests for collecting performance metrics such as throughput and latency measurements.

Verify disk performance for etcd

As the root user, enter the following:

mkdir /var/lib/test

fio --rw=write --ioengine=sync --fdatasync=1 --directory=/var/lib/test --size=22m --bs=3200 --name=mytest ↵

The command produces output like the following:

Starting 1 process

mytest: Laying out IO file (1 file / 22MiB)

Jobs: 1 (f=1)

mytest: (groupid=0, jobs=1): err= 0: pid=40944: Mon Jun 15 10:23:23 2020

  write: IOPS=7574, BW=16.6MiB/s (17.4MB/s)(21.0MiB/1324msec)

    clat (usec): min=4, max=261, avg= 9.50, stdev= 4.11

     lat (usec): min=4, max=262, avg= 9.67, stdev= 4.12

    clat percentiles (nsec):

     |  1.00th=[ 5536],  5.00th=[ 5728], 10.00th=[ 5920], 20.00th=[ 6176],

     | 30.00th=[ 7584], 40.00th=[ 8896], 50.00th=[ 9408], 60.00th=[ 9792],

     | 70.00th=[10432], 80.00th=[11584], 90.00th=[12864], 95.00th=[14528],

     | 99.00th=[20352], 99.50th=[23168], 99.90th=[28800], 99.95th=[42752],

     | 99.99th=[60672]

   bw (  KiB/s): min=16868, max=17258, per=100.00%, avg=17063.00, stdev=275.77, samples=2

   iops        : min= 7510, max= 7684, avg=7597.00, stdev=123.04, samples=2

  lat (usec)   : 10=64.21%, 20=34.68%, 50=1.08%, 100=0.02%, 500=0.01%

In the second block of output, which is shown below, the 99th percentile durations must be less than 10ms. In this block, each durations is less than 1ms.

fsync/fdatasync/sync_file_range:

    sync (usec): min=39, max=1174, avg=120.71, stdev=63.89

    sync percentiles (usec):

     |  1.00th=[   42],  5.00th=[   45], 10.00th=[   46], 20.00th=[   48],

     | 30.00th=[   52], 40.00th=[   71], 50.00th=[  153], 60.00th=[  159],

     | 70.00th=[  167], 80.00th=[  178], 90.00th=[  192], 95.00th=[  206],

| 99.00th=[ 229], 99.50th=[ 239], 99.90th=[ 355], 99.95th=[ 416],

| 99.99th=[ 445]

  cpu          : usr=2.95%, sys=29.93%, ctx=15663, majf=0, minf=35

  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0

     latency   : target=0, window=0, percentile=100.00%, depth=1

Verify disk performance for NSP

Enter the following as the root user in the /opt/nsp directory to create a file called 'test' in the directory:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=50 ↵

The command produces output like the following:

test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64

fio-3.7

Starting 1 process

test: Laying out IO file (1 file / 4096MiB)

Jobs: 1 (f=1): [m(1)][100.0%][r=22.1MiB/s,w=22.2MiB/s][r=5645,w=5674 IOPS][eta 00m:00s]

test: (groupid=0, jobs=1): err= 0: pid=32439: Mon Sep 21 10:25:11 2020

read: IOPS=6301, BW=24.6MiB/s (25.8MB/s)(2049MiB/83252msec)

   bw (  KiB/s): min=13824, max=39088, per=99.57%, avg=25098.60, stdev=5316.27, samples=166

   iops        : min= 3456, max= 9772, avg=6274.49, stdev=1329.11, samples=166

write: IOPS=6293, BW=24.6MiB/s (25.8MB/s)(2047MiB/83252msec)

   bw (  KiB/s): min=13464, max=40024, per=99.56%, avg=25062.73, stdev=5334.65, samples=166

   iops        : min= 3366, max=10006, avg=6265.57, stdev=1333.67, samples=166

  cpu          : usr=5.13%, sys=18.63%, ctx=202387, majf=0, minf=26

  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%

     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%

     issued rwts: total=524625,523951,0,0 short=0,0,0,0 dropped=0,0,0,0

     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):

   READ: bw=24.6MiB/s (25.8MB/s), 24.6MiB/s-24.6MiB/s (25.8MB/s-25.8MB/s), io=2049MiB (2149MB), run=83252-83252msec

  WRITE: bw=24.6MiB/s (25.8MB/s), 24.6MiB/s-24.6MiB/s (25.8MB/s-25.8MB/s), io=2047MiB (2146MB), run=83252-83252msec

Disk stats (read/write):

  vda: ios=523989/526042, merge=0/2218, ticks=3346204/1622070, in_queue=4658999, util=96.06%