Troubleshooting NSP cluster issues

The commands provided in this section can help diagnose and resolve NSP deployment and installation issues by providing detailed information about the status of various deployed components.

This topic covers troubleshooting the pods that are part of the NSP cluster.

Retrieve a list of pods

Enter the following command to view a list of pods in the NSP cluster:

# kubectl get pods -A ↵

Retrieve pod information

Enter the following to view information about a specific pod:

# kubectl describe pod pod_name ↵

where pod_name is the name of the pod to view

The command output includes many parameters, including any events associated with the pod. For example:

Type     Reason            Age        From               Message

----     ------            ----       ----               -------

Warning  FailedScheduling  <unknown>  default-scheduler  0/1 nodes are available: 1 Insufficient memory.

Recover pods

Enter the following command to recover a pod:

# kubectl delete pod pod-name ↵

where pod-name is the name of the pod

The pod is automatically redeployed. You can use the command to recover a pod in an errored state.

Recover executor pods

The following applications use executor and driver pods:

act-pipeline-app
rta-anomaly-detector-app
rta-trainer-app
rta-windower-app

An executor pod name has the following format:

app_name-instance-exec-executor_ID

where

app_name is the application name

instance is the pod instance ID

executor_ID is a number that identifies the executor instance

Enter the following to recover an executor pod, where app_name is the application name:

# kubectl delete pod app-name-driver ↵

The driver pod is automatically redeployed, thereby recovering any associated errored executor pods.

This topic describes troubleshooting the members of an NSP cluster.

Retrieve a list of members

Enter the following to list the NSP cluster members:

# kubectl get nodes ↵

Retrieve member information

Enter the following to view information about a specific member:

# kubectl describe nodes node_name ↵

where node_name is the name of the member to view

The command output includes member information such as the following:

member status; for example:

_{Type                 Status   LastHeartbeatTime                 LastTransitionTime                Reason       Message}

_{----                 ------   -----------------                 ------------------                ------       -------}

_{NetworkUnavailable   False    Wed, 30 Sep 2020 12:19:23 -0400   Wed, 30 Sep 2020 12:19:23 -0400   CalicoIsUp   Calico is running on this node}
member resource capacity; for example:

Capacity:

  cpu:                24

  ephemeral-storage:  67092472Ki

  hugepages-1Gi:      0

  hugepages-2Mi:      0

  memory:             64381888Ki

  pods:               110
running pods on the member; for example:

_{Namespace   Name                                 CPU Requests   CPU Limits     Memory Requests   Memory Limits   AGE}

_{---------   ----                                 ------------   ----------     ---------------   -------------   ---}

_{default     nginx-ingress-controller-8fj7s       100m (0%)      12 (37%)       500Mi (0%)        1000Mi (0%)     7h9m}

_{default     nspos-app1-tomcat-8597d67787-wdgxd   5100m (16%)    12100m (38%)   17230Mi (13%)     17230Mi (13%)   7h10m}

_{default     nspos-neo4j-core-default-1           2050m (6%)     2050m (6%)     2650Mi (2%)       2650Mi (2%)     7h10m}

_{default     nspos-postgresql-primary-0           6050m (19%)    6050m (19%)    1290Mi (1%)       1290Mi (1%)     7h9m}
resources allocated to the member; for example:

Resource           Requests          Limits

--------           --------          ------

cpu                22870m (71%)      41150m (129%)

memory             42120228Ki (32%)  44290630912 (33%)

ephemeral-storage  0 (0%)            0 (0%)

This topic describes troubleshooting MDM instances.

Note: The NSP system must be operational before these operations can be performed.

Retrieve detailed information about MDM servers

From the NSP deployer host software directory, enter the following to show the MDM server roles, the number of NEs managed using MDM, and which MDM server is hosting which NE.

# tools/mdm/bin/server-load.bash --user username --pass password--detail

where

username is the NSP username

password is the NSP password

The command output includes information such as the following:

{

"mdmInstanceInfos": [

{

"name": mdm-server-0,

"ipAddress": mdm-server-0.mdm-server-svc-headless.default.svc.cluster.local,

"grpcPort": 30000,

"status": Up,

"neCount": 0,

"neIds": null,

"active": False

"groupIds": [1, 2],

},

{

"name": mdm-server-1,

"ipAddress": mdm-server-1.mdm-server-svc-headless.default.svc.cluster.local,

"grpcPort": 30000,

"status": Up,

"neCount": 2,

"neIds": ["1.1.1.1", "1.1.1.2"],

"active": True

"groupId": 1,

},

{

"name": mdm-server-2,

"ipAddress": mdm-server-2.mdm-server-svc-headless.default.svc.cluster.local,

"grpcPort": 30000,

"status": Up,

"neCount": 2,

"neIds": ["1.1.1.3", "1.1.1.4"],

"active": True

"groupId": 2,

}

]

}

Rebalance NE load on MDM servers

From the NSP deployer host software directory, enter the following to rebalance the NE load on the MDM servers.

# tools/mdm/bin/server-load.bash --user username --pass password--rebalance

where

username is the NSP username

password is the NSP password

This topic describes NSP disk tests for collecting performance metrics such as throughput and latency measurements.

Verify disk performance for etcd

As the root user, enter the following:

# mkdir /var/lib/test

# fio --rw=write --ioengine=sync --fdatasync=1 --directory=/var/lib/test --size=22m --bs=3200 --name=mytest ↵

The command produces output like the following:

Starting 1 process

mytest: Laying out IO file (1 file / 22MiB)

Jobs: 1 (f=1)

mytest: (groupid=0, jobs=1): err= 0: pid=40944: Mon Jun 15 10:23:23 2020

write: IOPS=7574, BW=16.6MiB/s (17.4MB/s)(21.0MiB/1324msec)

clat (usec): min=4, max=261, avg= 9.50, stdev= 4.11

lat (usec): min=4, max=262, avg= 9.67, stdev= 4.12

clat percentiles (nsec):

| 1.00th=[ 5536], 5.00th=[ 5728], 10.00th=[ 5920], 20.00th=[ 6176],

| 30.00th=[ 7584], 40.00th=[ 8896], 50.00th=[ 9408], 60.00th=[ 9792],

| 70.00th=[10432], 80.00th=[11584], 90.00th=[12864], 95.00th=[14528],

| 99.00th=[20352], 99.50th=[23168], 99.90th=[28800], 99.95th=[42752],

| 99.99th=[60672]

bw ( KiB/s): min=16868, max=17258, per=100.00%, avg=17063.00, stdev=275.77, samples=2

iops : min= 7510, max= 7684, avg=7597.00, stdev=123.04, samples=2

lat (usec) : 10=64.21%, 20=34.68%, 50=1.08%, 100=0.02%, 500=0.01%

In the second block of output, which is shown below, the 99th percentile durations must be less than 10ms. In this block, each durations is less than 1ms.

fsync/fdatasync/sync_file_range:

sync (usec): min=39, max=1174, avg=120.71, stdev=63.89

sync percentiles (usec):

| 1.00th=[ 42], 5.00th=[ 45], 10.00th=[ 46], 20.00th=[ 48],

| 30.00th=[ 52], 40.00th=[ 71], 50.00th=[ 153], 60.00th=[ 159],

| 70.00th=[ 167], 80.00th=[ 178], 90.00th=[ 192], 95.00th=[ 206],

| 99.00th=[ 229], 99.50th=[ 239], 99.90th=[ 355], 99.95th=[ 416],

| 99.99th=[ 445]

cpu : usr=2.95%, sys=29.93%, ctx=15663, majf=0, minf=35

IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0

latency : target=0, window=0, percentile=100.00%, depth=1

Verify disk performance for NSP

Enter the following as the root user in the /opt/nsp directory to create a file called 'test' in the directory:

# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=50 ↵

The command produces output like the following:

test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64

fio-3.7

Starting 1 process

test: Laying out IO file (1 file / 4096MiB)

Jobs: 1 (f=1): [m(1)][100.0%][r=22.1MiB/s,w=22.2MiB/s][r=5645,w=5674 IOPS][eta 00m:00s]

test: (groupid=0, jobs=1): err= 0: pid=32439: Mon Sep 21 10:25:11 2020

read: IOPS=6301, BW=24.6MiB/s (25.8MB/s)(2049MiB/83252msec)

bw ( KiB/s): min=13824, max=39088, per=99.57%, avg=25098.60, stdev=5316.27, samples=166

iops : min= 3456, max= 9772, avg=6274.49, stdev=1329.11, samples=166

write: IOPS=6293, BW=24.6MiB/s (25.8MB/s)(2047MiB/83252msec)

bw ( KiB/s): min=13464, max=40024, per=99.56%, avg=25062.73, stdev=5334.65, samples=166

iops : min= 3366, max=10006, avg=6265.57, stdev=1333.67, samples=166

cpu : usr=5.13%, sys=18.63%, ctx=202387, majf=0, minf=26

IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%

submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%

issued rwts: total=524625,523951,0,0 short=0,0,0,0 dropped=0,0,0,0

latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):

READ: bw=24.6MiB/s (25.8MB/s), 24.6MiB/s-24.6MiB/s (25.8MB/s-25.8MB/s), io=2049MiB (2149MB), run=83252-83252msec

WRITE: bw=24.6MiB/s (25.8MB/s), 24.6MiB/s-24.6MiB/s (25.8MB/s-25.8MB/s), io=2047MiB (2146MB), run=83252-83252msec

Disk stats (read/write):

vda: ios=523989/526042, merge=0/2218, ticks=3346204/1622070, in_queue=4658999, util=96.06%

Troubleshooting NSP cluster issues

Purpose

Pod troubleshooting

Retrieve a list of pods

Retrieve pod information

Recover pods

Recover executor pods

NSP cluster member troubleshooting

Retrieve a list of members

Retrieve member information

MDM server troubleshooting

Retrieve detailed information about MDM servers

Rebalance NE load on MDM servers

Disk performance tests

Verify disk performance for etcd

Verify disk performance for NSP