Troubleshooting NSP cluster issues
Purpose
The commands provided in this section can help diagnose and resolve NSP deployment and installation issues by providing detailed information about the status of various deployed components.
Pod troubleshooting
This topic covers troubleshooting the pods that are part of the NSP cluster.
Retrieve a list of pods
Enter the following command to view a list of pods in the NSP cluster:
# kubectl get pods -A ↵
Retrieve pod information
Enter the following to view information about a specific pod:
# kubectl describe pod pod_name ↵
where pod_name is the name of the pod to view
The command output includes many parameters, including any events associated with the pod. For example:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 Insufficient memory.
Recover pods
Enter the following command to recover a pod:
# kubectl delete pod pod-name ↵
where pod-name is the name of the pod
The pod is automatically redeployed. You can use the command to recover a pod in an errored state.
Recover executor pods
The following applications use executor and driver pods:
An executor pod name has the following format:
app_name-instance-exec-executor_ID
where
app_name is the application name
instance is the pod instance ID
executor_ID is a number that identifies the executor instance
Enter the following to recover an executor pod, where app_name is the application name:
# kubectl delete pod app-name-driver ↵
The driver pod is automatically redeployed, thereby recovering any associated errored executor pods.
NSP cluster member troubleshooting
This topic describes troubleshooting the members of an NSP cluster.
Retrieve a list of members
Enter the following to list the NSP cluster members:
# kubectl get nodes ↵
Retrieve member information
Enter the following to view information about a specific member:
# kubectl describe nodes node_name ↵
where node_name is the name of the member to view
The command output includes member information such as the following:
-
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Wed, 30 Sep 2020 12:19:23 -0400 Wed, 30 Sep 2020 12:19:23 -0400 CalicoIsUp Calico is running on this node
-
member resource capacity; for example:
Capacity:
cpu: 24
ephemeral-storage: 67092472Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 64381888Ki
pods: 110
-
running pods on the member; for example:
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default nginx-ingress-controller-8fj7s 100m (0%) 12 (37%) 500Mi (0%) 1000Mi (0%) 7h9m
default nspos-app1-tomcat-8597d67787-wdgxd 5100m (16%) 12100m (38%) 17230Mi (13%) 17230Mi (13%) 7h10m
default nspos-neo4j-core-default-1 2050m (6%) 2050m (6%) 2650Mi (2%) 2650Mi (2%) 7h10m
default nspos-postgresql-primary-0 6050m (19%) 6050m (19%) 1290Mi (1%) 1290Mi (1%) 7h9m
-
resources allocated to the member; for example:
Resource Requests Limits
-------- -------- ------
cpu 22870m (71%) 41150m (129%)
memory 42120228Ki (32%) 44290630912 (33%)
ephemeral-storage 0 (0%) 0 (0%)
MDM server troubleshooting
This topic describes troubleshooting MDM instances.
Note: The NSP system must be operational before these operations can be performed.
Retrieve detailed information about MDM servers
From the NSP deployer host software directory, enter the following to show the MDM server roles, the number of NEs managed using MDM, and which MDM server is hosting which NE.
# tools/mdm/bin/server-load.bash --user username --pass password--detail
where
username is the NSP username
password is the NSP password
The command output includes information such as the following:
{
"mdmInstanceInfos": [
{
"name": mdm-server-0,
"ipAddress": mdm-server-0.mdm-server-svc-headless.default.svc.cluster.local,
"grpcPort": 30000,
"status": Up,
"neCount": 0,
"neIds": null,
"active": False
"groupIds": [1, 2],
},
{
"name": mdm-server-1,
"ipAddress": mdm-server-1.mdm-server-svc-headless.default.svc.cluster.local,
"grpcPort": 30000,
"status": Up,
"neCount": 2,
"neIds": ["1.1.1.1", "1.1.1.2"],
"active": True
"groupId": 1,
},
{
"name": mdm-server-2,
"ipAddress": mdm-server-2.mdm-server-svc-headless.default.svc.cluster.local,
"grpcPort": 30000,
"status": Up,
"neCount": 2,
"neIds": ["1.1.1.3", "1.1.1.4"],
"active": True
"groupId": 2,
}
]
}
Rebalance NE load on MDM servers
From the NSP deployer host software directory, enter the following to rebalance the NE load on the MDM servers.
# tools/mdm/bin/server-load.bash --user username --pass password--rebalance
where
username is the NSP username
password is the NSP password
Disk performance tests
This topic describes NSP disk tests for collecting performance metrics such as throughput and latency measurements.
Verify disk performance for etcd
As the root user, enter the following:
# mkdir /var/lib/test
# fio --rw=write --ioengine=sync --fdatasync=1 --directory=/var/lib/test --size=22m --bs=3200 --name=mytest ↵
The command produces output like the following:
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1)
mytest: (groupid=0, jobs=1): err= 0: pid=40944: Mon Jun 15 10:23:23 2020
write: IOPS=7574, BW=16.6MiB/s (17.4MB/s)(21.0MiB/1324msec)
clat (usec): min=4, max=261, avg= 9.50, stdev= 4.11
lat (usec): min=4, max=262, avg= 9.67, stdev= 4.12
clat percentiles (nsec):
| 1.00th=[ 5536], 5.00th=[ 5728], 10.00th=[ 5920], 20.00th=[ 6176],
| 30.00th=[ 7584], 40.00th=[ 8896], 50.00th=[ 9408], 60.00th=[ 9792],
| 70.00th=[10432], 80.00th=[11584], 90.00th=[12864], 95.00th=[14528],
| 99.00th=[20352], 99.50th=[23168], 99.90th=[28800], 99.95th=[42752],
| 99.99th=[60672]
bw ( KiB/s): min=16868, max=17258, per=100.00%, avg=17063.00, stdev=275.77, samples=2
iops : min= 7510, max= 7684, avg=7597.00, stdev=123.04, samples=2
lat (usec) : 10=64.21%, 20=34.68%, 50=1.08%, 100=0.02%, 500=0.01%
In the second block of output, which is shown below, the 99th percentile durations must be less than 10ms. In this block, each durations is less than 1ms.
fsync/fdatasync/sync_file_range:
sync (usec): min=39, max=1174, avg=120.71, stdev=63.89
sync percentiles (usec):
| 1.00th=[ 42], 5.00th=[ 45], 10.00th=[ 46], 20.00th=[ 48],
| 30.00th=[ 52], 40.00th=[ 71], 50.00th=[ 153], 60.00th=[ 159],
| 70.00th=[ 167], 80.00th=[ 178], 90.00th=[ 192], 95.00th=[ 206],
| 99.00th=[ 229], 99.50th=[ 239], 99.90th=[ 355], 99.95th=[ 416],
| 99.99th=[ 445]
cpu : usr=2.95%, sys=29.93%, ctx=15663, majf=0, minf=35
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Verify disk performance for NSP
Enter the following as the root user in the /opt/nsp directory to create a file called 'test' in the directory:
# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=50 ↵
The command produces output like the following:
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.7
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=22.1MiB/s,w=22.2MiB/s][r=5645,w=5674 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=32439: Mon Sep 21 10:25:11 2020
read: IOPS=6301, BW=24.6MiB/s (25.8MB/s)(2049MiB/83252msec)
bw ( KiB/s): min=13824, max=39088, per=99.57%, avg=25098.60, stdev=5316.27, samples=166
iops : min= 3456, max= 9772, avg=6274.49, stdev=1329.11, samples=166
write: IOPS=6293, BW=24.6MiB/s (25.8MB/s)(2047MiB/83252msec)
bw ( KiB/s): min=13464, max=40024, per=99.56%, avg=25062.73, stdev=5334.65, samples=166
iops : min= 3366, max=10006, avg=6265.57, stdev=1333.67, samples=166
cpu : usr=5.13%, sys=18.63%, ctx=202387, majf=0, minf=26
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=524625,523951,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=24.6MiB/s (25.8MB/s), 24.6MiB/s-24.6MiB/s (25.8MB/s-25.8MB/s), io=2049MiB (2149MB), run=83252-83252msec
WRITE: bw=24.6MiB/s (25.8MB/s), 24.6MiB/s-24.6MiB/s (25.8MB/s-25.8MB/s), io=2047MiB (2146MB), run=83252-83252msec
Disk stats (read/write):
vda: ios=523989/526042, merge=0/2218, ticks=3346204/1622070, in_queue=4658999, util=96.06%