Evaluating failed or slow workflow executions

Purpose

This article shows you how to evaluate failed or slow workflow executions and troubleshoot the source of workflow errors.

Parent workflows and sub-workflows

Workflows may call other workflows as part of their execution. Both the parent workflow and the sub-workflow appear in the Workflow Execution list.

If a workflow execution’s Executed by parameter is blank, the workflow was executed by another workflow, as shown in the following figure.

Note the Created field for these example executions. The LSO_7x50_Backup_MD_Mixed sub-workflow execution was created at the same time as the LSO_7x50_Backup workflow execution. The LSO_7x50_Backup may be the parent workflow that created this failed execution.

Start your troubleshooting with a parent workflow to ensure that you see all the relevant information.

Check the information page of a successful execution

If you have an example of a successful execution of the workflow, it can help you narrow your search for the source of issues with slow or failing workflows.


1	Double click on a successful workflow execution. The Execution info page displays.
2	Choose Flow from the Info drop-down. The Flow diagram shows the sequence of tasks performed when the workflow was executed. Hover over the icons on each task for more information on the task type. In this example, `decideMode` is a message action, and `runBackupOnMdNode` is a sub-workflow.
3	Expand the panel at the right of the screen for further details: Click Run Time to see the run time for each task. Select a task and click Action Executions to see the list of actions executed by the task. For this example, `runBackupOnMdNode` represents most of the runtime of the workflow, and it took 16 seconds to run. This provides expected behavior to compare to a workflow that may be running slowly. The list of actions performed also shows whether the workflow is transferring files, calling APIs, or communicating with other applications. Problems with any of these could be the cause of a slowdown or failure.
4	Double-click on a sub-workflow task to see the execution status. Double click on the execution status to open the info page for the sub-workflow execution in a new tab, and investigate actions and tasks executed by the sub-workflow. End of steps

Check the process of a slow workflow


1	Double-click on a workflow execution, and select Tasks from the Info drop-down. The Tasks list shows the time stamps when each task was created, and each task’s run time. The Created column shows the time since the task was created. Hover over a time in the Created column to see the precise time of creation.
2	Check for delays in the sequence of tasks. For example, if one task was created at midnight and had a run time of 2 seconds, the following task should be created at 2 seconds after midnight.
3	If there are delays, NSP is experiencing slowness due to memory usage. When database usage is high, it takes longer than expected to query the database for the next action. If you are experiencing these delays, your cleanup policy may need to be adjusted. For more information, see the NSP Network Automation Guide. End of steps

Check concurrency

If a workflow is experiencing slowdowns or API errors, check for tasks with loops and ensure the concurrency is set correctly. If NSP is creating too many actions at one time, the workflow database could be impacted, causing slowdowns, or APIs could be overwhelmed with too many simultaneous calls.


1	From the Workflows page, double click on a workflow to open the Info page. Choose Definition from the Info drop-down.
2	In the YAML panel, search for a task with a `with-item` statement. The `with-item` statement provides the number of times the task will initiate the action.
3	Verify that any task with a `with-item` statement also has a `concurrency` property set. The `concurrency` property sets a limit on the number of times the task will create the action concurrently. For example, if the `concurrency` is set to 1, the task will create the action once and wait for it to complete before executing it again.
4	Ensure that the concurrency value and the size of the with-items loop are appropriate for the task. For example, if the task is an API call, ensure that the concurrency is low enough to prevent the resource from being overwhelmed with simultaneous API calls. Remember that this workflow may not be the only entity calling the API at the time you execute it. End of steps

Check task output

If a workflow is experiencing slowdowns or getting stuck in a Running state, resource usage may be impacted by large amounts of output.


1	From the Workflows page, double click on a workflow to open the Info page. Choose Definition from the Info drop-down.
2	In the YAML panel, search for the `output` statements for the tasks and the workflow itself, as applicable.
3	Ensure that output for all tasks is as minimal as possible. This minimizes the database usage of each task execution and helps to prevent resource overload. For more information; see the Mistral documentation and the Best Practices section in the Network Automation tutorial on the Network Developer Portal. End of steps

Heartbeat errors

A Heartbeat not received error occurs when a workflow attempts to contact an API or NSP and no response is received within ten minutes. Check logs to verify the source of the error.

Check the logs for the Mistral executor to see whether responses were received from the other entity.

Check logs for the RabbitMQ messaging bus to see whether there was an interruption to monitoring, which may have caused Mistral to miss a response.

End of steps