gNOI

gRPC Network Operations Interface (gNOI) defines a set of gRPC-based services for executing operational commands on network devices. The individual RPCs and messages that perform the gNOI operations required on the node are defined at the following location: https://github.com/openconfig/gnoi. This repository also stores the various per-service protos in subdirectories.

SR Linux supports the following gNOI services:

gNOI OS service

The gNOI OS service provides an interface to install an OS package on a target node. SR Linux supports the gNOI OS service on both the active and standby CPMs (referred to as supervisors in gNOI).

To perform the OS installation, the client progresses through the following three gNOI OS RPCs:

  • Install RPC
  • Activate RPC
  • Verify RPC

The protos used to define the OS service were pulled from the following hash: https://github.com/openconfig/gnoi/commit/93cdd9ae9f35d8b4bc1599d0a727b294faeca352.

Install RPC

The Install RPC transfers the OS package to the target node. The target node first attempts to copy the specified OS package between the CPMs before it accepts the transfer from the client.

To refer to the OS version, SR Linux uses the same string in gNOI as the one used with ZTP (version string-build number) , for example: v22.11.1-010. The download folder for the OS is located at: /var/run/srlinux/gnoi. To validate that the transferred OS package is valid and bootable before installation, the platform performs a hash check against the md5sum that is embedded in the .bin file.

On a dual CPM node, only the active CPM runs the gNOI service. The Install RPC transfers the OS to the active CPM.
Note: SR Linux does not support the standby_supervisor option. On a dual CPM node, the transferred image is synced automatically to the standby CPM using ZTP.

One Install RPC is required for each CPM. Concurrent Install RPCs are not allowed on the same target node.

Install RPC structure

 rpc Install(stream InstallRequest) returns (stream InstallResponse);

Activate RPC

The Activate RPC sets the requested OS version for the target node to use at the next reboot. It also reboots the target node if the no_reboot flag is not set.

Note: If the requested image fails to boot, SR Linux cannot attempt to boot a secondary image. In this case, the system can revert to the rescue image.
On a dual CPM node, if you perform this RPC on the active CPM, it triggers a switchover to the standby CPM before rebooting the previously active CPM.

Activate RPC structure

rpc Activate(ActivateRequest) returns (ActivateResponse);

Verify RPC

The Verify RPC checks the OS version running on the target node. The client can call this RPC multiple times while the target node boots until the activation is successful.

Note: The activation_fail_message is not supported because if the target node does not boot, it remains in a failure state and does not revert to a previous version of OS.

Verify RPC structure

rpc Verify(VerifyRequest) returns (VerifyResponse);

gNOI FactoryReset service

The FactoryReset service enables gNOI clients to reset a target node to boot using a golden image and to optionally format persistent storage.

One of the practical applications of this service is the ability to factory reset a device before performing Zero Touch Provisioning (ZTP).

SR Linux supports the following gNOI FactoryReset RPC:
  • Start RPC

The protos used to define the FactoryReset service were pulled from the following hash: v0.1.0.

Start RPC

The Start RPC allows the client to instruct the target node to immediately clean all existing state data (including storage, configuration, logs, certificates, and licenses) and boot using the OS image configured as the golden image. The golden image is the image that the device resets to when a factory reset is performed. You can set the golden image using the tools system boot golden-image command. To view the available images to select from, use the tools system boot available-images command. If the golden image is not set, the system boots using the current OS image.

The Start RPC supports optional flags to:

  • roll back to the OS configured as the golden image
  • zero-fill any state data saved in persistent storage

If the golden image is configured and the factory_reset flag is set to true, SR Linux resets to the golden image. If the factory_reset flag is omitted or set to false, SR Linux boots using the current running image, but all existing state data is cleaned.

If any optional flags are set but not supported, the target node returns a gRPC Status message with code INVALID_ARGUMENT with the details value set to the appropriate ResetError message.

Start RPC structure

rpc Start(StartRequest) returns (StartResponse);

gNOI File service

The gNOI File service allows the client to transfer files to and from the target node. The main use for this service is extracting debugging information through the transfer of system logs and core files.

SR Linux supports the following gNOI File RPCs:

Note: The TransferToRemote RPC is not supported.

The protos used to define the gNOI File service were pulled from the following hash: https://github.com/openconfig/gnoi/commit/93cdd9ae9f35d8b4bc1599d0a727b294faeca352.

Get RPC

The Get RPC reads and streams the contents of a file from a target node to the client using sequential messages, and sends a final message containing the hash of the streamed data before closing the stream.

The target node returns an error if:

  • An error occurs while reading the file.
  • The file does not exist.

Get RPC structure

rpc Get(GetRequest) returns (stream GetResponse) {}

Put RPC

The Put RPC streams data to the target node and writes the data to a file. The client streams the file using sequential messages. The initial message contains information about the filename and permissions. The final message includes the hash of the streamed data.

The target node returns an error if:

  • An error occurs while writing the data.
  • The location does not exist.

Put RPC structure

rpc Put(stream PutRequest) returns (PutResponse) {}

Stat RPC

The Stat RPC returns metadata about files on the target node.

If the path specified in the StatRequest references a directory, the StatResponse returns the metadata for all files and folders, including the parent directory. If the path references a direct path to a file, the StatResponse returns metadata for the specified file only.

The target node returns an error if:
  • The file does not exist.
  • An error occurs while accessing the metadata.

Stat RPC structure

rpc Stat(StatRequest) returns (StatResponse) {}

Remove RPC

The Remove RPC removes the specified file from the target node.

The target node returns an error if:

  • An error occurs during the remove operation (for example, permission denied).
  • The file does not exist.
  • The path references a directory instead of a file.

Remove RPC structure

rpc Remove(RemoveRequest) returns (RemoveResponse) {}

gNOI System service

The gNOI System service defines an interface that allows a client to perform operational tasks on target network nodes. SR Linux supports the following gNOI System RPCs:

The protos used to define the gNOI System service were pulled from the following hash: https://github.com/openconfig/gnoi/commit/93cdd9ae9f35d8b4bc1599d0a727b294faeca352.

Ping RPC

The Ping RPC allows the client to execute the ping command on the target node. The target node streams the results back to the client. Some targets do not stream any results until they receive all results. If the RPC does not specify a packet count, the ping operation uses a default of five packets.

Note:
  • The Ping RPC does not currently support specification of a network-instance. The ping is executed in the network-instance where the gNMI server is running.
  • SR Linux does not support setting the interval field in the PingRequest to -1 (flood ping).

Ping RPC structure

rpc Ping(PingRequest) returns (stream PingResponse) {}

Traceroute RPC

The Traceroute RPC allows the client to execute the traceroute command on the target node. The target node streams the results back to the client. Some targets do not stream any results until they receive all results. If the RPC does not specify a hop count, the traceroute operation uses a default of 30.

Note:
  • The Traceroute RPC does not currently support specification of a network-instance. The traceroute is executed in the network-instance where the gRPC server is running.
  • In the TracerouteRequest, SR Linux does not support the TCP and UDP enum values for the l4protocol field. Only ICMP is supported.
  • In the TracerouteResponse, SR Linux does not support the mpls and as_path fields.

Traceroute RPC structure

rpc Traceroute(TracerouteRequest) returns (stream TracerouteResponse) {}

Time RPC

The Time RPC returns the current time on the target node. It is typically used to test whether the target is currently responding.

Time RPC structure

rpc Time(TimeRequest) returns (TimeResponse) {}

SwitchControlProcessor RPC

The SwitchControlProcessor RPC switches the active control processing module (CPM) on the target node to the control slot (A or B) that is specified in the request message.

SwitchControlProcessor RPC structure

rpc SwitchControlProcessor(SwitchControlProcessorRequest)
    returns (SwitchControlProcessorResponse) {}

Reboot RPC

The Reboot RPC allows the client to reboot a target node, either immediately or at some time in the future. It triggers the reboot of the entire chassis. It also supports specification of a reboot method (for example, cold or warm reboot), however, if the target node does not support the specified reboot method, the Reboot RPC fails.

Note: SR Linux supports only the cold reboot method, and does not support rebooting of subcomponents.

If a reboot is pending on the active control processor, the service rejects all other reboot requests.

Reboot RPC structure

rpc Reboot(RebootRequest) returns (RebootResponse) {}

CancelReboot RPC

The CancelReboot RPC allows the client to cancel any pending reboot requests on the target node.

Note: SR Linux does not support canceling a reboot for a subcomponent.

CancelReboot RPC structure

message CancelRebootResponse { }

RebootStatus RPC

The RebootStatus RPC allows the client to query the status of a reboot on the target node.

Note: SR Linux does not support querying on a single component at a time.

RebootStatus RPC structure

rpc RebootStatus(RebootStatusRequest) returns (RebootStatusResponse) {}

KillProcess RPC

The KillProcess RPC allows a client to kill an OS process and optionally restart it on the target node.

To specify the process to kill, the RPC must match the application name referenced in the tools system app-management application <name> command.

Mapping of termination signals to SR Linux commands

The KillProcess RPC termination signals map to SR Linux commands as follows:

Table 1. Mapping of termination signals to SR Linux commands
Termination signal SR Linux command Command if restart field is true
SIGNAL_TERM stop restart

(this option runs a warm or cold restart based on what is supported)

SIGNAL_KILL kill restart cold
SIGNAL_HUP reload

KillProcess RPC structure

rpc KillProcess(KillProcessRequest) returns (KillProcessResponse) {}

gNOI Healthz service

To align with general design principles of distributed systems, the gNOI Healthz service allows system components to report their own health.

Debug commands can display details about the health and state of a component. The Healthz service exposes these interfaces as queryable endpoints. In doing so, it allows clients to validate the health of components and, if unhealthy, gather device-specific data to help triage or reproduce issues.

The Healthz service allows a client to initiate health checks on a target node using the Check RPC. Alternatively, the target node can self-initiate the check and report the results to the client.

A client can then use the List or Get RPC to retrieve health events associated with the affected component and subcomponents. These health events are included in ComponentStatus messages and can be helpful to debug or further root cause the reported fault.

As part of the event response, the List or Get RPC can identify specific artifacts associated with the event, which the client can then retrieve using the Artifact RPC. The client can also call the Acknowledge RPC to acknowledge the retrieval of an event (corresponding to a series of artifacts). By default, acknowledged events are no longer included in the list of events.

The SR Linux components that support some degree of Healthz are as follows (listed in native schema):

  • .platform.control{}
  • .platform.linecard{}
  • .platform.chassis
  • .platform.fan-tray{}
  • .platform.power-supply{}
  • .platform.fabric{}
  • .interface{}.transceiver

This includes all control, linecard, and fabric modules, along with power supplies and fans, individual transceivers and the chassis itself. Software components, such as routing protocol daemons, are not yet supported with the gNOI Healthz service.

SR Linux supports the following gNOI Healthz RPCs:

SR Linux uses v1.3.0 of the gNOI Healthz service protos, pulled from the following hash: https://github.com/openconfig/gnoi/blob/4f5cb0885a26a52f9c30acc236d307192c665bd8/healthz/healthz.proto.

Collection of artifacts

The workflow for collecting gNOI Healthz artifacts is as follows:

  • When a system component becomes unhealthy, the system transmits health state information via telemetry that indicates the healthz/state/status of the component has transitioned to UNHEALTHY.
  • When the client observes the transition to UNHEALTHY, it can call the Get or List RPCs to collect the events that occurred on the component.
  • As the collection of some artifacts can be service impacting, all artifacts are not always automatically collected for an event. In this case, the client can call the Check RPC to collect the additional service impacting artifacts. This provides an opportunity to coordinate the collection of these artifacts when the operational risk of doing so is minimized (for example, by first removing traffic from the target node). To refer to a previously reported event, the Check RPC request must populate the event_id field with the ID reported in a prior Get or List response. After this Check RPC call, the client can call the Get or List RPCs to obtain the additional artifacts collected for the specified event.
  • If a component returns to a healthy status, the system sends updated telemetry information to ensure that the external clients are updated about the current health status, even if the clients make no additional Healthz calls to the system.

Healthz events persistence

Healthz events created for the components are written to disk and persist across restarts of the service or software components.

SR Linux saves Healthz events in the /etc/opt/srlinux/gnoi/healthz/events directory, and rotates the event files to prevent overflow of the partition size. The rotation limit is set to 10 MB for all events combined.

During an unexpected CPM failover Healthz event, the creation and storage of events and artifacts can be interrupted by a CPM switchover. In this case, defer any user-initiated CPM switchover while the system is still processing the Healthz events and artifacts.

Healthz events are written to disk in intervals (every minute) to mitigate high disk pressure for frequently changing events (for example, a flapping interface).

gNMI component paths

The Healthz service works in conjunction with telemetry streamed via gNMI. The system can stream OpenConfig or native YANG paths for a specific component when the component becomes unhealthy.

To maintain Healthz parameters, SR Linux includes a healthz container for each of the supported components. For example, the following container maintains Healthz data for the control module:
augment /srl-platform:platform/srl-platform-control:control:
    +--ro healthz
        +--ro status? enumeration
        +--ro last-unhealthy? srl-comm:date-and-time-delta
        +--ro unhealthy-count? srl-comm:zero-based-counter64

When the Healthz service references a gNMI path (gnoi.types.Path), it specifies the complete path to a component, for example: /components/component[name=FOO].

Check RPC

The Check RPC allows a client to execute a set of validations against a component. As with other Healthz operations, the component is specified using its gNMI path.

The Check RPC produces a Healthz ComponentStatus message, which contains a list of the artifacts generated from the validation process.

While the system can initiate health checks itself, these checks are limited to operations that do not impact the device functionality. Checks that are potentially service impacting require use of the Check RPC.

Note: Nokia recommends the implementation of command authorization to restrict use of these commands to prevent running unauthorized Check RPCs during normal operations.

The CheckRequest message includes an optional event_id field. When populated, this field directs the system to perform the check for a prior event. In this case, the device collects artifacts that were not collected automatically when the event occurred (to prevent service impacts). The collected artifacts are returned in the artifact list for the event in subsequent Get or List RPC calls.

A CheckRequest for a previous event_id does not overwrite previous artifacts that were collected at the time of the event.

Check RPC structure

rpc Check(CheckRequest) returns (CheckResponse) {}

Get RPC

After a health check, the client can use the Get (or List) RPC to retrieve the health events that are associated with a component.

The Get RPC retrieves the latest health event for the specified component. Each event consists of a collection of data that you can use to debug or root cause the fault. Unlike the List RPC, the Get RPC returns only the latest event.

The GetResponse returns a ComponentStatus message that corresponds to the latest health event for the component and each of its subcomponents. As a result, the Get RPC can return multiple ComponentStatus messages for a single component.

Each ComponentStatus message includes a set of ArtifactHeader messages that correspond to the health event, and provide identifiers and types for the artifacts returned by the system. All artifacts listed within the same ComponentStatus message share the same acknowledgement state and expiry time.

When a client invokes a Get RPC on a path, this action is not recorded as an event for this path and no health checks are performed.

Get RPC structure

rpc Get(GetRequest) returns (GetResponse) {}

List RPC

As an alternative to the Get RPC, the client can use the List RPC to retrieve not just the latest but all health events for the specified component and its subcomponents. Similar to the Get RPC, the List RPC also returns a series of ComponentStatus messages, which have the same semantics as those returned by the Get RPC.

By default, events that are already acknowledged are not returned.

List RPC structure

rpc List(ListRequest) returns (ListResponse) {}

Acknowledge RPC

A client can use the Acknowledge RPC to indicate to the target node that the client retrieved a particular (component, event) tuple. To ensure that Healthz artifact storage does not cause resource exhaustion, SR Linux can remove saved artifacts, starting with acknowledged artifacts first.

Acknowledge RPC structure

rpc Acknowledge(AcknowledgeRequest) returns (AcknowledgeResponse) {}

Artifact RPC

The Artifact RPC allows a client to retrieve specific artifacts that are related to an event that the target node reported in a prior List or Get response.

Because these artifacts can be large, the Artifact RPC is implemented as a server-side streaming RPC. The Artifact RPC ensures that a target node sends these potentially large artifacts only when the client explicitly requests them.

Artifacts can be core files, dumps of state (info from state on the specified component), or other log files. The collection of info from state artifacts results in the capture of any failure reasons from either the oper-reason or oper-down-reason fields.

The client can acknowledge a retrieved event corresponding to a series of artifacts. Acknowledged events are no longer returned in the list of events by default.

Events persist across restarts of the system or its hardware and software components, and they are removed only for resource management purposes. SR Linux can use the acknowledged status to remove artifacts that are no longer relevant and, if necessary, remove artifacts that are not yet acknowledged.

Artifact RPC structure

rpc Artifact(ArtifactRequest) returns (stream ArtifactResponse) {}

gNOI Healthz CLI commands

To allow the gNOI Healthz events to be cleared using the CLI or by a gNMI or JSON-RPC client, SR Linux supports the following CLI command:

tools system grpc-server testing gnoi healthz [<component>] clear

You can omit the component value to clear all statistics for all components, or you can use one of the following parameters to clear statistics for the specified component only:

  • chassis: chassis component
  • control slot <id>: control module component
  • fabric slot <id>: fabric module component
  • fan-tray id <id>: fan component
  • linecard slot <id>: line card component
  • power-supply id <id>: power supply component
  • transceiver interface <name>: transceiver component

gNOI configuration

SR Linux supports gNOI services using the gRPC server configuration. To enable gNOI support, enable the gRPC server.

The session between the gNOI client and SR Linux must be encrypted using TLS.

See the "Management servers" chapter in the SR Linux Configuration Basics Guide for information about how to configure the gRPC server.

Configuring gNOI services

As part of the gRPC server configuration, you can also specify which individual gNOI services to enable.

To enable gNOI services, use the system grpc-server <network-instance> services command.

Enable gNOI services

# info system grpc-server mgmt

    system {
        grpc-server mgmt {
            admin-state enable
            timeout 7200
            rate-limit 1500
            session-limit 20
            metadata-authentication true
            yang-models native
            tls-profile tls-profile-1
            network-instance mgmt
            port 50052
            oper-state up
            services [
                gnoi.packet_link_qualification
            ]
            source-address [
                ::
            ]
            gnmi {
                commit-confirmed-timeout 0
                commit-save false
                include-defaults-in-config-only-responses false
            }
            unix-socket {
                admin-state disable
                socket-path ""
            }
        }
    }