Connection monitoring and error recovery
Overview
A JMS client must monitor and, if required, recover from the following:
The scenarios can occur separately or in conjunction. If a client has a durable subscription, a loss of connectivity may occur without causing events to be lost. Event loss can occur under heavy load without a connectivity loss; the client is notified using a jmsMissedEvents message. A subscription removal can accompany a connectivity loss, for example, after a main server activity switch or restart.
The manner in which a client manages each scenario depends on the specific needs and the expectations of their customers. For example, for some applications, a loss of connectivity should always result in the OSS application reconnecting to the NFM-P. In other situations, the desired behavior may depend on the cause of the connectivity loss—for example, if an NFM-P administrator intentionally disconnects an OSS, it may not be desirable to immediately reconnect.
How to recover from lost events also depends on the OSS. If an OSS uses JMS events to maintain a local information store that mirrors some data set in the NFM-P—such as a network inventory or a list of current alarms—then a loss of events results in the OSS being out of sync with the NFM-P.
The following scenarios could be used to determine whether the OSS needs to resync the database of inventory and alarm information:
-
If a durable OSS disconnects but remains subscribed to the NFM-P, on reconnect if a JmsMissedEvent message is not received and the sysStartTime is not changed, it is not necessary to do a resync of inventory and alarm information because all events are queued on the NFM-P and await OSS reconnection.
-
If a disconnected durable subscription is removed from an NFM-P main server, a JmsMissedEvents message is generated when the client reconnects using the same client ID. This indicates to the OSS that a resync of inventory and alarm information needs to be performed because events have been missed.
When an OSS detects a lack of synchronization, the recovery scenarios include:
-
immediately resynchronizing with the NFM-P, such as by retrieving inventory information or the latest alarm list through the XML API
-
notifying the OSS administrator of the problem, and allowing the administrator to select a recovery approach—for example, an immediate resync or a scheduled resync for a later time
When you implement a recovery procedure, you must consider that events will continue to occur in the network while an OSS is busy resynchronizing or populating its database for the first time. For this reason, an OSS must process events while they are resynchronizing, and may need to recover from error conditions that require them to reconnect or restart the resync.
JMS exceptions
You must use JMS exceptions if you have a JMS connection to an NFM-P server. JMS internally monitors client connections to the JMS server and throws an exception if a connection between the JMS server and a client is lost. You must implement the javax.jms.ExceptionListener interface on the JMS client to enable the monitoring of exceptions. The interface contains a call-back method that is invoked for all exceptions. A typical implementation of the call-back method attempts to reconnect and possibly take another action, such as generating an event.
Monitoring for incoming events
In addition to handling exceptions, an OSS must monitor incoming events to ensure that the JMS connection is active. A KeepAliveEvent is published at approximately 30-second intervals to each of the JMS topics even when no other messages are sent.
KeepAliveEvents may be received ahead of other event types.
If no events are received within a reasonable time period, you must investigate the status of the NFM-P server.
XML API session termination
You can use the NFM-P GUI client to close and remove a durable subscription when you no longer require a durable client. See the procedure to disconnect an XML API JMS client connection or remove a durable subscription in the NSP System Administrator Guide for more information about removing durable subscriptions.
The XML API uses the following JMS event to notify OSS client applications of a session termination:
TerminateClientSession
The TerminateClientSession event indicates that a client JMS session is about to be closed. The client must clean up the disconnected session when the message is received. Additional session termination behavior is dependent on the requirements of your OSS client application. For example, you can also configure the requirement to close the OSS client application.
Missed events
Both durable and non-durable subscriptions can be used by OSS clients that cannot tolerate losing events without taking corrective action. However, in certain situations, events can be missed and subscribers are notified with a JmsMissedEvents message.
JMS messages are queued by the NFM-P until they are acknowledged by the JMS client. If the JMS message queue overflows, the following occurs for both durable and non-durable JMS subscriptions:
-
Connected non-durable subscribers
-
-
A JmsMissedEvents message is added to the queue. See Figure 4-18, JmsMissedEvents message example for an example.
-
Non-connected non-durable subscribers
-
Non-connected durable subscribers
The following are example scenarios in which messages may be missed:
-
The OSS client is slow, the message rate is high, or there is high network latency.
-
A client with an active subscription disconnects, resulting in messages being queued until the client reconnects.
The following table lists and describes the alarms that are raised against JMS clients.
Table 4-8: JMS client alarms
Alarm |
Description |
---|---|
JMSDurableClientReset |
Raised when a durable JMS client is reset as the result of a JMS server restart or activity switch |
JMSClientMessagesRemoved |
Raised when a JMS client has messages removed after exceeding the configured message limit. This applies to OSS durable subscribers only. |
JMSDurableClientUnsubscribed |
Raised when a durable JMS client is automatically unsubscribed. This occurs when a disconnected durable client exceeds the configured message limit. |
JmsMissedEvents message
The NFM-P sends a JmsMissedEvents message to indicate that events have been lost, allowing subscribers to detect missed events. The JmsMissedEvents message is a StateChange Event, as shown in the following figure.
Figure 4-18: JmsMissedEvents message example
<SOAP:Envelope xmlns:SOAP="http://schemas.xmlsoap.org/soap/envelope/"> |
<SOAP:Header> |
<header xmlns="xmlapi_1.0"> |
<eventName>JmsMissedEvents</eventName> |
<MTOSI_osTime>1222891102168</MTOSI_osTime> |
<ALA_clientId>JMS_client@n</ALA_clientId> |
<MTOSI_NTType>ALA_OTHER</MTOSI_NTType> |
<MTOSI_objectType>StateChangeEvent</MTOSI_objectType> |
<ALA_category>GENERAL</ALA_category> |
<ALA_isVessel>false</ALA_isVessel> |
<ALA_allomorphic/> |
<ALA_eventName>JmsMissedEvents</ALA_eventName> |
<MTOSI_objectName/> |
<ALA_OLC>0</ALA_OLC> |
</header> |
</SOAP:Header> |
<SOAP:Body> |
<jms xmlns="xmlapi_1.0"> |
<stateChangeEvent> |
<eventName>JmsMissedEvents</eventName> |
<state>jmsMissedEvents</state> |
</stateChangeEvent> |
</jms> |
</SOAP:Body> |
</SOAP:Envelope> |
Recovery from missed events
Clients must be able to recognize when events are missed and take the appropriate recovery action. For example, in an OSS application for which events are being used to maintain inventory information, the recovery action could include:
Although an OSS application must have measures in place to recover from missed events, the OSS application can implement prevention methods. The following are examples of ways an OSS application can prevent missed events:
-
Use restrictive filtering wherever possible. See JMS message filtering for more information.
-
Minimize the time that an OSS is subscribed but disconnected from the NFM-P, in which case events are queued but not processed.
-
Queue messages internally to facilitate the handling of message bursts.
-
Use the DUPS_OK_ACKNOWLEDGE acknowledgment mode to facilitate the handling of network latency. See Acknowledgment modes in JMS subscriptions for more information.