Event Management Overview | Practical Service Level Management: Delivering High-Quality Web-Based Services

Event management defines a set of functions that are applied to the alert stream to identify those alerts associated with actual or potential service disruptions. Further actions are initiated after the event manager identifies the important alerts.

Alerts arrive in different forms, as determined by the collectors and specific product implementations. They include the following:

Simple Network Management Protocol (SNMP) traps, which are sent mainly by network infrastructure elements, although elements in other infrastructures (such as the server infrastructure) also use SNMP
Alerts from passive and active collectors using vendor-specific protocols
Alerts from other elements using vendor-specific protocols
Alerts triggered by the arrival and transformation of an extensible markup language (XML) document
Alerts generated by the management system itself

The following subsections discuss how alerts are triggered, the need to transport alerts reliably from their origin to the central event manager, and the need for the event manager to handle the fact that some alerts are more important than others.

Alert Triggers

Baselines, thresholds, and internal failures are the usual triggers of alerts from element instrumentation. Threshold alerts can be triggered when a threshold is crossed in either direction (see Figure 4-2 in Chapter 4). Baselines represent a normal operating range of measurements. Alerts are triggered when the monitored variable is moving toward the edge of the envelope or has moved outside that envelope.

Alerts are also generated when there are internal failures, such as with a disc system, an application, or an interface. An element might be able to report certain failures itself. For example, a server that is still operating after an application fails can easily report that failure.

Other failure alerts are indirect, often because of a failure that prevented the element from reporting its own problems. For example, a central instrumentation management portal monitors a set of collectors with a heartbeat exchange. If a collector doesn't respond, an alert is generated, noting that it might have failed.

Internally generated alerts are used to integrate and coordinate event management operations and to activate other management system components. As seen in Figure 5-1, the event manager activates other functional areas, such as fault- or performance- management tools. At this point, the event management system has organized an alert stream into a set of actions based on the alerts that are generated.

Figure 5-1. Event Management

The alert volume can be substantial; several large organizations that I have spoken with recently have tens of thousands of element alerts daily, while the services alarms are in the mid-hundreds. There are usually more element alerts than service alerts because many element problems do not affect service quality when there is sufficient redundancy.

Automated processing of alerts is needed to identify those requiring immediate action from the management system. High alert volumes and more complex sorting criteria can overwhelm human staff.

Reliable Alert Transport

There are important reliability issues that must be considered when handling the reception of alerts. Many systems, such as SNMP, use unreliable transport methods; there's no assurance that alerts will arrive at the central event manager. If the event manager is regularly polling for information or for the presence of a heartbeat, missing poll responses will be noted and the next poll may get the information. If the event manager is simply listening for alerts, without checking for problems with the instrumentation, it won't know that an alert has been lost.

One common solution to the problem of missing alerts is to have remotely located aggregators that are in proximity to the source of the alerts. If they use a reasonably error-free communications channel to connect to the alert sources, the aggregators will receive almost all alerts correctly.

To push measurement alert information reliably into the enterprise's event manager from the remote aggregators, it is necessary to avoid using unreliable transport. This can be difficult if the alerting system uses industry-standard SNMP traps; normal SNMP uses unreliable transport.

One way of reliably transporting SNMP is illustrated by the web-performance measurement service, Keynote Systems, which uses industry-standard SNMP traps to push its measurement alert information into event managers from Tivoli, HP OpenView, Micromuse Netcool, and other major management systems. Keynote places a small appliance next to the enterprise's event manager, inside the enterprise's firewall. That appliance connects across the Internet to the Keynote system using once-a-minute, outgoing, secure, reliable connections. Retrieved alerts are then signaled with SNMP traps from the Keynote appliance to the management system that's only a few feet away; there's little chance of losing the alert.

Keynote also offers direct plug-in into some event managers, such as Unicenter/TNG; in those cases, software is installed into the event manager to communicate directly with the Keynote systems using reliable, secure transport and XML. Either methodlocal appliance or direct plug-incan be used to improve the reliability of alert transport.

Alert Management

The alert stream contains information of differing value for managing services. Not every alert requires further attention. As an example, an alert reporting a slow response time for a single customer might not indicate a problem by itself. A single slow transaction can be caused by temporary server congestion, lost packets, or a routing change. No further attention is needed as long as the percentage of completed transactions with acceptable response times is very high.

The managed environment is highly dynamic and the instrumentation can create artifacts, which are false indications of the actual situation (false positives); they need to be eliminated before a false diagnosis causes further disruptions to staff and operations. In fact, responding to artifacts wastes staff time because subsequent measurements usually reveal no problem at all.

The event manager organizes the remainder of the alerts after the artifacts have been removed from consideration. There are ranges of actions depending on the overall operational context. For example, a measurement that exceeds a warning threshold requires different attention than a measurement indicating noncompliance with a Service Level Agreement (SLA).

Alerts also have different business impacts that affect subsequent management decisions. A disruption that affects revenues and business relationships should draw more attention than a slight slowing of internal e-mail.

Refer again to Figure 5-1, which illustrates event management functions. Starting at the top are the event management inputs, which are either internally generated alerts or those from the service or element instrumentation.

The event management functions are shown within the rectangle in the middle. Functions such as artifact reduction, filtering, and correlation are applied to any alert.

The event management system identifies the events that require further action. The next step depends on the event. Some events activate a fault management tool while others launch a performance management tool. Events can also trigger a billing subsystem, page an administrator, generate a report, or initiate other functions.