Instrumentation Design for Service Monitoring | Practical Service Level Management: Delivering High-Quality Web-Based Services

In this section, you learn about the actual process of collecting the required information. To monitor service behaviors, you need to select the appropriate demarcation points and monitoring techniques.

Demarcation Points

Collectors are deployed most effectively by selecting the appropriate demarcation pointsusually a boundary between organizations or infrastructures. The enterprise-service provider interface is an example of an organization demarcation point. Collectors positioned at each demarcation point measure the delay across the provider network as well as within parts of the enterprise network structure. They can then provide end-to-end service quality measurements; additional placements break the measurements into specific domains.

The collectors in Figure 4-5 are placed at demarcation points. They move from right to left between the following:

The local delays found in the desktop and local infrastructure
The service provider delays
The delays from the provider edge to the remote server

Figure 4-5. Measurement Demarcation Points

The desktop (or wireless phone or PDA) collector measures the entire round-trip delay for any transaction initiated from that location. No further measurements are needed unless the delay exceeds specifications in the SLA.

For situations in which the desktop is beyond the control of the enterprise (for example, a web site serving the general public), or for situations where a disinterested third party is needed, measurement services, such as those offered by Keynote Systems, can be used.

The other demarcation points are used to identify the likely cause of the delay so that staff members are properly assigned without wasting additional time and interrupting other activities.

As an example of the use of demarcation points, consider that measuring the round-trip delay between the desktop and the edge of the service provider network isolates the delay associated with the local infrastructure. Tracking the round trip between the edges of the service provider network measures the delay introduced by the provider. Finally, measuring a transaction from the collector closest to the server tracks the server delays.

Passive and Active Monitoring Techniques

Collectors track service quality with passive and active measurements. Passive measurements are usually made from client desktops or customer access devices, using instrumentation or activity logging on a client that "consumes" the service. The results are widely variable and can be difficult to normalize. However, passive measurements will accurately reflect the user experience and can be important as the last resort for detecting compliance problems.

Active measurements, in contrast, are consistent and thus easier to use for tracking performance trends. The active measurements are also proactive, detecting potential non-compliance before passive approaches can. A combination of both types is most effective in exploiting the strengths of each approach while minimizing the shortcomings.

Passive Collection

The most common collectors use passive collection. In other words, they gather only the information that flows by. For example, a desktop collector tracks user activity as it occurs and keeps a record of specific transactions and their completion times. Passive collectors can be relatively simple and can consume minimal resources. They use no additional bandwidth, but they can generate large volumes of data. They are good for detailed data collection and for reactive management, such as forwarding an alarm when a problem is detected.

Placing a collector in a desktop is a common form of passive collection and measurement. One of the first to offer desktop instrumentation was VitalSigns, which became the Lucent Technologies VitalSuite product line after acquisition. The collector usually intercepts traffic flowing between the desktop and the network and measures round-trip delay while tracking the applications and subtransactions actually being used. The information usually is stored at the desktop until it is passed to the management system for further analysis and processing. A real-time alert is forwarded whenever the response time exceeds a predefined threshold value.

Active Collection

Active collection, in contrast, uses active agents to generate network and application activity for management measurement purposes. An active approach is proactive because it is exercising networks and services and evaluating their behavior rather than waiting for a passive collector to detect a problem. Periodic active measurements detect problems earlier than the passive approach. Active measurements are probing behavior even in the middle of the night; they do not depend on user actions to highlight a problem.

Virtual transaction (or synthetic transaction) is the commonly used term for describing active measurements. There can be a range of virtual transactions for measuring performance and for detecting service-related problems. Some examples, in order of complexity, include the following:

Pinging to verify network connectivity and basic system response
Activating a service by checking for service availability and access
Initiating specific transactions by testing specific operations such as sending a message, retrieving a web page, or buying a product

Virtual transactions match the actual business processes being measured; thus, the measurements are viewed with confidence by administrators. Virtual transactions are of limited value if they don't match the actual business processes. Using a simple database query in a virtual transaction doesn't illuminate potential problems when the actual business processes are making multiple queries and activating other processes.

Checking for correct operation is essential after a virtual transaction extends beyond the simple ping. For example, a web server might return a "page not found" message quickly. Using that measurement to route more traffic to that (apparently) lightly loaded server only compounds the problem. As another example, a virtual transaction for ordering a product must verify that appropriate information is placed correctly in forms, that the credit card authorization worked, and that a confirming message was sent.

Active agents are usually used as a proxy for a set of local desktops. Therefore, they must be carefully placed and configured so that they accurately reflect the user experience. The virtual transactions they use must match the actual transactions of the local desktops and they must access the same services so that the traffic flows over the same areas of the network. When the Internet is involved and there are thousands of external customers, a measurement service, such as that offered by Keynote Systems, can perform virtual transactions from the same backbones and geographic locations as the customers.

Active agents consume network and application resources. Therefore, they must be constrained through measurement policies defining the virtual transactions to use and the frequency. Other policy parameters define acceptable values so that trip wires are activated.

Highly dynamic environments that frequently create new transactions or modify current ones add to the administrative burden. Administrators must develop new virtual transactions or modify their current set. This entails taking the time to understand the transactions, modeling the steps, determining successful outcomes, and measuring parameters.

Trade-Offs Between Passive and Active Collection

Active approaches offer advantages over passive collection in that they allow proactive responses. For instance, a virtual transaction can detect a failed server or an application that is not available, but passive techniques can indicate only that no server traffic has been detected.

Consistency is also a significant difference between the two approaches; a virtual transaction always exercises the same functions in the same way each time. As such, baselining is simpler because the only changes between successive transactions will be due to network, server, or content-delivery delays. Documentation about normal responses and trends is easier to build and monitor.

A passive approach can also track Web downloads by timing the duration of HTTP GET operations. However, the difficulty arises when, for example, one link brings back three lines of text and the next one brings in complex graphics. As such, getting an accurate understanding of actual performance with such variation is more difficult and requires much more processing.

Active agents must be used carefully because they consume network and application resources with each usage. Large numbers of active agents initiating complex transactions on a frequent basis can degrade performance and interfere with legitimate activities. In contrast, passive approaches don't add any traffic to the system.

Hybrid Systems

A combination of active and passive agents offers optimum instrumentation coverage. The passive agents collect information on the actual transactions and their performance, and the active agents proactively find problems and build accurate baselines. This maximizes the information quality while minimizing the resource impacts of virtual transactions.

An instrumentation system for tracking service behaviors can be conceptualized as a new layer that sits above the element instrumentation. The services layer uses its own monitoring tools and techniques for measuring and tracking service-level metrics. Integrating the information from both layers is discussed in Chapter 5.