Event Correlation

Event Correlation

Event Correlation is a technique for making sense of a large number of events and pinpointing the few events that are really important in that mass of information. It has been notably used in Telecommunications and Industrial Process Control since the 1970s, in Network Management and Systems Management since the 1980s, in Service Level Management and Event-Based Systems since the 1990s, and in Business Activity Monitoring since the early 2000s.

In Network Management, Systems Management and Service Level Management, Event Correlation usually takes place inside the Management Platform. In ITIL parlance, Event Correlation is part of Support Management.

Event Correlation is implemented by a piece of software known as the Event Correlator. This tool is automatically fed with events originating from managed elements, monitoring tools or the Trouble Ticket System. Each event captures something special (from the event source standpoint) that happened in the domain of interest to the Event Correlator (e.g., the reboot of a device, a Service-Level Objective that is not met for a given customer, or the CPU of an e-business server that is used at 100% for over 15 minutes).

An event may convey an alarm or report an incident (which explains why Event Correlation used to be called Alarm Correlation), but not necessarily. It may also report that a situation goes back to normal, or simply send some information (e.g., policy P has been updated on device D). The "severity" of the event is an indication given by the event source to the event destination of the priority that this event should be given while being processed. Upon receiving events, the Event Correlator discards those that it deems irrelevant. Next, it merges duplicate events and aggregates events that globally tell the same story. Finally, the Event Correlator performs Root Cause Analysis to identify, through dependency analysis, what events can be explained by a single one (the root cause).

At this stage, the Event Correlator is left with at most a handful of events that need to be acted upon. Strictly speaking, Event Correlation ends here. However, by language abuse, the Event Correlators found on the market (e.g., in Network Management) can also include problem-solving capabilities, in order to be able to trigger corrective actions or further investigations automatically. Such functionality is not covered here.

Event correlation can be decomposed into four steps:
* Event Filtering
* Event Aggregation
* Event Masking
* Root Cause Analysis

Event Filtering

Event Filtering consists in discarding events that are deemed to be irrelevant by the Event Correlator. For instance, a number of bottom-of-the-range devices are difficult to configure and occasionally send events of no interest to a centralized management platform (e.g., printer P needs A4 paper in tray 1). Another example is the filtering of informational or debugging events by an Event Correlator that is only interested in availability and faults.

Event Aggregation

Event Aggregation (also known as Event De-duplication) consists in merging duplicates of the same event. Such duplicates may be caused by network instability (e.g., the same event is sent twice by the source because the first instance was not acknowledged sufficiently quickly, but both instances eventually reach the event destination). Another example is temporal aggregation, when the same event is sent over and over again by the source until the problem is solved.

Event Masking

Event Masking (also known as Topological Masking in Network Management) consists in ignoring events pertaining to systems that are downstream of a failed system. For example, servers that are downstream of a crashed router will fail availability polling.

Root Cause Analysis

Root Cause Analysis is the last and most complex step of Event Correlation. It consists in analyzing dependencies between events, based on a model of the environment and dependency graphs, to detect whether some events can be explained by others. For example, if database D runs on server S and this server gets durably overloaded (CPU used at 100% for a long time), the event “the SLA for database D is no longer fulfilled” can be explained by the event “Server S is durably overloaded”.

Role of Event Correlation in Integrated Management

The point of Integrated Management is to integrate the management of networks, systems and IT services in organizations. The Event Correlator plays a key role in this integration, for only there do network, system and service events come together. For instance, this is where the failure of a service can be ascribed to a specific failure in the underlying IT infrastructure.

Most Event Correlators can receive events from Trouble Ticket Systems. However, only some of them are currently able to notify Trouble Ticket Systems when a problem is solved, which partly explains the difficulty for Service Desks to keep updated with the latest news. The integration of management in organizations requires communication between the Event Correlator and the Trouble Ticket System to work both ways.

References

* M. Hasan, B. Sugla and R. Viswanathan, "A Conceptual Framework for Network Management Event Correlation and Filtering Systems", in "Proc. 6th IFIP/IEEE International Symposium on Integrated Network Management (IM 1999)", Boston, MA, USA, May 1999, pp. 233–246.
* H.G. Hegering, S. Abeck and B. Neumair, "Integrated Management of Networked Systems", Morgan Kaufmann, 1998.
* G. Jakobson and M. Weissman, "Alarm Correlation", "IEEE Network", Vol. 7, No. 6, pp. 52–59, November 1993.
* S. Kliger, S. Yemini, Y. Yemini, D. Ohsie and S. Stolfo, "A Coding Approach to Event Correlation", in "Proc. 4th IEEE/IFIP International Symposium on Integrated Network Management (ISINM 1995)", Santa Barbara, CA, USA, May 1995, pp. 266–277.
* J.P. Martin-Flatin, G. Jakobson and L. Lewis, "Event Correlation in Integrated Management: Lessons Learned and Outlook”, "Journal of Network and Systems Management", Vol. 17, No. 4, December 2007.
* M. Sloman (Ed.), "Network and Distributed Systems Management", Addison-Wesley, 1994.

ee also

* Root Cause Analysis
* Complex Event Processing
* Network Management
* Systems Management
* Service Level Management
* Business Activity Monitoring
* Trouble Ticket System
* Incident Management


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Event-driven architecture — (EDA) is a software architecture pattern promoting the production, detection, consumption of, and reaction to events. An event can be defined as a significant change in state [K. Mani Chandy Event Driven Applications: Costs, Benefits and Design… …   Wikipedia

  • Event Stream Processing — Event Stream Processing, or ESP, is a set of technologies designed to assist the construction of event driven information systems. ESP technologies include event visualization, event databases, event driven middleware, and event processing… …   Wikipedia

  • Correlation does not imply causation — (related to ignoring a common cause and questionable cause) is a phrase used in science and statistics to emphasize that correlation between two variables does not automatically imply that one causes the other (though correlation is necessary for …   Wikipedia

  • Complex event processing — (CEP) consists of processing many events happening across all the layers of an organization, identifying the most meaningful events within the event cloud, analyzing their impact, and taking subsequent action in real time. Complex event… …   Wikipedia

  • Complex Event Processing — Complex Event Processing, or CEP, is primarily an event processing concept that deals with the task of processing multiple events with the goal of identifying the meaningful events within the event cloud. CEP employs techniques such as detection… …   Wikipedia

  • Security Information and Event Management — Security Information Management System Le principe du Security Information Management (SIM) est de gérer les évènements du Système d Information (SI). Appelés également SEM (Security Event Management) ou SEIM (Security Event Information… …   Wikipédia en Français

  • Heinrich event — Heinrich events, first described by marine geologist Hartmut Heinrich, occurred during the last glacial period, or ice age . During such events, armadas of icebergs broke off from glaciers and traversed the North Atlantic. The icebergs contained… …   Wikipedia

  • Human cognitive reliability correlation — (HCR) is a technique used in the field of Human reliability Assessment (HRA), for the purposes of evaluating the probability of a human error occurring throughout the completion of a specific task. From such analyses measures can then be taken to …   Wikipedia

  • Complex event processing — Traitement des événements complexes ou Complex Event Processing (CEP) Le Traitement des événements complexes ou CEP, est principalement un concept de traitement des événements dans le but d identifier les événements significatifs dans un nuage d… …   Wikipédia en Français

  • Illusory correlation — is the phenomenon of seeing the relationship one expects in a set of data even when no such relationship exists. When people form false associations between membership in a statistical minority group and rare (typically negative) behaviors, this… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”

We are using cookies for the best presentation of our site. Continuing to use this site, you agree with this.