01 Jul 4 Most Critical Aspects of Event Life Cycle Management
Since gen-E’s inception over 30 years ago, we’ve seen several interesting approaches to managing events in an operations environment. While it’s always important to keep your users in mind, we’ve found that successful event management requires a clear concept of event life cycle management, regardless of whether you’re in the NOC, Call Center, IT Operations, or Operations Management.
To gain a clear picture of the event life cycle, we’ve found that it’s most helpful to have a standard process. This standard process should begin with documenting a description of each stage or state that an event passes through during its life cycle, from the first appearance to the time it’s deleted. We recommend using a flow charting tool to contextually capture and define each event state.
Through hundreds of implementations, we’ve found that understanding and defining the below functions in your event life cycle management process lead to the most successful outcomes.
A critical step in the process is understanding which event categories are considered important enough to generate tickets/incidents. For example, thought should be given to whether your organization will only ticket events of major or critical severity, and/or if you will not be ticketing events from flapping conditions or those that are able to self-clear.
Automatic and Ad Hoc Event Escalation and De-escalation
Thought should be given to how the operational impact of an event is determined, as well as how an event will be re-prioritized given a shift in the impact or if the event doesn’t follow the standard procedure. This may include manually setting thresholds, or using a machine learning tool such as gen-E’s OpsCenter AIOps to automatically determine normal behavior and detect anomalies. Automatic notification options regarding these shifts should be considered to maximize efficiency and drive towards proactive event management.
With ever-growing volumes of events, it’s desirable to have the ability to automatically group events based on multiple categories, increasing efficiency when wading through a sea of red. It’s also desirable to create role-based grouping. For example, an event grouping by application name would be relevant to the application owner, while a different group would be interested in seeing events grouped by geography or incident ID.
We all know that data lurks in systems external to your department that be hugely beneficial to managing events more effectively. When evaluating event management practices, consider all of the information that would make you more successful, and which systems contain this information that could be leveraged to enrich events, such as device owners, customer contact information, network layer related information, application dependencies, or service information. Integrations between your event management system and these external systems is not only possible, but relatively simple when working with a partner with extensive experience in these systems.
Further, when events are used to generate tickets/incidents into a ticketing application or CMDB, the following items should be considered:
- An event and the incident that it generates should be linked by a common key.
- Events should be automatically updated based on updates to the incident. Typically, if an incident is created, the NOC will not close or de-prioritize the event before the incident is resolved but updates from the incident side should automatically be propagated to the event.
- Incident creation should be automatically suppressed based on predetermined criteria, such as a change request resulting in CI downtime being performed on the CI the events are received from, such as a maintenance window.
- Events from the CI that are undergoing the service-disrupting change request should be automatically suppressed for the duration of the downtime due to the change, as well as automatically unsuppressed when the change window expires or when a notification that the change was completed is received.
- Since a planned change that generates a service disruption can impact multiple CIs, such as patching a set of servers or updating the OS on a set of network devices) it is desirable to have the ability to automatically suppress all events coming from the impacted CI for the duration of the change window or until a notification that the change was completed is received. A follow-up automation should handle the problems that continue after the change window has passed, such as simply unsupressing qualifying events going forward.
Hopefully this blog post provided some food for thought that will help you improve how events are handled in your environment. As always, we’re here to lend a hand and share additional thoughts about how you can improve performance without undergoing a complete system overhaul.