Major Incident Management
The basics of a good incident management practice are relatively simple. When something fails or there’s a deviation from acceptable performance, you react, get things back up and running, and try to provide good customer service along the way. While this may be an oversimplification of the incident management practice, the point is that the basics are straightforward and relatively easy to understand. In fact, most organizations have some level of success in handling the basics of the incident management practice. But even in the best organizations, there are times when things go really wrong. And those times are often during a major incident.
Defining a Major Incident
One of the things often stressed in an ITIL course is the special way in which we need to handle major incidents. Since a major incident is sure to have high impact, high urgency, and high visibility, it makes sense that we should have a structured plan or procedure to deal with major incidents. But one of the most basic failures in major incident management has to do with defining exactly what a major incident is to your organization.
There are two sides to this common issue. In the first case, when our major incident definition is too subjective, those involved with the incident may struggle with whether a specific incident is truly a major incident or not. Stakeholders often fear the repercussions of declaring a major incident only to find out a short time later that the incident wasn’t what they thought it was initially. Another common fear is that they don’t declare an incident to be a major incident when they should, thus causing delay in addressing it properly. To counter the problems with a subjective definition, many organizations try to create a very tight and objective definition of a major incident. And this leads to a different set of challenges.
In the second case, we have a tight, objectively defined set of criteria for what qualifies as a major incident. Often, this involves certain customers, systems, service types, or percentages of a user-base that are impacted. In this environment, it is common to have true major incidents that are never declared (or delayed) because our objective criteria wasn’t met by an issue we didn’t anticipate. The crux of the issue is that we don’t know what we don’t know. In other words, it’s very difficult to anticipate every way a service could fail or a customer could be impacted by the failure of a service. And there will almost always be major incidents that we didn’t expect to be major incidents.
The best major incident definitions strike a balance between a set of objective criteria while leaving room for some level of subjectivity. While we want to avoid the “I’ll know it when I see it” definition mentality, the truth is that often we do know it when we see it. Subjectivity is not always bad. After all, many of the best thinkers rely heavily on their own intuition.
Perhaps the biggest challenge in defining a major incident actually comes from getting a good impact assessment. When considering impact assessment, it’s important to consider the known impacts, likely impacts, secondary impacts, and non-impacts.
Known impacts should weigh heavily in our impact assessment. These impacts can be observed, experienced, or defined in a tangible way. Known impacts are derived from empirical data. The impact assessment should really begin here.
Likely impacts are predicted or extrapolated from known impacts. When we consider the empirical data from known impacts, we are often able to define other impacts that are probably from the known impact. For example, if a service desk is suddenly inundated with phone calls about the failure of a particular service, it may be likely that all customers are impacted even if the service desk doesn’t have empirical data from every customer. Likely impacts are based on probability and should be validated, as the situation permits. When a likely impact is validated through empirical evidence, it should be considered a known impact.
Secondary impacts are impacts by extension. In other words, what is the broader effect of the impact on other processes, services, systems, etc. Consider the simple example of database latency for a connected pricing and inventory system. A bar code is scanned and a request is sent to a database for a price lookup. This normally completes in a fraction of a second. Because of some yet unexplained incident, the lookup now takes 10 seconds. In line at the grocery store, the checkout lines are growing. Each scan leads to a further delay. Customers become frustrated. Shoppers abandon their carts, filled with groceries, in front of the store near the registers. Store personnel then need to deal with the abandoned carts, the lost revenue, and the sluggish checkout process. Determining secondary impacts requires an understanding of how services are used in a broader context and what dependencies exist.
Non-impacts are one of the most ignored but useful data points in incident management. A non-impact is something that could have been impacted, or closely related to the impact, but is not impacted. This information helps us set good boundaries during our impact assessment and is useful for mapping impact, which is a more advanced topic. Non-impacts should be validated by empirical data when possible. Non-impacts are very useful in building a causal hypothesis and during cause testing and validation.
When all of these impact types are considered during impact assessment, it is much easier to determine if an incident truly is a major incident.
Invoking the Major Incident Procedure
Among the most critical decisions in managing and responding to major incidents relates to who can declare an incident to be a major incident and what invoking the procedure actually means. The declaration should be made by an appropriate stakeholder who carries enough authority in the organization to get the attention of those that need to be involved in major incident activities.
While every organization is different, there are a few common considerations that should be made in the major incident procedure. Your procedure should account for key roles and responsibilities, communication management, technical response, and evidence preservation.
Because of the chaos often associated with a major incident, the roles and responsibilities should be clearly defined in your procedure. At a minimum, consider the following roles:
- A Major Incident Manager who ensures compliance with the procedure, assigns people to other roles, and coordinates the incident response.
- A scribe, who is responsible for keeping a historical timeline of all troubleshooting, communications, updates, and other activity during the major incident.
- A communications manager, who is responsible for drafting, reviewing, and coordinating internal and external communications related to the major incident.
- Technical manager who is responsible for assembling and coordinating the technical resources to address the major incident.
Managing communications is one of the most critical aspects of major incident management. Internal and external resources will likely want constant updates. It is best to have a predetermined update or notification schedule. When notifications do not provide quality information, stakeholders are more likely to seek information from alternate contacts which could potentially affect recovery efforts.
Every effort should be made to ensure the proper technical response to the major incident. This will often require a dedicated technical conference bridge exclusively for the technical troubleshooting. While management personnel and customers may want access to this bridge, it should be tightly controlled to facilitate the best technical response to the incident.
Major incidents benefit from good evidence preservation. Sometimes there are legislative or regulatory reasons to preserve evidence. Even when these reasons do not exist, good evidence preservation allows for a proper retrospective to be conducted later. The best way to prevent recurrence and to improve is to gain a good understanding of exactly what happened and why it happened. Good analysis is always beneficial for improving the organization.
Major incident management is a much deeper subject than can be exhaustively covered in one article. For more information, feel free to contact us directly. If you’d like a template to use to help you build your major incident management procedure, reach out and I’ll send you one free of charge.