Anatomy of a Major-Incident Postmortem
A major system is down for the second time in as many weeks, customer orders aren’t being processed, and operations can’t provide an estimated recovery time. Sound familiar?
As a CIO, one of the most important steps you can take to prevent recurring outages or major incidents is to conduct rigorous, constructively-focused postmortems.
A well-designed postmortem process can be used to develop comprehensive IT action plans and serve as a powerful building block in launching an overall service improvement program, which may also involve implementation of a best practice framework such as ITIL (Information Technology Infrastructure Library).
ITIL describes the process framework for Incident Management and Problem Management, both of which play key roles in minimizing user down time. Although the ITIL framework endorses an incident postmortem process, it does not provide a detailed framework for it.
The following is a proven strategy for developing and implementing a postmortem process:
Probing for Contributing Causes
IT organizations are generally effective at assessing a system failure to identify a single root cause, such as a hardware failure or a missing security patch. An action plan to implement the missing patch, for example, should reduce the future risk from this particular type of failure.
In this particular case, the true root cause may remain unresolved because the effectiveness and reliability of the patch management process has not been investigated for gaps. By delving deeper into all of the contributing causes of a major incident, we may uncover a great deal of additional, highly valuable information.
The postmortem review is designed to probe for those other factors that contribute to impact and downtime.
The process looks at the full chronology of events that make up the incident life-cycle including factors such as change management processes, cross-group communications, training, documentation and human error.
A good review meeting asks such questions as:
Writing-off an outage to a single, high-level root cause without this further analysis is like a coroner skipping the autopsy and listing as cause of death: “hit by a bus.”
Like an autopsy, a thorough postmortem review should look at all likely causes of a failure, including the organizational behaviors that may contribute to the failure and delays in resolution. Only this more comprehensive analysis will lead to an understanding of the often complex relationship between people, processes and events that come into play before and during a system failure.
Structuring the Incident Review Process
IT organizations are unlikely to conduct an effective, detailed analysis of contributing causes of a major incident without strong executive sponsorship and a dedicated process owner.
Due to the cross-organizational nature of the major incident review process and the required commitment of time and resources, successful implementation of a major incident postmortem process must start with sponsorship from the CIO and the senior IT management team.
Once sponsorship has been secured, the first step is to create a problem manager role and establish a problem review board to serve as a process development group. With the problem manager serving as the lead, the review board must first set the criteria for when an incident rises to the level of “major.”
ITIL defines major incidents as “those for which the degree of impact is extreme” and “for which the timescale of disruption — to even a relatively small percentage of users becomes excessive …”.
As a general rule of thumb, the following definition works well as a starting point for many organizations: Whenever a service impact occurs on a critical business system and extends for more than an hour.
The next step for the problem review board is to document the incident review process, meeting guidelines, and templates for capturing incident chronologies. Charter the problem review board to meet within three-to-five business days following a major incident.
During these meetings, the problem manager will be responsible for scheduling and managing the review meetings, capturing action items and developing and executing service improvement plans.
A powerful message is created if the CIO or another senior IT staff member volunteers to serve as executive chairperson and attends review meetings whenever possible. Executive involvement serves as behavior model for the organization and reinforces the importance of the board’s role and the organization’s commitment to improving service.
Other attendees should include those technical managers and staff who were involved in the chronology of the outage.
Focusing the Incident Review Process
The postmortum process will fail miserably if the problem review board is used as a forum to identify the person or organization at fault. Although it’s tempting to place blame — especially if software vendors, contractors, or outsource vendors are involved — it will be impossible to gather all the pertinent facts when the people involved will be focused on covering their tracks.
The problem manager, serving as chairperson, must insure that fact-finding is objective, positive, and focused on the offending processes. To be successful, the meeting must stay focused on actions that could or should be taken going forward to reduce risk and recurrence.
Capturing the Incident Chronology
Ideally, an incident ticketing system serves as the repository for capturing information about the incident, and the ticket serves as the outage history. For smaller organizations that may not have a ticketing system, an email summary of events built from shift logs and participant notes can work just as well to facilitate the review.
Regardless of the method used, the importance of a good, high-level summary cannot be overstated. It serves as the instrument for zeroing-in on the key questions underlying each review meeting:
Building the Action Plan
Through analyzing the chronology, a comprehensive action plan is then documented for follow-up. While some of the underlying causes may remain unknown at the time of the meeting, these can be captured as open action items to be closed when final research is completed.
An action item matrix that captures the action, person assigned, and a due date for follow-up, will serve the purpose of reducing future risk.
The Postmortum in a Service Management Culture
In rolling out the process, reinforce with the organization from the outset that a postmortum process has only one goal: To drive service improvement.
An overview of the process, management expectations and the goal of the reviews should be discussed with all staff members before the first meeting. The message should be reinforced with participants as the purpose at each meeting.
In IT environments today, where systems typically underlie the organization’s most critical business functions, and where a system failure can mean revenue impact, missed business commitments and customer dissatisfaction, instituting a postmortum review process can be a positive early step on the road to establishing a service management culture.
Brian Corrington is the President and CEO of Codesic Consulting, an IT consultancy headquartered in Kirkland, Washington. Corrington has over twenty years of experience managing enterprise-scale IT projects, building technology practices, and developing strategic customer relationships.