Why Crises Can Be Good for IT
Simon couldn't believe it. Once again, the new, expensive system that was put in place over a year ago was on the fritz. The reporting engine, which produced reports automatically in about 20 seconds, was taking more than 60 minutes per report. Even worse, some reports were never coming out.
Nothing had changed, no patches had been installed. Everything was working fine and all of a sudden performance slowed to a crawlfor no apparent reason. What's worse, reactive measures that worked before were no longer working: verifying memory consumption, recycling services, even rebooting servers.
A problem escalation team (PET) was set up to address the issue. Simon, a project manager for this multibillion dollar, global company, headed the team. System administrators and DBAs we brought on board to monitor SQL queries, CPU usage, memory usage and so on. After two days the problem had not been resolved so vendors were brought in also: the database manufacturer, the vendor who delivered the application, and the vendor of the reporting engine. Various methodical diagnostic procedures began.
One of the procedures involved producing a report interactively instead of automatically, to see how much time it took to run. During one of those tests, Simon made a seemingly innocuous comment: "I wonder why we aren't seeing the X's in the red squares for that report." There was a brief moment of silence, but the comment was dismissed because everything else looked correct, and the report was produced in less than 20 seconds.
On the fourth day, an external consultant with expertise on the report engine was called upon. While looking over the test results, someone from the PET team recalled Simon's comment. The consultant looked deeper at the results and, of course, the missing X was the root cause of the problem. It was fixed quickly, the fix was uploaded to the production servers and since then, everything has been working smoothly.
From the moment the incident began, until the time it was resolved, many members of the PET team were on conference calls and Live Meeting sessions for more than ten hours at a time. Yet, during all this time, nobody complained loudly, and in the end the problem was resolved and taken care of relatively quickly.
Crises such as these are good for IT because they allow you to test the quality and strength of your teams and procedures. With proper resolution, it also helps improve your systems infrastructure.
To get through a crisis, the following guidelines can help:
Focus on the business: During the system outage, there was little time and energy spent dealing with personal issues. According to Simon, "The most important thing is to get the system up and running and service the business." The reporting system oversees operations for over 70 different sites across the U.S. "Our job is to keep the business running. We impact a $1 billion business."
Expertise comes first: People were included on the PET team based on their expertise and skills, not on their job description or their project assignments. Some of the people present on the calls would normally not have worked on resolving the issue. People who had moved on to other projects had been asked to drop what they were doing in order to help out. They did so gracefully.
Everybody contributes: There is always a person on-call who expects to be interrupted at any time to resolve system outages. However, that person is not responsible for fixing everything. If there are people with more appropriate skills, they can be brought on as needed.
Maintain continuous communication: While working to correct the situation, managers on the PET team were also responding simultaneously to instant messages and emails from site operators. Never did the PET team stop responding to the users' requests even though at times it seemed like nothing was moving forward. In parallel, executives were kept informed by way of progress reports issued during the day.
No blaming: According to Simon, "We don't typically have a blame culture. When things are going wrong, that is definitely not acceptable culture." People are held accountable for their mistakes but there is no finger pointing during the crisis. Post-mortems and root-cause analysis only occur once the incident has been resolved and the business is back to normal.
Keep a smile: Good humor and laughing was present at all times. Even in the thick of things, someone was always willing to crack a joke and get a chuckle from all the participants. None of the jokes were at the expense of other people.
Experience helps: People on the calls had been around long enough to know that it takes time to resolve issues. Although there was pressure to resolve the issue, there were no threats or calls for it to be resolved by a specific date or time.
Get outside help: In the end, the person that resolved the issue was not intimately familiar with the system. It allowed him to look at things from a different perspective and eventually find the cause and provide a cure. Sometimes you just need an external point of view to see what the real problem is.
The biggest benefit from crises can be the personal satisfaction of a job well done. For some members of the PET team, there was a feeling that a thorny issue had finally been resolved. For others, it was a change of pace from the usual, sometimes boring, administrative tasks. It was a chance to pull together as a team and face a challenge. Simon states, "If you talk to most of the guys involved, they probably enjoyed part of it. They didn't enjoy the inconvenience of missing out on other things. But did they actually enjoy the process? Yeah."
As for the late nights, and the early mornings spent on the phone trying to get the business back on its feet, Simon laughs about it: "As I always say, I'll sleep when I'm dead."
Laurent Duperval is the president of Duperval Consulting which helps individuals and companies improve people-focused communication processes. He may be reached at firstname.lastname@example.org or 514-902-0186.