The phrase “Houston, we have a problem” became part of our cultural lexicon in April 1970. The Apollo 13 mission was en route to the Moon when an oxygen tank exploded. Flight director Gene Krantz canceled the moon landing, and over the next few days engineers, flight surgeons, and other members of Mission Control worked feverishly to bring the three Apollo 13 astronauts—Jim Lovell, Jack Swigert, and Fred Haise—safely back to Earth.
Imagine what would’ve happened if no one had been sitting in Mission Control when the explosion occurred: the incident would have gone unnoticed, and the crew would almost certainly have perished. Fortunately, this was not the case. A representative from every key system in the command and lunar module (LM) was on hand in Mission Control, ready to react in the event of a major incident.
In the aftermath of the explosion, the astronauts were forced to power down the command module and retreat to the LM to save energy. Back in Mission Control, engineers monitored the LM’s systems, while the flight surgeons monitored the astronauts’ vital signs. Meanwhile, other engineers were busy developing a workaround solution to the problem of removing carbon dioxide from the LM so the astronauts could breathe safely. All of these resources working together made the hope for a successful return a reality.
So, what, exactly, do NASA and space exploration have to do with ITSM? In most organizations, IT has two relationships with the business it supports: first, it creates or facilitates the creation of new services, and second, it provides ongoing support for those services. The same can be said of the scientists, engineers, and contractors who created the systems that enabled man to travel into space and walk on the Moon. However, getting to the Moon and landing safely was just the first task; taking off and returning to Earth was the second task. Back at Mission Control, the support staff were on constant alert during every Moon mission. They were hyperfocused on the systems they’d created, and they provided top-level support to ensure that the systems performed as designed.
The same principle applies to ITSM. The level of focus should vary based on the criticality of the services to the business (every NASA mission wasn’t a Moon mission, after all). For some services, if a given service is unavailable (under maintenance, perhaps) during a specific time frame, it won’t have a significant impact on the business. However, there are other services that are so critical that when those services are unavailable, it can have a catastrophic impact on the business. Whether a service or time frame is critical will vary based on the business or industry.
What is your business’s “moon-shot” event? If you’re a retailer, then Black Friday or Cyber Monday may be a day when a loss of service could create a substantial financial loss for your company; this would be a bad day for the point-of-sale systems to go down. If you’re a pizza chain, then Super Bowl Sunday may be your biggest day of the year; this would be a bad day for your online ordering system to suffer a degradation of responsiveness. If you’re a large university, move-in week might be a key event; with thousands of new students connecting to the network and accessing course and campus resources, resolving issues quickly is crucial.
IT provides services to the business, and ITSM enables IT to react accordingly when those services don’t perform as designed. It’s important that IT and the business are on the same page when it comes to key events, and that they approach those events with the same sense of urgency. This requires a high degree of collaboration, and a great way to collaborate with the business is to create a war room. A war room is a type of Mission Control; by creating a war room and staffing it with key experts, you’ll be able to react quickly when major (or even minor) incidents occur that disrupt the normal flow of business.
To determine who should participate in your war room, consider all of the different components that make up the critical systems used to support the business. For example, if you’re supporting a website that must be responsive, you may have representatives from the site development team, the database administration team, and the server infrastructure teams. You may also consider staffing the war room with a situation manager who can coordinate communication and remediation activities when incidents occur. Just as the Apollo 13 mission required the strong leadership of Gene Krantz, an ITSM war room also requires strong leadership to be successful.
Your IT department may already have a 24×7 operations team that monitors critical systems, so it might be hard to convince senior management to pull resources away from other activities to staff the war room (especially if those resources are more senior—and likely more expensive). To determine if a war room is even necessary, the following factors should be taken into account: First, consider the processes you already have in place for resolving major incidents during the normal course of daily operations. Does the resolution of these incidents require escalation to more senior IT resources? Are these resources located on- or off-site? If they’re off-site, how long does it take to reach them? Do they have access to the information they need to resolve major incidents? Second, consider your mean time to resolution for major incidents and the cost to the business on a minute-by-minute basis. During critical events, how much does the cost of an outage multiply? Armed with the answers to these questions, you’ll have the information you need to acquire the necessary resources to create a war room.
In general, war rooms should be planned events that are in alignment with critical business events. Sometimes, however, you may need to pull together an unplanned war room when the need arises (i.e., to resolve a specific problem). In the Apollo 13 crisis, for example, expert engineers were brought together to build a new air filtration system made out of materials available to the astronauts in the command module and LM. The astronauts were running out of breathable air, so the engineers had a limited amount of time to find a solution. Albeit not as dramatically, the same type of need could occur in business situations. If a critical event is on the horizon, and there must be quick remediation to a problem prior to that event, then it would make sense to gather all the necessary resources into a war room for the express purpose of solving that particular problem. For example, if it’s discovered that your retail website will not be responsive under high traffic, and it’s one week before Cyber Monday, it might be necessary to create a war room to make the appropriate modifications to the site in advance. In this example, the war room may bring together application designers and developers, server engineers, and database administrators.
For planned war rooms, you must start your preparations early, well in advance of any events you’ve identified that may require additional resources. In addition to determining who will be present in the war room, you also have to determine what they’ll be doing. You must also ensure that there are monitoring systems in place to quickly alert participants to any unforeseen abnormal conditions.
If you discover that you require multiple war rooms throughout the year, you may want to consider instituting a virtual war room. Rather than bringing geographically distributed participants on-site—which can be cost-prohibitive, among other things—bring them together virtually. In a virtual war room, participants are actively monitoring systems and situations, albeit remotely, and any incident will be met with a quick response. This requires communication channels to be set up in advance, such as conference lines or multiuser chat sessions, but it can work especially well when support is required after hours or on the weekend. (To be clear, this type of war room goes beyond normal “on-call” activity, as resources may have to be contacted by phone, which can delay your response to the event.)
If your organization is large, your war room participants may not be involved in day-to-day ITSM activities. Because they may not be familiar with your ITSM tools or your regular support procedures, this is a great opportunity to not only familiarize them with normal support processes, like incident management, but also show them how the decisions they make in their project work (e.g., the design and creation of systems) can have an impact on the production environment.
Aside from the immediate benefit of reducing the impact of incidents during crucial periods, there are several long-term advantages to staffing a war room. First, you may develop improved monitoring capabilities that are reusable for normal operations. Second, you can increase the knowledge and capabilities of the incident management staff by having regular interaction between all participants. It’s important that the resources involved in daily ITSM activities, whether they’re on the service desk or in your operations center, are involved in all war room activities, especially if you’re staffing it with more experienced resources.
Imagine the possibilities: an overnight operator might learn new ways to monitor a critical database system; a service desk technician might learn new techniques for troubleshooting and resolving issues with custom software. Both the increase in monitoring capabilities and the increased capabilities of resources can have a dramatic effect on incident resolution in the long term. Finally, the most important people you can ask to participate in the war room may be nontraditional IT resources from the business you support. They can help draft communications, assess the impact of incidents, or make recommendations for the remediation of incidents.
You may not be in the business of putting men on the Moon, and a business critical incident may not involve the potential for loss of life, but by applying some of the principles and lessons learned from the space program to ITSM in the form of a war room, you can position your business to achieve the greatest success.
Philip DeYoung has more than twenty years of experience in the IT industry. He currently serves as senior production support manager for store systems at GameStop, the world’s largest video game retailer. Philip manages the store service desk, ensuring that all systems remain operational so there are no disruptions to store sales. Philip currently holds an Intermediate-level ITIL certification in Operational Support and Analysis, as well as his HDI Support Center Manager certification.