Damage control (n.): Minimizing negative impact when something bad happens.
When I ask my clients about their problem management processes, I often get a curious response: “We don’t really have the time to focus on problems because we’re so busy putting out fires. Once we knock down the fires that flare up, we’ll be able to devote some time to problem management.” A response like that tells me the client probably doesn’t understand incident or problem management. If they wait until they “knock down the fires that flare up,” they’ll probably never start doing any problem management. The real irony is that problem management actually eliminates the fires.
Those fires are actually recurring incidents, or incidents that repeat, often many times. Why do they repeat? Because we use incident management to deal with them, and incident management only treats the symptoms of the incident, not the causes. If you only treat the symptoms, the incident will continue to flare up.
It’s like getting a toothache and treating it with ibuprofen. The pain will subside, but the effect is only temporary. You’ll have to take more ibuprofen when the pain returns. This cycle will continue until you eventually make an appointment with your dentist to determine the cause of the pain. Investigating the cause of the pain associated with the toothache is problem management; treating the pain with ibuprofen is incident management.
Incident and problem management can be confusing concepts, so let’s break the concepts down and examine their key features.
Process Ownership
A process is a structured set of activities designed to accomplish specific goals or objectives. Both incident management and problem management are processes.
For the purpose of accountability, every process requires an owner, someone who fully understands the process and is accountable for everything that happens within it. The appointed owner needs to be high enough in the organization to champion the process and have sufficient credibility and authority to direct the work of others who may not report to them.
Both incident and problem management have a chronological set of activities designed to resolve incidents and problems, and each uses specific techniques to achieve their objectives.
Incident Management
According to ITIL, an incident is defined as “an unplanned interruption to an IT service or a reduction in the quality of an IT service.” We use incident management to remediate incidents.
The purpose of incident management is “to restore normal service operation as quickly as possible,” thus minimizing any negative impact on the business. Overall, we’re trying to maintain agreed-upon levels of service quality and availability so the business can successfully achieve its goals. These agreed-upon levels of service are usually written into the service level agreements we make with our customers.
In order to restore a service as quickly as possible, we must restrict ourselves to treating the symptoms. If we try to find the cause of the incident, this incident might last much longer, which, in turn, might have a greater negative impact on the business. The downside of treating the symptoms is that we run the risk of recurrence.
Problem Management
ITIL defines a problem as “the underlying cause of one or more incidents.” The purpose of problem management is “to manage the lifecycle of all problems from first identification through further investigation, documentation, and eventual removal.”
Consider the possible cost savings any time you eliminate a recurring incident. Each time the incident recurs, it generates one or more calls to your service desk. Each phone call could cost $15–25, not to mention the cost of lost productivity for the end users affected by the incident. And those are regular incidents; imagine the potential cost of a major incident (i.e., an incident with a much greater impact or urgency than a normal incident).
Problem management gives us the tools we need to diagnose the root cause of incidents, correct the errors we discover, initiate the implementation of a permanent fix, and collect information about problems and workarounds. Workarounds are methods or techniques we can use to temporarily reduce or eliminate the impact of an incident or a problem until the permanent fix can be implemented. Incident management can then reuse these workarounds as part of an activity called incident matching.
Here’s how incident matching works: The service desk agent searches the known error database (KEDB), trying to match the symptoms of the incident being investigated with the symptoms of previous incidents captured in the KEDB. Owned, managed, and maintained by problem management, the KEDB records workarounds that were used to overcome previous incidents. If incident matching is successful, the service desk agent will apply the workaround to the new incident. Incident matching improves the effectiveness of the incident management process, increasing the speed of resolution and reducing downtime and lost productivity.
In addition to developing workarounds, problem management also investigates the root causes of incidents. There are several triggers for initiating root cause analysis; these triggers can be either proactive or reactive.
Here are some of the reactive triggers:
- Event monitoring tools detect conditions of a fault and automatically alert problem management to those conditions.
- An incident recurs and is noticed by the service desk or other technical staff.
- A major incident occurs and problem management is notified. The problem investigation can run parallel to the incident investigation, or it can be done after the incident has been resolved.
- A third-party supplier alerts the service desk to a problem with its hardware, software, or service.
Here are some of the proactive triggers:
- Trending analysis of previous incident records reveals trends that indicate a fault. (This type of analysis should be done on a regular basis, as it can prevent recurring incidents.)
- Stakeholders undertake activities to improve the quality of a service, using problem investigations to identify other improvement actions that might be appropriate.
Once the problem management process owner decides to start an investigation, whether reactively or proactively, she should open a problem record and begin logging pertinent information such as user, service, hardware, and software details; priority; categorization details; a description of the issue; cross-references between other incidents and the incident being investigated; and all steps taken during the investigation and diagnosis process. When the problem has been categorized, the process owner should then engage a subject matter expert (SME) to conduct the investigation.
Investigative Techniques
Problem management uses a number of investigative techniques. Here are a few of the most common ones.
Brainstorming
To get some insight into the problem being investigated, it can be helpful to conduct a brainstorming session involving relevant stakeholders and SMEs. The facilitator should share the details of the problem with the group and then ask the participants to share their ideas about potential causes and build on each others’ ideas without criticism or judgment. The facilitator should jot down each idea and the contributor’s initials on a flip chart or whiteboard, leaving room after every idea to add details later. This part of the meeting should progress quickly; the goal is to capture as many ideas as possible in a relatively short time.
Once the flow of ideas begins to subside, the facilitator should then go back to the beginning of the list and ask each of the original contributors to explain their ideas. This is the time for details and discussion. Through the discussions, many of the ideas will be eliminated and only the more logical and probable causes will remain.
Ishikawa Diagrams
Kaoru Ishikawa was a leader in Japanese quality control, and he developed a unique way of diagramming causes and effects during brainstorming sessions. The diagram is often referred to as a fishbone diagram because of its shape.
The problem under investigation should be documented in the head of the fish. While brainstorming, ideas should be grouped into broad categories in the bones of the fish. Once the diagram is complete, the facilitator should ask participants to rank the top causes based on their knowledge and experience. Problem management should then begin investigating or testing the top-ranked causes to isolate the true cause.
Pain Value Analysis
If the number of problems requiring investigation exceeds the time available for those investigations, you will need to prioritize them and decide which problems to address first. This may require an in-depth analysis to determine the level of pain being felt by the business and/or end users. You may want to create a mathematical formula to measure the pain based on the number of people affected, the length of the outage or downtime, or the cost of the outage (if that can be calculated). You could also have some of your key end users look at the list to help you effectively prioritize your investigations to meet their needs and the needs of the business.
Kepner-Tregoe
Developed by Charles Kepner and Benjamin Tregoe, this technique involves the following steps:
-
Define the Problem. The precise definition should explicitly identify how the problem deviates from the norm (or agreed-upon service levels, if you have a service level agreement).
-
Describe the Problem. The problem should be described in terms of its identity, location, time, and size.
- What’s not functioning to specification?
- Where is this happening?
- When did this start?
- How big is the problem?
- How many people/parts/services are affected?
- The answers to these questions will reveal what’s wrong. Compare the affected product/service to something that’s similar but working correctly, and then search for relevant differences between the two.
-
Identify Possible Causes. The list of differences identified in the previous step should reveal possible causes.
-
Test the Most Probable Cause. Each of the possible causes must be assessed to determine which could be causing all of the problem’s symptoms.
-
Verify the True Cause. The final step involves testing the remaining possible causes to verify which is the true cause (e.g., by implementing a change or replacing a part).
Once the root cause has been identified, the SME should develop and implement a temporary workaround, record that information in the KEDB, and notify the incident manager of the results of the investigation. Be sure to direct incident management to the KEDB for the workaround, in case more incidents occur before the permanent fix can be applied.
Permanently “Fixing” the Problem
To apply a permanent “fix” to a problem’s root cause, problem management needs to fill out a Request for Change and submit it to change management. Change management will then evaluate the recommended solution and approve the implementation of the change.
Once you begin permanently eliminating incidents, you can devote even more time to problem management. In the long term, you’ll experience fewer incidents and disruptions, increase cost savings and productivity, and deliver higher levels of service quality and availability.
Jim McKennan is an experienced senior ITSM consultant with a solid background in managing support teams and delivering ITIL training. Possessing the highest level of ITIL certification currently available (ITIL Expert), Jim is committed to delivering effective consulting and educational services to clients based upon an in-depth knowledge of and experience with ITSM best practices. Jim is a writer and blogger, and he’s a regular speaker at local, regional, and national events.