In the March/April issue of SupportWorld, Jim McKennan provided an overview of some of the activities related to problem management, including incident matching and root cause analysis. In this article, I will build upon that foundation by providing additional insights into the problem management process.
The Scope of Problem Management
The scope of problem management includes two aspects: reactive problem management and proactive problem management. Reactive problem management focuses on solving problems in response to one or more incidents as they occur; proactive problem management focuses on identifying and solving problems and known errors that might otherwise be missed, thereby preventing future incidents. Typically, organizations that are new to problem management initially channel their energies and resources toward reactive problem management. As organizations mature, their focus should shift to proactive problem management, which reduces the likelihood of potential outages and minimizes the impact of problems when they occur. However, few organizations make this shift because they find it hard to quantify benefits of a process that can be perceived as fixing potential problems, not real ones.
Trending analysis of incident tickets should be one of the first proactive steps undertaken by organizations, as it can be accomplished by anyone with good analytical skills (that is, it doesn’t require technical resources, which may already be stretched). The analyst should be looking for repeat incidents with the same category, affected service, configuration item (CI), cause, or resolution. Mining your knowledge base can also help identify potential problems. Other sources of potential problems include:
- Known errors identified by application development release and deployment teams
- Reports generated by application or system software (system or activity logs)
- Service/supplier review meetings
The potential problems identified from these analyses should then be forwarded to problem analysts for further investigation.
Problem Management Activities
Whether your organization favors reactive or proactive problem management, the activities associated with problem management can be grouped into four major categories.
Detection and categorization
These activities focus on identifying, logging, and classifying problems, and they’re similar to the initial activities performed by incident management. Correctly logging and categorizing incidents and problems is essential as it facilitates more effective incident and problem matching; indicates the business impact, which can subsequently be used to determine the priority of the problem and decide whether to proceed with problem investigation and diagnosis; helps to ensure that the appropriate resources are assigned to the problem; and helps accurately identify trends for proactive problem management. Some of the common fields or data that are captured when logging problems are:
- Problem source
- Assignee
- Priority
- User(s) affected
- Service(s) affected
- Location(s) affected
- Suspected IT component(s) at fault
- Date and time initially logged
- Incident or event trigger details (linkages)
- Details of all diagnostic or attempted recovery actions taken
Investigation and diagnosis
It’s during investigation and diagnosis that incident matching and root cause analysis occurs. These activities focus on identifying root causes and transforming problems into known errors (from an ITIL perspective, you have a known error when you’ve identified the root cause and a workaround). Root causes can generally be classified into four major categories: physical causes (components failed), system errors (software failed), human causes (someone did something wrong or failed to do something they should have), and organizational causes (a process, policy, or procedure is in error).
The six major activities associated with investigation and diagnosis include:
- Defining the problem in terms of what, where, when, and significance
- Updating the known error record
- Collecting data that supports or points to the causal factors that created the problem
- Analyzing the data and identify possible causes
- Identifying the root cause (i.e., a cause to which you can attach a solution)
- Documenting the analysis done to reach the conclusion
Problem resolution
The focus here is on identifying, approving, applying, and validating permanent fixes to problems and known errors. There are two major activities associated with problem resolution: solution identification and solution implementation.
-
Solution identification includes all of the actions taken to determine a permanent solution to a problem or known error:
- Researching and identifying possible solutions
- Choosing a solution
- Obtaining approval to proceed with the development of the proposed solution
- Developing the proposed solution
- Testing the proposed solution
- Submiting a request for change to change management for approval to implement the identified solution
- Determining problem-prevention actions to take
-
Solution implementation includes all of the actions taken to approve, implement, and validate the proposed solution to the problem or known error:
- Obtaining approval to implement the proposed solution
- Implementing the proposed solution
- Verifying that the solution corrected the error
- Executing problem-prevention activities
- Updating the knowledge base or known error database with resolution information
Problem closure
These activities focus on closing problems, known errors, and related incidents with updated and reusable information. The activities associated with problem closure include:
- Verifying the problem and known error records are updated, correct, and complete
- Closing the problem or known error records when the change has been implemented and the solution has been verified
- Updating the status of related open incidents at the time of problem and known error record closure
- Conducting a postimplementation review for capturing lessons that can be applied to future problems
There’s one more activity associated with problem management, although it’s less frequently conducted, and that is major problem reviews. These are similar to major incident reviews, except they’re performed on problems where the impact was significant enough that management decides to review the process, the actions taken, and the tools. The objective is to identify what went well, what didn’t go well, what could be improved for the future, and how the organization can prevent recurrence, with the overarching goal of improving future outcomes.
Problem Management Roles
There are three primary roles involved in problem management: process owner, problem manager, and problem analyst.
Process Owner
The process owner is typically a senior-level manager within the IT organization who has complete accountability and responsibility for the problem management process. In essence, these individuals own and maintain the problem management process. They provide input into the process design and scope and approve the final product. They’re responsible for defining appropriate critical success factors (CSFs) and key performance indicators (KPIs) for measuring the process, and they’re responsible for reviewing and approving the process documentation that’s to be used throughout the problem management process.
Problem Manager
The problem manager is responsible for the day-to-day operation of the problem management process. In many organizations, the problem manager is also known as the process manager, the problem coordinator, or problem queue manager. Some of the problem manager’s responsibilities include addressing and resolving issues with process operation and execution, monitoring the progress of problems and known errors, ensuring target resolution times are met, and reviewing and approving proposed workarounds.
Problem Analyst
Problem analysts are members of the support groups who are assigned problems. In most IT organizations, these are tier 2 and tier 3 personnel. They’re responsible for receiving and working assigned problem records, identifying and documenting the root cause of problem records through the use of root cause analysis techniques, and identifying workarounds. They are also responsible for coordinating or facilitating the testing and resolution of problems and known errors by submitting requests for change to change management. Problem analysts need to be analytical, able to perform trend analysis, and have good problem-solving skills. Additionally, they should receive training on root cause analysis techniques.
Ideally, problem management staff are separate from incident management staff. The reason for this is that if resources are shared between incident and problem management, incident management will get the lion’s share of the attention (resources and time), at the expense of problem management. In smaller organizations, this separation may not be possible; even if it’s not, at a minimum, the activities should be segregated.
Problem Management Metrics
It’s important to measure both the effectiveness and the efficiency of the problem management process. Common CSFs and KPIs for problem management include:
-
CSF: Improving service quality
-
KPI: An increase in the percentage of proactive changes submitted by problem management
-
KPI: A reduction in the number of incidents over time
-
CSF: Minimizing the impact of problems
-
KPI: An increase in first call resolution through the use of workarounds
-
KPI: A reduction in the average time to implement fixes
-
CSF: Resolving problems effectively and efficiently
-
KPI: A reduction in the backlog of open problems
-
KPI: An increase in the number of problems that met or exceeded their target resolution times
The Keys to Success
There are a number of keys to a successful problem management process. First, make sure you have the support of senior IT leadership. Second, establish a clear vision and purpose (what you’re trying to accomplish and why), as well as a clearly defined, documented, and communicated process. Third, make sure you have an effective incident management process, and clearly define the relationship between incident and problem management. This entails defining roles and filling those roles with the right people who have the right skills (including root cause analysis). Finally, make sure you have well-defined CSFs and KPIs, and a clear management reporting process. If you do all of these things, you’ll be well on your way to having an effective problem management process.
Buff Scott III has more than thirty years of experience in the IT industry. He’s a versatile leader with extensive management experience, and he’s an accredited ITIL v3 Expert, ITIL Trainer, and HDI Faculty member. Among his many other skills and accomplishments, Buff’s been designing and implementing ITIL processes since 2001, and he specializes in business and IT process reengineering. He currently delivers the HDI Problem Management Professional course, developed in partnership with Propoint Solutions, Inc.
Buff will be delivering the HDI Problem Management Professional pre-conference workshop at FUSION 14 (Pre-5). Register today!