Date Published January 19, 2023 - Last Updated January 20, 2023
Look at the incident management practice in any organization, and you’ll find it hasn’t changed much in the last 25 years. The focus of many organizations is to keep improving the practice by offering self-service capabilities, knowledge, and effective response to incidents. Problem management is where the permanent recovery, resolution, and prevention of future known issues occur.
But what if we didn’t separate the two? What if part of incident management was the proactive resolution of an incident?
The problem with incident management
Consider the employee experience with incident management today. They run into a problem, perhaps printing or saving documents or the ubiquitous blue screen. They call a service desk or log a ticket, and eventually (within 1-2 days), someone reaches out to them.
They spend time troubleshooting the issue. If they need a repair, the part arrives in a day or two, and they wait while someone installs it. If it needs replacement, they’re told to request a new machine, which requires approval and takes two weeks to provision, while they limp along.
Think about this experience and the amount of downtime and resolution time they have invested in an activity that takes them away from doing their job. The technician may say, “We met our SLA,” yet the employee is dissatisfied.
This is the disconnect Corporate IT faces today. We think the user should be happy if we do all of this within the service level agreement, but we don’t even consider their lost productivity time. We even look great on paper because we met our SLA.
Incident management, as a practice, doesn’t consider the user experience of the intake and resolution processes. In contrast, the user is either unable to bill for their time, complete a sale, or work with a patient when their equipment malfunctions. This is why experience-level agreements are starting to replace service-level agreements. However, even here, if you agree to a particular experience, you need the technical ability to deliver it. Observability is the practice that provides that ability.
Observability and proactive incident management
According to Wikipedia, observability is a measure of how well the internal states of a system can be inferred from knowledge of its external outputs. It derives from work in quantum mechanics and control systems done by r.e. kalman in the 1960s.
Observability combines monitoring data with logs, and traces other metrics and configuration information to provide a complete picture of a device or application’s health. With observability, the data provided by monitoring systems, the application or device’s logs, and information about its expected configuration can be compared automatically using AI and machine learning algorithms. The tools used to support the practice are instrumented to respond automatically when an anomaly presents itself. This can be automated via patch and configuration management tools and monitoring systems, or a ticket can be automatically opened and assigned to a technician to investigate. With observability, errors are detected before a user becomes aware of a problem and, in many cases, resolved inthe background. Organizations can decide how to handle exceptions by either provisioning and delivering a replacement or scheduling a convenient time to get their hands on the affected equipment or software.
To get to this level, an organization already needs to be established and mature event management, incident management, and configuration management practices, as the tools used to deliver observability all rely on these being present. Observability is only possible with a robust configuration management practice and database. Some of the information needed to achieve it depends on knowing the expected configuration for every device, applications that runs on it, and networks and other peripherals that interact with it. Once this is in place, the work needed to optimize monitoring and incident management practices can be performed as part of adopting an observability practice and implementing the tools required to achieve it.
The business value changing our practice
Observability is not for everyone. As the practice relies heavily on instrumentation, it can be expensive, so readiness needs to be established before adding tools blindly. However, for those organizations that have a good configuration management practice or are working to achieve one, there is a good deal of business value from engaging in observability:
As DevOps practitioners know, observability engineering helps ensure acceptable performance for applications, getting information about potential bottlenecks and errors to developers before they impact the business. This is more difficult in distributed cloud environments, making a case for software observability.
Observability makes it easier to detect and manage security vulnerabilities and intrusions. The asset and configuration management work needed to achieve observability is also foundational for security vulnerability management, and using observability tools and practices enhances security incident responses. Thus, much of the business value and funding needed to achieve observability can be linked to improved security management. In many ways, improvements to incident management are just another way to use the tools and data. Fewer noticeable incidents and service interruptions provide a return by protecting business value streams, hence the organization’s revenue.
The tools needed for observability can be found in many Remote Monitoring and Management (RMM) platforms and combined/integrated with service management suites or their workflow automation engines. Managed Service Providers have used them for years, but perhaps it’s time to bring them into corporate IT support.
Phyllis Drucker is a professional speaker and writer with EZ2BGR8