Revolutionizing the Incident Management Practice

by Phyllis Drucker
Date Published July 8, 2024 - Last Updated January 7, 2025

How good is your incident management practice? Does it look much like it did ten years ago?

IT service management frameworks have yet to change how incident management is performed beyond making it more efficient.

For over 25 years, incident management has looked like this:

A major system experienced a failure, and technicians jumped into action. A major incident is declared. Regular conference calls take place until service is restored. A major incident report is produced for review and potential follow-up with the problem management process to prevent future occurrences and decrease failures.

A caller contacts the service desk with an issue. Initial troubleshooting might resolve the issue. If a hardware failure is expected or the service desk can’t fix it, the issue is escalated. This is called the watermelon effect, (green outside, red when you peel it.). The tier 2 tech has 2-3 days to restore service. In this scenario, the best incident management practices restore service within the SLAs (service level agreements) that IT established with the business. Still, the end-users are not satisfied with this. What do users want? Fewer interruptions and no downtime.

The Future of Incident Management is Now!

For organizations with mature incident management practices that cannot still achieve high levels of customer satisfaction, technology has caught up and is ready to manage incidents proactively.

AI and predictive analytics can help in two ways:

Identifying failures or equipment that needs attention before it impacts the end user

Offering solutions to end users via a corporate instant messaging platform or portal

Both reduce the downtime the user experiences, leading to corporate savings, better employee morale, and improved retention.

An ITIC survey indicates that one hour of downtime costs 44% of respondents over $1 million.

A case study by Tata Steel indicates that employee satisfaction and retention are directly related to their productivity, indicating a linkage between system performance and employee productivity.

What Causes Most Major Incidents?

System-wide failures typically result from one of three potential issues like human error, hardware failures and application/database changes.

Issues with personal devices also have only a few common causes like hardware failures, outdated software versions, missing patches, security fixes, outdated virus signature files, and user error, lack of knowledge.

Incident Management of the Future

The future of incident management is realized when organizations stop working manually and leverage automation to lower failure rates and provide support. Let’s look at three steps to achieve this.

Step 1: Lower Human Error with Automation

DevOps practices have helped drive the popularity of automation in software development, testing, and deployment. These tools and practices help prevent application changes and human error incidents as follows:

Automated application testing ensures repeatable and accurate testing of application changes before deployment.

Deployment consistency and accuracy lower human error in missed deployment steps and coordination.

Application monitoring and automated rollback ensure the application runs correctly after deployment or immediately roll back on identifying an issue.

Step 2: Monitor and Address Issues Before They Impact Users

Automated device management enables IT teams to automate everyday activities, allowing proactive incident management across the entire computing environment and ensuring every device is managed consistently.

A robust device management application enables and supports the automation of:

Device inventory and authorization

Device configuration standards

Patch management and compliance

Device security

Authorization to connect to a network and use computing resources

Proactive management of device errors

Step 3: AI Helps End-Users

The combination of device management, monitoring, and AI is the sweet spot that helps prevent incidents by enabling automation to detect anomalies and attempt resolution based on pre-programmed policies and scripts. This feature allows IT to scale support to manage devices proactively down to the user level.

The final step to achieving modern incident management practice is ensuring end users can get support when an issue prevents them from working. Using incident tickets with detailed resolution information, manuals, and other internal documentation, as well as knowledge bases, conversational and generative AI solutions can be implemented and provide automated first-tier support.

Organizations should work on implementing the following capabilities:

Integration between the service portal chat and corporate IM platforms (like Teams or Slack) and the ITSM ticketing system that enables users to chat with a support channel in plain language.
This should provide answers to questions and simple instructions on how to work around an error.
It should also allow users to enter requests for goods and services or log a ticket if they can’t fix their problem using the information provided.

Implement robust knowledge bases and use internally hosted generative AI platforms that enable AI to search all documentation repositories and prior tickets for solutions to problems or answers to questions.