How to Reduce Incident MTTR Mean Time to Repair
What specific areas are you focusing on to reduce Mean Time to Repair (MTTR) of Major Incidents and increase Mean Time between Failures (MTBF) at your company?
Reducing Mean Time to Repair of Major Incidents and increasing Mean Time between Failures (MTBF) is critical. Reducing MTTR will decrease the service disruption duration to avoid a loss of sale revenue and productivity. Increasing MTBF will improve the up-time availability of your services.
There are some key segment slices for Mean Time Between System Interruptions (MTBSI). Understanding the segments and needed improvement areas for each is important to improve the capability of the IT Infrastructure, services and supporting organization that enables the business to satisfy its business objectives.
Occurrence – When an issue to a configuration Item or system actually starts. A high percentage of the time this is related to a change to the configuration item (CI) or system.
MTTR occurrence improvement areas to focus on are;
- Change Management Risk Assessment calculator – It is important to update the Change Risk Assessment Calculator with questions that more accurately identify changes that are a very high or high risk of failing and causing a service interrupting incident. The risk assessment calculator is not intended to replace “human” scrutiny but help focus greater attention on changes that pose the greatest risks.
- High Risk Change Implementation Plans – Improve Change Management rigor of high risk changes using data driven solutions. This includes identifying and maintaining a fragile CI / system list, mature change implementation coordinator accountabilities and responsibilities, improve post change validation rules, post change event monitoring resumption rules, and incident monitoring rules.
- Forward Schedule of Change Dashboard – If your change ticketing application supports it, build a dynamic High-Risk Change Dashboard. The dashboard will display real-time status of pending, in-progress, breeched, and completed high risk changes for the current date.
Detection – This is when event monitoring, support teams, or a user detects the issue to a configuration Item or system. A mature IT support organization will identify a high percentage of issues by event monitoring and support teams’ verses reported by end users.
MTTR detection improvement areas to focus on are;
- Improve Service Desk Incident trending – Quickly identify support trends and focus valuable Information Technology resources on targeted business process improvement. To properly trend incident you need a well thought out Incident Management incident category scheme.
- Event Monitoring – Basic monitoring is comprised of watching for spikes in system resources such as CPU utilization, memory use, and network response. As your event monitoring becomes more advanced, your monitoring will focus on business transactions to discover and correct issues before they significantly affect your users. As events occur, your monitoring system will generate incident tickets for the impacted CI based on data drive rules.
- Defining CMDB CI Relationships – Not only is it important to populate configuration items within your system with relationships between those configuration items, but identifying the support teams the incident should be assigned to.
- Implement Incident Alert and Contact Management – To notify business users, support teams and management the status of a major incident impacting a business service, it is important to ensure your incident alerts reach their intended target. To reduce incident Mean Time to Restore Service, you must invest in an automated contact and alert management system. Many ticket applications such as Service Now offer this as a module. You will be able to define automated escalation rules, manage their on call and time away scheduling, and automatically process self-managed alert subscriptions to drive reduction in mean time to respond.
Diagnosis and Repair– Diagnosis is when the initial IT Support team is trying to understand what the Incident is, triage the priority, and assign the incident to the correct resources to resolve the issue. Repair are the recovery actions to return the configuration item to a normal state.
MTTRS diagnosis and repair improvement areas to focus on are;
- Incident Category Scheme – Proper ticket classification of an issue when a Help Desk ticket is created enables the Help Desk Agent to sort the issue into support buckets. These buckets will allow knowledge to be presented to the Help Desk agent when trying to provide proper support, enable proper routing of escalated tickets and allow trend reporting of ticket types.
- Incident Priority levels – Due to IT support resource constraints, not all incidents can be worked on simultaneously. Incident tickets will need to be prioritized based on impact and urgency. Incident impact is the potential financial, brand or security damage caused by the incident on the business organization before it can be resolved. Urgency is how quickly incident resolution is required.
- Incident Manager Recovery Run books / decision trees – A runbook or decision tree can be very valuable for a major incident management team that are more generalist. Runbook or decision trees can be built by a service SME and manager prior to an incident, which will provide incident management team valuable actions to take in the first 30 minutes while the experts are joining the bridge.
- 24/7 Persistent Chat Collaboration Room – When an incident occurs, It is critical to collaborate quickly with resources to determine how to diagnosis and repair the system. With support resources spread-out through a building, city or even country, companies need a collaboration tool beyond just an email chain or audio bridge call. A 24/7 persistent chat collaboration room will allow resources from management, operations, development, storage, platform, network, and other areas visually have real-time discussions, allow resources joining the discussion to review the persistent chat history, allowing sharing of documents, display recovery step timelines, instantly take roll calls of the current participants and who is speaking/chatting, and record the entire recovery event for a post incident review.
Recovery, Restoration and Closure – Recovery is when a configuration item has returned to a normal state. The overall business service made up of one or more configuration items may or may not be recovered at this point. Restoration is the point when the actual business service has been recovered and the end users are able to use the services successfully. Closure occurs after the service is available to the user and the recovery teams validate that the service is stable from immediate reoccurrence.
MTTR Recovery, Restoration and Closure improvement areas to focus on are;
- Incident Resolution Category Scheme – Initial incident categories focus on what monitoring or the customer sees and experiences as an issue. Capturing incident resolution categories allows the incident owner to categorize the incident based on what the end resolution was based on all of the information learned from recovering the system or how it was fixed. This is important for troubleshooting future incidents.
- Post Incident Review (PIR) – A post incident review is an evaluation of the response and recovery of a major incident. The post incident review identifies what went well and opportunities to reinforce improved response and recovery processes to reduce MTTR. It also finalizes the capture of the incident data for root cause analysis by problem management.
- Root Cause Analysis – Determine what happened, why it happened and what to do to reduce the likelihood that it will happen again. This process involves collecting the data, identify all potential causes, determine the root cause, and implement a fix if possible to eliminate the problem.
To improve the capability of the IT Infrastructure, services and supporting organization that enables the business to satisfy its business objectives, reducing Mean Time to Repair of Major Incidents and increasing Mean Time between Failures is critical. Share the specific actions that are you are focusing on to reduce Mean Time to Repair of Major Incidents and increase Mean Time between Failures at your company.