Designing a major incident management process is critical to protect a company from significant financial loss. Likewise, an extended service outage could tarnishing its reputation and impacting its customers. A fully optimized major incident process will leverage live monitoring, predictive analytics and real-time alerting to proactively avoid service outages or significantly reduce Mean Time to Repair (MTTR) when an outage occurs. Unfortunately, most companies currently have a reactive or ad-hoc process. The major incident management process should be based on industry best practices. Procedures should be standardized and continuously improved.
What is an Incident?
ITIL defines an incident as an unplanned interruption to or quality reduction of an IT service. The goal of having an established incident management process is to return the service to normal functionality quickly while minimizing the impact to the business. Normal functioning operations of an IT service is defined in Service Level Agreements (SLA). In addition, there may be other agreements between the business and IT operations which define normal functioning.
What is Major Incident Management?
A major incident is an incident which demands a response and resource engagement level well beyond the routine incident management process. Therefore, a procedure for a major incident management should be designed to coordinate the response and accelerate the recovery process to return the IT service to a normal state as quickly as possible. Typically, a major incident is assigned a critical priority based on an incident priority matrix of impact and urgency. Additionally, major incidents could have a high priority assignment.
The incident priority levels typically have four levels.
Major Incident Management Lifecycle
Occurrence – When an issue to a configuration Item or system actually starts. A high percentage of the time this is related to a change to the configuration Item or system.
Detection – This is when event monitoring, support teams, or a user detects the issue to a configuration Item or system. A mature IT support organization will identify a high percentage of issues by event monitoring and support teams verses reported by end users.
Diagnosis and Repair – Diagnosis is when the initial IT Support team is trying to understand what the Incident is, triage the priority, and assign the incident to the correct resources to resolve the issue. Repair are the recovery actions to return the configuration item to a normal state.
Recovery, Restoration and Closure – Recovery is when a configuration item has returned to a normal state. The overall business service made up of one or more configuration items may or may not be recovered at this point. Restoration is the point when the actual business service has been recovered and the end users are able to use the services successfully. Closure occurs after the service is available to the user. To close the incident, recovery teams must validate that the service is stable from immediate re-occurrence.
Major Incident Management Best Practices
What specific areas are you focusing on to improve stability and availability in your environment by reducing the frequency and duration of Major Incidents at your company? Reducing Incident Mean Time to Restore Service (MTRS) of Major Incidents and increasing Mean Time between Failures (MTBF) is critical. Reducing MTRS will decrease the service disruption duration to avoid a loss of sale revenue and productivity. Increasing MTBF will improve the up-time availability of your services. There are some key best practices for each of the segment slices in the Major Incident Lifecycle. Understanding the each is important to improve the capability of the IT Infrastructure, services and supporting organization that enables the business to satisfy its business objectives.
Major Incident Lifecycle – Occurrence
Occurrence is when an issue to a configuration item or IT system starts until the time it has been detected. To reduce the frequency of major incident occurrence, you must study how to keep a fully functioning IT services from failing. A high percentage of the time, failure is related to a change to the configuration Item or IT system. Introducing additional rigor to the change management process for higher risk changes will reduce major incident occurrence.
Major Incident Lifecycle – Occurrence Recommendations
Change Management Risk Assessment calculator – It is important to update the change risk assessment calculator with more appropriate risk questions. Appropriate risk questions will more accurately identify changes that are a very high or high risk of failing. Additional scrutiny of high risk changes may reduce the risk of causing a service interrupting incident. The risk assessment calculator is not intended to replace “human” scrutiny but will help change coordinators focus greater attention on changes that pose the greatest risks.
High Risk Change Implementation Plans – Improve Change Management rigor of high-risk changes using data driven solutions when planning implementations. By ensuring your change implementation plans are following industry and department best practices, your successful change percentage should improve. Simply stated when changes are successful, major incident frequency is reduced.
- Identify and maintain a fragile configuration item & IT service list.
- Mature change implementation coordinator accountabilities and responsibilities
- Improve post change testing & validation rules.
- Ensure post change event monitoring resumption is correctly timed.
Forward Schedule of Change Dashboard – If your change ticketing application supports it, build a dynamic High-Risk Change Dashboard. The dashboard will display real-time status of pending, in-progress, breached, and completed high risk changes for the current date. Everyone should be aware of the status of high-risk changes. If IT staff are award of a change in progress and an issue is reported to the Help Desk, there can be immediate correlation. If an issue is
Major Incident Lifecycle – Detection
Detection is when event monitoring, IT support teams, or a user detects an issue occurring to a configuration Item or IT service. Once an issue is detected, an incident is logged. A mature IT support organization will identify a high percentage of incidents by event monitoring and IT support teams verses reported by end users. Early detection of issues which occurred, will significantly reduce duration of a major incident.
Major Incident Lifecycle – Detection Recommendations
Improve Service Desk Incident trending – Major incidents have a high impact to your customers. It is very important to quickly identify support ticket trends. If a trend of a unusually large number of lower priority incidents is discovered, they should be grouped into a higher priority incident based on the increased impact. Now that you have a higher priority incident, resources can be focused on the incident. To properly trend incident you need a well thought out help desk incident category scheme.
Event Monitoring – Basic monitoring is comprised of watching for spikes in system resources such as CPU utilization, memory use, and network response. Resources can investigate resource levels which rise above predetermined thresholds for an extended duration. As your event monitoring becomes more advanced, your monitoring should focus on errors with business and system transactions. By discovering errors with these transactions, issues can be corrected before they significantly affect your users. As events occur, your monitoring system will generate incident tickets for the impacted CI based on data drive rules.
Defining CMDB CI Relationships – IT services are made up of configuration items. It is important to associate configuration items with the IT services. Similarly, IT services should be associated with the support teams the incident should be assigned to. When a configuration item has a fault, you know what IT service is impacted. This will allow the proper resolver team to be engaged with the incident.
Implement Incident Alert and Contact Management – Notifying business users, support teams and management the status of a major incident impacting a business service is critical. It is important to ensure your incident alerts reach their intended targets in a timely manner. To reduce incident Mean Time to Restore Service, you must invest in an automated contact and alert management system. Many ticket applications such as Service Now offer this as a module. You will be able to define automated escalation rules, manage their on call and time away scheduling, and automatically process self-managed alert subscriptions to drive reduction in mean time to respond.
Major Incident Lifecycle – Diagnosis and Repair
Diagnosis is when the initial IT Support team is trying to triage the configuration item fault. The first level support team will attempt to fix the issue. If the support team is not able to fix the incident, they categorize the incident, validating the priority and escalate the incident to the correct resources to resolve. Repair is the actions to return the configuration item to a normal state. Since IT services are made up of one or more configuration items, repairing a configuration item may not completely resolve the IT service incident.
Incident Ticket Classification Scheme – Proper ticket classification of an issue when a Help Desk ticket is created enables the Help Desk Agent to sort the issue into support buckets. These buckets will allow knowledge to be presented to the Help Desk agent when trying to provide proper support, enable proper routing of escalated tickets and allow trend reporting of ticket types. Ticket categories also can be used to identify mission critical services. If an incident is raised against a mission critical service, the priority can be elevated.
Incident Priority levels – Due to IT support resource constraints, not all incidents can be worked on simultaneously. Incident tickets will need to be prioritized based on impact and urgency. Incident impact is the potential financial, brand or security damage caused by the incident on the business organization before it can be resolved. Urgency is how quickly incident resolution is required.
Incident Manager Recovery Run books / decision trees – A runbook or decision tree can be very valuable for a major incident management team that are more generalist. Runbook or decision trees can be built by a service SME and manager prior to an incident, which will provide incident management team valuable actions to take in the first 30 minutes while the experts are joining the bridge.
24/7 Persistent Chat Collaboration Room – When an incident occurs, It is critical to collaborate quickly with resources to determine how to diagnosis and repair the system. With support resources spread-out through a building, city or even country, companies need a collaboration tool beyond just an email chain or audio bridge call. A 24/7 persistent chat collaboration room will allow resources from management, operations, development, storage, platform, network, and other areas visually have real-time discussions, allow resources joining the discussion to review the persistent chat history, allowing sharing of documents, display recovery step timelines, instantly take roll calls of the current participants and who is speaking/chatting, and record the entire recovery event for a post incident review.
Major Incident Lifecycle – Recovery, Restoration and Closure
Recovery is the segment to bring an IT service has returned to a normal state. The overall business IT service made up of one or more configuration items may or may not be recovered at this point. Restoration is the point when the actual business service has been recovered and the end users are able to use the services successfully. Closure occurs after the service is available to the user and the recovery teams validate that the service is stable from immediate re-occurrence.
Incident Resolution Category Scheme – Initial incident categories focus on what monitoring or the customer sees and experiences as an issue. Capturing incident resolution categories allows the incident owner to categorize the incident based on what the end resolution was based on all of the information learned from recovering the system or how it was fixed. This is important for troubleshooting future incidents.
Root Cause Analysis – Determine what happened, why it happened and what to do to reduce the likelihood that it will happen again. This process involves collecting the data, identify all potential causes, determine the root cause, and implement a fix if possible to eliminate the problem.
Post Incident Review (PIR) – A post incident review (PIR) is an evaluation of the response and recovery of a major incident. The post incident review identifies what went well and opportunities to reinforce improved response and recovery processes to reduce MTTRS. It also finalizes the capture of the incident data for root cause analysis by problem management.