One of the standard metrics for measuring the performance of our incident management practice has always been mean time to repair (MTTR). But I wonder if this metric really gives you actionable information—information that will help you provide a better service to your business.
Think of it this way: Do your customers really care how long it takes you to resolve a low priority-incident that may be impacting one person, or that may have a perfectly acceptable workaround? I would venture that the answer is ‘no’. Business users care about critical incidents, those that impact their ability to serve their customers and impact the bottom line for business profits or reputation.
In fact, I find that another metric, the critical incident response time (CIRT), is a far better way to measure performance.
Drawbacks of MTTR
Mean time to repair reporting generally looks at all incidents that come to the service desk, giving the average time it takes to resolve these. This number will include:
- Critical incidents that are given high priority
- Incidents that are worked on only when someone has time to look at them
In some organizations, your MTTR number also includes automatically logged events that are resolved in seconds.
Another point to remember: MTTR only looks at the incidents that have been resolved; it gives no recognition to long standing incidents that are languishing in your queue. These long-standing incidents artificially skew metrics upon resolution.
In my opinion, all this extra noise makes MTTR virtually meaningless. It makes little sense to provide this metric to the business, as its neither actionable nor relevant. If you provide to the business an MTTR measurement that includes non-concerning (to them) incidents, you’re giving an inaccurate impression of your incident practice performance.
What is CIRT?
Critical incident response time (CIRT) is a far more accurate way to evaluate how well your incident practice is performing because it concentrates on incidents that have actual impact on the business.
CIRT is interested in how long it takes for you to respond to business-impacting incidents, rather than how long it takes for you to resolve them. CIRT measurements remove the extraneous noise from your reporting. This can be accomplished by taking a few, relatively easy steps:
- Removing all low-urgency incidents from your reporting. These are unlikely to be having a real impact on business performance.
- Removing any incidents that were auto-resolved. Incidents, or events, that are resolved without human intervention are unlikely to have caused any noticeable business impact.
- Removing incidents that are resolved within 2 minutes. These are very unlikely to have had any noticeable impact on the business.
- Setting business rules about incidents that sit in the queue for extended time. Incidents that have been sitting for a long time without resolution are very unlikely to be having significant business impact—if they were, you would be hearing about it!
Applying these rules to the incidents that are included in your CIRT measurements ensures that you:
- Provide accurate and actionable metrics that will measure real operations performance
- Contribute to positive business transformation
Measuring for business need, not IT need
The important measures for business-critical incidents relate to how long it takes to identify, locate, engage, and initiate your response team. The quicker this can be done, the better.
Unfortunately, a recent report from Fintech Futures indicates that it takes, on average, over an hour for companies to assemble all the appropriate resources to start work on a critical incident. Only 30% were able to assemble the personnel and resources in less than an hour.
These numbers matter to the business. Ultimately, reducing response times will improve your downtime statistics and improve customer perception of your critical incident processes.
Rather than giving an average time to resolve for these incidents, which varies dramatically due to complex IT environments, it is more important to measure process efficiency by examining how long it takes for your team to initiate response. You have complete control over this response time, and you can take actionable steps to improve it.
Best practices for ITSM reporting
Historically ITSM professionals love reports. We like to see how well we are doing, we want to measure improvements, and we must present our accomplishments to our management. Too often, however, we grab the low-hanging fruit of reporting options, using these metrics to justify our value to the business.
But, a word of caution: before you change what you report to the business, sit down and talk to the business units. Ensure that you understand what is truly important for them to understand. Don’t simply send them reports to make yourself feel better! Don’t report on something that you aren’t going to action or have no ability to change. If your report won’t lead to improvement, it is pointless. People are wasting time: your effort on producing the report, and their time understanding it.
My view is that the CIRT report has far greater value to the business than a standard MTTR report but, first, talk to your business. Explain what this new report tells them. If, and only if, they agree they would find value in it, you can start reporting this to them.
Once you do start tracking CIRT, make sure you have a plan to improve your metrics. How can you bring your teams together faster? Where are the bottlenecks in the process? What isn’t working. Always use your reporting to create real improvements, not just a token you send to the business each month.