In the world of technology and software development, you are always trying out something new—only to test it again. Engineers learn from their mistakes and use them to grow their skillsets and improve processes. But some mistakes, like a major network or infrastructure failure, are less forgiving. The result of these unintended problems is a thing of nightmares.

Fortunately, a systematic approach available helps engineers and developers find the beginning of a problem and discover what went wrong: root cause analysis (RCA).

RCA also integrates seamlessly with AIOps, enhancing predictive analysis and automated resolution in IT environments.

In this article, we’ll look at RCA in IT environments, including:

Defining RCA and why it might be necessary
Exploring RCA strategies, including the 5 whys
Understanding the many benefits of RCA

What is root cause analysis?

Root cause analysis (RCA) is a systematic process for finding and identifying the root cause of a problem or event.

RCA is based on the basic idea that having a truly effective system means more than just putting out fires all day. That’s why RCA starts with figuring out how, where, and why the issue appeared. Then it goes further: RCA strives to respond to that answer—in order to prevent it from happening again.

Originating in the field of aeronautical engineering, this method is now applied in virtually every industry, but with particular focus and benefits in software development. Finding the root cause of a software or infrastructure problem is a highly effective, quality engineering technique that many industries already mandate in their governance.

Root cause analysis is considered a reactive management approach. In the ITIL® framework for service management, for instance, incident management is a reactive move where you’re responding to a critical incident. Problem management, on the other hand, is a proactive approach wherein you’re seeking out problems to address. (Learn more in Incident Management vs Problem Management.)

Why is root cause analysis necessary?

RCA delivers a wide range of advantages (detailed below), but it is dramatically beneficial in the continuous atmosphere of software development and information technology for several reasons:

Focuses on cause, not symptoms: RCA pinpoints the factors that contribute to the problem or event, helping to find the actual cause of the problem as opposed to just fixing resulting symptoms. Its depth also helps to avoid singling out one issue over others for a quick fix.
Reduces cost and time: By catching problems early, RCA significantly reduces cost and time spent, enabling developers to maintain an agile develpment environment and drive process improvement.
Improves system reliability: Delving into the root cause of issues enhances the reliability and performance of IT systems.
Promotes proactive problem management: RCA identifies potential issues before they escalate, allowing for proactive measures and reducing the likelihood of major disruptions.
Encourages a culture of learning: It fosters a culture of continuous improvement and learning within IT teams, improving their problem-solving skills.
Enhances customer satisfaction: Stable and reliable services, achieved through effective RCA, lead to higher customer satisfaction and trust.
Optimizes resource utilization: By avoiding repetitive troubleshooting and temporary fixes, RCA is cost-effective in the long run.
Facilitates risk mitigation: RCA aids in identifying and mitigating risks, thereby reducing the severity and frequency of incidents or system failures.

Although performing root cause analysis might feel time-consuming, the opportunity to eliminate or mitigate risks and root causes is undeniably worthwhile.

RCA principles

Some of the basic principles of RCA can help organizations ensure they are following the correct methodology:

Focusing on corrective measures of root causes is more effective than simply treating the symptoms of a problem or event.
Effective RCA is accomplished through a systematic process with evidence-backed conclusions.
There is usually more than one root cause for a problem or event
The focus of RCA, via problem identification, is WHY the event occurred—not who made the error.

How to perform root cause analysis

The specific map of root cause analysis may look slightly different across organizations and industries. But here are the most common steps, in order, to perform RCA:

Root Cause Analysis Let’s look at these steps in detail.

Define the problem. When a problem or event arises, your first move is to contain or isolate all suspected parts of the problem. This will help contain the problem.
Gather data. Once you find the problem, compile all data and evidence related to the specific issue to begin understanding what might be the cause.
Identify any contributing issues. You might have hands-on experience or stories from others that indicate any additional issues.
Determine root cause. Here’s where your root cause analysis really occurs. You can use a variety of RCA techniques (detailed below). Each technique helps you search for small clues that may reveal the root cause, allowing the person or team to correctly identify what went wrong.
Implement the solution. Determining the root cause will likely indicate one or several solutions. You might be able to implement the solution right away. Or, the solution might require some additional work. Either way, RCA isn’t done until you’ve implemented a solution.
Document actions taken. After you’ve identified and solved the root problem, document the problem and the overall resolution so that future engineers can use it as a resource.

Even if you don’t expect the problem to occur again, plan as if it will.

Remember, in order to have an effective RCA it is important that the team recognizes that processes cause the problems not people. Pointing fingers and placing blame on specific workers will not solve anything.

(Learn more about the importance of a blameless culture when performing an incident postmortem—the final step of your root cause analysis.)

Methods for root cause analysis

You can perform RCA using a variety of techniques. We highlight four well-known RCA techniques below—use the technique that meets your specific situation. Here’s a simple distinction:

Five whys analysis is good for initial troubleshooting.
Fishikawa diagrams are helpful for identifying all possible root causes for a situation.
Pareto charts help you prioritize which root causes should be addressed first, based on how often each identified root cause occurs.
Scatter plots are helpful in situations where you can identify and collect data on fluctuating variables that are related to the problem you are studying.

Take a look at these options and consider which might be best for your situation:

The 5 whys analysis

One of the easiest and most common tools for conducting a root cause analysis is the “five whys” method. Mimicking curious children, the five whys method literally suggests that you ask “why?” five times in a row in order to identify the root cause of any process or problem.

Five why analysis is effective because it is easy to use for solving problems where there is a single root cause.

Although the method seems explicit enough, this approach is still meant to be flexible depending on the scenario. Sometimes five whys will be enough. Other times, you’ll need to ask why a few more times. You could also use additional techniques to identify the root cause.

To begin this method, follow this outline:

Write down the specific problem that needs to be fixed, describing it completely.
Ask why the problem happened. Write the answer below.
If your first question did not find the root cause, ask why again and write that answer down.
Continue this process until the team agrees you’ve identified the root cause of the problem.

(See the five whys in action with a simple RCA example, below.)

Pareto charts

Pareto charts identify the most significant factor among a large set of factors causing a problem or event. A Pareto chart is a combined bar and line chart, where the factors are plotted as bars arranged in descending order. The chart is accompanied by a line graph showing the cumulative totals of each factor, left to right.

Pareto Charts

Ishikawa diagrams

You might know the Ishiwaka diagram by other names: the fishbone, the herringbone, the cause-and-effect, and, our favorite, the Fishikawa diagram.

Ishiwaka Diagram The Ishikawa diagram is a great visualization tool for brainstorming and discovering multiple root causes. It is shaped like a fish skeleton, with the head on the right and the possible causes shown as fishbones to the left.

Scatter diagrams (plots)

Scatter diagrams, or scatter plots, use regression analysis to graph pairs of numerical data to determine relationships. This is helpful to identify problems and events that occur because of fluctuating measurements, such as capacity issues that happen when server traffic increases.

Scatter Diagrams (Learn how to create your own scatter plots using Matplotlib.)

FMEA (failure mode and effects analysis)

FMEA is a systematic, step-by-step approach for identifying all possible failures in a design, a manufacturing or assembly process, or a product or service. It’s particularly valuable for preemptively addressing potential failures and enhancing reliability. FMEA focuses on identifying failure modes and their causes, and effects, enabling teams to prioritize the risks and implement effective control measures. This process is instrumental in improving safety, increasing customer satisfaction, and reducing costs by catching issues early in the development cycle.

Fault tree analysis (FTA)

FTA is a top-down, deductive analytical method used to identify and analyze the potential causes of system failures. Starting with a known problem (or ”top event”), FTA uses logic diagrams to map out the various intersecting paths that could lead to the failure. This approach is essential for understanding complex systems, where multiple factors may interact to cause a failure. FTA is widely employed in safety engineering and reliability engineering to anticipate potential problems, thus aiding in the development of more reliable and safer systems.

RCA example using five whys analysis

Here is a simple five whys analysis where we try to determine why a computer is not turning on. At each step, we ask why the computer is not turned on. We gather data as we follow the power flow, until we finally determine that the power strip the computer plugged into is turned off.

Here’s what the user has reported: Their desktop computer is not turning on. The monitor is turned on, but the user does not hear the computer fan running, and there are no power lights.

Root cause analysis using the 5 Whys to troubleshoot a computer that won't turn on

Resolution: Technician turned on the surge protector and the computer came back on again.

Benefits of root causes analysis

The main benefit of root cause analysis is obvious: identifying problems so you can solve them. RCA offers plenty more benefits that help to solidify its usefulness and importance in the tech environment.

Solve real-world problems

When the right employees get the right RCA and resolution training, you’ll execute correct processes and solve common business problems.

Lower costs

When you catch problems quickly, you reduce the likelihood that those problems will turn into major incidents—especially when RCA is used to support an agile environment. RCA saves valuable employee time and ensures the organization doesn’t other fines or compromises.

Make the workplace safer

Employee safety is vital, and root cause analysis provides an added peace-of-mind. By quickly and effectively investigating any safety incidents, you can solutions can be put into place to prevent anything similar from happening again down the line.

Implement effective, long-lasting solutions

When you follow RCA analysis all the way through to final documentation, you focus on long-term prevention. It also shows that your organization prioritizes solutions—not speedy workarounds.

This forward thinking enables companies to become proactive and productive.

Resolve technical debt, strengthen code base

An RCA may show the problem is broken code due to technical debt. If the problem occurred due to changed business requirements, code development compromises, poor coding practices, or software entropy, the real solution may be refactoring rather than patching. Refactoring realigns your code with desired business outcomes, eliminates technical debt, and brings it up to current standards for future agile deployments.

Limitations of RCA

To effectively implement RCA, it’s essential to understand its limitations. While RCA is a powerful tool for problem-solving and prevention in IT and other industries, it's important to recognize its boundaries and the challenges that can arise. This awareness not only helps in applying RCA more effectively, but also in integrating it with other strategies for a more comprehensive approach to problem-solving.

RCA is primarily a diagnostic process. It focuses on identifying the underlying causes of problems or incidents to prevent their recurrence. However, like any diagnostic method, it has its constraints:

Time-consuming: Comprehensive RCA can be a lengthy process, delaying immediate corrective actions.
Complexity in large systems: In highly complex systems, identifying a single root cause can be challenging.
Subject to bias: RCA outcomes can be influenced by individual or team biases.
Not a panacea: RCA may not address systemic issues or external factors beyond the organization's control.
Resource-intensive: Effective RCA often requires significant resources such as tools and skilled personnel.

By understanding these limitations, IT professionals can better navigate the RCA process, ensuring a more balanced and effective approach to problem-solving. It's also crucial to combine RCA with other methodologies and insights, forming a multifaceted strategy that addresses not just the “what” and “why” of problems, but also the “how” of future prevention and improvement.

Effective RCA saves more than money

Taking the time to create a robust root cause analysis process may take some time and effort in the initial stages, but it is an investment that will extend far beyond the expenses. The skills learned during the RCA process can be carried over to almost every other problem or field and initiate an attitude of continuous improvement—and even innovation.

This culture will surely permeate your organization for the better.

Root cause analysis examples by industry

RCA is a versatile tool applied across various industries, each with unique challenges and requirements. Its adaptability and effectiveness in identifying the underlying causes of issues make it an invaluable technique in diverse settings. From healthcare to retail, RCA provides critical insights that drive improvements, enhance efficiency, and prevent future problems.

Each industry is faced with unique challenges:

Healthcare: In healthcare, RCA is used to understand patient safety incidents or equipment failures.
Financial services: Financial institutions use RCA to analyze system failures or security breaches impacting transactions.
Manufacturing: RCA in manufacturing often focuses on production line errors or equipment malfunctions.
IT: In IT, RCA helps in troubleshooting network outages, software bugs, or security incidents.
Retail: Retailers use RCA to address supply chain disruptions or customer service issues.

By examining root cause analysis examples from different sectors, we can gain a deeper appreciation of its versatility and effectiveness in solving complex problems and driving continuous improvement in diverse organizational contexts.

Root cause analysis and AIOps

RCA in the context of AIOps represents a powerful combination for IT environments, offering advanced tools for predictive analysis and automated issue resolution.

AIOps leverages artificial intelligence and machine learning (AI/ML) and big data analytics to enhance IT operations, and when combined with RCA, it provides a powerful mechanism for identifying, as well as predicting and preventing, IT issues.

AIOps enhances RCA by automating data collection and analysis, which allows for faster identification of root causes in real time. This integration leads to more proactive and predictive IT management. For instance, AIOps can analyze patterns and anomalies across vast datasets, detecting potential issues before they escalate into major problems. This predictive capability is crucial for maintaining system health and ensuring uninterrupted service delivery.

Moreover, AIOps enables more sophisticated RCA by handling complex, multi-layered IT environments where traditional RCA might struggle. It sifts through the noise of vast data sets to pinpoint accurate root causes, reducing the time IT teams spend on troubleshooting and increasing their efficiency.

The combination of RCA and AIOps represents a shift from reactive to proactive and predictive IT management. It helps resolve current issues while also anticipating and preventing future disruptions. This proactive approach is essential for businesses that rely on IT infrastructure for their critical operations, as it minimizes downtime and ensures a more stable and reliable IT environment.

By integrating RCA with AIOps, organizations can harness the power of advanced analytics and AI to transform their IT operations, making them more resilient, efficient, and aligned with business objectives.

Learn more about BMC's AIOps solutions ›

Additional resources

For more on this topic, explore these resources:

BMC Service Management Blog
BMC Business of IT Blog
Data Visualization Guide, with tutorials on creating charts and graphs
How To Build Your ITSM Business Case (Free Template Included)
Resilience Engineering: An Introduction

RCA in IT: Root Cause Analysis for IT Environments