Organizations thrive on the availability of IT services powering their daily business operations. Most organizations leverage technology to deliver all kinds of services to end-users and customers, and would fail to do so without a functional IT infrastructure. With the proliferation of cloud computing, many organizations rely on third-party cloud vendors to operate and deliver the IT infrastructure services. While vendors promise adequate reliability of the service, such as Service Level Agreements (SLA) that guarantee availability for 99.999 percent of the time, outages of cloud services is a stark reality of modern enterprise IT industry. Even the largest cloud vendor and pioneer of the on-demand, subscription based networked IT infrastructure offering Amazon Web Services has had its fair share of cloud outages.
Cloud Outage simply refers to the duration when the cloud infrastructure service is unavailable for use. The unavailability may also refer to performance inadequacy of the service, as per the agreed SLA metrics. For instance, the incident during which an outage may have only partially impacted a data center may cause the vendor to perform the necessary maintenance and restoration measures. Until the service is fully restored as per the agreed SLA standards, it may be seen as a downtime for the end-user.
Cloud outages may result from a range of causes within and beyond the control of a cloud vendor. The following list briefly highlights the issues that cloud vendors take into consideration in order to ensure that the service always delivers on the SLAs with sufficient acceptability:
Cloud computing allows business organizations to invest resources on product development and innovation instead of keeping the infrastructure alive. The sheer scale of a modern cloud datacenter and the pressing internal and external threats make it virtually impossible to eliminate the possibility of a cloud outage. For this reason, customers of cloud computing must understand both the reality and unpredictability of the cloud outage. While it is possible to mitigate the issues leading to a cloud outage, the cloud system also suffers from the “unknown unknowns” – issues that the vendors don’t know about what they don’t know. This unpredictability has to be compensated by the acknowledgement that a cloud outage will happen, and corrective measures applied to reduce the impact. For some issues, it may be cost effective or less complex to suffer from an outage than to invest in mitigation efforts.
For instance, datacenter resources generate a deluge of service incident alerts pointing to possible technical issues in the future. Cloud vendors rely on advanced machine learning capabilities and automation technologies to identify the most impactful red flags and perform proactive maintenance on a small isolated section of the infrastructure related with the root cause. The alerting mechanism is set to a maximum threshold that may allow some risky alerts to go under the radar and only trigger action when the impact to the service is large enough. This tradeoff is optimized to reduce incident triggers that force frequent maintenance, while operating at an acceptable risk that may have a low probability of impact. In the real world however, this risk calculation may be inaccurate or complement other risks that lead to an eventual cloud outage.
Similar tradeoff should be considered by the customers of cloud services when investing in a cloud solution. If the impact of an outage for a certain duration is not acceptable for healthy business operations, it may be suitable to invest in high availability SLAs. Similarly, additional monitoring, visibility and control capabilities may be required on part of customers to ensure that a possible cloud outage is least impactful toward their business.