“The cloud is just a computer in Reston with a bad power supply.” This was the highlight of programmer Andy Hunt’s tweet following an AWS outage on August 31, 2019. The incident led to the loss of his organization’s data and affected other companies including Reddit. (Luckily for Hunt, he had backups elsewhere.)
The root cause? The failure of Amazon backup generators following a power outage at its AWS US-EAST-1 data center in Northern Virginia, leading to some EC2 instances and EBS volumes incurring hardware damage, and subsequently causing irrecoverable data loss—every organization’s nightmare.
Availability is the heart of IT service management: it has the greatest responsibility in determining IT service value. It is one of three Information Security pillars under the C.I.A. approach. That’s why it is understandable when customers like Andy cause a commotion if availability isn’t treated with the care it should have—particularly if the service provider is slow and unclear in communicating the incident and resolution efforts.
To avoid Amazon’s incident and its subsequent resolution and communication efforts, let’s look at availability management and the essential role of the availability manager.
Understanding availability management
According to ITIL® 4, availability is the ability of an IT service or other configuration item to perform its agreed function when required. So, if you can’t log in to Facebook or download your emails or access your Salesforce dashboard, your immediate reaction is to deem that service unavailable.
The purpose of availability management is to ensure that services deliver agreed levels of availability to meet the needs of customers and users. The more critical a service is to the customer, the more the company should invest in its availability. We gain insights regarding the bare minimum of what comprises availability management from the ISO/IEC 20000 standard:
- Assessing and documenting risks to service availability at regular intervals
- Determining and documenting service availability requirements and targets, by considering relevant business requirements, service requirements, SLAs, and risks
- Monitoring and recording service availability results and comparing to targets
- Investigating and addressing instances of unplanned non-availability
Availability management works hand-in-hand with other practices such as architecture, change and configuration, release and deployment, and incident and problem management in order to ensure that elements such as capacity, continuity, and security are designed, built, deployed and managed effectively across the life of the service and its underlying infrastructure and components. A holistic view is required as there are countless availability risks in the ITSM domain, such as expired certificates, poorly planned configuration changes, human error, and vendor-related failures, among others.
Monitoring and measurement of availability must consider both the component view (through events and alerts) as well as the customer view (based on complaints and usage patterns). The success of availability management at a service level will be measured by two main metrics:
- Mean time to restore service (MTRS): How quickly your company addresses non-availability, e.g. 4 hours
- Mean time between failures (MTBF): The frequency of non-availability, e.g. twice a year
The focus of availability management has shifted from designing systems that are fault tolerant (addressing MTBF) towards designing systems that recover quickly. This has brought forward concepts such as the antifragile software movement that thrive on volatility and surprise. Techniques such as auto scaling, microservices, and chaos engineering are now quite prevalent in this area.
The Availability Manager role
While the job title Availability Manager isn’t one that stands out in today’s age (though organizations do still recruit for this role), the role of managing availability is part and parcel of ITSM environments, particularly those of an operational nature.
Interestingly, the European e-competence framework does not list ‘Availability’ in any title of its 40 reference dimensions or in the 30 European ICT Professional Role Profiles. A quick search, however, reveals that availability knowledge is required in several roles and activities:
- Architecture design
- Problem management
- Information security strategy development
- Information security management
- The data administrator role
- The DevOps expert role
Whether you’re a solution architect, software developer, systems administrator, or service desk support specialist, availability management will always be critical to your KPIs or OKRs. An excellent example is the site reliability engineer (SRE): availability is among the role’s top elements as it is essential to protecting, providing, and progressing software and systems.
Availability manager tasks and responsibilities
To get an idea of expectations for your Availability Manager, SFIA 7 defines three availability management responsibility levels, categorized under Delivery and Operation (sub-category: Service Design). These are examples of higher responsibility, so an availability manager for these levels would be in leadership and/or have significant expertise:
Availability management: Level 4
- Contributes to the availability management process and its operation and performs defined availability management tasks.
- Analyzes service and component availability, reliability, maintainability and serviceability.
- Ensures that services and components meet and continue to meet all agreed performance targets and service levels.
- Implements arrangements for disaster recovery and documents recovery procedures.
- Conducts testing of recovery procedures.
Availability management: Level 5
- Provides advice, assistance, and leadership associated with the planning, design, and improvement of service and component availability, including the investigation of all breaches of availability targets and service non-availability, with the instigation of remedial activities.
- Plans arrangements for disaster recovery together with supporting processes and manages the testing of such plans.
Availability management: Level 6
Sets policy and develops strategies, plans, and processes for the design, monitoring, measurement, maintenance, reporting and continuous improvement of service and component availability, including the development and implementation of new availability techniques and methods.
BMC supports availability management
For clear and actionable availability management that aligns with your company’s IT service and operations management, it is critical to implement the right strategy. The most successful strategies are supported by the right tools that meet your company’s needs.
BMC has a large suite of the most innovative ITSM and ITOM products, including the only end-to-end ITSM and ITOM platform: BMC Helix. For more information on how BMC can help you manage availability within your autonomous enterprise platform, contact BMC today.