YOW! CTO Summit 2019 Melbourne - Error Budgets and Learning From Incidents

Banner image for the article

# Learning from Incidents - Andrew Hatch & Rolling Out Error Budgets - John Viner

The presentations by Andrew Hatch from Seek and John Viner from ZenDesk both related to errors, outages and system failures, so I have elected to group them together and combine my thoughts on this topic.

Due to the scale of both Seek and ZenDesk, Andrew and John have encountered a number of issues I have previously seen, but they have had to deal with them as a priority due to the size and complexity of their technical and people environments.

In an Agile and DevOps environment, teams that build a product or feature are expected to own it. This is done with a view to improving customer value and to help reduce the complexity for an individual to a sustainable level. But it has a negative aspect as well. Like organisations that are siloed by technical skills or job function, the larger organisation can become siloed by product; developers and teams don’t always know who their customer is as more consumers throughout the organisation integrate with the product or feature they offer. This leads to the localisation of technical knowledge, which in turn leads to the localisation of incident handling and site reliability knowledge.

During these two presentations John focussed on how to manage the risk associated with an outage, while Andrew focussed on the impact of an outage. Using some of the advice they presented I will walk through the process in a somewhat chronological order relative to an outage occurring.

John talked about the implementation of targets, service level indicators (SLIs), service level objectives (SLOs) and the resulting error budget. For context, a target is a simple measurement, perhaps it is the availability of a system, the response time of a system, or maybe the total processing time of an item; an SLI is the number of good events divided by the total number of events; an SLO is the desired value of an SLI over time, that could be an uptime measure, the percentage of requests within an acceptable SLI or the percentage of items processed within the SLI; the resulting error budget is then the inverse of the SLO. This should not be confused with service level agreements (SLAs) which are promises made to customers about the availability and/or speed of a system.

To provide a concrete example of this, if a target response time of 200ms is set 10,000 requests are received and 9,900 of these are within 200ms, we can calculate our SLI as 9,900 divided by 10,000, giving us 99%. If the SLO is 98% then we have exceeded our SLO. We can also calculate our error budget by subtracting our SLO from 1, so in this case, we have an error budget of 2%. By looking at our SLI, we can see that half of the error budget has been consumed. For clarity around SLAs, an SLA of 96% may be implemented on this indicator, that means that should be service breach the agreement the customer will be compensated, the internal SLO will likely be higher than the customer facing SLA as this provides a margin of error and as an organisation we want to exceed expectations, not just meet them.

When determining the SLIs, SLOs and error budgets, the system has to be viewed from a customer perspective. The items that provide the most value for customers should have the most stringent SLIs and SLOs; for example an outage on an authentication system that prevents people from accessing the service would have a much greater negative impact than an outage on an asynchronous search index updater.

In the organisations I have worked at, almost all of them have had a central location for API specifications. This enables developers (and sometimes customers) to easily find the available services and interfaces. If a system such as this is used, this would be a perfect place to publish the SLIs, SLOs and error budgets. By publishing these values along with the interface specifications it will help to remind people that systems are fallible, errors will occur, and any system that is developed should be capable of coping with errors; this will lead to a resilient system architecture and will ultimately improve the reliability of all the components in a system.

So, why are we promoting acceptable outage levels? It’s simply because we know that it is impossible to have a system that is 100% reliable, couple this with our need to keep innovating and improving, and the constant changes this requires, and we are increasing the likelihood of something unexpected occurring. By implementing and promoting an error budget we are acknowledging that outages and service degradation are unavoidable, we are setting acceptable limits on the outages, and we are removing the fear that is associated with these issues. With the implementation of an error budget we are also working to find an objective balance between the reliability that is inherent in a stable system and the risk of a loss of reliability when changes are implemented in a system.

As a developer, even after more than 25 years of coding, I still get nervous when a feature is released to production; it doesn’t matter how much testing has been performed, it doesn’t matter how many times my code has been reviewed, I worry that if something goes wrong when it is deployed it could have an unintended consequence and cause an outage; by accepting that this will sometimes occur and being prepared to take action to rectify the problem the fear is reduced. The reduction in the fear of degradation or outage increases my confidence in implementing new features or making changes with the goal of improving the system.

With the implementation of an error budget, and an acknowledgement that sometimes things will go wrong I can focus on how to reduce the mean time to recovery (MTTR), accepting that the mean time between failure (MTBF) is less important. I know there will be a number of people who will be horrified with the thought that the MTBF is of little importance, but in the typical SaaS organisation the length of continuous uptime provides little value to the customer. If we were building aircraft, then the MTBF is a life or death situation (just see the Boeing 737Max issues that happened recently), but most SaaS services aren’t putting lives at risk. Customer value is delivered through the total usable time of a system in any given period. If we have a MTBF of 2 years, and a MTTR of 8 hours, that means that every 2 years there will be 1 business day where the system is unavailable, a situation like this will have a huge impact on customers and will rapidly lose them. But if we have a MTBF of 7 days and a MTTR of 5 minutes we will have far less impact on the customers real and perceived value.

So far, I haven’t touched on what happens when an incident occurs and some of the error budget is consumed. It’s important to remember that having an error budget doesn’t negate the need to assess each service degradation or outage. If anything, it is more important because there is now an objective measure of what is acceptable and what is not and we must strive to stay within acceptable limits. As part of using error budgets, I believe that every incident that consumes some of the error budget should be assessed in a similar way to an incident when no error budgets exist. This leads into aspects of the presentation by Andrew and how to ensure that incidents are a learning experience and benefit the entire organisation.

When reviewing an incident, many organisations will focus on the people and what individuals could have done to prevent the incident. If this focus is maintained people will become protective and will seek ways to avoid reporting incidents. Instead of focussing on the people, an incident post-mortem needs to balance the technical causes and the ways people can be better equipped to counteract these risks.

In many cases an incident isn’t the result of a single failure. If you’ve watched as many episodes of Air Crash Investigations as I have, you will be familiar with most incidents being caused by a chain of events; the same is true with IT systems. A well architected system will have resilience built-in, it will be able to cope with isolated failures, and it will take a number of concurrent failures for an incident to have a significant impact. To ensure the prospect of concurrent failures is minimised, each isolated incident should be assessed in terms of impact, contributing factors and potential ways to prevent the incident from recurring. In the case of a significant incident, the same basic process and assessments will be used, but across a much wider section of the product and potentially involving systems that were not affected to find what had been done differently to prevent impact or increase resilience.

When assessing an incident focus should first be placed on supporting those who were impacted. Ensuring the health and well-being of staff involved in resolving the incident will help them to be open about potential causes and will also help them to be better prepared the next time an incident occurs.

During the assessment Seek utilises a technical staff member to facilitate the post-mortem, and a product staff member to act as a scribe; this allows input from both the technical and customer perspective in a process that is often tech focussed. During the assessment it is important to acknowledge that complexity of the system will always increase, adding resilience to a system will add additional layers of complexity and more locations for errors to appear, adding features to a system will increase the amount of communication required between systems and will increase the dependency tree. As complexity increases unintended feedback loops will appear. By acknowledging these factors and by working to identify all the contributing factors in an incident then a holistic approach can be taken to find a solution. The knowledge gained from the incident can also be shared throughout the organisation so others are able to make allowances for the contributing factors and can assess is the solutions found are applicable to their features.

So far I’ve covered the benefits of error budgets and how to handle incidents and the consumption of error budgets, but I still need to touch on what to do when an SLO is at risk (the error budget is almost consumed), and what to do when an SLO is broken (the error budget is consumed).

When an error budget is at risk, the team will need to prioritise the resilience of the system over new feature development. This doesn’t mean that feature development should stop, but it should be reduced to allow time for the improvement of the existing system. If the error budget has been consumed due to unavailability of external services then action must be taken to reduce the dependencies on these services; if it has been consumed due to issues within the system then ways to increase the reliability should be found.

If the error budget is completely depleted, then the team must give even more importance to improving the resilience of the system. In this case it is likely that development of new features will cease for a short period of time (maybe one or two sprints) to focus on getting back to an acceptable level of error.

By implementing error budgets with appropriate measures and limits, ensuring that incidents are assessed and learnings are both implemented and shared, acknowledging that external dependencies and internal functionality are fallible, reducing the MTTR and by ensuring focus is maintained on delivering value, it is possible to reduce stress on staff, increase system resilience and ultimately deliver better value to customers.