The article is mostly copied from a previous internal document written on this subject
Introduction
Across engineering organisations, service levels are an important way to establish the expected reliability and consistency of services. In this document, we’ll cover some of the various terms in use, how to pick the right things to create service levels for, and how to appropriately set those levels.
We’ll start with a few terms, and then look into them in more detail:
- Service Level Indicator (SLI)
These are particular qualities about your service that you want to measure. For a REST service, you might be interested in latency and uptime; for a data storage service you might be interested in durability; for a customer success function, you may be interested in ticket response time. - Service Level Objective (SLO)
This is where you put numbers to your SLIs. Eg, for a latency SLI, your objective may be to deliver 99% of requests in under 50ms. A customer success team may guarantee an initial response to a client within 1 hour during UK business hours. - Service Level Agreement (SLA)
This is a commercial agreement between two entities. It normally defines the contractual obligations of a vendor to provide a service to its clients.
As a term, SLA gets used a lot, so we’ll address that one briefly first.
Service Level Agreement (SLA)
For lots of people, the term SLA is the one most often encountered. For technical services, you’ll often see this refer to the uptime of a service, such as 99.9% or 99.99%, and I expect that an uptime SLA quoting a number or nines is the most typical type that people in engineering roles will be familiar with.
SLAs establish the relationship between a vendor and an external customer - they help frame the expectations that a customer can have of a vendor’s service they are purchasing, which allows the customer to make business decisions based on those expectations. SLAs are only really useful when there is some kind of financial or contractual penalty for failing to meet it.
SLAs should not be used to establish relationships between services within a company - these are more effectively served by SLOs. The crucial difference here is that the internal service levels within the company must be stricter than those offered to a customer as an SLA. This allows an engineering team to address a breach in their internal service levels before the SLA with the client is breached. It should be self-evident that guarantees offered to a client as an SLA should also form part of the SLOs to which internal team(s) commit.
Service Level Indicator (SLI)
SLIs describe the inherent qualities of a service, against which guarantees are being made. Some examples follow:
- For the extremely common situation where a service is provided over a RESTful web API, typical SLIs would include latency, error rate, uptime and throughput.
- A data archival system might provide guarantees around durability and retrieval time.
- A data pipeline service might choose to go with throughput and some measurement of data traversal time (how long it takes for a piece of data to move through the pipeline).
- A data collection team might use an SLI of update frequency to guarantee their data is always the latest available.
Choosing your SLIs
There are some general rules to take into consideration when choosing your SLIs:
- Go with the qualities of your service that the customers (internal or external) care about. Users of a high-frequency trading platform are going to really care about latency, for example. Perhaps they are less concerned with the durability of historical data within the service.
- Don’t pick metrics which are easily encompassed by a broader SLI that is more directly relevant to customer experience. Making assurances around the response time of an internal database is not useful if the customer experience is defined by the broader metric of the overall latency of the system.
- Don’t pick too many metrics. If you’re coming from a position of having no SLIs at all, picking a core few will allow you to focus on the top priorities without becoming overloaded.
Service Level Objectives
SLOs are where guarantees are attached to the metrics that are being tracked with SLIs. Some examples:
- 99.99% uptime guarantee on a REST API
- 99.9999% durability of a piece of data in a data storage system
- Sanctions list updates within 60 minutes of being published
While these might seem OK on the surface, they are incredibly ambiguous. Let’s take the uptime guarantee above and go into a bit more detail:
- How do we define the “up” in uptime?
- 99.99% allows for 50 minutes of downtime per year. If this happens in one chunk, the uptime is actually 99.5% for that week. Is this acceptable?
We might decide that the following slightly more wordy statement gives a bit more clarity to this SLO:
99.99% uptime across the calendar month, where uptime means the platform is available to 90% of clients.
This can, of course, be further clarified ad nauseum, but the goal here is to set customer expectations with an appropriate level of detail. We do this so that the customer can understand precisely what we expect to be able to deliver and can build their system to take this into account. The point here is to ensure the customer is happy, not that a service provider can argue that they were “technically” right when it comes to an SLA/O query following a breach.
At Google Cloud they even have a team of CREs - Customer Reliability Engineers - who work to make sure that customers of their services build their own systems to take into account Google Cloud’s SLAs. It is good practice to work with customers on this issue.
"We offer an SLA of 99.5% uptime in a calendar month. When our service to you goes offline, how are you working around that issue?"
Failing to engage with customers on the reality of service levels, and the impact they can have if they are not taken into consideration when designing a consuming service leads to unhappy customers - whether internal or external!
Dialling in your SLOs (and SLAs)
It can be tempting to aim really high in terms of your SLx commitments. But because you’re a responsible engineering team, and you take these things seriously, the difference in how you approach building and providing your service will be significantly different if you attach a 99.99% uptime guarantee to it, instead of 99.9%. This is a 10x difference in downtime!
Trying to reduce your downtime by 10x will increase the cost of your infrastructure. It will affect the role distribution in your team (more people focused on reliability than building new features) and it should affect your hiring plans. It will affect your deployment velocity, and how quickly you can iterate on the service. These are not trivial implications.
Picking your service level guarantees affects your team because you take your commitments seriously. Set your guarantees at the levels you need in order to deliver your service to the required standard, and no higher.
Conclusion
The point of adopting service levels is not to produce a service which never goes down. These services don’t exist, and in trying to engineer something this way, you inevitably create a brittle system. So instead of trying to push the MTBF (mean time between failure) really low, focus instead on driving down the MTTR (mean time to recovery) so that when your services go down, you are prepared, practised, and able to fix quickly.
Service levels are there to set expectations on some key qualities of the service, both within a team and to its internal or external customers. They should be picked carefully, taken seriously, tracked, and affect the type of work being done within a team.
Author: Oliver Butterfield