Based on different interactions within the organisation on the subject of SLOs, I have come up with, what I hope, is a simple framework to structure thinking around them.
What is an SLO?
What this is
This framework is designed to help you get SLOs defined quickly, within an organisation that is not used to using them to structure their approach to engineering. It's a higher-level framework and is only here to get you off to the races quickly, but my mantra has always been: "Minds change slow, computers change fast", so getting our minds made up quickly is more than half the battle already won.
It can be scary to think about signing your team up for commitments based on measurements, but it is absolutely critical to getting on top of incidents, improving our services and analysing what's really important.
I have broken down this approach into two-ish questions service owners can ask themselves, or the "2 Q's". The neat thing about SLOs is that once we have our SLOs set up correctly, you should be able to work backwards from any other service owner's SLOs and infer their answers to these questions.
Q1
"What are we here to do?"
If I deleted your services from our estate right now, what would happen? What do you provide to other teams/customers etc that means your service isn't redundant? Why are you building a new service, and what is it supposed to do once you're finished?
Examples include:
- "We provide the client-facing API for access to services."
- "We provide customers with a way to update the monitoring status of entities."
- "We provide other services with a way to retrieve structured data on entities in our graph database."
- "We tame data scraped from the web in a raw format and insert it into our graph database."
- "We provide the container runtime environment for all our services to run."
- "We provide a Kafka-compatible, production-ready distributed event store and stream processing platform."
- "We provide an Observability solution to all engineering teams in the company."
That's right - this includes the Observability team!
When considering this first question, do not worry so much about your downstream dependencies. Leave worrying about whether Kubernetes is working to the team responsible for running it.
“The chief task in life is simply this: to identify and separate matters so that I can say clearly to myself which are externals not under my control, and which have to do with the choices I actually control. Where then do I look for good and evil? Not to uncontrollable externals, but within myself to the choices that are my own…” —_ Epictetus, Discourses, 2.5.4–5
Concentrate on what you can control - leave the things you can't to someone else's SLOs!
Q2
"At what point are we letting others down?"
Based on your answer to the first question, you need to ask yourself "At what point will we be letting others down?". Now you know what you are here to do at a high level, you can go into specifics about exactly what that means for customers or other teams.
If customers need your services to respond quickly, at what point is latency "too high"? If other teams need you to provide accurate data, at what point does either failure to deliver of the data freshness start to impact other people/services?
Examples include:
- "We must respond to a customer request at least as fast as our customers are currently used to."
- "We must provide data within 30ms of it being processed."
- "We must update a Kafka topic within 24 hours of the customer requesting a monitor status change on an entity."
- "We must ensure at least 3 pods of this service are running at all times to provide enough capacity for the traffic loads we see."
- "We must ensure that the production Kafka endpoints our teams rely on do not go offline for longer than 4 seconds during our 3 scheduled maintenance windows."
While the specific numbers in the above example are fabricated, the overall idea is to start to drill down. If you are lucky enough to be doing this exercise for an existing service you can work backwards from your existing P1 alert thresholds for the institutional knowledge about your current expectations, and once SLOs are in place they can start to drive your thresholds in future.
If you're building something brand new, you might find it helpful to think about where you would set these thresholds and go from there.
Bonus Question
"Which other Service Owners do I need to speak to?"
Well done! If you've made it this far, you now know how to think about SLOs for your own services, and where you want to tweak the specifics before presenting your answers to the wider engineering team and committing to their delivery. You may already be prepped to configure things in our alert systems and get incident management set up properly.
However, it's highly likely that your SLOs won't be done in a vacuum, and your services certainly don't exist in one when they're providing whatever your answer to question 1 was. So, the 3rd bonus question is about speaking to the teams who run the services you depend on, to see if their answers to the above can help you tweak.
This shouldn't delay your own work on getting answers to the "2 Qs", but it can serve as a final checking point to ensure you're set up for success. You may even need to go to your Director from here and clarify things together with your upstream/downstream service owners and can use your answers to the first 2 questions to clearly explain where the situation is unclear.
Wrapping Up...
If you're struggling with SLOs, don't!
SLOs are there to codify a lot of things we already do, or would expect to do as a growing company that runs services at scale. They're not there to hold you to unrealistic standards but to help clearly highlight where we have room to breathe and where we really need to focus our efforts to gain some additional room.
Author: Adam Wilson