Why reliability matters to our customers: Insights gathered from the Google Cloud Next conference

I am a Site Reliability Engineer (SRE) at Comply Advantage (CA). In November of 2022, I and fellow SREs attended a Google-hosted conference discussing a principled approach to SRE. This event provided a wealth of knowledge informing me why SRE teams undertake the work we do, namely to maintain and improve reliability, and why reliability matters so much to our customers. In this blog, I will cover some key insights we came away with.

Reliability is your most important product feature

A repeated statement given throughout the conference was that reliability is the most important feature of your product. Though this is an incendiary and myopic view; most tech companies are built around a symbiotic relationship between Sales, Product, and Tech; there is value in the statement. No matter how quickly we roll out new features to our customers (our feature velocity), there is little value gained unless the service we provide is available, or in other words has high reliability.

Reliability from a customer perspective

Reliability is a cornerstone of any service. However, what does it actually mean to our customers? Would they care about every metric we gather? No. They care about the characteristics of the service that matter to them. Therefore, to offer customers the most value we should focus on customer-driven metrics. Monitoring such metrics offers SRE and engineering teams a great way to capture the value provided by the work they undertake. This is because it demonstrates clearly the value provided directly to the customer. This clarity is a great tool to achieve buy-in from product owners or leadership.

The redundancy of production freezes

Though this is a common practice in many organisations (including at a time, our own), it is actually considered a bad practice by Google as it encourages the following behaviours:

A rush to get releases out before the freeze, bringing forward deadlines.
A bottle-neck releases immediately after the freeze

The biggest risk to a system's reliability is change, and these behaviours work in tandem to compress a large amount of change into a small time frame, inviting a substantial risk to reliability. Moreover, the implementation of a production freeze itself demonstrates a lack of confidence in your CI/CD processes and ability to remediate. If the CI/CD/remediation automation is indeed lacking, it can further exacerbate the risk to the system when the production freeze is lifted.

To address this, Google suggested the capacity to undertake non-feature work alongside feature work (using concepts such as error budgets to plan a healthy tension between the two). By undertaking this work, you will build confidence over time in your CI/CD and automated remediation; this confidence will remove the need for production freezes. Consequently, in the longer term, the initial capacity cost of carrying out non-feature work will be repaid by not having periods when features, and therefore value, cannot be delivered to customers at all.

A blameless culture is a good culture

When incidents occur, individuals should not be blamed. It isn't who did it; it's what circumstances led to the incident. This gives all individuals involved the psychological safety needed to undertake a Root Cause Analysis (RCA) faster and more comprehensively. This means that the risk of the same incident occurring again is greatly reduced.

How the above insights have been applied here at CA

In the time since the conference, CA's Tech function has made great strides in implementing many of the best practices mentioned previously in this blog.

We have completely refreshed our already strong post-mortem process, doubling down on our use of incident.io, introducing and using many more of its AI and automation features to enrich our findings and reduce toil during, and after, incidents.

We have also completely overhauled the criteria for incidents; we now base them on the impact on our customers using Service-Level Indicators(SLIs) and Service-Level Objectives(SLOs). Over the last few months we have initiated automated alerting and incident creation based on these constructs; This has greatly improved our awareness, response, and prevention of potential outages.

Lastly, we are looking to incorporate error budgets across our Tech function in the near future, to further align with a principled approach to reliability across our portfolio of services.

Conclusion

The conference was a great learning experience, which better informed my day-to-day role; I learnt why the work I do matters, and increased my perspective to take into account our customer needs.

Moreover, the experience demonstrated how CA was already in a strong position regarding its reliability, and that we already followed several Google-considered best practices. I arrived at this conclusion by observing the following here at CA:

1. We already had a strong post-mortem process
2. We already exhibited and encouraged a blameless culture
3. We already had strong leadership buy-in for non-feature work
4. We already had a great collection of individuals throughout CA keen to always better themselves and the product

We have maintained and built upon the above, since the conference. However, we knew we could do more, and as we are driven by our core values, one of which is kaizen (the drive for continuous improvement) we have aligned ourselves further with a principled approach to reliability, as highlighted in the previous section.

In closing, although reliability will always be a journey, and we at CA will always strive to improve, we are well-positioned to handle any challenges that we may face now and in the future on the road to our goal: 5 nines(99.999%) up-time.

Author: Chris Todd