What can Security teams learn from Site Reliability Engineering and Service Level Objectives?

I was fortunate enough, in a previous role, to work with some excellent engineers with a Site Reliability Engineering (SRE) focus. I became familiar with the benefits of the relationship between Service Level Agreements (SLA’s), Service Level Indicators (SLI’s), and Service Level Objectives (SLO’s).

SLA - A contractual agreement between a business and their customers, if a business falls short of this agreement there could be financial consequences.
SLI - These are metrics that can be captured to determine how a service is performing, common examples are error rates, response time, and uptime.
SLO - The target values, which can be used as the basis for SLA’s and leverages SLI’s to track and monitor performance against metrics.

SLO’s provide engineers with an Error Budget, and through monitoring SLO’s engineering teams know when to focus on improving the reliability of their systems as a priority.

Example

We’ll use Amazon S3 as an example use case, and what that might look like within the AWS team.

SLA - Guarantee of 99.9% uptime each month
SLI - Uptime metrics for the service
SLO - Tracks uptime metrics to measure against uptime requirements
Error budget - Approximately 40 minutes of downtime each month

The SLO in this example can be used for monitoring and alerting, and when Amazon S3 hits 20 minutes of downtime for the month an AWS engineer might take a look at what’s causing the degraded service and choose to take action. It’s also a good predictor of the health of a system over time, and whether downtime is at an acceptable level. The agreement when using Amazon S3 as a customer, is that it may be down for 40 minutes a month.

It’s worth pointing out I know nothing of the inner workings of AWS, but simply used S3 to illustrate how SLO’s might work in practice.

What can we take from this approach?

We don’t need to reinvent the wheel, we can simply leverage the great work SRE has done with SLO’s. Here’s how we might slightly extend the scope, and start using them for security:

SLA - Contractual agreement between a business and their customers, if a business falls short of this agreement there could be financial consequences.
SLI - Service performance aligns with availability in the CIA triad, but we can go further with security vulnerabilities, misconfigurations, network exposure, etc
SLO - A target based on risk tolerance and regulatory requirements, which can be used as the basis for SLA’s and leverages SLI’s to track and monitor performance against metrics.

As you can see it’s not too dissimilar to the original description I provided for SLA’s, SLI’s, and SLO’s, SLA’s are exactly the same as before.

Example

Let’s envision a team within a business, which owns the checkout process on an e-commerce website. There are strict regulatory requirements due to payment processing, but also customer requirements to ensure they can make payments securely.

SLA - Guarantee of 0 critical vulnerabilities 99% of the time each month
SLI - vulnerability metrics from our CNAPP for the checkout application
SLO - Tracks vulnerability metrics to measure against critical vulnerability requirements within the checkout application
Error budget - Critical vulnerabilities must be remediated within 7 hours

Conclusion

SLO’s feel like the perfect tool for measuring the security posture of our applications, while taking into account the various contexts and risk profiles of different services and applications. We don’t need a high level security score which covers everything, and we don’t need to apply the same rules across all applications. We can create SLO’s for the smallest applications or for our biggest systems, and engineering teams can manage and track their own risk appetite and codify their risk tolerance through error budgets.

If you liked this post, you may also be interested in...