Help Docs

SLI and SLO - Overview

Introduction

Site reliability engineering (SRE) was introduced as a method to maintain and enhance the reliability of software systems. SREs are responsible for monitoring the reliability of these systems, identifying issues, and resolving them to ensure that customers receive a high-quality product that performs as intended.

To measure reliability, SREs use service-level objectives (SLOs), which establish benchmarks for evaluating whether a system meets its performance standards.

This approach helps project managers and SREs manage error budgets, assess the impacts of any breaches in SLOs, and shape the service-level agreements (SLAs) that are promised to customers.

What is an SLI?

SLIs are metrics with quantifiable indicators used to assess the quality of the software's service. By monitoring SLOs based on SLIs over time, organizations can gain insights into their actual performance. These indicators are essential for tracking the organization's growth and ensuring that both SLOs and SLAs are met, helping to maintain reliability and customer satisfaction.

What is an SLO?

An SLO is an internal goal or commitment that a company establishes to ensure the reliability of its software. It is the primary component of an SLA. SLOs are tracked by monitoring the service-level indicators (SLIs), which are the KPIs defined by the company over a specific period. 

Why SLIs and SLOs?

Here are steps for achieving an SLO.

SLAs are the commitments made by an organization to its customers. To effectively meet SLAs, the organization must focus on achieving its SLOs. Typically, SLOs are set higher than the SLAs.

Use case

Let's look at a real-world example to see why monitoring SLOs is necessary.  

Scenario: E-learning platform availability

Zylker is an e-learning platform that offers a variety of videos and online sessions to students, its primary users.

To meet the organization's objectives, Zylker must first set the SLIs, which are the metrics (i.e., KPIs) necessary for measuring the system's performance. The SLIs are defined below:

SLIs:

  • Availability: Percent of successful videos that start.
  • Latency: Time it takes for the video to load after initiating play.
  • Buffering: Percent of playback time.

The SLOs are defined based on the SLIs.

SLOs:

  • Availability: Ensure 99.9% of videos start successfully without errors.
  • Latency: Maintain an average video load time of under five seconds.
  • Buffering: Keep the buffering ratio below 1% of total playback time.

How do the SLOs work?

  • Monitor the platform and measure user metrics to track SLIs over a specific period. For this scenario, the video must be available, should start playing within five seconds, and the buffering rate during video playback must be low.
  • Evaluate any issues by identifying outages and network failures. This allows Zylker to pinpoint flaws and implement necessary solutions.
  • Analyze the results of the SLIs and suggest adjustments along with faster recovery plans in the event of outages. This proactive approach helps the e-learning platform avoid future disruptions.

When Zylker achieves the SLOs, the SLA with the students is also honored. To ensure this process is managed effectively, Site24x7 provides a reliable tool for monitoring real-world SLIs, which aids in reaching the SLOs and fulfilling commitments to customers. The error budget and burn rate can be calculated using actual data, allowing Zylker to determine whether services have met or breached the SLOs. By continuously monitoring the SLOs, the SREs can take proactive measures to maintain service quality.

Benefits of SLOs

SLOs are essential for several reasons. They:

  • Identify underperforming areas to help SREs prioritize critical issues and ensure a seamless user experience.
  • Establish shared performance expectations for informed decision-making.
  • Serve as benchmarks for tracking progress and refining SLO parameters.
  • Define the acceptable margin for errors within established SLOs.
  • Monitor error budget consumption to prevent SLO breaches.
  • Leverage Zia’s insights to proactively update and enhance SLIs.

SLO index

Explore key SLO concepts, learn how to set up and monitor SLOs, analyze performance metrics, and generate reports to ensure service reliability using the resources below.

Help pages How it helps
Adding an SLO Learn how to create and configure an SLO, define objectives, and set up monitoring for service reliability.
Performance metrics of an SLO Analyze real-time and historical SLO performance, track trends, and assess compliance with set targets.
Threshold and availability Define performance thresholds, measure availability, and ensure services meet reliability expectations.
SLO reports Generate detailed SLO reports, interpret key metrics, and track compliance over time.
SLO dashboard Get an overview of all your SLOs in a single dashboard with graphical insights and real-time monitoring.
Understanding SLO concepts Understand essential SLO-related terms, including SLI, error budget, burn rate, and compliance percentage.

Was this document helpful?

Would you like to help us improve our documents? Tell us what you think we could do better.


We're sorry to hear that you're not satisfied with the document. We'd love to learn what we could do to improve the experience.


Thanks for taking the time to share your feedback. We'll use your feedback to improve our online help resources.

Shortlink has been copied!