top of page

Essential Concepts Every Site Reliability Engineer Should Know Explained with Simple Examples

Updated: Nov 27, 2025

Site Reliability Engineer blends software engineering and systems administration to build scalable and reliable software systems. If you are stepping into this field or want to strengthen your foundation, understanding key concepts like Throughput, Latency, SLA, SLO, and SLI is essential. These terms guide how you measure, maintain, and improve system reliability.


This post breaks down these concepts with clear definitions and practical examples to help you grasp their importance and application in your daily work.


Eye-level view of a server rack with blinking network equipment lights
Server rack showing network equipment with blinking lights


Throughput: How Much Work Gets Done

Throughput measures the amount of work a system completes in a given time. For example, in a web service, throughput could be the number of requests processed per second.


Example:

Imagine a ticket booking website. If it processes 500 bookings every minute, its throughput is 500 bookings/minute. If throughput drops, users might experience delays or failures in booking tickets.


Throughput helps you understand system capacity and whether it meets demand.



Latency: How Fast Work Gets Done

Latency is the time it takes for a system to respond to a request. It measures delay from the moment a user sends a request until they get a response.


Example:

If a user clicks “search” on an e-commerce site and the results appear in 200 milliseconds, that 200 ms is the latency. Lower latency means faster responses and better user experience.


High latency can frustrate users and indicate bottlenecks in your system.


SLA: Service Level Agreement

An SLA is a formal agreement between a service provider and its users that defines expected service performance and availability.


Example:

A cloud storage provider might promise 99.9% uptime per month. This means the service can be down for no more than about 43 minutes monthly. If the provider fails to meet this, they may owe compensation or face penalties.


SLAs set clear expectations and legal commitments.



SLO: Service Level Objective

An SLO is a specific target within an SLA that defines acceptable performance levels for a service.


Example:

Within the 99.9% uptime SLA, an SLO might specify that 99.95% of requests should have latency under 300 ms. This gives the engineering team a measurable goal to maintain.


SLOs guide daily operations and improvements.



SLI: Service Level Indicator

SLIs are the actual measurements used to evaluate if the service meets its SLOs.


Example:

If your SLO is 99.95% of requests under 300 ms latency, the SLI is the percentage of requests that actually meet this latency during a monitoring period.


SLIs provide data to track service health and identify issues.



How These Concepts Work Together

Think of SLAs as the contract, SLOs as the goals, and SLIs as the measurements. Throughput and latency are key metrics that feed into SLIs.


For example, a video streaming service might have:


  • SLA: 99.9% uptime monthly

  • SLO: 99.9% of video streams start within 2 seconds

  • SLI: Percentage of streams starting within 2 seconds measured every day

  • Throughput: Number of streams served per minute

  • Latency: Time to start streaming after user clicks play


By monitoring SLIs like latency and throughput, the team ensures they meet SLOs and uphold the SLA.


Other Important Terms Every SRE Should Know


Error Budget

An error budget is the allowable amount of downtime or failure within an SLA. It balances reliability with innovation by letting teams know how much risk they can take.


Example:

If your SLA allows 0.1% downtime, your error budget is that 0.1%. If you use up your error budget quickly, you focus on fixing issues rather than releasing new features.


Monitoring and Alerting


Monitoring collects SLIs and other metrics continuously. Alerting notifies the team when metrics cross thresholds, signaling potential problems.


Example:

If latency exceeds 300 ms for more than 5 minutes, an alert triggers so engineers can investigate.


Incident Management


This is the process of responding to and resolving outages or degradations quickly to restore service.


Example:

When a database crashes, the incident response team follows a playbook to fix it and communicate status updates.



Practical Example: Applying These Concepts

Imagine you manage an online food delivery app. Your SLA promises 99.9% uptime. You set an SLO that 99% of orders should be confirmed within 5 seconds. You measure SLIs like order confirmation latency and throughput (orders processed per minute).


If monitoring shows latency creeping above 5 seconds, you get alerts. You check throughput to see if the system is overloaded. If the error budget is nearly used up, you pause new feature releases and focus on fixing the latency issue.


This approach keeps your service reliable and customers happy and helps Site Reliability Engineer to keep the services up.



Understanding these core concepts helps you build and maintain systems that users trust. By measuring the right things and setting clear goals, you can balance reliability with innovation and respond effectively when problems arise.



Comments


bottom of page