Last updated:

Site reliability engineering (SRE) is a set of practices that combines software engineering and operations to create highly reliable and scalable software systems. There are a few essential algorithms that every SRE should know or will have to know when building highly available applications or scaling business-critical systems. Below is a non-exhaustive list of the most-likely algorithms you will need as an SRE.

Load Balancing Algorithms

Load balancing algorithms are used to distribute incoming network traffic across multiple servers or services to avoid overload and ensure high availability. Common load balancing algorithms include round-robin, least connections, and (IP, header, session, etc.) hash. Those algorithms will often include a weighted variant as well.

Distributed Consensus Algorithms

Distributed consensus algorithms are used to ensure that multiple nodes in a distributed system agree on a common state, even in the presence of failures or network partitions. You will often hear about the distributed consensus algorithms Paxos and Raft. For example, Apache Cassandra is based on Paxos, Apache Zookeeper uses Zab which is very similar to Paxos, and CockroachDB uses Raft.

Sharding And Partitioning Algorithms

Sharding and partitioning are techniques used in distributed systems to improve performance, scalability, and availability. The main difference between sharding and partitioning is that sharding involves dividing data into shards and distributing them across multiple nodes to improve scalability and availability. In contrast, partitioning involves dividing data into smaller partitions based on specific criteria to improve parallel processing and performance.

Sharding algorithms are a critical component of site reliability engineering for distributed systems that must scale horizontally and handle large amounts of data. The choice of sharding algorithm will depend on the specific requirements and characteristics of the system being built or managed. Most software, if not all, will use a combination of sharding algorithms. The most common algorithms include the following:

Range Sharding and Partitioning

With Range Partitioning, data is partitioned based on a specific range of values, such as IDs or timestamps. Each node in the distributed system is responsible for a particular range of values. Range partitioning is easy to implement and can work well when data is naturally sorted or clustered based on specific attributes. Apache Hadoop is a distributed system that provides a framework for processing large datasets. It uses range sharding to partition data across nodes in a cluster based on a specific range of values. Additionally, Range partitions are a prevalent feature of any SQL database allowing to distribute portions of individual tables across a single-node file system. MySQL, PostgreSQL, Apache Cassandra, and Oracle Database all support range partitioning.

Directory-Based Sharding

With Directory-Based Sharding, a lookup table maps keys to specific nodes in the distributed system. Directory-based sharding can be more flexible than other sharding algorithms, but it requires additional infrastructure to maintain the directory and handle routing requests. Apache HBase uses a region server to keep track of the location of data shards and routes queries to the appropriate shard based on the location of the requested data. Similarly, MongoDB uses mongos as a query router.

Hash-Based Sharding

With Hash-Based Sharding, a hash function maps keys to a fixed set of nodes in the distributed system. Each node is responsible for a subset of the hash space. Hash-based sharding is simple to implement and can work well when data is evenly distributed across keys. Adding a new node to a hash-based sharded distributed system will require to re-shard your data.

Consistent Hashing

Consistent Hashing is an improved version of Hash-based sharding. In this algorithm, a hash function is used to map keys to a fixed set of virtual nodes in a ring. Each physical node in the distributed system is responsible for a subset of the virtual nodes. When a new key needs to be stored or retrieved, the hash function maps it to a virtual node, which maps to a physical node in the system. Consistent hashing allows nodes to be added or removed from the system without requiring the entire system to be re-sharded. For example, while memcached is not a distributed cache solution on its own, the PHP Memcached Client library implement consistent hashing to multiple Memcached servers.

Failure Detection Algorithms

Failure detection algorithms are used to monitor the health of nodes in a distributed system and detect failures promptly. Examples of failure detection algorithms include heartbeat, ping, and quorum-based algorithms. I often compare the heartbeat and ping approach to a push vs. pull method. For example, the heartbeat method will show a server is available by periodically sending a message to all the other servers. In contrast, a ping will require a server to send a message individually to every other server and see if they respond back.

Monitoring and Alerting Algorithms

Monitoring and alerting algorithms are used to collect and analyze system metrics and generate alerts when thresholds are exceeded, or anomalies are detected. Examples of monitoring and alerting algorithms include anomaly detection, outlier detection, and moving averages. You can find more examples in my post on Top 5 Machine Learning and Self-Healing Techniques used by SRE.

Performance Optimization Algorithms

Performance optimization algorithms are used to improve the performance and scalability of a system by reducing latency, improving throughput, and minimizing resource usage. Examples of performance optimization algorithms include caching, query optimization, sharding, and throttling (using rate-limiting or load-shedding techniques).

Resilience Patterns

Regarding reliability, software engineering design patterns are as important as actual algorithms. Resilience patterns are techniques used to build robust and fault-tolerant systems that can recover quickly from failures or service disruptions and help ensure that systems can continue functioning even in the presence of failures or disruptions. Some of those techniques are essential to reduce your blast radius in case of a service failure. The field has been heavily influenced by the Hystrix library, developed by Netflix. Open-source software implementing such patterns includes software like Apache Camel, Istio, Envoy Proxy, Resilience4j, or Kubernetes. Examples of resilience patterns and algorithms include the following:

Circuit Breakers

Circuit breakers are a pattern for detecting and responding to failures in distributed systems. When a service or resource fails repeatedly, the circuit breaker trips and prevents further requests from being sent to the failing service. This approach helps to reduce the load on the failing service and allows the system to degrade functionality until the service recovers gracefully.

Bulkheads

Bulkheads have been used for a long time in the boating industry to reduce boat floodability. Similarly, in software engineering, this pattern is used for isolating resources or dependencies to help contain failures in a distributed system. By breaking up a system into multiple compartments, called bulkheads, failures can be isolated to a specific group of resources or services without affecting the rest of the system. This technique helps prevent cascading failures and ensures that the system remains functional even if one component fails.

Retry

Retry patterns are a technique for automatically retrying failed requests or tasks in a distributed system. Implementations often use a linear or exponential backoff algorithm. The latter increases the waiting time between retries up to a maximum backoff time. This approach helps to prevent overwhelming the system with retries and ensures that the system has time to recover from transient failures or network disruptions.

Timeouts

Timeouts are a technique for preventing requests from taking too long to complete. By setting a maximum time limit for a request to complete, timeouts help to avoid requests from getting stuck and blocking other requests from being processed. For example, if a request takes too long to complete, it can be canceled or retried, helping to ensure that the system remains responsive and available.

Gracefull Shutdown

Graceful shutdown is a technique for shutting down a system or service in a controlled and predictable way. By shutting down a system or service in a controlled manner, engineers can ensure that any in-flight requests or tasks are completed before the system or service is shutdown. This step helps prevent data loss and ensures the system remains stable and available during the shutdown process.

Conclusion

While tailoring algorithms to specific systems is crucial, SREs need to go beyond that. Building truly resilient systems demands a socio-technical perspective, as explored in my blog post on Resilience Engineering. By understanding underlying organizational dynamics and human factors, alongside technical tools, SREs can craft systems that truly withstand challenges.