Recent and repeated incidents at Boeing with the 737 MAX serve as stark reminders of how system quality and reliability are inherently socio-technical problems. An organization pushing unreasonable timelines cannot simultaneously expect top-tier quality or reliability. When companies drastically slash costs, deficiencies often emerge in products. The Efficiency–thoroughness trade-off principle (ETTO principle) effectively capture this paradigm.
Additionally, unexpected failures are inevitable even with the best engineers and system architecture. The real questions become, how resilient are your systems and organization in the face of such failures? And, are you prepared to learn and adapt through adversity?
Resilience Engineering embodies the proactive design and management of systems, aiming to anticipate, contain, and learn from disruptions. It fosters adaptive capabilities that ensure persistent functionality and drive continuous improvement amidst evolving challenges.
In today’s ever-evolving landscape of interconnected systems, resilience engineering stands out as a beacon of proactive response. It’s not just about weathering storms; it’s about anticipating them, containing them, and extracting invaluable insights to propel continuous improvement.
In this blog post, I explore resilience engineering, breaking down its principles and uncovering its practical uses, ranging from the basics of Failure Mode and Effects Analysis (FMEA) to chaos engineering techniques and cell architectures.
FMEA: The Foundation and its Limitations
FMEA, or Failure Mode and Effects Analysis, traces its roots back to the 1940s when the U.S. military initially developed it to identify and prioritize potential failure modes in military hardware. Since then, it has evolved into a widely adopted methodology across various industries, serving as a foundational tool in risk management and reliability engineering.
FMEA is a process that methodically examines potential failure modes, their impacts, and prioritizes them according to severity, frequency, and detectability to mitigate risks proactively.
The five steps of FMEA
The original FMEA methodology, often referenced in academic settings and older industry standards, is in five steps. It focuses on identifying, analyzing, and prioritizing potential failure modes within a system. The five steps are:
Define the system or process to be analyzed: Identify the specific system or process you are evaluating for potential failures. This could be an entire product, a subsystem, or a specific operational procedure.
Identify potential failure modes: Brainstorm and list all the possible ways each component or step within the system could fail. Consider different failure mechanisms like component malfunction, operational error, environmental factors, etc.
Assess the effects of each failure mode: For each identified failure mode, analyze the potential consequences on the system’s functionality, safety, performance, or other relevant aspects. Consider severity levels and cascading effects.
Assign severity, occurrence, and detection rankings: Assign numerical scores (usually on a scale of 1 to 10) to each failure mode based on its:
- Severity: How critical is the impact of the failure?
- Occurrence: How likely is the failure to occur?
- Detection: How easily can the failure be detected before it causes significant problems?
Calculate the Risk Priority Number (RPN) and prioritize actions: Multiply the severity, occurrence, and detection scores for each failure mode to get the Risk Priority Number (RPN). A higher RPN indicates a higher priority for mitigation actions. Focus on addressing the failure modes with the highest RPNs first.
📎 Remarks
- FMEA is often conducted iteratively, meaning you may need to revisit earlier steps as you gain more information and refine your analysis.
- There are variations and additional steps based on specific FMEA methodologies and industry standards. For example, the Automotive Industry Action Group (AIAG) and the German Association of the Automotive Industry (VDA), expand on the traditional FMEA with the AIAG-VDA FMEA in 7 steps:
- Planning & Preparation
- Structure Analysis
- Function Analysis
- Failure Analysis (same as step 2 in traditional FMEA)
- Risk Analysis (same as step 3 in traditional FMEA)
- Optimization
- Results Documentation
Notable Success
One of the notable success stories of FMEA comes from NASA’s Apollo program, where it was instrumental in ensuring the safety and reliability of spacecraft systems. By meticulously analyzing potential failure modes and their effects, NASA engineers were able to anticipate and mitigate risks, ultimately contributing to the success of the Apollo missions and the safe return of astronauts to Earth.
FMEA is crucial in identifying vulnerabilities and enhancing system reliability in modern industries. For example, in the automotive sector, manufacturers utilize FMEA to assess potential failure modes in vehicle components, leading to improved safety standards and customer satisfaction.
Limitations
Despite its long history and proven benefits, FMEA is not without its limitations. The Deepwater Horizon oil spill serves as a sobering reminder of the potential consequences when critical failure modes are overlooked or inadequately addressed. Even with the extensive use of FMEA in the oil drilling process, critical factors like improper well cementing, faulty design, and lack of redundancy went unnoticed, contributing to the disaster. This underscores the need for complementary strategies beyond FMEA.
Prior to the 737 MAX incidents, Boeing was widely recognized for its leadership in engineering rigor, safety, and quality standards. However, investigations into these tragic events revealed significant concerns regarding processes, risk assessment, and design decisions surrounding the development of the aircraft. Such issues clearly persist at Boeing as it needs to fix how its organization works and get back to a culture of safety and resilience. The Harvard Business School Working Knowledge blog has a great post on Boeing’s eroded culture and Why Boeing’s Problems with the 737 MAX Began More Than 25 Years Ago.
While FMEA is a valuable tool for risk management, it does have some downsides you should be aware of:
Limited Scope:
- Focuses on single components: FMEA primarily analyzes individual components and their failure modes, potentially overlooking systemic weaknesses or interactions between components.
- Static analysis: It assumes a fixed system design and may not capture dynamic changes or emerging threats.
Subjectivity and Dependence on Expertise:
- Reliance on experience: The effectiveness of FMEA heavily relies on the analyst’s experience and skill in identifying potential failures and assigning risk scores. This can lead to subjective judgements and inconsistencies.
- Limited ability to predict complex failures: FMEA might not account for unforeseen interactions or rare events, especially in complex systems.
Time and Resource Limitations:
- Can be time-consuming and resource-intensive: Especially for large and complex systems, conducting a thorough FMEA can be very demanding on time and resources.
- Focus on high-risk areas may neglect others: Prioritizing high-risk areas can be beneficial, but it might lead to neglecting potential failures in less critical areas.
Potential to miss cascading effects:
- Focus on direct consequences: FMEA might not fully capture the cascading effects of a failure, where one issue triggers a chain reaction of subsequent failures.
- Limited analysis of common cause failures: Common cause failures, where multiple elements fail due to a single root cause, can be challenging to predict and address through FMEA.
Can be bureaucratic and checklist-driven:
- Overemphasis on documentation: Focusing too much on completing the FMEA documentation can overshadow the actual risk assessment and corrective action planning.
- Loss of insights: Rigidly following a checklist approach can hinder creative problem-solving and deeper understanding of risk factors.
Overall, FMEA remains a valuable tool, but it’s essential to acknowledge its limitations and combine it with other risk management approaches for a comprehensive understanding and mitigation of system vulnerabilities.
Chaos Engineering: Embracing Uncertainty
While FMEA provides a structured foundation for risk identification, chaos engineering takes a radically different approach by proactively injecting controlled chaos into systems to expose weaknesses and build resilience. Instead of passively analyzing potential failures, chaos engineering actively tests them in live environments, mimicking real-world disruptions and observing the system’s response. This paradigm shift emerged from the need to ensure the reliability and scalability of increasingly complex, distributed systems, particularly in cloud computing.
😀 Fun Fact
In 1983, while developing MacWrite and MacPaint for the first Apple Macintosh, Steve Capps pioneered an early form of chaos engineering with “Monkey.” This playful desk accessory injected chaos into the user interface, simulating a mischievous monkey pounding randomly on the keyboard and mouse at high speeds. While automated testing was scarce due to limited memory, “Monkey” was a valuable debugging tool that triggered unexpected user interactions and exposed hidden errors for programmers to address. This early experiment helped build resilient software for the nascent Macintosh and foreshadowed the rise of formalized chaos engineering practices we see today. - Monkey Lives
I view Chaos Engineering as a complementary practice that builds upon FMEA’s foundation. Both methodologies share a core principle: identifying hypotheses and then running experiments to test them. While FMEA helps us proactively assess potential failure scenarios, Chaos Engineering takes it further by actively testing those scenarios in live environments. This allows us to identify weaknesses and observe how the system responds to real-world disruptions, ultimately building more resilient systems. While FMEA primarily serves as a risk assessment technique, and Chaos Engineering focuses on practical testing, they can be used jointly to create a comprehensive approach to system reliability.
Moreover, Chaos Engineering often incorporates concepts like circuit breakers, bulkheads, and other resilience patterns, which act as safeguards to prevent system-wide failures. These mechanisms enable the system to isolate and contain failures, ensuring they do not cascade and affect the entire system’s performance.
Companies like Netflix, Etsy, and Amazon have become pioneers in using chaos engineering tools like Chaos Monkey, Gremlin, AWS Fault Injection Service, and Chaos Mesh. These tools simulate various failure scenarios, such as server outages, network interruptions, and resource spikes, allowing engineers to identify and rectify vulnerabilities before they impact actual users.
However, it’s crucial to acknowledge the potential downsides and ethical considerations surrounding chaos engineering when operated in a live production environment. Unintended consequences like data loss or service disruptions can occur, even during controlled experiments. Balancing the need for rigorous testing with minimizing user impact requires careful planning and ethical considerations. Obtaining informed consent from users and ensuring transparency about potential disruptions are vital factors in responsible implementation.
Despite these challenges, chaos engineering is a powerful tool for building inherently resilient systems capable of weathering real-world storms. By proactively embracing uncertainty and testing for failure, companies can move beyond reactive firefighting and ensure a seamless user experience even amidst unexpected disruptions. The future of chaos engineering promises even more advanced tools and methodologies, ultimately empowering organizations to thrive in an increasingly unpredictable world.
Cell-Based Architecture: Building Systems for Resilience
Chaos Engineering shines a light on vulnerabilities and cascading failures, revealing the fragility of interconnected systems. But how do we design systems that minimize the impact of failures, reducing their “blast radius” and ensuring overall system resilience? This is where cell-based architecture, inspired by the robust compartmentalization of biological cells, takes center stage.
💡 Cell-based architecture vs bulkhead patterns
Cell-based architecture and bulkhead patterns share the goal of building resilient systems through modularity and isolation. While cell-based architecture acts as a broader philosophy emphasizing resilience across diverse types of systems, bulkhead patterns offer a more specific implementation technique often used in software development to isolate services and prevent cascading failures. Both prioritize independent, fault-tolerant units but differ in scope and granularity, with cell-based approaches favoring larger functional units and bulkheads focusing on finer-grained resource or service isolation. Ultimately, the choice between them depends on your specific system needs and desired level of granularity.
Imagine a software application broken down into independent microservices, each encapsulated with its own resources and fault tolerance mechanisms. This is the essence of cell architecture in microservices architecture. Each service operates independently, minimizing cascading failures and allowing for individual updates and scaling. Similarly, consider a power grid segmented into isolated cells (a.k.a Microgrids), enabling localized troubleshooting and minimizing widespread outages. In autonomous vehicles, redundant systems like multiple sensors and control units offer backup mechanisms, enhancing safety and reliability.
Cell-based architecture offers several advantages:
- Minimized Blast Radius: Failures are contained within their respective cells, reducing the scope of disruption and ensuring other system components continue functioning.
- Enhanced Resilience: Redundancy within each cell provides backup mechanisms, allowing the system to gracefully withstand component failures.
- Independent Scalability: Cells can be scaled individually based on specific needs, promoting flexibility and efficiency.
- Agile Development: Changes can be made to individual cells without affecting the whole system, accelerating innovation and iteration.
However, cell-based architecture also presents challenges:
- Increased Complexity: Managing numerous independent units requires robust monitoring, governance, and security practices.
- Tracing Issues: The isolation principle can make it difficult to track problems across cells, demanding a strong observability infrastructure.
- Not a One-Size-Fits-All: This approach is not universally applicable. Careful consideration of system needs, trade-offs, and implementation complexity is crucial.
Cell-based architecture offers a powerful approach to building systems with limited blast radius, promoting resilience and adaptability. However, its suitability depends on specific context and careful implementation.
Remember, building resilient systems is an ongoing journey. Cell-based architecture can be a valuable tool on this path, but its effectiveness requires thoughtful application and continuous evaluation.
Beyond the technology
Resilience Assessment Grid (RAG)
Beyond the technology, how can you measure and understand the resilience of your systems? For that purpose, Erik Hollnagel, introduced in 2015 the Resilience Assessment Grid (RAG), stating that:
A system cannot be resilient, but a system can have a potential for resilient performance. A system is said to perform in a manner that is resilient when it sustains required operations under both expected and unexpected conditions by adjusting its functioning prior to, during, or following events (changes, disturbances, and opportunities). Whereas current safety management (Safety-I) focuses on reducing the number of adverse outcomes by preventing adverse events, Resilience Engineering (RE) looks for ways to enhance the ability of systems to succeed under varying conditions (Safety-II). It is therefore necessary to understand what this ability really means, since it clearly is not satisfactory just to call it ‘resilience’.
In other words, the Safety-I paradigm emphasizes understanding and preventing failures, while Resilience Engineering (RE) or Safety-II paradigm focuses on understanding how systems succeed under varying conditions and adapt to unexpected situations (see Safety-I and Safety-II).
Additionally, a typical “Safety-I” mindset is to attribute an incident as “ human errors” which often masks deeper systemic issues within the system or workplace.
The RAG has been used extensively in Healthcare setting. It is a framework developed in Resilience Engineering to assess a system’s potentials for resilient performance. It focuses on four key abilities:
- The potential to Respond: Knowing what to do and adjusting actions based on changes, disturbances, and opportunities.
- The potential to Monitor: Being aware of what affects the system’s performance, both internally and externally.
- The potential to Learn: Analyzing experiences to improve future responses and decision-making.
- The potential to Anticipate: Predicting potential disruptions, demands, and opportunities to prepare proactively.
Instead of measuring a single, elusive “resilience,” the RAG framework dives into a system’s ability to respond, monitor, learn, and anticipate. Using customizable questions and visualized results (like radar charts), it identifies areas for improvement and tracks progress over time, guiding targeted interventions for a more resilient system.
Unlike solely focusing on failure prevention, the RAG promotes a comprehensive understanding of resilience by assessing four key abilities: responding, monitoring, learning, and anticipating. This shift away from “blaming individuals” helps identify root causes and guides targeted improvement efforts. The RAG’s flexible framework allows comparisons across different systems and groups, facilitating effective interventions and strengthening resilience across your organization.
While the RAG provides a valuable assessment tool, it does require tailoring to each specific context and interpretation of its subjective results, which can be susceptible to biases. Additionally, it cannot guarantee future resilience, but rather offers a snapshot of current capabilities. Despite these limitations, the RAG’s focus on core functionalities empowers us to understand and improve the adaptability of complex systems.
Building Resilient Systems Together
Collaboration is the key to effective resilience engineering. It’s not just about designing robust systems; it’s about building a shared understanding across teams and disciplines. Open communication, fostered through regular design discussions, incident reviews, and post-mortems, allows us to learn from both successes and failures. Analyzing both what went right and what went wrong is crucial for building resilient systems.
Cross-functional teams bring diverse perspectives to the table, ensuring resilience considerations are woven into every stage of a system’s life cycle, from design and development to operations. Imagine incorporating security experts into early design phases, involving operations teams in development discussions – this diverse blend of expertise strengthens the fabric of resilience.
Starting small is key. Begin by selecting a specific process or system and gradually integrate resilience practices like chaos engineering and fault injection testing. Automation tools can be your allies, streamlining tasks like monitoring, testing, and remediation.
👉 Resilience is a journey, not a destination.
The future of resilience engineering is brimming with innovative possibilities. Artificial intelligence (AI) and machine learning (ML) will play a crucial role in building more intelligent and adaptive systems. Imagine AI-powered systems predicting and mitigating failures before they occur, or ML algorithms continuously learning and adapting resilience strategies based on real-time data. However, we must approach these advancements with caution. Ethical considerations around AI development and bias must be addressed, and overreliance on these technologies without human oversight can introduce new vulnerabilities.
The evolving landscape of threats, from sophisticated cyberattacks to the complexities of climate change, demands constant learning and adaptation. By staying vigilant, fostering collaboration across disciplines, and embracing new technologies responsibly, we can build systems that not only survive but thrive in an increasingly complex and unpredictable world. Let’s share best practices, learn from each other, and build a future where resilience is not just a goal, but a deeply ingrained practice.