Leadership
Why Some Systems Keep Working When Things Go Wrong
A practical model for building adaptive, operationally resilient systems.
Posted December 5, 2025 Reviewed by Monica Vilhauer Ph.D.
Key points
- Resilient systems rely on both redundant and alternative paths when their primary plan fails.
- Spare capacity alone is insufficient; systems must also have the ability and authority to change course.
- Experience matters; systems adapt better when they practice operating under degraded conditions.
- Operational resilience emerges from building capacity, ability, and experience in changing course.
Operational resilience is the ability of a system to keep carrying out a critical function in the face of friction, partial failure, or adversity. Operationally fragile systems break when conditions change or parts fail, while operationally resilient systems find a way to keep going despite the altered (and typically worse) conditions they find themselves in.
An intensive care unit shows operational resilience when it continues to provide exceptional levels of critical care despite budget cuts, staff turnover, and equipment failures. A group of satellites handling communications for a space mission shows operational resilience when it dynamically redistributes signal pathways to adjust for interference or the loss of a single satellite.
In this post, we’ll look at three aspects of operationally resilient systems. Capacity describes the alternatives a system can engage when its primary path is blocked. Ability describes whether a system is actually able to use these alternatives. Experience describes how much practice the system has at using these alternatives in the real world. Together, these concepts form the “ACE Model” for exploring operational resilience and give a practical way to understand why some systems bend while others break.
Building Capacity to Adjust
The capacity for a system to function despite a setback comes in two main forms for operational resilience: redundant paths and alternative paths. Redundancy is the capability to do the same thing in the same way despite a setback. If one train fails a maintenance check in a city transit system, a spare train can be brought online from a backup pool to run the same route. Alternative paths, by comparison, is the capacity to get to the same (or similar) end point in an entirely different way. Continuing the transit metaphor, if a whole track needs to be closed for repairs, high-capacity systems could reroute individuals through other tracks (or use buses) to keep the same stations connected. Operationally resilient systems tend to be either “robust” (lots of redundancy) or “agile” (lots of ability to pivot to alternate routes), but often express components of both.
Importantly, relying on spare capacity to maintain operational resilience can easily result in cascading failures as the workload through a system gets shunted into a lower number of working components. For example, a hospital might have three dialysis machines that each run at 70% capacity. If one fails and the other two are called upon to pick up the slack, this load redistribution can cause second-order failures as they each operate at higher-than-normal capacities.
Ensuring the Ability to Change
Systems with high levels of spare capacity still require the ability to change configurations to make use of that capacity—ability in this sense is the bridge between theoretical capacity and actual capacity. For example, a power company, may operate primarily on nuclear energy but have a coal-fired plant available as a backup option. On paper, this may look like spare capacity, but many of the skills required to run an older coal-fired plant are different than the ones required for a nuclear reactor—transitioning between the systems may be challenging, costly, or time-consuming, limiting or even eliminating the company’s ability to be truly operationally resilient.
A critical but often overlooked feature of ability is authority—a system may lose operational resilience if it has useable capacity, but not the authority to transition to a different mode of operation. In the emergency department, patients can be brought to critical areas by almost any staff member who feels they are sick—nearly everyone has authority to change the first part of the triage line. However, opening or closing entire pods of treatment areas to meet shifting demand patterns is typically a decision only a few individuals have the authority to make. The system has the ability to change configurations, but the individuals present may still lack the authority to activate that change.
Practicing Failure to Build Experience
Emergency action plans typically overestimate the ability of systems to handle sudden and chaotic changes. When these plans are written down without serious testing and practice, they are fantasies at best, and actively dangerous at worst. Even systems with spare capacity and the ability and authority to use it can fail to be operationally resilient if they do not proactively practice that resilience.
While it’s true that you can’t necessarily identify all the ways in which a system could fail, it is generally possible to envision particularly likely (or particularly catastrophic) failure states, and to have your system practice working against them. For example, hospitals frequently practice “disaster drills” that simulate everything from earthquakes to widespread computer failures. By proactively simulating continuing to provide medical care in degraded and suboptimal conditions, hospital teams can check their capacity, assess their ability to actually use alternate tools, and make sure their on-paper plans match the reality of their current situations. These drills are excellent, for example, at finding out that an old plan has not been updated to match new construction, or that the door to crucial backup equipment auto-locks at night. Operationally resilient hospitals take advantage of this “cheap learning” to modify and improve protocols and physical resources, improving the space in which they can keep performing and succeeding.
Using the ACE Model
Ultimately, operational resilience is a dynamic feature of a system, not a static one. Maintaining it is an ongoing process, and using the ACE can help make that happen. To put it into action for a system you’re working with, start by choosing a challenge or a way the system can potentially fail.
ACE is a more memorable acronym than CAE, but the easiest starting point is to look first at your system’s capacity to handle that failure. If you had to face it today, how would your system naturally respond? What capacity do you currently have to buffer that shock?
Once you have identified redundant and alternative paths forward, look at your systems ability to use those paths. For each option, ask yourself if your system really has the ability to change in that particular way. Do you have the necessary skills and authority to make that change happen?
Finally, check when the last time was that your system had practiced or simulated maintaining operational resilience in a particular way. If it has been a while, or if other parts of your system have changed substantially, it’s probably time to practice again.
