As businesses increasingly migrate their applications and infrastructure to the cloud, ensuring resiliency becomes paramount to maintain service availability and minimize disruptions. Resilient cloud architectures are designed to withstand failures and recover quickly from disruptions, offering enhanced reliability and performance.
This article provides an overview of resiliency patterns and trade-offs that can help architects build efficient and robust cloud solutions.
In this article we’ll go over five resilience patterns and the trade-offs to consider when designing your workloads. To fulfill your business resilience requirements, you should take into account the following key factors:
By doing so, you can achieve different levels of resiliency and effectively determine the most suitable architecture that aligns with your requirements.
Note: Implementing a combination of one or more patterns is a possibility as these patterns are not mutually exclusive.
Resiliency, as defined by the AWS Well-Architected Framework, refers to the ability of a system to recover from failures and continue operating as expected.
A resilient workload has the capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload's components.
P1 is an architectural pattern that enhances resilience by incorporating availability zones (AZ’s). It uses multiple AZ’s within a single AWS region to ensure that your application can withstand disruptions at the AZ level.
An application is running on a single EC2 instance managed by an Auto Scaling Group that uses health checks for scaling instances. In the case of an AZ is unavailable/down, the Auto Scaling detects the unhealthy instance and replaces it with a new instance in another available AZ.
Low business impact applications that can have lower requirements for resiliency, such as internal employee applications.
P1 has low impact in all factors. Yes, it “sweetens” an application availability disruption, but it comes with the expense of the application recovery. In the event of an AZ failure/outage, end users' access to the application will be disrupted until new resources are provisioned in a different AZ. This is known as bimodal behavior.
P2 uses multiple EC2 instances across multiple availability zones (AZ’s) within a region to increase resilience, while using static stability to prevent bimodal behavior.
Static stability workloads operate in one mode regardless of changes in the operating environment.
Meaning, you should pre-provision enough instances in each availability zone to handle the workload load if one AZ were failed and then use Elastic Load Balancer or Route 53 health checks to shift load away from the unavailable instances to the available ones in the other availability zone.
One of the advantages of using a static stability approach is that it simplifies the recovery process during a disruption due to the pre-provisioned capacity of resources.
An application is running on multiple instances managed by an Auto Scaling Group within multiple Az’s. When one AZ fails, the application will continue operating as the Elastic Load Balancer will shift end users' traffic to the working AZ.
Customer-facing websites that has a lower tolerance for downtime
Adopting the P2 approach will result in higher reliability, as end users will not experience downtime of the application as opposed to P1, however P1 is less expensive infrastructure cost-wise, as you provision less compute capacity and rely on launching new instances in the case of a failure.
But for large-scale failures (such as an Availability Zone failure) the P1 approach is less effective because it relies on reacting to impairments as they happen, rather than being prepared for those impairments before they happen.
Therefor, to determine the most suitable solution for your workload, it is essential to balance reliability and cost requirements. If your application can support the P2 approach, then increasing the number of availability zones across a region can reduce additional compute costs, as you provision less
For example, if you use two Az’s, you should provision enough EC2 instances such that the unaffected AZ can handle 100% of the workload load.
If using three AZ’s, you should provision enough EC2 instances such that two unaffected AZ’s can handle 100% of the workload load.
This means that you only have to provision 150% of your capacity across three AZs compared with the 200% in two AZs, and by that you reduce your costs.
P3 uses a multi-region pattern to increase functional resilience by distributing different critical applications in multiple regions.
It’s true that regional service disruptions are rare, however implementing this pattern will likely ensures your end users retain access to business-critical services during a disruption.
An organization that provides its services across multiple different digital channels (e.g: online website, mobile application). Each digital channel/service is deployed on a different region.
In case digital channel #1 (e.g: mobile application) is disrupted, end users can still consume the service they wished to via an alternate digital channel (channel #2 - online website).
Business-critical services that are distributed via multiple digital channels.
P3 addresses the risk of a regional service disruption affecting multiple applications simultaneously.
Running an application portfolio across multiple regions requires extensive operational planning and management. While isolated functional elements might rely on shared downstream systems and data sources deployed in a single region, the introduction of region-wide events should result in reduced impact surface area, even though some disruptions may still occur.
P4 uses multiple availability zones deployments in multiple regions while working with active/passive strategies that enable workload to recover from disaster events.
When selecting your disaster recovery (DR) strategy, you must weigh the benefits of lower RTO (recovery time objective) and RPO (recovery point objective) vs the costs of implementing and operating a strategy.
The Pilot Light and Warm Standby strategies offer a good balance of benefits and cost.
The Pilot Light pattern is suitable for applications with recovery time objectives (RTO) and recovery point objectives (RPO) in the range of 10s of minutes. In this pattern, data is continuously replicated, and the application infrastructure is pre-provisioned in the disaster recovery (DR) Region. The main focus is cost optimization, as the application infrastructure remains switched off and is only activated during the restoration event.
The Warm Standby pattern offers a notable improvement in restore times compared to the Pilot Light approach by maintaining application availability in the disaster recovery (DR) Region, albeit at a reduced capacity. During a DR event, the application infrastructure can be automatically scaled up with minimal manual intervention. When implemented correctly, this pattern can achieve recovery time objectives (RTO) and recovery point objectives (RPO) within minutes.
Both the Pilot Light and Warm Standby strategies replicate data from the primary region to data resources in the recovery region, while these data resources are ready to serve requests. In addition to replication, both strategies require you to create a continuous backup in the recovery region, for in case of a human action type disasters, data can be deleted or corrupted, and replication will replicate the bad data. That’s why backups are necessary - to enable you to get back to the last known good state.
Resources used for the workload infrastructure are deployed in the recovery region for both strategies and will require additional actions to become production ready.
As required for all active/passive strategies, both require a means to route traffic to the primary Region, and then fail over to the recovery Region when recovering from a disaster.
Business-critical services that have a very low tolerance for disruption
P4 addresses regional service disruptions while simultaneously reducing mitigation costs. Regional disaster recovery (DR) patterns introduce increased complexity due to the synchronization of infrastructure changes across multiple Regions. Testing resilience also becomes significantly more intricate, including the simulation of regional disruptions. However, employing Infrastructure as Code for automated deployments can alleviate these challenges.
In case of disaster, both Pilot Light and Warm Standby strategies offer the capability to limit data loss (RPO). Both offer sufficient RTO performance that enables you to limit downtime. Between these two strategies, you have a choice of optimizing for RTO or for cost.
The P5: Multi-region active/active disaster recovery involves running multiple instances of an application simultaneously across different geographically dispersed sites or regions. In this setup, all sites are actively serving live traffic and handling user requests, providing continuous availability and workload distribution. The multi-region active/active strategy will give you the lowest RTO (recovery time objective) and RPO (recovery point objective). However, this must be weighed against the potential cost and complexity of operating active stacks in multiple sites.
Multi-region active/active disaster recovery involves running identical instances of an application across multiple geographically dispersed sites. Traffic is distributed among these sites through load balancing, and data synchronization ensures consistency. In case of a site failure, failover mechanisms redirect traffic to the available sites, ensuring uninterrupted service, while regular testing validates the effectiveness of the setup.
Applications that have zero tolerance for disruption
P5 addresses the disruption of a regional service by investing additional costs and complexity to achieve a near-zero recovery time objective (RTO). Multi-active deployments, which involve multiple collaborating applications, are generally complex and require asynchronous data replication across regions, impacting data consistency. Operating this pattern require a high level of process maturity, and it is advisable for customers to gradually progress towards it by initially adopting the deployment patterns described earlier.
Stream.Security Architectural Standards can help you understand how you implemented these resilience principals.
For example, Stream.Security can alert on the following:
Ensuring that Amazon CloudTrail trails are enabled for all the supported AWS cloud regions increases the visibility of the API activity in your AWS account for security and management purposes. Applying CloudTrail trails to all AWS regions has multiple advantages such as receiving log files from all regions in a single S3 bucket and a single CloudWatch Logs log group, managing trail configuration for all AWS regions from one location, and record API calls in regions that are not used often in order to detect unusual activity.
Ensuring Database Migration Service (DMS) replication instances are using Multi-AZ deployment configurations provides High Availability (HA) through automatic failover to standby replicas in the event of a failure such as an Availability Zone (AZ) outage, an internal hardware or network outage, a software failure or in case of a planned maintenance session.
Enabling Cross-Zone Load Balancing simplifies the deployment and management of applications that operate across multiple subnets located in different Availability Zones (AZs), while also ensuring improved fault tolerance and consistent traffic flow. With this feature enabled, the load balancer acts as a traffic guard in the event of an AZ failure due to a network outage or power loss. It stops requests from being routed to the unhealthy zone and instead redirects them to the other available zone(s).
NAT gateways in each Availability Zone are implemented with redundancy. If you have resources in multiple Availability Zones and they share one NAT gateway, in the event that the NAT gateway's Availability Zone is down, resources in the other Availability Zones lose internet access, To create an Availability Zone-independent architecture, create a NAT gateway in each Availability Zone and configure your routing to ensure that resources use the NAT gateway in the same Availability Zone. The NAT Gateway enables outgoing Internet connectivity for a private subnet. It is important to note that you need to create a NAT Gateway for every Availability Zone that you have created private subnets to achieve high availability.
AWS offers five resilience patterns that provide organizations with a range of options to enhance the reliability and robustness of their cloud architectures. Each pattern comes with its own trade-offs, benefits, and suitability for specific use cases. By understanding these patterns and their characteristics, businesses can make informed decisions on which pattern aligns best with their requirements.
Stream.Security delivers the only cloud detection and response solution that SecOps teams can trust. Born in the cloud, Stream’s Cloud Twin solution enables real-time cloud threat and exposure modeling to accelerate response in today’s highly dynamic cloud enterprise environments. By using the Stream Security platform, SecOps teams gain unparalleled visibility and can pinpoint exposures and threats by understanding the past, present, and future of their cloud infrastructure. The AI-assisted platform helps to determine attack paths and blast radius across all elements of the cloud infrastructure to eliminate gaps accelerate MTTR by streamlining investigations, reducing knowledge gaps while maximizing team productivity and limiting burnout.