Simulating AZ Failure
Overview
This repeatable experiment simulates an Availability Zone (AZ) failure, demonstrating the resilience of your application when faced with significant infrastructure disruptions. By leveraging AWS Fault Injection Simulator (FIS) and additional AWS services, we'll test how well your system maintains functionality when an entire AZ becomes unavailable.
Setting up the Experiment
Retrieve the Auto Scaling Group (ASG) name associated with your EKS cluster and create the FIS experiment template to simulate the AZ failure:
Running the Experiment
Execute the FIS experiment to simulate the AZ failure:
------us-west-2a------
ip-10-42-100-4.us-west-2.compute.internal:
ui-6dfb84cf67-h57sp 1/1 Running 0 12m
ui-6dfb84cf67-h87h8 1/1 Running 0 12m
ip-10-42-111-144.us-west-2.compute.internal:
ui-6dfb84cf67-4xvmc 1/1 Running 0 11m
ui-6dfb84cf67-crl2s 1/1 Running 0 6m23s
------us-west-2b------
ip-10-42-141-243.us-west-2.compute.internal:
No resources found in ui namespace.
ip-10-42-150-255.us-west-2.compute.internal:
No resources found in ui namespace.
------us-west-2c------
ip-10-42-164-250.us-west-2.compute.internal:
ui-6dfb84cf67-fl4hk 1/1 Running 0 11m
ui-6dfb84cf67-mptkw 1/1 Running 0 11m
ui-6dfb84cf67-zxnts 1/1 Running 0 6m27s
ip-10-42-178-108.us-west-2.compute.internal:
ui-6dfb84cf67-8vmcz 1/1 Running 0 6m28s
ui-6dfb84cf67-wknc5 1/1 Running 0 12m
This command starts the experiment and monitors the distribution and status of pods across different nodes and AZs for 8 minutes to understand the immediate impact of the simulated AZ failure.
During the experiment, you should observe the following sequence of events:
- After about 3 minutes, an AZ zone will fail.
- Looking at the Synthetic Canary you will see change state to
In Alarm
- Around 4 minutes after the experiment started, you will see pods reappearing in the other AZs
- After the experiment is complete, after about 7 minutes, it marks the AZ as healthy, and replacement EC2 instances will be launched as a result of an EC2 autoscaling action, bringing the number of instances in each AZ to 2 again.
During this time, the retail url will stay available showing how resilient EKS is to AZ failures.
To verify nodes and pods redistribution, you can run:
Post-Experiment Verification
After the experiment, verify that your application remains operational despite the simulated AZ failure:
Waiting for k8s-ui-ui-5ddc3ba496-721427594.us-west-2.elb.amazonaws.com...
You can now access http://k8s-ui-ui-5ddc3ba496-721427594.us-west-2.elb.amazonaws.com
This step confirms the effectiveness of your Kubernetes cluster's high availability configuration and its ability to maintain service continuity during significant infrastructure disruptions.
Conclusion
The AZ failure simulation represents a critical test of your EKS cluster's resilience and your application's high availability design. Through this experiment, you've gained valuable insights into:
- The effectiveness of your multi-AZ deployment strategy
- Kubernetes' ability to reschedule pods across remaining healthy AZs
- The impact of an AZ failure on your application's performance and availability
- The efficiency of your monitoring and alerting systems in detecting and responding to major infrastructure issues
Key takeaways from this experiment include:
- The importance of distributing your workload across multiple AZs
- The value of proper resource allocation and pod anti-affinity rules
- The need for robust monitoring and alerting systems that can quickly detect AZ-level issues
- The effectiveness of your disaster recovery and business continuity plans
By regularly conducting such experiments, you can:
- Identify potential weaknesses in your infrastructure and application architecture
- Refine your incident response procedures
- Build confidence in your system's ability to withstand major failures
- Continuously improve your application's resilience and reliability
Remember, true resilience comes not just from surviving such failures, but from maintaining performance and user experience even in the face of significant infrastructure disruptions. Use the insights gained from this experiment to further enhance your application's fault tolerance and ensure seamless operations across all scenarios.