Simulating Partial Node Failure with FIS
AWS Fault Injection Simulator (FIS) Overview
AWS Fault Injection Simulator (FIS) is a fully managed service that enables you to perform controlled fault injection experiments on your AWS workloads. FIS allows you to simulate various failure scenarios, which is crucial for:
- Validating high availability configurations
- Testing auto-scaling and self-healing capabilities
- Identifying potential single points of failure
- Improving incident response procedures
By using FIS, you can:
- Discover hidden bugs and performance bottlenecks
- Observe how your systems behave under stress
- Implement and validate automated recovery procedures
- Conduct repeatable experiments to ensure consistent behavior
In our FIS experiment, we'll simulate a partial node failure in our EKS cluster and observe how our application responds, providing practical insights into building resilient systems.
For more information on AWS FIS, check out:
Experiment Details
This experiment differs from the previous manual node failure simulation in several ways:
- Automated execution: FIS manages the experiment, allowing for more controlled and repeatable tests compared to the manual script execution in the previous experiment.
- Partial failure: Instead of simulating a complete failure of a single node, FIS allows us to simulate a partial failure across multiple nodes. This provides a more nuanced and realistic failure scenario.
- Scale: FIS allows us to target multiple nodes simultaneously. This allows us to test the resilience of our application at a larger scale compared to the single-node failure in the manual experiment.
- Precision: We can specify exact percentages of instances to terminate, giving us fine-grained control over the experiment. This level of control wasn't possible in the manual experiment, where we were limited to terminating entire nodes.
- Minimal disruption: The FIS experiment is designed to maintain service availability throughout the test, whereas the manual node failure might have caused temporary disruptions to the retail store's accessibility.
These differences allows for a more comprehensive and realistic test of our application's resilience to failures, while maintaining better control over the experiment parameters. In this experiment, FIS will terminate 66% of the instances in two node groups, simulating a significant partial failure of our cluster. Similar to previous experiments, this experiment is also repeatable
Creating the Node Failure Experiment
Create a new AWS FIS experiment template to simulate the partial node failure:
Running the Experiment
Execute the FIS experiment to simulate the node failure and monitor the response:
------us-west-2a------
ip-10-42-127-82.us-west-2.compute.internal:
ui-6dfb84cf67-s6kw4 1/1 Running 0 2m16s
ui-6dfb84cf67-vwk4x 1/1 Running 0 4m54s
------us-west-2b------
------us-west-2c------
ip-10-42-180-16.us-west-2.compute.internal:
ui-6dfb84cf67-29xtf 1/1 Running 0 79s
ui-6dfb84cf67-68hbw 1/1 Running 0 79s
ui-6dfb84cf67-plv9f 1/1 Running 0 79s
This command triggers the node failure and monitors the pods for 4 minutes, allowing you to observe how the cluster responds to losing a significant portion of its capacity.
During the experiment, you should observe the following:
- After about 1 minute, you'll see one or more nodes disappear from the list, representing the simulated partial node failure.
- Over the next 2 minutes, you'll notice pods being rescheduled and redistributed to the remaining healthy nodes.
- Shortly after you'll see the new node coming online to replace the terminated one.
Your retail url should stay operational unlike the node failure without FIS.
To verify nodes and re-balance pods, you can run:
Verifying Retail Store Availability
Ensure that your retail store application remains operational throughout the partial node failure. Use the following command to check its availability:
Waiting for k8s-ui-ui-5ddc3ba496-721427594.us-west-2.elb.amazonaws.com...
You can now access http://k8s-ui-ui-5ddc3ba496-721427594.us-west-2.elb.amazonaws.com
The retail url may take 10 minutes to become operational.
Despite the partial node failure, the retail store should continue to serve traffic, demonstrating the resilience of your deployment setup.
Conclusion
This partial node failure simulation using AWS FIS demonstrates several key aspects of your Kubernetes cluster's resilience:
- Automatic detection of node failures by Kubernetes
- Swift rescheduling of pods from failed nodes to healthy ones
- The cluster's ability to maintain service availability during significant infrastructure disruptions
- Auto-scaling capabilities to replace failed nodes
Key takeaways from this experiment:
- The importance of distributing your workload across multiple nodes and availability zones
- The value of having appropriate resource requests and limits set for your pods
- The effectiveness of Kubernetes' self-healing mechanisms
- The need for robust monitoring and alerting systems to detect and respond to node failures
By leveraging AWS FIS for such experiments, you gain several advantages:
- Repeatability: You can run this experiment multiple times to ensure consistent behavior.
- Automation: FIS allows you to schedule regular resilience tests, ensuring your system maintains its fault-tolerant capabilities over time.
- Comprehensive testing: You can create more complex scenarios involving multiple AWS services to test your entire application stack.
- Controlled chaos: FIS provides a safe, managed environment for conducting chaos engineering experiments without risking unintended damage to your production systems.
Regular execution of such experiments helps build confidence in your system's resilience and provides valuable insights for continuous improvement of your architecture and operational procedures.