Test Resiliency Using Application Failure Injection

6.1 Web server failure injection

This failure injection will simulate a critical failure of the web server running on the EC2 instances using FIS.

In Chaos Engineering we always start with a hypothesis. For this experiment the hypothesis is:

Hypothesis: If the server process on a single instance is killed, then availability will not be impacted

  1. [Optional] Before starting, view the deployment machine in the AWS Step Functions console to verify the deployment has reached the stage where you can start testing:
    • single region: WaitForWebApp shows completed (green)
    • multi region: WaitForWebApp1 shows completed (green)

6.1.1 Create experiment template

  1. Navigate to the FIS console at http://console.aws.amazon.com/fis and click Experiment templates in the left pane.

    • Troubleshooting: If screen is blank, then select the region US East (Ohio)
  2. Click on Create experiment template to define the type of failure you want to inject.

    FISconsole

  3. Enter Experiment template for application resiliency testing for Description and App-resiliency-testing for Name. For IAM role select WALab-FIS-role.

    ExperimentName-App

  4. Scroll down to Actions and click Add action.

    AddAction

  5. Enter kill-webserver for the Name. Under Action type select aws:ssm:send-command/AWSFIS-Run-Kill-Process. Under documentParameters enter {"ProcessName":"python3","Signal":"SIGKILL"}. For duration select Minutes and then enter 2 in the text box next to it. Click Save.

    ActionApp

  6. Scroll down to Targets and click Edit next to Instances-Target-1 (aws:ec2:instance).

    EditTargetApp

  7. Under Target method, select Resource tags and filters. Select Count for Selection mode and enter 1 under Number of resources. This ensures that FIS will only kill the web server on one instance.

  8. Scroll down to Resource tags and click Add new tag. Enter Workshop for Key and AWSWellArchitectedReliability300-ResiliencyofEC2RDSandS3 for Value. These are the same tags that are on the EC2 instances used in this lab.

  9. For Resource filters click Add new filter. Enter State.Name for Attribute path and running for Values. This ensures FIS targets a running instance. Click Save.

    SelectTargetEC2

  10. You can choose to stop running an experiment when certain thresholds are met, in this case, using CloudWatch Alarms under Stop condition. For this lab, you can leave this blank.

  11. Click Create experiment template.

  12. In the warning pop-up, confirm that you want to create the experiment template without a stop condition by entering create in the text box. Click Create experiment template.

    CreateTemplate

6.1.2 Run the experiment

  1. Click on Experiment templates from the menu on the left.

  2. Select the experiment template App-resiliency-testing and click Actions. Select Start experiment.

    StartExperimentApp-1

  3. You can choose to add a tag to the experiment if you wish to do so.

  4. Click Start experiment.

    StartExperimentApp-2

  5. In the pop-up, type start and click Start experiment.

    StartExperiment

6.2 System response to web server failure

The instances launched as part of this lab are running simple Python webservers. This experiment uses AWS Systems Manager to run a command on the selected instance(s). In this workshop, the command used is kill-process. When the experiment runs, the python3 web server process is terminated on one of the instances and it can no longer handle requests. Watch how the service responds. Note how AWS systems help maintain service availability. Test if there is any non-availability, and if so then how long.

6.2.1 System availability

Refresh the service website several times. Note the following:

  • Website remains available
  • The remaining two EC2 instances are handling all the requests (as per the displayed instance_id)
  • Also note the availability_zone value when you refresh. You can see that requests are being handled by the EC2 instances in only two Availability Zones, while the EC2 instance in the third zone is being replaced

This can also be verified by viewing the canary run data.

  • Go to the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation
  • click on the WebServersforResiliencyTesting stack
  • click on the Outputs tab
  • Open the URL for WorkloadAvailability in a new window
  • Canary runs continue to be successful confirming that the website is available

Load balancing and Auto Scaling work here much the way they did for the EC2 failure injection experiment.

[Optional] If you want to review the Load balancing and Auto Scaling behavior again for this case, click here

6.3 Web server failure injection - conclusion

In this section, you simulated an application level failure where the web server process running the application was killed using FIS and SSM. Although there was no infrastructure failure, your workload was able to detect and correct the issue by replacing the EC2 instance. Deploying multiple servers and Elastic Load Balancing enables a service suffer the loss of a server with no availability disruptions as user traffic is automatically routed to the healthy servers. Amazon Auto Scaling ensures unhealthy hosts are removed and replaced with healthy ones to maintain high availability.

Our hypothesis is confirmed:

Hypothesis: If the server process on a single instance is killed, then availability will not be impacted