Test Resiliency Using Availability Zone (AZ) Failure Injection

7.1 AZ failure injection

This failure injection will simulate a critical problem with one of the three AWS Availability Zones (AZs) used by your service. AWS Availability Zones are powerful tools for helping build highly available applications. If an application is partitioned across AZs, companies are better isolated and protected from issues such as lightning strikes, tornadoes, earthquakes and more.

In Chaos Engineering we always start with a hypothesis. For this experiment the hypothesis is:

Hypothesis: If an entire Availability Zone dies, then availability will not be impacted

  1. Go to the RDS Dashboard in the AWS Console at http://console.aws.amazon.com/rds and note which Availability Zone the AWS RDS primary DB instance is in.

    • Note: If you previously ran the RDS Failure Injection test, you must wait until the console shows the AZs for the primary and standby instances as swapped, before running this test
    • A good way to run the AZ failure injection is first in an AZ other than this - we’ll call this Scenario 1
    • Then try it again in the same AZ as the AWS RDS primary DB instance - we’ll call this Scenario 2
    • Taking down two out of the three AZs this way is an unlikely use case, however it will show how AWS systems work to maintain service integrity despite extreme circumstances.
    • And executing this way illustrates the impact and response under the two different scenarios.
  2. To simulate failure of an AZ, select one of the Availability Zones used by your service (us-east-2a, us-east-2b, or us-east-2c) as <az>

    • For scenario 1 select an AZ that is neither primary nor secondary for your RDS DB instance. Given the following RDS console you would choose us-east-2c
    • For scenario 2 select the AZ that is primary for your RDS DB instance. Given the following RDS console you would choose us-east-2b

    DBConfigurationShort

  3. use your VPC ID as <vpc-id>

  4. Select one (and only one) of the scripts/programs below. (choose the language that you setup your environment for).

    LanguageCommand
    Bash./fail_az.sh <az> <vpc-id>
    Pythonpython3 fail_az.py <vpc-id> <az>
    Javajava -jar app-resiliency-1.0.jar AZ <vpc-id> <az>
    C#.\AppResiliency AZ <vpc-id> <az>
    PowerShell.\fail_az.ps1 <az> <vpc-id>
  5. The specific output will vary based on the command used.

    • Note whether an RDS failover was initiated. This would be the case if you selected the AZ containing the AWS RDS primary DB instance

7.2 System response to AZ failure

Watch how the service responds. Note how AWS systems help maintain service availability. Test if there is any non-availability, and if so then how long.

7.2.1 System availability

Refresh the service website several times

  • Scenario 1: If you selected an AZ not containing the AWS RDS primary DB instance then you should see uninterrupted availability
  • Scenario 2: If you selected the AZ containing the AWS RDS primary DB instance, then an availability loss similar to what you saw with RDS fault injection testing will occur.

Verify your observations by going through the canary run data:

  • Go to the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation
  • click on the WebServersforResiliencyTesting stack
  • click on the Outputs tab
  • Open the URL for WorkloadAvailability in a new window
  • You will see that canary runs are failing because the website is not available.

7.2.2 Scenario 1 - Load balancer and web server tiers

This scenario is similar to the EC2 failure injection test because there is only one EC2 server per AZ in our architecture. Look at the same screens as you did before, for that test:

One difference from the EC2 failure test that you will observe is that auto scaling will not replace the EC2 instance in the same AZ as the one that was terminated. Auto scaling attempts to balance the requested three EC2 instances across the remaining two AZs.

7.2.3 Scenario 2 - Load balancer, web server, and data tiers

This scenario is similar to a combination of the RDS failure injection along with EC2 failure injection. In addition to the EC2 related screens look at the Amazon RDS console, navigate to your DB screen and observe the following tabs:

  • Configuration
  • Monitoring
  • Logs & Events

7.2.4 AZ failure injection - conclusion

This similarity between scenario 1 and the EC2 failure test, and between scenario 2 and the RDS failure test is illustrative of how an AZ failure impacts your system. The resources in that AZ will have no or limited availability. With the strong partitioning and isolation between Availability Zones however, resources in the other AZs continue to provide your service with needed functionality. Scenario 1 results in loss of the load balancer and web server capabilities in one AZ, while Scenario 2 adds to that the additional loss of the data tier. By ensuring that every tier of your system is in multiple AZs, you create a partitioned architecture resilient to failure.

Our hypothesis is confirmed:

Hypothesis: If an entire Availability Zone dies, then availability will not be impacted

7.2.5 AZ failure recovery

This step is optional. To simulate the AZ returning to health do the following:

  1. Go to the Auto Scaling Group console
  2. Select the WebServersforResiliencyTesting auto scaling group
  3. Actions » Edit
  4. In the Subnet field add any ResiliencyVPC-PrivateSubnets that are missing (there should be three total) and Save
  5. Go to the Network ACL console
  6. Look at the NACL entries for the VPC called ResiliencyVPC
  7. For any of these NACLs that are not Default do the following
    1. Select the NACL
    2. Actions » Edit subnet associations
    3. Uncheck all boxes and click Edit
    4. Actions » Delete network ACL
  • Note how the auto scaling redistributes the EC2 servers across the availability zones