This failure injection will simulate a critical failure of the Amazon RDS DB instance.
Before starting, view the deployment machine in the AWS Step Functions console to verify the deployment has reached the stage where you can start testing:
WaitForMultiAZDBshows completed (green)
CheckRDSRRStatus1show completed (green)
Before you initiate the failure simulation, refresh the service website several times. Every time the image is loaded, the website writes a record to the Amazon RDS database
Click on click here to go to other page and it will show the latest ten entries in the Amazon RDS DB
Go to the RDS Dashboard in the AWS Console at http://console.aws.amazon.com/rds
From the RDS dashboard
Look at the configured values. Note the following:
To failover of the RDS instance, use the VPC ID as the command line argument replacing
<vpc-id> in one (and only one) of the scripts/programs below. (choose the language that you setup your environment for)
The specific output will vary based on the command used, but will include some indication that the your Amazon RDS Database is being failedover:
Failing over mdk29lg78789zt
Watch how the service responds. Note how AWS systems help maintain service availability. Test if there is any non-availability, and if so then how long.
The website is not available. Some errors you might see reported:
Continue on to the next steps, periodically returning to attempt to refresh the website.
Refresh and note the values of the Info field. It will ultimately return to Available when the failover is complete.
Note the AZs for the primary and standby instances. They have swapped as the standby has no taken over primary responsibility, and the former primary has been restarted. (After RDS failover it can take several minutes for the console to update as shown below. The failover has however completed)
From the AWS RDS console, click on the Logs & events tab and scroll down to Recent events. You should see entries like those below. In this case failover took less than a minute.
Mon, 14 Oct 2019 19:53:37 GMT - Multi-AZ instance failover started. Mon, 14 Oct 2019 19:53:45 GMT - DB instance restarted Mon, 14 Oct 2019 19:54:21 GMT - Multi-AZ instance failover completed
From the AWS RDS console, click on the Monitoring tab and look at DB connections
As the failover happens the existing three servers all cannot connect to the DB
AWS Auto Scaling detects this (any server not returning an http 200 status is deemed unhealthy), and replaces the three EC2 instances with new ones that establish new connections to the new RDS primary instance
The graph shows an unavailability period of about four minutes until at least one DB connection is re-established
In this section you reduce the unavailability time from four minutes to under one minute.
This part of the RDS failure simulation is optional. If you are running this lab as part of a live workshop, then you may want to skip this and come back to it later.
You observed before that failover of the RDS instance itself takes under one minute. However the servers you are running are configured such that they cannot recognize that the IP address for the RDS instance DNS name has changed from the primary to the standby. Availability is only regained once the servers fail to reach the primary, are marked unhealthy, and then are replaced. This accounts for the four minute delay. In this part of the lab you will update the server code to be more resilient to RDS failover. The new code can recognize underlying changes in IP address for the RDS instance DNS name
Use either the Express Steps or Detailed Steps below:
Now repeat the RDS failure injection steps on this page, starting with 5.1 RDS failure injection.
Learn more: After the lab see High Availability (Multi-AZ) for Amazon RDS for more details on high availability and failover support for DB instances using Multi-AZ deployments.
High Availability (Multi-AZ) for Amazon RDS
The primary DB instance switches over automatically to the standby replica if any of the following conditions occur:
- An Availability Zone outage
- The primary DB instance fails
- The DB instance’s server type is changed
- The operating system of the DB instance is undergoing software patching
- A manual failover of the DB instance was initiated using Reboot with failover