Now that dependency monitoring has been established by leveraging CloudWatch Metrics and CloudWatch alarms, the last piece of the “puzzle” is to ensure that events related to the external service are tracked effectively so that relevant stakeholders are aware of the status of resolution. Alarms and notifications are good to alert teams of potential issues, however, tracking an event such as this will ensure co-ordination of efforts towards resolution. AWS Systems Manager OpsCenter can be used to achieve this. An OpsItem can be created to track events and quickly understand the current status of an event and can help answer questions such as - what level of severity is the event? what resources are affected? what is the status of the event? are there other events similar to this?
Automating creation of an OpsItem, coupled with alarms and notifications will allow teams to quickly triage events and lead to faster, more organized resolution.
This process can be automated by using a Lambda function to create an OpsItem every time the dependency alarm goes into an In alarm state.
Go to the Amazon SNS console at https://console.aws.amazon.com/sns/v3 and click on Topics
Click on the SNS Topic that was created as part of this lab -
Scroll down to the Subscriptions section and click on Create subscription
On the Create subscription page, make the following changes:
Click on Create subscription
To test this, follow the instructions in the previous section on testing a fail condition by deleting the default route. This time, when the alarm goes into an In alarm state, an OpsItem will be created in OpsCenter, in addition to the notification being sent to the email address specified.
Go to the AWS Systems Manager console at https://console.aws.amazon.com/systems-manager and click on OpsCenter
Click on the OpsItems tab, search by Title, select contains, and enter the value as
S3 Data Writes
Click on the OpsItem that has been created with the title S3 Data Writes failing
Expand the OpsItem details section by clicking on the triangle next to it, and view the information available there such as severity, category, etc.
Scroll down to the see the Related resources, in this case, the S3 bucket to which the writes are failing
The event can now be efficiently tracked using the OpsItem, and remediation work can be better co-ordinated. Additionally, you can choose to execute pre-created Runbooks which are listed under the Runbooks section and automate the remediation. You can create custom runbooks depending on the type of event.