How to Protect Your Application Against an AWS Outage
Early in the morning on November 25th, 2020 – the day before Thanksgiving – reports started circulating, claiming various issues with popular consumer applications and sites. It started off as intermittent availability problems. Before long, it escalated to full-blown unavailability. We turned to Amazon CloudWatch, a monitoring and management service that provides data and actionable insights for applications and resources. But CloudWatch wasn’t loading. And then the realization hit. This was not an issue we could resolve. Businesses around the world, like Gavant, were likewise scrambling to uncover the root cause. But there was an AWS outage in Amazon’s most popular region, US-EAST-1. Not long after, AWS confirmed as much, stating an issue with an underlying service which crippled part of their US-EAST-1 region. I will not bore you with the full details; suffice to say, the issues were not fully resolved until the following morning.
Coping with an AWS Outage
There was a feeling of helplessness. Applications that purely ran in US-EAST-1 were at a standstill. Even applications that were built to withstand an availability-zone outage could not function. Suddenly, it was no longer a theoretical consideration to protect against. And everyone was asking the same questions: How do we avoid this in the future? How can we ensure that an AWS outage in a region will not bring our application down? Can we have a recovery plan to bring up the application in another region in minutes?
What we will be exploring in this article are ways to be prepared for a major AWS outage and what you can do to ensure your data is backed up and always available. It’s not just about data, but also about design. If your application isn’t built to run in a stateful manner and you can’t quickly spin up a new environment, having your data won’t suffice in the short term. With some basic practices in place, you’ll have the option to quickly swap to another region.
Architecture
One of the most important parts of a dynamically scalable and deployable application is ensuring the code is stateless. What this means is that the code that handles your main logic doesn’t store any data in it. Leave that to the data tiers, like the database or an S3 bucket. If the application code stores data on the machine it’s running on, then the machine becomes stateful. And that means it needs the same considerations as a database, making backups and deployments unnecessarily complicated.
Infrastructure as Code
The next key piece to a quick change over is ensuring the infrastructure can be quickly “spun up” in another region. The last thing you want is to try to set up all the different services in a high stress situation. This can lead to costly mistakes and security vulnerabilities. By using CloudFormation, not only will you be able to spin an environment up quickly, but you will also know that it spun up correctly. It contains all the little pieces of configuration that make your application work, which ensures no important component is forgotten.
Database
The most important piece of the application is the database. Even if you can deploy your stateless application to another region, it won’t matter if the data isn’t available in that region. There are a couple of ways to approach this: a good way and the right way.
A good way is simple, cheaper and can be serviceable, depending on requirements with realtime data needs. With AWS snapshots, you can send nightly snapshots of your DB automatically. This is a cheap and easy way to back up your data in another region, but it has the downside of only being as accurate as your last nightly snapshot. If users spent a whole day generating new data, it would be lost in a cut-over. Not ideal. Still, it’s better than starting with nothing.
The right way is to utilize Relational Database Service (RDS) read instances. In AWS you can create a read replica in another region. As the main database is updated, the data gets replicated, in near real time, to the read instance in another region. Perfect! Now if the application becomes unavailable, you can spin up the infrastructure in the other region where a copy of the data already resides.
EC2
EC2 is comprised of two parts: the image and the volume. The image is the configuration of your machine. The volumes are the underlying data store. When you create an Amazon Machine Image (AMI) it includes at least one volume. So, if you want to back up your entire machine you will want to create an AMI. If you’re looking to backup only the data, you’ll instead create a snapshot of the specific volume. In either scenario you can set up a life cycle manger policy that performs a scheduled move of your backups to another region. The life cycle manager is fully configurable, providing for options like when to perform the backup, how many backups to keep, and what tags to keep. Check out the life cycle manager documentation for the full list of options.
S3
Many applications store user-uploaded data or reference files out in S3. Fortunately, AWS really simplifies this. Simply navigate to your bucket in the AWS Console and configure a replication rule to automatically back up the bucket to another region. Done. As you add files to the bucket, they’re immediately copied to another region and bucket. Super simple. Super easy. It’s a no brainer to enable.
Cognit – oh no
If you use AWS Cognito, brace yourself for bad news. Unfortunately, Cognito does not have a way to support cross region backup out of the box. The best solution at the time of this writing is to use event hooks. When a new signup happens or a user requests to reset their password, you can have a Lambda trigger. From here, you can configure that Lambda to write the user to a pool in another region. It’s not ideal, but it’ll get the job done.
However, if you have an existing pool, this is where is gets rough. You have no way of accessing a user’s password. So, the best you can do is write a script that will query the pool for all users and put them in another region without passwords. This means when you cut over, your users would have to perform a password reset before being able to log in again. Until AWS builds in a way to back up your pool automatically, use with caution. As a side note, this was the service that went down in the AWS outage mentioned at the top of this article. When authentication and token validation goes offline, that’s going to make almost everything stop working.
Other Considerations
Do you use AWS Simple Email Service? If you do, I’m sure you’re familiar with the process of enabling your account for production. By default, you can only send emails to and from verified email addresses and domains. While this is fine for testing, you need to go into production mode to send emails to customers in a live environment. What you might not realize is that this is region specific. In other words, enabling production mode in US-EAST-1 does not enable production mode automatically in US-WEST-1. So, if you fail over to another region, emails will not send properly. You should make sure all your failover regions are properly setup for running in production.
AWS Backup – The Rest
We’ve covered a good number of services here, but there are so many more in AWS. CloudFormation has you covered for most any stateless service. For many others that store data – like EFS, EBS or DynamoDB – AWS has a solution for each as well. While a relatively new service, AWS Backup is designed to help manage data backup. When you take snapshots of RDS instances you can use AWS backup to schedule moving them to other regions. This solution is better for data redundancy, as the data isn’t replicated in realtime. The list of services AWS Backup integrates with is growing so be sure to keep an eye on it.
Lessons
While an AWS outage is uncommon, there’s no worse feeling than having your application down with nothing you can do about it. By leveraging tools like CloudFormation for quick and accurate deployment of your stateless application in another region, you can get back up on your feet quickly. And by redundantly backing up your data in realtime across regions, not only do you ensure you have the latest data available at a moment’s notice, but you also know you have an extra disaster recovery location. These strategies are easier to implement than you might think, and knowing that you have the capability of recovering on your own is peace of mind that no business should go without.