Recently, we released PasteThePlan.com, which runs on top of Amazon Web Services. PasteThePlan.com allows you to share SQL Server query plans with others. Behind the scenes, we’re using DynamoDB (a NoSQL database) for record data and S3 (a file storage service) to store text data.
Since we’re a bunch of data freaks, we wanted to make sure that our data and files are properly backed up. I set out to create a script that will backup DynamoDB to a file and copy the data in S3 to Azure. The reasoning for saving our backups into a different cloud provider is pretty straightforward. First, we wanted to keep the data in a separate cloud account from the application. We didn’t make the same mistakes that Code Spaces did. Secondly, I wanted to kick the tires of Azure a bit. Heck, why not?
I figure this script would take me a day to write and a morning to deploy. In the end it took four days to write and deploy. So here are some lessons that I learned the hard way from trying to bang out this backup code.
This is why we’re backing up to Azure in the first place. Some of you are probably thinking, “Why not just save the data to a different S3 bucket?” I’m guessing you didn’t read the Code Spaces link. Ok, I’ll tl;dr for you. Code Spaces was a code-hosting service that was completely based in AWS. A hacker gained access to their AWS control panel and deleted all of their data and configurations. In a period of twelve hours the hacker deleted the company.
The big take away here is that your backups should at least be in a different AWS account. We opted to go with a completely different cloud. Backups in the cloud are extremely important, but where you put those backups might be even more important.
Keep Code Small and Simple
Here’s where I got myself in trouble. I wrote the script to run in AWS Lambda, as planned, in one day. It read the DynamoDB data, wrote it to a file in S3, then took the files in S3 and sent it to Azure. It even did full and incremental backups. All of the testing with the development data was successful. It worked beautifully.
But when the code was run against the production data, which is a much larger dataset, we started to get timeout failures from Azure. Debugging the problem in the Lambda was difficult because I had written the code in such a way that finding the root of the problem wasn’t easy. After trying to get the monolith script to work, I decided to rethink what I was doing. I was thinking too much like how I would script moving files from my local machine to an external hard drive. The cloud doesn’t work that way. I had to rewrite it in a more cloud-like manner.
Plan For Failures
The main problem with the script was that some files being moved to Azure timing out. So if just one of the thousand or so files failed for any reason the entire script failed. This is extremely inflexible. I didn’t write any logic to retry the file in case something went wrong. With traditional on-premise programming you didn’t have to worry about the file system failing or the network going down. If it did we had bigger problems to deal with. But in the cloud we have to expect these kind of failures. So now I needed an easy way to retry a file, as it turns out AWS has a solution for that.
Use Cloud Services Where it Makes Sense
While looking at the process with a more cloud-based thinking, it occurred to me that my script needed to be two different processes. The first would backup DynamoDB, the second would send a file to Azure from S3.
The glue between the two processes would be Amazon’s Simple Queue Service (SQS). The first process, after the backup was completed, loads a message for each DynamoDB record into SQS. The second process then reads SQS, grabs the file from S3, and sends it to Azure. Now the problem was how to start the second process. I could have used Simple Notification Service to start the Lambda function but I opted for a simpler solution. I created a CloudWatch schedule that checked the queue every minute. Now if a file fails, for any reason, the message won’t be removed from the queue and it will be processed on the next run.
Test Yourself Before You Wreck Yourself
This is a tough one. Testing code destined for the cloud is still harder than it should be. I figured that since this little script was fairly straightforward I wouldn’t need to write automated tests for it. That turned out to be a mistake. If I had written unit tests for this script first I would have realized that my thinking was flawed and I would have to take a different path. Unit tests would have saved me a couple of days. The bottom line is unit tests are worth your time.
Brent says: careful readers might notice that we’re creating a lot of vendor lock-in: the work Richie’s doing is tied to specific services from a specific vendor. While sure, you can get queueing services, function-as-a-service, and NoSQL databases from AWS, Google, and Microsoft, the code isn’t portable today. So why back up the data to somebody else’s cloud if we can’t run the apps over there too? In this case, it’s about disaster recovery: Amazon could (and has) permanently lost data, and we just need a way to put the data back later in case it disappears.