Lessons Learned From Working in the Clouds

Recently, we released PasteThePlan.com, which runs on top of Amazon Web Services. PasteThePlan.com allows you to share SQL Server query plans with others. Behind the scenes, we’re using DynamoDB (a NoSQL database) for record data and S3 (a file storage service) to store text data.

Since we’re a bunch of data freaks, we wanted to make sure that our data and files are properly backed up. I set out to create a script that will backup DynamoDB to a file and copy the data in S3 to Azure. The reasoning for saving our backups into a different cloud provider is pretty straightforward. First, we wanted to keep the data in a separate cloud account from the application. We didn’t make the same mistakes that Code Spaces did. Secondly, I wanted to kick the tires of Azure a bit. Heck, why not?

I figure this script would take me a day to write and a morning to deploy. In the end it took four days to write and deploy. So here are some lessons that I learned the hard way from trying to bang out this backup code.

Be Redundant

This is why we’re backing up to Azure in the first place. Some of you are probably thinking, “Why not just save the data to a different S3 bucket?” I’m guessing you didn’t read the Code Spaces link. Ok, I’ll tl;dr for you. Code Spaces was a code-hosting service that was completely based in AWS. A hacker gained access to their AWS control panel and deleted all of their data and configurations. In a period of twelve hours the hacker deleted the company.

The big take away here is that your backups should at least be in a different AWS account. We opted to go with a completely different cloud. Backups in the cloud are extremely important, but where you put those backups might be even more important.

Keep Code Small and Simple

Here’s where I got myself in trouble. I wrote the script to run in AWS Lambda, as planned, in one day. It read the DynamoDB data, wrote it to a file in S3, then took the files in S3 and sent it to Azure. It even did full and incremental backups. All of the testing with the development data was successful. It worked beautifully.

But when the code was run against the production data, which is a much larger dataset, we started to get timeout failures from Azure. Debugging the problem in the Lambda was difficult because I had written the code in such a way that finding the root of the problem wasn’t easy. After trying to get the monolith script to work, I decided to rethink what I was doing. I was thinking too much like how I would script moving files from my local machine to an external hard drive. The cloud doesn’t work that way. I had to rewrite it in a more cloud-like manner.

Plan For Failures

The main problem with the script was that some files being moved to Azure timing out. So if just one of the thousand or so files failed for any reason the entire script failed. This is extremely inflexible. I didn’t write any logic to retry the file in case something went wrong. With traditional on-premise programming you didn’t have to worry about the file system failing or the network going down. If it did we had bigger problems to deal with. But in the cloud we have to expect these kind of failures. So now I needed an easy way to retry a file, as it turns out AWS has a solution for that.

Use Cloud Services Where it Makes Sense

While looking at the process with a more cloud-based thinking, it occurred to me that my script needed to be two different processes. The first would backup DynamoDB, the second would send a file to Azure from S3.

The glue between the two processes would be Amazon’s Simple Queue Service (SQS). The first process, after the backup was completed, loads a message for each DynamoDB record into SQS. The second process then reads SQS, grabs the file from S3, and sends it to Azure. Now the problem was how to start the second process. I could have used Simple Notification Service to start the Lambda function but I opted for a simpler solution. I created a CloudWatch schedule that checked the queue every minute. Now if a file fails, for any reason, the message won’t be removed from the queue and it will be processed on the next run.

Test Yourself Before You Wreck Yourself

This is a tough one. Testing code destined for the cloud is still harder than it should be. I figured that since this little script was fairly straightforward I wouldn’t need to write automated tests for it. That turned out to be a mistake. If I had written unit tests for this script first I would have realized that my thinking was flawed and I would have to take a different path. Unit tests would have saved me a couple of days. The bottom line is unit tests are worth your time.

 

Brent says: careful readers might notice that we’re creating a lot of vendor lock-in: the work Richie’s doing is tied to specific services from a specific vendor. While sure, you can get queueing services, function-as-a-service, and NoSQL databases from AWS, Google, and Microsoft, the code isn’t portable today. So why back up the data to somebody else’s cloud if we can’t run the apps over there too? In this case, it’s about disaster recovery: Amazon could (and has) permanently lost data, and we just need a way to put the data back later in case it disappears.

Previous Post
Reminder: Rate Your Summit Sessions (and Watch Mine Now!)
Next Post
A Gentle Reminder That Corruption Stinks

9 Comments. Leave new

  • Interesting stuff. More posts like this please!

    Reply
  • HI Richie, Can you elaborate on how unit testing would have helped design this better?

    Reply
  • Very nice topic, plenty of things to consider. Thanks!

    Reply
  • This is interesting, thank you for the post! I had given more credit to the cloud providers in their ability to recover or restore data from accidental or malicious deletion. Speaking of which…have you done a restore? Until you have tested a successful restore of that data from Azure to AWS, you still do not have a valid backup(following the same paradigm as SQL Server backups). Also, what kind of data transfer rates are you seeing? I assume there a data size limit beyond which its not practical to use this (or any similar) method. i.e. at some point the data transfer speed limit will prevent meeting a given RTO.

    Reply
    • Well, these backups are a little different. We’re just backing up zip and json files. To restore we would have to move data from Azure Blog Storage to AWS S3. Then we would need to load the backup DynamoDB json file into DynamoDB and that could easily be done via a command line. We could do a dry run into a test environment just to make sure everything is OK but this a tool we make available for free. It’s not the core of our business, just a side project that we decided to release. So if we’re not testing restores weekly I think we’ll be fine.

      We will be fine right Brent?

      Reply
  • Devon Leann Ramirez
    November 4, 2016 1:06 pm

    “Test Yourself Before You Wreck Yourself” — Is now on my cube whiteboard. Thank you good sir! Interesting story about Code Spaces, too!

    Reply
  • Is there a gui like SSMS for querying and managing Dynomo db ?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.