Notes on Scalability

15 Comments

We all hope that we’re going to succeed beyond our wildest expectations. Startups long for multi-billion dollar IPOs or scaling to hundreds, or even thousands, of servers. Every hosting provider is touting how their new cloud offering will help us scale up to unheard of heights. I’ve built things up and torn them down a few times over my career

Build it to Break

Everything you make is going to break, plan for it.

Whenever possible, design the individual layers of an application to operate independently and redundantly. Start with two of everything – web servers, application servers, even database servers. Once you realize that everything can and will fail, you’ll be a lot happier with your environment, especially when something goes wrong. Well designed applications are built to fail. Architects accept that failure is inevitable. It’s not something that we want to consider, but it’s something that we have to consider.

Distributed architecture patterns help move workloads out across many autonomous servers. Load balancers and web farms help us manage failure at the application server level. In the database world, we can manage failure with clustering, mirroring, and read-only replicas. Everything computer doesn’t have to be duplicated, but we have to be aware of what can fail and how we respond.

Everything is a Feature

As Jeff Atwood has famously said, performance is a feature. The main thrust of Jeff’s article is that making an application fast is a decision that you make during development. Along the same lines, it’s a conscious decision to make an application fault tolerant.

Every decision that has a trade off. Viewing the entire application as a series of trade offs leads to a better understanding about how the application will function in the real world. The difference between being able to scale up and being able to scale out can often come down to understanding key decisions that were made early on.

Scale Out, Not Up

This isn’t as axiomatic as it sounds. Consider this: cloud computing like Azure and AWS is at its most flexible when we can dynamically add servers in response to demand. To effectively scale out means that we need also to be able to scale back in.

Adding additional capacity is usually in the application tier; just add more servers. What happens when we need to scale the database? The current trend is to buy a faster server with faster disks and more memory. This process keeps repeating itself. Hopefully your demand for new servers will continue at a pace that is less than or equal to the pace of innovation. There are other problems with scaling up. As performance increases, hardware gets more expensive for smaller and smaller gains. The difference in cost between the fastest CPU and second fastest CPU is much larger than the performance gained – scaling up often comes at a tremendous cost.

don't be afraid to change everything

The flip side to scaling up is scaling out. In a scale out environment, extra commodity servers are added to handle additional capacity. One of the easiest ways to manage scaling out the database is to use read-only replica servers to provide scale out reads. Writes are handled on a master server because scaling out writes can get painful. But what if you need to scale out writes? Thankfully, there are many techniques available to horizontally scaling the database layer – features can be broken into distinct data silos, metadata is replicated between all servers while line of business data is sharded, or automated techniques like SQL Azure’s federations can be used.

The most important thing to keep in mind is that it’s just as important to be able to contract as it is to expand. As a business grows it’s easiest to keep purchasing additional servers in response to load. Purchasing more hardware is faster and usually cheaper than tuning code. Once the application reaches a maturity level, it’s important to tune the application to run on fewer resources. Less hardware equates to less maintenance. Less hardware means less cost. Nobody wants to face the other possibility, too – the business may shrink. A user base may erode. A business’s ability to respond to changing costs can be the difference between a successful medium size business and a failed large business.

Buy More Storage

In addition to scaling out your servers, scale out your storage. If you have the opportunity to buy a few huge disks or a large number of small, fast disks give serious thought to buying the small, fast disks. A large number of small, fast drives is going to be able to rapidly respond to I/O requests. More disks working in concert means that less data will need to be read off of each disk.

The trick here is that modern databases are capable of spreading a workload across multiple database files and multiple disks. If multiple files/disks/spindles/logical drives are involved in a query, then it’s possible to read data from disk even faster than if only one very large disk were involved. The principle of scaling out vs. scaling up applies even at the level of scaling your storage – more disks are typically going to be faster than large disks.

You’re Going to Do It Wrong

No matter how smart or experienced your team is, be prepared to make mistakes. There are very few hard and fast implementation guidelines about scaling the business. Be prepared to rapidly iterate through multiple ideas before finding the right mix of techniques and technologies that work well. You may get it right on the first try. It may take a number of attempts to get it right. But, in every case, be prepared to revisit ideas.

On that note, be prepared to re-write the core of your application as you scale. Twitter was originally built with Ruby on Rails. Over time they implemented different parts of the application with different tools. Twitter’s willingness to re-write core components of their infrastructure led them to their current levels of success.

Don’t be afraid to change everything.

Previous Post
What 2012 Work Scares You?
Next Post
The Art of the Execution Plan

15 Comments. Leave new

  • > Scale Out, Not Up

    Good article, with one item that I think is especially important: scale the right way for your project. Some projects are especially easy to scale up, rather than out.

    It is sometimes more effective to simply pop in RAM and SSDs than trying to do sharding. This is especially true for your writable database that needs ACID.

    Reply
    • Thanks for the kind words, I’m glad you liked the article.

      I couldn’t agree more about throwing a visit to Newegg or Fry’s at the problem – RAM and SSDs can solve many problems. I would hope that people will try to scale their database servers up long before they go down the route to scale them out. That can be a painful path, especially if you need ACID across the whole dataset.

      Reply
    • That’s my first thought exactly when reading this post. Most people will never actually have to scale out the database. This is even more true in the era of enterprise class SSDs like Fusion-io.
      Think about scaling out, prepare plan for it, but don’t lose your sleep over it. I have the infrastructure where single powerful database (with mentioned Fusion-io SSD) doesn’t cut it – but you will not find yourself in that place overnight (unless you are working at place like Facebook or Google, of course 🙂 ).

      Reply
  • Atit Vakharia
    January 10, 2012 3:53 pm

    Great article.

    Reply
  • Great article, really enjoyed reading this. Being prepared to rewrite core functionality and make it better tends to be a stumbling block. There is an anxiety about breaking something that kinda works because it kinda works.

    ~Edafe~

    Reply
  • I couldn’t understand this point, “To effectively scale out means that we need also to be able to scale back in.”

    Does it mean, “To effectively scale out means that we need also to be able to scale back in… by removing unused/unnecessary cloud instances (servers)?? Is it what you meant, please clarify.

    Loved your notes over all. Very helpful and insightful, thanks!

    Reply
    • You hit the nail right on the head when you added “by removing unused/unnecessary instances”. Sometimes you’ll be scaling with physical servers and sometimes with virtual servers. Either way, you need to be ready to reduce your physical infrastructure when load/demand decreases. Nobody wants to pay for servers they aren’t using.

      I’m glad you found it helpful and insightful.

      Reply
  • Excellent article that really hit home. I’m involved in a project right now helping redesign one of our core products for scalability and high availability. I was pleasantly surprised to see many of the ideas I had been thinking about or pushing for you covered.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.