Before I start with this sordid tale of low scalability, I want to thank the guys at Phusion for openly discussing the challenges they’re having with Union Station. They deserve applause and hugs for being transparent with their users.
Today, they wrote about their scaling issues. That article deserves a good read, but I’m going to cherry-pick a few sentences out for closer examination.
“Traditional RDBMSes are very hard to write-scale across multiple servers and typically require sharding at the application level.”
They’re completely right here – scaling out writes is indeed hard, especially without a dedicated database administrator. Most RDBMSes have replication methods that allow multiple masters to write simultaneously, but these approaches don’t fare well with schemaless databases like Phusion wanted. SQL Server’s replication tools haven’t served me well when I’ve had to undergo frequent schema changes. Some readers will argue that the right answer is to pick a schema and run with it, but let’s set that aside for now. Phusion chose MongoDB for its ease of scale in these situations.
“In extreme scenarios our cluster would end up looking like this.”
Their final goal was to have three MongoDB shards. Unfortunately, they didn’t understand that the Internet is an extreme scenario. If your beta isn’t private, you can’t scale with good intentions. Instead, they went live with:
“We started with a single dedicated server with an 8-core Intel i7 CPU, 8 GB of RAM, 2×750 GB harddisks in RAID-1 configuration and a 100 Mbit network connection. This server hosted all components in the above picture on the same machine. We explicitly chose not to use several smaller, virtualized machines in the beginning of the beta period for efficiency reasons: our experience with virtualization is that they impose significant overhead, especially in the area of disk I/O.”
The holy trinity of CPU, memory, and storage can be tough to balance. For a new database server running a new application, which one needs the most? In the immortal words of Wesley Snipes, always bet on
black memory. A lot of memory can make up for slow storage, but a lot of slow storage can’t make up for insufficient memory. When running your database server – even in a private beta – 750GB SATA drives in a mirrored pair is not web scale.
“Update: we didn’t plan on running on a single server forever. The plan was to run on a single server for a week or two, see whether people are interested in Union Station, and if so add more servers for high availability etc. That’s why we launched it as a beta and not as a final.”
Unfortunately, if you go live with just one database server and all your eggs in two SATA drives, you’re going to have a really tough time catching up. As Phusion discovered, it’s really hard to get a lot of data off SATA drives quickly, especially when there’s a database involved. They ran into a few issues while trying to bring more servers into the mix. They would have been better served by starting with three virtual servers, each running a shard, and then separating those virtual machines onto different hosts (and different storage) as they grew. Instead, when faced with getting over 100GB of user data within 12 hours, they:
“During the afternoon we started ordering 2 additional servers with 24 GB RAM and 2×1500 GB hard disks each, which were provisioned within several hours.”
Now things really start to go wrong. They’ve gotten 100GB of user data within 12 hours, and they’re moving to a series of boxes with more SATA drives. They still only used two slow SATA drives, but don’t worry, they had a plan for better performance:
“We setup these new harddisks in RAID-0 instead of RAID-1 this time for better write performance…”
Whoa. For the part-time DBAs in the crowd, RAID 0 has absolutely zero redundancy – if you lose any one drive, you lose all the data in that array. Now we’re talking about some serious configuration mistakes based on some very short-term bets. Phusion had been overwhelmed with the popularity of the new application, and decided to put this “beta” app on servers with no protection whatsoever – presumably to save money. This makes their scaling challenge even worse, because sooner or later they’ll need to migrate all of this data again to yet another set of servers (this time with some redundancy like RAID 10). Two SATA drives, even in RAID 0, simply can’t handle serious IO throughput.
“RAID-0 does mean that if one disk fails we lose pretty much all data. We take care of this by making separate backups.”
Pretty much? No, all data is gone, period. When they say they’ll make separate backups, I have a hard time taking this seriously when their IO subsystems can’t even handle the load of the existing end user writes, let alone a separate set of processes reading from those servers.
“And of course RAID-0 is not a silver bullet for increasing disk speed but it does help a little, and all tweaks and optimizations add up.”
Even better, try just getting enough hard drives in the first place. If two SATA drives can’t keep up in one server, then more servers with two SATA drives aren’t going to cut it, especially since you have to get the data off the existing box. I can’t imagine deploying a web-facing database server with less than 6 drives in a RAID 10 array. You have to scale exponentially, not in small leaps.
“We’ve learned not to underestimate the amount of activity our users generate.”
I just don’t believe they’ve learned that lesson when they continue to grow in 2-drive increments, and even worse, in RAID 0. I wish these guys the best of luck, but they need to take radical steps quickly. Consider implementing local mirrored SSDs in each server like Fusion-IO or OCZ PCI Express drives. That would help rapidly get the data off the existing dangerous RAID 0 arrays and give them the throughput they need to focus on users and code, not a pyramid of precarious SATA drives.
(Jumping on a plane for a few hours, but had to let this post out. Will moderate comments when I get back around 5PM Eastern.)