Blog

Before I start with this sordid tale of low scalability, I want to thank the guys at Phusion for openly discussing the challenges they’re having with Union Station.  They deserve applause and hugs for being transparent with their users.

Today, they wrote about their scaling issues.  That article deserves a good read, but I’m going to cherry-pick a few sentences out for closer examination.

“Traditional RDBMSes are very hard to write-scale across multiple servers and typically require sharding at the application level.”

They’re completely right here – scaling out writes is indeed hard, especially without a dedicated database administrator.  Most RDBMSes have replication methods that allow multiple masters to write simultaneously, but these approaches don’t fare well with schemaless databases like Phusion wanted.  SQL Server’s replication tools haven’t served me well when I’ve had to undergo frequent schema changes.  Some readers will argue that the right answer is to pick a schema and run with it, but let’s set that aside for now.  Phusion chose MongoDB for its ease of scale in these situations.

“In extreme scenarios our cluster would end up looking like this.”

Their final goal was to have three MongoDB shards.  Unfortunately, they didn’t understand that the Internet is an extreme scenario.  If your beta isn’t private, you can’t scale with good intentions.  Instead, they went live with:

“We started with a single dedicated server with an 8-core Intel i7 CPU, 8 GB of RAM, 2×750 GB harddisks in RAID-1 configuration and a 100 Mbit network connection. This server hosted all components in the above picture on the same machine. We explicitly chose not to use several smaller, virtualized machines in the beginning of the beta period for efficiency reasons: our experience with virtualization is that they impose significant overhead, especially in the area of disk I/O.”

The holy trinity of CPU, memory, and storage can be tough to balance.  For a new database server running a new application, which one needs the most?  In the immortal words of Wesley Snipes, always bet on black memory.  A lot of memory can make up for slow storage, but a lot of slow storage can’t make up for insufficient memory.  When running your database server – even in a private beta – 750GB SATA drives in a mirrored pair is not web scale.

“Update: we didn’t plan on running on a single server forever. The plan was to run on a single server for a week or two, see whether people are interested in Union Station, and if so add more servers for high availability etc. That’s why we launched it as a beta and not as a final.”

Unfortunately, if you go live with just one database server and all your eggs in two SATA drives, you’re going to have a really tough time catching up.  As Phusion discovered, it’s really hard to get a lot of data off SATA drives quickly, especially when there’s a database involved.  They ran into a few issues while trying to bring more servers into the mix.  They would have been better served by starting with three virtual servers, each running a shard, and then separating those virtual machines onto different hosts (and different storage) as they grew.  Instead, when faced with getting over 100GB of user data within 12 hours, they:

“During the afternoon we started ordering 2 additional servers with 24 GB RAM and 2×1500 GB hard disks each, which were provisioned within several hours.”

Now things really start to go wrong.  They’ve gotten 100GB of user data within 12 hours, and they’re moving to a series of boxes with more SATA drives.  They still only used two slow SATA drives, but don’t worry, they had a plan for better performance:

“We setup these new harddisks in RAID-0 instead of RAID-1 this time for better write performance…”

Whoa.  For the part-time DBAs in the crowd, RAID 0 has absolutely zero redundancy – if you lose any one drive, you lose all the data in that array.  Now we’re talking about some serious configuration mistakes based on some very short-term bets.  Phusion had been overwhelmed with the popularity of the new application, and decided to put this “beta” app on servers with no protection whatsoever – presumably to save money.  This makes their scaling challenge even worse, because sooner or later they’ll need to migrate all of this data again to yet another set of servers (this time with some redundancy like RAID 10).  Two SATA drives, even in RAID 0, simply can’t handle serious IO throughput.

“RAID-0 does mean that if one disk fails we lose pretty much all data. We take care of this by making separate backups.”

Pretty much?  No, all data is gone, period.  When they say they’ll make separate backups, I have a hard time taking this seriously when their IO subsystems can’t even handle the load of the existing end user writes, let alone a separate set of processes reading from those servers.

“And of course RAID-0 is not a silver bullet for increasing disk speed but it does help a little, and all tweaks and optimizations add up.”

Even better, try just getting enough hard drives in the first place.  If two SATA drives can’t keep up in one server, then more servers with two SATA drives aren’t going to cut it, especially since you have to get the data off the existing box.  I can’t imagine deploying a web-facing database server with less than 6 drives in a RAID 10 array.  You have to scale exponentially, not in small leaps.

“We’ve learned not to underestimate the amount of activity our users generate.”

I just don’t believe they’ve learned that lesson when they continue to grow in 2-drive increments, and even worse, in RAID 0.  I wish these guys the best of luck, but they need to take radical steps quickly.  Consider implementing local mirrored SSDs in each server like Fusion-IO or OCZ PCI Express drives.  That would help rapidly get the data off the existing dangerous RAID 0 arrays and give them the throughput they need to focus on users and code, not a pyramid of precarious SATA drives.

(Jumping on a plane for a few hours, but had to let this post out.  Will moderate comments when I get back around 5PM Eastern.)

↑ Back to top
  1. Hi Brent. I don’t have much time right now because I need to leave in a few minutes but let me start by thank you for analyzing our post and giving more constructive feedback than on Twitter. I understand that you’re mainly writing this response from the point of view as a DBA but I ask that you also try to consider our view point as a business as well and that not everything is purely about technicality.

    “Their final goal was to have three MongoDB shards.”

    That’s not what we meant with that figure. We meant to illustrate that our shards can scale infinitely to N shards, and that we will add however many shards as necessary to have handle the data, however much that might be. But yeah I suppose it’s easy to misunderstand what we meant.

    “Unfortunately, they didn’t understand that the Internet is an extreme scenario. If your beta isn’t private, you can’t scale with good intentions.”

    A launch can go two ways: either few people are interested and you risk wasting money, or you underestimate the interest and get scaling problems soon. We based our initial hardware on an educated guess of what the interest would be and it turned out our guess was off. It’s as simple as that.

    “For a new database server running a new application, which one needs the most? In the immortal words of Wesley Snipes, always bet on black memory. [...] When running your database server – even in a private beta – 750GB SATA drives in a mirrored pair is not web scale.”

    We already knew that. We didn’t choose this hardware combination, it was one of the standard combinations our hosting provider gave us. Provisioning a custom configuration with lots of RAM takes a long time.

    As for your point about “web scale”, we didn’t expected we *need* web scale on day one.

    I really have to go now and I’ve only read about 25% of your post, but I want to make it clear that we are very well aware of the issues you mention and that any scaling problems we had in the beginning were not the result of ignorance but the result of a wrong estimation of the initial traffic we would get. Of course in the ideal world we’d get the very best hardware and configuration right from the get-go.

    • Hongli – thanks for your comments. I’m not seeing anything that changes my mind here, especially due to two points. You asked me to see things from the business angle – the problem is that the business is asking for something impossible. 2-magnetic-drive SATA RAID 1 arrays don’t scale, period. It doesn’t matter if that’s all the business can afford, because it doesn’t work. If the business could only afford a paper clip and a bar napkin, I wouldn’t be okay with using that for data storage either. There are some technical requirements that aren’t negotiable, and you’ve learned that now.

      Second, when you say you based your “initial hardware on an educated guess of what the interest would be and it turned out our guess was off” – that’s not entirely true. You bought your additional servers with SATA RAID 0 2-drive arrays after you already knew the interest in your product. At that point, you’re making educated guesses – you’re making decisions based entirely on what you want to pay, not what you actually need. You have to be able to see that in hindsight at least, even if you weren’t able to make that call in the heat of the moment.

      • Actually we got the 2 SATA RAID-0 drives because it was the best thing our hosting provider can get us within 6 hours. We are well aware that they’re not going to solve any and all problems and that faster setups exist. Our current infrastructure is not permanent.

        If your criticism lies in that we don’t have the perfectly scalable setup, then yes you are totally right. But it’s not without reason and it’s definitely not permanent.

        That said, the current setup handles the load very very.

      • Brent,

        Thanks for the comment, just wanted to add a few things here that might elaborate a bit on our rationale for having gone this way. It might not change your view on things, but that’s okay, I feel I’m at least obligated to elaborate a bit for completeness sake.

        As mentioned before, getting provisioned custom hardware would’ve taken _at least_ a week. The machines we’re working with right now however, could be provisioned to us within hours. Seeing as we’re in a free beta (perhaps alpha was a better label) and we wanted to get back up as soon as possible, in the heat of the moment, we opted for the lesser of two evils being fully aware of the implications you’ve underlined in your blog post in advance. Making separate backups, how doubtful this might sound for you after reading our decisions, would be the way to go for us for now. Most importantly however, and this is something I believe where most of the confusion is coming from, we’ve never made mention that this will be the final setup (and rest assured it isn’t). It’s just the best that we could do with the limited options that were presented to us by our host on such short notice for the beta.

        As mentioned by caseyf, the data being collected at this point is marked as “discardable”. We’ve made mention of this from the very beginning of the beta and our beta testers should be fully aware of that. “But don’t you care about the best fault tolerance/redundancy?” you might be wondering, why yes, we do care, but we also care about getting the features for our product right this early in the beta first. The latter requires us to iterate as quickly as possible and once these are in place, we’ll put our entire focus and resources on getting this as fault-tolerant as possible with the appropriate hardware and mechanisms.

        You mention that it’s a decision made after we went down, indeed, it was. It was to get back online within a day instead of having to wait at least a week. The latter would’ve severely hurt us in the process of iterating on features fast based on user-feedback. “But why not get on something like AWS etc…?”, well, as things are currently looking, we’ll need in the order of many TB’s to be able to cope with this in a free beta.

        The traction gained in a free beta does not equate to 1:1 paid conversion, in fact, freemium services traditionally have a retention rate of 1-5% if we may believe the HN stories. Putting several thousands a month up front for e.g. AWS or specialized hardware without knowing whether this will be a viable business to begin with (remember, free beta traction is not representative of when people have to actually start to pay) didn’t make sense to us. It will make sense for us however once we’ve got the features right and we get a good gauge of the people who are willing to pay.

        • Ninh – it’s one thing to say that your customers’ data can be trashed without warning because you’re using RAID 0. It’s another thing entirely to say that you’re willing to keep rebuilding your infrastructure every time there’s a drive failure. The value isn’t just in the customers’ data – even if that’s worthless, your time is still worth something, and that’s where RAID 1 comes in. By using RAID 0, you’re saying that your time in infrastructure protection isn’t worth anything either, and I find that hard to believe.

          • We’re not willing to rebuild our infrastructure every time there’s a drive failure. In the final paid version we will have database replicas in place with IP failover so that a disk failure will result in quick automatic recovery of the cluster. We do not consider RAID-1 a valid backup strategy. Better to have replicas and backups on different servers that stay on standby.

      • Pretty much? No, all data is gone, period.

        Not all data is stored on the RAID-0 disks.

        When they say they’ll make separate backups, I have a hard time taking this seriously when their IO subsystems can’t even handle the load of the existing end user writes, let alone a separate set of processes reading from those servers.

        Database replication happens in real-time and don’t require additional disk I/O from the replication source.

        And the current IO subsystems *can* handle the load. After the capacity expansion everything has been running very, very smoothly. The RAID-0 setup is considerably faster than the original RAID-1 setup.

        • “Database replication happens in real-time and don’t require additional disk I/O from the replication source.”

          Unless of course the replication target lags behind because its disks can’t handle the speed, or network failure, etc etc. Just saying this for completeness’s sake so that you know we are aware of the edge cases.

  2. Pretty scary stuff. Having lots of RAM can help cover up an inadequate I/O subsystem to an extent, but having just two spindles obviously will not cut it. A single socket Core i7 server is limited to 24GB of RAM, so that is not going to last them very long either.

  3. I don’t have much to say about the choices made after the initial problems (other than it looks like they are going to be in for more surprises) but I as an beta user who was probably among those sending an unanticipated amount of traffic I have a few notes:

    * Not sure if it is clear to readers, but the all of the data is pretty worthless. The only thing that isn’t monitoring/performance data is my username, password, and a few other config things. Since they are in beta (alpha?) and nobody is paying, they probably should have said at the beginning: “we may have to throw away your data from time to time. Thanks for helping us test.”

    * That initial machine wasn’t really server-class at all (processor, memory, disk) given its roles and it looks like they are seriously pinching pennies at this point. That’s understandable. It’s surprising considering our modern day definition of beta but their post said that they are still discovering whether this service is a viable business. Maybe “beta” was the wrong label.

    * SSDs might have been a good emergency fix but with the *massive* amount of data that they are going to be dealing with (judging by what is shown in the app) and the amount of churn, SSD is probably not a possibility for them long term.

    If I were them, my biggest concern would be storing terabytes of data in Mongo… but what do I know.

    On a more positive note, I did try this out on one app server node and I’m most happy that all of these problems had zero impact on my application. I’m not sure how the collection and reporting works but I’m not surprised. Their product does have the advantage of being part of Passenger and outside of the application itself (unlike New Relic, etc) and the Phusion guys certainly have proven their programming chops.

    Given the nature of the service, I’d say that downtime or even losing data is much less harmful to the future of the business than impacting a customer’s app would have been.

  4. Brent,

    Not sure how you got onto this story but, as a bit of background, the Phusion guys are some of the smartest guys I’ve come across in the Ruby community (and that bar is crazy, crazy high). Nothing in your post will be news to them.

    I think they genuinely just put it up to see if there was *any* interest, got flooded and went looking for a way to cope until they could get a better setup online. It’s more of ‘alpha’ than a ‘beta’ in my view but that’s the point of the label right; you’re saying ‘this thing may not work 100%’.

    Having said all that, I totally agree about the OCZ Revo drives; low-ish cost and amazing performance. if you go with PCI SSD you can relegate spinning disk to local backup / fallback.

    Joe

  5. Good post, and lots of stuff that I’ve jumped into as the first full time DBA for the company I’m at… been here just over 9 months now. I started with a primary intranet DB server that had 3GB of RAM and a 200GB Raid 5 for data (4 disks), and a 100 GB Raid 1 Log drive. We’ve just moved the last database off that server and onto a Cisco box with 32GB of Ram and attached to NetApp FAS3100 series filer with a number of disk shelves.

    Even better was their WSS3.0 pilot installation that ended up being taken over by the business for every day use. That’s a Web/App server and a DB backend that has 2GB of RAM and 4 disks, two mirrored for OS and two mirrored for DB and Logs. That server now has an External USB drive plugged in just to store nightly DB backups. That just cut over last weekend to a three tier MOSS 2010 install with the new DB server the same set up as the above DB server.

    It took me about a month solid of gathering facts and baselines, making charts, growth planning, risk mitigation, etc. to get management to spend the money needed.

    One final note, reading “schemaless” is worse than saying it!

  6. Brent,

    You’re right – it’s fantastic that Phusion shared this information with the world. Kudos to them. Software development and application engineer are insanely complex endeavors – with so many angles to consider and get right.

    But in this case, Phusion’s use of a NoSQL solution to go ‘web scale’ REALLY highlights my biggest problem or beef with NoSQL. Which is that too many organizations turn to it because Relational Databases just aren’t ‘powerful enough’ to handle their loads.

    Only, what’s really clear from this example is that had Phusion used a relational database they NEVER would have given it enough horsepower to meet their needs. Yeah, they probably would have put decent RAM and Processors in place… then potentially bitched about the ‘relational’ overhead and how crappy RDBMses are when it comes to keeping up with load. Why, because throwing two cheap/crappy drives at a time at massive IO requirements as they’ve done here showcases how many organizations don’t correctly address capacity planning needs – simply because developers tend to focus more on code than storage/hardware/capacity requirements to handle that code.

    And again, custom software and solutions are complex and hard to get right. I’m not faulting or criticizing Phusion’s engineers. Just pointing out that when creating their solution it’s obvious to see that they never sat down and figured out what kinds of rates of data they’d be pulling in, and how those rates/requirements would translate into disk utilization and workload characterization in terms of mapping the number of spindles/buses that they’d need to meet load requirements. That, and using RAID-0 (and SATA drives for intensive writes) shows, again, a fundamental lack storage understanding and experience. Yet, I’m SURE that Phusion chose NoSQL because RDBMses can’t scale or handle large workloads – when it’s clear that had they used a relational storage engine in this case they never would have given it the correct storage capabilities to meet load requirements.

    • I wish people would drop the “web scale” point. We never claimed to be going for or needed or expected web scale. We chose MongoDB because:
      1) we expected that eventually we might get a lot of data, so better choose a system that can be easily sharded, and
      2) because our data is structured in such a way that a schemaless database works better.

      “OMG MySQL is not web scale and MongoDB is!!!!!111″ has never been our reason for choosing MongoDB. Please don’t assume that everybody use NoSQL databases for the wrong reasons.

      “Yeah, they probably would have put decent RAM and Processors in place… then potentially bitched about the ‘relational’ overhead and how crappy RDBMses are when it comes to keeping up with load.”

      No, we wouldn’t have. It’s not the RDBM’s fault that it can’t keep up with the load, it’s the hardware. The only thing I could have blamed RDBMes for is that it doesn’t allow easy sharding without application-level changes. RDBMses have their place and we don’t store everything in MongoDB.

      “Why, because throwing two cheap/crappy drives at a time at massive IO requirements as they’ve done here showcases how many organizations don’t correctly address capacity planning needs”

      Again, our current setup is not permanent, it’s just the best our hosting provider could give us within 6 hours. As load grows, time goes by and as we’re nearing the final paid product, we definitely have plans to put better and safer infrastructure in place.

      Furthermore we didn’t expect the I/O requirement would have been that high. We did some tests and we estimated lower I/O usage in the beginning.

      “That, and using RAID-0 (and SATA drives for intensive writes) shows, again, a fundamental lack storage understanding and experience.”

      This dead horse called RAID-0 has been beaten up enough. If you look at the updated architecture overview image you see that we *do* have accounted for backups and replicas to save us from RAID-0 failures.

  7. Hm I do not share you opinion, brent. It is a logical decision to risk crashing under load in order to save money. The two cases “extreme success” and “moderate success and below” have to be evaluated, that is all there is to it. Few products ever become successful enough to ever need more than a commodity server. Where is you blog hosted, brent? Surely not an SSDs and rightly so.

    • Tobi – my blog is hosted using RAID 1, not RAID 0. Not sure what you’re trying to say there – I wouldn’t use any hosted service that didn’t provide protection from a hard drive failure.

      • You don’t have separate backups outside of the RAID-1 copy? What do you do if your motherboard explodes and takes both harddisks with it?

        • RAID is not backups – it’s an availability solution. Drives are the most common failure points for servers. In the event of a drive failure, you don’t want to suddenly have an overwhelming amount of load refocused onto your other replica. It will have to handle double the user load since the other replica went down, PLUS it will have to repopulate the new drives in the other replica. You’re talking about a sudden 3-5x load, and if that other box is only RAID 0 too, you’re playing a dangerous game. You’re hoping you won’t have another drive failure on that replica during a 3-5x load period – not a good gamble at all.

          • Now I finally understand your point about RAID-0. People weren’t very helpful by just pointing out “a fundamental lack of understanding of RAID-0″. However I know about this and said problem can be solved by having multiple replicas which is a good idea anyway for HA, which I much prefer over RAID for HA.

  8. It’s interesting to read the original post and the responses in the comments here.

    Given the failure rate of drives in production (see Failure Trends in a Large Disk Drive Population), throwing RAID 0 at data is downright dangerous. Likewise, expecting anything but mediocre performance from SATA is just silly.

    One of the biggest issues right now is that there are few guidelines for scaling MongoDB once you’ve exhausted the bounds of a single box. The ‘best’ advice that I’ve come across is to add more RAM until the machine can’t handle it, then you add more shards. The problem is, of course, that you have to pull data off of one shard and push it to another before things balance out. Moving data off of SATA is just plain slow.

    It seems to me that this is a problem of hardware provisioning: the initial hardware was too cheap and new hardware is even worse. Redundancy with MongoDB can be easily solved by using replica sets in addition to sharding. I’m interested in why they didn’t even have a single replica in place to give them some level of redundancy.

    All in all, none of this is surprising. The hardware is underpowered. Architecturally, things seem sound. I’m more than a bit worried that the filesystem choice 1) came so late in the game and 2) was expected to solve many problems. Given some of the database-related research into linux filesystem performance, I’m surprised that they started out with something other than XFS. I’m aware I liked to a PostgreSQL blog post, but databases can only write data in a few different ways, MongoDB is no exception to this.

    • “I’m interested in why they didn’t even have a single replica in place to give them some level of redundancy.”

      Because it was day 1 of the beta and we expected few sign ups in the beginning. A beta is not guaranteed to work correctly or reliably, hence the label. For the final version we definitely have plans to put replicas in place. See the updated architecture image.

      “and 2) was expected to solve many problems”

      I don’t know where you got the impression that we expected the filesystem to solve many problems. We don’t. But that doesn’t mean that a proper filesystem choice helps a little.

  9. Pingback: Brent Ozar – RAID 0 SATA with 2 Drives: It’s Web Scale! – Yostivanich

  10. Great info guys.By your opinion what specs should they use for a server?Is i7 ok for so much trafic?

    • CPU isn’t the problem, disk speed and RAM are. We have a lot of CPU to spare. That said the current cluster is handling the current traffic very well.

  11. Ok, give them a break. (snark)This from a guy who’s talking about Web Scale and SQL Server in the same post? (/snark)
    First, saying that someone “at web scale” needs RAID is like telling Google that they need RAID. You may need RAID if you can’t accomplish data redundancy ACROSS SERVERS, which clearly is what they’re aiming at with Mongo. Sorry, DBA guy. You’re just wrong on this one.
    Six disks in RAID 10 would only be 50% faster than two disks in RAID 0. RAID 0 is exactly twice as fast for both read and writes as two disks minus a miniscule amount for overhead. RAID 10 on ten disks means that half your disks are backups, and half are RAID-0, and overhead would be higher for RAID-10 on six disks then RAID-0 on two. Thus, it’s effectively three disks on RAID 0, not two, and you might actually end up with some bus contention on six disks unless you have a high end (expensive) controller. So from a speed perspective, you’re right, RAID-10 on 6 would be better, but the cost differential is usually astronomical. I bet they could get three servers with two drives in RAID-0 for the same price as one server with six disks in RAID-10 at almost any decent hosting provider and get extra cores and RAM to boot.
    I have dozens of servers in two datacenters running exactly this configuration. It’s cheap and I can fail between servers. You want to know something funny? When I first started doing this, I was insanely careful and put all kinds of failover systems in place. Since then, I’ve had exactly one drive failure in the last four years. Of course, when it failed, it took down the whole box, but things failed over so smoothly that not a single customer ever noticed.
    Bottom line: their solution is definitely viable in production and IMO your post was more than a wee bit rude and offensive to some guys (who I don’t know) who are just out there getting things done.

    • “So from a speed perspective, you’re right, RAID-10 on 6 would be better, but the cost differential is usually astronomical.” – yes, and so is the speed difference. You get what you pay for. Imagine that.

      “I bet they could get three servers with two drives in RAID-0 for the same price as one server with six disks in RAID-10 at almost any decent hosting provider and get extra cores and RAM to boot.” – That hasn’t been my experience, and additional servers don’t make management easier – they make it harder.

      “I have dozens of servers in two datacenters running exactly this configuration.” – Wow, so we’re definitely going to agree to disagree there. If you’re enough of a fan of RAID 0 to use it in dozens of servers, we’re differently on different pages.

      “IMO your post was more than a wee bit rude and offensive to some guys (who I don’t know) who are just out there getting things done.” – They had to take their site down, dude. Since that’s your definition of getting things done, suddenly I understand much more about your use of RAID 0. In that case, go ahead and keep getting ‘er done.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php