Everyone needs a distributed database right? Wrong. It’s easy to get so excited about trying out a new technology that you close your eyes to the problem you’re trying to solve. I came across a question on Stack Overflow where the poster was asking what kind of distributed database they needed. Rather than jump right into the fray and say “You don’t need it,” I took a step back and asked myself “What’s the real question here?”
Understand Your Requirements
We all hope that some day our business will be as successful as Google or Facebook or whatever the next big thing is. Those business got that way by first concentrating on doing something incredibly well and then growing as needed. Nobody reasonable wakes up and says “We’re going to build this incredibly robust application and here’s how we’re going to do it.” The internet is full of articles about how different companies have dealt with their problems of scaling. Do your job well and worry about success when it happens.
In the StackOverflow question, Ted (we’ll call this person Ted) was asking which distributed database they should use to scale their system. They gave some vague requirements for data throughput, but left out why the system needed to be distributed across multiple servers.
This triggered my buzzword detector. I think distributed databases are incredibly cool, but they have a place; Ted’s requirements didn’t match at all. without much explanation for why a distributed database would be important here, it was hard to even refute the argument about using a distributed database.
Distributed databases have some operational advantages – they tend to be more robust and tolerant of equipment failures, but that’s based on certain configuration details like using multiple server replicas. RDBMSes, out of the box, aren’t distributed across multiple servers, but there are a lot of features that have been built to make it possible to replicate data across data centers or to shard the database across multiple servers. The closest thing to a business requirement was that the database needed to be free; open source would be nice, too.
Understand Your Hardware
Business requirements only matter so much. Eventually somebody has to write data to a disk. Once that drive head starts moving, the best designed software won’t matter if the underlying hardware can’t keep up with the load. That’s why it’s important to know what your hardware is capable of handling. It’s just as important to know what your application is capable of producing
Ted needed a system that would be handling less than 100 transactions per second and would probably end up writing data at a rate of around 400 kilobytes per second. Neither of these requirements are show stoppers. Assuming that the server was going to be writing at a constant rate, the amount of data generated and kept would be around 10 terabytes of data a year. While it’s nothing to scoff at, it’s not an unheard of data generate rate. The thing is, almost any off the shelf database software can handle these kinds of load. Almost any off the shelf server can handle this kind of data throughput.
The requirement to handle ~100 requests per second at around 4 kilobytes per record isn’t a matter of choosing a database product, it’s a matter of designing a storage solution that can handle the ongoing needs of the business. When SAN space can be purchased from around $15,000 per terabyte, 10TB per year becomes a minor budget line item for all but the most cash strapped startup.
Understand Your Data
There was one feature I left out until now. It’s important to understand how your data will be used. Graph databases excel at helping users explore the relationship between different data points. Relational databases make it easy to build incredibly flexible, good enough solutions to most problems. Key value databases make it possible to do single key lookups in near constant time. The way that data is read limits the playing field.
Ted mentioned that almost of the data lookups were going to be by primary key lookups. If this were the only requirement for reading data, this problem could be solved by any database. Then he threw in a little hook – there would be some joins in the data. In the world of databases once you use the j-word, your options get very limited very quickly. You have to start thinking about querying patterns, potential optimizations, and the trade offs of read vs write optimizations.
If you do need joins, you can take one of two approaches – let the database do it, or write it yourself. Neither approach is difficult (and one is certainly easier than the other), but they’re both feasible – heck the hard part has already been done for you: someone else came up with the algorithm.
Justice Potter Stewart in an attempt to classify obscenity said “I shall not today attempt further to define the kinds of material I understand to be embraced… but I know it when I see it.” Right now, there’s no good definition of what makes data into Big Data. Some people say that you’ve hit Big Data when you can no longer predict query performance. Some people use hard and fast numbers of data volume in bytes, in data churn rate, or in the massively parallel nature of the database. There’s no right or wrong answer and Big Data is something that varies from organization to organization.
It’s important to understand what problem you’re trying to solve, understand the volume of data, and understand how the data is going to be used before making the final selection. There are many ways to store data.
What would I have done in this situation? Taking into account that I know SQL Server well, I would use SQL Server. SQL Server can perform admirably as a glorified key value store. B+trees are pretty quick in most use cases and they balance many of the problems of simultaneously reading and writing data to provide a good enough solution to the problem (with great management tools on top). When business users demand better querying capability, it’s easy enough to start adding non-clustered indexes on top of the solution.
Thought I would add a comparison chart recently introduced on the twittersphere that includes PostgreSQL (free and open source) and SQL Server:
I do agree with the main point of your article, though. Leave the buzzwords behind and use requirements/data to be the drivers for these decisions.
Yeah, I’d say it depends on the amount of data and the type of data–if it’s just key values, and not an Amazonian amount of them, yeah, SQL should be fine. Something like Redis (or Dynamodb, if he is willing to pay) would be better if he does need flexible scalability.
I happen to use a database, Vertica, for data warehousing which is distributed from the ground up. It does give a lot of advantages. The biggest one that it has that if you need to scale up you just add more nodes.
I’m sure with traditional relational databases (“Old SQL”) getting advantages out of a distributed system is much more painful. But if you have a database that has it built into the design from the ground up, it allows you to scale easily when you need to.