When you start developing a new application how do you pick the database back end? Most people pick what they know/what’s already installed on their system: the tried and true relational database. Let’s face it: nobody is getting fired for using a relational database. They’re safe, well understood, and there’s probably one running in the datacenter right now.
Here’s the catch: if you’re using a relational database without looking at anything else, you’re potentially missing out. Not all databases work well for all workloads. Microsoft SQL Server users are already aware of this; there are two distinct database engines – one for transactional and one for analytical workloads (SQL Server Analysis Services).
When you start a new project, or even add a substantial feature, you should ask yourself [questions about your data]. Here are a few sample questions:
- Why am I storing this data?
- How will my users query the data?
- How will my users use this data?
The Relational Database
There are a lot of reasons to use a relational database. Relational databases became the de facto choice for databases for a number of reasons. In addition to being based on sound mathematical theory and principles, relational databases make it easy to search for specific information, read a select number of columns, and understand structure by querying the metadata stored in the relational database itself.
The self-describing nature of a relational database provides additional benefits. If you’ve created a relational database with referential integrity and schema constraints you are assured that every record in the database is valid. By enforcing data integrity and validity at the data level, you are assured that any data in the database is always correct.
The adoption of SQL as a standard in the mid-1980s finalized the victory of the relational database for the next 25 years. Using SQL it became easy to create ad hoc queries that the original database developers had never dreamed of. The self-describing nature of relational databases combined with SQL’s relatively simple syntax was hoped to make it easy for savvy business users to write their own reports.
In short, relational databases make it easy to know that all data is always correct, query data in many ways, and use it in even more ways. Will all of these benefits, you’d think that people would have no need for a database other than a relational database.
Document databases different from relational databases primarily because of how they store data. Relational databases are based on [relational theory]. While databases differ from relational theory, the important thing to remember is that relational database structure data as rows in tables. Document databases store documents in collections. A document closely resembles a set of nested attributes or, if you’re more like me, you might think of it as a relatively complete object graph. Rather than break an application entity out in to many parts (order header and line items) you store application entities as logical units.
The upside to this is that document databases all developers to create software that reads and writes data in a way that is natural for that particular application. When an order is placed, the order information is saved as a logical and physical unit in the database. When that order is read out during order fulfillment, one order record is read by the order fulfillment application. That record contains all of the information needed.
Unlike relational databases, document databases do not have a restriction that all rows contain the same number of columns. We should store similar objects in the same collection, but there’s no mandate that says the objects have to be exactly the same. The upside of this is that we only need to verify that data is correct at the time it is written. Our database will always contain correct data, but the meaning of “correct” has changed slightly. We can go back and look at historical records and know that any record was valid when it was written. One of the more daunting tasks with a relational database is migrating data to conform to a new schema.
While data flexibility is important, document databases may make it difficult to perform complex queries. Document databases typically do not support what many database developers have come to think of as standard operations. There are no joins or projections. Instead it’s a requirement to move querying logic into the application tier.
Database developers will find the following query to be a familiar way to locate users who have never placed an order.
FROM users u
LEFT JOIN orders o WHERE u.user_id = o.user_id
WHERE o.order_id IS NULL;
With a document database a naive approach might be to write queries that retrieve all users and orders and subsequently merge the list of results. A more practical approach is to cache a list of order IDs within the user object to improve look up performance. This seems like a horrible idea to many proponents of relational thinking, but it allows for rapid lookups of data and is considered to be an acceptable substitute for joins in a document database. Finding the users who have never placed an order becomes as simple as looking for users without an orders property. Some document databases support secondary indexes, making it possible to improve lookups.
Document databases are a great fit for situations where an entire object graph will always be retrieved as a single unit. Additionally, document databases make it very easy to model data where most records have a similar core of functionality but some differences may exist between records.
Key/value stores are simple data stores. Data is identified by a key when it is stored and that key is used to retrieve data at some point in the future. While key/value stores have existing for a long time, they have gained popularity in recent years.
Many data operations can be reduced to simple operations based on primary key and do not require additional complex querying and manipulation. In addition, key/value stores lend themselves well to being distributed across many commodity hardware nodes. A great deal has been written about using key/value stores. [Amazon’s Dynamo] is an example of a well documented and much discussed key/value store. Other examples include Apache Cassandra, Riak, and Voldemort.
Key/value stores typically only offer three data access methods:
delete. This means that joins and sorting must be moved out to client applications. The data store’s only responsibility is to serve data as quickly as possible.
Of course, if key/value stores did nothing apart from serve data by primary key, they wouldn’t be terribly popular. What other features do they offer to make the desirable for production use?
It is very easy to scale a key/value store beyond a single server. By increasing the number of available servers, each server in the cluster is responsible for a smaller amount of data. By distributing data, it’s possible to get faster throughput and better data durability than is possible with a monolithic server.
Many key/value stores use a technique known as consistent hashing to divvy up the key space. Using consistent hashing means we can divide our key space into many chunks and distribute responsibility for those many chunks across many servers. Think of it like this: when you go to register in person at an event the alphabet has frequently been divided up into sections at separate tables. Splitting up responsibility for check ins across the alphabet means that, in theory every attendee can be served faster by having multiple volunteers sign them in. Likewise, we can spread responsibility for different keys across different servers and spread the load evenly.
Data is replicated across many servers. Replicating data has several advantages over having a single monolithic, robust, data store. When data is stored on multiple servers the failure of any single server is not catastrophic; data can still be read and written while the outage is solved.
Hinted handoff mechanisms make it easy to handle writing during a server outage. If a server is not available to write data, other servers will pick up the load until the original server (or a replacement) is available again. Writes will be streamed to the server responsible for the data once it comes back online. Much like replication, hinted handoff is a mechanism that helps a distributed key/value store cope with the failure of individual server.
Many distributed databases use a master server to coordinate activity and route traffic. Master/coordinator servers create single points of failure as well as singular bottlenecks in a system. Many distributed key/value databases bypass this problem by using a heterogeneous design that makes all nodes equal. Any server can perform the duties of any other server and communication is accomplished via gossip protocols.
The previous features add resiliency and fault tolerance to key/value data stores. Combining these features makes it possible for any node to serve data from any other node, survive data center problems, and survive hardware failures.
Column-oriented databases store and process data by column rather than row. Although commonly seen in business intelligence, analytics, and decision support systems, column-oriented databases are also seeing use in wide table databases that many have sparse columns, multi-dimensional maps, or be distributed across many nodes. The advantage of a column-oriented approach is that data does not need to be consumed as an entire row – only the necessary columns need to be read from disk.
Column-oriented databases have been around for a long time; both Sybase IQ and Vertica are incumbents, and SQL Server Apollo is Microsoft’s upcoming column store, slated for release in SQL Server Denali. Google’s Bigtable, Apache HBase, and Apache Cassandra are newer entrants into this field and are the subject of this discussion. Bigtable, HBase, and Cassandra are different from existing products in this field: these three systems allow for an unlimited number of columns to be defined and categorized into column families. They also provide additional data model and scalability features.
I have to speak in generalities and concepts here since there are implementation differences between the various column-oriented databases.
Data Model – Row Keys
A row in a column-oriented database is identified by a row key of an arbitrary length. Instead of using system generated keys (GUIDs or sequential integers), column-oriented databases use strings of arbitrary length. It’s up to application developers to create logical key naming schemes. By forcing developers to choose logical key naming schemes, data locality can be guaranteed (assuming keys are ordered).
The original Bigtable white paper mentions using row keys based on the full URL of a page with domain names reversed. For example
com.stackoverflow.blog. Because data is sorted by row key, this scheme makes sure that all data from Stack Overflow is stored in the same location on disk.
Data Model – Columns & Column Families
Column families are an arbitrary grouping of columns. Data in a column family is stored together on disk. It’s a best practice to make sure that all of the column in a column family will be read using similar access patterns.
Column families must be defined during schema definition. By contrast, columns can be defined on the fly while the database is running. This is possible because column families in a column-oriented database are sparse by default; if there are no columns within a column family for a given row key, no data is stored. It’s important to note that different rows don’t need to contain the same number of columns.
Column-oriented databases don’t natively support secondary indexes. Data is written in row key order. However there is no rule that data can’t be written in multiple locations. Disk space is cheap, CPU and I/O to maintain indexes is not.
The lack of secondary indexes may seem like a huge limitation, however it frees application developers from having to worry about how indexes might be maintained across multiple distributed servers in a cluster. Instead, developers can worry about writing and storing data the same way that it needs to be queried.
N.B. Cassandra has secondary indexes as of Cassandra 0.7
Picking the Right Database
Ultimately, picking the right database depends on workload, expertise, and future plans. It’s worth considering one of many options before settling on a relational database or one of many other databases. They all serve different purposes and fill different niches. The decision to store your data one way will have far reaching implications about how data is written, retrieved, and analyzed.