Tag Archive: nosql

CorrugatedIron v0.1.2 – Now with options!

Corrugated Iron, the C# library to working with Riak, isn’t terribly old – OJ and I released the first version two and a half weeks ago. In the mean time, we’ve fixed a few bugs but haven’t pushed out major enhancements. The minor changes that we’ve pushed out, though, have been in direct response to user feedback. This next version is no exception.

When we first put the library together, we made some very broad assumptions about how people were using the library. We assumed that everyone would be using Inversion of Control containers to do configuration – that turned out to be wrong. We assumed that everyone would not be using load balancers – that turned out to be wrong (fix coming in v0.2). And we assumed that everyone would be using JSON to serialize their data – once again we were wrong.

I like being wrong because it gives me the chance to re-think what I’m doing and learn something new. In this case, a user was looking for examples using a different way to serialize data. I provided some sample code and it only halfway worked; OJ and I had made assumptions about how users would be reading and writing data. Thankfully the developer pointed out the exact code that was in error and I was able to quickly turn around a patch and get it pushed out the door this evening. Once we identified the issue, the turnaround time was about 17 hours. Not too bad for a few weeks on the job.

To get the latest version of CorrugatedIron, head on over to the download page and grab the latest version.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

More Posts - Website

Follow Me:
TwitterFacebook

Which Database Is Right for Me?

When you start developing a new application how do you pick the database back end? Most people pick what they know/what’s already installed on their system: the tried and true relational database. Let’s face it: nobody is getting fired for using a relational database. They’re safe, well understood, and there’s probably one running in the datacenter right now.

Here’s the catch: if you’re using a relational database without looking at anything else, you’re potentially missing out. Not all databases work well for all workloads. Microsoft SQL Server users are already aware of this; there are two distinct database engines – one for transactional and one for analytical workloads (SQL Server Analysis Services).

When you start a new project, or even add a substantial feature, you should ask yourself [questions about your data][1]. Here are a few sample questions:

  • Why am I storing this data?
  • How will my users query the data?
  • How will my users use this data?

The Relational Database

There are a lot of reasons to use a relational database. Relational databases became the de facto choice for databases for a number of reasons. In addition to being based on sound mathematical theory and principles, relational databases make it easy to search for specific information, read a select number of columns, and understand structure by querying the metadata stored in the relational database itself.

The self-describing nature of a relational database provides additional benefits. If you’ve created a relational database with referential integrity and schema constraints you are assured that every record in the database is valid. By enforcing data integrity and validity at the data level, you are assured that any data in the database is always correct.

The adoption of SQL as a standard in the mid-1980s finalized the victory of the relational database for the next 25 years. Using SQL it became easy to create ad hoc queries that the original database developers had never dreamed of. The self-describing nature of relational databases combined with SQL’s relatively simple syntax was hoped to make it easy for savvy business users to write their own reports.

In short, relational databases make it easy to know that all data is always correct, query data in many ways, and use it in even more ways. Will all of these benefits, you’d think that people would have no need for a database other than a relational database.

Document Databases

Document databases different from relational databases primarily because of how they store data. Relational databases are based on [relational theory][2]. While databases differ from relational theory, the important thing to remember is that relational database structure data as rows in tables. Document databases store documents in collections. A document closely resembles a set of nested attributes or, if you’re more like me, you might think of it as a relatively complete object graph. Rather than break an application entity out in to many parts (order header and line items) you store application entities as logical units.

The upside to this is that document databases all developers to create software that reads and writes data in a way that is natural for that particular application. When an order is placed, the order information is saved as a logical and physical unit in the database. When that order is read out during order fulfillment, one order record is read by the order fulfillment application. That record contains all of the information needed.

Unlike relational databases, document databases do not have a restriction that all rows contain the same number of columns. We should store similar objects in the same collection, but there’s no mandate that says the objects have to be exactly the same. The upside of this is that we only need to verify that data is correct at the time it is written. Our database will always contain correct data, but the meaning of “correct” has changed slightly. We can go back and look at historical records and know that any record was valid when it was written. One of the more daunting tasks with a relational database is migrating data to conform to a new schema.

While data flexibility is important, document databases may make it difficult to perform complex queries. Document databases typically do not support what many database developers have come to think of as standard operations. There are no joins or projections. Instead it’s a requirement to move querying logic into the application tier.

Database developers will find the following query to be a familiar way to locate users who have never placed an order.

SELECT  u.*
FROM    users u
        LEFT JOIN orders o WHERE u.user_id = o.user_id
WHERE   o.order_id IS NULL;

With a document database a naive approach might be to write queries that retrieve all users and orders and subsequently merge the list of results. A more practical approach is to cache a list of order IDs within the user object to improve look up performance. This seems like a horrible idea to many proponents of relational thinking, but it allows for rapid lookups of data and is considered to be an acceptable substitute for joins in a document database. Finding the users who have never placed an order becomes as simple as looking for users without an orders property. Some document databases support secondary indexes, making it possible to improve lookups.

Document databases are a great fit for situations where an entire object graph will always be retrieved as a single unit. Additionally, document databases make it very easy to model data where most records have a similar core of functionality but some differences may exist between records.

Key/Value Store

Key/value stores are simple data stores. Data is identified by a key when it is stored and that key is used to retrieve data at some point in the future. While key/value stores have existing for a long time, they have gained popularity in recent years.

Many data operations can be reduced to simple operations based on primary key and do not require additional complex querying and manipulation. In addition, key/value stores lend themselves well to being distributed across many commodity hardware nodes. A great deal has been written about using key/value stores. [Amazon's Dynamo][2] is an example of a well documented and much discussed key/value store. Other examples include Apache Cassandra, Riak, and Voldemort.

Key/value stores typically only offer three data access methods: get, put, and delete. This means that joins and sorting must be moved out to client applications. The data store’s only responsibility is to serve data as quickly as possible.

Of course, if key/value stores did nothing apart from serve data by primary key, they wouldn’t be terribly popular. What other features do they offer to make the desirable for production use?

Distributed

It is very easy to scale a key/value store beyond a single server. By increasing the number of available servers, each server in the cluster is responsible for a smaller amount of data. By distributing data, it’s possible to get faster throughput and better data durability than is possible with a monolithic server.

Partitioning

Many key/value stores use a technique known as consistent hashing to divvy up the key space. Using consistent hashing means we can divide our key space into many chunks and distribute responsibility for those many chunks across many servers. Think of it like this: when you go to register in person at an event the alphabet has frequently been divided up into sections at separate tables. Splitting up responsibility for check ins across the alphabet means that, in theory every attendee can be served faster by having multiple volunteers sign them in. Likewise, we can spread responsibility for different keys across different servers and spread the load evenly.

Replication

Data is replicated across many servers. Replicating data has several advantages over having a single monolithic, robust, data store. When data is stored on multiple servers the failure of any single server is not catastrophic; data can still be read and written while the outage is solved.

Hinted Handoff

Hinted handoff mechanisms make it easy to handle writing during a server outage. If a server is not available to write data, other servers will pick up the load until the original server (or a replacement) is available again. Writes will be streamed to the server responsible for the data once it comes back online. Much like replication, hinted handoff is a mechanism that helps a distributed key/value store cope with the failure of individual server.

Masterless

Many distributed databases use a master server to coordinate activity and route traffic. Master/coordinator servers create single points of failure as well as singular bottlenecks in a system. Many distributed key/value databases bypass this problem by using a heterogeneous design that makes all nodes equal. Any server can perform the duties of any other server and communication is accomplished via gossip protocols.

Resiliency

The previous features add resiliency and fault tolerance to key/value data stores. Combining these features makes it possible for any node to serve data from any other node, survive data center problems, and survive hardware failures.

Column-Oriented Databases

Column-oriented databases store and process data by column rather than row. Although commonly seen in business intelligence, analytics, and decision support systems, column-oriented databases are also seeing use in wide table databases that many have sparse columns, multi-dimensional maps, or be distributed across many nodes. The advantage of a column-oriented approach is that data does not need to be consumed as an entire row – only the necessary columns need to be read from disk.

Column-oriented databases have been around for a long time; both Sybase IQ and Vertica are incumbents, and SQL Server Apollo is Microsoft’s upcoming column store, slated for release in SQL Server Denali. Google’s Bigtable, Apache HBase, and Apache Cassandra are newer entrants into this field and are the subject of this discussion. Bigtable, HBase, and Cassandra are different from existing products in this field: these three systems allow for an unlimited number of columns to be defined and categorized into column families. They also provide additional data model and scalability features.

I have to speak in generalities and concepts here since there are implementation differences between the various column-oriented databases.

Data Model – Row Keys

A row in a column-oriented database is identified by a row key of an arbitrary length. Instead of using system generated keys (GUIDs or sequential integers), column-oriented databases use strings of arbitrary length. It’s up to application developers to create logical key naming schemes. By forcing developers to choose logical key naming schemes, data locality can be guaranteed (assuming keys are ordered).

The original Bigtable white paper mentions using row keys based on the full URL of a page with domain names reversed. For example www.stackoverflow.com becomes com.stackoverflow.www and blog.stackoverflow.com becomes com.stackoverflow.blog. Because data is sorted by row key, this scheme makes sure that all data from Stack Overflow is stored in the same location on disk.

Data Model – Columns & Column Families

Column families are an arbitrary grouping of columns. Data in a column family is stored together on disk. It’s a best practice to make sure that all of the column in a column family will be read using similar access patterns.

Column families must be defined during schema definition. By contrast, columns can be defined on the fly while the database is running. This is possible because column families in a column-oriented database are sparse by default; if there are no columns within a column family for a given row key, no data is stored. It’s important to note that different rows don’t need to contain the same number of columns.

Indexing

Column-oriented databases don’t natively support secondary indexes. Data is written in row key order. However there is no rule that data can’t be written in multiple locations. Disk space is cheap, CPU and I/O to maintain indexes is not.

The lack of secondary indexes may seem like a huge limitation, however it frees application developers from having to worry about how indexes might be maintained across multiple distributed servers in a cluster. Instead, developers can worry about writing and storing data the same way that it needs to be queried.

N.B. Cassandra has secondary indexes as of Cassandra 0.7

Picking the Right Database

Ultimately, picking the right database depends on workload, expertise, and future plans. It’s worth considering one of many options before settling on a relational database or one of many other databases. They all serve different purposes and fill different niches. The decision to store your data one way will have far reaching implications about how data is written, retrieved, and analyzed.


Resources

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

More Posts - Website

Follow Me:
TwitterFacebook

NoSQL Basics for Database Administrators

Most SQL Server DBAs don’t even like MySQL, let alone NoSQL.  They cringe when they hear people say things like, “I don’t need transactions” or “The data will eventually be consistent” or “Indexes will be supported in the next version.” SQL Server and Oracle people are constantly trying to out-feature each other, and the abject lack of features in the NoSQL world comes as a complete shock to both parties.

Thing is, NoSQL and cloud databases do fill some very real business needs.  Today, I’m going to give you a peek into the NoSQL/cloud-database world. You can be a better database professional if you understand what these tools are trying to achieve, who wants to use them, and how your database compares with theirs.

Let’s start at the beginning: who needs this stuff? To answer the question, let’s look at a few common projects that are probably happening right now in your company with conventional databases.

Use Case #1: Logging High-Volume, Low-Value Data

Most companies have a web filtering system that makes sure employees aren’t surfing dangerous sites during business hours. They want to reduce their Internet connection needs by eliminating streamed music or video, and they want to avoid sexual harassment lawsuits caused by employees surfing for porn at work. The filtering system maintains two databases: configuration data such as employees, departments, and blacklisted web sites, and monitoring data that covers who surfed for what, and when. The config data has to be perfect – we can’t have our configuration data disappearing, but that isn’t too big of a problem because it’s a fairly small amount of records.

I can tell when someone's been surfing dirty web sites.

I can always tell who surfs the dirty sites.

The monitoring data is important, but not all that important. We want to capture as much history as possible, and we’re going to be inserting data fast and furious, but it wouldn’t kill us if we missed a few rows. Odds are someone won’t surf for illicit material just once – they’ll do it repeatedly over time. We don’t make money on this data (although we could make an argument that it helps us avoid risk) so the CIO isn’t prepared to spend a fortune archiving Johnny’s Google searches. The only time we query this data is when HR wants to check Johnny’s web activity – we don’t need to run fancy reports to see the average web use by department.

This system could live without fancy transactional backups. We just need to do a full backup periodically – even once a week or per month would be fine if it saved us a lot of money. When it goes down, we don’t need to restore it quickly – or maybe not even at all, if HR doesn’t ask for a report. We just need to set up a new database to start taking inserts again.

What if we had a way to store multiple terabytes of this data on commodity hardware that would have a very low cost, pretty high availability, but might not give us very good reporting options or transactional consistency?

Use Case #2: One-Time, One-Way, Big-Load Analysis

When a company’s data gets large enough, they start asking questions about what the data might reveal. I’ve worked with PhD’s and analysts who say things like, “I need a copy of the data warehouse to solve a problem I’m working on. I’ve got a hunch about the numbers, and I’ll need to do a week or two of slicing and dicing to figure things out.” It’s just one or two rocket scientists working on the data for a relatively short amount of time.

They don’t want to synchronize the company’s incoming data with this research dataset, and they don’t need to back it up. They just want you to hand over the data, and then you won’t hear from them again. If it goes down, that’s okay – it doesn’t affect sales. Sometimes they use their own SQL Servers, sometimes analytic tools, sometimes its plain old Excel or Access. They won’t write production reports on this data – they’re just going to dive into specific queries and analytics, and then write PowerPoint decks or white papers on the results.

They might do similar projects like this over and over with different datasets from different areas in the company, but they use fresh sets of data every time. They could use a completely different toolset each time without training worries – these are rocket scientists, after all.

What if you could give them a tool that could handle rapid loads of completely different data each time, perform wild and crazy queries, but it might not use traditional query tools or database languages? These users aren’t proficient with SQL queries or RDBMS technologies anyway – they could use anything as long as it’s cheap and scales well.

Why Not Use SQL Server or Oracle?

Mozilla Labs Test Pilot Project

Mozilla Labs Test Pilot Project

I bet you’ve run into projects just like these in your career, and until now, you’ve solved them with the only tool you really know well – relational databases like SQL Server.  SQL Server, when properly configured and managed, doesn’t blink at the requirements of most NoSQL projects.  Take the Mozilla Test Pilot experiment:

  • Expected amount of data stored per experiment: 1.2 TB
  • Expected peak traffic: approximately 75 GB per hour for two 8 hour periods following the conclusion of an experiment window.  This two day period will result in collection of approximately 90% of the total data.
  • Remain highly available under load
  • Provide necessary validation and security constraints to prevent bad data from polluting the experiment or damaging the application

Here’s my personal take on it – NoSQL doesn’t mean NoSQLServer, but rather NoMySQL.  MySQL doesn’t scale as easily to these levels, whereas it’s not much of an event anymore for a single-node SQL Server to scale to multiple terabytes.  I’ve worked with plenty of SQL Server data warehouse systems that have much larger volumes and loads of data than this.  In the last month, I’ve worked with two in particular that ran on less than $50k of commodity hardware.

It’s not that SQL Server or Oracle can’t scale to these levels.

It’s that NoSQL and cloud databases have a few crucial differences about how they scale.

NoSQL Has a Different Cost Model

If you’re doing multi-terabyte databases, you probably need SQL Server Enterprise Edition, which is licensed in one of two ways:

  • $27,495 per CPU socket for unlimited users, or
  • $8,592 per server plus $164 per user (unless the user already has a SQL CAL)

When you start a SQL Server project, you need to have an accurate CPU count and stick to it.  If you decide halfway into the project that you need a lot more power to get the job done, you have to buy more licenses.  These licenses stick around, and that long-term overhead becomes a business problem.  Businesses don’t like to see expensive software licenses not in use when a project isn’t doing number-crunching.

In addition, even just four sockets of Enterprise at $35-120k is a lot of money that could be spent on hardware – and therefore performance – instead.  Even if SQL Server was 50% faster than a competing solution, that advantage is negated when the user spends 50% more on hardware instead to get the same speed at a lower overall cost.  Tough sell.

The advantage of buying an industry standard database with a long history (whether it be DB2, Oracle, SQL Server, or Sybase) is that it’s relatively easy to find staff and software that interact with that database.  NoSQL and cloud databases, on the other hand, come with no up-front licensing – but because there’s no established industry standard, it’s very tough to find experienced staff off the street.

An early build of MulletDB

An early build of MulletDB

How do I hire experienced NoSQL-savvy-staff when the very NoSQL databases out there keep changing?  Here’s a sampling of announcements from one 24-hour period:

NoSQL solutions have different costs.  Immature platforms suffer from a lack of documentation, tooling, and commonly-needed features, thereby driving implementation costs up.

But is that really a problem?  There’s an abundance of documentation, tools, and features for Microsoft SQL Server, but I almost never see it implemented correctly. Just because there’s a manual doesn’t mean anybody’s going to read it or heed it, and just because there’s a tool doesn’t mean users will use it appropriately.  It’s easy for me to hire SQL Server people off the street, but that doesn’t mean that they’re good, either.

The best NoSQL solutions solve the human cost problem by building in the things you need in order to scale, store data, and consume data.  It’s not all that hard, either – the scenarios I described in the beginning of the post had one thing in common: a very limited number of features.  Some NoSQL and cloud database users need to insert a lot, and they need to run very specific queries, but that’s about it. This is where open source products can really shine – they need even less features than MySQL has.

NoSQL Makes Smart Compromises to Achieve Scale

Right from the start, cloud databases and NoSQL projects aimed to scale out. Whether you’re using pizza boxes or virtual servers in the cloud, you can get more capacity by throwing more of them into your resource pool and letting the platform spread the load around.  The Visual Guide to NoSQL Systems illustrates the compromises – between consistency, availability, and partition tolerance, you can pick any two. If every client needs to see an identical copy of the data, there’s an app for that. If you can tolerate different query results in exchange for higher availability, there’s plenty of choices for that too.

MongoDB’s feature set is a good example.  They tout:

  • Auto-Sharding – Scale horizontally without compromising functionality.
  • Replication & High Availability – Mirror across LANs and WANs for scale and peace of mind.
  • GridFS – Store files of any size without complicating your stack.

If I was a business manager, I’d take those bullet points straight to my database team and ask, “Can our database do this?  If not, why don’t we try MongoDB for our next project?”  If the use case matches one of the two I described earlier, it might be a better fit than SQL Server.

No, you can’t have your cake and eat it too, and NoSQL systems don’t support anything even remotely near SQL Server’s feature set.  Forget things like joins, stored procedures, triggers, spatial indexes, CLR in the database, foreign keys, and so on.  As a performance tuner, though, that list makes me grin – because for years, I’ve been telling programmers to keep their CPU-intensive work out of the database server if you want to scale.  The NoSQL camp understands the importance of that.  These systems are about storing data – period.

No Licensing Revenue Means No Buzzword Features

My poker face, let me show you it

From my SQL Server 2008 R2 review

In my SQL Server 2008 R2 review, I lamented that the last couple of versions of SQL Server have focused heavily on Business Intelligence because it’s a sexy buzz-phrase that sells licenses. I imagine Microsoft marketing thinks of customers as saying, “We’re a business and we need intelligence – get me some of that SQL Server!” Don’t get me wrong – I’m all for BI – but sometimes it feels like buzzwords get emphasized over basics, like working IntelliSense for all versions of SQL.

Since NoSQL tools are built by the users, they’re fat-free. If nobody needs it, it doesn’t get built. On the other hand, if one very determined person needs it badly, that person builds it or pays someone else to build it.

This leads to some interesting design decisions. Many of the NoSQL tools have abandoned any sense of schema – they want to store chunks of data, but that data layout can change at any time with no rhyme or reason. Take our web filtering example – if we design our application to suddenly start saving more information, like the browser agent or the remote site’s web server type, we can do that without a problem. When we pull the data back out, we need to handle whatever kind of data we find, but we’re not worried about joining this data to another table. It’s just flat web history data without relationships.

No Licensing Revenue Means Different Development Models

Five years ago, Microsoft wanted to conquer the VLDB market, so they poured money into making SQL Server scale better.  SQL Server 2005 took on the “SQL Server can’t scale” challenge and knocked it out of the park.  We got things like partitions for fast data loads, DMVs for easy management, and online piecemeal restore for quicker disaster recovery.  Microsoft’s business decision to go after large databases paid off for SQL Server DBAs everywhere because we could use our skills on progressively larger databases.

When you pick a NoSQL or cloud database provider, the first question to ask yourself is, “Who will be developing this thing next year?”  If the platform doesn’t have a really strong developer infrastructure, and your company runs into a data storage problem, the answer might be you.  I dunno about you, but I’m pretty good at administering databases – but absolutely useless at coding them.

That means long-term customers should either:

  • Choose a database with a commercial company supporting the development
  • Have coders on staff who start getting familiar with the project’s source code
  • Consider sponsoring development of the database.  Sponsored open source development is a foreign concept to most businesses, but I’ve seen it work well even in small dollar amounts.

On the other hand, short-term customers have a completely different decisionmaking process.  Consider the second use case I described – repeated one-time, one-way analysis projects.  In that case, who cares if the database platform is still being developed a year from now?  Your database team is probably already used to repeatedly learning new systems for data import/export, and they’re skilled at techniques, not just technologies.  They’ll be comfortable switching to another platform if development dies off or if another more attractive platform comes out.  So how do you pick the most attractive platform?

How I Would Choose a NoSQL or Cloud Database

Ignore throughput performance benchmarks. The market and the tools are changing way too fast.  By the time a well-written benchmark comes out, it’s outdated in this rapidly evolving environment.  Poorly-done benchmarks can come out much quicker – but they’re not good for making decisions.

Focus on developer performance. How fast can you get started deploying and storing data in it?  Shaving days off the development time will probably outweigh the computing gains, especially since you’re going to be throwing your most skilled people at these problems.

Ask the team who will do the work. If you’re a manager on the project rather than a hands-on coder, let the hands-on people decide.  If they have personal connections to someone else using a particular tool, they might be able to get faster answers or code help.

Prioritize difficult-to-code features that work right now. If you need your reporting team to be able to write SQL-style queries against the data, only pick platforms that offer that support today.  If transaction support is important to you, rule out any solutions without it.  Don’t gamble on these challenging features possibly coming out later, because it’s tough to adapt these into existing databases.

Prefer systems that tolerate a node failure. If the database is stored across a dozen cheap pizza box servers, one of the servers is going to go down.  Your data loads and queries should still run while you provision a new pizza box and add it to the system.  Ideally, you want this systems management function to be performed by your infrastructure team, not the database people.  The closer this system comes to being an invisible managed appliance, the lower it will keep your manpower costs.  This feature is extremely difficult to add to a database system – just ask Microsoft, who still haven’t done it yet for SQL Server, and believe me, we’ve all been asking.

I want to end this with a recommendation for a specific NoSQL or cloud database solution, but I can’t bring myself to do it.  After all, the market is still evolving so fast, and I just told you to ask your team who will do the work.  If you’re looking to play around with one, though, I’d recommend trying MongoDB.  It’s the closest to a conventional database while still meeting most of the criteria I’d want for a scaling solution, and there’s even an easy-to-play-with online tutorial so you don’t have to install anything.

But don’t do it from your work computer.

You don’t want SQL Server to know you were cheating on her.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

More Posts - Website

Follow Me:
TwitterFacebookLinkedInGoogle PlusYouTube

Switching to Cassandra

NoSQL is a good idea with a bad name.  You wouldn’t put on your NoJeans to drive to work in your NoFord. But put away the crappy label for a minute and check out what the NoSQL movement has going for it, because they’ve got some good points.

In the Microsoft Certified Master training, I’ve been stunned to learn some of the things that go on under the covers of SQL Server.  The good thing is that MSSQL offers a ton of functionality few of us actually use, but the drawback is that there’s a lot of overengineered plumbing.

I train database administrators around the world, and if there’s one thing I see over and over, it’s that people just don’t use the features built into the product.  They’re not using the built-in index tools that can make their servers faster.  They’re not checking fragmentation.  Hell, half the time they’re not even checking data integrity or doing backups.

So it occurred to me – maybe these DBAs aren’t on something – they’re onto something.

Traditional relational database management systems like Microsoft SQL Server have inherent weaknesses that prevent them from scaling to handle real load.  All these features they’ve built in over the years have added up to real overhead.  During several sessions, I heard phrases like “just adds 2% overhead” and “only uses 1-2% CPU.”  Add up all these features, and it’s clear: SQL Server 2008 is the Windows Vista of databases.  It’s a miracle queries ever complete.

And I’ve had enough.

NoSQL DBA

NoSQL DBA

NoSQL databases don’t have transactions, but frankly, that’s fine.  You know what kind of transactions I want?  Cash transactions.  I want a brand new house on an episode of Cribs and a bathroom I can play baseball in.  I can’t get there working with SQL Server because none of the cool new startups are using it.

I’m tired of working for suit-and-tie companies that tout their long histories and big balance sheets.  I wanna roam the halls of companies like Digg, Reddit, and Twitter – companies chock full of venture capital.  Those guys are all distracted by how the database software is free, and they haven’t thought through the process of hiring good DBAs for a database platform that got compiled last Thursday.  Presto, I can charge ginormous fees.

After doing a lot of research, I’ve settled on Cassandra, the NoSQL database created by Facebook.  It does have some weaknesses – for example, you can’t add tables without restarting – but I think that’s an advantage.  I won’t have to do nearly as many schema deployments if the developers know the servers will have to get restarted, and that means less work for me.

Most NoSQL databases have what’s called eventual consistency – sooner or later, all of the nodes will catch up and have the same data.  I’m fine with that.  What’s the rush?  The inserts, updates, and deletes will probably be caught up before these guys even have a business plan.  DBAs have such a big issue with integrity, but I work for a software vendor, remember?  I sold my integrity already.

I, for one, welcome our new Database 2.0 overlords.  Or 4.1. Whatever.

Update 4/2: since you’re reading this late, you should probably be aware that it was originally published on April 1.  You may also be interested in my post from 4/1/2009 about becoming a Natural Keyboard MVP.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

More Posts - Website

Follow Me:
TwitterFacebookLinkedInGoogle PlusYouTube