Production DBA

NoSQL Basics for Database Administrators

By Brent Ozar · May 26, 2010 · 37 comments

Most SQL Server DBAs don’t even like MySQL, let alone NoSQL. They cringe when they hear people say things like, “I don’t need transactions” or “The data will eventually be consistent” or “Indexes will be supported in the next version.” SQL Server and Oracle people are constantly trying to out-feature each other, and the abject lack of features in the NoSQL world comes as a complete shock to both parties.

Thing is, NoSQL and cloud databases do fill some very real business needs. Today, I’m going to give you a peek into the NoSQL/cloud-database world. You can be a better database professional if you understand what these tools are trying to achieve, who wants to use them, and how your database compares with theirs.

Let’s start at the beginning: who needs this stuff? To answer the question, let’s look at a few common projects that are probably happening right now in your company with conventional databases.

Use Case #1: Logging High-Volume, Low-Value Data

Most companies have a web filtering system that makes sure employees aren’t surfing dangerous sites during business hours. They want to reduce their Internet connection needs by eliminating streamed music or video, and they want to avoid sexual harassment lawsuits caused by employees surfing for porn at work. The filtering system maintains two databases: configuration data such as employees, departments, and not-allowed web sites, and monitoring data that covers who surfed for what, and when. The config data has to be perfect – we can’t have our configuration data disappearing, but that isn’t too big of a problem because it’s a fairly small amount of records.

I can tell when someone's been surfing dirty web sites. — I can always tell who surfs the dirty sites.

The monitoring data is important, but not all that important. We want to capture as much history as possible, and we’re going to be inserting data fast and furious, but it wouldn’t kill us if we missed a few rows. Odds are someone won’t surf for illicit material just once – they’ll do it repeatedly over time. We don’t make money on this data (although we could make an argument that it helps us avoid risk) so the CIO isn’t prepared to spend a fortune archiving Johnny’s Google searches. The only time we query this data is when HR wants to check Johnny’s web activity – we don’t need to run fancy reports to see the average web use by department.

This system could live without fancy transactional backups. We just need to do a full backup periodically – even once a week or per month would be fine if it saved us a lot of money. When it goes down, we don’t need to restore it quickly – or maybe not even at all, if HR doesn’t ask for a report. We just need to set up a new database to start taking inserts again.

What if we had a way to store multiple terabytes of this data on commodity hardware that would have a very low cost, pretty high availability, but might not give us very good reporting options or transactional consistency?

Use Case #2: One-Time, One-Way, Big-Load Analysis

When a company’s data gets large enough, they start asking questions about what the data might reveal. I’ve worked with PhD’s and analysts who say things like, “I need a copy of the data warehouse to solve a problem I’m working on. I’ve got a hunch about the numbers, and I’ll need to do a week or two of slicing and dicing to figure things out.” It’s just one or two rocket scientists working on the data for a relatively short amount of time.

They don’t want to synchronize the company’s incoming data with this research dataset, and they don’t need to back it up. They just want you to hand over the data, and then you won’t hear from them again. If it goes down, that’s okay – it doesn’t affect sales. Sometimes they use their own SQL Servers, sometimes analytic tools, sometimes its plain old Excel or Access. They won’t write production reports on this data – they’re just going to dive into specific queries and analytics, and then write PowerPoint decks or white papers on the results.

They might do similar projects like this over and over with different datasets from different areas in the company, but they use fresh sets of data every time. They could use a completely different toolset each time without training worries – these are rocket scientists, after all.

What if you could give them a tool that could handle rapid loads of completely different data each time, perform wild and crazy queries, but it might not use traditional query tools or database languages? These users aren’t proficient with SQL queries or RDBMS technologies anyway – they could use anything as long as it’s cheap and scales well.

Why Not Use SQL Server or Oracle?

I bet you’ve run into projects just like these in your career, and until now, you’ve solved them with the only tool you really know well – relational databases like SQL Server. SQL Server, when properly configured and managed, doesn’t blink at the requirements of most NoSQL projects. Take the Mozilla Test Pilot experiment:

Expected amount of data stored per experiment: 1.2 TB
Expected peak traffic: approximately 75 GB per hour for two 8 hour periods following the conclusion of an experiment window. This two day period will result in collection of approximately 90% of the total data.
Remain highly available under load
Provide necessary validation and security constraints to prevent bad data from polluting the experiment or damaging the application

Here’s my personal take on it – NoSQL doesn’t mean NoSQLServer, but rather NoMySQL. MySQL doesn’t scale as easily to these levels, whereas it’s not much of an event anymore for a single-node SQL Server to scale to multiple terabytes. I’ve worked with plenty of SQL Server data warehouse systems that have much larger volumes and loads of data than this. In the last month, I’ve worked with two in particular that ran on less than $50k of commodity hardware.

It’s not that SQL Server or Oracle can’t scale to these levels.

It’s that NoSQL and cloud databases have a few crucial differences about how they scale.

NoSQL Has a Different Cost Model

If you’re doing multi-terabyte databases, you probably need SQL Server Enterprise Edition, which is licensed in one of two ways:

$27,495 per CPU socket for unlimited users, or
$8,592 per server plus $164 per user (unless the user already has a SQL CAL)

When you start a SQL Server project, you need to have an accurate CPU count and stick to it. If you decide halfway into the project that you need a lot more power to get the job done, you have to buy more licenses. These licenses stick around, and that long-term overhead becomes a business problem. Businesses don’t like to see expensive software licenses not in use when a project isn’t doing number-crunching.

In addition, even just four sockets of Enterprise at $35-120k is a lot of money that could be spent on hardware – and therefore performance – instead. Even if SQL Server was 50% faster than a competing solution, that advantage is negated when the user spends 50% more on hardware instead to get the same speed at a lower overall cost. Tough sell.

The advantage of buying an industry standard database with a long history (whether it be DB2, Oracle, SQL Server, or Sybase) is that it’s relatively easy to find staff and software that interact with that database. NoSQL and cloud databases, on the other hand, come with no up-front licensing – but because there’s no established industry standard, it’s very tough to find experienced staff off the street.

How do I hire experienced NoSQL-savvy-staff when the very NoSQL databases out there keep changing? Here’s a sampling of announcements from one 24-hour period:

NoSQL solutions have different costs. Immature platforms suffer from a lack of documentation, tooling, and commonly-needed features, thereby driving implementation costs up.

But is that really a problem? There’s an abundance of documentation, tools, and features for Microsoft SQL Server, but I almost never see it implemented correctly. Just because there’s a manual doesn’t mean anybody’s going to read it or heed it, and just because there’s a tool doesn’t mean users will use it appropriately. It’s easy for me to hire SQL Server people off the street, but that doesn’t mean that they’re good, either.

The best NoSQL solutions solve the human cost problem by building in the things you need in order to scale, store data, and consume data. It’s not all that hard, either – the scenarios I described in the beginning of the post had one thing in common: a very limited number of features. Some NoSQL and cloud database users need to insert a lot, and they need to run very specific queries, but that’s about it. This is where open source products can really shine – they need even less features than MySQL has.

NoSQL Makes Smart Compromises to Achieve Scale

Right from the start, cloud databases and NoSQL projects aimed to scale out. Whether you’re using pizza boxes or virtual servers in the cloud, you can get more capacity by throwing more of them into your resource pool and letting the platform spread the load around. The Visual Guide to NoSQL Systems illustrates the compromises – between consistency, availability, and partition tolerance, you can pick any two. If every client needs to see an identical copy of the data, there’s an app for that. If you can tolerate different query results in exchange for higher availability, there’s plenty of choices for that too.

MongoDB’s feature set is a good example. They tout:

Auto-Sharding – Scale horizontally without compromising functionality.
Replication & High Availability – Mirror across LANs and WANs for scale and peace of mind.
GridFS – Store files of any size without complicating your stack.

If I was a business manager, I’d take those bullet points straight to my database team and ask, “Can our database do this? If not, why don’t we try MongoDB for our next project?” If the use case matches one of the two I described earlier, it might be a better fit than SQL Server.

No, you can’t have your cake and eat it too, and NoSQL systems don’t support anything even remotely near SQL Server’s feature set. Forget things like joins, stored procedures, triggers, spatial indexes, CLR in the database, foreign keys, and so on. As a performance tuner, though, that list makes me grin – because for years, I’ve been telling programmers to keep their CPU-intensive work out of the database server if you want to scale. The NoSQL camp understands the importance of that. These systems are about storing data – period.

No Licensing Revenue Means No Buzzword Features

My poker face, let me show you it — From my SQL Server 2008 R2 review

In my SQL Server 2008 R2 review, I lamented that the last couple of versions of SQL Server have focused heavily on Business Intelligence because it’s a sexy buzz-phrase that sells licenses. I imagine Microsoft marketing thinks of customers as saying, “We’re a business and we need intelligence – get me some of that SQL Server!” Don’t get me wrong – I’m all for BI – but sometimes it feels like buzzwords get emphasized over basics, like working IntelliSense for all versions of SQL.

Since NoSQL tools are built by the users, they’re fat-free. If nobody needs it, it doesn’t get built. On the other hand, if one very determined person needs it badly, that person builds it or pays someone else to build it.

This leads to some interesting design decisions. Many of the NoSQL tools have abandoned any sense of schema – they want to store chunks of data, but that data layout can change at any time with no rhyme or reason. Take our web filtering example – if we design our application to suddenly start saving more information, like the browser agent or the remote site’s web server type, we can do that without a problem. When we pull the data back out, we need to handle whatever kind of data we find, but we’re not worried about joining this data to another table. It’s just flat web history data without relationships.

No Licensing Revenue Means Different Development Models

Five years ago, Microsoft wanted to conquer the VLDB market, so they poured money into making SQL Server scale better. SQL Server 2005 took on the “SQL Server can’t scale” challenge and knocked it out of the park. We got things like partitions for fast data loads, DMVs for easy management, and online piecemeal restore for quicker disaster recovery. Microsoft’s business decision to go after large databases paid off for SQL Server DBAs everywhere because we could use our skills on progressively larger databases.

When you pick a NoSQL or cloud database provider, the first question to ask yourself is, “Who will be developing this thing next year?” If the platform doesn’t have a really strong developer infrastructure, and your company runs into a data storage problem, the answer might be you. I dunno about you, but I’m pretty good at administering databases – but absolutely useless at coding them.

That means long-term customers should either:

Choose a database with a commercial company supporting the development
Have coders on staff who start getting familiar with the project’s source code
Consider sponsoring development of the database. Sponsored open source development is a foreign concept to most businesses, but I’ve seen it work well even in small dollar amounts.

On the other hand, short-term customers have a completely different decisionmaking process. Consider the second use case I described – repeated one-time, one-way analysis projects. In that case, who cares if the database platform is still being developed a year from now? Your database team is probably already used to repeatedly learning new systems for data import/export, and they’re skilled at techniques, not just technologies. They’ll be comfortable switching to another platform if development dies off or if another more attractive platform comes out. So how do you pick the most attractive platform?

How I Would Choose a NoSQL or Cloud Database

Ignore throughput performance benchmarks. The market and the tools are changing way too fast. By the time a well-written benchmark comes out, it’s outdated in this rapidly evolving environment. Poorly-done benchmarks can come out much quicker – but they’re not good for making decisions.

Focus on developer performance. How fast can you get started deploying and storing data in it? Shaving days off the development time will probably outweigh the computing gains, especially since you’re going to be throwing your most skilled people at these problems.

Ask the team who will do the work. If you’re a manager on the project rather than a hands-on coder, let the hands-on people decide. If they have personal connections to someone else using a particular tool, they might be able to get faster answers or code help.

Prioritize difficult-to-code features that work right now. If you need your reporting team to be able to write SQL-style queries against the data, only pick platforms that offer that support today. If transaction support is important to you, rule out any solutions without it. Don’t gamble on these challenging features possibly coming out later, because it’s tough to adapt these into existing databases.

Prefer systems that tolerate a node failure. If the database is stored across a dozen cheap pizza box servers, one of the servers is going to go down. Your data loads and queries should still run while you provision a new pizza box and add it to the system. Ideally, you want this systems management function to be performed by your infrastructure team, not the database people. The closer this system comes to being an invisible managed appliance, the lower it will keep your manpower costs. This feature is extremely difficult to add to a database system – just ask Microsoft, who still haven’t done it yet for SQL Server, and believe me, we’ve all been asking.

I want to end this with a recommendation for a specific NoSQL or cloud database solution, but I can’t bring myself to do it. After all, the market is still evolving so fast, and I just told you to ask your team who will do the work. If you’re looking to play around with one, though, I’d recommend trying MongoDB. It’s the closest to a conventional database while still meeting most of the criteria I’d want for a scaling solution, and there’s even an easy-to-play-with online tutorial so you don’t have to install anything.

But don’t do it from your work computer.

You don’t want SQL Server to know you were cheating on her.

Free, 3× a week

Get my new posts by email

Three posts a week, plus a Monday roundup of the best database news from around the web.

37 comments

Michael J Swart

May 26, 2010 at 7:46 am

Awesome post.
(But nervous about clicking the link that with the text “Dangerous Sites”)

Reply
1. Brent Ozar
  
  May 26, 2010 at 8:26 am
  
  Hahaha, that’s one of my favorite Achewood comics. Sums up my life doing tech support for friends and family.
  
  Reply
Bob Pusateri

May 26, 2010 at 8:37 am

Most Excellent. I’m in awe of MulletDB…

Reply
1. Brent Ozar
  
  May 26, 2010 at 8:37 am
  
  Isn’t that a funny idea? I love the name – it’s absolutely perfect.
  
  Reply
David Stein

May 26, 2010 at 9:38 am

Great article Brent, very comprehensive. However, isn’t Case #2 perfect for PowerPivot and might not require a DB solution at all?

Reply
1. Brent Ozar
  
  May 26, 2010 at 9:40 am
  
  Well, the problem with PowerPivot is that it’s still in Excel, and it’s tough to do really CPU-intensive calculations in Excel. You’re limited to the power of what one machine can calculate. Other users can’t live-query your PowerPivot data without giving them Excel either, whereas these NoSQL solutions can be queried by multiple users simultaneously as they do analysis.
  
  Reply
  1. David Stein
    
    May 26, 2010 at 9:48 am
    
    I was under the impression that you could accomplish both of your points using SharePoint, but I don’t have experience with it yet. Am I incorrect?
    
    Reply
    1. Brent Ozar
      
      May 26, 2010 at 9:50 am
      
      Sorta kinda, but not exactly. You can publish your work to SharePoint for other people to consume it, but when they want to run calculations, it’s limited to the power they’ve got on one particular machine. The NoSQL and cloud database crowd can quickly throw a ton of horsepower at a problem and use multiple machines to solve it at once – whereas we don’t have that scale-out option when the data’s in SQL Server or Excel.
      
      There are some roll-your-own solutions for work distribution if you want to build something with Service Broker, for example, but it’s still not an easy way to scale out.
      
      Reply
Sean

May 26, 2010 at 9:52 am

Interesting and thought provoking article as always Brent. It also really hits the spot for me because I am currently investigating using NoSQL or Hybrid systems like Kdb or FastObjects for our company.

I would ideally like to recommend SQL but I just do not know enough about a SQL database to know if it is suitable for handling potentially many terrabytes of high performance time series data. I know about StreamInsight and although it looks like it could handle the processing of incoming data, this still leaves us of the problem of storing and data mining huge amount of time series data. Your thoughts on this would be highly appreciated.

Reply
1. Brent Ozar
  
  May 26, 2010 at 9:53 am
  
  Sure, SQL Server is great at that. I regularly work with clients who have >10 terabytes in a single SQL Server database and use it for querying and data mining.
  
  Reply
  1. Sean
    
    May 26, 2010 at 10:00 am
    
    I was told that before my time the company had tested SQL and found out that SQL could not keep up with the flow of information coming in. I personally find this hard to believe as we are limited to internet speeds, but it does put yet another obstacle in any justification of SQL.
    
    Surely MS SQL could keep up with even a 100Mb/s internet line and still store the data in a database with some basic verification?
    
    Reply
    1. Brent Ozar
      
      May 26, 2010 at 10:02 am
      
      Yep, it can handle incoming data a lot faster than that:
      
      http://msdn.microsoft.com/en-us/library/dd537533.aspx
      
      Reply
Rob

May 26, 2010 at 6:38 pm

A little while ago you asked your readers for some suggestions on where you should develop your skills.

Amazon’s RDS product is currently for MySql only, but there is an ‘Engine’ field that I would love to someday say ‘MSSQL’. I think you should get on the phone with the Amazon Web Services folk and get them to hire you to help implement this. Having a robust scalable fully-featured (unlike SqlAzure) database engine alongside products such as SimpleDB and EC2 would rock.

Reply
1. Brent Ozar
  
  May 26, 2010 at 9:04 pm
  
  Rob – interesting, but I don’t have the skillset to build database tools. I just administer them. 😉
  
  Reply
Brad Nicholson

May 26, 2010 at 8:20 pm

A lot of the use cases you describe for NoSQL could easily be filled by Postgres. I encourage everyone to looking for MySQL alternatives to look into Postgres. I have been running Postgres in 24x7x365 high throughput environments for a number of years and it is a fantastic database.

It’s a free database that cares about your data (durability is excellent). It has the a lot of features a DBA would look for in a database, adheres fairly closely to the SQL standard, its fast and it scales well.

There is a vibrant user and developer community, excellent documentation. Development is not driven by company marketing teams. The license model ensures that it will remain free.

Reply
1. Brent Ozar
  
  May 26, 2010 at 9:03 pm
  
  Brad – yes, but how does Postgres solve the scaling problem for multiple servers? For example, if I need to scale to a 100 terabyte database, how does Postgres scale out to multiple writeable servers and perform sharding?
  
  Reply
  1. Brad Nicholson
    
    May 26, 2010 at 10:34 pm
    
    It currently does not handle scaling out on it’s own (which is why I mentioned some, but not all of your use cases could be handled).
    
    There are a number of projects that handle various problem sets for scaling though. Skytools is Skype’s framework for scaling out their systems. GridSQL is for distributed shared-nothing datawarehoses and datamarts.
    
    For scaling writes, there is a multi-master project under development – which is fork in the code but under the same license. It won’t support a full feature set until later this year, but the initial results impressive to say the least.
    
    I’m not implying that there is not case for NoSQL – I believe in the right tool for the right job. In the case of a 100TB DB that needs write scaling, I would likely say that Postgres is the wrong tool at this time.
    
    What I do think is that it sits in an interesting market segment in that it is a free product that can deliver on both use case #1 and #2 you mention would work just fine under Postgres, and shouldn’t induce cringes from SQL Server DBA’s because it behaves like an actual database.
    
    Reply
    1. Brent Ozar
      
      May 27, 2010 at 5:21 am
      
      OK, thanks!
      
      Reply
David Saiote

May 28, 2010 at 11:57 pm

MulletDB! Yee-haa! xD

Reply
Josh Feierman

May 29, 2010 at 12:17 pm

Great post Brent! While I’m a huge SQL Server fan I am also a big open source user so it’s nice that someone can see both sides of the equation.

Right now I’m working on a project that has some very ambitious data size goals and involves using PerfMon to collect and store data across many, many servers in our environment. The volumes seem pretty large to me: ~50,000 rows per second insertion and eventual database size in excess of 10TB. Now, because this is an internal only use and won’t generate revenue, the appetite for spending money on costly features like Enterprise edition or high end storage are pretty much non-existent. While I’m pretty confident we can tweak and tune enough to get it going with what we have (a 4 core VM with Standard Ed), your post has at least stuck the idea in my head about considering less traditional alternatives. Certainly the data is not critical, though I suppose one could argue that clean performance data is a very important thing to have as a DBA. What do you think?

Reply
1. Brent Ozar
  
  May 29, 2010 at 6:37 pm
  
  Josh – interesting question. Do you really need to insert that data directly into SQL Server? I’d probably argue that the data should be landed in flat files first and then periodically loaded into SQL using SSIS. SSIS could also be used to aggregate the data into less granular chunks for reporting.
  
  Reply
  1. Josh Feierman
    
    May 31, 2010 at 9:48 pm
    
    Good point Brent, and that is actually the direction we are trending towards: having the data collectors spit into flat files which are then processed and archived by SSIS at scheduled intervals. Currently we’re using the built-in direct-to-sql functionality of PerfMon, but experience is showing it scales rather… well, OK, it really doesn’t scale at all, unless you want to have one database for every two or three servers you’re running collectors against. Hopefully with the approach you suggested we can avoid those staging DBs and just load the cleansed data right into the warehouse where it belongs. Considering that this may well end up being the largest database we have in house, I’ll be curious to see just what we can get out of our little VM. Thanks again for the suggestion.
    
    Reply
    1. Brent Ozar
      
      June 1, 2010 at 8:30 am
      
      Yeah, the direct-to-SQL functionality is a pretty bad idea for Perfmon. I’d advise going to flat files first, especially since Perfmon supports CSVs, and then periodically pulling those into SQL Server via SSIS. Performance is fantastic that way.
      
      Reply
Ron Dunn

May 30, 2010 at 11:17 pm

I like your writings on storage, LOVED your recent truncation post, but did you just turn into the love child of Dare Obasanjo and Michael Stonebraker?

This “NoSQL” looks like just another doomed iteration of object databases, perhaps better named “AnythingButSQL”. This is jump-the-shark stuff in enterprise computing.

Please don’t scare me any more 🙂

Ron.

Reply
1. Brent Ozar
  
  May 31, 2010 at 6:22 am
  
  And you have an @Microsoft.com email address. I can see why this might sound scary. 😉 I bet you’re just jealous that I didn’t include Azure, right? Yuk yuk yuk…
  
  Reply
R Thomas

June 1, 2010 at 11:45 am

Great article. I’ve been a database guy for 15 years. One of my major professional pet peaves is folks who are married to a tool for the simple reason that it’s the only one they are comfortable with. I’ve made my living for the past decade off SQL Server. Doesn’t mean I wouldn’t use NoSQL in a heartbeat if it was shown to me to be a better tool for the job.

Reply
1. Brent Ozar
  
  June 1, 2010 at 12:49 pm
  
  Thanks, glad you liked it!
  
  Reply
Stephen Hirsch

June 3, 2010 at 4:02 pm

Sounds like a flat file would work just as well.

Reply
Lester

July 8, 2010 at 5:46 pm

haha! she(SQLServer) just saw me picking at your article.

great read!

Reply