NoSQL Basics for Database Administrators

Architecture
37 Comments

Most SQL Server DBAs don’t even like MySQL, let alone NoSQL.  They cringe when they hear people say things like, “I don’t need transactions” or “The data will eventually be consistent” or “Indexes will be supported in the next version.” SQL Server and Oracle people are constantly trying to out-feature each other, and the abject lack of features in the NoSQL world comes as a complete shock to both parties.

Thing is, NoSQL and cloud databases do fill some very real business needs.  Today, I’m going to give you a peek into the NoSQL/cloud-database world. You can be a better database professional if you understand what these tools are trying to achieve, who wants to use them, and how your database compares with theirs.

Let’s start at the beginning: who needs this stuff? To answer the question, let’s look at a few common projects that are probably happening right now in your company with conventional databases.

Use Case #1: Logging High-Volume, Low-Value Data

Most companies have a web filtering system that makes sure employees aren’t surfing dangerous sites during business hours. They want to reduce their Internet connection needs by eliminating streamed music or video, and they want to avoid sexual harassment lawsuits caused by employees surfing for porn at work. The filtering system maintains two databases: configuration data such as employees, departments, and not-allowed web sites, and monitoring data that covers who surfed for what, and when. The config data has to be perfect – we can’t have our configuration data disappearing, but that isn’t too big of a problem because it’s a fairly small amount of records.

I can tell when someone's been surfing dirty web sites.
I can always tell who surfs the dirty sites.

The monitoring data is important, but not all that important. We want to capture as much history as possible, and we’re going to be inserting data fast and furious, but it wouldn’t kill us if we missed a few rows. Odds are someone won’t surf for illicit material just once – they’ll do it repeatedly over time. We don’t make money on this data (although we could make an argument that it helps us avoid risk) so the CIO isn’t prepared to spend a fortune archiving Johnny’s Google searches. The only time we query this data is when HR wants to check Johnny’s web activity – we don’t need to run fancy reports to see the average web use by department.

This system could live without fancy transactional backups. We just need to do a full backup periodically – even once a week or per month would be fine if it saved us a lot of money. When it goes down, we don’t need to restore it quickly – or maybe not even at all, if HR doesn’t ask for a report. We just need to set up a new database to start taking inserts again.

What if we had a way to store multiple terabytes of this data on commodity hardware that would have a very low cost, pretty high availability, but might not give us very good reporting options or transactional consistency?

Use Case #2: One-Time, One-Way, Big-Load Analysis

When a company’s data gets large enough, they start asking questions about what the data might reveal. I’ve worked with PhD’s and analysts who say things like, “I need a copy of the data warehouse to solve a problem I’m working on. I’ve got a hunch about the numbers, and I’ll need to do a week or two of slicing and dicing to figure things out.” It’s just one or two rocket scientists working on the data for a relatively short amount of time.

They don’t want to synchronize the company’s incoming data with this research dataset, and they don’t need to back it up. They just want you to hand over the data, and then you won’t hear from them again. If it goes down, that’s okay – it doesn’t affect sales. Sometimes they use their own SQL Servers, sometimes analytic tools, sometimes its plain old Excel or Access. They won’t write production reports on this data – they’re just going to dive into specific queries and analytics, and then write PowerPoint decks or white papers on the results.

They might do similar projects like this over and over with different datasets from different areas in the company, but they use fresh sets of data every time. They could use a completely different toolset each time without training worries – these are rocket scientists, after all.

What if you could give them a tool that could handle rapid loads of completely different data each time, perform wild and crazy queries, but it might not use traditional query tools or database languages? These users aren’t proficient with SQL queries or RDBMS technologies anyway – they could use anything as long as it’s cheap and scales well.

Why Not Use SQL Server or Oracle?

Mozilla Labs Test Pilot Project
Mozilla Labs Test Pilot Project

I bet you’ve run into projects just like these in your career, and until now, you’ve solved them with the only tool you really know well – relational databases like SQL Server.  SQL Server, when properly configured and managed, doesn’t blink at the requirements of most NoSQL projects.  Take the Mozilla Test Pilot experiment:

  • Expected amount of data stored per experiment: 1.2 TB
  • Expected peak traffic: approximately 75 GB per hour for two 8 hour periods following the conclusion of an experiment window.  This two day period will result in collection of approximately 90% of the total data.
  • Remain highly available under load
  • Provide necessary validation and security constraints to prevent bad data from polluting the experiment or damaging the application

Here’s my personal take on it – NoSQL doesn’t mean NoSQLServer, but rather NoMySQL.  MySQL doesn’t scale as easily to these levels, whereas it’s not much of an event anymore for a single-node SQL Server to scale to multiple terabytes.  I’ve worked with plenty of SQL Server data warehouse systems that have much larger volumes and loads of data than this.  In the last month, I’ve worked with two in particular that ran on less than $50k of commodity hardware.

It’s not that SQL Server or Oracle can’t scale to these levels.

It’s that NoSQL and cloud databases have a few crucial differences about how they scale.

NoSQL Has a Different Cost Model

If you’re doing multi-terabyte databases, you probably need SQL Server Enterprise Edition, which is licensed in one of two ways:

  • $27,495 per CPU socket for unlimited users, or
  • $8,592 per server plus $164 per user (unless the user already has a SQL CAL)

When you start a SQL Server project, you need to have an accurate CPU count and stick to it.  If you decide halfway into the project that you need a lot more power to get the job done, you have to buy more licenses.  These licenses stick around, and that long-term overhead becomes a business problem.  Businesses don’t like to see expensive software licenses not in use when a project isn’t doing number-crunching.

In addition, even just four sockets of Enterprise at $35-120k is a lot of money that could be spent on hardware – and therefore performance – instead.  Even if SQL Server was 50% faster than a competing solution, that advantage is negated when the user spends 50% more on hardware instead to get the same speed at a lower overall cost.  Tough sell.

The advantage of buying an industry standard database with a long history (whether it be DB2, Oracle, SQL Server, or Sybase) is that it’s relatively easy to find staff and software that interact with that database.  NoSQL and cloud databases, on the other hand, come with no up-front licensing – but because there’s no established industry standard, it’s very tough to find experienced staff off the street.

An early build of MulletDB
An early build of MulletDB

How do I hire experienced NoSQL-savvy-staff when the very NoSQL databases out there keep changing?  Here’s a sampling of announcements from one 24-hour period:

NoSQL solutions have different costs.  Immature platforms suffer from a lack of documentation, tooling, and commonly-needed features, thereby driving implementation costs up.

But is that really a problem?  There’s an abundance of documentation, tools, and features for Microsoft SQL Server, but I almost never see it implemented correctly. Just because there’s a manual doesn’t mean anybody’s going to read it or heed it, and just because there’s a tool doesn’t mean users will use it appropriately.  It’s easy for me to hire SQL Server people off the street, but that doesn’t mean that they’re good, either.

The best NoSQL solutions solve the human cost problem by building in the things you need in order to scale, store data, and consume data.  It’s not all that hard, either – the scenarios I described in the beginning of the post had one thing in common: a very limited number of features.  Some NoSQL and cloud database users need to insert a lot, and they need to run very specific queries, but that’s about it. This is where open source products can really shine – they need even less features than MySQL has.

NoSQL Makes Smart Compromises to Achieve Scale

Right from the start, cloud databases and NoSQL projects aimed to scale out. Whether you’re using pizza boxes or virtual servers in the cloud, you can get more capacity by throwing more of them into your resource pool and letting the platform spread the load around.  The Visual Guide to NoSQL Systems illustrates the compromises – between consistency, availability, and partition tolerance, you can pick any two. If every client needs to see an identical copy of the data, there’s an app for that. If you can tolerate different query results in exchange for higher availability, there’s plenty of choices for that too.

MongoDB’s feature set is a good example.  They tout:

  • Auto-Sharding – Scale horizontally without compromising functionality.
  • Replication & High Availability – Mirror across LANs and WANs for scale and peace of mind.
  • GridFS – Store files of any size without complicating your stack.

If I was a business manager, I’d take those bullet points straight to my database team and ask, “Can our database do this?  If not, why don’t we try MongoDB for our next project?”  If the use case matches one of the two I described earlier, it might be a better fit than SQL Server.

No, you can’t have your cake and eat it too, and NoSQL systems don’t support anything even remotely near SQL Server’s feature set.  Forget things like joins, stored procedures, triggers, spatial indexes, CLR in the database, foreign keys, and so on.  As a performance tuner, though, that list makes me grin – because for years, I’ve been telling programmers to keep their CPU-intensive work out of the database server if you want to scale.  The NoSQL camp understands the importance of that.  These systems are about storing data – period.

No Licensing Revenue Means No Buzzword Features

My poker face, let me show you it
From my SQL Server 2008 R2 review

In my SQL Server 2008 R2 review, I lamented that the last couple of versions of SQL Server have focused heavily on Business Intelligence because it’s a sexy buzz-phrase that sells licenses. I imagine Microsoft marketing thinks of customers as saying, “We’re a business and we need intelligence – get me some of that SQL Server!” Don’t get me wrong – I’m all for BI – but sometimes it feels like buzzwords get emphasized over basics, like working IntelliSense for all versions of SQL.

Since NoSQL tools are built by the users, they’re fat-free. If nobody needs it, it doesn’t get built. On the other hand, if one very determined person needs it badly, that person builds it or pays someone else to build it.

This leads to some interesting design decisions. Many of the NoSQL tools have abandoned any sense of schema – they want to store chunks of data, but that data layout can change at any time with no rhyme or reason. Take our web filtering example – if we design our application to suddenly start saving more information, like the browser agent or the remote site’s web server type, we can do that without a problem. When we pull the data back out, we need to handle whatever kind of data we find, but we’re not worried about joining this data to another table. It’s just flat web history data without relationships.

No Licensing Revenue Means Different Development Models

Five years ago, Microsoft wanted to conquer the VLDB market, so they poured money into making SQL Server scale better.  SQL Server 2005 took on the “SQL Server can’t scale” challenge and knocked it out of the park.  We got things like partitions for fast data loads, DMVs for easy management, and online piecemeal restore for quicker disaster recovery.  Microsoft’s business decision to go after large databases paid off for SQL Server DBAs everywhere because we could use our skills on progressively larger databases.

When you pick a NoSQL or cloud database provider, the first question to ask yourself is, “Who will be developing this thing next year?”  If the platform doesn’t have a really strong developer infrastructure, and your company runs into a data storage problem, the answer might be you.  I dunno about you, but I’m pretty good at administering databases – but absolutely useless at coding them.

That means long-term customers should either:

  • Choose a database with a commercial company supporting the development
  • Have coders on staff who start getting familiar with the project’s source code
  • Consider sponsoring development of the database.  Sponsored open source development is a foreign concept to most businesses, but I’ve seen it work well even in small dollar amounts.

On the other hand, short-term customers have a completely different decisionmaking process.  Consider the second use case I described – repeated one-time, one-way analysis projects.  In that case, who cares if the database platform is still being developed a year from now?  Your database team is probably already used to repeatedly learning new systems for data import/export, and they’re skilled at techniques, not just technologies.  They’ll be comfortable switching to another platform if development dies off or if another more attractive platform comes out.  So how do you pick the most attractive platform?

How I Would Choose a NoSQL or Cloud Database

Ignore throughput performance benchmarks. The market and the tools are changing way too fast.  By the time a well-written benchmark comes out, it’s outdated in this rapidly evolving environment.  Poorly-done benchmarks can come out much quicker – but they’re not good for making decisions.

Focus on developer performance. How fast can you get started deploying and storing data in it?  Shaving days off the development time will probably outweigh the computing gains, especially since you’re going to be throwing your most skilled people at these problems.

Ask the team who will do the work. If you’re a manager on the project rather than a hands-on coder, let the hands-on people decide.  If they have personal connections to someone else using a particular tool, they might be able to get faster answers or code help.

Prioritize difficult-to-code features that work right now. If you need your reporting team to be able to write SQL-style queries against the data, only pick platforms that offer that support today.  If transaction support is important to you, rule out any solutions without it.  Don’t gamble on these challenging features possibly coming out later, because it’s tough to adapt these into existing databases.

Prefer systems that tolerate a node failure. If the database is stored across a dozen cheap pizza box servers, one of the servers is going to go down.  Your data loads and queries should still run while you provision a new pizza box and add it to the system.  Ideally, you want this systems management function to be performed by your infrastructure team, not the database people.  The closer this system comes to being an invisible managed appliance, the lower it will keep your manpower costs.  This feature is extremely difficult to add to a database system – just ask Microsoft, who still haven’t done it yet for SQL Server, and believe me, we’ve all been asking.

I want to end this with a recommendation for a specific NoSQL or cloud database solution, but I can’t bring myself to do it.  After all, the market is still evolving so fast, and I just told you to ask your team who will do the work.  If you’re looking to play around with one, though, I’d recommend trying MongoDB.  It’s the closest to a conventional database while still meeting most of the criteria I’d want for a scaling solution, and there’s even an easy-to-play-with online tutorial so you don’t have to install anything.

But don’t do it from your work computer.

You don’t want SQL Server to know you were cheating on her.

Previous Post
The Beer Trials Review
Next Post
How to Solve the Environmental Crisis

37 Comments. Leave new

  • Awesome post.
    (But nervous about clicking the link that with the text “Dangerous Sites”)

    Reply
  • Most Excellent. I’m in awe of MulletDB…

    Reply
  • David Stein
    May 26, 2010 9:38 am

    Great article Brent, very comprehensive. However, isn’t Case #2 perfect for PowerPivot and might not require a DB solution at all?

    Reply
    • Well, the problem with PowerPivot is that it’s still in Excel, and it’s tough to do really CPU-intensive calculations in Excel. You’re limited to the power of what one machine can calculate. Other users can’t live-query your PowerPivot data without giving them Excel either, whereas these NoSQL solutions can be queried by multiple users simultaneously as they do analysis.

      Reply
      • David Stein
        May 26, 2010 9:48 am

        I was under the impression that you could accomplish both of your points using SharePoint, but I don’t have experience with it yet. Am I incorrect?

        Reply
        • Sorta kinda, but not exactly. You can publish your work to SharePoint for other people to consume it, but when they want to run calculations, it’s limited to the power they’ve got on one particular machine. The NoSQL and cloud database crowd can quickly throw a ton of horsepower at a problem and use multiple machines to solve it at once – whereas we don’t have that scale-out option when the data’s in SQL Server or Excel.

          There are some roll-your-own solutions for work distribution if you want to build something with Service Broker, for example, but it’s still not an easy way to scale out.

          Reply
  • Interesting and thought provoking article as always Brent. It also really hits the spot for me because I am currently investigating using NoSQL or Hybrid systems like Kdb or FastObjects for our company.

    I would ideally like to recommend SQL but I just do not know enough about a SQL database to know if it is suitable for handling potentially many terrabytes of high performance time series data. I know about StreamInsight and although it looks like it could handle the processing of incoming data, this still leaves us of the problem of storing and data mining huge amount of time series data. Your thoughts on this would be highly appreciated.

    Reply
    • Sure, SQL Server is great at that. I regularly work with clients who have >10 terabytes in a single SQL Server database and use it for querying and data mining.

      Reply
      • I was told that before my time the company had tested SQL and found out that SQL could not keep up with the flow of information coming in. I personally find this hard to believe as we are limited to internet speeds, but it does put yet another obstacle in any justification of SQL.

        Surely MS SQL could keep up with even a 100Mb/s internet line and still store the data in a database with some basic verification?

        Reply
  • A little while ago you asked your readers for some suggestions on where you should develop your skills.

    Amazon’s RDS product is currently for MySql only, but there is an ‘Engine’ field that I would love to someday say ‘MSSQL’. I think you should get on the phone with the Amazon Web Services folk and get them to hire you to help implement this. Having a robust scalable fully-featured (unlike SqlAzure) database engine alongside products such as SimpleDB and EC2 would rock.

    Reply
  • Brad Nicholson
    May 26, 2010 8:20 pm

    A lot of the use cases you describe for NoSQL could easily be filled by Postgres. I encourage everyone to looking for MySQL alternatives to look into Postgres. I have been running Postgres in 24x7x365 high throughput environments for a number of years and it is a fantastic database.

    It’s a free database that cares about your data (durability is excellent). It has the a lot of features a DBA would look for in a database, adheres fairly closely to the SQL standard, its fast and it scales well.

    There is a vibrant user and developer community, excellent documentation. Development is not driven by company marketing teams. The license model ensures that it will remain free.

    Reply
    • Brad – yes, but how does Postgres solve the scaling problem for multiple servers? For example, if I need to scale to a 100 terabyte database, how does Postgres scale out to multiple writeable servers and perform sharding?

      Reply
      • Brad Nicholson
        May 26, 2010 10:34 pm

        It currently does not handle scaling out on it’s own (which is why I mentioned some, but not all of your use cases could be handled).

        There are a number of projects that handle various problem sets for scaling though. Skytools is Skype’s framework for scaling out their systems. GridSQL is for distributed shared-nothing datawarehoses and datamarts.

        For scaling writes, there is a multi-master project under development – which is fork in the code but under the same license. It won’t support a full feature set until later this year, but the initial results impressive to say the least.

        I’m not implying that there is not case for NoSQL – I believe in the right tool for the right job. In the case of a 100TB DB that needs write scaling, I would likely say that Postgres is the wrong tool at this time.

        What I do think is that it sits in an interesting market segment in that it is a free product that can deliver on both use case #1 and #2 you mention would work just fine under Postgres, and shouldn’t induce cringes from SQL Server DBA’s because it behaves like an actual database.

        Reply
  • David Saiote
    May 28, 2010 11:57 pm

    MulletDB! Yee-haa! xD

    Reply
  • Great post Brent! While I’m a huge SQL Server fan I am also a big open source user so it’s nice that someone can see both sides of the equation.

    Right now I’m working on a project that has some very ambitious data size goals and involves using PerfMon to collect and store data across many, many servers in our environment. The volumes seem pretty large to me: ~50,000 rows per second insertion and eventual database size in excess of 10TB. Now, because this is an internal only use and won’t generate revenue, the appetite for spending money on costly features like Enterprise edition or high end storage are pretty much non-existent. While I’m pretty confident we can tweak and tune enough to get it going with what we have (a 4 core VM with Standard Ed), your post has at least stuck the idea in my head about considering less traditional alternatives. Certainly the data is not critical, though I suppose one could argue that clean performance data is a very important thing to have as a DBA. What do you think?

    Reply
    • Josh – interesting question. Do you really need to insert that data directly into SQL Server? I’d probably argue that the data should be landed in flat files first and then periodically loaded into SQL using SSIS. SSIS could also be used to aggregate the data into less granular chunks for reporting.

      Reply
      • Good point Brent, and that is actually the direction we are trending towards: having the data collectors spit into flat files which are then processed and archived by SSIS at scheduled intervals. Currently we’re using the built-in direct-to-sql functionality of PerfMon, but experience is showing it scales rather… well, OK, it really doesn’t scale at all, unless you want to have one database for every two or three servers you’re running collectors against. Hopefully with the approach you suggested we can avoid those staging DBs and just load the cleansed data right into the warehouse where it belongs. Considering that this may well end up being the largest database we have in house, I’ll be curious to see just what we can get out of our little VM. Thanks again for the suggestion.

        Reply
        • Yeah, the direct-to-SQL functionality is a pretty bad idea for Perfmon. I’d advise going to flat files first, especially since Perfmon supports CSVs, and then periodically pulling those into SQL Server via SSIS. Performance is fantastic that way.

          Reply
  • I like your writings on storage, LOVED your recent truncation post, but did you just turn into the love child of Dare Obasanjo and Michael Stonebraker?

    This “NoSQL” looks like just another doomed iteration of object databases, perhaps better named “AnythingButSQL”. This is jump-the-shark stuff in enterprise computing.

    Please don’t scare me any more 🙂

    Ron.

    Reply
    • And you have an @Microsoft.com email address. I can see why this might sound scary. 😉 I bet you’re just jealous that I didn’t include Azure, right? Yuk yuk yuk…

      Reply
  • Great article. I’ve been a database guy for 15 years. One of my major professional pet peaves is folks who are married to a tool for the simple reason that it’s the only one they are comfortable with. I’ve made my living for the past decade off SQL Server. Doesn’t mean I wouldn’t use NoSQL in a heartbeat if it was shown to me to be a better tool for the job.

    Reply
  • Stephen Hirsch
    June 3, 2010 4:02 pm

    Sounds like a flat file would work just as well.

    Reply
  • haha! she(SQLServer) just saw me picking at your article.

    great read!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.