Things Every DBA Should Know (But Most Don’t)

What if your boss walked up right now and asked you, “How many database servers do we have?”

The classic poster

Would you start sputtering, “Uh, do you mean production, or dev? Do multiple instances on a cluster count? And by the way, who’s asking?”

Or would you confidently answer with an exact, unwavering number?

When I was a DBA, I thought I had a pretty good handle on my environment. I did proactive work like building a database server support matrix, setting up failsafe operators and alerts so I’d get notified when things broke, and even attended development planning meetings so I could see what was coming down the pike. Now that I’m a consultant, though, I realize that I wasn’t really armed for the kinds of questions business users like to throw at DBAs from out of nowhere.

How many database servers do we have?

When the business asks this question, it’s only the beginning. Here’s what they really want to know:

  • How many production servers/clusters do we have?
  • How many instances running on how many servers?
  • How many databases do we have?
  • How many applications are we supporting on each database server?

The snark starts

To start the process of surveying a Microsoft SQL Server environment, check out the Microsoft Assessment and Planning Toolkit. This free tool has a sinister purpose – setting up for a licensing audit – but it’s also really helpful to get a scan of SQL Servers buried under desks or hidden on application servers. It helps you find out about servers you’d otherwise never hear of – until they crash, at which point it’s too late.

The best DBA teams track not just the number of instances and databases, but the overall data size in the environment and the growth per week. I love being able to tell business executives, “I’m managing 10.4 terabytes right now and it’s growing an average of 3% per week.” They understand growth metrics, and it shows them that I take my job as seriously as they take company sales and expenses.

Which servers are no longer under support?

Most companies are harboring a few old database servers that soldier on just fine despite being long out of support. Database administrators often take this question personally and start responding with whines about not being able to move an old vendor application off SQL Server 2000, but take the emotion out.

Farm Gold and Level Up

The Microsoft Product Lifecycle Search is a little cryptic, but

  • SQL Server 2000 – mainstream support is long gone (2008), but you can still get extended support through April 2013.
  • SQL Server 2005 – if you thought 2000 was hard to kill, 2005 is going to be worse. It’s got enough DBA-friendly features like DMVs and database mirroring that it might just stick around forever – certainly long after April 2016 when extended support ends.
  • SQL Server 2008 – we’re looking at least through January 2019 for extended support.
  • SQL Server 2008R2 – as with 2008, at least through January 2019.
  • SQL Server 2012 – servers you deploy today are safe through July 2022, or 2023 for Express Edition.

If you’ve been pushing the business to get off SQL Server 2005 for support purposes, you should find a different approach before management finds the above links.

If that database server went down, what would you do first?

Enough with the Posters Already

The business doesn’t really care how you answer this question: they just want to hear confidence. They want to hear an immediate, well-thought-out-answer that says to the world, “I’m a data professional, and I’m the person you call to the helm when the database server hits the rocks.”

Here’s the kind of answer the business wants to hear: “I’ve got a script to take a snapshot of the database mirror in our DR datacenter and check the time of the most recent record in the sales table. I tell my manager that if we have to fail over to DR, we’re going to lose X minutes of data. Then I start at the top of our troubleshooting checklist, which is posted on the wall in my cube so I can find it fast when trouble strikes.”

Confidence is a key part of being a good DBA. Don’t just take it from me – take it from world-renowned database expert Jack Palance:

Want to Learn More?  Watch the Webcast.

Video – How to Test Availability Groups in SQL Server 2012

Want a jump-start on testing the hottest scale-out feature in SQL Server 2012? Kendra Little will tell you why it’s critical that you configure your own lab, and what you need to get your test environment set up. She’ll also detail how to avoid some gotchas that could cost you hours of frustration. She’ll demonstrate the steps you need to enable the Availability Group feature in SQL Server 2012 and how to create and test your first Availability Group. This half-hour session is appropriate for DBAs with one or more years of experience with SQL Server:

Links mentioned in the video:

For more videos like this:

SQL Server AlwaysOn Availability Groups in AWS

Many companies are investigating a move into cloud hosting. With the recent release of SQL Server 2012, these very same companies are also looking at a migration to SQL Server 2012 to take advantage of readable secondaries. Combining Amazon Web Services and SQL Server 2012 seems like a perfect match. Unfortunately, there are some gotchas, like limited support for clustering, to keep in mind as you consider deploying SQL Server 2012 AlwaysOn Availability Groups in AWS.

What’s Wrong With Clustering?

AlwaysOn Availability Groups don’t require shared storage, but they do require Windows Server Failover Clustering. Let’s be clear: there’s nothing wrong with clustering.

There’s a white paper from Amazon Web Services that suggests clustering won’t work in AWS because clustering requires support for multicast networking. This changed with Windows Server 2008 – multicast networking is no longer part of Windows Server clustering. For those who don’t know, multicast networking is delivering one packet from the source computer to multiple destination computers. Multicast relies on the intermediate hardware making the decision when to make copies of the packets to send to the destinations and, in general, relies on the hardware making that decision at the latest possible moment. Because so much of the AWS infrastructure is virtual, a lot of these decisions become trickier for the intervening networking gear (which might also be completely virtual). Ultimately, this is why you’d want to avoid multicast networking in a purely virtual environment. Since Windows Server 2008 and newer don’t use multicast networking, it’s not an issue.

It’s Not You, It’s… You

If multicast networking isn’t what stops clustering from working in AWS, what does?

Every cluster has a cluster access point – a name and IP address used to administer the cluster. The cluster access point is a virtual IP address, when you set up the cluster access point, Windows creates a virtual IP address that is used to provide a way to access the active node in the cluster. This isn’t strictly necessary for SQL Server clustering, but it does make life easier when managing the underlying Windows Server Failover Cluster.

In addition to each server’s individual IP address, the cluster access point requires its own IP address. This is where things start to get sticky. When you assign static IP addresses in AWS, you don’t assign the IP address at the server level – you assign the static IP address to a virtual ethernet adapter and attach the ethernet adapter to your EC2 instance. Each of these Elastic Network Interfaces (ENI) can only be assigned one IP address, for now.

Creating an Availability Group, Is It Possible?

Since it isn’t possible to create virtual IP addresses it would be easy to conclude that it isn’t possible to create an availability group – during availability group set up SQL Server has to get in touch with the cluster control point. There’s a way around this limitation – along as you create the availability group on the current host server of the cluster, then you can create the availability group.

It is possible to create an availability group, however it isn’t possible to create a TCP listener for the same reason that it isn’t possible to create the cluster control point – it isn’t possible to have more than one IP address per ENI in AWS. This doesn’t mean that AlwaysOn Availability Groups are a non-starter in AWS, but it does mean that you won’t have access to the full AlwaysOn Availability Group functionality until later this year when Amazon introduces multiple IP addresses per ENI.

Faking Out the TCP Listener

Here’s what we know so far:

  1. It is possible to set up a Windows Server Failover Cluster in AWS.
  2. It is possible to set up AlwaysOn Availability Groups on our Failover Cluster in AWS.
  3. It is not possible to configure a Cluster Control Point or TCP Listener.

It is possible, though, to get much of the same benefit of the TCP Listener without using the SQL Server’s built-in functionality. A load balancer like HAProxy can be used to create groups of readable secondaries for different applications. True, your applications won’t be able to take advantage of the other TCP Listener features like specifying read only intent in the connection strings, but it will be possible to create multiple connections in HAProxy that will work the same as connections specifying read only intent. HAProxy has the ability to use sticky sessions, monitor connections, and detect when servers are offline and route traffic away from the dead server. People have been doing this with MySQL for years.

Where Do We Go From Here?

Teams using AWS who need SQL Server 2012 Always On Availability Groups can start making use of these features right now. All of the benefits of SQL Server 2012 Always On Availability Groups won’t be available, but it will be possible to use multiple readable secondaries for scale out reads. Many of the features of Always On Availability Groups can be configured using a load balancer like HAProxy for now and configured using native functionality once multiple IP addresses can be attached to a single ENI. With a little work and some patience, you’ll be able to use this feature right now and be ready for when AWS supports virtual IP addresses.

More SQL Server AlwaysOn Resources

If you want to learn more about how to design, test, and manage this slick feature that offers high availability, disaster recovery, and scale-out reads, visit our AlwaysOn Availability Groups resources page.

SQL Server Virtualization Q&A

In my recent “Why Is My Virtual SQL Server So Slow?” webcast, we got a lot of great questions in chat.  Here’s some of the highlights:

Question: “Why would I virtualize a 4-core server and turn it into a 2-core VM?  Doesn’t performance go down?”

If you’ve got an existing physical box, and your only goal is to increase performance, then I probably wouldn’t virtualize.  However, it’s pretty rare that a company is only looking for more speed – usually they’re looking for lower costs, less heat in the datacenter, less power consumption, and so forth.

Question: “Is there any way to collect VMWare counters from within TSQL to capture those values for those specific counters and capture into a SQL table? I can already capture the SQL counters, but cannot find a way to collect the VMware counters?”

Not that I’m aware of, but I have to confess that I don’t look too hard for that kind of thing.  I want more than just the VMware counters – I want counters like Physical Disk Avg Sec/Read, and those aren’t available easily inside T-SQL either.  Instead, I use Perfmon to capture it as explained in my Perfmon tutorial post, or a third party SQL Server performance monitoring utility.

Question: “Hi Brent. Great presentation! How to deal with the arrogant SA who refuses to believe there is any impact to SQL from virtualizing? This guy even said that Microsoft now suggests that you virtualize SQL Server!”

Yes, take HP’s Database Consolidation Appliance developed in partnership with Microsoft.  Virtualization is a key part of that solution.  Yes, depending on how you architect it, you could take a huge performance hit – but I can say the same thing about physical boxes.  If I saddle a physical box with just 3 drives in a RAID 5 configuration, and I put the OS, SQL binaries, data files, log files, and TempDB on those drives, there’s often a big performance impact.  I’ve seen plenty of virtual servers suffering from similar design bottlenecks, like 30 virtual servers sharing 15 hard drives in a RAID 5 config.  The key is to design and provision appropriately.

Question: “Regarding Virtualization and SAN… is there a recommendation for setting block size on SQL Server 2008?”

There’s a few related settings here including NTFS allocation unit size, RAID stripe size, and partition offset.  Check with your SAN vendor’s documentation to see what’s right for you.  In most cases for OLTP databases, you’re in decent shape with 64K NTFS allocation unit size and RAID stripe size, and a 1mb partition offset.

Question: “I’m using Brand X storage.  Is that a problem?”

There was a lot of brand bashing in chat about a particular SAN vendor, and I’m disappointed by that.  I’ve seen crappy performance out of every single vendor out there – but it’s rarely due to the storage itself.  Instead, most of the time it’s us meatbags who make bad implementation decisions and cripple storage performance.  It’s a poor musician that blames his instrument.  (Unless your instrument is a software-based SAN that runs on commodity servers using crappy storage, in which case, it’s a tin can banjo, and you should blame it.)

Question: “Brent, would you ever recommend running a production SQL database on a VM?”

Yes.  I’ve been doing it since 2006, and I’ve got clients doing it.

Question: “Are virtual servers more secure than physical ones?”

I don’t think so, but I don’t think any kind of servers are inherently secure.

Question: “Can SQL perform well having thin provisioning on the SAN?”

If the primary bottleneck isn’t storage waits due to expanding the drives, yes.  For example, if you’ve got a 100GB database that’s fairly stable at 100GB, but you’ve thin provisioned the storage to grow to 500GB if necessary, then thin provisioning wouldn’t be an impact.

Question: “We are running a SQL cluster in a VMware environment. Any thoughts to this?”

Generally, I don’t like doing this.  When a business wants to do it, I ask if the databases involved are truly mission-critical.  Do they absolutely, positively, have to be up all the time, with as little downtime as possible?  If so, let’s put them on a physical cluster.  Troubleshooting clusters is hairy enough – there’s already enough finger-pointing between the OS, app, and hardware guys.  Introducing virtualization makes troubleshooting just that much more complex, and that’s an area where I want less complexity.  If, on the other hand, the database isn’t truly mission-critical, then I’d ask why we’re clustering it.

Question: “Why would you NOT want a dedicated server for your database? Does anybody think they have CPU or disk cycles lying around that they can’t use and want to share?”

So are you saying you’re running all of your databases on 8-socket monsters with solid state drives, and you’ve got 100% CPU use?  In that case, you shouldn’t virtualize.  Otherwise, you’re probably not using the latest and greatest technology on every SQL Server to begin with, so you’ve already got some headroom.  With SQL Server 2012 Enterprise Edition licensing coming in at $7,000 per core, I find that most businesses who aren’t running 100% CPU use are starting to ask tough questions to their DBAs.

My Best Practices for Virtualizing SQL Server on VMware

I’ve got a 3-hour video training session on how to manage SQL Server in VMware. Check it out now!

SQL Server 2012 Certification Changes Video

In 2000, we worked on our MCSE and MCDBA certifications. In 2005, Microsoft changed things up with the MCITP and MCM programs. Things are about to change again – we’ve got the new MCSA, MCSE, and MCSM certifications. Join Kendra Little to learn more in this 30-minute webcast:

Learn More About Changes in Microsoft Certifications

Links from the webcast, and answers to some Q&A:

One insightful viewer asked about the differences between the two MCSE exams for Data Platform. They are titled “Developing Microsoft SQL Server 2012 Databases” and “Designing Solutions for SQL Server 2012.” I recommended checking out the skills measured for the exams (here and here). I re-reviewed these after the webcast and the “Developing” exam really isn’t about what we would typically think of as developer tasks– it includes things like index maintenance, automation of backups and jobs, and interpreting perfmon data as well as what we’d typically think of as developer tasks. Similarly, the “Designing” test also contains a mix of topics between what we have treated as two disciplines in the past.

Writing Better Conference Abstracts and Presentations

The conference season is coming – let’s get you a speaker slot.  Before you submit an abstract, do your research.  Here’s some of my past posts about presentations and abstracts:

How to Write a Conference Abstract – the real goals of a session abstract is to get the right people into your room and keep the wrong people out.  If you write a misleading or vague abstract, attendees will be furious, and your evaluation scores will show it.  Learn how to craft an abstract that’ll put the right butts in your seats.

How to Pick Blog & Presentation Topics – we have a tendency to write the abstracts for sessions we’d like to attend ourselves, but that’s completely wrong.

How to Get Readers to Pay Attention – hit hard with the first sentence, tell ’em something they already know, and for God’s sake, check your spelling.

And yes, a lot more people came in.

Buck Woody warming up the room before a session

Who’s Your Target Audience? – you’re not writing to impress other presenters.  You’re writing to impress your attendees.

How to Rehearse a Presentation – one of the things that shows an audience you really know your stuff is how you handle the transitions between slides.  PowerPoint Presenter View helps a lot here.

What Makes a Good Conference Session? – killer presentations don’t have a magic formula with a certain number of demos or slides.  In fact, you might not need either.

Dealing with Presentation Criticism – I’ve bombed, and you probably have too. Before you submit an abstract, reread your past conference feedback to do a better job.

How to Deliver a Killer Technical Presentation – my favorite start-to-finish post with tons of tips.

How I APPROACHED 2012’s Conference Season

Last year, I wrote my sp_Blitz® session with a few specific goals:

  1. I wanted to get into the top 10 sessions for the 3rd year in a row
  2. I wanted to give attendees a Steve Jobs “one more thing” moment
  3. I wanted everybody to leave the session eager to run a ready-to-go script

I pulled it off, and you know how it goes – it’s time to raise the bar again.  Here’s my goals for this year:

  1. Get into the top 10 again
  2. Get everybody to run a script as soon as possible
  3. Get attendees to pass on the script to as many people as possible
  4. Pack whatever room they put me in

That means I need to build a session around a script that will have really wide appeal and make a big difference in their jobs.  I need to make the session a train-the-trainer session, too – I can’t just teach them what the script is doing, but I have to equip them so they can pass this knowledge onto their friends.  It’s not enough just to give them a slide deck, because many/most attendees don’t feel comfortable presenting.  I need to make the session viral, with as low of an effort as possible on the attendees’ parts.

Sounds like a lot of work – and it is.  Not every session needs to go to that level of planning and detail – but it helps to go in with a set of goals.  What is your session trying to achieve for you, and for your attendees?  Let’s really make our mark this year!

Maintenance Plans and Roombas Suck – in a Good Way

Every now and then, I hear an experienced DBA say, “SQL Server maintenance plans for backups, indexes, and statistics?  What, are you kidding me?  Nobody uses maintenance plans because they suck.”

They say maintenance plans are so inflexible because they insist on defragging or rebuilding every index, every time, regardless of how little the table has changed.  Or maybe they complain about problems with scheduling, or how they saw somebody use a DBCC SHRINKDB maintenance plan once.

Well, I use maintenance plans all the time.

I even suggest ’em to my clients.

In a perfect world, every DBA would have an unlimited amount of time to get the absolute best solution to every problem.  We’d spend endless time tuning T-SQL, perfecting plans, and daddying databases.  We’d drive home from work in our Porsche 911s to our mansion on the hill.

Back here in the real world, we have a limited amount of time to work each day.  We need to implement solutions that anybody can troubleshoot and maintain with a bare minimum of knowledge.  Sure, you might be the only DBA in the shop today, but wouldn’t it be nice to get a junior person hired, and then immediately be able to pass work off to ’em?  I’ve seen shops where the lead DBA is still personally managing their homegrown maintenance scripts – the DBA doesn’t trust anyone else to touch this convoluted spaghetti code.

In the real world, we usually have two, or sometimes three, problems:

  1. The servers need to be maintained
  2. We need to save DBA time
  3. We need to save server time

Everybody has problem #1, but often problem #3 is overrated.  With careful use of maintenance plans, we can solve problems #1 and #2 easily.  Could the maintenance be done faster with custom scripts?  Yep – but when those scripts fail – and they will fail – we’ll make problem #2 worse, not better.

Maintenance plans aren’t perfect, but they require less troubleshooting time than your homemade scripts.  I know, you think your scripts are perfectly bulletproof and would never possibly have a bug, but time after time, I do SQL Server health checks and find some amazing problems.  Some highlights have included:

  • A t-log backup script that failed every night at 11pm without sending a warning
  • A 1mb database with 100GB of transaction logs because people thought it was in simple mode and their script skipped it by name
  • Databases that had never been backed up because the script used the wrong variable name for the database
  • DBCC jobs that completed successfully in under a second and nobody noticed.  They were using sp_msforeachdb, which is notoriously unreliable.

Maintenance plans have a killer advantage over custom maintenance scripts because they don’t have your bugs.  Let’s face it: we’re just not that good at writing code.  If we were, we’d be developers, not DBAs.  They do have bugs – but the bugs tend to be in bad implementation, like doing SHRINKDB, and those same problems pop up in custom maintenance scripts too.  If you’re smart enough to write a maintenance script, then you’re smart enough to use maintenance plans correctly.  (And conversely, if you’re not smart enough to use maintenance plans correctly, maybe you should ease up on those scripts, eh?)

I, for one, welcome our new fresh-smelling overlords

Maintenance plans are like my iRobot.  Their Scooba is a floor-cleaning robot that sprays cleaning solution on my wood floors, scrubs it (a little), and vacuums the liquid and dirt up.  It doesn’t do a great job – in fact, it does a pretty crappy job, leaving a damp trail of cleaning solution behind.  At first this bothered me, and I searched the web looking for ways to get it to do a perfect job and leave a dry floor behind.  No dice – everybody complains about the dampness.

Thing is, J-Lo (that’s my name for her, because she’s On the Floor) does a better job of floor maintenance than I can do in my spare time.  Once a week, I fill up J-Lo’s tank with cleaning solution, put in her battery, and turn her on.  I watch her do her thing while I write blog posts (and she’s doing it right now – I know you want to watch, but this is a private show.)  The cleaning solution evaporates, and I’m left with a cleaner apartment than I’d have had otherwise.

If you insist on using crafty scripts to do your database maintenance, use Ola Hallengren’s free scripts.  They’re powerful and he even updates ’em frequently – just this week, he added support for the new preferred-replica-backup DMVs for SQL Server 2012’s AlwaysOn Availability Groups.

That’s the curse and blessing of scripts – even good scripts like Ola’s require maintenance because new versions come out from time to time.  If you don’t have the time to maintain and debug scripts, use maintenance plans.  Take all that time you save, and keep reading blogs to fix problems that actually bother your end users.

Breaking News: Changes to Microsoft SQL Server Certifications

Microsoft Learning published new information today which changes the certification landscape in a big way.

Think you understand how to get certified? You’ve got a set of new terms and new rules to learn!

I’ll give you my take on the high points here. Join us next Tuesday in our webcast to get the full scoop.

Certification Names and Paths are Changing

This one’s going to cause a whole lot of chatter. The biggest problem I think there will be? So many certification names are changing that it’s a little tricky to even get your head around it. If you’ve been in the industry a while, you may be surprised at the re-use of some old and familiar acronyms.

Getting in the Game: Become an MCSA

In the current round of certifications you first work toward becoming an MCTS— the TS stands for “Technical Specialist.”

In the new certification world, you will first work toward becoming an MCSA— or a Microsoft Certified Solutions Associate. To achieve this first step, you’ll take three tests.

Advancing Your Career: Get the MCSE

Did you see that? I totally just said MCSE! That’s an older acronym that stands for “Microsoft Certified Systems Engineer.” In our new certification world, the “Microsoft Certified Solutions Expert” is the second level of certification.

The MCSE is replacing the current MCITPs (“IT Professionals”) in Database Administration and Development.

Here’s the part I want to make sure is clear here: the MCITP level is being combined. In order to get an MCSE in SQL Server, you need to pass exams for developing and designing database solutions. This is a big shift! In the new world we’ll have fewer certifications, and the trend is to have them cover more topic areas.

Become a Master: The MCSM

Once you have your MCSE, you’re ready for the Masters level. There’s only slight changes here so far— the name is now Microsoft Certified Solutions Master.

No More Versions!

Get ready for a big one: these new certifications don’t all have “SQL Server 2012” in their names. That’s on purpose, the certifications are not versioned.

Instead, certifications may cover multiple versions of the product.

What Do these Changes Mean?

Personally, I love these changes, particularly the move away from versioning. If someone holds a certification in SQL Server 2000, does that mean they remember anything about SQL Server 2000? (Trust me, that’s no small accomplishment!)

Let’s be clear: these changes do make certification more challenging for DBAs who don’t work actively with development, and who work in businesses that are slow to upgrade to new technologies. However, the changes also challenge everyone to stay current and to broaden their skills. That’s the best bet for hiring managers.

Interested in learning more? Check out the Microsoft Learning FAQ for information on exam availability, upgrades, and more.

DBA Darwin Awards: T-SQL Edition Webcast

Working in the SQL Server emergency room, I see a lot of self-inflicted performance wounds. That query looks good, and it looks like it’s using an index – but why does it take so long? The query runs great half the time, but the rest of the time performance is cut off at the knees, and the only fix is to restart the server. Learn the top 3 T-SQL traumas I see in my practice:


Hash Partitioning, SQL Server, and Scaling Writes

At first glance, SQL Server’s partitioning seems like it should be an easy way to solve problems inserting data into busy tables. If the writes are spread across many partitions it only makes sense that we can avoid write hot spots in SQL Server, right? The truth is much more complicated than it appears.

Asking the Right Questions

A reader commented on Kendra’s post How To Decide if You Should Use Table Partitioning asking “what are the plusses and minuses to using Hash partitioning on OLTP tables with large numbers of inserts on a daily basis?” The pros are pretty easy to see – SQL Server will spread the writes out across many partitions which, in turn, allows you to spread the writes out within multiple file groups in each partition. It seems obvious that this will solve problems with allocation map contention under write heavy workloads.

If the only problem that you’re trying to solve is eliminating write contention in a filegroup, you can improve SQL Server performance by adding multiple files to that file group, then rebuild indexes. After the index rebuild, data will be distributed evenly between the different files. Since each file has its own set of allocation map pages, you’ll see less contention across the different allocation map pages. The upside of this approach is that you don’t have to make any changes apart from adding extra data files to the filegroup. SQL Server’s proportional fill algorithm should take care of the rest.

Of course things are never as simple as that, right?

Even if you manage to add enough data files to handle our data file write throughput, the transaction log will still be the biggest bottleneck. You can push SQL Server write performance a long way, especially if you follow the advice in these SQLBits videos from Thomas Kejser, but eventually you will hit the limits of a single transaction log. If the log can’t keep up, the number of data files won’t matter: you won’t be able to write any faster than you can write to your transaction log.

This is why it’s so important to ask “What’s the problem you’re trying to solve?”

Write Throughput & Multi-Tenancy

The original poster asked about the pros and cons of using hash partitioning in an OLTP system to deal with a large number of inserts. Hash partitioning uses some kind of hash function to evenly distribute data between a finite number of buckets. While a hash function will solve the problem of distributing data, hashing won’t make it possible to distribute writes to multiple log files – SQL Server only allows for one log file per database. Sure, you can add more disks behind the log file to make it respond faster but there is only so much you can do. Limitations are imposed on storage throughput simply because you can only fit so many HBAs in a server chassis before you run out of PCI slots and you can only buy so much Infiniband before you run out of budget.

When the transaction log becomes the limiting factor in a database’s write performance, it’s time to split the database into multiple databases. This is where things get difficult: where is the decision made to partition data across databases?

In multi-tenant systems it’s relatively easy to partition data across databases – each client’s data moves to its own database and any global metadata can either live in a central database or can be replicated to each client database. Partitioning the data by client ID is the natural way to split out the data. Most, if not all, queries will be limited to a single client and the application can handle any cross client joins that administrative users need to run.

The upside of this approach is that you can spread your data across multiple servers and handle performance problems on each server as needed. The downside is that using a hash function with a fixed number of servers means that adding another server means repartitioning all of the data and shuffling it between servers. Needless to say, this can take a long time.

Write Throughput & Single Tenancy

Let’s pretend that you don’t have a lot of different clients. You have one big client and you’re streaming a lot of sensor readings from multiple sensors into a single database. Things become more difficult when data is being streamed into a single tenant application – say you’re streaming sensor data into a single application database. Sometimes it’s not so easy to partition this data. There are a few ways to handle the load and none of them are great.

One of the most obvious ways to handle massive write concurrency in a single tenant system is to find the a level of the application where writes can be partitioned – in this example write partitioning could happen at the sensor level, sensors could be grouped, or the hash might be found by converting the current time to seconds and reversing the digits to bring more randomness the hash function output. This will help spread the load across multiple servers but it won’t make any easier to query the data. You need to do something other than spread the data randomly across some random number of servers.

Writing Faster Without A Database

Assuming that our goal is to handle more write throughput, the easiest thing to do is remove the database from the equation altogether. After all, if we aren’t writing to a database then we can’t run into issues with log throughput or allocation map contention. We can’t actually get rid of the database, but we can remove it from directly causing performance problems.

The first thing to do is add several cache servers in front of our database server. Ideally you want to have multiple cache servers for redundancy. Since the cache servers are nothing more than boxes loaded full of memory it shouldn’t be expensive to find servers that fit the bill. Before any data is committed to the database it needs to be written into the cache. It might be difficult to modify application patterns to work with the cache, but it can be done. 37signals redesigned how they handled caching page elements in their Basecamp product do use tiers of cache. The same ideas can be applied to your application – write data at the most granular level possible in your cache and cascade calculations throughout your cache.

The next step is to add a queuing mechanism to handle writes into your database. After you write to the cache, throw your writes into a durable queue. RabbitMQ is an example of a queue that can be clustered, is highly available, and can be backed by a durable store. By adding a durable queue alongside the cache it is possible to scale reads (the cache) independently of writes (the queue) without having to worry as much about performance in the underlying database.

Processing writes is a matter of setting up additional processes to read data from the queue, process the instructions, and write data to the database. This could happen by pulling off batches of tens or hundreds or even thousands of instructions at a time, processing them, and then inserting them into the database in one transaction and hopefully improving performance by avoiding a lot of small inserts. You still have to handle the write load into the database and handling scalability issues with the transaction log so this strategy relies on being able to find a performance point where it is advantageous to medium to large size batches instead of many small transactions. At some point, you might even have to start partitioning your data across multiple servers underneath the caches and queues, but your application won’t need to be as aware of that data split taking place. Caching and queuing hide that implementation pain from the middle tiers.

The Answer

There’s no correct answer. Pushing writes into a partitioned table means having to deal with the pain points of SQL Server’s table partitioning. Splitting writes out among multiple databases or servers can make it difficult to query your data. Implementing caches and queues adds operational complexity and requires expertise that some teams may not have.

The more flexibility you require, the more difficult it becomes to handle additional writes. At some point it might even become necessary to find ways to move the write heavy portions of your application out of SQL Server and into another application that does a better job of handling distributed writes.

Always ask yourself what problem you’re trying to solve and make sure that you’re solving it in the best way. Bolting something like consistent hashing on top of SQL Server is going to be painful, at best. But there may be a different way to solve the problem that’s less painful, operationally simpler, or a little bit of both.

Interested in SQL Server Table Partitioning Instead?

Before you start designing a SQL Server table partitioning strategy, or if you’re trying to troubleshoot why your partitioned tables aren’t working as fast as you expect, check out our SQL Server table partitioning resources page.