Blog

Notes on Scalability

15 Comments

We all hope that we’re going to succeed beyond our wildest expectations. Startups long for multi-billion dollar IPOs or scaling to hundreds, or even thousands, of servers. Every hosting provider is touting how their new cloud offering will help us scale up to unheard of heights. I’ve built things up and torn them down a few times over my career

Build it to Break

Everything you make is going to break, plan for it.

Whenever possible, design the individual layers of an application to operate independently and redundantly. Start with two of everything – web servers, application servers, even database servers. Once you realize that everything can and will fail, you’ll be a lot happier with your environment, especially when something goes wrong. Well designed applications are built to fail. Architects accept that failure is inevitable. It’s not something that we want to consider, but it’s something that we have to consider.

Distributed architecture patterns help move workloads out across many autonomous servers. Load balancers and web farms help us manage failure at the application server level. In the database world, we can manage failure with clustering, mirroring, and read-only replicas. Everything computer doesn’t have to be duplicated, but we have to be aware of what can fail and how we respond.

Everything is a Feature

As Jeff Atwood has famously said, performance is a feature. The main thrust of Jeff’s article is that making an application fast is a decision that you make during development. Along the same lines, it’s a conscious decision to make an application fault tolerant.

Every decision that has a trade off. Viewing the entire application as a series of trade offs leads to a better understanding about how the application will function in the real world. The difference between being able to scale up and being able to scale out can often come down to understanding key decisions that were made early on.

Scale Out, Not Up

This isn’t as axiomatic as it sounds. Consider this: cloud computing like Azure and AWS is at its most flexible when we can dynamically add servers in response to demand. To effectively scale out means that we need also to be able to scale back in.

Adding additional capacity is usually in the application tier; just add more servers. What happens when we need to scale the database? The current trend is to buy a faster server with faster disks and more memory. This process keeps repeating itself. Hopefully your demand for new servers will continue at a pace that is less than or equal to the pace of innovation. There are other problems with scaling up. As performance increases, hardware gets more expensive for smaller and smaller gains. The difference in cost between the fastest CPU and second fastest CPU is much larger than the performance gained – scaling up often comes at a tremendous cost.

don't be afraid to change everything

The flip side to scaling up is scaling out. In a scale out environment, extra commodity servers are added to handle additional capacity. One of the easiest ways to manage scaling out the database is to use read-only replica servers to provide scale out reads. Writes are handled on a master server because scaling out writes can get painful. But what if you need to scale out writes? Thankfully, there are many techniques available to horizontally scaling the database layer – features can be broken into distinct data silos, metadata is replicated between all servers while line of business data is sharded, or automated techniques like SQL Azure’s federations can be used.

The most important thing to keep in mind is that it’s just as important to be able to contract as it is to expand. As a business grows it’s easiest to keep purchasing additional servers in response to load. Purchasing more hardware is faster and usually cheaper than tuning code. Once the application reaches a maturity level, it’s important to tune the application to run on fewer resources. Less hardware equates to less maintenance. Less hardware means less cost. Nobody wants to face the other possibility, too – the business may shrink. A user base may erode. A business’s ability to respond to changing costs can be the difference between a successful medium size business and a failed large business.

Buy More Storage

In addition to scaling out your servers, scale out your storage. If you have the opportunity to buy a few huge disks or a large number of small, fast disks give serious thought to buying the small, fast disks. A large number of small, fast drives is going to be able to rapidly respond to I/O requests. More disks working in concert means that less data will need to be read off of each disk.

The trick here is that modern databases are capable of spreading a workload across multiple database files and multiple disks. If multiple files/disks/spindles/logical drives are involved in a query, then it’s possible to read data from disk even faster than if only one very large disk were involved. The principle of scaling out vs. scaling up applies even at the level of scaling your storage – more disks are typically going to be faster than large disks.

You’re Going to Do It Wrong

No matter how smart or experienced your team is, be prepared to make mistakes. There are very few hard and fast implementation guidelines about scaling the business. Be prepared to rapidly iterate through multiple ideas before finding the right mix of techniques and technologies that work well. You may get it right on the first try. It may take a number of attempts to get it right. But, in every case, be prepared to revisit ideas.

On that note, be prepared to re-write the core of your application as you scale. Twitter was originally built with Ruby on Rails. Over time they implemented different parts of the application with different tools. Twitter’s willingness to re-write core components of their infrastructure led them to their current levels of success.

Don’t be afraid to change everything.


What 2012 Work Scares You?

Professional Development
94 Comments

Right now, this very moment, there’s an ugly piece of work tapping on your shoulder – and I don’t mean your boss.

From a trip to DC
Or bacon. Bacon works too.

You agreed to do it months ago during the 2012 planning sessions, but you never really expected January 2012 to arrive.  Now that it’s here, you’re starting to get nervous because you’ve never done it before and you’re not quite sure where to begin.

I’d like to help.

I believe I can’t be successful unless other people around me are successful, and that means you.  I also believe that if there’s anything out there I haven’t done before, I bet either somebody I know has done it, or a few other readers have.  Between all of us, we can figure it out.  In the year 2012, there’s no need for you to feel alone going up against anything.  Whether it’s implementing replication, fixing a performance problem, getting your blog back on track, losing weight, or beating cancer, we’re here for each other.

Leave an anonymous comment describing the work that scares you.  We (and by we I mean the internet) will talk about our first time doing it, what we wish we’d have known when we did it, and our favorite getting-started resources.


My Easy New Year’s Resolutions (Guest Post)

Hi!  I’m your SQL Server.  I know you don’t usually listen to me, so I’ve decided to take a drastic step and find you where you spend all your time – Brent’s blog.  Seriously, you need to spend less time on the intertubes and more time on the error logs.  You’re lucky this post is about me, not about you, or else I’d also have to divulge the fact that you installed Adobe Flash on my desktop.  Whoops, I just did.

Resolution 1: Keep in Touch with Loved Ones

Bad things have been happening to me all year long, but I’ve been keeping it secret from you.  It’s time you and I were more honest with each other.  To do this, I want you to write your name in my heart – and by that I mean, set yourself up as my operator.  The below script sets you up as an operator and sends you alert emails whenever bad things happen.  Before you run it, make these two changes:

  • Change YourEmailAddress@Hotmail.com to your actual email
  • Change 8005551212@cingularme.com to your actual phone’s email address (look up your provider)

Resolution 2: Do a Data Detox Diet

Over the years, I’ve eaten a lot of bad stuff.  It’s time to do a little cleansing ritual to purge myself.  People have dumped in heaps of temporary backup tables that never got queried again.  This query looks for heaps (tables without clustered indexes) that haven’t been accessed since the last server restart (or database restore/attach).  Run it in my largest or most important databases:

These heaps are slowing down my backup times, my DBCC times, and my 100 meter sprint times.  We can make some quick judgment calls based on table names – if they’re names like Sales_Backup2009BeforeDeployment, it’s probably a table someone made once as a backup – and then forgot about it.  We could either rename them temporarily and then drop ’em in a few days, or just drop ’em outright.  Standard warning about deleting data, blah blah blah.  (Hey, my resolution wasn’t to be more cautious.)

Resolution 3: Lose Waits

Users love me, and they show it to me by feeding me huge dinners of data. I love my life, don’t get me wrong, but sometimes I see those new guys with their solid state drives and their 512GB of memory, and I think, wow, they must be able to handle anything. I’ll never be that well-endowed, but there’s still some easy things I can do to get in better shape. Right now, I’m in heart attack territory, and it’s time to fix that.

I’ll start by working out with Brent Ozar and Buck Woody, watching this video where Brent is dressed up like Richard Simmons. It’s an oldie but a goodie, and it’s better than being seen in public with a ShakeWeight.

Now, weren’t those resolutions easy?  What, we’re not done yet?  You haven’t run the scripts?  What the hell, you lazy meatbag?  I can’t pull this off by myself – I need you to do ’em.  What do you think this is, the cloud?  And wasn’t one of your resolutions this year to make sure the boss doesn’t think the cloud is better than us?  Hop to it before we lose our jobs.


The 9 Letters That Get DBAs Fired

I’m proud to say that I’ve only lost one job in my life.  I was a grocery store bagger when the store came under pressure to increase their percentage of minority workers.  They laid off all of the Caucasian baggers, but I’m not bitter.  I actually loved bagging groceries, and if I hadn’t been laid off, I’d still be the guy asking you “Paper or plastic?”

I probably didn’t deserve to lose that job, but I’ve deserved to lose others.  I’ve just been lucky.  Through my career, I’ve seen a lot of database administrators go through some dicey situations that cost them their jobs.  Maybe not right away – but once these situations happen, it’s only a matter of time until management finds a legally defensible reason to swing the axe.  These situations often involve just nine letters.

The First Three Letters That Get you Fired: RPO

Recovery Point Objective (RPO) measures how much data you’d lose if your server went down and you brought up the secondary server.  It’s measured in time, not megabytes.  The less time you have, the more frequently you have to get the data to a secondary location.

In theory, if you’re doing transaction log backups every 5 minutes to a network share, so you’ve got a 5-minute RPO.

In theory, “in theory” and “in practice” are the same.  In practice, they are different.

If you lit up the oxygen tank, the view would actually improve.
The Gremlin of Disaster

In practice, you’re unlucky, and your transaction log backups will fail before the engine itself does.  I know this because it’s happened to me and my clients.  In one case, the volume for backups filled up and the transaction log backups stopped working.  Since the log backups didn’t run, the transaction logs didn’t clear, and they continued to grow.  The Gremlin of Disaster arrived, sowing corruption in his wake.  By the time the dust cleared, we’d lost over an hour of data – permanently.

If you’re backing up to a local drive and the RAID controller fails, check out this helpful article.  You’ll be needing it, because with a failed RAID controller, you can’t access any of the data on the local drives.  Before you try pulling the hard drives out of the failed server and putting them into another server, check out this article.  After all, you’re gambling – gambling that the RAID controller firmware matches between servers and that it’ll be able to read and rebuild the newly inserted drives.  Guess wrong, and it will format the drives, thereby erasing your backups.

Savvy readers will detect a sense of acerbic humor in the post so far – even more biting than my usual pieces.  Brace yourself, because it’s going to get worse.

The Next Three Letters: RTO

Recovery Time Objective is how much time you’ve got to get your server’s act together.  Some companies measure it from the time you learn about the problem, some measure when the service is first unavailable.

Beep beep, back that thing up. Wait, on second thought...
The Pinto of Peril

The clock’s starting point matters a lot.  When things go wrong, SQL Server Agent might just email the failsafe operator – if you’ve gone to the trouble of setting one up and configuring alerting.  Even then, you probably won’t notice the email because you’ll be drunk at the pub having a nutritious dinner with your lovely family.  An hour later, when you realize that your cell phone is set on vibrate, you’ll panic, run to the car, fire up your laptop, and try connecting over your phone’s 1-bar internet connection.  Should we start the clock when the server started failing, when the email was sent to you, when you saw the email, or when you finally connected to the VPN?  If the business wants to start the clock when the server stops, you have to account for the meatbag reaction time, and with most of us, there’s a pretty long delay.  (If you’re the kind of DBA who sets up rules to move automated alerts into a separate email folder, you’re off to a bad start.)

If our idea of disaster protection is restoring the databases onto another server, we might assume that the RTO just needs to cover the length of time to do a full restore.  Not so fast, Speedy McPinto – here’s a few things you’ll need to do to get the server online:

  • Provision enough space to do the restore – do you know how much space you need on each drive?
  • Make sure it’s on the right version/build of SQL Server – sometimes we have to install service packs, cumulative updates, or hotfixes to get production to work right.  There’s no guarantee our standby box has those same binaries.
  • Get the sp_configure settings right – if production had a certain maxdop, for example, it might fail terribly on the new hardware.
  • Get security working correctly – you did script out the logins and passwords, didn’t you?
  • Configure TempDB correctly – I’ve seen people restore production databases onto dev environments where TempDB was just configured with a few gigabytes of storage, and it instantly started growing when handling production load.  TempDB ran out of space, and the server went Pete Tong.
  • Restore the database – and hope to hell it restores correctly the very first time.

The less time you have, the more tasks you have to perform ahead of time, and the more you have to document, test, and rehearse the steps you’re going to perform at the moment of disaster.

The Most Important Three Letters: CYA

Right now, all of your users believe your RPO and RTO is zero.  They believe you’ve got everything under control, and that when disaster strikes, you’re going to instantly bring up another server with no data loss whatsoever.  We both know the truth, though – you spend your time at work checking blogs, not backups, so we’ve got some work to do.

You have to put it in writing – and that’s where our worksheet comes in:

HA Worksheet Page 1 (from our free First Responder Kit)
HA Worksheet Page 1 (from our free First Responder Kit)

There are four columns: high availability, disaster recovery, corruption, and “oops” queries.

For each column, you have to define the maximum amount of data you’re willing to lose, and the amount of downtime you’re willing to have. Start by filling out the current state. (If you’re not familiar with calculating your current RPO/RTO, that’s where our DBA’s Guide to HA/DR class comes in.)

I know, you’re going to think the business says “ZERO!” but that’s where the next page of the worksheet comes in:

HA Worksheet Page 2 (from our free First Responder Kit)
HA Worksheet Page 2 (from our free First Responder Kit)

Down the left side, choose how much data loss you’re willing to tolerate. Across the top, choose your downtime. Where they meet, that’s the cost.

When the business users understand the costs, they’re able to make better decisions about RPO and RTO. But it all starts with getting them in writing.


Silent CPU Killers

10 Comments

It’s always important to make sure that SQL Server is running at peak efficiency – nobody wants to waste resources. In the past we’ve concentrated on ways to reduce I/O – either network I/O or disk I/O. The increasing number of cores available in modern CPUs meant that CPU tuning wasn’t as much of a concern. With the licensing changes coming in SQL Server 2012, tuning SQL Server for CPU should be in front of every DBA and developer’s mind.

Old or Missing Statistics

Statistics help SQL Server determine the best way to access data. When we have accurate statistics, the database is able to examine the statistics and make good guesses about how much data will be returned by a given query. The statistics are then used to determine the most appropriate join or data access method (seek vs scan, for instance). The problem with statistics is that they’re only good for a short period of time. Statistics are recorded as a histogram that describes how data is distributed in a table or index. If the histogram is inaccurate, then SQL Server has no reliable way of knowing how much data is present.

Looking at the AdventureWorks sample database, we can see there’s an index on the SalesOrderHeader table IX_SalesOrderHeader_OrderDate. If we use DBCC SHOW_STATISTICS to look at the index, we can see that the most recent valued for RANGE_HI_KEY is ‘2008-07-31 00:00:00’. What happens when we query for data that’s outside of that range? SQL Server assumes that there’s only 1 row present no matter how many rows there are. There could be 0 rows, 5 rows, or 500,000 rows. SQL Server doesn’t have any statistics about the data, so it cannot make an accurate decision.

Old, or missing, statistics can be cause huge performance problems because SQL Server will have to make a bad guess (1 row) regardless of how much data is present. SQL Server may choose an inappropriate join type and perform a large number of additional reads; a nested loop join may be used when a merge or hash join would be more appropriate.

The solution is to make sure that statistics are kept up to date.

  1. Understand data change rates. If data is loaded as part of regular jobs, make sure that index maintenance is performed on a regular basis and that statistics are updated during index maintenance.
  2. Understand the type of data change. If data is written at the end of table, there’s a chance its data won’t be included in statistics (especially in the case of something like an order date). If a lot of data coming into the system is time sensitive, regular statistics updates will be necessary to keep things performing well.
  3. Make sure that AUTO_UPDATE_STATISTICS is enabled for all databases. Even if your jobs to update statistics fail to run, having AUTO_UPDATE_STATISTICS enabled will make sure that SQL Server updates your data as it changes. Statistics will not get updated as often as we’d like, but it is better than nothing.

Ordering Data

The ORDER BY clause exists with good reason, right? Users don’t just need to get data back from the database, they want to get it back in a specific order. While that’s true, ordering data has a definite cost associated with it, if the data isn’t already in the correct order. Take a look at this query:

When we look at the actual execution plan, sorting the results take up 79% of the cost of the query. While this is a simple query, it’s easy to see how sorting can chew up a lot of CPU in SQL Server.

There’s no easy solution. However, many programming languages have the ability to use functional programming techniques to sort data in memory. The .NET framework makes it even easier by providing datatypes like the DataTable and DataSet. Both of these datatypes make it easy to sort data in memory in an application server. The idea here is that CPU is cheap on a Windows Server – we only pay to license the server, not each socket or CPU core. It is also much easier to horizontally scale the middle tier of an application than it is to horizontally scale SQL Server.

Functions

Functions are one of the biggest offenders when it comes to wasting SQL Server’s CPU time. The tricky part about functions is that they seem like they an easy way to encapsulate and re-use functionality. While functions accomplish this goal, there are other aspects of functions that make them a bad choice for many activities.

Simple, single statement table valued functions can be easily inlined by SQL Server. That is to say SQL Server will be able to include the execution plan of the function in the calling execution plan. The function’s execution plan can be optimized with the rest of the query. This can potentially lead to additional optimizations within the function. Multi-statement table valued functions and scalar functions won’t be optimized much of the time. Instead SQL Server will repeatedly execute the underlying function – once for every row in the result set. Over very large result sets this can result in a considerable amount of CPU utilization. IT’s often best to replace functions like this with inlined query logic. For a look at the behavior of functions in SQL server, take a gander at SQL Server Functions: The Basics.

Another downside to functions in SQL Server is that they frequently don’t get good statistics from the optimizer. When a table valued function returns data, SQL Server has no way to know (at compile time) how much data will be returned from the function. Instead it makes the assumption that there is only 1 row present in the result set from the TVF. This can result in poor join choice as well as bad data lookup decisions, just as if there were incorrect or missing statistics on an index or table.

Summing It Up

Keeping statistics up to date, moving ordering out of the database, and eliminating costly functions will go a long way to keeping your CPU utilization low. There are, of course, many other ways to combat high CPU, but these are some of the biggest culprits and some of the items that can be most easily fixed by database administrators and developers alike.


sp_Blitz® Updated to v5

SQL Server
6 Comments

Happy holidays! Santa SQL comes bringing tidings of cheer with a new sp_Blitz® – but also doom and gloom about configuration problems in your servers.  Here’s some of the recent contributions from readers:

  • John Miner suggested tweaking checkid 48 and 56, the untrusted constraints and keys, to look for is_not_for_replication = 0 too.  This filters out constraints/keys that are only used for replication and don’t need to be trusted.
  • Ned Otter caught a bug in the URL for check 7, startup stored procs.
  • Scott (Anon) recommended using SUSER_SNAME(0x01) instead of ‘sa’ when checking for job ownership, database ownership, etc.
  • Martin Schmidt caught a bug in checkid 1 and contributed code to catch databases that had never been backed up.
  • Added parameter for @CheckProcedureCache.  When set to 0, we skip the checks that are typically the slowest on servers with lots of memory.  I’m defaulting this to 0 so more users can get results back faster.
  • Andreas Schubert caught a typo in the explanations for checks 15-17.
  • K. Brian Kelley added checkid 57 for SQL Agent jobs set to start automatically on startup.
  • Added parameter for @CheckUserDatabaseObjects.  When set to 0, we skip the checks that are typically the slowest on large servers, the user database schema checks for things like triggers, hypothetical indexes, untrusted constraints, etc.
  • David Tolbert caught a bug in checkid 2.  If some backups had failed or been aborted, we raised a false alarm about no transaction log backups.
  • Fixed more bugs in checking for SQL Server 2005. (I need more 2005 VMs!)
  • Ali Razeghi added checkid 55 looking for databases owned by <> SA.
  • Fixed bugs in checking for SQL Server 2005 (leading % signs)

Whew!  And there’s more to come – I’ve got another half-dozen improvements queued up that also require new web pages, so those will take a little while longer.

If you’ve already downloaded sp_Blitz®, you can run master.dbo.sp_BlitzUpdate to fetch the latest version from The Clouds.  Enjoy!


Consulting Lines: SQL Server Needs a Dog

Consulting Lines
30 Comments

SQL Server isn’t lonely, but it needs a dog.  In today’s Consulting Lines series post, I’ll share my very favorite consulting line of all – one that makes people laugh while illustrating a point.

The Situation: Bob the Button Basher

My Blitz script helps me quickly spot SQL Server settings that have been changed from the default values.  I always work with the client’s staff in the room, and when I see a changed setting, I turn to them and ask:

Me: “Hey, Bob, I notice that max degree of parallelism has been set to 1.  The default is zero.  Can you tell me why that was set to 1?”  (The key here is to not say anything whatsoever about what the setting does – let Bob explain it to you.)

Bob: “Yeah, I read a blog saying that all OLTP servers should have maxdop set to 1.”

Me: “Ah, yeah, I’ve read those too.  Did we measure the before and after impact of the change?”

Bob: “No, it just seemed like a good setting to change.”

Me: “Gotcha.  Funny thing there – SQL Server is like an airplane.  Modern airplanes are so advanced that there’s only two things in the cockpit: there’s the pilot, and there’s the dog.  It’s the pilot’s job to feed the dog, and it’s the dog’s job to bite the pilot if he touches anything.”

Bob: “HA HA HO HO”

Me: “The problem with SQL Server is that it doesn’t have a dog.”

Bob: “Wait, what?”

Me: “SQL Server has a lot of knobs in the cockpit, and there’s a lot of advice out on the web that tells us to push all kinds of buttons to go faster.  Thing is, SQL Server ships out of the box with a pretty good set of defaults.  Let’s use those defaults until we have a really good reason to change something, and when we do, we need to document what we change so we know how it improved performance or reliability.”

What That Line Does

SQL Server really does have a lot of knobs, but every setting – every single one of them, including the defaults – has drawbacks.  I salute Microsoft for doing a really good job of weighing the pros and cons and coming up with a solid set of defaults.

In my line of work, I flip a lot of knobs, but for every one I touch, I try to explain thoroughly what I’m doing.  I may casually say, “This server needs Optimize for Ad Hoc Workloads enabled,” but when I do it, I also explain what I’m seeing about that particular server’s workload that justifies the change.

Jonathan Kehayias (Blog@SQLPoolBoy) recently wrote an excellent article about Lock Pages in Memory, one of those settings people just love playing with.  Before he posted the article, we had a great discussion about why we have differing opinions on that recommendation.  I understand why he recommends using it as a default, but the reason I don’t recommend it is illustrated beautifully by the fourth comment on the post:

“With LPIM, SQLOS will back down its usage to your ‘min server memory’ value. At that point Windows can try and hard trim the other processes, and potentially page the process requiring the 30GB memory out, or if it can’t respond efficiently this way, you may face a crash due to Out Of Memory for the OS.” – Jonathan

Ouch.  I’m not a big fan of giving a default recommendation that might crash the OS.  There’s also a bug in SQL2008R2 that causes corruption with LPIM.  I understand why Microsoft doesn’t enable Lock Pages in Memory by default, and I agree with them – it’s just not a safe button to push if you aren’t intimately familiar with the consequences and how to monitor for them.

This consulting line helps establish whether the other person understands the real ramifications of pushing buttons.  If they’re blindly bashing buttons based on blogs, it helps them understand the risks.

What Happens Next

The success of this line depends on your ability to dive deep and explain the concepts.  You have to be able to explain the pros and cons of both settings – the default setting and the one they chose.  Jonathan’s post on LPIM is a great example – if you can be comfortable explaining a topic to that level of detail, then you should feel comfortable flipping that switch.  (If, on the other hand, you’re not even comfortable reading a post with that level of detail, then you shouldn’t be flipping that switch.)

Bob: “So should we leave maxdop set to 1, or go back to the default?  We have no CXPACKET waits, so it must be working.”

Me: “Well, that’s true – no queries are going parallel. But parallelism isn’t always a bad thing – sometimes we want to throw lots of CPU power at queries.  Right now, we’ve got queries that aren’t running as fast as they could because they’re not getting the CPU they need.”

Bob: “But the server isn’t CPU-bound – CPU % is usually under 10%.”

Me: “That’s because you’re measuring CPU use as a total, not per thread.  We’ve only got a few active queries at a time, but we’ve got a four-socket server with 10 cores per processor, and hyperthreading is turned on.  Our CPUs are sitting around bored.  If we let some queries go parallel, we’ll be able to use those CPUs more effectively.  Based on what I’m seeing in your procedure cache, I recommend setting maxdop to 4 and cost threshold for parallelism to 100.  Here’s the evidence that supports it, and here’s how we’ll measure the success of that change…”

More of My Favorite Consulting Lines


Perk Up Your Career with the SQL Server Troubleshooting Checklist

10 Comments

When an application is offline or performing so badly that the users are complaining, what do you turn to?

When your cellphone says a major incident is ongoing and you can’t get to a computer, where do you send the team at the office?

When you’re putting out fires with whatever you can find, how do you record what steps you’ve taken and what information you’ve gathered?

You’re Only As Strong As Your Team

One major reason to put together a troubleshooting checklist is simple: you can’t always be working.

SQL Server Troubleshooting Checklist
What Would Brent Ozar PLF Do? (PDF)

If you’re the only person who knows how to use SQL Server in your office, you need to train someone. They don’t need to be an expert, they just need to be able to run through your checklist successfully and gather information. Find someone sensible, practical, and with a steady hand who you can practice the steps with, and tailor the checklist so it makes sense to your secondary.

If you work with a large team of SQL Server experts, you need a troubleshooting checklist just as much. As we all gain experience we all develop different ways of doing things. We each focus on different things and may interpret things differently.

The troubleshooting checklist gives your team consistency: it gives everyone a base process to make sure the major areas are covered.

You’re Only As Smart As Your Documentation

The other major reason you want a troubleshooting checklist is that all the rules change when things get really bad.

When a company is losing money, it’s hard to do things in a logical fashion. You’ve got four people at your desk asking all manner of questions from “we just bounced the Heffalamps services, is it OK now?” to “are the Wuzzles impacted by this problem?” Your boss’ boss keeps poking their head around your monitor saying, “I’m just checking in.” You’ve got 500 emails from your monitoring.

When this happens, you remember about 75% of all the basic stuff to check. It’s incredibly easy to miss some obvious things, though. After all, you’ve needed to go for the bathroom for about forty five minutes and still can’t leave your desk.

The troubleshooting checklist helps here in three ways:

  • You always hit a consistent list of things that are important to your business.
  • If someone misses a step when following up on a problem, you have a place to add a step to ensure the mistake doesn’t happen again. The troubleshooting checklist gives you a way to correct human error— that’s the secret to de-personalizing an embarrassing mistake and instead showing you’re a professional who follows up on errors and is in control of their process.
  • You have a place to record information which others can read. This helps you clear out that crowd from behind your desk! Save the checklist in a place where they can read it, and let them know they can see updates on your progress if they let you get going.

To Do: Give Our Checklist To Your Manager

There’s one person who really wants you to have a troubleshooting checklist, but they think of it in slightly different terms. They think of this as an ‘Incident Management Response Process’.

Your manager would love to have predictable response to problems. This helps them understand and explain to others what value you add to the company. It also helps them understand and justify having someone who can be your backup when you’re not available. It helps your manager show that you’re working to have an organized production environment with defined processes for keeping applications available.

Here’s how to handle this. Download our SQL Server Troubleshooting Checklist and give it a read. Think about a couple of things you’d customize for your environment.

Then show the checklist to your manager and say, “I think having a basic process like this would be helpful for our team. I’d like to lead a project customizing it for our applications. What do you think?”

According to Penelope Trunk, the path to promotion is shortest in December.

Now there’s a holiday score.

Check Out the Troubleshooting Checklist

Download the free SQL Server Troubleshooting checklist (PDF).


Coping with Change Control Video

1 Comment

Managing change is part of a DBA’s job. If you aren’t managing database changes effectively, though, disaster can strike. Jeremiah Peschka has worked with good and bad change management practices and even brought some awful ones under control. In this session, he introduces you to tools, tips, and techniques that will help you effectively handle changes in the database. If you need to deploy changes to the database, this session is for you.

This is a three pronged attack covering a set of techniques for gauging the scope of change, tracking code and configuration changes, and coping with the inevitable 3:00AM production hot fix. This is aimed at development teams and DBAs who are deploying both configuration and code changes to production servers.

https://www.youtube.com/watch?v=OthGRqibDqc

Change Control Scripts & Resources

Coping With Change Control T-SQL Scripts

 


My Thoughts on SQL Server 2012’s Licensing Changes

Licensing
67 Comments

A few weeks ago, Microsoft announced that SQL Server 2012 will no longer be licensed by the CPU socket, and will instead be licensed by the core.  You can read more at Microsoft’s recap.

After contemplation and discussing the issues with companies, I think of this as a few related announcements.

SQL Server is getting more expensive overall.

This shouldn’t come as a surprise to anybody: good things get more expensive over time, and crappy things get cheaper.  This change simply doesn’t affect me because I’ve never paid a single penny for SQL Server myself anyway, and I don’t plan to start now.  If I had to pay my own money for a database platform, I’d have switched to PostgreSQL long ago.

When I talked to my clients about this change, they all had one of two reactions:

  • “OMG, THIS IS UNBELIEVABLE!”  These people were more worried about their individual careers than their budgets: DBA managers wanted to know if SQL Server would start waning in popularity, thereby reducing the value of their skills.  (If the high salaries of Oracle DBAs are anything to go by, bring on the licensing price increases.)
  • “No big deal, but no new 2012 projects next year.”  These people understood that they were already licensed for their existing servers anyway, and only their ongoing maintenance fees would be impacted.  However, they immediately crossed Availability Groups off their list of 2012 projects, and that really sucks for me.  I’d been so excited about the ability to scale out with multiple read-only replicas, but the pricing just makes this a no-go for most of my clients.  Between the licensing changes and the traditional hesitation to deploy before Service Pack 1, 2012 is dead in the water for them.

Nobody decided to throw SQL Server out of the shop altogether, but some of them did start asking tough questions about their future projects.  It’s really hard to hire a good production SQL Server DBA right now (email us your resume if you’re looking in the Chicago, Los Angeles, or Portland areas, but no remote workers) so many of our clients are running understaffed.  One client said to me, “As long as I’m relying on outsiders for my database administration, what difference does it make whether it’s MSSQL, Oracle, DB2, Postgres, or the cloud?”

SQL Server used to be seen as the middle ground between expensive-but-awesome Oracle and free-but-limited open source.  Those days are gone – SQL’s pricing is higher, and open source platforms have gotten pretty darned good.

Limits kick in quickly on Standard Edition.

Standard Edition’s limits haven’t really changed significantly – but today’s hardware has, and Standard Edition isn’t keeping up.  If you’re struggling with application performance on Standard and you survive by throwing hardware at the problem, your options run out once you hit 16 cores and 64GB of memory.  At that point, you have to throw hardware and licensing at the problem by upgrading to Enterprise Edition.

I don’t think 16 CPU cores is really all that limiting.  CPU-intensive SQL Server queries tend to be the easiest ones for me to tune, and I’ve always argued that database servers aren’t app servers anyway.  The core-based licensing change just gives me more ammo to tell developers not to do string processing in the database server.

While 10-core Xeon CPUs are already available, they’re not typically deployed in 2-socket configurations.  You can technically buy a 4-socket box like the HP DL580 and populate it with 2 10-core CPUs, but that config just doesn’t make sense for Standard Edition due to the high cost.  With Intel’s upcoming tick/tock roadmap, the next couple of Xeon families are still slated to be in the 6-10 core range, so I don’t think the 16-core limitation is going to have a dramatic impact in 2012/2013.

The 64GB memory limitation, on the other hand, is frustratingly small given today’s memory prices.  32GB of server memory runs around $1,000, and memory can hide a lot of sins.  Standard Edition just doesn’t let you hide sins – you’re forced to spend manpower to keep tuning applications, and unfortunately, that’s not an option with third party applications.  I do a lot of work with independent software vendors (ISVs), and they’re frustrated that their customers can’t just buy $1,000 worth of memory to get awesome performance quickly.

There’s an opening for DBAs who love performance tuning.

It’s just you and me here, so let’s be honest: I make money when companies are in pain, and SQL Server’s licensing changes will inflict some pain.  Companies can’t just throw more CPU or memory at the problem anymore without writing a big check to Microsoft for additional licensing.  As a consultant, I can say, “I’ll fix that problem for less money than Enterprise Edition costs, let alone the cost of a server with more CPU sockets.”

If Microsoft had raised the 64GB memory limit, then companies could afford to mask problems longer by throwing memory at the problem.  They can’t, so I win.

The bummer for me is that I just can’t make the math work for scaling out with Availability Groups now.  I was really, really hoping for a lower-tier license cost for the read-only replicas (after all, they only support a subset of the work) but the cost to run a 5-node Availability Group is staggering.  Even if we use just 2-socket, 6-core servers across the board, that’s $412,440 of licensing – simply unthinkable, even if we throw in discounts.  It kills me, because I’m a huge fan of this feature, but right now companies have to be under tremendous pain in order to write a check that large to scale one application.  It’s just easier to tune SQL Server, and that’s where I come in.

It’s hard to be upset about that.


What the Users Want: Ad Hoc Reporting

SQL Server
10 Comments

Eventually we all run into a situation that makes us cringe: the users want to design their own reports. If this were something as simple as giving users a color picker and some visual layout tools, that would be one thing. Most of the time it’s more complex. Business users know their problem domain very well; they want to be able to view widget sales for the last three quarters broken up by sales person, location, and four other factors that you couldn’t possibly dream of, but you know you have the data for. What are the best options? How do you get the job done?

Dynamic SQL

It’s no secret that I’m a fan of dynamic SQL and when you’re building ad hoc reports, dynamic SQL can be the quickest way to get the job done. For those of you unfamiliar with the term, dynamic SQL is SQL that has been built in the database in response to inputs from some outside source. Dynamic SQL is frequently used to respond to two separate code paths. For example:

Depending on the age of the data, the code will read from one of two tables. The idea, of course, is to keep our implementation invisible from the user. This doesn’t tell us how we can use dynamic SQL to respond to ad hoc queries from users.

In order to use dynamic SQL to respond to user reports, the features must meet some pretty strict requirements – we need to be able to base all of our ad hoc reports on a stable core query. “Stable core query” is just my way of saying “all of these reports join to the same base set of tables.” This makes it easy to dynamically add join conditions to the end of the FROM clause. In effect, we’re just shoving strings together in the database to get data back to the users.

Unfortunately, dynamic SQL doesn’t always produce the best results. It’s difficult to build true ad hoc querying capacity with dynamic SQL. Users can be given a reasonable facsimile of dynamic SQL by providing many canned choices, but creating this SQL may be time consuming, error prone, and ultimately it can become difficult to maintain. The more options that users have, the more complex the T-SQL becomes. Clearly this pattern doesn’t work for large applications.

Lucene and SOLR

Apache Lucene is an open source search engine written in Java. SOLR adds functionality like document indexing (PDF, Word documents, etc), database integration, geospatial search, and faceted search. Both of these products are free and scale independently of SQL Server. Another benefit is that Lucene and SOLR can be fed documents to index in many formats – XML, JSON, PDF, or Word docs are all acceptable. Developers merely have to point the documents at SOLR and they will be indexed.

If you think this sounds an awful lot like full text search, you’d be right. Lucene and SOLR share a lot with SQL Server’s full text search feature. One of the big differences is that they exist outside of the database. Another difference is that you can query SOLR directly and return all of the fields that you want without having to read from tables in a database.

What’s more, SOLR has a very fluent search syntax making it possible to many types of complex queries. SOLR’s rich syntax makes it very easy to construct complex queries according to user needs without having to worry about writing SQL that can perform the underlying search. Finding sales data for the third quarter of the last five years where the sale contained headphones would look like this: items_sold:headphones sale_date:[NOW-5YEARS/DAY TO *] quarter:3. It’s certainly not something that business users could write, but neither is SQL.

OLAP

I’m not a huge fan of OLAP databases because I fear all things I don’t understand.

Okay, that last paragraph is a lie. SQL Server Analysis Services is a fantastic way to build crazy reports, the likes of which the world has never seen. I hadn’t even thought of this mechanism until I talked with a client recently and they told me what they were doing to deliver ad hoc reporting. New or modified data is marked with a timestamp. The data is periodically fed into an Analysis Service cube, right now this is once a day. The users get ad hoc reporting capabilities through the cube. The data currently isn’t real time enough, but the cube processing is fast enough on the current hardware that the frequency could be increased dramatically.

Here’s the best part: many different tools can be pointed at the cube to make this happen. Right now, users are generating reports with Excel and pivot tables, but nothing says that additional front ends can’t be stood up. Excel, Power Pivot, and SQL Server Reporting Services all make it easy for users to create their own reports and get to data in ways that the business can use.

What’s Your Secret?

There are as many ways to implement ad hoc reporting as there are developers implementing it. What are some of the ways that you’ve seen ad hoc reporting implemented. Sound off in the comments.


Read-Only View Permissions and Roles in VMware vSphere

SQL Server, Virtualization
11 Comments

When Ronald Reagan said, “Trust, but verify,” he was talking about DBAs working with their VMware sysadmins.

For the record, he didn't really trust the Soviets either.
"Trust, but verify." Ronald Reagan

As a consultant, I get to see a lot of SQL Server implementations – both successes and failures.  The successes have one thing in common: transparency.  The database administrators, VMware admins, and storage admins have clear, open discussions about the way their respective systems are configured.  They give read-only access to other teams so everyone can double-check to make sure everything is working well.  After all, these teams share a common goal – fast, reliable applications for end users.

You can easily get read-only access to the VMware vSphere Client, the tool VMware admins use to manage your virtualization environment.  Here’s how:

1. Walk over to your sysadmins and show them this blog post.

This is the hardest step in the entire process.

The rest of the steps are simple and predictable: click a button, get a result.  This step, however, involves complex, unpredictable meatbags.  Soft skills are the hardest one to master, and finessing this is outside of the scope of this particular article.  Start by brushing up with my Consulting Lines series.

You need to walk over there because you’re asking for access into their domain.  This implies that you don’t trust them.  You do, but like Reagan said, you just need to verify.

2. Open VMware vSphere, Home, Hosts and Clusters.

The vSphere Client is a thick front end that connects to Virtual Center, a service that keeps an eye on your hosts and guests.  In the vSphere Client, click Home, Hosts and Clusters.  Here’s a screenshot of what you’ll see:

Hopefully yours won't say Lab.
VMware vSphere Client

On the left side, click on the thing you want access to – which brings up the first question for your VMware admins: what do you need read-only access to?  Your VMware sysadmins may have built a separate cluster just for SQL Server use, or your VMs may be intermingled with other virtual machines.

On the right side, click the Permissions tab, right-click, and click Add Permissions.

3. Add the DBA domain group.

In the Add Permissions popup, click Add.  In the dropdown for domain, choose your Active Directory domain, and in the Users and Groups dropdown, click Show Groups First.  Your screen will look like this:

Active Directory Groups in VMware

In my lab, I’ve created an Active Directory group called “Database Administrators” that includes me, Jeremiah, Kendra, and Tim.  That way, anytime I need to grant additional permissions for the DBAs (like when I build a new SQL VM that they’ll need to administer), I can just use that domain group instead of individual user accounts.

Click your own DBA group’s name, click Add, and click OK.  On the right side of the Assign Permissions window, the Read-Only permission will be defaulted, and that’s fine – click OK.  Presto, you’re in.

If you click on child objects in the left tree (like one of your SQL VMs) and click the Permissions tab, you’ll notice that your read-only permissions have propagated to this object too.

4. Install the VMware vSphere client on your desktop.

The client is a free download from VMware.com as part of other products, but the easiest way to get it is to simply point your web browser at one of your in-house VMware hosts.  In my lab, here’s what that looks like:

Downloading vSphere From Your VMware Host

Click the Download vSphere Client link and the installer will come flying through the tubes.  Yes, the installer includes Visual J#, and yes, I laugh at that too.

5. Launch the vSphere Client and start poking around.

After the installation finishes, double-click the VMware vSphere Client icon.  You’ll need to enter the name or IP address of your Virtual Center server – ask your sysadmins for that.  Check the box labeled “Use Windows session credentials,” and thanks to the magic of Active Directory, you’ll be poking around in virtualization in no time.  (Actually, it takes about 30 seconds to launch the dang thing.)

Want to Learn More?

Check out Virtualization, Storage, and Hardware for the DBA. It’s our 5-hour training video series explaining how to buy the right hardware, configure it, and set up SQL Server for the best performance and reliability. Here’s a preview:

https://www.youtube.com/watch?v=S058-S9IeyM

Buy it now.


SQL Server App Store Reviews

Humor
21 Comments

If SQL Server cost $4.99 in the App Store, what would the reviews look like?

Not as good as the Access app – **
By B. Analyst, December 6, 2011
I don’t understand what all the fuss is about.  The GUI in this app sucks.  It’s hard to build reports, and it’s overpriced.  $4.99?!? Save your money and get the free Access app instead.

Secret Performance Hints – *****
By JuniorHacker, December 5, 2011
Go into Settings and check the box for Priority Boost.  If you’re running low on space, check the Auto-Shrink box for each database.  When you need to beat the final level, hit up-up-down-down-left-right-left-right-B-A-Start.

No Twitter integration? – *
By Teenagur, December 5, 2011
C’mon, Microsoft, get with it, this app is so Web 1.0, I can’t share my favorite tables with my friends, I can’t post pictures of my schema on Twitter, WTF?!?!??!

THIS GAME SUXXXX – *
By l337, December 4, 2022
adventureworks is boring why anybody pay for apps? stupid

Best Database Ever – *****
By MVP99, December 4, 2011
I’ve built hundreds of rock-solid solutions with this app. My customers love me, and it keeps my bills paid. Thanks, Microsoft!

ADS ARE DECEPTIVE – *
By SingleGuy18, December 3, 2011
App description said it had highly available bis and models on tables, but there’s no pictures of hot chicks or anything, just a bunch of lines and squares.  Wasted that $5.

Good, but needs better graphics – ***
By Reports4Me, December 2, 2011
The graphics in this haven’t really changed since v7.0, and there’s no easter eggs.  It needs like a hot coffee mod or something.

Works for me – *****
By I. Ben-Gan, December 1, 2011
Works mostly as advertised, although it’s not clear whether some behaviors are a bug or a feature.

HELP!!! ASAP!!! – ***
By panic42, November 30, 2011
MY PHONE CRASHED AND I NEED MY DATA. CAN’T FIND MY BACKUPS. PLEASE CALL ME!!!


SQL Server First Responders Kit Video

7 Comments

When your SQL Server is critically injured you need a first responder kit to help you diagnose the problem and apply emergency aid. In this session, Kendra Little introduces you to invaluable tools and techniques for triaging an emergency. If you have one year’s experience with database administration, this half-hour session will set you up to triage like a pro.

Want a list of links referenced in the video? Scroll on down!

https://www.youtube.com/watch?v=V8JEenuz1SM

Links from the video:


Brent’s 2012 Conference Schedule

#SQLPass
5 Comments

I feel like the luckiest guy in the world: I get to take part in six major SQL Server events in 2012.  Here’s where I’m going and why:

January 26-Feb 4: SQLCruise Miami

SQLCruise is the highlight of my year.  This year, we’ve transitioned SQLCruise completely to Tim Ford – he owns the event start to finish.  I’m really excited because I get to sit back and enjoy the event a little more, and Tim gets to do what he truly shines at: bringing good people together to learn and bond.

I’m doing three all-new in-depth sessions on SQLCruise Miami: Procedure Cache BI, Scaling SQL with Solid State, and SAN Features for Senior DBAs.  All three are targeted at DBAs who have to make applications go faster.  You need 3-5 years of experience with SQL Server, and you should already be familiar with the basics of reading execution plans and designing indexes.

I like SQLCruise because it’s a limited number of people together for an extended period of time.  I get to know every single attendee, hear about their challenges at work, and have plenty of one-on-one time for design questions.  Some attendees have started bringing their laptops armed with a SQL Server instance, a copy of their production database, and their toughest queries or design problems.  I love discussing this stuff for hours over margaritas.

Trust me, this is the cheapest way to get the most honest consulting, and it’s not just from the presenters – it’s also from your fellow attendees.  Learn more at SQLCruise.com.

Feb 28-Mar 2: MVP Summit Redmond and SQLSaturday Redmond

The annual international MVP shindig is the coolest benefit of being a Microsoft MVP.  We have to pay for our flights to Seattle, but once we’re there, Microsoft picks up pretty much everything else.  We get insider NDA sessions, and given the timing on the below event, I’m hoping that we’ll get insider content on SQL Server 2012 at the MVP Summit this year.  The training really isn’t my favorite part – I just like getting the chance to spend more time with my favorite community and Microsoft people.  There’s no substitute for face time with these folks.

The public isn’t invited to the MVP Summit, but the good news is that the MVP Summit will bring a ton of great speakers to SQLSaturday Redmond!  This might be the best SQLSaturday all year.  I try to work in SQLSaturdays whenever it matches my travel schedule, and this one fits perfectly.  I bet lots of out-of-town MVPs will make a similar decision.

March 25-31: SQL Connections Las Vegas

This Connections is billing itself as a SQL Server 2012 launch event, and it just so happens that the conference date exactly lines up with the next SQLBits conference too.  Does this mean we’ll see SQL Server 2012 released here?  Who knows – it’s tough for conference planners and software vendors to get all the stars to line up simultaneously.

Connections feels like the most training-oriented of the conferences: it focuses on quality, not quantity.  There’s less simultaneous sessions, but the presenters tend to be very polished professionals who are very accustomed to being onstage.  (Presenters get free registration, hotel, and a fee per session.)  It lacks the rowdy, party, community feeling of the PASS Summit: but there’s clearly an audience for a more traditional learning event.  I like both events for different reasons – I meet more friends at the Summit, but I make more consulting business connections at Connections.

I’m doing a few sessions at Connections, but the official itinerary isn’t out yet.  The only one I can announce for sure is my all-day Virtualization and SAN Basics for DBAs session, which will be a $399 post-con session on March 30th.  Registration is open now.

May 24-June 2: SQLCruise Alaska

I’m doing the same presentations that I did on SQLCruise Miami.  Neither cruise is more junior or senior than the other – it just boils down to whether you’d like to see the islands or the icebergs.

There’s one big difference, though: in Alaska, I teamed together with Jeremiah, Kendra, and a friend of ours to get the Garden Villa suite.  SQLCruisers will all get to hang out with us here:

Look like fun?  Book yourself a cheap inside room (they’re as low as $679 per person right now) and register for SQLCruise.

Fall: SQL Connections Las Vegas

Sometime in the fall, I’ll be returning to Vegas for another round of Connections.  The dates, exact location, and session agendas are still to be determined.

November 6-9: PASS Summit Seattle

The PASS Summit is the biggest event of the year in terms of quantity: thousands of SQL Server professionals from all over the world gather together to talk shop.  It’s a zoo of nonstop activity, a whirlwind of tweeting and meeting that goes by in a blur.  I love getting the chance to meet up with some of my best friends for the briefest of moments.

There’s at least a dozen sessions going on simultaneously at any given time, but as big as the event is, there’s fierce competition for speaking spots.  Speakers get free registration (roughly $1500 value), so speakers from all over the world vie for spots.  The call for sessions hasn’t gone out yet, so I have no idea whether I’ll be accepted, but I’d pay to go out of my pocket even if I didn’t get the chance to speak.

And I Wish I Could Hit More!

The good news – heck, the GREAT news – is that I’ll be hitting so many major conferences next year.  I pinch myself when I think about this because it just seems too good to be true.

The bad news is that I’d love to do more: I would love to do SQLBits, TechEd, a few CodeCamps, storage conferences, and VMworld, but with just these ones above, I’m already away from home for six weeks.  In order to keep a good work/life balance, I try to only travel one week per month, and that only leaves me 6 more weeks of travel in 2012.  Since I like to eat and pay rent, I gotta use those other 6 weeks for client projects to make moolah. <sigh>  Decisions, decisions.


3 SQL Server Book Review Short Takes

Book Reviews
13 Comments

Here’s what I’ve been reading lately on the SQL Server front.  I’d like to thank all three authors and publishers for providing these books for review.

Securing SQL Server by Denny Cherry
PaperbackKindle

Denny’s a jack of all trades, and a Microsoft Certified Master of SQL Server to boot.  I’ve had the pleasure of knowing Denny for a few years, and he’s on the short list of people who I’d trust with my server password list.  (I wouldn’t trust him with my laptop password, though, because he’s got a wicked sense of humor.)  When Denny first told me he was writing a security book by himself, I asked him why – it seemed like such a dry topic.  Amazingly, Denny was able to bring this material to life in a friendly, readable way.

It’s easy to recommend this book for companies that store financial, medical, or personally identifiable data in their databases, but going beyond that, I think every company with a multi-DBA team should have a copy of this book on their bookshelf.  A lot of my clients ask questions that are answered well in this book, and this book is way cheaper than a HIPAA/PCI/SOX audit.

If you’re a solo DBA at a shop, I probably wouldn’t recommend this book, though.  The topics covered in this book take time to implement, and they’re usually beyond the scope of what a single person has the time to do when they’re firefighting.  Yes, I wish all our databases were secure, but solo DBAs probably need to start with more basic security concepts such as how to configure service accounts, and that’s not covered here.  This focuses on senior DBA level knowledge.

Troubleshooting SQL Server by Jonathan Kehayias and Ted Kreuger
Paperback

Jonathan and Ted are forum monsters: they’re constantly patrolling user questions looking for ways they can help.  As a result, they’re well-versed in the typical problems DBAs face and the best ways to solve those problems.  They’ve compiled the results into a book.  The book reads like a very polished forum answer: if you’ve got a CPU problem, turn to Chapter 3.  You’ll learn what tools to use to diagnose the issue, the most common problems, and the best solutions.

Jonathan was one of the coauthors on Professional SQL Server 2008 Internals and Troubleshooting (PaperbackKindle), and at first glance, it might sound like those two books are similar.  This book is different because it takes a problem-and-solution approach, whereas our Pro 2008 Internals teaches the reader about internals first, then explains troubleshooting tools you can use to look under the hood.  I’d say that Troubleshooting SQL Server is a faster approach to getting ‘er done, and I’m adding it to my list of recommended books for SQL Server DBAs.

SQL Server 2008 R2 Administration Cookbook by Satya Shyam K Jayanty
PaperbackKindle

Lemme just start out by saying that I don’t like posting negative reviews.  I first received a review copy of this, marked it up, and emailed the publisher and author with a list of questions.  Despite a long dialog, I never got answers to the technical issues, so I tossed the book in the closet.  I wouldn’t have posted this review, but I noticed a disturbing number of five-star reviews for the book on Amazon.  It wouldn’t be fair for me to let people spend their money on this, because by buying it and following the advice, they’re hurting their SQL Servers.

Here’s a few examples of the advice:

  • P335 – in a section on designing maintenance tasks in a mission-critical environment, the reader is instructed to check maintenance plans for reorganize indexes, rebuild indexes, and update statistics.  This is the worst approach possible for a mission-critical environment: it will do all three tasks against every index in the database, every time.  If we absolutely had to use this approach, we could accomplish the same thing by simply rebuilding all indexes, but again, even that is a bad idea in a mission-critical environment.  It even tells the reader to create separate schedules, but doesn’t say how or when these tasks should run.
  • P361 – “Place transaction logs on RAID5 rather than RAID1+0 for sequential workloads.”  Not true, as Microsoft documented in Table1A of this doc.
  • P360 – in a section on building a scalable server: “…edition-wise it must be either DataCenter or Enterprise edition to host production data.” What, Standard isn’t good enough for production?
  • P70 – in the wait stats troubleshooting query, useless wait stats aren’t filtered out.  The example screenshot shows that the server is bottlenecked by SQLTRACE_INCREMENTAL_FLUSH_SLEEP, XE_TIMER_EVENT, and FT_IFTS_SCHEDULER_IDLE_WAIT, none of which are wait events that a DBA should spend time investigating.  Contrast this with the well-thought-out, production-tested queries in Troubleshooting SQL Server, and it’s pretty disappointing.
  • P360 – “Configure the storage and memory L2 cache per CPU has a minimum of 2 MB to improve memory.”  There are so many grammatical and technical errors here I don’t even know where to begin.

To top it all off, many of the book’s sample queries simply don’t work – keywords are mashed together with no spacing, like page 69’s “SELECTTOP” and “CROSSAPPLYsys”.  Ouch.

Needless to say, I don’t recommend adding this book to your collection, but if you’re itching for holiday reading, check out my recommended SQL Server book list.


Three Things That Rock About SQL Server

3 Comments

When something remarkable is around for a while, it becomes easy to take it for granted.

Today, I’ll share three awesome things about SQL Server with tips to give you an edge when using each one.

#1. The Free Online Documentation is Mind Blowing

SQL Server is a huge product. We’ve got the database engine, Analysis Services for cubes, Integration Services to process and transform data, and an increasing set of other complex products for working with data documented in Books Online.

Tip: When using Books Online, check the Community Content at the bottom of the page first for any errors or questions others have raised— this can save you time and trouble if an important detail is missing or in error. Check other versions of the same page using the link at the head of the topic, and always read critically. Even encyclopedias can’t be perfect!

Microsoft’s documentation doesn’t stop with Books Online. Microsoft also publishes in-depth whitepapers for SQL Server. I recommend you periodically review the list to make sure you know what’s available, then set aside time to read the topics which are most useful to you.

We also have sites where community members publish high quality documentation for SQL Server. The SQL Server community is so vast that it’s impossible to mention them all. Three of my favorite sites for technical documentation are:

Tip:  When you see unusual behavior in SQL Server, search Microsoft Connect. This is where users report bugs and suggestions for the product. Make sure you log in to search: external search engines typically won’t lead you to what you need to find.

#2. Lots of Ways to Find Out “What’s Happening Now?”

We have many ways to check out what’s currently happening in SQL Server. The instrumentation for the product is very well developed and gives a lot of options to see what’s processing at any given time.

The top four ways I check to see what’s happening are by using:

  1. Dynamic Management Objects – We can find out an awful lot with queries— and more with each version of the product;
  2. SQL Trace – Our old friend, sometimes used with SQL Server Profiler;
  3. Extended Events – The new, leaner and meaner hotness for tracing: SQL 2008 and higher;
  4. Perfmon Counters – Windows and SQL Server specific counters.

Each of these techniques has its own strengths, and Extended Events is becoming increasingly powerful. Together, all of these methods provide a vast array of information about what’s happening in and around SQL Server.

Gathering data with each technique has its own cost. It takes experimenting and research to know what you can get away with and where you need to hold back.

Tip: When it comes to finding out what’s going on in production, treat your investigation like a database change. That means you should test your method against another environment, even if you can’t reproduce the issue there. Always think about how your method of investigating may be impacted by increased load, and make sure you have a way to monitor its impact.

#3. Tools, Tools, Tools

SQL Server ships with rich, user-friendly tools.

SQL Server Management Studio covers a lot of ground with different features. It helps us navigate a given installation and also develop Transact SQL. It has built-in reports to help describe what’s going on at instance and database levels.

It has sqlcmd mode if you prefer a different kind of scripting and want to interact more with Windows. You can use Object Explorer Details to select multiple items like Windows Explorer. We can view Graphical Execution Plans. Or, we can just run Transact SQL statements in multiple windows.

Tip: SSMS lets you choose to generate a script for almost every action you want to do rather than just executing it through the GUI. This feature is just plain awesome. Even if you choose to execute a change through the GUI, I recommend you always script out the command and save it off as a record of your change.

Want more SSMS tips? Check out Jes Schultz Borland’s blog post Tips and Tricks to Make SQL Server Management Studio Awesome.

What’s Cool About These Things Together

SQL Server may be complicated, but we have a lot of ways to work with it and to learn more about it.

As the SQL Server product grows, the tools and increasingly improved instrumentation allow us to understand the product more deeply. Books Online is just a launch board– the tools and the methods we have to see what’s going on in SQL Server help us take off from there.

Together with community members who want to share knowledge, this creates an interested, invested group of technologists who write about what they’re learning. And that’s something we really shouldn’t take for granted.


Virtualization and SAN Basics for DBAs Video

SQL Server
4 Comments

These two technologies can make a very big — and very bad — difference in how your SQL Server performs. Wouldn’t it be great if you could get the real, honest lowdown from a virtualization administrator, a SAN administrator, and a DBA? Wouldn’t it be even better if one person had done all three, and could give you the pros and cons of each point of view?

In this one-hour session, I explain how virtualization changes CPU, memory, and monitoring, and I show how to get specific recommendations for your make & model of SAN:

https://www.youtube.com/watch?v=V38bzhrfTHU

The links I discuss in the video are BrentOzar.com/go/san and BrentOzar.com/go/virtual.


Third Normal Form is Snake Oil

13 Comments

Step right up, ladies and gentlemen, and I will sell you the solution to all of your database needs. That’s right, it’s Doctor Codd’s Third Normal Form, guaranteed to cure all ailments of the schemata, pragmata, and performata. Doctor Codd’s Form will make your data performant and compressive. Accept no substitutes or imitators; Doctor Boyce’s lamentable attempts cannot soothe your aches and pains like Doctor Codd’s Third Normal Form. Why, with just a simple application of Doctor Codd’s Third Normal Form, thrice daily, and you’ll be jumping around to the tune of normal forms and transactions in no time!

Sound Familiar?

Anyone pushing a single idea is pushing snake oil, plain and simple. They’re selling you a warm and fuzzy feeling that you’ll make your problems go away by following their simple prescriptions. Deviation from the remedy will, of course, result in problems, failure, and potentially phlebitis.

Can I Cure You?

No, I can’t. Well, I can, but I’m not going to. Not yet, at least. You need to pay attention, first.

The Forms, Both Magnificent and Normal, Are A Not Panacea

Slavish adherence to normalization is bad for your health. There are as many reasons to not normalize data as there are reasons to normalize your data. Don’t believe me? What if I asked you to design a database to persist items that we might sell in a store?

It’s easy, at first, to design an items table with a few columns to describe the main properties that we want to persist about an item in our store. Problems begin when different departments in the store need to save different properties. Different parts of our IT systems will need different views of the data. While adding a column is trivial on small databases, adding a column in a large database is decidedly non-trivial. Eventually the database boils down to an items and item_properties table and at that point the database becomes impossible to query reasonably.

A Solution Most Surprising

We can solve this problem a few ways, but with Microsoft’s Hadoop announcements, it makes sense to look at what the non-relational world can offer. HBase is a real-time column-oriented database that runs on top of Hadoop.

HBase is helpful modeling dynamic properties because of flexible data model. While HBase does have tables, rows, and columns there are some powerful differences. HBase’s columns are split up into column families – these are logical groupings of columns. Columns can be added on the fly once a column family has been created.

Jumping back to our example, instead of modeling a items and item_properties table, we can create an items table and create column families to store properties specific to a department or for a common purpose. Rather than create many tables, we can add a shipping_info column family, a accounting column family, and a sales_promotion column family. Over time this flexible data model can be used to populate reporting tables in an enterprise data warehouse. Rather than focus initial efforts on building a robust general purpose schema in an RDBMS, it’s easy to create a flexible schema in HBase and pull out the data we need for reporting at a later time.

A Final Commentary on Data

Denormalization doesn’t have to be a dirty word. There are many reasons to denormalize data. Ultimately, the process of shredding data apart should depend not on blind adherence to the principles of normalization but to the needs of the applications that consume the data. If you have a log file processing application, does it make sense to read log files from disk into a relational database? Every log entry will need to be shredded into multiple columns doesn’t make sense when log files are only infrequently processed and used to produce aggregations.

Even when you eventually need to query the log file data, there are tools suited to performing SQL-like operations across flat files. Hive provides a SQL-like querying layer on top of the Hadoop framework making it possible to run bulk queries across large volumes of data stored in flat files and spread across many servers.

Know how data is used; know the problem that the business wants to solve. Let the principle of consumption drive the structure of your information. You will thank me, some day, for freeing you from the false rigor of normalization.


Local Backups Are The New Offsite Backups

Backup and Recovery
3 Comments

Years ago, Microsoft’s Jim Gray said that memory is the new disk, and disk is the new tape.  (You can read more in his 2006 “Tape is Dead, Disk is Tape, Flash is Disk” PowerPoint presentation or the Google Quick View.)  Many of the things he predicted have come to pass, and it’s time we explore this a little more because it has so many impacts for today’s database administrator.

When I first started in database administration, I did regular fire drill restores because I was paranoid.  (Chalk that up to dropping a table when I was a teenager working with SCO XENIX.  Damn, I feel old.)  As often as I could, I’d restore databases from production over to our development servers and time how long they took.  That way, I could report up to my managers and say, “If we lose production, it’ll take X hours to restore onto dev and go live.”  Every now and then – never as often as we should – we shipped backups offsite to the boss’s house.  As I moved up in life, I started working for companies that sent backups offsite via storage services.

My primary method of recovery was to restore a local backup first (which took a known length of time), but if things really went to hell in a handbasket, I could call the offsite storage service and get them to courier a backup over.  I had no idea how long that would take, and when it arrived, it’d take another unknown time to read the backup off tape and get it onto the live servers.

In today’s multi-terabyte world, most of us don’t test our restores as often as we should.

Heck, most of us don’t even do backups as often as we should.  We have clients who have even abandoned the thought of doing multi-terabyte data warehouse backups because it’s faster to reload the data from source, and because they’ve got schema changes they’ve always wanted to make along the way anyway.

Local Disk is the New Tape

Unless you're running Developer Edition in production. Arrr, pirates!The HP DL580 G7 holds up to 2TB of memory, but even if you cheap out and only get 1TB of memory, you’ll pay under $50k for that memory.  While that might sound expensive at first, keep in mind that SQL Server Enterprise Edition is roughly $30k per socket, so you’ll spend $120k just licensing the four CPU sockets in that DL580.  The memory isn’t the biggest expense in that server – 1TB of memory is cheaper than the SQL Server licensing.  If you need four CPU sockets worth of power, you’re probably going to be caching a great deal of your database in memory.

That is, as long as the server’s up.

Two ugly things happen when that server restarts.  First, it needs to check the memory upon booting.  One of our clients with a 1TB memory system reports that it takes over 15 minutes for the server to boot up into Windows.  That’s some pretty ugly downtime if you need to do Windows patches or firmware upgrades, and this is why clustering starts to be a no-brainer when you go past the 256GB memory mark.  It’s just so much faster to have the second node already up and running – you lose much less time.

Second – and clustering doesn’t fix this – is that your server is starting from scratch with nothing cached in memory whatsoever.  The data’s all sitting on disk.  Even with whiz-bang solid state drives, it takes a long time to read 1TB of data off disk to warm up SQL Server’s cache.

One of our performance-freak clients had us write scripts to do a SELECT * across every table just to make sure all of the data would be brought into cache after a restart.  And if you’d like to attempt a similar exercise, keep in mind that you’ll also need to use index hints for every index in the database too, because just doing a SELECT * will only hit the clustered index.  If you’ve got filtered indexes, you’ll need to use a WHERE clause that matches your clustered index, or else use DMV tricks that will hit the filtered index.

For these kinds of companies, a SQL Server restart is a serious outage-causing event.  Their systems will perform below par for up to an hour as the database server recovers.  To a manager, this is akin to restoring from backup – it’s an event they don’t want to see more than once a quarter.

Caching Servers are the New Memory

The rest of yours are pretty damn slow, believe me.Database administrators think of SQL Server’s memory as the cache.  That is completely and utterly wrong.  The database is the persistence layer – the place things go where they need to be permanently saved somewhere.

The first place your applications should be querying isn’t the database: it’s a caching server tier like Redis, Memcached, Couchbase (aka Membase), and Windows AppFabric Caching.  Only if your apps can’t find the data in cache should they ask the database, and then they should store the database results in the cache to lessen load on the database.  The fastest query is the one you never make.

Don’t think the caching tier doesn’t need to be reliable just because it doesn’t have original data in it.  When this caching tier goes down, your database server will feel it.  As your loads grow, your app can get to the point where the database tier can’t survive the load if the cache servers are completely flushed, and you have to build cache-server-warming code.  That’s why I consider caching to be the new buffer pool – it’s the database’s first line of defense, long before queries hit the database server’s buffer pool.

When You’re Performance Tuning Queries, Zoom Out

When someone brings me a query to tune, the first question I ask is, “Why are we running this query?”  It’s not just that there may be a faster way to get the data, but there might be a reason not to run the query at all.

Are we repeatedly running the same get-configuration-data query over and over?  Maybe that should be cached inside the application itself.

Are thousands of users asking for their dynamic home page over and over?  Maybe we should use a cache tier like Redis that every web and app server can query.

It can take your developers months to fully implement and test a good caching solution like Redis, so think like Jim Gray.  The time to prepare for tomorrow’s massive architecture changes is now.  Start asking the caching question so your developers can start building out your new buffer pool.