Blog

Finding One Problem Query With Extended Events

It’s easy to find problem queries on your SQL Server after they’ve happened. You can open your favorite monitoring tool or look in the plan cache. Those options are great for finding things after they’ve happened, but what if you want to find problems as they happen?

Server Side Trace

You could set up a SQL Server Server Side Trace. This is an acceptable solution if you’re using SQL Server 2008 or earlier. There are a few problems with this approach.

Problem the first – the SQL Server Profiler UI is less than beautiful. The GUI interface is somewhat cumbersome and the scripting interface is hilariously bad. To make matters worse – to only capture a single query requires string based filtering. There’s a plan handle property available in profiler, but it’s not available for TSQL or stored procedures.

Something appears to be missing
Something appears to be missing

As much as I like string matching in T-SQL (hint: I don’t), I think it’s fair to say that Profiler/Server Side Trace are best avoided unless you’re using SQL Server 2008 or earlier.

Extended Events

Extended events are nowhere near as hard as you’d think. Trust me on this. How do I know? I set up an Extended Events session to grab a particular query.

Here’s what you do:

  1. Create a crappy stored procedure. Use your imagination.
  2. Run the stored procedure a lot.
  3. Grab the query_hash for that stored procedure. The easiest way
    I’ve found to do this is to use our plan cache query to identify it. My query_hash is 0x2B42702322C10821. What’s yours?
  4. Create an extended events session to capture that specific query hash.

Step 4 isn’t as easy as it sounds. Astute readers will note that I never said this was easy, just that it’s not as hard as you would think.

Even though the query_hash is stored as a hexadecimal in SQL Server, Extended Events would prefer to see an unsigned 64-bit integer. To get this magic number, take your query_hash and cast it as either a BIGINT or a DECIMAL(20,0):

Once you convert the query_hash to a magic_number, create your Extended Events session. I chose to capture the sp_statement_completed and sql_statement_completed events and a number of server level events like time, app name, and client host name. Here’s what it would look like capturing this to a file:

Once you have that up and running, you should be able to start the session, if it isn’t already started, by running:

Just like that, you’ll be capturing your terrible queries to disk where you can mine the extended events files for gold and glory at your own convenience. If you want to query it, it’d look something like this:

Cleaning Up After Yourself

Once you’re done watching a specific query or queries, make sure you clean up after yourself. There’s no reason to add extra load to SQL Server when you aren’t watching. Make sure to stop and remove your Extended Events session:

Augment Your Techniques

This won’t replace any other techniques that you’re using to find poorly performing queries. However, you can use this technique to find problem queries as they’re happening.

Different monitoring techniques provide you with a different view of the data. Our plan cache query shows you a historical record of what’s been going on, but it can only show so much. If a query has been run 14,400 times over the last 24 hours, does that mean it’s been run 10 times a minute for that entire period or is there a period of activity where that query gets run 1,440 times a minute for 10 minutes straight?

This isn’t going to replace the need for other tools you have at your disposal. It’s just another way to get the job done.


Rollback: What happens when you KILL a session?

SQL Server
53 Comments

It’s Friday afternoon. You, the DBA, are looking forward to a long weekend. You’re going to walk out the door in 30 minutes and enjoy some family time. The phone rings and you pick it up. On the other end, Joe Developer says, “There’s a query that’s been running for a long time. Can you kill it?”

You log into SSMS and use sp_who2 to confirm that the user has an open transaction. You issue the (infamous) KILL command.

Twenty-five minutes later, you’re still sitting there, waiting for the transaction to roll back. You’re sweating. You really want to leave for the weekend. Another ten minutes pass, and finally, it completes.

What happened? Why did SQL Server take so long to kill the process? What happens in a rollback situation?

What Happens in a Transaction

When a transaction is being committed, each step is written into the transaction log. Each value that is being inserted, updated, or deleted is recorded so that if there is a failure before the log is written to the data file, the appropriate rollback or roll forward can take place. The transaction may take and release locks on tables or indexes.

Want to see this in action? You can use the undocumented fn_dblog command to view the entries in the log.

What Happens in a Rollback

When we issue a KILL, every step of the transaction that has been written to the transaction log must be undone. The object the transaction was changing must be put back to the state it was in prior to the change. Was a record inserted? It needs to be deleted. Was a record updated? The previous value must be restored. Was a record deleted? It needs to be added back in. You can use fn_dblog to see this in action, too.

If SQL Server is simply undoing what was once done, why does it sometimes take longer?

If the transaction had to take a lock on a row or table to perform the operation, it will have to take that lock again – and other processes might be using that object now. Also, consider that a rollback is mostly single-threaded. If the transaction ran using four cores initially, and the rollback is now only using one, it will take more time.

Try going down backwards, hopping on one foot. I dare you!
Try going down backwards, hopping on one foot. I dare you!

Imagine this: you’ve decided to climb a tower with 10,000 stairs. You get to stair 9,999 and decide you don’t want to complete the climb. You can’t wish yourself to the bottom step – you have to go back down. But now, you’re tired – and have to do this single-threaded. Imagine having to hop down the stairs backwards, on one foot.

Tracking Progress

You are able to track the progress of a rollback. After you issue a KILL command, run KILL [Session ID] WITH STATUSONLY. This will show you approximately how much time is remaining in the rollback – but other processes in the database could affect that.. Consider the information here a guideline, not a hard-and-fast time.

Another thing to keep in mind: what you might not have known when you issued that KILL on Friday at 4:30 pm is that the transaction was going to finish at 4:32 pm. But, you had to wait until 5:05 for the rollback to complete. Always be cautious when using this command!

Solving the root cause

A one-time KILL issued to solve an “oops” may not need further investigation. But if you find yourself having to issue these commands for the same application, at the same time each week, it’s time to dig deeper. You want to work with the owners of the application or process to determine why this keeps happening, repeatedly. If it’s a long-standing process or transaction, what’s changed to make it take so long now? Perhaps you’re doing a mass insert or update, and could break it into chunks. If it’s a new process, look at what else is going on at that time. Could the statements be written more efficiently? Could it be scheduled at another day or time?


When Do You Fire the DBA?

SQL Server
16 Comments

I know a lot of DBAs, but it’s really, really rare that I’ve seen one get fired. I don’t think it’s ever happened during one of my consulting engagements, and I’ve seen some incredibly horrific database disasters (not to mention a whole lotta ugly near-misses).

So I asked Twitter:

Not Telling Management About Problems

That’s fair – I can understand if someone’s databases have problems that they don’t know about, because we can’t all know about every risk. But if you do know about a risk and you don’t inform management, that’s bad. Which leads to…

Doing Dirty Deeds

https://twitter.com/buckwoody/status/438678817058865152

What if someone came to you and offered you fifty thousand bucks for a copy of your company’s database backups? Or more realistically, just asked you to up their login permissions to see more databases as a favor?

The devil doesn’t come wearing red horns and a cape. He comes as a friend who needs a favor and will pay you handsomely for it.

Being Untrainable

Not Having Backups

This one’s interesting because it happens a lot. No, seriously, it’s the first thing I check when I start a consulting engagement, and maybe 1/4 to 1/2 of the time, the backups aren’t working the way the DBA assumed.

After all, think about your own databases – you’re just assuming the backups worked okay last night because you didn’t get an email. If someone walked into your cube right now and checked every database for a clean (non-corrupt) backup that made it to tape or offsite, are you ready to bet your job on it? Especially if other teams are involved in sweeping the backups off to tape?

I’m much more tolerant of this mistake now because I see it so often. DBAs get distracted by performance issues because that’s what users complain about. Nobody complains about missing backups until it’s too late.

Making a Fatal Mistake

I love those last two words from Allan – stated consequences. If the company makes it clear ahead of time that certain mistakes are unforgivable, then yep, that can be a recipe for firing. If you’re in that kind of shop, you’d be wise to print out that list of unforgivable mistakes in a really large font and stick it to your wall near your monitor as a constant reminder.

Not Being Customer-Friendly

Over on DBAreactions, I make a lot of bad jokes about developers, SAN admins, network admins, sysadmins, and your momma.

The reality, though, is that I love these people because they’re struggling just like us DBAs are. They’re working to get better, and I have to help them get better as part of my own struggles. As much as we’d like to think we DBAs know everything about everybody else’s jobs, often our attitudes create a problem that’s a bigger liability than our perceived knowledge.

Not Improving Despite Guidance

https://twitter.com/buckwoody/status/438678939234750464

Buck sums up the real thing DBAs need to be aware of.

Ideally, you build your own learning plan to up your game and keep your data safeguarded. You sharpen your own knives, and you train for the day that you have to respond to the unthinkable outage.

Less-than-ideally, your managers notice that your knives are those plastic ones you get in coach class on airplanes, and they write you a plan as part of your annual review. You make progress on it with that jump start, and you keep treading water in your career.

Or not-ideally-at-all, you put your company’s data at risk, they take you to Human Resources, and you sign off on a very urgent plan to get your learn on.

So, go get your learn on before the company notices.


Finding Blocked Processes and Deadlocks using SQL Server Extended Events

A lot of folks would have you think that Extended Events need to be complicated and involve copious amounts of XML shredding and throwing things across the office. I’m here to tell you that it doesn’t have to be so bad.

Collecting Blocked Process Reports and Deadlocks Using Extended Events

When you want to find blocking, you probably turn to the blocked process report. You mess around with profiler on your SQL Server 2012 box. You probably feel a little bit dirty for clunking around in that old interface, but it gets the job done.

There’s a better way… Well, there is at least a less awful way: Extended Events.

With that, you’ve created an Extended Events session to grab blocked processes and deadlocks. Why both? The blocked process report makes use of the deadlock detector. Since large amounts of blocking are frequently synonymous with deadlocking, it makes sense to grab both at the same time. There are a few other things we’ll need to do to make sure you can collect blocked processes:

At this point, you’ll be collecting the blocked process report with Extended Events. There’s no profiler session to set up, just start and stop the Extended Event session at your leisure.

Reading the Block Process Report from Extended Events

We’re saving the blocked process report to disk using Extended Events. Now what?

We need to get that blocked process data out of the Extended Events files and somewhere that we can better analyze it.

In this query, you read from an Extended Events session that’s being saved to disk and perform XML shredding to get client information. It isn’t a pretty query, but it does the job very well.

Viewing the Extended Events Deadlock Graphs

Extended Events deadlock graphs use a slightly different XML schema than what SSMS expects. You should see an error along the lines of “There is an error in XML document”. For folks using SQL Server 2012 and earlier, you can either parse the XML by hand or use SQL Sentry Plan Explorer.

Viewing the Extended Events Blocked Process Report

But what about the blocked process report? After all, your users are complaining about blocking, right?

Michael J. Swart has created tools to view the blocked process report. It’d be awesome if you could use it, but Michael’s blocked process report viewer uses the output of a server side trace to read blocking information. These Extended Events files are different enough that you can’t use them outright. You can, however, create a table that will let you use the blocked process report viewer:

While you still have to read the XML yourself, this will give you a view into how deep the blocking hierarchies can go. Collecting this data with Extended Events mean that you won’t have to sit at your desk, running queries, and waiting for blocking occur.

Extended Events – Not That Hard

Extended Events aren’t difficult to use. They provide a wealth of information about SQL Server and make it easier to collect information from complex or difficult to diagnose scenarios. You really can collect as much or as little information as you want from SQL Server. When you get started, the vast majority of your work will be spent either looking up Extended Events to use or formatting the output of the queries into something meaningful.


T-SQL Tuesday: Stop Tuning with Wait Stats Percentages

TSQL2sDay150x150For this month’s T-SQL Tuesday topic, Michael J. Swart told us to argue against a popular opinion, and man, is that right up my alley. I’ve told you to stop worrying about index fragmentation, stop backing up your logs once an hour, and to stop running $14k of licensing on $3k of hardware.

You’re probably checking SQL Server wait stats periodically to find your SQL Server’s bottleneck – and that’s a good thing. Instead of checking Perfmon counters and trying to figure out what’s the slowest part of your server, at least wait statistics tell you what SQL Server has been waiting on while running queries.

But it all falls apart when you say, “67% of the time, SQL Server is waiting on ____, so we should focus there.”

We need to understand the difference between latency and throughput.

Explaining Latency and Throughput with Shipping Companies

If we run an online store and we wanted to measure how fast a delivery service works, we could call them to pick up a single envelope, and then measure the amount of time it took to arrive at its destination. We could then say, “That service can deliver exactly one package per day. If our business is going to grow, we’re gonna have to find a delivery service that can ship faster, because we need to move more than one package per day.” We could then focus all our efforts trying to use local courier services, or spreading our load across multiple shipping companies.

Ernie arrives at her destination.
Ernie arrives at her destination.

But we would be morons.

Instead, we need to put MORE packages out on our doorstep and call the delivery service to get it. They’ll send a truck, pick up all the packages, and deliver them to various destinations. As we try to ship more and more packages, we’ll probably need to upgrade to a loading dock, or maybe even multiple loading docks, and set up an arrangement with our shipping company to send more trucks simultaneously.

Latency is the length of time it takes to deliver a single package.

Throughput is the number of packages they can deliver per day.

Just because our business is waiting overnight for a single package to be delivered doesn’t mean we have to go finding a new shipping company. It’s completely normal. We need to keep pushing our business to figure out where the breaking point is. Are packages piling up at the door because the shipping company only has a single small cargo van? Sure, that’s the point at which we worry.

How This Relates to SQL Server Wait Stats

In a recent load testing engagement, the DBA told me, “We’ve only got a few end users hitting the system, and we’re already seeing 80-90% of our waits in PAGEIOLATCH. The data file storage simply isn’t able to keep up.”

We switched from using his wait stats script to sp_BlitzFirst® instead, which shows the cumulative amount of time spent waiting on each wait type. In any given 5-second span, the server was spending less than a second waiting on PAGEIOLATCH. Furthermore, the average wait time was less than 5 milliseconds each time – indicating that the storage was responding fairly quickly to each request.

The server was sitting idle, and the DBA was reading wait stats incorrectly. Sure, the majority of time spent waiting was due to storage, but there just wasn’t much time spent waiting period.

“Crank up the load,” I said. “Quadruple the amount of work you’re throwing at the server.”

Everybody in the room looked at me like I was crazy, but they agreed – and the SQL Server still didn’t flinch. We kept upping and upping the load, and finally we did find a breaking point, but it wasn’t storage. Just as you can pile up a lot of boxes in front of your house and the shipping company will pick them all up to deliver them in the same amount of time, the SQL Server’s storage kept right on delivering every result within 5-6 milliseconds.

The Moral of the Story

When using wait stats for monitoring, make sure you’re looking at the total number of seconds spent waiting per second on the clock. If you sample waits for 5 seconds on a 16-core server, don’t freak out about 5 seconds worth of wait. Each core can have multiple queries piled up, all waiting on different resources, so even 15-20 seconds of wait time during a 5-second period may not indicate problems.


The Worst Database User

SQL Server
30 Comments

“Can you tell me a little about this code?” I asked, keeping my best poker face on.

“I’m really proud of it. It gathers a whole bunch of stuff from accounting, human resources, manufacturing, you name it. I want to be able to track how we’re doing over time.” He nearly hovered in his chair with excitement, eagerly pointing out the different places he was gathering data from. “It’s like having my finger on the pulse of what matters most to the company.”

“That sounds really cool. And there’s some really fine-grained stuff in here – it looks like you really are taking the pulse of the company. How often do you run these queries?”

“It depends – some of them every day, some of them every minute because I need better sampling on those.”

I started to ease into the tough questions. “Every minute, wow. Are you concerned about the overhead?”

“No no, they’re really small queries, only took me a few minutes to write. And it’s not like I’m doing SELECT *,” he laughed.

“Oh, good. And did you index the tables involved? Sometimes adding indexes can really help ease the load of frequent queries.”

He paused for a second. “I never thought about that – but I can’t, right? I can’t add indexes on that kind of thing, can I?” Suddenly he realized he was in unsafe territory.

“No, I guess not. I wonder why that is. But who cares – it’s gotta be worth the cost to get such valuable information, right?”

Your job isn't to gather metrics you won't use.
Your job isn’t to gather metrics you won’t use.

“Oh yeah! You bet.”

“So how often do you query that data?”

He paused again. “Well, not very often. I’m pretty busy.”

“Did you query them today?”

“No.”

“Yesterday, or last week, maybe? Like before you called me in?”

His enthusiasm was gone, and he eyed me with suspicion.

“You’re way too busy to query this data, aren’t you? You’re busy putting out fires, fixing people’s mistakes, and keeping the lights on. You don’t really need this data gathered every minute, across all of your servers, do you?”

See, that user was the DBA, and when I showed him the top resource-using queries in the plan cache, they were all his monitoring queries. Queries that hit tables he couldn’t possibly index, because they weren’t actually tables at all – they were dynamic management views that SQL Server had to compute each time he accessed them. Queries that faithfully stored away data in tables he’d never bother to review.

And now you know the rest of the story.


Need High Availability for SQL Server? Use Windows Server 2012.

SQL Server DBAs are slow to adopt Windows Server 2012 and Server 2012 R2. I frequently find teams planning new implementations of SQL Server 2012 on Windows Server 2008 R2.

There seem to be two main reasons for this:

  1. DBAs want to change a minimal number of factors, and are slow to trust a new version of the OS. Windows 2008R2 feels “safe.”
  2. Windows Server 2012 was hard to love when it first came out. With no Start menu, Windows 2012 was more baffling than exciting to most production DBAs.

But these reasons don’t hold up– and staying on an old OS isn’t doing you any favors. If you’re planning a SQL Server 2012 (or upcoming 2014) installation, you should also be using Windows Server 2012 or Server 2012 R2.

Windows Server 2008R2 isn’t “Safer” or “Simpler” than Server 2012 R2

Let’s break down that first concern above. The essense of it is that staying on Windows Server 2008R2 seems less risky than using a new OS.

But let’s look at that closely. The most recent service pack for Windows 2008R2 was released in March 2011— that was Service Pack 1. (I bet I’m not the only person who has searched in vain for SP2 repeatedly, thinking I must have just missed it.)

Lots of fixes have been released since Service Pack 1. Here’s a few highlights:

Which patches you need varies by your configuration, and this is just the tip of the iceberg.

When you’re using high availability solutions, the list of patches you need to consider just gets bigger– and the risks get higher. In this post Brent describes a cluster-specific bug that was released after Stack Exchange experienced Availability Group outages due to a clustering issue.

Although Windows 2008R2 has been out for a while, managing an Availability Group or a Windows Failover Cluster on it isn’t simpler– it’s actually more complicated! You’ve got more patches you need to be aware of and manage, and you still need to keep an eye out for new fixes as they’re released.

Failover Clustering has huge new features in quorum

Windows Failover Clustering is increasingly critical to high availability in SQL Server– it’s the underpinning for clustered SQL Server instances and for AlwaysOn Availability Groups.

Don’t get me wrong– Windows 2008 R2 has some pretty nice features when it comes to setting up a cluster, particularly compared to older versions of Windows. I love the Cluster Validation Wizard. I find the Failover Cluster Manager snap-in to be pretty intuitive and easy to manage.

But Failover Clusters have grown up significantly– particularly when it comes to your options with making sure that your applications stay online. And having more ways to keep applications online is probably the exact reason that you’re using features involving clustering!

Dynamic Quorum and Vote Assignment – Windows Server 2012

Windows Server 2012 introduced “Dynamic Quorum“. Dynamic Quorum lets the cluster add and remove votes for nodes if they shut down or fail. This makes it more likely for the cluster to maintain a majority if you have multiple sequential failures.

Windows Server 2012 also lets you easily customize quorum configuration and remove votes from some nodes if you’d like to. To remove node votes with Windows Server 2008R2, you’d have to know to install KB 2494036 and then reconfigure quorum.

Bonus: as of Windows Server 2012, the Windows Failover Clustering feature is part of Windows Standard Edition.

Dynamic Witness, Tiebreakers – Windows Server 2012 R2

Who's got the votes? Much easier to see in Server 2012 R2.
Who’s got the votes? Much easier to see in Server 2012 R2.

Windows Server 2012 R2 introduces even more quorum features. The Dynamic Witness feature lets the cluster decide when it will or won’t give a vote to the witness you have configured.

This means that if you have multiple sequential failures, the cluster has more options and is smarter about keeping the right number of votes available. It also means that the guidance about creating a witness is simpler: since the vote can be managed dynamically, you should always create a witness.

Windows 2012 R2 has more features, too– the cluster is smarter about what happens when things start to come back online after you force quorum, and it has options for specifying a tiebreaker if you have multiple failures at the same time.

And it’s a simple thing, but I love it the most: in 2012 R2 you can see what’s happening with dynamic votes for nodes by using the Failover Cluster Manager (no PowerShell commands required). This makes testing different failure scenarios much more straightfoward as you can quickly and easily see how many votes there are, and who’s got them.

Oh, and they fixed that Start menu thing, too.

You can’t easily migrate your Windows 2008R2 Cluster to Windows 2012+

Right clicking on the Start Menu in Windows Server 2012 R2
Right clicking on the Start Menu in Windows Server 2012 R2

The final reason I wouldn’t plan a new highly available environment on Windows 2008R2 is simply that I don’t want to be stuck there. You might think, “We’ll start out on a known OS and then upgrade soon” — but that’s easier said than done.

If you’re on a Windows 2008R2 cluster and you want to upgrade, you will need to create a new, separate failover cluster using Windows Server 2012, and then migrate to it. That requires separate storage, separate servers, and a careful migration plan. It’s absolutely doable, but it’s not simple, cheap, or easy. (Note that the built-in “Migrate a Cluster” wizard doesn’t support SQL Server installs for Windows 2008R2 -> Windows 2012.)

What about SQL Server 2008? Or 2005?

If you’re doing a new server buildout, you may decide to install SQL Server 2008 R2 on a Windows 2012 / 2012 R2 failover cluster. That’s supported, but to get installation to work you may need to use a slipstream installation and take a few other steps. (I think slipstreaming is great, but budget extra time to get everything tested and working.)

If you’re thinking about running SQL Server 2005 on Windows 2012 or Windows 2012 R2, that’s just not supported by Microsoft.

It’s Time to Embrace Our New OS Overlord

I really like Windows 2008 R2. I’ve used it for years, it’s very familiar, and I feel like I know its quirks. But I also recognize that it’s got a lot of baggage. It’s not really simpler or easier to manage.

The improvements to Windows Failover Clustering features in later versions of the operating system should be very compelling to anyone considering clustering SQL Server or using Availability Groups. It’s not always perfect and it’s certainly complicated, but it’s much better than what you’ve got in Windows 2008R2.

It’s time to start planning your new SQL Server implementations with Windows 2012 and higher.


Top Classic Country Songs for SQL Server DBAs

Humor
17 Comments

“I Shot a Server in Reno (just to watch it die)” Johnny Cash

“AlwaysOn on My Mind” Willie Nelson

“WASD, Don’t Take Your Single Thread to Town” Kenny Rogers

“Transactions in the (Replication) Stream” Kenny Rogers and Dolly Parton

“Stand by Your Maintenance Plan” Tammy Wynette

“Ring Buffer of Fire” Johnny Cash

“Your Deadlockin’ Heart” Hank Williams, Sr.

“Friends in Low Latency Places” Garth Brooks

“Mommas Don’t Let Your Babies Grow Up to Be Virtual Clusters” Willie Nelson and Waylon Jennings (cover)

“Crazy” Patsy Cline (Let’s face it, there’s no improving on this one.)

“Here’s a Quarter (Call Someone Who Understands Oracle)” Travis Tritt

“I Wish You Could Have Turned My Head (And Left My Indexes Alone)” Oak Ridge Boys

“I Love the Way You Optimize Me” John Michael Montgomery

“The Night the Lights Went Out in the Datacenter” Reba McEntire

“All My Transactions Take Place in Texas” George Strait

“She’s In Love With the Feature” Trisha Yearwood

“Could I Have This Lock” Anne Murray


How to Use Partitioning to Make kCura Relativity Faster

kCura Relativity
4 Comments

kCura Relativity is an e-discovery program used by law firms to find evidence quickly. I’ve blogged about performance tuning Relativity, and today I’m going to go a little deeper to explain why DBAs have to be aware of Relativity database contents.

In Relativity, every workspace (case) lives in its own SQL Server database. That one database houses:

  • Document metadata – where the document was found, what type of document it is
  • Extracted text from each document – the content of emails, spreadsheets, files
  • Document tagging and highlighting – things the lawyers discovered about the documents and noted for later review
  • Workspace configuration – permissions data about who’s allowed to see what documents
  • Auditing trails – who’s searched for what terms, what documents they’ve looked at, and what changes they made

For performance tuners like me, that last one is kinda interesting. I totally understand that we have to capture every activity in Relativity and log it to a table, but log-sourced data has different performance and recoverability requirements than other e-discovery data.

The AuditRecord table is append-only. We don’t go back and modify AuditRecord data – we just constantly add to it, tacking on data at the end of the table. This means it has different backup requirements – I certainly don’t want to be doing a full backup on this table every day, repeatedly backing up the exact same data over and over and over when it will never change.

It will never get deleted. We have to keep this data throughout the life of the case because the client may need to go back through time to see who did what during the case. This means it will keep growing and growing, making backups tougher over time.

In big cases, it can dominate the database. I’ve seen instances where the AuditRecord table consumed more than half of the database size – meaning in a 1TB database, 500GB of the contents are AuditRecord. This means backups will take twice as long.

In the event of a disaster, I don’t need it right away. In our 1TB workspace example, I would prefer to restore the 500GB workspace data first, let the lawyers in to do document review, and then take my time restoring a separate 500GB AuditRecordArchive database.

Splitting AuditRecord Into an Archive Database

In order to pull this off, I need two separate databases:

  1. Workspace database – the normal EDDS12345 database for workspace 12345. Inside this database, I have all the normal Relativity tables, but only a small AuditRecord table with the current audits – say the last 7 days. As people review documents, I’d log that data into this AuditRecord table.
  2. AuditRecordArchive database – say, EDDS12345_AuditRecordArchive. In here, I have one AuditRecordArchive table that has all of the AuditRecord data more than 1 week old.

Once I’ve set this up, then I need a job that sweeps the EDDS12345.AuditRecord data into the archive database once per week. Immediately after that job finishes, then I do a full backup and DBCC of EDDS12345_AuditRecordArchive – and then I don’t need to back it up again until the next sweep.

If I want to get really fancy, I don’t do daily DBCCs or index maintenance against that AuditRecordArchive database either. If anything goes wrong with it, like database corruption, I just restore to last week’s full backup and I’m off to the races. This means less downtime for database maintenance.

Great News! kCura Supports This

It’s called “partitioning the AuditRecord table”, and your kCura support contacts can walk you through it. It doesn’t involve SQL Server partitioned tables at all – they just call it partitioning because it’s the same basic concept, only done with application-level code.

However, I don’t recommend doing this by default across all your databases. This technique is going to instantly double the number of databases you have and make your management much more complex. However, I do recommend reviewing your largest workspaces to see if AuditRecord is consuming half or more of the database space. If so, consider partitioning their AuditRecord tables to get faster backups, database maintenance jobs, and restores.

At the risk of sounding like a fanboy, this is one of the reasons I love working with the kCura folks. They really care about database performance, they take suggestions like this, and they implement it in a way that makes a real difference for customers.

This is also why database administrators need to:

  1. Understand the real business purpose of the biggest tables in their databases
  2. Build working, productive relationships with their software vendors
  3. Come up with creative approaches to ease SQL Server pains
  4. Help the vendors implement these approaches in software

Back Up Transaction Logs Every Minute. Yes, Really.

Right now, your transaction log backups are set up to run every 15 minutes, every hour, or maybe every few hours.

It’s time for a reality check. Figure out what internal customer owns the data inside the database, and walk into their office. Ask them how much data they’re comfortable losing.

You’re probably going to have to repeat the question a couple of times because they won’t believe these words are coming out of your mouth. After they eventually recover, they’ll blurt out words like “NONE” and “ZERO” and “ZOMG”. Give them a quick overview of how much it costs to have a no-data-loss system – typically in the millions of dollars – and then ask for another answer.

They’re going to ask for a list of options and their costs. Forget the fine-grained details for now – let’s just assume one of your solutions involves transaction log backups.

The big hand is on the network cable, and the little hand is on the power cable.
The big hand is on the network cable, and the little hand is on the power cable.

Is it more expensive to run your log backups:

  • A – Every hour
  • B – Every 15 minutes
  • C – Every 5 minutes
  • D – Every minute

It’s a trick question – they all cost the same.

“BUT MORE LOG BACKUPS MEAN MORE SLOWDOWNS!”

Nope – it actually means less overhead. To keep the numbers easy, let’s say you generate 60GB of transaction log data per hour. Would you rather back up 1GB per minute in small chunks, or kick the server in the junk every hour trying to back up 60GB all at once? The users would definitely notice the latter – the server will slow to a crawl while it handles that firehose of log data for a sustained period.

“BUT MORE LOG BACKUPS MEAN MORE FILES!”

Yes, but if you think you’re really going to manually restore log backups through the GUI, one at a time, you haven’t been through a real mission-critical crash. Long before disaster strikes, you need to be familiar with a script-based method to automatically restore all of the log files in a directory. Test it, get comfortable with it, and then when the outage hits, you can let your script restore everything ASAP, no clicks required.

“BUT MY LOG BACKUP JOBS CAN’T FINISH IN A MINUTE!”

If you have too many databases and your jobs can’t keep up, it’s time to start breaking up the log backups into multiple jobs. Some of my clients use two jobs, for example – one to back up all odd-numbered databases by the database ID in sys.databases, and another one to back up all the evens. The log backups don’t stress out their servers by any means, and users don’t notice the impact of two simultaneous log backups in two different databases.

“BUT SURELY YOU CAN’T BE SERIOUS!”

I’m just as serious as you were when you randomly picked 15 minutes or 1 hour as the log backup schedule. The point is that it’s not for the geeks to decide – it’s for the customer to decide. Ask your business user about the business needs for their data, and then give them a menu of options to meet that goal.

You think they’re all going to insist that all data is mission-critical, but you’d be surprised. Once they see the costs involved with log shipping, AlwaysOn Availability Groups, SAN replication, and so forth, they’re able to make better informed decisions about which data really is priceless, and which data we could really re-enter at minimal cost.

But until you ask them this question, they assume you’re never going to lose data.


Alaska has a SQL Server User Group!

Alaska is on the PASS map!
Alaska is on the PASS map!

I’m really stoked to share the news: Alaska (my home state) finally has a PASS Chapter of its own! The group just got started last December, and officially welcomed into the PASS organization at the end of January. While they don’t have a Facebook or Twitter account yet, they do have a website and a YouTube channel.

The group meets every month and draws about twenty attendees, according to Chapter Leader Greg Burns. (Greg also runs the local SharePoint user group, which has about eighty members.) The audience is a mix of DBAs and developers, mostly.

Curious. Why would I mention the audience?

Because Greg is running a PASS Chapter for the first time, he could use a lot of help. He’s looking for speakers — remote or in-person — to present at upcoming meetings. If you’re interested in presenting to the group remotely, or just looking for an excuse to visit by far the largest state in the union…[prolonged eye contact with Texas]…just drop Greg a line at AlaskaSQL(at)gmail.com.

But wait, there’s more! If you’re a current or former PASS Chapter leader, you probably have some great tips on how to structure meetings, build membership, advertise your group, line up sponsors, and other things it takes to grow a user group. Rather than flood Greg’s inbox with your collective wisdom, let’s assemble them here in the comments so they’re all in one place. I can think of no better way to welcome Alaska to the SQL Server community than to show them how much we help each other.


Extended Events for Tracking Code Improvements (video)

SQL Server
1 Comment

Ever wanted to prove that your code or index changes had a positive impact? In this 30 minute video, Kendra shows you how to use Extended Events to measure a production workload– and how you can aggregate the results to prove your changes made a difference.

Want to skip straight to the demos? Here’s a quick timeline:

  • Demo 1: How I generate a fake workload – 2:45
  • Demo 2: Taking an initial benchmark with Extended Events – 4:45
  • Demo 3: Aggregating the results of the Extended Events Trace -10:52
  • Demo 4: Taking a second benchmark with Extended Events and aggregating the results – 18:05

Want a sample of the type of Extended Events trace run in the demo? Here’s a sample script. This isn’t any rocket science on my part: I generated the script with the Extended Events wizard and then told it to script it out. (Try it yourself!)


The Evolution of SQL Server Instrumentation

SQL Server
3 Comments

One of my favorite things about SQL Server is the instrumentation. It’s incredibly good at giving you information about what’s going on in the database engine and query optimizer.

Want to know why your server is slow? Need to know what made it go offline? You have a lot of tools available.

This hasn’t always been the case. One of the toughest things about the old SQL Server 2000 is the lack of instrumentation. The tools that help you improve SQL Server have all been added in the last 10 years.

Evolution-of-Instrumentation

The cool thing is that the instrumentation gets better and better with each version of SQL Server– new features get new Dynamic Management Views and functions. And sometimes we get a whole new feature to help us get insight into SQL Server.

If you’ve still got SQL Server 2000 instances where performance and availability is important, explain to your management how limited the tools are to diagnose and repair issues. SQL Server’s come an awful long way!


San Diego Class Recap

SQL Server
2 Comments

We had a lot of fun this month teaching SQL Server classes at the San Diego Harbor Hilton. (We wrote about how we picked the cities earlier.) Even when we teach, we learn, and then we share what we learned, so here we are.

San Diego Harbor Hilton
San Diego Harbor Hilton

We like to try new things with our products and services, and this year’s classes brought a few firsts for us. Let’s recap what’s new and how it worked.

We tried new class titles and module groupings. This year, we’re running a 2-day class on How to Be a Senior DBA and a 3-day class on SQL Server Performance Troubleshooting. We wanted to find out if we’d gotten the module mixes right.

To learn, we surveyed the bejeezus out of the attendees at the start. How did they rate their skills at each module before we started, and which modules were they looking forward to the most? Then at lunchtime and end-of-day, we gave them more surveys to find out if they would recommend each module to others.

We learned right from the first day – it was obvious from the looking-forward-to question that the attendees weren’t interested in one particular module, so we gave them the option to switch it out for another module instead. That went really well, and we’ll adapt the Chicago and Philadelphia itineraries to include that change by default.

Be Your Neighbor's Consultant
Be Your Neighbor’s Consultant

We added a lot of attendee interaction. One of the best ways to learn something is to teach it to someone else. In the high availability architecture module, we ran a 5-minute portion where you had to describe your app’s RPO/RTO needs to the person sitting next to you, and they had to recommend the right HA/DR infrastructure.

Great side effect: this got the attendees out of their shell right from the start! Before the very first bio break, each attendee knew at least one other person’s SQL Server challenges and felt comfortable talking about it.

Kendra Discusses the Outage Post-Mortem
Kendra Discusses the Outage Post-Mortem

We added optional after-hours group lab exercises. For example, on the first night of the How to Be a Senior DBA class, we broke attendees up into teams and handed them a production outage to fix. We watched via the projector as each team worked on their servers.

After the outage finished, the team conducted a post-mortem meeting just as they would (maybe?) in real-life, explaining what caused the outage, how they fixed it, and how they would ensure it didn’t happen again. Prizes went to the best solvers and best explainers.

Jeremiah Teaching
Jeremiah Teaching

We used attendee questions to improve the training. While the training takes place, one of us sits in the back of the room and transcribes every attendee question.

Now that the training is over, we’ll revisit the slides and demos to see where we can build those answers right into the training to keep making each delivery better. (We do this at our conference sessions, user group sessions, and webcasts too.)

Granted, we can’t just add material like crazy – otherwise we’d run way long every day – but sometimes it makes sense to take out portions of training material that isn’t working well, add in other portions, or reorder slides to introduce answers in a way that makes more sense for attendees. We have some slide decks we’ve been working on for several years, and gradually improving this way the whole time.

Brent Teaching with Hardware
Brent Teaching with Hardware

We added hands-on hardware for illustration. One of my favorite teaching icons says that if you want to teach students an abstract topic, give them something to put their hands on. I brought motherboards, memory, hard drives, and cables in to teach about how wait stats work, and as various queries ran onscreen, we talked about the physical holdup.

This class was the first time I’d tried this approach in person, and this is a great example of a session I’ll need to tweak over time. The time budgeting failed miserably – I didn’t get to go as deep as I wanted because I tried to cover too many things in too short of time. Now the fun part starts – I get to revisit the presentation, the attendee questions, and the subject matter, and figure out how I can impart maximum learning in 60 minutes.

Kendra Teaching
Kendra Teaching

We added prerequisite training videos and resources. To make sure everybody was on the same page when class started, each attendee got homework when they enrolled. They got a login for our training video system that gave them access to a group of videos we picked just for them.

When the in-person class started, their training login suddenly had access to all of the scripts and resources for the class, too. We use that same delivery system to handle all of our resources, and we love how it works.

We even extended it based on what we learned in the first class. One of the Senior DBA students suggested that we compile a list of learning resources mentioned during the class, build a PDF of the links, and add that PDF to the training site, too. Great idea! Plus, since we’ll be teaching these same classes in Chicago and Philadelphia later this year, we can even start with that same resource list because many of them will be applicable.

San Diego (the Hilton is on the left)
San Diego (the Hilton is on the left)

See, we learn a lot even when we’re teaching. We have such a great time with this, and we can’t wait to keep sharing what we learn. If it sounds like fun, join us in Chicago or Philly later this year.


How to count the number of rows in a table in SQL Server

SQL Server
46 Comments

“How many rows exist in a table?”

It seems like such an innocent request. It isn’t too hard to get this information out of SQL Server. But before you open SSMS and whip out a quick query, understand that there are multiple methods to get this information out of SQL Server – and none of them are perfect!

COUNT(*) or COUNT(1)

The seemingly obvious way to get the count of rows from the table is to use the COUNT function. There are two common ways to do this – COUNT(*) and COUNT(1). Let’s look at COUNT(*) first.

The STATISTICS IO output of this query shows that SQL Server is doing a lot of work! Over 100,000 logical reads, physical reads, and even read-ahead reads need to be done to satisfy this query.

Looking at the execution plan, we can see an Index Scan returning over 31 million rows. This means that SQL Server is reading every row in the index, then aggregating and counting the value – finally ending up with our result set. The cost of this query? 123.910000.

count 1

The query results: 31,263,601 rows.

count 2

Now, let’s look at the behavior of COUNT(1).

We can see from STATISTICS IO that we have a large number of logical reads – over 100,000.

The execution plan again shows an index scan returning over 31 million rows for processing. The query cost is the same, 123.910000.

count 3

The results here are the same – 31,263,601 rows.

count 4

The benefit of using COUNT is that it is an accurate indicator of exactly how many rows exist in the table at the time query processing begins. However, as the table is scanned, locks are being held. This means that other queries that need to access this table have to wait in line. This might be acceptable on an occasional basis, but I frequently see applications issuing these types of queries hundreds or thousands of times per minute.

sys.tables + sys.indexes + sys.partitions

We can join several SQL Server catalog views to count the rows in a table or index, also. sys.tables will return objects that are user-defined tables; sys.indexes returns a row for each index of the table; and sys.partitions returns a row for each partition in the table or index. I am going to query for the table ID, name, and count of rows in all partitions.

The output of STATISTICS IO here shows far fewer reads – 15 logical reads total.

The execution plan is more complex, but much less work – the query cost here is 0.0341384.

count 5

The results of the query are also the same – 31,263,301.

count 6

The benefits of using this method are that the query is much more efficient, and it doesn’t lock the table you need the count of rows for.

However, you need to be cautious when counting the rows on a table that is frequently inserted into or deleted from. The TechNet documentation for sys.partitions.rows says it indicates the “approximate number of rows for this partition”. How approximate? That information isn’t documented. Understand, though, that if you use this method, you potentially sacrifice up-to-the-moment accuracy for performance.

sys.dm_db_partition_stats

A third option is to use the dynamic management view sys.dm_db_partition_stats. This returns one row per partition for an index.

The STATISTICS IO output of this query is even lower – this time, only two logical reads are performed.

The execution plan is less complex than our second example involving the three system views. This query also has a lower cost – 0.0146517.

count 7

The query results are the same as the previous examples – 31,263,301 rows.

count 8

Using this DMV has the same benefits as the system views – fewer logical reads and no locking of the target table. The query is also simpler to write, involving only one object.

But again, the TechNet documentation for sys.dm_db_partition_stats.row_count says it is “the approximate number of rows in the partition”, and when this information is updated is not documented. Here, you are also potentially sacrificing accuracy for performance.

Time to do some digging

The questions that you need to work with the business to answer are, “How up-to-date must the row count be? What is the business purpose? How often do you insert into or delete from that table, and how often do you count the rows?” If the accuracy of the row count is crucial, work to reduce the amount of updates done to the table. If performance is more important, and the row count could be approximate, use one of the system views.

Wanna learn more tricks for free?

Check out our free T-SQL Level Up online class – we guarantee it’s the best T-SQL training trailer you’ve ever seen:

Learn more and take the T-SQL course now.


Why Your Slow SQL Server Doesn’t Need a SQL Consultant (or Does It?)

SQL Server
10 Comments

Your SQL Server is slow – but should you call us in? Jeremiah and Brent had a throw-down the other day, and we figured we’d capture it here on the blog.

Brent Says You Do, and Here’s Why

5. You’ve been blindly throwing money at the problem without results. You’re on a first name basis with your local SAN salesperson. Your server racks are glistening with blue and green lights. But somehow, when users hit your app, they’re not as happy as your sysadmins – or your SAN salesperson. Before you burn another stack of Benjamins, it’s time to get an outside opinion.

brent-ozar

4. You’re doing something for the first time. Sure, you’re pretty good at Googling your way out of trouble, but the company’s starting to make riskier and riskier gambles on data storage. Maybe you should talk to someone who’s faced this problem before.

3. You’re relying on the manual. I love Microsoft as much as the next guy – especially now that they brought out a new ergonomic keyboard – but Books Online doesn’t tell the whole truth. When Microsoft unveils a new feature, they talk about all the positives, but they don’t always disclose the drawbacks. Get a reality check before you bet the farm on PowerFilePivotProRT, and hear what our other clients are doing to accomplish the same goal.

2. You need answers faster. We get together on a Monday, and by end of day Wednesday, you’ve got a prioritized action plan showing you how to make the pain go away by the end of the week. You get the knowledge and confidence to keep going without expensive long-term consultants. You’re really close – you just need our 3-day SQL Critical Care® to unlock the tips and tricks to make it work.

1. Developers can get back to adding features. Your real business isn’t SQL Server administration – it’s adding features to your app to make your end user happier. Bring us in, get the answers, and get back to work.

Jeremiah Says You Don’t, and Here’s Why

5. You’re probably blindly throwing money at the problem without results. Unless a consultant is willing to provide a list of happy customers, there’s no way to verify that they know something. Heck, even if they do provide a list of happy customers, you have no way of knowing that Luanne in IT management isn’t really someone’s Aunt Mavis.

jeremiah-peschka

4. Best practices aren’t universal. Every situation is different and the most complicated scenarios require a deep understanding of business goals, features, and SLAs. Consultants can help you understand best practices, but you’re the only person who knows what’s right in your environment. If you’re doing something for the first time and your Google-fu is running out, you can’t expect much help from outside.

3. Peer pressure shouldn’t change your decisions. We jokingly call this “design by Hacker News”. Just because a startup, our clients, or your next door neighbor are doing something, that doesn’t mean it’s the right fit for you. For many application feature decisions, it’s easy to build two or three prototype solutions and decide on the best one.

2. You need performance yesterday. Rather than wait a few weeks for a reputable consultant to show up, have you considered buying memory? If you’re on SQL Server Standard Edition and you have less than 64GB of memory, just buy more RAM. If you’re on EE and you have less RAM than you have data, why not max out that server with 16GB DIMMs; they’re cheap and you can solve most code sins with memory. Heck, we even recommend buying memory as one of the first ways to solve problems quickly.

1. Developers: understand your features. While developers should be adding features, they also need to understand the consequences of those features. Some functionality that’s possible in SQL Server requires an understanding of how to write queries to take advantage of those features – filtered indexes, indexed views, and ColumnStore indexes immediately spring to mind. The best way to understand a feature is to get in the database, make it work, and then make it work fast.


PREEMPTIVE_OS_WRITEFILEGATHER and Instant File Initialization (IFI)

SQL Server
33 Comments

The other day I was running a script I’d written to expand out some tables in the AdventureWorks sample database for some testing. The script was taking a long time, and at first I thought I was just spoiled because I usually run my tests on my laptop, whose storage is backed by SSDs. In this case I was running my test against a SQL Server instance up in AWS.

After a delicious fifteen minute coffee break, I realized something was wrong– it STILL wasn’t done. I grabbed sp_BlitzFirst® to do a quick check and see if my storage was really THAT slow.

I ran sp_BlitzFirst® with a 60 second sample, like this:

Here’s what I saw (click for a larger version):

Lack of Instant File Initialization

PREEMPTIVE_OS_WRITEFILEGATHER means slow file growths

In a 60 second sample, I’d been waiting for PREEMPTIVE_OS_WRITEFILEGATHER for 36 seconds. Whoops!

The script I was running was making tables larger. The database was out of room, and was growing the data file in small autogrowth increments. Each of these automatic growths was stopping to “zero out” the files, because I hadn’t enabled Instant File Initialization in SQL Server.

How I enabled Instant File Initialization

My test instance is set up a bit uniquely– it uses UNC paths for the SQL Server databases. This is supported in SQL Server 2012, and I was testing it out in a Windows Failover Cluster. (If you use shares for files, you need to be very careful about redundancy and availability of course, but that’s a topic for another post.)

I’d forgotten to enable Instant File Initialization for the SQL Server Service on the server that hosted the file shares! Wherever Windows will actually grow the files, the SQL Server Service account needs the “Perform Volume Maintenance” tasks right.

Using the Local Security Policy to Grant “Perform Volume Maintenance Tasks”

Here’s how to grant the rights:

First, open the local security. I usually do this by running the command secpol.msc, but you can also find this in the “Administrative Tools” menu.

Opening the Local Security Policy

Then navigate to Local Policy -> User Right Assignment -> Perform Volume Maintenance Tasks.

Perform Volume Maintenance Tasks

Add the SQL Server Service account. The change will take effect the next time you restart SQL Server.

Grant IFI to the SQL Server Service Account

So, did it improve my performance?

Heck yeah, it did! On the first run I got tired of waiting and killed my expansion script after 20 minutes and just dealt with the rollback. After enabling Instant File Initialization on fileserver that holds the shares where my database files live, the expansion script finished in four minutes and 48 seconds. Shazam!


Using SQL Server’s Table Valued Parameters

SQL Server
36 Comments

Table valued parameters (TVPs) are nothing new – they were introduced in SQL Server 2008. TVPs are a great way to get data into SQL Server, and yet very few developers use them.

Getting Started with TVPs

In order to get started, we need to create a user defined type. This type is necessary so we have a well defined interface between SQL Server and the outside world – you can just pass a bunch of junk in as a parameter to SQL Server… well, you could, but that’s called XML.

Let’s create that user defined type:

Now we can create variables using the dbo.SalesPersonTerritory type whenever we want. It’s just this easy:

Using Table Valued Parameters With Stored Procedures

Having a special table type is really convenient, but it doesn’t help if you can’t use it, right? Let’s assume that you’re calling a stored procedure and you’re sick of joining lists of strings on the client and then splitting them apart on the server. What you really need is the ability to pass a fully fledged table across the wire and into SQL Server.

This is relatively easy to accomplish. From C#, the code would look something like this:

That’s really all there is to it.

C# Tips for Table Valued Parameters

There are a few other things that developers can do to make their life easier when working with table valued parameters. A DataTable, just like a table in SQL Server, should have types declared for all columns. While it’s easy enough for developers to create these tables on the fly, that won’t cut it in production code – boiler plate code means that people can make mistakes.

To make life easier, developers can create methods in their application to make it easier to work with table valued parameters. We can create a special chunk of code that will make it easy for developers to instantiate and use a DataTable that matches up with the table valued parameter.

Seasoned developers will even create methods that let them drop a list of objects straight into an appropriate DataTable. There are many different ways to build convenience methods into code and make it easy for development teams to work with TVPs.

For simple data type matching, this works well. If you’re dealing with more complex data types, you’ll want to check out SQL-CLR Type Mapping to make sure you get the right data type. You’ll notice that some datatypes (varchar for instance) have no direct corollary in the .NET Framework. Sometimes you just have to lose some fidelity – make wise decisions, it gets crazy out there.

These same techniques can be used with a string of ad hoc SQL, too. TVPs aren’t limited to stored procedures, they can be used anywhere that you are executing parameterized code.

Gotchas of TVPs

There are two big gotchas with TVPs.

First: the table variable that comes in as a table valued parameter cannot be changed. You’re stuck with whatever values show up. No inserts, updates, or deletes can be applied.

Second: table valued parameters are still table variables – they get terrible cardinality estimates.

We can get around both of these problems with the same technique – copy the contents of the TVP into a temp table. Although it adds an extra step to using the TVP, I’ve found that copying the contents of the TVP to a temporary table lead to better execution plans and much less confusion during development and troubleshooting.

Summary – Using Table Valued Parameters isn’t Hard.

TVPs aren’t difficult to use. They’re just different and require a different mindset. By using TVPs, developers can pass many rows into a stored procedure and create far more complex logic and behavior that is possible using only single parameter values.


Statistics Matter on Temp Tables, Too

SQL Server, TempDB
18 Comments

Temp tables are like real tables, just a little tricker.

When you’re starting out writing TSQL, it’s easy to want to do all your work in a single query. You learn about derived sub-queries, CROSS APPLY statements, and common table expressions. Suddenly, each of your queries is so complex that you hardly know what you’re doing.

But complex TSQL can be quite fragile. Small mis-estimations in parts of the plan can cause things to go very much awry in other regions of the plan.

Temp tables can help performance!

One treatment for slow, complex TSQL is to break it into multiple steps. You create a temp table, populate it, and reference it in future queries. This can lead to improved estimates and reliability, because temporary tables can have column and index related statistics. Those statistics help the optimizer estimate how many rows will be returned at different points along the way.

But this tuning technique isn’t foolproof. It’s easy to forget that statistics can be tricky– just like with real tables. Let’s take a look and prove that statistics issues from “normal” tables also matter on temp tables.

Out of date statistics on temp tables can hurt performance

The first step is to load up a temp table with some rows:

We now have:

  • 100K rows where TempValue=1
  • 10K rows where TempValue=2

Now let’s say I query the temp table in some way. My where clause is for a specific TempValue. The act of running this query causes a column level statistic to be created on the TempValue column:

Correct Statistics Estimate-Query1

We can see the statistic, too. This code is modified from Books Online and uses a DMV available in SQL Server 2012 SP1 and higher:

Statistics Super Fresh Not Modified

I continue on, and insert 11K rows for TempValue=3:

At this point I have:

  • 100K rows where TempValue=1
  • 10K rows where TempValue=2
  • 11K rows where TempValue=3

Now I run another query which wants to look ONLY at the 11K rows where TempValue=3. SQL Server completely mis-guesses how many rows will be returned– instead of 11K rows, it guesses that I’ll just get a SINGLE row:

Incorrect Statistics Estimate-Query2

Looking at the statistics, we can see why– the statistic that was created when I ran my first SELECT query wasn’t triggered to update:

Statistics-Modification Counter

Let’s look at the histogram

When we look at the histogram for the statistic, it confirms that it doesn’t know about any rows with a TempValue higher than two:

Click me to see a bigger version
Click me to see a bigger version

Bad Statistics can lead to bad optimization

If I was joining from this temp table to other large tables, it might cause a very large data skew in my plan– and it could cause big problems. The optimizer might select inefficient joins, not allocate enough memory, and choose not to go parallel if estimates are skewed too low.

Temp tables are a great tool to use, but remember a few key rules:

  • Column level statistics will be created for you when you run queries using the temp table
  • If you modify the temp table in multiple steps, those statistics may get out of date

In complex procedures, manually creating and updating statistics on columns or indexes can save you time and improve execution plan quality.

Could Trace Flag 2371 Help?

If you find that you have a lot of existing code that has this problem, Trace Flag 2371 may help. This trace flag is available as of SQL Server 2008R2 SP1 and higher, and it changes the default threshold for when statistics automatically update.

The net effect of the trace flag is to cause statistics updates to happen more frequently on large tables. You must turn it on server wide for it to take effect — it doesn’t work for individual sessions.

I’m not a huge fan of enabling any trace flags unless you’re solving a specific problem, and documenting it. This trace flag is not very widely used in my experience, so flipping it on puts you in a small subset of people. If you start hitting wacky issues and need to resolve them, suddenly the trace flag gives you a whole new dimension of complexity. So please don’t think of this trace flag as “preventive medicine” — only use if it you don’t have other good ways to solve the problem.


What If You Knew The Server Would Crash Today?

SQL Server
7 Comments

This morning you woke up with a terrible premonition – you were absolutely sure your most important production database server was going to crash today.

What would you do?

Step 0: stop for coffee, because it's gonna be a long day.
Step 0: stop for coffee, because it’s gonna be a long day.

Here’s a quick list of places to start:

  1. Identify where the most recent backups are kept
  2. Make sure the backup schedule and frequency lines up with what the business wants
  3. Make sure the backups are getting offsite
  4. Make sure the monitoring software is watching this server, and sending me alerts

When was the last time you made a list like this and checked it twice? Do it this week, and schedule yourself a recurring task in your favorite to-do software to check this every quarter.

Because you’re not going to get these visions.