Blog

Getting JSON out of SQL Server

SQL Server 2016 brings support for JSON. For those of you who don’t know, JSON is JavaScript Object Notation – it’s a way of representing application level objects as a string. Imagine XML without anywhere near as much line noise, and you’re pretty much there.

Our Sample Document

In this example, to retrieve orders and their line items from the AdventureWorks database. We need to get the data out in this format for fast processing. In the end, our data should look something like:

{ "orders" : 
    [ /* an array of orders */
        { "order" : 
            { "number" : "123456" ,
                /* more stuff goes here */
                "items" : 
                    [
                        { "item": { /* item info goes here */ } }
                    ]
            }
        }
    ]
}

Holy cow, that’s a lot of stuff just to represent and order and the line items.

Here’s our starter query:

SELECT  h.SalesOrderNumber AS [order.number],
        h.DueDate AS [order.due_date],
        h.ShipDate AS [order.ship_date],
        h.PurchaseOrderNumber AS [order.po_number],
        h.TotalDue AS [order.total_due],
        h.Freight AS [order.freight],
        d.ProductID AS [item.product_id],
        d.OrderQty AS [item.quantity],
        d.UnitPrice AS [item.unit_price],
        d.UnitPriceDiscount AS [item.discount]
FROM    Sales.SalesOrderHeader h
        JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID
WHERE   h.SalesOrderNumber = 'SO43659'

First Attempt

The documentation hints that we might just be able to use `FOR JSON AUTO` option to automatically format our JSON. What could go wrong?

Syntax is hard

Syntax is hard

Our results aren’t what we want at all! The structure is close, but not quite what we’re looking for.

Getting the Structure Right

We are getting all the data we want, but we’re not getting the structure that we want. Our requirements say that we need to have and `orders` array and then each `order` should be a separate item in that array. An orders line items should also be stored in a separate array.

The square brackets make an array in JSON. The curly brackets denote an object.

Using the root() function will nest our results in a root object. In this case, we can use root('orders') to create a basic orders object to hold our list of orders.

Making each item part of an array inside the order is going to be trickier. We can still use FOR JSON AUTO to get the job done:

SELECT  h.SalesOrderNumber AS [order.number],
        h.DueDate AS [order.due_date],
        h.ShipDate AS [order.ship_date],
        h.PurchaseOrderNumber AS [order.po_number],
        h.TotalDue AS [order.total_due],
        h.Freight AS [order.freight],
        (SELECT d.ProductID AS [item.product_id],
                d.OrderQty AS [item.quantity],
                d.UnitPrice AS [item.unit_price],
                d.UnitPriceDiscount AS [item.discount]
         FROM   Sales.SalesOrderDetail d 
         WHERE  h.SalesOrderID = d.SalesOrderID 
         FOR    JSON AUTO) AS items
FROM    Sales.SalesOrderHeader h
WHERE   h.SalesOrderNumber = 'SO43659'
FOR JSON AUTO, ROOT('orders')

We’re almost right!

So wrong I can taste it.

So wrong I can taste it.

Well, that’s … something. To get the output we want, we’re going to have to use a correlated subquery – fans of generating XML in SQL Server may remember jumping through hoops to get the right structure out of SQL Server.

We’re on the right track, but what we really need is better formatting.

The Best Solution, For Now

Since the FOR JSON AUTO isn’t generating the JSON output that we want, we can go down a manual route and use FOR JSON PATH. If this is starting to sound like XML, you’re absolute right.

After attaching a debugger to SQL Server, Paul White shows that the JSON writer is using the old XML code under the hood:

Anyway, the SQL to generate the right JSON document:

SELECT  h.SalesOrderNumber AS [order.number],
        h.DueDate AS [order.due_date],
        h.ShipDate AS [order.ship_date],
        h.PurchaseOrderNumber AS [order.po_number],
        h.TotalDue AS [order.total_due],
        h.Freight AS [order.freight],
        (SELECT d.ProductID AS [item.product_id],
                d.OrderQty AS [item.quantity],
                d.UnitPrice AS [item.unit_price],
                d.UnitPriceDiscount AS [item.discount]
         FROM Sales.SalesOrderDetail d 
         WHERE h.SalesOrderID = d.SalesOrderID
         FOR JSON PATH
        ) AS [order.items]
FROM    Sales.SalesOrderHeader h
WHERE   h.SalesOrderNumber = 'SO43659'
FOR JSON PATH, ROOT('orders')

Check out the correct (and formatted) results!

Getting JSON out of SQL Server

It’s easy to get incorrectly formatted JSON out of SQL Server. If you do need to get well formatted JSON out of SQL Server be prepared to put in a lot of work coercing your query into the right shape. You may find that you need to do a large number of nested sub-queries, too.

Fake moustaches, barbecue, and SQL Server.

Must be Dell DBA Days:

Join us live as we performance tune – and break – SQL Servers:

  • Thursday Morning: Finding Your Slow SQL Server’s Bottlenecks in Record Time with Wait Stats
  • Thursday Afternoon: How to Prove Hardware is the Problem
  • Friday Morning: SQL Server High Availability, Disaster Recovery, and Licensing Explained
  • Friday Afternoon: Watch SQL Server Break and Explode

Register now and join in.

Is leading an index with a BIT column always bad?

“Throughout history, slow queries are the normal condition of man. Indexes which permit this norm to be exceeded — here and there, now and then — are the work of an extremely small minority, frequently despised, often condemned, and almost always opposed by all right-thinking people who don’t think bit columns are selective enough to lead index keys. Whenever this tiny minority is kept from creating indexes, or (as sometimes happens) is driven out of a SCRUM meeting, the end users then slip back into abject query performance.

This is known as “Business Intelligence.”

–Bobby Q. Heinekens

Fake quotes and people aside

Let’s look at a scenario where you have a BIT column that’s fairly selective, and perhaps the rest of your predicates are ranged. This isn’t so out of the ordinary, especially because people like to know when stuff happened and how many times it happened.

“How many lunches have I eaten today?”

“Where did that bear learn to drive?”

“Am I being slowly disassembled on a molecular level by millions of tiny black holes?”

Yes, China or Florida, and Probably!

So let’s figure this out quick

;WITH E1(N) AS (
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL  ),                          
E2(N) AS (SELECT NULL FROM E1 a, E1 b, E1 c, E1 d, E1 e, E1 f, E1 g, E1 h, E1 i, E1 j),
Numbers AS (SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM E2)
SELECT  
        IDENTITY (BIGINT, 1,1) AS [ID] , 
    ABS(CHECKSUM(NEWID()) / 100000000) + 1 AS [CustomerID],
        ISNULL(CONVERT(DATE, DATEADD(MINUTE, -[N].[N], GETDATE())),     '1900-01-01') AS [OrderDate] ,
        CASE WHEN [N].[N] % 19 = 0 THEN 1 ELSE 0 END AS [isBit]
INTO [NotAshleyMadisonData]
FROM    [Numbers] [N]
ORDER BY [OrderDate]
;

ALTER TABLE [NotAshleyMadisonData] ADD CONSTRAINT [PK_NotAshleyMadisonData] PRIMARY KEY CLUSTERED ([ID]) WITH (FILLFACTOR = 100)

CREATE NONCLUSTERED INDEX [IX_BITFIRST] ON [dbo].[NotAshleyMadisonData]
([isBit], [OrderDate])

CREATE NONCLUSTERED INDEX [IX_DATEFIRST] ON [dbo].[NotAshleyMadisonData]
([OrderDate], [isBit])

Here’s one table with one million rows in it. Since it’s random, if you run this on your own it may turn out a little different for you, but I’m sure you can adapt. You are, after all, wearing the largest available diaper size.

I’ve also gone ahead and created two indexes (neither one filtered!) to avoid the appearance of impropriety. The first one goes against the oft-chanted mantra of not leading your index with a BIT column. The other complies to your thumb-addled rules of index creation where your more unique column comes first, though not to an opposing rule to lead your index with equality predicates and then range predicates.

Only 52,631 rows out of a million have a BIT value of 1. And with the exception of the first and last date values, each date has 75 or 76 BIT = 1 columns.

If you had to do this in your head, which would you do first? Find all the BIT = 1 rows, and then only count occurrences from the desired range? Or would you Find your start and end dates and then count all the BIT = 1 values?

(Hint: it doesn’t matter, you’re not the query engine. Unless you’re Paul White. Then maybe you are. Has anyone seen them in the same room together?)

Images of Query Plans

SELECT COUNT_BIG(*) AS [Records]
FROM [dbo].[NotAshleyMadisonData] AS [namd]
WHERE [namd].[isBit] = 1
AND [namd].[OrderDate] BETWEEN '2013-09-25'	AND	 '2015-08-20' --All dates

SELECT COUNT_BIG(*) AS [Records]
FROM [dbo].[NotAshleyMadisonData] AS [namd]
WHERE [namd].[isBit] = 1
AND [namd].[OrderDate] BETWEEN '2013-09-25'	AND	 '2014-09-01' --About half the dates

SELECT COUNT_BIG(*) AS [Records]
FROM [dbo].[NotAshleyMadisonData] AS [namd]
WHERE [namd].[OrderDate] BETWEEN '2013-09-25'	AND	 '2014-09-01'
AND [namd].[isBit] = 1 -- Flipping them doesn't change anything

SELECT COUNT_BIG(*) AS [Records]
FROM [dbo].[NotAshleyMadisonData] AS [namd]
WHERE [namd].[OrderDate] = '2013-09-26' --It's not until here that the other index gets used
AND [namd].[isBit] = 1
Peter Godwin will never play my birthday party.

Peter Godwin will never play my birthday party.

Put on your recap cap

This is another case where knowing the data, and knowing the query patterns in your environment is important. It’s easy to overlook or reject obvious things when you’re bogged down by dogma.

The stamp of approval for an idea shouldn’t come from blogs, forums, white papers, or hash tags. It should come from how much something helps in your environment.

Thanks for reading!

New and Improved: Relaunching Our Online Training – With a Daily Deals Sale

About two years ago, we launched our online training classes, and you loved ’em. We sold thousands and it became a pretty big part of our business.

The New Hotness

The New Hotness

We’ve launched our all-new online training video store with features like:

  • Comments – so you can ask questions on each video module
  • More powerful e-commerce – VAT, group discounts, Paypal, PDF receipts
  • HTML-based modules – so we can include code samples, images, and posts
  • Better analytics – to understand which modules you like, and which ones to tweak
  • Course reviews – we’ve had thousands of buyers, and it’s time to let them speak
  • Better coupons – including getting a discount for liking us or tweeting (and boy, Black Friday is gonna be fun this year)

To celebrate the launch, we’re running a set of daily deals:

  • Today (Monday 8/24) – The Everything Bundle: $599 – a bundle of all of our online video courses available for $1,553 just $599. Only 50 available. Gone!
  • Tues (8/25) – Chicago SQL Server Performance Troubleshooting class seats for $3,195 just $995. Only 3 available. Sold out!
  • Weds (8/26) – $99 SQLSaturday Pittsburgh Pre-Con seats – normally $199, this one-day Developer’s Guide to SQL Server Performance is half off. 25 available. Gone!
  • Thurs (8/27) – Mystery Pocket Bundles – we’ll post a handful of $29-$99 deals with secret unknown contents. On sale now. Sold out!
  • Fri (8/28) – all $299 courses 50% off – for the first 50 buyers of each course.

Go check it out!

Back up every database on the production server. No exceptions.

Before our consulting engagements start, we send clients a couple of queries to run. One of them outputs a list of databases, and the last backup date for each.

You wouldn’t believe how often this query produces surprises like:

  • “Wait – your script must be wrong. I swear we’re backing things up. We’ll fix that.”
  • “Whoa – it turns out we only run full backups monthly, and differentials weekly, but if we add a new database in the middle of the month, it doesn’t get backed up for weeks. We’ll fix that.”
  • “It’s okay – we don’t need to back up that database because it doesn’t have anything useful.”

The last one is actually the scariest one because often we discover real production data in databases that aren’t supposed to be used.

For example, at one client, there were two databases: AppV3 and AppV2. AppV2 hadn’t been backed up in months, and the architect assured me it was a really old version of the app that was just kept around for checking data every now and then.

However, as we started the engagement, we discovered that the login queries were still hitting AppV2 and storing data in there about user profiles. If they’d have lost that database, their users would have been destroyed, and their app would be down.

If the database is important enough to live on the production database server, then it needs to be considered a production-quality database, and it needs to get backed up. Full stop. Don’t gamble that no one is putting data in it.

How Do You Manage DBAs? Part 2: Training Budgets

Your DBA wants to get better at what they do, but … they don’t know exactly what it is they do, or how to get the business to pay for it. Here’s how.

My I Need to Learn List - man, where do I even begin - by Luis Llerena

My I Need to Learn List – man, where do I even begin – by Luis Llerena

Encourage them to start an “I Need to Learn” List – maybe as a text file, in a notebook, or in a shared spreadsheet.

When they’re working on help desk tickets, server outages, or architectures, and they want to learn something, they open up the list and add:

  • The problem they’re trying to solve (and be specific here)
  • Who needs them to solve it (departments, VPs, staff names)
  • How much time or money they spent working on the solution

That last one is important because your team is probably always going to find a solution even if they don’t understand the problem well. They’re going to reboot the box to make the problem go away, or apply some script from an Internet stranger, and they’re not going to feel terribly confident about what they just did. As far as the business is concerned, Mission Accomplished for now – and that’s fine. The “I Need to Learn” List is about knowing where these gaps were and what they cost at the time.

The DBA is never allowed to add a line “because I’m curious” or “because it sounds awesome” or “because all the cool kids are doing it.” This is about real business pains, because that’s where money comes from.

How to Turn the “I Need to Learn” List Into Money

Once a quarter, meet with your DBA and go through the list.

Look for a common theme. There’s probably one area of trouble that pops up over and over, like clusters or query tuning.

This forms your business case – you can say things like:

“This quarter, we’ve spent 35 hours of outage troubleshooting on the the sales team’s SharePoint cluster and the BI team’s data warehouse. We need to cut that troubleshooting time down, so I need $5k to send Ann to DBA training for that topic.”

“This quarter, Andy has spent 80 hours tuning slow queries for the developers. He’s run out of easy buttons, and he doesn’t feel confident with the advanced problems they’re hitting now. We need to invest $5k in training him to make that tuning time more effective going forward.

This suddenly makes it easier for management to open wallets, and you can even go to these other departments to get them to fork out training money as well. (Hey, look, I gotta be optimistic here.)

Microsoft SQL Server DBA Training Options

We’re lucky in the SQL Server community because we’ve got so many options. Here are the most popular ones – note that the price estimates include travel, but not human hourly time:

Blogs – free, but unorganized – you can learn a lot by regularly reading blogs, but they don’t help you start from zero and go to hero in a short amount of time. The articles aren’t focused on a clear progression. This option is best for DBAs who have a long time to learn and no urgent requirements.

Books – $50, but slow and not interactive – for some reason, people seem to think that buying a book will make them an expert. I’ve tried keeping them on my bookshelf for months, but that just doesn’t do the trick. You have to allocate a lot of time to actually read the book, and most human beings can’t read a technical book start to finish in one week while actually absorbing the material. You read some, get up, do other things, come back, read some more, absorb some of it, and so forth. Technical books often take weeks to fully absorb.

Training videos – $200-$300, ready when you are, but require focus – these cover a technical topic in a logical order, start to finish, broken up in easy-to-digest modules. You can spend 30-60 minutes a day absorbing a topic, tackle a few demos, and take a quiz to see how they’re progressing. The material is ready whenever the DBA is ready to start learning the topic, and you can learn it as fast as you can dedicate the time.

SQLSaturdays – $250-$500, but not often, and topics aren’t organized – these awesome one-day events take place all over the world, once a year in major cities. They’re run by volunteers, and often you can see great sessions – all free. Don’t just wait for them to occur in your city though – at this price, it can make sense to drive to a regional event. Thing is, the sessions are relatively short, and you can’t learn a topic start-to-finish at these events. Again, best for DBAs who have a long time to learn and no urgent requirements.

Pre-Conference Sessions – $500-$1,000 – not often, but fantastic deal – on the days before & after major conferences, speakers put on one-day classes at a bargain price (typically $200-$500). This is a good way to learn a topic in an organized way, although you can only go so deep in one day. You don’t have to attend the entire conference to buy a pre-con seat, which can sometimes make this a heck of a bargain. Conferences with pre-con and post-con sessions include the PASS Summit, SQL Intersection, SQLbits in the UK, and SQL Server Live 360.

Full Conferences – $2,000-$4,000, but topics aren’t organized – these are 2-4 days of material, broken up into 60-75 minute sessions, but the sessions are all over the map. You shouldn’t expect to learn a single subject start to finish here, but rather learn a variety of things about the current and next version of SQL Server. That list of conferences above is a good starting point.

Training classes – $3,000-$5,000, in-depth coverage of a topic – if you need to make a serious skills investment, set aside a week of your calendar, put someone else on the on-call duty, and go to a training class to focus for a week. This gives you the most interactive learning time because you can ask plenty of questions – to both the instructors and your fellow attendees – and maximizes your chance of fixing a serious gap on your “I Need to Learn” list.

Make the DBA Prove the Money was Well-Spent

No matter which training method you and the DBA choose together, give them a homework assignment.

They need to write a Book Report – summarize what they learned, in their own words, and how it relates to the pain points they wrote down in their “I Need to Learn” List. Have them talk about how they would have handled each of those past situations differently, knowing what they know now.

Telling them about this homework ahead of time is critical.

It keeps them focused on why you’re sending them to training – not to buddy around or get drunk with strangers, but to learn things to relieve real business pains.

And that’s how you get the training budget renewed.

Performance Benefits of Unique Indexes

sql server loves unique indexes

Why? Because it’s lazy. Just like you. If you had to spend all day flipping pages around, you’d probably be even lazier. Thank Codd someone figured out how to make a computer do it. There’s some code below, along with some screen shots, but…

TL;DR

SQL is generally pretty happy to get good information about the data it’s holding onto for you. If you know something will be unique, let it know. It will make better plan choices, and certain operations will be supported more efficiently than if you make it futz around looking for repeats in unique data.

There is some impact on inserts and updates as the constraint is checked, but generally it’s negligible, especially when compared to the performance gains you can get from select queries.

So, without further ado!

Q: What was the last thing the Medic said to the Heavy?

A: Demoooooooooo!

We’ll start off by creating four tables. Two with unique clustered indexes, and two with non-unique clustered indexes, that are half the size. I’m just going with simple joins here, since they seem like a pretty approachable subject to most people who are writing queries and creating indexes. I hope.

USE [tempdb]

/*
The drivers
*/
;WITH E1(N) AS (
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL  ),                          
E2(N) AS (SELECT NULL FROM E1 a, E1 b, E1 c, E1 d, E1 e, E1 f, E1 g, E1 h, E1 i, E1 j),
Numbers AS (SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM E2)
SELECT 
ISNULL([N].[N], 0) AS ID,
ISNULL(CONVERT(DATE, DATEADD(SECOND, [N].[N], GETDATE())),     '1900-01-01') AS [OrderDate],
ISNULL(SUBSTRING(CONVERT(VARCHAR(255), NEWID()), 0, 9) , 'AAAAAAAA') AS [PO]
INTO UniqueCL
FROM [Numbers] N

ALTER TABLE UniqueCL ADD CONSTRAINT [PK_UniqueCL] PRIMARY KEY CLUSTERED ([ID]) WITH (FILLFACTOR = 100)


;WITH E1(N) AS (
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL  ),                          
E2(N) AS (SELECT NULL FROM E1 a, E1 b, E1 c, E1 d, E1 e, E1 f, E1 g, E1 h, E1 i, E1 j),
Numbers AS (SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM E2)
SELECT 
ISNULL([N].[N], 0) AS ID,
ISNULL(CONVERT(DATE, DATEADD(SECOND, [N].[N], GETDATE())),     '1900-01-01') AS [OrderDate],
ISNULL(SUBSTRING(CONVERT(VARCHAR(255), NEWID()), 0, 9) , 'AAAAAAAA') AS [PO]
INTO NonUniqueCL
FROM [Numbers] N

CREATE CLUSTERED INDEX [CLIX_NonUnique] ON dbo.NonUniqueCL ([ID]) WITH (FILLFACTOR = 100)


/*
The joiners
*/

;WITH E1(N) AS (
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL  ),                          
E2(N) AS (SELECT NULL FROM E1 a, E1 b, E1 c, E1 d, E1 e, E1 f, E1 g, E1 h, E1 i, E1 j),
Numbers AS (SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM E2)
SELECT 
ISNULL([N].[N], 0) AS ID,
ISNULL(CONVERT(DATE, DATEADD(SECOND, [N].[N], GETDATE())),     '1900-01-01') AS [OrderDate],
ISNULL(SUBSTRING(CONVERT(VARCHAR(255), NEWID()), 0, 9) , 'AAAAAAAA') AS [PO]
INTO UniqueJoin
FROM [Numbers] N
WHERE [N] < 5000001

ALTER TABLE UniqueJoin ADD CONSTRAINT [PK_UniqueJoin] PRIMARY KEY CLUSTERED ([ID], [OrderDate]) WITH (FILLFACTOR = 100)

;WITH E1(N) AS (
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL  ),                          
E2(N) AS (SELECT NULL FROM E1 a, E1 b, E1 c, E1 d, E1 e, E1 f, E1 g, E1 h, E1 i, E1 j),
Numbers AS (SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM E2)
SELECT 
ISNULL([N].[N], 0) AS ID,
ISNULL(CONVERT(DATE, DATEADD(SECOND, [N].[N], GETDATE())),     '1900-01-01') AS [OrderDate],
ISNULL(SUBSTRING(CONVERT(VARCHAR(255), NEWID()), 0, 9) , 'AAAAAAAA') AS [PO]
INTO NonUniqueJoin
FROM [Numbers] N
WHERE [N] < 5000001

CREATE CLUSTERED INDEX [CLIX_NonUnique] ON dbo.NonUniqueJoin ([ID], [OrderDate]) WITH (FILLFACTOR = 100)

Now that we have our setup, let’s look at a couple queries. I’ll be returning the results to a variable so we don’t sit around waiting for SSMS to display a bunch of uselessness.

DECLARE @ID BIGINT;

SELECT  @ID = [uc].[ID]
FROM    [dbo].[UniqueCL] AS [uc]
JOIN    [dbo].[UniqueJoin] AS [uj]
ON      [uj].[ID] = [uc].[ID]
WHERE   [uc].[ID] % 2 = 0
ORDER BY [uc].[ID];

GO  

DECLARE @ID BIGINT;

SELECT  @ID = [nuc].[ID]
FROM    [dbo].[NonUniqueCL] AS [nuc]
JOIN    [dbo].[NonUniqueJoin] AS [nuj]
ON      [nuj].[ID] = [nuc].[ID]
WHERE   [nuc].[ID] % 2 = 0
ORDER BY [nuc].[ID];

GO

What does SQL do with these?

Ugly as a river dolphin, that one.

Ugly as a river dolphin, that one.

Not only does the query for the unique indexes choose a much nicer merge join, it doesn’t even get considered for parallelilzazation going parallel. The batch cost is about 1/3, and the sort is fully supported.

The query against non-unique tables requires a sizable memory grant, to boot.

Looking at the STATISTICS TIME and IO output, there’s not much difference in logical reads, but you see the non-unique index used all four cores available on my laptop (4 scans, 1 coordinator thread), and there’s a worktable and workfile for the hash join. Overall CPU time is much higher, though there’s only ever about 100ms difference in elapsed time over a number of consecutive runs.

Table 'UniqueJoin'. Scan count 1, logical reads 3969, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'UniqueCL'. Scan count 1, logical reads 3968, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 266 ms, elapsed time = 264 ms.

Table 'NonUniqueCL'. Scan count 5, logical reads 4264, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'NonUniqueJoin'. Scan count 5, logical reads 4264, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 1186 ms, elapsed time = 353 ms.

fair fight

So, obviously going parallel threw some funk on the floor. If we force a MAXDOP of one to the non-unique query, what happens?

You Get Nothing! You Lose! Good Day, Sir!

You Get Nothing! You Lose! Good Day, Sir!

Yep. Same thing, just single threaded this time. The plan looks a little nicer, sure, but now the non-unique part is up to 85% of the batch cost, from, you know, that other number. You’re not gonna make me say it. This is a family-friendly blog.

Going back to TIME and IO, the only noticeable change is in CPU time for the non-unique query. Still needed a memory grant, still has an expensive sort.

Table 'UniqueJoin'. Scan count 1, logical reads 3969, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'UniqueCL'. Scan count 1, logical reads 3968, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 265 ms, elapsed time = 264 ms.

Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'NonUniqueJoin'. Scan count 1, logical reads 4218, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'NonUniqueCL'. Scan count 1, logical reads 4218, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 766 ms, elapsed time = 807 ms.

just one index

The nice thing is that a little uniqueness goes a long way. If we join the unique table to the non-unique join table, we end up with nearly identical plans.

You're such a special flower.

You’re such a special flower.

Table 'UniqueJoin'. Scan count 1, logical reads 3969, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'UniqueCL'. Scan count 1, logical reads 3968, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 265 ms,  elapsed time = 267 ms.

Table 'NonUniqueJoin'. Scan count 1, logical reads 4218, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'UniqueCL'. Scan count 1, logical reads 3968, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 266 ms,  elapsed time = 263 ms.

done and done

So, you made it to the end. Congratulations. I hope your boss didn’t walk by too many times.

By the way, the year is 2050, the Cubs still haven’t won the world series, and a horrible race of extraterrestrials have taken over the Earth and are using humans as slaves to mine gold. Wait, no, that’s something else.

But! Hey! Brains! You have more of them now, if any of this was enlightening to you. If you spaced out and just realized the page stopped scrolling, here’s a recap:

  • Unique indexes: SQL likes’em
  • You will generally see better plans when the optimizer isn’t concerned with duplicate values
  • There’s not a ton of downside to using them where possible
  • Even one unique index can make a lot of difference, when joined with a non-unique index.

As an aside, this was all tested on SQL Server 2014. An exercise for Dear Reader; if you have SQL Server 2012, look at the tempdb spills that occur on the sort and hash operations for the non-unique indexes. I’m not including them here because it’s a bit of a detour. It’s probably not the most compelling reason to upgrade, but it’s something to consider — tempdb is way less eager to write to disk these days!

Thanks for reading!

Brent says: I always wanted proof that unique clustered indexes made for better execution plans!

Clustered Index key columns in Nonclustered Indexes

clustered indexes are fundamental

And I’m not just saying that because Kendra is my spiritual adviser!

They are not ~a copy~ of the table, they are the table, ordered by the column(s) you choose as the key. It could be one. It could be a few. It could be a GUID! But that’s for another time. A long time from now. When I’ve raised an army, in accordance with ancient prophecy.

What I’d like to focus on is another oft-neglected consideration when indexing:

columns in the clustering key will be in all of your nonclustered indexes

“How do you know?”
“Can I join your army?”
“Why does it do that?”

You may be asking yourself all of those questions. I’ll answer one of them with a demo, and one of them with “no”.

D-D-D-DEMO FACE

USE [tempdb]

IF OBJECT_ID('tempdb..ClusterKeyColumnsTest') IS NOT NULL
DROP TABLE [tempdb]..[ClusterKeyColumnsTest]

;WITH E1(N) AS (
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL 
    SELECT NULL  ),                          
E2(N) AS (SELECT NULL FROM E1 a, E1 b, E1 c, E1 d, E1 e, E1 f, E1 g, E1 h, E1 i, E1 j),
Numbers AS (SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM E2)
SELECT  
        IDENTITY (BIGINT, 1,1) AS [ID] ,  
        ISNULL(CONVERT(DATE, DATEADD(HOUR, -[N].[N], GETDATE()    )),  '1900-01-01') AS [OrderDate] ,
        ISNULL(CONVERT(DATE, DATEADD(HOUR, -[N].[N], GETDATE() + 1)), '1900-01-01') AS [ProcessDate] ,
        ISNULL(CONVERT(DATE, DATEADD(HOUR, -[N].[N], GETDATE() + 3)), '1900-01-01') AS [ShipDate] ,
        REPLICATE(CAST(NEWID() AS NVARCHAR(MAX)), CEILING(RAND() * 10)) AS [DumbGUID] 
INTO [ClusterKeyColumnsTest]
FROM    [Numbers] [N]
ORDER BY [N] DESC;

ALTER TABLE [ClusterKeyColumnsTest] ADD CONSTRAINT [PK_ClusterKeyColumnsTest] PRIMARY KEY CLUSTERED ([ID]) WITH (FILLFACTOR = 100)

CREATE NONCLUSTERED INDEX [IX_ALLDATES] ON [dbo].[ClusterKeyColumnsTest] ([OrderDate], [ProcessDate], [ShipDate]) WITH (FILLFACTOR = 100)

That code will create a pretty rudimentary table of random data. It was between 3-4 GB on my system.

You can see a Clustered Index was created on the ID column, and a Nonclustered Index was created on the three date columns. The DumbGUID column was, of course, neglected as a key column in both. Poor GUID.

Running these queries, the Nonclustered Index on the date columns will be used, because SQL does this smart thing where it takes page count into consideration when choosing an index to use.

SELECT [ckct].[ID]
FROM [dbo].[ClusterKeyColumnsTest] AS [ckct]

SELECT [ckct].[ID]
FROM [dbo].[ClusterKeyColumnsTest] AS [ckct]
WHERE [ckct].[OrderDate] = '2014-12-18'

SELECT [ckct].[ID], [ckct].[DumbGUID]
FROM [dbo].[ClusterKeyColumnsTest] AS [ckct]
WHERE [ckct].[OrderDate] = '2014-12-18'

Notice that the only time a Key Lookup was needed is for the last query, where we also select the DumbGUID column. You’d think it would be needed in all three, since the ID column isn’t explicitly named in the key or as an include in the Nonclustered Index.

Key Lookups really are really expensive and you should really fix them. Really.

Key Lookups really are really expensive and you should really fix them. Really.

sp_BlitzIndex® to the rescue

If you find yourself trying to figure out indexes, Kendra’s sp_BlitzIndex® stored procedure is invaluable. It can do this cool thing where it shows you SECRET columns!

Since I’ve already ruined the surprise, let’s look at the indexes on our test table.

EXEC [master].[dbo].[sp_BlitzIndex]
    @DatabaseName = N'tempdb' , 
    @SchemaName = N'dbo' , 
    @TableName = N'ClusterKeyColumnsTest'

Here’s the output. The detail it gives you on index columns and datatypes is really awesome. You can see the ID column is part of the Nonclustered Index, even though it isn’t named in the definition.

Ooh, shapes!

Ooh, shapes!

one step beyond

Run the code to drop and recreate our test table, but this time add these indexes below instead of the original ones.

ALTER TABLE [ClusterKeyColumnsTest] ADD CONSTRAINT [PK_[ClusterKeyColumnsTest] PRIMARY KEY CLUSTERED ([ID], [OrderDate]) WITH (FILLFACTOR = 100)

CREATE NONCLUSTERED INDEX [IX_ALLDATES] ON [dbo].[ClusterKeyColumnsTest] ([ProcessDate], [ShipDate]) 
INCLUDE([DumbGUID])
WITH (FILLFACTOR = 100)

Running the same three queries as before, our plans change only slightly. The Key Lookup is gone, and the statement cost per batch has evened out.

You too can be a hero by getting rid of Key Lookups.

You too can be a hero by getting rid of Key Lookups.

But notice that, for the second query, where we’re searching on the OrderDate column, we’re still scanning the Nonclustered Index.

We moved that out of the Nonclustered Index and used it as part of the Clustering Key the second time around. What gives?

Running sp_BlitzIndex® the same as before, OrderDate is now a secret column in the Nonclustered Index.

Fun Boy Three does the best version of Our Lips are Sealed, BTW.

Fun Boy Three does the best version of Our Lips are Sealed, BTW.

did we learn anything?

Sure did!

  1. SQL ‘hides’ the columns from the Key of the Clustered Index in Nonclustered Indexes
  2. Since those columns are part of the index, you don’t need to include them in the definition
  3. Secret columns in Nonclustered Indexes can be used to avoid keylookups AND satisfy WHERE clause searches!

Well, at least until all indexes are ColumnStore Indexes.

Kendra says: One of the most common questions I get is whether there’s a penalty or downside if you list the columns from the clustered index in your nonclustered index. You can stop worrying: there’s no penalty.

Why Nobody Ever Patches Their SQL Servers

I saw a DBA.StackExchange.com question and had to laugh out loud:

“Is there a cool way of performing CU updates for SQL Server on hundreds of machines?”

Patching got me like

Patching got me like

No, and it has nothing to do with technology.

  • Which servers are mission-critical 24/7 and can’t go down?
  • Which servers can only be taken down in specific time windows?
  • Which servers have dependencies between each other, like database mirroring and AlwaysOn Availability Group replicas?
  • Which servers have automatic failover mechanisms (like clusters) where you can patch the standby node first, then fail over once, and patch the primary without having to fail back?
  • Which servers have vendor apps that required a specific hotfix that may not be included in the cumulative update you’re about to apply?
  • Which servers are running long batch jobs like data warehouse loads or backups that would take a long time to restart if you took it down in the middle of its operations?
  • Which servers have pending sp_configure changes that will take effect and surprise you when you restart?

Patching is hard work. Seriously.

Getting the last good DBCC CHECKDB date

whether it’s a new job or a new (old) server

If you’re reading this, you’ve at some point in your career stared at a server, for the first time, with great trepidation.

The smarter you are, the greater your trepidation is. The 2nd century mathematician Trepidatius the Wimpy had an equation that described this, but he only applied it to leaving his hut.

So the first thing you check is backups. Miraculously, someone is at least taking FULL backups. The logs and diffs are another story, but that’s why you’re getting paid. If your DBA checklist looks like mine, the next box down is seeing if someone has ever run DBCC CHECKDB to find corruption.

BUT HOW?

Since it’s my favorite test data set, I’ll use the StackOverflow database.

Ready?

DBCC DBINFO('StackOverflow') WITH TABLERESULTS

That’s it. But the output is a nightmare. It’s about 80 lines of stuff you will probably never care about. Around line 50 is what you’re looking for.

Hi, I'm nonsense.

Hi, I’m nonsense.

And this is probably what you’ll see! A date of 1900-01-01 etc. That means never. If you run DBCC CHECKDB on the database, perhaps like so:

DBCC CHECKDB('StackOverflow') WITH NO_INFOMSGS, ALL_ERRORMSGS

And then re-run the DBCC DBINFO command, our date is now updated to current:

LOOK HOW MUCH FUN WE'RE HAVING

LOOK HOW MUCH FUN WE’RE HAVING

IS THIS THE ONLY WAY?

Of course not. But if you need a quick solution, there it is. The only catch is that it will update if you run your DBCC CHECKDB with PHYSICAL_ONLY set. Using that option skips the logical consistency checks that a full run of DBCC CHECKDB does.

If you’re a smarty pants, and you’re using Ola Hallengren’s maintenance scripts, you can check the CommandLog table it creates to [drumroll] log commands, see when DBCC CHECKDB was last run, and even how long it took.

If you’re doing something else, some guy named Brent wrote a stored procedure called sp_Blitz® that will tell you if any of the databases on your server have not had a consistency check run in the last two weeks. It will also tell you everything else wrong with that new server. It was one of my favorite things in the world, back when I had a real job.

If you like this stuff, and you want to get better at it, we’ll show you how! Join us for our upcoming Senior DBA Class in crime-free Chicago.

Kendra says: Ever been confused by those weird messages about CHECKDB in the SQL Server log when your instance starts up, but sometimes it might show a really old date? Fun fact: it’s actually looking up last CHECKDB run date for the database.

css.php