i like questions!
Because I don’t always like talking to myself, and robots are sometimes murderous and insane. So when this one came up in the comments of my previous post, I thought it would make a good follow-up blog. It’s a really good question, and definitely one I found myself asking quite a bit when I first started writing queries.
the fancy pants answer
Is that this isn’t a processing order issue, but rather a set processing issue. When you group by two columns, the resulting set makes ordering by a column outside of that set ambiguous.
Think about it. If two columns of values are spread out over a range of unique IDs in a third column, which ID belongs to which grouped set if you group by the two other columns? The highest ID? The lowest ID? Something in between?
demonstrate my syntax
A quick example using a take on FizzBuzz. We’ll call it NYHCBuzz.
;WITH E1(N) AS ( SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL ), E2(N) AS (SELECT NULL FROM E1 a, E1 b, E1 c, E1 d, E1 e, E1 f, E1 g, E1 h, E1 i, E1 j), Numbers AS (SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM E2) SELECT IDENTITY (BIGINT, 1,1) AS [ID] , ISNULL(CONVERT(DATE, DATEADD(MINUTE, -[N].[N], GETDATE())), '1900-01-01') AS [SomeDate] , CASE WHEN n.[N] % 15 = 0 THEN 'Sheer Terror' WHEN n.[N] % 5 = 0 THEN 'Madball' WHEN n.[N] % 3 = 0 THEN 'Skarhead' ELSE 'Warzone' END AS [NYHCBuzz] INTO [LowerEastSide] FROM [Numbers] [N] ORDER BY [N] DESC; SELECT * FROM [dbo].[LowerEastSide] AS [les] ORDER BY [les].[ID];
And here’s a little snippet of what you end up with. This hopefully makes my point a bit more clear. You can’t order by both ID 11 and 26, for example. They would be part of the same group, and they’re buried in a range with hundreds of other values.
Ordering by the ID column wouldn’t make sense. Which ID would each grouped set be associated with?
Who knows? Not even SQL.
Running this will get you a big fat error.
SELECT [les].[SomeDate], [les].[NYHCBuzz], COUNT_BIG(*) AS [Records] FROM [dbo].[LowerEastSide] AS [les] GROUP BY [les].[SomeDate], [les].[NYHCBuzz] ORDER BY [les].[ID]
Msg 8127, Level 16, State 1, Line 26
Column "dbo.LowerEastSide.ID" is invalid in the ORDER BY clause because it is not contained in either an aggregate function or the GROUP BY clause.
So what can you do? You can’t group by the ID column, it’s a unique value per row. You’ll just get every row and a count of 1. That’s dumb and ugly.
You can work around it by telling SQL how to order within each group, with a query like this.
SELECT [les].[SomeDate], [les].[NYHCBuzz], COUNT_BIG(*) AS [Records] FROM [dbo].[LowerEastSide] AS [les] GROUP BY [les].[SomeDate], [les].[NYHCBuzz] ORDER BY MIN([les].[ID]) --Look at me
Adding the MIN() function to the ORDER BY tells SQL to take the lowest ID per grouped set for ordering. If you’re curious as to how that helps, look at the results with this query.
SELECT MIN([les].[ID]) AS [MinID], [les].[SomeDate], [les].[NYHCBuzz], COUNT_BIG(*) AS [Records] FROM [dbo].[LowerEastSide] AS [les] GROUP BY [les].[SomeDate], [les].[NYHCBuzz] ORDER BY MIN([les].[ID])
You can see pretty easily what SQL is doing here. For every grouped set of dates and names, you’re also grabbing the minimum ID value. The aggregation works to provide each set with a related value to also order by.
You can use any aggregate function of your choosing, really. I don’t know if any of these have a good business use case, but taking a look at the changes to both output ordering and ID column values is instructive when it comes to learning exactly how the aggregation is necessary to provide order to grouped sets.
SELECT MIN([les].[ID]) AS [MinID], [les].[SomeDate], [les].[NYHCBuzz], COUNT_BIG(*) AS [Records] FROM [dbo].[LowerEastSide] AS [les] GROUP BY [les].[SomeDate], [les].[NYHCBuzz] ORDER BY MIN([les].[ID]) SELECT MAX([les].[ID]) AS [MaxID], [les].[SomeDate], [les].[NYHCBuzz], COUNT_BIG(*) AS [Records] FROM [dbo].[LowerEastSide] AS [les] GROUP BY [les].[SomeDate], [les].[NYHCBuzz] ORDER BY MAX([les].[ID]) SELECT AVG([les].[ID]) AS [AvgID], [les].[SomeDate], [les].[NYHCBuzz], COUNT_BIG(*) AS [Records] FROM [dbo].[LowerEastSide] AS [les] GROUP BY [les].[SomeDate], [les].[NYHCBuzz] ORDER BY AVG([les].[ID]) SELECT SUM([les].[ID]) AS [SumID], [les].[SomeDate], [les].[NYHCBuzz], COUNT_BIG(*) AS [Records] FROM [dbo].[LowerEastSide] AS [les] GROUP BY [les].[SomeDate], [les].[NYHCBuzz] ORDER BY SUM([les].[ID])
We cover all sorts of cool stuff like this (and more!) at Advanced Querying and Indexing. There’s even food. And prizes. And almost no one ever self-immolates, so that’s a big plus.
For the last few years, we’ve hosted a Tuesday webcast to talk SQL Server.
We’re giving it a vacation in August, and then starting September 2, Office Hours is coming back on Wednesdays.
This is easier for us because of the way our SQL Critical Care® service works. We’re typically working with clients interactively on Monday and Tuesday, and then Wednesday we take a break to write the findings recap. It’s easier for us to take a midday break on Wednesdays.
Here’s a sampling of some of the questions we’ve answered at Office Hours:
- What are your recommendations for service accounts? Unique domain accounts for each instance/service, shared domain accounts, or something else?
- Can you explain why one database can be added to AG with just a full backup and another would need a transaction log backup as well?
- Working on my first AG setup in 2012. I don’t quite understand how the listener works with DNS. Does the listener name get registered with each AG IP?
- We’re looking into Amazon Web Services cloud with SQL Server, any thoughts or tips or things to be aware of?
- Ever ran into an instance where database gets detached automatically after server restart?
- What would you recommend as MAXDOP for a server with 80 cores?
To pick our brains and learn from other questions/answers, register for Office Hours here.
This sounds really cheesy, but I’m honestly excited to be presenting again this year at kCura Relativity Fest 2015.
Here’s what I’ll be talking about:
How to Check Your SQL Server’s Health
The Abstract: You’re a system or database administrator responsible for the uptime and performance of Relativity’s SQL Servers, but you’ve never received professional training on SQL Server and you need to figure out if it’s safe. I’ve built up an array of free tools over the 15-plus years I’ve been working with SQL Server, and I’ll show you how to use them in this demo-packed session.
Why I’m Presenting It: For the last few years, kCura has sent me around to dozens of the biggest Relativity shops around to do a one-day SQL Critical Care®. I want to teach you that exact same process – and trust me, it’s way easier than you think. I’m not going to teach you how to put fires out – but I *am* going to teach you how to tell in just a few minutes if your server is on fire or not, and whether it’s a small trash can fire – or a big, blazing three-alarm monster.
Buying the Right Hardware for SQL Server
The Abstract: Ever wonder if there was a simple list of instructions for picking out the perfect infrastructure for Relativity at your organization? Wonder if you should use mirroring, replication, clustering, or AlwaysOn Availability Groups? Are you better off with solid state or shared storage? Is virtualization an option? Microsoft MVP Brent Ozar has worked with dozens of Relativity customers, and he’ll simplify your infrastructure options in this one-hour session.
Why I’m Presenting It: I see the same situation over and over again: your Relativity infrastructure started out small – just one SQL Server in the corner that you didn’t worry too much about – but man, has that thing grown. It’s become really important to the business, and people are stuffing data in there like it’s some kind of Chipotle data burrito. You need to buy exactly one replacement environment, and it’s gotta be right the first time. You don’t buy a lot of SQL Servers, so you want simple independent advice that you can take home to management. I’m going to hand you a few sketched-out options with budget ranges.
Want to Learn More About Relativity Fest?
During this period we will have limited access to our email. We’ll still have full access to Twitter, Facebook, and Instagram. We apologize in advance about that, because we’re going to be posting a lot of photos about this year’s company retreat.
This year, we’re spending a week on the Oregon coast – specifically, Manzanita, an adorable little town in Tillamook County – population 598. We’ve rented a couple of houses for us, our spouses, and our dogs. (In the past, we’ve done our retreat on a cruise ship, and a couple of times in a beachfront house in Cabo.)
For immediate assistance, please contact Microsoft Support at 1-800-642-7676.
i’ve seen this happen
Especially if you’ve just inherited a database, or started using a vendor application. This can also be the result of inexperienced developers having free reign over index design.
Unless you’re running regular health checks on your indexes with something like our sp_BlitzIndex® tool, you might not catch immediately that you have a heap of HEAPs in your database.
You may be even further flummoxed upon finding that someone thoughtfully created Primary Keys with Nonclustered Indexes on them, yet no Clustered Indexes. Unless the original developer is still around, the intent may not be clear.
Using this code snippet, you can quickly identify tables that were created with Nonclustered Indexes on the Primary Key, and no Clustered Index. Another way to spot this potential issue might be looking for RID lookups, or Table Scans in your Plan Cache. Wide Nonclustered Indexes may also be present to compensate for the lack of a good Clustered Index.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; SELECT QUOTENAME(SCHEMA_NAME([t].[schema_id])) + '.' + QUOTENAME([t].[name]) AS [Table] , QUOTENAME(OBJECT_NAME([kc].[object_id])) AS [IndexName] , CAST((SUM([a].[total_pages]) * 8 / 1024.0 ) AS DECIMAL(18,2))AS [IndexSizeMB] FROM [sys].[tables] [t] INNER JOIN [sys].[indexes] [i] ON [t].[object_id] = [i].[object_id] INNER JOIN [sys].[partitions] [p] ON [i].[object_id] = [p].[object_id] AND [i].[index_id] = [p].[index_id] INNER JOIN [sys].[allocation_units] [a] ON [a].[container_id] = CASE WHEN [a].[type] IN ( 1, 3 ) THEN [p].[hobt_id] WHEN [a].[type] = 2 THEN [p].[partition_id] END INNER JOIN [sys].[key_constraints] AS [kc] ON [t].[object_id] = [kc].[parent_object_id] WHERE ( [i].[name] IS NOT NULL AND OBJECTPROPERTY([kc].[object_id], 'CnstIsNonclustKey') = 1 --Unique Constraint or Primary Key can qualify AND OBJECTPROPERTY([t].[object_id], 'TableHasClustIndex') = 0 --Make sure there's no Clustered Index, this is a valid design choice AND OBJECTPROPERTY([t].[object_id], 'TableHasPrimaryKey') = 1 --Make sure it has a Primary Key and it's not just a Unique Constraint AND OBJECTPROPERTY([t].[object_id], 'IsUserTable') = 1 --Make sure it's a user table because whatever, why not? We've come this far ) GROUP BY [t].[schema_id] , [t].[name] , OBJECT_NAME([kc].[object_id]) ORDER BY SUM([a].[total_pages]) * 8 / 1024.0 DESC;
There are times when heaps are a valid choice
ETL or staging tables are the most common examples of when raw insert throughput is necessary, and a Clustered Index may not be.
But when tables are designed to be queried against and lack a Clustered Index, it’s usually a problem to be fixed.
Brent says: and remember, kids, SQL Server won’t suggest a clustered index.
Kendra says: if you think this probably didn’t happen to you, that’s a good sign you should doublecheck.
During some testing with SQL Server 2014’s new cardinality estimator, I noticed something fun: the new CE can give you different index recommendations than the old one.
I’m using the public Stack Overflow database export, and I’m running this Jon Skeet comparison query from Data.StackExchange.com. (Note that it has something a little tricky at the top – it’s using a local variable for the @UserId, which itself makes for a different execution plan. When literals are used in the query, the behavior is different, but that’s another story for another blog post.)
First, here are the two different execution plans, both of which do about 1mm logical reads:
It’s a really subtle difference in the plans – at first glance, just looks like the 2014 CE removed a single operator – but the big difference is in the number of estimated rows returned:
- Old CE estimated 82 rows returned
- New CE estimated 352,216 rows returned
In actuality, 166 rows get returned with this particular input variable – the new CE is just flat out making bad guesses on this data.
Here are the different index recommendations:
CREATE NONCLUSTERED INDEX NewCE_OwnerUserId_Includes ON [dbo].[Posts] ([OwnerUserId]) INCLUDE ([ParentId],[Score]); CREATE NONCLUSTERED INDEX OldCE_OwnerUserId_PostTypeId_Includes ON [dbo].[Posts] ([OwnerUserId],[PostTypeId]) INCLUDE ([ParentId],[Score]);
And when I run sp_BlitzIndex® after doing a little load testing, both missing index recommendations show up in the DMVs:
But surely the new CE’s recommendation is better. We’ll create just the one it recommends, and the resulting execution plan does 57k logical reads. Both the new CE and the old CE produce an identical plan, albeit with wildly different row count estimates (old 83, new says 37,423, actual is 166):
HAHAHA, now the new CE agrees that it needs the index recommended by the old CE in the first place. So let’s remove the new CE’s recommendation, and only create the old CE’s recommended index. Both the old and new CE choose to use it:
And even better, the old CE’s recommendation results in only 175 logical reads.
So what’s the takeaway? If you’re relying on the execution plan’s missing index recommendations for fast performance tuning, you’re not going to get the best results – no matter which cardinality estimator you’re using. With 2014, the recommendations are different, not necessarily better.
The real keys are knowing how to do it yourself, and we teach those in the Advanced Querying and Indexing 5-day in-person class next month.
Holy cow, people, we have a lot of free stuff coming your way. Here’s just some of the highlights:
- Digging Into THREADPOOL Waits
- How Do You Fix a Slow TempDB?
- Advanced Shared Storage
- How to Prove Hardware is the Problem
- SQL Server High Availability Options Explained
- Watch SQL Server Break and Explode
- The Art of Stress-Free Database Administration
- Prove It! Collecting The Right Performance Metrics
- What’s New in SQL Server 2016
To sign up for ’em all, just put in your contact info here. You don’t even have to check any boxes. Learnin’ doesn’t get much easier than that. These webcasts won’t be recorded, so be there or be square.
you can’t do that on management studio
Recently, while working with a client, I did something in a query that they were mystified by. I didn’t think much of it, but I thought it might be useful to you, dear readers, as well. Along with an explanation.
Here’s a sample query that takes advantage of the same type of trick, but with a few extra bats and worms added in to illustrate a larger point.
SELECT TOP 100 [b].[Name] , [b].[UserId] , COUNT_BIG(*) AS [Coconuts] FROM [dbo].[Badges] AS [b] WHERE [b].[Date] >= '2014-01-01 00:00:00.003' GROUP BY [b].[Name] , [b].[UserId] HAVING COUNT_BIG(*) > 100 ORDER BY [Coconuts] DESC;
can you dig it?
What I did was order by the alias of the COUNT_BIG(*) column, Coconuts.
What they didn’t understand was why that’s legal, but filtering on that alias wouldn’t be legal. A more familiar scenario might be using ROW_NUMBER(); you can ORDER BY it, but not filter on it in the WHERE clause to limit result sets to the TOP N per set. You would have to get an intermediate result in a CTE or temp table and then filter.
When SQL goes to figure out what to do with all this, it doesn’t look at it in the order you typed it. It’s a bit more like this:
5. GROUP BY
6. WITH CUBE/ROLLUP
10. ORDER BY
To make that a easier to read:
5. GROUP BY
6. WITH CUBE/ROLLUP
10. ORDER BY
and that’s how babies get made
Since the ORDER BY is processed after the SELECT list, ORDER BY can use a column aliased there. You can’t do that in the WHERE clause because it gets processed before SQL does its fancy footwork to get you some stuff to look at.
Here are some examples of what happens when you try to move the alias to different parts of the query.
Using it in the HAVING filter:
SELECT TOP 100 [b].[Name] , [b].[UserId] , COUNT_BIG(*) AS [Coconuts] FROM [dbo].[Badges] AS [b] WHERE [b].[Date] >= '2014-01-01 00:00:00.003' GROUP BY [b].[Name] , [b].[UserId] HAVING [Coconuts] > 100 ORDER BY [Coconuts] DESC;
And again using it in the WHERE clause:
SELECT TOP 100 [b].[Name] , [b].[UserId] , COUNT_BIG(*) AS [Coconuts] FROM [dbo].[Badges] AS [b] WHERE [b].[Date] >= '2014-01-01 00:00:00.003' AND [Coconuts] > 100 GROUP BY [b].[Name] , [b].[UserId] ORDER BY [Coconuts] DESC;
Both result in the same error, give or take a line number. Coconuts is not reference-able at either of these points.
Msg 207, Level 16, State 1, Line 33
Invalid column name 'Coconuts'.
I’ve touched a lot of SQL Servers over the years. After my share of trips to HR for inappropriate touching, here’s the questions I ask first these days:
- Is this in production now?
- If this goes down, what apps go down with it?
- When those apps go down, is there potential for loss of life or money?
- How sensitive are these apps to temporary slowdowns?
- When was the last successful backup?
- When was the last successful restore test?
- Is everyone okay losing data back to the last successful backup?
- When was the last successful clean corruption test?
- Do we have a development or staging environment where I can test my changes first?
- Is there any documentation for why the server was configured this way?
- What changes am I not allowed to make?
- Who can test that my changes fixed the problem?
- Who can test that the apps still work as designed, and that my changes didn’t have unintended side effects?
Before we get started…
I know. I know. BOL. It’s documented.
They even show you how to rename a table.
You just forget.
And as with most simple mistakes, fixing them is… Weird.
Here’s what happened to me recently, when I was working on a table swapping demo.
CREATE TABLE dbo.Whatever (i int) SELECT * FROM dbo.[Whatever] AS [w]
Here’s where I was going to rename it, and then make another table with the INT column as a BIGINT.
EXEC [sys].[sp_rename] @objname = N'dbo.Whatever' , @newname = N'dbo.Whatever_New'
Which worked, except…
Like most people who make mistakes, I decided to revisit the documentation afterwards. And, yeah, you don’t specify schema in the new object name.
So, now that you all know you’re smarter than me, how would you fix it?
I’ll spare you the trial and error:
EXEC [sys].[sp_rename] @objname = N'dbo.[dbo.Whatever_New]' , @newname = N'Whatever_New'
There were quite a few different arrangements of brackets and schema prefixes leading up to this.
I hope this post saves someone a little time.
Brent says: dear reader, please use this as part of an April Fool’s prank. Just not in production.