If your query doesn’t have an ORDER BY clause,
you can’t reliably predict the order of your results over time.
Sure, it’s going to look predictable at first, but down the road, as things change – the indexes, the table, the server’s configuration, the size of your data – you can end up with some ugly surprises.
Let’s start with something simple: we’ll do a SELECT from the Stack Overflow Users table. The clustering key on that table is Id, the identity number that starts at one and goes up to a bajillion. When we do this SELECT, the data comes back in the order of the clustered index:
But if someone creates an index on DisplayName and Location, then suddenly SQL Server will choose to use that index rather than the clustered index:
Here’s the execution plan as proof:
Why did SQL Server choose to use that index even though it didn’t need to have DisplayName and Location sorted? Because that index is the narrowest/smallest copy of the data that SQL Server needs to fetch. Let’s examine the index sizes with sp_BlitzIndex:
The clustered index of the table (CX/PK) has 8.9M rows, and it’s 1.1GB in size.
The nonclustered index on DisplayName/Location also has all 8.9M rows, but it’s only 368MB in size. If you have to scan the entire thing in order to get your query results, why not scan the smallest object, which will get done more quickly? And that’s exactly what SQL Server chose to do.
“Yeah, but my query has a WHERE clause.”
Okay, now that we’ve got an index on DisplayName, Location, try running a query that looks for a particular DisplayName. The results come back in DisplayName order:
And the execution plan uses the index:
But now if you try for a different username, they’re not sorted by name at all:
Because SQL Server decided that there are more Alexes, so it made more sense to do a clustered index scan rather than the seek + key lookup:
These are simple examples, and I could go on.
Even in these really simple cases, you can’t guarantee that SQL Server will always use the copy of the data that you expect. Over the last few weeks, I’ve encountered a lot more complex examples:
- Someone dropping an index that was used by a query
- Someone turned on Forced Parameterization, which changed SQL Server’s row estimates for a query plan, causing it to make different index choices
- Someone changed the Compatibility Level of a database, introducing a newer Cardinality Estimator, which resulted in a Different Plan
If you need the data to be ordered tomorrow, when you’re not looking, just put an ORDER BY in there. Your grandchildren will thank you when they have to troubleshoot your query.
Yep. I’ve been burned by this. I remember in the deep dark distant past (SQL Server 2000?) where you could have a ORDER BY clause in a view (without a TOP clause). Now you have to explicitly ORDER BY when SELECTing from the view. The way it works today makes much more sense. You don’t want SQL Server constrained to doing the sort when the view is later used for other purposes than originally intended.
Parallelism also; if you’ve got a parallel query each block can return id’s individually, and it’ll even look fine. you can end up with:
There’s also nothing stopping sql server from returning pages from its page cache first, then going to disk for the rest of the pages.
Andrew – absolutely correct!
As a historic note, the first SQL engines were built on top of sequential filesystems. There was a pretty good chance that base tables were already sorted in those files. My favorite myth from those days was that if you did a GROUP BY, you could always depend on the results at being sorted on the grouping columns. This was because the grouping was done was done by sorting the un-grouped data and going through it sequentially. Later products actually follow the standards definition of grouping by using hashing. In particular, Teradata would return the group with the smallest number of rows first. This was because it’s hash bucket would fill up first.
Joe – hahaha, wow! That’s wild. Internals used to really mean internals!
Good stuff! When I try to explain this, I like to say that resultsets without ORDER BY clauses are sets. They appear in an unpredictable order. That’s like a random order but worse. “Random order” kinda sorta implies that each time the query runs it will return a different order every time, which means you have a chance of catching the problem when you’re testing something.
“Unpredictable” means it returns the same order every time, until it doesn’t. And, because the actual order is based on stuff like “how many Alexes” the database contains, the order won’t change until the table has been in use for a while and the programmers are off doing something else.
It’s HARD to get programmers, used to procedural programming, to wrap their heads around this unpredictability. It would be GREAT to have some sort of plugin to run on test systems that would actually randomize the order of resultsets where the order isn’t declared.
I remember having this exact discussion with colleagues. They thought I was talking nonsense because “it’s in order on the screen in query analyzer”.
My favorite “bug” related to this is in SSMS in Mirror Monitor. The SSMS Mirror Monitor GUI for displaying history just displays what is returned by sp_dbmmonitorresults, and since sp_dbmmonitorresults doesn’t have an ORDER BY clause, the results are . . . interesting. If you ask for the past 2 hours, you’ll get stuff displayed in a reasonable order. But if you ask for everything, the query in the stored procedure goes parallel, and you end up with chunks of rows that are sorted, but the chunks aren’t in any logical sequence. Which pretty much means the GUI is useless, and instead you have to roll your own query to view the data.
Gotta love msdb! From missing indexes to poorly thought out table designs (times stored in ints as hhmmss), I sometimes think Microsoft intended msdb to be a poster child of worst practices. To paraphrase Despair.com, “It could be that the purpose of msdb is only to serve as a warning to others.”
When it comes time to return a result set, my developers will add an order by, and then complain the result times are slow. I ask, why are you ordering it? They say, because that’s how we want to display the results.
Why not just order it in your code?
Depending on how big your shop is, you would probably pass the results from the database server to a report server and do the sorting and formatting and prepping all the pretty stuff on a dedicated machine that set up for just that purpose.