Query Exercise: Find Recent Superstars
For this week’s Query Exercise, we’re working with the Stack Overflow database, and our business users have asked us to find the new superstars. They’re looking for the top 1000 users who were created in the last 90 days, who have a reputation higher than 50 points, from highest reputation to lowest.
In your Stack Overflow database, create just these two indexes:
Transact-SQL
|
1 2 3 4 |
EXEC DropIndexes; GO CREATE INDEX CreationDate ON dbo.Users(CreationDate); CREATE INDEX Reputation ON dbo.Users(Reputation); |
And then write the query. Now, because you’re all using different copies of the Stack Overflow database, with different end dates, find the most recently created user in your database:
Transact-SQL
|
1 |
SELECT TOP 1 CreationDate FROM dbo.Users ORDER BY Id DESC; |
Use that date as the finish date on your query, and write a query to find the top 1000 users created in the last 90 days (ending with the date you just found), who have a reputation higher than 50 points.
Your challenge has a few parts:
- Write the query.
- How does your query perform? Compare the number of logical reads your query does, versus what it’d take to scan the whole table once.
- If your query does more logical reads than a table scan, can you get it to do less, without changing the indexes on the table?
Post your solutions in the comments, and feel free to ask other folks questions about their solutions. If your solution involves code and you want it to look good, paste it as a Github Gist, and then include the Gist link in your comment. Then, check out my answers & discussion.
Related

Hi! I’m Brent Ozar.
I make Microsoft SQL Server go faster. I love teaching, travel, cars, and laughing. I’m based out of Las Vegas. He/him. I teach SQL Server training classes, or if you haven’t got time for the pain, I’m available for consulting too.
Get Free SQL Stuff
"*" indicates required fields

44 Comments. Leave new
In database StackOverflow2013 after clean restor from backup I get non-identical results of the following queries. Is it an error in the data or am I doing something wrong?
SELECT TOP 1 * FROM dbo.Users ORDER BY Id DESC
SELECT TOP 1 * FROM dbo.Users ORDER BY CreationDate DESC;
I’ll rephrase your question: “Should Id and CreationDate always go up together?” Not necessarily – as you’ll see if you dig deeper into the data. That’s a separate question from today’s homework post though.
Well, I assume my query has some code, so here’s the Gist link, looking forward for the feedback.
https://gist.github.com/peterkruis/ddccb033b523a755f5d543269a5796aa
Good work! I like how you identified a way to separate the query processing into two steps.
A simple and awesome solution.
1. Done
2. Horrible – the amount of logical reads is equal to 14 table scans
3. Yup – add a temp table to the mix and we’re down to 7525 total logical reads – https://gist.github.com/VladDBA/fb4ec8c1248be4c2fa399659dc9523f3
Hmm, can you post the query plan you’re getting for the first part of your query? On mine at least, I’m getting a table scan for the query populating the temp table. I’m guessing – just guessing – that you’re using an index that isn’t part of the challenge’s requirements.
Sure.
Here’s the plan – https://www.brentozar.com/pastetheplan/?id=HJSbnGH2p
Might want to check your compatibility level. 😉
It was 150, so I switched now to 140 and flushed the plan cache, but the query is still using both indexes and not doing a table scan.
Version Info:
Microsoft SQL Server 2019 (RTM-CU23) (KB5030333) – 15.0.4335.1 (X64)
Sep 21 2023 17:28:44
Copyright (C) 2019 Microsoft Corporation
Developer Edition (64-bit) on Windows 10 Pro 10.0 (Build 22631: )
I’m not sure if there’s anything else I might be missing.
Whoa, that’s odd – your compat level in the query plan is showing 70: CardinalityEstimationModelVersion=”70″
I figured it out. I had Legacy CE set to ON.
Set it to OFF and now my plan sucks. ?
Back to the drawing board.
Along the lines of Peter:
===================
;
with cte_Rep as ( — 2960 logical reads
select id
from users
where Reputation > 50
)
, cte_Date as ( — 1981 logical reads
select id ,CreationDate
from users
where CreationDate > ‘20220305’
–order by CreationDate desc
)
, cte as ( — 4941 logical reads if we don’t care about getting the complete Users record
select top 1000 d.Id
from cte_Date d
join cte_Rep r on (d.Id = r.id)
order by CreationDate desc
)
select u.* — 9012 total logical reads to get the complete Users record
from Users u
join cte on (u.id = cte.id)
==============================================================
Plan:
https://www.brentozar.com/pastetheplan/?id=rJHaCMSnp
No code to post, as mine was similar to others. However, although my version did similar logical reads, it used a bit more CPU and execution cycles since I used RANK() instead of TOP to get the 1000 best reputations. The reason I used RANK() is that (in my version of Stack Overflow at least) there is a tie for last place, and I think that other last-place user should be included. So my query returns 1001 rows instead of 1000. Maybe a nit-pick, maybe not what the business users wanted, but I think it’s more accurate.
Jake – great point about the ties!
My solution involved querying the Users table twice, returning a limited set of columns, to get the benefit of the two indexes then joining the two together and performing the ORDER BY and then finally joining back to the users table to get the columns for display purposes.
https://gist.github.com/steveearle86/a58478291795b788ec47aa0991d3ea2f
Steve – there you go! Good work.
SELECT TOP 1000 Users.Id
FROM dbo.Users
WHERE Users.Id > (SELECT TOP 1 U2.Id FROM Users U2 WHERE U2.CreationDate > DATEADD(DAY, -90, ‘2010-11-09 22:17:33.317’))
AND Users.Reputation > 50;
Alex – so, are we returning the same data that the developers were asking for?
— if you do not hard code the date
SELECT TOP 1000 Users.Id
FROM dbo.Users
WHERE Users.Id > (SELECT TOP 1 U2.Id FROM Users U2 WHERE U2.CreationDate >
DATEADD(DAY, -90, (SELECT MAX(U3.CreationDate) FROM dbo.Users U3)))
AND Users.Reputation > 50;
Hey Brent,
your question was very interesting and I am actually chilling a bit. Here comes my idea for the solution.
SET STATISTICS IO, TIME ON;
GO
WITH date_limit
AS
(
/* Define the time range for the filter on CreationDate */
SELECT TOP (1)
DATEADD(DAY, -90, CreationDate) AS start_date,
CreationDate AS end_date
FROM dbo.Users
ORDER BY
Id DESC
),
E
AS
(
/* Now we filter the top 1000 order by CreationDate */
SELECT TOP (1000)
*
FROM dbo.Users,
date_limit AS dl
WHERE Reputation > 50
AND CreationDate >= dl.start_date
AND CreationDate <= dl.end_date
ORDER BY
CreationDate DESC
)
/* To sort the final result by reputation */
SELECT * FROM e
ORDER BY
e.Reputation DESC
/* Note: 9130 only to check where the pushdown jumps in! */
OPTION (RECOMPILE, QUERYTRACEON 9130);
GO
The query stats (IO / TIME):
(1000 rows affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, …
Table 'Users'. Scan count 2, logical reads 22732, physical reads 0, …
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 144 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
And finally the plan…
https://www.brentozar.com/pastetheplan/?id=ByaNnAr3T
PS: I'm with StackOverflow2013 (running in 160 CompLevel)
Fullscan: 45.184 pages
Solution: 22.732 pages
Best from Germany, Uwe
Sorry – I’m wrong with the exercise. – to much chilling…
Let’s see wether I can do it better…
Hahaha, no worries – I think you can do a lot better than that. You’re a sharp fella. 😉
I’m using the large [StackOverflow] database with compat level 140.
For my solution I first get the minimum user ID for the users created within the date range and store that to a variable: this uses a seek on the new [CreationDate] index. I then use that variable in a second query that does a backwards index seek on the new [Reputation] index where only users from our minimum ID are included.
In total there are 5131 logical reads versus a full table with 143667 logical reads. Here’s the plan: https://www.brentozar.com/pastetheplan/?id=rJzQmyI26
A few of observations. Firstly, a MIN is used to avoid the situation where the creation date is not aligned with the ascending ID. Second, the plan works for any date but take longer the earlier the date. Finally, without a FORCESEEK the first query performs a clustered index scan taking 135139 logical reads, which is not much less than the full scan.
The final point is interesting because there are good statistics from the [CreationDate] index, and the date being used as the predicate is a scalar value since the result of the DATEADD can be derived by the optimiser. If I count the number of users within the date range the new [CreationDate] index is used whereas if I get the MIN or MAX value for the ID – which is readily available in the [CreationDate] index – a clustered index scan is used which is odd. Changing to compat level 150 makes no difference. See https://www.brentozar.com/pastetheplan/?id=By6vYJLna
Damian – that’s creative, but unfortunately it returns invalid data. If you query through the Users table, you’ll find that Id and CreationDate don’t always go up together. There are high-Id users with old CreationDates, and low-Id users with newer CreationDates.
Ah – quite right, 555 users are included but have an earlier creation date, my fault for not checking.
That’s easily corrected by adding the filter on the date predicate on the second query without much change to the plan, a small increase up to 5152 reads. https://www.brentozar.com/pastetheplan/?id=r1Ob1QIha
Changing this to apply a tie-break for users with the same reputation changes this to a clustered index seek with over 7500 reads.
Damian – there you go, good work!
SELECT TOP 1000 u.Id
FROM dbo.Users AS u
WHERE u.CreationDate >= (SELECT TOP(1) DATEADD(DAY, -90, u.CreationDate) FROM dbo.Users AS u ORDER BY u.CreationDate DESC)
AND u.Reputation > 50
ORDER BY u.Reputation DESC
Plan:
https://www.brentozar.com/pastetheplan/?id=HJfUpGU3T
Performance: ~6,7k reads vs. 45,4k reads for a full table scan
To go further down you would e.g. need to add an INCLUDE(Reputation) to the CreationDate-Index which would drop the reads to ~750
Thomas – and does that return the same information that our developers needed?
I think yes, since I didn’t made the mistakes (where you wrote the same) to filter for Users.Id > xxx (which would assume that it is guaranteed that there are no out-of-order created users).
Of course you could add the u.Reputation to the output list, if the analysts needs it (isn’t 100% clear), this wouldn’t change the execution plan. And I don’t see anything that says, that they want more / additional columns. If they really want, you would have to put my statement into a subquery and join the dbo.Users to this sub select on sub.Id = u.Id. This would increase the reads by ~3k (3 pages read for each of the 1000 Users in a nested lookup).
Another a bit unclear point is the CreationDate itself, depending on the real requirement you may want to put a DATETRUNC(DAY, …) around the DATEADD(), so when my CreationDate is at some day 23:59:58 it would not just show 90 * 24 but 90 * 24 + 23:59:58 hours of created users. And as always you could use > instead of >= when it is not 100% clear defined.
Generally speaking, people want more than an Id, though. You’re probably going to want to show things like DisplayName, Location, etc.
select top 1000 u3.*
from users as u1
inner join users as u2 on u1.Id = u2.Id
inner join users as u3 on u2.id = u3.id
where u1.CreationDate > dateadd( day , -90 , ‘2013-09-14 23:06:39.187’ )
and u2.Reputation > 50
order by u2.Reputation desc
Full Scan is 91334 while the query above drops to 6115
Christos – I LOVE THIS SOLUTION SO MUCH because it’s so creative. On mine (2018-06 copy), it does 41,004 logical reads, whereas a full table scan is 143,667. Good work! Other solutions get less logical reads, but yours is so awesome. It shows that you understand what the engine is doing, and how to work around it.
SET STATISTICS IO ON;
GO
DECLARE @LastDate DATETIME = CAST(DATEADD(dd, -90,
(SELECT TOP 1 CreationDate FROM dbo.Users ORDER BY Id DESC)) AS DATE);
SELECT @LastDate;
/*
SELECT TOP 1000 u.CreationDate, u.Reputation, *
FROM dbo.Users u
WHERE u.CreationDate > @LastDate AND
u.Reputation > 50
ORDER BY u.Reputation DESC;
*/
WITH LastCreatedUsers AS
(
SELECT u.CreationDate, u.id
FROM dbo.Users u WITH(INDEX(CreationDate))
WHERE u.CreationDate > @LastDate
–AND
–u.Reputation > 50
–ORDER BY u.Reputation DESC;
), TopFiftyUsers AS
(
SELECT u.Reputation, u.Id
FROM dbo.Users u WITH(INDEX(Reputation))
WHERE u.Reputation > 50
–ORDER BY u.Reputation DESC;
)
SELECT TOP 1000 l.CreationDate, t.Reputation, t.Id
FROM LastCreatedUsers l
INNER JOIN TopFiftyUsers t ON t.Id = l.Id
ORDER BY t.Reputation DESC
/*Logical reads*/
/*
(1000 rows affected)
Table ‘Users’. Scan count 10, logical reads 2693, physical reads 0
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0,
Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0,
*/
And does that return the same data as the query we’re trying to tune?
Hi Brent,
yes, this is my original query, do I miss something?
SELECT TOP 1000 u.CreationDate, u.Reputation, u.Id
FROM dbo.Users u
WHERE u.CreationDate > @LastDate AND
u.Reputation > 50
ORDER BY u.Reputation DESC;
Table ‘Users’. Scan count 1, logical reads 1917442
Yes, does that seem like it gives folks enough information? Might they want other things in their result set, like the user’s name, location, etc?
you are right Brent, let’s say that the scope was not clear enough for me 🙂
Solution:
SET STATISTICS IO ON;
GO
DECLARE @LastDate DATETIME = CAST(DATEADD(dd, -90,
(SELECT TOP 1 CreationDate FROM dbo.Users ORDER BY Id DESC)) AS DATE);
SELECT @LastDate;
/*
SELECT TOP 1000 u.CreationDate, u.Reputation, *
FROM dbo.Users u
WHERE u.CreationDate > @LastDate AND
u.Reputation > 50
ORDER BY u.Reputation DESC;
*/
WITH LastCreatedUsers AS
(
SELECT u.CreationDate, u.id
FROM dbo.Users u WITH(INDEX(CreationDate))
WHERE u.CreationDate > @LastDate
–AND
–u.Reputation > 50
–ORDER BY u.Reputation DESC;
), TopFiftyUsers AS
(
SELECT u.Reputation, u.Id
FROM dbo.Users u WITH(INDEX(Reputation))
WHERE u.Reputation > 50
–ORDER BY u.Reputation DESC;
)
SELECT TOP 1000 l.CreationDate, t.Reputation, t.Id, u.DisplayName, u.Location, u.Age
FROM LastCreatedUsers l
INNER JOIN TopFiftyUsers t ON t.Id = l.Id
INNER JOIN Users u on u.Id = t.Id
ORDER BY t.Reputation DESC
/*result*/
/*
Table ‘Users’. Scan count 10, logical reads 10849
*/
There we go! Much better, nice work!
Quote:
“Now, because you’re all using different copies of the Stack Overflow database, with different end dates, find the most recently created user in your database:
SELECT TOP 1 CreationDate FROM dbo.Users ORDER BY Id DESC;
/* 2013-09-14 23:06:39.187 */”
Unquote
/* Lets compare that to a ORDER BY CreationDate */
SELECT TOP 1 CreationDate FROM dbo.Users ORDER BY CreationDate DESC;
/* 2013-12-31 23:59:23.147
Bad Brent, providing us with a bogus query :o)
*/
So I normally I would go with the more purist query of ordering by CreationDate instead of Id. But to comparison reasons I used the Id.
My solution(s):
https://gist.github.com/Montro1981/00b25e7eb27b5ddc72e168330085c651
Heh heh heh, good work! Yeah, you hit the roadblocks I expected folks to hit. Good work.
1.Write the query.
select top 1000 displayname,location,CreationDate,reputation
from dbo.users
where CreationDate > (select top 1 DATEADD(dd, -90,creationdate) FROM dbo.Users ORDER BY Id DESC)
and reputation > 50
order by reputation desc
2. How does your query perform? Compare the number of logical reads your query does, versus what it’d take to scan the whole table once.
It does 91336 logical reads with the indexes. 45187 without the indexes
3. If your query does more logical reads than a table scan, can you get it to do less, without changing the indexes on the table?
No and ran out of time (and talent). If I remove the top 1000 clause, i get the same number of reads as the table scan, so then I tried to experiment with using rownum instead to see if this makes a difference but struggling to get the same result set.
Good work! Yeah, you hit the problems I expected folks to hit. Stay tuned for the solutions post, or check out the comments from the other readers who worked on it as well.
[…] query exercise for this week was to write a query to find users created in the last 90 days, with a reputation higher than 50 […]