Query Exercise: Find Recent Superstars

Query Exercises
44 Comments

For this week’s Query Exercise, we’re working with the Stack Overflow database, and our business users have asked us to find the new superstars. They’re looking for the top 1000 users who were created in the last 90 days, who have a reputation higher than 50 points, from highest reputation to lowest.

In your Stack Overflow database, create just these two indexes:

And then write the query. Now, because you’re all using different copies of the Stack Overflow database, with different end dates, find the most recently created user in your database:

Use that date as the finish date on your query, and write a query to find the top 1000 users created in the last 90 days (ending with the date you just found), who have a reputation higher than 50 points.

Your challenge has a few parts:

  1. Write the query.
  2. How does your query perform? Compare the number of logical reads your query does, versus what it’d take to scan the whole table once.
  3. If your query does more logical reads than a table scan, can you get it to do less, without changing the indexes on the table?

Post your solutions in the comments, and feel free to ask other folks questions about their solutions. If your solution involves code and you want it to look good, paste it as a Github Gist, and then include the Gist link in your comment. Then, check out my answers & discussion.

Previous Post
Finding Long Values Faster: Answers & Discussion
Next Post
[Video] Office Hours: Hot Tub Edition

44 Comments. Leave new

  • In database StackOverflow2013 after clean restor from backup I get non-identical results of the following queries. Is it an error in the data or am I doing something wrong?

    SELECT TOP 1 * FROM dbo.Users ORDER BY Id DESC
    SELECT TOP 1 * FROM dbo.Users ORDER BY CreationDate DESC;

    Reply
    • I’ll rephrase your question: “Should Id and CreationDate always go up together?” Not necessarily – as you’ll see if you dig deeper into the data. That’s a separate question from today’s homework post though.

      Reply
  • Well, I assume my query has some code, so here’s the Gist link, looking forward for the feedback.

    https://gist.github.com/peterkruis/ddccb033b523a755f5d543269a5796aa

    Reply
  • 1. Done
    2. Horrible – the amount of logical reads is equal to 14 table scans
    3. Yup – add a temp table to the mix and we’re down to 7525 total logical reads – https://gist.github.com/VladDBA/fb4ec8c1248be4c2fa399659dc9523f3

    Reply
    • Hmm, can you post the query plan you’re getting for the first part of your query? On mine at least, I’m getting a table scan for the query populating the temp table. I’m guessing – just guessing – that you’re using an index that isn’t part of the challenge’s requirements.

      Reply
  • Jordan Tenjeras
    February 22, 2024 7:28 pm

    Along the lines of Peter:
    ===================
    ;
    with cte_Rep as ( — 2960 logical reads
    select id
    from users
    where Reputation > 50
    )
    , cte_Date as ( — 1981 logical reads
    select id ,CreationDate
    from users
    where CreationDate > ‘20220305’
    –order by CreationDate desc
    )
    , cte as ( — 4941 logical reads if we don’t care about getting the complete Users record
    select top 1000 d.Id
    from cte_Date d
    join cte_Rep r on (d.Id = r.id)
    order by CreationDate desc
    )
    select u.* — 9012 total logical reads to get the complete Users record
    from Users u
    join cte on (u.id = cte.id)
    ==============================================================
    Plan:
    https://www.brentozar.com/pastetheplan/?id=rJHaCMSnp

    Reply
  • No code to post, as mine was similar to others. However, although my version did similar logical reads, it used a bit more CPU and execution cycles since I used RANK() instead of TOP to get the 1000 best reputations. The reason I used RANK() is that (in my version of Stack Overflow at least) there is a tie for last place, and I think that other last-place user should be included. So my query returns 1001 rows instead of 1000. Maybe a nit-pick, maybe not what the business users wanted, but I think it’s more accurate.

    Reply
  • My solution involved querying the Users table twice, returning a limited set of columns, to get the benefit of the two indexes then joining the two together and performing the ORDER BY and then finally joining back to the users table to get the columns for display purposes.

    https://gist.github.com/steveearle86/a58478291795b788ec47aa0991d3ea2f

    Reply
  • SELECT TOP 1000 Users.Id
    FROM dbo.Users
    WHERE Users.Id > (SELECT TOP 1 U2.Id FROM Users U2 WHERE U2.CreationDate > DATEADD(DAY, -90, ‘2010-11-09 22:17:33.317’))
    AND Users.Reputation > 50;

    Reply
  • — if you do not hard code the date
    SELECT TOP 1000 Users.Id
    FROM dbo.Users
    WHERE Users.Id > (SELECT TOP 1 U2.Id FROM Users U2 WHERE U2.CreationDate >
    DATEADD(DAY, -90, (SELECT MAX(U3.CreationDate) FROM dbo.Users U3)))
    AND Users.Reputation > 50;

    Reply
  • Hey Brent,

    your question was very interesting and I am actually chilling a bit. Here comes my idea for the solution.

    SET STATISTICS IO, TIME ON;
    GO

    WITH date_limit
    AS
    (
    /* Define the time range for the filter on CreationDate */
    SELECT TOP (1)
    DATEADD(DAY, -90, CreationDate) AS start_date,
    CreationDate AS end_date
    FROM dbo.Users
    ORDER BY
    Id DESC
    ),
    E
    AS
    (
    /* Now we filter the top 1000 order by CreationDate */
    SELECT TOP (1000)
    *
    FROM dbo.Users,
    date_limit AS dl
    WHERE Reputation > 50
    AND CreationDate >= dl.start_date
    AND CreationDate <= dl.end_date
    ORDER BY
    CreationDate DESC
    )
    /* To sort the final result by reputation */
    SELECT * FROM e
    ORDER BY
    e.Reputation DESC
    /* Note: 9130 only to check where the pushdown jumps in! */
    OPTION (RECOMPILE, QUERYTRACEON 9130);
    GO

    The query stats (IO / TIME):
    (1000 rows affected)
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, …
    Table 'Users'. Scan count 2, logical reads 22732, physical reads 0, …

    SQL Server Execution Times:
    CPU time = 16 ms, elapsed time = 144 ms.
    SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 0 ms.

    And finally the plan…

    https://www.brentozar.com/pastetheplan/?id=ByaNnAr3T

    PS: I'm with StackOverflow2013 (running in 160 CompLevel)
    Fullscan: 45.184 pages
    Solution: 22.732 pages

    Best from Germany, Uwe

    Reply
  • I’m using the large [StackOverflow] database with compat level 140.

    For my solution I first get the minimum user ID for the users created within the date range and store that to a variable: this uses a seek on the new [CreationDate] index. I then use that variable in a second query that does a backwards index seek on the new [Reputation] index where only users from our minimum ID are included.

    In total there are 5131 logical reads versus a full table with 143667 logical reads. Here’s the plan: https://www.brentozar.com/pastetheplan/?id=rJzQmyI26

    A few of observations. Firstly, a MIN is used to avoid the situation where the creation date is not aligned with the ascending ID. Second, the plan works for any date but take longer the earlier the date. Finally, without a FORCESEEK the first query performs a clustered index scan taking 135139 logical reads, which is not much less than the full scan.

    The final point is interesting because there are good statistics from the [CreationDate] index, and the date being used as the predicate is a scalar value since the result of the DATEADD can be derived by the optimiser. If I count the number of users within the date range the new [CreationDate] index is used whereas if I get the MIN or MAX value for the ID – which is readily available in the [CreationDate] index – a clustered index scan is used which is odd. Changing to compat level 150 makes no difference. See https://www.brentozar.com/pastetheplan/?id=By6vYJLna

    Reply
    • Damian – that’s creative, but unfortunately it returns invalid data. If you query through the Users table, you’ll find that Id and CreationDate don’t always go up together. There are high-Id users with old CreationDates, and low-Id users with newer CreationDates.

      Reply
  • SELECT TOP 1000 u.Id
    FROM dbo.Users AS u
    WHERE u.CreationDate >= (SELECT TOP(1) DATEADD(DAY, -90, u.CreationDate) FROM dbo.Users AS u ORDER BY u.CreationDate DESC)
    AND u.Reputation > 50
    ORDER BY u.Reputation DESC

    Plan:
    https://www.brentozar.com/pastetheplan/?id=HJfUpGU3T
    Performance: ~6,7k reads vs. 45,4k reads for a full table scan

    To go further down you would e.g. need to add an INCLUDE(Reputation) to the CreationDate-Index which would drop the reads to ~750

    Reply
    • Thomas – and does that return the same information that our developers needed?

      Reply
      • I think yes, since I didn’t made the mistakes (where you wrote the same) to filter for Users.Id > xxx (which would assume that it is guaranteed that there are no out-of-order created users).

        Of course you could add the u.Reputation to the output list, if the analysts needs it (isn’t 100% clear), this wouldn’t change the execution plan. And I don’t see anything that says, that they want more / additional columns. If they really want, you would have to put my statement into a subquery and join the dbo.Users to this sub select on sub.Id = u.Id. This would increase the reads by ~3k (3 pages read for each of the 1000 Users in a nested lookup).

        Another a bit unclear point is the CreationDate itself, depending on the real requirement you may want to put a DATETRUNC(DAY, …) around the DATEADD(), so when my CreationDate is at some day 23:59:58 it would not just show 90 * 24 but 90 * 24 + 23:59:58 hours of created users. And as always you could use > instead of >= when it is not 100% clear defined.

        Reply
  • Christos Bellos
    February 23, 2024 1:48 pm

    select top 1000 u3.*
    from users as u1
    inner join users as u2 on u1.Id = u2.Id
    inner join users as u3 on u2.id = u3.id
    where u1.CreationDate > dateadd( day , -90 , ‘2013-09-14 23:06:39.187’ )
    and u2.Reputation > 50
    order by u2.Reputation desc

    Full Scan is 91334 while the query above drops to 6115

    Reply
    • Christos – I LOVE THIS SOLUTION SO MUCH because it’s so creative. On mine (2018-06 copy), it does 41,004 logical reads, whereas a full table scan is 143,667. Good work! Other solutions get less logical reads, but yours is so awesome. It shows that you understand what the engine is doing, and how to work around it.

      Reply
  • IalsoWantAFerrari
    February 23, 2024 3:03 pm

    SET STATISTICS IO ON;
    GO

    DECLARE @LastDate DATETIME = CAST(DATEADD(dd, -90,
    (SELECT TOP 1 CreationDate FROM dbo.Users ORDER BY Id DESC)) AS DATE);

    SELECT @LastDate;
    /*
    SELECT TOP 1000 u.CreationDate, u.Reputation, *
    FROM dbo.Users u
    WHERE u.CreationDate > @LastDate AND
    u.Reputation > 50
    ORDER BY u.Reputation DESC;
    */

    WITH LastCreatedUsers AS
    (
    SELECT u.CreationDate, u.id
    FROM dbo.Users u WITH(INDEX(CreationDate))
    WHERE u.CreationDate > @LastDate
    –AND
    –u.Reputation > 50
    –ORDER BY u.Reputation DESC;
    ), TopFiftyUsers AS
    (
    SELECT u.Reputation, u.Id
    FROM dbo.Users u WITH(INDEX(Reputation))
    WHERE u.Reputation > 50
    –ORDER BY u.Reputation DESC;
    )

    SELECT TOP 1000 l.CreationDate, t.Reputation, t.Id
    FROM LastCreatedUsers l
    INNER JOIN TopFiftyUsers t ON t.Id = l.Id
    ORDER BY t.Reputation DESC

    /*Logical reads*/
    /*
    (1000 rows affected)
    Table ‘Users’. Scan count 10, logical reads 2693, physical reads 0
    Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0,
    Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0,
    */

    Reply
    • And does that return the same data as the query we’re trying to tune?

      Reply
      • IalsoWantAFerrari
        February 23, 2024 3:18 pm

        Hi Brent,
        yes, this is my original query, do I miss something?

        SELECT TOP 1000 u.CreationDate, u.Reputation, u.Id
        FROM dbo.Users u
        WHERE u.CreationDate > @LastDate AND
        u.Reputation > 50
        ORDER BY u.Reputation DESC;

        Table ‘Users’. Scan count 1, logical reads 1917442

        Reply
  • IalsoWantAFerrari
    February 23, 2024 3:25 pm

    you are right Brent, let’s say that the scope was not clear enough for me 🙂
    Solution:
    SET STATISTICS IO ON;
    GO

    DECLARE @LastDate DATETIME = CAST(DATEADD(dd, -90,
    (SELECT TOP 1 CreationDate FROM dbo.Users ORDER BY Id DESC)) AS DATE);

    SELECT @LastDate;
    /*
    SELECT TOP 1000 u.CreationDate, u.Reputation, *
    FROM dbo.Users u
    WHERE u.CreationDate > @LastDate AND
    u.Reputation > 50
    ORDER BY u.Reputation DESC;
    */

    WITH LastCreatedUsers AS
    (
    SELECT u.CreationDate, u.id
    FROM dbo.Users u WITH(INDEX(CreationDate))
    WHERE u.CreationDate > @LastDate
    –AND
    –u.Reputation > 50
    –ORDER BY u.Reputation DESC;
    ), TopFiftyUsers AS
    (
    SELECT u.Reputation, u.Id
    FROM dbo.Users u WITH(INDEX(Reputation))
    WHERE u.Reputation > 50
    –ORDER BY u.Reputation DESC;
    )

    SELECT TOP 1000 l.CreationDate, t.Reputation, t.Id, u.DisplayName, u.Location, u.Age
    FROM LastCreatedUsers l
    INNER JOIN TopFiftyUsers t ON t.Id = l.Id
    INNER JOIN Users u on u.Id = t.Id
    ORDER BY t.Reputation DESC

    /*result*/
    /*
    Table ‘Users’. Scan count 10, logical reads 10849
    */

    Reply
  • Quote:
    “Now, because you’re all using different copies of the Stack Overflow database, with different end dates, find the most recently created user in your database:
    SELECT TOP 1 CreationDate FROM dbo.Users ORDER BY Id DESC;
    /* 2013-09-14 23:06:39.187 */”
    Unquote

    /* Lets compare that to a ORDER BY CreationDate */
    SELECT TOP 1 CreationDate FROM dbo.Users ORDER BY CreationDate DESC;
    /* 2013-12-31 23:59:23.147
    Bad Brent, providing us with a bogus query :o)
    */

    So I normally I would go with the more purist query of ordering by CreationDate instead of Id. But to comparison reasons I used the Id.

    My solution(s):
    https://gist.github.com/Montro1981/00b25e7eb27b5ddc72e168330085c651

    Reply
  • 1.Write the query.
    select top 1000 displayname,location,CreationDate,reputation
    from dbo.users
    where CreationDate > (select top 1 DATEADD(dd, -90,creationdate) FROM dbo.Users ORDER BY Id DESC)
    and reputation > 50
    order by reputation desc

    2. How does your query perform? Compare the number of logical reads your query does, versus what it’d take to scan the whole table once.

    It does 91336 logical reads with the indexes. 45187 without the indexes

    3. If your query does more logical reads than a table scan, can you get it to do less, without changing the indexes on the table?

    No and ran out of time (and talent). If I remove the top 1000 clause, i get the same number of reads as the table scan, so then I tried to experiment with using rownum instead to see if this makes a difference but struggling to get the same result set.

    Reply
    • Good work! Yeah, you hit the problems I expected folks to hit. Stay tuned for the solutions post, or check out the comments from the other readers who worked on it as well.

      Reply
  • […] query exercise for this week was to write a query to find users created in the last 90 days, with a reputation higher than 50 […]

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.