Query Exercise: Find Tagged Questions Faster.

Query Exercises
42 Comments

For this week’s Query Exercise, we’re going to start with an existing query that has performance problems from time to time, depending on how it’s used.

This query’s goal is to find the top-voted Stack Overflow questions for any given tag – for example, here are the top-voted questions tagged “sql-server”.

What are tags? I’m glad you asked. When you’re looking at a question at Stack Overflow, you’ll see a list of tags for that question:

The tags are at the bottom of that screenshot: sql, sql-server, t-sql, sql-update, dml. A question can have up to 5 tags. To see how they’re stored in the Stack Overflow database, let’s look at the posts table, which is where questions & answers are stored:

There’s a “Tags” column:

So to find the questions tagged sql-server, our analyst wrote this query:

And our DBA helped by creating this index:

Right now, the actual execution plan is fairly simple. SQL Server scans that index backwards, going from the top-scoring posts down, checking each one’s tags to see if it contains <sql-server>.

Your exercises for this week are:

  1. What kinds of tags will perform worse than others for this query?
  2. Could you change the query to perform better?
  3. Could you change the indexes to perform better, without changing the table structure?

You can post your answers in this blog post’s comments, and discuss each others’ ideas. We’ll revisit your answers in this post. Have fun!

Previous Post
Finding the Best Time for Maintenance: Answers & Discussion
Next Post
What PowerBI Questions Do You Have for Eugene Meidinger?

42 Comments. Leave new

  • Henrik Staun Poulsen
    January 25, 2024 7:39 pm

    I’ll hazard a guess or two:
    1) any obscure tag will require most of the table read, hence slower than frequently used tags.

    2) One could build a new table; PostTags with PostId and Tag. Rewrite query to use that table too. Keep the table updated with a trigger, or change the application to do that too.

    3) Does a Non-Clustered Columns Store index count as changes to the table structure?
    It does not count in my book, and I have created a few of those.

    Reply
    • Great work as always, sir! I would say that a non-clustered columnstore index doesn’t count as changes to table structure. The followup question though: will it be faster? Is CS good at string searching?

      Reply
      • Henrik Staun Poulsen
        January 25, 2024 8:24 pm

        I have been surprised of how small the index was.
        Hence it does not take long to read into memory.
        “My” system has not been CPU bound, so I do not know about CPU cycles.

        If you add a NCCI, then any new columns added to the table are NOT included in NCCI.
        You must rebuild it yourself.
        I have been thinking of a database trigger to warn me, but never gotten round to it.

        Reply
      • Brian Boodman
        January 26, 2024 2:58 pm

        > The followup question though: will [non-clustered columnstore index] be faster? Is CS good at string searching?

        My thoughts before empirical testing:
        Your reply seems to be hinting that an NCCI won’t help and I’ve often been unpleasantly surprised by how expensive a like clause is. Adding a NCCI makes it faster to read in the tag column, but does nothing to speed up the “like” clause, which probably dominates the query time. So, I’d expect the NCCI to have similar performance.

        Testing as follows:
        1) Use mini database (sorry)
        2) Use ‘%%’ to force match failure.
        3) Doing a run and ignoring the results to force measurements to be against a warm cache
        Inappropriately testing on the mini database and intentionally using a non-existent tag to force worst-case behavior:

        With Score_Tags index: CPU time = 2468 ms, elapsed time = 676 ms.
        With NCCI: CPU time = 2672 ms, elapsed time = 984 ms.

        Oh, dear. I can recover the old performance with with(index(Score_Tags)), but by default the query favors the NCCI and runs slower.

        I even tried forcefully clearing the cache before each run, but the NCCI approach was still slower. Mind you, this is definitely not the greatest place to be using an NCCI. NCCI shines with warehousing type queries. While the 3.7M posts doesn’t subceed Microsofts guidance if 1M rows, it’s still a sub-second OLTP, which is very much not the place where NCCI shines. Even so, I was expecting the NCCI’s overhead to be negligible, which very much wasn’t the case.

        Reply
  • RollbackIsSingleThreaded
    January 25, 2024 9:42 pm

    Hi Brent!

    I apologize if my response is not aligned with your expectations.

    In the newer version of the StackOverflow database, there are two distinct tables named ‘tags’ and ‘posttags.’

    In previous versions of StackOverflow, querying a question tagged with ‘Brent’ would result in significant slowdowns. I encountered a similar challenge with a large log table in production, and the solution involved implementing a full-text index, which worked like magic.

    Thank you for generously sharing your knowledge! Your input is greatly appreciated.

    Reply
    • @RollbackIsSingleThreaded – let’s focus on answering the questions in the post. Thanks!

      Reply
      • RollbackIsSingleThreaded
        January 26, 2024 6:23 am

        Alright, though the “post” table utilized in the query doesn’t even adhere to the first normal form (1NF).
        Thanks!

        Reply
        • sometimes you have to work with the stuff you have, because changing it afterwards would result in a 6 month project for a 20 man team to make the app and all the reports and existing queries work with the new normalized version of it…

          Or it is a vendor product and you mustn’t touch any of the tables, because it would break their upgrade scripts and you lose the warranty.

          Reply
        • And the 1st Normal form is the most important one for performance!

          Reply
  • JennyFromDBLock
    January 26, 2024 1:12 am

    Short term lurker, first time poster, here is my attempt.

    1. Longer Tags/Strings, resulting in more data the DB engine needs to search/scan through.

    2. Introduce another WHERE predicate, eg Score > 0 (assuming there are no scores of 0) to better utilise the index due to index ordering being on Score then Tags, but this may come at a cost of higher CPU to evaluate the extra predicate…?

    In the Select only return Id, Score and Tags to eliminate the key lookup… query now performs faster (but may no longer meet the business requirements)

    3. Change the index to be on the predicate only (Tags) and leave off the Score, resulting in a smaller index to search. The Score will be retrieved as part of the lookup on the clustered key index to return all the columns, so can be ordered off that.

    Alternatively you could have a index on Tags, and INCLUDE every other column (except the primary key) to eliminate the key-lookup on the execution plan.

    Another solution, drop the index and create and use a full text search index instead (CONTAINS keyword).

    Reply
    • Jenny – welcome to the loud crowd!

      1. About longer tags/strings – are you thinking long tags that we have to search THROUGH, or long tags that we’re looking FOR?

      2. I love the idea of adding another predicate, that’s great!

      3. Check the tags in the picture – might those present a problem for full text search?

      Reply
      • JennyFromDBLock
        January 26, 2024 11:29 pm

        Thanks Brent.

        1. Was thinking that the rows that have longer sets of up to 5 tags, would mean that it takes more time to search through.

        3. What about adding a filter on the index? eg WHERE tags IS NOT NULL?

        Does the index order matter here? I would assume an index on Tags first, Score second would perform better than the current one on Score first, Tags second based on the predicate being on tags only.

        Reply
        • 1 – Gotcha!

          3 – The filter is a great idea! The question would be, what percentage of the rows have tags not null. And then, if the rows are null, think about how that relates back to your idea in #1.

          Does index matter – hey, that’s why I give y’all the databases as open source. Go check it out to find out!

          Reply
  • 1) Tags used on only 10 posts, where one of the posts is (tied for) lowest Score. That makes SQL Server scan through all the index rows rather than stopping after finding the first 10 that match, and you’re still doing all 10 key lookups to the clustered index.

    2) Normalize by creating a Tags table of distinct tag values (2013 version has over 36,000 distinct tags), plus a PostsTags “junction” table with just PostID and TagID, and a clustered composite PK on those. To be really fast, you could copy the Score value from Posts into PostsTags, then create an index on (TagID, Score); since the clustering key includes PostID, that value is included the NC index and it becomes covering for the query. Normally, with that many tag values and the string length, I’d create an Identity column on Tags as PK, but if you made the string value the PK, that would push the string into PostsTags as TagID and eliminate the need to join to Tags in the query. Definitely use data compression on the PostsTags indexes at that point.

    3) Full Text Search. This post says so – https://www.brentozar.com/archive/2010/06/sargable-why-string-is-slow/

    I’ve got to admit, I tried creating an XML type computed column with this (50/50 chance the post software hoses the less-than and greater-than symbols)
    ALTER TABLE dbo.Posts
    ADD TagsXML AS CONVERT(xml, ” + REPLACE(SUBSTRING(Tags,2, LEN(Tags)-2), ‘><','’) + ”);
    and then using XQuery to parse out the tags. It works, but it’s more work for SQL than the original query. I have a hunch that creating an XML index on that column would help, but apparently you can’t put an XML index on a computed XML column, at least in SQL2016. But after trying to once again teach myself XQuery from the SQL docs to do this, I’m convinced that why madness lies. If someone else can get that to perform, more power to you.

    Reply
    • 1) almost – correct anwere woulb be tags with less then 10 findings, since it has to scan the whole table until it can stop, while with exact 10 matches it could theoretical stop after the very first page when all 10 were used within a few minutes (maybe double posts because of a big server lag).

      Reply
    • DW – good answer on #1, and Thomas’s clarification to your answer is even better there.

      On #2 – that’s good, but it’s not changing the query – it’s changing the entire database to match the query. That’s pretty tough.

      On #3 – I love the idea, and I think you’re on a great track there. I bet it could work, but yeah, it’ll take some heavy lifting. I like it.

      Reply
  • Ayup, some of that XML got hosed. The empty ” are are opening and closing a Tags parent and Tag child element, and the internal >< is replaced with closing one Tag child and opening another.

    Reply
  • 1) What kinds of tags will perform worse than others for this query?
    Because creating this index, is the same as creating a new table which only has the Score and the Tags columns, I’d say Tags longer in lenght will performe worse
    then other tags, as they will need more effort from SQL SERVER engine to be searched/scanned

    2) Could you change the query to perform better?
    Before adjusting the query, I would firstly change the DB structure a little bit.
    Here’s what I would do:
    1 – create a new table with Tags Descriptions and Tag Id’s (Naturally, DB and front end would need adjstments to keep this table updated)
    2 – Use that table to rewrite the query, making use of DENSE_RANK

    3) Could you change the indexes to perform better, without changing the table structure?
    Without changing the table structure, I’d say best aproach would be using columnstore index: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview?view=sql-server-ver16

    Reply
    • ST – hmm, you’ll probably want to test your ideas on a copy of the Stack Overflow database. I wanna be gentle here, but you’re on the wrong track altogether.

      Reply
      • Thanks for the feeback Brent.
        I’ll do that once I get a chance, and hopefully I will try and learn something from it.

        Reply
  • Could this be an option?

    STATISTICS TIME, IO ON;

    ALTER DATABASE SCOPED CONFIGURATION SET MAXDOP = 8;

    CREATE NONCLUSTERED INDEX [NCI_Tags] ON [Posts] ([Tags])
    WITH (ONLINE=ON, DATA_COMPRESSION=PAGE);

    CREATE NONCLUSTERED INDEX [NCI_Tags_Score_Include] ON [Posts] ([Tags], [Score] DESC)
    INCLUDE([Id],[AcceptedAnswerId],[AnswerCount],[Body],[ClosedDate],[CommentCount],[CommunityOwnedDate],[CreationDate],[FavoriteCount],[LastActivityDate],[LastEditDate],[LastEditorDisplayName],[LastEditorUserId],[OwnerUserId],[ParentId],[PostTypeId],[Title],[ViewCount])
    WITH (ONLINE=ON, DATA_COMPRESSION=PAGE);

    SELECT *
    FROM [Posts]
    WHERE Tags IN (
    SELECT DISTINCT(Tags)
    FROM [Posts]
    WHERE Tags LIKE ‘%%’)
    ORDER BY Score DESC

    Reply
    • RHC – you tell me! Try it and find out, comparing the logical reads, CPU, and query duration before/after. (Also, just to be clear, you didn’t answer the questions in the post.)

      Reply
  • To answer the first question: as stated by others it has to scan the whole table for seldom used tags.

    Question 2 and 3:
    Very hard task when you really want high performance and must not change the structure of dbo.Posts. It is possible, that a FULLTEXT index may help (you may need to create extra stopwords etc.), but since nobody (including myself) really uses it, I can’t say much about it.

    So I decided to use an indexed view instead, which gives me top performance (about 20-960 reads). The index created by the dbo on the score and tags columns can be dropped again in this case.
    Alternatively you could add computed columns to the dbo.Posts using similar code as in my first view (but you need a CASE around the tag_1 too and check for NULL), but this would be a forbitten change to the dbo.Posts table 🙂

    —————————
    USE [StackOverFlow2013]
    GO

    GO
    CREATE OR ALTER VIEW dbo.v_post_tags
    WITH SCHEMABINDING
    /*
    manually splits the up to 5 tags into separate columns

    View is very ugly, since it is an indexed view and you can’t neither use CROSS APLLY (for STRING_SPLIT()) nor UNION ALL nor CTE’s nor subselects nor PIVOT in an indexed view, so it repeats it often self

    After changing the view, you have to recreate the indexes (will be automatically dropped):
    CREATE UNIQUE CLUSTERED INDEX idx_v_post_tags_1 ON dbo.v_post_tags (tag_1, Score, Id)
    CREATE NONCLUSTERED INDEX idx_v_post_tags_2 ON dbo.v_post_tags (tag_2, Score, Id)
    CREATE NONCLUSTERED INDEX idx_v_post_tags_3 ON dbo.v_post_tags (tag_3, Score, Id)
    CREATE NONCLUSTERED INDEX idx_v_post_tags_4 ON dbo.v_post_tags (tag_4, Score, Id)
    CREATE NONCLUSTERED INDEX idx_v_post_tags_5 ON dbo.v_post_tags (tag_5, Score, Id)

    */
    AS
    SELECT id, p.Tags, p.Score
    , SUBSTRING(p.Tags, 2, CHARINDEX(‘>’, p.Tags) – 2) AS tag_1
    , CASE WHEN p.Tags LIKE ‘<%<%’, p.Tags) + 2
    , CHARINDEX(‘>’, SUBSTRING(p.Tags, CHARINDEX(‘>’, p.Tags) + 2, 150)) – 1
    )
    ELSE NULL — no 2nd tag
    END AS tag_2
    , CASE WHEN p.Tags LIKE ‘<%<%’, p.Tags) – 2) — Tag_1
    , SUBSTRING(p.Tags
    , CHARINDEX(‘>’, p.Tags) + 2
    , CHARINDEX(‘>’, SUBSTRING(p.Tags, CHARINDEX(‘>’, p.Tags) + 2, 150)) – 1
    ) — tag_2
    ) ) + 4
    , CHARINDEX(‘>’,
    SUBSTRING(p.Tags
    , LEN(CONCAT_WS(‘>’, p.Tags) – 2) — Tag_1
    , SUBSTRING(p.Tags
    , CHARINDEX(‘>’, p.Tags) + 2
    , CHARINDEX(‘>’, SUBSTRING(p.Tags, CHARINDEX(‘>’, p.Tags) + 2, 150)) – 1
    ) — tag_2
    ) ) + 4
    , 150)
    ) -1
    )
    ELSE NULL — no 3rd tag
    END AS tag_3
    , CASE WHEN p.Tags LIKE ‘<%<%<%<%<%' — 5 tags
    THEN REVERSE(SUBSTRING(REVERSE(p.Tags)
    , CHARINDEX('<', REVERSE(p.Tags)) + 2
    , CHARINDEX('<', SUBSTRING(REVERSE(p.Tags), CHARINDEX('<', REVERSE(p.Tags)) + 2, 150)) – 1
    ) ) — then reversed code from tag_2
    WHEN p.Tags LIKE '<%<%<%<%' — 4 tags
    THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 2, CHARINDEX('<', REVERSE(p.Tags)) – 2)) — then reversed code from tag_1
    ELSE NULL
    END AS tag_4
    , CASE WHEN p.Tags LIKE '<%<%<%<%<%' — 5 tags
    THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 2, CHARINDEX('<', REVERSE(p.Tags)) – 2)) — then reversed code from tag_1
    ELSE NULL
    END AS tag_5
    FROM dbo.Posts AS p
    WHERE p.Tags IS NOT NULL AND p.Tags LIKE '’
    GO

    — create the indexes on the view — duration ~ 45 sec for the StackOverFlow2013 database on my laptop
    — you can’t filter indexes on views for e.g. tag_2 IS NOT NULL
    CREATE UNIQUE CLUSTERED INDEX idx_v_post_tags_1 ON dbo.v_post_tags (tag_1, Score, Id); — there MUST be an clustered index on the view
    CREATE NONCLUSTERED INDEX idx_v_post_tags_2 ON dbo.v_post_tags (tag_2, Score, Id)
    CREATE NONCLUSTERED INDEX idx_v_post_tags_3 ON dbo.v_post_tags (tag_3, Score, Id)
    CREATE NONCLUSTERED INDEX idx_v_post_tags_4 ON dbo.v_post_tags (tag_4, Score, Id)
    CREATE NONCLUSTERED INDEX idx_v_post_tags_5 ON dbo.v_post_tags (tag_5, Score, Id)

    GO
    CREATE OR ALTER VIEW dbo.v_post_query_tags AS
    — view that UNIONs ALL 5 tags to be able to easily query it without having to use ” in (tag_1, tag_2, tag_3, tag_4, tag_5)
    SELECT vpt.id, vpt.Score, vpt.tag_1 AS tag
    FROM dbo.v_post_tags AS vpt WITH (NOEXPAND) — without the NOEXPAND it wouldn’t use the index
    UNION ALL
    SELECT vpt.id, vpt.Score, vpt.tag_2 AS tag
    FROM dbo.v_post_tags AS vpt WITH (NOEXPAND)
    WHERE vpt.tag_2 IS NOT NULL
    UNION ALL
    SELECT vpt.id, vpt.Score, vpt.tag_3 AS tag
    FROM dbo.v_post_tags AS vpt WITH (NOEXPAND)
    WHERE vpt.tag_3 IS NOT NULL
    UNION ALL
    SELECT vpt.id, vpt.Score, vpt.tag_4 AS tag
    FROM dbo.v_post_tags AS vpt WITH (NOEXPAND)
    WHERE vpt.tag_4 IS NOT NULL
    UNION ALL
    SELECT vpt.id, vpt.Score, vpt.tag_5 AS tag
    FROM dbo.v_post_tags AS vpt WITH (NOEXPAND)
    WHERE vpt.tag_5 IS NOT NULL
    ;
    GO

    SELECT TOP 10 * FROM dbo.v_post_query_tags AS vpq WHERE vpq.tag = ‘ora-31011’ ORDER BY vpq.Score DESC — one occurence – 22 reads
    SELECT TOP 10 * FROM dbo.v_post_query_tags AS vpq WHERE vpq.tag = ‘iphone-sdk-4.0.1’ ORDER BY vpq.Score DESC — 10 occurences – 26 reads
    SELECT TOP 10 * FROM dbo.v_post_query_tags AS vpq WHERE vpq.tag = ‘sql-server’ ORDER BY vpq.Score DESC — 75k occurences – 946 reads

    — no noticable delay; 53 additional reads in v_post_tags
    UPDATE dbo.Posts SET Tags = Tags + ” WHERE Id = 3571036

    Reply
  • […] Query Exercise: Find Tagged Questions Faster. (Brent Ozar) […]

    Reply
  • If we generalize this problem to “searching string data for an arbitrary substring” in can maybe be solved by creating a trigram index.
    Here is a very detailed tutorial on how to do that:
    https://sqlperformance.com/2017/09/sql-performance/sql-server-trigram-wildcard-search
    Postgres has a module:
    https://www.postgresql.org/docs/current/pgtrgm.html

    Reply
  • #1 Infrequently used tags will be hardest to find.
    #2
    – Don’t use SELECT *. Instead, use SELECT Id, Body, Score
    – The Tags column is NVARCHAR, so use N’%%’ instead.
    – Consider caching the results. Depends on how often the query is used.
    #3 Instead of indexing on (Score, Tags), index on (Score) INCLUDE (Tags)
    – I tried indexing on Score DESC but it didn’t reduce logical reads or cpu time.
    – Because the exercise calls for finding questions, I tried indexing on PostTypeId
    and adding PostTypeId to the WHERE clause but that was detrimental.

    Reply
    • Martin – bingo on #1!

      #2 – you’ll want to test those assumptions to see if they make a difference here. For example, is SELECT * any worse on overhead? You might be surprised.

      #3 – did that index make a significant difference on reads, cpu, or duration?

      Reply
      • #3 with INCLUDE (tags) instead indexing the tags may prevent some resorting, when you add / remove tags. Since you can’t really seek for the tag (with the given query) in the index it would be the better option from maintainance perspective (performance should be the same). Particularly in the lower score ranges where 100k rows per score but with different tags exists.

        On the other hand the whole index is a maintainance nightmare, since there will always posts up- and downvoted, so it has to shift rows around in the index the whole time (e.g. when the score goes up from 1 to 2).

        Reply
      • – SELECT Id, Body, Score instead of SELECT * saves an almost-perceptible number of physical reads and over a millisecond of elapsed time. You’re right. I’m surprised.
        – I expected an index on (Score, PostTypeId) to improve query performance. Instead, reads increased from 500 to 2,000 (when searching for ). (4M without any index.)

        I appreciate TF’s comment on index maintenance nightmare, and I like the indexed view solution. I think my coworkers would get on my case for the complexity of it, but that’s more of a personnel issue.

        Thanks.

        Reply
  • Quite interested in Brent’s answer at the end of the week.

    /*
    1. What kinds of tags will perform worse than others for this query?
    Since the LIKE it introduced with a double percent-sign ‘%%’, it becomes non-sargable and the Engine will have to scan every row in the index to see the contents of the Tags column to find the correct rows up to the number in the TOP.
    When seaching for tags that are not used often, like ” that has been used once in the SO2013 database, will perform worse as the whole index will be scanned.
    To illustrate, seaching for scans 17488 rows (0.1%) with 546 logical reads, and seaching for scans 17142169 rows (100.0%) with 75555 logical reads.
    */

    /*Fact finding query to discover all distinct Tags*/
    SELECT v.[value], COUNT(*)
    FROM dbo.Posts
    CROSS APPLY string_split(REPLACE(Tags, N’>;<'), N';') AS v
    WHERE Tags IS NOT NULL
    GROUP BY v.[value]
    ORDER BY COUNT(*) ASC;
    GO

    DROP INDEX IF EXISTS Score_Tags ON dbo.Posts;
    CREATE INDEX Score_Tags ON dbo.Posts(Score, Tags);
    GO

    SET STATISTICS IO ON;

    SELECT TOP 100 *
    FROM dbo.Posts
    WHERE Tags LIKE '%%’
    ORDER BY Score DESC;
    –546 logical reads

    SELECT TOP 100 *
    FROM dbo.Posts
    WHERE Tags LIKE ‘%%’
    ORDER BY Score DESC;
    –75555 logical reads

    SET STATISTICS IO OFF;

    /*
    The more a certain tag is used, the quicker the query will become.

    2. Could you change the query to perform better?
    Let’s try: There a quite a few rows where there are no tags at all, so we can start by filtering those out.
    */
    SET STATISTICS IO ON;

    SELECT TOP 100 *
    FROM dbo.Posts
    WHERE Tags LIKE ‘%%’
    AND Tags IS NOT NULL
    ORDER BY Score DESC;
    –546 logical reads

    SELECT TOP 100 *
    FROM dbo.Posts
    WHERE Tags LIKE ‘%%’
    AND Tags IS NOT NULL
    ORDER BY Score DESC;
    –75555 logical reads

    SET STATISTICS IO OFF;

    /*
    Okay, the query didn’t benefit from the WHERE clause, because the leading column in the index is Score and we need to scan the whole index anyways.
    So the index created by our lovely DBA isn’t helping us in this case, and flipping the fields around will not help as well, as it will force SORTs into our plans because of the ORDER BY.
    Let’s make an extra table for storing the tags, however if we want to go this route we need to modify the app code as well to use this table, unless we do some shenanigans with the posts table and views…
    */

    DROP TABLE IF EXISTS dbo.PostTags;

    CREATE TABLE dbo.PostTags
    (
    PostId int NOT NULL
    ,Tag nvarchar(50) NOT NULL
    ,PRIMARY KEY CLUSTERED (Tag ASC, PostId ASC)
    );

    INSERT INTO dbo.PostTags
    SELECT PostId = p.Id, Tag = v.[value]
    FROM dbo.Posts AS p
    CROSS APPLY string_split(REPLACE(p.Tags, N’>;<'), N';') AS v
    WHERE Tags IS NOT NULL;

    SET STATISTICS IO ON;

    SELECT TOP 100 *
    FROM dbo.Posts AS p
    WHERE EXISTS
    (
    SELECT 1/0
    FROM dbo.PostTags AS pt
    WHERE pt.Tag = N'’
    AND pt.PostId = p.Id
    )
    ORDER BY p.Score DESC;
    –73111 + 104852 logical reads

    SELECT TOP 100 *
    FROM dbo.Posts AS p
    WHERE EXISTS
    (
    SELECT 1/0
    FROM dbo.PostTags AS pt
    WHERE pt.Tag = N”
    AND pt.PostId = p.Id
    )
    ORDER BY p.Score DESC;
    –6 + 8 logical reads

    SET STATISTICS IO OFF;

    DROP TABLE IF EXISTS dbo.PostTags;
    GO

    /*
    Okay, that fixed the outliner query, but made the non-outliner query so much worse.

    So let’s modify the with the score field this time … More denormalization \o/
    */

    DROP TABLE IF EXISTS dbo.PostTags;

    CREATE TABLE dbo.PostTags
    (
    PostId int NOT NULL
    ,Tag nvarchar(50) NOT NULL
    ,Score int NOT NULL
    ,PRIMARY KEY CLUSTERED (Score ASC, Tag ASC, PostId ASC)
    );

    INSERT INTO dbo.PostTags
    SELECT PostId = p.Id, Tag = v.[value], Score = p.Score
    FROM dbo.Posts AS p
    CROSS APPLY string_split(REPLACE(p.Tags, N’>;<'), N';') AS v
    WHERE Tags IS NOT NULL;

    CREATE INDEX idx_PostTags ON dbo.PostTags (Tag ASC);

    SET STATISTICS IO ON;

    SELECT TOP 100 *
    FROM dbo.Posts AS p
    CROSS APPLY
    (
    SELECT pt.Score
    FROM dbo.PostTags AS pt
    WHERE pt.Tag = N'’
    AND pt.PostId = p.Id
    ) AS pt
    ORDER BY pt.Score DESC;
    –415 + 4 logical reads

    SELECT TOP 100 *
    FROM dbo.Posts AS p
    CROSS APPLY
    (
    SELECT pt.Score
    FROM dbo.PostTags AS pt
    WHERE pt.Tag = N”
    AND pt.PostId = p.Id
    ) AS pt
    ORDER BY p.Score DESC;
    –5 + 5 logical reads

    SET STATISTICS IO OFF;

    DROP TABLE IF EXISTS dbo.PostTags;
    GO

    /*
    That fixed up both the outliner and the non-outliner query, with … quite a “few” database changes that will also require app changes to handle.
    The business users will be pleased because great response times, but the application team will not like this solution as it is not very fault tolerate and requires significate changes to the app code.
    The database team might not like the extra denormalization, but since it is mostly the application code to handle it, they will go “it’s not our problem”.

    3. Could you change the indexes to perform better, without changing the table structure?
    A simple solution would be to filter the index made by our DBA
    */

    DROP INDEX IF EXISTS Score_Tags ON dbo.Posts;
    CREATE INDEX Score_Tags ON dbo.Posts(Score, Tags) WHERE Tags IS NOT NULL;
    /*Note: Could also be PostTypeId = 1*/

    SET STATISTICS IO ON;

    SELECT TOP 100 *
    FROM dbo.Posts
    WHERE Tags LIKE ‘%%’
    AND Tags IS NOT NULL
    ORDER BY Score DESC;
    –1817 logical reads

    SELECT TOP 100 *
    FROM dbo.Posts
    WHERE Tags LIKE ‘%%’
    AND Tags IS NOT NULL
    ORDER BY Score DESC;
    –56228 logical reads

    SET STATISTICS IO OFF;
    /*
    That knocked some logical read off from the query, but nothing substantial

    Time to get creative, let’s try an indexed view where we split the tags column into individual parts.
    Thanks Thomas Franz for the general idea.
    */
    DROP VIEW IF EXISTS dbo.view_PostTags
    GO

    CREATE OR ALTER VIEW dbo.view_PostTags
    WITH SCHEMABINDING
    AS
    SELECT p.Id, p.Tags, p.Score
    ,Tag1 = SUBSTRING(p.Tags, 1, CHARINDEX(N’>’, p.Tags, 1))
    ,Tag2 = CASE
    WHEN LEN(p.Tags) – LEN(REPLACE(p.Tags, N'<', N'')) = 2
    THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1)))
    WHEN LEN(p.Tags) – LEN(REPLACE(p.Tags, N' 2
    THEN SUBSTRING(p.Tags, CHARINDEX(N’>’, p.Tags, 1) + 1, CHARINDEX(N’>’, SUBSTRING(p.Tags, CHARINDEX(N’>’, p.Tags, 1) + 1, 150)))
    ELSE NULL
    END
    ,Tag3 = CASE
    WHEN LEN(p.Tags) – LEN(REPLACE(p.Tags, N'<', N'')) = 3
    THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1)))
    WHEN LEN(p.Tags) – LEN(REPLACE(p.Tags, N' 3
    THEN SUBSTRING(
    SUBSTRING(p.Tags, CHARINDEX(N’>’, SUBSTRING(p.Tags, CHARINDEX(N’>’, p.Tags, 1) + 1, 150), 1) + CHARINDEX(N’>’, p.Tags, 1) + 1, 150)
    ,1
    ,CHARINDEX(N’>’, SUBSTRING(p.Tags, CHARINDEX(N’>’, SUBSTRING(p.Tags, CHARINDEX(N’>’, p.Tags, 1) + 1, 150), 1) + CHARINDEX(N’>’, p.Tags, 1) + 1, 150), 1)
    )
    ELSE NULL
    END
    ,Tag4 = CASE
    WHEN LEN(p.Tags) – LEN(REPLACE(p.Tags, N'<', N'')) = 4
    THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1)))
    WHEN LEN(p.Tags) – LEN(REPLACE(p.Tags, N' 4
    THEN REVERSE(SUBSTRING(REVERSE(p.Tags), CHARINDEX(N'<', REVERSE(p.Tags), 1) + 1, CHARINDEX(N'<', SUBSTRING(REVERSE(p.Tags), CHARINDEX(N'<', REVERSE(p.Tags), 1) + 1, 150))))
    ELSE NULL
    END
    ,Tag5 = CASE
    WHEN LEN(p.Tags) – LEN(REPLACE(p.Tags, N'<', N'')) = 5
    THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1)))
    ELSE NULL
    END
    FROM dbo.Posts AS p
    WHERE p.Tags IS NOT NULL;
    GO

    CREATE UNIQUE CLUSTERED INDEX idx_view_PostTags_Tag1 ON view_PostTags (Tag1, Score, Id);
    CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag2 ON view_PostTags (Tag2, Score, Id);
    CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag3 ON view_PostTags (Tag3, Score, Id);
    CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag4 ON view_PostTags (Tag4, Score, Id);
    CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag5 ON view_PostTags (Tag5, Score, Id);
    GO

    DROP VIEW IF EXISTS dbo.view_TagsPerPost
    GO

    CREATE OR ALTER VIEW dbo.view_TagsPerPost
    AS
    SELECT
    Id, Score, Tag = Tag1
    FROM dbo.view_PostTags WITH (NOEXPAND)
    UNION ALL
    SELECT
    Id, Score, Tag = Tag2
    FROM dbo.view_PostTags WITH (NOEXPAND)
    WHERE Tag2 IS NOT NULL
    UNION ALL
    SELECT
    Id, Score, Tag = Tag3
    FROM dbo.view_PostTags WITH (NOEXPAND)
    WHERE Tag3 IS NOT NULL
    UNION ALL
    SELECT
    Id, Score, Tag = Tag4
    FROM dbo.view_PostTags WITH (NOEXPAND)
    WHERE Tag4 IS NOT NULL
    UNION ALL
    SELECT
    Id, Score, Tag = Tag5
    FROM dbo.view_PostTags WITH (NOEXPAND)
    WHERE Tag5 IS NOT NULL
    GO

    SET STATISTICS IO ON;

    SELECT TOP 100 p.*
    FROM dbo.view_TagsPerPost AS tpp
    INNER JOIN dbo.Posts AS p
    ON p.Id = tpp.Id
    WHERE tpp.Tag = '’
    ORDER BY tpp.Score DESC;
    –1139 + 20807 logical reads

    SELECT TOP 100 *
    FROM dbo.view_TagsPerPost AS tpp
    INNER JOIN dbo.Posts AS p
    ON p.Id = tpp.Id
    WHERE tpp.Tag = ”
    ORDER BY tpp.Score DESC;
    –5 + 24 logical reads

    SET STATISTICS IO OFF;

    DROP VIEW IF EXISTS dbo.view_PostTags
    GO

    DROP VIEW IF EXISTS dbo.view_TagsPerPost
    GO

    /*
    That fixed up both the outliner and the non-outliner query, with some database changes that will the app will need to do nothing with. However the query from the app will need to changed to use the view.
    The business users will be pleased because great response times.
    For the application team it is a small and manageable change, but the ways tags are recorded can’t be easily changed in the future.
    However the database team might oppose the solution, as the view is mostly a database solution and they might be the team to maintain it.
    */

    Reply
    • Great the blog doesn’t like \ in the text

      I used the tags sql-server and progress-reports in my tests

      Reply
    • After reading the site Vedran Galic linked (and it was in Brent ‘s mail of yesterday) I gave the trigram solution a try.
      Summary:
      It makes finding often used tags slower from the overhead from generating and searching the trigrams table (about 10X in miliseconds).
      However, finding the outliers is way faster though (110X faster).

      LIKE %sql-server%: 103 ms
      LIKE %progress-reports%: 10161 ms
      Trigram search %sql-server%: 1013 ms
      Trigram search %progress-reports%: 89 ms

      In my opinion, the decrease in performance while trigram searching for often used tags would be acceptable as finding the outliers is way faster.

      Wall of code incoming:

      CREATE OR ALTER FUNCTION dbo.GenerateTagTrigrams
      (
      @tags nvarchar(150)
      )
      RETURNS TABLE WITH SCHEMABINDING
      AS
      RETURN
      (
      WITH Ten AS
      (
      SELECT Ten = Digit.Ten, Weigth = COUNT(Digit.Ten) OVER ()
      FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS Digit (Ten)
      ), Hundred AS
      (
      SELECT Hundred = Ten.Ten + (Ten.Weigth * Hundred.Ten), Weigth = Ten.Weigth * Ten.Weigth
      FROM Ten AS Ten CROSS APPLY Ten AS Hundred
      ), ThreeHundred AS
      (
      SELECT ThreeHundred = (Digit.Three * ThreeHundred.Weigth) + ThreeHundred.Hundred + 1
      FROM (VALUES (0), (1), (2)) AS Digit (Three)
      CROSS APPLY Hundred AS ThreeHundred
      ), TagTrigrams AS
      (
      SELECT TOP (CASE WHEN LEN(@tags) > 2 THEN LEN(@tags) – 2 ELSE 0 END)
      Trigram = SUBSTRING(@tags, threeHund.ThreeHundred, 3)
      FROM ThreeHundred AS threeHund
      ORDER BY threeHund.ThreeHundred
      )
      SELECT DISTINCT t.trigram
      FROM TagTrigrams AS t
      WHERE t.trigram COLLATE Latin1_General_BIN2 NOT LIKE ‘%[^A-Z0-9a-z]%’
      );
      GO

      DROP TABLE IF EXISTS dbo.TagsTrigrams;

      CREATE TABLE dbo.TagsTrigrams
      (
      PostId int NOT NULL
      ,Trigram nchar(3) NOT NULL
      );

      INSERT INTO dbo.TagsTrigrams WITH (TABLOCKX)
      (
      PostId
      ,Trigram
      )
      SELECT
      p.Id
      ,t.Trigram
      FROM dbo.Posts AS p
      CROSS APPLY dbo.GenerateTagTrigrams (tags) AS t;
      GO

      ALTER TABLE dbo.TagsTrigrams ADD CONSTRAINT pk_TagsTrigrams PRIMARY KEY CLUSTERED (Trigram, PostId);
      GO

      CREATE OR ALTER VIEW dbo.TagsTrigramCounts
      WITH SCHEMABINDING AS
      SELECT tt.Trigram, NumberOfOccurrences = COUNT_BIG(*)
      FROM dbo.TagsTrigrams AS tt
      GROUP BY tt.Trigram;
      GO

      CREATE UNIQUE CLUSTERED INDEX idx_TagsTrigramCounts ON dbo.TagsTrigramCounts (Trigram);
      GO

      CREATE OR ALTER FUNCTION dbo.PostTag_GetBestTrigrams
      (
      @searchTag nvarchar(50)
      )
      RETURNS TABLE WITH SCHEMABINDING
      AS
      RETURN
      (
      SELECT
      TagTrigram1 = MAX(CASE WHEN bestTrigrams.TrigramOrder = 1 THEN bestTrigrams.Trigram END)
      ,TagTrigram2 = MAX(CASE WHEN bestTrigrams.TrigramOrder = 2 THEN bestTrigrams.Trigram END)
      ,TagTrigram3 = MAX(CASE WHEN bestTrigrams.TrigramOrder = 3 THEN bestTrigrams.Trigram END)
      FROM
      (
      SELECT TOP 3
      TrigramOrder = ROW_NUMBER() OVER (ORDER BY ttc.NumberOfOccurrences ASC), st.Trigram
      FROM dbo.GenerateTagTrigrams(@searchTag) AS st
      INNER JOIN dbo.TagsTrigramCounts AS ttc WITH (NOEXPAND) ON ttc.Trigram = st.Trigram
      ORDER BY ttc.NumberOfOccurrences ASC
      ) AS bestTrigrams
      );
      GO

      CREATE OR ALTER FUNCTION dbo.GetTagTrigramPostIDs
      (
      @TagTrigram1 nchar(3)
      ,@TagTrigram2 nchar(3)
      ,@TagTrigram3 nchar(3)
      )
      RETURNS @IDs TABLE
      (
      PostId int PRIMARY KEY
      ) WITH SCHEMABINDING AS
      BEGIN /* Begin function body */
      IF @TagTrigram1 IS NOT NULL
      BEGIN
      IF @TagTrigram2 IS NOT NULL
      BEGIN
      IF @TagTrigram3 IS NOT NULL
      BEGIN
      INSERT @IDs (PostId)
      SELECT tt1.PostId
      FROM dbo.TagsTrigrams AS tt1
      WHERE tt1.Trigram = @TagTrigram1
      INTERSECT
      SELECT tt2.PostId
      FROM dbo.TagsTrigrams AS tt2
      WHERE tt2.Trigram = @TagTrigram2
      INTERSECT
      SELECT tt3.PostId
      FROM dbo.TagsTrigrams AS tt3
      WHERE tt3.Trigram = @TagTrigram3
      OPTION (MERGE JOIN);
      END;
      ELSE
      BEGIN
      INSERT @IDs (PostId)
      SELECT tt1.PostId
      FROM dbo.TagsTrigrams AS tt1
      WHERE tt1.Trigram = @TagTrigram1
      INTERSECT
      SELECT tt2.PostId
      FROM dbo.TagsTrigrams AS tt2
      WHERE tt2.Trigram = @TagTrigram2
      OPTION (MERGE JOIN);
      END;
      END;
      ELSE
      BEGIN
      INSERT @IDs (PostId)
      SELECT tt1.PostId
      FROM dbo.TagsTrigrams AS tt1
      WHERE tt1.Trigram = @TagTrigram1
      END;
      END;
      RETURN;
      END; /* End function body */
      GO

      CREATE OR ALTER FUNCTION dbo.PostTags_TrigramSearch
      (
      @Tag nvarchar(150)
      )
      RETURNS TABLE WITH SCHEMABINDING
      AS
      RETURN
      SELECT Posts.Id, Posts.AcceptedAnswerId, Posts.AnswerCount, Posts.Body, Posts.ClosedDate, Posts.CommentCount, Posts.CommunityOwnedDate, Posts.CreationDate, Posts.FavoriteCount, Posts.LastActivityDate, Posts.LastEditDate, Posts.LastEditorDisplayName, Posts.LastEditorUserId, Posts.OwnerUserId, Posts.ParentId, Posts.PostTypeId, Posts.Score, Posts.Tags, Posts.Title, Posts.ViewCount
      FROM dbo.PostTag_GetBestTrigrams(@Tag) AS ptt
      CROSS APPLY
      (
      SELECT p.Id, p.AcceptedAnswerId, p.AnswerCount, p.Body, p.ClosedDate, p.CommentCount, p.CommunityOwnedDate, p.CreationDate, p.FavoriteCount, p.LastActivityDate, p.LastEditDate, p.LastEditorDisplayName, p.LastEditorUserId, p.OwnerUserId, p.ParentId, p.PostTypeId, p.Score, p.Tags, p.Title, p.ViewCount
      FROM dbo.GetTagTrigramPostIDs (ptt.TagTrigram1, ptt.TagTrigram2, ptt.TagTrigram3) AS trigrams
      INNER JOIN dbo.Posts AS p
      ON p.id = trigrams.PostId
      WHERE ptt.TagTrigram1 IS NOT NULL
      AND p.Tags LIKE @Tag
      UNION ALL
      SELECT p.Id, p.AcceptedAnswerId, p.AnswerCount, p.Body, p.ClosedDate, p.CommentCount, p.CommunityOwnedDate, p.CreationDate, p.FavoriteCount, p.LastActivityDate, p.LastEditDate, p.LastEditorDisplayName, p.LastEditorUserId, p.OwnerUserId, p.ParentId, p.PostTypeId, p.Score, p.Tags, p.Title, p.ViewCount
      FROM dbo.Posts AS p
      WHERE ptt.TagTrigram1 IS NULL
      AND p.Tags LIKE @Tag
      ) AS Posts;
      GO

      SET NOCOUNT ON;

      DECLARE
      @StartTime datetime2
      ,@ElapsedTime int;

      SET @StartTime = SYSUTCDATETIME();

      SELECT TOP 100 *
      FROM dbo.Posts
      WHERE Tags LIKE ‘%sql-server%’
      ORDER BY Score DESC;

      SET @ElapsedTime = DATEDIFF(MILLISECOND, @StartTime, SYSUTCDATETIME());
      PRINT ‘LIKE %sql-server%: ‘ + CAST(@ElapsedTime AS nvarchar(4000)) + N’ ms’;
      SET @StartTime = SYSUTCDATETIME();

      SELECT TOP 100 *
      FROM dbo.Posts
      WHERE Tags LIKE ‘%progress-reports%’
      ORDER BY Score DESC;

      SET @ElapsedTime = DATEDIFF(MILLISECOND, @StartTime, SYSUTCDATETIME());
      PRINT ‘LIKE %progress-reports%: ‘ + CAST(@ElapsedTime AS nvarchar(4000)) + N’ ms’;
      SET @StartTime = SYSUTCDATETIME();

      SELECT TOP 100 *
      FROM dbo.PostTags_TrigramSearch(N’%sql-server%’)
      ORDER BY Score DESC;

      SET @ElapsedTime = DATEDIFF(MILLISECOND, @StartTime, SYSUTCDATETIME());
      PRINT ‘Trigram search %sql-server%: ‘ + CAST(@ElapsedTime AS nvarchar(4000)) + N’ ms’;
      SET @StartTime = SYSUTCDATETIME();

      SELECT TOP 100 *
      FROM dbo.PostTags_TrigramSearch(N’%progress-reports%’)
      ORDER BY Score DESC;

      SET @ElapsedTime = DATEDIFF(MILLISECOND, @StartTime, SYSUTCDATETIME());
      PRINT ‘Trigram search %progress-reports%: ‘ + CAST(@ElapsedTime AS nvarchar(4000)) + N’ ms’;

      SET NOCOUNT OFF;

      Reply
      • Holy mackerel, Martin, you went to work!

        You may wanna check out Github Gists: https://gist.github.com You can paste code snippets there, and plus update ’em over time. I spent some time last weekend trying to work Gists into the comments here, but I couldn’t make it work without a pretty big change to the blog, so I held off for now. I think I’m gonna have to do something similar though because the query challenge thing is so much fun.

        I tried to write the challenge to say, “You can’t change the tables structurally,” and I actually had trigrams in mind when I was writing that. The answer post won’t mention trigrams by name either, but the followup challenge is going to be, “How would you fix this if you could change the tables structurally?” And trigrams are a good answer there.

        I still like this solution here though because we’re technically not CHANGING the existing tables, hahaha, just adding new ones.

        Reply
        • Thanks for the compliment Brent.
          I noticed the “Faster Text Search” section in your email yesterday and I thought to give it a try, it is fun to learn new techniques.

          I subscribed to replies and I had to laugh at this top line: “Brace yourself: someone on the Internet said something. It’s probably wrong, too.”

          Reply
    • Reply
  • […] Your query exercise was to take this Stack Overflow query to find the top-voted questions for any given tag: […]

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.