Query Exercise: Find Tagged Questions Faster.

Last Updated 2 months ago

For this week’s Query Exercise, we’re going to start with an existing query that has performance problems from time to time, depending on how it’s used.

This query’s goal is to find the top-voted Stack Overflow questions for any given tag – for example, here are the top-voted questions tagged “sql-server”.

What are tags? I’m glad you asked. When you’re looking at a question at Stack Overflow, you’ll see a list of tags for that question:

The tags are at the bottom of that screenshot: sql, sql-server, t-sql, sql-update, dml. A question can have up to 5 tags. To see how they’re stored in the Stack Overflow database, let’s look at the posts table, which is where questions & answers are stored:

SELECT TOP 100 * FROM dbo.Posts;

1	SELECT TOP 100 * FROM dbo.Posts;

There’s a “Tags” column:

So to find the questions tagged sql-server, our analyst wrote this query:

SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%<sql-server>%'
ORDER BY Score DESC;

SELECT TOP 100 *

FROM dbo.Posts

WHERE Tags LIKE '%<sql-server>%'

ORDER BY Score DESC;

And our DBA helped by creating this index:

CREATE INDEX Score_Tags ON dbo.Posts(Score, Tags);

1	CREATE INDEX Score_Tags ON dbo.Posts(Score, Tags);

Right now, the actual execution plan is fairly simple. SQL Server scans that index backwards, going from the top-scoring posts down, checking each one’s tags to see if it contains <sql-server>.

Your exercises for this week are:

What kinds of tags will perform worse than others for this query?
Could you change the query to perform better?
Could you change the indexes to perform better, without changing the table structure?

You can post your answers in this blog post’s comments, and discuss each others’ ideas. We’ll revisit your answers in this post. Have fun!

Finding the Best Time for Maintenance: Answers & Discussion

What PowerBI Questions Do You Have for Eugene Meidinger?

42 Comments. Leave new

Henrik Staun Poulsen
January 25, 2024 7:39 pm

I’ll hazard a guess or two:
1) any obscure tag will require most of the table read, hence slower than frequently used tags.

2) One could build a new table; PostTags with PostId and Tag. Rewrite query to use that table too. Keep the table updated with a trigger, or change the application to do that too.

3) Does a Non-Clustered Columns Store index count as changes to the table structure?
It does not count in my book, and I have created a few of those.

Reply
- Brent Ozar
  January 25, 2024 7:41 pm
  
  Great work as always, sir! I would say that a non-clustered columnstore index doesn’t count as changes to table structure. The followup question though: will it be faster? Is CS good at string searching?
  
  Reply
  - Henrik Staun Poulsen
    January 25, 2024 8:24 pm
    
    I have been surprised of how small the index was.
    Hence it does not take long to read into memory.
    “My” system has not been CPU bound, so I do not know about CPU cycles.
    
    If you add a NCCI, then any new columns added to the table are NOT included in NCCI.
    You must rebuild it yourself.
    I have been thinking of a database trigger to warn me, but never gotten round to it.
    
    Reply
  - Brian Boodman
    January 26, 2024 2:58 pm
    
    > The followup question though: will [non-clustered columnstore index] be faster? Is CS good at string searching?
    
    My thoughts before empirical testing:
    Your reply seems to be hinting that an NCCI won’t help and I’ve often been unpleasantly surprised by how expensive a like clause is. Adding a NCCI makes it faster to read in the tag column, but does nothing to speed up the “like” clause, which probably dominates the query time. So, I’d expect the NCCI to have similar performance.
    
    Testing as follows:
    1) Use mini database (sorry)
    2) Use ‘%%’ to force match failure.
    3) Doing a run and ignoring the results to force measurements to be against a warm cache
    Inappropriately testing on the mini database and intentionally using a non-existent tag to force worst-case behavior:
    
    With Score_Tags index: CPU time = 2468 ms, elapsed time = 676 ms.
    With NCCI: CPU time = 2672 ms, elapsed time = 984 ms.
    
    Oh, dear. I can recover the old performance with with(index(Score_Tags)), but by default the query favors the NCCI and runs slower.
    
    I even tried forcefully clearing the cache before each run, but the NCCI approach was still slower. Mind you, this is definitely not the greatest place to be using an NCCI. NCCI shines with warehousing type queries. While the 3.7M posts doesn’t subceed Microsofts guidance if 1M rows, it’s still a sub-second OLTP, which is very much not the place where NCCI shines. Even so, I was expecting the NCCI’s overhead to be negligible, which very much wasn’t the case.
    
    Reply
    - Brian Boodman
      January 26, 2024 3:02 pm
      
      #2 in “Testing as follows” was corrupted by the post software’s html-stripping. To be clear, I forced match failure by using a tag that didn’t exist.
      
      Reply
    - Brent Ozar
      January 26, 2024 3:02 pm
      
      Brian – bingo, and I explain why in my Fundamentals of Columnstore class.
      
      Reply
RollbackIsSingleThreaded
January 25, 2024 9:42 pm

Hi Brent!

I apologize if my response is not aligned with your expectations.

In the newer version of the StackOverflow database, there are two distinct tables named ‘tags’ and ‘posttags.’

In previous versions of StackOverflow, querying a question tagged with ‘Brent’ would result in significant slowdowns. I encountered a similar challenge with a large log table in production, and the solution involved implementing a full-text index, which worked like magic.

Thank you for generously sharing your knowledge! Your input is greatly appreciated.

Reply
- Brent Ozar
  January 25, 2024 10:01 pm
  
  @RollbackIsSingleThreaded – let’s focus on answering the questions in the post. Thanks!
  
  Reply
  - RollbackIsSingleThreaded
    January 26, 2024 6:23 am
    
    Alright, though the “post” table utilized in the query doesn’t even adhere to the first normal form (1NF).
    Thanks!
    
    Reply
    - Thomas Franz
      January 26, 2024 11:51 am
      
      sometimes you have to work with the stuff you have, because changing it afterwards would result in a 6 month project for a 20 man team to make the app and all the reports and existing queries work with the new normalized version of it…
      
      Or it is a vendor product and you mustn’t touch any of the tables, because it would break their upgrade scripts and you lose the warranty.
      
      Reply
    - drsql
      January 26, 2024 2:37 pm
      
      And the 1st Normal form is the most important one for performance!
      
      Reply
JennyFromDBLock
January 26, 2024 1:12 am

Short term lurker, first time poster, here is my attempt.

1. Longer Tags/Strings, resulting in more data the DB engine needs to search/scan through.

2. Introduce another WHERE predicate, eg Score > 0 (assuming there are no scores of 0) to better utilise the index due to index ordering being on Score then Tags, but this may come at a cost of higher CPU to evaluate the extra predicate…?

In the Select only return Id, Score and Tags to eliminate the key lookup… query now performs faster (but may no longer meet the business requirements)

3. Change the index to be on the predicate only (Tags) and leave off the Score, resulting in a smaller index to search. The Score will be retrieved as part of the lookup on the clustered key index to return all the columns, so can be ordered off that.

Alternatively you could have a index on Tags, and INCLUDE every other column (except the primary key) to eliminate the key-lookup on the execution plan.

Another solution, drop the index and create and use a full text search index instead (CONTAINS keyword).

Reply
- Brent Ozar
  January 26, 2024 1:38 pm
  
  Jenny – welcome to the loud crowd!
  
  1. About longer tags/strings – are you thinking long tags that we have to search THROUGH, or long tags that we’re looking FOR?
  
  2. I love the idea of adding another predicate, that’s great!
  
  3. Check the tags in the picture – might those present a problem for full text search?
  
  Reply
  - JennyFromDBLock
    January 26, 2024 11:29 pm
    
    Thanks Brent.
    
    1. Was thinking that the rows that have longer sets of up to 5 tags, would mean that it takes more time to search through.
    
    3. What about adding a filter on the index? eg WHERE tags IS NOT NULL?
    
    Does the index order matter here? I would assume an index on Tags first, Score second would perform better than the current one on Score first, Tags second based on the predicate being on tags only.
    
    Reply
    - Brent Ozar
      January 27, 2024 1:57 pm
      
      1 – Gotcha!
      
      3 – The filter is a great idea! The question would be, what percentage of the rows have tags not null. And then, if the rows are null, think about how that relates back to your idea in #1.
      
      Does index matter – hey, that’s why I give y’all the databases as open source. Go check it out to find out!
      
      Reply
DW
January 26, 2024 1:22 am

1) Tags used on only 10 posts, where one of the posts is (tied for) lowest Score. That makes SQL Server scan through all the index rows rather than stopping after finding the first 10 that match, and you’re still doing all 10 key lookups to the clustered index.

2) Normalize by creating a Tags table of distinct tag values (2013 version has over 36,000 distinct tags), plus a PostsTags “junction” table with just PostID and TagID, and a clustered composite PK on those. To be really fast, you could copy the Score value from Posts into PostsTags, then create an index on (TagID, Score); since the clustering key includes PostID, that value is included the NC index and it becomes covering for the query. Normally, with that many tag values and the string length, I’d create an Identity column on Tags as PK, but if you made the string value the PK, that would push the string into PostsTags as TagID and eliminate the need to join to Tags in the query. Definitely use data compression on the PostsTags indexes at that point.

3) Full Text Search. This post says so – https://www.brentozar.com/archive/2010/06/sargable-why-string-is-slow/

I’ve got to admit, I tried creating an XML type computed column with this (50/50 chance the post software hoses the less-than and greater-than symbols)
ALTER TABLE dbo.Posts
ADD TagsXML AS CONVERT(xml, ” + REPLACE(SUBSTRING(Tags,2, LEN(Tags)-2), ‘><','’) + ”);
and then using XQuery to parse out the tags. It works, but it’s more work for SQL than the original query. I have a hunch that creating an XML index on that column would help, but apparently you can’t put an XML index on a computed XML column, at least in SQL2016. But after trying to once again teach myself XQuery from the SQL docs to do this, I’m convinced that why madness lies. If someone else can get that to perform, more power to you.

Reply
- Thomas Franz
  January 26, 2024 12:07 pm
  
  1) almost – correct anwere woulb be tags with less then 10 findings, since it has to scan the whole table until it can stop, while with exact 10 matches it could theoretical stop after the very first page when all 10 were used within a few minutes (maybe double posts because of a big server lag).
  
  Reply
  - Brent Ozar
    January 26, 2024 1:39 pm
    
    Thomas – bingo, you nailed it with that clarification!
    
    Reply
- Brent Ozar
  January 26, 2024 1:41 pm
  
  DW – good answer on #1, and Thomas’s clarification to your answer is even better there.
  
  On #2 – that’s good, but it’s not changing the query – it’s changing the entire database to match the query. That’s pretty tough.
  
  On #3 – I love the idea, and I think you’re on a great track there. I bet it could work, but yeah, it’ll take some heavy lifting. I like it.
  
  Reply
DW
January 26, 2024 1:27 am

Ayup, some of that XML got hosed. The empty ” are are opening and closing a Tags parent and Tag child element, and the internal >< is replaced with closing one Tag child and opening another.

Reply
ST
January 26, 2024 4:37 am

1) What kinds of tags will perform worse than others for this query?
Because creating this index, is the same as creating a new table which only has the Score and the Tags columns, I’d say Tags longer in lenght will performe worse
then other tags, as they will need more effort from SQL SERVER engine to be searched/scanned

2) Could you change the query to perform better?
Before adjusting the query, I would firstly change the DB structure a little bit.
Here’s what I would do:
1 – create a new table with Tags Descriptions and Tag Id’s (Naturally, DB and front end would need adjstments to keep this table updated)
2 – Use that table to rewrite the query, making use of DENSE_RANK

3) Could you change the indexes to perform better, without changing the table structure?
Without changing the table structure, I’d say best aproach would be using columnstore index: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview?view=sql-server-ver16

Reply
- Brent Ozar
  January 26, 2024 1:41 pm
  
  ST – hmm, you’ll probably want to test your ideas on a copy of the Stack Overflow database. I wanna be gentle here, but you’re on the wrong track altogether.
  
  Reply
  - ST
    January 26, 2024 3:22 pm
    
    Thanks for the feeback Brent.
    I’ll do that once I get a chance, and hopefully I will try and learn something from it.
    
    Reply
RHC
January 26, 2024 11:13 am

Could this be an option?

STATISTICS TIME, IO ON;

ALTER DATABASE SCOPED CONFIGURATION SET MAXDOP = 8;

CREATE NONCLUSTERED INDEX [NCI_Tags] ON [Posts] ([Tags])
WITH (ONLINE=ON, DATA_COMPRESSION=PAGE);

CREATE NONCLUSTERED INDEX [NCI_Tags_Score_Include] ON [Posts] ([Tags], [Score] DESC)
INCLUDE([Id],[AcceptedAnswerId],[AnswerCount],[Body],[ClosedDate],[CommentCount],[CommunityOwnedDate],[CreationDate],[FavoriteCount],[LastActivityDate],[LastEditDate],[LastEditorDisplayName],[LastEditorUserId],[OwnerUserId],[ParentId],[PostTypeId],[Title],[ViewCount])
WITH (ONLINE=ON, DATA_COMPRESSION=PAGE);

SELECT *
FROM [Posts]
WHERE Tags IN (
SELECT DISTINCT(Tags)
FROM [Posts]
WHERE Tags LIKE ‘%%’)
ORDER BY Score DESC

Reply
- Brent Ozar
  January 26, 2024 1:42 pm
  
  RHC – you tell me! Try it and find out, comparing the logical reads, CPU, and query duration before/after. (Also, just to be clear, you didn’t answer the questions in the post.)
  
  Reply
Thomas Franz
January 26, 2024 11:48 am

To answer the first question: as stated by others it has to scan the whole table for seldom used tags.

Question 2 and 3:
Very hard task when you really want high performance and must not change the structure of dbo.Posts. It is possible, that a FULLTEXT index may help (you may need to create extra stopwords etc.), but since nobody (including myself) really uses it, I can’t say much about it.

So I decided to use an indexed view instead, which gives me top performance (about 20-960 reads). The index created by the dbo on the score and tags columns can be dropped again in this case.
Alternatively you could add computed columns to the dbo.Posts using similar code as in my first view (but you need a CASE around the tag_1 too and check for NULL), but this would be a forbitten change to the dbo.Posts table 🙂

—————————
USE [StackOverFlow2013]
GO

GO
CREATE OR ALTER VIEW dbo.v_post_tags
WITH SCHEMABINDING
/*
manually splits the up to 5 tags into separate columns

View is very ugly, since it is an indexed view and you can’t neither use CROSS APLLY (for STRING_SPLIT()) nor UNION ALL nor CTE’s nor subselects nor PIVOT in an indexed view, so it repeats it often self

After changing the view, you have to recreate the indexes (will be automatically dropped):
CREATE UNIQUE CLUSTERED INDEX idx_v_post_tags_1 ON dbo.v_post_tags (tag_1, Score, Id)
CREATE NONCLUSTERED INDEX idx_v_post_tags_2 ON dbo.v_post_tags (tag_2, Score, Id)
CREATE NONCLUSTERED INDEX idx_v_post_tags_3 ON dbo.v_post_tags (tag_3, Score, Id)
CREATE NONCLUSTERED INDEX idx_v_post_tags_4 ON dbo.v_post_tags (tag_4, Score, Id)
CREATE NONCLUSTERED INDEX idx_v_post_tags_5 ON dbo.v_post_tags (tag_5, Score, Id)

*/
AS
SELECT id, p.Tags, p.Score
, SUBSTRING(p.Tags, 2, CHARINDEX(‘>’, p.Tags) – 2) AS tag_1
, CASE WHEN p.Tags LIKE ‘<%<%’, p.Tags) + 2
, CHARINDEX(‘>’, SUBSTRING(p.Tags, CHARINDEX(‘>’, p.Tags) + 2, 150)) – 1
)
ELSE NULL — no 2nd tag
END AS tag_2
, CASE WHEN p.Tags LIKE ‘<%<%’, p.Tags) – 2) — Tag_1
, SUBSTRING(p.Tags
, CHARINDEX(‘>’, p.Tags) + 2
, CHARINDEX(‘>’, SUBSTRING(p.Tags, CHARINDEX(‘>’, p.Tags) + 2, 150)) – 1
) — tag_2
) ) + 4
, CHARINDEX(‘>’,
SUBSTRING(p.Tags
, LEN(CONCAT_WS(‘>’, p.Tags) – 2) — Tag_1
, SUBSTRING(p.Tags
, CHARINDEX(‘>’, p.Tags) + 2
, CHARINDEX(‘>’, SUBSTRING(p.Tags, CHARINDEX(‘>’, p.Tags) + 2, 150)) – 1
) — tag_2
) ) + 4
, 150)
) -1
)
ELSE NULL — no 3rd tag
END AS tag_3
, CASE WHEN p.Tags LIKE ‘<%<%<%<%<%' — 5 tags
THEN REVERSE(SUBSTRING(REVERSE(p.Tags)
, CHARINDEX('<', REVERSE(p.Tags)) + 2
, CHARINDEX('<', SUBSTRING(REVERSE(p.Tags), CHARINDEX('<', REVERSE(p.Tags)) + 2, 150)) – 1
) ) — then reversed code from tag_2
WHEN p.Tags LIKE '<%<%<%<%' — 4 tags
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 2, CHARINDEX('<', REVERSE(p.Tags)) – 2)) — then reversed code from tag_1
ELSE NULL
END AS tag_4
, CASE WHEN p.Tags LIKE '<%<%<%<%<%' — 5 tags
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 2, CHARINDEX('<', REVERSE(p.Tags)) – 2)) — then reversed code from tag_1
ELSE NULL
END AS tag_5
FROM dbo.Posts AS p
WHERE p.Tags IS NOT NULL AND p.Tags LIKE '’
GO

— create the indexes on the view — duration ~ 45 sec for the StackOverFlow2013 database on my laptop
— you can’t filter indexes on views for e.g. tag_2 IS NOT NULL
CREATE UNIQUE CLUSTERED INDEX idx_v_post_tags_1 ON dbo.v_post_tags (tag_1, Score, Id); — there MUST be an clustered index on the view
CREATE NONCLUSTERED INDEX idx_v_post_tags_2 ON dbo.v_post_tags (tag_2, Score, Id)
CREATE NONCLUSTERED INDEX idx_v_post_tags_3 ON dbo.v_post_tags (tag_3, Score, Id)
CREATE NONCLUSTERED INDEX idx_v_post_tags_4 ON dbo.v_post_tags (tag_4, Score, Id)
CREATE NONCLUSTERED INDEX idx_v_post_tags_5 ON dbo.v_post_tags (tag_5, Score, Id)

GO
CREATE OR ALTER VIEW dbo.v_post_query_tags AS
— view that UNIONs ALL 5 tags to be able to easily query it without having to use ” in (tag_1, tag_2, tag_3, tag_4, tag_5)
SELECT vpt.id, vpt.Score, vpt.tag_1 AS tag
FROM dbo.v_post_tags AS vpt WITH (NOEXPAND) — without the NOEXPAND it wouldn’t use the index
UNION ALL
SELECT vpt.id, vpt.Score, vpt.tag_2 AS tag
FROM dbo.v_post_tags AS vpt WITH (NOEXPAND)
WHERE vpt.tag_2 IS NOT NULL
UNION ALL
SELECT vpt.id, vpt.Score, vpt.tag_3 AS tag
FROM dbo.v_post_tags AS vpt WITH (NOEXPAND)
WHERE vpt.tag_3 IS NOT NULL
UNION ALL
SELECT vpt.id, vpt.Score, vpt.tag_4 AS tag
FROM dbo.v_post_tags AS vpt WITH (NOEXPAND)
WHERE vpt.tag_4 IS NOT NULL
UNION ALL
SELECT vpt.id, vpt.Score, vpt.tag_5 AS tag
FROM dbo.v_post_tags AS vpt WITH (NOEXPAND)
WHERE vpt.tag_5 IS NOT NULL
;
GO

SELECT TOP 10 * FROM dbo.v_post_query_tags AS vpq WHERE vpq.tag = ‘ora-31011’ ORDER BY vpq.Score DESC — one occurence – 22 reads
SELECT TOP 10 * FROM dbo.v_post_query_tags AS vpq WHERE vpq.tag = ‘iphone-sdk-4.0.1’ ORDER BY vpq.Score DESC — 10 occurences – 26 reads
SELECT TOP 10 * FROM dbo.v_post_query_tags AS vpq WHERE vpq.tag = ‘sql-server’ ORDER BY vpq.Score DESC — 75k occurences – 946 reads

— no noticable delay; 53 additional reads in v_post_tags
UPDATE dbo.Posts SET Tags = Tags + ” WHERE Id = 3571036

Reply
- Brent Ozar
  January 26, 2024 1:44 pm
  
  Thomas – absolutely fantastic work! Great job. I love the creativity on it, and it makes total sense here. An indexed view would be fine since it’s essentially just an index on the table, and we’re not doing any grouping against other Posts rows.
  
  Reply
- Thomas Franz
  February 2, 2024 12:55 pm
  
  I put the whole script on Githup Gits for better readability (well formated there):
  https://gist.github.com/samot1/ba98a0838eddcbb225ef5bce13e9ec1c
  
  Reply
Dew Drop – January 26, 2024 (#4115) – Morning Dew by Alvin Ashcraft
January 26, 2024 11:57 am

[…] Query Exercise: Find Tagged Questions Faster. (Brent Ozar) […]

Reply
Vedran Galic
January 26, 2024 1:19 pm

If we generalize this problem to “searching string data for an arbitrary substring” in can maybe be solved by creating a trigram index.
Here is a very detailed tutorial on how to do that:
https://sqlperformance.com/2017/09/sql-performance/sql-server-trigram-wildcard-search
Postgres has a module:
https://www.postgresql.org/docs/current/pgtrgm.html

Reply
- Brent Ozar
  January 26, 2024 1:44 pm
  
  Vedran – make sure to answer the 3 questions in the post. This comment isn’t answering any of ’em. Thanks!
  
  Reply
Martin
January 26, 2024 1:52 pm

#1 Infrequently used tags will be hardest to find.
#2
– Don’t use SELECT *. Instead, use SELECT Id, Body, Score
– The Tags column is NVARCHAR, so use N’%%’ instead.
– Consider caching the results. Depends on how often the query is used.
#3 Instead of indexing on (Score, Tags), index on (Score) INCLUDE (Tags)
– I tried indexing on Score DESC but it didn’t reduce logical reads or cpu time.
– Because the exercise calls for finding questions, I tried indexing on PostTypeId
and adding PostTypeId to the WHERE clause but that was detrimental.

Reply
- Brent Ozar
  January 26, 2024 1:57 pm
  
  Martin – bingo on #1!
  
  #2 – you’ll want to test those assumptions to see if they make a difference here. For example, is SELECT * any worse on overhead? You might be surprised.
  
  #3 – did that index make a significant difference on reads, cpu, or duration?
  
  Reply
  - Thomas Franz
    January 26, 2024 3:12 pm
    
    #3 with INCLUDE (tags) instead indexing the tags may prevent some resorting, when you add / remove tags. Since you can’t really seek for the tag (with the given query) in the index it would be the better option from maintainance perspective (performance should be the same). Particularly in the lower score ranges where 100k rows per score but with different tags exists.
    
    On the other hand the whole index is a maintainance nightmare, since there will always posts up- and downvoted, so it has to shift rows around in the index the whole time (e.g. when the score goes up from 1 to 2).
    
    Reply
  - Martin
    January 26, 2024 7:17 pm
    
    – SELECT Id, Body, Score instead of SELECT * saves an almost-perceptible number of physical reads and over a millisecond of elapsed time. You’re right. I’m surprised.
    – I expected an index on (Score, PostTypeId) to improve query performance. Instead, reads increased from 500 to 2,000 (when searching for ). (4M without any index.)
    
    I appreciate TF’s comment on index maintenance nightmare, and I like the indexed view solution. I think my coworkers would get on my case for the complexity of it, but that’s more of a personnel issue.
    
    Thanks.
    
    Reply
Montro1981
January 29, 2024 2:58 pm

Quite interested in Brent’s answer at the end of the week.

/*
1. What kinds of tags will perform worse than others for this query?
Since the LIKE it introduced with a double percent-sign ‘%%’, it becomes non-sargable and the Engine will have to scan every row in the index to see the contents of the Tags column to find the correct rows up to the number in the TOP.
When seaching for tags that are not used often, like ” that has been used once in the SO2013 database, will perform worse as the whole index will be scanned.
To illustrate, seaching for scans 17488 rows (0.1%) with 546 logical reads, and seaching for scans 17142169 rows (100.0%) with 75555 logical reads.
*/

/*Fact finding query to discover all distinct Tags*/
SELECT v.[value], COUNT(*)
FROM dbo.Posts
CROSS APPLY string_split(REPLACE(Tags, N’>;<'), N';') AS v
WHERE Tags IS NOT NULL
GROUP BY v.[value]
ORDER BY COUNT(*) ASC;
GO

DROP INDEX IF EXISTS Score_Tags ON dbo.Posts;
CREATE INDEX Score_Tags ON dbo.Posts(Score, Tags);
GO

SET STATISTICS IO ON;

SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%%’
ORDER BY Score DESC;
–546 logical reads

SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE ‘%%’
ORDER BY Score DESC;
–75555 logical reads

SET STATISTICS IO OFF;

/*
The more a certain tag is used, the quicker the query will become.

2. Could you change the query to perform better?
Let’s try: There a quite a few rows where there are no tags at all, so we can start by filtering those out.
*/
SET STATISTICS IO ON;

SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE ‘%%’
AND Tags IS NOT NULL
ORDER BY Score DESC;
–546 logical reads

SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE ‘%%’
AND Tags IS NOT NULL
ORDER BY Score DESC;
–75555 logical reads

SET STATISTICS IO OFF;

/*
Okay, the query didn’t benefit from the WHERE clause, because the leading column in the index is Score and we need to scan the whole index anyways.
So the index created by our lovely DBA isn’t helping us in this case, and flipping the fields around will not help as well, as it will force SORTs into our plans because of the ORDER BY.
Let’s make an extra table for storing the tags, however if we want to go this route we need to modify the app code as well to use this table, unless we do some shenanigans with the posts table and views…
*/

DROP TABLE IF EXISTS dbo.PostTags;

CREATE TABLE dbo.PostTags
(
PostId int NOT NULL
,Tag nvarchar(50) NOT NULL
,PRIMARY KEY CLUSTERED (Tag ASC, PostId ASC)
);

INSERT INTO dbo.PostTags
SELECT PostId = p.Id, Tag = v.[value]
FROM dbo.Posts AS p
CROSS APPLY string_split(REPLACE(p.Tags, N’>;<'), N';') AS v
WHERE Tags IS NOT NULL;

SET STATISTICS IO ON;

SELECT TOP 100 *
FROM dbo.Posts AS p
WHERE EXISTS
(
SELECT 1/0
FROM dbo.PostTags AS pt
WHERE pt.Tag = N'’
AND pt.PostId = p.Id
)
ORDER BY p.Score DESC;
–73111 + 104852 logical reads

SELECT TOP 100 *
FROM dbo.Posts AS p
WHERE EXISTS
(
SELECT 1/0
FROM dbo.PostTags AS pt
WHERE pt.Tag = N”
AND pt.PostId = p.Id
)
ORDER BY p.Score DESC;
–6 + 8 logical reads

SET STATISTICS IO OFF;

DROP TABLE IF EXISTS dbo.PostTags;
GO

/*
Okay, that fixed the outliner query, but made the non-outliner query so much worse.

So let’s modify the with the score field this time … More denormalization \o/
*/

DROP TABLE IF EXISTS dbo.PostTags;

CREATE TABLE dbo.PostTags
(
PostId int NOT NULL
,Tag nvarchar(50) NOT NULL
,Score int NOT NULL
,PRIMARY KEY CLUSTERED (Score ASC, Tag ASC, PostId ASC)
);

INSERT INTO dbo.PostTags
SELECT PostId = p.Id, Tag = v.[value], Score = p.Score
FROM dbo.Posts AS p
CROSS APPLY string_split(REPLACE(p.Tags, N’>;<'), N';') AS v
WHERE Tags IS NOT NULL;

CREATE INDEX idx_PostTags ON dbo.PostTags (Tag ASC);

SET STATISTICS IO ON;

SELECT TOP 100 *
FROM dbo.Posts AS p
CROSS APPLY
(
SELECT pt.Score
FROM dbo.PostTags AS pt
WHERE pt.Tag = N'’
AND pt.PostId = p.Id
) AS pt
ORDER BY pt.Score DESC;
–415 + 4 logical reads

SELECT TOP 100 *
FROM dbo.Posts AS p
CROSS APPLY
(
SELECT pt.Score
FROM dbo.PostTags AS pt
WHERE pt.Tag = N”
AND pt.PostId = p.Id
) AS pt
ORDER BY p.Score DESC;
–5 + 5 logical reads

SET STATISTICS IO OFF;

DROP TABLE IF EXISTS dbo.PostTags;
GO

/*
That fixed up both the outliner and the non-outliner query, with … quite a “few” database changes that will also require app changes to handle.
The business users will be pleased because great response times, but the application team will not like this solution as it is not very fault tolerate and requires significate changes to the app code.
The database team might not like the extra denormalization, but since it is mostly the application code to handle it, they will go “it’s not our problem”.

3. Could you change the indexes to perform better, without changing the table structure?
A simple solution would be to filter the index made by our DBA
*/

DROP INDEX IF EXISTS Score_Tags ON dbo.Posts;
CREATE INDEX Score_Tags ON dbo.Posts(Score, Tags) WHERE Tags IS NOT NULL;
/*Note: Could also be PostTypeId = 1*/

SET STATISTICS IO ON;

SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE ‘%%’
AND Tags IS NOT NULL
ORDER BY Score DESC;
–1817 logical reads

SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE ‘%%’
AND Tags IS NOT NULL
ORDER BY Score DESC;
–56228 logical reads

SET STATISTICS IO OFF;
/*
That knocked some logical read off from the query, but nothing substantial

Time to get creative, let’s try an indexed view where we split the tags column into individual parts.
Thanks Thomas Franz for the general idea.
*/
DROP VIEW IF EXISTS dbo.view_PostTags
GO

CREATE OR ALTER VIEW dbo.view_PostTags
WITH SCHEMABINDING
AS
SELECT p.Id, p.Tags, p.Score
,Tag1 = SUBSTRING(p.Tags, 1, CHARINDEX(N’>’, p.Tags, 1))
,Tag2 = CASE
WHEN LEN(p.Tags) – LEN(REPLACE(p.Tags, N'<', N'')) = 2
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1)))
WHEN LEN(p.Tags) – LEN(REPLACE(p.Tags, N' 2
THEN SUBSTRING(p.Tags, CHARINDEX(N’>’, p.Tags, 1) + 1, CHARINDEX(N’>’, SUBSTRING(p.Tags, CHARINDEX(N’>’, p.Tags, 1) + 1, 150)))
ELSE NULL
END
,Tag3 = CASE
WHEN LEN(p.Tags) – LEN(REPLACE(p.Tags, N'<', N'')) = 3
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1)))
WHEN LEN(p.Tags) – LEN(REPLACE(p.Tags, N' 3
THEN SUBSTRING(
SUBSTRING(p.Tags, CHARINDEX(N’>’, SUBSTRING(p.Tags, CHARINDEX(N’>’, p.Tags, 1) + 1, 150), 1) + CHARINDEX(N’>’, p.Tags, 1) + 1, 150)
,1
,CHARINDEX(N’>’, SUBSTRING(p.Tags, CHARINDEX(N’>’, SUBSTRING(p.Tags, CHARINDEX(N’>’, p.Tags, 1) + 1, 150), 1) + CHARINDEX(N’>’, p.Tags, 1) + 1, 150), 1)
)
ELSE NULL
END
,Tag4 = CASE
WHEN LEN(p.Tags) – LEN(REPLACE(p.Tags, N'<', N'')) = 4
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1)))
WHEN LEN(p.Tags) – LEN(REPLACE(p.Tags, N' 4
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), CHARINDEX(N'<', REVERSE(p.Tags), 1) + 1, CHARINDEX(N'<', SUBSTRING(REVERSE(p.Tags), CHARINDEX(N'<', REVERSE(p.Tags), 1) + 1, 150))))
ELSE NULL
END
,Tag5 = CASE
WHEN LEN(p.Tags) – LEN(REPLACE(p.Tags, N'<', N'')) = 5
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1)))
ELSE NULL
END
FROM dbo.Posts AS p
WHERE p.Tags IS NOT NULL;
GO

CREATE UNIQUE CLUSTERED INDEX idx_view_PostTags_Tag1 ON view_PostTags (Tag1, Score, Id);
CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag2 ON view_PostTags (Tag2, Score, Id);
CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag3 ON view_PostTags (Tag3, Score, Id);
CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag4 ON view_PostTags (Tag4, Score, Id);
CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag5 ON view_PostTags (Tag5, Score, Id);
GO

DROP VIEW IF EXISTS dbo.view_TagsPerPost
GO

CREATE OR ALTER VIEW dbo.view_TagsPerPost
AS
SELECT
Id, Score, Tag = Tag1
FROM dbo.view_PostTags WITH (NOEXPAND)
UNION ALL
SELECT
Id, Score, Tag = Tag2
FROM dbo.view_PostTags WITH (NOEXPAND)
WHERE Tag2 IS NOT NULL
UNION ALL
SELECT
Id, Score, Tag = Tag3
FROM dbo.view_PostTags WITH (NOEXPAND)
WHERE Tag3 IS NOT NULL
UNION ALL
SELECT
Id, Score, Tag = Tag4
FROM dbo.view_PostTags WITH (NOEXPAND)
WHERE Tag4 IS NOT NULL
UNION ALL
SELECT
Id, Score, Tag = Tag5
FROM dbo.view_PostTags WITH (NOEXPAND)
WHERE Tag5 IS NOT NULL
GO

SET STATISTICS IO ON;

SELECT TOP 100 p.*
FROM dbo.view_TagsPerPost AS tpp
INNER JOIN dbo.Posts AS p
ON p.Id = tpp.Id
WHERE tpp.Tag = '’
ORDER BY tpp.Score DESC;
–1139 + 20807 logical reads

SELECT TOP 100 *
FROM dbo.view_TagsPerPost AS tpp
INNER JOIN dbo.Posts AS p
ON p.Id = tpp.Id
WHERE tpp.Tag = ”
ORDER BY tpp.Score DESC;
–5 + 24 logical reads

SET STATISTICS IO OFF;

DROP VIEW IF EXISTS dbo.view_PostTags
GO

DROP VIEW IF EXISTS dbo.view_TagsPerPost
GO

/*
That fixed up both the outliner and the non-outliner query, with some database changes that will the app will need to do nothing with. However the query from the app will need to changed to use the view.
The business users will be pleased because great response times.
For the application team it is a small and manageable change, but the ways tags are recorded can’t be easily changed in the future.
However the database team might oppose the solution, as the view is mostly a database solution and they might be the team to maintain it.
*/

Reply
- Montro1981
  January 29, 2024 3:00 pm
  
  Great the blog doesn’t like \ in the text
  
  I used the tags sql-server and progress-reports in my tests
  
  Reply
- Montro1981
  January 30, 2024 2:10 pm
  
  After reading the site Vedran Galic linked (and it was in Brent ‘s mail of yesterday) I gave the trigram solution a try.
  Summary:
  It makes finding often used tags slower from the overhead from generating and searching the trigrams table (about 10X in miliseconds).
  However, finding the outliers is way faster though (110X faster).
  
  LIKE %sql-server%: 103 ms
  LIKE %progress-reports%: 10161 ms
  Trigram search %sql-server%: 1013 ms
  Trigram search %progress-reports%: 89 ms
  
  In my opinion, the decrease in performance while trigram searching for often used tags would be acceptable as finding the outliers is way faster.
  
  Wall of code incoming:
  
  CREATE OR ALTER FUNCTION dbo.GenerateTagTrigrams
  (
  @tags nvarchar(150)
  )
  RETURNS TABLE WITH SCHEMABINDING
  AS
  RETURN
  (
  WITH Ten AS
  (
  SELECT Ten = Digit.Ten, Weigth = COUNT(Digit.Ten) OVER ()
  FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) AS Digit (Ten)
  ), Hundred AS
  (
  SELECT Hundred = Ten.Ten + (Ten.Weigth * Hundred.Ten), Weigth = Ten.Weigth * Ten.Weigth
  FROM Ten AS Ten CROSS APPLY Ten AS Hundred
  ), ThreeHundred AS
  (
  SELECT ThreeHundred = (Digit.Three * ThreeHundred.Weigth) + ThreeHundred.Hundred + 1
  FROM (VALUES (0), (1), (2)) AS Digit (Three)
  CROSS APPLY Hundred AS ThreeHundred
  ), TagTrigrams AS
  (
  SELECT TOP (CASE WHEN LEN(@tags) > 2 THEN LEN(@tags) – 2 ELSE 0 END)
  Trigram = SUBSTRING(@tags, threeHund.ThreeHundred, 3)
  FROM ThreeHundred AS threeHund
  ORDER BY threeHund.ThreeHundred
  )
  SELECT DISTINCT t.trigram
  FROM TagTrigrams AS t
  WHERE t.trigram COLLATE Latin1_General_BIN2 NOT LIKE ‘%[^A-Z0-9a-z]%’
  );
  GO
  
  DROP TABLE IF EXISTS dbo.TagsTrigrams;
  
  CREATE TABLE dbo.TagsTrigrams
  (
  PostId int NOT NULL
  ,Trigram nchar(3) NOT NULL
  );
  
  INSERT INTO dbo.TagsTrigrams WITH (TABLOCKX)
  (
  PostId
  ,Trigram
  )
  SELECT
  p.Id
  ,t.Trigram
  FROM dbo.Posts AS p
  CROSS APPLY dbo.GenerateTagTrigrams (tags) AS t;
  GO
  
  ALTER TABLE dbo.TagsTrigrams ADD CONSTRAINT pk_TagsTrigrams PRIMARY KEY CLUSTERED (Trigram, PostId);
  GO
  
  CREATE OR ALTER VIEW dbo.TagsTrigramCounts
  WITH SCHEMABINDING AS
  SELECT tt.Trigram, NumberOfOccurrences = COUNT_BIG(*)
  FROM dbo.TagsTrigrams AS tt
  GROUP BY tt.Trigram;
  GO
  
  CREATE UNIQUE CLUSTERED INDEX idx_TagsTrigramCounts ON dbo.TagsTrigramCounts (Trigram);
  GO
  
  CREATE OR ALTER FUNCTION dbo.PostTag_GetBestTrigrams
  (
  @searchTag nvarchar(50)
  )
  RETURNS TABLE WITH SCHEMABINDING
  AS
  RETURN
  (
  SELECT
  TagTrigram1 = MAX(CASE WHEN bestTrigrams.TrigramOrder = 1 THEN bestTrigrams.Trigram END)
  ,TagTrigram2 = MAX(CASE WHEN bestTrigrams.TrigramOrder = 2 THEN bestTrigrams.Trigram END)
  ,TagTrigram3 = MAX(CASE WHEN bestTrigrams.TrigramOrder = 3 THEN bestTrigrams.Trigram END)
  FROM
  (
  SELECT TOP 3
  TrigramOrder = ROW_NUMBER() OVER (ORDER BY ttc.NumberOfOccurrences ASC), st.Trigram
  FROM dbo.GenerateTagTrigrams(@searchTag) AS st
  INNER JOIN dbo.TagsTrigramCounts AS ttc WITH (NOEXPAND) ON ttc.Trigram = st.Trigram
  ORDER BY ttc.NumberOfOccurrences ASC
  ) AS bestTrigrams
  );
  GO
  
  CREATE OR ALTER FUNCTION dbo.GetTagTrigramPostIDs
  (
  @TagTrigram1 nchar(3)
  ,@TagTrigram2 nchar(3)
  ,@TagTrigram3 nchar(3)
  )
  RETURNS @IDs TABLE
  (
  PostId int PRIMARY KEY
  ) WITH SCHEMABINDING AS
  BEGIN /* Begin function body */
  IF @TagTrigram1 IS NOT NULL
  BEGIN
  IF @TagTrigram2 IS NOT NULL
  BEGIN
  IF @TagTrigram3 IS NOT NULL
  BEGIN
  INSERT @IDs (PostId)
  SELECT tt1.PostId
  FROM dbo.TagsTrigrams AS tt1
  WHERE tt1.Trigram = @TagTrigram1
  INTERSECT
  SELECT tt2.PostId
  FROM dbo.TagsTrigrams AS tt2
  WHERE tt2.Trigram = @TagTrigram2
  INTERSECT
  SELECT tt3.PostId
  FROM dbo.TagsTrigrams AS tt3
  WHERE tt3.Trigram = @TagTrigram3
  OPTION (MERGE JOIN);
  END;
  ELSE
  BEGIN
  INSERT @IDs (PostId)
  SELECT tt1.PostId
  FROM dbo.TagsTrigrams AS tt1
  WHERE tt1.Trigram = @TagTrigram1
  INTERSECT
  SELECT tt2.PostId
  FROM dbo.TagsTrigrams AS tt2
  WHERE tt2.Trigram = @TagTrigram2
  OPTION (MERGE JOIN);
  END;
  END;
  ELSE
  BEGIN
  INSERT @IDs (PostId)
  SELECT tt1.PostId
  FROM dbo.TagsTrigrams AS tt1
  WHERE tt1.Trigram = @TagTrigram1
  END;
  END;
  RETURN;
  END; /* End function body */
  GO
  
  CREATE OR ALTER FUNCTION dbo.PostTags_TrigramSearch
  (
  @Tag nvarchar(150)
  )
  RETURNS TABLE WITH SCHEMABINDING
  AS
  RETURN
  SELECT Posts.Id, Posts.AcceptedAnswerId, Posts.AnswerCount, Posts.Body, Posts.ClosedDate, Posts.CommentCount, Posts.CommunityOwnedDate, Posts.CreationDate, Posts.FavoriteCount, Posts.LastActivityDate, Posts.LastEditDate, Posts.LastEditorDisplayName, Posts.LastEditorUserId, Posts.OwnerUserId, Posts.ParentId, Posts.PostTypeId, Posts.Score, Posts.Tags, Posts.Title, Posts.ViewCount
  FROM dbo.PostTag_GetBestTrigrams(@Tag) AS ptt
  CROSS APPLY
  (
  SELECT p.Id, p.AcceptedAnswerId, p.AnswerCount, p.Body, p.ClosedDate, p.CommentCount, p.CommunityOwnedDate, p.CreationDate, p.FavoriteCount, p.LastActivityDate, p.LastEditDate, p.LastEditorDisplayName, p.LastEditorUserId, p.OwnerUserId, p.ParentId, p.PostTypeId, p.Score, p.Tags, p.Title, p.ViewCount
  FROM dbo.GetTagTrigramPostIDs (ptt.TagTrigram1, ptt.TagTrigram2, ptt.TagTrigram3) AS trigrams
  INNER JOIN dbo.Posts AS p
  ON p.id = trigrams.PostId
  WHERE ptt.TagTrigram1 IS NOT NULL
  AND p.Tags LIKE @Tag
  UNION ALL
  SELECT p.Id, p.AcceptedAnswerId, p.AnswerCount, p.Body, p.ClosedDate, p.CommentCount, p.CommunityOwnedDate, p.CreationDate, p.FavoriteCount, p.LastActivityDate, p.LastEditDate, p.LastEditorDisplayName, p.LastEditorUserId, p.OwnerUserId, p.ParentId, p.PostTypeId, p.Score, p.Tags, p.Title, p.ViewCount
  FROM dbo.Posts AS p
  WHERE ptt.TagTrigram1 IS NULL
  AND p.Tags LIKE @Tag
  ) AS Posts;
  GO
  
  SET NOCOUNT ON;
  
  DECLARE
  @StartTime datetime2
  ,@ElapsedTime int;
  
  SET @StartTime = SYSUTCDATETIME();
  
  SELECT TOP 100 *
  FROM dbo.Posts
  WHERE Tags LIKE ‘%sql-server%’
  ORDER BY Score DESC;
  
  SET @ElapsedTime = DATEDIFF(MILLISECOND, @StartTime, SYSUTCDATETIME());
  PRINT ‘LIKE %sql-server%: ‘ + CAST(@ElapsedTime AS nvarchar(4000)) + N’ ms’;
  SET @StartTime = SYSUTCDATETIME();
  
  SELECT TOP 100 *
  FROM dbo.Posts
  WHERE Tags LIKE ‘%progress-reports%’
  ORDER BY Score DESC;
  
  SET @ElapsedTime = DATEDIFF(MILLISECOND, @StartTime, SYSUTCDATETIME());
  PRINT ‘LIKE %progress-reports%: ‘ + CAST(@ElapsedTime AS nvarchar(4000)) + N’ ms’;
  SET @StartTime = SYSUTCDATETIME();
  
  SELECT TOP 100 *
  FROM dbo.PostTags_TrigramSearch(N’%sql-server%’)
  ORDER BY Score DESC;
  
  SET @ElapsedTime = DATEDIFF(MILLISECOND, @StartTime, SYSUTCDATETIME());
  PRINT ‘Trigram search %sql-server%: ‘ + CAST(@ElapsedTime AS nvarchar(4000)) + N’ ms’;
  SET @StartTime = SYSUTCDATETIME();
  
  SELECT TOP 100 *
  FROM dbo.PostTags_TrigramSearch(N’%progress-reports%’)
  ORDER BY Score DESC;
  
  SET @ElapsedTime = DATEDIFF(MILLISECOND, @StartTime, SYSUTCDATETIME());
  PRINT ‘Trigram search %progress-reports%: ‘ + CAST(@ElapsedTime AS nvarchar(4000)) + N’ ms’;
  
  SET NOCOUNT OFF;
  
  Reply
  - Brent Ozar
    January 30, 2024 2:23 pm
    
    Holy mackerel, Martin, you went to work!
    
    You may wanna check out Github Gists: https://gist.github.com You can paste code snippets there, and plus update ’em over time. I spent some time last weekend trying to work Gists into the comments here, but I couldn’t make it work without a pretty big change to the blog, so I held off for now. I think I’m gonna have to do something similar though because the query challenge thing is so much fun.
    
    I tried to write the challenge to say, “You can’t change the tables structurally,” and I actually had trigrams in mind when I was writing that. The answer post won’t mention trigrams by name either, but the followup challenge is going to be, “How would you fix this if you could change the tables structurally?” And trigrams are a good answer there.
    
    I still like this solution here though because we’re technically not CHANGING the existing tables, hahaha, just adding new ones.
    
    Reply
    - Montro1981
      January 30, 2024 3:16 pm
      
      Thanks for the compliment Brent.
      I noticed the “Faster Text Search” section in your email yesterday and I thought to give it a try, it is fun to learn new techniques.
      
      I subscribed to replies and I had to laugh at this top line: “Brace yourself: someone on the Internet said something. It’s probably wrong, too.”
      
      Reply
- Montro1981
  January 31, 2024 10:04 am
  
  Made a gist out of my initial answer:
  
  https://gist.github.com/Montro1981/c26bb6c7ca0992416f6c351e9db44da9
  
  And the Trigrams version
  
  https://gist.github.com/Montro1981/ab10d261ac5cee1096583539d1605f4c
  
  Reply
Finding Tagged Questions Faster: Answers & Discussion - Brent Ozar Unlimited®
January 31, 2024 1:15 pm

[…] Your query exercise was to take this Stack Overflow query to find the top-voted questions for any given tag: […]

Reply