Blog

In the last Query Exercise discussion, we hit a performance wall when we were trying to quickly find questions with a specific tag. We threw up our hands and said it’s time to make database changes, but we had a few restrictions:

We have a lot of existing code that uses the Posts.Tags column
That code does selects, inserts, and updates
We want our changes to be backwards-compatible: old code still has to work
We need to give developers an option for slow queries, letting them change those slow queries to get faster performance if they want it

The core of our problem is the structure of the Tags column:

It’s one long string, but the tag we’re looking for could be anywhere in the string. So instead of doing this:

WHERE Tags LIKE '%<sql-server>%'

1	WHERE Tags LIKE '%<sql-server>%'

We need a way to dive bomb into specific rows without that leading % sign. We need to be able to index the contents of Tags so that our queries look more like this:

WHERE Tags = '<sql-server>'

1	WHERE Tags = '<sql-server>'

But we can’t just change the queries to use =, because a question can have multiple tags.

First, design a child table to normalize Post.Tags.

That one phrase right there – “a question can have multiple tags” – is our sign that we have a data modeling problem. We really need a child table, more like this:

CREATE TABLE dbo.PostTags
	(PostId INT, Tag NVARCHAR(50));

CREATE INDEX Tag_PostId 
	ON dbo.PostTags(Tag, PostId);

CREATE TABLE dbo.PostTags

(PostId INT, Tag NVARCHAR(50));

CREATE INDEX Tag_PostId

ON dbo.PostTags(Tag, PostId);

Any given Post.Id could have up to 5 rows in the PostTags table.

For example, Post.Id 4 has Tags = “<c#><winforms><type-conversion><decimal><opacity>”. That post would have 5 rows in the PostTags table instead, one for each tag.

Then, if we had queries that we wanted to go faster, we could modify them to point to the new indexed child table:

SELECT TOP 100 p.*
FROM dbo.PostTags pt
	INNER JOIN dbo.Posts p ON pt.PostId = p.Id
WHERE pt.Tag = '<sql-server>'
ORDER BY p.Score DESC;

SELECT TOP 100 p.*

FROM dbo.PostTags pt

INNER JOIN dbo.Posts p ON pt.PostId = p.Id

WHERE pt.Tag = '<sql-server>'

ORDER BY p.Score DESC;

And they’d get nice, quick index seeks.

For bonus points, you might consider storing the Tags in a separate normalized structure like this:

CREATE TABLE dbo.Tags
	(Id INT IDENTITY(1,1),
	Tag NVARCHAR(50) PRIMARY KEY CLUSTERED);

CREATE TABLE dbo.PostTags
	(PostId INT, TagId INT);

CREATE UNIQUE CLUSTERED INDEX Tag_PostId 
	ON dbo.PostTags(TagId, PostId);

CREATE TABLE dbo.Tags

(Id INT IDENTITY(1,1),

Tag NVARCHAR(50) PRIMARY KEY CLUSTERED);

CREATE TABLE dbo.PostTags

(PostId INT, TagId INT);

CREATE UNIQUE CLUSTERED INDEX Tag_PostId

ON dbo.PostTags(TagId, PostId);

So that we only have to store the full string contents of the Tag (‘sql-server’) just once instead of on every PostTags row. In the comments, readers will have vigorous arm-wrestling matches trying to decide the primary key and clustered index structures of these tables. I’ll just point out that the only reason we’re really building these tables is to make queries of Posts.Tags go faster, so in your table design, consider these tables to be more like indexes, less like fancypants table designs.

Second, we gotta keep the new tables up to date.

In a perfect world, when Posts rows are deleted/updated/inserted, I would have the app code also maintain the PostTags rows. However, in the real world, I don’t always have the luxury of changing all the app code at once.

The answer: triggers.

STOP SCREAMING, it’s actually fine here. We just need to be careful about a few things:

Only touch the PostTags tables when the contents of Post.Tags actually changed
Insert new rows in dbo.Tags when a tag doesn’t exist yet – because right now, there aren’t any constraints on the contents of the Posts.Tags column. The app could just insert a new Posts row with wild new tags that we’ve never seen before, and it’s gonna be up to our trigger to handle that.
Handle multi-row transactions – because that’s the silent bug I find in most triggers, especially when they use the built-in UPDATE() function

Once we’ve written our triggers, we’ll want to backfill the new tables in a one-time process. I actually like enabling the triggers first, then backfilling the existing data, because it lets me use gradual batch processes to populate all the existing Posts, while the triggers handle ongoing changes while my batch stuff runs.

Here’s a proof of concept that I whipped up in my secret lab:

DROP TABLE IF EXISTS dbo.Tags;
DROP TABLE IF EXISTS dbo.PostTags;
GO
CREATE TABLE dbo.Tags
	(Id INT IDENTITY(1,1),
	TagName NVARCHAR(150) PRIMARY KEY CLUSTERED);

CREATE NONCLUSTERED INDEX Id ON dbo.Tags(Id);

CREATE TABLE dbo.PostTags
	(PostId INT, TagId INT);

CREATE UNIQUE CLUSTERED INDEX Tag_PostId 
	ON dbo.PostTags(TagId, PostId);

CREATE NONCLUSTERED INDEX PostId_TagId
	ON dbo.PostTags(PostId, TagId)
GO

/* Populate Tags with distinct list of existing tags */
INSERT INTO dbo.Tags (TagName)
	SELECT DISTINCT(REPLACE(t.value, '>', '')) AS TagName
	FROM dbo.Posts p
	CROSS APPLY STRING_SPLIT(p.Tags, '<') t
	WHERE t.value <> ''
	  AND NOT EXISTS(SELECT * FROM dbo.Tags tE WHERE tE.TagName = REPLACE(t.value, '>', ''))
GO
CREATE OR ALTER TRIGGER dbo.Posts_Tags
	ON dbo.Posts AFTER DELETE, UPDATE, INSERT AS
BEGIN
	SET NOCOUNT ON;

	/* Remove PostTags for deleted rows
		(rows that exist in deleted, but not inserted virtual tables) */
	DELETE dbo.PostTags
	FROM deleted d
		LEFT OUTER JOIN inserted i ON d.Id = i.Id
		INNER JOIN dbo.PostTags pt ON d.Id = pt.PostId
		WHERE i.Id IS NULL;

	/* Add any new tags that don't exist */
	INSERT INTO dbo.Tags(TagName)
		SELECT DISTINCT(REPLACE(t.value, '>', '')) AS TagName
		FROM inserted p
		CROSS APPLY STRING_SPLIT(p.Tags, '<') t
		WHERE t.value <> ''
		  AND NOT EXISTS(SELECT * FROM dbo.Tags tE WHERE tE.TagName = REPLACE(t.value, '>', ''))
		
	/* Add PostTags that exist in inserted */
	INSERT INTO dbo.PostTags(PostId, TagId)
		SELECT p.Id AS PostId, t.Id AS TagId
		FROM inserted p
		CROSS APPLY STRING_SPLIT(p.Tags, '<') pt
		INNER JOIN dbo.Tags t ON REPLACE(pt.value, '>', '') = t.TagName
		LEFT OUTER JOIN dbo.PostTags ptE ON p.Id = ptE.PostId AND t.Id = ptE.TagId
		WHERE pt.value <> ''
		  AND ptE.PostId IS NULL;
		
	/* Remove PostTags that have been edited out */
	DELETE pt
	FROM inserted i
	INNER JOIN dbo.PostTags pt ON i.Id = pt.PostId	 /* No join requirement for correct tags */
	INNER JOIN dbo.Tags t ON pt.TagId = t.Id
	WHERE t.TagName NOT IN (SELECT REPLACE(value, '>', '') FROM STRING_SPLIT(i.Tags, '<'))

END
GO

/* Backfill existing rows, takes a few minutes */
INSERT INTO dbo.PostTags(PostId, TagId)
	SELECT p.Id AS PostId, t.Id AS TagId
	FROM dbo.Posts p
	CROSS APPLY STRING_SPLIT(p.Tags, '<') pt
	INNER JOIN dbo.Tags t ON REPLACE(pt.value, '>', '') = t.TagName
	LEFT OUTER JOIN dbo.PostTags ptE ON p.Id = ptE.PostId AND t.Id = ptE.PostId
	WHERE pt.value <> ''
	  AND ptE.PostId IS NULL;
GO

DROP TABLE IF EXISTS dbo.Tags;

DROP TABLE IF EXISTS dbo.PostTags;

CREATE TABLE dbo.Tags

(Id INT IDENTITY(1,1),

TagName NVARCHAR(150) PRIMARY KEY CLUSTERED);

CREATE NONCLUSTERED INDEX Id ON dbo.Tags(Id);

CREATE TABLE dbo.PostTags

(PostId INT, TagId INT);

CREATE UNIQUE CLUSTERED INDEX Tag_PostId

ON dbo.PostTags(TagId, PostId);

CREATE NONCLUSTERED INDEX PostId_TagId

ON dbo.PostTags(PostId, TagId)

/* Populate Tags with distinct list of existing tags */

INSERT INTO dbo.Tags (TagName)

SELECT DISTINCT(REPLACE(t.value, '>', '')) AS TagName

FROM dbo.Posts p

CROSS APPLY STRING_SPLIT(p.Tags, '<') t

WHERE t.value <> ''

AND NOT EXISTS(SELECT * FROM dbo.Tags tE WHERE tE.TagName = REPLACE(t.value, '>', ''))

CREATE OR ALTER TRIGGER dbo.Posts_Tags

ON dbo.Posts AFTER DELETE, UPDATE, INSERT AS

BEGIN

SET NOCOUNT ON;

/* Remove PostTags for deleted rows

(rows that exist in deleted, but not inserted virtual tables) */

DELETE dbo.PostTags

FROM deleted d

LEFT OUTER JOIN inserted i ON d.Id = i.Id

INNER JOIN dbo.PostTags pt ON d.Id = pt.PostId

WHERE i.Id IS NULL;

/* Add any new tags that don't exist */

INSERT INTO dbo.Tags(TagName)

SELECT DISTINCT(REPLACE(t.value, '>', '')) AS TagName

FROM inserted p

CROSS APPLY STRING_SPLIT(p.Tags, '<') t

WHERE t.value <> ''

AND NOT EXISTS(SELECT * FROM dbo.Tags tE WHERE tE.TagName = REPLACE(t.value, '>', ''))

/* Add PostTags that exist in inserted */

INSERT INTO dbo.PostTags(PostId, TagId)

SELECT p.Id AS PostId, t.Id AS TagId

FROM inserted p

CROSS APPLY STRING_SPLIT(p.Tags, '<') pt

INNER JOIN dbo.Tags t ON REPLACE(pt.value, '>', '') = t.TagName

LEFT OUTER JOIN dbo.PostTags ptE ON p.Id = ptE.PostId AND t.Id = ptE.TagId

WHERE pt.value <> ''

AND ptE.PostId IS NULL;

/* Remove PostTags that have been edited out */

DELETE pt

FROM inserted i

INNER JOIN dbo.PostTags pt ON i.Id = pt.PostId /* No join requirement for correct tags */

INNER JOIN dbo.Tags t ON pt.TagId = t.Id

WHERE t.TagName NOT IN (SELECT REPLACE(value, '>', '') FROM STRING_SPLIT(i.Tags, '<'))

END

/* Backfill existing rows, takes a few minutes */

INSERT INTO dbo.PostTags(PostId, TagId)

SELECT p.Id AS PostId, t.Id AS TagId

FROM dbo.Posts p

CROSS APPLY STRING_SPLIT(p.Tags, '<') pt

INNER JOIN dbo.Tags t ON REPLACE(pt.value, '>', '') = t.TagName

LEFT OUTER JOIN dbo.PostTags ptE ON p.Id = ptE.PostId AND t.Id = ptE.PostId

WHERE pt.value <> ''

AND ptE.PostId IS NULL;

Standard disclaimers apply there: I only tested for a few minutes, didn’t put any error handling in, etc. It should be enough to get you started when you need to solve problems like this, though.

Finally, change the slowest queries to use the new tables.

Once the new child tables are populated, we can present the solution to our developers like this:

“Hi! We’ve heard your complaints about queries being slow when you’re looking for a specific tag. Good news! We’ve got a new table structure that you can use, but it’s entirely up to you when you want to migrate over to it. You can change any queries, at any time, and use either the old or new table structures. Here’s an example of a slow query:”

SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%<sql-server>%'
ORDER BY Score DESC;

SELECT TOP 100 *

FROM dbo.Posts

WHERE Tags LIKE '%<sql-server>%'

ORDER BY Score DESC;

“And here’s the faster version using the new tables:”

SELECT TOP 100 p.*
FROM dbo.Posts p
INNER JOIN dbo.PostTags pt ON p.Id = pt.PostId
INNER JOIN dbo.Tags t ON pt.TagId = t.Id
WHERE t.TagName = N'<sql-server>'
ORDER BY Score DESC;

SELECT TOP 100 p.*

FROM dbo.Posts p

INNER JOIN dbo.PostTags pt ON p.Id = pt.PostId

INNER JOIN dbo.Tags t ON pt.TagId = t.Id

WHERE t.TagName = N'<sql-server>'

ORDER BY Score DESC;

“When changing your code, a few things to be aware of: use equality searches on T.TagName, not like searches, and don’t use % signs in your search. Follow those guidelines, and your queries will typically see a 99% reduction in duration. If you don’t see at least a 90% reduction, send the before & after query to a member of the DBA team, and we’ll help you with the query changes.”

I like this approach because it lets the existing code work as-is, and lets developers opt into faster performance whenever they have time to edit queries. If we can’t just use a magic button to make everything faster, then this is the next best thing.

Bonus Points for Thomas Franz

In the comments on the challenge, Thomas Franz took things to the next level when discussing a solution similar to mine above. Thomas wrote:

Problem with your first approach: when you write direct into the dbo.PostTags table, it will not be reflected in dbo.Posts.Tags (you are still using this column).

And one part of the goal was, that the devs should be able to use faster queries (on the new structure) and in my opinion this does include INSERT/UPDATE/DELETE too (e.g. deleting the tag from all Posts (if there would be any).

And of course you are saving the tags twice (disk space / memory) and on the long term it is almost guaranteed, that it will differ (maybe because someone disabled the trigger for a short maintenance or there are some edge cases …).

So Thomas wrote an even more ambitious solution: he turned Posts into a view, and broke out the underlying normalized data into separate tables. He used triggers on the view to push deletes/updates/inserts (DUIs) into the right places. He also only stored the normalized data once, ensuring its accuracy and eliminating the chance that it’d get out of sync. I love it!

#TSQL2sday Invitation: Describe the Most Recent Issue You Closed.

Last Updated December 30, 2023

Professional Development

T-SQL Tuesday is a monthly blogothon where we get together and write about a different topic. I’m hosting this month’s episode.

Your readers wonder what kinds of jobs are out there in the database world, what exactly it is that you do, and what your daily grind is like. While it’d be cool to cover all of that, let’s start with something simple.

Your mission for this week: write a blog post about the last ticket you closed, and schedule it for next Tuesday, February 13.

It doesn’t have to be T-SQL. T-SQL Tuesday has evolved to cover all kinds of data topics.

The task/issue doesn’t have to be indicative of your overall career. Our database jobs cause us to do all kinds of oddball things through the day. Go into your ticket system, help desk system, list of Github issues, or task list right now, look at the last task you checked off, and blog about that.

Don’t include company specifics or anything that might get you in trouble. Just talk in general terms about:

Why the task was created (an error popped up, a user had a problem, your boss had an idea, whatever)
General terms about work you had, what online resources you found helpful, how long it took
How often that kind of task pops up in your queue

The rules:

Your blog post has to go live on Tuesday, February 13
Include the T-SQL Tuesday logo, and if somebody clicks on the logo, it has to link back to this post
Include a link to this blog post, or after you publish it, leave a comment here so I can track you down for the roundup blog post next week
If you want to host a future T-SQL Tuesday episode, leave a comment on this post

Office Hours: PowerBI & Fabric Q&A with Eugene Meidinger

Last Updated January 31, 2024

Videos

Today’s Office Hours has a different topic. Eugene Meidinger dropped by to answer your business intelligence questions.

00:00 Start
01:35 Tim: Can Power BI fully replace SSRS? We are looking to migrate our SSRS on-premise report server to a Cloud Service and Power BI is being touted as the solution. Are there any gotchas with moving SSRS to Azure Power BI? Thanks Eugene and Brent.
03:44 James: Now that Fabric has the workspace concept from Power BI, how do you determine when you need a new workspace? Do you base it on security or topic or ?
05:29 SQLCheese: What’s the best place to look for getting started using Power BI?
06:56 DeploymentNewb: What’s the best way to deploy reports with changing data sources? Like for multiple customers that all need the same report.
08:08 Catherine: How do I go about identifying slow performing measures and query steps when I inherit a massive PBI report to achieve some quick wins?
12:00 April O’N14:43eil: We provide a software that creates pay slips in PDF format and send it to employees. It’s all on SSRS. Can Power BI provide the same feature (produce PDF per user), send it and ideally Row-level Security for dashboards in a scenario of ~20.000 users? What about license costs?
14:43 Malte: What is the best practice to deal with 2-3 large dimensions (~10mio rows)? Unfortunately my company uses matrixes alot with lots of details. The large dimensions seem to be a bottleneck.
16:10 James: I personally have not had a lot of success with ChatGPT when it comes to generating DAX code. Do you have libraries of DAX expressions which you usually use and could share?
20:00 When Eugene quit his job
21:55 westside: When is it best to create DAX measures in Power BI versus creating measures in SSAS tabular? Is there a rule of thumb? Thanks

Who’s Hiring in the Microsoft Data Platform Community? February 2024 Edition

Last Updated January 5, 2024

Who's Hiring

Is your company hiring for a database position as of February 2024? Do you wanna work with the kinds of people who read this blog? Let’s set up some rapid networking here.

If your company is hiring, leave a comment. The rules:

Your comment must include the job title, and either a link to the full job description, or the text of it. It doesn’t have to be a SQL Server DBA job, but it does have to be related to databases. (We get a pretty broad readership here – it can be any database.)
An email address to send resumes, or a link to the application process – if I were you, I’d put an email address because you may want to know that applicants are readers here, because they might be more qualified than the applicants you regularly get.
Please state the location and include REMOTE and/or VISA when that sort of candidate is welcome. When remote work is not an option, include ONSITE.
Please only post if you personally are part of the hiring company—no recruiting firms or job boards. Only one post per company. If it isn’t a household name, please explain what your company does.
Commenters: please don’t reply to job posts to complain about something. It’s off topic here.
Readers: please only email if you are personally interested in the job.

If your comment isn’t relevant or smells fishy, I’ll delete it. If you have questions about why your comment got deleted, or how to maximize the effectiveness of your comment, contact me.

Each month, I publish a new post in the Who’s Hiring category here so y’all can get the latest opportunities.

Finding Tagged Questions Faster: Answers & Discussion

Last Updated February 23, 2024

Your query exercise was to take this Stack Overflow query to find the top-voted questions for any given tag:

SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%<sql-server>%'
ORDER BY Score DESC;

SELECT TOP 100 *

FROM dbo.Posts

WHERE Tags LIKE '%<sql-server>%'

ORDER BY Score DESC;

That’s currently using this index in its execution plan:

CREATE INDEX Score_Tags ON dbo.Posts(Score, Tags);

1	CREATE INDEX Score_Tags ON dbo.Posts(Score, Tags);

And answer 3 questions:

What kinds of tags will perform worse than others for this query?
Could you change the query to perform better?
Could you change the indexes to perform better, without changing the table structure?

Q1: What tags will have problems?

The current execution plan is scanning the Score_Tags index backwards, from highest-scoring posts down. As soon as it finds 100 posts that match the tag we’re looking for, boom, the query can stop scanning.

That works really well for common tags like sql-server, but the rarer of a tag we’re looking for, the slower the query is going to get. It takes longer to scan through the entire table, hoping to find 100 rows that match, checking the contents of each Tags, because searching strings in SQL Server is expensive.

For example, let’s look for a relatively recent tag, sql-server-2022:

SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%<sql-server-2022>%'
ORDER BY Score DESC;

SELECT TOP 100 *

FROM dbo.Posts

WHERE Tags LIKE '%<sql-server-2022>%'

ORDER BY Score DESC;

The execution plan is the same as searching for sql-server, but because there aren’t even 100 2022-specific questions yet, the query takes forever to run.

The less data there is for a tag, the longer the query will take to run. So the answer here is that rare tags will have problems. (Isn’t that wild? Usually we think of ‘big data’ as being more problematic for query performance, but here it’s the opposite problem.)

Q2: Could we improve the query?

Database admins often have a knee-jerk reaction to seeing “SELECT *” and they scream that it’s bad for performance. We could modify the query so that it returns less columns, and let’s try an extreme example of that. Let’s only return in columns included in our Score_Tags index:

SET STATISTICS IO ON;

/* "Bad" */
SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%<sql-server>%'
ORDER BY Score DESC;

/* "Good" */
SELECT TOP 100 Score, Tags, Id
FROM dbo.Posts
WHERE Tags LIKE '%<sql-server>%'
ORDER BY Score DESC;

SET STATISTICS IO ON;

/* "Bad" */

SELECT TOP 100 *

FROM dbo.Posts

WHERE Tags LIKE '%<sql-server>%'

ORDER BY Score DESC;

/* "Good" */

SELECT TOP 100 Score, Tags, Id

FROM dbo.Posts

WHERE Tags LIKE '%<sql-server>%'

ORDER BY Score DESC;

The “good” query’s actual execution plan has one less operator – no key lookup – but in the case of the sql-server tag, the duration difference is like 1 millisecond. Nobody cares. Yes, it’s technically less logical reads:

But the difference isn’t really groundbreaking. Besides, if we even just add one column in that isn’t in the index, the key lookup is back, and we’re back to the exact same number of logical reads. Like Erik recently blogged, you can’t partially fix key lookups.

Besides, if we switch to the sql-server-2022 tag that runs slowly, both the “bad” and “good” versions of the query do the exact same number of logical reads anyway. They scan the whole index. The problem in their query plans isn’t the key lookup – it’s the index scan.

So I would argue that:

For common tags, people aren’t concerned with performance because the query’s already fast enough
For uncommon tags, where performance is terrible, query changes don’t help

There is one way you can cheat, and I’m all about cheating if it results in faster performance. In the challenge post’s comments, JennyFromDBLock suggested adding an additional filter to the query – like, only look for posts with a Score higher than a certain number. The more selective that filter is, the less rows we have to examine. For example, if we only look for high scores:

SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%<sql-server>%'
	AND Score > 1000
ORDER BY Score DESC;

SELECT TOP 100 *

FROM dbo.Posts

WHERE Tags LIKE '%<sql-server>%'

AND Score > 1000

ORDER BY Score DESC;

The actual execution plan now has a seek rather than a scan. The index is on Score, Tags, which means we can seek to 1001, and read less rows. This technique is even effective for rare tags – but again, only as long as we use a selective filter on Score.

I can’t always tell users, “Hey, just look for different rows, and your query will go faster,” but if you can do that, then this technique works.

Q3: Could we tune indexes without changing the table structure?

I have to put in the second half of that sentence because I wanted to keep the challenge realistic to short-term real-world work. In the real world, we’re not usually allowed to point, shriek, and demand application code changes.

First, for regular nonclustered indexes, some readers suggested adding different indexes, like putting just Tags in the key:

CREATE INDEX Tags ON dbo.Posts(Tags) INCLUDE (Score);

1	CREATE INDEX Tags ON dbo.Posts(Tags) INCLUDE (Score);

But since our query isn’t sargable, that makes no difference on logical reads. We have no idea whether the <sql-server> tag is at the beginning of the Tags string, or the end, so the rows we’re looking for could be anywhere. And now, since the index isn’t organized by Score, we have to find all of the sql-server tagged questions, and sort all of them by score. That query plan really blows and takes forever.

Another suggestion was to only index on Score, and just put Tags in the includes:

CREATE INDEX Score ON dbo.Posts(Score) INCLUDE (Tags);

1	CREATE INDEX Score ON dbo.Posts(Score) INCLUDE (Tags);

The idea there is that the index will require less maintenance since Tags doesn’t need to be sorted. Sure – but that doesn’t make our query any faster, and we’re trying to get a faster query here.

Going even farther out of the box, a couple of readers suggested adding a nonclustered columnstore index like this:

CREATE NONCLUSTERED COLUMNSTORE INDEX NCCI
	ON dbo.Posts(Score, Tags);

1 2	CREATE NONCLUSTERED COLUMNSTORE INDEX NCCI ON dbo.Posts(Score, Tags);

But that actually makes performance dramatically worse. I explain why in my Fundamentals of Columnstore class.

So regular nonclustered indexes aren’t gonna cut it. How about a full text index? Let’s look at the table screenshot from the challenge blog post:

Oof. Things like .net-3.5 are going to be a problem since SQL Server’s full text indexes ignore dashes. Alright, that’s out the door – let’s keep looking.

If you look at the screenshot with your special eyes, you’ll see ~~your brand~~ some rows don’t have Tags at all. However, our index on Score_Tags includes all rows of the table, whether they have null or populated Tags. For proof, check out sp_BlitzIndex and compare the sizes of the indexes, and you’ll see that the clustered index and the Score_Tags index both have the same number of rows.

Let’s see how much of the table has null Tags:

SELECT SUM(1) AS rows,
	SUM(CASE WHEN Tags IS NULL THEN 1 ELSE 0 END) AS tags_null
	FROM dbo.Posts;

SELECT SUM(1) AS rows,

SUM(CASE WHEN Tags IS NULL THEN 1 ELSE 0 END) AS tags_null

FROM dbo.Posts;

Results:

More than half of the table’s rows can’t possibly match our search. Wouldn’t it be cool if we could remove those from the index? Indeed, we can, with a filtered index.

CREATE INDEX Score_Tags_Filtered
	ON dbo.Posts(Score, Tags)
	WHERE Tags IS NOT NULL;

CREATE INDEX Score_Tags_Filtered

ON dbo.Posts(Score, Tags)

WHERE Tags IS NOT NULL;

Now, if we look for a rare tag, like sql-server-2022, we can avoid scanning those null rows:

SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%<sql-server-2022>%'
ORDER BY Score DESC;

SELECT TOP 100 *
FROM dbo.Posts WITH (INDEX = Score_Tags)
WHERE Tags LIKE '%<sql-server-2022>%'
ORDER BY Score DESC;

SELECT TOP 100 *

FROM dbo.Posts

WHERE Tags LIKE '%<sql-server-2022>%'

ORDER BY Score DESC;

SELECT TOP 100 *

FROM dbo.Posts WITH (INDEX = Score_Tags)

WHERE Tags LIKE '%<sql-server-2022>%'

ORDER BY Score DESC;

In those two queries, the top query lets SQL Server choose the index. The bottom one has an index hint saying, “SQL Server, I want you to use the old non-filtered index.” Here are their actual execution plans, and here’s the difference in logical reads:

The filtered index is good for about 25% less reads, and query runtime drops from 35.1 seconds to… 33.5 seconds.

That’s, uh, not good enough. So technically I’d say the answer to Q3 is that yes, we can improve performance with an index, but we’re only talking about percentage improvements, not order-of-magnitude improvements. We’re gonna need a bigger boat.

Finally, Thomas Franz had a really ambitious idea: build an indexed view that normalizes the contents of the Tags column, splitting it into Tag_1, Tag_2, Tag_3, Tag_4, and Tag_5 automatically. It was hard work to build that solution out, but It absolutely does work. I’m mentioning this last because it requires changes to the query and changes to the indexing, but I still totally consider it valid and I love it.

Sometimes, we gotta change the database.

This is a great example of the kind of performance problem you hit as data size grows over time. In the beginning, when tables are small, table design is less critical. As your application grows over time:

There are more users, running more queries, simultaneously
The size of the data they’re querying grows, so each query becomes more resource-intensive
The distribution of data changes, and you’re more likely to have outliers with unique performance challenges
The shortcuts and mistakes from the application’s beginning come back to haunt you

And that’s your next Query Exercise:
how would you change this design?

Keep in mind that:

We have a lot of existing code that uses the Posts.Tags column
That code does selects, inserts, and updates
We want our changes to be backwards-compatible: old code still has to work
We need to give developers an option for slow queries, letting them change those slow queries to get faster performance if they want it

That’s your challenge for this week! Post your answer in the comments (if you’re using code, try pasting the code in a Github Gist), discuss the ideas of others, and we’ll check back next week with another post with the thoughts and my own solution. Have fun!

Free Video Training: Estimation and Statistics by Dave Ballantyne

Last Updated January 30, 2024

Statistics, Videos

It’s hard to find good 400-500 level SQL Server training at any price, let alone free. Dave Ballantyne’s past sessions from SQLBits are great examples of those.

Today, I’m going to review Dave’s most recent session on Estimation and Statistics. It’s from 2015, but it’s still very relevant today.

If you want to jump past the intro and bio, the technical stuff starts at about 2 minutes 15 seconds in.

Dave starts by explaining how building a query plan is a lot like building a driving route from point A to point B. I love that analogy because:

There are a lot of options, and exploring all the possible routes is computationally intensive
We have to make some assumptions about how the roads and traffic are going to look
Along the way, things may go wrong, but at that point, it’s too late for us to replan our route from scratch

A few time marks if you want to jump ahead in the video recording:

You can skip the intros by jumping to about 2m:15sec into the video.
If you’ve watched How to Think Like the Engine, and you want to jump into how the statistics histogram is decoded, jump to 12m:30sec into the video.
If you already know how to interpret a single histogram, and you want to know how SQL Server uses multiple histograms on different columns, jump to 21m in.
When you hit the brief part on parameter sniffing and want to skip past it, jump to 43m:19sec. The parameter sniffing part is probably not new to regular readers here. At 43m:19sec, the next topic starts, the Plan Skyline problem.
To learn about optimization’s ascending key problem, jump to 55m in

If you like training like this, consider attending the March conference either virtually, or in-person outside of London. Registration is open now, and you can use discount code Brent5 to save 5%. Here’s the spot to look for during registration:

As of this writing, there are literally 10 Early Bird seats left, so hustle!

What PowerBI Questions Do You Have for Eugene Meidinger?

Last Updated January 29, 2024

Data Science

Eugene Meidinger, aka SQLGene.com, makes Power BI concepts easy and accessible. He’s got a free introduction to Power BI performance tuning here. People come to him with questions like:

What kind of tools are there for performance tuning Power BI?
Where should I do my logic: SQL, Power Query, or DAX?
What are some best practices for designing a data model?
What is the best trick for speeding up Power Query?
How do you get started improving Power Performance?

He’s joining me for Office Hours on Wednesday to take your Power BI performance tuning questions.

Post your questions here, and upvote the ones you’d like to see him answer.

On Wednesday at 2PM Eastern, 11AM Pacific, join us on my Twitch channel to watch live. Here’s a calendar invite so you can block it out on your calendar. If you can’t join live, no worries, the recording will go up on my YouTube channel later.

I’ve turned off comments on this post to make sure folks don’t accidentally post their questions here – post ’em over on PollGab. See you Weds!

Query Exercise: Find Tagged Questions Faster.

Last Updated February 23, 2024

For this week’s Query Exercise, we’re going to start with an existing query that has performance problems from time to time, depending on how it’s used.

This query’s goal is to find the top-voted Stack Overflow questions for any given tag – for example, here are the top-voted questions tagged “sql-server”.

What are tags? I’m glad you asked. When you’re looking at a question at Stack Overflow, you’ll see a list of tags for that question:

The tags are at the bottom of that screenshot: sql, sql-server, t-sql, sql-update, dml. A question can have up to 5 tags. To see how they’re stored in the Stack Overflow database, let’s look at the posts table, which is where questions & answers are stored:

SELECT TOP 100 * FROM dbo.Posts;

1	SELECT TOP 100 * FROM dbo.Posts;

There’s a “Tags” column:

So to find the questions tagged sql-server, our analyst wrote this query:

SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%<sql-server>%'
ORDER BY Score DESC;

SELECT TOP 100 *

FROM dbo.Posts

WHERE Tags LIKE '%<sql-server>%'

ORDER BY Score DESC;

And our DBA helped by creating this index:

CREATE INDEX Score_Tags ON dbo.Posts(Score, Tags);

1	CREATE INDEX Score_Tags ON dbo.Posts(Score, Tags);

Right now, the actual execution plan is fairly simple. SQL Server scans that index backwards, going from the top-scoring posts down, checking each one’s tags to see if it contains <sql-server>.

Your exercises for this week are:

What kinds of tags will perform worse than others for this query?
Could you change the query to perform better?
Could you change the indexes to perform better, without changing the table structure?

You can post your answers in this blog post’s comments, and discuss each others’ ideas. We’ll revisit your answers in this post. Have fun!

Finding the Best Time for Maintenance: Answers & Discussion

Last Updated January 22, 2024

Your Query Exercise was to find the best time to do database maintenance by querying the Users table, looking for a one-hour window with the lowest number of created users. We kept this one pretty simple by looking at the data in just one table, and we didn’t hassle with time zones. We just wanted the 3 lowest-load hours, in this format:

Mondays, 1:00-1:59
Saturdays, 23:00-23:59
Sundays, 2:00-2:59

The comments on this one were wild! Lots of different approaches, but they all boiled down to two basic styles.

The easy (but not necessarily right) way:
group by date_part

One way to do this is to use the built-in DATEPART function to group users by weekday and hour of their creation dates, like this:

SELECT DATEPART(WEEKDAY, CreationDate) as weekday, 
	   DATEPART(HOUR, CreationDate) as hour, 
	   COUNT(*) as users_created
FROM dbo.Users u
WHERE u.CreationDate >= '2017-11-05'
  AND u.CreationDate < '2017-12-03'
GROUP BY DATEPART(WEEKDAY, CreationDate), DATEPART(HOUR, CreationDate)
ORDER BY COUNT(*);

SELECT DATEPART(WEEKDAY, CreationDate) as weekday,

DATEPART(HOUR, CreationDate) as hour,

COUNT(*) as users_created

FROM dbo.Users u

WHERE u.CreationDate >= '2017-11-05'

AND u.CreationDate < '2017-12-03'

GROUP BY DATEPART(WEEKDAY, CreationDate), DATEPART(HOUR, CreationDate)

ORDER BY COUNT(*);

And at first glance, that looks like it works:

For the database I’m using (the 2018-06 Stack Overflow database) and the date ranges I picked, it looks like late Sunday night, early Monday morning would be the winners here, so my answer might be:

Monday, 02:00-02:59
Monday, 00:00-00:59
Sunday, 00:00-00:59

If that was your solution, you’d be in fine company: ChatGPT 4 suggested a similar solution. It even added a delightful caveat:

Note that DATEPART(dw, CreationDate) returns an integer representing the day of the week, where the specific value depends on the SET DATEFIRST setting of your SQL Server instance. Typically, 1 represents Sunday, but this can vary. Similarly, DATEPART(hour, CreationDate) returns the hour of the day based on a 24-hour clock.

That’s one of the things I love about using ChatGPT for coding tasks like that – it can add valuable context that saves me time from checking the documentation.

That grouping approach has a bug, though.

That technique only works as long as our historical data is populated for every hour of every day. What if we’ve been taking the web servers down every Sunday at 16:00-17:00? There wouldn’t be any users created in that date range, and thus, no rows would show up in the result set for that query!

To illustrate it, let’s query the table looking for days when no users were created during particular hours:

SELECT TOP 10 
	DATEPART(YEAR, CreationDate) AS CreationDate_Year, 
	DATEPART(MONTH, CreationDate) AS CreationDate_Month,
	DATEPART(DAY, CreationDate) AS CreationDate_DAY,
	COUNT(DISTINCT DATEPART(HOUR, CreationDate)) AS CreationDate_Hours
FROM dbo.Users
GROUP BY DATEPART(YEAR, CreationDate), DATEPART(MONTH, CreationDate), DATEPART(DAY, CreationDate)
ORDER BY COUNT(DISTINCT DATEPART(HOUR, CreationDate));

SELECT TOP 10

DATEPART(YEAR, CreationDate) AS CreationDate_Year,

DATEPART(MONTH, CreationDate) AS CreationDate_Month,

DATEPART(DAY, CreationDate) AS CreationDate_DAY,

COUNT(DISTINCT DATEPART(HOUR, CreationDate)) AS CreationDate_Hours

FROM dbo.Users

GROUP BY DATEPART(YEAR, CreationDate), DATEPART(MONTH, CreationDate), DATEPART(DAY, CreationDate)

ORDER BY COUNT(DISTINCT DATEPART(HOUR, CreationDate));

The results in the 2018-06 database show that we’ve had several days where we only created users in some (but not all) hours of the day:

And if that was a repeating trend in the data, we’d have a problem.

The Safer Technique:
generate a list of days/times first.

In pseudocode:

SELECT weekday, hour, 
  (SELECT COUNT(*) of users_created in that weekday and hour)
FROM list_of_weekdays_and_hours

SELECT weekday, hour,

(SELECT COUNT(*) of users_created in that weekday and hour)

FROM list_of_weekdays_and_hours

That way, if we have any weekday/hour combinations with no users created, we’ll still have rows for them in our result set. Starting with compatibility level 160 (SQL Server 2022), one option for this is the generate_series function. In older versions, you’d probably use a numbers table, aka tally table.

We need two sets of rows:

weekdays: a set of rows 1 through 7
hours: a set of rows 0 through 23

So we’ll call generate_series twice:

SELECT weekdays.value AS weekday, hours.value AS hour
FROM GENERATE_SERIES(1,7) AS weekdays
CROSS JOIN GENERATE_SERIES(0,23) AS hours;

SELECT weekdays.value AS weekday, hours.value AS hour

FROM GENERATE_SERIES(1,7) AS weekdays

CROSS JOIN GENERATE_SERIES(0,23) AS hours;

To get a list of days and hours:

Then, let’s take our first version of this query, use that as a CTE, and left outer join to it:

WITH UserCreations AS (
	SELECT DATEPART(WEEKDAY, CreationDate) as weekday, 
		   DATEPART(HOUR, CreationDate) as hour, 
		   COUNT(*) as users_created
	FROM dbo.Users u
	WHERE u.CreationDate >= '2017-11-05'
	  AND u.CreationDate < '2017-12-03'
	GROUP BY DATEPART(WEEKDAY, CreationDate), DATEPART(HOUR, CreationDate)
)
SELECT weekdays.value AS weekday, hours.value AS hour,
	COALESCE(UserCreations.users_created, 0) AS users_created
FROM GENERATE_SERIES(1,7) AS weekdays
CROSS JOIN GENERATE_SERIES(0,23) AS hours
LEFT OUTER JOIN UserCreations ON weekdays.value = UserCreations.weekday
AND hours.value = UserCreations.hour
ORDER BY COALESCE(UserCreations.users_created, 0);

WITH UserCreations AS (

SELECT DATEPART(WEEKDAY, CreationDate) as weekday,

DATEPART(HOUR, CreationDate) as hour,

COUNT(*) as users_created

FROM dbo.Users u

WHERE u.CreationDate >= '2017-11-05'

AND u.CreationDate < '2017-12-03'

GROUP BY DATEPART(WEEKDAY, CreationDate), DATEPART(HOUR, CreationDate)

)

SELECT weekdays.value AS weekday, hours.value AS hour,

COALESCE(UserCreations.users_created, 0) AS users_created

FROM GENERATE_SERIES(1,7) AS weekdays

CROSS JOIN GENERATE_SERIES(0,23) AS hours

LEFT OUTER JOIN UserCreations ON weekdays.value = UserCreations.weekday

AND hours.value = UserCreations.hour

ORDER BY COALESCE(UserCreations.users_created, 0);

That way, the GENERATE_SERIES calls will make sure rows exist in our result set – even if no users were created in that date range.

Hope you enjoyed it! For more challenges like this, check out the Query Exercises page.

Save 10% on SQLBits Registration with BRENTO10.

Last Updated February 23, 2024

SQLBits

Going to the SQLBits conference in March?

Whether you’re registering in-person or attending virtually, save 10% on your ticket with discount code BRENTO10. During the checkout process, look for this:

And save yourself some quick bucks. The screen will refresh after you hit Apply, so make a note of the price before & after. You’re welcome!

2024-02-1: Bits early bird pricing ended, so they upgraded my coupon from 5% to 10% since you’re now paying full price for the tickets overall. Yay!

[Video] Office Hours: The Question is Not the Problem Edition

Last Updated January 8, 2024

Videos

I answer your top-voted Azure SQL DB and SQL Server questions from https://pollgab.com/room/brento and lose my temper about what you’re doing.

Here’s what we covered:

00:00 Start
00:40 Mattia: Has bulk logged recovery model ever got you across the finish line? If so, could you share a story?
01:41 ForEachDBA: My manager wants to reduce cpu from db servers as cpu usage under 50% most of the time, also sp_BlitzFirst doesn’t show cpu top waits. The question:how to measure effectively the impact of reducing cpu? What are the key metrics to watch? Plan cache durations stats is good approach?
02:56 Ataollah S: Will we see any new features added to SQL 2022 in 2024?
03:31 Henrique Almeida: As a recommendation, MS suggests that we use the maximum for MAXDOP being 8.We changed it to 2 and I could see that there was a considerable decrease. give your opinion on the relationship between MAXDOP x CXPACKET, whether they are proportional or inversely proportional.
04:42 Bonnie: Is it ok / safe to simultaneously run sp_blitzcache and sp_blitzindex on a busy boxed SQL Server? sp_blitzindex sometimes takes much longer to complete than sp_blitzcache.
05:45 Owl Kitty: What is the danger zone for Batch Requests per Second that we should be on the lookout for?
06:16 TimelyDBA: My friend is configuring alerts based on severity in a shop requiring the server instance hardened using the center of internet security standards which recommends never using SQL server for mail services. Have you alternative pathways to harmonize security and delivery for such?
07:02 Nazanin B: What are the common issues you see when applications connect to SQL Server over a WAN connection vs a LAN connection? Does compression help here?
07:38 MyTeaGotCold: How can I quality assure SQL code? Other languages have reasonable error handling, good linters, and unit testing that doesn’t need to be hacked in.
08:49 Bruno: What is your opinion of the new Azure Boost feature to improve VM network and storage performance? Game changer for SQL VM?
09:04 Bahareh A.: Is column store index maintenance for large tables hard on RPO when log shipping is involved?
09:47 Rudy: SQL assessment API is complaining about high page splits in our canned SQL instance? How do we identity which tables are experiencing this? Is this a fools errand?
10:37 TimelyDBA: Who is the brent ozar for SQL server always on AG and other HA/DR strategies?
12:55 Jen: What are the best / worse use cases you have seen for sp_invoke_external_rest_endpoint to call external API’s inside of SQL Server?
13:56 Bonnie: sp_BlitzFirst is reporting a priority 40 finding for Forwarded Fetches/Sec High: TempDB Object for boxed SQL. When is this bad / concerning?
14:14 UnInstallAllScripts: How install scripts like InstallAllScripts.sql gets update?
15:18 Manimal: Does it ever make since to run boxed SQL Enterprise on a 24 core or less cloud VM? Do you see this much in the wild?
15:48 Persephone: Which vendors do you foresee doing cool AI integrations with SQL Server? Which integrations are you looking forward to?
17:14 Rom: What are the top customer issues you see when resizing an Azure SQL VM up or down?
17:49 Dom: Hi Brent, Have you ever taugh of offering your training/blog in french (for Canada/Europe people who doesn’t understand english that much)
18:13 Pytzamarama: Happy 2024! Brent I will rephrase a badly asked past question of mine: When we update our customers we drop create all procedures and triggers. Does this affect performance and how. Thank you!
19:50 Zappy: Hi Brent, as far as know, a statistic update shouldn’t cause a plan recompilation if a table wasn’t modified. But recently I think that I have seen a plan recompilation on an unmodified table simply caused by a stat update with a different SamplingPercent. Is this possible ?

Query Exercise: Find the Best Time for Maintenance

Last Updated January 24, 2024

If we’ve gotta take the database down for maintenance – perhaps a version upgrade, perhaps upgrading our own code, maybe scaling up the hardware – when’s the best time to do it?

For this week’s query exercise, the business has asked us to identify 3 1-hour periods with the least user activity, in this format:

Mondays, 1:00-1:59
Saturdays, 23:00-23:59
Sundays, 2:00-2:59

Pop open the Stack Overflow database, and for the sake of simplicity for this exercise:

We’re just going to look at the users table. We’ll judge load by times when the fewest number of users are created, judging by the users.creationdate column.
We’re not dealing with time zones.
We’re only looking for one-hour time ranges that start with 00 and end with 59, like 1:00-1:59. We’re using a little shorthand there and not typing out the seconds, but the point is that we don’t want you saying “the best time is 2:15:05 to 3:14:04.” That’s delightfully ambitious – but let’s keep it simple here. Not everyone has as much free time as you, bud.

Pick a 28-day window so that you’re looking at several weeks of trending activity, and come up with your one-hour preferences. (I expect folks to come up with widely different answers depending on which 28-day window they pick, and which copy of the database they’re using. That’s okay! The goal is to write the query, and then check your logic against the gotchas that I discuss next week.)

The performance of your query isn’t a concern here – this isn’t a query we’ll be running on a regular basis. We just need to know the quietest times in the database.

You can post your answers in this blog post’s comments, and discuss each others’ ideas. We’ll revisit your answers in this post. Have fun!

Find Posts with the Wrong CommentCount: Answers & Discussion

Last Updated January 14, 2024

Your Query Exercise was to find denormalization accuracy problems: checking the accuracy of a reporting column, Posts.CommentCount. There were two parts: finding the top 100 most problematic Posts with the biggest variances, and thinking about a long term solution to keep the CommentCount accuracy as high as practical.

Question 1: Finding the Problematic Posts

Your first attempt might have looked something like this:

SELECT TOP 100 p.Id, 
	p.CommentCount, COUNT(distinct c.Id) as comments_actual,
	p.CommentCount - COUNT(distinct c.Id) as comments_variance
FROM dbo.Posts p
INNER JOIN dbo.Comments c ON p.Id = c.PostId
GROUP BY p.id, p.CommentCount
HAVING p.CommentCount <> COUNT(distinct c.Id)
ORDER BY p.CommentCount - COUNT(distinct c.Id) DESC;

SELECT TOP 100 p.Id,

p.CommentCount, COUNT(distinct c.Id) as comments_actual,

p.CommentCount - COUNT(distinct c.Id) as comments_variance

FROM dbo.Posts p

INNER JOIN dbo.Comments c ON p.Id = c.PostId

GROUP BY p.id, p.CommentCount

HAVING p.CommentCount <> COUNT(distinct c.Id)

ORDER BY p.CommentCount - COUNT(distinct c.Id) DESC;

But there are 2 problems with that. First, because it’s an inner join, it only finds Posts with Comments. What if the CommentCount is off because there are no matching rows in the Comments table at all? We’re gonna need a left outer join instead.

Second, that approach focuses on variances where p.CommentCount is HIGHER than the actual number of comments – but not the other way around. In this case, those variances are fairly small, only off by a couple comments:

Your next approach might have been two different queries, unioned together, with one sorting for higher p.CommentCounts, and the other sorting for higher actual comments. However, given the size of these tables, that’d be a performance nightmare, making multiple passes through the same tables.

The slick way to do it is to sort by the absolute value (ABS) of the variance. That way, if I have a list with both large positive AND negative values, those large values will hang out together in the result list.

Our new query looks a lot more complicated due to needing to handle situations with no rows in the Comments table (thus the COALESCE) and the ABS to sort large values together:

SELECT TOP 100 p.Id, 
	p.CommentCount, COUNT(distinct c.Id) as comments_actual,
	ABS(p.CommentCount - COALESCE(COUNT(distinct c.Id), 0)) as comments_variance
FROM dbo.Posts p
LEFT OUTER JOIN dbo.Comments c ON p.Id = c.PostId
GROUP BY p.id, p.CommentCount
HAVING p.CommentCount <> COALESCE(COUNT(distinct c.Id), 0)
ORDER BY ABS(p.CommentCount - COALESCE(COUNT(distinct c.Id), 0)) DESC;

SELECT TOP 100 p.Id,

p.CommentCount, COUNT(distinct c.Id) as comments_actual,

ABS(p.CommentCount - COALESCE(COUNT(distinct c.Id), 0)) as comments_variance

FROM dbo.Posts p

LEFT OUTER JOIN dbo.Comments c ON p.Id = c.PostId

GROUP BY p.id, p.CommentCount

HAVING p.CommentCount <> COALESCE(COUNT(distinct c.Id), 0)

ORDER BY ABS(p.CommentCount - COALESCE(COUNT(distinct c.Id), 0)) DESC;

Our result set looks quite different now!

Now we’ve found issues where there are no comments at all (which the inner join missed), and variances in both directions. (I especially love the symmetry of rows 10 & 11 in this screenshot to illustrate that problem.)

Alright, part 1 done. We’ve identified that we do indeed have a variety of data problems here. In the comments on the challenge blog post, there was a lot of discussion about creating indexes. I’d say that for Question 1, seeing the scope of the problem, you shouldn’t put in indexes just to solve that problem. It’s a one-time thing, checking to see if we have something to fix or not.

Now, what should we do long term?

Question 2: Solving It Long Term

Your homework assignment specifically skipped writing a one-time update statement. I asked you to focus on how we should keep that Posts.CommentCount column as up to date as practical, reminding you that it wasn’t a transactional system like bank balances.

When we wanna keep data up to date between two tables, we have 3 basic options:

Synchronously, done by the database server itself – which would make insert transactions longer when people add new comments, plus add workload to the database server to coordinate that transaction. This would typically be accomplished with a trigger on the Comments table so that as soon as rows were deleted, updated, or inserted, the Posts.CommentCount could be updated at the same time, and always in sync. ChatGPT 4 generated me a trigger to do this. (At first glance, I don’t think it handles multi-row deletes or updates correctly, but I didn’t test that – just found it amusing. Whenever I have a coding task to do these days, I tend to ask ChatGPT to give it a shot, just to see how it does.)
Asynchronously, done by the database server – add a trigger to the comments table, but rather than updating Posts.CommentCount right away, simply add a row to a queue table indicating that the post.id needs its comment count updated. Later, circle back with an async job to run the commentcount updates only for the posts listed in the queue table.
Asynchronously, done by the app – instead of a trigger, have the front end app (or service layer) add a record to something whenever it inserts a comment. That something doesn’t even have to be a relational database – it could be a queue service or a key/value store. Then, have a service check that queue later and run update queries only for the specific posts listed in the queue.

The advantage of the async approach is that we only have to update a posts row once every X minutes, no matter how many comments have been added to it. If comments come in fast & furious in the first, say, hour of a post’s existence, then we can reduce that workload with a queue.

In fact, if you check out the above screenshot with variances, you’ll notice that the largest variances all seem to have a similar range of ids, and they’re all in the high range – some of the more recently inserted posts, basically. That would seem to suggest an asynchronous process.

Your long term solution might have been just that! You might have recommended a periodic job, say hourly, to update the posts.commentcount column. Depending on a few things, that might be fine:

If we’re willing to let the data be out of sync on newly created posts
If we don’t have too many posts columns to update (you saw how slow these queries were – imagine how bad it’d be if we had a dozen similar reporting columns to keep in sync)
If the regular process doesn’t cause blocking problems when it runs for a long time
If we have enough database server horsepower to get ‘er done without too much impact to end users

On the other hand, you could also choose the synchronous approach: every time we insert a new comment row, update the posts.commentcount row for the post we’re commenting on. This could either be done with a trigger on comments, or with an added update statement on whatever app code is inserting the comments. A big advantage of the sync approach is that the reporting data is always up to date. Business users like that.

For the sync approach, there are two ways to update the Posts.CommentCount number:

Either count the exact number of Comments in the Comments table at the time of the data change. This is easier to code because it’s just a quick sum on the Comments table, but it does involve actually querying the Comments table, which introduces more work and more locking.
Or simply increment the Posts.CommentCount number +1 or -1, depending on the data change. That one’s faster because there’s less locking involved on the Comments table, but it can be trickier to get the logic right.

In a perfect world, I’m a fan of keeping the database simple for scalability purposes: start with the synchronous approach, done in app code. When you hit a performance wall where that stops scaling, and if this particular process is the bottleneck holding you back from going faster, only then do something more complicated like an async batch job, after making sure the business users are okay with the data being out of sync.

If you choose a synchronous solution, you’ll probably need indexes to support the queries to make them fast enough. If you choose an async solution, especially if you offload it to a readable replica, then it’s easier to get away without adding more indexes when the database is already over-indexed.

Something Fun to Think About

For those of you following along with the Postgres version of this homework, you might have noticed that in the original homework assignment posts, when I queried the top 100 posts just to show demo data, I got very different results.

Here’s SQL Server’s top 100:

Note that in this case, it’s sorted by Id. Typically, in SQL Server, you’ll see the first 100 rows sorted by their clustering key. That’s certainly not always the case, and in beginning performance tuning sessions, I have to emphasize to SQL Server people, “Seriously, order isn’t guaranteed unless you ask for it.”

Here’s Postgres’s top 100:

Postgres really doesn’t give you an order by unless you ask for it, and if you haven’t used Postgres before, the reason will rock your world.

How’d you do? What’d you think?

I hope you had fun, enjoyed this exercise, and learned a little something. If you’ve got comments or questions, hit the comment section below. (Keep the comments focused on this exercise, though – we’re not covering the details of the Postgres export just yet. Only so many things I can cover per blog post, hahaha, and this thing’s already way long.)

We’re Starting to Automate Testing for the First Responder Kit.

Last Updated January 10, 2024

First Responder Kit

Brent says: It’s really funny how the world works. I wrote what you see below in December 2023 and scheduled it for publication on January 16th (today), hoping to get help from the community to get this process started. On January 9, before the post went live, David Wiseman submitted a pull request with the first steps! More on that at the end of the post. I’m leaving the post as-written in Dec 2023 because it’s still useful.

It’s a new year. You tell me if the state of the art has changed.

The biggest challenge in maintaining the First Responder Kit is testing people’s code. Different people submit code changes all the time, and I need to quickly assess:

If their code will work on all current versions of SQL Server, Amazon RDS, and Azure SQL DB Managed Instances
If their code works on case sensitive instances
If their code works when some databases aren’t in an expected health state (offline, inaccessible for security reasons, restoring, etc)
If their changes still have the intended good effects of the original code – for example, if we’re testing for databases with auto-shrink on, are we still catching that?

It’s especially hard with the First Responder Kit because the queries use SQL Server’s own system objects as a source.

Writing queries for multiple SQL Server versions is hard.

Here’s a very simplified example from a recently failed pull request:

SELECT database_id, object_id, index_id,
    row_lock_count, row_lock_wait_count,
    version_generated_inrow, version_generated_offrow
FROM sys.dm_db_index_operational_stats(1,NULL,NULL,NULL);

SELECT database_id, object_id, index_id,

row_lock_count, row_lock_wait_count,

version_generated_inrow, version_generated_offrow

FROM sys.dm_db_index_operational_stats(1,NULL,NULL,NULL);

That pull request Worked On My Machine™, but it failed on SQL Server 2016 because the version_generated_inrow and version_generated_offrow columns didn’t exist back then.

Sure, you could check that if you read the sys.dm_db_index_operational_stats documentation, which shows which columns are available in which version. On that page, do a control-F and search for “version”, which will take you to the part of the page that shows those two columns, and the version of SQL Server that introduced them. Go ahead, I’ll wait.

What’s that you say?

Those columns aren’t listed in the documentation, and many aren’t?

Ah well you see, that’s because Microsoft’s documentation is open source now. It’s up to kind folks like you and I to make pull requests with updates to the documentation, explaining to Microsoft what columns in their DMVs have changed, and what the contents of those columns are. We can’t expect Microsoft to do that for us, you know. They only made $151,597,000,000 of profit last year. That’s not revenue, that’s profit, after they paid all their expenses. I know that seems like a big number to you, and that with so much leftover change, they’d be able to keep their documentation up to date, but as the Microsoft folks say, that’s a big ask.

Just to put that in perspective, that’s $17,305,593 per hour. $4,807 per second.

So forgive me for not making that pull request to fix the documentation. I’ll leave that to you, dear reader. While you work on that, I’ll get back to the problem at hand.

Testing that multiple versions is even harder.

To test this simplified T-SQL:

SELECT database_id, object_id, index_id,
    row_lock_count, row_lock_wait_count,
    version_generated_inrow, version_generated_offrow
FROM sys.dm_db_index_operational_stats(1,NULL,NULL,NULL);

SELECT database_id, object_id, index_id,

row_lock_count, row_lock_wait_count,

version_generated_inrow, version_generated_offrow

FROM sys.dm_db_index_operational_stats(1,NULL,NULL,NULL);

We would need to somehow:

Fake the contents of a read-only system DMV (because you can’t delete/update/insert data in that DMV.)
Fake the contents of different versions of SQL Server, which have different columns, and aren’t even documented.
Keep those tests up to date, which is monstrously challenging given the cloud versions like AWS RDS and Azure SQL DB MI, which can and do change over time, including system stored procedures and their RDS equivalents.

Realistically, that’s not going to happen. So instead, it would make more sense to do automated testing. Each time a pull request comes in:

Spin up instances of SQL Server 2014, 2016, 2017, 2019, and 2022.
Run the First Responder Kit’s creation scripts (the new pull request’s version), ensuring that the scripts are at least created without error.
Run a series of predefined tests of sp_Blitz & friends, ensuring that the scripts run without error and produce the expected result sets.
And then perhaps the hardest one, repeat those tests on AWS RDS and Azure SQL DB MI. Those are tougher because the spin-up times are much, much longer, and we get charged while they’re up.

That’s hard – but let’s take the first step.

Several years ago, I approached community members & consultants who were well-known for SQL Server CI/CD work. I explained this challenge, and I asked, “How much would you charge me to build a solution for this? You can open source the whole thing, or whatever parts are useful to you.”

Back then, no one would even give me a number to solve that problem – the effort in designing the solution alone was too high. Given that years have passed, and the state of the art might be different now, you tell me in the comments: what would the first steps look like? How long would they take? If it’s doable today, I’d definitely be interested in funding it.

Because manual testing blows, which means we don’t do it as well as we should, and we rely on community testers. Anything I can do to lessen the work of the community on stuff like this, I’m down!

Update 2024-01-09: David Wiseman took the first step!

Back to that first foreshadowing paragraph of the post: I’m thrilled to see that David Wiseman actually took the first step. He used Github Actions so that when someone opens a pull request:

A GitHub runner deploys a clean instance of SQL Server 2017 in a Linux Docker container
The First Responder Kit scripts are deployed
Basic automated tests are run with PowerShell/Pester

To see how it’s done, check out the changed files in pull request 3419.

I love this as a first step because like David wrote in his Github issue, in just 2 minutes, this verifies that:

The scripts deploy & run without errors on a case-sensitive 2017 Linux instance – which would actually fail more than you might think, just because a lot of folks don’t use case sensitive instances or Linux
The expected number of row sets are returned with the expected number of columns – this actually would have caught pull requests in the past where folks left in debug results

[Video] Office Hours: Ask Me Anything About the Microsoft Data Platform

Last Updated January 4, 2024

Videos

Forgive the hoarse voice – I’m recovering from something COVID-ish. I went through your top-voted questions from https://pollgab.com/room/brento and did my best:

Here’s what we covered:

00:00 Start
03:03 DBANoob: We have EF code that creates randomly generated temp table names every time it runs, while the rest of the logic remains the same, giving thousands of single use plans. I’ve tried warning the devs about this, but no luck. Any classes cover this? Any fixes I can try on my end?
03:35 Dom: Hi Brent, you often mention San backup for VLDB. I googled but wasn’t able to find information about how we could do PIT recovery with San Snapshot backup. Is it possible ? Thanks ! P.s. Big fan of your work !
04:14 Bruno: Is it safe to install the microsoft ace ole db provider on SQL Server so that SSIS packages targeting excel files will function?
04:59 MyTeaGotCold: Do you still send out the 6-Month DBA Training Plan on Wednesdays? I don’t think I’ve seen one in months.
05:13 KyleDevDBA: Hi Brent, You recently mentioned Cercle for when you need to focus. Does it ever have the opposite effect and you end up staring at the beautiful scenery for an hour?
05:58 Pepé Le Pew: What are the top smells that a given sql agent job is really application logic that should be moved elsewhere?
07:05 Scooby: Is that Beany in your mastering slide deck?
07:24 Ezra: How do you perform automated tests with the first responder kit?
08:22 Persephone: How would you rate ChatGPT’s ability to explain what a complicated piece of TSQL code is doing? Will this end the need for users to document their TSQL code?
09:17 Darth Insidious: Inline TSQL comments using — syntax instead of /* are insidious since the extracted single line TSQL can’t be auto formatted for multi-line. Do you know of any good tools that can strip the — comments and replace with /* comments ?
09:56 Montro1981: Hi Brent, Happy New Year to you and your loved ones. Are there any things to look out for or “gotchas” when a database has both RCSI and Snapshot Isolation enabled?
10:47 RacerX: Do the parallel racing stripes in estimated query plan operator indicate that parallelism will also be used in actual query plan operator as well?
11:25 Guðmundur: Hierarchyid appears to be a CLR data type. Do you see any common performance issues with this data type on large tables?
12:27 Bruno: Thanks for all the tooling / training you provide for the SQL community. Looking forward to future training.
12:40 DBA_Todd: Hey Brent! When enabling Accelerated Database Recovery, is it best to change the location of the PVS to a different filegroup? If so, why, and should I put it on separate storage from the primary filegroup? I have a 4TB db running SQL 2022 on Azure VM.
14:59 Prisha: What are the best methodologies for adhoc copying a single (hundred million rows) SQL table from production to development?
15:29 Giannis M: sp_Blitz is reporting system db’s located on C drive (master, model, msdb). Is it low or high risk to move these DB’s?
17:05 kansas4444: Hi Brent, What’s your opinion on _UTF8 collation to replace nvarchar by varchar ? I’ve translations tables where the PK is composed of the culture code and the parent key. Copying the table with a varchar column in UTF8 reduced its size from 10 GB to 4 GB for ~4 million rows.
17:59 Sigrún: We update stats nightly but even still there are some tables that will auto update stats during the next day due to high frequency of activity. Do you have guidance as to when you should you should disable auto update stats for a given table and do it manually?
18:53 The Blame Bus Driver: Is there a good way to know who last modified a given SQL agent job?

Query Exercise: Find Posts with the Wrong CommentCount

Last Updated January 24, 2024

For 2024, I’m trying something new: weekly homework challenges! For this week’s query challenge, we’re going to do some data validation. Pop open the Stack Overflow database and check out the CommentCount column on the Posts table:

In theory, that’s a fast way to check how many Comments a particular Post has, without querying the Comments table to find them all. For example, if I want to see the Comments for Post.Id 13, I might write:

SELECT * FROM dbo.Comments WHERE PostId = 13;

1	SELECT * FROM dbo.Comments WHERE PostId = 13;

And presto, I get the 6 comments, matching the CommentCount = 6 value in the above screenshot:

But… I’m a DBA, and you know what DBA stands for: Doesn’t Believe Anyone. So our query exercise for this week is to:

Write a query that finds the top 100 problematic Posts with the biggest CommentCount variance versus the actual number of comments on that post. This is a little tricky for two reasons: the Posts.CommentCount could be off in either direction, AND these tables are ginormous, so your queries are going to take a while. You’ll want to put in some thought before you just go hit execute.
Think about a long term solution to keep that Posts.CommentCount as up to date as practical. We’re not worried about a one-time update – we’re more concerned with making sure the data is relatively accurate. It doesn’t have to be transactionally consistent here – these aren’t bank balances, after all.

Have fun, and once you’ve given it a shot, read this post with solutions.

Find Foreign Key Problems: Answers & Discussion

Last Updated January 24, 2024

https://stackoverflow.com/questions/34/

Your Query Exercise was to find foreign key problems: data that might stop us from successfully implementing foreign keys in the Stack Overflow database.

Question 1: Write queries to find the problematic data.

Montro1981’s thorough answer broke the challenge into two parts: first, check to see if there is ANY bad data, and if so, check to see how pervasive it is. What I really like about this query below is that it recognizes that:

Each table, and each column, might have different foreign key problems
Some of those problems might be big, and some might be small
Some might have only happened a long time ago, and some might still be ongoing

/* Let's retrieve some metadata about the Posts */
WITH PostsMetadata
AS (SELECT Number = COUNT(*),
           FirstDate = MIN(p.CreationDate),
           LastDate = MAX(p.CreationDate)
    FROM dbo.Posts AS p
   ),
     /* Let's retrieve data about the Posts without users */
     PostsWithoutValidUser
AS (SELECT NumberOfIssues = COUNT(*),
           FirstOccurance = MIN(p.CreationDate),
           LastOccurance = MAX(p.CreationDate)
    FROM dbo.Posts AS p
    WHERE NOT EXISTS
    (
        SELECT 1 / 0 FROM dbo.Users AS u WHERE p.OwnerUserId = u.Id
    )
   ) ,
     /* Let's retrieve some metadata about the Comments */
     CommentsMetadata
AS (SELECT Number = COUNT(*),
           FirstDate = MIN(c.CreationDate),
           LastDate = MAX(c.CreationDate)
    FROM dbo.Comments AS c
   ) ,
     /* Let's retrieve data about the Comments without users */
     CommentsWithoutValidUser
AS (SELECT NumberOfIssues = COUNT(*),
           FirstOccurance = MIN(c.CreationDate),
           LastOccurance = MAX(c.CreationDate)
    FROM dbo.Comments AS c
    WHERE NOT EXISTS
    (
        SELECT 1 / 0 FROM dbo.Users AS u WHERE c.UserId = u.Id
    )
   ) ,
     /* Let's retrieve data about the Posts without posts */
     CommentsWithoutValidPosts
AS (SELECT NumberOfIssues = COUNT(*),
           FirstOccurance = MIN(c.CreationDate),
           LastOccurance = MAX(c.CreationDate)
    FROM dbo.Comments AS c
    WHERE NOT EXISTS
    (
        SELECT 1 / 0 FROM dbo.Posts AS p WHERE c.PostId = p.Id
    )
   )
SELECT Issue = 'PostsWithoutUser',
       pMeta.Number,
       pMeta.FirstDate,
       pMeta.LastDate,
       pNoUser.NumberOfIssues,
       pNoUser.FirstOccurance,
       pNoUser.LastOccurance,
       PercentageOfPosts = CAST(ROUND(
                                         CAST(pNoUser.NumberOfIssues AS decimal(14, 2))
                                         / CAST(pMeta.Number AS decimal(14, 2)) * 100,
                                         2
                                     ) AS decimal(14, 2))
FROM PostsMetadata AS pMeta
    CROSS APPLY PostsWithoutValidUser AS pNoUser
UNION ALL
SELECT Issue = 'CommentsWithoutUser',
       cMeta.Number,
       cMeta.FirstDate,
       cMeta.LastDate,
       cNoUser.NumberOfIssues,
       cNoUser.FirstOccurance,
       cNoUser.LastOccurance,
       PercentageOfComments = CAST(ROUND(
                                            CAST(cNoUser.NumberOfIssues AS decimal(14, 2))
                                            / CAST(cMeta.Number AS decimal(14, 2)) * 100,
                                            2
                                        ) AS decimal(14, 2))
FROM CommentsMetadata AS cMeta
    CROSS APPLY CommentsWithoutValidUser AS cNoUser
UNION ALL
SELECT Issue = 'CommentsWithoutPost',
       cMeta.Number,
       cMeta.FirstDate,
       cMeta.LastDate,
       cNoPost.NumberOfIssues,
       cNoPost.FirstOccurance,
       cNoPost.LastOccurance,
       PercentageOfComments = CAST(ROUND(
                                            CAST(cNoPost.NumberOfIssues AS decimal(14, 2))
                                            / CAST(cMeta.Number AS decimal(14, 2)) * 100,
                                            2
                                        ) AS decimal(14, 2))
FROM CommentsMetadata AS cMeta
    CROSS APPLY CommentsWithoutValidPosts AS cNoPost;

/* Let's retrieve some metadata about the Posts */

WITH PostsMetadata

AS (SELECT Number = COUNT(*),

FirstDate = MIN(p.CreationDate),

LastDate = MAX(p.CreationDate)

FROM dbo.Posts AS p

/* Let's retrieve data about the Posts without users */

PostsWithoutValidUser

AS (SELECT NumberOfIssues = COUNT(*),

FirstOccurance = MIN(p.CreationDate),

LastOccurance = MAX(p.CreationDate)

FROM dbo.Posts AS p

WHERE NOT EXISTS

(

SELECT 1 / 0 FROM dbo.Users AS u WHERE p.OwnerUserId = u.Id

)

) ,

/* Let's retrieve some metadata about the Comments */

CommentsMetadata

AS (SELECT Number = COUNT(*),

FirstDate = MIN(c.CreationDate),

LastDate = MAX(c.CreationDate)

FROM dbo.Comments AS c

) ,

/* Let's retrieve data about the Comments without users */

CommentsWithoutValidUser

AS (SELECT NumberOfIssues = COUNT(*),

FirstOccurance = MIN(c.CreationDate),

LastOccurance = MAX(c.CreationDate)

FROM dbo.Comments AS c

WHERE NOT EXISTS

(

SELECT 1 / 0 FROM dbo.Users AS u WHERE c.UserId = u.Id

)

) ,

/* Let's retrieve data about the Posts without posts */

CommentsWithoutValidPosts

AS (SELECT NumberOfIssues = COUNT(*),

FirstOccurance = MIN(c.CreationDate),

LastOccurance = MAX(c.CreationDate)

FROM dbo.Comments AS c

WHERE NOT EXISTS

(

SELECT 1 / 0 FROM dbo.Posts AS p WHERE c.PostId = p.Id

)

SELECT Issue = 'PostsWithoutUser',

pMeta.Number,

pMeta.FirstDate,

pMeta.LastDate,

pNoUser.NumberOfIssues,

pNoUser.FirstOccurance,

pNoUser.LastOccurance,

PercentageOfPosts = CAST(ROUND(

CAST(pNoUser.NumberOfIssues AS decimal(14, 2))

/ CAST(pMeta.Number AS decimal(14, 2)) * 100,

) AS decimal(14, 2))

FROM PostsMetadata AS pMeta

CROSS APPLY PostsWithoutValidUser AS pNoUser

UNION ALL

SELECT Issue = 'CommentsWithoutUser',

cMeta.Number,

cMeta.FirstDate,

cMeta.LastDate,

cNoUser.NumberOfIssues,

cNoUser.FirstOccurance,

cNoUser.LastOccurance,

PercentageOfComments = CAST(ROUND(

CAST(cNoUser.NumberOfIssues AS decimal(14, 2))

/ CAST(cMeta.Number AS decimal(14, 2)) * 100,

) AS decimal(14, 2))

FROM CommentsMetadata AS cMeta

CROSS APPLY CommentsWithoutValidUser AS cNoUser

UNION ALL

SELECT Issue = 'CommentsWithoutPost',

cMeta.Number,

cMeta.FirstDate,

cMeta.LastDate,

cNoPost.NumberOfIssues,

cNoPost.FirstOccurance,

cNoPost.LastOccurance,

PercentageOfComments = CAST(ROUND(

CAST(cNoPost.NumberOfIssues AS decimal(14, 2))

/ CAST(cMeta.Number AS decimal(14, 2)) * 100,

) AS decimal(14, 2))

FROM CommentsMetadata AS cMeta

CROSS APPLY CommentsWithoutValidPosts AS cNoPost;

Heres what the results look like for the 2018-06 training version of the Stack Overflow database, and of course they’ll be different depending on which version you use:

We can see that the first two issues (Posts & Comments with invalid UserIds) is pretty rampant, whereas comments without valid PostIds is much less frequent. However, all 3 of the foreign key candidates are still having problems right up til the end of the data set (2018-06), which means whatever’s causing this “bad” data is still happening.

Question 2: What might have caused the data problems?

Let’s examine the posts rows that don’t have matching users rows. You can click on this to zoom in if you like:

We’re trying to join over to the users table using the owneruserid column, but the OwnerUserId is 0 for these rows, and apparently there’s no Users.Id 0. That 0 sounds like it might be a hard-coded magic value.

What are some reasons that a post might not have a valid owneruserid?

Perhaps users are allowed to delete their accounts, but Stack Overflow has chosen to keep their questions & answers around for posterity.
Perhaps there were bugs in transaction handling code. Perhaps there was a single stored procedure that was supposed to add a user and add their first post, but something went wrong, and only the post got inserted.
Perhaps there was database corruption.
Perhaps someone ran a delete on the users table with an incorrect where clause. (C’mon, we’ve all been there.)
Or multiple of the above, especially for an app that’s been around for 15+ years.

I’ve seen all of these in various systems over time, but given what I know from Stack (I used to work with ’em), the first option is the likely cause for most (but not all) of the problematic rows we’re seeing here.

Pull up the site to see how it’s handling the data problem.

Normally, when you’re looking at a question or answer on StackOverflow.com, you see the person who wrote it. Let’s take a look at what the web site shows for this post.

One of the things I love about StackOverflow.com is their URL handling. If you know a row’s id, you can pull it up on the site. We’re in the posts table, which is where questions & answers are stored, and let’s say we wanna see id 34, the one about “How to unload a ByteArray”. To see posts.id = 34, put this into your browser’s address bar:

Click that, and Stack magically redirects you to:

https://stackoverflow.com/questions/34/how-to-unload-a-bytearray-using-actionscript-3

Isn’t that cool? It works for both questions & answers, too. For example, to see posts.id 18:

https://stackoverflow.com/questions/18/

Which redirects you to:

https://stackoverflow.com/questions/17/binary-data-in-mysql/18#18

Because posts.id 18 is actually an answer for question 17. More about that join structure some other day – for now, let’s go back to the problem at hand, and we’ll look at posts.id 34 about unloading a byte array.

At the bottom right, Stack shows who asked the question. It says “asked Aug 1, 2008 at 12:30, maclema.”

Wait – how does Stack know the user’s name?

How are they getting the name ‘maclema’ if there’s no working join over to the users table? In the SQL Server versions of the Stack Overflow database I’ve provided in the past, there’s no OwnerDisplayName column, but there’s actually one in the current XML data dumps. To see it, let’s look at the Postgres version of the StackOverflow data dump, which Richie built with a newer tool that exports every column. (More on that in another post.)

Here’s the Posts table in that database:

And look over at the far right column:

Ah-ha! It looks like Stack de-normalized the data a little. Perhaps, when a user deletes their account, Stack goes through all of the posts with their owneruserid, and sets the ownerdisplayname value for historical purposes. Then, they can set the posts.owneruserid to null, and delete the users row.

Just for curiosity’s sake, let’s look at a few posts rows where the owneruserid is NOT null:

It kinda matches our hypothesis: they might not be populating the ownerdisplayname column until a user deletes their row. Some of the rows do have an ownerdisplayname populated though. This lines up with my real-world experience with decade-old production databases: the data is all over the place, in varying levels of explanations and quality. Perhaps Stack’s handling of deleted user accounts changed over time, too.

Question #3: any changes we want to make?

The last part of your query exercise was to think about any changes you might want to make to the app, processes, or database.

I’ll be honest with you, dear reader: I laid that as a trap for professional database administrators. When a DBA sees a database without foreign key referential integrity implemented, their first knee-jerk reaction is usually, “WE HAVE TO PUT IN FOREIGN KEYS RITE NAO TO FIX THIS PROBLEM!!1!!ONE!”

I agree with that in the beginning of an application’s lifecycle, but now that we’ve got a 15-year-old database with a ton of data that violates foreign key constraints, what are we supposed to do?

Could/should we add a hard-coded “Unknown User” row, and link the data to that? You’ll often see this with a “magic number” users row with an id of, say, 0 or -1. If we take that approach, we also have to modify our users-deletion process to assign posts.owneruserid to that magic value, or we have to implement triggers on tables to do that work. There are performance and concurrency issues associated with putting that work on the database tier, though.

In Martin’s excellent answer, another problem with the magic-number approach came up. “Magic numbers are bad because they’re pretty much always undocumented, they make the code hard to read and because new team members aren’t aware of them.” (I feel Martin’s pain, but in my experience, everything in the application is undocumented, so… there’s that.)

Could/should we set columns to null? If a User or Post is deleted, should we set all the referring columns to be null? And what should do that work, the app or a trigger? Foreign keys could be implemented in that case, but… our reports might not produce the data we expect if we’re doing inner joins. (Inner joins would work with the magic number approach above.)

Could we delete the data that doesn’t have valid relationships? If we did that, we would lose questions & answers on the site, stuff that’s valuable and helpful to looking for solutions. The business might be okay with that, though – that’s a business question, not a data professional’s question.

Even if we do either of those, we still can’t just implement foreign keys safely. We would need to do more investigation to find out if there are other causes for the missing users rows.

How’d you do? What’d you think?

Announcing the 2024 Data Professional Salary Survey Results.

Last Updated January 4, 2024

Salary

This is the 8th year now that we’ve been running our annual Data Professional Salary Survey, and I was really curious to see what the results would hold this year. How would inflation and layoffs impact the database world? Download the raw data here and slice & dice it to see what’s important to you. Here’s what I found.

First, without filtering the data at all, salaries are up, but response counts continue to drop:

If we filter for just United States folks whose primary database is SQL Server or Azure SQL DB, the salary numbers are higher, and are still continuing to rise:

Because this blog’s primary readership is SQL Server folks, I wouldn’t use the survey to draw conclusions about any other platform. The number of responses for other platforms is really low:

So with that in mind, for the rest of this post, I’m going to focus on only SQL Server & Azure SQL DB folks. What are your career plans for 2024?

Most respondents intend to stay in the same employer, in the same role. Folks who are planning to make a change also happen to be getting paid less – and that’s probably not a coincidence, heh. If you’re thinking about changing roles, you’re probably interested in who’s bringing home the cheddar:

Normally I wouldn’t draw conclusions from just 4 respondents, but I think it’s safe to say that data scientists are in such high demand that they command higher pay. (However, it’s also harder to get a data scientist job than most of the rest of the jobs in this list.)

Another way to make more money is to go independent:

This marks the first year of the survey where female pay is actually higher than male! The response rate is pretty skewed, but it always has been:

Download the raw data here, and hope this data is useful to you when you have salary and career planning discussions with your manager. Here’s to you getting another raise in 2024!

Query Exercise: Find Foreign Key Problems

Last Updated January 10, 2024