Find Posts with the Wrong CommentCount: Answers & Discussion

Last Updated 4 months ago

Your Query Exercise was to find denormalization accuracy problems: checking the accuracy of a reporting column, Posts.CommentCount. There were two parts: finding the top 100 most problematic Posts with the biggest variances, and thinking about a long term solution to keep the CommentCount accuracy as high as practical.

Question 1: Finding the Problematic Posts

Your first attempt might have looked something like this:

SELECT TOP 100 p.Id, 
	p.CommentCount, COUNT(distinct c.Id) as comments_actual,
	p.CommentCount - COUNT(distinct c.Id) as comments_variance
FROM dbo.Posts p
INNER JOIN dbo.Comments c ON p.Id = c.PostId
GROUP BY p.id, p.CommentCount
HAVING p.CommentCount <> COUNT(distinct c.Id)
ORDER BY p.CommentCount - COUNT(distinct c.Id) DESC;

SELECT TOP 100 p.Id,

p.CommentCount, COUNT(distinct c.Id) as comments_actual,

p.CommentCount - COUNT(distinct c.Id) as comments_variance

FROM dbo.Posts p

INNER JOIN dbo.Comments c ON p.Id = c.PostId

GROUP BY p.id, p.CommentCount

HAVING p.CommentCount <> COUNT(distinct c.Id)

ORDER BY p.CommentCount - COUNT(distinct c.Id) DESC;

But there are 2 problems with that. First, because it’s an inner join, it only finds Posts with Comments. What if the CommentCount is off because there are no matching rows in the Comments table at all? We’re gonna need a left outer join instead.

Second, that approach focuses on variances where p.CommentCount is HIGHER than the actual number of comments – but not the other way around. In this case, those variances are fairly small, only off by a couple comments:

Your next approach might have been two different queries, unioned together, with one sorting for higher p.CommentCounts, and the other sorting for higher actual comments. However, given the size of these tables, that’d be a performance nightmare, making multiple passes through the same tables.

The slick way to do it is to sort by the absolute value (ABS) of the variance. That way, if I have a list with both large positive AND negative values, those large values will hang out together in the result list.

Our new query looks a lot more complicated due to needing to handle situations with no rows in the Comments table (thus the COALESCE) and the ABS to sort large values together:

SELECT TOP 100 p.Id, 
	p.CommentCount, COUNT(distinct c.Id) as comments_actual,
	ABS(p.CommentCount - COALESCE(COUNT(distinct c.Id), 0)) as comments_variance
FROM dbo.Posts p
LEFT OUTER JOIN dbo.Comments c ON p.Id = c.PostId
GROUP BY p.id, p.CommentCount
HAVING p.CommentCount <> COALESCE(COUNT(distinct c.Id), 0)
ORDER BY ABS(p.CommentCount - COALESCE(COUNT(distinct c.Id), 0)) DESC;

SELECT TOP 100 p.Id,

p.CommentCount, COUNT(distinct c.Id) as comments_actual,

ABS(p.CommentCount - COALESCE(COUNT(distinct c.Id), 0)) as comments_variance

FROM dbo.Posts p

LEFT OUTER JOIN dbo.Comments c ON p.Id = c.PostId

GROUP BY p.id, p.CommentCount

HAVING p.CommentCount <> COALESCE(COUNT(distinct c.Id), 0)

ORDER BY ABS(p.CommentCount - COALESCE(COUNT(distinct c.Id), 0)) DESC;

Our result set looks quite different now!

Now we’ve found issues where there are no comments at all (which the inner join missed), and variances in both directions. (I especially love the symmetry of rows 10 & 11 in this screenshot to illustrate that problem.)

Alright, part 1 done. We’ve identified that we do indeed have a variety of data problems here. In the comments on the challenge blog post, there was a lot of discussion about creating indexes. I’d say that for Question 1, seeing the scope of the problem, you shouldn’t put in indexes just to solve that problem. It’s a one-time thing, checking to see if we have something to fix or not.

Now, what should we do long term?

Question 2: Solving It Long Term

Your homework assignment specifically skipped writing a one-time update statement. I asked you to focus on how we should keep that Posts.CommentCount column as up to date as practical, reminding you that it wasn’t a transactional system like bank balances.

When we wanna keep data up to date between two tables, we have 3 basic options:

Synchronously, done by the database server itself – which would make insert transactions longer when people add new comments, plus add workload to the database server to coordinate that transaction. This would typically be accomplished with a trigger on the Comments table so that as soon as rows were deleted, updated, or inserted, the Posts.CommentCount could be updated at the same time, and always in sync. ChatGPT 4 generated me a trigger to do this. (At first glance, I don’t think it handles multi-row deletes or updates correctly, but I didn’t test that – just found it amusing. Whenever I have a coding task to do these days, I tend to ask ChatGPT to give it a shot, just to see how it does.)
Asynchronously, done by the database server – add a trigger to the comments table, but rather than updating Posts.CommentCount right away, simply add a row to a queue table indicating that the post.id needs its comment count updated. Later, circle back with an async job to run the commentcount updates only for the posts listed in the queue table.
Asynchronously, done by the app – instead of a trigger, have the front end app (or service layer) add a record to something whenever it inserts a comment. That something doesn’t even have to be a relational database – it could be a queue service or a key/value store. Then, have a service check that queue later and run update queries only for the specific posts listed in the queue.

The advantage of the async approach is that we only have to update a posts row once every X minutes, no matter how many comments have been added to it. If comments come in fast & furious in the first, say, hour of a post’s existence, then we can reduce that workload with a queue.

In fact, if you check out the above screenshot with variances, you’ll notice that the largest variances all seem to have a similar range of ids, and they’re all in the high range – some of the more recently inserted posts, basically. That would seem to suggest an asynchronous process.

Your long term solution might have been just that! You might have recommended a periodic job, say hourly, to update the posts.commentcount column. Depending on a few things, that might be fine:

If we’re willing to let the data be out of sync on newly created posts
If we don’t have too many posts columns to update (you saw how slow these queries were – imagine how bad it’d be if we had a dozen similar reporting columns to keep in sync)
If the regular process doesn’t cause blocking problems when it runs for a long time
If we have enough database server horsepower to get ‘er done without too much impact to end users

On the other hand, you could also choose the synchronous approach: every time we insert a new comment row, update the posts.commentcount row for the post we’re commenting on. This could either be done with a trigger on comments, or with an added update statement on whatever app code is inserting the comments. A big advantage of the sync approach is that the reporting data is always up to date. Business users like that.

For the sync approach, there are two ways to update the Posts.CommentCount number:

Either count the exact number of Comments in the Comments table at the time of the data change. This is easier to code because it’s just a quick sum on the Comments table, but it does involve actually querying the Comments table, which introduces more work and more locking.
Or simply increment the Posts.CommentCount number +1 or -1, depending on the data change. That one’s faster because there’s less locking involved on the Comments table, but it can be trickier to get the logic right.

In a perfect world, I’m a fan of keeping the database simple for scalability purposes: start with the synchronous approach, done in app code. When you hit a performance wall where that stops scaling, and if this particular process is the bottleneck holding you back from going faster, only then do something more complicated like an async batch job, after making sure the business users are okay with the data being out of sync.

If you choose a synchronous solution, you’ll probably need indexes to support the queries to make them fast enough. If you choose an async solution, especially if you offload it to a readable replica, then it’s easier to get away without adding more indexes when the database is already over-indexed.

Something Fun to Think About

For those of you following along with the Postgres version of this homework, you might have noticed that in the original homework assignment posts, when I queried the top 100 posts just to show demo data, I got very different results.

Here’s SQL Server’s top 100:

Note that in this case, it’s sorted by Id. Typically, in SQL Server, you’ll see the first 100 rows sorted by their clustering key. That’s certainly not always the case, and in beginning performance tuning sessions, I have to emphasize to SQL Server people, “Seriously, order isn’t guaranteed unless you ask for it.”

Here’s Postgres’s top 100:

Postgres really doesn’t give you an order by unless you ask for it, and if you haven’t used Postgres before, the reason will rock your world.

How’d you do? What’d you think?

I hope you had fun, enjoyed this exercise, and learned a little something. If you’ve got comments or questions, hit the comment section below. (Keep the comments focused on this exercise, though – we’re not covering the details of the Postgres export just yet. Only so many things I can cover per blog post, hahaha, and this thing’s already way long.)

We’re Starting to Automate Testing for the First Responder Kit.

Query Exercise: Find the Best Time for Maintenance

4 Comments. Leave new

Barbara
January 18, 2024 1:49 pm

This was a fun exercise. My query was close but missed the null counts in the comments table. I appreciate the homework, I am always looking to advance my skillset! Thanks!!

Reply
- Brent Ozar
  January 18, 2024 1:50 pm
  
  Cool, glad you liked it! I totally missed the null counts in the comments table in my first answer too! This is part of the reason that I love this format – I learn stuff from the readers’ comments.
  
  Reply
Charlie
February 14, 2024 12:05 pm

Why not just use an Indexed View? This allows a GROUP BY with COUNT_BIG and is maintained synchronously, more efficiently than a trigger.

Reply
- Brent Ozar
  February 14, 2024 12:07 pm
  
  Because an indexed view has to read (and lock) ALL of the comments for the post involved. It basically runs a group-by query every time there’s an insert/update/delete, which can lead to blocking problems at scale.
  
  I’m not saying indexed views are bad by any means – but they can have much higher overhead than it appears at first glance. (We cover that in my Mastering Index Tuning course.)
  
  Reply