Blog

Free Video Training: Estimation and Statistics by Dave Ballantyne

Statistics, Videos
3 Comments

It’s hard to find good 400-500 level SQL Server training at any price, let alone free. Dave Ballantyne’s past sessions from SQLBits are great examples of those.

Today, I’m going to review Dave’s most recent session on Estimation and Statistics. It’s from 2015, but it’s still very relevant today.

If you want to jump past the intro and bio, the technical stuff starts at about 2 minutes 15 seconds in.

Dave starts by explaining how building a query plan is a lot like building a driving route from point A to point B. I love that analogy because:

  • There are a lot of options, and exploring all the possible routes is computationally intensive
  • We have to make some assumptions about how the roads and traffic are going to look
  • Along the way, things may go wrong, but at that point, it’s too late for us to replan our route from scratch

A few time marks if you want to jump ahead in the video recording:

  • You can skip the intros by jumping to about 2m:15sec into the video.
  • If you’ve watched How to Think Like the Engine, and you want to jump into how the statistics histogram is decoded, jump to 12m:30sec into the video.
  • If you already know how to interpret a single histogram, and you want to know how SQL Server uses multiple histograms on different columns, jump to 21m in.
  • When you hit the brief part on parameter sniffing and want to skip past it, jump to 43m:19sec. The parameter sniffing part is probably not new to regular readers here. At 43m:19sec, the next topic starts, the Plan Skyline problem.
  • To learn about optimization’s ascending key problem, jump to 55m in

If you like training like this, consider attending the March conference either virtually, or in-person outside of London. Registration is open now, and you can use discount code Brent5 to save 5%. Here’s the spot to look for during registration:

As of this writing, there are literally 10 Early Bird seats left, so hustle!


What PowerBI Questions Do You Have for Eugene Meidinger?

Data Science

Eugene Meidinger, aka SQLGene.com, makes Power BI concepts easy and accessible. He’s got a free introduction to Power BI performance tuning here. People come to him with questions like:

Eugene Meidinger

  • What kind of tools are there for performance tuning Power BI?
  • Where should I do my logic: SQL, Power Query, or DAX?
  • What are some best practices for designing a data model?
  • What is the best trick for speeding up Power Query?
  • How do you get started improving Power Performance?

He’s joining me for Office Hours on Wednesday to take your Power BI performance tuning questions.

Post your questions here, and upvote the ones you’d like to see him answer.

On Wednesday at 2PM Eastern, 11AM Pacific, join us on my Twitch channel to watch live. Here’s a calendar invite so you can block it out on your calendar. If you can’t join live, no worries, the recording will go up on my YouTube channel later.

I’ve turned off comments on this post to make sure folks don’t accidentally post their questions here – post ’em over on PollGab. See you Weds!


Query Exercise: Find Tagged Questions Faster.

Query Exercises
42 Comments

For this week’s Query Exercise, we’re going to start with an existing query that has performance problems from time to time, depending on how it’s used.

This query’s goal is to find the top-voted Stack Overflow questions for any given tag – for example, here are the top-voted questions tagged “sql-server”.

What are tags? I’m glad you asked. When you’re looking at a question at Stack Overflow, you’ll see a list of tags for that question:

The tags are at the bottom of that screenshot: sql, sql-server, t-sql, sql-update, dml. A question can have up to 5 tags. To see how they’re stored in the Stack Overflow database, let’s look at the posts table, which is where questions & answers are stored:

There’s a “Tags” column:

So to find the questions tagged sql-server, our analyst wrote this query:

And our DBA helped by creating this index:

Right now, the actual execution plan is fairly simple. SQL Server scans that index backwards, going from the top-scoring posts down, checking each one’s tags to see if it contains <sql-server>.

Your exercises for this week are:

  1. What kinds of tags will perform worse than others for this query?
  2. Could you change the query to perform better?
  3. Could you change the indexes to perform better, without changing the table structure?

You can post your answers in this blog post’s comments, and discuss each others’ ideas. We’ll revisit your answers in this post. Have fun!


Finding the Best Time for Maintenance: Answers & Discussion

Query Exercise Answers
3 Comments

Your Query Exercise was to find the best time to do database maintenance by querying the Users table, looking for a one-hour window with the lowest number of created users. We kept this one pretty simple by looking at the data in just one table, and we didn’t hassle with time zones. We just wanted the 3 lowest-load hours, in this format:

  • Mondays, 1:00-1:59
  • Saturdays, 23:00-23:59
  • Sundays, 2:00-2:59

The comments on this one were wild! Lots of different approaches, but they all boiled down to two basic styles.

The easy (but not necessarily right) way:
group by date_part

One way to do this is to use the built-in DATEPART function to group users by weekday and hour of their creation dates, like this:

And at first glance, that looks like it works:

For the database I’m using (the 2018-06 Stack Overflow database) and the date ranges I picked, it looks like late Sunday night, early Monday morning would be the winners here, so my answer might be:

  • Monday, 02:00-02:59
  • Monday, 00:00-00:59
  • Sunday, 00:00-00:59

If that was your solution, you’d be in fine company: ChatGPT 4 suggested a similar solution. It even added a delightful caveat:

Note that DATEPART(dw, CreationDate) returns an integer representing the day of the week, where the specific value depends on the SET DATEFIRST setting of your SQL Server instance. Typically, 1 represents Sunday, but this can vary. Similarly, DATEPART(hour, CreationDate) returns the hour of the day based on a 24-hour clock.

That’s one of the things I love about using ChatGPT for coding tasks like that – it can add valuable context that saves me time from checking the documentation.

That grouping approach has a bug, though.

That technique only works as long as our historical data is populated for every hour of every day. What if we’ve been taking the web servers down every Sunday at 16:00-17:00? There wouldn’t be any users created in that date range, and thus, no rows would show up in the result set for that query!

To illustrate it, let’s query the table looking for days when no users were created during particular hours:

The results in the 2018-06 database show that we’ve had several days where we only created users in some (but not all) hours of the day:

And if that was a repeating trend in the data, we’d have a problem.

The Safer Technique:
generate a list of days/times first.

In pseudocode:

That way, if we have any weekday/hour combinations with no users created, we’ll still have rows for them in our result set. Starting with compatibility level 160 (SQL Server 2022), one option for this is the generate_series function. In older versions, you’d probably use a numbers table, aka tally table.

We need two sets of rows:

  • weekdays: a set of rows 1 through 7
  • hours: a set of rows 0 through 23

So we’ll call generate_series twice:

To get a list of days and hours:

Then, let’s take our first version of this query, use that as a CTE, and left outer join to it:

That way, the GENERATE_SERIES calls will make sure rows exist in our result set – even if no users were created in that date range.

Hope you enjoyed it! For more challenges like this, check out the Query Exercises page.


[Video] Office Hours: The Question is Not the Problem Edition

Videos
0

I answer your top-voted Azure SQL DB and SQL Server questions from https://pollgab.com/room/brento and lose my temper about what you’re doing.

Office Hours: The Question Isn't The Problem Edition

Here’s what we covered:

  • 00:00 Start
  • 00:40 Mattia: Has bulk logged recovery model ever got you across the finish line? If so, could you share a story?
  • 01:41 ForEachDBA: My manager wants to reduce cpu from db servers as cpu usage under 50% most of the time, also sp_BlitzFirst doesn’t show cpu top waits. The question:how to measure effectively the impact of reducing cpu? What are the key metrics to watch? Plan cache durations stats is good approach?
  • 02:56 Ataollah S: Will we see any new features added to SQL 2022 in 2024?
  • 03:31 Henrique Almeida: As a recommendation, MS suggests that we use the maximum for MAXDOP being 8.We changed it to 2 and I could see that there was a considerable decrease. give your opinion on the relationship between MAXDOP x CXPACKET, whether they are proportional or inversely proportional.
  • 04:42 Bonnie: Is it ok / safe to simultaneously run sp_blitzcache and sp_blitzindex on a busy boxed SQL Server? sp_blitzindex sometimes takes much longer to complete than sp_blitzcache.
  • 05:45 Owl Kitty: What is the danger zone for Batch Requests per Second that we should be on the lookout for?
  • 06:16 TimelyDBA: My friend is configuring alerts based on severity in a shop requiring the server instance hardened using the center of internet security standards which recommends never using SQL server for mail services. Have you alternative pathways to harmonize security and delivery for such?
  • 07:02 Nazanin B: What are the common issues you see when applications connect to SQL Server over a WAN connection vs a LAN connection? Does compression help here?
  • 07:38 MyTeaGotCold: How can I quality assure SQL code? Other languages have reasonable error handling, good linters, and unit testing that doesn’t need to be hacked in.
  • 08:49 Bruno: What is your opinion of the new Azure Boost feature to improve VM network and storage performance? Game changer for SQL VM?
  • 09:04 Bahareh A.: Is column store index maintenance for large tables hard on RPO when log shipping is involved?
  • 09:47 Rudy: SQL assessment API is complaining about high page splits in our canned SQL instance? How do we identity which tables are experiencing this? Is this a fools errand?
  • 10:37 TimelyDBA: Who is the brent ozar for SQL server always on AG and other HA/DR strategies?
  • 12:55 Jen: What are the best / worse use cases you have seen for sp_invoke_external_rest_endpoint to call external API’s inside of SQL Server?
  • 13:56 Bonnie: sp_BlitzFirst is reporting a priority 40 finding for Forwarded Fetches/Sec High: TempDB Object for boxed SQL. When is this bad / concerning?
  • 14:14 UnInstallAllScripts: How install scripts like InstallAllScripts.sql gets update?
  • 15:18 Manimal: Does it ever make since to run boxed SQL Enterprise on a 24 core or less cloud VM? Do you see this much in the wild?
  • 15:48 Persephone: Which vendors do you foresee doing cool AI integrations with SQL Server? Which integrations are you looking forward to?
  • 17:14 Rom: What are the top customer issues you see when resizing an Azure SQL VM up or down?
  • 17:49 Dom: Hi Brent, Have you ever taugh of offering your training/blog in french (for Canada/Europe people who doesn’t understand english that much)
  • 18:13 Pytzamarama: Happy 2024! Brent I will rephrase a badly asked past question of mine: When we update our customers we drop create all procedures and triggers. Does this affect performance and how. Thank you!
  • 19:50 Zappy: Hi Brent, as far as know, a statistic update shouldn’t cause a plan recompilation if a table wasn’t modified. But recently I think that I have seen a plan recompilation on an unmodified table simply caused by a stat update with a different SamplingPercent. Is this possible ?

Query Exercise: Find the Best Time for Maintenance

Query Exercises
46 Comments

If we’ve gotta take the database down for maintenance – perhaps a version upgrade, perhaps upgrading our own code, maybe scaling up the hardware – when’s the best time to do it?

For this week’s query exercise, the business has asked us to identify 3 1-hour periods with the least user activity, in this format:

  • Mondays, 1:00-1:59
  • Saturdays, 23:00-23:59
  • Sundays, 2:00-2:59

Pop open the Stack Overflow database, and for the sake of simplicity for this exercise:

  • We’re just going to look at the users table. We’ll judge load by times when the fewest number of users are created, judging by the users.creationdate column.
  • We’re not dealing with time zones.
  • We’re only looking for one-hour time ranges that start with 00 and end with 59, like 1:00-1:59. We’re using a little shorthand there and not typing out the seconds, but the point is that we don’t want you saying “the best time is 2:15:05 to 3:14:04.” That’s delightfully ambitious – but let’s keep it simple here. Not everyone has as much free time as you, bud.

Pick a 28-day window so that you’re looking at several weeks of trending activity, and come up with your one-hour preferences. (I expect folks to come up with widely different answers depending on which 28-day window they pick, and which copy of the database they’re using. That’s okay! The goal is to write the query, and then check your logic against the gotchas that I discuss next week.)

The performance of your query isn’t a concern here – this isn’t a query we’ll be running on a regular basis. We just need to know the quietest times in the database.

You can post your answers in this blog post’s comments, and discuss each others’ ideas. We’ll revisit your answers in this post. Have fun!


Find Posts with the Wrong CommentCount: Answers & Discussion

Query Exercise Answers
4 Comments

Your Query Exercise was to find denormalization accuracy problems: checking the accuracy of a reporting column, Posts.CommentCount. There were two parts: finding the top 100 most problematic Posts with the biggest variances, and thinking about a long term solution to keep the CommentCount accuracy as high as practical.

Question 1: Finding the Problematic Posts

Your first attempt might have looked something like this:

But there are 2 problems with that. First, because it’s an inner join, it only finds Posts with Comments. What if the CommentCount is off because there are no matching rows in the Comments table at all? We’re gonna need a left outer join instead.

Second, that approach focuses on variances where p.CommentCount is HIGHER than the actual number of comments – but not the other way around. In this case, those variances are fairly small, only off by a couple comments:

Your next approach might have been two different queries, unioned together, with one sorting for higher p.CommentCounts, and the other sorting for higher actual comments. However, given the size of these tables, that’d be a performance nightmare, making multiple passes through the same tables.

The slick way to do it is to sort by the absolute value (ABS) of the variance. That way, if I have a list with both large positive AND negative values, those large values will hang out together in the result list.

Our new query looks a lot more complicated due to needing to handle situations with no rows in the Comments table (thus the COALESCE) and the ABS to sort large values together:

Our result set looks quite different now!

Now we’ve found issues where there are no comments at all (which the inner join missed), and variances in both directions. (I especially love the symmetry of rows 10 & 11 in this screenshot to illustrate that problem.)

Alright, part 1 done. We’ve identified that we do indeed have a variety of data problems here. In the comments on the challenge blog post, there was a lot of discussion about creating indexes. I’d say that for Question 1, seeing the scope of the problem, you shouldn’t put in indexes just to solve that problem. It’s a one-time thing, checking to see if we have something to fix or not.

Now, what should we do long term?

Question 2: Solving It Long Term

Your homework assignment specifically skipped writing a one-time update statement. I asked you to focus on how we should keep that Posts.CommentCount column as up to date as practical, reminding you that it wasn’t a transactional system like bank balances.

When we wanna keep data up to date between two tables, we have 3 basic options:

  • Synchronously, done by the database server itself – which would make insert transactions longer when people add new comments, plus add workload to the database server to coordinate that transaction. This would typically be accomplished with a trigger on the Comments table so that as soon as rows were deleted, updated, or inserted, the Posts.CommentCount could be updated at the same time, and always in sync. ChatGPT 4 generated me a trigger to do this. (At first glance, I don’t think it handles multi-row deletes or updates correctly, but I didn’t test that – just found it amusing. Whenever I have a coding task to do these days, I tend to ask ChatGPT to give it a shot, just to see how it does.)
  • Asynchronously, done by the database server – add a trigger to the comments table, but rather than updating Posts.CommentCount right away, simply add a row to a queue table indicating that the post.id needs its comment count updated. Later, circle back with an async job to run the commentcount updates only for the posts listed in the queue table.
  • Asynchronously, done by the app – instead of a trigger, have the front end app (or service layer) add a record to something whenever it inserts a comment. That something doesn’t even have to be a relational database – it could be a queue service or a key/value store. Then, have a service check that queue later and run update queries only for the specific posts listed in the queue.

The advantage of the async approach is that we only have to update a posts row once every X minutes, no matter how many comments have been added to it. If comments come in fast & furious in the first, say, hour of a post’s existence, then we can reduce that workload with a queue.

In fact, if you check out the above screenshot with variances, you’ll notice that the largest variances all seem to have a similar range of ids, and they’re all in the high range – some of the more recently inserted posts, basically. That would seem to suggest an asynchronous process.

Your long term solution might have been just that! You might have recommended a periodic job, say hourly, to update the posts.commentcount column. Depending on a few things, that might be fine:

  • If we’re willing to let the data be out of sync on newly created posts
  • If we don’t have too many posts columns to update (you saw how slow these queries were – imagine how bad it’d be if we had a dozen similar reporting columns to keep in sync)
  • If the regular process doesn’t cause blocking problems when it runs for a long time
  • If we have enough database server horsepower to get ‘er done without too much impact to end users

On the other hand, you could also choose the synchronous approach: every time we insert a new comment row, update the posts.commentcount row for the post we’re commenting on. This could either be done with a trigger on comments, or with an added update statement on whatever app code is inserting the comments. A big advantage of the sync approach is that the reporting data is always up to date. Business users like that.

For the sync approach, there are two ways to update the Posts.CommentCount number:

  • Either count the exact number of Comments in the Comments table at the time of the data change. This is easier to code because it’s just a quick sum on the Comments table, but it does involve actually querying the Comments table, which introduces more work and more locking.
  • Or simply increment the Posts.CommentCount number +1 or -1, depending on the data change. That one’s faster because there’s less locking involved on the Comments table, but it can be trickier to get the logic right.

In a perfect world, I’m a fan of keeping the database simple for scalability purposes: start with the synchronous approach, done in app code. When you hit a performance wall where that stops scaling, and if this particular process is the bottleneck holding you back from going faster, only then do something more complicated like an async batch job, after making sure the business users are okay with the data being out of sync.

If you choose a synchronous solution, you’ll probably need indexes to support the queries to make them fast enough. If you choose an async solution, especially if you offload it to a readable replica, then it’s easier to get away without adding more indexes when the database is already over-indexed.

Something Fun to Think About

For those of you following along with the Postgres version of this homework, you might have noticed that in the original homework assignment posts, when I queried the top 100 posts just to show demo data, I got very different results.

Here’s SQL Server’s top 100:

Note that in this case, it’s sorted by Id. Typically, in SQL Server, you’ll see the first 100 rows sorted by their clustering key. That’s certainly not always the case, and in beginning performance tuning sessions, I have to emphasize to SQL Server people, “Seriously, order isn’t guaranteed unless you ask for it.”

Here’s Postgres’s top 100:

Postgres really doesn’t give you an order by unless you ask for it, and if you haven’t used Postgres before, the reason will rock your world.

How’d you do? What’d you think?

I hope you had fun, enjoyed this exercise, and learned a little something. If you’ve got comments or questions, hit the comment section below. (Keep the comments focused on this exercise, though – we’re not covering the details of the Postgres export just yet. Only so many things I can cover per blog post, hahaha, and this thing’s already way long.)


We’re Starting to Automate Testing for the First Responder Kit.

First Responder Kit
7 Comments

Brent says: It’s really funny how the world works. I wrote what you see below in December 2023 and scheduled it for publication on January 16th (today), hoping to get help from the community to get this process started. On January 9, before the post went live, David Wiseman submitted a pull request with the first steps! More on that at the end of the post. I’m leaving the post as-written in Dec 2023 because it’s still useful.


It’s a new year. You tell me if the state of the art has changed.

The biggest challenge in maintaining the First Responder Kit is testing people’s code. Different people submit code changes all the time, and I need to quickly assess:

  • If their code will work on all current versions of SQL Server, Amazon RDS, and Azure SQL DB Managed Instances
  • If their code works on case sensitive instances
  • If their code works when some databases aren’t in an expected health state (offline, inaccessible for security reasons, restoring, etc)
  • If their changes still have the intended good effects of the original code – for example, if we’re testing for databases with auto-shrink on, are we still catching that?

It’s especially hard with the First Responder Kit because the queries use SQL Server’s own system objects as a source.

Writing queries for multiple SQL Server versions is hard.

Here’s a very simplified example from a recently failed pull request:

That pull request Worked On My Machine™, but it failed on SQL Server 2016 because the version_generated_inrow and version_generated_offrow columns didn’t exist back then.

Sure, you could check that if you read the sys.dm_db_index_operational_stats documentation, which shows which columns are available in which version. On that page, do a control-F and search for “version”, which will take you to the part of the page that shows those two columns, and the version of SQL Server that introduced them. Go ahead, I’ll wait.

What’s that you say?

Those columns aren’t listed in the documentation, and many aren’t?

Ah well you see, that’s because Microsoft’s documentation is open source now. It’s up to kind folks like you and I to make pull requests with updates to the documentation, explaining to Microsoft what columns in their DMVs have changed, and what the contents of those columns are. We can’t expect Microsoft to do that for us, you know. They only made $151,597,000,000 of profit last year. That’s not revenue, that’s profit, after they paid all their expenses. I know that seems like a big number to you, and that with so much leftover change, they’d be able to keep their documentation up to date, but as the Microsoft folks say, that’s a big ask.

Just to put that in perspective, that’s $17,305,593 per hour. $4,807 per second.

So forgive me for not making that pull request to fix the documentation. I’ll leave that to you, dear reader. While you work on that, I’ll get back to the problem at hand.

Testing that multiple versions is even harder.

To test this simplified T-SQL:

We would need to somehow:

  • Fake the contents of a read-only system DMV (because you can’t delete/update/insert data in that DMV.)
  • Fake the contents of different versions of SQL Server, which have different columns, and aren’t even documented.
  • Keep those tests up to date, which is monstrously challenging given the cloud versions like AWS RDS and Azure SQL DB MI, which can and do change over time, including system stored procedures and their RDS equivalents.

Realistically, that’s not going to happen. So instead, it would make more sense to do automated testing. Each time a pull request comes in:

  • Spin up instances of SQL Server 2014, 2016, 2017, 2019, and 2022.
  • Run the First Responder Kit’s creation scripts (the new pull request’s version), ensuring that the scripts are at least created without error.
  • Run a series of predefined tests of sp_Blitz & friends, ensuring that the scripts run without error and produce the expected result sets.
  • And then perhaps the hardest one, repeat those tests on AWS RDS and Azure SQL DB MI. Those are tougher because the spin-up times are much, much longer, and we get charged while they’re up.

That’s hard – but let’s take the first step.

Several years ago, I approached community members & consultants who were well-known for SQL Server CI/CD work. I explained this challenge, and I asked, “How much would you charge me to build a solution for this? You can open source the whole thing, or whatever parts are useful to you.”

Back then, no one would even give me a number to solve that problem – the effort in designing the solution alone was too high. Given that years have passed, and the state of the art might be different now, you tell me in the comments: what would the first steps look like? How long would they take? If it’s doable today, I’d definitely be interested in funding it.

Because manual testing blows, which means we don’t do it as well as we should, and we rely on community testers. Anything I can do to lessen the work of the community on stuff like this, I’m down!

Update 2024-01-09: David Wiseman took the first step!

Back to that first foreshadowing paragraph of the post: I’m thrilled to see that David Wiseman actually took the first step. He used Github Actions so that when someone opens a pull request:

  • A GitHub runner deploys a clean instance of SQL Server 2017 in a Linux Docker container
  • The First Responder Kit scripts are deployed
  • Basic automated tests are run with PowerShell/Pester

To see how it’s done, check out the changed files in pull request 3419.

love this as a first step because like David wrote in his Github issue, in just 2 minutes, this verifies that:

  • The scripts deploy & run without errors on a case-sensitive 2017 Linux instance – which would actually fail more than you might think, just because a lot of folks don’t use case sensitive instances or Linux
  • The expected number of row sets are returned with the expected number of columns – this actually would have caught pull requests in the past where folks left in debug results

[Video] Office Hours: Ask Me Anything About the Microsoft Data Platform

Videos
0

Forgive the hoarse voice – I’m recovering from something COVID-ish. I went through your top-voted questions from https://pollgab.com/room/brento and did my best:

Office Hours: Ask Me Anything About Azure and SQL Server

Here’s what we covered:

  • 00:00 Start
  • 03:03 DBANoob: We have EF code that creates randomly generated temp table names every time it runs, while the rest of the logic remains the same, giving thousands of single use plans. I’ve tried warning the devs about this, but no luck. Any classes cover this? Any fixes I can try on my end?
  • 03:35 Dom: Hi Brent, you often mention San backup for VLDB. I googled but wasn’t able to find information about how we could do PIT recovery with San Snapshot backup. Is it possible ? Thanks ! P.s. Big fan of your work !
  • 04:14 Bruno: Is it safe to install the microsoft ace ole db provider on SQL Server so that SSIS packages targeting excel files will function?
  • 04:59 MyTeaGotCold: Do you still send out the 6-Month DBA Training Plan on Wednesdays? I don’t think I’ve seen one in months.
  • 05:13 KyleDevDBA: Hi Brent, You recently mentioned Cercle for when you need to focus. Does it ever have the opposite effect and you end up staring at the beautiful scenery for an hour?
  • 05:58 Pepé Le Pew: What are the top smells that a given sql agent job is really application logic that should be moved elsewhere?
  • 07:05 Scooby: Is that Beany in your mastering slide deck?
  • 07:24 Ezra: How do you perform automated tests with the first responder kit?
  • 08:22 Persephone: How would you rate ChatGPT’s ability to explain what a complicated piece of TSQL code is doing? Will this end the need for users to document their TSQL code?
  • 09:17 Darth Insidious: Inline TSQL comments using — syntax instead of /* are insidious since the extracted single line TSQL can’t be auto formatted for multi-line. Do you know of any good tools that can strip the — comments and replace with /* comments ?
  • 09:56 Montro1981: Hi Brent, Happy New Year to you and your loved ones. Are there any things to look out for or “gotchas” when a database has both RCSI and Snapshot Isolation enabled?
  • 10:47 RacerX: Do the parallel racing stripes in estimated query plan operator indicate that parallelism will also be used in actual query plan operator as well?
  • 11:25 Guðmundur: Hierarchyid appears to be a CLR data type. Do you see any common performance issues with this data type on large tables?
  • 12:27 Bruno: Thanks for all the tooling / training you provide for the SQL community. Looking forward to future training.
  • 12:40 DBA_Todd: Hey Brent! When enabling Accelerated Database Recovery, is it best to change the location of the PVS to a different filegroup? If so, why, and should I put it on separate storage from the primary filegroup? I have a 4TB db running SQL 2022 on Azure VM.
  • 14:59 Prisha: What are the best methodologies for adhoc copying a single (hundred million rows) SQL table from production to development?
  • 15:29 Giannis M: sp_Blitz is reporting system db’s located on C drive (master, model, msdb). Is it low or high risk to move these DB’s?
  • 17:05 kansas4444: Hi Brent, What’s your opinion on _UTF8 collation to replace nvarchar by varchar ? I’ve translations tables where the PK is composed of the culture code and the parent key. Copying the table with a varchar column in UTF8 reduced its size from 10 GB to 4 GB for ~4 million rows.
  • 17:59 Sigrún: We update stats nightly but even still there are some tables that will auto update stats during the next day due to high frequency of activity. Do you have guidance as to when you should you should disable auto update stats for a given table and do it manually?
  • 18:53 The Blame Bus Driver: Is there a good way to know who last modified a given SQL agent job?

Query Exercise: Find Posts with the Wrong CommentCount

Query Exercises
31 Comments

For 2024, I’m trying something new: weekly homework challenges! For this week’s query challenge, we’re going to do some data validation. Pop open the Stack Overflow database and check out the CommentCount column on the Posts table:

In theory, that’s a fast way to check how many Comments a particular Post has, without querying the Comments table to find them all. For example, if I want to see the Comments for Post.Id 13, I might write:

And presto, I get the 6 comments, matching the CommentCount = 6 value in the above screenshot:

But… I’m a DBA, and you know what DBA stands for: Doesn’t Believe Anyone. So our query exercise for this week is to:

  1. Write a query that finds the top 100 problematic Posts with the biggest CommentCount variance versus the actual number of comments on that post. This is a little tricky for two reasons: the Posts.CommentCount could be off in either direction, AND these tables are ginormous, so your queries are going to take a while. You’ll want to put in some thought before you just go hit execute.
  2. Think about a long term solution to keep that Posts.CommentCount as up to date as practical. We’re not worried about a one-time update – we’re more concerned with making sure the data is relatively accurate. It doesn’t have to be transactionally consistent here – these aren’t bank balances, after all.

Have fun, and once you’ve given it a shot, read this post with solutions.


Find Foreign Key Problems: Answers & Discussion

Your Query Exercise was to find foreign key problems: data that might stop us from successfully implementing foreign keys in the Stack Overflow database.

Question 1: Write queries to find the problematic data.

Montro1981’s thorough answer broke the challenge into two parts: first, check to see if there is ANY bad data, and if so, check to see how pervasive it is. What I really like about this query below is that it recognizes that:

  • Each table, and each column, might have different foreign key problems
  • Some of those problems might be big, and some might be small
  • Some might have only happened a long time ago, and some might still be ongoing

Heres what the results look like for the 2018-06 training version of the Stack Overflow database, and of course they’ll be different depending on which version you use:

We can see that the first two issues (Posts & Comments with invalid UserIds) is pretty rampant, whereas comments without valid PostIds is much less frequent. However, all 3 of the foreign key candidates are still having problems right up til the end of the data set (2018-06), which means whatever’s causing this “bad” data is still happening.

Question 2: What might have caused the data problems?

Let’s examine the posts rows that don’t have matching users rows. You can click on this to zoom in if you like:

We’re trying to join over to the users table using the owneruserid column, but the OwnerUserId is 0 for these rows, and apparently there’s no Users.Id 0. That 0 sounds like it might be a hard-coded magic value.

What are some reasons that a post might not have a valid owneruserid?

  • Perhaps users are allowed to delete their accounts, but Stack Overflow has chosen to keep their questions & answers around for posterity.
  • Perhaps there were bugs in transaction handling code. Perhaps there was a single stored procedure that was supposed to add a user and add their first post, but something went wrong, and only the post got inserted.
  • Perhaps there was database corruption.
  • Perhaps someone ran a delete on the users table with an incorrect where clause. (C’mon, we’ve all been there.)
  • Or multiple of the above, especially for an app that’s been around for 15+ years.

I’ve seen all of these in various systems over time, but given what I know from Stack (I used to work with ’em), the first option is the likely cause for most (but not all) of the problematic rows we’re seeing here.

Pull up the site to see how it’s handling the data problem.

Normally, when you’re looking at a question or answer on StackOverflow.com, you see the person who wrote it. Let’s take a look at what the web site shows for this post.

One of the things I love about StackOverflow.com is their URL handling. If you know a row’s id, you can pull it up on the site. We’re in the posts table, which is where questions & answers are stored, and let’s say we wanna see id 34, the one about “How to unload a ByteArray”. To see posts.id = 34, put this into your browser’s address bar:

https://stackoverflow.com/questions/34/

Click that, and Stack magically redirects you to:

https://stackoverflow.com/questions/34/how-to-unload-a-bytearray-using-actionscript-3

Isn’t that cool? It works for both questions & answers, too. For example, to see posts.id 18:

https://stackoverflow.com/questions/18/

Which redirects you to:

https://stackoverflow.com/questions/17/binary-data-in-mysql/18#18

Because posts.id 18 is actually an answer for question 17. More about that join structure some other day – for now, let’s go back to the problem at hand, and we’ll look at posts.id 34 about unloading a byte array.

At the bottom right, Stack shows who asked the question. It says “asked Aug 1, 2008 at 12:30, maclema.”

Wait – how does Stack know the user’s name?

How are they getting the name ‘maclema’ if there’s no working join over to the users table? In the SQL Server versions of the Stack Overflow database I’ve provided in the past, there’s no OwnerDisplayName column, but there’s actually one in the current XML data dumps. To see it, let’s look at another version of the database that includes the OwnerDisplayName in the Posts table:

And look over at the far right column:

Ah-ha! It looks like Stack de-normalized the data a little. Perhaps, when a user deletes their account, Stack goes through all of the posts with their owneruserid, and sets the ownerdisplayname value for historical purposes. Then, they can set the posts.owneruserid to null, and delete the users row.

Just for curiosity’s sake, let’s look at a few posts rows where the owneruserid is NOT null:

It kinda matches our hypothesis: they might not be populating the ownerdisplayname column until a user deletes their row. Some of the rows do have an ownerdisplayname populated though. This lines up with my real-world experience with decade-old production databases: the data is all over the place, in varying levels of explanations and quality. Perhaps Stack’s handling of deleted user accounts changed over time, too.

Question #3: any changes we want to make?

The last part of your query exercise was to think about any changes you might want to make to the app, processes, or database.

I’ll be honest with you, dear reader: I laid that as a trap for professional database administrators. When a DBA sees a database without foreign key referential integrity implemented, their first knee-jerk reaction is usually, “WE HAVE TO PUT IN FOREIGN KEYS RITE NAO TO FIX THIS PROBLEM!!1!!ONE!”

I agree with that in the beginning of an application’s lifecycle, but now that we’ve got a 15-year-old database with a ton of data that violates foreign key constraints, what are we supposed to do?

Could/should we add a hard-coded “Unknown User” row, and link the data to that? You’ll often see this with a “magic number” users row with an id of, say, 0 or -1. If we take that approach, we also have to modify our users-deletion process to assign posts.owneruserid to that magic value, or we have to implement triggers on tables to do that work. There are performance and concurrency issues associated with putting that work on the database tier, though.

In Martin’s excellent answer, another problem with the magic-number approach came up. “Magic numbers are bad because they’re pretty much always undocumented, they make the code hard to read and because new team members aren’t aware of them.” (I feel Martin’s pain, but in my experience, everything in the application is undocumented, so… there’s that.)

Could/should we set columns to null? If a User or Post is deleted, should we set all the referring columns to be null? And what should do that work, the app or a trigger? Foreign keys could be implemented in that case, but… our reports might not produce the data we expect if we’re doing inner joins. (Inner joins would work with the magic number approach above.)

Could we delete the data that doesn’t have valid relationships? If we did that, we would lose questions & answers on the site, stuff that’s valuable and helpful to looking for solutions. The business might be okay with that, though – that’s a business question, not a data professional’s question.

Even if we do either of those, we still can’t just implement foreign keys safely. We would need to do more investigation to find out if there are other causes for the missing users rows.

How’d you do? What’d you think?

I hope you had fun, enjoyed this exercise, and learned a little something. If you’ve got comments or questions, hit the comment section below. (Keep the comments focused on this exercise, though – we’re not covering the details of the Postgres export just yet. Only so many things I can cover per blog post, hahaha, and this thing’s already way long.)


Announcing the 2024 Data Professional Salary Survey Results.

Salary
13 Comments

This is the 8th year now that we’ve been running our annual Data Professional Salary Survey, and I was really curious to see what the results would hold this year. How would inflation and layoffs impact the database world? Download the raw data here and slice & dice it to see what’s important to you. Here’s what I found.

First, without filtering the data at all, salaries are up, but response counts continue to drop:

If we filter for just United States folks whose primary database is SQL Server or Azure SQL DB, the salary numbers are higher, and are still continuing to rise:

Because this blog’s primary readership is SQL Server folks, I wouldn’t use the survey to draw conclusions about any other platform. The number of responses for other platforms is really low:

So with that in mind, for the rest of this post, I’m going to focus on only SQL Server & Azure SQL DB folks. What are your career plans for 2024?

Most respondents intend to stay in the same employer, in the same role. Folks who are planning to make a change also happen to be getting paid less – and that’s probably not a coincidence, heh. If you’re thinking about changing roles, you’re probably interested in who’s bringing home the cheddar:

Normally I wouldn’t draw conclusions from just 4 respondents, but I think it’s safe to say that data scientists are in such high demand that they command higher pay. (However, it’s also harder to get a data scientist job than most of the rest of the jobs in this list.)

Another way to make more money is to go independent:

This marks the first year of the survey where female pay is actually higher than male! The response rate is pretty skewed, but it always has been:

Download the raw data here, and hope this data is useful to you when you have salary and career planning discussions with your manager. Here’s to you getting another raise in 2024!


Query Exercise: Find Foreign Key Problems

Query Exercises
21 Comments

For 2024, I’m trying something new: weekly homework challenges! For this week, let’s say we’ve decided to implement foreign keys, and we need to find data that’s going to violate our desired keys.

Foreign keys, according to AIWe’re going to use the Stack Overflow database, and we’ll focus on these 3 tables:

  • dbo.Users table: with Id column as its primary key
  • dbo.Posts table: with OwnerUserId column noting which Users.Id wrote the post
  • dbo.Comments table: with UserId column noting which Users.Id wrote the comment, and PostId column noting which Posts.Id is being commented on

Before we attempt to implement foreign keys, we need to find data which might violate the foreign key relationships. Are there any:

  • Posts rows whose OwnerUserId does not match up with a valid Users.Id
  • Comments rows whose UserId doesn’t match up with a valid Users.Id
  • Comments rows whose PostId doesn’t match up with a valid Posts.Id
  • And to make your task easier, let’s focus on just the first 100K rows in each table (rows with an Id <= 100000) to see whether or not foreign keys make sense for this database

Your query exercise has a few parts:

  1. Write one or more queries to find these answers as quickly as possible with low load on the database.
  2. Given what you find, hypothesize about what might have caused the foreign key problems.
  3. Given what you learned, are there any changes you want to make to the app, processes, or database?

You can post your answers in this blog post’s comments, and discuss each others’ ideas. We discuss the challenges & techniques in the next post. Have fun!


[Video] Office Hours: Oddball Questions Edition

Videos
0

The last Office Hours of 2023 featured some oddball questions from https://pollgab.com/room/brento. Not bad, just … odd.

Office Hours: Oddball Questions Edition

Here’s what we covered:

  • 00:00 Start
  • 03:04 TheMooneyFlyer: Hey Brent, how do you work on optimizing sp that performs insert/update/delete? Does putting the exec within a begin tran / rollback is a good option?
  • 06:06 MyTeaGotCold: If a table is empty and I absolutely know that nobody else is using it, should I always insert in to it WITH (TABLOCK)? What if it’s a temp table?
  • 07:09 Tonia S: Have you tried the mock DBA interviews with ChatGPT? Very realistic?
  • 09:11 ChompingBits: What “Best Practice” pays your bills the most? I’m thinking DBAs with superstitions they follow that cause issues in newer versions, but if you’ve got suggestions that almost no one follows so you come in and clean up in an afternoon, I’d like to hear that too.
  • 11:33 Nardole: What are the top SSIS issues you see with your clients? Anything performance related?
  • 11:45 Philo: Is windows paging of sqlserver always bad? What are the top issues you see with windows paging?
  • 12:00 OnSiteDBA: You mentioned that shops that go multi-terabyte need to do snapshots backups instead of the native backups given the reduced restore times typical of snapshots. How does one handle possible inconsistencies in MSSQL configurations with data and log files in different volumes?
  • 12:36 crushingtempdb: Hi Brent! I am troubleshooting some tempdb issues on Azure SQL Database; When I read the documentation; mo cores=mo tempdb, we’re told; When I run0 select Sum (max_size)/1024.0/1024.0 FROM tempdb.sys.database_files WHERE type_desc = ‘ROWS’ it doesn’t match. Thoughts?
  • 13:45 Karthik: Have you ever had to run the windows debugger against sqlserver.exe? What was the scenario?
  • 17:42 Bonnie: What are your best and worst observed times for manually scaling azure SQL VM to a higher SKU?
  • 19:36 MyTeaGotCold: How should I manage an effort to refactor away from Hungarian notation (e.g. “sp_” prefixes)? Even when I win the battle on clarity and performance, I lose it on the fear that the changes will break something.

Who’s Hiring in the Microsoft Data Platform Community? January 2024 Edition

Who's Hiring
9 Comments

Is your company hiring for a database position as of January 2024? Do you wanna work with the kinds of people who read this blog? Let’s set up some rapid networking here.

If your company is hiring, leave a comment. The rules:

  • Your comment must include the job title, and either a link to the full job description, or the text of it. It doesn’t have to be a SQL Server DBA job, but it does have to be related to databases. (We get a pretty broad readership here – it can be any database.)
  • An email address to send resumes, or a link to the application process – if I were you, I’d put an email address because you may want to know that applicants are readers here, because they might be more qualified than the applicants you regularly get.
  • Please state the location and include REMOTE and/or VISA when that sort of candidate is welcome. When remote work is not an option, include ONSITE.
  • Please only post if you personally are part of the hiring company—no recruiting firms or job boards. Only one post per company. If it isn’t a household name, please explain what your company does.
  • Commenters: please don’t reply to job posts to complain about something. It’s off topic here.
  • Readers: please only email if you are personally interested in the job.

If your comment isn’t relevant or smells fishy, I’ll delete it. If you have questions about why your comment got deleted, or how to maximize the effectiveness of your comment, contact me.

Each month, I publish a new post in the Who’s Hiring category here so y’all can get the latest opportunities.


[Video] Office Hours: Holiday Speed Round Edition

Videos
1 Comment

Most of the https://pollgab.com/room/brento questions from this episode had really short and sweet answers. Let’s take a break for the holidays and knock ’em out:

Office Hours: Holiday Speed Round Edition

Here’s what we covered this episode:

  • 00:00 Start
  • 03:29 TheyBlameMe: Have you ever made it across the finish line for a customer by changing the way their application connects to SQL Server (i.e. OLE DB vs ODBC driver)?
  • 03:51 Sleepless in Seattle: How often, if ever, do you use partially contained databases?
  • 04:36 GUIDKeysWasteAndFlushBufferCache: With a GUID PK & CX on a large table, most new rows will land on 8K pages containing only old, obsolete data. Almost every new row (that will be accessed again soon) has a whole 8K page in the buffer cache! Buffer cache will be mostly old, obsolete data. Am I missing something?
  • 06:26 Gökçe: What should you do when you notice the third party vendor app is using NOLOCK like butter all over their TSQL queries? Have you experienced this?
  • 06:59 Björgvin B.: Entity Framework is spamming the SQL plan cache with many duplicate plans. How do you deal with this issue? Use sp’s instead?
  • 07:22 Anastasios: Do you have any good stories where SQL Server was blamed for slow app performance but turned out to be something completely unrelated to SQL Server?
  • 08:27 MyTeaGotCold: Has your opinion on agent jobs changed in the past 5 years? AWS lambda and its kin seem to be replacing the Task Scheduler.
  • 09:13 Stooky Sue: As SQL migrates from on-prem to the cloud, do you still see separation of duties between DBA and Storage Admins?
  • 10:12 bagllop: What would you suggest my friend to use instead of linked servers? SSIS? OpenQuery? Or is there some new fancy stuff he could use?
  • 11:41 Rena: Is adding foreign keys just to get join elimination ever advised?
  • 12:28 Persephone: What scripts do you like to use for Columnstore Index maint?
  • 12:52 Encino Man: Is there a good way to determine if a given query from C# is using the density vector or histogram for cardinality estimates?
  • 13:40 Stooky Bill: For SQL Server, Microsoft recommends a minimum target for page life expectancy of 300 seconds. Is this still a good recommendation or does it need updating for modern times?
  • 14:10 DBADany: Hey Brent, in a working environment with multiple DBAs having sysadmin permissions, are you aware of anyway we could audit who restarted SQL Server? Apparently SQL Audit only tracks the time for stop/start itself but no details of hostname or IP from who did it? Thanks
  • 15:15 Huyang: What is your opinion of Azure NetApp Files and Silk cloud ISCSI SAN for hosting Azure SQL VM data files?
  • 15:26 OnSiteDBA: I have heard you mention PostgreSQL as a cheaper RDBMS alternative to MSSQL, but hardly mention MySQL. Is there a reason you hardly mention MySQL? notable performance hitches, feature-specific limitations etc.
  • 16:46 KyleDevDBA: Hi Brent, Do you have a favorite wait type to fix and why?
  • 17:12 Renzi: What are your pros / cons of app consistent snapshot backups vs crash consistent snapshot backups for multi TB boxed SQL db’s?
  • 18:29 Stooky Bill: What are your thoughts on the TOP operator short circuiting parallelism? Good thing or bad thing?
  • 20:07 ChompingBits: Why is it that the Microsoft owned apps are always the worst offenders with SQL issues? Sharepoint and SCCM have longstanding known issues with deadlocks. SCOM is a beast that spawns all kind of GUID named agents jobs. SCORCH doesn’t support availability groups.
  • 20:57 Argyris P.: Do you have any good ways to find all large (billions of rows) static tables in a boxed SQL Server instance?
  • 21:38 Alex: Huge fan. What’s your opinion on azure database fleet and should I aim to replace my elastic pools with this new feature or is it another gimmick?
  • 22:32 Pytzamarama: Hi Brent! When we update our customers databases (a lot of them are still on SQL Server 2008R2) for a new app version, we drop/create all procedures and triggers. How does this impact performance? Thanx
  • 24:12 Toymaker: Is there ever value in perf testing queries cold vs hot (I.e. DBCC DROPCLEANBUFFERS, DBCC FREEPROCCACHE)?
  • 24:37 WouldPrefer3rdNormalForm: Have you ever seen or heard of a ceiling/threshold for XML columns, beyond which performance craters? We are scared by two large XML columns (and their indexes). They are currently performing adequately, but comprise about about 40% of our 1.5TB database and continue to grow…
  • 25:49 Kane Baden: Was literally typing a question around “what do you think are the main questions around picking MSSQL vs. Postgres…” while watching your last upload when you spelled it out for the last question on that SaaS question. So I figured I should let you know thanks instead! Thanks!
  • 26:10 Bonnie: How do you run sp_blitzcache to target a temporary stored procedures for analysis?
  • 26:35 bottomless: Azure SQL Database Serverless is supposed to cost less but at the end of the month it costs more than DTU: our SaaS has routines and jobs that wake up the database. From you experience have you aver seen a successful use of Azure SQL Database Serverless?

Updated First Responder Kit and Consultant Toolkit for December 2023

New this month: more work on letting sp_Blitz run with limited permissions, nicer Markdown output, sp_BlitzLock compatibility with Managed Instances, and more.

How I Use the First Responder Kit
Wanna watch me use it? Take the class.

To get the new version:

Consultant Toolkit Changes

There are two changes to the spreadsheet in this release. In the “Plans Duplicated” and “Plans by Query Hash” tabs, we’ve added new columns for Compile Time MS, Compile CPU MS, and Compile Memory KB. These values are for the one specific plan you’re looking at in the row, not the total amount for that duplicated plan. It gives you a rough idea of how resource-intensive these queries are each time they run. (Thanks Erik Darling.)

If you’ve customized your query_manifest.json and/or your spreadsheet

  • And you don’t want the new columns, then you can simply copy your customized query_manifest.json and/or spreadsheet over ours just like you normally do. Nothing will error out – you just won’t have the new columns.
  • And you do want the new columns, then you’ll need to merge our query_manifest.json changes for queries 50 & 82, and copy the tabs “Plans Duplicated” and “Plans by Query Hash” over those tabs in your customized spreadsheet.

sp_Blitz Changes

  • Enhancement: way prettier output when @OutputType = ‘markdown’. (#3401, thanks Mike Scalise.)
  • Enhancement: simpler, more intuitive checks for Instant File Initialization. (#3362 and #3409, thanks Montro1981.)
  • Enhancement: if we had to skip checks because you didn’t have enough permissions, we now warn you about that. (#3376, thanks Montro1981.)
  • Fix: continued work on detecting msdb permissions. (#3377, thanks Montro1981.)

sp_BlitzCache Changes

sp_BlitzIndex Changes

  • Fix: @Debug = 1 was skipping a character of the dynamic SQL. (#3406, thanks Per Scheffer and Montro1981.)
  • Fix: added drop-if-exists statement for temp tables to make it easier to run parts of sp_BlitzIndex ad-hoc, outside of a stored proc. (#3383, thanks Chad Baldwin.)

sp_BlitzLock Changes

sp_BlitzQueryStore Changes

sp_ineachdb Changes

  • Enhancement: new @is_ag_writeable_copy parameter to only run queries against those databases. (#3399, thanks Douglas Taft.)

For Support

When you have questions about how the tools work, talk with the community in the #FirstResponderKit Slack channel. Be patient: it’s staffed by volunteers with day jobs. If it’s your first time in the community Slack, get started here.

When you find a bug or want something changed, read the contributing.md file.

When you have a question about what the scripts found, first make sure you read the “More Details” URL for any warning you find. We put a lot of work into documentation, and we wouldn’t want someone to yell at you to go read the fine manual. After that, when you’ve still got questions about how something works in SQL Server, post a question at DBA.StackExchange.com and the community (that includes me!) will help. Include exact errors and any applicable screenshots, your SQL Server version number (including the build #), and the version of the tool you’re working with.


[Video] Working on First Responder Kit Pull Requests

Videos
0

For this month’s First Responder Kit releases, I worked through most of the pull requests live on my Twitch channel. If you wanna get a glimpse of what it’s like being an open source maintainer, this is a good behind-the-scenes look:

Working on First Responder Kit Pull Requests, Part 1

And part 2, with more PRs:

Working on First Responder Kit Pull Requests, Part 2

The Annual Data Professional Salary Survey Closes This Week!

Salary
2 Comments

Take the Data Professional Salary Survey now. The survey has closed.

The 2020s have been tough: a pandemic, a recession, layoffs, and inflation. Inflation makes things particularly tricky because your costs for everything have risen a lot in the last year, but at the same time… has your salary? What about your peers? You’re in a tough position because it’s hard to ask for more money when there are layoffs everywhere. I feel you.

So it’s time for our annual salary survey to find out what data professionals make. You fill out the data, we open source the whole thing, and you can analyze the data to spot trends and do a better job of negotiating your own salary.

The anonymous survey closes Sunday, Jan 1. The results are completely open source, and shared with the community for your analysis. (You can analyze ’em now mid-flight, but I’d wait until the final results come in. I’ll combine them into a single spreadsheet with the past results, and publish those on January 9th.)

Thanks for your help in giving everybody in the community a better chance to talk honestly with their managers about salary.