Blog

Building SQL ConstantCare®: 10% of you have Priority Boost on.

SQL ConstantCare
2 Comments

One of my favorite things about my job is being able to take the pulse of how real companies are managing their databases. I don’t wanna be locked in an ivory tower, preaching down to you, dear reader, about how you should be doing five million things a day – when in reality, you’re struggling to get an hour of real work done because you’ve got so many meetings.

But sometimes I wonder – am I out of touch? When I was a DBA, I remember struggling with backups and corruption checking – has that gotten easier? Have today’s DBAs started using more automation and tools to protect their estate? Is Transparent Data Encryption catching on? Did DBAs start using Extended Events for monitoring when I wasn’t looking?

And it’s important because I wanna build the right training material and scripts for our customers. I see a problem trending, I want to be able to give people the right information to fix the problem, fast.

When we launched SQL ConstantCare®, I was excited to see what the data would reveal. 86 users have opted into public data sharing for 285 servers hosting 11,521 databases.

Here’s some of the interesting things we’ve learned so far.

SQL Server adoption is still slow.

It’s spring 2018, but SQL Server 2017 still has less adoption than SQL Server 2008, let alone 2008R2. I’m really curious to see how this progresses as we move towards the end of support for 2008 and 2008R2 next year.

SQL Servers by version

In terms of edition, I went in with no expectations – I really have no idea what our readership looks like, and it’s interesting to see numbers:

SQL Servers by edition

Over 50% of us had basic backup issues.

58% of all us had at least one database (37% of databases overall) that hadn’t had a full backup in the last week. Now this is a little tricky: in the initial round of collection, I noticed a trend that a lot of people would add a development server first, then look at the email advice to decide whether they wanted to add more servers.

However, I also noticed a trend amongst the replies – paraphrasing:

“You know, I’d forgotten about those databases. We restored that a while ago to get some data out of it, and then I guess I forgot to delete it. I’ll go delete those now.”

Similarly, 38% of all servers had databases in full recovery model, but weren’t doing transaction log backups on them.

Between dropping databases that shouldn’t be around (shout out to the multiple folks that had AdventureWorks in production), plus suddenly clearing away unnecessarily giant log files, I can see how the database size tends to drop quickly on servers within the first few days of setting up SQL ConstantCare. (We’ll do an ROI study on that over time.)

We have a mixed record on corruption checking.

93% of all databases had a CHECKDB in the last 2 weeks! That’s awesome!

However, things were a lot worse when it came to enabling checksums for page verification. 5% of databases didn’t have checksums turned on, which sounds small, but it was spread across 54% of the users. Look at the person sitting next to you: either this issue affects you, or it affects them. One of you needs to buckle up.

We’re still not patching.

  • 16% of customers are running a completely unsupported build of 2008 or newer (meaning they haven’t applied a service pack in a few years)
  • 10% of customers are running builds with known corruption or security escalation bugs
  • Hardly anyone is patched for Meltdown/Spectre

As a teacher and consultant, I gotta think hard about that. I don’t have easy answers. It’s not like I can just build a presentation and magically get the business to agree to outage windows.

Long term, I’m thinking of it as a data problem: can I tie peoples’ server problems to a specific CU that has a fix for their issues? That won’t be on the horizon for the product in 2018, but it’s an interesting long term challenge.

37% of us are experiencing poison waits.

When RESOURCE_SEMAPHORE, RESOURCE_SEMAPHORE_QUERY_COMPILE, and THREADPOOL strike, it can feel like your SQL Server service is completely locked up – even though you can remote desktop into the base OS and it responds just fine.

When I talk to training classes about that, I’ve been saying that most of you will be able to go your entire career without having to troubleshoot those issues. Turns out I’m completely wrong, and I need to start talking about these more often, like blogging about how to recognize the symptoms even if you’re not monitoring wait stats.

On a related note, 16% of us have had memory dumps recently. My old advice was to install the SSMS memory dump upload utility, but Microsoft shut that down – likely due to security issues around GDPR, since memory dumps can include PII. Before I write new advice there, I’m going to dig deeper into the data – like if the dumps are correlated to specific builds/versions – to improve my advice.

We still have priority boost turned on.

Thanks, SSMS

It’s hard for me to believe that Microsoft still exposes this as an option in SSMS 17.6. Users should be protected from themselves and from really bad Internet advice – this should be deprecated, burned, and pushed to the bottom of the ocean.

Here’s the real kicker, though: ten percent of us have a server with Priority Boost on.

Another way to think of it: when you’re in a user group session with 40 other people, 4 of them have Priority Boost on. Or maybe you, and 3 other people.

The exciting thing is that we can track what happens to wait stats as people turn that feature off, and then prove if it made things better or worse. (Over 20% of us have either auto-close or auto-shrink enabled on databases, too, but thankfully it’s confined to about 11% of our servers.)

I’m barely scratching the surface here of what we’re learning. As the data grows, I’m really looking forward to showing users comparison data of how they rank related to other shops, how their database health stacks up, and the easy stuff they can do to get better.

Read more of my SQL ConstantCare posts, or sign up now.


The Many Mysteries of Merge Joins

Not A Single Picture Of A Zipper

Image humbly borrowed from https://70srichard.wordpress.com/2014/12/17/beverly-hills-cop/

There are some interesting things about Merge Joins, and Merge Join plans that I figured I’d blog about.

Merge joins have at least one interesting attribute, and may add some weird stuff to your query plans.

It’s not that I think they’re bad, but they can be tricky.

Lots of people see a Merge Join and are somewhere between grateful (that it’s not a Hash Join) and curious (as to why it’s not Nested Loops).

Oh, you exotic Merge Join.

E pluribus unum

In a “good” Merge Join, the join operator in the query plan will have the Many to Many: False attribute.

The optimizer knows this because the Primary Key on Id (though a unique index or constraint offers similar assurances) is distinct for each value in the Users table.

Having one unique input give you a one to many Merge Join.

Simple as a pimple

The statistics TIME and IO profile for this query is about like so:

Not too shabby for one meeeeeeeeeeeeeeeeeeeellion rows.

E pluribus pluribus

In a “bad” Merge Join, that attribute will be True.

Hamburger Lady

Why does this happen, and why is it bad?

It happens most commonly when there are, or may be duplicates on both sides of the results. They can also happen when the outer input has duplicates, but quite often the optimizer will rewrite the JOIN order to put it on the inside to avoid having to use a worktable, etc. Thanks to Adam and Kalen for nudging me in the comments to clarify this part.

Internally, the Merge Join will spin up a work table (similar to how a Hash Join operates), and work out the duplicate situation.

The stats TIME and IO profile of this plan looks like so:

If we were to, say, choose that serial Merge Join plan in the compilation of a stored procedure that became the victim of parameter sniffing, we could run into trouble.

Yes, that is seven minutes and forty seconds. A little over half of one metric cigarette break.

Head to head, the large merge (many to many) is costed higher than the smaller merge (one to many). But you won’t see that in a parameter sniffing situation.

You’ll only see the lower costed Merge Join version of the plan.

TELL’EM LARGE MERGE SENT YA

The optimizer will sometimes to try protect itself from such hi jinks.

Aggregation Ruling The Nation

In some cases, the optimizer may inject an aggregation into one side of the join ahead of time to distinctify the data. It doesn’t need to do both — we only need one distinct input for the many to many attribute to be false.

I haven’t seen a situation where both inputs get aggregated merely to support the Merge Join, but it might happen if you ask for other aggregations on the join column.

It can use a Stream Aggregate, which would generally make more sense, since both the Stream Aggregate and the Merge join require sorted data.

Sense and Sensibility

Under less reasonable circumstances, you may get a Hash Match Aggregate. This plan has the additional misfortune of needing to re-order the data for the Merge Join. Teehee.

Hash Gang

If you see this, something has truly gone wrong in your life. This query was heavily under the hint-fluence.

A Sort Is A Sort

Much more common is seeing a Sort injected into a plan to support one or more downstream operators that require sorted input (Merge Joins, Stream Aggregates, Windowing Functions).

For instance, a query like this:

May give you a plan like this:

Sort early, Sort often

In this case, the Sort happens early on to support the Window Function. It also aids the Stream Aggregate, but whatever. Once data is sorted, the optimizer tends to not un-sort it.

The Sort in the next example will be needed with no index on, or where the join column is not the leading column in the index (there’s an If here, which we’ll get to).

If this is my index on the Votes table, data is sorted by PostId, and then UserId

When my query is just a simple join, like this:

My query plan will look like this, with a big honkin’ Sort in it:

Is kill

On the other hand, if my query looks like this:

My WHERE clause filters the leading column to a single PostId (one Post can be voted on by many Users), the UserId column will already be sorted for that single PostId value.

We won’t need to physically sort data coming out of that.

Like mustard

Out In The Street, They Call It Merge Join

I hope you learned some stuff that you can use when troubleshooting, or trying to understand a Merge Join plan.

This is one of many places that the optimizer may inject a Sort into a plan that you didn’t ask for.

Thanks for reading!

ZIPPER FREE!


Building SQL ConstantCare®: The Minimum Viable Product (MVP)

SQL ConstantCare
2 Comments

When you wanna build something, how early do you let the public in? Where do you draw the lines with the features you absolutely have to have – versus the stuff you just want? Code is never really done.

The lines are so blurry these days with labels like Alpha, Beta, Private Preview, Public Preview, Early Access, Limited Availability, Regional Availability, General Availability, etc. Companies wanna get something in your hands as quickly as possible so they can start learning from what you like and what you use. What’s the first thing you actually ship?

MVP: Minimum Viable Product

The Lean Startup by Eric Ries

In Eric Ries’ excellent book The Lean Startup, he talks about how your company should rapidly iterate along:

  1. Release something as quickly as possible
  2. Measure how customers use it
  3. Learn lessons from those measurements, and then
  4. Go back to step 1

The first time you hit step #1, that’s called the Minimum Viable Product (MVP). Your MVP doesn’t even have to be an app or online service – it could be a manual process, or it could even be just a signup form for a service that doesn’t exist yet. In a sense, you could think of sp_Blitz as the MVP: a script that people could run and get advice about their server.

If you were going to build a system to give people advice about their servers, here’s a few ways you could do it:

  1. Totally Manual Process: you build a list of queries, they copy/paste the queries into SSMS, copy/paste the results into Excel, email them to you, you analyze the data manually, and you manually email them a reply.
  2. Automatic Collection, Manual Analysis: you build an application that runs a bunch of queries and dumps the data into files (say Excel). The user emails you the files, and then you analyze them manually, and tell them what to do.
  3. Automatic Collection, Automatic Analysis: you build an app that runs queries, sends the data to you, and then robots analyze the data and send recommendations to the end user.

Obviously, #3 is a hell of a lot harder than #1, and it ain’t minimum.

You’re probably looking at what SQL ConstantCare® does and thinking, “Wait, he screwed up – he jumped straight to #3.” Well, we’d already done #1 early in our consulting practice, and #2 was our SQL Critical Care® consulting service. (Richie had built a data collection app to make our process faster.)

There’s a lot of gray area in #3, though. We made a lot of brutal decisions about what we would do in order to get the MVP out the door.

What we skipped

We wanted to collect data, put it into a database, and send you emails with advice – but everything about the process was up for debate. If we could put something in your hands faster by making some tough decisions, then we made ’em.

  • “Mute” links – I wanna make it as easy as possible for you to permanently mute servers, databases, or specific alerts that you don’t care about. Soon, the emails will have mute links, but for now, we’re having folks just hit reply and tell us what they want muted, and we mute ’em on the back end.
  • Self-updating app – really wanted this for v1 because I figured we’d be iterating fast over the collector & queries, but no go. The components are there, but you have to run ConstantCare.exe yourself manually if you want to get the update. Has to do with permissions gotchas with Squirrel, the updating tool we used.
  • Windows Service (rather than scheduled task) – an always-running service would let this thing work better in a data center environment, but I’m also hesitant to deal with the support gotchas involved with an always-on self-updating service.
  • Team hierarchy – later, I’d love to designate different people for different groups of servers, or different levels of alerts (like production DBA vs developers.) For now, if you want different teams to get emails for different servers, you’ll need to install different instances of the collector.
  • Interactive web site with your data & recommendations – didn’t wanna hassle with logins for now.
  • Shareable anonymized reports – I’d love to give you the ability to pass your DMV data on to your software vendor or consultant and say, “Here’s what we’re up against – you take a look and tell me what you think.”
  • AG/DR-aware checks – if you have an AG, I’d like to be able to identify backup history across all the nodes and tell you if you have gaps in coverage. (As long as we’re getting backup data across all replicas, this should be doable – just takes more query work.)
  • Troubleshooting tools – sure, it’d be nice to have a dashboard showing where all the incoming files are at in processing, but it doesn’t make sense to build something like that when our processes are changing so fast during the MVP.
  • Automated wait stats analysis – for example, “Your server is waiting on storage, but it’s not because the storage is bad – you just need to change max memory because it’s set incorrectly. Change it to ___, here’s how, and here’s how safe (or unsafe) it is.” Right now, I’m doing this manually for customers, building a manual process of what I’m thinking as I do it.

What we shipped

Users install a desktop app that sends us diagnostic data once a day. Here’s a video showing how you install it – warning, it ain’t pretty:

There’s no GUI, there’s no wizard to add servers for you, and there’s no input validation if you screw up your email address or server names. There’s a lot of places where things can go wrong. I’d love to have an entirely graphical setup process that sweeps the network, suggests the SQL Servers that you would want to monitor, and guesses the right time zones for each server.

I would also like a pony. (Not really. Ponies smell bad. Except this pony.)

But you get what I mean – you go to war market with the army installer you have, not the army installer you want. Besides, this is a one-time experience for users – the more important experience was getting the emails that were valuable and actionable.

The emails – that’s where the value comes in. Right now, Lambda functions analyze the data and send us emails that look like this:

Erik needs to do a better job of backups

They’re the same emails that you’ll eventually get directly – but for now, we’re watching them ourselves, making sure the functions are returning the right recommendations, and then manually sending guided recommendations directly to the customers.

Then for some customers with special situations, or where the automated emails aren’t quite the right fit, we send manual emails like this (sanitized customer email sent to myself):

Sent by Mañuel Llabor

Is that a lot of work? Sure it is – early adopters are basically getting a screamingly cheap deal on our personalized attention via email. If you’re going to automate something big, though, you’ve gotta start by making sure you can do it manually first.

How the MVP is scaling so far

The line is the number of servers we’re analyzing (from customers willing to share their numbers – more on that in the upcoming GDPR post), and the bars are the terabytes of data on those servers. The jump yesterday was the first day of the marketing launch:

Servers and terabytes of data over time

Yes, the terabytes of data have gone down more than once as folks have suddenly realized, “Whoa, someone restored a bunch of databases onto a production server, and we forgot to get rid of them. They’ve just been making our maintenance windows take longer every night.” My favorites are when we’ve found AdventureWorks on production servers.

Those server numbers may not look big – we could have easily processed 156 servers’ worth of diagnostic data with a conventional Windows app running in a VM – but check out these hosting costs:

Amazon Billing Explorer

We spent ~$650 last month on hosting, and is projected to come in around $1k this month. That includes development environments, by the way. This is inexpensive enough that I could afford to absorb it for even just a handful of customers if it hadn’t caught on.

As more customers start to diagnose more servers, serverless really pays off. When new customers join, they seem to wait a day or two to set up collection, then they set up collection for just 1-2 servers, see the results, and then suddenly go, “Whoa, I should add a few other servers in too.” We added 49 new customers yesterday, so I wouldn’t be surprised if we were monitoring a total of around 300-500 servers by the end of the week.

In preparation for that, Richie’s been putting a lot of work into tuning data ingestion and processing the rules quickly, staying ahead of the Lambda function timeouts and the growing data set.

As we go, we’re learning stuff, tweaking the system, and figuring out what things users can fix on their own and which ones they need more help with. It’s even driving the blog posts I write – for example, when I see a problem on several servers in a row, and I have the same discussion with customers about it a few times, that means I need to write a post to link folks to. The results are posts like “But I don’t need to back up that database” and Why Multiple Plans for One Query are Bad.

What’s on the roadmap next

Now that we’ve got the MVP out, we’ve been working on:

  • Quality checking on the automated emails – we’re pretty close to the point where we’ll start letting the automated emails go straight out, but then once users have fixed the first round of problems (like no backups, obvious server misconfigurations, etc) then the human intervention will kick in later. Right now, with ~50 active paying customers and ~150 SQL Servers, it’s easy enough that I can still keep an eye on this manually, but we’ll be switching over soon.
  • Back-patting rules – tying together a recommendation we made, a customer’s successful application of that recommendation, and the difference it made in health or performance metrics.
  • Not breaking the build – seriously, I have a nearly 100% failure rate on my commits. Richie has to be gritting his teeth by now every time he sees one of my pull requests.
  • Wait stats trending – right now I’m manually trending wait stats, then emailing customers an analysis. We’ll need something more scalable as we go to 1,000 servers and beyond. When you build an MVP, though, you gotta do things that don’t scale.

Read more of my SQL ConstantCare posts, or sign up now.


Building SQL ConstantCare®: What Data We Collect (and What We Avoid)

SQL ConstantCare
2 Comments

Next up in my series of behind-the-scenes posts about our new SQL ConstantCare®, let’s talk about the kinds of SQL Server diagnostic data we’re gathering, why, and what data we’re steering clear of for now. (Psst – read through to the end for a huge, crazy discount today, too.)

We – you and I – are paranoid.

We’re a data professionals, first and foremost. During my consulting intro calls, I’m really proud to say that we don’t need to get a VPN connection, remote access, or a SQL Server account. You know how it is – DBA, Default Blame Acceptor – and these servers are already in bad shape. The last thing folks want is some consultant going rogue and “fixing” something, so we make it clear from the get-go that won’t happen.

Because we’re so paranoid, my first thought when deciding to collect diagnostic data was, “I don’t wanna end up in the news.” I bet you feel the same way. Therefore, we thought about how we could give users as much valuable insight as possible from as little data as possible.

I drew a big, clear box around queries, query plans, and statistics and said, “We’re not going here in v1.” That data can produce awesome performance recommendations, but it often contains personally identifiable information. I know some people are willing to share that data on a plan-by-plan basis – the referral-marketing success of PasteThePlan proves it. I even bet a lot of people are even willing to share a lot of their plans on a regular basis as long as they’re getting valuable performance advice. However, I think we can still provide a lot of value without having that data, and we have plenty of Lean Startup lessons to learn along the way first. For now, rather than automating that part of the analysis, we simply tell customers in their advice, “If you’re able to tune queries or indexes to fix this, here’s the exact parameters to use with sp_BlitzCache and sp_BlitzIndex to find the root cause of this problem.”

We started with the only data to run the most urgent sp_Blitz checks.

To pick the data we’d gather at first, Richie took the list of sp_Blitz checks from priority 1 to 100 – the most urgent stuff – and came up with the list of data requirements to be able to send emails like this:

You’d be surprised how many real SQL ConstantCare emails start like this.

To see the queries we run to get that data, look at the queries ConstantCare.exe runs by going to Windows Explorer and opening the folder:

One file per query

The way it works is that ConstantCare.exe connects to the SQL Server, runs these queries, writes the results to an encrypted zip file, and uploads it to a private bucket in Amazon S3.

At this point, you might be thinking, “Why not just run sp_Blitz on the server and harvest the results?” We wanted to design checks that worked with data over time – for example, after we find auto-shrink enabled, and after you turn it back off, how are your wait stats affected? (And yes, we’ve already found over a dozen servers with auto-shrink enabled, hahaha.)

Exposing the queries directly to customers also has a side benefit: later this year, they’ll be able to save override versions of the queries. This way, if there are some parts of the diagnostic data that you don’t wanna send in, you can override that portion of the query (returning nulls, or masked data, or just not return any rows at all.) This presents a little more design and support work for us since missing rows might trigger a false alarm – for example, if no backup data comes in, we don’t wanna say, “OMG, you aren’t backing up your databases!”

Serverless architecture means per-database checks are harder.

We hit a few challenges early:

  • In the past, we’ve used sp_MSforeachdb, but that doesn’t work in Azure SQL DB (and we wanted eventual compatibility with that, although not necessarily in v1)
  • People have a crazy number of databases per server (think thousands)
  • People who have a crazy number of databases also seem to have a crazy number of indexes per database (think thousands per database, times thousands of servers)
  • And they also have a crazy amount of backup job executions (especially if they took my advice about backup frequency)

Moving large files around in the cloud isn’t a big deal, but processing them is, especially with serverless. AWS Lambda functions have a 5-minute timeout, and we use Lambda functions for stuff like:

  • Importing the client’s JSON data into the database – in the first version of the intake process, we were processing incoming files line-by-line – and hoowee, did that not work with the crazy-number scenarios. After hitting timeout walls, Richie later changed it to pull the whole JSON file into the database as a set, plus we held the index analysis back for a later version.
  • Analyzing the data to build email content – just like your homegrown queries look at SQL Server’s DMVs to figure out what’s broken, we’re using Lambda functions to run database queries to look at your DMV data. It’s just that the data is now stored in a central database on our end. Imagine a database server that holds DMV data from lots of customers – we’re gonna run into performance issues as we gain clients, and retain more database history. If your homegrown query takes over 5 minutes to run, you’re the only one who cares – but if our AWS Lambda functions time out over 5 minutes, you won’t get your report, and I care about that.
Our SQL ConstantCare Slack channel before performance tuning

Watching function runtimes and timeouts has probably been one of the more interesting challenges so far. We have a company Slack channel for SQL ConstantCare® where status updates go and we talk about what we’re doing about ’em. We laugh a lot in there. Also, swearing and gifs.

The whole mission and design of SQL ConstantCare® means these timeout errors are less of a concern – it’s a mentoring product, not a real time monitoring product, and users don’t expect to get an email instantly after their data is sent in. (Plus, data is sent in daily, but for now, analysis & mentoring emails only go out weekly.) The database size is less of a concern, too – during the initial Minimum Viable Product (MVP) phase of the project, we could even delete and redeploy the database from scratch whenever we want, and the product still worked fine.

Read more of my SQL ConstantCare posts, or sign up now.


[Video] Office Hours 2018/3/28 (With Transcriptions)

This week, Erik and Richie discuss what to do first at a new job that never had a DBA before, basic vs advance DBA skills, tempdb files, downgrading SQL server versions, a database architecture issue, tools for monitoring SQL Server instances, other areas to focus on as a DBA besides HADR and query tuning, SQL Constant Care®, and database backups.

Here’s the video on YouTube:

You can register to attend next week’s Office Hours, or subscribe to our podcast to listen on the go.

Enjoy the Podcast?

Don’t miss an episode, subscribe via iTunes, Stitcher or RSS.
Leave us a review in iTunes

Office Hours – 3-28-18

 

What should you do on your first day on the job?

Erik Darling: Let’s see, I’m going to try this name, but I’m not going to do well at it, Seybou – I like that you have B-O-U in your name; that’s Brent Ozar Unlimited – asks, “What should you do first when you start a new job at a company that never had a DBA before?” My man, backups, backups, backups, backups, backups, backups – make sure that the backups are in order before you do a damn other thing. If you don’t have those, like, it doesn’t matter if you’re checking for CHECKDB, checking for corruption, it doesn’t matter what else you’re doing in your life, if you don’t have those backups, you are missing the biggest piece of the DBA puzzle. What do you think, Richie?

Richie Rump: Oh no, that’s exactly right. So in fact, we don’t even do consulting gigs unless we have backups from the client. And we actually go in and check and make sure that they have valid backups for everything. And there’s been times where we’ve actually cancelled gigs because they haven’t had valid backups. So we’re practicing what we preach; that’s all I’m saying.

Erik Darling: Not even like, you know, the situation where it’s like, they’ve never taken a backup. It’s like, if their backups are like more than a week or ten days old, we’re still like, you have to take one like now because we don’t know what was in that – we don’t know what was going on the last time you did anything. I feel like a nerd now, everyone can see my EpiPen. I went to get my allergy shot yesterday and I have to bring this thing with me to get my allergy shot, or else they won’t give it to me because I might, I guess, die in the office.

Richie Rump: That kind of reminds me of the old bear skit form SNL, you know, that’s my fourth heart attack this week.

Erik Darling: That was a good one. That was, like, one of the last funny things SNL ever did.

Richie Rump: No, no, no they’ve been doing some pretty good stuff recently. I haven’t been watching it live but I’ve been watching it on digital and they – Bill Hayder was on this week and he broke like five times; it was great. He just kept breaking. He was like, “I’m not a cast member anymore. I don’t have to worry about my job. I’m going to go ahead and break.” Cracking up in the middle of his stuff.

Erik Darling: I don’t know, unless he’s trying to come back on SNL. But to get back to Seybou’s question a little bit, so after backups, of course, CHECKDB and making sure you have corruption checks running, after that, everything else is kind of less important as business priorities. So make sure you you’re meeting RPO and RTO goals, then after that, make sure you’re meeting whatever SLAs you have with customers, and then after that you can start focusing on stuff like performance, whether it’s query or index tuning.

Please, feel free – I mean, it’s free, give sp_Blitz a run. Head over to our first responder kit, go grab a recent copy of sp_Blitz, give it a run, see if there’s any surprises. Best way to do things, I think; it’s neatly packaged up for you.

Richie Rump: We have this new product that we’ve just been slowly rolling out called ConstantCare. And ConstantCare will actually go in and do all of that stuff for you. It will go in – send us the data every day and we’ll send you an email. Right now we’re once a week; eventually, it will be once a day and it will actually tell you, “Hey, this one backup didn’t run,” or it will tell you go do this or go do that. So that is an option and you don’t need to worry about running scripts or anything like that, you do one install and it does it all for you. So obviously, it’s something near and dear to my heart, because you’re my life, but yes, since we’re not getting the word out, I can now talk about it.

Erik Darling: Finally, like a think that you’ve been working on almost since you first started here and everyone’s just like, what the hell is Richie doing? Well, now we know.

Richie Rump: What is he doing?

Erik Darling: Just Paste The Plan, really? That’s it?

Richie Rump: I know, and he hasn’t done anything in like a year, what’s up with that?

Erik Darling: And it barely works – just kidding, it works fine.

 

What separates a junior vs senior DBA?

Erik Darling: Anyway, let’s see, “I’m currently a BI developer but I’m interested in becoming a DBA as a future goal. What aspects of DBA do you consider basic versus advanced, i.e. DBA versus senior DBA? For example, would you put replication HADR in the basic or advanced group?” Well, there’s a basic and advanced level of knowledge with anything, so what I would do is forget about replication immediately. Don’t pursue that. it’s really not the future – it’s not an HADR feature, for one and it’s not really the direction that anyone’s going to be heading in if they need HADR. As SQL Server progresses and Availability Groups, and now basic Availability Groups on Standard Edition get, you know, more robust and less brittle, that is going to be the pretty obvious, you know, the HADR solution for just about everyone moving forward.

But like I said, there’s a basic and advanced level of knowledge with anything, you know. Knowing how to create an index is a whole lot different from knowing how to tune an index. The same thing goes for query writing, you know, and knowing how to write a query is a lot different form knowing how to tune a query. So it really depends where you see yourself as a DBA.

The days of the Swiss Army Knife DBA are slowly drying up. If you want to really be senior at anything, you have to specialize, like a son of a gun, and you have to get really good at something. There are just simply too many products, too many features and too many quirks and too many other things for someone to really have a senior level of knowledge across the board. I’m sure there are some people who can get that, but they’re people who have had those building blocks for years now. Like if you take someone like Bob Ward, who will be able to troubleshoot an Availability Group, tune a query, figure out indexes, you know, run a debugger, do stuff in extended events. There’s obviously the renaissance men out there who can do all that, but one of my absolute heroes in the world, Paul White, probably knows little about Availability Groups but knows the optimizer intimate.

So if you want to be a senior DBA, you have to specialize, you have to figure out what you’re passionate about as a DBA. If it’s the HADR stuff then you have your direction there. If it’s the query and index tuning then that’s your direction for that.

Richie Rump: Yeah, and now, since this is a future goal and you’re looking into becoming a DBA, then you still need to understand the concepts so you could actually talk intelligently about them. So if you’re going for a job and you’re not focusing on replication but they’re doing replication, you at least want to have a conversation about that and understand why replication isn’t the future and why this other technology is. So, you know, don’t avoid it but don’t go in depth if it doesn’t interest you.

Erik Darling: Obviously, you know, DBAs need a certain level of knowledge about little things, even if they’re not going to specialize in it. Like, if you’re going to call yourself a DBA, you should at least know how to take a backup and do a restore without Googling too much. But, you know, if that’s not your primary goal as a DBA – if your goal is not the backup and restore infrastructure person then obviously you don’t need to know every single in and out about buffer counts and max transfer sizes and all that other stuff. So really pick the route that you want to take as a DBA, or, you know, it seems like a pretty solid decision as a BI developer now. Take your time in figuring out what you really like about the DBA role and then pursue that in a seniorly way, I guess. I don’t know how else to say that.

Richie Rump: No, I think it’s good.

 

Do I need more than one TempDB data file?

Erik Darling: Good, I got passing grades from Richie so I’m ready to ask the next one. Steve Malcolm, who we’ve been talking to via email, is still confused about tempdb files. Yes, Steve, you need your NDF files, those are secondary data files. Since this is tempdb, you’re going to want to have more than one data file.

Steve asks – Steve sends through an email yesterday to us asking why he had multiple tempdb log files, but it turned out – we’ll get to it in a second, but it turned out that they were both data files. Now, with 2016 and 2017 SQL Server setups, something that’s changed from prior versions of SQL Server, which used to just give you the one tempdb file, is during the setup process there’s a screen that comes up that looks at how many CPUs you have in your system and it gives you one tempdb file per core up to, I think, like eight, then it stops there.

Then the bottom line of that is, like, the link I sent you over about configuring tempdb yesterday, configuring multiple data files on tempdb is generally a wise thing to do. SQL Server can use those files in a parallel manner, it can write to a whole bunch of them at once, which can reduce contention and all sorts of other stuff if you have a busy tempdb. The nice thing about 2016 and 2017 is they made the behavior of trace flags 1117 and 1118 the standard; that’s the default behavior now so you don’t have to turn those on anymore. No, you don’t. Alright, let me catch my breath for a second here.

Richie Rump: I don’t know whatever you just said, it sounded good to me, but, you know, again, I’m not [crosstalk] that that guy is… But then again, I couldn’t get, you know, you to deploy a serverless application either, so.

Erik Darling: No, no and then, you know, that’s the wonderful thing is that Brent can find people who can do the little bibs and bobs of exactly what he needs. He has you if he needs to do some crazy serverless development task that I would have no hope of ever getting past, like, figuring out what to do for step one on. He has me and Tara for the consulting and the DBA type stuff. He’s got things pretty solid; at least I hope he does, anyway. If he doesn’t, I don’t know…

Richie Rump: What do we need Brent for? I think a lot of things, actually.

Erik Darling: He’s like dad; we bring him along because he buys stuff, right.

Richie Rump: Yeah, yeah that’s right, yeah.

Erik Darling: Pays for the movie tickets – he’s a nice guy. He’s fun – he’s fun.

 

I need to downgrade Enterprise to Standard…

Erik Darling: Let’s see – oh boy, Doug says, “I have a situation…” Not our old Doug. “I have a situation where I need to downgrade SQL 2016 Enterprise to Standard. If I maintain copies of master, model and MSDB and after uninstalling…” Oh boy, jeez. You know, I’ve never had to go through that process. I would highly, strongly, firmly suggest that you try that somewhere because I don’t even know that you would be able to use those copies.

Richie Rump: But it’s the same version, it’s just different editions. I would – yeah, definitely test it out. I’m thinking it probably would work. I don’t know.

Erik Darling: I would want to test that out. I would be concerned about all sorts of things with that. I wouldn’t want to rely on that for a downgrade.

Richie Rump: The first thing that came to mind is that you would just move all your objects over, SSIS or something, you would move the data. But it’s the same version, just different editions, so maybe…

Erik Darling: Maybe, but here’s what I would want to do instead. I would want to side-install. I wouldn’t want to uninstall the first one completely. I would want to side-install Standard edition. And this is actually – a good use for PowerShell is the dbatools.io people have written all sorts of command lets to like copy settings and stuff from one server to another. It’s not quite desired state configuration, but it’s a good – like, you know, if you need to like migrate a bunch of stuff over all at once, it might be a good option. There are a whole bunch of command lets over there that can do that and other cool things.

So head over to dbatools.io, side-install Standard Edition and then script out whatever stuff you want to move over to there. I wouldn’t rely on the uninstall-reinstall and like either have SQL 2016 pick up on the new files or try to like restore stuff over. That just sounds like a nightmare to me. If you do that, you have absolutely no back-out clause. You are stuck. You are hosed with whatever setup you get yourself into. If you do the side install and you just script out the stuff that you need to move over, I think that puts you in a much safer place, even if you have to…

Richie Rump: And if this is a virtual machine, then just spin up a new virtual machine and just do it that way. But wow, good luck because that sounds like a rough one.

Erik Darling: Yeah, that’s not fun at all; the stuff that does not make me want to have a real job ever again.

 

Erik Darling: Alright, Pablo asks an interesting architectural question. “I have a view that gets three billion rows to make queries there.” I don’t know what that means exactly. “Those rows are across 40 plus databases making unions. Is this structure okay? May I have less databases or just one to avoid the unions?” This is generally the kind of thing that one would want to avoid.

Richie Rump: Yeah, we’d consider that not a best practice.

Erik Darling: Yeah, it’s a little bit more than one could diagnose without looking at things, you know, in a 30-minute free webcast, but generally that’s not a design pattern that we would want to put folks into. If you’re doing like the one database per client thing, I’m totally on board with that. I think that’s a much better pattern than the giant database teaming with everybody even if they’re separated by schema or whatever other crazy things. I think that’s a much smarter way to do things.

If you really need to query data across all the databases, perhaps like a database that you pull data from rather than having to execute queries across all of them at once would make more sense. Like, you could have just a data-dump database and you could have SSIS just move stuff in incrementally during the course of whatever day or period you need. Then again, if what you have written isn’t slow or just not causing any problems, don’t change a thing until it causes problems.

Richie Rump: Yeah, when I think about joining three billion rows in a union…

Erik Darling: That might slow things down.

Richie Rump: yeah, I don’t know what you’re trying to do, but trying to do that with a view would give me the caution to say, wait, what are we trying to do, for one, and is there a better way of doing it than in a view? Maybe there’s a different way we could go about this. I’m assuming that because you have 40 different databases, it’s the same database but all for different customers and those are tough architectures, especially when you’re trying to join them all together and do data off of them.

Typically, you want to put them all into one database, but there’s different solutions for different things that you’re trying to do. So three billion rows in all different database and trying to get them synced together and all that, that’s tough as well – so that’s a tough one but yeah, it all depends on what you’re trying to do and what data you’re trying to [inaudible] from and what information.

 

What do you use to monitor SQL Server?

Erik Darling: Cool, alright, Sree asks a question. And I swear to god, this is a question, not a plant so that we can talk about ConstantCare again. It is in the transcript. I’m looking at it right now. My name is not Sree and as far as I know, neither is Richie’s. “What kind of monitoring tools do you use to monitor SQL Server instances. Do you have any thoughts on the Grafana Dashboard and alerting?” Well, no I don’t and it’s not because I think it’s bad, it’s just because I’ve never used it or seen it. I don’t really know.

So Brent’s been posting some great posts, Brent, lately about our pseudo monitoring tool. But like he says, it’s more of a mentoring tool. So you have, you know, a good set of vendors out there; SentryOne, Quest Spotlight, people like that who have a cool monitoring tool with a good dashboard that gives you pretty similar metrics across a whole bunch of different areas in SQL Server. But the direction we went is a little bit different because with any monitoring tool, you need to A – figure out how to use the monitoring tool, figure out how to dig into it deeper and then learn how to solve the problems that are causing the monitoring tool to show you to these things.

The direction we’re going with it is, like Brent calls it, a mentoring tool. So what we’re going to do is we’re going to get all the data that a monitoring tool would, we’re going to analyze it, we’re going to do that work for you and then we’re going to send you an email that tells you which buttons to push and which things to do to try to solve those problems. So it’s a little bit different take on monitoring. There’s no dashboard involved. There is no – you don’t get to, you know, look at uptime, downtime, green lights, red lights, you know, charts and graphs and all that stuff. But you also don’t have to learn how to read all those charts and graphs and do all the work behind the scenes.

Richie Rump: Yeah, some of those metrics always made my eyes just cross. Oh, it’s red, is that bad? Is that really a problem? Having all those wait stats, is that a problem? What does that actually do? And then you have to figure out, okay, well what does it actually do? And then you have to do more investigation – is that actually a problem for our server?

Erik Darling: A lot of monitoring tools will just show you whatever is high. So it’s like, if some completely meaningless metric just happens to spike up, like buffer cache hit ratio or like, I don’t know, what’s another weird one? I don’t know, stuff like that. Stuff just pops up and it’s like, oh my god, freak-out. And you’re like, what does that mean? And you’re like, I don’t know, is it bad? How many page lets do I have? I don’t know, what’s going on? You just get so confused staring at all these different numbers that are just changing all the time and you have no idea if they’re good, bad or ugly.

Richie Rump: Yeah, and you know, maybe you had all these extremely high page splits and, oh my gosh, we have a problem, and yet you did a huge load into the database the day before and now you’re like, oh I’ve got a problem. Oh, but wait, there was a load that went in…

Erik Darling: Fun fact, SQL Server counts new page creations as a page split; ha-ha. No one’s looking into that. You insert a million rows, it’s like, oh no, I have a million page splits, dear god, what am I going to do? Funny…

 

What areas should a DBA focus on?

Erik Darling: Chris asks kind of an interesting question, “What are other areas to focus on as a DBA beside HADR and query tuning?” Well, I guess dev ops might be a thing.

Richie Rump: Starting to.

Erik Darling: I know thankfully little about dev ops so far, so I’ll let Richie expand on that one a little bit.

Richie Rump: Oh thanks, sir, so much. I’ve worked on a couple of dev ops teams. Essentially, it’s the idea of…

Erik Darling: Are you a script-master?

Richie Rump: I am a certified script-master, oddly enough. I have my Agile certification, which means I could read, kind of. It was the dumbest test ever. It was like, the answers were in the questions and it’s like, you think some Microsoft tests are bad, woo, certified script master was even worse. But essentially, it’s the idea of you’re merging your development and your operations together. So as opposed to throwing things over a wall and letting the DBAs handle it, the DBA works very closely with his own team, or is usually on the development team and they kind of work together for these things.

So your dev ops would be, kind of, a merge of a database developer and a DBA kind of into one bundle. They typically would know lots of scripting languages and, kind of, handle things together. They do things like dashboards and monitoring and make sure everybody understands the state of the system and things like that. So yeah, I mean, it’s an option of a way to go. Someone like myself, I could be considered a database developer. I consider myself a data-veloper; you have to pay me if you want to use that. but essentially…

Erik Darling: Copyright it.

Richie Rump: Exactly, so I use [crosstalk]… But essentially, I stand in the middle between the development team and the data team and I make sure all that stuff in the middle kind of goes very fast, whether it’s using ORMs or software or whatever that is. So I kind of have to do both the software and the development side. Not a lot of us guys around, so, I hear it’s a growth industry. You may want to look into that.

Erik Darling: Yeah, software is a big thing apparently; I never would have noticed.

Richie Rump: You know, not a lot of us software developers have opinions. I need you to know that. we’re really easy-going guys. Oh, you like that? Oh, I like Visual Basic, isn’t that great?

Erik Darling: Yeah, let’s high-five and hug and let’s like talk about it civilly over lunch. Let’s not talk about, like, the weird version to version issues that we have.

Richie Rump: yeah, let’s meet on Stack Overflow. I’ve got nothing but great things to say about your question.

Erik Darling: Yeah, it’ll be fun. I love the way you handle nulls. How fast can you serialize that array? I don’t know…

 

Do I need ConstantCare if I have a monitoring app?

Erik Darling: Alright, Daryl asks a sort of follow up question to the monitoring thing. He says, “I already have a monitoring service 24/7. What would be the point of ConstantCare? Would I want to keep my 24/7 service?” It’s not a replacement for a 24/7 service. This is something that collects data once a day, but right now it checks in with you once a week. Like Richie said, it will eventually move to once a day, where we look at your server, we try to see what’s good, bad, and ugly about your server and we send you an email so you know – and like, we teach your – we have like training videos and stuff so, like, we send you an email. We say, hey, this is what we saw on your server, these are the things going on, these are the things we think you should fix. Here’s how to fix it with, like, you know, links to training videos, blah, blah, blah; stuff like that. So it’s not a replacement for a 24/7 service. It’s going to give you, like, up, down, servers on fire type stuff. What it is, is it’s going to take a look at a server over time and figure out if it’s doing better or worse and how you can make it do better.

Richie Rump: Yeah, I like the way he put it as, it’s a mentoring tool. And some of the feedback we’ve gotten from some people, it’s like, I’m a solo DBA and it’s like I have a co-worker sitting next to me, you know, so I can kind of throw ideas off of because you have a certain level of access to us as well through the service. It’s new, it’s different. I would suggest reading the post that Brent is putting on the site every Monday. I think there’s three of them out already. I’ve been way deep in the code, so even when I read some of that stuff, it’s like, oh okay, yeah that’s right, and I’ve been working on it for a year. So if you want to talk more about this service and development and what it takes and all that stuff then I’m your guy. But if you want to know about what it can do for you and things like that, those posts are phenomenal.

Erik Darling: There’s a whole bunch of new ones coming out.

Richie Rump: Every Monday, I think, they’re going to be dropping, right.

Erik Darling: yeah, for a wile anyway.

 

If I need to shrink my database…

Erik Darling: So we’re going to finish up, Tom asks, “Is it okay to do a Full Backup then switch the database to recovery mode simple…” You’re lucky Tara’s not here to hear you say recovery mode, “Run DBCC SHRINKFILE and then switch the recovery model back to full?” bad news, you won’t be in full recover model again until you take a full backup after you switch it back to full, so I don’t know. Don’t shrink your databases, just please stop doing it. It’s a bad idea. It’s not a good idea. Like, since 2009 it’s been a known bad idea. Don’t so it anymore. Please stop.

 

Can I use SETUP to change editions?

Richie Rump: Actually, before we drop, Stephanie had a suggestion saying that, “Setup.EXE has an option to downgrade editions,” which I did not know.

Erik Darling: I’ve never seen that either. Congratulations Stephanie.

Richie Rump: So maybe it’s as easy as just running setup.EXE again and just say downgrade this sucker.

Erik Darling: There you go. Ooh, but that would be interesting if they’re using Enterprise features. So if you were using TDE, you wouldn’t be able to downgrade.

Richie Rump: Yeah, TDE or – well, I guess now even if they’re using partitions, but that’s free now too, right?

Erik Darling: SP1, yeah, 2016 SP1. But there’s a whole bunch of stuff that’s still not, like, available to everyone. Alright, well we should go now; get back to work. Richie has cloud stuff to do. I think I have other things to do too, so…

Richie Rump: I’ve got database tuning.

Erik Darling: Wahoo, you do more of that than I do these days.

Richie Rump: Yeah, it’s all the bad code that I’ve written.

Erik Darling: Ah well, someone’s got to write it. Alright folks, thanks for showing up. We will see you next week; goodbye.


“But I don’t need to back up that database.”

You’ve got this database that you swear doesn’t need to be backed up, like:

  • Read-only databases
  • Schema-only databases with no data in ’em
  • Archive databases that haven’t changed in years
  • Maybe even system databases like master/model/msdb

And you think you’re going to save some time in your maintenance windows by skipping backups and corruption checks on those databases.

Lemme tell you a few stories. Everything’s been abstracted to protect the not-so-innocent.

The Successful Failed Deployment – RetailCo needed a software package to manage their inventory, so they bought one from SoftwareInc. RetailCo set up the server and did the backups, but SoftwareInc’s installer absolutely required SA permissions, so they got it. Over time, one of SoftwareInc’s deployment scripts had a bug: it didn’t change the default database, so they accidentally deployed some of their tables in the master database. The app continued to work perfectly – but real user data was going into master. RetailCo wasn’t backing up the master database, figuring they’d just move the user databases over to another server. You can guess what happened when the server crashed.

The Mostly Read-Only Database – The data warehouse archiving strategy involved one database per year with databases going back over a decade. The older databases were marked read-only, so the DBA figured he didn’t need to back them up often. After a migration to a set of brand-new, way-faster SQL Servers, user complaints started coming in about incorrect query results. Turns out each database had views with tax logic in it, and every time the developers did a release, they’d alter the archive databases to make them writable, change the views, and then alter them back to read-only. No big deal – just now apply the same thing to the archive databases again, right? Well, the project manager was white as a sheet and said, “I’m afraid we need to talk about how we’ve been doing something similar to correct product categories for the accounting team.”

The Truly Read-Only Database – The DBA was absolutely positive no one could write to the archive database because he’d done a great job of locking down permissions. He backed up the monster 10TB archive database once, and then faithfully ran CHECKDB on it as often as he could – not too often, given the size. Years later, the CHECKDB failed – because after all, the SAN can still write to the database regardless, and sometimes, storage gets a mind of its own. The DBA said no problem, I’ll just restore it from backup – and then found out that backups over two years old had been deleted as part of the company’s compliance projects. (Yes, in some private companies, folks actually have an obligation to delete data to stay in compliance.)

The Priceless Schema – A SaaS company came up with a creative version management solution. Customer data lived in one database – call it DataDB – but stored procedures & views lived in another, call it LogicDB. With every release, the app built a brand new LogicDB database with all the new stored procedures, views, etc – like LogicDBv1, LogicDBv2, LogicDBv3, etc. When a SaaS customer migrated to a newer version, the support team just changed the default database for that customer’s SQL login, and presto, it started using the new objects. Seemed brilliant at the time, letting customers live on different version levels while calling the same stored procedure names. The DBA didn’t bother backing up the Logic databases – there had been hundreds built up over time, and they could all be regenerated by source control, right? And the DBA had been using a script to sync logins from production to disaster recovery, but she didn’t think about syncing their default databases, too (and the support team was constantly changing them based on what version of the app they were on.) As a result, when they failed over to DR, it was more D, less R.

Brent’s rule: on a production server,
every database is a production database.

Don’t take the initiative to cut corners around data. Your first obligations are to serve and protect the data. Start there first, not with “time-saving” or “cost-cutting” ideas that don’t really accomplish either.


First Responder Kit Release: Driving Miss Data

Ugh. Bungabase slow.

When Grog try count meat, Glug eat meat.

Then Grog get hungry and Glug try count meat.

Bungalock.

Need big head make bungabase fast.

You big head?

You can download the updated FirstResponderKit.zip here.

sp_Blitz Improvements

#1459: A while back, Brent blogged about implicit transactions. In the comments, a smart commenter named Dan pointed out a way to find that with a couple DMVs. If you’d like that information to be available in more places, please consider voting for my Connect item. This check was also added in different ways to BlitzWho and BlitzFirst.
#1483: Add CPU speed to the power settings check.
#1481: Brent made stuff compatible with Managed Instances. Hopefully that holds up.
#1465: Someone had a problem in German. This is an infrequent occurrence.

sp_BlitzCache Improvements

#1516: Added support for correctly displaying Unicode database names ?_?
#1489: We had a whole heck of a lot of checks running. It’s not that they weren’t useful, it’s just that they created a lot of noise, and may not all have been immediately actionable. All that XML parsing was starting to slow things down. I took a bunch of them and changed the code so they only execute if you run it with @ExpertMode = 1.

sp_BlitzFirst Improvements

#1459: See note in sp_Blitz

sp_BlitzIndex Improvements

#1513: Added support for correctly displaying Unicode database names ?_?
#1509: The column names column in the #statistics table is now NVARCHAR(MAX). This was reported to us in the Slack channel! Thanks to Sanket for letting us know.

sp_BlitzWho Improvements

#1459: See note in sp_Blitz

sp_DatabaseRestore Improvements

Nothing this time around

sp_BlitzBackups Improvements

Nothing this time around

sp_BlitzQueryStore Improvements

Nothing this time around

sp_AllNightLog and sp_AllNightLog_Setup Improvements

Nothing this time around

sp_foreachdb Improvements

Nothing this time around

PowerBI

Nothing this time around

sp_BlitzLock

#1500: Column Store deadlocks generate slightly different XML. We now account for those.

#1469: Fixes an issue where come deadlocks were only counted once. Thanks to @HolisticDeveloper!

For Support

When you have questions about how the tools work, talk with the community in the #FirstResponderKit Slack channel. If you need a free invite, hit SQLslack.com. Be patient – it’s staffed with volunteers who have day jobs, heh.
When you find a bug or want something changed, read the contributing.md file.

When you have a question about what the scripts found, first make sure you read the “More Details” URL for any warning you find. We put a lot of work into documentation, and we wouldn’t want someone to yell at you to go read the fine manual. After that, when you’ve still got questions about how something works in SQL Server, post a question at DBA.StackExchange.com and the community (that includes us!) will help. Include exact errors and any applicable screenshots, your SQL Server version number (including the build #), and the version of the tool you’re working with.

You can download the updated FirstResponderKit.zip here.


Hash Join Memory Grant Factors

Buskets

Much like Sorts, Hash Joins require some amount of memory to operate efficiently  — without spilling, or spilling too much.

And to a similar degree, the number of rows and columns passed to the Hashing operator matter where the memory grant is concerned. This doesn’t mean Hashing is bad, but you may need to take some extra steps when tuning queries that use them.

The reasons are pretty obvious when you think about the context of a Hash operation, whether it’s a join or aggregation.

  1. All rows from the build side have to arrive at the operator (in parallel plans, usually after a bitmap filter)
  2. The hashing function gets applied to join or grouping columns
  3. In a join, the hashed values from the build side probe hashed values from the outer side
  4. In some cases, the actual values need to be checked as a residual

During all that nonsense, all the columns that you SELECT get dragged along for the ride.

Here’s a quick example!

This query doesn’t return any rows, because Jon Skeet hadn’t hit 1 million rep in the data dump I’m using (Stack Overflow 2010).

Despite that, the memory asks for about 7 MB of memory to run. This seems to be the lowest memory grant I could get the optimizer to ask for

Hashtastic

If we drop the Reputation filter down a bit so some rows get returned, the memory grant stays the same.

That’s why I’m calling 7MB the “base” grant here — that, and if I drop the Reputation filter lower to allow more people in, the grant will go up.

Creepin and creepin and creepin

But we can also get a grant higher than the base by requesting more columns.

ARF ARF

This is more easily accomplished by selecting string data. Again, just like with Sorts, we don’t need to actually sort by string data for the memory grant to go up. We just need to make it pass through a memory consuming operator.

Thanks for reading!

Brent says: you remember how, in the beginning of your career, some old crusty DBA told you to avoid SELECT *? Turns out they were right.


Office Hours Guest Instructor Month

Company News
0
Andy Leonard

In our weekly Office Hours, we hang out and answer questions from you, dear reader.

In April, we’re trying something new: guest hosts. If you’ve got a topical question or something you want to ask a specific person, now’s your chance:

Register now.


Missing Index Impact and Join Type

Indexing
8 Comments

Just Another Way

No matter how you delve into missing index requests — whether it’s the plan level, DMV analysis, or (forgive me for saying it), DTA, the requests will generally be the same.

They’ll prioritize equality predicates, the columns may or not may be in the right order, the columns may or may not be in the right part of the index, and the impact…

Oh, that impact.

It’s all just a cry for help, anyway.

Like a teenager watching anime and buying intricate parasols.

Salted Grains

If I run these three queries with different join types:

They’re all going to ask for the same missing index:

Kinda weird already, that a join column is an INCLUDE, but hey.

What’s even weirder is that they have diminishing estimated impacts based on join type.

  • Loop: 99.1954%
  • Merge: 42.7795%
  • Hash: 28.7901%

It gets a bit stranger if I force parallelism!

Now the estimated impacts look like this:

  • Loop: 98.9569%
  • Merge: 29.1982%
  • Hash: 44.2455%

The Hash and Merge join impacts have just about changed places.

The funny thing is…

I totally agree.

As far as indexes go, that’s a crappy index for the Merge and Hash Join plans. But no better one is being offered, not even sneakily.

For the nested loops plans, it’s super easy to grab the UserIds for that date, and dig into the Posts table for just those.

For the merge join plans, it’s less helpful. While it’s nice that we can easily filter the date predicate, we still have to order our data for the merge join. There are further complications in the parallel version.

For the hash join plans, it’s a similar situation. We need to create a hash table on UserId. It being in the leaf of an index on Date doesn’t help us much, or rather as much as it would if it were in the key.

In short, both the merge and hash join plans have cost-increasing operators thrown into the mix that the index as requested just wouldn’t help.

Thanks for reading!

Brent says: in the Mastering Index Tuning class, these types of examples are why I tell students that you should look at one-key-column index suggestions carefully. In most cases, SQL Server really needs one (or more) of the included fields to be in the key. The hard part is figuring out which one(s) without looking at the execution plans.


Building SQL ConstantCare®: Product, Packaging, and Pricing

SQL ConstantCare
8 Comments

Our new SQL ConstantCare® is our first paid online service – it tells you what you should do next on your SQL Servers. I’ve written about the vision and the architecture design, and this week I’m continuing to share the behind-the-scenes planning.

When you buy it, how exactly should we charge you? There’s a few options:

  • Priced per-server (say, each server is $X per month)
  • Priced per-user
  • Priced per-tier (maybe basic reports are $X, and more in-depth reports are $Y, or maybe it’s based on the frequency of reports)
  • A combination of the above (like $X per user for the first Y servers, then $Z for unlimited)
  • Freemium – free for up to X servers or users (and freemium could be mixed with either per-server or per-person pricing)

I ruled out per-server pricing early on because I wanted users to be excited to use this on all of their servers. I’ve talked to a lot of monitoring software customers who said things like, “We have 1 license of our monitoring software, and we move it around from server to server based on what’s having problems.” Screw that – I need to get data on your server when it’s healthy, before the problems strike, so I can identify things like good execution plans before they go awry.

I kinda wanted to try freemium. Because we built it with cost-effective serverless technologies, we could really afford to process the data for tens of thousands of SQL Servers for free. However, that’s only the processing – not doing tech support or improving the service. Those cost money, and I wasn’t sure I’d be able to convert enough free customers to a high enough premium tier to pay for that support & development. (Remember, I’ve got the ISV background, and I know how expensive those departments can be.)

One could also argue that there’s a real value in getting performance & health data on all of the world’s SQL Servers, and then using that data to identify the right customers for consulting and training. It might even make us a strong acquisition target because we’d have data that other ISVs, consulting companies, and cloud hosts might want. However, companies probably don’t want to acquire a list of customers who aren’t even willing to pay $10 to make their SQL Servers faster or more reliable. And worse, I didn’t wanna be acquired out of desperation because we’d run out of money, and that seems to be a recurring theme with companies that get acquihired.

Besides, in a way, I could kinda do limited-time-freemium by doing a Free Server Friday, running sales, giving folks a free trial for a week, or even using it as part of our annual salary survey. One of the things I loved about WooCommerce‘s pricing & coupon handling is that there’s a lot of flexibility.

I settled on simple per-user pricing for the MVP (Minimum Viable Product – a concept in Lean Startups, something I’ll be talking about later.) As we add features down the road, we can add additional tiers like Enterprise, ISV (who needs to watch their customer servers), or Consultant (yes, I want freelancers to be able to use this to understand what’s happening on their clients’ servers.)

Then I had to decide what was included.

The beauty of controlling your own services and content is that you can figure out how you want to bundle them together for the best customer value. We do consulting, training, and now online services – so what was the right way to sell the new service? Should it be standalone, or bundled with:

We might just let you pay with these
  1. Access to our video training library (because the emails might point you there, telling you in more details about the task you need to perform on a server)
  2. Access to our live online classes
  3. Access to our consultants – could be private emails, or via our Slack channel, or comments on your PasteThePlan plans, or X hours or WebEx time per month – which could be expensive for us, but as Paul Graham suggests, try things that don’t scale
  4. Access to entirely new stuff – like perhaps deeper levels of analysis for PasteThePlan – I ruled this out because I didn’t want to spend any more development time than I had to, at least not yet

So say for example we give you access to #1, but we charge for #3 – here’s what one of the emails might look like:

You’re currently not running CHECKDB, and based on your data volumes and your low server horsepower, I don’t think you’re going to be able to start. For now, you should set up a weekly CHECKDB job with the PHYSICAL_ONLY option. Here’s a video explaining how to do it, and if you want to learn more about the mechanics behind my recommendation, here’s a 1-hour video that goes even deeper. If you’re not confident in how this works, click here to buy & schedule a 30-minute call with the next available consultant.

Or, if I had a much higher price and included a set number of consultant calls per month, that last sentence might look like this:

Got questions about any of your homework tasks this week? Click here to schedule a 30-minute session with the next available consultant. You have 3 remaining free sessions this month.

For the first version, I decided to keep it simple: email advice, plus access to all of our self-paced recorded video training. In the future, we might add in a Pro level with bundled pricing for access to our consultants, or even refer folks to other consulting firms that are good fits for the problems they’re facing.

The packaging and pricing affected our other products, too.

Oh you’ll know when the marketing launch hits, trust me

We’re taking a relaxed approach to launching this, just kinda putting it out there on the site. We figured we’d learn a few quick lessons about the install process, troubleshooting, email quality, market interest, and how it’d affect our other products. (And we did!)

Our recorded videos used to be available at various monthly subscription prices. We started by throwing all that out and boiling down to just one product, SQL ConstantCare®. This way, as you’re going through the video training, you can get tips on what modules you should be watching that will be most relevant to your own servers. (We have so much video training that people often ask what they should watch first – this solves that problem.) This helped us learn if search visitors were interested in buying the full-blown SQL ConstantCare® product, or if we still needed to sell the videos separately.

For our live training classes, the corporate purchasing department often buys the tickets for the students. Those purchasing folks get twitchy when they see a subscription renewal notice in their cart like that they’re going to get charged again later – so for now, we’re not including SQL ConstantCare® in the live class purchases. I think we’re going to soon, though – I’ll just tweak the checkout process so that they get one year of SQL ConstantCare® with no warnings about renewal.

There was a consulting change right away I didn’t expect: sales prospects started and asking, “Which service is right for me, SQL Critical Care® or SQL ConstantCare®?” The way I’ve answered it is that our 3-day SQL Critical Care® is like an emergency room for SQL Server where we work together face to face to solve an urgent problem. SQL ConstantCare® is more like ongoing therapy where I send you one email per week with advice over a long period of time. I’ll need to work on the web site to make that decision easier for folks.

In subsequent posts in this series, I’ll talk about drawing the line for the Minimum Viable Product (MVP), what we learned from the first few rounds of users, how we’re aiming for GDPR compliance, and more. Next week: what we decided to collect in v1 – and what we avoided.


[Video] Office Hours 2018/3/21 (With Transcriptions)

This week, Brent, Erik, and Richie discuss replication, SQL Source Control with Git, installing service packs in cumulative updates in SQL Server, query tuning, write errors, log backups, Azure-managed instances, 3-rd party backup software, and DBA porn.

Here’s the video on YouTube:

You can register to attend next week’s Office Hours, or subscribe to our podcast to listen on the go.

Enjoy the Podcast?

Don’t miss an episode, subscribe via iTunes, Stitcher or RSS.
Leave us a review in iTunes

Office Hours – 3-21-18

 

Erik’s speaking at SQL Saturday NYC

Erik Darling: I know that we’re due to start soon but I wanted to just let anyone watching know that I’ve been accepted to present at SQL Saturday in New York City on May 19th. So if anyone is, I guess, anyone from the tri-state area wants to come on down and watch me babble about stuff, I’ll be there. One session I know that got selected was Query Optimization Explained in 15 Demos. I might have a second one since they said that there are still some open slots, so I don’t know.

Brent Ozar: Awesome.

Erik Darling: So I’ll be doing at least one there. So I look forward to maybe seeing all of you from a distance.

Brent Ozar: And we’re also always ready – or we’re always interested – to hear what people would want to hear us talk about too. So if you’ve got ideas for sessions you’d want to hear us give, feel free to put that into the questions as well.

Erik Darling: Replication. Fixing replication disasters.

Brent Ozar: Damn it…

Erik Darling: What stinks is you’re going to say that and people are going to say, like, a lot of HADR topics are going to come up and I am just gasless when it comes to HADR topics. I have nothing to say about failover clusters anymore. They exist; I’m sorry.

 

I have replication and Always On Availability Groups…

Brent Ozar: Speaking of which, so RimJim starts with a question. He says, “Hello, I have a replication question. We have Always On set up and the primary is also the publisher. We don’t have an AG installed. If I want to remove the main publisher…” None of us in here know. This is the funniest thing about doing the replication. One of the slides [inaudible] talks about going to dba.stackexchange.com for posting multi-paragraph questions, and we just tend to avoid replication like the plague; really because it is the plague.

Erik Darling: It is. It is a pox upon your servers. It’s such, like, an outdated technology at this point. There’s just so many replacements for it that are easier to deal with and are less brittle and break less.

Brent Ozar: And it’s not like it’s bad. I get that people go, oh this works for me and I want to keep using it. That’s cool, just you get that, as you’ve tried to ask questions because the thing breaks all the time and you’re trying to get changes made to it and nobody’s around to answer, you go – man, go look at Amazon.com – go look at the books on replication and then look and see what the authors are doing now; it’s not replication. They’ve all gone somewhere else.

Erik Darling: If you need, like, a primer in some troubleshooting stuff, Redgate has a free PDF about replication, but it’s old. It’s from 2009 or 2010 or something. Like, no one’s keeping up to date on that. Like, replication just hasn’t changed where there’s like a constant need for replication experts to keep their skills up to date. It’s like all this is the same problems for the last 15 years.

Richie Rump: Replication does not bring to the boys to the yard.

Brent Ozar: It keeps the boys from the yard. RimJim follows up with, “The reason we have replication is – we don’t care… “Because the subscriber, we have custom scripts that do not replicate all the delete commands.” You can do that with SSIS too. You can do that with stuff like continuously copying data from one place to another. You’re right in that this is what it was made for, that’s a good thing, but it’s just painful. It sucks. Richie, I don’t think he wants to ask questions.

Richie Rump: No, no, no I was waiting for the green light here.

Brent Ozar: Green light, go for it.

 

How can I get started with source control?

Richie Rump: Alright, so Anon would like to hear more about SQL source control with Git for a total newbie.

Erik Darling: There you go, Richie.

Richie Rump: No, you guys are the SQL experts. Please, go ahead. Talk to me about Git; I’d love to hear what you have to say.

Erik Darling: So, like, the Venn diagram of SQL expertise, and then like Git and source control expertise – if you can picture my hands as circles and then picture the Venn diagram slowly expanding as you add these words into the equation, they are far apart.

Richie Rump: And once you get Pluto, keep going.

Brent Ozar: Pretty far. My thought was, I couldn’t get into it until Jeremiah kind of forced me into it. He’s like, look here, let me show you how it works. And I think I would go – instead of coming at it from a SQL source control angle, I would go to your developers. Talk to your developers because they’re already really comfortable with Git, GitHub, TFS, whatever kind of distributed source control that you want to go use.

Richie Rump: Yeah, and there’s different levels, right. Because you could just use Git and just throw things in there and everything’s fine, but when you start working with other people – the Git workflow and there’s different ways of doing it and branching and merging – it can get pretty confusing pretty quick. So I understand why, you know, all these DBAs who are now forced to use Git, they’re really confused and they don’t want to deal with it. I understand it.

Erik Darling: I mean, if you wanted to give someone a pillow-landing for git, you could try using GitHub Desktop, which, kind of, GUI-izes a lot of the stuff that you have to do. That’s what I use because I’m soft. So it helps in that regard because I can do stuff and I can fix stuff. I’m still a little gun-shy when it comes to fixing a conflict on a merge or something, but for the most part – for the scripts we use, I know them well enough. I have enough domain knowledge about the scripts where it’s not awful for me to use the GUI interface for it. If you wanted me to use a command line, like actually use a Git, I would run screaming. [crosstalk]

Richie Rump: Yeah, even I don’t use the command line. Visual studio code has a Git lens hook-in – plug-in – and it works great. Source Tree – I use Source Tree for visualizations so – I mean, I could use the command line just like anyone else but I don’t need it.

Erik Darling: No, good stuff.

 

Should I stop SQL before installing CUs?

Richie Rump: So Don asks, “What are the best steps to installing Service Packs and Cumulative Updates on SQL Server; change to single user, stop agent service, stop service?” Blah, blah, blah, go on, go on, go on.

Brent Ozar: I’m a huge fan of – like if you have any high availability – patch the secondary first. Do it during the week, when you’re sober, like between nine and five. Once you’ve done it a few times, you get used to what you’re able to do without a reboot and what actually requires a reboot. But then that way, if the thing’s all patched up and ready to go in, it’s just a secondary. Then you can failover to it during your outage window; make sure that everything’s okay.

And in that scenario, the whole thing about stopping the service and agent and all that is much less relevant. You just don’t have to worry about it as much. If you only have one server and it’s up and you have to do the patching live, oh my god, it just makes me nervous. Like, you can do it but the whole concept of stopping agent and stopping the service and all that is a giant pain in the rear. I’ll tell people, if they want me to performance tune and get it right to that level, just give me a secondary instead.

Erik Darling: The other thing you’re looking at is cutting off access from the application so that no users can try to sneak in and hook in, or whatever. You know, if it is just a standalone instance, you really have no choice but to ask for whatever maintenance window you think you need to install and then rollback if the install, something goes awfully bad with it.

 

How can I change this scan to a seek?

Richie Rump: Okay, so [Tishal] – if I mispronounced your name, I’m so sorry. I write code all day, I’m… “I have a TOP1 ORDER BY descending query that has an index on it, but it does an index scan and reads about eight million records. sp_BlitzCache shows parallel warnings; no missing indexes. It’s one of the most expensive procs in terms of CPU. What can be done to change an index seek? Could cast be a problem?”

Erik Darling: There is a cast function in it, like…

Brent Ozar: There’s casting UTC reads…

Erik Darling: I mean, that’s not going to change much.

Brent Ozar: Yeah, I have bad feelings. Everything about this gives me bad feelings.

Richie Rump: Why? What are the code smells?

Erik Darling: I don’t know. What I want to do is see the query plan. I want you to take the XML and put it on PasteThePlan and then, if you want like super detailed help with it – because the amount of information you’re giving me just isn’t enough to tune a query with. When you say there’s an index on it, I’m like, okay, there’s an index on lots of stuff. Post it in dba.stackexchange.com – what a nice time for that slide to pop up. I don’t have anything to do for the rest of the day, so if you post it on there, I’d be happy to take a look and see if there’s anything…

Richie Rump: Well I have some issues that you could help me with, Erik.

Erik Darling: Okay [crosstalk]. Just try it in Postgres, Richie. Write a unit test.

Brent Ozar: The one thing that does stand out in there is you’re like, why can’t I get an index seek? If SQL Server is trying to compare two different fields because you’ve got two different fields in there I-date and B-date. Two different fields and either of them could be less than some number. My guess is that SQL Server is going, this is going to match a large percentage of the table; why bother doing index seeks on it? You could try rewriting it as a pair of unions so that one searches for I-date and one searches for B-date, but it’s not going to be a small change.

 

Should I partition to fix blocking problems?

Richie Rump: Okay, so Pablo asked, “Ola, amigos…” Ola Dora, [speaking Spanish]…

Erik Darling: It would be racist if anyone else did that. Richie gets to do it, using his Spanish privilege against us.

Richie Rump: The guy who is third-generation Mexican American and speaks a little Spanish who lives in Miami. It’s my burden to bear. “I have a third-party app that is currently making a huge amount of transactions per second on a particular table. It is getting heavily locked and the app is giving timeouts. Do you think that some kind of partitioning in the table may help the locking problem? Or maybe another suggestion?”

Erik Darling: No, not partitioning anyway. Just because if you’re not getting adequate chunks of data eliminated with the indexes you have now, I think trying to add partition elimination into the mix is not going to make that any simpler or give you any extra guarantees that it’s going to work.

Brent Ozar: And always define what huge is, because numbers mean different things to people. Is it like 10,000 batch requests per second, 50, 100,000 batch requests per second? You know, give me a rough idea there. Plus, too, if you’re running into lock escalation, when you say huge numbers, that immediately makes me think, if you’re tripping up past 5000 rows locked – because we’ll end up doing index scans instead of seeks – partitioning it probably isn’t going to help too much because you’re still going to escalate pretty quickly when you’re locking thousands of rows.

Richie Rump: Or when you’re popping and locking.

Brent Ozar: When you’re popping and locking. You did say, “Do you have another suggestion?” I’d say, depending on what the primary – if it’s readers blocking writers or vice-versa, RCSI might be a good fit; read committed snapshot isolation. If you go to brentozar.com/go/rcsi, we have posts about that. it’s a setting that you can just flip a switch and magically, blocking starts to disappear. There’s just so many gotchas with it that that’s why we have the whole page about it.

Richie Rump: Well, like tempdb would be the biggest one, right?

Brent Ozar: Yeah, the tempdb throughput and size. If someone doesn’t BEGINTRAN, their tempdb can just start growing like crazy. Richie, you’re going to kill me but your microphone’s doing the thing again, the static-y thing. I know right. Hold on – so that’s that. We’re all playing chicky. Go ahead…

Richie Rump: Is it better now?

Erik Darling: Maybe.

 

We have this problem with Windows failing to write…

Brent Ozar: Mike Chubbs asks, “We’re having this error that our Windows folks have no idea where to look at. The requested resource is in use when I try to write. During a write, it offset…” Blah, blah, blah, blah, “In file.” Two words – except I guess it’s really one word – antivirus. So it’s typically a filter driver; something that’s intercepting writes. Antivirus grabbed a lock on something. A filter driver – so you’ve got some kind of filter driver that’s trying to do compression or encryption or snapshots. Those are all common causes for that.

Erik Darling: I know that Semantic, or however you pronounce it, is kind of notorious for that sort of thing. So check and see if you have Semantic running anywhere. I know it’s going to sound like the awful, awful thing to say, but if you run Xperf, you can usually find the filter drivers that are using a whole bunch of CPU and doing stuff and figure out which ones are, you know, messing with your files. Good old ETW trace.

Brent Ozar: Nice. Richie, you want to test your microphone?

Richie Rump: Am I back? Hello.

Brent Ozar: You are back. You sound great.

Richie Rump: Okay, I need to stop using Chrome. I think Chrome is my problem.

Erik Darling: You have a Chrome problem?

Richie Rump: I may – I make everything chrome…

Brent Ozar: Justin Bieber.

 

Can a log backup cause a stack dump?

Richie Rump: Alright, so Gordon has a question. “Is it possible for log backup to have an exception access violation and subsequent stack-dump?”

Brent Ozar: Sure, corruption. Corruption will do that. If it tries to backup corrupt data, that will totally do it. I would check CHECKDB for starters. Anytime you have any kind of error-running backup, I would probably be worried about CHECKDB.

Erik Darling: I’m trying to think, that isn’t like guaranteed to be a CHECKDB thing. It could be an underlying – is there a different disk that you can try taking the log backup to, just to make sure it’s not a pure hardware issue. That might be something to do if the CHECKDB comes back clean.

Brent Ozar: And if you’re backing up to a network, try local. If you’re backing up local, try the network.

Erik Darling: And, you know, it might be a situation where, like, you have to add a second log file and get rid of the first one because it’s on a bad bunch of storage or something, too.

Brent Ozar: I should have also thrown out too, weirdo backup software, I’ve got a rash of calls in lately going, I use such and such backup software, and when we look at their data, they’re not doing backups. I’m like, I don’t know what that thing does. But generally, if you paid less than $100 for your backup software, it’s probably not doing what you want it to do. Richie laughs – it’s true. Big customers have emailed in like three times during sales prospects and they’re like, I use blank, and I go to the website and I’ve never heard of it before. I’m like, $49 with a 90-day trial?

Erik Darling: You just can’t software that costs less than $100.

Richie Rump: My copy of DataGrip cost more than that.

Erik Darling: I was bummed that didn’t work for me. The connection string worked first time in Postgres – [crosstalk]…

Richie Rump: But there’s something going on there. You just have to dig into it a little bit. Since you’re not doing anything for today, you could…

Erik Darling: That’s not true. I have to answer that person’s question about the query.

Richie Rump: Oh okay. I mean, that’s Robert’s question, right, who wants to know how to troubleshoot an OLEDB wait type.

Erik Darling: Ooh, ole.

Brent Ozar: It’s early for Cinco de Mayo. Linked servers is one way to do it. I’d say, hold up though, what makes you think OLEDB is the problem. Because often, people will run wait stat scripts like sp_BlitzFirst – we’ll do the same thing. We list OLEDB because sometimes it actually can be bad, it’s just super-rare. If it’s one to one with clock-time – like if your server’s been up for 1000 hours and you have 1000 hours’ worth of OLEDB waits, it’s harmless. It’s a monitoring tool like SQL Sentry, Quest Spotlight, IDERA SQLDM. He says, “I get it in the monitoring script.” There’s so many. I get it in the monitoring script too.

Erik Darling: Don’t we all? Like Brent said, remote queries will cause OLEDB, DBCC CHECKDB – it will register OLEDB waits – it just might not be that bigger deal, unless you’re running CHECKDB 24/7, just in a loop.

Brent Ozar: Oh god, he follows up with, “I’m moving 180,000 rows via linked server.” Yep, well, that will do it.

Erik Darling: We found the problem.

Brent Ozar: He says, “The waits go to two seconds and then drop.” Yep, that’s exactly what it does, is reset every two seconds. It still could be waiting on OLEDB, but if you’re trying to move 180,000 rows faster, I usually keep a USB floppy drive or something like that around. You know, a CD-ROM maybe might be faster than trying to move that kind of data around.

Richie Rump: Or SSIS?

Brent Ozar: SSIS – that’s a good one.

Erik Darling: Or, you know, just not moving 180,000 rows at once, you know. Move like 1000 – 5000 rows at once; something that’s a little bit easier on your servers and on your network and all that other stuff.

 

Any opinions about DACPAC deployment?

Richie Rump: Yep, okay. So, Brian has a question. Somewhat of a follow up to the SQL and Git source control question, “Have any of you worked with, or have any strong opinions, about the DAC-pack/SSDT based development and deployment?”

Brent Ozar: Yeah, so I tried it when they first came out. It sounded like a really interesting thing. And I try to keep optimistic about, hey, Microsoft’s got a new technology. I should play around with it. And the problem for me with DAC-pack based deployments was, if I wanted to make a schema change to a large table, it was really weird how it worked and it generated a ton of log data. So it seemed like it worked across small databases. And I’m going to use, as a number for small, less than 5GB of data. But it totally fell apart at like 50GB of data. The deployment was just brutal. Plus dropping columns was horrific.

I found that I could accidentally drop columns during the deployment and it was way too easy to go and do. So I was burned out on it within the first six months. I kept trying it and would go, ah it’s just not working for me. I haven’t heard from anyone else that they swear by it, so I don’t know.

Erik Darling: A lot of people who have to do that sort of thing just use Redgate SQL Compare and have much better luck with that.

Richie Rump: Yeah, I prefer SQL Compare. I am not a DAC-pack person whatsoever. You just don’t have a lot of control, from what I’ve used of it, in those deployments. At least with SQL Compare, I could say, here’s A, here’s B and give me the deltas and then I have a script and I can review it and whatnot, as opposed to, I’m just throwing this thing out there and it’s going to do something and hopefully it will work. I don’t know.

Erik Darling: So I’m going to give Redgate SQL Compare the best accolade that I can give a piece of software. It was so easy, I could use it.

Brent Ozar: I’d agree. The wizards are nice, the side-by-side comparison thing.

Erik Darling: It highlights where things are different.

Richie Rump: Does that work for Postgres? Because lord knows I could have used it yesterday.

Erik Darling: Why don’t you use – what was it you told me about? KDiff3.

Richie Rump: KDiff is essentially just for get and compare text files. It’s not going to go into your database and say, okay this is this and this is that. So what my problem was, yesterday, I did a deployment to Postgres and I had some scripts. And as I usually do, I’ll go ahead and put all these updates into a different script and I’ll keep track of it that way, if it’s gone into production or not.

I didn’t update the scripts from some of the updates that I had made in source control, and so I went ahead and deployed an old version of updates. And that’s when Brent saw all of this red and started laughing at me and things like that. But I fixed it. I fixed it.

 

Why do we need backup software?

Richie Rump: So Steve has a question, “Why backup software? Can’t we do this with SQL Server tools?”

Brent Ozar: My thought, from working – used to work for Quest, and so I was always worried about that question when people would come and ask, what do you need LiteSpeed for? Can’t we just do a native backup? The compression options, the encryption options, the monitoring, the log shipping GUI, all that kind of thing is phenomenal with third party tools.

Now, the ability to do integration with storage snapshots – there’s all kinds of – smart differentials, or one of my favorites, object level recovery. If you’re doing backups with LiteSpeed, you can pluck an individual object out of the backup. Because, newsflash, people don’t call you to restore a 10Tb database. They call you to restore one table out of the 10TB database, and then you don’t have any space to go do it. You have to shuffle it around from one place to another. Man, third party backup tools make that kind of thing way easier.

If you don’t have – I may make a generalization – if all your databases are under 1TB and you’ve got, say, less than 100 servers, then you could probably get by with native. But as you go past 1TB, man, the third-party stuff is phenomenal.

Erik Darling: I look forward to object level restore, people in Azure start having to restore large things in Microsoft, all of a sudden realizes, oh yeah, maybe it’s not so much fun to babysit a 25TB restore when I just have this 50,000-row table that I need to get back.

Brent Ozar: And especially if they’re paying SLA refunds to people when it all starts to take three days.

Erik Darling: Another cool thing about tools like LiteSpeed, beyond what Brent said with the object level restore, is transaction reading. So again, what no one ever tells you – all sorts of like great DBA porn – no one ever tells you to restore just one table. No one tells you to restore that table just to the last transaction log backup. There’s always, like, you know, we’re taking hourly transaction log backups and something happened earlier this morning, and you’re like, cool. I don’t know what that was. You don’t know when that was and you don’t want to have to go through that crazy restore process to get things back up to here and then realize you went too far and have to start the whole process over again. You really just want to get to the point where something bad happened and stop.

Richie Rump: So is DBA porn like, look at the IOPS on that server?

Erik Darling: Yeah, stuff like that.

Brent Ozar: Scan paged, PFS pages.

Richie Rump: Look at the waits on that guy…

Brent Ozar: Or when you go into DBA cubicles, it’s always the PerfMon counter poster up on the wall.

Erik Darling: It’s so true.

Brent Ozar: It is. That’s what we put. That’s exactly it.

 

What’s up with Azure Managed Instances?

Richie Rump: So you guys were playing with a new service that Azure just released. Why don’t you tell everyone about that and, kind of, what you found out about it?

Brent Ozar: So, Azure managed instances – this is kind of, I’m going to use the term replacement – but I think it’s the thing that people are going to light up in excitement for over Azure SQLDB because you get cross-database queries, you get easier backups and restores, you can log ship up to it, in a weird kind of way, if you want to restore full backups up in Azure Managed Instances. It has more compatibility with more features that Azure SQLDB doesn’t allow you to support. You can query DMVs, you can go into the master database, you can put stuff in the master database for that matter, if you want.

It’s relatively affordable. It’s pretty cheap when you think about it and you can even buy it in eight cores, 16 or 24 cores. You don’t have to worry about memory. You don’t really have to worry about storage even. There’s just a slider for how much storage you want. It’s not provisioned by price or anything; it’s just a slider for storage – how much storage space you need.

Erik Darling: Does storage size tie into storage speed?

Brent Ozar: No, there’s just a size thing and that’s it. So it’s pretty impressive. Easy to set up. Some of the things that we found were that there’s all kinds of new DMVs. There’s these new feature switches where it looks like they’ll be able to turn things on and off more gradually. There were thousands of features in the feature switches table. It’s pretty neat stuff. It gives me excitement. It’s DBA porn, I suppose you could say.

There was an issue that I caught around database corruption where you could corrupt a master in a way that your entire in managed instance would be unrecoverable. I have every confidence they’re going to fix that before go-live. But it’s like anything else, all software has bugs.

Erik Darling: I mean, this is brand new when we looked at it, too. The stuff that we found isn’t stuff that is going to be there forever or stuff that you have to be eternally vigilant about. Like, there’s a whole bunch of stuff in SQL agent that doesn’t work. Like, you can’t call jobs that use command shell. You can’t call jobs that use PowerShell. You can’t call SSIS job steps from it. There’s no maintenance plans in there. You can’t use Ola Hallengren’s scripts. There’s stuff that doesn’t work yet; all stuff coming down the pipe. The T-SQL things are different between managed instances and real live T-SQL, but you know, it will get there. I have a lot of confidence in this so far. I think they’re doing a lot of things right.

Brent Ozar: Man, for a preview, it was freaking phenomenal. Just awesome.

Erik Darling: My biggest gripe was – sorry, go ahead…

Brent Ozar: Your biggest gripe was what?

Erik Darling: The memory didn’t go up when the core count went up, but again, previews.

Brent Ozar: Alright, well thanks everybody for hanging out with us this week and we’ll see y’all next week on Office Hours; Adios.


So You Wanna Debug SQL Server Part 1

SQL Server
5 Comments

Start here!

This is a gentle introduction to getting set up and started.

I’m nowhere near as good at this as I want to be, but I’ve been inspired by a couple wizards: Bob the Ward and Paul the White.

The first thing you need to do is go download WinDbg (or the debugger of your choice — I use this one to emulate my heroes).

If you want a lot of reference material, there’s a Stack Overflow post with a list of resources, like cheat sheets and tutorials.

I should probably read those, huh?

Debuggery

Once you’ve got things installed, you’ll wanna crack the correct tool open. For most of you, that’s gonna be the x64 version.

Summer of 69

The important thing when opening it up is to right click an run as administrator.

I’m run as bro

If you choose the wrong bits, or you don’t run as administrator, debugging won’t work.

Symbols of the Bug

The next thing you’re gonna need is some symbols.

I’m lazy, so I’m going to create a folder on my C: drive called “symbols” — then I’m gonna use that to point WinDbg to, so it can store the public symbols that Microsoft lets us plebeians use.

If anyone wants to send me the private symbols, I won’t tell.

Pick a winner
COMMANDS

The command in here is srv*c:\symbols*http://msdl.microsoft.com/download/symbols — you’ll need to change the c:\symbols part to wherever you chose to create your symbols folder.

Action Man

Now comes the fun part! Especially if you’re not on a Mac. Or Linux. This blog post has probably been awful for you, if you are.

Let’s attach this thing! The easiest way is to either hit F6, or go to File > Attach to a process.

Hello beautiful

I find it convenient to sort by Executable and then scroll down to SQL Server. There are fancy command line ways to do this as well, but this is the lowest barrier to entry.

Now, remember when I said you had to choose the right bits and run as admin?

This is why! If you don’t see Session information here, or you get access denied/request can’t be completed errors when you try to attach, you did something wrong.

If you did everything right, the debugger will attach, and you’ll get a screen full of nonsense.

That nonsense will be a list of .dlls that you can get symbols for. Symbols are kind of like translators. You may not need them, but hey.

Free stuff.

To get symbols for a dll just type something like x sqllang! or x sqlmin!

If you’ve done things right, your symbols folder will start to fill up.

Eet’s a seeekrit

 

Closedown

That was just enough to be dangerous!

I’ll be covering some of the basics of setting breakpoints and whatnot down the line. In the meantime, poke around on your own.

Explore the space.

There’s almost no shortage of EULAs you can break stuff you can find poking around in here.

Thanks for reading!


Table Valued Parameters: Unexpected Parameter Sniffing

Like Table Variables, Kinda

Jeremiah wrote about them a few years ago. I always get asked about them while poking fun at Table Variables, so I thought I’d provide some detail and a post to point people to.

There are some interesting differences between them, namely around how cardinality is estimated in different situations.

Table Valued Parameters are often used as a replacement for passing in a CSV list of “things” to then parse, usually with some God awful function. This helps by passing in the list in table-format, so you don’t have to do any additional processing. TVPs don’t have the exact same problems that Table Variables do, but there are some things you have to be aware of.

I’m going to go over how they’re different, and how a trace flag can “help”.

I know you’re gonna ask: “Should I just dump the contents out to a temp table?”

And I’m gonna tell you: “That’s a whole ‘nother post.”

What We’re Testing

Since I usually care about performance (I guess. Whatever.), I want to look at them with and without indexes, like so:

Mr. Swart points out in the comments:

The name of the primary key in your example turns out to be PK_#A954D2B_3214EC07D876E774 which gets in the way of plan forcing, a super-useful emergency technique.

Here’s the stored procedure we’ll be calling:

Now, I’m going to square with you, dear reader. I cannot write C#. If it were me in that scene from Swordfish, we’d all be dead.

Or whatever the plot line was there. Hard to tell.

I had to borrow most of my code from this post by The Only Canadian Who Answers My Emails.

Mine looks a little different, but just so it will Work On My Computer®.

And also so I can test data sets of different data sizes. I had to make the array and loop look like this:

By the end of the blog post, you’ll be glad I didn’t introduce much more variation in there.

Results

I have another confession to make: I used Extended Events again. It still burns a little. At least there was no PowerShell though.

First Thing: There’s a fixed estimate of 100 rows for inserts

This is regardless of how many rows actually end up in there. This is the insert from the application to the TVP, so we’re clear.

Jump, jump
Maybe that’s not so great…
Second Thing: They don’t get a fixed estimate like Table Variables

Table Variables (unless you Recompile, or use a Trace Flag), will sport either a 1 or 100 row estimate, depending on which version of the Cardinality Estimator you use. The old version guesses 1 row, the new guesses 100 rows.

Table Valued Parameters get sniffed, just like… parameters! When the first plan in the cache is a 100k row TVP, we keep that number until recompilation occurs.

Nice guess…
… Be a shame if something happened to it
Third Thing: Non-join cardinality estimates behave like local variables (fixed estimates)

And I am not a very good graphic artist.

Hoowee.

Without the PK/CX, we get 10% for equality, 30% for inequality, and 9% for two inequalities.

This is detailed by Itzik Ben Gan here.

Fourth Thing: This didn’t get much better with a PK/CX

The only estimate that improved was the direct equality. The rest stayed the same.

Captain Obvious, et al.
Fifth Thing: The index didn’t change estimates between runs

We still got the “parameter sniffing” behavior, where whatever row count was inserted first dictated cardinality estimates until I recompiled.

Apologies, no pretty pictures here.

Sixth Thing: Trace Flag 2453 definitely did help

With or without the index, turning it on improved our cardinality estimates, but had little *ffect on standardizing capitalization of key words.

I didn’t have to recompile here, because there was enough of a difference between 100 and 100,000 rows being inserted to trigger a recompilation and accurate estimate.

Pictures don’t do this part justice either.

Go Away

Table Valued Parameters offer some nice improvements over Table Variables (I know, I know, there are times they’re good…).

But they’re also something you need to be careful with. Sniffed parameters here can cause plan quality issues just like regular parameters do.

Boohoo.

In this plan, the 100 row estimate lead to a spill. To be fair, even accurate estimates can result in spills. It’s just hard to fix an accurate estimate 😀

  • The initial insert to the TVP is still serialized (just like table variables), so use caution if you’re inserting a lot of rows
  • There are no column level statistics, just like Table Variables, even with indexes (this leads to fixed predicate estimations)
  • Trace Flag 2453 can improve estimates at the cost of recompiles
  • The fixed estimate of 100 rows for each insert may not be ideal if you’re inserting a lot of rows

Thanks for reading!


SQL Server 2017 CU5: Finding Problems With Parallelism

Snow Is Temporary

Cumulative Update 5 dropped for SQL Server 2017. Hooray.

Poor Taryn.

These new servicing models sure do make me glad I don’t have a real job.

I’m also happy because I get to write weird queries to test interesting new functionality.

This SQL Server update improves troubleshooting parallelism-related (exchange) deadlocks by adding the following enhancements:

  • The exchange_spill xEvent reports how much data parallelism operators spill per each thread, in the event field name worktable_physical_writes. Each thread may report an exchange_spill xEvent multiple times.

  • The DMVs sys.dm_exec_query_stats, sys.dm_exec_procedure_stats and sys.dm_exec_trigger_stats now include the data that is spilled by parallelism operators, in the columns total_spills, last_spills, max_spills, and min_spills.

  • showplan warning is reported at run time if there are parallelism spills. This warning is rendered in a showplan xml attribute (<ExchangeSpillDetails WritesToTempDb=”spill_amount” />).

There’s some other cool stuff in there too, but I’m only talking about exchange spills here.

How Do I Spill In Parallel?

The easiest way to encourage spills in a parallel plan is to force lots of order preserving operators — stuff like Merge Joins and Stream Aggregates — to occur around Exchange operators.

Exchange operators can Distribute, Redistribute, and Gather parallel streams of data. When they’re not required to keep data in order, things are a lot easier.

When they are, you run into problems.

Hoowee!

In this case, the Exchange operators on either side of the Merge Join spill.

Exchange Spill Extended Event

I set up an extended event to watch this query, which runs for a touch over 9 minutes. The results are somewhat interesting.

THIRTEEN?

Over 19 exchange events, only five of them admit to physically writing to a worktable, for a total of 13 writes.

Over 9 minutes.

Lesson learned: it doesn’t take many exchange spills to make a parallel query really slow.

Exchange Spill DMV Data

sp_BlitzCache will grab these columns if you’re on 2017 CU3+. Otherwise… Well, nothing really.

Please upgrade.

IANAGD

The DMVs agree! There were only 13 spills here. Happy Days — this is included in here. I’m not sure that this was always the case…

Query Plan (Actual Only)

Spills aren’t tracked in estimated or cached plans. Unfortunately, SSMS 17.6 doesn’t show the writes to tempdb, however it does exist in the XML.

Zilchin’

Perhaps the next version will display this in the tool tips.

Lcuky Nmummber ss

But the plan shows us 13 total spills again, so… If it’s wrong, it’s wrong everywhere. That’s a start.

For the record, I don’t think it’s wrong. I’m just surprised that it only takes 13 writes to make a parallel query crap the bed.

Thanks for reading!

An update, and a confession

This was not the optimizer’s first choice for a plan. No, no, my sweet and innocent readers. I had to throw an army of hints at this query to get it to happen.

I had to force it to run parallel with a certain number of cores, I had to force choosing merge joins, and I had to throw a Trace Flag That Shall Not Be Numbered at it.

I also had to choose tables where I knew a NOT EXISTS would be problemaaaaaaaaaatic (NPR voice) because of data distributions.

Like I tell myself when I go to the gym: it takes a lot of work to look this mediocre.


Why Multiple Plans for One Query Are Bad

I’m going to demo this using the Stack Overflow public database. We’ll use the Users table – which has exactly what you think it has, everyone who’s asked/answered/commented at StackOverflow.com.

I need to search for people by their DisplayName, so I’ve created an index on that:

And now I’m going to search for a couple of different people – me, and the lady in the meat dress – and then examine what’s in my plan cache:

Here’s the results:

The only place where my reputation is higher than hers

Hey, whaddya know, Lady Gaga uses Stack Overflow too! We’re practically meant for each other.

But let’s zoom in a little on that last result set, the output of sp_BlitzCache:

Two queries, two plans

SQL Server built and cached two query plans.

This has a few interesting problems:

  • It built an execution plan for each one of them – which meant the query took a little longer to finish
  • It cached each execution plan separately – meaning it takes up more memory
  • Each plan could be different – in cases where the different name has a different number of estimated rows, SQL Server might choose to use (or avoid) an index

With just 2 queries, who cares? But if your app is sending in the same query thousands of times, each with different parameters, this can add up to more CPU time, more memory used for caching plans, and less memory used for caching data.

Our tools warn you about this in a few different ways:

  • sp_Blitz warns you that you have a large number of plans for a single query, and that it’s time to dig deeper by looking at the plan cache
  • sp_BlitzCache shows a warning for queries that have multiple plans (indicating that the query you’re looking at might just be a bad one out of several). You can do EXEC sp_BlitzCache @SortOrder = ‘query hash’ to find the queries to focus on, too.
  • SQL ConstantCare® suggests forced parameterization when we see that you have a ton of these over time, and can’t keep plans in the cache as a result

You could fix this by changing the application so that it uses parameterized SQL instead of strings. Run this query to tell your developers which queries are involved:

That gives you the top 10 most duplicated queries in cache, plus for each one, 10 sample texts, plans, and a more-info query for sp_BlitzCache to let you slice & dice them by reads, CPU, etc. Note that the “Total” numbers like Total_Reads and Total_CPU_ms are for ALL of the different executions of the query text, not just the one line you’re looking at.

I’m filtering for only queries that have at least 100 duplicate entries in the plan cache.

Click to zoom

And then when they say, “Sorry, we can’t fix those,” keep reading.

Optimize for Ad Hoc does not fix this.

When turned on, this server-level setting tells SQL Server to expect a lot of different queries that will never be seen again. That means:

  • SQL Server still compiles every string, every time it sees it
  • Just now it doesn’t cache that query’s plan until it sees the query a second time (which it almost never will, because you’re sending in a different string every time)

So your CPU is still high – you’ve just saved some memory, but not a lot. This isn’t a full fix.

Forced Parameterization fixes this.

If you right-click on a database, click Properties, Options, and scroll to the Miscellaneous section, you’ll see Parameterization. The default is Simple, but you can also choose Forced.

Chuck Norris has the option of Brute Force

Setting it to Forced takes effect instantly, doesn’t require a restart, and then has a different behavior.

If I run the same queries again, here’s the new output:

sp_BlitzCache showing forced parameterization

SQL Server takes a little more time with each incoming query, turns the literals into variables, and then checks to see if there’s an execution plan already compiled for it. That means:

  • Faster query runtime because we can skip compiling a full plan for it
  • Less memory wasted on duplicate plans being cached in memory
  • Easier to spot performance issues because now the same query is grouped together easier in tools like sp_BlitzCache

Things to know about Forced Parameterization:

  • It’s set at the database level, and needs to be set in the database where users are running queries.
  • It could theoretically be a performance drag – if all of your queries had literals, but they really were totally different queries, this could slow things down. I only recommend using this tool to fix a problem, not to proactively prevent a problem.
  • When enabled, plans do get reused – which means you may suddenly have parameter sniffing issues that you didn’t have before (because before, every query got its own hand-crafted plan.)

So when should you use Forced Parameterization?

  • When our tools are alerting you about a high number of plans for a single query (like, say, 10,000 or more)
  • You can’t fix that query to be parameterized
  • You want to reduce CPU usage and increase memory available to cache data
  • You’re comfortable troubleshooting parameter sniffing issues that may arise with that query

For more on the topic, check out Eitan Blumin’s post Too Many Plans for the Same Query Hash.


If You Can’t Index It, It’s Probably Not SARGable

Thumbs Rule OK?

When writing queries, it’s really common to want to take shortcuts to express some logic.

I know, it “works fine”, and the results are “right”, and you have “more important” things to do.

But take a minute here for future you.

Just think of reading this post as a continuation of the infinite thumb-scroll that your life has become.

And Examples!

For instance, it’s a lot faster for you to write ISNULL(somecol, 0) = 0 or YEAR(somedate) = 2018.

In some ways it may even seem intuitive to write queries that express simple equalities like this rather than somecol = 0 OR somecol IS NULL.

The problem is that this isn’t how indexes store, or have their data retrieved, and you can’t create indexes like this:

CREATE INDEX ix_nope ON dbo.zilch (ISNULL(nada, 0))

Hence the title: If you can’t index it, it’s probably not SARGable.

Practically?

I’ve tried to point this out with functions like ISNULL and DATEDIFF.

Using the DATEDIFF example, if you write a query that does this: WHERE DATEDIFF(DAY, day1, day2) = 90

And you’ve got an index like this: CREATE INDEX ix_dates ON dbo.orders (day1, day2)

Nothing about the index is tracking how many days apart day1 and day2 are. Nor minutes, nor hours, nor milliseconds.

It’s just data ordered first by day1, then by day2. Multi-column statistics also don’t track this. It’s up to you to define the data points you need to retrieve for your data.

The same goes for many other built in system functions, but there’s no warning sign on the doc pages that says stuff like:

  • “Hey, this is only here to make presenting data easier.”
  • “The optimizer can’t do that, Dave.”
  • “Only do this if you want to keep champagne flowing for consultants.”

It’s fairly well-documented what happens when you use these on-the-fly calculations as predicates: Nothing good.

For all the hand-wringing there is about index scans, there’s equal amounts of poorly written queries that have no hope of every doing a seek.

Mass Appeal!

As development teams mature and start writing queries beyond simple selects, they may add non-SARGable constructs in other places.

CTEs, Derived Tables, and non-indexed Views are easy examples. The backing query results aren’t materialized anywhere, so they’re not good candidates for predicates.

For example, this query:

Putting the COALESCE inside the CTE doesn’t magically make a new set of physical data — it’s just another expression. It’s really no different than if you did it directly in the WHERE clause without the CTE.

Going back to the title, you can’t index a CTE or a derived table. Abstracting expressions in there doesn’t persist them.

They won’t be SARGable here, either.

Unawares?

If you think they’re bad there, wait until you start trying to order data by expressions.

Even with a perfectly fine index thrown into the mix…

We’re forced to sort everything by hand.

Or whatever your computer uses.

Mine uses hands.

Tiny hands.

This isn’t your friend

This is the kind of query that really makes your CPU fans kick in.

The kicker is that the optimizer may decide to inject a sort into your plan that you didn’t ask for.

Is There An Exit?

When you’re writing queries to be reliably fast, take a close look at what you’re expecting the optimizer to do.

Indexes can go a long way, but they’re not cure-alls. They still have to contain data the way you’re trying to use the data.

If they don’t, you might need to look at temp tables, computed columns, denormalizing, or lookup/junction tables to set data up the way your queries use it.

Thanks for reading!


Azure SQL Managed Instances and Database Corruption

Disclaimer: I love Azure SQL Managed Instances (MIs) so far. They’re really cool, and I’m having a fun time playing with them. I’m about to describe a bug, and I’m sure this bug will be fixed soon, so it shouldn’t stop you from previewing Managed Instances. This post is not “MIs are t3h suxxors” – it’s just making sure you understand that this is new technology, and if you’re going to bet your business on new technology, you need to test it by pushing the limits. That’s what we do here. This isn’t the kind of blog that mindlessly parrots press releases: we’re real world users, so we beat the hell out of technology before we recommend it to customers. I’m sharing this to explain the kind of work we do, and why we put so much effort into getting tools like sp_Blitz so you can get the right alerts at the right time. Now, on with the show.

Corruption has always been kind of a gray area in Azure’s Platform-as-a-Service offerings. Yes, Microsoft manages your backups, and they can recover the entire database to a point in time for you, but it’s not clear what happens if you want to attempt corruption repair for specific objects. It’s not like you get access to the log backups and can do a page level restore.

(I can hear the cloud zealots screaming already, “There’s no corruption in the cloud!” Buckle up, watch, and learn.)

So while working on getting sp_Blitz fully compatible with MIs, I found it pretty interesting that Microsoft isn’t running CHECKDB for you. When sp_Blitz returned an alert about CHECKDB not being run, I thought it was a bug in sp_Blitz until I dug deeper:

Lazy robots are slacking already

When they say the robots are taking your job, I guess they’re doing exactly the same kind of job you would do. Don’t lie: I’ve seen a lot of your servers, and you suck at corruption detection and recovery too. The robot is taking your exact job. Probably going to take smoke breaks even though he doesn’t smoke, just like you.

Note that neither maintenance plans nor Ola Hallengren’s maintenance scripts will work in MIs yet either: MI’s SQL Agent doesn’t support those kinds of scheduled tasks yet. (Props to Erik for catching that one.)

MIs might have a first line of defense in Automatic Page Repair. With mirroring and AGs, when the primary detects corruption, it’ll ask the secondary replicas for a clean copy of the page, repair it on the fly, and log it in sys.dm_hadr_auto_page_repair or sys.dm_db_mirroring_auto_page_repair. So I wondered, if the new HA/DR fabric solution isn’t showing replicas in the usual AG replica DMVs, do we get automatic page repair? I bet we do, because the startup error log shows this:

Databases in an Availability Group

Note that even model and msdb are in that list, and in other parts of the log, master is shown as replicatedmaster. That’s kinda awesome – it’d be amazing if we get automatic page repair for system databases, too.

Side note – even with automatic page repair, if SQL Server writes trash data into the database, you end up with corrupt copies of the data everywhere. Classic examples include the 2012/2014 online index rebuild bug, the 2008-2014 UPDATE/NOLOCK bug, the 2008 compressed partitioned table bug, the 2008-2012 table variable bug, etc. These aren’t common issues by any means, and they’re all fixed today. However, for every bug, there was a range of time where it was active in the field and wasn’t fixed, and that’s what I’m worried about. No software is bug-free – so what happens when we hit a bug?

Let’s find out.

Testing Corruption of the Master Database

Corruption happens. It’s just a fact of life – storage is gonna fail. Microsoft’s SLAs for storage only give you 3-4 9’s, and there’s nothing in there about never losing your data. Nothing against Azure, either – I’ve lost entire VMs in AWS due to storage corruption.

So let’s demo it. Normally, this kind of thing might be hard to do, but at the moment, DBCC WRITEPAGE is enabled (although I expect that to change before MIs hit General Availability.) I used Erik’s notorious sp_GoAheadAndFireMe to purposely corrupt the master database (not TempDB. I modified it to work with a user database instead, ran it, and in less than ten seconds, the entire instance went unresponsive.

For the first 3-4 minutes, login attempts would fail:

No soup for you

From 5-20 minutes, the connection started, but:

“Yes, support? Here’s my tracing ID. Eff…six…three…”

After half an hour, Object Explorer still couldn’t connect, but at least I could connect a query window to run connections, so I checked msdb.dbo.suspect_pages, and it was empty. However, CHECKDB() in master caused my connection to eject:

I sent the MI details off to Microsoft for analysis, and they’re working on a fix.

So what did we learn?

In traditional DR technologies like log shipping, database mirroring, and Always On Availability Groups, the fact that the master database isn’t replicated can actually be a good thing. When you fail over your user databases and only your user databases, you’re able to escape from system-database-level problems (or in the case of mirroring and log shipping, escape cluster-level problems.)

When you go with MIs, you’re gambling that Microsoft’s administration-as-a-service is going to be more reliable than your own administrators (with some tradeoffs around expense as well.) I think in the vast majority of cases, that’s a really, really, really safe bet. It’s not a guaranteed bet, though, as we see here.

Everybody has work to do. You need to run CHECKDB on your MIs, and Microsoft needs to improve compatibility to let you do it more easily, and corruption occurs, MS needs to do a better job of recovering – or give you some kind of plan B.

Because think for a second: what’s your disaster recovery plan for a Managed Instance failure?


Building SQL ConstantCare®: The Serverless Architecture

Our new SQL Constant Care® is an online service that analyzes your SQL Server’s diagnostic data, then gives you personalized advice on what you should do to make your database app faster and more reliable. I blogged about the vision last week, and today let’s talk about the things that drove our architecture decisions way back in 2015.

I’ve worked for software vendors who had to deploy and support applications running on customer desktops. It’s an absolute nightmare – you would not believe the crazy things people try to do with your app.

So when I started thinking about how to deliver server recommendations, and how I’d deliver the application as a software vendor, I wanted as little code on the client’s machines as possible. Ideally, none, but querying SQL Server directly from a web browser doesn’t seem doable yet. Sooner or later, Atwood’s Law will kick in and it’ll be written in JavaScript, but not today, Satan, not today.

In the really big picture of the full version:

  1. You’d collect data from SQL Server. (This was gonna require a local app.)
  2. We’d analyze the data and build a list of tasks for you to do.
  3. We’d tell you what those tasks were.
  4. We’d monitor your progress on accomplishing those tasks, and the difference it made.

Steps 2-4 weren’t synchronous – they were going to take time. Some problems would be completely obvious (“hey, you have slow file growths due to repeated file shrinks, turn off this specific maintenance job step”) but others might get flagged for human intervention. As we improved the code for steps 2-4, more and more scenarios would be handled automatically – and this is the part of the software that we expected to rapidly iterate over. We didn’t wanna deploy this kind of logic on-premises every time we got smarter. Steps 2-4 were gonna be in the cloud.

The cloud also made sense for performance and scale requirements. I didn’t expect a large number of people to enroll right away, but I wanted to plan for the future – especially a future where I could drive the per-user and per-server costs lower. I wanted the processing cost to basically drop to free, thereby enabling me to do fun stuff like a “Free Server Friday” where anyone could send in data.

In a future like that, what kind of loads might I have to deal with? Think users times the number of servers:

  • Our mailing list has ~100K people, but let’s say just 1% of them enrolled.
  • For server count, in our annual salary survey, the median number of servers managed is 20, but let’s say they send in data about 5 of those servers.
  • That means we’d be processing tons of diagnostic data for 5,000 servers at a time (and if we did freemium or free days, possibly much more)

Scale-out, queue-based processing was a requirement. The scale-out portion might be able to wait until v2, but queue-based was an absolute must.

This seemed like the perfect app for serverless architecture design.

Wait, what’s serverless architecture?

Serverless doesn’t mean there are no servers.
Serverless just means the servers aren’t your problem.

Serverless, also known as function-as-a-service, means that you write small units of code (functions) that get triggered whenever events happen (like when a file lands in a folder.) Mike Roberts’ Serverless Architectures article is a good place to start learning more.

We wouldn’t have to build, troubleshoot, and patch our own servers, nor would we have to worry about performance capacity. As we got more incoming data, AWS would just run our functions on more servers. You pay by the millisecond that your code runs. Even if that cost was relatively expensive, it still called to me as a tiny business owner because I flat out couldn’t afford to replicate the support and sysadmin infrastructure that would be required for a conventional software-as-a-service. If I took a risk on serverless, and spent more up front on development, I might be able to build something that scaled easier later.

And serverless isn’t expensive, by the way – your first million AWS Lambda requests per month are free, and $0.20 per million requests thereafter. Pricing gets a little tricky as you configure the memory your function needs, but even still, at the kinds of scale we’re talking about, it’s way, way cheaper than buying a single server, let alone hiring a sysadmin.

Richie’s early process sketch

When designing a serverless app, you think about tiny services and queues. In SQL ConstantCare®, that means:

  • The client runs the collector app, which pushes data into a cloud file service (Amazon S3). At this point, the synchronous client-facing work is done, and the rest happens asynchronously via functions and queues.
  • As files arrive, S3 would automatically add related records to a queue for processing.
  • Functions would launch for each file, importing them into a database.
  • As each function performed operations on the incoming files, like adding them to a database or checking data for business logic rules, they’d add records to the next queues.
  • Eventually a queue entry would trigger a function to send an email with the list of tasks for the client.

If we got overwhelmed with files, fine – they’d chug along asynchronously. If a function broke, fine – we’d troubleshoot it asynchronously without clients seeing errors. Granted, their emails might sometimes take longer than others to process – but who cares? This is mentoring, not real time monitoring.

The combination of what I wanted to build, plus the brand-new serverless thing, just seemed like the absolute perfect match.

Even Richie’s documentation looks awesome

But serverless was one hell of a risk.

The process and architecture diagrams were chock full of icons for brand new AWS products. To give you some perspective on when the decision was made, and how risky it was:

  • 2014-Nov – AWS Lambda announced
  • 2015-Nov – I swill the serverless Kool-Aid
  • 2016-Feb – Google Cloud Functions goes into private early release
  • 2016-March – Richie starts working on PasteThePlan, our first serverless project – using AWS because it was the only game in town that we could access
  • 2016-March – Microsoft Azure Functions goes into private early release
  • 2016-Aug – Mike Robert’s fantastic introduction to serverless appears
  • 2016-Sept – PasteThePlan goes live, Richie starts working on SQL ConstantCare – at the time, we just called it “the service” because we didn’t have a brand yet. Based on the AWS Lambda experience with PasteThePlan, we were hooked, and continued moving forward with that. We did switch to the Serverless framework, though.
  • 2016-Nov – Microsoft Azure Functions goes into general availability
  • 2017-March – Google Cloud Functions goes into general availability

To put it another way, I decided to go serverless before there was any competition in the market. If AWS Lambda had died, or if Google or Microsoft unveiled something dramatically better, or even if AWS themselves offered something way better, I ran the risk of flushing a lot of money down the toilet.

To reduce risk, hire brilliant people and trust them.

When we hired notorious dataveloper Richie Rump (@Jorriss), I said something to the effect of, “Here’s what I want to build, I want us to use serverless to build it, and here’s a list of reasons why. However, you’re going to be in a wild West of uncertainty because the tools are so new – so if at any point you think we need to switch back to conventional architectures, I’m fine with that. It’s your call.” I knew development would take longer – much, much longer – but because there wasn’t really anything like this in the market, and I didn’t think there would be soon, I could afford to take some time.

PasteThePlan.com

Richie built PasteThePlan with serverless architecture first as a learning experience, and wrote about that here. It worked out extremely well – the hosting costs were dirt cheap, and support was even cheaper. For example, when the Meltdown/Spectre attacks hit, the AWS update was straightforward: they fixed everything automatically without customers lifting a finger.

The platform wasn’t the only risk, either: because I could only afford to hire one developer, I had all my eggs in Richie’s basket. If he got hit by the lotto, got a better job, or just got tired of working for me, the project would be set way the hell back. All I could do is keep giving Richie whatever he needed to make the project successful. That, and take him aside from time to time and tell him how good his basket looked.

In 2018, looking back, using serverless for SQL ConstantCare® feels like it was a smart decision. We got lucky – Lambda caught on, AWS kept investing in it, the tooling got better, and no new competitors emerged in our market. Sure, there’s some survivor bias here: we might have been able to go live earlier, cheaper, by building it in a more conventional web app. However, I think as our marketing ramps up over the summer, we’ll be glad we built it to handle bursty demands. (We’re only diagnosing 76 servers a day right now, for perspective – I’m not marketing it hard yet, just letting people find their way in and we’re learning as adoption grows.)

In upcoming posts, I’ll talk about what data we chose to collect, where we store it, what we’re learning from it, how we bundled and priced the services, and more. If you’ve got questions, feel free to ask away – in these behind-the-scenes posts, I try to share the kinds of business decisions I’d find interesting as a reader. Next up: the product, packaging, and pricing.