Blog

Availability Groups Bug with Indexed Views

The short story: if you drop an index on an indexed view, queries on the Always On Availability Groups replicas that point to that indexed view will cause memory dumps.

The long story: to reproduce – AND DO NOT DO THIS IN PRODUCTION – on the primary, create a couple of tables, an indexed view, and hold my beer:

Over on the secondary, run a query against that view, and look at the execution plan, making sure it’s actually using the index on the view:

Successful

Then, over on the primary, drop the index – but not the view:

On the secondary, run your query again, and – uh oh:

The Kill State

What happened was that the secondary tried to use an execution plan that pointed to the index on the indexed view – but the index no longer existed.

The secondary’s error log shows a memory dump, always a good time:

Mama always said you were exceptional

And I know what you’re thinking – we just need to recompile the query to build a new execution plan that doesn’t include the index on the view:

Sad Trombone (which happens to be the official state song of the Kill State)

And no, sp_recompile and DBCC FREEPROCCACHE don’t work either. Even getting an estimated execution plan for anything related to the view causes a memory dump.

The only way to stop the continuous memory dumps is to free the system caches:

Microsoft is now aware of it, and working on a fix for a future cumulative update.

In Azure SQL DB, the problem manifests itself differently: instead of repeated memory dumps, the secondary replica restarts itself after the first query fails. Subsequent queries don’t because the server restart “fixed” the system cache issue. At first that sounds awesome – until you realize that the secondary replica can also be a primary replica for other databases that happen to be running on it.

I wanna give a big shout out to the nice folks at iQmetrix, especially Chris B. and Patrick H. They got SQL ConstantCare® alerts about a flurry of memory dumps on one of their replicas, so they opened a support case, kept us in the loop as the case progressed, and let us share the results with you to hopefully prevent more heartbreak until the patch comes out.


Parameter Fluid Optimization

SQL Server
0

Optimizer Studies

I don’t have a real job, which means that if something strikes my fancy, like staring at a spot on the wall for an hour, I can do it.

Occasionally things are a bit more inspired, and so I go poking around what SMRT PPL might be up to.

All you nice people out there paying $7k a core for Enterprise Edition make really cool research, especially into databases, possible.

There’s all sorts of cool stuff in there that has been surfacing in the product lately, from plan progression to the coLUmn ____St0r3! improvements to, well, click around a little.

Modern Plans

The general direction of the optimizer is not only towards automation, but also towards adaptation. Hence all the work in 2017 for Adaptive Joins, Memory Grant Feedback, and Plan Correction.

One of those research papers looks especially interesting to me, despite being named with the intent of dissuading readership:

Leveraging Re-costing for Online Optimization of Parameterized Queries with Guarantees

Art School Confidential

If you don’t have an advanced math degree or a couple hours of spare time, just read the abstract:

Parametric query optimization (PQO) deals with the problem
of finding and reusing a relatively small number of plans
that can achieve good plan quality across multiple instances
of a parameterized query. An ideal solution to PQO would
process query instances online and ensure (a) tight, bounded
cost sub-optimality for each instance, (b) low optimization
overheads, and (c) only a small number of plans need to be
stored. Existing solutions to online PQO however, fall short
on at least one of the above metrics. We propose a plan recosting

based approach that enables us to perform well on
all three metrics. We empirically show the effectiveness of
our technique on industry benchmark and real-world query
workloads with our modified version of the Microsoft SQL
Server query optimizer.

Hang The DJ

If you’ve been using SQL Server for a while, you’ve probably had to ask the question “why is my query suddenly slow?”, you’ve likely come across the term Parameter Sniffing, and you’ve maybe even been sent directly to a Summer Forest.

Will this fix every parameter sniffing problem? I don’t know!

Like I always say — if performance was never good, there’s no automatic fix for it (yet!). You could go from having one bad plan to having ten bad plans.

You’ll still have the same underlying plan quality issues to address, with no magical fixes in sight.

“lol why is max degree of q-bits set to 1 trillion?”

I love having this kind of stuff to look forward to. People ask if I’m scared that X feature will put X line of work out of business.

My answer is always the same: I’d be a lot more scared about going out of business if there were suddenly no new features to learn.

Thanks for reading!

Brent says: With open source databases getting better, and Platform-as-a-Service options like Google Cloud SQL for Postgres and Amazon Aurora catching on, Microsoft has to keep innovating to stay ahead. They have a neat advantage: the money they invest in database development pays off both in their Azure SQL DB hosting, and in their SQL Server boxed product offering. It’s really cool to see them recognize that and to continue investing research money into features nobody else has yet.


Hey, That’s Not My Sort!

Understand Your Plan

Mr. Optimizer does the rock n roll hoochie koo with Mrs. Optimizer, c. 1953.

When reading query plans, you may sometimes see a sort when you didn’t explicitly ask for data to be sorted.

Sometimes they show up to support order-preserving operators like stream aggregates, merge joins, or segments. Other times, they may show up to help optimize an operation that would otherwise rely on random I/O.

Random I/O is estimated to be much more expensive than sequential I/O. This comes from the optimizer being a bit long in the tooth, and not being hip to storage that doesn’t have design features in common with a record player.

This can be pretty confusing sometimes, so I’m going to highlight five queries and their plans where the optimizer has injected a sort operator, when we haven’t asked for any particular ordering.

The point isn’t whether the Sort is good or bad, or if you need to index to support the ordering that that optimizer has inflicted upon your plan, it’s just to show you why they sometimes show up.

Plan One: The Optimized Key Lookup

I’ll say it one more time in case you’re playing catch up: clustered index key columns are implicitly (or inherently, if you’re feeling rhymey) present in all of your nonclustered indexes. Where they’re stored in the index depends on uniqueness, but if your nonclustered index leads with columns that aren’t your clustered index keys, they’ll like fall out of order in the index pages.

When the optimizer chooses a key lookup plan, it effectively joins the nonclustered index to the clustered index using the clustered index key columns in both.

In some circumstances, the optimizer may decide to sort the output of the nonclustered index to reduce the random I/O involved in the key lookup.

It Just Runs Faster

This is what happened here. The Id column in the Posts table is the PK/CX. The nonclustered index that got used for our query looks like this.

Clearly the Id column isn’t in the definition.

Plan Two: The Sort To Support A Stream Aggregate

The Optimizer has chosen a Stream Aggregate over a Hash Aggregate. Unfortunately (using the same nonclustered index as before), the OwnerUserId column isn’t ordered in any way. It’s only an included column in the leaf level of the index, and not in any particular order, and doesn’t exist in the intermediate pages in a seekable fashion.

 

Islands In The Stream

Where Hash Match aggregates can take data in any order (this doesn’t necessarily make them better, but hang on), Stream Aggregates need everything in order first.

This is a trade off, though. For a Hash Match aggregate, it behaves a bit like a Hash Join in that all the rows have to arrive at the operator before hashing can begin. With a Stream Aggregate, if the data isn’t in index order, it needs to be sorted. The Sort behaves similarly to the Hash Join (huzzah! The trade off!), and all rows have to get to the Sort operator before sorting can begin.

The other thing that sorting and hashing have in common is they’re both memory consuming operators. This is what all the fuss about memory grants is about.

Plan Three: The Sort To Support A Merge Join

Just like a Stream Aggregate, Merge Joins need sorted input. There’s some sense to this plan, in that with two Sorts, you can deduplicate some amount of data prior to sorting. There’s also that Stream Aggregate after the Merge Join, which needs sorted input. If the Sort were after the Merge Join (for some strange reason), there could be a whole lot more data to sort.

 

Wild, huh?

Traffic Jam

The Hash Match is applied to one side of the join to reduce the number of rows to sort and join.

Why not both sides?

Good question.

Plan Four: The Distinct Sort

This one is a little more obvious, but I decided to include it because I think you’re all nice people who deserve mostly complete blog posts.

I hope someday you find them.

Ha ha haaaadhefoiwsoihrgjergopjh.

This wine is only okay.

If you use DISTINCT or GROUP BY in your queries, the optimizer may choose to implement them with a Distinct Sort.

Why not a Sort and then a Stream Aggregate?

Good question.

Maybe it’s both?

Plan Five: The Windowing Function

I know, you think I’m cheating here. You think I’m a big fat cheaty cheater. Liar liar plans on fire.

There’s an order by here, but the Sort has almost nothing to do with it. All of these queries use the same index as before. The order by for CreationDate is fully supported.

You’re a good time

Or at least it would be, if we didn’t have to Partition By OwnerUserId first.

Partition By sorts the data, too. This is why indexing correctly can be important.

The optimizer doesn’t really have any alternatives. In many of the other plans, things could have been done differently. This one is all you.

Or me.

Where were we?

Put On A Happy Face

Now when you see a mysterious Sort operator appear in your query plans, you’ll have some more clues as to why.

Hopefully this helps you out, and helps you make a decision about how to interpret and tune things where necessary.

Remember that the only cure for a SARGable Sort (a Sort that’s not on an Expression) is an index that supports it. Adding or rearranging indexes may help (query rewrites may also help, sometimes…).

Thanks for reading!


Book Review: Inside SQL Server 6.5

Book Reviews
44 Comments
Books Offline

30 years ago, back in 1988, Microsoft, Ashton-Tate, and Sybase got together to start building SQL Server 1.0 for OS/2.

I wanted to take a trip down memory lane to celebrate, but I didn’t wanna go back quite that far. I wanted to go back to what would probably be fairly recognizable – say, over 20 years ago: SQL Server 6.5 as of 1996.

I picked up half a dozen used books about SQL Server 6.5, then spent a delightful weekend reading them. Seriously delightful – lemme tell you just how into it I was. Erika and I eat all weekend meals out at restaurants, but she saw me so happily curled up in my chair reading that she insisted on going out and getting tacos for us just so I wouldn’t have to get up. I was having that good of a time READING BOOKS ABOUT SQL SERVER 6.5. (Also, Erika is amazing. Moving on.)

To bring you that same fun, I wanna share with you a few pages from Inside SQL Server 6.5 by Ron Soukup, one of the fathers of SQL Server. (And please do NOT pay full price for that. Used ones pop up for sale all the time, or support your local Goodwill or used book store. It’s not like it really matters which one of these 6.5 books you buy, hahaha – they’re all a hoot to read.)

This book was written long before Microsoft Press descended into bland marketing material. Back then, Ron could still pepper the book with all kinds of personal opinions and stories, bringing the material to life. Forgive the pictures – my scanner’s got a big line going down the middle, and I’ll be doggone if I’m going to buy even MORE old technology!

Chapter 1: a little business history to warm you up

Now, let’s move on to choosing the right hardware:

Installation time: picking between processor families

That’s right: SQL Server used to support Intel x86, MIPS R4000, DEC Alpha, and Motorola PowerPC. When I first read that, I thought, “Wow, I forgot how much of a Wild West hardware used to be.” But you know what? Today’s really no different – cloud vendors are the new processor choices. (And just like you, dear reader, don’t usually make the decision about the cloud vendor, you wouldn’t have been the one making the hardware decision either. That was done by big executives on golf courses as they talked to vendors.)

Once you’ve picked hardware, now it’s time to fill that box up with memory:

“I’d recommend start with enough memory to give SQL Server at least a 10-MB cache.”

Choosing storage is important too, of course:

You don’t have to worry about the CD-ROMs and tape drives being on the same SCSI channel. They don’t hog bandwidth.

As antiquated as this stuff sounds, DBAs are still dealing with the same problems today – it’s just that the problem has different names now. The work you have to do today to get maximum throughput from Azure VMs is really no different than what these people were monkeying around with 20 years ago. It’s just that instead of 6 drives per SCSI channel, now you’re dealing with how many Premium Disks you have to stripe together to saturate the network throughput of a given VM type. Same math, different names.

Then, how are we going to license our 1996 server?

The old server + CAL model

Hold on, hold on, that looks cheap, but it’s not as cheap as you think. When you translate it into today’s dollars due to infl….uh, actually, yeah, that’s still pretty cheap.

Alright, moving on. It’s purchased – let’s start installing and configuring it. Take memory, for example:

How to set max memory

Twenty years ago, way back in the dark ages, we had to use formulas in order to tell SQL Server – this massively powerful database system – how much memory it was allowed to use.

Today…we still do.

Today, software that costs $7,000 per core still can’t suggest a default amount of memory to take. These are the things that really blow me away as I read these books – sure, the terms have changed, but so much of the work we’re doing today is still exactly the same. I keep hearing about these magical robots that are going to take our job – so are we going to have to tell the robots what their max memory is, too? Or set their MAXDOP so it isn’t unlimited by default? (That’s how the world is going to end, by the way: Microsoft is going to bring out a robot, tell us to set its MAXDOP as part of the unboxing process, and a bunch of people are just going to open the box, let it breed, and we’re all going to die. Seriously, Microsoft, nobody reads the manual. Try not to kill us.)

Moving on – let’s create a database. Back in SQL Server 6.5, you had to create a “device” (not a file) on disk, then put the database inside the device:

It creates a basket, and then it puts the lotion in the basket

And I can almost hear you giggling, ha ha ho ho, Grandpa Database Administrator was so backwater, but stop for a second to think about what you do when you build a brand new, state of the art Azure Managed Instance.

You have to specify how big the storage is going to be.

For a database that doesn’t exist yet.

And when it runs out of space, you have to deal with it – just like Grandpa did.

Not so fancy now, are you, kiddo? Well, I’ve got good news: in this book, Ron gives us a vision into The Future:

The Future

THE FUTURE IS HERE, RON, AND IT’S AWESOME, BUT IT’S NOT ALL THAT DIFFERENT

Alright, moving on – let’s start storing some data:

Clustered indexes

Twenty years ago, you could buy a book off the shelf – just one book – and it gave you the history of the database system, licensing, purchasing, configuration, internals, how to write a query:

Working with CASE statements

That part about the comments down at the bottom is kinda funny – there was a 64KB limit for stored proc comments. However, “neither of these limitations is much of a worry.”

Oh, Ron, the future is terrible. Microsoft lifted that limit, and people build stored procedures that make Shakespeare look like an essayist.

Performance tuning checklist

That performance tuning checklist is still viable today. I could totally see this pinned up on a DBA’s cubicle wall. I’ll be tickled pink when the robots master any of this stuff. Take it. Please. Get it under control.

This performance tuning snippet intrigues me:

Joins managed in groups of 4 tables

I’ve heard this “groups of 4” thing repeatedly from different people, and I wonder if this is where it got its start. I’ve noticed that on really complex queries (20+ joins), if most of the filtering is done on tables far down the join list, I get a massive speed boost by moving those to the first 4 tables joined, and filtering on them as early as possible. Just seems to shape which query plan SQL Server considers first. It’d be neat to know that it actually is a hard solid rule of 4 though.

The “Watching the optimizer’s decision process” section is mesmerizing, seeing how they used to work through this stuff:

Hello, trace flags
What the trace flags produced

Man, Grandpa had it rough. Moving on.

Priority Boost: bad idea for 20+ years

They’ve been telling you Priority Boost is a bad idea for over TWENTY YEARS, but I still find it in a disturbing number of servers.

Whew. What did I learn from this monster book? Well, Ron could have finished writing this book, jumped into a time machine, walked into any modern database shop, and his 1990s skills would have translated pretty well into what we’re doing today. The data is bigger, the servers are faster, and the costs are way higher – but the basic plumbing is all still the same. SQL Server has made some things easier – but it’s added a whole lot more toys in the box, and those toys still aren’t easy to manage.

Reading these books, I’m more convinced than ever that I picked the absolute perfect career. I love working with databases – really, really love it – and there’s always so much more to learn. Part of me wishes I could have gotten started earlier to see more of the original genesis, but the rest of me knows that so much awesome stuff keeps coming into view, and I wouldn’t wanna miss that either.

If you liked this one, holler – I bought a few others and read ’em for fun, and I can share those as well if there’s interest.


I’m speaking at SQL Saturday NYC

SQL Server
5 Comments

No, Not Brent

Just me. And a few dozen other people, on May 19 (2018).

I have sessions on the query optimizer, and all the stuff I think is cool in SQL Server 2017.

Away Days

You know, I get it. The weather is just finally getting nice in these parts.

No one wants to spend a Saturday inside thinking about their job. But free training days like this are priceless.

In a lot of ways, I consider the first SQL Saturday I went to back in 2012 to be a deciding factor in sticking with it as a career.

That and I won a book. I still have the book.

I’d give it away as a prize, but it’s about SQL Server 2012.

Is that thing even supported anymore?

Location, Location, Location

Besides, look how nice Times Square is.

Go here to register for FREE.


First Responder Kit Release: Darn May Showers

There’s usually something that I think is funny in this space. Instead, I’m going to extend my already extensive day of civic duty by letting you know something interesting about jury duty.

 

Sometimes, if you show up when they ask you to, they can make you come back for a second day if they don’t pick enough people to fill a a jury on the first day.

 

They’ll also be kind enough to give you absolutely no notice and be entirely inflexible about letting you reschedule.

 

Ah, the joys of not having to worry about customer satisfaction.

 

Not like us. We care about what each and every one of you thinks about these scripts, for free.

 

(strums guitar)

sp_Blitz Improvements

#1537: In older versions of SQL Server, if you ran into CMEMTHREAD waits, you’d flip on Trace Flag 8048 at startup and your problems would likely go away. On 2016+, that’s the default behavior. But apparently you can still run into CMEMTHREAD. We tried to give you better guidance, but Bob Ward wouldn’t let us put his phone number in there. Can’t imagine why. Oh well. Hi Bob!
#1552: Ever go to fix a typo and make another typo? Well, that typo is fixed now. Fixed for good.

sp_BlitzCache Improvements

#1522: We sort the second result set more prettier (according to Brent) now. It has good Lo Mein or whatever.
#1525: Busy Loops wording is more consistent. It used to be “frequently executed operators”, which was downright confusing at times.
#1528@RichBenner is a much more forgiving entity than I am. We’ll now let you get away with not patching 2017 past CU3 without throwing vicious errors.
#1543: With 2016 SP2 getting released, a bunch of the XML attributes that used to only be in 2017 got back ported. Stuff like tempdb spills and row goal information. You should totally upgrade. I’ll be your best friend.

sp_BlitzFirst Improvements

#1545: Eagle Eyed Professional Man With Standard @goochjj noticed that some of the dynamic SQL in sp_BlitzFirst was adding single use plans to the cache. He fixed that with some rather cunning parameterization.
#1388#1389#1396@EmanueleMeazzo did a metric t-rex load of work rewriting the views and view creation syntax so they’ll be faster. This has been in the works for a while, but totally worth it. We salute your patience!

sp_BlitzIndex Improvements

#1532: The taller and better looking version of me, @amtwo, spotted some wonkiness with the way we displayed data types for indexes with descending orders. Namely that uh, we weren’t showing them. Case expressions are hard sometimes. You just never want them to end.

sp_BlitzWho Improvements

#1527@TaraKizer noticed that sometimes wait times were wrong for parallel queries. This has been corrected with gusto.
#1531: A lot of times when you’re troubleshooting parameter sniffing, you can see the bad plan running. Getting the plan handle to knock it out of the cache is a pain. There’s right clicking and XML. Or at least there used to be. Now we surface it in @ExpertMode.

sp_BlitzQueryStore Improvements

#1544: I had made a bunch of changes in BlitzCache to put some checks in @ExpertMode and tidy up URLs, etc. They’ve all been added here so output should be aligned. There were also some behind the scenes changes to make sure we’re only diagnosing parameter sniffing for multi-use plans, and we’re not flagging unmatched indexes in trivial plans.

sp_DatabaseRestore Improvements

Nothing this time around

sp_BlitzBackups Improvements

Nothing this time around

sp_AllNightLog and sp_AllNightLog_Setup Improvements

Nothing this time around

sp_foreachdb Improvements

Nothing this time around

PowerBI

Nothing this time around

sp_BlitzLock

Nothing this time around

For Support

When you have questions about how the tools work, talk with the community in the #FirstResponderKit Slack channel. If you need a free invite, hit SQLslack.com. Be patient – it’s staffed with volunteers who have day jobs, heh.
When you find a bug or want something changed, read the contributing.md file.

When you have a question about what the scripts found, first make sure you read the “More Details” URL for any warning you find. We put a lot of work into documentation, and we wouldn’t want someone to yell at you to go read the fine manual. After that, when you’ve still got questions about how something works in SQL Server, post a question at DBA.StackExchange.com and the community (that includes us!) will help. Include exact errors and any applicable screenshots, your SQL Server version number (including the build #), and the version of the tool you’re working with.

You can download the updated FirstResponderKit.zip here.


Building SQL ConstantCare®: Refining Our Recommendations

SQL ConstantCare
6 Comments

One of the fun things about building SQL ConstantCare® as a cloud-based service is that we can iterate fast, making improvements based on the data we find. We can assess the population as a whole, figure out what people are doing well, and look more closely at the rules they’re struggling with.

What checks are everyone passing?

Let’s look at a sample day from last week when 562 servers sent in data (out of the users who agreed to share their data to improve recommendations):

  • No one was using the -x startup flag
  • No one had known-dangerous third party modules installed
  • No one was running a 32-bit SQL Server
  • No one had high memory usage for in-memory OLTP (Hekaton)
  • No one’s plan cache had been erased recently (aside from server restarts)
  • No one needed to restart to fix a TempDB file configuration mismatch in the DMVs
  • Everyone on 2016 Standard Edition had patched to SP1 to get the free features

That’s not to say no one was having these problems when they first started ConstantCare – the subject of which rules people are fixing the fastest is the topic of another upcoming post.

What recommendations are people struggling with the most?

  • 276 servers need to enable remote access to the DAC
  • 197 servers have databases with no CHECKDB in the last couple of weeks
  • 135 servers have over 10,000 plans cached for a single query
  • 132 servers have databases with transaction logs larger than their total data file sizes
  • 89 servers have databases with no recent full backups
  • 79 servers have databases in full recovery model, but no recent log backups (kinda ties into the 132 with big log files, too)
  • 78 servers have databases without checksum page verification turned on
  • 70 servers have max memory at the default (unlimited) or larger than their physical memory
  • 69 servers have user databases on the C drive (and 60 have TempDB on C)

As I look at that list, I ask myself:

  • Are the recommendations right? Are the thresholds valid, or do I need to tweak some?
  • Why aren’t people doing those tasks?
  • What can I do to make those tasks easier?

Lemme share 2 examples of how we’re adapting the emails based on your data.

Making the DAC recommendation easier

Take the top one, enabling remote access to the DAC. I believe the recommendation is right, and I don’t think that many people are pushing back due to valid security concerns. (Hardly anybody is muting that recommendation.) However, it sounds dangerous the way it’s written, and the call to action was kinda vague: go read this blog post and watch this video to learn how to fix the problem. It sounds like it’s going to take work on your part – when in reality, it’d be really simple. I was just making it sound harder than it is.

What I really needed to do was rewrite the guidance to get faster adoption. Here’s the new one:


Enable the Remote DAC – When poison wait issues like threadpool strike, you can fix them by connecting to the Dedicated Admin Connection (DAC). However, by default, you can only do this by remote desktopping into the box – something I’m not a big fan of doing. Run the below command, and the next time SQL Server restarts, you’ll be able to access the DAC remotely. More information about the DAC.


And in case you’re curious, yes, we can adapt the T-SQL in the recommendation based on whether it’s safe to run in their environment, like if they have any reconfigurations pending or if they don’t have show-advanced-options turned on. (To do that, I found myself building dynamic SQL in PostgreSQL. Man, my life is weird.)

That new recommendation will go into production soon, and we’ll see how that affects completion rates. I really take that stuff seriously – if your servers aren’t getting better, then we’re not doing our job and earning our keep.

Changing thresholds for plan cache guidance

The remote DAC recommendation make sense, and I just needed to tune the advice to make it easier to follow. However, what about the 135 servers with over 10,000 plans cached for a single query? Fixing that is much harder: changing apps to send in parameterized queries can take a long time, and forced parameterization comes with some risks.

For this one, I stepped back and asked – “Am I really sure this is a problem for all of these servers? Or are we maybe triggering that recommendation a little too casually?”

Upon further investigation, I realized that most of these servers had no memory pressure. Many of ’em were outright bored – with wait time ratios below 1.0, and in many cases below 0.1! To pick an example: if a server has 256GB RAM, spends less than 10 minutes per hour waiting on resources, and has days of queries in the plan cache, is it really a problem if 10,000 plans are cached for a single query? I mean, sure, it’s bad, but in the grand scheme of things, how bad is it? Especially when you’ve got lots of servers and lots of tasks to perform?

Some customers even emailed in the sources of these queries – they included backup software, monitoring tools, and even sp_BlitzFirst’s logging query! Granted, with open source tools, smart folks like Gooch can check in improvements to mitigate the problem – but realistically, is your backup vendor going to react to issues like this? Probably not.

Therefore, we’re editing that rule’s thresholds so it only fires when you’re under serious memory pressure, and you can’t keep much in the plan cache, and there are tens of thousands of queries for a single plan in the cache. We’ll also consider it when it looks like your server is under heavy CPU load due to high compilations per second.

Analyzing customer data from SQL ConstantCare® is my favorite part of my job right now. We’re building a new set of realistic, doable performance recommendations that make a measurable difference in servers – cutting the fluff, and focusing on the big bang for the buck tasks. I want you to be able to make as few changes as possible, and see as big of a difference as possible. We’re learning more every week.


[Video] Office Hours 2018/4/25 (With Transcriptions)

This week, Brent, Erik, Tara, and Richie to discuss cloud server disaster recovery, Availability Groups troubleshooting, SQL Server deadlocks, slow tempdb latency, migrating to Azure, SQL Server 2017 gotchas, how to set up your dev environment when looking for a job, best allocation unit size for disks hosting data files and good replacement for linked servers.

Here’s the video on YouTube:

Office Hours Webcast - 2018/4/25

You can register to attend next week’s Office Hours, or subscribe to our podcast to listen on the go.

Enjoy the Podcast?

Don’t miss an episode, subscribe via iTunes, Stitcher or RSS.
Leave us a review in iTunes

Office Hours – 4-25-18

 

Does Azure have a good hybrid DR option?

Brent Ozar: Alright, let’s see here. Anon says, “Do y’all know if Azure has a good hybrid DR option? We’re rolling onto SQL 2014 on-premises and I was just told we’re going to need disaster recovery.” I like that, we were told we were going to need disaster recovery. “Apparently, we don’t have any and it’s just me, so I thought maybe Azure might have something better than physical servers.”

Erik Darling: I love just picturing this conversation where it’s like, this is going to be a disaster…

Brent Ozar: We need to be able to recover from this disaster. I don’t know about Azure – we actually wrote whitepapers on a technique that works across all cloud providers. I’ll point way over to the – oh, I got the wrap around thing… So Tara, about that whitepaper that you wrote for Google

Tara Kizer: It’s a cheap solution, but it is a very valid solution. So sending your backups to the cloud into a storage bucket and then when the time comes, spinning up a VM and getting your system restored over there. Just tack the SQL Server component and you really can get your system up and running fairly fast. But just make sure you understand what you need to do to get your application server, web servers. Make sure you’re setting your source code to the cloud also because having your database backups is not enough to get your site back up and running.

Erik Darling: Yeah, and if your company already has, like, pretty tightly defined RPO and RTO goals, you need to make sure that there’s a separate discussion around what they’re going to be for DR, especially if you use the  cheaper DR option where you’re just sending the backups up because oftentimes it’s going to be outside what your current windows are. It’s just not going to be as immediate or as fast as manual work involved to bring stuff online.

Tara Kizer: And you can still use the whitepaper, you know, if your RPO and RTO is lower, just make sure you’re not only sending your backups to a storage bucket. Also, have the restores happening; the VM set up and the restores happening.

 

My app server has these errors in the event log…

Brent Ozar: Rakesh says, “I’ve got multi-subnet Always On Availability Group. If I have JDBC and it goes to connect to the listener, is it going to attempt to get all of the IPs?” Hold on a sec – let me finish answering. He also says, “My application is working fine but the log file is filled with errors.” That’s like a three-part question. Okay, let me sum up and rephrase. So he’s using Always On Availability Groups and he’s using the option where it will automatically try to connect to all of the servers at once. Is it a problem that logins are failing on the servers that aren’t the primary?

Tara Kizer: I’m confused by the question because he’s asking if it will connect to all IPs. And that’s going to be dependent upon if your database driver supports multi-subnet failover. It says you’re using JDBC, but it doesn’t provide the version number. JDBC 4.0 does offer the multi-subnet failover equals true option. So make sure you’re on JDBC 4.0. And I would hope you are, since that’s been around for a very long time; you shouldn’t be on older stuff. And then your connection string needs to include multi-submit failover equals true into it. I wonder if you’re getting 50% success for the – is it just the secondary servers that have the errors?

Brent Ozar: He didn’t say, but I would bet it is.

Tara Kizer: Yeah, because I don’t know that the errors on the secondary – I guess that would be trying to connect to it if you don’t have multi failover subnet in there. So check your driver version and check your connection string. Multi-subnet failover equals false is the default. So you have to turn it on. And Microsoft actually recommends that you go ahead and enable this feature even if you aren’t using a multi-subnet configuration to indicate there’s some performance benefits just using Availability Groups in a single subnet.

Brent Ozar: Oh, this is interesting. He follows up with, “We’re using JDBC 6.2…” Okay, that’s good. “We’re getting 100% success, but the errors are in the application log, indicating connection failure.” The application event log in the SQL Server or the app servers? If it’s the app servers, I don’t think we care. We’re database administrators.

Tara Kizer: I don’t even have access to those servers.

Brent Ozar: Yeah, Rakesh says it’s in the application server. Yeah, then don’t worry about it. As far as I’m concerned, that’s an application problem.

Tara Kizer: Filter them out.

 

Where can I learn about deadlocks?

Brent Ozar: Teschal asks, “Where can I get good information on SQL Server deadlocks?”

Tara Kizer: Good information…

Erik Darling: Good information is tough.

Tara Kizer: Jonathan Keheyias would be who I’d think of. He loves deadlock troubleshooting. He might be the only person in the entire world that loves it. They’re frustrating. I’m just like, have you enabled RCSI? Because that’s where I’m stuck.

Brent Ozar: It is hard. Erik actually wrote a tool around it too.

Erik Darling: Oh god, yeah. I was drunk on a plane with Brent, so if there are any errors, don’t take them personally. It’s in the First Responder Kit, so if you head over to firstresponderkit.org there’s a stored procedure in there called sp_BlitzLock, which is I guess a little bit misleading because it’s actually to look at deadlocks. You can have it look at the system health extended events session. It only works on 2012 plus because the deadlock definition changed a whole bunch in the XML and documents, so I’m just kind of crap out of luck on that. but it will look at either the system health extended events session, or if you have a custom extended events session that capture deadlocks, you can pass that in as one of the parameter variables or whatever. It will give you back all sorts of cool parsed out information about which queries did what and when and all sorts of stuff.

Brent Ozar: It’s really awesome. I even like it just for regular locks, not even deadlocks because, half of the time, there’s some deadlocking involved, it’s just not big and it gives you a place to go start. Since Kehayias is really hard to spell, I’ve got it up on the screen there as well. He’s got several posts about it. The Simple Talk one is fantastic.

Richie Rump: You’re IEing again.

Brent Ozar: I am. I’ve got to put Chrome on this thing. It’s so terrible. It’s so bad.

Erik Darling: You got to use IE no matter what because it’s the browser you don’t use. There’s no history in it. There’s no bookmarks. It’s like the only public-facing safe browser you can do is that and you can only Bing things. Otherwise, no, avoid chrome in public.

Tara Kizer: I have a little bit of a funny story, since we don’t have a whole lot of questions in the queue anyway. So in February, I was trying to get a permit to a very popular place and we went to the website a couple of minutes before 7AM when it was supposed to go live. And in Chrome, refresh, control F5, and it would not pull up the reserve button. So you know what, I’d probably a caching issue, even though I’d tried all the tricks. So I pulled up Edge, you know, Microsoft Edge was on this computer and sure enough, it worked over there so I was able to complete my permit and get this extremely hard permit because I hadn’t gone to the website yet in Edge. A lot of people encountered it and it would work from their phone but not their desktop.

Brent Ozar: See, I would have thought have using my phone before I would have thought of using Edge.

Tara Kizer: I was panicking too.

Erik Darling: That happens to me a lot with Microsoft sites. Like, I’ll open them in chrome and it will say like error, connection refused or like connection broken, and then I open it in Edge and it comes right up. I’m like, okay come on, guys.

Tara Kizer: See, I don’t even think about it for that; I just don’t get to go to that page today, I guess.

Brent Ozar: I get it with Books Online and Tacknet stuff all the time; you have to log in to see this. Screw you.

Tara Kizer: Wow, I’ve never encountered that, huh.

 

I added TempDB files, and my latency went down…

Brent Ozar: Let’s see, Adrian says – he has a follow up on last week’s slow tempdb latency. He had an eight virtual core server. It had eight data files; nine seconds latency on each tempdb data file. Now that he only has one tempdb data file, the latency is only 12 milliseconds; no more sad trombone. I have a hunch that it’s not related to the number of files. I have a hunch that something else is going on. It’s possible, I just think…

Erik Darling: It seems like the opposite of the way things usually go.

Brent Ozar: It’s possible. It is possible. There’s so many…

Tara Kizer: [crosstalk] they had to restart SQL Server, so if I wonder if something else happened there.

Brent Ozar: yeah, that’s what I’m thinking too, like patching or just a parameter sniffing issue.

 

My company wants to migrate from SQL Server to CosmosDB…

Brent Ozar: Phil says, “My company wants to migrate SQL Server up to…” What the…

Erik Darling: No.

Brent Ozar: God bless you, Phil, this is a good question. “My company wants to migrate some SQL Server databases to Azure Cosmos DB in order to go faster. They use the XML data type, so they see it as a silver bullet. Are there any metrics you know of to evaluate RDBMS to DocumentDB’s suitability?

Erik Darling: How is Richie still standing?

Tara Kizer: Just shaking his head.

Richie Rump: I mean, if you’re using XML as a document and you’re storing it in a relational database, you’re missing the point, right. So you need to shred all that out and put that into a relational manner. You’re using a relational database as a document data store. So you’re going to get bad performance from that even though you can get XML indexes and all the other crazy stuff on top of it. You’re going to get better performance from a document database when you’re using documents. That’s just kind of the way it is. Now, if there’s any metrics or anything I know; no. You’ve just got to put a test set and say here’s one and here’s the other and then go for it, but then the document database is always going to be faster for documents.

Brent Ozar: The other thing is, Azure Cosmos DB has nothing that SQL Server has. It doesn’t have any of the fancy-pants stuff that SQL Server does. Transaction logging – you get snapshot point in time backups like once every four hours and that’s it.

The thing that I would usually ask companies too is, okay, so you’re going to rewrite the data layer, right, because you now have a totally different layer going over to Cosmos DB. Of course, it’s going to be faster because you’re not going to be so drunk this time when you go and rewrite it. You’re like, oh we never should have built it that way and we never should have built this, and it’s going to be faster when you rewrite it.

Richie Rump: Yeah, you know, or you could take a look at it and say do I have relational data? Am I storing it in XML because somebody, like Brent said, drunk that day and whatnot and we just stored it in XML because we’re lazy or somebody was lazy? But is this relational data – I think that question needs to be asked first before you go and ask what’s my database going to be. Well, what kind of data do I have? Then you can answer the database question.

 

Have you used @@dbts?

Brent Ozar: Pablo is asking us an interesting one, “Have any of y’all used @@DBTS, the timestamp?”

Tara Kizer: No – I’m like, what is that?

Brent Ozar: No, I’ve never used this either. None of us – Pablo, we won’t be able to give you a good answer, but what I would do is ask it on dba.stackexchange.com. I’ve never seen that either, so…

Erik Darling: Just to follow up a little bit on Phil’s thing, using XML in SQL Server religiously is like throwing your database the hardest knuckleball you can possibly throw it. Whatever’s in that document is like kind of unknown. Like, getting the certain paths and the tributes and evaluating them is just a nightmare. This is nothing that the optimizer is good at or was built for and there’s no special rules built in to, like, make XML better for that. So if you’re querying XML in SQL Server without XML indexes or without parsing it out first, yeah, it’s going to suck all day long.

Brent Ozar: Phil follows up with, “Yes, XML does suck.”

Erik Darling: JSON’s not any better.

Richie Rump: It’s not XML; it’s just the data format. I mean, SQL Server just wasn’t designed for XML. It’s a CSV – it’s just any – it’s a data format. SQL Server was designed for relational data. You put in rows and columns and then you can do other things and make it super fast. But when you don’t do that and you just throw data at a row and a column and say here go and search for something, it doesn’t like it. And it doesn’t matter if it’s XML or JSON; it’s using the same things behind the scenes.

Brent Ozar: Files, all kinds of stuff.

Erik Darling: XQuery is hard.

Brent Ozar: Says the guy who writes XQuery.

Erik Darling: I know, it’s hard. It’s not fun.

Richie Rump: You’re insane.

 

Any new gotchas with 2017?

Brent Ozar: Steve says, “I’m going to go install a new instance of SQL Server 2017 and it’s just a dev instance. Are there any new gotchas with 2017?”

Tara Kizer: What version is he coming from? 2012 or older? Because the new gotcha from 2014 up is the new cardinality estimator, which can cause some performance degradation for some queries. So just be aware of that if you’re coming from an older version or you are using a compatibility level that is 2012 or older.

 

Is it okay to force legacy cardinality estimation?

Brent Ozar Which segues beautifully into – Jordan asks, “I ran into the new 2016 cardinality estimator issue with a script. I have several tables joined with two views.” We’re off to a good start, “It took less than two minutes on 2008 R2.” You were happy with that? “Less than two minutes on 2008 R2 and over 15 minutes on 2016. If I remove either view in the join, it goes back to two minutes. My solution right now is to use a forced legacy cardinality estimation. Do you have any concerns?

Erik Darling: Yeah, but not with the hint. I’m concerned with everything you’re doing that’s leading up to the hint. I hear joining to views and in my head I immediately see views within views within views that are joined to other views. And I know that the problem is in that process – it’s like the code that Brent always talks about that’s like 90 years old that no one wants to touch because if you move a comma it breaks. So I feel for you on that, but my concern is not with you hinting the old cardinality estimator.

Brent Ozar: Jordan followed up and he said, “You know, if we’re honest, the real problem is that it’s a select and a cursor, which is the real tyrant.”

Erik Darling: Is there a trigger and a function? What kind of sadistic bingo are you playing?

Tara Kizer: Non SARGable process, yes.

Richie Rump: Please tell me the database is less than 1GB.

Erik Darling: By the way, it’s in AdventureWorks.

Brent Ozar: He gives an industry, but I’m not going to repeat that on here just in case – because the thing’s recorded, of course, and put in publically, but yes. So does it give us concern? If it makes it go fast, obviously you specialize in duct taping things together in order to get past the next hump so this should be perfectly comfortable with you. It’s normal.

 

I have this 8-socket, 1-core-per-socket VM…

Brent Ozar: Let’s see, Robert says, “I have a VM configured with eight sockets and one core per socket.”

Erik Darling: Boo…

Tara Kizer: Well, that’s unusual.

Brent Ozar: And why boo?

Tara Kizer: You know, I’m not going to answer that, you know I won’t.

Erik Darling: So, why boo is – I mean, just immediately, whenever I see anyone who has done this, it’s always on Standard Edition and they always have cores and or memory offline. I don’t have any particular gripe with the setup outside of that. Like, as long as it fits within your licensing and everything’s online and available, I kind of don’t care. But immediately I jump to, you know, you’re on Standard Edition and two of your cores are completely offline and maybe memory – four of them are offline, sorry and maybe even memory offline as well; I don’t know.

Tara Kizer: Run sp_Blitz. At the bottom of sp_Blitz, there’s a section that will let you know if some of them are offline or not.

Richie Rump: ConstantCare would do the same thing.

Brent Ozar: It sure would. Richie and I have caught a lot of people with cores and memory offline and that’s one of my favorite things to email people. Check this out; you server’s about to go twice as fast as soon as you – it’s incredible.

Tara Kizer: It’s in the hardware numa config section in sp_Blitz, just to give a hint.

 

What’s the best way to set up a lab?

Erik Darling: Niraj says, “I’m not working and I’m looking for a job. What’s the best way to set up my environment to practice until I get a job?” The only answer I have for that is set up your environment for the job you want to get. If you want to do, like, perf tuning stuff, set up whatever version of SQL Server Developer Edition you have available to you. 2016 and 2017 are totally free. Download a great big honking database like Stack Overflow and do your practice and setup stuff in there. If you’re looking for SSIS or SSRS work or just like plain old DBA work then install whatever tools you need.

If you’re looking to do, like, infrastructure or prod stuff, you know, use Hyper-V because it’s free as all get-out and you can create VMs, you know, primary domain controllers, make AGs, log shipping, mirroring, whatever. Set up your environment for the job you want. That’s the best advice I have.

Brent Ozar: I like that. There’s also the Microsoft Virtual Labs too. Open up Internet Explorer – brace yourselves, everybody. So Microsoft marketing wants to get you to get the latest certifications and play around with the new technology. So they do their own labs online that you can go through – I’m not going through all those. So there’s a bunch of self-paced labs on here that you can go through…

Erik Darling: What has Stretch Database come up first?

Brent Ozar: Marketing is like, come on, we got to get this pig to fly.

Erik Darling: Use it, someone…

Brent Ozar: But the thing that I adore about these are they’re all free and you don’t have to use them to do what the labs want you to do. They’re just VMs, so you could use them for anything you want. You could set up database mirroring, you could set up log shipping. There is a gotcha…

Erik Darling: Mine Bitcoin…

Brent Ozar: The Gotcha is that there’s no copy paste and there kind of silo-ed off from an internet perspective. So it makes it really hard to go download the stuff you want. But as long as what you want is included in the boxed product, it’s easy enough to play around inside there and totally free. I want to say they go away after two hours or something like that…

Erik Darling: Yeah, they have a time limit on them.

 

Any overhead in cross-database stored procs?

Brent Ozar: Michael asks, “Is there any overhead from calling a procedure in one database from another database? Like, if I’m in AdventureWorks and I go call something in Northwind?”

Erik Darling: Like, performance like finding the procedure perspective? It’s minimal.

Tara Kizer: As long as it’s not on another server using linked servers. It sounds like it’s on the same instance.

Brent Ozar: Then he says, “Is there any significant overhead on using one procedure to call another? We’re keeping legacy procedures in an old location while reorganizing where the current prod procs live.”

Erik Darling: No, but it can make your plan cache really confusing because whatever stored procedure calls substored procedures will get all of the resource use attributed to it. So if you have a stored procedure that calls like three other procs and you run the two – let’s say, I don’t know, let’s go crazy, sp_BlitzCache, then you see that calling stored procedure way up at the top of the list, just keep in mind that all three of those – whatever all three of those stored procs do is going to be attributed to that calling proc. So it can make things a little confusing, but there’s generally nothing crazy insane that’s going to go wrong with it.

 

What’s the best NTFS allocation unit size?

Brent Ozar: And then Pablo asks, “What would be the best allocation unit size for my disks where I host my data files when I format my drives?”

Erik Darling: We’re here from 2003, I guess [crosstalk] cursors and… Whatever your vendor documentation says. If you’re using local disks then – I can’t even remember the last time I even bothered thinking about that. like, when I first started here and people were still using SQL Server 2003, we would, you know, look at disk allocation and…

Brent Ozar: Windows Server…

Erik Darling: Windows Server 2003, right. But yeah, we used to look at that way back then. There were some, like, boss command, but I can’t remember – most people on SANs or whatever else, it’s whatever the vendor docs say.

Brent Ozar: Because half the time, they’re using a different format under the covers anyway. Like, NetApp, it’s all 4K underneath.

 

What is a good replacement for linked servers?

Brent Ozar: Oh, Kevin. Kevin asks, “So, what is a good replacement for linked servers?”

Erik Darling: Who’s that bearded guy? That guy with the beard that was supposed to be here that has a good replacement for linked servers; what’s his name?

Brent Ozar: Oh, yes…

Erik Darling: What’s that guy?

Brent Ozar: Andy Leonard.

Erik Darling: Andy Leonard, that’s right. What’s that thing he uses? What’s that thing [crosstalk 0:20:21.2]?

Richie Rump: Reporting services or something like that?

Erik Darling: Almost…

Richie Rump: That was it; DTS packages.

Erik Darling: [crosstalk] SSIS. SSIS is a great replacement for linked servers because there’s no linking because it’s just a thing that sits there and moves data around.

Tara Kizer: I mean, it also depends on what the linked servers are doing. If you’re just using it to grab data bulk type stuff, then SSIS. But if it’s not a bulk type thing then your application should be smart enough to go to another server to grab data and then you join the data together at the application side.

Richie Rump: Depends how much data we’re talking about here and how many transactions are involved and are we using an ORM to get all this data?

Erik Darling: Or are we?

Brent Ozar: Oh my goodness, we…

Erik Darling: Can you edit that out later? Pavel, can you blank that section? Put a do-over on that.

Brent Ozar: Ah, Kevin says, “It’s researchers pulling data into their working databases.”

Erik Darling” Oh, just use Access. I’m just kidding.

Brent Ozar: That’s different. Yeah, in that case, SSIS is magical. Like, I’d be like totally, teach them how to start transforming the data, put it into to different style. SSIS is great for that. I would just make sure that they – and this is not a plug for Andy’s class, but it’s just a plug for any kind of education. Make sure that they know vaguely what they’re doing, otherwise they’re going to hook SSIS up to production and suck all the tables down every time. They’re not going to do any change detection; they’re just going to try and pull all the data down every time.

 

Ah, that’s why the other TempDB files aren’t getting used

Brent Ozar: Adrian says – uh-oh, this is a follow up on his tempdb, I think. He says, “I’m using SentryOne to look at data patterns as well and we can see that all the I/Os only flow into the primary data file. Now, I will do a restart and revert if there’s anything insightful I can find.”

Tara Kizer: He’s the one mentioned the nine-second waits and – he said he hasn’t even restarted the server, but they don’t even take effect until you restart the SQL Servers at least. So I’m not sure what’s going on.

Erik Darling: Yeah, he mentioned something earlier about 1MB auto-growths and stuff too. So I wonder if you don’t have instant file initialization turned on and your disks are, like, not able to keep up with the growths and everything, like lots of little growths.

Brent Ozar: Or the files are all different sizes. Like you added a bunch more files and they’re all really tiny. They all need to be exactly the same size. SQL Server will do proportional fill and work really heavily on the busy one.

Erik Darling: Yeah, whatever the biggest file is will kind of black hole all the stuff going on in there.

Brent Ozar: Perfect, well that’s it for this week. Thanks, everybody, for hanging out and we will see y’all next week on Office Hours; adios, everybody.


How to Delete Just Some Rows from a Really Big Table: Fast Ordered Deletes

Say you’ve got a table with millions or billions of rows, and you need to delete some rows. Deleting ALL of them is fast and easy – just do TRUNCATE TABLE – but things get much harder when you need to delete a small percentage of them, say 5%.

It’s especially painful if you need to do regular archiving jobs, like deleting the oldest 30 days of data from a table with 10 years of data in it.

The trick is making a view that contains the top, say, 1,000 rows that you want to delete:

Make sure that there’s an index to support your view:

And then deleting from the view, not the table:

This lets you nibble off deletes in faster, smaller chunks, all while avoiding ugly table locks. Just keep running the DELETE statement until no rows are left that match. It won’t necessarily be faster overall than just taking one lock and calling it a day, but it’ll be much more concurrency-friendly.

A similar method is to use a CTE:

Wanna see Fast Ordered Deletes in action? No? Then just copy/paste my code, put it straight into production like you always do, and get back to work. For the rest of you, keep reading.

Demoing Fast Ordered Deletes

To demo this technique, I’m going to use the cloud setup for our Mastering Query Tuning classes:

  • An 8-core, 60GB RAM VM with the data & log files on ephemeral (fast) SSD
  • The Stack Overflow public database as of 2017-Aug
  • The dbo.Comments table – which has 60M rows, 20GB in the clustered index
  • I’ve created 5 nonclustered indexes that total about 5GB of space (to make the deletes a little tougher and more like real-world tables)

The Comments table has a CreationDate field, and let’s say I need to delete the oldest comments – we’re going to delete all the ones from 2008 and 2009:

Comments by year

2008 & 2009 had a total of 1,387,218 comments – but that’s only about 2.3% of the table’s overall rows.

First, the plain ol’ DELETE.

I could try just deleting them outright:

It takes 39 seconds. Here’s what the actual execution plan (PasteThePlan) looks like:

DELETE dbo.Comments WHERE CreationDate < ‘2010-01-01’

It’s what we call a “wide” execution plan, something I first heard from Bart Duncan’s post and then later Paul White explained in much more detail. Because we’re deleting so many rows, SQL Server does a bunch of sorting, and those sorts even end up spilling to TempDB.

Plus, it’s taking a big table lock as it works. That’s no good, especially on big tables.

If you can get away with a 39-second table lock and activity in TempDB, the plain ol’ DELETE technique is fine. But let’s pretend you’re working in a mission-critical environment where a 39-second table lock is out of the question, and you need a faster background technique.

Demoing Fast Ordered Deletes

Like we talked about at the start of this odyssey, create a view:

Make sure that there’s an index to support your view:

And then deleting from the view, not the table:

It runs nearly instantly (because we’ve got an index to support it), and here’s the plan:

Fast ordered deletes plan

At first, it looks the same as the plain DELETE plan, but look closer, and there’s something missing:

Just like me with the tequila – no spills

There’s no yellow bangs because there’s fewer sort operators and they’re not spilling to disk. Similarly, the memory grant on this query is way lower:

  • Plain DELETE memory grant: 118MB (only 64MB of which gets used, but it spills to disk anyway because not every operator can leverage the full grant – you can learn more about grant fractions from Joe Obbish)
  • Fast Ordered Delete memory grant: 1.8MB (only 472KB of which got used)

The grants are lower because we’re handling less data, which is also evidenced by the STATISTICS IO output:

  • Plain DELETE logical reads: 25,022,799 on the Comments table (plus another 4.1M on the worktables)
  • Fast Ordered Delete logical reads: 24,732 on the Comments table, plus 2K on the worktables – but that’s with me using TOP 1,000 in the view. If I change it to TOP 10,000, then the reads jump to 209,163. Still way better than 25,022,799 though, but it brings up a good point….

If you need to do this regularly, tune it.

You can play around with:

  • The number of rows in the view (say, 1K, 5K, 10K, etc, keeping in mind the lock escalation threshold)
  • The delay time between deletions

That way you can find the sweet spot for your own deletes based on your server’s horsepower, concurrency demands from other queries (some of which might be trying to take table locks themselves), the amount of data you need to delete, etc. Use the techniques Michael J. Swart describes in Take Care When Scripting Batches.

For more learning on this topic, read Microsoft SQLCat on Fast Ordered Deletes – Wayback machine copy because Microsoft deleted a lot of pages during one of their annual corporate shuffles. You can tell it’s old because…MySpace, yeah.


How Computed Columns Can Cause Blocking

The short story: when you add a computed column that references another table, like with a scalar user-defined function, you can end up causing concurrency problems even when people didn’t really want to go see that other table, and that table is locked by someone else.

Here’s my query:

Here’s what I occasionally see when the query runs, using sp_BlitzWho:

Block Party

I’m selecting data from users.

It’s being blocked by an insert into Badges.

There’s weird code running preventing my query from finishing.

What’s The Problem?

Someone had tried to be clever. Looking at the code running, if you’ve been practicing SQL Server for a while, usually means one thing.

A Scalar Valued Function was running!

In this case, here’s what it looked like:

Someone had added that function as a computed column to the Users table:

Abhorrent

I know, I know. Bad, right? Who would ever do this, right?

We get some funny emails.

Now, in the past, I’ve written about Scalar Functions from a performance and parallelism angle here, here, and here.

I can feel the question coming: Can’t I just persist that computed column?

Nope.

Msg 4934, Level 16, State 3, Line 19
Computed column ‘BadgeCount’ in table ‘Users’ cannot be persisted because the column does user or system data access.

Or surely we can index it.

Nope.

Msg 2709, Level 16, State 1, Line 21
Column ‘BadgeCount’ in table ‘dbo.Users’ cannot be used in an index or statistics or as a partition key because it does user or system data access.

Without being able to do either of those things, our function does a couple nasty things to our query

  1. Executes for each row returned to grab the count of badges
  2. Forces it to run serial

Unfortunately, workarounds for the parallelism issue aren’t applicable here, since we can’t persist it.

You Can Guess…

This made all sorts of things like concurrency, locking, and blocking very tricky to figure out.

There’s an expectation that when someone writes a single-table query, their only concern should be that table. If we were to add additional joins, or additional aggregating functions as computed columns to other tables, we’d be making things even worse.

Hiding code like that in a function, and then hiding the function in a computed column may seem like a nice trick, but it doesn’t help performance, and it doesn’t make issues any more clear when you take all these new found SQL skills and run off to your Brand! New! Job! Leaving them as an exercise to the [next person].

And the workarounds weren’t any more pleasant.

  • We could have added a trigger on the Badges table to update the Users table whenever someone got a Badge added.
  • We could have had a process run every so often to recalculate Badge counts
  • We could have made an indexed view to pre-aggregate the data

There’s nothing wrong with any of those in theory, but they’d require a lot of extra development and testing.

Something that hadn’t been done with the computed column.

Thanks for reading!


Concurrency Week: How Entity Framework and NHibernate Can Cause Poison RESOURCE_SEMAPHORE Waits

Development, Memory Grants
6 Comments

I’ve already blogged about my dislike for ORMs from a production DBA performance tuning standpoint only. I get that they’re useful to developers. I get it. But I’m focused on performance.

A quick recap of what I don’t like about ORMs from that other blog post:

  • Implicit conversions due to nvarchar variables vs varchar columns
  • Queries grab every single column, regardless if they are needed or not
    • Expensive key lookups
    • Very wide missing index recommendations
  • The resulting query is hard to read

Though I did mention that they grab every single column, I didn’t mention that this can cause large memory grants which can lead to RESOURCE_SEMAPHORE waits.

RESOURCE_SEMAPHORE waits occur when SQL Server runs out of available memory to run queries.

ORM-style query: Selecting all the columns

Let’s look at an example.

This query grabs all columns from Posts, PostLinks, Users and Comments. In the normal Stack Overflow database without any nonclustered indexes, this query wants an almost 10GB memory grant just to run.

That’s not fair, though – your database indexes are surely very finely tuned – so we’ll give SQL Server a fighting chance with a few helpful indexes to reduce its workload.

However, even with supporting indexes, the query still gets a 325MB memory grant on my server. I’ve certainly seen worse, but it’s enough to simulate the issue. (Brent says: note that the query doesn’t even have an ORDER BY in it, which is my favorite way to get giant memory grants – it would be even worse!)

Nobody notices a 325MB memory grant when it runs by itself – so to show concurrency, I ran the query through SQLQueryStress using 10 iterations and 50 threads.

After it had been running for a few seconds, I checked sp_BlitzFirst.

Look at all those RESOURCE_SEMAPHORE waits.

Some queries are able to run, but many are waiting on memory to become available so that they can START.

This was a 30-second sample, but it too was struggling to run so we got a smaller sample in the Wait Stats section of the sp_BlitzFirst output.

Notice that the average waiting time on RESOURCE_SEMAPHORE is 8784.2 milliseconds. That’s 8.7 seconds. That’s a long time to wait to even start running the query.

Minus the Posts.Body column

Let’s look at the same query but without the Posts.Body column, which is nvarchar(max).

This one gets a 234MB memory grant. That’s 91MB less than the query that includes Posts.Body.

Repeat the test minus the Posts.Body column.

Still seeing RESOURCE_SEMAPHORE waits, but the average waiting time has dropped to 4.5 seconds. Queries don’t have to wait as long in order to get the memory they need to run.

Minus all big columns

Now let’s take a look at the query without the big columns: Posts.Body, Users.AboutMe and Comments.Text columns. Posts.Body and Users.AboutMe are nvarchar(max); Comments.Text is nvarchar(700).

This one gets just a 36MB memory grant.

Repeat the test minus the Posts.Body, Users.AboutMe and Comments.Text columns.

No more RESOURCE_SEMAPHORE waits! Queries no longer have to wait to get the memory they need to run – which also means more memory available for caching data pages and execution plans.

It’s not all good news though. Notice that it says 88 for “Seconds Sample”. I specified 30 seconds for sp_BlitzFirst. I now had a CPU bottleneck. Even sp_BlitzFirst was suffering.

Sometimes when we resolve a bottleneck, it moves the bottleneck to another area. If this were a real-world scenario, I would figure out if there were any other columns that weren’t needed to be returned and then add covering indexes as there are expensive key lookups in the execution plan.

What if I need to return big columns?

Sometimes you need to return really big columns. That’s okay. Return those columns ONLY when you need to. Don’t let your ORM return every single column in each of the tables in the query. Spend the time to return only the columns that are needed for the specific query. Richie shows how to do this in Entity Framework.

Recap

Let’s add RESOURCE_SEMAPHORE waits to the list of why I don’t like ORMs from a production DBA performance tuning standpoint:

  • Implicit conversions due to nvarchar variables vs varchar columns
  • Queries grab every single column, regardless if they are needed or not
    • Expensive key lookups
    • Very wide missing index recommendations
    • Leads to RESOURCE_SEMAPHORE waits
  • The resulting query is hard to read

In the wild

Have you seen RESOURCE_SEMAPHORE waits in the wild? Were they due to big columns or something else? How did you resolve it?


Concurrency Week: Can SELECTs Win Deadlocks?

Deadlocks
0

Yes, especially when they’re part of a larger transaction.

In this example, I’m going to use a database in Read Committed Snapshot Isolation (RCSI), my favorite isolation level for building new applications from the ground up. Optimistic concurrency (aka MVCC) helps avoid a lot of blocking issues – but not all of them.

In a database with RCSI enabled, start by creating two tables:

Then start a transaction that does a lot of work, but don’t commit:

Now start a new tab or window in SSMS – we’re going to start a separate transaction in the same database, but working on Table #2 instead. We’ll call this Window #2 (because it’s 5AM when I’m writing this, and the caffeine hasn’t kicked in yet.)

That query finishes instantly because it’s a metadata-only change in modern versions (even if you set a default, as Remus explains.) Both transactions are getting along just fine right now, not blocking each other, because they’re working on different tables.

Now, let’s cause a deadlock.

In Window #2, where you did just a little bit of work, try to do much more work – but monkey with the table that’s already locked by Window #1:

Aaaaaand nothing happens – he’s blocked because Window #1 already has a lock on those rows. He won’t make progress, and he doesn’t time out. He just sits there waiting patiently.

Now switch over to Window #1 and run a select:

At first, nothing happens – because he’s blocked by Window #2. When Window #2 added a column to Table2, that schema change is a big deal, and it even blocks SELECTs. After all, the SELECT can’t have some of his rows come back without the new column, and some of them with it.

Within 5 seconds, the SELECT wins.

SQL Server’s deadlock monitor wakes up, sees that we have a Mexican standoff. Neither window can make progress until the other window gives up. SQL Server looks at which session would be the easiest to roll back, and kills that session’s work.

Window #2’s UPDATE sees:

Window #1’s SELECT won because in his transaction, he’d already done a lot of work. Window #2 hadn’t done that much – the alter table was a tiny metadata-only change. Sure, he was trying to run an update that would do a lot of work, but he hadn’t done it yet, so he was trivially easy to roll back.

In the past, I’ve often said, “In RCSI, readers don’t block writers, and writers don’t block readers” – but that’s not exactly true, as this demo indicates. However, I’m…still gonna say that, because it’s mostly true. The problem is that the OTHER writing we’d done previously in the transaction is what killed us here.


Building SQL ConstantCare®: Why We Avoided Graphs

When we started building SQL ConstantCare®, one of my early decisions was to focus on a text-based interface. I wanted to just simply tell you what I’d do in your shoes. If users wanted to see more evidence behind the recommendations, I’d show it – but otherwise, I’d hold off and let them ask.

The experience has been really cool – for the most part, folks have just said, “This is really useful, actionable stuff!” It’s also been really neat to watch the number of alerts go down on peoples’ servers over time as they work through the most important easy-to-fix issues, then move on to tougher tasks.

Customers mention graphs –
but only because they’ve been misled.

But from time to time, the subject of graphs comes up – but not in the way you’d expect. It comes in the form of customers saying, “Whoa, I didn’t know that – nothing shows as abnormal or high in my monitoring tool.” Last week, when I was explaining to a customer what their server’s real problem was, they had an ah-ha moment. They forwarded over a screenshot from their monitoring tool to help explain why they hadn’t been able to get to root cause analysis:

Monitoring tool’s wait stats graph

Ouch. The poor user – there was absolutely no way they were going to get anything actionable out of a graph like that. Here’s why:

The units of measure don’t match. One axis is in seconds, the other axis is in hours (or days, depending on how you look at it.) At the very least, they should be in the same time scale – so the reader can say things like, “In 4 hours, SQL Server spent 2.5 hours waiting on stuff.”

Auto-scaling makes interpretation much harder. The graphs automatically re-adjust their axis to make sure that you always have peaks and valleys. If you look at 5 different servers, they all have 5 different sizes for their peak. The only way you can understand if a server is working hard is to jump back and forth between different servers, saying to yourself, “Alright, this one’s peak is at 80,000, but this other one’s peak is at 800,000, so I guess this other one is worse off.”

Poison wait types aren’t called out. Some wait types like THREADPOOL, RESOURCE_SEMAPHORE, and PREEMPTIVE_DEBUG are truly catastrophic even when they’re in small numbers, and they indicate issues you need to work on first. In the case of the bar graph above, the customer was facing THREADPOOL poison waits – the SQL Server service seemed locked up and unresponsive – but they looked in their monitoring tool and nothing showed up as problematic. (And no, those times when you see the huge spikes? That’s not when their performance emergency was happening.)

When wait types are called out, they mislead the reader. Classic example from the above product – when the user breaks out the list of wait types, they see which queries are having the wait, not the ones causing the wait. When a query is waiting on THREADPOOL, it’s not because there’s something bad about that query. The problem is the other queries that are consuming so many other worker threads – often unnecessarily. The poor end user is led down the exact wrong path.

Finding the real root cause with that kind of tool is like trying to find a needle in a haystack – except the needle is yellow, and all the hay has been cut down to needle sizes. You’re gonna get pricked.

All of the above is fixable – sure, you could design a monitoring system with graphs that led you to the right solution instead of actively obfuscating it away from you. Heck, even just the act of me posting this blog post is probably going to influence a developer somewhere to say, “Whoa, we need to make that graph more understandable” – and hopefully they’ll get that fixed.

Instead, we focused on simple task-based emails: we analyze the metrics, and we tell you in plain English what you should do. You can see a couple examples of ’em over at the product page.

I do watch one chart with extreme happiness, though: last week, we blew past 600 servers. That’s one chart where I look forward to automatic scale adjustments.

SQL ConstantCare population as of 2018-04-19

Want us to help you decipher all your database’s confusing metrics and mentor you into faster, more reliable SQL Servers? Join the 300+ folks who’ve signed up already.


[Video] Office Hours 2018/4/18 (With Transcriptions) with Special Guest Drew Furgiuele

Videos
0

This week Drew Furgiuele joins Brent and Richie to discuss patching, job interview suggestions, using PowerShell to solve various SQL Server issues, roles and responsibilities of a SQL or a database architect, enabling query store on production servers, SARGability, moving SSRS from the main production server, tempdb issues, Availability Groups troubleshooting, Drew’s upcoming Powershell class, and more.

Here’s the video on YouTube:

Office Hours Webcast - 2018/4/18 with Special Guest Drew Furgiuele

You can register to attend next week’s Office Hours, or subscribe to our podcast to listen on the go.

Enjoy the Podcast?

Don’t miss an episode, subscribe via iTunes, Stitcher or RSS.
Leave us a review in iTunes

Office Hours – 4-18-18

 

Should I apply patches to all the nodes at the same time?

Brent Ozar: Richard says, “When I’m doing monthly Windows updates on a failover cluster, should I update each node and then wait for a few days, then update the next one, or should I just update everything all at exactly the same time?” Drew, how do you guys deal with patching over there?

Drew Furgiuele: You know, we’ve gone through all kinds of different ways of administering patches, like, at least where I work. I mean, we used to do WSUS and we tried System Center. Now we use a third-party solution. But if I can stomach the downtime, I like to do it all when I can, just because if, god forbid, I do have to failover – because we had an instance one time where some patches got applied to a SQL Server, system level patches and SQL level patches, which we like to do separate. But for whatever reason, they all went at the same time and we were noticing that the failover wasn’t behaving very well because it was taking a long time for it to start up, or it wouldn’t start up because the system databases got patched to a higher version, even though it was all SQL Server 2008R2.

So we got into this weird situation where, like, we were running on two different CU levels and it got really dicey. So if we patch in a window where I know I can take the extra little bit of downtime with the failover, that’s what I like to do.

Brent Ozar: I’m one of those where I’ve been burned enough by updates. I’ll usually try and do like half of them on one weekend and then wait and see if they bake out and shake out and then do the half the next weekend. But it makes your workload higher not lower. It depends on how much downtime you’re allowed to take too.

Drew Furgiuele: Right, it’s like you’ve got to have two windows to do it at that point, unless you do the failover ahead of time and depending on how tolerant your apps or your business processes are to that – I mean, I think it really depends on the person.

 

I have an interview tomorrow. Any suggestions?

Brent Ozar: Niraj has an interesting question and it’s awfully open-ended.  Niraj says, “I have an in-person interview tomorrow morning; any suggestions?”

Richie Rump: Show up.

Drew Furgiuele: Yeah, if you’re not early, you’re late is what I would say.

Brent Ozar: Brush your teeth. Don’t wear a suit unless you’re interviewing at a financial company. That advice is dead and gone. I think that’s…

Richie Rump: Good call. I mean, talk to the HR person and say, well what should I wear? What’s the proper attire for your environment?

Brent Ozar: It was good that Richie asked us on ours because I said a leather clown suit and he said formal or informal.

Richie Rump: I’m like, what color…

Drew Furgiuele: I think it also really helps to know, like where you’re applying, what’s the business do, what market are they in, just so you have some base understanding about what you’re getting into.

Brent Ozar: And you can Google for – usually, if you Google for their HR department, they’ll have pictures – for larger companies, they have pictures of the office and you can see what people wear. You can see what kind of environment it is.

 

How should I keep users synced between AG replicas?

Brent Ozar: Next up, Augusto asks, “I would like to know the best practice…” You came to the wrong room, “In order to keep users in an Always On environment synced between the principal and any replicas.”

Drew Furgiuele: Well, I mean there’s lots of different ways to do that, right. I mean, you can use good old-fashioned T-SQL scripting. You can set up automation through, of course, PowerShell, but look at user principals that exist on both servers, do a compare, find out what’s missing. I don’t really have a lot of AGs that I administer. In all that I’ve done with it, that’s how I’ve approached it is I just have a basic compare of what’s there. And with PowerShell, if you’re able to create custom objects and read in those permissions, you can do things like compare object and see where the differences lie.

Brent Ozar: Instead of dumping them all and redoing them every time.

 

I’m getting 15-second I/O warning errors…

Brent Ozar: Oh, Pablo says, “I’m getting I/O requests that take longer than 15 seconds.” Notorious – I keep saying I’m going to write a blog post about that. His principal wait in SQL Server is write log waiting to write to the transaction log and his applications are getting timeouts expired. “Can PowerShell help me look for disk issues?”

Drew Furgiuele: You know, it’s funny…

Brent Ozar: Not really…

Drew Furgiuele: Not really, like, there’s some code out there that can help you look at different wait statistics, but it’s not going to be able to give you a smoking gun, just like any other kind of wait statistics query might do. You can use that to aggregate this stuff over time so you can see if it’s time-based or if you want to collect it and look at it over historical periods of time if you don’t have access to a fancy expensive monitoring tool.

But as far as helping alleviate that stuff, I’m going to leave it to the real experts on this webcam.

Brent Ozar: Unfortunately they didn’t show up today… I think you nailed it with you know that your top wait is write log and you’re getting 15-second I/O warning waits. I don’t know how much more conclusive you can be from the SQL Server side. Now is when I would turn around and play with – CrystalDiskMark is a utility that will help you just quickly run a speed test. And you run CrystalDiskMark against your laptop, you run it against the SQL Server C drive, you run it against where the data and log files live and it will get you just a quick sanity check on does my storage suck completely or is it vaguely in line with a consumer grade laptop.

Most of the time, when I see 15-second I/O warning errors, if I even plug a USB into my laptop I can get faster throughput on a USB thumb drive than I can get from the storage. It really happens a lot under virtualization.

 

How does Drew use PowerShell to manage storage?

Brent Ozar: I’m going to ask a related question though because when we talk about PowerShell and storage – Drew, I’ve heard you talk about what you do with your snapshots in Pure Storage. So talk a little bit about how you use PowerShell with that.

Drew Furgiuele: Yeah, so where I work, we have a ton of different non-production environments and one thing we got in the habit of doing was we have a set schedule where we refresh our non-production environments and before we had access to fancy Pure Storage – free plug for them because they’re great – we were doing traditional restores and it was like, alright, it’s Friday afternoon, we’re going to shut down int and dev and con or whatever. We’re going to do the refreshes, they’re going to run for however many hours to restore those databases and then hopefully when we come in Monday morning, everything worked and we’ll reapply permissions.

So when we switch to the Pure Storage array, they have the concept of volume based snapshots. So you can set up automation around that because it has a complete PowerShell API tied to it where you can connect to it, set up scripting, to say take a new snap, you know, present it to this host, mount it on this disk and then attach these databases, apply permissions, any other kind of post scripts you need to run to prep your databases for non-production.

And that’s what we do now. And our restores went from taking an entire weekend, kind of worst case, obviously, now I can do basic on-demand refreshes that take less than a minute and for multi-terabyte databases in multiple servers. It’s fantastic. And I know that other storage providers also have that, but I’ve seen Pure do it and they just do it right, like, big fan.

Brent Ozar: And I think – you kind of have to be a new vendor. You have to come up with all new stuff and go, alright I’m going to have new approaches as to how I’m going to do this and embrace things like PowerShell.

 

What does a database architect do?

Brent Ozar: Sree asks – and this is a good one for Richie – Sree asks, “What are the roles and responsibilities of a SQL architect and or a database architect?”

Richie Rump: Ooh, well we really make a lot of coffee, we, I don’t know, have a meeting or two and we go home. I mean, what else do we do?

Brent Ozar: I wouldn’t say anything about documentation.

Richie Rump: What the hell is that? Mainly, database architects will get around – talk to the business a lot. So they know a lot about the business and how the business works because they need to understand, really, the intricacies of how everything works so they can properly model it in the database. So that is, in essence, what a data architect does; understands the business enough to know how to properly model into a database where it’s fast, it’s efficient and it does what the business needs it to do. So in the days of NoSQL now, that’s kind of changed a little bit because a lot of these NoSQL databases, you don’t need models anymore, they’re all just kind of thrown in there.

So you can kind of move very willy-nilly. That has its own set of challenges as is, but mostly that’s kind of what we do. We understand how the, kind of, gotchas of a database engine and what kind of modeling we need to do. I remember, there was one project that I was doing that I had one of my guys help with a lot of the architecture and he did a fantastic job. I mean, he really knew his stuff and we pulled out database patterns and we did some other stuff and [thaw fluid] in there and the database was pristine. It was great except when we threw the ORM on top of it. It crapped out a log and it just made everything super slow, of which we had to change the architecture of the database so that the ORM would be faster and just work right.

So a lot of times, we do our job as database architects really super well. Sometimes too well, where the application itself doesn’t keep up with what we’re trying to do. So there’s a balance there and there’s a give and trade to kind of everything we do as database architects. And what if we do this? Well, there’s something over here – and there’s always that. we’re trying to find that middle ground when we’re doing that. I mean, I guess I could go for another hour; are you okay with that? Start pulling books out and…

Brent Ozar: Well I was going to say – like for example, I know we’ve had customers come to us and say, “You have a data architect in the form of Richie; can you just come in and architect a database for us, like tell us what we need?” I’m like, well there’s so much back and forth conversation with the business and understanding what the application is. You want that guy on your team for good, you know. If you bring in a consultant for it, you’re basically writing them checks for the rest of your life because they’re going to build up all this knowledge about your business and you’re going to wish you had that in-house instead of being held hostage by an outsider.

Richie Rump: You mentioned documentation too and that’s a big part of it as well. A lot of – from what I understand, a lot of database architects, all they do is come out with the ERD and that’s it. So you have a data model and that’s all you get. But I’ve always found that the best data modelers also do a lot of business documentation, whether that’s business process flows or whatever that is. So all that is down, so then you could actually say, hey here’s my business flow and here’s how it maps into my data flows, and it all kind of goes that way.

Drew Furgiuele: Database architect doesn’t really get thrown around a lot where I work, but we have a lot of people who refer to themselves as data modelers. I guess it kind of works in the same vain. And ironically, the same person that helped model a lot of the stuff that we do is also the person who helped me embraced ORMs. So John Kruger, if you’re listening, you take credit for that.

Brent Ozar: And that always comes up on the blog. I’ve been talking about writing a blog post for a while too. There’s people who bash ORMs all the time. There was a blog post on Hacker News recently that Richie and I were looking at and people come out in droves, like ORMs suck. Everyone should know exactly how to write T-SQL. I’m like, we don’t have time, you know, you’ve got to ship products, you’ve got to ship features.

Richie Rump: And the other thing is that who’s writing the SQL? When I was a manager, that was my big question, who’s writing the SQL and do the people that will be writing the SQL, do they know SQL? Whether that’s the DBA, maybe it comes from the administration and doesn’t have a strong SQL language background, or if it’s developers who don’t have a strong database background. So who does that leave, really? Who’s writing the SQL that’s tight and does get to what it needs.

How many times, Brent, have we gone to clients and they’re missing where clauses? I mean, just the base level simple things and they’re missing all this stuff. So ORMs kind of prevents a lot of that. Now, as opposed to me understanding SQL, I just have to understand my language and I can focus more on business stuff. Now we’re talking about benefits and tradeoffs there. ORMs inherently, they’re not bad; it’s how they’re used that’s bad.

Drew Furgiuele: I mean, it’s like nuclear power; it can be used for good and evil and you don’t want to get any on you.

Brent Ozar: I thought for sure where Richie was going to go was it depends on the quality of the person writing the SQL because I’ve had several check-ins into our own products where I’ve written queries for the wrong language. I’m like, oh this is perfect T-SQL – it doesn’t even work in Postgres.

Drew Furgiuele: I have a whiteboard that I keep track of how many times you break the build and announce it via Twitter.

Richie Rump: Well we have the opposite; how many times Brent didn’t break the build and that’s when it’s announced.

Brent Ozar: I think one. I think I’ve had one build or two builds go through.

Richie Rump: Full disclosure, it really is not Brent’s fault. It’s not…

 

What’s the impact of Query Store on performance?

Brent Ozar: Turge asks, “Do y’all have any experience with the performance impact of enabling Query Store on production servers?” Drew, have y’all turned on Query Store on yours?

Drew Furgiuele: You know, we haven’t – it’s not something we’ve fully embraced yet because we have such a mish-mash of SQL versions. We have a couple of 2016 instances. We’re in the process of thinking about upgrading some to 2017, but we don’t have any kind of plan to do that right now. I’d like to, but it’s not something we’ve turned on. I mean, I know there’s a lot of people that have done a lot of good case studies about what you should or shouldn’t do. I’m not one of them, but I’d love to turn it on and watch plans evolve over time because we do have some pretty in-depth stored procedures for stuff that run and I would love to be able to track that stuff.

Brent Ozar: I’ve seen – what Microsoft initially touted was something like a 1% to 3% overhead. And the way I always feel about it is it’s a lot like a car’s dashboard. If you want to know how fast you’re going, you need a speedometer. And the weight of a dashboard and a speedometer and all that slows your car down a little. But if you don’t have a dashboard, you don’t really know how fast you’re going. So people are just like, it feels fast or it feels slow. So if you need to know which queries are performing better or worse, then you want to turn this on. I would just say, make sure that you’re on a recent cumulative update of either 2016 or 2017 because they fixed a lot of bugs in Query Store over the last three, four, five months.

Drew Furgiuele: And part of that too is with the monitoring tools we have, we can actually see degradation in similar statements over time. So it seemed to me like it’d be kind of redundant, but at the same time, I know what it has and what it offers.

Brent Ozar: Y’all use SQL Sentry or what do you use?

Drew Furgiuele: Yeah, we’re Sentry One customers and they have the ability to kind of see different statements. And they put on a nice little chart about how they’re performing over time.

Brent Ozar: And it works with any supported SQL Server version. It’s not like it’s just 2016 or newer.

Drew Furgiuele: Right.

Brent Ozar: That was my feeling with Query Store is it’s wonderful for people who don’t have a monitoring tool already and that also are on 2016. So if you fall into those two categories, 2016 or 2017. It’s a smaller audience. Most of the people who needed that kind of monitoring already got it with third-party tools.

 

How should I set a variable to the beginning of a day?

Brent Ozar: Let’s see – the last one in the queue – if anybody has other questions, otherwise we’ll close the webcast over this one, so feel free to get your questions in, otherwise we’ll bid you adieu. Colin says, “There are 100 ways to set a date time variable to the beginning of the day…” Oh, I love that. He has a method in there using convert date time, flipping it over into a date field. “This way seems super simple and straightforward to me. What are the downsides?” The big one on that one is SARGability.

SQL Server may look at that function and go, “I have no idea what the hell is going to come out of this.” So it also depends on whether or not you’re seeking for it, whether you’re just rendering it as report output; all kinds of things. If I’m going to set it as part of a search, like in a where clause, I’m usually going to set a variable either outside in my app somewhere and then set it that way, so that when SQL Server builds the execution plan, it sees a parameter with the exact date that you’re looking for. Otherwise, if you set an internal variable into your code, you can end up using the density vector, which will get you wacko statistics for how many rows SQL Server is going to come back with.

 

How hard is it to move SSRS?

Brent Ozar: Let’s see – anon says, “How difficult is it to move SSRS off your main production server? Assume I have no help, no third-party tools and can have little or no downtime.” Damn, wow.

Drew Furgiuele: All I know about SSRS is you want to backup your keys. That’s all I know. Like, that’s the only way you can move stuff is to backup your keys from SSRS. I think as long as you do that, you can pretty much do whatever you want.

Brent Ozar: I would say, if anything, you put it behind a load balancer too, so you put – if you have any kind of F5 or any kind of load balancing type tool, that you first start by pointing everyone at the load balancer and then put your existing 2008 behind the load balancer, just so that that way, whenever it comes time for the cutover, you can do that with little to no work. Plus later, you can add in multiple reporting server VMs if you want to for high availability too.

Drew Furgiuele: Yeah, and the bigger question is – I don’t think you lose anything going from 2008 R2 to 2014, but I have no idea. That would be the other thing; what’s going to break?

Brent Ozar: And I’d also ask what you’re going to gain too. SSRS isn’t terribly resource intensive. It doesn’t use that much memory. It uses some but it’s not as bad as, like, integration services or analysis services, which are total pigs. Not that they’re bad; they’re great applications. I’m just saying they use a lot of resources. But make sure that there’s a bang for the buck on the other side too, just that you’re going to get something out of it.

Richie Rump: Yeah, I mean, if you’re going to use new hardware on it, you might as well move the main production server over there rather than SSRS. But if you’re moving to like a small VM or something like that, no, don’t do that; that’s no good.

Drew Furgiuele: And ultimately, it’s where the queries run, right. So like, you know, it doesn’t matter where SSRS runs; it’s going to connect to an instance and run a big SELECT*. It doesn’t matter where that happens.

Brent Ozar: Anon adds on, “Our team’s split and the reporting team is now in a different organization.” Ah politics. In that case, it’s their problem. You go, you stand up the server whenever you want and I’ll give you the data.

Brent Ozar: Marci says, “We have always run reporting services analysis services on their own servers and over the years we’ve gone from 2008 to 2014 without anything breaking and the report builder version is always improved with the latest, so just stand it up and see.”

Brent Ozar: Brian also adds, “I just restored my supporting services databases to the new instance and then fixed the data course pointers prior to going live.” They keys is the other one to keep in mind there; any kind of encryption keys on SSRS.

 

How many TempDB files should we use on a VM?

Brent Ozar: Adrian says, “We use VMware and we’re trying to figure out whether or not we should use one data file per socket for tempdb or one data file per core – like how many tempdb files do I have?” And he says at the end, “I have super high latency.” So I think he’s thinking that more files are going to help him in terms of latency. Drew, when y’all set up new tempdb data files, how many users per server?

Drew Furgiuele: Well we’re a VMware shop and I typically go, if my server has eight CPUs or eight VCPUs or less, I tend to go eight files. If it goes any bigger than that, I kind of see diminishing returns, but that’s just my experience. I haven’t really seen any tempdb contention over that. Now granted, like, we don’t have any super powerful VM SQL Servers, although we did just get some new hardware. I think our biggest virtualized CPU in a cluster is probably eight VCPUs per. But even then, if it was bigger, and maybe it is a little bit bigger than that, I still think I only went eight. And I think I saw that advice somewhere.

Brent Ozar: Yeah, that’s in our setup checklist too. I’m fine with – some people will tell you four, some people will tell you eight, either of those numbers is fine. And in terms of more files helping you with file latency, that’s probably not going to happen. I would go check CrystalDiskMark. Just got run CrystalDiskMark for a quick rule of them test against your I/O and run it against your desktop and compare to see the two differences. Then prepare to open sadtrombone.com.

 

Where are Drew’s scripts for his GroupBy presentation?

Brent Ozar: Teschal says, “Drew, do you have the scripts that you presented on GroupBy? I watched but I couldn’t find the scripts.”

Drew Furgiuele: Yeah, so man, it’s been a while since I presented on GroupBy, but my GitHub – it’s a ridiculous last name, so the easiest way to get there – it has my last name in the URL. If you head over to my website, which is port1433.com – over on the right-hand side there’s a little Git Hub icon, on the right, a little octocat. If you click on that guy, that will take you to my repositories and any publically available code that I’ve shared has been there. So you click that little guy – those icons could admittedly be a little bit bigger, but I’m not the designer. So if you click there, that will take you over to my repositories. And like I said, any publically available code that I’ve ever done is on there. Richie, you should really upgrade IE…

Rocky, Drew’s partner in crime

Brent Ozar: And who is this adorable little creature here with the bowtie on?

Drew Furgiuele: That’s Rocky, who is actually behaving himself right now. Usually, he’s down here all up in my business, but that’s Rocky. He’s my little monster.

Richie Rump: Are you using DotNetNuke for your website?

Drew Furgiuele: Nah, WordPress.

Richie Rump: I’m trying to find that guy who’s always using DotNetNuke. I’ll still look.

 

 

SadTrombone.com actually exists

Brent Ozar: And then Michael Tilley says, “Holy crap, there actually is a sadtrombone.com.” Of course, there is.

Drew Furgiuele: There’s also the auto-play version, which is the one you need to send people.

Brent Ozar: Yes, why does that not work? Google Chrome, it’s trying to plug – I hate Internet Explorer so badly. So anyway, any-who, back over to the PowerPoint, put that backup and go back to the questions list. That teaches me for trying to surf the web on a crappy version of internet explorer.

 

Why isn’t my AG working correctly?

Brent Ozar: Mark asks, “I’m adding a database to an existing Availability Group. For some reason, the secondary files are not going to the same drive and location as the primary – they’re going to a default folder. Do you have any idea why this is happening?” Drew, do y’all use Availability Groups?

Drew Furgiuele: We don’t because we don’t use Enterprise Edition. Although I know we can use basic AGs now, but we don’t use them. We’re a traditional Failover Cluster group right now, although we are getting to that point where we’re exploring a lot of new DR stuff for this coming year and we’re going to start moving to that stuff with some remotely available stuff. But that’s a weird question. And forgive me, but is that, I think you said the default directory when you install SQL, right?

Brent Ozar: Yeah, I bet money that when you’re restoring the databases without using the with move commands to move those databases somewhere else. If you’re restoring the databases manually as opposed to going through the GUI or using direct seeding, which will just automatically do it for you. If you do manual restores, you have to specify where you want the data and log files to go. And the folder structure needs to be identical across all of your replicas, otherwise, whenever you add a data file or log file on the primary, it will break if that folder doesn’t exist, drive letter or folder doesn’t exist over on the secondaries; your replication will stop working.

Mark follows up with, “We’re using the GUI to do it automatically and the folder structure is identical.” In that case, I would actually post it, but I would post with screenshots. Do as many screenshots as you can on dba.stackexchange.com because you might have found a bug. I don’t think you did, because I’ve done this a bunch of times myself, but it’s possible that you found a bug and other people will see it when you go over to dba.stackexchange.com.

Update: in this week’s 2017 Cumulative Update 6, Microsoft released a patch that might solve this problem, too.

 

Drew – what’s your PowerShell class about?

Well, thanks, everybody for hanging out with us this week. Drew, before we go, we should say your upcoming training class – tell us a little bit about what you’re teaching and who’s the target audience for it.

Drew Furgiuele: Yeah, so this will be kind of the third go-round of this class. It’s PowerShell for DBAs, hosted by the fine folks at Brent Ozar Unlimited, and it’s targeted at people who have a desire to learn PowerShell and how it can be used to administer things like SQL Server. So it’s a two-day class. For signing up and joining it, you get access to a VM for both days and we start out at the base level. We talk about how to do simple things like declaring variables in PowerShell and doing basic scripting actions and then day two, we move into how you connect to SQL Server, how you can script things out, how you can automate things. Really a lot of fun to present. It’s been really good. People seem to like it to hopefully you guys, if you have desire to learn that stuff, you can now learn with me.

Brent Ozar: Absolutely. Marci says, “Do you have a link for the class?” Yes, if you go to brentozar.com and right on the homepage there’s a list of classes. You can see Drew’s shiny happy face smiling. When’s the next one, like June I think?

Drew Furgiuele: It’s like the last week of June, I think.

Brent Ozar: Sounds right. Alright, well thanks everybody for hanging out with us this week and we will see y’all next week on Office Hours. Adios, everybody.


Does an Index on Just The Clustering Key Ever Make Sense?

Indexing
10 Comments

Say you’ve got a table – in this case, I’m using the Stack Overflow Posts table, which holds all of the questions and answers:

StackOverflow.dbo.Posts

See the clustered index at the bottom? It’s on the Id field, which is an identity – starts at one and goes up to a bajillion.

So it would be really stupid to do this, right?

Why would someone ever create yet another index on Id when you already have the clustered index? I’ll show you why:

Here are the execution plans – all 3 of which use our Id index:

Scantastic

The performance differences are pretty eye-opening:

  • Clustered index scan: 50 seconds, 7,736,170 logical reads
  • Nonclustered index scan: 0.5 seconds, 36,767 logical reads

When SQL Server has to scan the table, it says, “Is there a faster object that I could scan instead? Is there another copy of the table I could use that has less data in it?” Indeed, our nonclustered index just on Id is way smaller, as sp_BlitzIndex shows:

sp_BlitzIndex analyzing index sizes

The clustered index (row 2) is 70GB because it has all the fields, whereas the nonclustered index on Id is a tiny 285MB in comparison. Turns out all those big string fields take up a lot of space on the clustered index. Who knew?

My favorite query plan of the 3, though, is the count WHERE Body IS NOT NULL. You’d think SQL Server would have to go check the Body field, right? But no – the Body field was specified as non-nullable (scroll back up to check the table definition) so SQL Server can simply ignore it. Ahh, the sweet benefits of the right null-vs-not-null table definitions.

The index on dbo.Posts.Id can even be used to speed up foreign key creation on other tables. Illustrating this is just a little trickier since foreign key creations don’t show execution plans (nor do they show up in the plan cache.) To show it, I restarted my instance, created a foreign key between Comments.PostId and Posts.Id, and then ran sp_BlitzCache. (You could also probably show this with Extended Events, but I’m doing this fast, dear reader.)

Score another one for our nonclustered index

I’m not saying that those queries are a good idea – I’m just explaining why someone might have created that index, and what might backfire when you drop it. I’m also not saying you can never drop it, either – I’m just warning you to keep an eye out for what might slow down.

And by the way, if this comes as a surprise, I bet you haven’t seen my free course, How to Think Like the SQL Server Engine. It teaches you the differences between clustered & nonclustered indexes – something everybody thinks they know until I start asking them questions.


So You Wanna Debug SQL Server Part 2

SQL Server
2 Comments

Birth Control

Debugging, and, heck, even learning about debugging, is a time-consuming and often unhelpful process when it comes to tracking down issues in SQL Server. It is arduous and tedious, and yields little chance of making you any friends. So why write about it?

Mostly because no one else is willing to — at least not in the context of SQL Server. And, let’s face it, “vibrant social life” has been used to describe me exactly zero times Kelvin.

In Part One, I talked about the basics of getting WinDbg set up, and attaching it to your local (please for the love of all that’s from Islay don’t attach it to a production server) instance. The next logical thing to do is get interacting with SQL Server.

Point Break

It’s customary to yell I AM AN FBI AGENT whenever you set a break point in WinDbg. This guarantees you success, long life, and ten cent wings wherever you dine. The break point commands we care about right now are:

  • bp: set a break point
  • bu: set a break point on a pattern
  • bl: list break points
  • bc: clear break points

After you follow instructions below to set a break point, you should use these to examine and undo it. Heck, even set your own.

But where do you set a break point, and what happens when you hit one?

To answer those questions, we need to go back to the x command — x means we want to examine available symbols.

I say it really doesn’t matter where I put my finger

When I first open up WinDbg, I like to get symbols for a few different code paths

  • sqlmin
  • sqllang
  • sqldk
  • sqlos
  • sqlservr

Under sqlmin is where most of the query related stuff lives. To make your life a little easier, you can output window contents to a text file!

Someone will come along and move it anyway

That lets you come back to the output and search through it a little easier. So if you choose a log file, and then run x /n sqlmin!, you should end up with a text file of all the public break points sorted by name. Fair warning: this can be slow. There are a lot of them.

I’m gonna save you a little bit of time, though. A lot of the interesting stuff that happens when a query runs is under sqlmin!CQScan, so if you run x /n sqlmin!CQScan*, you’ll get just that subset of break points.

Why do we care about this? Because when setting your first breakpoints, it helps to set one that you can trigger early on in a query to get used to what it looks like.

Almost nothing occurs sooner in the break point path than a TOP expression. Knowing that, we can narrow things down even further, to x sqlmin!CQScanTop*, or even x sqlmin!CQScanTopNew*.

Setting an early break point based on TOP would look something like this: bp sqlmin!CQScanTopNew::Open, and we can trigger that with an easy query.

Which, if you’ve done most things right, will get you some output that looks something like this!

Computers are hard.

Here’s where you’re gonna start to hate the debugger.

Tedium ad nausea

Once you hit that break point, your query is just gonna sit there. You now have the ability to step through every public instruction that the query calls while it runs. This is, not surprisingly, a lot of instructions.

You can hit F5 to end debugging and pass control back to the program, or hit F10 to start stepping through instructions.

This should give you enough to play with until next time, when I’ll talk about tracing calls, and how to find SQL Server exposing dirty little secrets.

Thanks for reading!


Index Key Column Order And Supporting Sorts

Whatever Man

When tuning queries that need to sort large amounts of data, sometimes it makes sense to stick the ordering elements as the leading key column(s) in your index. This allows SQL Server to easily sort your data by that column, and then access other key and included columns to satisfy other parts of the query, whether they’re joins, predicates, or selected columns.

That doesn’t mean that indexes with the sorting element first is always the best idea. Depending on how your query looks, you can sometimes put sorting elements at the end of an index definition and still have your sort be supported.

Pick It Up

Let’s start with this index:

If we run queries like this, the ORDER BY is unsupported by the index because it’s not the leading key column.

Slap happy

But if my query looks like this, it is supported.

Limited Yay

With direct equalities on each of the columns in the key of the index, we can sort by the last column in the index.

Unfortunately, this doesn’t work if we…

  • Have an incomplete WHERE clause
  • Our WHERE clause has inequalities (ranges) in it

By incomplete WHERE clause, I mean one that doesn’t utilize all of our index key columns, and by inequalities I mean >, >=, <, <=, and <>.

That means that both of these will result in a Sort operator:

Along with all of these:

It would be nice if the only requirement were that the key column prior to the sorting element use an equality predicate, but that’s not so either.

For instance, this will still result in a Sort.

Crud

But you do have some flexibility with other queries and other sort orders. For example, these will both result in the simple TOP plan that we saw earlier in the post. Again, the predicates need to be equalities to work.

Stone Cold Wrappin’

I think this is an interesting example of how much index definitions can matter, and how a good index for one variation of a query may not be so great for another variation. There’s probably some allegory about knowing your data or something in here, too.

Thanks for reading!


Parking Garages And Predicates

Hotels on Wheels

Parking garages and predicates have a “lot” in common.

GET IT?

No no but seriously. They do.

If you’ve ever lost your ticket, forgotten where you’ve parked, or just driven around hopelessly waiting to see an empty spot, you probably wish there were a better way to do things.

Maybe even a data-driven way of doing things.

Informal poll: Is anyone reading this blog data-driven?

If You Had To Index That…

Say you wanted to quickly find your car. An index on CarId would be awesome, if you know what it is.

If you got a parking slip with your Space Number on it, that’s sort of helpful. But then you need to figure out which floor and lane you’re in.

On the other end, if you need an empty space, an index on IsEmpty sounds pretty optimal.

Let’s say you’re Seinfelding

I don’t mean that, I mean this.

If Jerry, et al. knew where their car was, they’d have been able to walk directly to it. That’s like a Seek predicate.

Molon Labe

If they knew which floor and lane they were on, they’d have been able to start Seeking in the right place, and then double check each car. That’s like a Seek predicate with a Residual predicate.

ULLO JOHN GOT A NEW MOTOR

Likewise, if they know the lane and spot number, they could have done this:

This is where you start pressing the door unlock button and waiting to hear a chirp

Of course, they didn’t know anything, so they had to scan the entire joint. It looked something like this:

That’s a lot of walking

I don’t have the patience to paste that many thinking faces in.

There’s much to consider here. But I don’t even have a license, so let’s look at things in a Real Table®

Seek and Not Seek Walk Into A Bar

Using the miniaturized version of the StackOverflow database, I’m gonna focus in on the Users table.

The only thing on it is a Primary Key/Clustered index on the Id column. A query like this can use that quite easily.

Seeky Kid

This query only requires three reads to locate the page it needs and return a single row.

If we’re talking parking lots, a seek would look something like this.

The PK/CX is considerably less efficient for this query, taking nearly 7400 logical reads to return the same single row.

Hit Scan

It’s Not Seek vs Scan Though

What I’m interested in is the Seek Predicate vs Predicate. For instance, this query returns the same results, with the same 3 logical reads as the first query.

Double up
  • A predicate by itself is like having to walk through the entire parking garage trying to find your car.
  • A seek predicate by itself is like being able to start your search where your car is.
  • A seek predicate with a residual predicate is like walking along along the right lane and having to check parking space numbers.

What Makes For A Seek Predicate vs a Residual Predicate?

Mostly placement in the index, but SARGability plays an important role as well.

Let’s create an index and look at a couple slightly different queries.

This index leads with reputation, and also has Id as a key column. DisplayName is an included column, which puts it in just about the same place as Reputation was when we were only working with the clustered index.

The first query plan looks like this, seeking to the right key, and then evaluating the predicate on Reputation. This one is using the clustered index.

Party!

The second query uses the nonclustered index, but supports multiple seek keys.

Seeky Monkey

When I mentioned that SARGability plays a part, that’s because if you do something this stupid, neither predicate will be seekable.

Just don’t

Bumming Around

While I have your attention, and before I run DropIndexes, let’s look at how included columns work a little bit.

Right now, DisplayName is included in our nonclustered index. All three of these queries can use our index.

All of them will do the same number of logical reads (1279).

All three of them will have our LIKE condition as a predicate — we can’t seek here because it’s not in the key of the nonclustered index.

But only the first query, where the string search doesn’t lead with a wildcard will register a missing index.

In other words, a lack of SARGability with leading wild card searches shut down missing index requests.

Chimney Sweep

If we add the missing index request, the reason becomes pretty apparent.

I did not seek that coming

The index really does only help the first query. The other two queries do the same number of logical reads as before. Adding this index doesn’t change much for them.

Beat It

This is a small step towards understanding seek and residual predicates, and how indexes support searches.

The thing to keep in mind is that you can really only seek into an index when a SARGable predicate hits the leading key column. Multi-key indexes can support multiple seek keys.

If you’re only searching secondary (or beyond) columns, you’re only able to scan with a regular predicate. The same goes for included columns.

Thanks for reading!


Building SQL ConstantCare®: The Database

SQL ConstantCare
10 Comments

I bet, dear reader, that this is the post you’ve been really curious about in my behind-the-scenes posts.

Back in the post about the data we collect, you may remember that our initial goal was to analyze high-priority issues like missing backups or corruption check, priority boost on, offline NUMA nodes, etc. To do that, we take data straight out of SQL Server’s DMVs, and you can see those DMV queries in %localappdata%\ConstantCare\Current\Resources\Queries. ConstantCare.exe runs those queries, exports the data to one JSON file per query, zips them, encrypts the zip file, and uploads it to us.

The armchair architect league might wanna query those files directly, or load them into an on-demand database like Redshift, or put them in a document database. Thing is, we wanted to leverage spare consultant time from me, Erik, and Tara in order to build diagnostic queries across your DMV data, and…writing queries for those platforms is, uh, challenging at best (NSFW.) If you imagine a Venn diagram of people who know SQL Server troubleshooting, and people who can write SQL query, it’s a pretty good overlap. Now imagine that same diagram, but for people who know SQL Server troubleshooting, and people who can query a NoSQL data store. Yeah. Good luck with that.

So call us old-fashioned, but we wanted to put the data into a relational database.

We ruled out Microsoft SQL Server and Azure SQL DB.

I know, I know. The Microsoft geek in you – and me – wants us to use SQL Server on the back end, but here’s the deal: SQL Server is not self-managing or cheap. (After all, you and I make very good livings keeping SQL Server up and running, right?) Whenever possible in this product, I wanted to use something inexpensive that somebody else managed for us.

Azure SQL DB might seem like a valid contender, but like I wrote in an earlier post, Microsoft didn’t have a function-as-a-service platform available a couple years ago when we started building SQL ConstantCare. That would mean running our app code in AWS, and the database in Microsoft Azure. As Jeremiah once said, “Heeeeellloooo, laaaateeeennncccyyyy.” Even worse, until this year, Azure SQL DB didn’t even support VNETs, which made security management much, much harder for code that runs in AWS Lambda functions.

Today, if we were going to start building it all over again, Azure SQL DB would be a valid contender. It still has one big gotcha: cost. It’s twice as expensive as the leading brand:

 What you get for… Azure SQL DB Brand A
Around $200/mo 1 core, 7GB RAM 2 cores, 15GB RAM
Around $1,700/mo 8 cores, 56GB RAM 16 cores, 112GB RAM

So, uh, what is this Brand A, you ask?

Meet AWS RDS Aurora.

Aurora is Amazon’s custom version of MySQL and PostgreSQL. Here’s a 2-minute marketing video targeted at managers, but it paints a pretty good picture as to why the small business guy in me was sold:

Introduction to Amazon Aurora - Relational Database Built for the Cloud - AWS

And this one-hour technical video explains why the DBA in me was sold. (Don’t bother going past 40 minutes – it switches to Q&A.) In this video, when he says “PostgreSQL,” he’s talking about the conventional boxed product (like SQL Server) that you manage yourself. When he says “Aurora,” he’s referring to this new cool implementation of PostgreSQL (kinda like how Azure SQL DB is for Microsoft.)

AWS re:Invent 2017: Deep Dive on the Amazon Aurora PostgreSQL-compatible Edition (DAT402)

You don’t need to know anything about PostgreSQL for that video – your SQL Server knowledge will let you keep up. It moves fast: within the first 5 minutes, he explains some of the automatic failover magic baked in with the Amazon S3 storage integration. In the next 5, he digs into why they don’t have to screw around with checkpoints. In 10-15 minutes, he covers how it even affects inserts into a B-Tree.

My favorite part about Aurora is that from the front end, it’s just plain old Postgres. That means we’re not locked in – if AWS discontinues Aurora or starts charging too much for it, we can move to self-hosted MySQL or Postgres (or RDS PostgreSQL, too.)

Aurora offers fancy stuff like readable replicas (up to 15, across up to 3 Availability Zones), automatic continuous point-in-time recovery going back up to 35 days, encryption, auditing, and easy scalability. We’ve already used the quick scalability, too: we started with an r4.large with 2 cores, 15GB RAM, $0.29/hour (roughly $200/mo) and upsized it to handle the growing user loads last week:

SQL ConstantCare user base, servers and TB of data they host

(Note: that terabyte number isn’t how much data WE have, hahaha. We’re only using about 35GB at the moment. And we’re up over 500 servers, it’s just that not every server sends in data every day. More on that in another post.)

Depending on which flavor you pick, you either get MySQL compatibility or Postgres compatibility, but not both simultaneously, so you have to make an architecture decision early on. We probably could have been fine with either, but we went with Postgres. We liked its maturity and feature set more at the time.

Designing a Database for DMV Data

In your database at work, you focus on the user tables, and the system tables are kind of an afterthought. They’re just something that you use in order to get your job done better/faster/cheaper. Most of us probably don’t put much thought into how the DMVs/DMFs are designed, whether or not they’re normalized, and how they’re joined together.

In this project, it’s the opposite: the DMVs are all we care about! But rather than redesigning them to be stored more efficiently or more normalized, Richie just took ’em exactly as-is. Each table has a few columns that aren’t in your SQL Server:

  • ID – identity, primary key, clustered. Different names per table, so in sys_configurations, the ID is sys_configurations_id.
  • connection_id – corresponds to one point in time for one SQL Server instance. Each time you upload data for a server, a connection is made. You’re welcome.
  • user_id – matches up to a customer. Yeah, technically, a connection_id can only belong to one customer, but Richie went a little paranoid so we could double join for security purposes.
  • creation_date – when the row was added. (Yeah, that’s not really normalized either, because you’d think that we could store that at a higher level, but I’m not gonna tell you all our magic secrets in this blog post, alright? Also, this was probably a mistake.)

Here’s how it looks:

sys.configurations, cloud style

Obviously, that’s not SQL Server Management Studio or SQL Operations Studio – those are SQL Server tools only. For Postgres, the most commonly used tool is pgAdmin, and it’s alright, but just alright. After trying a few different tools, I fell in love with Postico, shown above. I’m a Mac user, and Postico feels like a database tool that was designed for Mac database developers, by Mac database developers. It’s not pretty, by any means, just very minimalist. It gets out of the way.

The first cool part about having DMV data in a database like this is that I could take a lot of my existing diagnostic query knowledge and port it straight over to this new database. When I’m doing research for a customer to figure out why they’re having an issue, I don’t have to stop and think about how to query the data – it’s just like querying their SQL Server’s DMVs.

The second part – and this, dear reader, still just blows my mind – is that if I wanna see what changed compared to yesterday, I can simply join to yesterday’s connection_id for this server and find out. Wanna see if someone changed a config setting? Wanna see if this is a new problem, or it’s been persisting through time? Wanna see if a metric has been getting worse or better? It’s all right there in the database. Long term, I can even imagine us giving you an export of this data so you can query it yourself too.

Wanna learn more? Read my other posts about building SQL ConstantCare®, or check out the product.


[Video] Office Hours 2018/4/11 (With Transcriptions) with Special Guest Edwin M Sarmiento

This week @EdwinMSarmiento joins Brent, Tara, Erik, and Richie to discuss high availability disaster recovery, migrating from on-premises server to Amazon’s RDS, AlwaysON Availability Groups, auto-growth issues, rebuilding vs dropping/re-creating indexes, log shipping, using Service Broker with Always On Availability Groups, updating stats issues, VLF counts, measuring transaction log usage, upgrading versions, and much more.

Here’s the video on YouTube:

Office Hours Webcast - 2018/4/11 with Special Guest Edwin M Sarmiento

You can register to attend next week’s Office Hours, or subscribe to our podcast to listen on the go.

Enjoy the Podcast?

Don’t miss an episode, subscribe via iTunes, Stitcher or RSS.
Leave us a review in iTunes

Office Hours – 4-11-18

 

What’s the easiest way to migrate to RDS?

Brent Ozar: Let’s see – we might as well go ahead and get into the technical questions. We’ll start with – Richard has an interesting one. He says, “What’s the easiest and fastest way to migrate from on-premises SQL Server up to Amazon’s RDS with as small cutover window as possible?” He says, “Availability Groups isn’t supported. Log backups aren’t available to us, being deduped and volted by third-party software. Is there anything that I can do other than transactional replication?” Wow, I don’t think he has any options.

Edwin Sarmiento: And somebody said transactional replication. Like, eurgh.

Brent Ozar: I don’t think you have an option there. That is pretty rough because RDS is like a sealed box. You get database mirroring, but that’s only inside of RDS; you don’t get to mirror into it so that’s tough.

Edwin Sarmiento: And there are capabilities that only exist in RDS, like for instance, somebody was asking about doing backups, like Native SQL Server backups within RDS. There are stored procedures that only exist in RDS, but they’re not in the Native SQL Server on-premises version.

 

What are all the ways Always On can fail?

Brent Ozar: Let’s see – next up, Brian says, “Edwin, do you have a punch-list to speak of all the failure scenarios that you test with Always On?” he doesn’t say Failover Clusters or Availability Groups, but he says, “If you could tell me what will break, it will make it easier to deal with when it does.”

The Senior DBA’s Guide to Always On Availability Groups

Edwin Sarmiento: What’s really interesting is someone on a Microsoft forum asked the same question, like what would cause an automatic failover? Which is kind of interesting because there are so many things – which was also followed up by, I’m not really concerned about what would make Failover Clustering cause an automatic failover. And the gist of it is Availability Groups and Failover Cluster instances, I mean, if you’re looking at high availability, they still run on top of Windows Server Failover Cluster. So whatever applies to Failover Clustering applies to Always On Availability Groups and Failover Clustered instances. There’s just so many of them.

Brent Ozar: There’s not three…

Tara Kizer: When I attended – we all attended the PASS conference in 2016. One of the sessions I went to was Availability Group Monitoring. I’d been supporting Availability Groups for three years by the time I went there, but it was interesting. It was by Microsoft and they had condensed it down to an hour, but I think they said it was more like a four-hour topic; that’s just how much monitoring is really needed. What are the failures that can happen and, you know – I don’t remember what I learned, honestly, but it was a good session.

Erik Darling: Not much in an hour…

Richie Rump: Great session…

Brent Ozar: Sometimes it’s good just to be scared.

 

How should I configure quorum for a 6-node cluster?

Brent Ozar: Another Brian, a different Brian follows up with, “How would you configure quorum settings when I have six nodes and I’ve got three nodes in one region and three nodes in another region? The three in the primary, I guess, are sync and the three in the other region are async replicas.”

Edwin Sarmiento: It’s interesting when somebody says, how do you, because I don’t know because I don’t know what your business requirements are. And it’s interesting, whenever I tell customers that, they’re kind of like, I only want an answer. But the reality is, what is your ultimate objective? What’s your recovery objective? What’s your service level agreement? Because telling me that you’ve got three nodes in one data center and another three nodes in another data center, does not tell me anything about what your recovery objectives are and what your service level agreements are.

Now, let’s assume that you don’t want automatic failover between data centers. If you don’t want automatic failover within data centers, you’ve got an even number of nodes; three in one data center and three on another data center. You want a tie-breaker. So you’ve got node majority, you’ve got file share, you’ve got disk witness, you’ve got cloud witness in Windows Server 2016; what would you want to use? I’m a pragmatic person. I’m very practical. I’ll just use File Share Witness and host it in the data center that I’m trying to protect, mainly because you don’t want to be spending so much on an extra vote within the cluster. So there’s – again, you always need to go back to, why would you want to do that?

 

We have an app that shrinks the database every hour…

Brent Ozar: So let’s see. Next up we have – Tammy asks, “We have a third-party app that’s shrinking a database every hour…” Off to a good start. “Then our monitoring application alerts about log space problems for that database. I grow the files but they were promptly shrunk again in the next hour. How would you handle this situation?”

Tara Kizer: I mean, if you’re stuck with this application and you can’t make them change it, I wouldn’t be doing the manual growth. Let it auto-grow and shrink back down, I guess. Also, maybe modify your monitoring application so it doesn’t alert to this guy. I mean, most monitoring applications allow you to do custom alerts or to change the rule.

Erik Darling: That’s rough.

Tara Kizer: I would be hating life though, having to support that third-party application. I bet you he uses the SA account and does all sorts of bad stuff.

Erik Darling: Is there any justification from the app provider about why they do that or is there any talking to them? Or is it just completely gone and you’re just stuck supporting it?

Tara Kizer: I would bet that this is a legacy application, you know, that’s been around for a long time, doing things back – it wasn’t ever recommended to do this in SQL Server 2000. It was probably more common to do this type of thing in older versions.

Erik Darling: Depending on the command that it runs, what I might do is add a second log file, cap the size of the first log file really small so it never tries to shrink it and just let the other file grow and do whatever it needs to do. Because then, if it’s just trying to shrink this one thing, if it does anything, there’s nothing in there to shrink. Make it like 1MB; what’s it going to do? Have a second log file – because I’m willing to bet, the way it’s calling is not, like shrink database. I willing to bet it’s doing shrink file and it’s targeting whatever the…

Brent Ozar: The default log number…

Erik Darling: Yeah. I bet, if you added a second log file and just let that thing do whatever it wanted to do, you’d be fine. That’s what I’d do. Screw those guys.

Brent Ozar: That’s so much better than my bad answer. My bad answer was going to be, if you know what time it happens, you could set up a logon trigger, so whenever they go to log on, the app goes to log out and [crosstalk]…

Richie Rump: That was my idea, yeah.

Brent Ozar: Chase says, “I want to know what Edwin and Brent are chatting about.” No, we just both have the same bad ideas about exactly what we were going to do.

 

 

Should I rebuild indexes, or drop/recreate them?

Brent Ozar: Let’s see, next up, Pablo asks, “What would the advice be between either rebuilding indexes or dropping them and recreating them? Which one should I do?”

Erik Darling: Neither.

Brent Ozar: And why?

Erik Darling: Leave it alone. Well, look, dropping and recreating indexes is a hell of a gig, right because you have all the pain of the rebuild. Because you have to drop the index, so it means you have to take out the locks to drop it. And then after it’s dropped, people can get in there and do crap and they can just do things without an index, which is awful for them. It’s awful for end users. At least if you just rebuild the index, if it’s online, users can get in and do stuff during that time anyway. It doesn’t really buy you anything to drop and recreate or rebuild. It’s actually kind of worse because then you’re dealing with the, you know, the drop, which is offline and taking the locks and then the create, which you know, if that’s not online then you’re doing that and they’re doing things without a supporting index.

But Brent has a great session at GroupBy – Great Session Brent – Is it why defragmenting indexes isn’t helping or why rebuilding indexes isn’t helping?

Brent Ozar: Yeah…

Erik Darling: So I would go watch that and make sure that you’re rebuilding indexes for the right reasons, like, you’re not just doing it at like 5%, 30% because someone told you to do that 15 years ago. Make sure that, you know, you’re targeting indexes that it’s sensible to do that for. When people tell DBAs to just rebuild or reorg at certain levels, no one ever tells them, hey, you know, if it takes you two hours, three hours, four hours, eight hours to do index maintenance, are you actually saving that much time when your queries run?

Are you saving two, four, six, eight hours of query runtime by doing this? Are you saving all the resources that you expend CPU, I/O memory-wise when your queries run that you’re expending when you do these rebuilds and reorgs and stuff. Make sure that you’re actually getting something out of them aside from pointing saying, “No look, they’re not fragmented.” It’s not really going to help you.

Edwin Sarmiento: And you’ve also got to look at what else is in that database. I had a customer as a question about – hey, we’re thinking about defragmenting or rebuilding indexes but it’s on an Availability Group. I mean, that impacted the performance instead of doing their index rebuild and defrag. Look, I can’t really help you that much unless you provide me – I know I’m trying to segue into providing additional information, but how many indexes are there, how long does it take you to defrag versus rebuild and what’s the overall impact on a performance when client applications are connecting to the primary?

So you have to think about the other things that come into the picture when you’re trying to do a simple – well, you might be thinking it’s just a simple rebuild or drop and rebuild or defrag, but there are other implications on the database.

Tara Kizer: I’ve had to completely disable index maintenance on an Availability Group, even switching – so it was two or three asynchronous replicas, one synchronous replica, and even switching that sync replica to async, I still was having issues. Besides the latency, lots of blocking. It was a very high-performance environment and a lot of batch requests per second. You have 5000 to 10,000 at normal times, higher at peak times. And a lot of people have had to completely disable index maintenance. And is it really needed, as Erik was saying? Still kept my update stats job once or twice a day – that’s where the performance benefits of the index maintenance is, typically because you update statistics with an index rebuild.

 

How many instances can I enroll in SQL ConstantCare®?

Brent Ozar: Daryl asks, “How many of my instances my I enroll in ConstantCare?” right now it’s unlimited. I could see, in the future, doing something like a 100 instance cap, but throw all you want in there.

 

Can you pause log shipping?

Brent Ozar: Anon asks, “Can you pause log shipping? My company does a weird thing where they take a new server back for a day or two before they do a final turnover so any databases would be out of date. We’re using simple recovery.”

Tara Kizer: You can’t use simple recovery model with log shipping. Log shipping to shipping transaction logs, so you’d have to switch over to bulk log or full recovery model in the first place. You can pause it, you just have the log backups running and not deleting them and one or two days later, the copies will just proceed in the restores. It might take a while to get up to date, depending upon how much work it has to do and do you have to start over with a restore, you know, restore database and then apply the log chain?

It also depends on how much disk space you’re able to keep online. I’ve got some clients who only keep 24 hours of backups on disk and then maybe some other longer term storage. Make sure you have the disk space to be able to have all the transaction log backups that you need to get it back up to date.

Edwin Sarmiento: That’s probably why he said the company’s doing weird things.

 

Is Microsoft about to charge for SSMS?

Brent Ozar: Phil asks an odd one. He says, “Microsoft moved SSMS into a separate standalone product. Rumors are afoot that it’s about to be licensed separately. Has anybody heard anything?” Dude, where do you hear those rumors from?

Erik Darling: SQL Operation Studio is licensed differently than SQL Server Management Studio…

Brent Ozar: In that it’s a different license but they’re both free. Yeah, it’s tough.

 

Do you recommend using Service Broker with Always On?

Brent Ozar: Anna asks, “Edwin, do you recommend using Service Broker with Always On?

Edwin Sarmiento: It’s an interesting question because, you know when people ask you, do you recommend, the first question to ask is, is it supported? If it is supported, then I would recommend it based on, of course, your workload. A common question, aside from service brokers, DTC. Would you recommend DTC with Availability Groups. And of course, unless it is fully supported, it’s not something that I would recommend. Quote en quote working does not mean supported. So I can’t remember exactly what the support matrix is for service broker for 2016 – by the way, what version are you using? So I know it does work but there’s some limitations on what is supported and what isn’t supported.

 

 

How high can VLFs go?

Brent Ozar: Next up, Turge, I believe it is, says, “How high can VLF counts go before it starts causing problems and what kind of problems will you experience when it starts getting high?”

Tara Kizer: Here’s what I know; 75,000 VLFs is not a good thing. In SQL Server 2005 days, back in 2007, 2008, the company I worked for did Microsoft security patching every single month, did reboots, we had high-performance systems, Failover Cluster instances and I had lots of servers to patch. SQL Servers – patch the other guys, you know, so I would restart the box after the patches were applied and I would verify in management studio that the instance would come online after it finished the reboot, but I wouldn’t expand the database history in object explorer.

So I moved onto the next server and a few minutes later, the admin team said, hey we can’t connect to the database; I don’t think it’s online. So I expanded the database history in object explorer and boom, it said in recovery for this database that had five lines of availability requirements. We were in a maintenance window, but it was bad. It had been probably 15, 20 minutes since I did the reboot when I started looking into this, you know. This is not normal, so I opened up the SQL Server error log and it’s phase one of three and – I don’t remember what the percentage was but I was like, oh my goodness, what is going on here?

I didn’t know about the VLFs at the time and luckily this database came online, you know, completed crash recovery after 45 minutes, and this was on a Saturday I believe. And so my plan was to contact Microsoft. I was going to contact Microsoft that night, but it came online before I could make that call. Because it could take like an hour even for a … to get someone on the call.

So, Monday morning, I contacted Microsoft and they said, run DBCC LOG_INFO info on it, and it returned 75,000 rows. He said this is a virtual log file issue; you have too many log files. And so that’s when I learned about VLFs and 75,000 of them is not a good thing. So it impacts fast recovery time of the database, it impacts your restores too and in some cases it can impact performance. So I can’t tell you what number to target. So if you have, say, a 300GB log file, you could have less than 1000 VLFs. I know that just because I fix log files to reduce the number of VLFs. You just need to make sure they’re not too big and not too small and your auto-growth setting is correct. Don’t use a percentage or a very small number.

Erik Darling: The flipside of that coin from Tara – I had, I forget how big the log file was, but there were very few VLFs but they were all gigantic. And so that took me 24 hours to recover from because it’s reading to, like, hundreds of gigs for each one. I was like, come on…

 

How do Availability Groups work internally?

Brent Ozar: Brian asks, “I would like to know the nuts and bolts of how this works. So what’s the best document or video that would give me the best deep dive into how Always On actually works internally?”

Edwin Sarmiento: That is a very interesting question. No documentation from Microsoft would actually explain the nuts and bolts. So here’s the thing, and I’m pretty sure you would agree with me in this; they’re all over. Like, read one documentation, it explains one thing. Read another documentation, it explains one thing; it’s just all over. And I know it sounds like a bit of a plug, I simplify that in the Always On Availability Groups training class that I’m doing. It’s all visual. You would see how everything works under the covers. How the transaction log records or log blocks are getting replicated from the primary to the secondary, what the wait stats are, what the wait types are; everything simplified.

Now, I know that’s kind of a rounded up version of it, but here’s the deal. It’s really complicated if you just simply look at the documentation. I’m trying to simplify it as much as I can from some of the articles and the blog posts that I write.

Brent Ozar: And also, there’ no money in Microsoft building that documentation for the public. Like they just don’t have an incentive to dedicate human beings to do that for – because it would take months for them to do it across all kinds of languages. Anything Microsoft does, they want it to be perfect across all these different standards and all this. And why, when half the time, it changes every version. They’re like, sorry we need those documentations for developers.

Tara Kizer: And just attend Edwin’s training class on Availability Groups. That gives you a really deep dive into the product.

Edwin Sarmiento: What’s really interesting is I see this across all software vendors. I was just looking at some of the documentation for Microsoft PowerShell for AWS. Same thing. They’re all over – I think it’s just a thing with software companies. They just are not really good at writing documentation.

 

Richie’s bored.

Brent Ozar: Richie is so bored. He’s like, I’m done with this.

Richie Rump: [crosstalk] I’m looking at ConstantCare files coming in. Y’all keep going because some people enjoy this stuff…

Tara Kizer: He’s waiting for one of his cats to arrive so he can play with it.

Erik Darling: It’s hit or miss. The other week when it was just me and Richie, there we like 70 questions about Git and I was like, sweet; I’m going to take a nap over here. I’m just going to sit.

 

How can I measure transaction log usage over time?

Brent Ozar: Gordon asks, “What’s a good way to measure how much transaction log usage we’re doing over time? Like how much of our transaction log we’re generating or backing up.”

Edwin Sarmiento: I do have a blog post on that, but I don’t recommend that you do it in production. So basically… Yes.

Brent Ozar: That’s my favorite kind of answer.

Edwin Sarmiento: So the blog that I wrote about – I can’t remember exactly what the blog post was all about, but basically the gist of it, with sample scripts, is you take your log backups and read your log backups using DBCC FN_DBLOG. And part of that – what I did was I came up with an Excel spreadsheet that would tally your transaction log generation throughout, you know, a month, a week. But again, don’t do in on production. Read your log backups. And again, dump those results into a table and come up with an analytical graph on the amount of transactional log records generated over time. But this is not just your inserts, updates and deletes. That also includes your index maintenance. It generates a lot of transaction log records that that would factor into that graph. So you might want to pick and choose, like, your inserts, updates and delete statements as part of the analysis.

 

How do you reduce VLFs and prevent problems?

Brent Ozar: Next up, Michael says, “Tara, if you do have a database with a high number of VLFs, how do you eliminate them and how do you prevent them from returning?”

Tara Kizer: It’s easy – once fixed, it should never come back for that specific database. So you shrink it as far down as you can, say 1MB or 5MB; a really small number. You may need to do some transaction log backups as you’re doing those shrinks. Do this in the maintenance window, but I’ve certainly done this during the day on very high critical databases, just because I don’t want to take a maintenance window for this task. So shrink it all the way down and then you’re going to grow it back out to that original size.

Don’t just grow it back out part of the way or let it auto-grow from there. Grow it back out to the space that it needed before you started doing this task. Don’t grow it out all in one, otherwise you’ll end up with the issue Erik had. And don’t grow out in small chunks because you’ll end up with too many VLFs. So if I had, say, my original file as 100GB, I might just grow it out in, say, 4000GB chunks and I don’t ever do the 4GB chunks because I ran into a bug back on SQL Server 2005 and it was horrific. It’s been fixed, so on newer versions, you wouldn’t run into it. But to this day, I will not do increments of 4GB, so 4GB, 8GB. I’ll do 4000MB or 8000MB. I will not do the 4GB increment.

And then just keep growing it out until you reach the original file size. And don’t forget to change your auto-growth setting. So on a 100GB log file, I’ll probably do a 1Gb auto-growth; maybe 512MB. You don’t want too small, you don’t want it too big, but fixing the auto-growth is how you prevent this issue from coming back.

Edwin Sarmiento: I can’t remember what version of SQL Server has that alert for if you reach 10,000 VLFs, I think, or 1000 VLFs, it will write an error in the SQL Server error log.

Tara Kizer: [crosstalk] I was going to say, it definitely wasn’t there in SQL Server 2005; I would have noticed it. I was a good DBA.

Erik Darling: [crosstalk] Then you got another job here and everything went down.

Brent Ozar: Now she spends all her time painting mountains on the background. [crosstalk] Marci says, “One other thing to add, when you do Tara’s fix, to also do a checkpoint before you go and do this stuff too.”

Tara Kizer: Yes.

 

How should I migrate 2012 AGs to 2017?

Brent Ozar: Pablo says he’s on 2012 Always On Availability Groups and he says, “What are the general steps to migrate toward  2017?”

Edwin Sarmiento: So, general steps will depend on the maintenance window. You could do a kind of rolling upgrade scenario where you’re adding another replica in your Availability Group that is running a higher version. Say you’re on 2012, maybe 2017, and then do a failover. But that’s a one-way street, which is why you really need to do a lot of testing prior to doing that because once you’ve failed-over to 2017, you can’t go back to any of the lower versions.

That being said, that’s the fastest way to do it. You can do a lot of different things because, number one, if you’re on 2012, I’m guessing you’re still on Windows Server 2008R2 or Windows Server 2012 R2, at most. I’m guessing you also want to upgrade your operating system. So you really have to plan well because you don’t want to be upgrading to SQL Server 2017 and then a couple of months later you’re upgrading your operating system to the higher version. So if I were you, wait until the whole process of upgrading your hardware and OS is part of the plan, so that way, you don’t have to repeat everything. Of course, as a consultant, I would rather have everything done again because I get paid twice, but I’m not like that. I’d rather do it once; I’m lazy. I’d rather do it once. Plan accordingly. You’re saving your company a lot of resources, a lot of time and a lot of money.

 

Are SSPI handshake alerts a problem?

Brent Ozar: Next up is Chase, who asks, “I’m getting a lot of alerts regarding SSPI handshake failed. I could just tweak my monitoring system to ignore it, but should I worry about this?”

Tara Kizer: That is an error message I automatically dismiss. It can be fixed with SPNs. I think this is an SPN issue, right…

Erik Darling: SPN or whatever delegation in the directory account.

Edwin Sarmiento: But you also have to think about why is something causing that and, when I was doing operational DB, I usually ignore that. But then at one point, I came to realize, if it’s throwing something like that, most of the time, we just think it’s active directory or it could be an SPN thing, but it could be worse. And I was in a troubleshooting call with Microsoft a couple of years ago where I’m seeing a lot of these incidents. It turns out that the reason why it’s getting a lot of that events was the fact that SQL Server could not contact active directory. And of course, I kind of passed the ball to the active directory guys without realizing that the main controller was actually sitting on a VM with a single virtual … I know you kind of know where I’m going with this. And every single machine on the network is hitting that thing. So it’s not just SQL Server that’s getting affected, it’s everything on the network. So I told them, look, maybe it’s about time that you moved this VM to a physical machine because everything’s hitting the same VM.

 

How does SQL ConstantCare® upload to AWS?

Brent Ozar: Landon asks a question for Richie. He says, “Is ConstantCare using a secure file transfer method to upload the data to AWS?”

Erik Darling: Carrier pigeons.

Richie Rump: Bit by bit.

Brent Ozar: One bit at a time.

Richie Rump: Yes we are and it’s because Brent made me. I think in initial testing, we were just going to use HTTPS and be done with it, and then Brent made me throw some encryption on it. Everything up to the cloud, once it gets up to the cloud, we decrypt and reprocess the files.

 

How can I automate AG patching when I have SSIS?

Brent Ozar: There we go. So let’s see, the last one we’ll take is [Saket]. [Saket] asks, “How can we automate SQL patching in an Availability Group when SSIS DB is in there as well?”

Tara Kizer: It’s not supposed to be in there, right? That guy is not supported on an Availability Group, I don’t think.

Edwin Sarmiento: So when this thing first came out – because SSIS DB was introduced in SQL Server 2012, there were a lot of challenges because for one, you wouldn’t want SSIS DB to be in there unless you know, for a fact, that you’re going to put all your SSIS packages in there instead of MSDB or maybe file share for SSIS. It became a bit more challenging because every single database has to be in the Availability Group [reporting] – SSIS in the Availability Group. Come patching time, it just screws thing up. You would have to remove the SSIS DB from the Availability Group, patch the nodes and then add it all back, which was why he asked that question.

You can automate that process, either through a combination of PowerShell and T-SQL. It all depends on what your patch management processes are. Are you using System Center for your patch management process, are you using Group Policies for it? It becomes a bit more challenging, but you can, again, use a combination of PowerShell plus T-SQL to remove the database, patch the node and add a database back into the Availability Group.

Tara Kizer: It sounds like a lot of work. I mean, it’s not supported, right. I mean, why put it in an Availability Group if it’s not supported?

Edwin Sarmiento: That I would have to ask the product team for clarification.

Tara Kizer: At my last job, we had Availability Groups and we had SSIS packages and were very excited about SSIS DB and all of its features. And so we had our Availability Groups for the user databases, non-Microsoft stuff, and then we had a separate virtual machine that had SSIS DB that was not in an Availability Group. We did snapshots on that guy like every five minutes. So it still had some HA features in place, it just wasn’t in an Availability Group.

Brent Ozar: Alright, well that’s everything we’ve got time for this week. Thanks, Edwin for coming out…

Edwin Sarmiento: Thanks for having me.

 

Who’s the right person to attend Edwin’s upcoming courses?

The Senior DBA’s Field Guide to Failover Clustered Instances

Brent Ozar: Answering questions – absolutely awesome. Do you want to say a few words about who the right person is to come to your course coming up on Failover Clusters and Availability Groups?

Edwin Sarmiento: Well it’s interesting because I’m also looking at the trends. First one we did last September, there were more senior DBAs who attended. The one we did last December were IT managers, which is very interesting; IT managers and sysadmins who are getting started with SQL Server. And so if you’re a senior DBA, or even if you’re a sysadmin or systems engineer who are now responsible for managing SQL Server, particularly Availability Groups, or if you’re an IT manager who just needs to understand this. Or maybe you’re an IT manager who needs to fill in because you’re still waiting for a new senior DBA to be hired. This is the right course for you because I try to simplify, like I said, I try to simplify everything that involves Always On Availability Groups and the new one we came up with is the Failover Clustered instances because they’re not dead. They’re still out there in the field and I’m pretty sure that you still need to manage them until you decide to move to Availability Groups or even to the cloud.

Brent Ozar: Perfect. Thank you, sir. I look forward to the class and see everybody next week at Office Hours. Adios, everybody.

Edwin Sarmiento: Thanks for having me.