Blog

sp_AllNightLog: Poll Vaulting

Carry on

It turns out that the only thing harder than checking for new databases restored to a SQL Server, is checking a folder for a backup of a database that doesn’t exist on another SQL Server.

These are both part of what sp_AllNightLog has to do.

The other components, which use workers and a queue to backup and restore databases, were fairly easy to write. The important code had already been written.

Don’t thank me.

Seriously, don’t. I didn’t write any of that.

Prince Poll

Polling on the “log shipping primary” basically just looks for new databases in sys.databases, and dumps them into our queue table to work on. This is easy enough.

On the “log shipping secondary”, we had to do some ballet.

Yes, it involves xp_cmdshell. If you’re horrified by this, feel free to stop reading and resume telling people that the sky is falling.

We do some checking to make sure that the path exists, and isn’t empty.

When we find folders for the first time, we have to restore the initial full backup. We insert some special canary values so our restore workers know this is the case.

After that, we just keep piling on log backups. This is the right thing to do.

This is the part of the code that calls sp_DatabaseRestore — and believe me, I’m so glad I didn’t have to write any of this.

Poll Exemplar

Some of the ‘interesting’ parts of the restore polling code are in here. The backup polling code is literally just an intermittent check on sys.databases. If that makes your pants move, you can look at the code on your own time, creep.

First, we grab the path from a configuration table that gets populated when you run setup. Don’t worry if you mess this up, you can update it.

Assuming that all works out, we set up our xp_cmdshell command, and insert that to a table variable.

I know, I know. but they kinda deserve each other.

It’s important to use /b in the dir command, so you don’t end up with an unparseable blob of directory information.

We also do some path checking here to make sure to make sure something usable comes back. The results when that’s not the case are pretty specific — there’s one NULL row, or you get a message like ‘The system cannot find the path specified’ or ‘File Not Found’.

Then we take the folder list from our table variable, and insert anything into the restore worker table that isn’t there already. Ah, the joys of NOT EXISTS.

Road trip!

Down the line, I do need to add a slightly more recursive check to make sure any ‘new’ database folder has a subdirectory called FULL in it, and perhaps a bit to delete rows from the worker table if restore attempts come up empty. We’ll see how much of a problem certain scenarios turn out to be.

Thanks for reading!


What Do You Think About ORMs?

Development
26 Comments

I was recently asked what I thought about ORMs (Entity Framework, NHibernate, Dapper, etc) while we were looking at implicit conversion warnings in execution plans.

Before I answered the question, I let them know that my answer is based on being a production DBA and the numerous performance problems I have had to troubleshoot that were caused by improper usage of an ORM.

What I don’t like about ORMs:

  • Implicit conversions due to nvarchar variables vs varchar columns
  • Queries grab every single column, regardless if they are needed or not
    • Expensive key lookups
    • Very wide missing index recommendations
  • The resulting query is hard to read

I first came across the implicit conversion problem 10 years ago and blogged about it (back then I didn’t realize the culprit was the ORM and instead blamed the database driver). I had been battling very high CPU utilization for a few weeks and finally figured out it was due to an implicit conversion that caused the plan to scan a very large index. Why did it take several weeks to figure this out? I’ll have to chalk that up to inexperience back then. I’m a much better performance tuner these days.

Based on the servers that I see as a consultant, implicit conversions is a very common problem. Sometimes it’s due to the ORM, but sometimes it’s due to a data type mismatch in the join conditions (Table1.ColumnA is varchar, Table2.ColumnA is nvarchar) or in the stored procedure (@var1 is nvarchar and is being compared to a varchar column). Avoid the data type mismatch issue by being diligent with the data types.

If you are using an ORM, be sure you know about the implicit conversion issue and how to work around it. Like I said, I’m a production DBA and not a developer. I can’t tell you how to fix it, I can tell you why it’s slow and how to find queries with this issue.

To help figure out if you are experiencing implicit conversions, look at the Warnings column when you run sp_BlitzCache. For more information, check this out. It links to a good article about implicit conversions.

This is not an anti-ORM blog post. This is the “I’m on-call 24×7 and am sick of being woken up at 2am because a query generated by an ORM is causing a severe performance problem” blog post.

Brent says: It’s not that ORMs are inherently bad – I’m usually for anything that gets products to market faster – but the problems hit when you scale the app without putting additional work into performance along the way. ORMs are like any other technical debt: as the app becomes popular, you’re gonna need to pay that debt back down sooner or later.


Last Season’s Performance Tuning Techniques: Slides & Demos

#SQLPass, Development
6 Comments

Right now, Erik and I are presenting at the 24 Hours of PASS. We’re talking about Last Season’s Performance Tuning Techniques:

24HOP Summit Preview Last Seasons Performance Tuning Techniques

Wanna play along with us as we show how your performance skills might be a little out of date? Here’s the demos and the slides:

https://www.slideshare.net/BrentOzar/last-seasons-performance-tuning-techniques

Fill Factor: Doing the Page Splits

You can do this one in any database, but you’ll want to do it on a server with very low load. If anybody else is doing any deletes/updates/inserts at all, it’s going to skew your numbers.

Questions to think about:

  • What does a page split really mean?
  • Is there such a thing as a good or a bad page split?
  • How do you know which ones you’re having?
  • Would setting fill factor have prevented that page split?

Missing Indexes

This one requires the Stack Overflow demo database. If you don’t already have a copy of that, don’t try to download it live during the session – it’s too big (~15GB torrent, then expands to a ~100GB SQL Server database.)

Get the estimated execution plan for this:

And then ask yourself:

  • What index am I told to create?
  • Does that index make sense?
  • Is there anything SQL Server isn’t telling me?

Now try the estimated plan for this:

And ask yourself those same questions.

BEGIN TRAN ERIK

CTEs

Sometimes a CTE won’t change anything at all. This is the case with simple predicates.

CTEs don’t materialize results. What do you think this is, Oracle?

If you join a CTE to itself, you’ll run the CTE query again.

Thankfully, nested CTEs don’t exhibit the same problem.

CTEs and derived tables will behave similarly as far as performance and query plans go.

One difference is that you can’t reference a derived table more than once, where you can do that with CTEs.

CTEs are cool though. You can filter on things on the outside that you can’t filter on the inside.

Functions

This query runs without a function and finishes pretty quickly.

If we turn that string aggregation expression into a scalar valued function…

Now we can crap up all our queries effortlessly.

Checking on query performance with sp_BlitzQueryStore…

Computed columns with Scalar Valued Functions in them will be similarly crappy.

Crappiness doesn’t depend on whether or not we select the computed column. It’s there no matter what.

Let’s see what happens when we add in a check constraint based on a UDF.

What to look for in XE: executions of function after inserting rows. Executions after selecting data.

Using an inline TVF makes things faster for the query, but we can’t use it in a computed column. Other downsides: inline TVFs aren’t tracked in DMVs (2016 has a function_stats DMV that doesn’t catch them).

This is the same in Query Store.

Temp Tables or Table Variables

Table variable modifications are forced to run serially.

They’re also not guaranteed to be in memory. Backed by temp objects which may spill to disk.

Bad estimates may prevent parallel plans from happening when they should have.

Does recompiling always make things better?

Temp tables generally work better!

Do indexes change anything?

ROLLBACK ERIK

Are your performance skills out of fashion?

If you learned things during the webcast, and you’re starting to question your taste, have no fear: we’re here to help. We’re doing an all-day pre-con class before the PASS Summit called Expert Performance Tuning for SQL Server 2016 & 2017. We specifically designed it to update your performance skills for today – and a lot of the techniques are even useful on currently patched versions of 2012 & 2014, too. Learn more and register for the pre-con.


sp_BlitzCache: Eventual Compatibility With Azure

Better than nothing

Yesterday it was announced that global temp tables were available in public preview for Azure. That means that unmodified versions of sp_BlitzCache (which uses a global temp table to hold a whole bunch of information) will be compatible up there.

This is a nice addition for Azure, and an even nicer addition for me.

Why global?

I don’t know! You’d have to go bug Jeremiah about that one. One reason that I kept it the way it is, though, is ease of troubleshooting.

See, when there’s a results bug, or when you want to add a new check, it’s really easy to run the proc, then select data from the global temp table. It also makes things super easy if you want to work on a portion of the code without running all of it. With a global temp table, you can easily shred out the XML for examination (just like we do in the proc) to work on XQuery. Trust me, you’re gonna need to work on XQuery.

Advil

If you’re using the public preview and want to give sp_BlitzCache a spin, head on over to our GitHub repo to download it. That’s where you should go if you run into any funny business while testing, too. Blog comments are the devil for that stuff.

Thanks for reading!


SQL 2016 SP1 Shows You Wait Stats in Execution Plans. Or Does It? [UPDATED]

SQL Server 2016 Service Pack 1 brought us performance tuners all kinds of shiny goodies – like the ability to see wait stats inside a query’s actual execution plan. I was really excited when this first came out, but I keep seeing some real oddities.

Let’s keep this one really simple: I’m using the Stack Overflow public database, and any version will do. Start by purposely kneecapping your SQL Server, restricting it to just 1GB of memory:

After that finishes, pop open another window, give your SQL Server amnesia, and start sp_Blitz for a 60-second monitoring span:

Then start another window, turn on actual execution plans, and run this simple query which scans the Posts table – which is quite too large to fit into 1GB of RAM:

Assuming this is the only query running at the time, you would expect that the server-level wait stats would match the execution plan’s waits. Here’s what the query plan shows:

Query-level wait stats

That sounds amazing! Our server only had 706 milliseconds of wait time altogether, none of which was spent waiting on storage! Our storage must be super-blazing fast, and there’s practically no way to tune it to wait less than 706 milliseconds, right?

Just to show that I’ve got nothing up my sleeve, here’s the waits in the XML plan:

Wait wait don’t tell me

Your first clue that these aren’t quite accurate is the fact that this query went parallel, but no CXPACKET waits are shown.

Your second clue: this query took a full 39 seconds to run, and CPU wasn’t 100% the whole time. Here’s a quick shot of Task Manager just to prove that point:

Task Manager

So what was SQL Server waiting on? Server-level wait stats tell a very different (and more accurate) story:

sp_BlitzFirst @ExpertMode = 1, @Seconds = 60

Ah-ha! We waited a heck of a long time for PAGEIOLATCH_SH, which means reading data pages from a data file. And while we’re looking – that “actual” execution plan didn’t mention anything about CXPACKET waits either.

Can you see wait stats in query plans? Sure. They’re just not accurate – yet, at least – so don’t go making any complex tuning decisions based on those. Keep your eye on this Connect item to know when it’s fixed.

Cue the sad trombone

Update 2017/10/05: Bad news.

Microsoft marked the bug as “Won’t Fix” and wrote:

This results in PAGEIOLATCH_* wait type and other IO waits not being tracked per session, because by-design these are SOS waits that are tracked at the server level. Potentially tracking individual IO requests per thread and session also has performance considerations that would impact query execution.

So no, 2016 doesn’t really show wait stats in query plans.

To me, this is worse than not showing any wait stats at all because the waits aren’t anywhere near accurate. Some poor performance tuner who hasn’t read this post is going to start analyzing their query’s wait stats in the plan, do all kinds of work to “fix” it, and not understand why the query’s not getting faster.

Ouch.

Update 2017/10/05 Part II – there might be more hope! Pedro Lopes reopened the Connect item for discussion.


SQL Server 2017 Release Date: October 19, 2017?

SQL Server 2017
5 Comments

Eagle-eyed @NikoNeugebauer (who you should definitely be following on Twitter) noticed something interesting in one of Lenovo’s recent TPC-H benchmark results.

The benchmark PDF shows:

Possible SQL Server 2017 release date at bottom right

Now this isn’t necessarily accurate – someone could have just stuffed in a placeholder date, or maybe SQL Server 2017 will be available earlier but the hardware is delayed until 2017/10/19, or maybe Lenovo’s trolling us.

Also, note that in the pricing section, SQL Server Enterprise Edition comes in at around $7k per core, indicating that there’s no big revolution coming in the pricing department.

I don’t consider this an official announcement by any means, so the Guess the SQL Server 2017 Release Date Contest is still on.

In related news, the SQL Server 2017 Release Candidate 1 (RC1) downloads are available now. Enjoy!

Update 2017/09/25: Looks like the release date was pretty doggone close, about two weeks off. The real date will be October 2.


[Video] Office Hours 2017/07/12 (With Transcriptions)

This week, Brent, Richie, Erik, and Tara discuss partitioning tables, native backups vs 3rd party agent backups, page life expectancy, query tuning, deadlocks, drawbacks of RCSI, triggers, replication, Always On Availability Groups, and forest fires.

Here’s the video on YouTube:

Office Hours Webcast - 2017/07/12

You can register to attend next week’s Office Hours, or subscribe to our podcast to listen on the go.

Enjoy the Podcast?

Don’t miss an episode, subscribe via iTunes, Stitcher or RSS.
Leave us a review in iTunes

Office Hours 2017-7-12

 

At what size do tables benefit from partitioning?

Brent Ozar: Richard says, “Is there a rule of thumb for a number of rows that would make you want to partition a table? I don’t need archiving, I don’t need sliding window loads; I just have a 230 million row table with about 45GB on disk. Is there some rule of thumb when I should consider partitioning?”

Erik Darling: Not if you don’t need those things.

Brent Ozar: What would make you think you need it?

Erik Darling: Needing sliding windows and archiving. Because it’s not going to help for performance, it’s not – it’s going to make things more complicated for you [crosstalk]

Richie Rump: Partition elimination?

Erik Darling: Yeah, if you get it, if you’re real lucky.

Richie Rump: Yeah, well you would have to use that partition key in order to get that.

Erik Darling: Yeah, so if your queries don’t match your partitioning key then it’s no good.

Richie Rump: Yeah you’re screwed, sorry.

Brent Ozar: And it’s becoming worse instead of better. SQL Server has to reassemble those results again; it’s tough. In terms of, is that a large table or not either, I don’t think it’s large for SQL Server anymore. I mean, the rule of thumb for VLDBs was always 1TB or one billion rows in a single table. So in the hundreds of millions, I mean, it feels intimidating, but remember, you’re always going to be working with the biggest servers you ever worked with. Your career’s always going to progress forwards until you have that one accident where you drop tables, drop the database, then you temporarily get fired but then you get another job…

Tara Kizer: I was okay when I did that…

Brent Ozar: Did you blame it on anybody else or?

Tara Kizer: No, no, no I always take – I always raise my hand.

Brent Ozar: What was the first worst memory when you dropped a table or dropped a database?

Tara Kizer: Oh goodness, the first? It would have been about 2002 or

2003 when I’d just started at Qualcomm, that’s the company that had Las Vegas DR site. And we did a planned failover to Las Vegas, they did it one or two times per year and we were production out there for a couple of weeks at least. And this was back in SQL Server 2000 days, so we were using log shipping and after we were done with the failover our process was to now drop log shipping and set it back up in the other direction. So what I would do is I would drop the database at the other site because it’s in a recovery state. So I would just drop and start the research. Now, you don’t have to do it that way, but that’s what I did. And I accidentally dropped the production, the one we had failed over to, because it was now production. So, you know, we were in a maintenance window, so it wasn’t a huge deal, you know. And back then databases weren’t big anyway, but it didn’t take me long to do a restore. But that system, that database had been online after the failover while we were working on all these other issues with other databases, other applications.

Erik Darling: It figures, that’s like the one time that a dropped database works on the first try. [crosstalk] no you can’t do it… What does it matter, SQL’s just like yeah, go ahead.

Brent Ozar: Yeah, it’s the same thing with SSMS letting you click twice slowly on a database name and just rename a database. Some things should be a little harder than that.

Erik Darling: Bad news, everyone has to use tilde now. That’s the first thing I hit instead of hitting escape.

 

Should I put dev/test/prod all on one VM?

Brent Ozar: Let’s see here. J.H. says, “What are your thoughts on having one SQL Server VM serving as all three dev test in prod, and then just use three databases, one named dev, one named test, one named with prod and just use different logins? What do you think about that?”

Tara Kizer: Oh my god, no. No, no, no, no, no. What are you doing? So this is going to be on the same SQL Server instance – you get a QA person who’s got elevated access in the test environment, and suddenly they brand this massive query and, you know, hogged up all the CPU and memory. I mean, wow, no. [crosstalk]

Erik Darling: Yeah, the size of that box would have to be so gigantic for me to be comfortable with all three environments sharing that it would probably just outdo any gains you got from it.

Brent Ozar: And your licensing would suck then at that point, because you were doing Standard or Enterprise on there, as opposed to Development Edition, which is free these days.

Tara Kizer: J.H. says it’s a VM, so why are they sharing? Don’t you have other servers?

Brent Ozar: He says, as a follow-up, “It’s very small DBs for one specific team”.

Tara Kizer: I don’t care. Prod is prod. Everything else can go on the same box, I don’t care. Put dev, test, stage – well maybe not a load test environment, maybe not stage, the ones that should mimic production. But all those other smaller databases that are not prod, they’re on one same server, I don’t care about that.

Erik Darling: Someone monkeys around with an agent job or something else in there, that’s prod affecting too.

Brent Ozar: Richie, I assume that – Richie does all the development here and manages all the databases – even with us as all database administrators, we use separate dev and test and prod environments, I assume?

Richie Rump: Yes, we sure do, absolutely. [crosstalk] SQL side, we do that.

 

Should I use Ola’s scripts or Unitrends?

Brent Ozar: Let’s see, next up Doug says – not Doug Lane, but another Doug; although hi Doug Lane, wherever you are. Doug says, “Currently using Ola’s scripts to manage backups for about 20 instances of SQL Server. Our operations manager is pushing something else, Unitrends Enterprise Backup. What are your thoughts on native backups versus somebody else’s third party agent stuff?”

Tara Kizer: I don’t like it. I especially don’t like the fact that – Unitrends Enterprise Backup, I’ve never even heard of that. I mean, a lot of clients aren’t using native backups, maybe they’re using other things… but we’ve at least heard of the technology that they’re using. This is – I’ve never heard of this. I would say no.

Brent Ozar: I think it’s CA. I think it’s Computer Associates, I’m not100% positive. Like it’s one of those mainframe-y companies…

Tara Kizer: Yeah, but if no one else is using it in the SQL Server world, or hardly anyone is and you’re not seeing any information out there about it, I would not go that way.

Brent Ozar: And what are some of the reasons why you don’t like third party backup things that are outside of the DBA instead of native backups?

Tara Kizer: So at one of my jobs, every few years they would ask, can we move to, say, NetBackup for the backups. Don’t do native backups – because before, we were doing native backups and then sweep those to take with NetBackup. But they wanted to get rid of the native backups for SQL Server because the Oracle side was using NetBackup. So we did this whole proof of concept and stuff, and it worked fine, but as a DBA, I don’t want to have to go to some other tool. I’m not even the one managing that software over there, you know.

In the companies I’m at, you know, we’ve got sysadmin teams, we have SQL Server team, you know, everyone’s different. So, I have to go to a sysadmin to tell them what to select for the restore and point in time recovery and all these things. So it just complicates things for a DBA when it comes time to do a restore. And when it’s time to do a restore, that’s a very critical task.

Brent Ozar: yeah, log shipping makes those things less useful to me because I want access to those log backups. If I need to reinitialize a database mirror or Always On availability groups, those things don’t help me very much. So huge fan of – when the folks say, I want to use Unitrends, NetBackups, whatever, I’m like sure, go backup the file share where I write my backups too; you can totally use the agent there.

 

Why is PLE different across 2 NUMA nodes?

Brent Ozar: Gordon says, “I’m seeing a large difference between page life expectancy on two nodes of a two CPU environment.” He’s got numa, so he sees two different tracks for page life expectancy. “Shouldn’t they be the same?”

Tara Kizer: Gosh, I remember there’s an article about this topic, that if they are not the same then there’s an issue.

Brent Ozar: Kehayias Paul Randal has one about it; we’ll have to put it in the show notes. But essentially, you can have different workloads happening across the two different nodes, I just wouldn’t focus on PLE and I’d jump out and look at wait stats instead. Because PLE can be in the toilet and your server may not be waiting on memory, you know, may not be waiting on storage, in other words reading data pages from a data file. Your PLE can also be fantastic and yet the server performance can be in the toilet; like you have locking problems or blocking issues. So I just wouldn’t use PLE that much anymore.

 

Why is a query faster on one AG replica?

Brent Ozar: Let’s see, Matt says something that’s going to be near and dear to Tara. “We are going to move from a single SQL Server to an availability group. The new Ag servers have more memory and CPU than the old server, but when I’m testing the same query on both environments, the AG servers are eight seconds slower. Where would I look to see why a query is faster on one server than another?”

Tara Kizer: Who wants to do it?

Brent Ozar: Erik, you haven’t answered a question in a while.

Erik Darling: There’s lots of stuff to look at. The first place I would look is the execution plan.

Brent Ozar: And what do you look for – like how do you compare to execution plans?

Erik Darling: Me? Oh god, I’m a horrible klutz with – I always just open stuff up in plan explorer. So you could look there – let’s see… I really thought the AG thing was going straight to Tara [crosstalk]

Tara Kizer: I mean, this isn’t an issue with the AG though.

Erik Darling: Alright, alright. So I would look at settings on there, see if maybe MAXDOP or cost threshold is different, but that might show up in the execution plan. Let’s see what else, what else would be good? I don’t know…

Tara Kizer: [crosstalk] waiting on, maybe it’s waiting on something different on this server than on the other server.

Erik Darling: My first instinct is just to check the execution plan…

Tara Kizer: Exactly, this sounds like just a bad execution plan parameter sniffing issue, but you know, parallelism settings, maybe check out the wait stats. Not maybe, do check out the wait stats, but maybe the wait stats is going to show HADR sync commit has a very big waiting time as compared to system up time. And if that’s the case, maybe the two sync commits to the other server. If you’re using a synchronous commit replica from the AG, maybe it’s adding a lot of overhead because you do not have blazing fast storage and blazing fast networking; those are critical for synchronous commit replicas. So wait stats and then what is the query waiting on. Look at sp_whoisactive on top of the execution plan stuff.

Erik Darling: Or sp_BlitzWho, sp_BlitzWho is okay too.

Tara Kizer: I’m just so used to whoisactive, you know. Something so rock solid for so many years and this other tool comes along. You know, I don’t need to switch to another tool. I like BlitzFirst, don’t get me wrong, I just, you know, for current activity I’m going to whoisactive. Sorry.

Erik Darling: No loyalty

Tara Kizer: the loyalty is to the product I’ve been using all these years.

Erik Darling: How much does Adam pay you?

Brent Ozar: Boos… The other thing is, of course – don’t forget the obvious thing is different data across the two different servers too if you haven’t restored that AG server in a while. Also, if you wanted to share the plan with other people, I don’t know how sensitive the query is but you can also use Paste the Plan. If you go to PastethePlan.com, you can paste in your execution plan and then you can give both of those links in a Stack Overflow question, you can post it on SQLTeam, SQL Server Central and say hey, help me figure out what the differences are between these execution plans so you can see which one’s better or worse.

 

Can I help my developers reduce deadlocks?

Brent Ozar: J.H. asks, “I had a developer recently experience deadlocks, and he asked if anything can be done on my end as a database administrator. I played pin the blame on the developer, but is there anything else I can do in order to make these deadlocks disappear?”

Tara Kizer: Well Brent, last week you mentioned how deadlocks was like your thing that you needed to work on, the thing that you’re weak on, and then you said, you know, as soon as you get to that deadlock graph, you mentioned that’s – you know what to do. I can’t read deadlock graphs.

Brent Ozar: I love – I mean, there’s two parts to it. One is troubleshooting why the deadlock is happening and then the other part is doing something about it. And the place that I would start is, often tables have no indexes, they have a clustered index and that’s it. Imagine the phone book, the white pages of the phone book that grandpa kept on top of the fridge. Well if I told you, go through the white pages and update everyone whose first name is Brent, you’d end up doing lock across that entire phone book, finding all the Brents and doing the updates. Whereas if you had the right non-clustered index on first name, you’d be able to seek directly to the Brents, you’d be able to know how many rows are going to come back. Some of this is a database administrator’s job, assuming that database administrators are the one in the shop fixing the indexes. You could also play around with read committed snapshot isolation.

Now, this isn’t bulletproof. If two people want to lock the same row, you’re still screwed, but if you go to BrentOzar.com/go/RCSI, as in remote controlled Subaru Impreza. If you go to BrenOzar.com/go/RCSI, Kendra Little has a great write up on everything you need to think about, how RCSI makes your database queries faster and what you need to tell your developers in order to implement RCSI. Start with indexes; RCSI is probably the next level up.

Richie Rump: Yeah that’s funny, as a developer, I know very little about deadlocks. I know that when it happens I go straight to the indexes, right, and it typically fixes my problem. And in fact, anything I do new that’s greenfield, RCSI is the first thing that’s turned on. Any new database come on, RCSI – and if there’s a problem that we need to fix and turn off or whatever, which I’ve never had happen to me, then we can handle it there. But it’s usually a good default to turn on RCSI.

Erik Darling: I’d say if you’re in a position where changing the indexes or changing a setting like RCSI isn’t possible, I would have the developers just build in retrial logic, so that if they hit a deadlock they catch that error and retry their transaction. Like wait for a second, wait for half a second and then retry.

Richie Rump: Still not fixing the deadlock…

Erik Darling: No, but at least you’re trying it again so you’re not missing the transaction.

Brent Ozar: Which may be fine. I mean, in some shops it’s totally okay to just retry [inaudible] behind the scenes.

Brent Ozar: Robert says, “Are there any drawbacks of RCSI?” The big classic one is that some yo-yo begins a transaction on Friday, locks his workstation and then goes home for the weekend, and then tempdb explodes in terms of size. So long running transactions is the big ugly one. You can learn more about it though at BrentOzar.com/go/RCSI. Got huge write ups in there.

 

Have you done cross-database queries in Azure?

Brent Ozar: Wilfred says, “Do you folks have any experience doing cross-database queries in Azure?” I assume he means Azure SQLDB. The answer on that is no because it sucks. You have to predefine your tables on both sides, or on whatever side you’re going to do a linked server query to. Like I have to define what the remote table’s going to look like as an external data source, and that’s just a brittle slow-down for most of my developers. Developers just want to go database name.dbo.table name, and having to predefine the table makes things a little wonky.

 

Can I redirect queries between AG replicas?

Brent Ozar: Kelly asks, “With a read only secondary in an availability group, is there a way to redirect queries back to the primary when the secondary stops responding?’ You know, her “sandbox environment, some of the mount points for the secondary went down” – not laughing at you, I’m laughing with you – “and my read only connections for my reports went along with it.”

Tara Kizer: it’s the read only routing list, so you have that set up to go back to itself if the other replica is down. So it’s read only routing list. And make sure that the read only routing URL is correct for it.

Erik Darling: Is that available in all availability group versions, or is that…

Tara Kizer: I mean, I don’t know about basic availability groups – well there’s no readable secondary, so no. yes, it’s available since 2012, definitely. There’s no GUI for it, so it’s easy to miss. I missed that when I first set up availability groups four years ago, and read only routing was not working.

Brent Ozar: The part that sucks is, you’ve got to make sure your developers understand application timeouts too. The connection timeouts may need to be set longer, because if they have short timeouts, like 30 seconds, it may take 30 seconds for the connection driver to try the first server that’s down and then go retry the second. It also won’t fix queries that are in flight. Once the query starts, if all of a sudden the mount points disappear under the secondary, that query is toast. You just have to build in some kind of retrial logic, but leave it to the consultants to say, build retrial logic into all your application queries.

Richie Rump: Oh it’s so easy, yeah, let’s go ahead and do that… Just do it, developer.

Erik Darling: I have supported entire applications that were built on retrial logic, so it can’t be that hard.

Richie Rump: It takes time, money and effort, that’s all.

Brent Ozar: And retrofitting it into an existing application is a nightmare; it’s just a nightmare.

Richie Rump: yeah, and typically when you say use retrial logic, that is our scenario, right? It’s not something we’re building new, because the new stuff is probably going to be in the cloud, which already has automatic retries. Thanks, guys.

 

Any free tools to develop triggers?

Brent Ozar: J.H. says, “Do y’all know of any free tools to help develop and troubleshoot triggers?’

Tara Kizer: No.

Richie Rump: The delete trigger command, I don’t know…

Brent Ozar: there was – I just saw this hit recently, and I’m going to go look to make sure it’s not a joke… Hold on a second [crosstalk] It had the feel to it. Yeah, so if you go to Amazon, Thomas LaRock put together a book recently, an ebook called The Trouble With Database Triggers. I have not read it, as I had to go look and make sure it wasn’t a joke, but it’s – oh, it’s 15 pages. 15 pages, okay, well it might be a starting point then. [crosstalk] 15 pages, that sounds real, okay, that’s legit. But I remember writing my first stored procedure, and I remember being a developer and going to the book store and there was actually a book on stored procedures, and I was like, this is awesome, this is fantastic. It’s actually harder than it looks, so I thought maybe there was going to be one on triggers, but it’s 15 pages.

Erik Darling: Aaron Bertrand though does have a pretty good talk, I believe it’s SQLBits, about triggers that I would recommend. If you’re getting into the business of triggers, give that a watch. It’s about – it’s like an hour or so; not the rest of your life.

Brent Ozar: Somebody said a couple of weeks ago, they were like, you know it’s amazing when you guys go and just open the URL on the webcast so we can go see what you’re talking about. [crosstalk]

Erik Darling: No I mean don’t use the site search on SQLBits.

Brent Ozar: Yeah, no it’s bad. So SQLBits is a conference where they record all their sessions, and it’s totally available for free. If you go all the way down to the bottom, they have previous sessions and previous speakers. So go to previous speakers and then look for Aaron… First on the list, look at that. Mother optimized, mother, parents, father optimize for the SQLBits list…

Erik Darling: they should have just put his name in quotes…

Brent Ozar: Ooh, five ways to write more effective triggers. And then you can download the slide deck or watch the video right there inside the browser.

Erik Darling: I’d watch the video, Aaron’s a handsome feller.

Brent Ozar: Oh does it actually have the camera? It does, wow.

Erik Darling: I think he’s in a kilt too, so it’s like double your pleasure.

Brent Ozar: I would play it now but then we would lose the entire audience and that would be the end of it. Forget watching these people, we’ll just go watch Aaron Bertrand in a kilt.

 

Do I need to stop replication to patch?

Brent Ozar: Let’s see here, Philip says, “Do I need to stop replication to do a cumulative update on either server? My assumption is that it will pick up and restart and reboot without recreating replication.”

Tara Kizer: The answer is definitely no. you’re free to go ahead and install it and it will pick up again. You should not receive any errors from replication. Usually, the errors for replication have to do with somebody manually modifying data on the subscriber and now, say, a delete isn’t working, so it’s causing replication to be in a broken state. So yes, you’re good to go there without doing anything to replication.

Brent Ozar: Am I the only one who heard Tara say you should never get any errors from replication? [crosstalk]

Tara Kizer: Reboot.

 

How do maintenance tasks work in Availability Groups?

Brent Ozar: Richard says, “In an availability group, do the maintenance tasks move across with availability groups? Meaning like updating stats and rebuilding indexes.”

Erik Darling: They sure do…

Tara Kizer: Absolutely not. The only thing that moves over when you have a failover of an availability group is those databases. That group has failed over, only those user databases inside the group has failed over. Nothing else, no system database information, so you’re missing jobs, missing linked servers, you’re missing server level prints. You’re missing a bunch of stuff and so it’s your job, as a DBA who has set up availability groups, to understand these and make sure that other server has a copy of those objects. And you could copy those jobs over and have all the jobs check if they’re a primary replica, and if they are then run the jobs step, the rest of the job step. If they’re not, then do no other work.

So your job is to copy those jobs over and make them AG aware by having an IF statement inside the jobs step to check, am I a primary replica. The alternative solution, because people don’t like having jobs, you know – if you’ve modify a job on one server; you’re going to have to remember to modify the job on the other server. So some companies, they’ve chosen a third server to be their job scheduler where their job steps are pointing to the availability group listener name. So, it is already AG aware, it doesn’t matter which node, primary…

Erik Darling: See I interpreted that question a little bit differently. I thought he meant if the effects of the jobs went across from one to the other. That’s why I said sure do, because I was – you’ve been in the spot where you’ve had to turn off index rebuilds altogether with an AG [crosstalk]…

Tara Kizer: My client this week, I’m telling them to do that too. [crosstalk] I mean, they’re having tons of issues with rebuilding indexes and they have an availability group.

 

What’s the best way to put 6,000 databases in an AG?

Erik Darling: Speaking of tons of issues with an availability group, I see a question from a feller named Jeff.

Brent Ozar: I was holding Jeff’s one until the last, but alright, let’s tackle it. Jeff says – and it sounds like Erik is excited to do this as well. Jeff says, “What is the best way to do Always On availability groups with an instance with 6000-plus databases?”

Erik Darling: With log shipping.

Tara Kizer: Do not go there. Do not use availability groups for this many databases.

Brent Ozar: Talk about the problems with it.

Erik Darling: Which one, I’ll start with thread pool – oh I didn’t know who you were pointing at…

Brent Ozar: No, you were excited for this question…

Erik Darling: You had a non-deterministic point, I couldn’t help it, I didn’t know where that was going. So the main problem that you’re going to run into there is keeping all those databases synced up, because the processor that you would need, to support 6000 threads, I don’t think has been invented yet. And you’re going to run into all sorts of issues with thread pool and keeping all those databases in sync is going to require worker threads to work on that, and then you’re going to have nothing for user queries to run. So, that’s going to be tremendously painful, you’re looking at breaking that out into probably at least six to eight servers before you even have a manageable amount of thread usage.

Brent Ozar: The worker thread count is tied into the number of cores in your processor, but you could also override it. There’s a setting called max worker threads, why wouldn’t you just change that?

Erik Darling: Because when worker threads are constantly context switching like that, you can run into performance issues waiting for a worker thread to pick back up on a query or pick back up on synchronizing a database. So you could lose synchronization process or it could like time out, or you could lose – queries would degrade performance OIs waiting to get back on these treads that are spread all around.

Brent Ozar: To learn more about that, there’s a great post by Warwick Rudd on – it used to be Simple Talk – Always On availability groups, what not to do when adding databases. And he goes in detail into the worker thread exhaustion issue. So you said log shipping instead, so what would drive you towards log shipping and why?

Erik Darling: the fact that we wrote a couple of stored procedures that do log shipping pretty well at scale like that. So, I mean, just for log shipping in general, it would just be that it takes a lot less from CPU usage and from a performance point of view to back up logs than it does to constantly keep data flowing from one database to another. So you could have – even if you just used standard log shipping with log backups, you know, sort of paste, not like every job starting every five minutes, because that might cause you some issues as well, but if you, you know, spackle the jobs so that they kind of start at different times, you could have pretty good luck getting log shipping to work in a reasonable manner.

I mean, 6000 databases is even pushing it with log shipping for me, but AGs is just off the table immediately.

Tara Kizer: I’d use log shipping for disaster recovery and then I’d use a failover cluster instance for the HA portion.

Erik Darling: the old FCI and log shipping…

Tara Kizer: Yeah, but all they’re using the AG is for HA then I would not use log shipping. I’d use log shipping for DR.

Erik Darling: Yeah.

Brent Ozar: We joke around all the time that a lot of times clients will come to us and want advice on high availability in disaster recover, and sometimes we feel bad saying failover clustering and log shipping, because it’s such a good default answer. It just works for so many things, here you go, failover clustering and log shipping, here you go, failover clustering and log shipping… But it works really well. At 6000 databases, there is something you still need to be aware of; that’s startup time. Startup time can take a long time to make all of the databases available. There’s another great post, and I can’t remember who this one’s by and it will take me forever to find it, but it’s the order in which databases start up. It’s based on the database ID. You can easily run into a scenario where the first hundreds of databases are online but the last hundreds are not.

When I have the page up for sp_allnightlog, just because this is what Erik and I have been working on for the last couple of weeks for a client gig, and I have been continuously singing Lionel Richie for like the last month. Just every time I pop open this code…

Richie Rump: So do you owe him royalty rights for naming this all night log?

Brent Ozar: I can only imagine, especially every time I’ve sung it. He’d probably pay me to stop singing it at this point.

Richie Rump: I just still see Chris Tucker and Fifth Element, yeah, I go right there.

Erik Darling: Yeah, that was me as well [crosstalk].

 

Should I directly query a 3rd party database, or use their API?

Brent Ozar: Samantha says, “Our developers want data from a vendor-supplied app.” So it’s a third party app that they’re not supposed to touch and they get to it through API calls. “In the spirit of collaboration, I let them know we could do availability groups with a read only replica. This way, they won’t have to wait on the vendor to write new APIs, we can just query their database directly. However, they want to write indexes for their own lookups. Should I do transactional replication instead of an availability group? Should I even do this at all?”

Tara Kizer: I like her thought process. I mean, moving those queries over to another server so you’re not impacting your OLTP database, where all the important stuff is occurring. But if they need very specific indexes for their queries that you don’t want on that production database, you can’t use availability groups. As you know, it’s a copy of your database. But then you’re stuck with transactional replication, and I don’t like supporting transactional replication. You support it for so long and the troubleshooting is ridiculous and it takes a long time to set up on a sizable database any time you have to reinitialize it, for whatever reason.

Brent Ozar: Richie, go ahead because I think I know where you’re going and, Richie, I’m with you.

Richie Rump: Okay, so you have this API that sits in front of a database which the vendor manages. Then you’re going to want to get behind that database and write your own queries. So what happens when the new version comes out, they modify the database and now your application goes splat, because you’re reading stuff that’s not there or stuff that’s been added and now your data has changed and you haven’t realized it because they’ve changed the key in the background or something crazy like that.

If you’re willing to jump into that, just know that you’re going to have some maintenance stuff that’s going to happen with a new release, and you’re not going to know what that is because the vendors kept it hidden because it’s in a database and they have to give you the API which they publish. And they’re probably not going to give you a schema, which they publish, which they probably won’t publish to you. So it’s tough. If I had to do this, I probably would put my own API in front of it and manage it that way as opposed to the developers having access directly to the database, and now I have all these queries going around touching the database, as opposed to one central point where all you do is change the API and it’s done.

Erik Darling: Cache it in Mungo, as they say.

Richie Rump: Or elastic search…

Brent Ozar: Instant technical debt.

 

Fast answers to several questions

Brent Ozar: Then we’ve got like six seconds, I’m going to rip through the answers really fast just so that we can nail them all before we bail out. Let’s see here, Jeff says, “Don’t only active AG replications take worker threads?” That’s true, but say that a secondary goes down or restarts and you’ve been doing index rebuilds over on the primary; immediately all of that data is going to try flow through to the secondary, and immediately the primary falls over.

Brent Ozar: Michael says, “Any gotchas with setting up an on-prem linked server to an Azure SQL database?” Yes, security. Gotcha on that one, that one’s a little tricky.

Brent Ozar: Oh there was another one that was epic. “What about moving those 6000 databases to Azure SQLDB?” Cha-ching, hello. Just remember that you can buy elastic pools of databases and then provision capacity by the pools, but there’s no pool for 6000 databases. So now you’re going to have to start micromanaging across all of those. You may save money on staffing, but you may end up spending a whole lot more money on capacity planning and performance management as well.

Erik darling: That goes the same for RDS as well, just because RDS has a 30 database per server limit. So what’s 6000 divided by 30?

Richie Rump: A lot.

Erik Darling: I’m not a math guy, so I don’t have an answer.

Brent Ozar: 200?

Erik Darling: 2000 maybe? I don’t know.

Brent Ozar: Alright, well thanks everybody for hanging out with us and we will see you guys next week on Office Hours. Adios.


Sliding Scale Parallelism: Why Stop At DOP?

SQL Server
11 Comments

This is a thought experiment

Just fair warning, you’re probably not going to learn anything new here. This also isn’t hidden in a CTP of SQL Server 2017, and as far as I know, it’s not a consideration outside of this post. Seriously, lawyers, I know nothing.

I was just thinking (while sober, mind you), in a what-if scenario, parallelism settings weren’t so two dimensional. When you hit X you get X. I know, that’s vastly oversimplified, and there’s all sorts of stuff to figure out DOP and parallel query placement. That’s fine.

But let’s say the knobs you had to turn were

  • MAXDOP — duh
  • Cost Threshold for Parallelism — duh again
  • Starting DOP — The DOP a query would get when it breaks CTFP
  • DOP increment — The number of cores to add per CTFP increment
  • CTFP increment — The cost increment at which more DOP is assigned

I know, it sounds weird, but bear with me.

What if you set CTFP to 50, and you’re cool with a query that costs 50 Query Bucks going parallel, but you don’t want it chewing up DOP 8.

Granted, these costs are estimates, and the estimates are wonky as all get-out sometimes.

But we’re already basing the decision to go parallel on wonky costing. Why not have more control over it?

Why not say, at CTFP 80 you can have 4 DOP, at 120 you can have 6 DOP, and for anything over 200 you can have 8 DOP?

It’s certainly an interesting conversation, especially with SQL Server licensing.

What’s licensing got to do with it?

When you opt in to Enterprise Edition, you pay $7k per core. Most people out there do the logical thing, and buy the smallest number of the fastest cores they can to support their workload.

When you start running AGs, or bringing on users, thread count becomes more of a consideration.

You don’t get infinite threads. When you run out of them, you run into THREADPOOL. This is not the idyllic swimming pool at a Club Med. This is the pool you get stuck in right before ending up in a PSA for gasoline huffing.

Looking at the chart for Max Worker Threads: at 32 cores (that’s about 225k in licensing bucks), you only get 960 threads. That means you can run 120 simultaneous parallel operations. Not 120 simultaneous queries, 120 simultaneous operations. Remember, a parallel query can have mulitple branches, and each of those branches will get DOP threads. So if you have a query that does four joins, it could get DOP 8 * 4 threads. That’s 32 threads.

One way of controlling thread consumption is tuning parallelism settings. This would be a couple additional knobs to turn for folks running into issues without sacrificing performance for more costly queries, and without having to go through code and add in MAXDOP hints. After all, if you run into a parameter sniffing issue in a stored procedure, the cost you get for one query could be terribly, wildly different from another query. Those MAXDOP hints could come back to haunt you.

This opens up a more complicated can of worms: is DOP influencing cost estimation now a recompile reason?

It used to be.

Stay Adaptive, Pony Boy

With SQL Server’s new Adaptive Join technology, this could be another consideration along with join type. Rows are a part of cost estimation, and if we’re deciding join types on the fly based on row thresholds, certainly we could apply additional DOP logic as well.

Thanks for reading!

P.S. Oracle doesn’t have this, so Microsoft could totally one up them here to make up for the fact that they don’t have a cool floating super fast yacht.


What I Love About the Stack Overflow Database

Development
8 Comments

I’ve been using the Stack Overflow public database for my performance tuning demos for a few years now, and I just keep finding more reasons to love it.

It’s got a few simple tables with a lot of data. The main ones are Users, Posts (questions & answers), Comments, Votes, and Badges, ranging from 1GB to 90GB (as of 2017/06). The database is small enough that you can work with it on a modern laptop, yet large enough that even simple queries can turn into performance problems. (I hate seeing AdventureWorks demos of “bad queries” that run in less than a second. That simply isn’t what you do for a living, so it’s useless for training.)

The data means something to data professionals and developers. After all, you’ve been using StackOverflow.com for years. It’s kinda fun to query for your own data while we’re doing demos.

The data has real-world data distribution for lots of data types. Take the Users table, for example. There’s datetimes, integers, short strings (DisplayName and Location), and long strings (AboutMe), all with real-world representative data. There’s even gotchas with Unicode, nulls, and data you can’t trust (like Age has some surprises.)

The Users table (click to zoom)

This is so much better than expanding small databases artificially, like scripts that try to expand AdventureWorks to a much larger size. This is real data with real data distributions – with plenty of opportunities for parameter sniffing examples.

The tables have easy-to-understand relationships. Every table has an identity Id key, and other tables can point to it. For example, in the Comments table:

  • PostId links to Posts.Id, which is either a question or an answer
  • UserId links to Users.Id, the person who posted the comment

The relationships even give you some fun for complexity: in the Posts table, the ParentId field links back to Posts.Id. See, both questions and answers are stored in the Posts table, so for answers, their ParentId links back to the question’s Posts.Id.

There’s a web front end. When I want to explain how the Users and Badges tables are related to each other, I can simply open my badges page on Stack, talk about how a user can earn badges, and earn them multiple times. I can walk the students through writing the exact query in SQL Server that will produce the results shown on the web page.

There’s a repository of sample queries. At Data.StackExchange.com, people have written thousands of queries against a recently restored copy of Stack Overflow’s databases. They’re real queries written by real people, which means they have real-world performance anti-patterns that are fun to unearth and tune.

It’s licensed under Creative Commons Attribution-ShareAlike. This means you’re allowed to share, copy, redistribute, remix, transform, and build upon the material for any purpose, even commercially as long as you give appropriate credit and share your work under the same license.

This means we can use it in our community training events – like our upcoming 24 Hours of PASS session, Last Season’s Performance Tuning Techniques. Register for it, go get the Stack Overflow database, and you’ll be able to follow along during our demos. We’ll be posting the scripts as a blog post here right at the start of the session. Let’s have some fun!


New Online Course: The Fundamentals of Database Administration

Remember that very first day when you officially became a Database Administrator? Remember how the Senior DBA sat you down and gave you clear, specific instructions on how to handle SQL Server?

No?

You don’t remember that?

That’s because it rarely happens. You just happened to be the one standing closest to the database server when it broke, so you eventually became The Accidental DBA. You didn’t get sent to a training class, and even if you did, it was one of those ones where the instructor just reads out of a book – but they’ve never actually been a DBA.

In this class, we’re gathering some of our favorite fundamental material – things we wished we’d have been taught when we first started monkeying around with SQL Server. DBA subscribers can get started now, and we’ve got more videos on the way in this class too.


New 2-day Class: Performance Tuning By Example

One of my most popular sessions is Watch Brent Tune Queries. I’ve updated it every year, tuning different queries, showing new tools like SentryOne’s Plan Explorer, StatisticsParser.com, SSMS’s plan comparison, and SSMS 2017’s new scenario evaluation tool. One hour is never enough – there’s so many tricks and tips I want to show in action, walking people through how I use them in a tuning scenario.

“Wait a minute, I have to actually do the work? What kind of class is this?”

So I’ve got a new online class: Performance Tuning By Example. The 2-day class consists of four lab scenarios:

  1. Finding the SQL Server’s bottleneck, and the queries causing it
  2. Fixing a slow SQL Server only using indexes
  3. Fixing a slow SQL Server only by tuning queries, not touching indexes
  4. Fixing both queries and indexes at once

During each scenario, we’ll:

  • Start with a 30-minute lecture explaining the tools and techniques you’ll use in this scenario.
  • You spend one hour working through the lab, figuring out the problem and implementing your solutions. (You get your own cloud VM with the Stack Overflow database, running the scenario workload.
  • Then I share my screen for an hour, working through the exact same problem, explaining what I’m seeing, what I think about it, and what actions I’m taking to fix it.
  • Finally, 30-minute Q&A where you can share your ideas on how you might fix it, and I may take over your desktop to check your work.
  • Then, on to the next lab!

This is not an introductory course. You should already have spent time going through our online performance training videos covering wait stats, index design, statistics, query plan analysis, and tuning, or been to my 4-day Performance Tuning class.

Learn more and register now.


[Video] Office Hours 2017/07/05 (With Transcriptions)

This week, Brent and Tara discuss separating data and logs on separate drives, statistics time, scalar UDFs, licensing, encryption, gathering SQL error logs and agent job history, replication, upcoming training courses from Brent Ozar Unlimited, and what DBA topics they struggle with the most.

Here’s the video on YouTube:

Office Hours Webcast - 2017/07/05

You can register to attend next week’s Office Hours, or subscribe to our podcast to listen on the go.

Enjoy the Podcast?

Don’t miss an episode, subscribe via iTunes, Stitcher or RSS.
Leave us a review in iTunes

Office Hours Webcast – 2017-7-5

 

Should data and logs be separated?

Brent Ozar: Jake says, “Hello, my name is Jake. With solid state disks, should the data and logs be separated on different drives?” Oh boy, this is a fun, passionate religion type of thing. I’ll tackle this one, just because I had a blog post about it. There are two different reasons – Tara’s going to start making the popcorn. There are two different reasons why you would do this: one is for reliability and one is for performance. Now, in terms of reliability, the general logic goes, that if you lose that data array, you can still do a tail of the log backup. So the gotcha there is, this is only true if these are both fairly reliable drives that don’t ever fail. But of course, any array is going to fail sooner or later, it’s just a matter of time, and since each of these is a single point of failure, even if you put in a RAID array, the entire array will fail from time to time.

If you have two single points of failure, either of which can bring the entire SQL Server down, you want to start doing some basic math on how often these things will actually fail and whether you’re making the system more reliable or less reliable.

The ironic thing is, because they’re both single points of failure and say that each one fails once a year, just to make the math simple, by using two sets of arrays or two volumes instead of one, now you’re at two annual failures instead of one. Then, when the data volume fails by itself, people go, well look, because of this extra reliability, now I can do a tail of a log backup. That’s only true if the log array is still up. And if you’d have put your data files on that array, guess what? Your whole server would still be up. You injected this less reliable thing by yourself, so just make sure that you’re actually making the system more reliable instead of less reliable.

The second part is, in terms of performance; I’m going to assume, Jake, that you have a limited number of SSDs. Let’s say that you’ve got eight of them, just to make the finger math easy. If you have eight drives, how many drives are you going to use for the log file and how many are you going to use for the data file? You need to know how to carve those up, and it’s hard to guess when you’re dealing with a brand new SQL Server that you’ve never worked with before, and you are robbing performance from one of those.

Say that, just to keep the math simple again, you take six of the drives for the data array and two of the drives for the log array, is that going to be enough? When you do inserts, updates, and deletes, SQL Server has to harden that to the transaction log file immediately. It writes the data pages later asynchronously in the background. Instead of having eight drives to shove transaction log data down onto, now you’ve only got two. Are you sure that that’s better? And, are you sure you’re even facing transaction log bottlenecks? Is write log even a problem for your SQL Server?

The problem is that sketching all of this out is very hard and takes some time and is very easy to get wrong. I’ve seen people go in and say, well I need six drives for the data files and two drives for the log files, and then not have enough performance for, say, the log files. Once you’ve baked this thing, in order to get it right, you have to dump it and start again, say dump your arrays and redo them as, say, four and four or five and three. So if you’re just talking about general purpose stuff and you have a fixed number of drives, it’s a lot harder than it looks. What I’d back up and say, what’s the problem you’re trying to solve? Like, what’s the server’s primary wait type, and then go from there.

 

How long can I leave AG replicas suspended?

Brent Ozar: Gordon says, “Is there a limit on how long I can leave databases in an AG secondary replica suspended after which they won’t be able to re-sync?” Oh, that’s a great question.

Tara Kizer: That’s dependent upon your transaction log. All that stuff is being saved on the primary replica’s transaction log and it’s waiting for that other secondary replica to become available so it can send it over. So, if you have unlimited space on the drive where this transaction log is, then it could be indefinite, but no one has unlimited space. So it just depends on your volume of records going into the transaction log and how big is that transaction log file, how often are you doing transaction log backups – well that won’t even matter for this, since you can’t clear the stuff out – and then how much space you have on the driver mount point where the transaction log exists, if you have auto growth in place.

I’ve had systems where we’ve had planned maintenance at, say, the disaster recovery site, and the site was supposed to be down for four hours. I knew most of my systems would be okay, I knew one system wouldn’t. and sometimes they would, say, be down for like a day or two because they would do major maintenance out of the DR side, because, you know, it’s DR and it’s not production, and I would have to kill – it would be mirroring at the time, but same thing; it’s the same technology basically for this aspect of it. It just depends, it just depends on your system.

Brent Ozar: And it’s not just the primary’s log file either, it’s all the replica’s log files. So if you have several replicas in your data center, they are all growing because SQL Server doesn’t know which one’s going to be the primary at the time that that replica comes back up.

Tara Kizer: True.

Brent Ozar: I actually say – in the Senior DBA Class, I say whenever you lose a replica, set an alarm, like a timer for every 30 minutes, and go back and look at your log files to see how much space you have left.

 

How should I gather error log & Agent job history?

Brent Ozar: Richard says, “If the database administrators are not allowed to create databases or jobs on the server because the vendor doesn’t allow it, what’s the best way to gather SQL error log and agent job history?” That’s interesting…

Tara Kizer: Well, I mean, can’t create jobs, but do you have access to task scheduler on the box, maybe not? Any server that you actually have control over has a Windows scheduler, task scheduler you could use, and you could reach out to that SQL Server via a linked server to gather this data.

Brent Ozar: Yeah, I like that. Think of it the same way a monitoring server would; like monitoring servers are c# apps, Java, whatever. They’re just connecting to SQL Server and running queries.

Tara Kizer: Yeah, there’s probably some PowerShell command let that you could use to collect this data already.

Brent Ozar: Oh there probably is. Probably – DBAtools.io probably has a script to go grab that stuff, that’s a good point. And the same techniques that monitoring vendors use too like SentryOne. I remember when I first looked at them, they didn’t even create a database or any stored procs or anything. They did everything in tempdb just so that they could go query stuff. I should – I have so many bad ideas. I’m like, then another thing… Because there’s not a lot of questions in the queue, so I’m like alright, here’s another thing that you could do is you could create – just open a session in tempdb and just WHILE one equals one, wait for 30 seconds, go get stuff and then send an email or an alert whenever something happens. That’s so duct taped; don’t ever tell your friends that I said that.

 

Are there new training videos coming?

Brent Ozar: So James asks, “Are there new training videos coming on BrentOzar.com? If so, when? I love the training courses.” As a matter of fact, yes. Erik just recorded a bunch of DBA fundamentals classes. So the DBA subscribers have new ones coming on CPU settings, which is like cost threshold, MAXDOP, priority boost, tempdb, CHECKDB, how you do CHECKDB if you have a database administrator, don’t have a database administrator and like a dozen more all together. So the first set will drop next Tuesday, and those are for people who are either on the enterprise plan or the DBA subscriptions plan.

On the performance side, I’m working on a set on how you run our scripts, sp_Blitzindex, BlitzCache, et cetera. Those should be out in August. Also, a brand new two-day class coming, Performance Tuning by Example, were you get a cloud VM and then I walk through a lecture for 30 minutes. Here’s the problem that you’re going to be facing or here’s a symptom that users are complaining about, you go jump into your cloud VM, you troubleshoot what the problem is, write the fix – you have one hour to do that, and then you jump back together with me for an hour and you watch me solve the exact same problem. Then we get to do Q&A about how our solutions worked. So that’s a new two-day class, and I’m doing that in August and September.

 

What does SET STATISTICS TIME include?

Brent Ozar: Forest says he’s “Looking at set statistics time. Is CPU time only the amount of time that threads were in a running state, or does it ever include any sorts of waits?” Oh, Tara, do you ever use statistics time?

Tara Kizer: Yeah, all the time. Anytime I’m looking to get set stats I/O on, I always have a comma time-wise. It’s just my habit for typing. I don’t usually need the time information, but that’s just what I type, and so I end up getting both.

Brent Ozar: Yeah, it is only going to be the time that is spent actually doing work. The easy way to test this is to just do a wait for, you know, run wait for 30 seconds and then at the end of it, of statistics time, you’ll see a relatively low amount of CPU time, even though there were wait for waits. He says, “How does virtualization affect this?” It doesn’t affect waits versus CPU burn, but virtualization can affect, if the VM is ready to consume cores or ready to consume CPU time, that the hypervisor isn’t ready to provide that CPU load, either because another guest is burning it or because the clocks have gotten far enough out of sync.

 

What DBA topic do you struggle with?

Brent Ozar: John says, “What DBA topic do you struggle with the most?” Tara, how about you?

Tara Kizer: I don’t know, staying court I guess, you know, keeping up to date with newer versions when you’re at a company that’s not willing to upgrade or, you know, upgrades may be next year. Especially being consultants, we’re dependent upon our own learning and then what version our clients are running. I’ve only had a couple of clients that have had SQL 2016, I don’t know how – 2017 obviously isn’t out yet, but I would imagine the adoption rate is even slower than 2016.

Brent Ozar: Man, for me it was deadlocks, oh my – I’m really good at solving deadlocks once I get the graph, but having to hassle with Extended Events to get a, you know, block process monitor stuff for deadlock or whatever. So I’m like – I had to slay that dragon this last week, so I’m like okay, it’s really not that bad, extended events isn’t terrible once you get in there, but XML sucks. God, I hate XML.

 

Can I denormalize replication subscribers?

Brent Ozar: John says, “If I set up replication for reporting, can I thin add de-normalize tables to my reporting database without affecting replication?”

Tara Kizer: I guess this would be for me.

Brent Ozar: I know the answer too…

Tara Kizer: So if you’re de-normalizing the tables that you’re replicating to, you’re going to have to update the replication stored procedures; like sp_MS, INS, UPD, you know, the actual stored procedures that are used for the inserts, updates, and deletes. So if you’re changing the structure of the destination tables, you do have to modify the replication stored procedures. But you’re free to do whatever you want in that database outside of those objects that you replicate. You could have more indexes, more stored procedures, more tables. But the actual tables that you’re replicating to, SQL Server needs to know what to do with them, so you have to put those in the replication stored procedures.

 

Should I worry about LCK waits on a distributor?

Brent Ozar: Gordon says, “I’m seeing high LCK waits on a server acting purely as a replication distributor. Are these likely to be a problem on the distributor, publisher or subscriber?”

Tara Kizer: Well, if the locks are happening on the distributor then the issue is on the distributor; I mean, the locks aren’t coming from another server. It could be replication blocking some other process. I’ve experienced blocking on the distributor, and in that specific environment, it was due to Informatica. That was what was being used for the data warehouse and Informatica does not – there’s a publication but no subscription, and so it reads the replicated rows out of the distribution database directly rather than subscribing to it. So it would create blocking and we had to – I forget what the solution we had to do was. But other things that can cause blocking in the distributor is the distribution cleanup job. Oftentimes that job can get really far behind and manual cleanup is needed. I’ve spent too many hours manually cleaning up data in there; so that job can also cause issues. So you have to see who’s blocking who. Log sp_whoisactive to a table and you should be able to track it down.

Brent Ozar: yeah, and if you haven’t done that before, Tara’s got a great blog post on how to log sp_whoisactive to a table.

 

Are you presenting at Microsoft Ignite?

Brent Ozar: Michael says, “Are you presenting at the Ignite conference in the fall?” So here’s the deal with the Ignite conference; it’s a really big Microsoft conference. There’s like 30,000 people who attend, and they only have one or two SQL Server tracks. It’s not really big for a SQL Server – so I used to present at TechEd and then Ignite, when they first brought it out, but I’m like, I really want to see more SQL Server stuff, or if I see other tracks, I want to see completely different tracks.

 

Should I get more sockets, or more cores?

Brent Ozar: Next one, “For a data warehouse workload, should we go for more sockets and less cores, or less sockets and more cores?”

Tara Kizer: I don’t know, I mean I don’t know specifically about a data warehouse workload, but I’ve – that’s a question – when working with sysadmins on new hardware that we’re going to be purchasing, that’s something that we’ve struggled with, and I don’t have a good answer. I didn’t even support data warehouse workloads, it was all OLTP…

Brent Ozar: My first thought is just how much memory you can get. Typically servers with more sockets can handle more RAM. With Dell 730s, they can go up to 1.5TB of RAM, if I remember right, but only 1.5 in a two-socket box, and you may want more than that in a data warehouse type environment, especially if you have slow storage.

 

Why are scalar UDFs slower on one server?

Brent Ozar: Eric says – not our Erik but a different Eric. “A user has a query that is using scalar UDFs, and he’s complaining about how the execution time is slower on production versus tests on other databases. Is this unusual? I’ve tried to steer him away from UDFs.”

Tara Kizer: I mean the problem with scalar UDFs is that it’s row-by-row processing. So if he compares the execution time between production and tests, I would bet that test does not have a copy of the production data. I mean, he’s running the same query maybe, maybe the production database has more rows in it. I don’t know if running sp_recompile on a scalar UDF – if that would be helpful or not? But he needs to compare the execution plans of the two systems, just to see if there’s a different execution plan because then you could just recompile, say, the query or the stored procedure. But if we’re specifically talking about the scalar UDFs, I would bet it’s a production size issue versus test.

Brent Ozar: I love that answer because I wouldn’t have come up with that. When you’re looking at the execution plans, know that with a scalar UDF, it won’t show in the actual plan. It will show on the estimated plan, just not the actual plan. The other thing, since scalar UDFs go single threaded, or serial, any slight difference in CPU can make a huge difference in overall run time. So if production is, say, only 1.8g gigahertz CPUs and test is 3 gigahertz CPUs, you can see a dramatic difference there in terms of run time. Man, I hate saying this, but you could blow the plan cache and then run sp_BlitzCache, you know, immediately after the test query ran, and you’ll see the number of time that the function ran. So that will help give you evidence on Tara’s thing that says that production has so much more data than the test systems.

 

Could setting MAXDOP make things worse?

Brent Ozar: J.H. says, “We never set MAXDOP, but recently a specific user has been getting frequent CXPACKET waits. If I set MAXDOP to a value, it could slow down other running queries, what should I do?”

Tara Kizer: For CXPACKET, your first stop shopping is to make sure your MAXDOP is not set to the default and your cost threshold for parallelism is not set to the default. After that you’re talking about looking at your queries, seeing if you can re-factor the queries to make them more efficient, indexes, write indexes in place, maybe table design. But if you’ve never set MAXDOP, you’re not setting your servers to best practices. You do not want to leave it at the default unless you have two processors maybe, or maybe four? I’d say two processors or less, that’s the only time I’d leave it on default value. So you need to set MAXDOP and cost threshold for parallelism, both of those.

Brent Ozar: And is it theoretically possible that setting MAXDOP is going to make some queries go slower?

Tara Kizer: Yeah…

Brent Ozar: Yeah, anything you change can make queries go slower.

 

Any special concerns about SQL Server in AWS?

Brent Ozar: Let’s see, J.H. next asks, “What are things to consider, or how does AWS SQL Server handle jobs, linked servers or any other things when migrating to AWS’s cloud?” Well, there’s two kinds of servers in the cloud in AWS, and really in anybody’s platforms. One is infrastructure as a service, which is just VMs. It’s the same VMs that you know and love dearly. The other one is RDS, or Amazon’s Relational Database Services, where they manage things for you. They manage the backups, they manage the restores, they manage things like linked server configuration. That is entirely different, and you’ll want to hot AWS’s documentation. While you’re in there, if you’re moving to AWS RDS, do a search on our site, or go on our site and click Blog and then Clod Amazon, because Tara recently wrote a blog post about restoring databases on AWS RDS. How did that go? Explain how that experience went.

Tara Kizer: Well I mean, I’ve been a DBA for a very long time; about 20 years at this point, and I’m not used to having to click through a wizard that’s not a management studio or enterprise manager wizard, you know. It’s unusual being a DBA clicking through a website to do a system restore. I mean, it took forever, and this was a smallish VM, but it probably was due to the 200-gigabyte disk that I used. It took like an hour to do a restore and my database was saying – I think it was like 10GB, but I did have a 200GB disk, and so it restores the entire image, not just the database. So if you have multiple databases where you’re talking about all of those being restored – it was just different being a production DBA. I think as a sysadmin, it might be easier to do the cloud type stuff because they’re not used to doing SQL Server work anyway, so you might as well just use these tools.

 

Does SQL Server licensing come in 2s or 4s?

Brent Ozar: So let’s see here. Next up, Ben says, “Can you only license SQL Server in multiples of four, right?” No, it’s weird. The minimum core count that you can license a SQL Server with is four cores, you actually buy them in two-packs. So you need two – and I can’t say this with a straight face – two two-packs in order to license any SQL Server, but then you can go in two-packs from there, so four, six, eight, 10, 12, whatever.

 

Should I separate data & logs on a VM?

Brent Ozar: [Kaushek] says, “Is it still recommended to separate data and log and tempdb files to their own disk drive on a VM, or do you recommend keeping everything on a single drive?” We actually answered this earlier, it was the first question we took. If you want, go to BrentOzar.com and click Blog up top, we recently blogged about this too, in the last week or so.

 

Are SQL Server connections encrypted by default?

Brent Ozar: J.H. says, “Is it correct that since SQL Server 2005 all connections are encrypted by default, and even if SQL Server doesn’t have a real signed certificate, it’s going to use its own self-signed cert?” So this is tricky…

Tara Kizer: Do you actually know this answer?

Brent Ozar: I do know the answer, and the only reason I know the answer is because I sat in Andre Melancia’s GroupBy session on hacking SQL Server.

Hacking SQL Server by André Melancia

So I totally cheated. And he demoed this; you can go to GroupBy.org, this is my open source community conference where SQL Server people, or really any database people, can go to submit any topic. Everything’s free, you get to vote to pick who does sessions on what. So there’s one called Hacking SQL Server by Andre Melancia, and he actually steps through this. The connection is encrypted for your username and password authentication, but as soon as that’s done, the rest of your packets go through unencrypted. So your results go across the wire unencrypted. Andre actually shows a man in the middle attack where he changes data coming back from a SELECT statement. So you SELECT out one thing but the data you see is entirely different because it’s not encrypted by default.

 

I get thousands of deadlocks per day…

Brent Ozar: And the last question we’ll take is from Samantha. Samantha says, “Speaking of deadlocks, I have an application which I’ve been able to chart between 300 and 6000 deadlocks per day…” Great. “It’s always on the same delete statement supplied by multiple polars.” She’s got a database doing about 2000 transactions per second. “I’ve tried to ask them to pull less, but since the end user isn’t noticing any lost data, they don’t see the problem. What should I do?”

Tara Kizer: Are the deadlocks a problem? I mean she just said the users aren’t noticing it, you know. Is there any problem? When I’ve had deadlocks at my last job where users were not noticing a problem because it was all back-end processes that were deadlocking, it was more a performance issue of getting all this work done. And then just nuisance from the DBA team, you know, open up the error logs, as we always did have the deadlock trace flag to post the data to the error log. So that’s not causing any issue, then so what that it’s happening so frequently. Samantha, you said it’s always on the same delete statement. I wonder, is it two delete statements? Is it always the exact same delete statement and its two different polars are sending the exact delete, and are they deleting the same row or maybe it’s locking the page? I would want to know what the other process is doing.

Brent Ozar: yeah, and if the developers have retry logic built into their app, you know, when they get a deadlock they just retry it again, that’s not terrible, it’s not so bad. If it was a 1000-line-long stored procedure where they were doing all kinds of work and right at the end it was the deadlock victim, yeah, the rollbacks would probably suck. But if it’s just one-line deletes, it’s not necessarily so bad. Alright, well thanks everybody for hanging out with us today and we’ll see you guys next week on Office Hours. Adios…


I’m on the new SQLPlayer Podcast

Company News
2 Comments
Me, Damian Widera, and Kamil Nowinski in Wroclaw

On my Poland trip, I sat down with Kamil Nowinski and Damian Widera to talk about how I got started with presenting, why I don’t use PowerShell (but you should), how sp_Blitz ended up on Github, my work/life balance, the Microsoft Certified Master exam, and a lot more.

The whole thing is fully transcribed, so you can either read it or listen to it, whichever you prefer.

Head on over, and if you get the chance, I highly recommend both speaking & attending at SQLday in Poland. It was a ton of fun, and really well-organized. (Also, it turns out I love Polish food!)


sp_AllNightLog: ¿Por que los queues?

Building stuff

I sometimes really hate coming up with ideas, and much prefer someone to just say “I want this” so I can go out into the world and figure out how to do that. Occasionally though, I realize that I’m not going to get all that much direction. That’s what happened with sp_BlitzQueryStore.

It also happened with some of the internals of sp_AllNightLog. Now, this is mostly a wrapper for two already-brilliant stored procedures: sp_DatabaseRestore, and dbo.DatabaseBackup. It has a few different bits of functionality that I’ll write more about, but two bits in particular both work off a queue table: backups and restores. I’m going to explain a little bit how the backup portion works, and why we chose to do things this way.

Prior to this, I’d never used a queue table, or written any code that used queues. I’d always been told (and I think rightly) that queues don’t belong in the database. I joked with Brent about using Service Broker.

He didn’t laugh.

Now that I’m out of the hospital

Here’s why a queue made sense in this scenario

We do:

  • Need backups and restores to happen as soon as possible
  • Add databases frequently
  • Want a mostly static number of Agent jobs

We don’t:

  • Want to rebalance Agent jobs to make sure they all finish within RPO
  • Add Agent jobs if we can’t rebalance
  • Have a job per database

Given the set of needs, a queue was the only thing that makes sense.

The pseudo code looks something like this:

We have 10 ‘workers’ (Agent jobs)
Each job runs in a loop with a WAITFOR to throttle it
When it activates, it checks the queue table for either a database that hasn’t had a log backup in the last 30 seconds (this is configurable), or a database that has never had a log backup

When it finds one, it locks the record (there’s a begin tran, and some locking hints), updates a couple fields: one that will allow other workers to skip over it: is_started, and a date started.
Takes the log backup (this is what calls Ola’s proc)
Marks the backup as completed, mark the backup as not started (sets is_started back to 0), and marks a completion time

There’s some other mesh and lace in there to handle errors and failures. Each job also has a 10 second run schedule behind it, so that if it fails miserably, it will try to restart again pretty quickly.

Now, I know what you’re thinking. Because I said “queue table”, you’re picturing some awful deadlock-ridden hotspot, inserting a row every time something needs to be done.

Non, non, non! We are not the dummies!

cheek2cheek

Every database gets 1 row, and only 1 row. We use Locking Magick® to serialize row access, and a pretty simple query for workers to figure out which database needs a backing up. It’s nifty, I promise. It’s not like those other queue tables that have given you the runaround and broken your heart before. When you create a database, a polling job adds it to the worker table, and it gets picked up the next time a worker activates.

Four queue

There’s a really cool feature in Ola’s procs — @ChangeBackupType. What this will do, is if a backup job comes along to do a diff or log, and it finds that it can’t because there’s never been a full backup, it will change the backup type to a full. It was really important to us that any job where this happened didn’t impede other backups finishing.

That means that if we kept trying to have a set number of jobs, and trying to rebalance them, some number of databases could get held up if we added a large database to the server that had never had a full backup here.

Picture an Agent job responsible for backing up 199 databases. We add another to it, Database200, and it needs a full. If that full takes 5-10 minutes, the other 199 databases don’t have a log backup start until that’s done. That breaks RPO.

Since each worker is only responsible for one backup, and one database at a time, the other 9 workers can still work on other log backups. This gives us a higher degree of safety, and gives us a better shot at staying with RPO.

I hope this clarifies a bit why we chose to use a queue in this scenario.

Thanks for reading!


New Stack Overflow Public Database Available (2017-06)

Stack Overflow
5 Comments

Nick Craver and the kind folks at Stack Overflow publish their data export periodically with your questions, answers, comments, user info, and more. It’s available as an XML data dump, which I then take and import into SQL Server for teaching performance tuning.

You can download the 16GB torrent (magnet), which gives you a series of 7Zip files that you can extract to produce a 118GB SQL Server 2008 database. You can then attach it to any 2008-2017 SQL Server.

Stack Overflow
The place that saves your job

The data goes up to 2017/06/11 and includes:

  • Badges – 23M rows, 1.1GB data
  • Comments – 58.2M rows, 18.5GB data
  • Posts – 36.1M rows, 90GB data, 15.5GB off which is off-row text data. This table holds questions & answers, so the Body NVARCHAR(MAX) field can get pretty big.
  • PostLinks – 4.2M rows, 0.1GB
  • Users – 7.3M rows, 1GB
  • Votes – 128.4M rows, 4.5GB

To learn more:

  • BrentOzar.com/go/querystack – my page about the SQL Server export with more info about how I produce the database.
  • Data.StackExchange.com – a web-based SSMS where you can run your own queries against a recently restored copy of the Stack databases, or run other folks’ queries.
  • Watch Brent Tune Queries – free sessions where I take different queries from Data.StackExchange.com and tune it live.
  • How to Think Like the Engine – free videos where I show the Users table to explain clustered indexes, nonclustered indexes, statistics, sargability, and more.

A Better Way To Select Star

Mindless Self Promotion

I liked writing this blog post so much that I wrote an entire presentation on it. If you’d like to see it at GroupBy, click the link and vote.

Update: You can watch the video for this session here.

Much has been written about this

It’s probably one of the lowest hanging items a performance tuner can deal with.

Don’t need all those columns?

Don’t select them.

But what if you do need them?

You’re left with pretty grim choices.

  1. Make a really wide nonclustered index: (some key columns) include (every other column)
  2. Rearrange your existing clustered index — maybe the wrong key column was chosen to begin with
  3. Create a narrow nonclustered index on just the (some key columns) and then hope that by some stroke of luck, no one ever conjures up a WHERE clause that pushes the optimzer past the Key Lookup tipping point and into the clustered index scan zone

Assuming that you don’t find any of those palatable: I’m 100% with you.

What if there’s another way?

Query Godmother

Using the Stack Overflow database (duh) as an example, let’s check out the Users table.

Not the worst, but…

There are some columns in there I’d be unhappy to index even as includes, especially AboutMe which is a MAX.

Right now, we have this query, and it has a cost of 156.8 Query Buckaroos.

Here’s the CPU and I/O profile from SET STATISTICS TIME, IO ON — I’ve abridged all the I/O output in the post so you don’t have to read that a bunch of things did 0 reads. If it looks a little funny to you, that’s why.

Table 'Posts'. Scan count 7, logical reads 25187
Table 'Users'. Scan count 7, logical reads 80834
Table 'Worktable'. Scan count 0, logical reads 0


SQL Server Execution Times:
CPU time = 4297 ms, elapsed time = 2324 ms.

The thing of it is, we created this nonclustered index

But it gets no use, sort of like my willpower.

Resist temptation!

Let’s pretend we care

This query only returns a couple thousand rows, so you’d think the optimizer would choose a key lookup plan.

A quick check forcing our index leaves us scratching our collective heads — this query has a ‘much’ higher cost, at 275.6 Query Bucks, but finishes much faster.

Here’s the stats output:

Table 'Users'. Scan count 7, logical reads 5479
Table 'Posts'. Scan count 7, logical reads 25279
Table 'Worktable'. Scan count 0, logical reads 0
Table 'Worktable'. Scan count 0, logical reads 0

SQL Server Execution Times:
CPU time = 532 ms, elapsed time = 109 ms.

Here’s the query plan:

Doodles.

Recap so far: When we force our thoughtful nonclustered index, we get a more expensive plan that runs in 100 milliseconds vs 2.3 seconds.

But we hate hints

That key lookup plan might not always be awesome. Like I mentioned, this query only returns a couple thousand rows. If we expand our search, we may not want to do that key lookup a whole bunch of times. If we keep forcing the index, queries that return more rows will necessitate more key lookups, and that can really slow things down.

Study your math, kids

Organics

So how on earth do we get the optimizer to choose our nonclustered index, have it make sense to do so, and not do row-by-row Key Lookups when it shouldn’t?

One option is to use a CTE, or common table expression for those of you who have word quotas to fill.

How do we do? Here are the stats output results:

Table 'Users'. Scan count 7, logical reads 8340
Table 'Posts'. Scan count 7, logical reads 25279
Table 'Workfile'. Scan count 0, logical reads 0
Table 'Worktable'. Scan count 0, logical reads 0
Table 'Worktable'. Scan count 0, logical reads 0

SQL Server Execution Times:
CPU time = 577 ms, elapsed time = 109 ms.

We have very similar metrics to when we force the index with a hint. How does the plan look?

SOME

This could use some explanation, here! Why is this better?

We use two narrow nonclustered indexes to do our early joins and predicate filtering. Even though in the plan for the original query, the predicates on CreationDate and Reputation are easily pushed to the clustered index scan, they aren’t key columns there. That means we read every row in the CX and filter along the way. Using the narrow nonclustered index, read far fewer pages (5.2 million rows vs 283k rows).

The results of this join are passed on and finally joined back to the clustered index on Users. This gives us our display level columns, but only for the rows we need to show you.

We’re not dragging them around all throughout the query. This is a far more efficient use of, well everything.

I know what you’re thinking, though. Didn’t you just replace a Key Lookup with a Nested Loops Join?

YEAH I DID!

But check it out, the Nested Loops Join is a smarty pants, executes 6 times, and grabs about 407 rows per iteration. Remember our Key Lookup executed 1506 times to get 1506 rows. That’s, like… One. One at a time.

Bully for you

This can also be extended to Cross Apply, because of course it can.

This results in the same plan and the same stats output. No need to rehash it all.

Different, but valid

If the logical needs of your query change, it may be simpler to express things with EXISTS, but the same basic idea works.

The stats output is close enough to the CTE and Cross Apply plans. The query plan is a touch different, though. A Hash Match Aggregate on the Id column is inserted after the initial join.

BODY ONCE TOLD ME

Shut up already

While I’m not a fan of SELECT * queries, I realize that they may be necessary sometimes. I mean, why have all those columns if you’re not gonna show’em to anyone?

If you can, try to cut out unnecessary columns from queries. Richie has a good post about doing that with EF over here.

If you can’t, you can always defer the pain of scanning the clustered index until you’ve cut results down to a more reasonable bunch of rows, and you can do it in a way where you don’t have to rely on the optimizer choosing a Key Lookup plan, or forever forcing one where it might not be helpful.

Thanks for reading!


[Video] Office Hours 2017/06/28 (With Transcriptions)

This week, Erik, Tara, and Richie discuss performance and load testing tools, sysadmin update SQL server restart, defragmenting column store indexes, Amazon Redshift, installing Microsoft updates, Always On Availability Groups, Redgate SQL Clone, as well as their thoughts on the current DBA job market.

Here’s the video on YouTube:

Office Hours Webcast - 2017/06/28

You can register to attend next week’s Office Hours, or subscribe to our podcast to listen on the go.

Enjoy the Podcast?

Don’t miss an episode, subscribe via iTunes, Stitcher or RSS.
Leave us a review in iTunes

Office Hours Webcast – 2017-6-28

 

Can you recommend a load testing tool?

Erik Darling: Gordon asks, “can you recommend a performance or load testing tool?” Can I ask a follow-up question? I that free or how much money are you willing to spend on that? Tara, you have some load testing – you don’t have load testing, you used to crack the whip on a whole team of people who did load testing; tell us about that…

Tara Kizer: Yeah, see the last three jobs, I’ve worked for companies that have a dedicated load testing team. So yeah, we spent money on the tools, we used Enterprise level tools. I don’t remember all of the tool names, but one of them was LoadRunner. I have used some free tools on the side when I couldn’t get testing time in the performance environment – Adam Machanic’s SQLQueryStress tools. You know, an easy one for a DBA to generate some load on the system because these tools are complicated. So it’s not like I could just jump in and run a test, I had to wait for the performance team. One of the tools that my last job used was Visual Studio. I’m not exactly sure what they did in there, but definitely one of the applications used Visual Studio for the load testing tool.

Erik Darling: Dell Benchmark Factory is another cool one, but these tools are expensive because that stuff is complicated and you’ve got to set up, you know, reads and writes and all sorts of other stuff going on. You’ve got to turn a whole bunch of weird knobs and – actually, that’s one of those jobs, I think that would be too hard for me.

Tara Kizer: It seems interesting…

Erik Darling: Oh yeah, it seems interesting but I think that I would just get lost in the minutiae of it. Like, it would … have this many users doing this, and this many users doing that; there are too many options there for me. I would need someone to keep me in line, I wouldn’t be able to just do that on my own.

 

What’s a good book for learning OLAP?

Erik Darling: Let’s move on then. Bill asks, “what is a good book or resource for learning OLAP and OLAP cubes?” I have no idea, I’ve never done any such thing. We tend to stick to mostly core SQL engine stuff here. We don’t really do a whole lot with OLAP or analysis services or any of that crazy stuff. So I actually don’t have a good answer there…

Richie Rump: I’ve spent my career avoiding OLAP, so yeah.

Tara Kizer: I once installed analysis services, but that’s about it.

Erik Darling: I’ve never even heard anyone say, oh I read this great book on OLAP cubes recently.

Tara Kizer: There probably are some great resources out there, we’re just not the right people to ask. Maybe post a question on Twitter, use the SQL help hashtag and I’m sure someone could give you an answer there.

 

How do you stop Windows updates from restarting SQL Server?

Erik Darling: Alright Tara, here’s one that’s right up your alley. So Brent gets this right, now it’s my turn to screw it up – Michaela asks, “how do you stop sysadmin windows updates from restarting SQL Server?”

Tara Kizer: Why do I get to answer this, just because my company has had monthly patches?

Erik Darling: [crosstalk] DBA since you were 12, you could answer…

Tara Kizer: No, I did graduate from college before – and I did not go to college early. I did see that, I look – I am older than you, Erik. I’ve worked for companies that had a sysadmin team and we used the System Center SCCM tool to manage all that stuff, and you just uncheck the box if you don’t want it to reboot. Or maybe we would just have it download the updates to the server, and then we would manually install them and reboot them. So it just depended upon the environment; there were some environments that we would have it installed, reboot the whole thing. Some environments, we needed more control of it. Other environments we needed to, you know, make sure the failovers were done correctly because we were using multiple instances on some clusters. I don’t know, as far as – if you’re just using Windows updates and scheduling that way, I’m not sure. SCCM is very, very powerful tool, and expensive again, Enterprise level product.

Erik Darling: Yeah, you might even be able to set something in group policy that would prevent that from happening, but I am not nearly anywhere near as good at group policy as I should be to tell you exactly how to do that.

 

What’s the best practice for maintaining columnstore indexes?

Erik Darling: Let’s see here, Dorian asks, “what are some best practices for defrag and update stats for columnstore indexes?” You know, it’s so new, I don’t think there are best practices yet.

Tara Kizer: I have no idea…

Erik Darling: I do know that if you have [crosstalk]…

Tara Kizer: Do you really need to do a defrag on columnstore indexes?

Erik Darling: Well sometimes, if you have uncompressed row groups or, I believe, if you have deleted records that haven’t been removed from row groups yet, you might need to rebuild them, but I think generally I would tend to treat them like regular indexes and just kind of leave them alone. You know, update stats maybe, if you do a data load, but I think with columnstore the amount of data you’re loading is usually significant enough to, you know, trigger an update stats the next time the query runs.

Dorian, what I would do is I would head over to NikoPort – let me grab that URL for you. Oh look, it’s the first thing that popped up when I typed in Nik…

Tara Kizer: You know you’ve gone to is a few times.

Erik Darling: I go to it once a week when I send people. So Niko Neugebauer is a really smart guy who blogs a ton about columnstore. I’m sure that if you dig through his, now 107 part blog series on columnstore indexes, you will find something about defragmenting.

Tara Kizer: He’s the one that presents that Office Hours? Sorry, not Office Hours, GroupBy?

Erik Darling: Yeah, he did GroupBy, he does PASS and everything else. I mean he is a hit on columnstore.

 

How good is SQL Server’s audit functionality?

Erik Darling: Alright, let’s move that on and see here. Jeez, I’m going to say this is a no, but Wes asks, “have you used audit tracking functionality in SQL Server, what are your views on it?

Tara Kizer: I have not. Anytime I start thinking about auditing, I’m wondering about performance at that point. I mean, yeah people want to audit things, but at the cost of performance?

Erik Darling: yeah, that’s one of those things that I go right to third party vendors for, because they’re usually much more cognizant of hurting performance than Microsoft is sometimes.

 

What’s a good resource to learn COALESCE?

Erik Darling: Robert asks, “do you know a good resource to learn more about the COALESCE function?” Well, I’m curious what your confusion…

Richie Rump: Books Online works pretty well.

Tara Kizer: That’s everything you need to know.

Erik Darling: Yeah, that’s sort of everything you need to know. I wonder what in particular, to you, what’s unclear about the COALESCE function. Is it the difference between that and IS NULL? What is your question about COALESCE? What did you see that led you down this strange road?

 

Can I add another node to a cluster with different storage?

Erik Darling: Let’s see here, Gordon asks, “is it possible to add another node to a cluster but specify different storage?”

Tara Kizer: Well I mean, it needs to be shared storage. You can point to multiple SANs in one cluster, but they both have to look at them.

Erik Darling: I’m going to go out on a limb and say that if you point to different storage, failover is going to be tricky. Can you imagine?

Tara Kizer: Yeah, I mean, in an Availability Group cluster, I mean, when people start talking about nodes and cluster, I’m assuming they’re talking about a failover cluster instance. Those terminologies still exist in availability groups, but I mean…

Erik Darling: Just no one talks about availability groups using that.

Tara Kizer: Exactly, yeah. They still are nodes in a cluster; people just don’t use those terms.

Erik Darling: Until Allan Hirt comes around and spanks them, then we know. Alright, so Gordon, you might want to clarify that a little bit.

 

 

Are there any new training videos coming soon?

Erik Darling: Let’s see here, James asks if, “there are any new training videos coming soon?” Yes… Shocking.

Tara Kizer: Maybe like one million from Erik. He did like 100 in a day or something.

Erik Darling: They were short. 17, 18 videos that are all sort of like more – not like performance, like DBA focused stuff. Like to-do list stuff, like setting up database mail and Remote DAC and how to use things and how to set up index and statistics maintenance with Ola Hallengren. So there’s lots of good stuff in there. But they are shorter videos, they’re not long videos like T-SQL Level Up Two.

 

How big is Brent Ozar Unlimited?

Erik Darling: There are, what, four people who work at Brent Ozar, Michaela asks. How many people working? Well, four plus Erika.

Tara Kizer: She’s part of the company though, right?

Erik Darling: She is, but she’s not as visible as the rest of us.

Tara Kizer: Not as visible, yeah. So five, four visible [crosstalk]

Erik Darling: Hand of the free market.

Tara Kizer: We don’t even get to see her in the company chat room hardly ever. Like I don’t know, twice a month…

Erik Darling: I know, like once in a while she’ll pop in on the Friday before a long weekend and tell everyone to go drink. This is why you’re popular.

Tara Kizer: I love when she does it because I’m three hours behind you guys, so I get off really early.

Erik Darling: I know, it’s like 7am for you. Lucky you, lucky, lucky, lucky you.

 

Should I shut down SQL Server before rebooting?

Erik Darling: Alright, let’s see here, “Deborah asks, “interested in your opinion of rebooting Windows server without first shutting down SQL Server.”

Tara Kizer: I have rebooted hundreds of servers, maybe thousands of server; I don’t ever shut down SQL Server first. This is a planned maintenance, I may talk to the web ops team or whoever it is we’re calling these days and tell them to shut down the applications. That way it can go through a graceful shutdown. But as far as SQL Server, SQL Server in Windows handles that; I’ve never had a problem.

Erik Darling: Yeah, me either. I’ve also just – like, I mean because it goes through the same shutdown process either way.

Tara Kizer: Exactly, the only thing that I would be worried about as far as SQL Server goes is if you have an index maintenance job currently running when you reboot, to make sure you have those jobs – especially I’m talking about the index rebuild job – make sure that occurs at an hour that you will not be rebooting servers manually, because I have, in a maintenance window – we had an index rebuild occurring for a 7TB database and I didn’t know it was running and I restarted SQL Server as part of our planned maintenance… Woo let me tell you, that was painful.

Erik Darling: Yeah, that’s rough. [crosstalk]

Tara Kizer: It was offline for a while. Yeah, I don’t remember exactly what the error log said, but we were watching it go through the different phases and the percentages. I think it was down for, I don’t know, [crosstalk] for like three or four hours or something like that. I mean, you know, there should have been like a two minute failover or whatever, because that would have been a clustered server.

Erik Darling: Should have been.

Tara Kizer: I rescheduled that job until a later time when I would not be awake to do planned restarts.

 

Can I connect to Redshift with SSIS?

Erik Darling: Alright, James asks, “do you guys know anything about Redshift AWS Data Warehouse? Does it connect via SSIS?” Richie, have you done anything with that? I know you’re not an SSIS man, but I know…

Richie Rump: I like Redshift a lot. I do not know if the SSIS connects directly to it. I’m assuming somebody’s working on something with that. Redshift is pretty cool, I dig it a lot, but as far as SSIS connecting to it, yeah, I’m kind of out of the SSIS game, thank god. I don’t want to deal with that anymore.

Erik Darling: Thank god indeed.

Tara Kizer: You mean you’re not going to be taking Andy Leonard’s new SSIS training with us?

Richie Rump: I did Andy Leonard’s training, I’ve done it. Yeah, he was doing SQL Saturday, Atlanta, did a pre-con. And I’m like, whatever – because I was on a contract where I needed to do SSIS, and I’m like whatever’s in that dude’s head needs to be in my head. So my plane landed a little bit late, so I walked in and it had already been going for an hour. And I walk across by Andy and he goes, hey Richie, how you doing? Have a seat.  And I’m like, he knows my name?

Erik Darling: That’s pretty funny. See I would do that training but it’s advanced SSIS training, and I am the least advanced SSIS user you could possibly find. You would have to like grab a fetus to find someone with less SSIS experience.

 

Have you ever seen an Availability Group with…

Erik Darling: Alright, let’s see here. Paul asks a bit of an odd question. “SQL Server 2016 Always On Availability Groups, can you provide a real-world example of a multiple AGs with multiple SQL instances using two or more nodes?”

Tara Kizer: Multiple AGs with multiple SQL… Yes, it wasn’t 2016, but it doesn’t matter. It was 2012 at my last job and we had a 14 node cluster that had, gosh, it would have been 14 SQL Server instances. It was virtual machines and we did one instance per server, and then the AGs spanned at most, back then it was 2012, at most five server, because you could only have four secondary replicas, and the AGs – it was to copy data to DR, it was to copy data to a failover server and it was to copy data to reporting systems. So we had AGs everywhere in this 14 node cluster. I think I’ve answered his question, I’m not quite clear.

Erik Darling: It sounded good to me.

 

Should I install cumulative updates right away?

Erik Darling: Kaushik, I hope I got that right, Kaushik? You got me on that one; my name’s probably tough for you. “Can I install cumulative updates in production as soon as Microsoft releases them, or wait a certain period?”

Tara Kizer: Yes, please do. That way, you can test it for the world, for us, because most of us are going to wait. Depending upon the bug fix, I don’t install the current one unless it’s been out for like a month or three months. I’m not talking – I’m talking the hotfix as a cumulative update. Now that’s my answer for SQL Server. Now for the Windows type stuff, the past three companies, they’ve all been installing those, you know, a few days after they get released, the security patches, the security updates at least. Those are really big in security, but SQL Server, I don’t install them as soon as they come up, that’s for sure, because there are bugs in them. There can be, I should say.

Richie Rump: Yeah, but there’s a difference between a hotfix and your cumulative updates.

Tara Kizer: No, I mean as far as SQL Server, they are the same. So you have your service packs and then you have your CUs, and another word for a CU is a hotfix.

Richie Rump: Okay, because it was my understanding that you had your hotfix and then the CUs bundled all the hotfixes together.

Tara Kizer: I mean, yeah I guess, but you don’t get the hotfixes separately, the hot fixes…

Erik Darling: Unless you’re real special. [crosstalk]

Tara Kizer: Maybe you get something special, yeah. I have done that.

Erik Darling: Yeah, I mean, so my advice there would be: don’t run out and install it immediately in production. This is something you want to roll in dev, proc, QA first, make sure everyone’s, you know – because not only can you run into bugs just from installing the service pack, but there can be bugs within the service pack that are specific to what you’re doing. So it’s something that you want to pretty thoroughly kick the tires on before you, you know, go rolling that up to prod. Another thing is that Microsoft has kind of changed the way stuff gets released, and cumulative updates now are going to be much more common and service packs are going to be much more rare. So you’re going to have to trust in the cumulative updates a little bit more than you used to in the past, but still make sure that you’re testing that, especially if you’re using AGs or anything else that’s an HADR component of SQL Server because stuff can go real wonky.

 

Should I change from FCIs to AGs?

Erik Darling: Let’s see what else is going on… So Tara, let’s talk more about AGs. “I hear you guys referring to Always On…”

Tara Kizer: What’s the name, real quick?

Erik Darling: Chris Adaline. Because the PITA, which we all know is an acronym [crosstalk]…

Richie Rump: Pita bread, that’s what it means…

Erik Darling: Yes, enjoy your humus, sir. “Is it worth looking at? We’re using clustering now but there’s no storage or site redundancy with it.”

Tara Kizer: I like availability groups, I love them, I’ve had great success with failover cluster instances. I was, at the time when we were looking at availability groups, we were using failover cluster instances for high availability, we were using asynchronous database mirroring for disaster recovery and log shipping to the pass as well, and we were also using transactional replication for reporting, the reporting system; so three different features for one system. Add in availability groups and you can get rid of those three features and it provided everything we needed for HADR and reporting.

So I really like them, but it does take clustering knowledge, it’s going to take a lot of your time, troubleshooting is hard. You need to understand what – if you’re going to be running reports on the readable secondary, you need to understand statistics and the indexes are from the, you know, main system. So there’s a lot of things that you need to know and you need to make sure that you have understanding of clustering and quorum and votes that you don’t cause unexpected outages. There are GUIs for availability groups, and so people have set them up and caused production outages because they did not understand them, whereas failover cluster instance, it’s very complex to do, and so you usually have a lot of knowledge when you’ve set up a failover cluster instance.

Erik Darling: Well, I’ve set them up so I don’t know… Maybe right, you may be wrong. I think it really depends on your level of experience and your comfort with SQL Server. If you are very comfortable with failover clusters and you’re not terribly comfortable with SQL Server or – so you haven’t used, like, mirroring or anything of that nature, I would stick with the basics. I would just go with log shipping or something simple that is easy to non-lifetime DBAs to wrap their heads around, Tara… so if you haven’t really done much with availability or stuff like that, I would say keep it simple with log shipping. Mirroring I’m a little bit meh on, just because, you know, it is deprecated, but it’s still pretty cool.

Tara Kizer: Yeah, I would want to know first, what problem are you having with clustering right now that you’re looking to replace? I wasn’t looking to replace failover cluster instance, I was looking to replace transactional replication and it just so happened that it also replaced HADR features for us. So that was my goal.

Erik Darling: So I think another topic worth covering on this, just because it is sort of in the same neighborhood as this, is geo-replicated clusters. Because you can extend a cluster to DR, but that’s…

Tara Kizer: Who the heck does that? I know people do, but who are these corporations that can afford that. When I worked at, you know, my LinkedIn profile’s public… When I worked at Qualcomm, every two years someone would ask if we could do this, and they also want to know about being able to write at both datacenters. Like okay, $2 million, send it over our way and we’ll get this set up.

Erik Darling: No problem, we’ll have it done this weekend.

Tara Kizer: SAN vendors were always extremely excited every time the discussion came up, ooh wow… We can retire early.

 

Have you used Red Gate SQL Clone?

Erik Darling: Oh man, that’s fun stuff. Dorian asks, “have you used Red Gate SQL Clone? What do you think about it in dev and QA environments?” I haven’t touched that one yet.

Tara Kizer: I was a big Redgate fan, they must not have had that when I used to use it.

Erik Darling: Yeah, it’s kind of on the newer side. I really like a whole bunch of Redgate tools but I haven’t used that one.

Tara Kizer: I would imagine it’s great, since everything else from Redgate, and the other companies, they’re great too – the big ones at least.

Richie Rump: Yeah, I was on the beta [crosstalk]; never opened it up.

Tara Kizer: there’s also the new DBCC CLONEDATABASE for SQL 2016 that comes with the product.

Erik Darling: That’s for 2014 now too, that came out in some service pack or CU… Or it may have come out there first actually, I forget. Either way, DBCC CLONEDATABASE is another sort of way to do that, which makes a statistics only copy of your database that you can stick on – I have a blog post up about that with some warnings about security and about not rebuilding indexes, because it will clear out all those stats that you just cloned. Hurray.

 

Is it true that UDFs are single-threaded?

Erik Darling: Alright, let’s see here. We still have so many hours left, oh my God. Paul asks, “is it totally true that UDFs are single threaded and if so, when using a stored procedure, is the whole stored procedure restricted to single thread?” Well yes, it is true – well actually it’s funny, because the UDF can go parallel, but the query that calls it is forced to run serially. That only causes the query that calls it to run serially, that doesn’t cause the entire, what do you call it, stored procedure to run serially, just whatever query it’s a part of. So that’s about that.

 

Who does the character drawings?

Erik Darling: Wes asks who does our character drawings? It’s a fella named Eric Larson, he has a website, Eric Larson artwork – he spells Erik wrong but Larson correctly. That’s who does them, and he’s wonderful. He just drew Tara’s new devil horns character…

Tara Kizer: Yes, I’m very excited about that one.

Erik Darling: I would be too, I need to get me a new character. I need to get some action shots, I’m all…

Tara Kizer: Yeah, you’ve been here long enough, you should have a few by now. I finally got my first. Richie’s always had his action shot since the very beginning, I think.

Erik Darling: yeah right, lucky Richie.

 

Have you used Rubrik?

Erik Darling: Let’s see here, “have you used a tool called Rubrik for SQL backups? If so, do you recommend it? What kind of testing do I need to do before I start using it?” I’ve never heard of that one, anyone?

Tara Kizer: I feel like that one came into our help account too, I think, and the answer was no, we haven’t used it, so didn’t answer it.

Richie Rump: I’ve used a tool called Rubik’s…

Erik Darling: Funny guy over here. So what I would say about that is, I would put it on the same level as any other SQL database backup tool. Make sure that you’re not taking dirty snaps, make sure that it quiesces drive activity, make sure that your testing restores. Also, keep an eye on the error log, because you want to make sure that it’s not – keep an eye on the error log, there will be I/O frozen and I/O thawing messages, and I would keep an eye on the error log to make sure that it’s not taking, you know, more than like a second or a couple of seconds when it goes to quiesce the drives. If you see big gaps in there where I/O is frozen for databases and then, like you know, 10, 20, 30 – I’ve seen up to two minutes later I/O gets thawed for a drive; you have a problem. You don’t know if it’s part of the VSS snapshotting process or if it’s part of the data copying process, but that’s what I would do.

Tara Kizer: I had a recent client, I forget what product they were using, but it was doing the – it was quiescing, freezing the I/O, unfreezing it. I think that only took, I say only, but let’s just say five seconds, but it would cause such a slowdown for the rest of the day because that system had such high write throughput that it just started gaining like a backlog. You could see the system performing fine, and then once it happened it would take, you know, an hour before the system was running well again. They were trying to do that process very frequently. I know Microsoft recommends not to do it any more frequent than every four hours.

Erik Darling: Oh yeah, because – was that the write log people or was that…

Tara Kizer: Something like that, maybe.

Erik Darling: Because I remember hearing about the write log things [crosstalk] that’s brutal. So I just got a notification code. [crosstalk]

 

Have you used 2016’s data masking?

Erik Darling: Alright, let’s see, what do we have next here? “What do you think about the new SQL Server 2016 data masking and Informatica dynamic data masking?” I’ve never touched…

Tara Kizer: They say Informatica, then I’m out. I’ve supported [crosstalk] Informatica for SQL Server and it was not fun.

Erik Darling: It’s funny, when I first started working here, a buddy of mine was working at Harvard doing a  whole bunch of ETL with Informatica tools, and he was like yeah, can you do any testing on that? I was like, no. [crosstalk]

Tara Kizer: And the way Informatica collects the data on SQL Server, it adds a transaction replication publication, no subscription. So Informatica connects directly to the distribution database and figures out what data it needs. So that took a while to get used to. It just caused all sorts of problems for us.

 

How’s the current DBA job market?

Erik Darling: Alright, let’s see here. Ben asks an interesting question; at least I think it’s interesting. “What are your thoughts on the current DBA job market? What are the best places to find opportunities? Is there a particular website you like or where do you g when you need to find a job?”

Tara Kizer: I don’t know, I don’t know what I would do these days, but I know from LinkedIn I get a lot of job opportunity emails. Not just recruiters, but LinkedIn automatically sending out messages, and at least for the San Diego market, it has slowed down as far as the number of database positions out there. I just look at the list real quick and these days it’s been sysadmin type stuff that’s coming through, whereas about a year ago, there were several database administrator jobs here as well as database engineer, database developer type. I don’t know what the markets are across the rest of the cities of the United States and stuff, but it definitely has changed here, and it was a booming market here about a year ago.

Erik Darling: Not so booming anymore?

Tara Kizer: I mean, just based upon LinkedIn notifications and then the less recruiters sending me emails. I get a kick out of those emails. I mean, I post them in the company chat room to get a kick out of them too. It’s interesting information.

Richie Rump: [crosstalk] Kind of developer stuff…

Tara Kizer: You are a developer…

Richie Rump: I know, but it just seems that developers – it just hasn’t stopped…

Tara Kizer: Always, gotcha. [crosstalk] there are more developers than there are DBAs at a company, usually. I mean, there’s usually – large companies, a heck of a lot more developers than there are DBAs.

Erik Darling: It’s like a wise man once said, the world needs ditch diggers too.

Richie Rump: Yes.

Erik Darling: And with that, we have reached our time limit and the end of the questions. Thank you so much for joining us, we will see you again next week. See you then.


Why Is Tracking Restores So Hard?

Phantom of the Nopera

Let’s say you have a server. Let’s go one step further and say SQL Server is running on it. Crazy, I know.

But hey, we specialize in crazy, here.

If you want to track CREATE, ALTER, or DROP DATABASE commands, that’s easy enough with DDL triggers.

But guess what? None of those fire when you RESTORE a database. Weird, right?

Other triggers?

Astute SQL-ers may attempt to add a trigger to the restorehistory table over in msdb. It’s in the dbo schema, which might make you hopeful. We all know triggers in that pesky sys schema don’t do a darn thing.

You guessed it, restores get tracked there. So there’s, like, something inside SQL telling it when a restore happens.

Guess what, though? A trigger on that table won’t fire. Not FOR INSERT, not AFTER INSERT, and not nothin’ in between.

Extended Events

There’s even an event! The backup_restore_progress_trace tracks this stuff. There’s even some cool information in there.

Restoregasm Addict

Skip a few…

She’s always at it

Exciting stuff, right? But… that’s a lot of extra info. And, wait… How does this help us? How do we take that and fire a notification?

I’m here to assure you, dear readers, that is is possible.

The proof is in this THIRTY-FOUR PAGE PDF.

Nothing against pages, the number 34, PDFs, or Jason Strate, of course. But if you need a 34 page manual for something, the barrier to entry is, well, 34 pages high. I’m bailing on that with a quickness you only see on Maury when someone is not the father.

Polling

That leaves you with rather ugly options. You can write a query that checks in with sys.databases (or even that restore history DMV) to look for new stuff being added. But that means you’re likely firing up an Agent job, or adding application code to do that, and then start doing ~something~ with the new database.

That’s pretty lousy. If you agree (and of course you do, that’s why you’re here!), there’s a Connect Item filed by Aaron Bertrand, that was unfortunately closed almost as soon as it was opened, back in 2010.

Maybe if enough of you kind folks vote on it, it will be reopened.

Thanks for reading!

Brent says: if you read yesterday’s post about the new sp_AllNightLog, you can probably connect the dots. We needed to be able to track when new databases are created, and that was fairly easy, but tracking restores into new database names was much trickier.


Introducing sp_AllNightLog: Log Shipping at Scale, Open Source

In our Faux PaaS project, we need a backup plan – or rather, a restore plan.

On each SQL Server instance, clients can create as many databases as they want, anytime they want, with no human intervention. We need those databases covered by disaster recovery as quickly as practical.

SQL Server’s newer disaster recovery options – Always On Availability Groups and async database mirroring – have a few drawbacks. They require configuration for each new database, and they can hit worker thread exhaustion as you grow to hundreds or thousands of protected databases.

That’s where old-school log shipping comes in.

The basic concept: backing up and restoring databases

Log shipping (Books Online) isn’t just a built-in feature – it’s more of a technique that’s been around forever. In essence:

  • The primary takes a log backup every X minutes
  • The secondary watches that file share, and restores any new log backups that show up

SQL Server has their implementation of log shipping built right into the product, and it’s great – having stood the test of time for over a decade. (When’s the last time you saw a Cumulative Update that fixed a log shipping bug?) Normally, given the choice between reusing Microsoft’s code versus writing my own, I’ll take theirs every time.

Native log shipping isn’t perfect, though. We had a few ambitious requirements that SQL Server’s implementation didn’t quite meet:

  • Zero setup for newly added databases
  • Zero communication between the primary and secondary servers
  • High throughput to keep up with hundreds or thousands of databases, but without hundreds or thousands of jobs running simultaneously
  • Open source the whole thing so you can find our bugs

To make log shipping scale, we built sp_AllNightLog.

On the primary server, multiple Agent jobs take backups simultaneously across lots of databases using Ola Hallengren’s proven DatabaseBackup proc. You control the backup frequency and the number of Agent jobs to balance recoverability against performance overhead.

On the restoring server, same thing, but with sp_DatabaseRestore – multiple Agent jobs watch folders for incoming files, and restore ’em. Just like with regular log shipping, you can control the job schedules, so you can use the servers for read-only reporting workloads during the day, then catch ’em back up to current overnight.

You can install the restore side of it on multiple servers, in multiple locations, too. Here’s a sketch-out of a more advanced implementation similar to one we’re doing for the Faux PaaS project:

Disaster recovery using cloud file storage

To learn more:

I do expect to see lots of bugs in here for now – this hasn’t gone into production anywhere yet, only into our testing labs, so I’m sure we’re going to discover neat stuff over the next few weeks. If you choose to put it in production, I’d highly recommend watching the Github repo (click the Watch icon at the top right when you’re logged in) to get emails as people find bugs.

Erik Darling blogs kCura Relativity
Erik just likes them for the free coffee cups

We’d like to thank kCura for making this open source project possible, and thank Google Compute Engine for helping us lay the foundation with sp_DatabaseRestore. I’m a huge believer in partnering with clients to not just make their own projects happen, but help you, dear reader, by contributing this work back into the open source community.

Wanna work with us on fun projects like this? kCura is hiring a Senior DBA, and it’s a heck of a fun place to work.


First Responder Kit Release: Weather Is Here, Wish You Were Wonderful

ONE TWO THREE FOUR FIVE SIX SEVEN WHO DO WE APPRECIATE?

No one.

Appreciate doesn’t rhyme with seven.

Special thanks to @JohnKNess, @jsetliffe, @rabryst, @gdoddsy and @SOlsonFAC for various levels of contribution. I’ve made a $10,000 donation in each of your names to the Erik Darling Home For Little Wanderers. Very thoughtful of me.

You can download the updated FirstResponderKit.zip here.

sp_Blitz Improvements

  • #878 The Irascible Mr. @JohnKNess fixed some things so maintenance plans and jobs that shrink things and are enabled have a higher priority than jobs that are disabled, taking into account last runtime. Pretty cool! Now we can all point and laugh at your shrinkage.

sp_BlitzCache Improvements

  • #968 Adds detection and warning for Index Spools. Only the Eager kind. Why? Paul White said so. Fun fact: He was in a band called The Lazy Spools in high school. I had the XQuery written to add the check for Eagerness, but then my XML Dad Mikael Eriksson suggested an alternative. So I went with it just because. Special thanks to #SQLNEWBLOGGER Joe Obbish for suggesting reasonable thresholds, here.

sp_BlitzFirst Improvements

  • #928 PLE may go up, and it may go down. We were only looking for one of those directions. Now we’re looking in both of those directions. Thanks to @TaraKizer for pointing that out!

sp_BlitzIndex Improvements

  • #937 If you restarted your server a year ago and created a table a month ago, missing index requests for it may be terribly underrepresented. Like me at AA meetings. Now we take table creation date into account when determining missing index benefit.

sp_BlitzWho Improvements

  • #895 Adds session level wait stats for 2016+ using sys.dm_exec_session_wait_stats to grab the top 3 waits with >5ms time accumulated.
  • #942 Adds a @Debug option to print out dynamic SQL. Helpful if you’re me.
  • #926 Adds tempdb allocations! So you can see how horrible that is now, too.
  • #921 Adds Resource Governor information, for both of you crazy kids out there using Resource Governor. Thanks to @djlaney3 for the idea and code submission!

sp_DatabaseRestore Improvements

  • This underwent a bit of reconstructive surgery to bring it inline with our ‘coding standards’ (read: I had to look busy one day, so I rearranged a bunch of stuff and made sure all our strings were Unicode). No behavior was harmed during this change.

sp_BlitzBackups

  • #906 Fixes an issue with dynamic SQL around the encrypted backup check. Thanks to @jsetliffe
  • #909 Clarified language around backup compression and supported versions. Thanks to @rabryst for showing us the error of our loose-tongued ways.
  • #911 Corrects math to determine if Diffs are big or not. Thanks to @gdoddsy and @SOlsonFAC for teaching me math.

sp_BlitzQueryStore

  • #934 Fixes some weirdness when attempting to search for a specific stored procedure by name
  • #933 We are compatible with Azure! At least according to @dcurlewis. I don’t have Azure, so I can’t check. He seems trustworthy, though.
  • #968 Same as sp_BlitzCache

sp_AllNightLog and sp_AllNightLog_Setup

  • These are new! They’re part of our FauxPaaS project to do our own version of Log Shipping ~AT SCALE~. Since these are net-new, check out the documentation and other posts about them to learn more. I could go on for a month, here.

Install_Core_Blitz and Install-All-Procs

  • We’re trying something new! Because we hate opening multiple .sql files as much as you do (because why in the blue ox would SSMS decide to spawn a new window FOR EVERY SINGLE SCRIPT YOU OPEN??), we made it easy. Now you can just open one big script. We use a couple simple copycommands to do this: copy /b /y sp_Blitz*.sql Install_Core_Blitz.sql and copy /b /y *.sql Install-All-Procs.sql, which you can use and modify to create your own version, omitting whatever scripts you don’t want. Lucky you!

You can download the updated FirstResponderKit.zip here.

Thanks for reading!