This week, Brent and Tara discuss separating data and logs on separate drives, statistics time, scalar UDFs, licensing, encryption, gathering SQL error logs and agent job history, replication, upcoming training courses from Brent Ozar Unlimited, and what DBA topics they struggle with the most.
If you prefer to listen to the audio:
Enjoy the Podcast?
Office Hours Webcast – 2017-7-5
Should data and logs be separated?
Brent Ozar: Jake says, “Hello, my name is Jake. With solid state disks, should the data and logs be separated on different drives?” Oh boy, this is a fun, passionate religion type of thing. I’ll tackle this one, just because I had a blog post about it. There are two different reasons – Tara’s going to start making the popcorn. There are two different reasons why you would do this: one is for reliability and one is for performance. Now, in terms of reliability, the general logic goes, that if you lose that data array, you can still do a tail of the log backup. So the gotcha there is, this is only true if these are both fairly reliable drives that don’t ever fail. But of course, any array is going to fail sooner or later, it’s just a matter of time, and since each of these is a single point of failure, even if you put in a RAID array, the entire array will fail from time to time.
If you have two single points of failure, either of which can bring the entire SQL Server down, you want to start doing some basic math on how often these things will actually fail and whether you’re making the system more reliable or less reliable.
The ironic thing is, because they’re both single points of failure and say that each one fails once a year, just to make the math simple, by using two sets of arrays or two volumes instead of one, now you’re at two annual failures instead of one. Then, when the data volume fails by itself, people go, well look, because of this extra reliability, now I can do a tail of a log backup. That’s only true if the log array is still up. And if you’d have put your data files on that array, guess what? Your whole server would still be up. You injected this less reliable thing by yourself, so just make sure that you’re actually making the system more reliable instead of less reliable.
The second part is, in terms of performance; I’m going to assume, Jake, that you have a limited number of SSDs. Let’s say that you’ve got eight of them, just to make the finger math easy. If you have eight drives, how many drives are you going to use for the log file and how many are you going to use for the data file? You need to know how to carve those up, and it’s hard to guess when you’re dealing with a brand new SQL Server that you’ve never worked with before, and you are robbing performance from one of those.
Say that, just to keep the math simple again, you take six of the drives for the data array and two of the drives for the log array, is that going to be enough? When you do inserts, updates, and deletes, SQL Server has to harden that to the transaction log file immediately. It writes the data pages later asynchronously in the background. Instead of having eight drives to shove transaction log data down onto, now you’ve only got two. Are you sure that that’s better? And, are you sure you’re even facing transaction log bottlenecks? Is write log even a problem for your SQL Server?
The problem is that sketching all of this out is very hard and takes some time and is very easy to get wrong. I’ve seen people go in and say, well I need six drives for the data files and two drives for the log files, and then not have enough performance for, say, the log files. Once you’ve baked this thing, in order to get it right, you have to dump it and start again, say dump your arrays and redo them as, say, four and four or five and three. So if you’re just talking about general purpose stuff and you have a fixed number of drives, it’s a lot harder than it looks. What I’d back up and say, what’s the problem you’re trying to solve? Like, what’s the server’s primary wait type, and then go from there.
How long can I leave AG replicas suspended?
Brent Ozar: Gordon says, “Is there a limit on how long I can leave databases in an AG secondary replica suspended after which they won’t be able to re-sync?” Oh, that’s a great question.
Tara Kizer: That’s dependent upon your transaction log. All that stuff is being saved on the primary replica’s transaction log and it’s waiting for that other secondary replica to become available so it can send it over. So, if you have unlimited space on the drive where this transaction log is, then it could be indefinite, but no one has unlimited space. So it just depends on your volume of records going into the transaction log and how big is that transaction log file, how often are you doing transaction log backups – well that won’t even matter for this, since you can’t clear the stuff out – and then how much space you have on the driver mount point where the transaction log exists, if you have auto growth in place.
I’ve had systems where we’ve had planned maintenance at, say, the disaster recovery site, and the site was supposed to be down for four hours. I knew most of my systems would be okay, I knew one system wouldn’t. and sometimes they would, say, be down for like a day or two because they would do major maintenance out of the DR side, because, you know, it’s DR and it’s not production, and I would have to kill – it would be mirroring at the time, but same thing; it’s the same technology basically for this aspect of it. It just depends, it just depends on your system.
Brent Ozar: And it’s not just the primary’s log file either, it’s all the replica’s log files. So if you have several replicas in your data center, they are all growing because SQL Server doesn’t know which one’s going to be the primary at the time that that replica comes back up.
Tara Kizer: True.
Brent Ozar: I actually say – in the Senior DBA Class, I say whenever you lose a replica, set an alarm, like a timer for every 30 minutes, and go back and look at your log files to see how much space you have left.
How should I gather error log & Agent job history?
Brent Ozar: Richard says, “If the database administrators are not allowed to create databases or jobs on the server because the vendor doesn’t allow it, what’s the best way to gather SQL error log and agent job history?” That’s interesting…
Tara Kizer: Well, I mean, can’t create jobs, but do you have access to task scheduler on the box, maybe not? Any server that you actually have control over has a Windows scheduler, task scheduler you could use, and you could reach out to that SQL Server via a linked server to gather this data.
Brent Ozar: Yeah, I like that. Think of it the same way a monitoring server would; like monitoring servers are c# apps, Java, whatever. They’re just connecting to SQL Server and running queries.
Tara Kizer: Yeah, there’s probably some PowerShell command let that you could use to collect this data already.
Brent Ozar: Oh there probably is. Probably – DBAtools.io probably has a script to go grab that stuff, that’s a good point. And the same techniques that monitoring vendors use too like SentryOne. I remember when I first looked at them, they didn’t even create a database or any stored procs or anything. They did everything in tempdb just so that they could go query stuff. I should – I have so many bad ideas. I’m like, then another thing… Because there’s not a lot of questions in the queue, so I’m like alright, here’s another thing that you could do is you could create – just open a session in tempdb and just WHILE one equals one, wait for 30 seconds, go get stuff and then send an email or an alert whenever something happens. That’s so duct taped; don’t ever tell your friends that I said that.
Are there new training videos coming?
Brent Ozar: So James asks, “Are there new training videos coming on BrentOzar.com? If so, when? I love the training courses.” As a matter of fact, yes. Erik just recorded a bunch of DBA fundamentals classes. So the DBA subscribers have new ones coming on CPU settings, which is like cost threshold, MAXDOP, priority boost, tempdb, CHECKDB, how you do CHECKDB if you have a database administrator, don’t have a database administrator and like a dozen more all together. So the first set will drop next Tuesday, and those are for people who are either on the enterprise plan or the DBA subscriptions plan.
On the performance side, I’m working on a set on how you run our scripts, sp_Blitzindex, BlitzCache, et cetera. Those should be out in August. Also, a brand new two-day class coming, Performance Tuning by Example, were you get a cloud VM and then I walk through a lecture for 30 minutes. Here’s the problem that you’re going to be facing or here’s a symptom that users are complaining about, you go jump into your cloud VM, you troubleshoot what the problem is, write the fix – you have one hour to do that, and then you jump back together with me for an hour and you watch me solve the exact same problem. Then we get to do Q&A about how our solutions worked. So that’s a new two-day class, and I’m doing that in August and September.
What does SET STATISTICS TIME include?
Brent Ozar: Forest says he’s “Looking at set statistics time. Is CPU time only the amount of time that threads were in a running state, or does it ever include any sorts of waits?” Oh, Tara, do you ever use statistics time?
Tara Kizer: Yeah, all the time. Anytime I’m looking to get set stats I/O on, I always have a comma time-wise. It’s just my habit for typing. I don’t usually need the time information, but that’s just what I type, and so I end up getting both.
Brent Ozar: Yeah, it is only going to be the time that is spent actually doing work. The easy way to test this is to just do a wait for, you know, run wait for 30 seconds and then at the end of it, of statistics time, you’ll see a relatively low amount of CPU time, even though there were wait for waits. He says, “How does virtualization affect this?” It doesn’t affect waits versus CPU burn, but virtualization can affect, if the VM is ready to consume cores or ready to consume CPU time, that the hypervisor isn’t ready to provide that CPU load, either because another guest is burning it or because the clocks have gotten far enough out of sync.
What DBA topic do you struggle with?
Brent Ozar: John says, “What DBA topic do you struggle with the most?” Tara, how about you?
Tara Kizer: I don’t know, staying court I guess, you know, keeping up to date with newer versions when you’re at a company that’s not willing to upgrade or, you know, upgrades may be next year. Especially being consultants, we’re dependent upon our own learning and then what version our clients are running. I’ve only had a couple of clients that have had SQL 2016, I don’t know how – 2017 obviously isn’t out yet, but I would imagine the adoption rate is even slower than 2016.
Brent Ozar: Man, for me it was deadlocks, oh my – I’m really good at solving deadlocks once I get the graph, but having to hassle with Extended Events to get a, you know, block process monitor stuff for deadlock or whatever. So I’m like – I had to slay that dragon this last week, so I’m like okay, it’s really not that bad, extended events isn’t terrible once you get in there, but XML sucks. God, I hate XML.
Can I denormalize replication subscribers?
Brent Ozar: John says, “If I set up replication for reporting, can I thin add de-normalize tables to my reporting database without affecting replication?”
Tara Kizer: I guess this would be for me.
Brent Ozar: I know the answer too…
Tara Kizer: So if you’re de-normalizing the tables that you’re replicating to, you’re going to have to update the replication stored procedures; like sp_MS, INS, UPD, you know, the actual stored procedures that are used for the inserts, updates, and deletes. So if you’re changing the structure of the destination tables, you do have to modify the replication stored procedures. But you’re free to do whatever you want in that database outside of those objects that you replicate. You could have more indexes, more stored procedures, more tables. But the actual tables that you’re replicating to, SQL Server needs to know what to do with them, so you have to put those in the replication stored procedures.
Should I worry about LCK waits on a distributor?
Brent Ozar: Gordon says, “I’m seeing high LCK waits on a server acting purely as a replication distributor. Are these likely to be a problem on the distributor, publisher or subscriber?”
Tara Kizer: Well, if the locks are happening on the distributor then the issue is on the distributor; I mean, the locks aren’t coming from another server. It could be replication blocking some other process. I’ve experienced blocking on the distributor, and in that specific environment, it was due to Informatica. That was what was being used for the data warehouse and Informatica does not – there’s a publication but no subscription, and so it reads the replicated rows out of the distribution database directly rather than subscribing to it. So it would create blocking and we had to – I forget what the solution we had to do was. But other things that can cause blocking in the distributor is the distribution cleanup job. Oftentimes that job can get really far behind and manual cleanup is needed. I’ve spent too many hours manually cleaning up data in there; so that job can also cause issues. So you have to see who’s blocking who. Log sp_whoisactive to a table and you should be able to track it down.
Brent Ozar: yeah, and if you haven’t done that before, Tara’s got a great blog post on how to log sp_whoisactive to a table.
Are you presenting at Microsoft Ignite?
Brent Ozar: Michael says, “Are you presenting at the Ignite conference in the fall?” So here’s the deal with the Ignite conference; it’s a really big Microsoft conference. There’s like 30,000 people who attend, and they only have one or two SQL Server tracks. It’s not really big for a SQL Server – so I used to present at TechEd and then Ignite, when they first brought it out, but I’m like, I really want to see more SQL Server stuff, or if I see other tracks, I want to see completely different tracks.
Should I get more sockets, or more cores?
Brent Ozar: Next one, “For a data warehouse workload, should we go for more sockets and less cores, or less sockets and more cores?”
Tara Kizer: I don’t know, I mean I don’t know specifically about a data warehouse workload, but I’ve – that’s a question – when working with sysadmins on new hardware that we’re going to be purchasing, that’s something that we’ve struggled with, and I don’t have a good answer. I didn’t even support data warehouse workloads, it was all OLTP…
Brent Ozar: My first thought is just how much memory you can get. Typically servers with more sockets can handle more RAM. With Dell 730s, they can go up to 1.5TB of RAM, if I remember right, but only 1.5 in a two-socket box, and you may want more than that in a data warehouse type environment, especially if you have slow storage.
Why are scalar UDFs slower on one server?
Brent Ozar: Eric says – not our Erik but a different Eric. “A user has a query that is using scalar UDFs, and he’s complaining about how the execution time is slower on production versus tests on other databases. Is this unusual? I’ve tried to steer him away from UDFs.”
Tara Kizer: I mean the problem with scalar UDFs is that it’s row-by-row processing. So if he compares the execution time between production and tests, I would bet that test does not have a copy of the production data. I mean, he’s running the same query maybe, maybe the production database has more rows in it. I don’t know if running sp_recompile on a scalar UDF – if that would be helpful or not? But he needs to compare the execution plans of the two systems, just to see if there’s a different execution plan because then you could just recompile, say, the query or the stored procedure. But if we’re specifically talking about the scalar UDFs, I would bet it’s a production size issue versus test.
Brent Ozar: I love that answer because I wouldn’t have come up with that. When you’re looking at the execution plans, know that with a scalar UDF, it won’t show in the actual plan. It will show on the estimated plan, just not the actual plan. The other thing, since scalar UDFs go single threaded, or serial, any slight difference in CPU can make a huge difference in overall run time. So if production is, say, only 1.8g gigahertz CPUs and test is 3 gigahertz CPUs, you can see a dramatic difference there in terms of run time. Man, I hate saying this, but you could blow the plan cache and then run sp_BlitzCache, you know, immediately after the test query ran, and you’ll see the number of time that the function ran. So that will help give you evidence on Tara’s thing that says that production has so much more data than the test systems.
Could setting MAXDOP make things worse?
Brent Ozar: J.H. says, “We never set MAXDOP, but recently a specific user has been getting frequent CXPACKET waits. If I set MAXDOP to a value, it could slow down other running queries, what should I do?”
Tara Kizer: For CXPACKET, your first stop shopping is to make sure your MAXDOP is not set to the default and your cost threshold for parallelism is not set to the default. After that you’re talking about looking at your queries, seeing if you can re-factor the queries to make them more efficient, indexes, write indexes in place, maybe table design. But if you’ve never set MAXDOP, you’re not setting your servers to best practices. You do not want to leave it at the default unless you have two processors maybe, or maybe four? I’d say two processors or less, that’s the only time I’d leave it on default value. So you need to set MAXDOP and cost threshold for parallelism, both of those.
Brent Ozar: And is it theoretically possible that setting MAXDOP is going to make some queries go slower?
Tara Kizer: Yeah…
Brent Ozar: Yeah, anything you change can make queries go slower.
Any special concerns about SQL Server in AWS?
Brent Ozar: Let’s see, J.H. next asks, “What are things to consider, or how does AWS SQL Server handle jobs, linked servers or any other things when migrating to AWS’s cloud?” Well, there’s two kinds of servers in the cloud in AWS, and really in anybody’s platforms. One is infrastructure as a service, which is just VMs. It’s the same VMs that you know and love dearly. The other one is RDS, or Amazon’s Relational Database Services, where they manage things for you. They manage the backups, they manage the restores, they manage things like linked server configuration. That is entirely different, and you’ll want to hot AWS’s documentation. While you’re in there, if you’re moving to AWS RDS, do a search on our site, or go on our site and click Blog and then Clod Amazon, because Tara recently wrote a blog post about restoring databases on AWS RDS. How did that go? Explain how that experience went.
Tara Kizer: Well I mean, I’ve been a DBA for a very long time; about 20 years at this point, and I’m not used to having to click through a wizard that’s not a management studio or enterprise manager wizard, you know. It’s unusual being a DBA clicking through a website to do a system restore. I mean, it took forever, and this was a smallish VM, but it probably was due to the 200-gigabyte disk that I used. It took like an hour to do a restore and my database was saying – I think it was like 10GB, but I did have a 200GB disk, and so it restores the entire image, not just the database. So if you have multiple databases where you’re talking about all of those being restored – it was just different being a production DBA. I think as a sysadmin, it might be easier to do the cloud type stuff because they’re not used to doing SQL Server work anyway, so you might as well just use these tools.
Does SQL Server licensing come in 2s or 4s?
Brent Ozar: So let’s see here. Next up, Ben says, “Can you only license SQL Server in multiples of four, right?” No, it’s weird. The minimum core count that you can license a SQL Server with is four cores, you actually buy them in two-packs. So you need two – and I can’t say this with a straight face – two two-packs in order to license any SQL Server, but then you can go in two-packs from there, so four, six, eight, 10, 12, whatever.
Should I separate data & logs on a VM?
Brent Ozar: [Kaushek] says, “Is it still recommended to separate data and log and tempdb files to their own disk drive on a VM, or do you recommend keeping everything on a single drive?” We actually answered this earlier, it was the first question we took. If you want, go to BrentOzar.com and click Blog up top, we recently blogged about this too, in the last week or so.
Are SQL Server connections encrypted by default?
Brent Ozar: J.H. says, “Is it correct that since SQL Server 2005 all connections are encrypted by default, and even if SQL Server doesn’t have a real signed certificate, it’s going to use its own self-signed cert?” So this is tricky…
Tara Kizer: Do you actually know this answer?
Brent Ozar: I do know the answer, and the only reason I know the answer is because I sat in Andre Melancia’s GroupBy session on hacking SQL Server.
So I totally cheated. And he demoed this; you can go to GroupBy.org, this is my open source community conference where SQL Server people, or really any database people, can go to submit any topic. Everything’s free, you get to vote to pick who does sessions on what. So there’s one called Hacking SQL Server by Andre Melancia, and he actually steps through this. The connection is encrypted for your username and password authentication, but as soon as that’s done, the rest of your packets go through unencrypted. So your results go across the wire unencrypted. Andre actually shows a man in the middle attack where he changes data coming back from a SELECT statement. So you SELECT out one thing but the data you see is entirely different because it’s not encrypted by default.
I get thousands of deadlocks per day…
Brent Ozar: And the last question we’ll take is from Samantha. Samantha says, “Speaking of deadlocks, I have an application which I’ve been able to chart between 300 and 6000 deadlocks per day…” Great. “It’s always on the same delete statement supplied by multiple polars.” She’s got a database doing about 2000 transactions per second. “I’ve tried to ask them to pull less, but since the end user isn’t noticing any lost data, they don’t see the problem. What should I do?”
Tara Kizer: Are the deadlocks a problem? I mean she just said the users aren’t noticing it, you know. Is there any problem? When I’ve had deadlocks at my last job where users were not noticing a problem because it was all back-end processes that were deadlocking, it was more a performance issue of getting all this work done. And then just nuisance from the DBA team, you know, open up the error logs, as we always did have the deadlock trace flag to post the data to the error log. So that’s not causing any issue, then so what that it’s happening so frequently. Samantha, you said it’s always on the same delete statement. I wonder, is it two delete statements? Is it always the exact same delete statement and its two different polars are sending the exact delete, and are they deleting the same row or maybe it’s locking the page? I would want to know what the other process is doing.
Brent Ozar: yeah, and if the developers have retry logic built into their app, you know, when they get a deadlock they just retry it again, that’s not terrible, it’s not so bad. If it was a 1000-line-long stored procedure where they were doing all kinds of work and right at the end it was the deadlock victim, yeah, the rollbacks would probably suck. But if it’s just one-line deletes, it’s not necessarily so bad. Alright, well thanks everybody for hanging out with us today and we’ll see you guys next week on Office Hours. Adios…