This week @EdwinMSarmiento joins Brent, Tara, Erik, and Richie to discuss high availability disaster recovery, migrating from on-premises server to Amazon’s RDS, AlwaysON Availability Groups, auto-growth issues, rebuilding vs dropping/re-creating indexes, log shipping, using Service Broker with Always On Availability Groups, updating stats issues, VLF counts, measuring transaction log usage, upgrading versions, and much more.
Enjoy the Podcast?
Office Hours – 4-11-18
What’s the easiest way to migrate to RDS?
Brent Ozar: Let’s see – we might as well go ahead and get into the technical questions. We’ll start with – Richard has an interesting one. He says, “What’s the easiest and fastest way to migrate from on-premises SQL Server up to Amazon’s RDS with as small cutover window as possible?” He says, “Availability Groups isn’t supported. Log backups aren’t available to us, being deduped and volted by third-party software. Is there anything that I can do other than transactional replication?” Wow, I don’t think he has any options.
Edwin Sarmiento: And somebody said transactional replication. Like, eurgh.
Brent Ozar: I don’t think you have an option there. That is pretty rough because RDS is like a sealed box. You get database mirroring, but that’s only inside of RDS; you don’t get to mirror into it so that’s tough.
Edwin Sarmiento: And there are capabilities that only exist in RDS, like for instance, somebody was asking about doing backups, like Native SQL Server backups within RDS. There are stored procedures that only exist in RDS, but they’re not in the Native SQL Server on-premises version.
What are all the ways Always On can fail?
Brent Ozar: Let’s see – next up, Brian says, “Edwin, do you have a punch-list to speak of all the failure scenarios that you test with Always On?” he doesn’t say Failover Clusters or Availability Groups, but he says, “If you could tell me what will break, it will make it easier to deal with when it does.”
Edwin Sarmiento: What’s really interesting is someone on a Microsoft forum asked the same question, like what would cause an automatic failover? Which is kind of interesting because there are so many things – which was also followed up by, I’m not really concerned about what would make Failover Clustering cause an automatic failover. And the gist of it is Availability Groups and Failover Cluster instances, I mean, if you’re looking at high availability, they still run on top of Windows Server Failover Cluster. So whatever applies to Failover Clustering applies to Always On Availability Groups and Failover Clustered instances. There’s just so many of them.
Brent Ozar: There’s not three…
Tara Kizer: When I attended – we all attended the PASS conference in 2016. One of the sessions I went to was Availability Group Monitoring. I’d been supporting Availability Groups for three years by the time I went there, but it was interesting. It was by Microsoft and they had condensed it down to an hour, but I think they said it was more like a four-hour topic; that’s just how much monitoring is really needed. What are the failures that can happen and, you know – I don’t remember what I learned, honestly, but it was a good session.
Erik Darling: Not much in an hour…
Richie Rump: Great session…
Brent Ozar: Sometimes it’s good just to be scared.
How should I configure quorum for a 6-node cluster?
Brent Ozar: Another Brian, a different Brian follows up with, “How would you configure quorum settings when I have six nodes and I’ve got three nodes in one region and three nodes in another region? The three in the primary, I guess, are sync and the three in the other region are async replicas.”
Edwin Sarmiento: It’s interesting when somebody says, how do you, because I don’t know because I don’t know what your business requirements are. And it’s interesting, whenever I tell customers that, they’re kind of like, I only want an answer. But the reality is, what is your ultimate objective? What’s your recovery objective? What’s your service level agreement? Because telling me that you’ve got three nodes in one data center and another three nodes in another data center, does not tell me anything about what your recovery objectives are and what your service level agreements are.
Now, let’s assume that you don’t want automatic failover between data centers. If you don’t want automatic failover within data centers, you’ve got an even number of nodes; three in one data center and three on another data center. You want a tie-breaker. So you’ve got node majority, you’ve got file share, you’ve got disk witness, you’ve got cloud witness in Windows Server 2016; what would you want to use? I’m a pragmatic person. I’m very practical. I’ll just use File Share Witness and host it in the data center that I’m trying to protect, mainly because you don’t want to be spending so much on an extra vote within the cluster. So there’s – again, you always need to go back to, why would you want to do that?
We have an app that shrinks the database every hour…
Brent Ozar: So let’s see. Next up we have – Tammy asks, “We have a third-party app that’s shrinking a database every hour…” Off to a good start. “Then our monitoring application alerts about log space problems for that database. I grow the files but they were promptly shrunk again in the next hour. How would you handle this situation?”
Tara Kizer: I mean, if you’re stuck with this application and you can’t make them change it, I wouldn’t be doing the manual growth. Let it auto-grow and shrink back down, I guess. Also, maybe modify your monitoring application so it doesn’t alert to this guy. I mean, most monitoring applications allow you to do custom alerts or to change the rule.
Erik Darling: That’s rough.
Tara Kizer: I would be hating life though, having to support that third-party application. I bet you he uses the SA account and does all sorts of bad stuff.
Erik Darling: Is there any justification from the app provider about why they do that or is there any talking to them? Or is it just completely gone and you’re just stuck supporting it?
Tara Kizer: I would bet that this is a legacy application, you know, that’s been around for a long time, doing things back – it wasn’t ever recommended to do this in SQL Server 2000. It was probably more common to do this type of thing in older versions.
Erik Darling: Depending on the command that it runs, what I might do is add a second log file, cap the size of the first log file really small so it never tries to shrink it and just let the other file grow and do whatever it needs to do. Because then, if it’s just trying to shrink this one thing, if it does anything, there’s nothing in there to shrink. Make it like 1MB; what’s it going to do? Have a second log file – because I’m willing to bet, the way it’s calling is not, like shrink database. I willing to bet it’s doing shrink file and it’s targeting whatever the…
Brent Ozar: The default log number…
Erik Darling: Yeah. I bet, if you added a second log file and just let that thing do whatever it wanted to do, you’d be fine. That’s what I’d do. Screw those guys.
Brent Ozar: That’s so much better than my bad answer. My bad answer was going to be, if you know what time it happens, you could set up a logon trigger, so whenever they go to log on, the app goes to log out and [crosstalk]…
Richie Rump: That was my idea, yeah.
Brent Ozar: Chase says, “I want to know what Edwin and Brent are chatting about.” No, we just both have the same bad ideas about exactly what we were going to do.
Should I rebuild indexes, or drop/recreate them?
Brent Ozar: Let’s see, next up, Pablo asks, “What would the advice be between either rebuilding indexes or dropping them and recreating them? Which one should I do?”
Erik Darling: Neither.
Brent Ozar: And why?
Erik Darling: Leave it alone. Well, look, dropping and recreating indexes is a hell of a gig, right because you have all the pain of the rebuild. Because you have to drop the index, so it means you have to take out the locks to drop it. And then after it’s dropped, people can get in there and do crap and they can just do things without an index, which is awful for them. It’s awful for end users. At least if you just rebuild the index, if it’s online, users can get in and do stuff during that time anyway. It doesn’t really buy you anything to drop and recreate or rebuild. It’s actually kind of worse because then you’re dealing with the, you know, the drop, which is offline and taking the locks and then the create, which you know, if that’s not online then you’re doing that and they’re doing things without a supporting index.
Brent Ozar: Yeah…
Erik Darling: So I would go watch that and make sure that you’re rebuilding indexes for the right reasons, like, you’re not just doing it at like 5%, 30% because someone told you to do that 15 years ago. Make sure that, you know, you’re targeting indexes that it’s sensible to do that for. When people tell DBAs to just rebuild or reorg at certain levels, no one ever tells them, hey, you know, if it takes you two hours, three hours, four hours, eight hours to do index maintenance, are you actually saving that much time when your queries run?
Are you saving two, four, six, eight hours of query runtime by doing this? Are you saving all the resources that you expend CPU, I/O memory-wise when your queries run that you’re expending when you do these rebuilds and reorgs and stuff. Make sure that you’re actually getting something out of them aside from pointing saying, “No look, they’re not fragmented.” It’s not really going to help you.
Edwin Sarmiento: And you’ve also got to look at what else is in that database. I had a customer as a question about – hey, we’re thinking about defragmenting or rebuilding indexes but it’s on an Availability Group. I mean, that impacted the performance instead of doing their index rebuild and defrag. Look, I can’t really help you that much unless you provide me – I know I’m trying to segue into providing additional information, but how many indexes are there, how long does it take you to defrag versus rebuild and what’s the overall impact on a performance when client applications are connecting to the primary?
So you have to think about the other things that come into the picture when you’re trying to do a simple – well, you might be thinking it’s just a simple rebuild or drop and rebuild or defrag, but there are other implications on the database.
Tara Kizer: I’ve had to completely disable index maintenance on an Availability Group, even switching – so it was two or three asynchronous replicas, one synchronous replica, and even switching that sync replica to async, I still was having issues. Besides the latency, lots of blocking. It was a very high-performance environment and a lot of batch requests per second. You have 5000 to 10,000 at normal times, higher at peak times. And a lot of people have had to completely disable index maintenance. And is it really needed, as Erik was saying? Still kept my update stats job once or twice a day – that’s where the performance benefits of the index maintenance is, typically because you update statistics with an index rebuild.
How many instances can I enroll in SQL ConstantCare®?
Brent Ozar: Daryl asks, “How many of my instances my I enroll in ConstantCare?” right now it’s unlimited. I could see, in the future, doing something like a 100 instance cap, but throw all you want in there.
Can you pause log shipping?
Brent Ozar: Anon asks, “Can you pause log shipping? My company does a weird thing where they take a new server back for a day or two before they do a final turnover so any databases would be out of date. We’re using simple recovery.”
Tara Kizer: You can’t use simple recovery model with log shipping. Log shipping to shipping transaction logs, so you’d have to switch over to bulk log or full recovery model in the first place. You can pause it, you just have the log backups running and not deleting them and one or two days later, the copies will just proceed in the restores. It might take a while to get up to date, depending upon how much work it has to do and do you have to start over with a restore, you know, restore database and then apply the log chain?
It also depends on how much disk space you’re able to keep online. I’ve got some clients who only keep 24 hours of backups on disk and then maybe some other longer term storage. Make sure you have the disk space to be able to have all the transaction log backups that you need to get it back up to date.
Edwin Sarmiento: That’s probably why he said the company’s doing weird things.
Is Microsoft about to charge for SSMS?
Brent Ozar: Phil asks an odd one. He says, “Microsoft moved SSMS into a separate standalone product. Rumors are afoot that it’s about to be licensed separately. Has anybody heard anything?” Dude, where do you hear those rumors from?
Erik Darling: SQL Operation Studio is licensed differently than SQL Server Management Studio…
Brent Ozar: In that it’s a different license but they’re both free. Yeah, it’s tough.
Do you recommend using Service Broker with Always On?
Brent Ozar: Anna asks, “Edwin, do you recommend using Service Broker with Always On?
Edwin Sarmiento: It’s an interesting question because, you know when people ask you, do you recommend, the first question to ask is, is it supported? If it is supported, then I would recommend it based on, of course, your workload. A common question, aside from service brokers, DTC. Would you recommend DTC with Availability Groups. And of course, unless it is fully supported, it’s not something that I would recommend. Quote en quote working does not mean supported. So I can’t remember exactly what the support matrix is for service broker for 2016 – by the way, what version are you using? So I know it does work but there’s some limitations on what is supported and what isn’t supported.
How high can VLFs go?
Brent Ozar: Next up, Turge, I believe it is, says, “How high can VLF counts go before it starts causing problems and what kind of problems will you experience when it starts getting high?”
Tara Kizer: Here’s what I know; 75,000 VLFs is not a good thing. In SQL Server 2005 days, back in 2007, 2008, the company I worked for did Microsoft security patching every single month, did reboots, we had high-performance systems, Failover Cluster instances and I had lots of servers to patch. SQL Servers – patch the other guys, you know, so I would restart the box after the patches were applied and I would verify in management studio that the instance would come online after it finished the reboot, but I wouldn’t expand the database history in object explorer.
So I moved onto the next server and a few minutes later, the admin team said, hey we can’t connect to the database; I don’t think it’s online. So I expanded the database history in object explorer and boom, it said in recovery for this database that had five lines of availability requirements. We were in a maintenance window, but it was bad. It had been probably 15, 20 minutes since I did the reboot when I started looking into this, you know. This is not normal, so I opened up the SQL Server error log and it’s phase one of three and – I don’t remember what the percentage was but I was like, oh my goodness, what is going on here?
I didn’t know about the VLFs at the time and luckily this database came online, you know, completed crash recovery after 45 minutes, and this was on a Saturday I believe. And so my plan was to contact Microsoft. I was going to contact Microsoft that night, but it came online before I could make that call. Because it could take like an hour even for a … to get someone on the call.
So, Monday morning, I contacted Microsoft and they said, run DBCC LOG_INFO info on it, and it returned 75,000 rows. He said this is a virtual log file issue; you have too many log files. And so that’s when I learned about VLFs and 75,000 of them is not a good thing. So it impacts fast recovery time of the database, it impacts your restores too and in some cases it can impact performance. So I can’t tell you what number to target. So if you have, say, a 300GB log file, you could have less than 1000 VLFs. I know that just because I fix log files to reduce the number of VLFs. You just need to make sure they’re not too big and not too small and your auto-growth setting is correct. Don’t use a percentage or a very small number.
Erik Darling: The flipside of that coin from Tara – I had, I forget how big the log file was, but there were very few VLFs but they were all gigantic. And so that took me 24 hours to recover from because it’s reading to, like, hundreds of gigs for each one. I was like, come on…
How do Availability Groups work internally?
Brent Ozar: Brian asks, “I would like to know the nuts and bolts of how this works. So what’s the best document or video that would give me the best deep dive into how Always On actually works internally?”
Edwin Sarmiento: That is a very interesting question. No documentation from Microsoft would actually explain the nuts and bolts. So here’s the thing, and I’m pretty sure you would agree with me in this; they’re all over. Like, read one documentation, it explains one thing. Read another documentation, it explains one thing; it’s just all over. And I know it sounds like a bit of a plug, I simplify that in the Always On Availability Groups training class that I’m doing. It’s all visual. You would see how everything works under the covers. How the transaction log records or log blocks are getting replicated from the primary to the secondary, what the wait stats are, what the wait types are; everything simplified.
Now, I know that’s kind of a rounded up version of it, but here’s the deal. It’s really complicated if you just simply look at the documentation. I’m trying to simplify it as much as I can from some of the articles and the blog posts that I write.
Brent Ozar: And also, there’ no money in Microsoft building that documentation for the public. Like they just don’t have an incentive to dedicate human beings to do that for – because it would take months for them to do it across all kinds of languages. Anything Microsoft does, they want it to be perfect across all these different standards and all this. And why, when half the time, it changes every version. They’re like, sorry we need those documentations for developers.
Tara Kizer: And just attend Edwin’s training class on Availability Groups. That gives you a really deep dive into the product.
Edwin Sarmiento: What’s really interesting is I see this across all software vendors. I was just looking at some of the documentation for Microsoft PowerShell for AWS. Same thing. They’re all over – I think it’s just a thing with software companies. They just are not really good at writing documentation.
Brent Ozar: Richie is so bored. He’s like, I’m done with this.
Richie Rump: [crosstalk] I’m looking at ConstantCare files coming in. Y’all keep going because some people enjoy this stuff…
Tara Kizer: He’s waiting for one of his cats to arrive so he can play with it.
Erik Darling: It’s hit or miss. The other week when it was just me and Richie, there we like 70 questions about Git and I was like, sweet; I’m going to take a nap over here. I’m just going to sit.
How can I measure transaction log usage over time?
Brent Ozar: Gordon asks, “What’s a good way to measure how much transaction log usage we’re doing over time? Like how much of our transaction log we’re generating or backing up.”
Edwin Sarmiento: I do have a blog post on that, but I don’t recommend that you do it in production. So basically… Yes.
Brent Ozar: That’s my favorite kind of answer.
Edwin Sarmiento: So the blog that I wrote about – I can’t remember exactly what the blog post was all about, but basically the gist of it, with sample scripts, is you take your log backups and read your log backups using DBCC FN_DBLOG. And part of that – what I did was I came up with an Excel spreadsheet that would tally your transaction log generation throughout, you know, a month, a week. But again, don’t do in on production. Read your log backups. And again, dump those results into a table and come up with an analytical graph on the amount of transactional log records generated over time. But this is not just your inserts, updates and deletes. That also includes your index maintenance. It generates a lot of transaction log records that that would factor into that graph. So you might want to pick and choose, like, your inserts, updates and delete statements as part of the analysis.
How do you reduce VLFs and prevent problems?
Brent Ozar: Next up, Michael says, “Tara, if you do have a database with a high number of VLFs, how do you eliminate them and how do you prevent them from returning?”
Tara Kizer: It’s easy – once fixed, it should never come back for that specific database. So you shrink it as far down as you can, say 1MB or 5MB; a really small number. You may need to do some transaction log backups as you’re doing those shrinks. Do this in the maintenance window, but I’ve certainly done this during the day on very high critical databases, just because I don’t want to take a maintenance window for this task. So shrink it all the way down and then you’re going to grow it back out to that original size.
Don’t just grow it back out part of the way or let it auto-grow from there. Grow it back out to the space that it needed before you started doing this task. Don’t grow it out all in one, otherwise you’ll end up with the issue Erik had. And don’t grow out in small chunks because you’ll end up with too many VLFs. So if I had, say, my original file as 100GB, I might just grow it out in, say, 4000GB chunks and I don’t ever do the 4GB chunks because I ran into a bug back on SQL Server 2005 and it was horrific. It’s been fixed, so on newer versions, you wouldn’t run into it. But to this day, I will not do increments of 4GB, so 4GB, 8GB. I’ll do 4000MB or 8000MB. I will not do the 4GB increment.
And then just keep growing it out until you reach the original file size. And don’t forget to change your auto-growth setting. So on a 100GB log file, I’ll probably do a 1Gb auto-growth; maybe 512MB. You don’t want too small, you don’t want it too big, but fixing the auto-growth is how you prevent this issue from coming back.
Edwin Sarmiento: I can’t remember what version of SQL Server has that alert for if you reach 10,000 VLFs, I think, or 1000 VLFs, it will write an error in the SQL Server error log.
Tara Kizer: [crosstalk] I was going to say, it definitely wasn’t there in SQL Server 2005; I would have noticed it. I was a good DBA.
Erik Darling: [crosstalk] Then you got another job here and everything went down.
Brent Ozar: Now she spends all her time painting mountains on the background. [crosstalk] Marci says, “One other thing to add, when you do Tara’s fix, to also do a checkpoint before you go and do this stuff too.”
Tara Kizer: Yes.
How should I migrate 2012 AGs to 2017?
Brent Ozar: Pablo says he’s on 2012 Always On Availability Groups and he says, “What are the general steps to migrate toward 2017?”
Edwin Sarmiento: So, general steps will depend on the maintenance window. You could do a kind of rolling upgrade scenario where you’re adding another replica in your Availability Group that is running a higher version. Say you’re on 2012, maybe 2017, and then do a failover. But that’s a one-way street, which is why you really need to do a lot of testing prior to doing that because once you’ve failed-over to 2017, you can’t go back to any of the lower versions.
That being said, that’s the fastest way to do it. You can do a lot of different things because, number one, if you’re on 2012, I’m guessing you’re still on Windows Server 2008R2 or Windows Server 2012 R2, at most. I’m guessing you also want to upgrade your operating system. So you really have to plan well because you don’t want to be upgrading to SQL Server 2017 and then a couple of months later you’re upgrading your operating system to the higher version. So if I were you, wait until the whole process of upgrading your hardware and OS is part of the plan, so that way, you don’t have to repeat everything. Of course, as a consultant, I would rather have everything done again because I get paid twice, but I’m not like that. I’d rather do it once; I’m lazy. I’d rather do it once. Plan accordingly. You’re saving your company a lot of resources, a lot of time and a lot of money.
Are SSPI handshake alerts a problem?
Brent Ozar: Next up is Chase, who asks, “I’m getting a lot of alerts regarding SSPI handshake failed. I could just tweak my monitoring system to ignore it, but should I worry about this?”
Tara Kizer: That is an error message I automatically dismiss. It can be fixed with SPNs. I think this is an SPN issue, right…
Erik Darling: SPN or whatever delegation in the directory account.
Edwin Sarmiento: But you also have to think about why is something causing that and, when I was doing operational DB, I usually ignore that. But then at one point, I came to realize, if it’s throwing something like that, most of the time, we just think it’s active directory or it could be an SPN thing, but it could be worse. And I was in a troubleshooting call with Microsoft a couple of years ago where I’m seeing a lot of these incidents. It turns out that the reason why it’s getting a lot of that events was the fact that SQL Server could not contact active directory. And of course, I kind of passed the ball to the active directory guys without realizing that the main controller was actually sitting on a VM with a single virtual … I know you kind of know where I’m going with this. And every single machine on the network is hitting that thing. So it’s not just SQL Server that’s getting affected, it’s everything on the network. So I told them, look, maybe it’s about time that you moved this VM to a physical machine because everything’s hitting the same VM.
How does SQL ConstantCare® upload to AWS?
Brent Ozar: Landon asks a question for Richie. He says, “Is ConstantCare using a secure file transfer method to upload the data to AWS?”
Erik Darling: Carrier pigeons.
Richie Rump: Bit by bit.
Brent Ozar: One bit at a time.
Richie Rump: Yes we are and it’s because Brent made me. I think in initial testing, we were just going to use HTTPS and be done with it, and then Brent made me throw some encryption on it. Everything up to the cloud, once it gets up to the cloud, we decrypt and reprocess the files.
How can I automate AG patching when I have SSIS?
Brent Ozar: There we go. So let’s see, the last one we’ll take is [Saket]. [Saket] asks, “How can we automate SQL patching in an Availability Group when SSIS DB is in there as well?”
Tara Kizer: It’s not supposed to be in there, right? That guy is not supported on an Availability Group, I don’t think.
Edwin Sarmiento: So when this thing first came out – because SSIS DB was introduced in SQL Server 2012, there were a lot of challenges because for one, you wouldn’t want SSIS DB to be in there unless you know, for a fact, that you’re going to put all your SSIS packages in there instead of MSDB or maybe file share for SSIS. It became a bit more challenging because every single database has to be in the Availability Group [reporting] – SSIS in the Availability Group. Come patching time, it just screws thing up. You would have to remove the SSIS DB from the Availability Group, patch the nodes and then add it all back, which was why he asked that question.
You can automate that process, either through a combination of PowerShell and T-SQL. It all depends on what your patch management processes are. Are you using System Center for your patch management process, are you using Group Policies for it? It becomes a bit more challenging, but you can, again, use a combination of PowerShell plus T-SQL to remove the database, patch the node and add a database back into the Availability Group.
Tara Kizer: It sounds like a lot of work. I mean, it’s not supported, right. I mean, why put it in an Availability Group if it’s not supported?
Edwin Sarmiento: That I would have to ask the product team for clarification.
Tara Kizer: At my last job, we had Availability Groups and we had SSIS packages and were very excited about SSIS DB and all of its features. And so we had our Availability Groups for the user databases, non-Microsoft stuff, and then we had a separate virtual machine that had SSIS DB that was not in an Availability Group. We did snapshots on that guy like every five minutes. So it still had some HA features in place, it just wasn’t in an Availability Group.
Brent Ozar: Alright, well that’s everything we’ve got time for this week. Thanks, Edwin for coming out…
Edwin Sarmiento: Thanks for having me.
Who’s the right person to attend Edwin’s upcoming courses?
Brent Ozar: Answering questions – absolutely awesome. Do you want to say a few words about who the right person is to come to your course coming up on Failover Clusters and Availability Groups?
Edwin Sarmiento: Well it’s interesting because I’m also looking at the trends. First one we did last September, there were more senior DBAs who attended. The one we did last December were IT managers, which is very interesting; IT managers and sysadmins who are getting started with SQL Server. And so if you’re a senior DBA, or even if you’re a sysadmin or systems engineer who are now responsible for managing SQL Server, particularly Availability Groups, or if you’re an IT manager who just needs to understand this. Or maybe you’re an IT manager who needs to fill in because you’re still waiting for a new senior DBA to be hired. This is the right course for you because I try to simplify, like I said, I try to simplify everything that involves Always On Availability Groups and the new one we came up with is the Failover Clustered instances because they’re not dead. They’re still out there in the field and I’m pretty sure that you still need to manage them until you decide to move to Availability Groups or even to the cloud.
Brent Ozar: Perfect. Thank you, sir. I look forward to the class and see everybody next week at Office Hours. Adios, everybody.
Edwin Sarmiento: Thanks for having me.