Note: this is coming to you from the past due to technical difficulties.
This week, Brent and Erik discuss Availability Group issues, database diagram tools, AWS vs Azure, diagnosing async io waits, troubleshooting deadlocks, extended events, using same login credentials for everyone in the company, adjusting and more.
If you prefer to listen to the audio:
Enjoy the Podcast?
Office Hours Webcast 9-13-17
Why is my AG database log file growing?
Brent Ozar: Leah asks, “I need help; I am out of ideas…” Boy, are we the right people to ask. She says, “I have multiple availability groups using SQL Server 2016. Every database is in its own availability group. However, I have one database that has a full transaction log file and it’s growing. It says the log reuse wait description says availability replica, but all replicas show everything is healthy. What can I do, other than remove AGs or throwing the database in simple recovery model?”
Erik Darling: Kendra has a great post. Kendra, about why your availability group transaction log can be full without you having messed anything up.
Brent Ozar: Let’s see if we can find it. Let’s see, Kendra Little transaction log – is it on her site or ours?
Erik Darling: No, it’s on hers.
Brent Ozar: LittleKendra.com… For those of you listening to the podcast, you can’t see that we’re typing. Why is my transaction log growing in my availability group… Oh, she has an episode on it; fantastic, on her IDERA SQL DBA podcast. Oh, it’s nice. She doesn’t transcribe the whole thing, but you can either get it in video or in audio. She does a great job on the podcasts there. You should also totally subscribe to her podcast while you’re there too.
Off the top of my head, I can’t remember what would cause that. Obviously, a long-running transaction, but that would be different, it wouldn’t show availability group in the list there.
Erik Darling: Stuff like running out of disk space on the replica – like if your transaction log is full over there and it can’t grow anymore to receive new transactions, it’ll be stuck. And then – I had another good one – like if stuff gets – it can’t clear out row versioning or – no, that was tempdb…
Brent Ozer: I was thinking, if you added a file on an invalid drive letter, you know, where the secondary doesn’t have it, then the replication will stop working. I would also check and see, sometimes you think that things are healthy when you look at the availability group dashboard, but it’s not; because the availability group dashboard stops updating when you run into worker thread exhaustion. You might actually not be in sync over on the other side. If this is a Standard Edition AG, you’re not supposed to query the secondaries; you know, it’s not readable on the secondaries. If this is in Enterprise though, I would actually go in and look at the tables on the secondary, just to make sure the data’s actually up to date, because it may not be.
What’s a good database diagram tool?
Brent Ozar: James says, “I have a 1.2TB database and I need a database diagram. Do you recommend any good third-party tools?”
Erik Darling: Oh, this is a Richie question, because Richie actually knows several of these. I can’t remember any off the top of my head that I like.
Brent Ozar: I hear people say Erwin, ER/Studio, Toad Data Modeler. Those are the three that I hear the most often. I don’t know anything about pros and cons to any of them, but I know they’re expensive; usually like three, four, five thousand bucks.
Erik Darling: I mean, James, not to punt again, if you search back through our episodes about database diagrams, there’s specifically one where Richie talked about several different that he had used over the years that he had various experiences with. So, I would search back through Office Hours episodes and see if you can find searching for database diagrams because Richie had a great answer about them.
Brent Ozar: Or – so this isn’t a good one to post on Stack Exchange, because they’re like never post-shopping list questions; but the person I’d ask is Lewis Davidson. Lewis Davidson has a blog over on SQLBlog.com. If you search for SQLBlog Lewis Davidson, I bet his contact info is over there; and he wrote the book on relational data modeling for SQL Server. Nice guy, has an unbelievable sense of humor. Super dry, like he’s talking and you don’t even realize the number of jokes he’s dropping in there.
Should I use AWS or Azure?
Brent Ozar: Next up, Brent – not me but the other Brent; I shouldn’t use last names – says his, “Company has mandated that we’re moving to the cloud. Now they’re counting on me to come up with which service to use; AWS or Azure. What should I consider?” Oh, that’s good.
So, you have to work with all of them because we go through and do things like builder days. What do you think about – if you were going to go pick one, what were the kinds of things you’d think about?
Erik Darling: Oh, which sales person can fight the best. No, but I think what you should do and the way you should decide which cloud service you want to pick is you should go to the Azure portal and try to figure it out.
Brent Ozar: That is just mean.
Erik Darling; I know, but the good thing is you get a quick answer out of that.
Brent Ozar: Yeah, come up with a list of like four or five things that you need to do; spin up a new VM, take a snapshot backup of it – I’m trying to think of the other common things that I would do. Networking security, holy cow. Spin up two VMs in two different areas of the US, or internationally, wherever you want to do it. One on the East Coast, one on the West, and then get them to talk to each other, ideally without putting them both on the internet. Holy cow, that’s a pain in the rear…
Erik Darling: I hope you like subnets.
Brent Ozar: And I don’t want to sound like I’m shilling for anybody, but this is one of the things actually that Google does really well. Any of your VMs can talk to any of your other VMs, regardless of what data center they’re in. You don’t have to deal with any of the complex VPN stuff be default. The other thing I would ask of your developers is, are there any services that are ion one cloud that you’re thinking about using that are not in other clouds?
So, for example, Microsoft will try to pull you in with stuff like Azure Data Warehouse or Azure Data Lakes that are proprietary to them. Amazon will try to pull you in with things like Redshift, S3, you know, services that they use that other people don’t. Google will try to pull you in with their Kubernetes type stuff, but just ask with your developers to see which ones they use.
Erik Darling: Another thing I’d want to do is try to find instance sizes that are sort of commensurate to the machines that I have running now and check out pricing there and see how – just see what’s available that you could probably get a close match on. Because one thing that you’re going to find in the cloud – Brent talked about networking, I’m just going to talk about networking speed. For the instance sizes you want, it might make sense for the – they might be a close match to the kind of machines you’re running now. The networking and the disks that you get might not be all that hot.
Brent Ozar: When you go through and look at pricing in comparison too, I would start here; EC2instances.info. Now this is only for Amazon, obviously. There’s similar kinds of sites for Google and Azure. I think it’s easier to find pricing for either Google or Azure. For example, with Google, there’s no predefined machine types that really hone you in. you get a lot more knobs and buttons that you can play with for sizing. But in here, I love this EC2 instances comparison, where you can choose what kind of licensing you want, whether or not you’re worried about how many compute units you get, whether or not you’re doing IPB6; all kinds of fun stuff. And then you can sort by that. So you can say, I want VMs with at least 60GB of RAM and at least eight cores, and then I can sort by pricing; how much is Windows pricing on demand? All kinds of neat stuff.
Erik Darling: But what’s really important to pay attention to is that network performance column, because not all of those are so great. So like up to 10GB, high, doesn’t even do all that high. And then you have moderate and low… Be careful.
Brent Ozar: They really are slow. They are desperately slow; same thing with storage throughput. If you go to columns and, say, EBS optimized throughput, you can see a lot of these are just dog slow. If you run CrystalDiskMark and compare it to some of these, man, your data would be better off on a USB thumb drive.
Erik Darling: If you need really large sizes and you’re deciding to go with Azure then you might want to hold out for Azure Managed Instances, because those are pretty beefy sizes and…
Brent Ozar: It’s wonderful; so Azure SQLDB, it doesn’t sound like you’re interested in, but Azure Managed Instances – so what are the kinds of things that would draw you to managed instances that SQLDB doesn’t have?
Erik Darling: Just size and some of the AG stuff I guess.
Brent Ozar: Cross database queries is a big one.
Where can I see past GroupBy sessions?
Brent Ozar: Kevin asks, “I couldn’t watch all of the GroupBy sessions, where can I see them now, please?” I like that he asked please at the end of it. Go to GroupBy.org and there’s a big thing up at the top that says “Watch past sessions”. That’s exactly where you should click. All totally free too.
Should I have different SIDs in development?
Brent Ozar: Amy says, “Is it okay to have the same login SID for databases that are on different SQL Servers; production, backup, test and dev?” Oh wow, I never thought about that.
Erik Darling: Me either…
Brent Ozar: Yeah, it’ll be fine. After all, you’re going to probably restore – oh god, I’m going to get in so much trouble for saying this – what you’re probably going to do in many environments is you’re going to restore from production over to other places like test and dev. And then you get stabbed by some security auditor who goes, “Developers should never have production data.” This is the way the world works.
Erik darling: It’s the agile way.
Brent Ozar: Amy says, “Yes I am, thank you.” I figured someone would stab me for that.
How should I diagnose ASYNC_NETWORK_IO?
Brent Ozar: Samuel asks, “What are the next steps in diagnosing async network I/O waits when I’ve confirmed that networking is not the issue?”
Erik Darling: Doesn’t it sound like it should be the network?
Brent Ozar: Yes – so many [crosstalk]
Erik Darling: Oh, that’s good. I would start with checking your SQL Server code and see if you’re running any scalar functions or cursors in there that might be sending results to your application one row at a time. The next thing I would check is your actual application code and see if they’re doing any processing on a dataset that comes in, in like a four-reach loop; where they’ll take a row and do some processing and then another row and do some processing. Because typically what async network I/O is, is when SQL is basically shooting data at a client and the client’s not saying that it has all the data because it’s doing a bunch of stuff client side.
Brent Ozar: You could run sp_whoisactive. Go run sp_whoisactive and look at that wait stat column. In there, it will show which queries are waiting for async network I/O; and then you can go back to those developers too. It shows you – I love sp_whoisactive so much. It shows you which machine it’s coming from, all kinds of fun stuff.
Should I hint my isolation level?
Brent Ozar: Next up, J.H. says, “A developer was experiencing deadlocks and I recommended to do their changes, inserts updates and deletes, in smaller batches. Now that did turn out helping, but we’re still experiencing some. Do you recommend locking isolation hints?”
Erik Darling: Not necessarily.
Brent Ozar: What would you start with first if you were facing deadlocks?
Erik Darling: Well, I think it would depend on where the deadlocks are coming from, a bit. Usually if it’s a couple of stored procs, I would go and look at those, and I would maybe look for missing index requests, and I would maybe look at smarter ways to index for updates. Because a lot of – a reason that I come across that happens quite a bit for when deadlocks occur is that there’s an update or a delete. And there’s a where clause for it, but the where clause doesn’t have an appropriate index. So, SQL ends up escalating all sorts of locks from row to page locks or to table locks, and you end up with these kinds of strict locks that get held for the duration of the modification, and then anything else trying to do work around it kind of gets bonked. So, I would just make sure, for my modification queries, updates and deletes, there’s a where clause that I’m kind of appropriately indexed for that.
Brent Ozar: I’d also say too, make sure that you don’t have too many indexes. A shop I was at yesterday had 60 to 70 indexes per table. And so whenever someone wanted to change data, we were having to grab locks across all kinds of indexes because every index included all kinds of fields from the table, fields that changed all the time; so you had to do way more locks than necessary. It usually seems like – I often joke with people that you should aim for five or less indexes per table and five or less fields per index. It’s not that that’s a magic number, it’s just that the more indexes you start to have, the more problems you end up with; stuff like blocking and deadlocks. And the less indexes that you have, the more you run into that problem too.
Erik Darling: So, one really good and quick way to get yourself a diagnosis on this is during or after – probably after one of those deadlocking things occurs, run sp_BlitzIndex and see if you get warnings for aggressive indexes. Aggressive indexes will point out indexes that have had significant row or page locks and how long they’ve had them for. So, you can figure out which indexes have these long locks being held on them and which ones might be the reason for the deadlocks occurring.
Brent Ozar: I remember the first time Kendra showed me that you could get locking stuff out of the index DMVs. And I was just like, witchcraft, how is this even possible? I used to try and focus too much on what are the queries involved in the blocking and the deadlocking, but man, you don’t even need it once you know which indexes are involved. Then it’s easier to look at the tables and see, how’s our indexing strategy on those?
Can you show me Extended Events?
(Note: you’ll need to watch the YouTube recording to see the demo in action.)
Brent Ozar: Don says, “Good morning. I tend to use SQL Profiler to do an audit on queries or find the queries that are getting called from a web page. The developers won’t tell me and I’ve got to try to figure it out. With SQL Profiler being deprecated, is there a specific extended event that I can use to do this? Or how should I use the new way?”
Erik Darling: Don, I have such good news for you. Assuming that you’re on a fairly modern version of SQL Server and using a fairly modern version of SSMS, when you go into extended events and you go into sessions – when you right click on sessions and hit new session, one of your options under the general tab is you can use a template. And there are a whole bunch of templates in there that are equivalent to Profiler. So, whatever you would normally fire up for Profiler, you can probably find a pretty close match for Profiler equivalence. So, you can do anything that you have in there, and then it gives you the definition of the extended events session once you go onto other parts of it; and you can see which events are tracked. And so, you could even narrow stuff down and pull stuff out and concentrate on stuff that you’re familiar with in Profiler.
So that’s what I would recommend. If you’re using Profiler to do something specific now like that, start with the templates there because they make your life a lot easier. You can get a lot of commensurate information up front from across those.
Brent Ozar: Session name – I’m like why can’t I do this demo? And then you can hit the script up thing…
Erik Darling: Don’t be afraid to hit the configure button and kind of poke through, because there’s going to be a whole bunch of extra columns and filters and stuff that you might want to – I have a car being stolen behind me – but there’s all sorts of stuff that you might want to pull out or add into this session that you saw on Profiler that you’re not seeing just off the bat in the [inaudible].
Brent Ozar: I mean, watched live data isn’t always completely accurate, it doesn’t show things exactly when it comes in…
Erik Darling: Sometimes it crashes management studio.
Brent Ozar: So, there’s a lot of gotchas with it. It just is nowhere near as bad as it used to be. The other thing I’ll say too is, if you don’t want to learn it, you don’t have to. Like you can still continue to use Profiler. It’s heavier, it isn’t nearly as cool and won’t gather all kinds of cool stuff; but it still works if you only need to use it part time, like once a month or once a quarter, it still works fine, and it’s still in SQL Server 2017.
Erik Darling: Go back to management studio real quick.
Brent Ozar: Oh man, you’re not my real dad.
Erik Darling: Come on, we’ve got so much time to kill. So, if you want to get results and watch live data or end the data file faster, just turn max disk patch latency down to a lower number. Right now, it’s at 30 seconds, so it could take 30 seconds for stuff to go in. you could change that to one or two seconds, which is what I do sometimes in demos to get the data into the file faster; which isn’t necessarily a good idea on a production server, but if you just want to see the data faster, that’s a knob you can tune to see it faster.
Brent Ozar: There’s also, as long as we’re going through this, allow event loss. So, you might look, as you’re going through the wizard, oh I shouldn’t allow any loss of events whatsoever; no, don’t do that. Let SQL Server drop some events because it’s not mission critical and it will be faster that way.
Everybody knows the SA password. What do I do?
Brent Ozar: M.M. says – I assume that’s not Eminem. Times are hard in the rap business…
Erik Darling: I heard that new album, man…
Brent Ozar: He goes, alright I give up, I’m going to go into databases instead. “The company I’m working for uses the exact same ID and password for…” Oh, now I see why you went anonymous with your name. “Uses the exact same ID and password for the SQL proxy, command shell credential, owner for all jobs. All the developers know the password, my developers work in production. They log in SSMS as that ID in order to create jobs. I think this is bad and I want to change it. Is it okay for the proxy and credential to be the same?” Man, that would be the least of my worries.
Erik Darling: As long as all of management is aware of it and they are okay with it, I mean – maybe let’s not bark up that tree. If you wanted to, you know, write them a nice email, maybe have a brown bag meeting where you talk about all the potential security problems that presents. Being a DBA and being a developer and handling data like this is a trust position. So, if you don’t necessarily trust everyone accessing the environment, you may not want them to have access to all of your customer data or whatever data it is coming in. If it’s medical data especially, that could be rough. But I would make sure that everyone is aware of it and aware of the potential downsides of it. One thing that I will caution against though is shooting off the handle and changing a password, because you could make a bunch of stuff break like that.
Brent Ozar: One thing I’ll throw in there, I have up on the screen this old blog post I wrote – I say old, I didn’t realize it was only a year and a half ago. How to talk people out of the SA account. I give a graceful way to do it and then in option two I give the really ugly way to do it. It involves, after all, if everyone knows the SA password, or whatever password, you could go in and change things in the production database and play stupid. “I don’t know who dropped that table, it could have been anybody. Man, it’s a shame we can’t figure it out, but with all you knowing the password, I guess there’s just no way we’ll ever know.” “Well you’ve got to stop people from doing that.” “I can’t, my hands are tied. I’ll see you again next week. I predict we’ll have an outage on Thursday at two.”
Erik Darling: All your queries are going to run really funny in about five minutes. You know what he could do? That stored procedure that we wrote at the last DBA Days, Chaos Slop. Change his settings.
Brent Ozar: Oh god, that was funny. Describe what that does.
Erik Darling: So, the nice way to put it is that at random it will – if you schedule it as an agent job, every time it runs it will assign random values to a whole bunch of different, important to performance, variables. Like max memory, cost threshold for parallelism, MAXDOP – so you could just get wild variances in execution plans and, you know, querying a plan cache. All sorts of bad stuff happens when you start changing that stuff on the fly. And Brent left a process running overnight while Chaos Slow was also running. So just the wait stats were awful, it felt like me on the cruise ship where I was just… It was inspired by the Amazon Chaos Monkey thing that randomly goes and breaks things.
Brent Ozar: It’s amazing Dell never let us into their data center. That was pretty spectacular.
What’s your favorite data visualization tool?
Brent Ozar: Wes asks, “Have you ever used any data visualization tools and if so what are your favorites?”
Erik Darling: Excel…
Brent Ozar: That’s exactly what I was going to say. It’s easy, it’s cheap and it produces results that I can manage. But I don’t know anything about visualization beyond that though. And if you’ve ever seen any of my reports, you’ll understand why. “You can see by this 3D pie chart filled with Pac Men…
How do I fail over to DR with AGs?
Brent Ozar: Sri says, “In a multi-site Always On Availability Group…” I like how you spelled it out absolutely perfectly there, SRI, you’re already ahead of most shops. “I have an even number of nodes. I’ve got four nodes and I’ve got a quorum file share in my primary data center. What happens when my primary data center dies? Like if I have a flood, a disaster? Will we be able to manually bring up our DR site?”
Erik Darling: take it away, Brent.
Brent Ozar: Alright, so assuming – it’s easier almost, with the camera with my number of fingers. So, I have two nodes in each data center and then I also have the extra file share witness over in my primary data center. When you lose the majority of voters all at once, the two that are still remaining over in the DR data center, they don’t know that they aren’t just left alone in the cold. Like maybe their internet connection died and they can’t see the majority; so, they will gracefully shut down their cluster services. Now, you can force them online. There’s a set of steps that are documented in Books Online about forcing the cluster online, fixing quorum, then forcing SQL Server to start the AG with allowed data losses on. But this is super dangerous and the checklist to do this is like a page and a half long in Books Online, and every one of the lines links off to other articles in Books Online. It is something that you want to rehearse the Bajeesus out of before you go and do it the first time.
The thing that really worried me when folks do this is, whenever you failover to that other data center, when you decide to pull the trigger, there is no easy way to get the data back from that other data center. The simple way to think about it is order IDs. Say you’ve got a table with an identity field, starts at one and goes up to a bajillion. In your primary data center, y9ou had order ID number 100, and that one hadn’t made it over to DR yet. Well, DR was only up at, say, order ID 90. When we failover and we start inserting into the orders table, we’re going to be using that order ID 100 again, and I have to be able to figure out are these two different orders or are they the same order where fields have been changed? Tools like Redgate’s data compare and schema compare will help you identify that there’s differences, but not which one is right or whether it’s two separate rows or one row. So, it’s super, super dangerous. It’s something that you actually want to step through as a company before you go live and understand what the ramifications are.
Erik Darling: See that was a much better answer than what I was going to give. I was going to say maybe you should think about log shipping.
Brent Ozar: Yeah, so much easier. Because you lose the primary data center, the first think you have to do when you go to failover to DR is fix the cluster. Before you can even get started you have to fix the cluster to bring that online. With log shipping, who cares? You just go live over on the other side.
Erik Darling: This is part of the reason why I have such a hard time finding HA and DR stuff sexy, is because it’s like nope, everything is lost. It’s like a dude saying he’s not wearing any underwear; you’re like yeah, who cares?
Brent Ozar: It’s amazing how I look at it and I go, if you do it wrong, it’s way less reliable instead of more reliable. It’s very easy to get less reliable. Now either I just lost internet or Erik did, one of the two. Let me see – yeah, I’m still pulling up internet pages, so it must just be Erik’s frozen in time.
Let’s see, SRI says thank you and Vlad says you are visible. Okay, cool.
How do I catch people using the DAC?
Brent Ozar: So Samuel says, “I have an issue of a process logging into my SQL Server using the DAC overnight, the dedicated admin connection. Can I run a trace, like an extended events or profiler query to capture this user? You actually don’t need to run a trace or an extended events type thing. The easy way to do it is, just log sp_whoisactive to a table. If you Google for – we’ll go ahead and do it real quick on ours – if you search for log sp_whoisactive to table, Tara Kizer has a great post explaining how you go about doing this. there’s two on here. Part two is the better one that goes – I say better one, it’s just newer. We go into more details there. So, you can set this up as a job to run every five minutes, every ten minutes, however often that you want. Then there’s also a parameter that you may want to play around with, show sleeping SPIDs, I believe it is.
Let’s go fire it open just to see if I’ve got this right – sp_whoisactive. Show sleeping SPIDs I think, equals one. Look at that, I even remember. So, this will catch even if someone just logs into the DAC and doesn’t do anything. If you wanted to get more specific about which query is which, like which query is using the DAC, you could go into more detail through things like sys.dm_exec_sessions. But I would just do this because it’s a great technique to understand how to use for all kinds of reasons. I know a lot of shops – a lot is probably a stretch. I know some shops who will set up a job to log sp_whoisactive to a table, they’ll set up an agent job to do it and schedule, like every minute. And they’ll only enable it when they know they’re going to have or they’re having problems; they’ll teach their helpdesk staff to go enable that job.
So, since you know roughly when that’s happening, you could simply schedule the job to run during those windows, and you’ll be able to catch the culprit pretty quickly.
Are there drawbacks to changing MAXDOP?
Brent Ozar: Finally, the last question we’ll do today, J.H. says, “I recall seeing a YouTube clip of Grant Fritchey saying maybe not always change the MAXDOP is the default. Is there a time when it changes?” Yes, absolutely. So first off I’m going to throw a link up here on the page so you all can see it. If you go to BrentOzar.com/go/cxpacket – and I’ll copy paste that so I can paste that in again whenever the page refreshes… I’ve got a video on there where I explain what’s going on with MAXDOP, what’s going on with cost threshold. And I haven’t seen that clip from Grant, but I can tell you, because I can probably agree with it, there are probably cases where you don’t want to throttle down how many cores your query gets. However, the gotcha with that is it’s usually talking about fairly small servers. So, think like eight to 12 cores as opposed to 16, 32, 64 cores. it’s extremely rare that on a 16-core server or a 32-core server that you actually want one query going parallel across all of those threads. Because remember, it’s not like just one query is going to go parallel. Odds are you’re going to have a bunch of queries going parallel and it can cause a problem.
The typical scenarios where you want to go with a larger MAXDOP are things like data warehouses, where you have a really repeatable load every day and then you have repeatable nightly scheduled reports that come in and do big huge awful ugly queries. What’s cool about data warehouses is you can change your MAXDOP over time, say run experiments from one week to the next to figure out where your sweet spot on MAXDOP is. This is one of the experiments we ran at Dell DBA Days. Both Erik and Doug had run experiments on this. One of them was an experiment on CHECKDB MAXDOP, for example, where we found that the sweet spot for some workloads or some databases was closer to 14 or 18 in terms of MAXDOP.
Alright, well that sums up today’s office hours. Thanks everybody for hanging out. If you have a question that we didn’t get to, feel free to go over to DBA.StackExchange.com. We’re over there all the time posting answers on that, and we’ll see you guys over there. See you all next week at Office Hours.