This week, Erik, Tara, and Richie discuss tempdb etiquette for developers, elastic search vs full-text search, server CPU spiking issues, source control in a database project, querying transaction logs, and more.
Enjoy the Podcast?
Office Hours Webcast – 2017-02-15
How can I discourage people from using TempDB?
Erik Darling: There’s some long questions in here today.
Richie Rump: If you have a short question, get them in now.
Erik Darling: I’m going to ask the short questions and in my head I’m going to be paraphrasing the long questions. I’m going to try not to “Brent” anyone. Let’s see here—there isn’t a short question, I lied. Here’s one about tempdb. Brian asks if there are any suggestions to discourage developers from abusing tempdb to teach proper tempdb etiquette to developers by not just filter-lessly WHERE clause-lessly, join-lessly dumping tables of values into tempdb and then indexing them and going crazy. So what would you do to teach developers the proper use of tempdb?
Tara Kizer: He references a script that the developers asked to performance tune. It sounds like it’s a hefty script. It’s just really long and needs to do all sorts of stuff. That’s the type of script you probably need to be using temp tables. It’s hard to really answer the specific question, but as far as massive abuse of tempdb, I don’t really discourage developers from using tempdb. So what if it gets big?
Erik Darling: I have a similar take on that. So what if tempdb gets big? But there are uses for temp tables that are better than others. A lot of people will fall back to temp tables because when they do their joins with regular tables, they suck. They have like 20 joins or they have to do all this awful joining together. There’s like no way to index for them because there’s all sorts of different sorts in like every single column. I understand why people dump stuff off to tempdb. My main pet peeve is when they just like blindly dump data in there without filtering it down to a kind of reasonable result set first. That’s my big pet peeve, is when there’s like some SELECT * INTO FROM table, no WHERE clause, nothing else like that. Anyway.
Richie Rump: The other thing in that question is a few hundred gigs almost maxed out your drive, maybe you need to get a bigger drive.
Tara Kizer: I’ve supported a tempdb that was 500 gigabytes normally. This was a mount point so it was dedicated tempdb. We got a storage alert that we were running out of space. I was the on-call DBA and checked it. It was a business user running a query. This was production data but this was a read-only replica for availability groups. It wasn’t the writeable side, just the readable side, where business users could run these gigantic queries. In order to complete this specific query that caused it to grow past 500 gigabytes in size, we had to tell them they need to start breaking up that query. It was this massive, single SELECT query. All sorts of stuff going on. I don’t know why it was using tempdb, because it wasn’t a temp table. Maybe it was an ORDER BY, I forget. I don’t remember what the actual script was, but that is what was needed for this system. Yeah, you can break apart your queries to do smaller work at a time so a single transaction doesn’t cause it to grow that big, but sometimes the instance just needs a large tempdb. It really depends on your workload.
Erik Darling: If you’re dealing with a terabyte+ size databases eventually you’re going to need a terabyte+ size tempdb. That’s just the way things go, not even just for queries. You do maintenance on a big enough table and SQL goes—there are tons of questions. There’s even a specific error you get when you try to sort something in tempdb and there’s not enough space and the file grows and it goes … it’s not fun. Tempdb should be sized in reaction to the size of your data, not to how big you generically think tempdb should be.
Erik Darling: Nick asks, “What is the thing being clustered or not when creating an index?” The key.
Tara Kizer: Just the key. It’s sorted.
Erik Darling: Yeah.
Tara Kizer: Physically sorted that way.
Erik Darling: It’s really quite boring. Whatever columns you choose to be the key of your clustered index, SQL sorts the clustered index by those columns and then at the very leaf level of the index, it has all the rest of the columns in the table hanging out down there, ready to be referenced.
Richie Rump: Right. Most of the time it’s your primary key but it doesn’t have to be your primary key. It doesn’t have to be. You want to think about how the queries are being written and how you’re reading data from it and maybe you’ll have a better primary clustered index. Maybe.
Erik Darling: I think questions like this come up because different database platforms have different but close by terminology. In Oracle there’s a thing called a cluster index where you can actually cluster tables together with an index so that values that you want to join together will be closer together on disk. Not the Oracle expert, but from the brief time I was reading about Oracle, I’m like, “This is exciting.” That’s what I got from it. Think of it what you will.
How can I determine the overhead of full text indexing?
Erik Darling: Here’s an interesting one. By interesting I mean odd. “My devs want to implement full text indexing. How can I predetermine what the overhead of this will be? Is there some metric that can tie the amount of data in the indexes to the drain on system resources?” Not that I know of.
Tara Kizer: I don’t think so. To answer it you have to have a load test environment where you do a synthetic load and you measure it yourself.
Erik Darling: Yeah, this is one of those things where full text indexing is going to be a little different for everyone depending on how you use it. It’s really hard to sort of glass ball what the load would be on your system. Unfortunately, what I would say and probably what Richie is itching to say is that if you want to use full text indexing, you should be looking at a different tool all together. Richie, take it away.
Richie Rump: Yeah, full text indexing has been around a long time, hasn’t really been touched in a long time. During that period, there’s a lot of tools that have come on like Apache Solr and Elasticsearch—probably Elasticsearch is the one everyone gravitates to—that do a much better job of searching text than full text search does. There’s always these weird, odd little things that full text search does that you don’t really expect it to do. Elasticsearch is so much better at searching text-type stuff. This is what it was designed to do than kind of SQL Server is. It would be interesting to see if SQL Server in the future wants to jump in that one like it did with the Hekaton stuff but right now we’re not hearing anything for that. I say go do a prototype on Elasticsearch. You’ll feel much better about yourself. You’ll be able to give the users a lot more results and do a lot more with it than you will with full text.
Erik Darling: I like Elasticsearch a lot. I agree with Richie on that.
What should I do with my career?
Erik Darling: Thomas asks, this is a similar question to one we’ve gotten before. “I’m trying to decide where my career should go. I have no development skills.” Good.
Richie Rump: Time to learn, bud.
Erik Darling: He wants to know, “Should I go into cloud, security, data science…?” Talk about falling off on the end there.[Laughter]
Erik Darling: Pick something else random. Should I go into seashell collecting?
Richie Rump: I hear bus driving is pretty good these days.
Erik Darling: Yeah, why not? So, take it away, folks. What would you do if you were young Thomas?
Tara Kizer: It’s confusing. Do you want to improve your DBA skills or are you looking to move your career on a different path? You talk about cloud and security and data science. Maybe cloud we’ll do a little bit of that as a DBA, if our company is using cloud stuff. But security, no. Data science, not for a DBA. What are you trying to do? Are you trying to go down a different path or do you want to remain a DBA?
Erik Darling: Security is, I guess, to me anyway, seems like a pretty limited field for databases. Like once you get past securing perimeter stuff and once you get into, “How am I going to manage security on the database?” You read a couple of chapters from BOL about users and logins and roles and permissions and stuff. That’s kind of database security in a nutshell. You figure out what port you’re going to run it on and then what? So I don’t if security is what I would go into if I was a database guy, unless I was really into security. But if you’re going to go really into security, then be prepared to step away from the database. There’s a lot wider world out there than just database security.
Richie Rump: Data science. I think data science comes up because it’s on the top of all these, “Most hottest technology jobs” or “hottest jobs ever.” Data science, woo! Until you meet an actual data scientist and not someone who just calls themselves a data scientist because maybe they read a book or two, until you meet a real one, you don’t really understand kind of what level they are at. I was lucky enough to work for a few hours with a real data scientist. My brain started leaking out of my ears. Essentially, they’re statisticians. They’re statisticians that understand the technology side of it as well. This particular individual was a PhD in statistics. So these are the type of individuals that go and excel in data science. They’re very heavy in the stats aspect of it. They call themselves data scientists now and it’s just a marketing term. I mean, am I programmer or am I a developer? Am I an app developer? It’s just the same thing and it will be something different in five years, they’ll call it something else. But that is the type of learning that I see when I see a data scientist. That’s the type of training that you would be looking into—PhD in statistics.
Erik Darling: If you want to read some DBA-friendly writing on data science stuff, Buck Woody has a blog, Backyard Data Science, where he occasionally—I don’t really know what his blogging schedule is. I see it pop up once in a while in my RSS feed but I don’t think he blogs regularly-regularly there. But his posts are always good and entertaining and there’s enough back fill data for you at this point to read. So it’s Backyard Data Science, Buck Woody’s blog. Really smart, funny guy. Good stuff on there. If you want a DBA-friendly-ish intro to data science stuff, I would head over there and read. Probably try to not let my boss see me doing that because he knows I’m trying to switch jobs at that point.
Richie Rump: Yeah, Buck is the best.
Erik Darling: Buck is the best.
Should I be using OPENROWSET for this?
Erik Darling: Joe asks—oh boy. “Can you recommend a replacement for OPENROWSET for putting the output of a stored procedure into a table variable or temp table?” Why is OPENROWSET not working for you? What is happening?
Tara Kizer: Why are you using OPENROWSET though just to put the output of a stored procedure? Just use INSERT INTO exact stored procedure. Is this for a linked server? What are you using OPENROWSET for?
Erik Darling: There’s different uses for it. Like I was using it in some test functionality for BlitzFirst to get BlitzFirst to run for the duration of some ad hoc SQL or a stored procedure because with OPENROWSET I can just pretty easily dump stuff into a temp table to relieve the server of having to return all the data. So a follow-up question, what about OPENROWSET isn’t working and what exactly are you trying to do with these rows?
How important is physical database design when I have SSDs?
Erik Darling: Here’s one from Matthew. I think Richie is going to have something on this. “How important is physical database design in an era of SSDs, SANs, etc.?”
Richie Rump: I typically still do the physical design. Mainly it’s because a lot of the tools still kind of force you to do a physical design still to get it out. A lot of times, if you’re designing for multiple databases, that’s kind of helpful to do your logical and then your physical design so then you can output the logical design into multiple databases. Don’t think of physical design as “I’m trying to make things faster.” I’ve never looked at it in that way. It’s a way that you can get your indexes in and things like that, stuff that doesn’t really fit well into your logical design. If you want to read more about that, Louis Davidson’s Pro SQL Server Relational Database and Design Implementation, good book on that. So physical design doesn’t necessarily mean I’ve got to care about what’s going on in the hardware aspect of it. It’s how it’s physically going to be implemented into the database, whatever platform that you’re running.
Erik Darling: A lot of people used to make a much bigger deal about separating data and log files and putting certain indexes on certain file groups off somewhere else. Which made a lot more sense when storage was direct attached. For SSDs, if they’re direct attached, sure, you can still seem some difference there. For the SAN, it just doesn’t make a lick of difference. Not a single drop. Tara, what say you, ye?
Tara Kizer: The question is “in an era of SSDs and SANs.” A lot of companies don’t have those even though they’ve been around for a lot of years. So physical databases are still really important to a lot of companies, your hardware aspect, your disks. We’ve got a lot of clients that have slow I/O issues. They’re just running 15k disks, no SSDs. A lot of them don’t have SANs. Physical databases, as far as hardware goes, is really important to those types of companies because they’re not getting it right, that’s for sure.
Richie Rump: Yeah, I haven’t seen a server like that in many years, probably maybe a couple decades. It’s been a very long time. In fact, I’d just go ahead and spin up a cluster in Amazon and, poof, I don’t have to deal with any of that stuff either.
Erik Darling: That’s not exactly true because you can pick different kinds of storage up in the cloud.
Richie Rump: Yes, but [inaudible] stuff I’m using right now.
Erik Darling: Of course you do.
Tara Kizer: Erik and I know how bad the disks are in the cloud, the default ones.
Richie Rump: Yeah, they’re terrible.
Tara Kizer: You can’t do anything.
Erik Darling: No, it’s like just sludge, man. Sludge.
Richie Rump: Well since you’ll be using this new system, I’m going to definitely choose the slow disks for your stuff.
Erik Darling: It’s okay because I’m going to choose a box with the most RAM on it so I don’t have to deal with that.
Tara Kizer: There you go.
Erik Darling: Stick my thumb in your eye.
Why does our SQL Server’s CPU go up at 9:30pm?
Erik Darling: Daniel has a question. He starts his question with, “I have a question.” Threw me off my whole game. “For some reason the CPU on our production server has been spiking regularly every night around 9:30 for the past week. I have Redgate monitor.” Okay. “Look at the timeframe and it is not reporting any heavy queries running. Is this a virtualization issue? The top ten queries at that time are running in less than 5 seconds each.” Anyone? Virtualized?
Tara Kizer: I had a client that we couldn’t figure out why the CPU was spiking. It was a virtual environment. Brent said to take a look at Perfmon and add the counter for CPU utilization and there’s another further down below, I think it was like virtual CPU. So that would have told you what the CPU utilization was on the virtual host, not just your guest. He was saying that there would be a discrepancy there. The issue ended up being on the host, not the actual VM.
Erik Darling: This is like one of those horrible downsides of VMs that no one really talks about, especially when you put your SQL Server out there. Because you have VM admins, who are not DBAs, who if they need to will shuffle a whole bunch of other guests onto your host for no particular reason other than—or they have automated load balancing set up. You deal with the noisy neighbor. So unless you have reservations set up for your VM, where you get to hang on to your CPUs and your memory and all that stuff, then you can run into some issues. It may not be your SQL Server’s fault. So outside of SQL Server, I would yell at my VM admins and see what they’re doing. See if they’re shuffling people onto my host or whatever. For a lot of VM environments if they are putting SQL Server on them and they’re hosting multiple SQL Servers, those hosts will be kind of like holy ground. They will not let other kinds of servers live on those. They will be for SQL Servers only and those SQL Servers will take up an even portion of the resources on there, like either half and half or in quarters. I would take a closer look at my VM setup for that.
Tara Kizer: I would wonder, CPU is spiking. Do you know that it is the SQL Server .exe process? Have you gone down to that level in Perfmon to see exactly which process is the one that’s spiking? If it truly is SQL Server, Redgate monitor and all of those other monitoring tools, they’re not looking at every single query that runs during a timeframe. They’re sampling it. Every few seconds they’re taking a look at what’s running. So maybe you don’t have any long-running queries but maybe you have a spike in the amount of queries that are running. Maybe even just do a server-side trace, an extended events session, and just gather everything for like two minutes. Don’t run it for a very long time, just for maybe five minutes or so. That way you can collect the data, put it into a table and then order by the CPU column and see what happened there.
Erik Darling: Works for me, man.
Do I need a data lake?
Erik Darling: Eric Swiggum, I wish I had an answer for you but I don’t know. “I had a developer come to me saying they need a data lake now but they seem to be confused about the concept, as am I, and I’m not sure if the implementation will be successful.” Aside from coming up with a POC and testing it, I don’t have any good advice. I don’t know if you need a data lake either.
Tara Kizer: If they’re confused about it also, why are they asking for it? I’d say if you don’t know what it is, you don’t get to use it.
Richie Rump: There will be a test at the end of the webinar.
Erik Darling: If I were you I’d make them watch a bunch of those Jason movies and see if they still want a data lake.
Should I add developer skills to my resume?
Erik Darling: Thomas follows up. He says, “I work mostly on the infrastructure side right now. I’m looking to expand beyond that.” He likes all the tech and it’s hard picking one thing. “Seems like most DBA jobs are developer DBA rather than production DBA work. That’s why I’m looking to get some focus on that side.”
Tara Kizer: I wonder what industry you’re in that you’re mostly seeing development DBA stuff—or not industry—what city you’re in because there are a ton of production DBA jobs out there when I get the LinkedIn emails and stuff. I don’t really see a whole lot of developer DBA jobs here. Sometimes it will be named a little bit differently but for the most part I’ve only seen production DBA jobs.
Erik Darling: At least when you look at job postings like that, try to take them with a grain of salt. Really read the job description. If you send in your resume and you go in for an interview, I would really prod the people about what exactly their expectations of a DBA are because they might expect you to develop something that is not like a performance thing. They may not expect you to develop like queries that are going to be running for clients or whatever. They may expect you to develop some other process that does stuff like if they want to set up partitioning and automate it. They may be looking at something like that and that sort of thing could be production DBA work because, of course, partitioning is not a performance feature.
Should I consider the cloud for DR?
Tara Kizer: I like Graham’s question.
Erik Darling: Yeah, go for it. You read that one.
Tara Kizer: “As DBA, I’ve asked about having a cloud DR strategy and in the past I’ve been told it’s too expensive. Our area has been experiencing a natural disaster.” I wonder if this is the spillway that’s having issue in northern California—I forget what the city name is—Oroville. It’s near Sacramento. I think it’s Oroville. But anyways. “I brought up exploring a cloud DR strategy and was told the same thing, too expensive.” I’m kind of surprised that you’re getting that answer for cloud servers. “By the way, we have no secondary center and on-site backups. If my org can’t grasp the need for a DR strategy now, will they ever?” If there’s a natural disaster happening near you and they aren’t freaking out already, if you’re in one of those towns that possibly could get flooded, I mean, they probably just don’t want to spend the money. I’ve always worked in environments of companies that had DR sites. These days, with the cloud solutions out there, you have a much cheaper option than having to have your own DR site and servers and managing all that. I know we’ve got a white paper coming up, and I know we’re allowed to talk about this now, that we wrote for Google about using Google’s cloud to copy your backups into their cloud, into a storage bucket. You don’t have a server up but if you ever had a natural disaster, you could then spin up a VM in the cloud and then copy those files down to that server and get up and running on that. That would be a pretty cheap solution because you’re only having your files in the cloud. You’re not hosting the server until you actually need it.
Richie Rump: Yeah, until maybe they lose everything, maybe their mind will change. It’s probably what’s going to have to end up happening. When the whole building gets wiped out and all of a sudden they maybe have a backup that maybe you took, that’s probably when they’re like, “You know, maybe we should be able to get something that’s a little, I don’t know, that we could get up a little faster than a couple months.”
Tara Kizer: Do you at least have your backups being copied to tape in off-site iron mountains, that type of thing?
Erik Darling: It doesn’t sound like it.
Tara Kizer: I would hope something is going offsite.
Erik Darling: I think if I were the DBA in that position I would be totally fine with that as long as I had it in writing.
Tara Kizer: Yeah.
Erik Darling: As long as I was like, “This is what I told you guys however many months back and you said no because it was too expensive. Now we’re in this position. I can’t do anything for you.” As long as I had that in writing I would be totally cool with being like, “I have one less computer to manage.”
Tara Kizer: Talk to them about RPO and RTO goals. A lot of companies have those numbers for on-site issues. It may be they have the same RPO and RTO goals for disaster recovery but show them what the actual reality is of those RPO/RTO. Your goals are not the same thing as your current state. Let them know how long it would take to bring up production if a natural disaster occurred and maybe you’d have total data loss if your backups aren’t going anywhere.
Richie Rump: Yeah, in accordance with that, I used to be a project manager and I used to always to talk to my sponsors about risk. I used to have like a punch board of just, hey, if this risk occurs and what’s the severity, what’s the likelihood that that’s going to occur: high, medium, or low. Then how would we fix that. Having these lists and sharing that with people and saying, “These are our risks: high, medium, and low. There’s a high risk for this to happen.” Or maybe it’s a low risk that it may happen but it will cost a lot of effort to get this thing back up and running, having those types of things in front of people will actually be like, “Oh, I didn’t know that that was a problem because I never thought about it because I’m the VP of whatever and frankly servers aren’t my deal.” So keeping that kind of list was really helpful when I was a PM.
What’s the best branching strategy for database source control?
Erik Darling: Richie, I’m sorry, you just got done talking but I’m going to pick on you again. Matthew asks, “What is the best branching strategy for source control in a database project? We are using VSO front end and a Git back end.”
Tara Kizer: What’s VSO?
Richie Rump: Visual Studio Online.
Tara Kizer: Online? Okay.
Richie Rump: It depends. I think when you talk about branching with databases, I still haven’t found a database strategy that I like. Code is a lot easier but databases are a little bit harder mainly because of the deployment aspect of it than anything else. I’d say go for it. Pick one, whether you’re using something like the Git strategy or whatever and see how it works. Whatever sticks with your team is usually the one that I would stick with. Pick one, try it for a few weeks. If it doesn’t work, go for something else that kind of works with your team. I think branching is more what works with your team than actually what works technically. So just try something with your team. Talk it over with them, get some feedback. If it works, stick with it. If it doesn’t, go somewhere else.
Erik Darling: Yeah, I think it’s a bit more of a cultural thing than a tech thing. But what do I know? I’m not a developer.
Does full recovery model slow down performance?
Erik Darling: J.H. has an interesting question, “Does putting a database in full recovery model slow down performance?”
Tara Kizer: He said model!
Richie Rump: It’s catching on.
Erik Darling: “And are there any native methods to query the transaction log?” Yes and yes. That’s my answer. Tara, what do you say?
Tara Kizer: Does it really affect performance though?
Erik Darling: Inserts, updates, and deletes.
Tara Kizer: Even if you use simple recovery model it’s still writing those transactions to the transaction log. You’re not getting rid of that step. They still get written to the transaction log it’s just what happens at the end of the transaction. If you’re using simple recovery model, that transaction is now gone from the transaction log. So, yeah, you do have to do log backups for full recovery or bulk-log but do they really slow down performance in the recovery model?
Erik Darling: Yeah, so like one thing that a lot of people might be getting in simple recovery and not know it is minimal logging.
Tara Kizer: Right.
Erik Darling: Minimal logging is pretty magical for inserts. So if you’re getting minimal logging sort of by accident, if like it’s Tuesday on Mars and everything else is aligned for you and you haphazardly get it, than, yes, it can totally way speed up inserts like crazy. But, other stuff, not really.
Tara Kizer: What version did that start out in?
Erik Darling: Minimal logging? Oh, god, 2005 maybe?
Tara Kizer: Oh really? Okay.
Erik Darling: It’s been around forever. So like your database has to be in simple or bulk-logged. And this is probably why—you probably never saw much of it because you’re Miss Full Recovery Model.
Tara Kizer: Yeah, I was. Only non-prod use. I never even used bulk-logged because your RPO is destroyed with bulk-logged.
Erik Darling: Bulk-logged is like a weird joke.
Tara Kizer: Yeah, I knew about minimal logging as far as like truncate and bulk inserts and things like that, but are you saying that the insert into T-SQL command can use minimal logging?
Erik Darling: Yeah.
Tara Kizer: Okay, I’ll look into that.
Erik Darling: …sometimes you have to use a trace flag.
Tara Kizer: Okay.
Erik Darling: It only works on tables of like an empty clustered index and no non-clustered indexes or a heap…
Tara Kizer: So a lot of rules.
Erik Darling: If you’re accidentally getting it, it’s really helpful. Sometimes even when if you’re doing everything right it doesn’t work. Minimal logging. One of those things.
Tara Kizer: Michael asked what is RPO because I kept saying RPO and RTO. RPO is your recovery point objective, that’s how much data can your company lose. RTO is your recovery time objective, basically how long can you be down. How long is it going to take you to restore a failover or whatever it is.
Richie Rump: I’m pretty sure we have a page on that.
Tara Kizer: We do.
Erik Darling: Like a billion of them. Billions of pages. All right, thank you for joining us. We’ll see you next week. Goodbye, everyone. Adios.