Here at Brent Ozar Unlimited
We have a proud tradition of not blaming index fragmentation for everything. There are points you should deal with it, but they’re probably not 5% and 30% and 1000 pages. But that’s not what this blog post is about. I’m hoping to clarify why we’re more interested in up to date statistics, and also why statistics outside of indexes aren’t really the most helpful thing. If they were, we’d all just create statistics and every query would magically blaze from select to offset without a care in the world.
Statistics: It’s What’s Inside That Counts
Statistics are what SQL Server uses to figure out how many rows to expect from accessing a data structure. You can do some things that fundamentally break this, like using functions in joins or where clauses, using local variables or optimize for unknown, using table variables without recompile hints, and a sniffed parameter can just allow SQL to run a query without guessing at all. It already guessed once. No tag backs. As you may imagine, this is important information for SQL to have for running your queries optimally.
Indexes: Pride and Vanity
Indexes of the nonclustered variety contain subsets of your table or view’s data. Clustered ones are your table or view data ordered by the clustering key(s). Focusing on the nonclustered variety, they’re the “and the [band name]” to your clustered index’s “[lead singer name]”, and they’re great for providing SQL with a more appropriate data structure for your query.
If you don’t need to select all the columns, and you do need to filter, join, order by, or aggregate a column not in the key of the clustered index, nonclustered indexes get a solo after the chorus. Give the drummer some.
Nonclustered indexes will, under normal circumstances, get a statistics object created with rolling density information for the key columns going left to right, and a histogram on only the leftmost column.
With that out of the way
Why do we care more about statistics being updated than indexes being fragmented? Largely, because reading pages with some empty space from a fragmented index is oh-so-very-rarely the root cause of a performance issue. Especially if those pages are already in memory. Out of date statistics can allow SQL to continue to make some really bad guesses, and keep giving you a lousy execution plan no matter which way you tune your query.
The bigger your table gets, the worse the problem gets. Prior to 2016, if you don’t turn on Trace Flag 2371, about 20% of your table’s rows need to change before an automatic statistics update kicks in. For a 100 million row table, this can be a long ways off. Poor cardinality estimates here can really sink you. Rebuilding indexes for a 100 million row table is a B-U-M-M-E-R.
Log Shipping? Mirroring? Availability Group? Good luck with that.
Crappy server? Low memory? Slow disk? Dead man walking.
You may ultimately spend more time and expend more server resources defragmenting indexes than your queries will spend reading extra pages from fragmented indexes. Rebuilding or reorganizing large indexes can be a special kind of brutal.
Consider the process
Read a bunch of index pages with sys.dm_db_index_physical_stats to figure out if there’s fragmentation, reorganize or rebuild based on feedback.
- Reorganize is online but single threaded and can take FOREVER on big tables, especially if you’re compacting LOBs.
- Rebuild is offline and single threaded in Standard, online and potentially parallel in Enterprise, but you better hope you’re patched up so you don’t corrupt anything.
Is that worth it? Every night? For every index on every table in every user database? Only if you can prove it.
The one upside to Rebuilding is that it also updates statistics with a full scan. Think about this the next time you say something like “rebuilding the index fixed the problem”, you may have an epiphany on the way.
Wait for it.
Wait for it.
Wait for it.
Statistics with no indexes
SQL Server doesn’t make easy work of getting information about your Statistics, or finding out which statistics get used. Even at the query level, you have to use a spate of Trace Flags to find out what gets loaded and looked at. Cached plans don’t fare much better.
No wonder everyone cares about indexes and their fragmentation. Microsoft has made information about them easy and abundant, while Statistics are kept hidden in the basement next to piles of soft bones and a bowl of hot blood.
Head rush moment: SQL may use information from histograms outside of the index it chooses for cardinality estimation.
Back to earth: If you just create a bunch of statistics instead of indexes, you’re (at best) using your Clustered Index for everything (which is still bad), or you’re using a HEAP for everything (which is usually worse). You’re still generally better off creating good indexes for your workload. They’ll get statistics objects created and associated with them, and if SQL thinks another column is interesting, it will create a single column statistics object for it, as long as you haven’t turned off auto create stats.
Sure, you can put on your black cloak and goat mask and create some multi-column or filtered statistics, but in the words of a wise man (Doug), you end up with more stats to maintain and understanding query behavior gets more difficult.
Filtered statistics suffer from a problem where they don’t automatically update based on the filtered row count, but rather the table row count. Imagine you have a 100 million row table, and your filtered index is on 1 million rows. All million of those rows might change, but the statistics on that index won’t. 1 million is not 20% of 100 million. You’ll have to update the statistics manually, or rebuild the filtered index.
Multi-column statistics are hardly a replacement for a multi-column index, and it’s not like you get an enhanced histogram that includes the second column. It’s just like a normal histogram. All you get is the density information for the columns you throw in there. Boo hiss.
Moral of the story (B-B-B-B-B-B-BENNY WITHOUT THE JETS)
Indexes are a really important factor for performance, but index fragmentation very rarely is. Statistics are super helpful when they’re not out of date, and getting them up to date is much easier on your server’s nerves. Though not perfect, I’d rather take my chances here. Updating statistics can also cause a bad execution plan to get flushed out of the cache. On their own they can sometimes help queries, but you should only end up here after you’ve really tuned your indexes.
Unless you can establish a metric that makes nightly index defragmentation worthwhile, don’t jump to it as the default. Try just updating statistics. You may find that nothing at all changes, and you now have many extra hours a night to do maintenance. Like run DBCC CHECKDB. If you think index fragmentation is a performance problem, try corruption sometime. That’s not what Brent meant when he said “the fastest query is one you never make.”
Thanks for reading!
TDE is one of those things!
You either need it, and quickly learn how many things it plays the devil with, or you don’t need it, and there but for the grace of God go you. Off you go, with your compressed backups, your instant file initialization, your simple restore processes. Sod off, junior.
But Microsoft maybe listened or figured out something by accident. I don’t know which one yet, but they seem excited about it! And I am too! If you read this blog post that’s probably also being monitored closely by Chris Hansen, you’ll see why.
Backup compression now works with TDE
Cool! Great! Everyone encrypt your data and compress your backups. It’s fun. I promise.
Not satisfied with a few meek and meager data points, I set out to see if increasing Max Transfer Size also increased the degree of compression. Why? This paragraph.
It is important to know that while backing up a TDE-enable database, the compression will kick in ONLY if MAXTRANSFERSIZE is specified in the BACKUP command. Moreover, the value of MAXTRANSFERSIZE must be greater than 65536 (64 KB). The minimum value of the MAXTRANSFERSIZE parameter is 65536, and if you specify MAXTRANSFERSIZE = 65536 in the BACKUP command, then compression will not kick in. It must be “greater than” 65536. In fact, 65537 will do just good. It is recommended that you determine your optimum MAXTRANSFERSIZE through testing, based on your workload and storage subsystem. The default value of MAXTRANSFERSIZE for most devices is 1 MB, however, if you rely on the default, and skip specifying MAXTRANSFERSIZE explicitly in your BACKUP command, compression will be skipped.
It left things open ended for me. Unfortunately for me, it doesn’t help. Fortunately for you, you don’t have to wonder about it.
Check out the exxxtra large screen cap below, and we’ll talk about a few points.
First, the database without a Max Transfer Size at the bottom was a full backup I took with compression, before applying TDE. It took a little longer because I actually backed it up to disk. All of the looped backups I took after TDE was enabled, and Max Transfer Size was set, were backed up to NUL. This was going to take long enough to process without backing up to Hyper-V VM disks and blah blah blah.
The second backup up, just like the blog man said, no compression happens when you specify 65536 as the Max Transfer Size.
You can see pretty well that the difference between compressed backup sizes with and without TDE is negligible.
The most interesting part to me was the plateau in how long each backup took after a certain Max Transfer Size. Right around the 1MB mark, it hits the low 380s, and never strays very far from there afterwards. I could have tried other stuff to make this go faster, but my main interest was testing compression levels.
There you have it
Max Transfer Size doesn’t impact compression levels, but it can help duration. If you want to keep playing with switches, you can throw in Buffer Count, and try striping backups across multiple files to ‘parallelize’ output.
Thanks for reading!
From the Mailbag
In another post I did on Direct Seeding, reader Bryan Aubuchon asked if it plays nicely with TDE. I’ll be honest with you, TDE is one of the last things I test interoperability with. It’s annoying that it breaks Instant File Initialization, and mucks up backup compression. But I totally get the need for it, so I do eventually get to it.
The TL;DR here
Is that if you encrypt a database that’s already taking part in a Direct Seeding relationship, everything is fine. If you already have an encrypted database that you want to add to your Availability Group, Direct Seeding has a tough time with it.
I don’t think this is an outright attempt to push people to AlwaysEncrypted, because it has a lot of limitations.
Let’s walk through this
Because I love reader sanity checks, here we go. Microsoft tells you how to add a database encrypted with TDE to an existing Availability Group here.
That all sounds good! So let’s follow directions. We need a database! We also need a password, and a certificate. Alright, we can do this. We’re competent adults.
/*Create databse on acceptable path to all Replicas*/ CREATE DATABASE EncryptedCrap ON PRIMARY ( NAME = 'EncryptedCrap', FILENAME = 'E:\Crap\EncryptedCrap.mdf') LOG ON ( NAME = 'EncryptedCrap_log', FILENAME = 'E:\Crap\EncryptedCrap_log.ldf'); /*Create key*/ CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'GreatP0stBrent!' GO /*Create cert*/ CREATE CERTIFICATE EncryptedCrapCert WITH SUBJECT = 'If you can read this I probably got fired.'
Alright, cool. We did that. Now we have to get all up in our database and scramble its bits.
/*Get into database*/ USE EncryptedCrap GO /*Create database encryption key*/ CREATE DATABASE ENCRYPTION KEY WITH ALGORITHM = AES_128 ENCRYPTION BY SERVER CERTIFICATE EncryptedCrapCert GO /*Turn encryption on*/ ALTER DATABASE EncryptedCrap SET ENCRYPTION ON
SQLCMD Appreciation Header
Few things in life will make you appreciate SQLCMD mode like working with Availability Groups. You can keep your PowerShell. $.hove-it; I’m with SQLCMD.
Stick with me through the next part. You may have to do this someday.
/*Back into master*/ USE master GO /*Backup cert to fileshare*/ BACKUP CERTIFICATE EncryptedCrapCert TO FILE = '\\Sqldc01\sqlcl1-fsw\NothingImportant\EncryptedCrap.cer' WITH PRIVATE KEY (FILE = '\\Sqldc01\sqlcl1-fsw\NothingImportant\EncryptedCrap.pvk' , ENCRYPTION BY PASSWORD = 'GreatP0stBrent!' ) GO :CONNECT SQLVM02\AGNODE2 USE master GO /*Set up password*/ CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'GreatP0stBrent!' GO /*Restore cert from share*/ CREATE CERTIFICATE EncryptedCrapCert FROM FILE = '\\Sqldc01\sqlcl1-fsw\NothingImportant\EncryptedCrap.cer' WITH PRIVATE KEY (FILE = '\\Sqldc01\sqlcl1-fsw\NothingImportant\EncryptedCrap.pvk', DECRYPTION BY PASSWORD = 'GreatP0stBrent!'); GO :CONNECT SQLVM03\AGNODE3 USE master GO /*Set up password*/ CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'GreatP0stBrent!' GO /*Restore cert from share*/ CREATE CERTIFICATE EncryptedCrapCert FROM FILE = '\\Sqldc01\sqlcl1-fsw\NothingImportant\EncryptedCrap.cer' WITH PRIVATE KEY (FILE = '\\Sqldc01\sqlcl1-fsw\NothingImportant\EncryptedCrap.pvk', DECRYPTION BY PASSWORD = 'GreatP0stBrent!'); GO :CONNECT SQLVM01\AGNODE1 USE master GO ALTER AVAILABILITY GROUP SQLAG01 ADD DATABASE EncryptedCrap GO
What did we do?
Exactly what we did. We backed up our certificate to a network share, created a private key for it, and then on two replicas we created master passwords, and created certificates using the backup of our certificate from the primary. We did this in one SSMS window. Magical. Then we added our encrypted database to the Availability Group.
If this database weren’t encrypted, everything would probably go just fine. I say probably because, you know, computers are just the worst.
But because it is encrypted, we get some errors. On our Primary Replica, we get normal startup messages, and then messages about things failing with a transient error. Not sure what a transient error is. It forgot to tie its shoelaces before running to jump on that freight car.
On our Replicas, we get a different set of messages. Backup failures. Database doesn’t exist. More transient errors. This time you left an open can of pork beans by the barrel fire.
Over in our Extended Events session that tracks automatic seeding, we get an error code! searching for it doesn’t really turn up much. New features. Good luck with them.
One bright, shiny star of error message-y goodness shows up in our Physical Seeding Extended Event session. Look at all those potentially helpful failure codes! An individual could get a lot of useful information from those.
If only you weren’t being laughed at by the Gods of HA/DR. Some of the physical_seeding Extended Events have values here, but none of the automatic seeding ones do.
As of now
I don’t have a work around for this. The alternatives are to decrypt, and then re-encrypt your database after you add it, or add it the old fashioned way. Maybe something will change in the future, but as of now, these don’t appear to be compatible.
I’ve opened a Connect Item about this. I’d appreciate votes of the upward variety, if you feel so inclined.
Thanks for reading!
Let’s face it, most people don’t have just one SQL Server
How many they tell Microsoft they have is another matter, but let the record show that I don’t condone licensing dishonesty. But going one step further, most places… Well, they’re ‘lucky’ if they have one DBA, never mind a team.
Everyone else: Give me your network people, your sysadmin, your huddled SAN group yearning to breathe free, the wretched refuse of your teeming developers.
Doing things on one server is aggravating enough. Doing things on a bunch of servers is even worse. Given some of today’s HA/DR features (I’m looking at you, Availability Groups, with your lack of a mechanism to sync anything outside of user databases. Rude.) people are more and more likely to have lots of SQL Servers that they need to tend to.
Sometimes just keeping track of them is impossible. If you’re one guy with 20 servers, have fun scrolling through the connection list in SSMS trying to remember which one is which. Because people name things well, right? Here’s SQLVM27\Instance1, SQLVM27\Instance2, SQLVM27\Instance3, and that old legacy accounting database is around here somewhere.
Register it and forget it
But don’t actually forget it. If you forget it and it goes offline, people will look at you funny. Turns out people don’t like offline servers much.
So what’s someone to do with all these servers? Register them! Hidden deep in the View menu of SSMS is the Registered Servers window
It will look pretty barren at first, just an empty folder. But you’ll fill it up quick, I’m sure. Can never have enough servers around, you know.
It’s pretty easy to populate, you can right click on the Local Server Group folder, or on servers you’re connected to in Object Explorer.
Either way, you get the same dialog box to add a server in. You can give it a friendly name if you want! Maybe WIN03-SQL05\Misc doesn’t tell a good story.
And if you hip and hop over to the Connection Properties tab, you can set all sorts of nifty stuff up. The biggest one for me was to give different types of servers different colored tabs that the bottom of SSMS is highlighted with. It’s the one you’re probably looking at now that’s a putrid yellow-ish color and tells you you’re connected and that your query has been executing for three hours. Reassuring. Anyway, I’d use this to differentiate dev from prod servers. Just make sure to choose light colors, because the black text doesn’t show up on dark colors too well.
Another piece of advice here is not to mix servers on different major (and sometimes minor) versions. The reason is that this feature gives you the ability to query multiple servers at once. If you’re looking at DMVs, they can have different columns in them, and you’ll just get an error. Even a simple query to sys.databases will throw you a bonk between 2012 and 2014.
Even if you’re running 2008R2, there are some pretty big differences in DMVs between SP1 and SP3. Microsoft has been known to change stuff in CUs (I’m looking at you, Extended Events).
On the plus side, you can use your multi-server connection to SELECT @@VERSION to help you decide how you should group them. If they have something better in common, like participating in Log Shipping, Mirroring, an AG, etc., all the better.
But my favorite thing, because I was a devotee to the Blitz line of stored procedures even before I got paid to like them, was that I could install them on ALL OF MY SERVERS AT ONCE! This was especially useful when updates came out. You know what it’s like to put a stored proc on 20 servers one at a time? Geeeeeet outta here!
Check that out. It’s on both of my servers. At once. That means simultaneously, FYI. If you have a DBA or Admin database that you keep on all your servers to hold your fancy pants scripts and tools, this is an awesome way to make sure they all have the latest and greatest.
You’re already better at your job
Even though this feature came out in 2008, I hardly see anyone using it. I found it really helpful comparing indexes and query plans across app servers that held different client data across them. It also exposes far less than Linked Servers; you need to worry less about access and level of privilege.
Just don’t forget to export your list if you change laptops!
Thanks for reading!
One year ago today!
Well, sort of one year ago. Who knows when this thing will get published? Only God and Brent. And part of the calendar year fell during a leap year, which was just plain cruel, like a two part Buffy episode where the second part is the first episode of the next season DAMN YOU JOSS WHEDON!
Anyway, I started working here, annoying you guys with blog posts, giving Doug rub-downs in between video takes, and walking Ernie when the Ozar family was too full of illegal caviar and albino truffles to move. I also started running down the clock on being able to work with my favorite piece of software again. You can probably guess.
Seriously, I love this thing. Not just because many of the databases I worked with under the software were hundreds of gigs, on up to 9 terabytes, but because the people behind the software really do care about the product. The customer support is aces (Hello, Pod One), and the developers are super helpful and responsive.
Plus, it’s just plain interesting. You give lawyers this crazy interface that lets them build just about any search query they can dream of, including some really, really bad ones, and see how SQL Server reacts.
If you’re a DBA who has read the execution plan of a Relativity query where saved searches reference saved searches that reference saved searches that… you get the point! I feel your pain.
It’s not just the hardware
You can’t fix every saved search and workflow immediately, which makes right-sizing hardware super important, but that’s not the only thing. Every case is different, and they often need custom indexes.
If you’re a DBA who has watched performance tank because some new search suddenly started scanning the clustered index of your 50 million row, one terabyte Documents table with a wildcard search on Email Body and Email Subject and Email Sender and Email Recipients and Email Metadata for ‘insert super common word here’, I feel your pain.
Best in Service
My other favorite part about Relativity is that they have standards. Not last call for alcohol standards, either. They keep you, as a DBA, honest. No backups? Ding. No DBCC CHECKDB? Ding. Perf in the tank? Ding.
The challenges that you’re presented with at scale are immense. You have 100 terabytes of data and you need to check it for corruption weekly. How’s that gonna work?
Index and statistics maintenance can be super important, too. Fragmentation may not matter if you’re reading 1500 pages instead of 1000 pages, but it can sure as heck matter when you’re reading 1,500,000 pages rather than 1,000,000 pages. And all those ascending keys? SQL Server is gonna make some real bad judgement calls on those, especially prior to 2014.
It’s a wild, wild life
You have users bulk loading thousands to millions of documents, users updating records during review, and then searches running on top of all that.
I am thrilled to be able to work with my favorite product again. If you’re experiencing Relativity pains, drop us a line.
Thanks for reading!
Brent says: For a refresher on this app and how we work with it, check out our past posts on The SQL Server Components of kCura Relativity, Performance Tuning kCura Relativity, Using Partitioning to Make kCura Relativity Faster, and Tiering Relativity Databases.
As of CTP 3.3, it’s the same behavior as Trace Flag 2371, in 2008 R2 SP1 and onward. That basically means that the bigger your table is, the fewer rows need to be modified before an automatic statistics update occurs.
Remember in Mortal Kombat, when getting to fight Reptile made you cool? Then Mortal Kombat 2 came out, and he was a playable character, and everyone would call your wins cheap if you picked him? That’s sort of what this reminds me of. If you’re new to SQL, you probably won’t appreciate the differences these Trace Flags make. If you’ve been using it for a while, you’ll probably start sentences with “back in my day, we had to add startup parameters…” and chuff off to write miserably long blog posts about unique indexes.
As of 02/23/2016, it sounds like Trace Flag 8048 is also enabled by default in 2016. See quote about soft NUMA at the link.
I ran tests on some fairly large tables. I tried to run them on tables from 100 million to 1 billion rows, but I blew out the data drive of our AWS instance. So, uh, if you have a bigger server to test stuff out on, be my guest.
The basic concept was:
- Load a bunch data into a table
- Update it 1000 rows at a time (I know, I know, but updating less than that took FOREVER)
- Run a query against it to invalidate stats
- If they reset, add a bunch more data and start over
What I ended up with was, well…
Here’s an abridged version of 10-20 million and 30-40 million rows, and how many modifications they took before a stats update occurred. If you follow the PercentMod column down, the returns diminish a bit the higher up you get. I’m not saying that I’d prefer to wait for 20% + 500 rows to modify, by any stretch. My only point here is that there’s not a set percentage to point to.
And, because you’re probably wondering, turning on Trace Flag 2371 in 2016 doesn’t make any difference. Here’s what 10-100 million look like, in 10 million row chunks.
If you can guess which side TF 2371 was on for, I’ll give you one merlin dollhairs.
This improvement is certainly welcome as a default, though it’s not all that ‘new’. My 2014 instance comes up with the same thresholds with 2371 enabled. Unless you’re working with pretty big tables, or used to managing statistics updates on your own, you likely won’t even notice the change.
Thanks for reading!
Brent says: It’s kinda like Microsoft is treating trace flags as alpha/beta tests for new features now. That’s right in line with how .com startups use feature flags.
A long time ago, I had to actually do stuff to databases. One thing I had to do was move data files around. Maybe some knucklehead had put system databases on the C: drive, or a LUN was filling up, or we got a new LUN. You know, whatever. Natural curiosity oft leads one to the internet. If one does not succumb to food and cats, one may find useful information. Or anonymous message boards. Sort of a toss up. What I found was this article. Weird, right? 2009. Brent said to use ALTER DATABASE. It’s new and pretty and smart people do it. What Brent didn’t do was explain how it’s done. Or link to how it’s done. I felt cold and alone. Abandoned. Afraid. “Great post, Brent”, I said sarcastically, and set out to figure out how to work this magic on my own.
I turned to BOL, the destination of all self-loathing people. If you scroll down to the bottom, way down at the bottom, the syntax is there. Of course, moving system databases is a horse of a different color. But hopefully you don’t need that one. For user databases, it’s rather more simple:
- Alter the file metadata to the new path
- Set the database offline
- Physically move the file
- Set the database back online
Run ALTER DATABASE with the new location. We’re moving the data file. If we were moving the log file, it would probably end in “_log” or something. You can find all this information in sys.master_files, except where you’re moving the file to. Just don’t actually move it to C:\Whatever. You may run into problems later. Also, you need the filename. If you don’t include it, SQL won’t complain until you try to set the database back online. Yay!
ALTER DATABASE [Sample] MODIFY FILE ( NAME = Sample, FILENAME = 'C:\Whatever\Sample.mdf' );
This is the part that you need to think through. People have to be cool with the database being offline while you move the physical file. This is not a seamless transition. If you’re moving large enough databases, you may want to consider an alternate method, like Mirroring or Log Shipping. They take more work, but you get the whole near-zero-downtime thing out of it. You may want to stage a mock file move to test LUN to LUN copy speeds. See how many GB you can move per minute. That way you’ll at least be able to estimate how long the outage will last. Assuming all that is cool, go ahead and take the database offline.
ALTER DATABASE [Sample] SET OFFLINE;
Now you gotta hurry up and get that file moved. How you do that is up to you. You may prefer to just use Windows Explorer, since it has a status bar, and tells you copy speeds. Good stuff to know if people ask for updates, right? Just to fill space, here’s a PowerShell command. I still hate PowerShell.
Move-Item -Path "D:\Data\Sample.mdf" -Destination "C:\Whatever" -Force
Once that finishes, put your database back online.
ALTER DATABASE [Sample] SET ONLINE;
If you find yourself having to do this often, or if you have to migrate a group of databases, it’s probably worth scripting out.
There you have it
It’s that easy to do. Just make sure you have adequate backups, in case something goes wrong. I take no responsibility for what happens to your data files when they copy across your SAN, or anywhere else.
Thanks for reading!
There’s an old DBA saying…
May you already have a backup restored
A half hour before your boss knows there’s corruption
What? There’s no such thing as old DBA sayings? Well, maybe if you all said something other than “no” once in a while, you’d be more quotable. Hmpf.
Anyway, this is a serious question! And there are a lot of things to consider
- Do I have a different RTO for corruption?
- What’s my backup retention policy?
- How much data do I have?
- How long are my maintenance windows?
- Do I have a server I can offload checks to?
Recovery Time Objectification
When you’re setting these numbers with management, you need to make them aware that certain forms of corruption are more serious than others, and may take longer to recover from. If system tables or clustered indexes become corrupt, you’re potentially looking at a much more invasive procedure than if a nonclustered index gets a little wonky — something you can disable and rebuild pretty easily.
Either way, you’re looking at an RTO of at least how long it takes you to restore your largest database, assuming the corruption isn’t present in your most recent full backup. That’s why backup checksums are important. They’re not a replacement for regular consistency checks by any means, but they can provide an early warning for some types of page corruption, if you have page verification turned on, and your page is assigned a checksum.
If you use a 3rd party backup tool that doesn’t allow you to use the backup checksum option, stop using it. Seriously, that’s garbage. And turn on Trace Flag 3023 until you find a replacement that does.
Notice I’m not talking about RPO here. But there’s a simple equation you can do: the shorter your RTO for corruption, the longer your RPO. It’s real easy to run repair with allow data loss immediately. The amount of data you lose in doing so is ¯\_(?)_/¯
Which is why you need to carefully consider…
The shorter the period of time you keep backups, the more often you need to run DBCC CHECKDB. If you keep data for two weeks, weekly is a good starting point. If you take weekly fulls, you should consider running your DBCC checks before those happen. A corrupt backup doesn’t help you worth a lick. Garbage backup, garbage restore. If your data only goes back two weeks, and your corruption goes back a month, best of luck with your job search.
Of course, keeping backups around for a long time is physically impossible depending on…
How much data YOU have
The more you have, the harder it is to check it all. It’s not like these checks are a lightweight process. They chew up CPU, memory, disk I/O, and tempdb. They don’t cause blocking, the way a lot of people think they do, because they take the equivalent of a database snapshot to perform the checks on. It’s transactionally consistent, meaning the check is as good as your database was when the check started.
You can make things a little easier by running with the PHYSICAL ONLY option, but you lose out on some of the logical checks. The more complicated process is to break DBCC checks into pieces and run them a little every night. This is harder, but you stand a better chance of getting everything checked.
Especially if you have terabytes and terabytes of data, and really a short…
Are you 24×7? Do you have nights or weekends to do this stuff? Are you juggling maintenance items alongside data loads, reports, or other internal tasks? Your server may have a different database for different customer locations, which means you have a revolving maintenance window for each zone (think North America, Europe, APAC, etc.), so at best you’re just spreading the pain around.
Or you could start…
This is my absolute favorite. Sure, it can be a bear to script out yourself. Automating rotating backups and restores can be a nightmare; so many different servers with different drive letters.
Dell LiteSpeed has been automating this process since at least version 7.4, and it’s not like it costs a lot. For sure, it doesn’t cost more than you losing a bunch of data to corruption. If you’re the kind of shop that has trouble with in-place DBCC checks, it’s totally worth the price of admission.
But what about you?
Tell me how you tackle DBCC checks in the comments. You can answer the questions at the beginning of the post, or ask your own questions. Part of my job is to help you keep your job.
Thanks for reading!
Brent says: if you’re using NetApp SAN snapshots, they’ve also got great tooling to offload corruption checks to your DR site. Licensing gotchas may apply – for both SQL Server and NetApp writeable snaps.
Most of you are going to hate this
And TL;DR, there’s a script at the end of the post. But like The Monster At The End Of This Book, it’s worth it not to skip the middle.
There are about a billion but-what-ifs that could come into play. I can’t possibly answer all of those for you. But that’s not the point of this post, anyway! If you’re in a special circumstance, using some fancy features, or doing something utterly deranged to your database, this isn’t the post, or script, for you.
I mean really, unless the size of your log file is causing you some dramatic pain, leave it alone. You should probably go invent cold fusion if log file size is the worst issue in your database. Congratulations.
This is also a lousy place to ask me if you can shrink your log file. I have no idea how or why it got that size. There’s free space now because you’re using FULL recovery model and you took a log backup, or you’re in SIMPLE and your database hit a CHECKPOINT. No magic there. It may very well grow to that size again, so shrinking it could be a really dumb idea.
So what’s the point? Lots of people ask me this question: clients, Office Hours attendees, random passerby on the street who recognize me (even without my Robot). I usually give them the same answer and explanation, unless I have ample evidence that their fancy and/or deranged ways require a different estimate.
From the ivory tower
A good STARTING POINT for your log file is twice the size of the largest index in your database, or 25% of the database size. Whichever is larger.
If the largest object in your database is larger than 25% of your database, you are likely running some type of maintenance. Index rebuilds require the size of the object being rebuilt in log space. I usually rule of thumb twice that space, in case you’re doing anything else while you’re doing that maintenance, like ETL, reports, dragging data to and fro, purging data, whatever. If you’re only ever reorganizing the largest object, you may not need all that space. Are you sure you’re ONLY ever reorganizing that? I’ll wait.
But 25% seems so random!
Well, kinda. but you’re here for a starting point. If you’re not Super DBA and taking baselines and trending your database file sizes over time, random is better than nothing. It buys you some leeway, too.
- If you miss a log backup (maintenance plans got you down?)
- If you’re not taking frequent enough log backups (can I interest you in RPO/RTO insurance?)
- If you run other long/large transactions (SSIS won’t save you)
You’ll have a fair amount of room to do your dirty work. Most sane and rational people consider this to be a positive thing.
But what if my log file still grows?
Well, then you found out you need a bigger log file. Or you need to take log backups more frequently. Perhaps those hourly log backups aren’t working out as you planned, hm?
And if your log file never grows, you’ll look really smart. And you’ll never have to wait for your log file to expand. They don’t benefit from Instant File Initialization the way data files do.
Show me the script already
It’s all right under here. Don’t forget to change the USE statement. All sizes are in GB. If your database is smaller than 1GB, you’re one of those lucky DBAs who can take vacations and stuff. Go do that. Life is short.
If your database is under 1GB, and your log file is over 1GB, start taking log backups. I’m pretty sure you’re not.
USE [StackOverflow] --You'll probably want to use your own database here --Unless you work at Stack Overflow --No, I'm not writing this to loop through all of your databases ; WITH [log_size] AS ( SELECT TOP 1 SCHEMA_NAME([t].[schema_id]) AS [schema_name] , [t].[name] AS [table_name] , [i].[name] , [p].[rows] AS [row_count] , CAST(( SUM([a].[total_pages]) * 8. ) / 1024. / 1024. AS DECIMAL(18, 2)) AS [index_total_space_gb] , ( SUM([a].[total_pages]) * 8 ) / 1024 / 1024 * 2 AS [largest_index_times_two_(gb)] , ( SELECT ( SUM([mf].[size]) * 8 ) / 1024 / 1024 FROM [sys].[master_files] AS [mf] WHERE [mf].[database_id] = DB_ID() ) AS [database_size_(gb)] , ( SELECT CAST(( SUM([mf].[size]) * 8 ) / 1024 / 1024 AS INT) FROM [sys].[master_files] AS [mf] WHERE [mf].[database_id] = DB_ID() AND [mf].[type_desc] = 'LOG' ) AS [current_log_size_(gb)] , ( SELECT CAST(( SUM([mf].[size]) * 8 ) / 1024 / 1024 * .25 AS INT) FROM [sys].[master_files] AS [mf] WHERE [mf].[database_id] = DB_ID() AND [mf].[type_desc] = 'ROWS' ) AS [25%_of_database_(gb)] FROM [sys].[tables] [t] INNER JOIN [sys].[indexes] [i] ON [t].[object_id] = [i].[object_id] INNER JOIN [sys].[partitions] [p] ON [i].[object_id] = [p].[object_id] AND [i].[index_id] = [p].[index_id] INNER JOIN [sys].[allocation_units] [a] ON [p].[partition_id] = [a].[container_id] WHERE [t].[is_ms_shipped] = 0 GROUP BY SCHEMA_NAME([t].[schema_id]) , [t].[name] , [i].[name] , [p].[rows] ORDER BY [index_total_space_gb] DESC) SELECT * , CASE WHEN [ls].[largest_index_times_two_(gb)] > [ls].[25%_of_database_(gb)] THEN [ls].[largest_index_times_two_(gb)] ELSE [ls].[25%_of_database_(gb)] END AS [maybe_this_is_a_good_log_size(gb)] FROM [log_size] AS [ls] OPTION ( RECOMPILE );
I’ve set up Mirroring about a billion times
I’m not bragging about that. I’d rather say that I set up a billion AGs, and not one of them ever failed. But then I’d be lying to you; those things fail like government programs. One thing I’d never done, though, is set up Mirroring with a Witness. I never wanted automatic failover, because it’s only one database at a time. If for some reason one database out of all that I had mirrored ever turned Ramblin’ Man and failed over to another server, there would understandably be some application consternation. Not to mention any maintenance and internal operations. They don’t react well to sudden database unavailability.
Of course, doing anything for the first time is horrible. Just ask my second wife.
Here’s where things got awkward
I have my databases! This is my top secret development environment. Stack Overflow is in an AG, and I had set up two other Mirrors: one synch and one asynch. I wanted to have a variety of setups to test some scripts against.
Alright, let’s set up Mirroring…
This is so easy. Seriously. Why doesn’t everyone do this? Why do you complicate your short, short lives with Availability Groups? Are they AlwaysOn? Are they Always On? WHO KNOWS? Not even Microsoft.
HIGH FIVES ALL ARO-
This is the error text:
The ALTER DATABASE command could not be sent to the remote server instance 'TCP://ORACLEDB.darling.com:5022'. The database mirroring configuration was not changed. Verify that the server is connected, and try again.
Alright, that’s silly. I used the GUI. Instead of going to bed I’ll spend some time checking all my VM network settings. BRB.
I’m back. They were all correct. I could ping and telnet and set up linked servers and RDP. What in the name of Shub-Niggurath is going on with this thing?
I can even see the Endpoint! So close, and yet so far~~
Where are we now?
This is a good time for a quick recap
- Mirroring is up and running synchronously
- The endpoint is configured on the witness
- We get an error when we try to connect the witness
TO THE ERROR LOG!
Well whaddya know? That’s a really good clue. Encryption and stuff. There’s no compatible algorithm. Ain’t that somethin’? You’d think that Microsoft would be cool about setting up the same kind of encryption across all the different Endpoints, if using different encryption would cause the setup to fail. Right guys? Heh. Right? Hey, hello?
Alright, let’s see what I need to be a matchmaker.
Since we have them both scripted out already, let’s just drop and re-create the Witness Endpoint with the right encryption algorithm.
That did not result in a forest fire. I’m hopeful. Sort of. It’s been a long night and I think I can see tomorrow from here.
Meanwhile, back on the Primary…
It worked! Now I have a Witness, and I can shut all my VMs down. That was so much fun.
What did we learn?
Microsoft hates you and doesn’t want you to sleep. Just kidding. Mostly. But seriously, why would they do that?
It mostly goes to show that it’s always a smart idea to use that little script button at the top of (most) GUIs in SSMS. Who knows what kind of foolishness you’ll find? A little reading can save you a lot of time troubleshooting errors that make you feel insane.
Thanks for reading!