Blog

Why You Shouldn’t Upgrade SQL Server

Stop. Before you think about upgrading your existing SQL Server to the latest version, think it through.

In retrospect, putting a Prius engine in the Gran Prix wasn't my best idea.

In retrospect, putting a new Prius engine in the Gran Prix wasn’t my best idea.

Your hardware is old. Even if it’s just a couple of years old, the latest advances in processors, memory, and storage mean it’s practically an antique.

Your Windows is old, too. Over the years, you installed a bunch of crap that you probably shouldn’t have put on the production box. You could uninstall it, but even that will leave leftovers all over the place.

You’ve learned a lot since the first time. When you first installed it, you didn’t know that you shouldn’t install extra services you don’t need, or that you should use 4 TempDB data files, or that you should format the drives with 64K NTFS allocation units. Now that you know about really good setup checklists, you wish you could do it all over again – and now is your chance.

Patching takes forever. You need to bring both Windows and SQL Server completely up to date, but it can take minutes or hours of downtime to get all the right patches applied. And then there’s the BIOS and firmware, too. Instead of guessing how long the old box will take to shore up, just build a new one and get it perfect.

You need to test the new Cardinality Estimator. For several versions, your query plans haven’t changed when you upgraded, but those days are over. SQL Server 2014 brings dramatic changes to the Cardinality Estimator, and you need to check your queries against it to prevent a 100%-CPU-surprise on go-live day.

You need a plan B. Patching and upgrading has a serious potential for failure. In theory, you could restore the OS backups and try again, but have you ever tested that? What happens if it fails, and you have to try it all on the fly? What’s your recovery time objective?

Kendra says: It’s funny, in-place upgrades always worked fine for me, unless it was on a production server.

Watch Brent Tune Queries [Video]

Ever wonder how somebody else does it? Watch over my shoulder for this beautifully recorded one-hour session at SQLRally Nordic in Copenhagen from a couple weeks ago:

The resources, scripts, and my Be Creepy process are all free too. Enjoy, and check out all of the session recordings for even more free learnin’.

Getting BLOBS out of the Database

Raise your hand if you’re storing BLOBs in the database.

I get to say that a lot during our training. Every time I say it, nearly every hand in the room goes up. Some hands go up faster than others, but eventually nearly every hand is up.

It’s a design that happens far more often than it should, but it does happen.

Why Store BLOBs in the Database?

People put binary data in the database because they need the data to be point in time consistent with the rest of the database. It’s not enough to save space in the database if you can’t recover the file to a moment in time.

Think about this scenario:

  1. 8:00AM – Brent saves a new contract.
  2. 9:00AM – Kendra modifies the contract and saves it.
  3. 9:30AM – Jeremiah modifies the contract and saves it.

If the contract is being stored inside the database, we can recover to any point in time and have the appropriate version of the document. It may not be the most current version of the contract, but it’s the version of the document that’s consistent with the rest of the database.

Why Not Use the Filesystem?

File systems are great. They do an excellent job of storing files and organizing them into folders. File systems don’t do a great job of being point in time consistent with a relational database. There’s no transaction log to help us roll back writes that are in flight.

It’s a lot of work to get a full database backup and a file system back up to be remotely close to the same point in time. Restoring a database to a point in time is easy. Restoring a file system to a point in time is close to impossible.

Why Not Use an Appliance?

appliancesThere’s a third option available – some kind of appliance that sits between database and the file system. The appliance should manage file metadata and provide all access to the files in the file system.

Commercial databases ship with features that sound similar. SQL Server has a FILESTREAMdata type and Oracle has both a BFILE and ORD data type. Both of these types let the database interact with files in the file system. But they still have a problem – you’re stuck managing data through the database. Let’s be clear: this is not an appliance.

Content Addressable Storage (CAS) is a mature technology. The idea behind CAS is that a hardware device handles the meta-data for a given incarnation of a file. A developer sends a file into the CAS appliance and the CAS appliance returns a pointer to the file. Whenever the file changes, a new copy is created and a new handle is returned to the developer. Files can’t be modified, so any thing stored in the database can only point to the right version of the file.

We can combine this with a database pretty easily. Instead of storing a file path in the database, we store the handle that we get back from the CAS system.

How Does CAS Solve the Problem?

The main reason people store BLOBs in the database is so they can get blobs that are consistent with a point in time in the database. By using a storage device that cannot be modified (the CAS), we can make sure that the location we’ve stored in the database is always the right location – there’s no way to tamper with the files that we’re storing, so whatever gets stored in the database is correct.

There’s overhead to this approach – old data may never get cleared out. Typically, though, CAS systems store data on large, slow disks. There’s little need for the throughput that we use for a relational database store system. Do those cat pictures really need to be stored on RAID 10 SSDs? Moving BLOB storage outside of the relational database will free up resources for serving queries. Picking the right way to store your BLOB data will make it easier to scale your system.

Kendra says: Finding the right storage for large objects is a huge architectural decision that impacts performance and availability. Choose wisely!

Brent says: Want your SQL Server’s DBCCs and backups to run faster? This can help a lot.

Doug says: “It’s a lot of work to get a full database backup and a file system back up to be remotely close to the same point in time.” -> This is a major drawback that’s easily overlooked. Make sure everyone’s okay with that possibility when choosing the file system for BLOB data.

Ten Ways to Tell if Your SQL Server is a Clown Car

Commuting to work.

Commuting to work.

Sometimes we pile just a few too many duties onto our servers. They start to become a clown car – the classic joke where dozens of clowns come pouring out of an impossibly small car. (The physics behind it are actually really interesting.)

So how can you tell if your SQL Server has crossed the line into clown car territory? It’s easy:

10. You have both production and development databases in it.

9. Everyone runs SSMS and Visual Studio by remote desktopping into the server.

8. You’ve enabled remote desktop services for more people to log in simultaneously.

7. It has the engine, SSRS, SSAS, SSIS all installed and running, but less than 32GB RAM.

6. It has multiple instances of each of those services running. (Bonus points for different versions.)

5. Application servers point to a file share hosted by this database server.

4. You’re monitoring it, and the monitoring software’s repository also lives on the same server you’re monitoring.

3. You built a spreadsheet to track your affinity masking, Resource Governor, and max memory settings.

2. When you say “the server,” you don’t have to clarify because everyone knows it’s your only one.

1. You’re reading this blog on the server right now.

Kendra says: If you care about performance, you’ve got to start pulling clowns out of that tiny car.

SQL Server Common Table Expressions

Common table expressions are a feature of SQL that lets a developer create a query that can be referenced multiple times. This feature gives developers another tool to add flexibility or just to simplify code.

Why Common Table Expression?

Why is it called a Common Table Expression (CTE)? Because that’s what the documentation calls it!

Selection_20150225-002

No, seriously, that’s it. Different databases have different terms for this feature. That’s what we call it in SQL Server.

What does a CTE do?

A CTE effectively creates a temporary view that a developer can reference multiple times in the underlying query. You can think of the CTE as if it were an inline view.

Here’s a slight re-write of a StackExchange query to find interesting and unanswered questions.

WITH unanswered AS (
    SELECT p.id
    FROM dbo.Posts p
    WHERE (SELECT COUNT(*)
           FROM dbo.Posts a
           WHERE a.ParentId = p.Id
                 AND a.Score > 0) = 0
          AND CommunityOwnedDate IS NULL
          AND ClosedDate IS NULL
          AND ParentId IS NULL
          AND AcceptedAnswerId IS NULL
)
SELECT TOP 2000 unanswered.Id AS [Post Link] ,
       (sum(t.[Count])
           / 10.0 + us.Reputation
           / 200.0 + p.Score * 100)
       AS Weight
FROM   unanswered
       JOIN dbo.Posts p ON unanswered.Id = p.Id
       JOIN PostTags pt ON pt.PostId = unanswered.Id
       JOIN Users u ON u.Id = p.OwnerUserId
GROUP BY unanswered.Id, u.Reputation, p.Score
ORDER BY Weight DESC ;

The CTE, lines 1 – 12, effectively creates a temporary view that we can use throughout the rest of the query. You can also think of it in the same way that you’d think of a derived table (a join to a subquery).

CTEs make it possible to simplify code, express complex ideas more concisely, or just write code quickly without having to worry as much about structure. They’re a great feature of the SQL language that many people overlook.

If you want to learn more about CTEs, check out our upcoming Advanced Querying and Indexing class in both Chicago and Portland. We’ll cover the basics of CTEs, pitfalls, and performance tuning options.

How to Measure SQL Server Workloads: Wait Time per Core per Second

When I ask you how fast your car is, there’s a two common metrics for answers:

  • How fast it can go
  • How much horsepower or torque its engine can deliver

If I ask you how hard your car is working right now, you’d probably answer with:

  • Its current speed (in MPH or KPH)
  • Its current engine workload (in RPM)

Conveniently, these two numbers are shown front and center on most car dashboards:

Tachometer and speedometer

Tachometer and speedometer

For SQL Server, those numbers are:

Current speed: Batch Requests per Second – the number of queries your server is currently handling. It’s available via Perfmon counter, and it’s on the dashboard of most monitoring software.

Wait Time per Core per Secondwait stats is the technique of measuring how much SQL Server is waiting on. You can’t tune wait stats using percentages – instead, you have to look at the total amount of time involved.

Compare these two five-second moments in the life of a SQL Server:

Workload Comparison

Workload Comparison

It’s clear that in Workload A, nothing is going on. Sure, 100% of our waits are on storage, but we don’t really have a storage problem. If Workload A was a car, it would be rolling along at 1mph, waiting for the driver to hit the gas pedal.

In workload B, there’s much more going on – but do we have a performance problem yet? That’s not quite as clear. To get the real picture, I have to tell you how many cores the server has.

Let’s Use Wait Time Per Core Per Second

Waits over time don’t really give you a clear picture – you have to divide the workload by the number of cores you have. Let’s say they’re both running on a VM with 8 virtual cores:

Workload comparison with 8 vCPUs each

Workload comparison with 8 vCPUs each

The VM on the left isn’t doing any work. It’s coasting.

On the VM on the right, for every second on the clock, each of its cores is spending .4 seconds (400 milliseconds) waiting on something (disk, memory, locks, etc.) While that might sound significant at first, it’s still really not doing any hard work. It’s not unusual for each core to spend several seconds per second waiting on different resources – and that’s fine, because SQL Server’s schedulers are good at switching over and running other queries that don’t need to wait.

For example, say you fire off a big SELECT query that scans a huge table, and we have to go get a lot of data from disk to accomplish it. That query’s tasks can wait for dozens (or hundreds or thousands) of milliseconds while other queries get their work done.

That’s why to measure SQL Server workloads, I’m proposing a new metric: Wait Time per Core per Second. That one metric, by itself, is a lot like the tachometer on a car’s dashboard. It’s not a measurement of how powerful your car is, but it does tell you how hard your car is working at this moment in time.

sp_AskBrent® now gives you this metric.

We’ve added new key metrics to the default output, including Batch Requests per Second, and Wait Time per Core per Second:

sp_AskBrent v13 with Wait Time per Core per Second

sp_AskBrent v13 with Wait Time per Core per Second

Note the new output down around priority 250-251 – it helps you get a fast idea of whether the SQL Server is working hard right now, or hardly working.

You can download our First Responder Kit with sp_AskBrent® now, and in our training classes, I show you more details on how I use it to triage real-world performance emergencies.

sp_AskBrent v14 adds CPU % Utilization

When your SQL Server is having performance problems, sp_AskBrent® checks a bunch of common trouble spots. In this week’s new version, it also checks sys.dm_os_ring_buffers for the most recent CPU utilization report from SQL Server and returns it.

Here’s what it looks like when the server isn’t under load – note that “No Problems Found” means the rest of the alerts are just information about the SQL Server:

sp_AskBrent on low-load server

sp_AskBrent on low-load server

And here’s a heavily loaded server:

sp_AskBrent on a server under high CPU loads

sp_AskBrent on a server under high CPU loads

I haven’t been blogging about this tool much, but I’ve become quite proud of it over the last few months.

Parameters include:

@Seconds = 5 – you can run it for longer periods like 60 seconds to see the server’s health during that time range. It’s really useful when you’re tuning a query or doing a presentation demo – fire it off on a 60-second span, go run your query or workload in another window, and then come back to sp_AskBrent to see what the effects were. Especially useful when combined with…

@ExpertMode = 1 – returns more result tables including wait stats, Perfmon counters, and my personal favorite, file stats – which shows you how much your workload read and wrote from your data/log files and TempDB:

sp_AskBrent file stats with expert mode on

sp_AskBrent file stats with expert mode on

@OutputDatabaseName, @OutputSchemaName, @OutputTableName – you can log results to tables. Some folks are running sp_AskBrent® every 5 minutes in a SQL Agent job, dumping the data to a table, so that they can go back in time and trend what happened on the server. This got better in last month’s v13, which also added the ability to output the file/Perfmon/wait stats details to individual tables too.

To get started, download our First Responder Kit with our latest scripts, posters, and e-books.

Quiz: Are You the Next Brent Ozar Unlimited Consultant?

Hey SQL Server DBAs — we’re hiring!Help-Wanted-Unicorn

Here’s a quick five question quiz. Give yourself one point for every ‘yes':

  1. Have you been a database administrator for a few years?
  2. Does planning how to keep an application reliable and safe from disasters sound like fun?
  3. Do you love helping people make SQL Server go faster?
  4. Would you like to have a cartoon character of yourself?
  5. Can you see yourself dropping into our Office Hours sessions occasionally to help people for free?

If you got five points, you just might be the next Brent Ozar Unlimited employee!

Here’s what to do next:

Read more about the job here to find out what we’re looking for and what benefits we offer.

Then tell us about you!

We plan to accept applications through Friday, March 6, but don’t wait too long to submit — life moves fast sometimes.

Update March 7 – applications are closed, and we’re interviewing candidates. Thanks!

AlwaysOn Availability Groups Quiz and FAQ [Video]

Think AlwaysOn Availability Groups are right for your environment? Take my 6-question quiz to find out:

That’s from our in-person training classes, where we cover what DBAs and developers need to do for a successful AG implementation.

In those classes, here’s some of the questions I get the most often:

Q: How much network bandwidth will I need?

For a really rough estimate, sum up the amount of uncompressed transaction log backups that you generate in a 24-hour period. You’ll need to push that amount of data per day across the wire. Things get trickier when you have multiple replicas – the primary pushes changes out to all replicas, so if you’ve got 3 replicas in your DR site, you’ll need 3x the network throughput. Calculating burst requirements is much more difficult – but at least this helps you get started.

Q: What’s the performance overhead of a synchronous replica?

From the primary replica, ping the secondary, and see how long (in milliseconds) the response takes. Then run load tests on the secondary’s transaction log drive and see how long writes take. That’s the minimum additional time that will be added to each transaction on the primary. To reduce the impact, make sure your network is low-latency and your transaction log drive writes are fast.

Q: How far behind will my asynchronous replica be?

The faster your network and your servers are, and the less transactional activity you have, the more up-to-date each replica will be. I’ve seen setups where the replicas are indistinguishable from the primary. However, I’ve also seen cases with underpowered replicas, slow wide area network connections, and heavy log activity (like index maintenance) where the replicas were several minutes behind.

Q: What’s the difference between AGs in SQL 2012 and SQL 2014?

SQL Server 2014’s biggest improvement is that the replica’s databases stay visible when the primary drops offline – as long as the underlying cluster is still up and running. If I have one primary and four secondary replicas, and I lose just my primary, the secondaries are still online servicing read-only queries. (Now, you may have difficulties connecting to them unless you’re using the secondary’s name, but that’s another story.) Back in SQL 2012, when the primary dropped offline, all of the secondaries’ copies immediately dropped offline – breaking all read-only reporting queries.

Q: How do I monitor AlwaysOn Availability Groups?

That’s rather challenging right now. Uptime monitoring means knowing if the listener is accepting writeable connections, if it’s correctly routing read-only requests to other servers, if all read-only replicas are up and running, if load is distributed between replicas the way you want, and how far each replica is running behind. Performance monitoring is even tougher – each replica has its own statistics and execution plans, so queries can run at totally different speeds on identical replicas. There’s not really a good answer to this question right now.

Q: How does licensing work with AlwaysOn Availability Groups in SQL 2012 and 2014?

All replicas have to have Enterprise Edition. If you run queries, backups, or DBCCs on a replica, you have to license it. For every server licensed with Software Assurance, you get one standby replica for free – but only as long as it’s truly standby, and you’re not doing queries, backups, or DBCCs on it.

Q: Can I use AlwaysOn Availability Groups with Standard Edition?

Not at this time, but it’s certainly something folks have been asking for since database mirroring has been deprecated.

Q: Do AlwaysOn AGs require shared storage or a SAN?

No, you can use local storage, like cheap SSDs.

Q: Do Availability Groups require a Windows cluster?

Yes, they’re built atop Windows failover clustering. This is the same Windows feature that also enables failover clustered instances of SQL Server, but you don’t have to run a failover clustered instance in order to use AlwaysOn Availability Groups.

Q: Do I need a shared quorum disk for my cluster?

No, and check out Kendra’s video on failover cluster quorum to learn your options.

Q: What version of Windows do I need for AlwaysOn AGs?

We highly recommend Windows Server 2012R2, and here’s why.

Q: Can I have different indexes or tables on my replicas?

No, the replica database contents will be exactly the same as the primary.

Q: If I fail over to an asynchronous replica, and it’s behind, how do I sync up changes after the original primary comes back online?

That is left as an exercise for the reader. When I go through an AG design with a team, we talk about the work required to merge the two databases together. If it’s complex (like lots of parent/child tables with identity fields, and no update datestamp field on the tables), then management agrees to a certain amount of data loss upon failover. For example, “If we’re under fifteen minutes of data is involved, we’re just going to walk away from it.” Then we build a project plan for what it would take to actually recover >15 minutes of data, and management decides whether they want to build that tool ahead of time, or wait until disaster strikes.

Kendra says:  I wish I’d had this list when I started learning about AGs. I was also surprised to learn that crazy things can happen in an AG when you create indexes. Check out more on that here.

9 Ways to Lose Your Data

Every time someone tells me, “This database is mission critical – we can’t have data loss or downtime,” I just smile and shake my head. Technology is seriously difficult.

To illustrate, here’s a collection of client stories from the last few years:

  1. The DBCC CHECKDB job ran every week just like it was supposed to – but it failed due to corruption every week. No one got email alerts because the SQL Agent mail was no longer valid – internal email server changes meant the mail was just piling up in SQL Server. CHECKDB had been failing for three years, longer than the backups were kept. Data was permanently lost.
  2. The DBA configured his backups to write to a file share. The sysadmins never understood they were supposed to back up that file share. When the DBA asked for a restore, he was surprised to find there were no backups.
  3. Three SQL Servers were all replicating data to each other. When I asked the DBA where the backups were run, he looked at one server, then another, then the third. He sheepishly admitted – in front of his manager – that there were no backups done anywhere.
  4. The DBA set up full backups daily, plus log backups of all databases in full recovery mode. Later, she put a few databases into simple recovery mode in order to fix an issue. She forgot to put them back into full recovery mode. When problems struck and she needed to recover a database, she lost all data back to the prior full backup.
  5. The SQL Server ran out of space on the C drive. During emergency troubleshooting, someone deleted a bunch of BAK files. The server started up, but databases were offline and corrupt. Turned out the user databases were on the C drive, as were all of the backups – the very backups that were just deleted to free up space.
  6. The DBA started getting odd corruption errors on one of his servers, then more, and quickly all of them. The SAN admin had flashed the storage with new firmware – which had a bug. The DBA was writing his backups to that same SAN, and sure enough, some of the corrupt databases had corrupt backups too.
  7. The admin wanted to restore the production databases onto another server. He tried, but it kept saying the files were in use. He stopped the SQL Server service, deleted the files, started it again, and finally his restore worked – but his phone lit up. Turned out he’d remote desktopped into the wrong server – he was on production.
  8. The developer did a deployment on Friday afternoon, tested it, and went home. However, the deployment had an UPDATE script that wrote invalid data to a percentage of the rows in a critical table. Over the weekend, people worked in that table, putting in valid data and changing some invalid data. On Monday, by the time the team figured out what had happened, the data in the table was a total mess – with real valuable data that shouldn’t be restored over.
  9. The team built a complex AlwaysOn Availability Groups infrastructure for really reliable databases, and full backups every night, and log backups every 4 hours. When someone accidentally issued a delete statement, that delete was instantly replicated across to the other replica. They could still recover back to a point in time – but it meant serious downtime for the AG since you can’t restore databases in an AG. While they debated what to do, more valid data went into their production database – meaning they could no longer simply do a point-in-time restore.

Each of these teams thought their data was safe.

They were wrong.

Kendra says: OK, I’ve got to admit it– I’ve been part of teams where we’ve fallen victim to more than one of these scenarios. And I work on teams with smart people! This can happen to you, look at these stories and your own environment slowly and carefully.

css.php