How to Measure SQL Server Workloads: Wait Time per Core per Second

When I ask you how fast your car is, there’s a two common metrics for answers:

  • How fast it can go
  • How much horsepower or torque its engine can deliver

If I ask you how hard your car is working right now, you’d probably answer with:

  • Its current speed (in MPH or KPH)
  • Its current engine workload (in RPM)

Conveniently, these two numbers are shown front and center on most car dashboards:

Tachometer and speedometer

Tachometer and speedometer

For SQL Server, those numbers are:

Current speed: Batch Requests per Second – the number of queries your server is currently handling. It’s available via Perfmon counter, and it’s on the dashboard of most monitoring software.

Wait Time per Core per Secondwait stats is the technique of measuring how much SQL Server is waiting on. You can’t tune wait stats using percentages – instead, you have to look at the total amount of time involved.

Compare these two five-second moments in the life of a SQL Server:

Workload Comparison

Workload Comparison

It’s clear that in Workload A, nothing is going on. Sure, 100% of our waits are on storage, but we don’t really have a storage problem. If Workload A was a car, it would be rolling along at 1mph, waiting for the driver to hit the gas pedal.

In workload B, there’s much more going on – but do we have a performance problem yet? That’s not quite as clear. To get the real picture, I have to tell you how many cores the server has.

Let’s Use Wait Time Per Core Per Second

Waits over time don’t really give you a clear picture – you have to divide the workload by the number of cores you have. Let’s say they’re both running on a VM with 8 virtual cores:

Workload comparison with 8 vCPUs each

Workload comparison with 8 vCPUs each

The VM on the left isn’t doing any work. It’s coasting.

On the VM on the right, for every second on the clock, each of its cores is spending .4 seconds (400 milliseconds) waiting on something (disk, memory, locks, etc.) While that might sound significant at first, it’s still really not doing any hard work. It’s not unusual for each core to spend several seconds per second waiting on different resources – and that’s fine, because SQL Server’s schedulers are good at switching over and running other queries that don’t need to wait.

For example, say you fire off a big SELECT query that scans a huge table, and we have to go get a lot of data from disk to accomplish it. That query’s tasks can wait for dozens (or hundreds or thousands) of milliseconds while other queries get their work done.

That’s why to measure SQL Server workloads, I’m proposing a new metric: Wait Time per Core per Second. That one metric, by itself, is a lot like the tachometer on a car’s dashboard. It’s not a measurement of how powerful your car is, but it does tell you how hard your car is working at this moment in time.

sp_AskBrent® now gives you this metric.

We’ve added new key metrics to the default output, including Batch Requests per Second, and Wait Time per Core per Second:

sp_AskBrent v13 with Wait Time per Core per Second

sp_AskBrent v13 with Wait Time per Core per Second

Note the new output down around priority 250-251 – it helps you get a fast idea of whether the SQL Server is working hard right now, or hardly working.

You can download our First Responder Kit with sp_AskBrent® now, and in our training classes, I show you more details on how I use it to triage real-world performance emergencies.

sp_AskBrent v14 adds CPU % Utilization

When your SQL Server is having performance problems, sp_AskBrent® checks a bunch of common trouble spots. In this week’s new version, it also checks sys.dm_os_ring_buffers for the most recent CPU utilization report from SQL Server and returns it.

Here’s what it looks like when the server isn’t under load – note that “No Problems Found” means the rest of the alerts are just information about the SQL Server:

sp_AskBrent on low-load server

sp_AskBrent on low-load server

And here’s a heavily loaded server:

sp_AskBrent on a server under high CPU loads

sp_AskBrent on a server under high CPU loads

I haven’t been blogging about this tool much, but I’ve become quite proud of it over the last few months.

Parameters include:

@Seconds = 5 – you can run it for longer periods like 60 seconds to see the server’s health during that time range. It’s really useful when you’re tuning a query or doing a presentation demo – fire it off on a 60-second span, go run your query or workload in another window, and then come back to sp_AskBrent to see what the effects were. Especially useful when combined with…

@ExpertMode = 1 – returns more result tables including wait stats, Perfmon counters, and my personal favorite, file stats – which shows you how much your workload read and wrote from your data/log files and TempDB:

sp_AskBrent file stats with expert mode on

sp_AskBrent file stats with expert mode on

@OutputDatabaseName, @OutputSchemaName, @OutputTableName – you can log results to tables. Some folks are running sp_AskBrent® every 5 minutes in a SQL Agent job, dumping the data to a table, so that they can go back in time and trend what happened on the server. This got better in last month’s v13, which also added the ability to output the file/Perfmon/wait stats details to individual tables too.

To get started, download our First Responder Kit with our latest scripts, posters, and e-books.

AlwaysOn Availability Groups Quiz and FAQ [Video]

Think AlwaysOn Availability Groups are right for your environment? Take my 6-question quiz to find out:

That’s from our in-person training classes, where we cover what DBAs and developers need to do for a successful AG implementation.

In those classes, here’s some of the questions I get the most often:

Q: How much network bandwidth will I need?

For a really rough estimate, sum up the amount of uncompressed transaction log backups that you generate in a 24-hour period. You’ll need to push that amount of data per day across the wire. Things get trickier when you have multiple replicas – the primary pushes changes out to all replicas, so if you’ve got 3 replicas in your DR site, you’ll need 3x the network throughput. Calculating burst requirements is much more difficult – but at least this helps you get started.

Q: What’s the performance overhead of a synchronous replica?

From the primary replica, ping the secondary, and see how long (in milliseconds) the response takes. Then run load tests on the secondary’s transaction log drive and see how long writes take. That’s the minimum additional time that will be added to each transaction on the primary. To reduce the impact, make sure your network is low-latency and your transaction log drive writes are fast.

Q: How far behind will my asynchronous replica be?

The faster your network and your servers are, and the less transactional activity you have, the more up-to-date each replica will be. I’ve seen setups where the replicas are indistinguishable from the primary. However, I’ve also seen cases with underpowered replicas, slow wide area network connections, and heavy log activity (like index maintenance) where the replicas were several minutes behind.

Q: What’s the difference between AGs in SQL 2012 and SQL 2014?

SQL Server 2014’s biggest improvement is that the replica’s databases stay visible when the primary drops offline – as long as the underlying cluster is still up and running. If I have one primary and four secondary replicas, and I lose just my primary, the secondaries are still online servicing read-only queries. (Now, you may have difficulties connecting to them unless you’re using the secondary’s name, but that’s another story.) Back in SQL 2012, when the primary dropped offline, all of the secondaries’ copies immediately dropped offline – breaking all read-only reporting queries.

Q: How do I monitor AlwaysOn Availability Groups?

That’s rather challenging right now. Uptime monitoring means knowing if the listener is accepting writeable connections, if it’s correctly routing read-only requests to other servers, if all read-only replicas are up and running, if load is distributed between replicas the way you want, and how far each replica is running behind. Performance monitoring is even tougher – each replica has its own statistics and execution plans, so queries can run at totally different speeds on identical replicas. There’s not really a good answer to this question right now.

Q: How does licensing work with AlwaysOn Availability Groups in SQL 2012 and 2014?

All replicas have to have Enterprise Edition. If you run queries, backups, or DBCCs on a replica, you have to license it. For every server licensed with Software Assurance, you get one standby replica for free – but only as long as it’s truly standby, and you’re not doing queries, backups, or DBCCs on it.

Q: Can I use AlwaysOn Availability Groups with Standard Edition?

Not at this time, but it’s certainly something folks have been asking for since database mirroring has been deprecated.

Q: Do AlwaysOn AGs require shared storage or a SAN?

No, you can use local storage, like cheap SSDs.

Q: Do Availability Groups require a Windows cluster?

Yes, they’re built atop Windows failover clustering. This is the same Windows feature that also enables failover clustered instances of SQL Server, but you don’t have to run a failover clustered instance in order to use AlwaysOn Availability Groups.

Q: Do I need a shared quorum disk for my cluster?

No, and check out Kendra’s video on failover cluster quorum to learn your options.

Q: What version of Windows do I need for AlwaysOn AGs?

We highly recommend Windows Server 2012R2, and here’s why.

Q: Can I have different indexes or tables on my replicas?

No, the replica database contents will be exactly the same as the primary.

Q: If I fail over to an asynchronous replica, and it’s behind, how do I sync up changes after the original primary comes back online?

That is left as an exercise for the reader. When I go through an AG design with a team, we talk about the work required to merge the two databases together. If it’s complex (like lots of parent/child tables with identity fields, and no update datestamp field on the tables), then management agrees to a certain amount of data loss upon failover. For example, “If we’re under fifteen minutes of data is involved, we’re just going to walk away from it.” Then we build a project plan for what it would take to actually recover >15 minutes of data, and management decides whether they want to build that tool ahead of time, or wait until disaster strikes.

Kendra says:  I wish I’d had this list when I started learning about AGs. I was also surprised to learn that crazy things can happen in an AG when you create indexes. Check out more on that here.

Woohoo! We Won a Red Gate Tribal Award: Best Blog of 2014 (Again)

For the past two years, Red Gate’s Tribal Awards have let you, the SQL Server community, nominate and vote on their favorite community voices. Here’s the 2014 winners:

This is the second year in a row that we’ve won the Best Blog award, and there’s three sets of people I want to thank.

First, I want to thank you, dear reader, for giving us the award. We toil away over a hot keyboard to bring you all kinds of useful stuff in a humorous way. It’s great to know that you enjoy our work.

Second, I want to thank my coworkers. Whenever they finish a blog post, they send it to the rest of the team for feedback, and I learn something in every single blog post I read. This job is so much fun.

Finally, I want to thank the other Tribal Award winners and nominees. This community is so uplifting, sharing, and generous. I’m proud to be a member of the SQL Server community.

More @redgate Tribal Awards! This year we won Best Blog again, plus Best Twitterer.

A photo posted by Brent Ozar (@brento) on

9 Ways to Lose Your Data

Every time someone tells me, “This database is mission critical – we can’t have data loss or downtime,” I just smile and shake my head. Technology is seriously difficult.

To illustrate, here’s a collection of client stories from the last few years:

  1. The DBCC CHECKDB job ran every week just like it was supposed to – but it failed due to corruption every week. No one got email alerts because the SQL Agent mail was no longer valid – internal email server changes meant the mail was just piling up in SQL Server. CHECKDB had been failing for three years, longer than the backups were kept. Data was permanently lost.
  2. The DBA configured his backups to write to a file share. The sysadmins never understood they were supposed to back up that file share. When the DBA asked for a restore, he was surprised to find there were no backups.
  3. Three SQL Servers were all replicating data to each other. When I asked the DBA where the backups were run, he looked at one server, then another, then the third. He sheepishly admitted – in front of his manager – that there were no backups done anywhere.
  4. The DBA set up full backups daily, plus log backups of all databases in full recovery mode. Later, she put a few databases into simple recovery mode in order to fix an issue. She forgot to put them back into full recovery mode. When problems struck and she needed to recover a database, she lost all data back to the prior full backup.
  5. The SQL Server ran out of space on the C drive. During emergency troubleshooting, someone deleted a bunch of BAK files. The server started up, but databases were offline and corrupt. Turned out the user databases were on the C drive, as were all of the backups – the very backups that were just deleted to free up space.
  6. The DBA started getting odd corruption errors on one of his servers, then more, and quickly all of them. The SAN admin had flashed the storage with new firmware – which had a bug. The DBA was writing his backups to that same SAN, and sure enough, some of the corrupt databases had corrupt backups too.
  7. The admin wanted to restore the production databases onto another server. He tried, but it kept saying the files were in use. He stopped the SQL Server service, deleted the files, started it again, and finally his restore worked – but his phone lit up. Turned out he’d remote desktopped into the wrong server – he was on production.
  8. The developer did a deployment on Friday afternoon, tested it, and went home. However, the deployment had an UPDATE script that wrote invalid data to a percentage of the rows in a critical table. Over the weekend, people worked in that table, putting in valid data and changing some invalid data. On Monday, by the time the team figured out what had happened, the data in the table was a total mess – with real valuable data that shouldn’t be restored over.
  9. The team built a complex AlwaysOn Availability Groups infrastructure for really reliable databases, and full backups every night, and log backups every 4 hours. When someone accidentally issued a delete statement, that delete was instantly replicated across to the other replica. They could still recover back to a point in time – but it meant serious downtime for the AG since you can’t restore databases in an AG. While they debated what to do, more valid data went into their production database – meaning they could no longer simply do a point-in-time restore.

Each of these teams thought their data was safe.

They were wrong.

Kendra says: OK, I’ve got to admit it– I’ve been part of teams where we’ve fallen victim to more than one of these scenarios. And I work on teams with smart people! This can happen to you, look at these stories and your own environment slowly and carefully.

The Basics of Storage Tiering [Video]

In our Senior DBA Class of 2015 session in Denver last week, one of my sessions explained storage tiering, snapshots, and replication using a table of bloggers. Here’s the first ten minutes of it:

To learn more, here’s additional resources:

How Many Databases Can I Put on One SQL Server?

I’ve seen servers with thousands of databases on a single SQL Server, and it works. Sure, opening the databases list in SQL Server Management Studio is painful, and a lot of third party monitoring tools fall over, but it’s not so bad once you know how to work around these issues.

But there’s two issues you can’t work around: our old archenemies, RPO and RTO. A quick reminder:

  • Recovery Point Objective (RPO) – if the server went down right now, where’s the point at which you could recover the data? It’s measured in time, like 1 second of data loss, 1 minute, 1 hour, etc.
  • Recovery Time Objective (RTO) – how long will it take you to get the server back online and functional? It’s also measured in time.

Say you’ve got a single stand-alone SQL Server. You’re doing full backups once a day, and those backups take six hours to complete.

The business comes along and says:

  • You can’t lose more than 15 minutes of data (RPO)
  • The server can’t be down for more than 1 hour (RTO)

In that case, knowing that your backups take six hours, that means your restores are likely going to take at least six hours. If you lose this server (due to hardware failure, storage corruption, Windows patch gone bad, etc), you’re not going to be able to get the databases restored in time.

RPO and RTO Worksheet

RPO and RTO Worksheet

You have three options: tune your backups and restores to make them go faster, put less data on each SQL Server, or implement a standby server that’s ready to go when disaster strikes.

The real answer to how many databases (and how much data) you can put into SQL Server isn’t a software limitation: it’s a business limitation. Get the RPO and RTO from the business first, put it into our HA/DR planning worksheet, and that will help you figure out if you’ll be able to recover in time.

Kendra says: If you’re asking this question, that’s your first sign that you may have a problem.

Announcing Row-Level Security in Azure SQL Database

Your users probably shouldn’t be able to view all of the data.

You might have regional sales managers who should only see sales for their region, or human resource staff who should be able to see employee details but not salaries.

This is usually challenging with databases.

How We’ve Historically Built Row-Level Security in SQL Server

We modify application code to pass in the user’s name or group name as part of their query, like this:

SELECT * FROM dbo.vwSalesHeader WHERE SalesVisibleTo = ‘MaryJane’

MaryJane doesn’t have to be a Windows or SQL authentication account – it can be anything we want to use as a security identifier. Some apps use a UserID number from their Users table, for example.

Then the vwSalesHeader view joins our SalesHeader table out to several other security tables where user names or roles are listed. The SalesHeader table might have a StoreID field, and then we’d have other tables that listed which users could see which StoreIDs.

This approach typically works well (enough) in reporting applications where we can guarantee all access is done through views. However, it isn’t as effective when users and applications are accessing the tables directly – and that’s where the database server needs to handle row-level security.

Serious Security: Enforcing It at the Server Level

If you create database logins for every end user, and the end user is authenticated in the database, then some database platforms can perform row-level security directly against tables.

The really cool part of this approach is that you don’t need to modify your applications – the apps don’t have to access the data via stored procedures or views in order to get row-level security. If you try to read or modify a row (or in some cases, even a column) that you’re not allowed to, the database simply stops you.

PostgreSQL added support for this via Row Security Policies recently in 9.5, and Oracle’s had Oracle Label Security since at least 10g. In all vendor implementations, the database needs a map between users, roles, tables, and rows. PostgreSQL does this with row security policies that let you build any query you want as the check expression. Oracle builds hierarchical security in at the data label level with compartments and groups. Both are radically different approaches, so there’s not really a concern about how Microsoft’s implementation adheres to standards.

How Azure SQL Database Does It

Books Online explains that the v12 preview will let you:

  1. Create a security predicate function to do the security check
  2. Create a security policy on a table that points to your new security function
  3. Enforce that security policy based on whoever’s logged in – without changing their queries

That’s awesome. Instead of this:

SELECT * FROM dbo.vwSalesHeader WHERE SalesVisibleTo = ‘MaryJane’

Your apps can just select from the table directly (or views still if they want):

SELECT * FROM dbo.SalesHeader

And SQL automatically applies the security policy.

That’s my favorite kind of new feature – one that can be introduced without app changes. It’s absolutely wonderful if you’re letting end users connect directly to the database with PowerBI tools like Excel.

I’m not going to write about the full implementation T-SQL here (function and policy creation) because we’re still ahead of the release date, and you can expect these details to change through the course of the release previews.

Server-Level Security Drawbacks

Most modern web and reporting applications use connection pooling with a single database login for all users, and no concern for “execute as user” statements. It’s challenging to run every query in the security context of the end user – especially if you have end users who aren’t in your actual security system. (Think public end users who don’t have Active Directory accounts.) The number of connections from your web/app tier may skyrocket, although most of those connections will end up being idle or disconnected over time. The alternative is to build in dynamic “execute as user” statements in your data access logic, and that’s nowhere near as trivial as it looks.

Plus, enforcing security at the database server level requires adding a security definition field to every secured table. If you’re not allowed to modify the tables, this approach won’t work. I don’t see this as a serious drawback because it’s still less work than modifying your application to work entirely on views and stored procedures.

Performance will suffer no matter how you implement row-level security. In the big picture, I don’t see this as a drawback because you shouldn’t implement it unless you need it, and if you need it, you should be willing to amp up your hardware requirements in order to pay for the additional business logic requirements. There’s no free lunch.

Silent security generates support calls. Anytime the users know that the system is influencing their query results (be it through Resource Governor, dirty reads, simultaneous loads & queries, or row-level security) then they’re going to think their reporting data is wrong/slow/unavailable because your infrastructure is at fault. If this is a concern for you, you can enable auditing of the security policies, but keep in mind that now you’re talking about even more performance impact.

And of course, there’s the obvious gotcha of having to create database users for these roles. In a perfect on-premise world, you don’t really want to create those in SQL Server – instead, you create groups in Active Directory and then in SQL Server. Let your help desk team manage the group memberships of individual users, and only rarely make changes to the groups in SQL Server. (But still, if you’re constantly adding/editing sales regions, and your queries need to be region-aware, you’re going to be constantly making security changes in production, QA, and dev servers.)

In summary, I think row-level security is one of the coolest new engine features I’ve seen in quite a while, and it’s in Azure SQL Database first. It solves a real business pain, and continues to bring Microsoft’s databases closer to feature parity with Oracle and PostgreSQL.

Are You Getting the Benefits of Virtualization?

Here’s some of the reasons companies usually virtualize their SQL Servers:

  1. Cost savings on hardware
  2. Cost savings on Windows OS licensing
  3. Cost savings on SQL Server licensing
  4. Protect against the failure of a single hardware element
  5. Leverage extended features for Disaster Recovery
  6. Automatic load balancing across multiple hosts
  7. Easier hardware replacement/migration

When we perform a SQL Critical Care® on a virtualized SQL Server, we often ask, “Are we actually getting those benefits?”

1. Cost savings on hardware – do you find yourself putting one SQL Server guest on each host, isolating them to make sure they get the performance they need? If so, you’re not actually saving money on hardware.

2. Cost savings on Windows OS licensing – as a standard, some companies license all their virtualization hosts with Windows Server Datacenter Edition in order to get unlimited virtualization rights. However, if you’re only running one guest per host (or just a few), then you’re not saving money here either.

3. Cost savings on SQL Server licensing – for this one, you’ve gotta do a little bit harder work. Add up the licensing you’re spending now, and look at what it would take to run similar instances on bare metal hardware. Keep in mind that you can still buy dual-socket, quad-core servers that are insanely powerful (768GB RAM, dozens of SSDs), thereby keeping your SQL licensing lower.

We're going to need your parents to sign your report card.

We’re going to need your parents to sign your report card.

4. Protect against the failure of a single hardware element – on the free versions of most hypervisors, you don’t get automatic failover protection. You can manually start up a guest on another host with some human intervention. Is that enough for the business, or are they assuming it’ll all happen automatically with only a minute or two of downtime – even when you’re not around? Or even worse, do you not have enough hardware horsepower to start up your biggest SQL Server guest somewhere else if its host fails? Or, heaven forbid, are you using local SSDs with virtualization, thereby missing the entire ability to move guests around?

5. Leverage extended features for Disaster Recovery – VMware and Hyper-V have killer features (and third-party app extensions) that make it easy to replicate a guest from one site to another. Are you using those, or have you given up because SQL Server’s data change rates are too high, and your network can’t keep up?

6. Automatic load balancing across multiple hosts – VMware’s Distributed Resource Scheduler (DRS) will automatically shuffle VMs around between hosts based on resource utilization. It’s an amazing way to react to performance issues with less human intervention. You should be using it.

7. Easier hardware replacement/migration – because SQL Server licensing is priced by the CPU core, and it’s super expensive, many shops choose to improve their virtualization host hardware annually. Whenever they need more capacity in their VMware or Hyper-V clusters, they drop in a couple of new hosts, vMotion or LiveMigrate the most expensive per-core guests over to those hosts (thereby taking advantage of today’s faster processors), and then give everybody else the hand-me-downs. It’s easy to do even live during the daytime. However, some shops are still running their SQL Servers on CPUs that might get featured on Antiques Roadshow.

If you’re not leveraging at least some of these virtualization features, and you don’t plan to…then what was the point of virtualizing to begin with? Jump on in – the water’s fine!

Urgent AlwaysOn Availability Groups Bug

If you’re using AGs, don’t apply these patches:

  • SQL 2012 SP2 CU3
  • SQL 2012 SP2 CU4
  • SQL 2014 CU5

until you read this Microsoft post about a breaking bug.

Your AG may stop synchronizing due to blocking between user queries and a system session. The fix is to disable automatic failover, restart the primary, and enable automatic failover again.

Carry on.