Monthly Archives: October 2007

Select A Candidate

Minnesota Public Radio has a simple, straightforward survey that asks you one page of questions, and determines your closest match in the upcoming elections.  It gives you a single list of all matching candidates in order of their agreement with your positions, regardless of their political affiliation.

http://minnesota.publicradio.org/projects/ongoing/select_a_candidate/

Turns out I’m a McCain man, with Joe Biden (a Democrat!) coming in second.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

More Posts - Website

Follow Me:
TwitterFacebookLinkedInGoogle PlusYouTube

Moving to Houston in February

This week, Erika got her official class date for air traffic control training.  She reports for duty in Oklahoma City, OK on November 8th.  We’ll be driving the Jeep cross-country in the first week of November so she can have a car there.  I’ll fly back to Miami Beach and stay here through the end of her training in February.  At that point, we’ll both move to Houston, where she’ll be stationed.

I’ll miss a lot of things about Miami, especially Miami Beach: driving home on Friday afternoons past the cruise ships, parking the car, and getting around by foot all weekend.  Living here makes jogging easy, as evidenced by the mounting numbers on my NikePlus web site gadget.  My marathon training is progressing along nicely – did 7 miles today.

I will not miss Miami’s absurd housing costs.  Erika and I ran the numbers repeatedly, and there was just no way we could afford to buy a place here.  I’ve seen a couple of recent transplant workers buy hilariously overpriced half-million-dollar houses that would barely fetch $150k in the Houston market, but salaries between the two cities seem identical.

After being in Miami for so long, Houston looks dirt cheap.  I’ve been surfing the Houston MLS and drooling over lofts.  See sample loft #1 and sample loft #2.  We’ve been furiously punching our calculators to see if we can make it work, and it all depends on how my job goes.

I’m working with the folks at Southern Wine to see if I can telecommute from Houston.  My fingers are crossed, but the company doesn’t exactly embrace that concept – yet.  I’m not giving up easily because I’ve done telecommuting for years, and I know how to make it work.  Worst case scenario, I end up doing consulting for them.  I wasn’t eager to start consulting again after my horrible gig with Adjoined, but if I was calling the shots on the client relationship, I think I’d have a different perspective.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

More Posts - Website

Follow Me:
TwitterFacebookLinkedInGoogle PlusYouTube

SQL Server load testing: synchronized tests versus random ones

Our QA department uses Mercury TestDirector to do load testing of our .NET web apps.  The basic idea is that TestDirector runs through a set of URLs to load test an application, and it can gradually scale up the number of simultaneous sessions until the datacenter is filled with smoke.  When the fire alarms go off, subtract one from the number of sessions, and that’s the maximum load the application can sustain.  (Okay, not really, but you get the idea.)

I don’t know the internals of how TestDirector does its thing, but while the QA department does their load testing, I capture my own Perfmon statistics and gather a SQL trace of all queries with duration > 1500 milliseconds.  Granted, the maximum application performance would be higher if I wasn’t running the trace, but I like being able to see behind the scenes to know what queries are causing the worst load.

In the trace, I noticed that different QA testers set up their tests differently.  Most of the time, the users were randomized: all of the sessions were active simultaneously, but they weren’t all doing exactly the same thing at exactly the same time.  While some users were logging in, other users would be printing a report, and other users would be doing data entry.  At first, I thought that was a good way to do load testing because it gave a picture of what the user activity levels would look like in real life.

Later (and I suspect, accidentally) they set up a test so that each simultaneous user was doing exactly the same thing at exactly the same time.  Ten users would hit the login page simultaneously, then they would all hit the menu page, then they would all do the same function at the same millisecond.

The load test results from that synchronized testing were dramatically slower in parts.  The synchronized tests pointed out which parts of the app had concurrency problems, which ones caused high CPU loads on the web server, which ones hit the SQL Server hardest, etc.

When I analyzed the load test results, I looked at the following metrics from Perfmon side by side with my SQL trace results, both sorted by date/time order:

High CPU use on the web server – during periods of very high CPU use on the web front end, I looked at the queries in the trace.  I noticed that a login stored procedure showed up with a low number of reads, low number of writes, and low CPU load, but the duration was very long, usually over 30 seconds.  When I ran that same stored procedure from Management Studio, it always finished in under a second.  I even tried running it over and over in several loops in different sessions, and it still always finished fast.  I looked at the results of the stored proc, and it was returning 7,000 lines of XML.  I deduced that the web front end was having difficulty parsing and processing that XML, and it wasn’t able to digest the results fast enough.  When I pointed this out to the app team, they researched and discovered that the test user’s access wasn’t configured correctly.

Heavy blocking on the SQL Server – when SQL reported lots of locks, I correlated that to the queries that were running at the time.  A stored procedure was doing “while” loops to find unused ID numbers in a table, and was doing the whole thing inside a transaction.  I showed them how to rewrite the query with a single update statement, thereby avoiding the loops and transaction.

High CPU use on the SQL Server – for a brief time in the load test, SQL was maxing out CPU power.  SQL CPU use is always my first concern, because if a program goes live with high CPU use, then the only way to fix it once it goes live is to throw very expensive hardware and licensing at it.  I checked what queries were running at the time, and it turned out that a stored procedure was building dynamic SQL and not caching execution plans.  Because the query was pretty complex, it was taking a lot of time to recompile.  I explained how to write it in a way that the execution plans could be cached.

In summary: I didn’t notice these things when the tests were randomized, because when they weren’t happening in clusters, they disappeared into the background noise.  When they ran simultaneously in large quantities, it was much easier to point out concurrency and scaling problems.  Database administrators that work with a QA team should ask QA to run their tests both ways.  (The QA folks seem to appreciate it, just knowing that someone is analyzing the results this closely.)

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

More Posts - Website

Follow Me:
TwitterFacebookLinkedInGoogle PlusYouTube

SQL Server Backup Best Practices

I’ve been backing up SQL Servers for almost a decade now, and it’s time to share the lessons I’ve learned. All of this is my own opinion – your mileage may vary – but I’ll try to explain the reasoning behind the choices I make. I won’t address log shipping or snapshots this time around.

Never back up databases to local disk.

If the SQL Server crashes, especially due to a hardware problem or a severe OS problem, the local drives may not be available. In the event of a hardware disaster, I’ve been able to point to my junior guy and say, “Go build a new server and start doing restores from the network share while I try to resuscitate this server.”

Back up databases to a fileshare, then back the share up to tape.

Tape drives these days are fast enough that the vendors like to say DBAs should go straight to tape, and they’re technically right: tape backup & restore speed is not a bottleneck. However, there’s a limited number of drives available. When the DBA needs a restore right away, the tape drives aren’t necessarily sitting idle. At our shop, if multiple people need to do simultaneous restores or backups, there can be a lag time of hours. Disk backups, on the other hand, are always available.

Cost justify the network share with lower licensing costs & simpler backups.

The SAN & backup admins will need cost justification for a new dedicated array just for SQL backups. Tell them that it pays for itself by eliminating the need for backup agents on each SQL Server, plus it simplifies their lives because they can just have one backup policy. Back up everything on that network share once a day, and get it offsite ASAP. They don’t have to worry about what servers are on what schedules or when the peak loads are – they just have one easy task to back up that network share.

Back up to a different SAN if possible.

This one’s not a reality for all shops, but it’s saved my bacon, so I have to mention it. We had our production SAN go down for an entire weekend. I was able to build new database servers using local disk & another SAN, and then restore databases & transaction logs from the network share. The one reason it succeeded was because my backup array was on a different SAN than the production SAN.

My sweet spot for the backup array is raid 10 SATA.

Depending on the backup windows, multiple database servers may be writing backups to that same share simultaneously. Raid 10 gives better write performance than raid 5, and while I’d love to have fiber channel backup drives, it’s cost-prohibitive. Raid 10 on SATA gets me the best balance of cost versus backup throughput. I’ll write a separate article about how to monitor performance metrics during backups to determine where the bottleneck is. Before I started benchmarking, I’d expected my bottleneck to be my gig network cards, but it turned out to be a raid 5 SATA array.

Backup agents like NetBackup and Backup Exec mean giving up scheduling control.

I steer clear of backup agents like Backup Exec, NetBackup and TSM because the schedules are generally dictated by the backup guys instead of the database administrators. I like having the control in the hands of the DBAs because they’re quicker to react to SQL Server problems. For example, if the nightly ETL jobs fail on a data warehouse, the DBAs can quickly pause the backup schedule or restore the databases without input from anybody else. I want my backup guys to concentrate on getting the data to tape and getting it offsite rather than worrying about application problems.

Do regular fire drill rebuilds and restores.

At least once a quarter, do a restore from tape. When the DBA team is separate from the backup administrator, just tell the backup admin that it’s a real restore need, like a developer lost something and needs it back from tape. Restore it.

Build a standalone restore testbed.

Stay buddies with the system administrators. Ask them to keep an eye out for leftover end-of-life servers in the shop that have enough internal drive space to do fire drill restores. I acquired a pretty slow server with a huge number of internal drive bays, picked up a bunch of big drives off Ebay, and presto, I have a restore testbed. One of these days, I’ll automate the fire drill restore process so that each week it restores off the network share, but I haven’t quite gotten that ambitious yet. For now, I do the restores manually.

Keep management informed on restore time estimates.

Every quarter, I look at the backup sizes & speeds, and estimate how long it will take to restore a server from tape. I turn that into a written report to management, and for each server, I give cost estimates to decrease that outage window. Sometimes the answer is third-party backup compression software, sometimes it’s more CPU power for the server, sometimes it’s adding memory, etc. The primary goal is to give them enough information to make a good business decision about whether it’s worth the money. The secondary goal is to CYA: when a server is down, and a restore is taking longer than the business owners would like, then it’s a great time to pull out last quarter’s report and say, “See, I told you this would take an hour to restore, and you said it wasn’t worth the $5,000 to shrink down. Should we reconsider that decision now?”

Trust no one.

Get an external FireWire or USB drive. Once a month, bring it into the office, hook it up to the server hosting the network share, and copy the entire backup share to that drive. It’ll take hours, and that’s OK. Take it home, or in today’s security-conscious world, take it to the bank and put it in a safe deposit box. Nobody should ever need to do this, but sometimes even the best backup administrators make mistakes.

More of My Backup Articles

Here’s more posts about SQL Server database backup technology:

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

More Posts - Website

Follow Me:
TwitterFacebookLinkedInGoogle PlusYouTube