Blog

SQL performance tuning: it’s about training too

1 Comment

When developers and SQL Server database administrators work together to make an app faster, it shouldn’t just be the DBA working alone in the dark.  If the DBA takes the slow code, goes into his cube, tweaks it (and indexes) and comes back out with a faster system, that fixes the problem once – but it doesn’t fix the long-term problem.

For the best long-term results, a DBA should pick a member of the application team and cross-train them on database tuning skills.  Do performance tuning sessions together, walking the developer through the analysis.  That way, they see the performance problems with their own eyes, they learn to recognize signs of problematic code, and they know how to avoid writing it in the first place.

Perfect example: a few weeks ago, a couple of our developers rewrote a slow user-defined function from scratch.  They then tested it in their development environment and were surprised to find that it was slower than the live version.  Another emergency came up, and now that I’ve got some time to tune it for them, they don’t want me to bother.  They don’t want to waste my time tuning something that they’re probably going to rewrite again anyway.

I told them to give me the code anyway, and at least I can review it to see what made it slower than they’d expect.  Even if this code will never see the light of day, it’ll at least be a good training exercise.


Behind the times or ahead of the curve?

1 Comment

I have this strange personality quirk – I can’t decide whether I’m behind the times or ahead of the curve.  I found out about Dinner With The Band, an HDTV podcast where a music-savvy chef (Sam Mason) cooks for a band, and then they play a couple songs in his NYC apartment.  It’s a slick mashup between cooking and music, two of my favorite things.  Anyway, they seem to have started and stopped filming episodes back in 2007.  The fact that I just now found out about them, man, ouch, that’s behind the times.

But then I started researching my favorite band off the show, Tokyo Police Club, and I went to subscribe to their blog:

First Subscriber!

YES!  YES!  I’m the first subscriber!  That means I’m ahead of the curve again!  Woohoo!


SQL Backup Software: Part 4 – Features Still Missing

Backup and Recovery
1 Comment

In this issue of my multi-part series about SQL Server backup compression software, I’m going to talk about some features that database administrators might find useful, but don’t yet exist in the current generation of backup software.

In-Depth Backup & Restore Progress Monitoring

On multi-terabyte data warehouses – especially those running new backup compression software that the DBA may not be familiar with yet – DBAs want to know how the backup is coming along. None of the products on the market do a particularly good job on a centralized console showing up-to-the-minute statuses across a farm of servers. Everybody has pieces of the picture, but no one shows the whole thing at once. Statistics I’d like to see include:

  • Status of the restore (is it initializing the data & log files or actually restoring data – very important distinction for large data warehouses)
  • Estimated time to completion
  • Compression ratio
  • Throughput rate (especially compared to this server’s historical average or the datacenter’s average)
  • Overlapping jobs (full backups that have run into the transaction log backup window, or jobs that won’t finish in a given amount of time)
  • Trend alerting (jobs that are taking longer than normal, jobs getting much lower throughput than normal)

The good news is that by querying each vendor’s activity database directly, an ambitious DBA with time on their hands can produce these reports themselves. The bad news is that few of us have that much time on our hands! Well, I do, but I use it to write my blog, heh.

Ability to Send Emails During Restores

I’m spoiled by Symantec/Veritas NetBackup, which will automatically email the system administrators whenever a restore finishes (regardless of whether it succeeded or not). This feature helps me empower my junior DBAs: while I want them to do their own restores, I also want to know who’s taking advantage of the service, and I also want to know when I need to jump in and help with a failed restore.

Ideally, I would like the ability to enter an email address during the restore process and get an email when the restore finishes. Optionally, it’d be even better to get emails upon each 10% of progress made. My data warehouse restore took several hours, and I wish I didn’t have to keep logging in to check on its progress.

Diagnostic Reporting & Recommendations

The backup software should be able to pinpoint the current backup bottleneck. Some examples might be:

  • CPU load – if the CPU is too heavily loaded, the backup software may be waiting on CPU time to compress the backup. It would alert the DBA and suggest a lower compression ratio.
  • Source disk throughput – if the source drives aren’t responding quickly enough, and if CPU load is low, the backup software could suggest a higher backup compression ratio. That way, the backups could be compressed smaller without affecting the backup times or the server response times.
  • Network throughput – if the backup target is a UNC path, and if the network bandwidth is saturated on the database server, the software could suggest adding network cards for better performance or using a higher compression ratio.
  • Target disk throughput – as with our source disk throughput scenario, if the target drives are too slow, the software could suggest higher compression ratios.

This sort of reporting could be an add-on license, or be used as a cross-selling tool for the vendor’s SQL Server performance monitoring software. In Quest’s case, the combination of Quest LiteSpeed for backup and Quest Spotlight for performance monitoring means that the two products have all of the information necessary to make these configuration recommendations.

Restore Source Selection

When it’s time to restore databases, Idera SQLsafe does a great job of integrating history across servers. When selecting a source for the restore, SQLsafe lets the DBA pick from a list of source servers and databases. Idera also includes a button to fetch the most recent backup of that source database including all matching transaction logs.

Quest LiteSpeed does a better job of picking source files off disk, though: their file-picking wizard shows all machines in the network (not just SQL Servers) and lets users drill into each machine’s file shares. LiteSpeed accesses the restore files via UNC path. This beats Idera’s UNC selection method, which requires the end user to manually type in the full UNC path including file name.

The best restore source selection would be a combination of both Idera and Quest’s methods.

Even better, I’d like to see a true disaster recovery wizard. Instead of specifying a single database to restore, I’d like to be able to restore an entire server: fetch the latest full & transaction logs for all databases on a given server, and restore all of them to another target server.

Automatically Skip T-Logs on Simple Mode Databases

Our developers and junior DBAs routinely create new databases without advance warning. I have to know that those databases will get backed up without human intervention, and the only way to do that with today’s backup software is to use a single maintenance plan for all user databases. If we set up individual maintenance plans for different groups of databases, then any new database wouldn’t be backed up until it was manually added to that maintenance plans. That’s not an acceptable risk for us, so we have to use a single maintenance plan for all user databases.

However, on a single server we’ll have some databases in full mode and some in simple, depending on their purposes. For example, we have systems monitoring databases like Ipswitch WhatsUp, and we don’t mind losing transactional history for a day.

The problem comes in with the single maintenance plan for all databases. If we only use one maintenance plan, today’s backup software isn’t smart enough to skip transaction log backups on databases in simple mode. It will try to do a t-log backup, and then send an error because it can’t perform the backup.

One solution would be to have the maintenance plans automatically skip t-log backups for databases in simple mode. Another solution would be to use policy-based management that defines a default policy for any newly created databases, and then lets me move simple-mode databases into their own policy.

Take Corrective Action Inside the Management Console

Both Idera SQLsafe v4.6 and Quest Litespeed v4.8 have dashboards that show recent activity, like which databases haven’t been successfully backed up in a given time range.

That’s a great start, but from that same screen, I want to take corrective action (like starting a backup or scheduling one) and diagnostic actions (like viewing the error logs or the event logs). If something is color-coded red to indicate trouble, then it should have a right-click mechanism to correct the problem.

Load-Balanced Scheduling

A while back, I got the nickname of Six Million Dollar Man. Somebody said that if they followed all of my suggestions, our products would take six million dollars to deliver. Here’s a good example of my wacko ideas.

For most of our production servers, we do full backups every day and transaction log backups every 15 minutes. Here’s the thing, though: I don’t care when the fulls happen, and I don’t always know when the server is at its lowest load point (where the backup should be done). The time of the full backup doesn’t really matter – I just want to find the right time to do it.

To make that easier, I want a report that shows the critical hardware metrics for a server so that I can pick the best time for a backup window. It should look at source disk load, target disk load, network throughput, CPU load and the size of the databases, and then suggest the best time window for the full backups.

As if that wasn’t tough enough, I want it to be aware of the backups that are already running, and highlight that load or subtract it from the report. Meaning, if there’s a set of full backups running from 4pm to 6pm, then don’t just show high server load during that window and say it’s the wrong time to run a backup. The backup is creating the load!

And finally, I would want to balance the backup load across multiple servers. I don’t want to try to back up two large data warehouses at the same time to the same file share, or two blades in the same chassis since that would max out my network throughput. I would want the ability to say that full backups should avoid overlapping.

(Pipe dream, right? I know.)

Summary

Some of my suggestions are pretty expensive to implement, but they give database administrators an idea of the kinds of features missing in the market now. This list shouldn’t stop a DBA from buying backup compression software – there’s still a huge ROI for backup compression, and I’ll talk about that ROI in my next blog post justifying the purchase price. (I’m writing these a little out of order!)

Continue Reading Justifying The Cost of SQL Server Backup Software


What’s the Largest Database You’ve Worked With?

When hiring a production DBA, I start by asking them, “In terms of file size, what’s the largest database you’ve worked with?”

In a perfect world, we’d only hire very experienced candidates, people who’ve worked with larger databases than we’re dealing with. In reality, though, we can’t always afford those candidates – if they’re even available at all. For some positions (like a junior production DBA with a senior already on staff), we can live with a lower level of experience – someone who can grow into the position. In other positions (like a senior lead), we can’t afford to have them learning on the job.

Non-DBA readers: to find the answer to this question, remote desktop into the shop’s largest SQL Server and do a search for files with MDF or NDF extensions. Add all of the file sizes together, and make a note of the largest one. For example, the total size might be 200 gigs, with the largest file being around 100 gigs.

I generally classify servers into these vague size groups, going by total size of all databases on the server:

  • Under 50 gigs – these servers almost run themselves once the basic maintenance plans are set up. The Windows system admins manage the SQL Server without training, and management takes less than an hour per week. Everything is installed on the C drive without serious performance issues.
  • 50-250 gigs – The system admins encounter some performance bottlenecks. Initial setup should be done by a DBA, and basic maintenance plans may require some planning and regular attention. Requires some performance tuning, but can be learned by a new DBA. Management takes less than 4-8 hours per week once the server is set up correctly, but the initial configuration repairs (when a DBA takes over for the first time) may take a few weeks.
  • 250GB-1TB – Requires a full time database administrator on staff, although not one DBA per server. Backups must be done with specialized backup compression software, index and statistic updates may run for hours. Performance tuning should be done only by staff with experience. Management takes 4-16 hours per week once the server is set up by a trained DBA, but initial configuration repairs can take a month or more.
  • Over 1TB – Even the simplest tasks require experience from other servers in this tier.

Database administrators have a tough time upgrading their skills from one tier to the next without hands-on experience. A DBA can manage more servers of the same tier without training, but the larger the server gets, the more planning and training it takes to do a good job on maintenance.

DBAs will also rarely step down in size, because pay goes along with the size of the environment. If the candidate says they’re used to working with terabyte-sized data warehouses, but they’re willing to take a job managing a couple of 100gb servers, start asking why.

If the company can’t find or can’t afford a DBA with experience on their size of database servers, there are three workarounds:

Training from an experienced local mentor.

Ask each candidate a tough question: “If you ran into a problem you couldn’t solve, do you have anybody in town that you could call in to help?” They might have a contact, someone they’ve worked with before, who would be willing to lend a hand. If not, there’s always consultants, but their skills can be tough to judge too. Expect a junior candidate to need mentor help for a few days per month when upgrading their skills from one tier to the next.

Buy software to make the job easier.

Companies like Quest, Idera and Red Gate provide a lot of tools to automate database administration. They build experience into the tools, making it easier for junior DBAs to pinpoint performance bottlenecks, solve index and statistics problems, and automate manual tasks. Sure, the software costs money, but the upside is that a junior DBA with good software tools can sometimes outperform a senior DBA who does everything by hand. (That excludes the >250gb tier, though – get an experienced pro for that tier.) Another advantage is that these tools persist even when the staff turns over, and they’re reusable over time.

Send the DBA to a training school or camp.

I have to be honest: I hate this option. The trainers are all too often “book experts” who don’t have real-world experience. The school curriculum moves at a fast pace, spending a set amount of time on each of many subject areas that the junior DBA may not need help with. Instead of spending a week at a boot camp covering a zillion topics, the junior DBA would get much more help from a week of targeted, hands-on time with an experienced pro looking at their system’s specific problems. That way, the junior learns exactly what’s applicable to the problems their company is facing, not canned solutions for things they won’t encounter for months or years.

More DBA Career Articles

  • Moving from Help Desk to DBA – a reader asked how to do it, and I gave a few ways to get started.
  • Development DBA or Production DBA? – job duties are different for these two DBA roles.  Developers become one kind of DBA, and network administrators or sysadmins become a different kind.  I explain why.
  • Recommended Books for DBAs – the books that should be on your shopping list.
  • Ask for a List of Servers – DBA candidates need to ask as many questions as they answer during the interview.
  • Are you a Junior or Senior DBA? – Sometimes it’s hard to tell, but I explain how to gauge DBA experience by the size of databases you’ve worked with.
  • So You Wanna Be a Rock & Roll Star – Part 1 and Part 2 – wanna know what it takes to have “SQL Server Expert” on your business card?  I explain.
  • Becoming a DBA – my list of articles about database administration as a career.

SQL 2008 release date pushed back

0

Remember all the hoopla when SQL 2005 came out? Microsoft swore we’d get more frequent releases because the 5-year lag after SQL 2000 meant it wasn’t worth paying for Software Assurance on SQL Server. They promised we’d get new stuff faster.

Fast forward to today’s announcement:

“Microsoft is excited to deliver a feature complete CTP during the Heroes Happen Here launch wave and a release candidate (RC) in Q2 calendar year 2008, with final Release to manufacturing (RTM) of SQL Server 2008 expected in Q3.”

I literally just finished installing our first SQL 2008 development (non-lab) box and I was scheduling a meeting with our developers to encourage them to migrate their apps. Thank goodness they announced this before I held the meeting – I’d have looked like an idiot.

Update: the SQL Server 2008 release date is still scheduled for August.  I moderated a great webcast on SQL Server 2008’s new features with Kevin Kline, Hilary Cotter and Geoff Hiten talking about what’s coming.


Dell EMC AX150i review

Storage
15 Comments
Dell EMC AX150i
Dell AX150i

The iSCSI EMC AX150i is resold by Dell, and refurbished versions are available pretty inexpensively through the Dell Outlet. We just picked up our second AX150i with 12 500gb SATA drives for around $6,000 total, or about $1,000 per raw terabyte. It’s a great deal for lab or development storage, but there are a couple of gotchas that I haven’t found in other reviews.

The AX150’s Drive Pool Setup Limitations

The AX150i stores its operating system on the first four drives (bays 0, 1, 2, 3). Those 4 drives therefore have less available capacity than the other 8 drives in the AX150i. Those drives cannot be used as hot spares, either, because of their specialized roles.

This breaks the AX150’s drives into the following categories:

  • Disks 1-4: some space consumed by OS, can’t be hot spares
  • Disks 5-11: full amount of free space
  • Disk 12: hot spare (could be any disk number, but I’m picking 12 for this example)

If the user tries to create a disk pool, and tries to mix some of disks 1-4 with disks 5-11 (like creating a 6-disk array with disks 1, 2, 3, 4, 5, 6), the AX150 throws out a stern warning. It will allow the configuration, but the warning is pretty ugly.

So if the user can’t mix drives from those 3 groups (OS, empty, hot spare), that basically leaves the following possibilities:

  • One pool with disks 1, 2, 3, 4 – can either be raid 5 (roughly 1.5tb capacity) or raid 10 (1tb capacity)
  • One pool with disks 5, 6, 7, 8, 9, 10, 11 – an awkward total of 7 disks, meaning raid 5 (3tb capacity) or only use 6 disks in a raid 10 and get 1.5tb capacity plus an extra hot spare
  • One hot spare (or two, if disks 5-10 were used as a raid 10)

Total config space for raid 5 is 4.5tb, and for raid 10 is 2.5tb. That space quantity is fine, but the hard-coded setup limitations on which disks can be grouped together are a pain – especially when we’re only dealing with 12 drives.

The AXI150i Has No Load Balancing

The AX150i has two iSCSI ports, and it does support multipathing with the stock Microsoft iSCSI initiator. I’ve successfully set up both of the iSCSI ports on the same switch, set up two network cards on a Windows server with the MS iSCSI initiator, and then pulled various network cables out during I/O activity. Sure enough, the drives fail over without loss of connectivity or data. To me, that’s an astounding bargain at the sub-$10k price point.

However, with one Windows server and one AX150i, I haven’t been able to break the 90mbps bottleneck – meaning, I don’t get more than one network card of throughput. I’ve tried multiple disk pools, tried multiple drives, different drives on different network ports, and I haven’t been able to figure out how to get it to use two full network ports.

This is not a huge problem at this price point – I’m thankful enough that I’ve even got failover in an array this inexpensive. However, consumers need to be aware that two iSCSI ports doesn’t mean two 1gb network cards with 100% saturation.

It’s Not Expandable

Dell/EMC don’t claim this model is expandable, but this needs to be emphasized to the prospective buyer. Shops with two or more AX150’s can’t combine them, and can’t migrate data from one AX150 to another. If there’s a small amount of free space on two AX150’s, it can’t be combined to create a single LUN.

Again, not a big problem at this great price point, but something to be aware of. This feature is available at higher price points from companies like EqualLogic (now owned by Dell) and LeftHand Networks.

And Yet: A Big Winner At This Price Point

Shops that haven’t invested in shared storage yet can get their feet wet with an AX150 without a big capital commitment. I’d recommend it for a shop that isn’t sure whether or not they’ll go with iSCSI or fiber down the road, or if they’re just not sure about shared storage, period. This class of storage is cheaper to get into – and out of – than a LeftHand chassis. An AX150 can be had for under $10k, whereas the LeftHands and EqualLogics run at least twice that much. Granted, they offer twice as many features and much better scalability.

We took the approach that these are for development sandbox use only, never production. Our VMware lab farm is hooked up to an AX150, and so is my SQL Server data warehouse testbed. It’s a testament to their ease of use that I’m always tempted to slap a production array on one just because we’ve got the extra space.


SQL Backup Software: Part 3 – New Features for Backup/Restore

Backup and Recovery
5 Comments

In the last decade, Microsoft has brought SQL Server a long way – bringing in things like .NET code, data encryption, partitioning, and a darned slick management user interface. But one of the things that hasn’t changed has been the process of backing up and restoring data. We still back up to a file, and we still restore the entire database. The process is clunky, usually manually managed, and no new backup/restore features have arrived since Microsoft has been focusing on things like the BI stack. (I can’t exactly complain about that!)

Other vendors, though, have introduced several new features that give database administrators new tools in their backup & recovery arsenal. I’m going to talk about these features in general, not how a specific vendor implements them, because I want to cover so much ground, and some of the products are easier to show in a single screen shot. Here we go:

Object Level Recovery – restoring single objects and rows

The most urgent restore requests seem to be the smallest ones: someone dropped a single table, someone deleted a key record, someone hosed up a stored procedure. Everyone expects the DBA to be able to recover objects in a moment’s notice, but SQL Server’s native backups don’t make that easy. We have to restore the whole database, then pluck out the specific objects we want, and copy them over to the online database. What a hassle! Today’s third party backup software automates all of that by browsing inside the backup file, letting us pick specific objects, and asking us where to restore them. When testing this capability in backup software, look for these abilities:

  • Restore multiple objects at once
  • Select objects without reading the entire backup file (time-consuming)
  • Script capabilities (so that you can create a single restore script, then reuse it for the exact objects you want without browsing through the entire backup file)

Central dashboard to manage all SQL Servers

A DBA that manages more than ten servers wants to spend less time checking their backups and restores. Vendors solved this problem with simple, intuitive dashboards that show whether any backups were missed in the last, say, 24 hours, or whether any database hasn’t been backed up in a certain time range. The DBA can see at a glance that all of the servers are protected. Contrast this with the native SQL Server backups, where there’s no graphical way to glance at a server and know that all of the databases are backed up. Some things to watch for:

  • Manage servers without an agent – useful for non-licensed servers
  • Central database of managed servers – so you don’t have to maintain the server list on every DBA’s workstation

Detailed progress status for currently running backups

I’ve taken this one for granted since I started using Idera SQLsafe and Quest LiteSpeed, and whenever I have to deal with a plain SQL Server, I can’t believe I lived without it. At a glance in the management UI of either program, I can see exactly how much has been backed up so far. Sounds crazy, but when the DBA is dealing with a 2 terabyte data warehouse, he doesn’t want to wait around to find out if data is getting out of the server. Look for:

  • Status monitoring from anywhere – on any DBA’s workstation, they should be able to see status updates for any server’s running backups

Encrypted database backups

For public companies and medical companies, this just sells itself. Database administrators who don’t have backup compression software can get this budgeted just by selling it as the way to encrypt database backups.

I also wrote a tech brief for Quest called “10 Things DBAs Probably Don’t Know LiteSpeed Can Do.” It requires registration, but hey, it’s free. (I guess that means it’s not one of the best things in life, eh?)

Continue Reading Features missing from today’s SQL Server backup software.


Microsoft Technology Center: The DBA Experience

4 Comments

Our contacts at Microsoft offered us the services of a Microsoft Technology Center to do a two-week lab exercise on our data warehouse. We’d kept piling one application after another on our data warehouse SQL Server until the poor thing was plain out of maintenance windows and couldn’t keep up, and we needed info from the pros to keep scaling it up.

Before my visit to the MTC, I tried searching the internet for other peoples’ experiences with the labs, but I came up emptyhanded. I resolved to write up a recap after the trip was over so that other DBAs heading to an MTC engagement would understand how it works, what to expect, and things to look out for.

What A Microsoft Technology Center Is

For us, the Microsoft Technology Center was really all about two things: getting insight from Microsoft experts and experimenting with different system-wide architecture options.

Microsoft staffs the labs with gurus who have extensive knowledge of MS products. When we ran into trouble with the testing functionality of Visual Studio, our MS guy quickly summoned the relevant expert. We got tricky questions answered in a matter of minutes instead of having to reinvent the wheel. As a result, we could make a lot of progress in a short amount of time.

Walking around the MTC, I was struck by how all of the place’s value is the technical people and the gear. The office could have been anywhere in the world, just a typical New Horizons-looking place with conference rooms and whiteboards. The building was completely worthless – until the people walked in, and suddenly it was priceless. Out of nowhere, I understood the Microsoft “People Ready” ads.

A minor drawback, though, is that the experts aren’t always available at a moment’s notice. They may be in other meetings, working with other clients at the lab, or gone for the day, or just didn’t have experience in that one particular aspect of the product.

The MTC datacenter is stocked with relevant hardware ahead of time for the client’s lab engagement. They did a great job of simulating our environment: while it wasn’t an exact match, it was more than close enough to simulate what we were doing in production. We were able to experiment with adding more memory, changing SAN configurations, changing SQL settings and other things we couldn’t possibly do in rapid fire succession in our production environments.

What An MTC Isn’t

The MTC doesn’t have every piece of software, particularly non-Microsoft software. I needed to bring a copy of our data warehouse, and they had Quest LiteSpeed available to do database restores. However, they couldn’t have copies of our reporting software (BusinessObjects), our ETL software (IBM DataStage) or our load testing software (Mercury LoadRunner) due to licensing restrictions. We were warned about this ahead of time, and we could have rented or borrowed trial licensing from our vendors, but we would still have to build that complete environment in another location from scratch. That kind of setup turns a two-week lab engagement into a four-week project with more risk and more planning. That fact alone made the Microsoft BI stack more appealing.

Since we couldn’t use our reporting, ETL or load test software, we had to have Plan B’s for those, and that brings in the next problem: the Microsoft Technology Center is not the place to learn new products. For example, we needed to be able to load test an IIS application without our normal Mercury LoadRunner, so we used Microsoft Visual Studio Team Test. VSTT met all of our needs and was a lot of fun to work with, but it came with a one week learning curve: we didn’t get a useful benchmark out of it until the fifth business day of the lab. Our QA staff had to rebuild their use case tests with a new product, we had to interpret IIS errors we’d never seen before, and we couldn’t compare the lab environment test results from VSTT with the production test results from LoadRunner.

I walked away with a lot of valuable QA experience and a great impression of Visual Studio Team Test, but looking back, that’s probably not the best thing for a database administrator to learn, which brings up the next weakness.

The Microsoft Technology Center isn’t staffed with everyone from the client – only with the people who are sent to the remote lab. During the lab setup, I wish we could have had at least one of our own developers and a QA staffer on hand. These people would have gained much more from the lab setup experience, the load testing and the code mitigation ideas. We would have had lower setup time, faster test iteration times, and faster mitigations. Instead, we had to do a lot of conference calls back and forth with the home office to get everybody on the same page and get status updates on the work.

The MTC also isn’t a silver bullet for application problems: during testing, I uncovered a few horrifying bugs that we just couldn’t fix fast enough to learn from the lab. We wanted to optimize one particular set of nightly processes, but while reviewing the code, I found business logic errors that required a rewrite of a major stored procedure. Since that stored proc represented the vast majority of the nightly load, we couldn’t take full advantage of the lab experience for that one objective – the code mitigation would have to wait until we returned to the office and met with our entire staff.

What We Gained From Our Experience at the MTC

On one poster at the MTC, a client quote said that they got frank, honest and valuable technical feedback from the Microsoft staff. That one thing sums up the biggest benefit in my view. Client managers, client technical staff and Microsoft staff can sit in a conference room, hash out a process, and be honest about what’s worked and what hasn’t worked at other clients.

For example, I’d talked to our BI manager in the past and emphasized that we couldn’t run ETL loads during business hours without a major impact to the end users. The system would still work, but reporting performance would degrade significantly. I couldn’t put a pricetag on hearing that same opinion from the Microsoft side: here was the vendor telling us that no, you can’t do that the way your system is engineered now, but that if you’d like to do that, it can be done with well-designed modifications to the ETL process.

At the Microsoft Technology Center, these kinds of recommendations and opinions carry more weight because they come from independent advisors, people who have a good interest in seeing the product succeed. They won’t overpromise something that the product can’t do, and they know the signs of a project that will fail.

Who To Send to an MTC Engagement

When planning an MTC lab session, a company should send the staff who will do the best job of listening to Microsoft, implementing Microsoft’s recommendations, and then conveying the lessons learned to the rest of the staff.

Notice that I didn’t say to send the best developers or the best DBAs.

The lab isn’t about being extremely good at what you do: the lab is about being a good listener, giving the right answers to company-specific questions, and helping Microsoft work together with the company to deliver an improved implementation.

I think our company sent the right mix of people (although more would always be better), but sitting through the sessions, I can see how that would easily go wrong. During the first couple of days, our main MS lab contact said the same thing several times: “I know this is going to be a tough conversation, but we need to talk about doing this process a different way.”

I responded the same way every time: “This is not going to be a tough conversation. We’re not here because we’re doing things right – we’re here because we need help and guidance! Tell us what to do.”

I can totally see that conversation going different ways, though, because as we related Microsoft’s plans back to our home office over the phone, we had some ugly talks. Some folks can be pretty entrenched in the ways they’ve always done things, and they’re not receptive to new ideas.

At the same time, I wouldn’t recommend sending new staff, either: send the staff with the most possible experience with the company’s internal applications and the most in-depth knowledge of how the company does business. For example, one performance improvement that was tossed around briefly was to disable our indexes during ETL, and then rebuild them after ETL finished. Because I’m familiar with how our ETL runs across different time zones, I was able to explain why it wouldn’t work for our business.

Send a project manager, or designate one person to be the PM for the lab. That person is accountable for making sure the lab stays on track, that it meets objectives, and that the staff back at the home office deliver on urgent needs that come up as a part of the lab engagement. I initially thought a PM was a crazy addition to a technical lab, but it was a great idea.

What To Ask Microsoft Beforehand

Find out exactly who will be involved from Microsoft’s end, the extent of their involvement, and their schedules. Our particular engagement was arranged at the last minute, and as a result, we didn’t get the quantity of Microsoft staff that we’d expected. The onsite experts had already been booked for other client engagements, and when we ran into problems, we couldn’t always get the help we needed. Getting the exact list of MS people who will be in each day’s activities helps to set the right expectations as to how much work will be done by the client, versus how much will be done by Microsoft.

Ask Microsoft to set up VPN access ahead of time to the lab for offsite team members who can’t go to the MTC. These team members, like people who can’t leave the office, will still be a big part of the lab engagement and they’ll need to contribute in order for the lab to keep moving forward. We had difficulties with our corporate firewall blocking VPN access to the MTC lab, and to get around it, I had to resort to installing the LogMeIn.com client on all of our MTC lab machines.

Consider a Microsoft SQL Server Health Check First

You can save a lot of time & money by doing a Microsoft SQL Server Health Check beforehand.  Microsoft sends a SQL Server expert to your site, observes your hardware & configuration in your environment, and delivers a set of recommendations.


At Microsoft’s Technology Center in Chicago

0

chicago_christmas_lights_in_snow.jpg

Carlos and I flew into Chicago midday yesterday, and a winter storm came in right after us.

We’re at Microsoft’s Technology Center on Wacker drive, and our work area on the 23rd floor has a wall of windows with a great view of the city.  We watched the sky get darker in the afternoon as the snow marched in.  After work, our dinner table looked out on the river and the city streets as the white stuff accumulated.  What a peaceful, relaxing way to spend the holidays – visiting, that is, not living here or commuting in it.

This morning, as I walked the couple of blocks from the hotel to the MTC, I had a great time looking around and shooting pictures.  Fresh snow is gorgeous before everybody (like, uh, me) trudges through it and leaves dirty footprints and slush.  The poor folks who have to come in an hour from now, well, that’s another story.

Our visit to the MTC comes right before deploying a new in-house application.  We’ve struggled with some performance problems in the SQL and the code – nothing we couldn’t solve with time, but we haven’t been able to get the dedicated hardware time or the dedicated staff time.  Our Microsoft contacts saw what was going on and responded by offering us time up at the MTC with their gurus.

That solved the dedicated hardware time, and we thought it would solve the dedicated staff time by forcing our guys to sit in a conference room thousands of miles away from day-to-day production needs, just focusing on getting the job done.  Not so much – we still had dueling teams of execs trying to get us to prioritize one project over another, even with us flying here to this standalone environment!  Funny.

Yesterday, I focused on getting a new VMware environment up (part of the deliverables are to show the performance difference between a virtual app server and a physical one), building a few application servers, and whiteboarding the ETL process with the Microsoft guys.  Today, we’re shooting for performance benchmarks on our from-scratch environment.


SQL Backup Software: Part 2 – Quad Cores are Changing the Game

Backup and Recovery
4 Comments

In my last post, Why Native SQL Backups Suck, I talked about the weaknesses of native SQL Server backups. Today, I’m going to extend that a little by talking about one of the greatest surprises for DBAs in recent history: the advent of dirt-cheap multi-core processors that don’t cost extra for SQL licensing.

How SQL Server Licensing Is Affected by Quad Cores

Microsoft SQL Server is licensed by the CPU socket, not the core, so it costs the same to license a single-core CPU as it does a quad-core CPU. I’ve used that logic to convince executives to upgrade older single-core database servers to new multi-core hardware because they can often pay for the server hardware via license savings. It’s twice as cheap to license a brand new 2-cpu quad-core box than it is to license a 4-cpu box with single cores, and the license savings completely pays for the cost of a new server.

Most of the time, quad-core CPU’s aren’t really a compelling feature for database administrators because SQL Server experiences more I/O backups than CPU power backups. We pour money into drives, HBA’s, and preach the benefits of raid 10, but we don’t spend a lot of time comparing processors in great detail. I/O is the big bottleneck. This is especially true during the backup window. Backing up a SQL Server database consists of reading a lot of information from drives, and then writing that same information to another set of drives (either local or across the network).

So during the backup window, we have all these extra cores sitting around idle with nothing to do.

Let’s Do Something With Those Cores!

Why not use that extra idle CPU power to compress the data before we send it out to be written?

The users won’t notice because they’re already waiting on I/O anyway, especially during backup windows when we’re taxing the I/O subsystems.

If we dedicate this extra CPU power to data compression, we now have smaller amounts of data being sent out for writes. Our backup size gets smaller, which in turn – decreases our I/O load! In effect, we’re trading CPU power for I/O power. The more CPU power we have for data compression, the more I/O we free up.

The equation gets interesting when we start to relate how much I/O speed we buy with each additional processor core. Going from a single-core CPU to a quad-core CPU enables a massive amount of backup compression power, which means much less data needs to be written to disk. If less data is being written to the backup target, then we have two options: our backup windows become shorter, or we can use cheaper/slower disks.

Using Backup Compression To Save Money

Choosing the latter method means that the shiny new quad-core database server may pay for itself. I’ve been able to say, “You need more drives for your new project? I’ll sell you my raid 10 of high-end, 73gb 15k SAN spindles because I’m downsizing to a raid 5 SATA array.” Trading off those expensive drives enabled me to buy more quad-core database servers, which could compress the backup files better, and I could live with the SATA drives as a backup target. My backup time window stayed the same, and I gained faster CPU power outside of my backup window because I had more cores.

Cheap quad-core processors enable a database administrator to trade CPU power for I/O speed in the backup window – but only when using those newfound cores to actively compress the backup data. SQL Server 2000 & 2005 can’t natively do that, and that’s where backup compression software comes in.

The same quad-core power works in our favor at restore time, too. During restores, the SQL Server has to read from the backup file and then write those objects out to disk. With backup compression software, the server does less file reads from the backup file because the backup is smaller. This means faster restores with less I/O bottlenecking, and fast restore times are important to a DBA’s career success. The faster we can restore a database in an emergency, the better we look.

Old Servers Trickle Down to Dev & QA

This pays off in another (albeit obscure) way: development & QA servers. At our shop, we’re constantly replacing big, multi-cpu (but single-core) servers with smaller quad-core servers. As a result, we have a lot of 4-way and 8-way servers lying around that are relatively expensive to license in production. They make absolutely perfect development & QA SQL Servers, though, since SQL Server Developer Edition isn’t licensed by the socket, but instead by flat rate. I’ve been able to take these 8-way servers by saying, “No one else can afford to license these for their applications, but I can use them for development.” Then, those 8 cores pay off in faster restores from our production database. I’m able to refresh development & QA environments in shorter windows because I can uncompress them faster than I would on a smaller server.

If faster backup & restore windows were the only tricks available in backup compression software, those alone would be a great ROI story, but there’s more. In the next part of my series, New Features for SQL Backup and Restore, we’ll look at ways backup software vendors are able to jump through hoops that native backups can’t.

Continue Reading New Features for SQL Server Backups


SQL Server Backup Software: Part 1 – Why Native SQL Backups Suck

Backup and Recovery
6 Comments

Before we start looking at SQL Server backup compression software, we need to spend a few minutes looking at the weaknesses of the native SQL Server backup process. In order to judge the fixes, we have to know what’s broken.

Native SQL Server backups take the same disk space as the data.

When we back up 100gb of data with a native backup, we’ll end up with a 100gb backup file. If a database has 100gb allocated, but it’s half empty (like in the case of unused log files), then the backup size will be roughly 50gb – the size of the data.

Large amounts of data take a long time to write to disk.

The slowest thing in the backup process is usually writing the backup file, whether it’s over the network or to local disk. Reads are typically faster than writes, so unless the database is under heavy transactional load at the time of the backup, the reads won’t be the bottleneck. As a result, the more data that has to get written to disk, the longer the backup will take.

We could alleviate that by purchasing faster and faster arrays for our backup targets, but that gets pretty expensive. Our managers start to ask why the DBA’s fastest raid array is being used for backups instead of the live data!

Large amounts of data take a REALLY long time to push over a network to a DR site.

This affects log shipping or just plain copying backup files over the WAN. Compressing the data as little as 25% cuts transmission times by that same amount, and cuts the amount of bandwidth required to replicate the application data. In a large enterprise where multiple applications are competing for the same WAN bandwidth pipe, other teams will ask why the SQL DBA can’t compress their data before sending it over the wire.

We can work around that problem by installing WAN optimization hardware like a Cisco WAAS appliance, but these have their own drawbacks. They must be installed on both ends of the network (the primary datacenter and the DR site), require a lot of management overhead, and they’re expensive. Really expensive.

Another workaround is to compress the backup files with something like WinZip after the backup has finished, but that’s a manual process that has to be automated by the DBA, actively managed, and adds a lag time for the compression before the data can be sent offsite.

SQL Management Studio doesn’t come with reports about the backup process.

Business folks like to say, “You get what you measure.” The idea is that if you start numerically measuring something in an objective way, that number will start to improve simply because you’re focusing on it and talking about it with others. SQL Server native backups are something of a black box: there’s no quick report to show how long backups are taking per database, how often they’re failing, and how long it would take to do a full restore in the event of an emergency.

I find it hilariously ironic that my job as a database administrator revolves around storing precise metrics for others, enabling them to do dashboards and reports, but SQL’s native backup system doesn’t offer any kind of dashboard or report to show its own backup & restore times and successes. SQL 2005 SP2 started to offer some database performance reports inside of SQL Server Management Studio, but they still don’t address the backup/restore metrics.

An ambitious DBA could build their own reports, but they have to manually consolidate data from all of their database servers across the enterprise and keep it in sync. Whew – I get tired just thinking about that. (I should probably subtitle my blog as “The Lazy DBA” come to think of it.)

Cross-server restores are a ton of manual work.

If the DBA wants to bring a development server up to the most current production backup, including transaction logs, they either have to write a complicated script to parse through a list of available backups, or they have to do a lot of manual restores by picking files.

Even worse, in the event of a disaster, the database administrator has to scramble through directories looking for t-logs, writing restore scripts, and hoping they work. Really good DBA’s plan this scenario out and test it often, but let’s be honest: most of us don’t have that much time. We write our DRP scripts once, test them when we have to, and cross our fingers the rest of the time.

That frustrates me because the restore process has been the same since I started doing database administration back in 1999 as a network admin. For years, I looked for the most reliable restore scripts I could find, played with them, spent time tweaking them, and wasted a lot of time. In my mind, this is something that should be completely integrated with the shipping version of SQL Server just because it’s so central to a DBA’s job.

Enough Backup Problems: Show Me Solutions!

So now we’ve seen some of the weaknesses in SQL Server 2005’s native backups. In my next couple of blog posts, I’ll talk about how third party backup compression software gets around these obstacles and offers better features with more capabilities.

Continue Reading with Part 2: Quad Cores are Changing the Game


SQL Performance Tuning: Estimating Percentage Improvements

1 Comment

When I’m doing performance tuning on an application in the early stages of its lifecycle (or any app that’s never had DBA attention before), I end up with a ton of recommendations within the first day of performance tuning.  The resulting to-do list can seem overwhelming to project managers and developers, so I include one of the following two sentences as a part of each recommendation:

  • This change will improve performance by a percentage, or
  • This change will improve performance by an order of magnitude

I know, I know, that phrase doesn’t come up too often, so it helps to check out Wikipedia’s definition of order of magnitude:

“Orders of magnitude are generally used to make very approximate comparisons. If two numbers differ by one order of magnitude, one is about ten times larger than the other.”

So the two sentences translate into:

  • This change will improve performance by 10-90%, or
  • This change will improve performance by 10-100x

I could use those latter two sentences instead of the “percentage versus order of magnitude” sentences, but those latter sentences make me sound like I’m taking wild, uneducated guesses.  In reality, sure, I am taking wild, uneducated guesses, but on an informed basis – I’m just not putting a lot of time into categorizing the improvements.

Jeff Atwood’s excellent Coding Horror blog has a two-part post about estimation that should be required reading for every DBA.  Part 1 is a quiz, and Part 2 explains the answers.

So why am I leaving so much gray area in my recommendations?  Why break suggestions into such widely varied categories?  Is it smart to lump a 10% improvement in with an 80% improvement?  In the early stages of performance tuning, yes, because the DBA can’t necessarily predict which changes will be the easiest to implement, or the order in which they’ll be implemented.  When a database administrator first looks at an application, queries, stored procedures and database schema, some things pop out right away as massive opportunities for performance gains.  These recommendations are so instrumental to application performance that they often have wide-ranging impacts across the entire app.  After those changes are made, everything speeds up so much that the other recommendations have even less of an impact than they might have originally had.

For example, in a project I’m currently tuning, I found that the three largest tables in a database (which had ten times more records than all of the remaining tables combined) were constantly queried by a single field, the equivalent of a DivisionID integer field.  All of the queries hitting those tables included the DivisionID, and the application frequently did huge update statements that affected all records with a single DivisionID number.  Partitioning those three tables by DivisionID and putting each DivisionID on its own set of disks would result in a staggering performance improvement and a tremendous increase in concurrent nightly ETL processing, since more divisions could run simultaneously.

I made other performance recommendations as well, but frankly, if the developers implemented every other recommendation except the partitioning, they would still have been struggling with their nightly windows, and the implementation time would have put the project way behind on their deadlines.  On the other hand, if they just implemented the partitioning, they would sail through their nightly windows and make their project delivery deadline.  That’s the definition of an “order of magnitude improvement.”


Sunday Buffet at The Lady & Sons

2 Comments

As part of our road trip this week to Oklahoma City, we stopped at Paula Deen’s The Lady and Sons in Savannah. For those of you unfamiliar with The Food Network, Paula Deen and her two sons are food celebrities, great people with a great story.

The restaurant doesn’t take advance reservations: instead, hopeful diners start lining up in front of the restaurant, waiting for the hostess to arrive and begin taking names for the day’s seatings. On Sundays, the hostess arrives at 9:30 AM, and the buffet opens at 11 AM. We took our place in line around 8 AM, and we were sixth in line. By 9 AM, the line stretched down the block, and by 9:30, it was down to the next block. We gave the hostess our name, and left to do some window shopping and photography.

At 11 AM, an unbelievably loud woman came out with a clipboard and yelled instructions to the crowd of maybe a hundred people. No bullhorn, no drama, just huge pipes. She explained that the restaurant had seating on the first and third floors (with steam tables on both floors), but that the elevators only carried 15 people at a time, so we should be patient while she called out a few names out at a time. That process might sound unfriendly, but the environment was so jovial and amusing, and everybody had a great time.

The Lady And SonsErika and I took our seats at a the third floor table, placed our drink orders, and headed for the buffet. The steam tables were much smaller than I’d expected, with maybe a dozen choices in all, but the staff kept all of the foods replenished quickly. I’ll cover the items one at a time.

Macaroni and cheese – this was, hands down, the very best macaroni and cheese I’ve ever put in my mouth. In fact, this shouldn’t even be called macaroni and cheese. There should be a different culinary term for this masterpiece, because it’s in a league of its own. I think they thicken it with eggs, because it has a bit of a loose-egg feel to it like the eggs in Pad Thai. When I went back for my second plate at the buffet, there was only one thing on it. That’s right – macaroni and cheese. I have resolved to track down this recipe and reproduce it, and then eat it every day for the rest of my life. Okay, maybe not.

Fried chicken – I’ve read reviews of The Lady and Sons fried chicken before, and they were right – it’s good. It’s not the life-changing experience of the macaroni, but it’s good. I will say that it’s the best fried chicken I’ve had off a steam table.

Mashed potatoes – Erika said it best when she said, “I’ve never tasted butter before in cooking, but I taste the butter in this.” Creamy texture, perfect spices, great stuff. I wasn’t as impressed with the gravy.

Roast beef – mmmm, juicy.

Everything was ever-so-slightly salty. If I didn’t tell you, you wouldn’t recognize it, and I probably only caught it because I’d read other reviews prior to our arrival. They could back off the salt just a tiny, teeny, wee bit, but it didn’t detract from the food. I don’t think Erika caught it.

Biscuit & hoe cake – the hoe cake is basically a pancake, but denser and with a more mealy texture. Good, but I gotta be honest – these take up space in a stomach, and that precious space should be saved for macaroni and cheese.

I didn’t try the greens, the grilled chicken, salads, or desserts. I wanted to, but I couldn’t do it in good faith. I’m still training for the Disney marathon in January, and it’s hard to gorge myself when I’ve got ten mile runs on the weekends.

Some of the reviews I’ve read said that Paula’s buffet is just a buffet, just like any other Southern buffet. I beg to differ, and I know how to illustrate it. Erika and I stopped several times at Cracker Barrels during the course of our road trip, and we went there for dinner the same day that we visited The Lady & Sons. Just to check, I ordered some of the same foods we’d had at Paula’s, and wow, what a difference. Paula’s food is famous for a reason – she makes ordinary food amazing.

I resisted the urge to pick up a t-shirt from the Paula Deen store, but its tagline deserves repeating here: “I’m Your Cook, Not Your Doctor.”


SQL Server load testing: synchronized tests versus random ones

Load Testing
7 Comments

Our QA department uses Mercury TestDirector to do load testing of our .NET web apps.  The basic idea is that TestDirector runs through a set of URLs to load test an application, and it can gradually scale up the number of simultaneous sessions until the datacenter is filled with smoke.  When the fire alarms go off, subtract one from the number of sessions, and that’s the maximum load the application can sustain.  (Okay, not really, but you get the idea.)

I don’t know the internals of how TestDirector does its thing, but while the QA department does their load testing, I capture my own Perfmon statistics and gather a SQL trace of all queries with duration > 1500 milliseconds.  Granted, the maximum application performance would be higher if I wasn’t running the trace, but I like being able to see behind the scenes to know what queries are causing the worst load.

In the trace, I noticed that different QA testers set up their tests differently.  Most of the time, the users were randomized: all of the sessions were active simultaneously, but they weren’t all doing exactly the same thing at exactly the same time.  While some users were logging in, other users would be printing a report, and other users would be doing data entry.  At first, I thought that was a good way to do load testing because it gave a picture of what the user activity levels would look like in real life.

Later (and I suspect, accidentally) they set up a test so that each simultaneous user was doing exactly the same thing at exactly the same time.  Ten users would hit the login page simultaneously, then they would all hit the menu page, then they would all do the same function at the same millisecond.

The load test results from that synchronized testing were dramatically slower in parts.  The synchronized tests pointed out which parts of the app had concurrency problems, which ones caused high CPU loads on the web server, which ones hit the SQL Server hardest, etc.

When I analyzed the load test results, I looked at the following metrics from Perfmon side by side with my SQL trace results, both sorted by date/time order:

High CPU use on the web server – during periods of very high CPU use on the web front end, I looked at the queries in the trace.  I noticed that a login stored procedure showed up with a low number of reads, low number of writes, and low CPU load, but the duration was very long, usually over 30 seconds.  When I ran that same stored procedure from Management Studio, it always finished in under a second.  I even tried running it over and over in several loops in different sessions, and it still always finished fast.  I looked at the results of the stored proc, and it was returning 7,000 lines of XML.  I deduced that the web front end was having difficulty parsing and processing that XML, and it wasn’t able to digest the results fast enough.  When I pointed this out to the app team, they researched and discovered that the test user’s access wasn’t configured correctly.

Heavy blocking on the SQL Server – when SQL reported lots of locks, I correlated that to the queries that were running at the time.  A stored procedure was doing “while” loops to find unused ID numbers in a table, and was doing the whole thing inside a transaction.  I showed them how to rewrite the query with a single update statement, thereby avoiding the loops and transaction.

High CPU use on the SQL Server – for a brief time in the load test, SQL was maxing out CPU power.  SQL CPU use is always my first concern, because if a program goes live with high CPU use, then the only way to fix it once it goes live is to throw very expensive hardware and licensing at it.  I checked what queries were running at the time, and it turned out that a stored procedure was building dynamic SQL and not caching execution plans.  Because the query was pretty complex, it was taking a lot of time to recompile.  I explained how to write it in a way that the execution plans could be cached.

In summary: I didn’t notice these things when the tests were randomized, because when they weren’t happening in clusters, they disappeared into the background noise.  When they ran simultaneously in large quantities, it was much easier to point out concurrency and scaling problems.  Database administrators that work with a QA team should ask QA to run their tests both ways.  (The QA folks seem to appreciate it, just knowing that someone is analyzing the results this closely.)


SQL Server Backup Best Practices

I’ve been backing up SQL Servers for almost a decade now, and it’s time to share the lessons I’ve learned. All of this is my own opinion – your mileage may vary – but I’ll try to explain the reasoning behind the choices I make. I won’t address log shipping or snapshots this time around.

Never back up databases to local disk.

If the SQL Server crashes, especially due to a hardware problem or a severe OS problem, the local drives may not be available. In the event of a hardware disaster, I’ve been able to point to my junior guy and say, “Go build a new server and start doing restores from the network share while I try to resuscitate this server.”

Back up databases to a fileshare, then back the share up to tape.

Tape drives these days are fast enough that the vendors like to say DBAs should go straight to tape, and they’re technically right: tape backup & restore speed is not a bottleneck. However, there’s a limited number of drives available. When the DBA needs a restore right away, the tape drives aren’t necessarily sitting idle. At our shop, if multiple people need to do simultaneous restores or backups, there can be a lag time of hours. Disk backups, on the other hand, are always available.

Cost justify the network share with lower licensing costs & simpler backups.

The SAN & backup admins will need cost justification for a new dedicated array just for SQL backups. Tell them that it pays for itself by eliminating the need for backup agents on each SQL Server, plus it simplifies their lives because they can just have one backup policy. Back up everything on that network share once a day, and get it offsite ASAP. They don’t have to worry about what servers are on what schedules or when the peak loads are – they just have one easy task to back up that network share.

Back up to a different SAN if possible.

This one’s not a reality for all shops, but it’s saved my bacon, so I have to mention it. We had our production SAN go down for an entire weekend. I was able to build new database servers using local disk & another SAN, and then restore databases & transaction logs from the network share. The one reason it succeeded was because my backup array was on a different SAN than the production SAN.

My sweet spot for the backup array is raid 10 SATA.

Depending on the backup windows, multiple database servers may be writing backups to that same share simultaneously. Raid 10 gives better write performance than raid 5, and while I’d love to have fiber channel backup drives, it’s cost-prohibitive. Raid 10 on SATA gets me the best balance of cost versus backup throughput. I’ll write a separate article about how to monitor performance metrics during backups to determine where the bottleneck is. Before I started benchmarking, I’d expected my bottleneck to be my gig network cards, but it turned out to be a raid 5 SATA array.

Backup agents like NetBackup and Backup Exec mean giving up scheduling control.

I steer clear of backup agents like Backup Exec, NetBackup and TSM because the schedules are generally dictated by the backup guys instead of the database administrators. I like having the control in the hands of the DBAs because they’re quicker to react to SQL Server problems. For example, if the nightly ETL jobs fail on a data warehouse, the DBAs can quickly pause the backup schedule or restore the databases without input from anybody else. I want my backup guys to concentrate on getting the data to tape and getting it offsite rather than worrying about application problems.

Do regular fire drill rebuilds and restores.

At least once a quarter, do a restore from tape. When the DBA team is separate from the backup administrator, just tell the backup admin that it’s a real restore need, like a developer lost something and needs it back from tape. Restore it.

Build a standalone restore testbed.

Stay buddies with the system administrators. Ask them to keep an eye out for leftover end-of-life servers in the shop that have enough internal drive space to do fire drill restores. I acquired a pretty slow server with a huge number of internal drive bays, picked up a bunch of big drives off Ebay, and presto, I have a restore testbed. One of these days, I’ll automate the fire drill restore process so that each week it restores off the network share, but I haven’t quite gotten that ambitious yet. For now, I do the restores manually.

Keep management informed on restore time estimates.

Every quarter, I look at the backup sizes & speeds, and estimate how long it will take to restore a server from tape. I turn that into a written report to management, and for each server, I give cost estimates to decrease that outage window. Sometimes the answer is third-party backup compression software, sometimes it’s more CPU power for the server, sometimes it’s adding memory, etc. The primary goal is to give them enough information to make a good business decision about whether it’s worth the money. The secondary goal is to CYA: when a server is down, and a restore is taking longer than the business owners would like, then it’s a great time to pull out last quarter’s report and say, “See, I told you this would take an hour to restore, and you said it wasn’t worth the $5,000 to shrink down. Should we reconsider that decision now?”

Trust no one.

Get an external FireWire or USB drive. Once a month, bring it into the office, hook it up to the server hosting the network share, and copy the entire backup share to that drive. It’ll take hours, and that’s OK. Take it home, or in today’s security-conscious world, take it to the bank and put it in a safe deposit box. Nobody should ever need to do this, but sometimes even the best backup administrators make mistakes.


PASS Summit 2007: Day Three Notes

#SQLPass
2 Comments

Bill Baker: Keynote on Office PerformancePoint 2007

Bill’s a fantastic speaker, and this is an interesting product, but our organization’s BI isn’t mature enough to take on this product yet. The product has a lot of potential; in a nutshell, it combines the flexibility and agility of Excel with the data safety & consistency of SQL Server BI.

Sitting in seminars the last few days, though, has gotten me really excited about the direction of the Microsoft BI stack. When our company is ready to take it to the next level, Microsoft has some great tools to help us get there. First I gotta get Analysis Services in place.

Linchi Shea: SQL Server on SANs

I picked this seminar before I realized I’m already subscribed to Linchi’s blog. He does a great job in the blog of illustrating SAN performance issues with charts and tests, and this seminar was like his blog’s greatest hits. His recommendations included:

Measure the performance of each piece of the storage I/O path using MBps, IOps, and latency. That will make it easier to identify the bottlenecks. Use IOMeter or SQLIO.exe to quantify each I/O path with those metrics. Then follow up with repeated measurements over time to watch for shifting SAN performance.

Linchi’s slide on “Create a balanced storage I/O path (6)” clearly showed how I/O bottlenecks can occur at each point of the SAN infrastructure. 2Gb cards mean 256MB/sec throughput max.

Linchi likes synchronous storage-based replication simply because it’s easier to manage large scale replication jobs than it is to manage large scale database mirroring. Heavy I/O servers may not work out because of the added latency, but for most servers he likes it. Note that he’s using fiber to connect the datacenters, which keeps latency low.

A lot of the data applied to EMC, but did not apply as much to IBM, whose SANs carve out LUNs differently. I can’t say how much it applied to other SANs, though.

He recommended sticking with basic disks instead of dynamic disks, and noted that dynamic disks are not supported in a MS failover cluster. He hasn’t seen a performance differentiation, and that in-place LUN growth is not quite there yet.

He explained performance tuning as a continual, iterative process cycle: measure I/O, identify the bottleneck, improve or remove the bottleneck, and start the process again.

An EMC guy in the audience kept interrupting and clarifying things. Note to seminar attendees employed by a vendor: if you would like to talk, get your own vendor seminar. Note to other seminar attendees: when someone talks, look at their badge, because they’re not always innocent bystanders.

Phil Hildebrand: Hash Partitioning and Scaling the OLTP Database

Phil spent the first 45 minutes walking through the basics of how SQL 2005 automated partitioning works and how to implement it. This was a snoozer for me because I’ve got partitioning in production already, but I was clearly in the minority. The rest of the attendees asked lots of basic partitioning questions, and they asked the kinds of questions that showed they were grasping the concept.

One of them caught on quickly that in select statements, SQL 2005 only eliminates partitions when the partition key is used as a part of the WHERE clause. Otherwise, it always scans every partition. Phil sighed and agreed that this is a huge problem, and he hopes SQL 2008 does a better job of partition elimination. I wholeheartedly agree there. He pointed out that you still get a performance gain on insert/update/delete statements, since they only lock a single partition.

The best part for me was the last 15 minutes, where he walked through hash partitioning, and I got this huge “AH!” look on my face.

Imagine a Customer table with an identity field for CustomerID. You don’t want to pre-create partitions ahead of time for all of the possible customer ranges. Instead, you just want to reduce load by lumping customers into, say, 10 partitions.

The solution: add a persisted computed column that contains your own hash function like modulus 10 (CustomerID % 10). Every customer will fall into one of ten buckets (0-9), instantly slicing your table volume by 10. Wow. I wish I’d have thought of that! This reduces insert/update contention because you can load balance the data across 10 partition, preferably on 10 different LUNs.

The biggest drawback is the partition elimination problem here too: SQL will continue to scan all of the partitions unless your where clause specifically states the persisted computed column value. Makes me want to go play with 2008 just to see if the partition elimination is better.

Another drawback would be the difficulty of changing the partition scheme later. Changes would require modifying the persisted computed field, recalculating the data for all of the customers, and moving their records onto the new filegroups. Ouch. Regardless, it’s a neat idea with interesting applications.

And there you have it!

Now for the tough part: poring over all of the notes from the sessions I attended, figuring out what to implement first, and even worse, examining the slides from the sessions I couldn’t attend due to conflicts!  What a great experience.


PASS Summit 2007: Day Two Notes

#SQLPass
4 Comments

Tom Casey: Keynote

Tom covered basic SQL Server marketing slides about the BI stack. Yawn. Good presenter, but couldn’t overcome the dry material. Then he handed it off to a Dell woman to announce that they’ll be selling packaged BI solutions with hardware, software, storage, etc all in one bundle. They’re targeting customers in the 1-4 terabyte range with pricing around $100k per terabyte.

I don’t understand the target customer for Dell’s packaged BI systems. I think of the decision tree like this: if a company is large enough to spend $100k per terabyte on a BI solution, aren’t they large enough to already have a preferred x86 hardware vendor, and probably a SAN? Why not just buy hardware from your existing preferred vendors? Surely you wouldn’t choose Dell because of their BI services expertise….

Russell Christopher suggests that companies will be attracted because Dell’s answered the question of what hardware configuration is required for a 1/2/4tb BI project. I’m not so sure – I haven’t seen two 1tb BI projects go the same way.

Anyway, regardless of my dim-witted analysis, this package would have been a hard sell coming out of anybody’s mouth, but it was an especially tough sell from this particular presenter. She was a nervous wreck: she kept constantly rehashing the same “I came from the dark side, not SQL Server” joke, and her slides didn’t match her speech. She would advance a slide only to say, “Oh, yeah, I already mentioned that.” Argh. Evidently Dell saves money on their marketing staff and passes that savings on to us.

Keynotes should be reserved for the very best speakers. Tom was fine, but it all went south when she picked up the mike. I bailed just before her presentation finished, and I heard bad things from other attendees as well.

Jerry Foster: Plan Cache Analysis in SQL Server 2005

In contrast to the Dell presenter, Jerry lit up the room as soon as he got started. This guy was born to teach. This was the first SQL seminar I’ve seen where people burst into spontaneous applause during one of his demos, even before he got to a conclusion.

Jerry and the crew at Plexus Online built a slick system to interpret the dynamic management views for the query plan cache. In a nutshell, his queries make it easy to see where load is coming from down to the batch and statement level, all without running a resource-expensive trace on the production servers.

About five minutes into the session, I knew I wouldn’t bother taking notes because I’d print out the slides and pore over them anyway. I downloaded his code the minute he put the URL up on the projector, and I’m going to spend a week going through it. He didn’t give out the source code for his front end, and I’m torn between building my own in classic ASP (miserable language, but I know it well enough to be dangerous) versus SSRS 2008 (which I don’t know at all, but might do well to learn.)

I’m not even going to try to touch base on everything Jerry discussed. Hard-core database engine DBAs owe it to themselves to go get his samples and pore over them.

I got chuckles out of some of the audience members’ questions, though. One of them started picking out differences between memory allocations on 32-bit versus 64-bit servers, trying to find out how much memory over 4gb his 32-bit servers could use for the plan cache. Hey, buddy, if you have to ask that question, then you need to upgrade to 64-bit. And if your database server isn’t capable of upgrading to 64-bit SQL, but you’re sitting in a seminar about caching, then you’ve got your priorities all wrong.

SQL Customer Advisory Team: Building Highly Available SQL Server Implementations

Going in, I thought this would be more technical, but it turned out to be a fairly high-level comparison of the newer HA technologies: database mirroring, peer-to-peer replication, log shipping and just a tiny bit on clustering. I didn’t learn much because I’d already researched the bejeezus out of these options, but their lessons-learned stuff bears some repeating here for people who haven’t done the homework.

Peer to peer replication and database mirroring have one good advantage over log shipping: the backup server (replication partner or mirror) can have a read-only copy of the database for query purposes. The CAT guys didn’t mention that the act of doing this means you have to pay licensing on the backup server; if you use it for purely disaster recovery reasons, you don’t have to license it.

Mirroring & log shipping should be done with Active Directory group security instead of SQL logins. Companies that frequently create SQL logins and modify their passwords will run into problems during disaster recovery, because the SQL logins aren’t synced between servers using mirroring or log shipping. If you strictly use AD groups for access permissions, then no user info is stored in the system databases, and you won’t have to worry about syncing the individual users.

Syncing SQL agent jobs, alerts, SSIS packages and maintenance plans is also a headache when doing disaster recovery planning, because those aren’t synced automatically either.

When doing database mirroring, remember that databases fail over individually, not the whole server at once. If your application uses multiple databases, you don’t want to have the failovers occur automatically, because a single database might fail over without the others, and timing would be important.

Monitor resources when mirroring more than 10 databases on an instance. That 10 number is flexible, just a rough guesstimate. (That scared me because I mirror more than 10 already.) Due to the way the mirror server handles writes, it may incur significantly higher I/O than the principal server.

In the event of a disaster, break mirroring quickly if there’s a chance the log files may fill up before the principal server comes back online.

When planning database mirroring, carefully analyze the log backup volume over time. The maintenance process of rebuilding indexes will add a lot of log volume, and you want that to happen during a very low activity window so that the mirroring logs don’t get too far behind. They’ve seen index rebuilds cause asynchronous mirroring to get over 2gb behind in less than 10 minutes.

They talked through a rather esoteric DR setup: two servers in the primary database doing synchronous database mirroring between each other, and then a third server in the disaster recovery datacenter with log shipping. That struck me as ridiculous because I’d have three possible database server names, which would be a configuration nightmare on the client side. Anyway, to get that scenario working requires manual setup and scripting, because log shipping has to be set up on both mirrored servers, and it can’t be done with the GUI.

Jason Carlson: Rich Report Design with SSRS 2008

I don’t know Jack about SSRS, but I figured I’d better sit in on this seminar after a midday conference call suggested that we might be doing it in-house.

The SSRS report design process is pretty much all new from the ground up with SSRS 2008, which makes me glad I didn’t put time into learning SSRS 2005. (Yay, procrastination!) The new design tool will be completely integrated into Visual Studio after CTP5, with a second non-VS designer with a Vista/Office 12 feel inspired by PowerPoint. The non-VS designer will support server mode (instead of just working locally), whereas the VS designer will only work when paired with an SSRS server.

Microsoft acquired the rights to a lot of Dundas chart code a few months ago. Dundas circular and linear gauges are coming in CTP6, but maps may not make it to RTM.

The chart setup is much more drag & drop than it’s been in previous releases (they say, and the crowd oohed in approval). Coders can click right in the chart to change the legend, title, locations, etc., much like Excel. As you’re doing chart setup in pop-up dialogs, the charts update in the background instantly. As a user of Excel for over ten years, I wasn’t quite as impressed as the developer members in the audience, but that’s okay – it just means I picked the right time to start poking around in SSRS.

I left about halfway through this presentation because I got some bummer news via email about a project, and wanted to do some damage control.

Tomorrow’s going to be tough – I’m stymied as to which sessions to attend. There’s some pretty good stuff out there with scheduling conflicts. I’m staying overnight on Friday, so I’ll be able to stay through the last session: troubleshooting connectivity errors. Sounds boring, but we’re having those issues at work, so I’ll be tuned in.

On to my Day Three notes.


PASS Summit 2007: Day One Notes

#SQLPass
2 Comments

My notes on the sessions I’ve attended so far:

Ted Kummert: Keynote on SQL 2008

I’d already seen most of this PowerPoint deck during a South Florida SQL meeting at Microsoft, but the demos kicked butt.  Finally, we have Intellisense in Management Studio!  Woohoo!  There was other fun stuff, but I’ve already forgotten it.

One odd note – they slipped in a slide that said the easiest transition to SQL 2008 will be to move to SQL 2005 now.  The exact wording strongly hinted to SQL 2000 users that they need to get with the program, or that maybe we won’t be able to migrate directly to SQL 2008.  That could be a problem – I’d planned on skipping SQL 2005 for the majority of our third party apps like Project, Sharepoint, Mercury Quality Center, etc.  SQL 2008’s transparent database mirroring is a big selling point, and I’ll put the time into migrating programs to it just for that alone, but it’s a tougher sell on 2005’s not-so-transparent mirroring that requires application changes.

Kalen Delaney: The Hidden Costs of Concurrency

Kalen explained the differences between optimistic & pessimistic concurrency: in a nutshell, pessimistic concurrency’s problem is that it blocks other queries that try to read what you’re updating. Both optimistic and pessimistic concurrency modes will hold locks on what they’re trying to update, but pessimistic will cause blocks to select statements as well.

The solution, switching from the default pessimistic concurrency to optimistic concurrency, means that snapshot isolation must be enabled. That causes SQL Server to store versions of updated/inserted rows in TempDB, so every update/insert incurs a lot more I/O. Reads are slower too, because if a process needs to read an older version of a record, that version will be fetched out of TempDB.

The sudden increased use of TempDB means that a database administrator’s job becomes much more difficult: they have to carefully watch the row versioning storage amounts in TempDB and size their files appropriately. Run out of space in TempDB for versions, and everything will grind to a halt. Perfmon counters are available under SQLServer:Transactions for the version store load, like checking to see how much space is being used for the version store as opposed to user TempDB objects.

She made the case for using Read Committed Snapshot Isolation instead of what she calls “full” Snapshot Isolation because it requires less overhead.

I’d love to enable RCSI on my data warehouse, but I’d need to put some time and testing into validating the TempDB speed & size first. Not a low-hanging fruit.

Kalen did a great job of conveying difficult concepts with simple demos, and I would recommend her sessions and her books to highly technical DBAs. This was one of those sessions that I really wanted to come to PASS to see, because this stuff is pretty esoteric.

SQL CAT: Optimizing OLTP: Lessons Learned from MySpace.com

This one was a bit of a surprise, because Hala Al-Adwan from MySpace showed up to talk about the challenges she’s had with SQL 2005. She’s worked closely with the Microsoft SQL Customer Advisory Team on managing her 275+ SQL Servers. That’s right, 275 HP DL585 4-way dual-cores with 32-64gb of ram. They’re adding 3 SQL Servers a day. Humina humina. And you think you have replication problems.

They use SQL2005 Standard Edition x64, and the drawback of Standard Edition x64 is that it won’t lock pages in memory. Enterprise does, and Standard 32-bit does, but Standard x64 doesn’t.

MySpace ran into connection denials because the same bits of code that manage IO also manage connections. This was an eye-opener for me, because I’ve heard rumblings from my data warehouse users that they’ve recently had connection denials during random times of the day. As soon as I get back to the office, I’ll be doing some in-depth performance monitoring.

As part of that problem, the SQL Customer Advisory Team recommended that transaction log writes need to happen in 2 milliseconds or less. It’s good to monitor wait queues, but for t-log drives, low wait queues isn’t enough.

They highly recommended having a clustered index on every single table regardless of size. They gave great charts and statistics about the impact of adding a clustered index: it improved speed under all circumstances, even inserts. I took a camera phone picture of that slide just to send to my BI manager because he still questions that one.

They’ve seen a problem with hotspots when inserting into tables with an identity field, but only with insert rates of >4500 inserts/second and >100 concurrent users. The fix is in the August cumulative update for SQL 2005, which I’ll be promptly applying on my dev & QA servers when I get back, because I’m pretty sure we’ve seen that issue with our ETL nightly loads. We run a lot of loads concurrently into the same tables.

The one shocker: the query optimizer will only do index seeks down to the level of the index where the user uses an inequality clause. That’s a mouthful, so lemme say it another way. Imagine a table with columns Col1, Col2, Col3 and Col4, all integers. It has an index on those columns, in order. A user writes this query:

SELECT * FROM MyTable WHERE Col1 = 100 AND Col2 > 350 AND Col3 = 135 AND Col4 = 151

The query optimizer will not look at all four fields on the index – only Col1 and Col2. Col2 has an inequality clause (less than, greater than, other than, etc) and that’s where the optimizer draws the line. Whoa. Gotta look into that.

Eric Hanson: Data Warehousing with SQL 2008

Eric specializes in the database engine, so he focused on the engine improvements for warehousing. I was most interested in the star join query plan improvements, and those alone would be worth the upgrade process. We use a lot of star join queries, and the query plans he showed looked exactly like the kinds of problems we’re having: inability to section out partitions, inability to filter down large subsets of data, etc.

He mentioned a new MERGE syntax for T-SQL that will do “upserts” – automatic insert/update/deletes, perfect for nightly ETL work.

He also covered a benefit of using indexed views for aggregates. Data warehouses use aggregate tables either by populating them manually as a part of the nightly ETL process, or by using indexed views. Indexed views have an advantage I hadn’t thought of: they will automatically be used in query plans whenever the engine detects that they’d speed up a query. That’s not the case with the aggregate tables we build manually, because the engine has no idea that those tables relate to our fact tables.

SQL 2008 will include database compression and backup compression, two things I’d heard previously but I’ll mention here in case anybody hasn’t heard it. An attendee asked Eric if this would phase out third party compression utilities like Quest Litespeed and Idera SQLsafe, and Eric danced around it well. He said SQL 2008’s compression will be a first version, and the third party utilities may be more mature with more features, so they may still be worth the money. Good answer.

Bill Ramos: Performance Monitoring Using Management Studio

SQL Server 2008 also steps on the toes of third-party performance monitoring products by collecting the same types of runtime statistics DBAs pay extra for now. Bill demoed a very-very-green version of the code, and it has a long way to go but it shows potential.

I was surprised that Microsoft would let this part of the product start up at this point in the SQL lifecycle. Microsoft just got done sticking a fork in Microsoft Operations Manager 2005, and the whole tagline around System Center 2007 is that admins should be monitoring whole applications, not just single servers. For example, I don’t just want to know that my SQL Server is having problems – I want to know what applications are affected, and how that affects the uptime of my entire application.

Performance Studio, on the other hand, was very server-centric and application-ignorant.

Plus, why implement this whole thing in the SQL stack? Why not include this inside of System Center? I don’t get it.

This was the one seminar I actually left early. I just didn’t see a long-term future for this piece of the product, at least not in its current incarnation. I was a little disappointed because I’d expected something else – not quite sure what.

At that point, I went back to the hotel room for dinner (Quizno’s, blah) and an early bedtime.  Lots of great stuff on the agenda tomorrow.

On to my Day Two Notes


In Denver for SQLpass 2007 Summit

#SQLPass
0

I splurged and forked over $180 for a first class upgrade for my flight from Miami to Denver. It’s been years since I’ve flown, and it was scheduled to be a four hour flight, so it seemed like a good – albeit decadent – option. Turned out to be a lucky gamble: the plane was stuck on the runway for an hour while the crew worked on a communications problem, and I relaxed in my big leather seat. The only drawback: the worn-out recliner insisted on gradually, slowly, imperceptibly leaning itself back. Every ten minutes, I had to reset it back up to the upright position. Maybe it’s trying to tell me something, but I’m not an easy sleeper on flights, even in the expensive seats.By the time we were at cruising altitude and the flight attendant came around with the champagne and warm nuts, I’d forgotten that the five hours of comfort worked out to about the same cost as a new iPod Nano.

Speaking of Apple products, this was also the first time that I’d turned off my iPhone since I bought it. Nice. That thing is just rock-solid stable, especially compared to my old Cingular 8125 (HTC Wizard) that required constant reboots.

Once in Denver in my rental Nissan Xterra (free upgrade, woohoo!), I remembered two things about the West: height and distance. Even in the dark, it’s obvious how much vast, undeveloped open space looms around Denver. Being in Miami for a couple of years made me forget what that looks like.

For updates and eye candy, keep an eye on my Flickr photos.