I first heard this in the blogosphere and thought it was a joke, but sure enough, the CTP’s of SQL Server 2008 won’t start on February 29th.
That’s a little frustrating, but it’s a good example of why I’m not letting developers use SQL 2008 yet at our shop. Miles to go before we sleep…
In today’s SQL newsletter from SQL-Server-Performance.com I was horrified to find the following tip:
When you specify a server in an ADO.NET connection string, use the server’s IP address, not the server’s DNS name. By using an IP address instead of a DNS name, name resolution does not have to occur, reducing the amount of time it takes for a connection to be made. A server’s IP address can be used to specify either a default or named instance of a server running SQL Server. If you are running a cluster, use the virtual SQL Server IP address. [7.0, 2000, 2005]
As a guy who’s lived through more than one IP address renumbering disaster, please believe me when I say you should never do that. IP addresses can and do change. While server names can also change, it’s easy to make fake DNS entries for an old server name that points to the new name. It’s impossible to make a fake IP address point to one in another subnet, especially if the server is being physically moved.
Disaster recovery is a great example, and performance testing offsite is another. There are times when we’ll airlift an entire application and drop it somewhere else. Everything will be the same, but the IP’s are different – and bam, the application fails. Programmers have to be called in to track down every possible location in their app for a connection string.
Furthermore, applications don’t query DNS every time they query the server. They have a DNS cache, and they will only look up the DNS name periodically – and that time is usually measured in hours or days, not seconds or minutes.
If your application is so blazingly fast that a 1-second DNS lookup once per day is a bottleneck for you, then I salute you, and you might consider using IPs in your connection string. However, the rest of us need to use DNS names.
Normally I like SQL-Server-Performance.com and I recommend it for beginners, so I was surprised on this one, and I had to point it out.
IBM Director is a decent tool to manage firmware & driver versions, but it doesn’t have a nice report to show the firmware & driver versions for things like the RSA, raid card, network cards, etc. I wrote the below T-SQL query to list out the most commonly needed versions & IP addresses after I couldn’t find a similar one online. If you store your data in something other than SQL Server, you’ll need to modify the query to match your database platform.
You’ll need to modify it in your environment to match your table owner – for us, we named the service account IBMdirector, so the tables are all prefixed with that, but you may need to use DBO instead.
It will produce multiple result rows for a single machine if you have multiple raid cards, multiple network cards, etc – but you can filter that out based on your needs.
, fwBios.Version AS BiosVersion
, COALESCE(fwDiag1.Version, fwDiag1.BuildNumber) AS DiagnosticsVersion
, fwBMC.Version AS BMCFirmwareVersion
, fwRSA.Name AS RSAFirmwareName
, fwRSA.BuildNumber AS RSAFirmwareBuildNumber
, rsaIP.MPIPAddress AS RSAipAddress
, rsaIP.MPSubnetMask AS RSASubnetMask
, rsaIP.MPGatewayIP AS RSAGatewayIP
, raid.Model AS RaidModel
, raid.BiosVersion AS RaidBiosVersion
, raid.FirmwareVersion AS RaidFirmwareVersion
, raid.DriverVersion AS RaidDriverVersion
, drvIPMI.Vendor_Name AS IPMIDriverVendorName
, drvIPMI.Version AS IPMIDriverVersion
, drvAPCI.Vendor_Name AS ACPIDriverVendorName
, drvAPCI.Version AS ACPIDriverVersion
, drvRSA.Vendor_Name AS RSADriverVendorName
, drvRSA.Version AS RSADriverVersion
, rsaDriver.REVISION AS RSADriverRevision
, drvQLogic.Vendor_Name AS QLogicDriverVendorName
, drvQLogic.Version AS QLogicDriverVersion
, drvSAV.Vendor_Name AS SAVDriverVendorName
, drvSAV.Version AS SAVDriverVersion
, drvVGA.Vendor_Name AS VGADriverVendorName
, drvVGA.Version AS VGADriverVersion
, drvBroadcom.Vendor_Name AS BroadcomDriverVendorName
, drvBroadcom.Version AS BroadcomDriverVersion
, dirAgent.Version AS DirectorAgentVersion
, (SELECT TOP 1 Firmware FROM IBMdirector.UMS_SRDISKDRIVES fwHD WHERE fwHD.MANAGED_OBJ_ID = obj.MANAGED_OBJ_ID ORDER BY fwHD.Firmware) AS HardDriveOldestFirmwareVersion
, (SELECT TOP 1 FirmwareVersion FROM IBMdirector.TWG_FIBRE_ADAPTER fwHD WHERE fwHD.MANAGED_OBJ_ID = obj.MANAGED_OBJ_ID ORDER BY fwHD.FirmwareVersion) AS HBAOldestFirmwareVersion
, (SELECT TOP 1 DriverVersion FROM IBMdirector.TWG_FIBRE_ADAPTER fwHD WHERE fwHD.MANAGED_OBJ_ID = obj.MANAGED_OBJ_ID ORDER BY fwHD.DriverVersion) AS HBAOldestDriverVersion
FROM IBMdirector.TWG_MANAGED_OBJECT obj
INNER JOIN IBMdirector.TWG_PHYS_CONTAINER_GLOBAL pc ON obj.MANAGED_OBJ_ID = pc.MANAGED_OBJ_ID
LEFT OUTER JOIN IBMdirector.UMS_FIRMWARE fwBios ON obj.MANAGED_OBJ_ID = fwBios.MANAGED_OBJ_ID AND fwBios.Name LIKE ‘Default System BIOS%’
LEFT OUTER JOIN IBMdirector.UMS_FIRMWARE fwDiag1 ON obj.MANAGED_OBJ_ID = fwDiag1.MANAGED_OBJ_ID AND fwDiag1.Name LIKE ‘IBM Diagnostics%’
LEFT OUTER JOIN IBMdirector.UMS_FIRMWARE fwBMC ON obj.MANAGED_OBJ_ID = fwBMC.MANAGED_OBJ_ID AND fwBMC.Name LIKE ‘%BMC%’
LEFT OUTER JOIN IBMdirector.UMS_FIRMWARE fwRSA ON obj.MANAGED_OBJ_ID = fwRSA.MANAGED_OBJ_ID AND (fwRSA.NAME LIKE ‘%Systems%’ OR fwRSA.NAME LIKE ‘%Remote%’)
LEFT OUTER JOIN IBMdirector.TWG_DEV_DRIVER drvIPMI ON obj.MANAGED_OBJ_ID = drvIPMI.MANAGED_OBJ_ID AND drvIPMI.DRIVER_ID LIKE ‘IPMI%’
LEFT OUTER JOIN IBMdirector.TWG_DEV_DRIVER drvAPCI ON obj.MANAGED_OBJ_ID = drvAPCI.MANAGED_OBJ_ID AND drvAPCI.DRIVER_NAME LIKE ‘IBM Active PCI Alert%’
LEFT OUTER JOIN IBMdirector.TWG_DEV_DRIVER drvRSA ON obj.MANAGED_OBJ_ID = drvRSA.MANAGED_OBJ_ID AND drvRSA.DRIVER_NAME LIKE ‘IBM RSA%’
LEFT OUTER JOIN IBMdirector.TWG_DEV_DRIVER drvQLogic ON obj.MANAGED_OBJ_ID = drvQLogic.MANAGED_OBJ_ID AND drvQLogic.DRIVER_NAME LIKE ‘QLogic Fibre%’
LEFT OUTER JOIN IBMdirector.TWG_DEV_DRIVER drvSAV ON obj.MANAGED_OBJ_ID = drvSAV.MANAGED_OBJ_ID AND drvSAV.DRIVER_NAME LIKE ‘SAVRTPEL%’
LEFT OUTER JOIN IBMdirector.TWG_DEV_DRIVER drvVGA ON obj.MANAGED_OBJ_ID = drvVGA.MANAGED_OBJ_ID AND drvVGA.DRIVER_NAME LIKE ‘ATI Radeon%’
LEFT OUTER JOIN IBMdirector.TWG_DEV_DRIVER drvBroadcom ON obj.MANAGED_OBJ_ID = drvBroadcom.MANAGED_OBJ_ID AND drvBroadcom.DRIVER_NAME LIKE ‘Broadcom%’
LEFT OUTER JOIN IBMdirector.RSA_MP rsaIP ON obj.MANAGED_OBJ_ID = rsaIP.MANAGED_OBJ_ID
LEFT OUTER JOIN IBMdirector.MP_VPD rsaDriver ON obj.MANAGED_OBJ_ID = rsaDriver.MANAGED_OBJ_ID AND rsaDriver.VPD_TYPE = ‘Device Driver’
LEFT OUTER JOIN IBMdirector.UMS_UMS dirAgent ON obj.MANAGED_OBJ_ID = dirAgent.MANAGED_OBJ_ID
LEFT OUTER JOIN IBMdirector.UMS_SRCONTROLLER raid ON obj.MANAGED_OBJ_ID = raid.MANAGED_OBJ_ID
ORDER BY pc.MACHINE_TYPE, obj.LABEL
I’ve talked about how a SQL backup compression program can help database administrators, but now it’s time to convince the boss. We have to sell management on why it’s worth the money.
First, a few words about my approach to database administration, because I think it’ll help DBAs in general. Being a database administrator is a lot like running a company: our users are buying a product (a reliable place to put their data), and they pay us (our budget) for that service. Getting more money in the budget isn’t a matter of begging our managers: it’s a matter of convincing our users that they can get a better product from us if they want to spend more money.
I never go to my IT manager and say, “Don, I need ten grand for this.” He doesn’t care whether backups take 4 hours or 1 hour because it doesn’t make a difference to him. Instead, I have to go to the stakeholders – the people who really care about backup windows and backup sizes.
Data Warehouses: Putting a Price on the Nightly Window
We have a 1tb data warehouse that’s growing like a weed in a compost pile. The business wants to store more data from more sources, the users want to design cooler reports with more analytics, and the ETL guys want to build more aggregate tables to speed reporting queries. All of these things have one thing in common: they require more time in the nightly load windows.
At the start of each data warehouse expansion project or each new reporting project, I ask the developers a question: how much longer is this going to make your nightly window, and how do you plan to compensate for that? It’s like when the government adds a new department: sometimes they offset it by reducing expenditures somewhere else. The data warehouse developers can add more nightly load processing, but they have to compensate for it by making another part of the process go faster. If they can’t find that part to optimize, then I can sell them more time by reducing the nightly backup window – but that costs money.
The first step is usually to add backup compression software because it’s the lowest-hanging fruit. When the project manager looks at the cost of backup compression software versus the cost of refining the difficult ETL process, the backup compression software is a no-brainer.
Later in the process, reducing the backup window even further can be a justification for switching to SAN snapshot backups, and I’m implementing those now on this data warehouse I’ve been mentioning. In March, we’ll be moving the data warehouse storage from an IBM DS4800 to an IBM N-Series (a rebadged NetApp) and using snapshots for nearly instantaneous snapshots, effectively eliminating our backup window. Of course, most SQL Server setups can’t afford that capability – and that’s where backup compression is the next best thing.
Time to Restore: Putting a Price on Recovery Time
The smaller the backup file, the faster it is to read off disk or tape. Granted, there is CPU overhead involved in decompressing the backup, but generally speaking, smaller database backup files will restore faster than larger ones. In the case of a disaster where the server must be rebuilt from scratch, there’s a price to waiting that long.
In the case of our 2-terabyte SAP BW data warehouse, it’s the difference between fitting the database on a single LTO3 tape versus spanning multiple tapes. That single tape read means less time. Granted, for a data warehouse, time isn’t always money.
For our sales ordering system, however, time is definitely money. We have a tight window every day where all of our sales force places their orders for the day, and the database simply can’t go down. For high availability, we use database mirroring to our disaster recovery site. That ensures that the database can always take orders, but the DR site has slower bandwidth than our primary site. If the primary server crashes, we have to restore it as fast as possible to get our redundancy back and run at full speed again. I can’t get the 100+ gb of native database backups across the wire from DR to our primary site fast enough, and instead I would have to ship tapes from DR. That’s an overnight delay. If I use database compression software, however, I can restore across the wire and bring the primary database server back in a matter of hours instead of a day later.
Database administrators should be performing regular fire drill restores onto new hardware. The next time the shop does a fire drill restore, time it. Take the end result time, and ask management if that time lag is acceptable. If it’s not, ask how much that long window is costing the business, and there’s the justification to get backup compression software.
Refreshing Dev & QA Faster: Putting a Price on Developer Time
Developers like to work with the freshest data possible. Heck, if they could, they’d do all of their development on the production box itself. They don’t want to be down for hours or days while the development & QA servers are refreshed from the production backups.
This is especially true for serious production issues that can’t be replicated in development or QA. Sometimes an emergency comes up that requires immediate testing and remediation, and the issue isn’t seen in any environment other than production. The developers need a copy of the production data as fast as possible in order to test their fix, and every minute counts.
Ask the development teams if they’d be interested in being able to refresh even their largest databases on demand – like over lunchtime instead of overnight. If the current restores take hours, and if they could be done in under an hour with backup compression, then this is a real option. If their departments are willing to fork over the expense of backup compression licenses for their development & QA servers, the systems teams may be willing to fork over the remaining production licenses as a team effort.
In Summary: Look For Your Users’ Pain Points
Hopefully I’ve given you some ideas about how you can look at the problems your users face, and offer them solutions to those problems with the tools at your disposal. Sometimes SQL Server database administrators focus too much on our own problems and struggle with getting the budget dollars to fix them – when in reality, we should be looking at how those issues affect our users, and realize that the users have their own budget money to spend.
Continue Reading SQL Server Backup Best Practices
In my last article about the C-Class chassis, I talked a little about the interconnect switches, and today I’m going to dive deeper.
As with most full-factor servers these days, each single-height HP blade like the BL460c includes an onboard dual-port network card. The difference between standalone and blade servers starts here: thoe two ports connect to two different interconnect bays (bays 1 & 2). They are hard-wired to these bays, each of which must contain a network switch. With standalone servers, a network admin can cable both of the server’s network cards to the same switch (HP, Cisco, etc), possibly utilizing empty ports on an existing switch. With a blade infrastructure, however, the admins need to buy at least a pair of switches for each blade enclosure.
We use two Cisco switches in those interconnect bays, each of which route to a different core Cisco switch. We tie the whole thing together by using the free HP network teaming software included with the blades, which means either of the interconnect bay switches can go down without taking down the blade’s networking. This does have weaknesses, but I’ll talk about that in an upcoming article about the HP Virtual Connect infrastructure.
A blade cannot live by two ports alone, so the BL460c includes 2 mezzanine card bays. The two mezzanine bays are HP’s version of an internal PCI Express slot designed for the tiny blade form factor, and they accommodate a variety of mezzanine cards including dual-port and quad-port network cards, dual-port SAN HBAs, and even Infiniband. This makes even the small BL460c well-suited for a variety of low to mid-range database server duties, especially for iSCSI shops. At our shop, some of the database server setups include:
- Standalone high-performance OLTP server – one mezzanine bay holds a SAN HBA for storage, the other bay is empty
- Clustered high-performance OLTP server – one bay has a SAN HBA, and the other bay has a dual-port network card
- Standalone iSCSI OLTP server – one bay has a multipurpose iSCSI network card
The more connectivity a blade needs, the more switchgear needs to be involved. A typical C7000 chassis configuration might look like this:
- Interconnect bay #1: a Cisco 3020 network switch
- Interconnect bay #2: a Cisco 3020 network switch (for redundancy)
- Interconnect bay #3: a Brocade SAN switch
- Interconnect bay #4: a Brocade SAN switch (for redundancy)
- Interconnect bay #5: a Cisco 3020 network switch (for 4-port NICs, especially for VMware)
- Interconnect bay #6: a Cisco 3020 network switch (for redundancy on the 4-port NICs)
The Problem with Lots of Switches
Having six additional switches for every 16 servers (or even less servers, depending on whether the shop uses full-height blades) presents some problems.
To me, the beauty of blades is their reduced complexity: it’s all about making deployments easier, more predictable, and faster. Adding more switchgear doesn’t eliminate that simplicity, but it doesn’t help the case. I still have to put in a ticket for our networking staff to set up the Cisco switches, and I can’t double-check their work. The only way I find out that the setup wasn’t right is when I put in the new blade and it can’t communicate on all of its network cards. I don’t have to put in a ticket for the SAN admin because – well, because it’s me – but the other Windows admins have to wait for me to configure their SAN connections. In all, this can add days of lag time for a new blade setup, and that takes the shine off the simplicity of blades.
This is made more frustrating by the fact that most of our blade configurations come from just a couple of types: VMware hosts with specific VLAN, iSCSI and SAN needs, SQL Servers with SAN needs, and plain vanilla servers with one subnet. This should be a cookie-cutter setup job, but because the setup is done by multiple teams, there’s lag times, misunderstandings and finger-pointing when something goes wrong.
The fact that there’s a growing army of switches makes the initial configuration that much more difficult: we have to be extremely explicit with the network staff. Where we could easily just specify Core Switch A or Core Switch B before, now we have to specify which blade chassis we’re working with, which bay the switch is in, and so forth. Plus, when we hire new network administrators, they’re not always familiar with blade switches, so we have to walk them through the datacenter to explain how the different switches uplink to the core switches.
More switches also mean more firmware administrative headaches. These are another six switches that we have to keep on a synchronized versions of firmware. For example, we recently ordered a new chassis with Brocade switches, and the new switches arrived with a newer version of firmware. Thankfully we caught that before we plugged it into our infrastructure, because that firmware version was not compatible with other switch firmware versions in our fabric.
Another problem with this sudden growth of switches is that some management software is licensed by the switch port, regardless of whether that switch port is actively in use. We license our SAN path management software by the switch port, and the instant we plug in another pair of Brocades, we have to license that software for the additional switch ports. In some of our C7000s, we only have half of the servers connected to the SAN – meaning we’re paying licensing for more switch ports than we’d use.
The Limits of 2 Mezzanine Cards with Conventional Switches
The two mezzanine card slots are the BL460c’s first weakness as a database server: it can’t get seriously high throughput with conventional switches.
Most midrange fiber channel SANs don’t have true multipathing for their arrays. Each LUN (drive letter) is sent through a single 4gb HBA until that HBA path fails over, and then it switches to the other HBA. For SQL Servers, especially data warehouses, this presents a bandwidth problem.
We did a Microsoft Technology Center lab for our data warehouse in the winter of 2007, and one of the findings was that we were hitting a SAN throughput bottleneck. We were using two 4gb HBAs with IBM’s RDAC failover multipathing, which does not truly load balance between HBAs. The recommendation was to switch to at least 4 HBAs – something we couldn’t do with the BL460c blades. Granted, we weren’t running a data warehouse on a BL460c, but my point is that it shouldn’t be done for performance reasons.
The same thing holds true with iSCSI, especially when using just 1gb switches. Since each pair of network cards is divided between two Cisco switches, we’ve been unable to get 2gb of combined throughput at any given time even when using the vendor’s multipathing software. We got an eval system from LeftHand Networks hoping it would resolve that issue, but the onsite tech agreed that it just couldn’t be done if the two network cards were connected to two different Cisco switches.
Summary: A Problem, but There’s a Solution…
These problems haven’t slowed our adoption of C-Class blades with conventional switchgear – the switches were an inconvenience that we can get around.
There’s also a solution to most of these problems: the HP Virtual Connect system. More about that in the next article in the series.
When developers and SQL Server database administrators work together to make an app faster, it shouldn’t just be the DBA working alone in the dark. If the DBA takes the slow code, goes into his cube, tweaks it (and indexes) and comes back out with a faster system, that fixes the problem once – but it doesn’t fix the long-term problem.
For the best long-term results, a DBA should pick a member of the application team and cross-train them on database tuning skills. Do performance tuning sessions together, walking the developer through the analysis. That way, they see the performance problems with their own eyes, they learn to recognize signs of problematic code, and they know how to avoid writing it in the first place.
Perfect example: a few weeks ago, a couple of our developers rewrote a slow user-defined function from scratch. They then tested it in their development environment and were surprised to find that it was slower than the live version. Another emergency came up, and now that I’ve got some time to tune it for them, they don’t want me to bother. They don’t want to waste my time tuning something that they’re probably going to rewrite again anyway.
I told them to give me the code anyway, and at least I can review it to see what made it slower than they’d expect. Even if this code will never see the light of day, it’ll at least be a good training exercise.
But probably not the way you think.
I made it to Houston on Sunday after more hours behind the U-Haul wheel than I’d care to remember. No damage, everything went well, pretty boring.
Erika and I are now in an extended stay hotel. We moved into here temporarily while we found an apartment, and we’ll be moving into the apartment on Saturday. We’re in downtown Houston in a funky new place with hardwood floors, tall ceilings and a fun little patio where Ernie can lounge.
If all goes well, I’ll have broadband internet on Saturday night, at which point I’ll be able to start posting some of my experiences with the HP Virtual Connect ethernet solution. It’s a pretty nifty setup.
At the moment, though, I’m updating the blog because I don’t have to work – our network guys completely downed our switching infrastructure, apparently by accident. Has nothing to do with the HP gear, though.
Kevin Kline put together a great paper on the new features in SQL Server 2008 called “Worth the Wait: Why SQL Server 2008 is Great” and it’s a good read. It’s the best DBA-focused paper I’ve seen on the topic. People have asked me what’s coming, and this is probably the best handout to use to explain it.
I have this strange personality quirk – I can’t decide whether I’m behind the times or ahead of the curve. I found out about Dinner With The Band, an HDTV podcast where a music-savvy chef (Sam Mason) cooks for a band, and then they play a couple songs in his NYC apartment. It’s a slick mashup between cooking and music, two of my favorite things. Anyway, they seem to have started and stopped filming episodes back in 2007. The fact that I just now found out about them, man, ouch, that’s behind the times.
But then I started researching my favorite band off the show, Tokyo Police Club, and I went to subscribe to their blog:
YES! YES! I’m the first subscriber! That means I’m ahead of the curve again! Woohoo!