
During Dell DBA Days 2016, we got to hang out with Doug Bernhardt of Dell’s storage team. He’s the guy who publishes Dell’s SQL Server documentation, builds their Fast Track reference architectures, and gets to play with all kinds of cool toys.
Doug emailed us and asked:
Is there any solutions type of info (or any info in general) on either Dell or EMC SAN and SQL Server that you think would be helpful to you, your customers, or the SQL community in general? Anything that would help make better decisions about architecture, design, purchasing, or optimization of any of the Dell EMC gear?
And I figured there’s only one person I should ask – and that’s you, dear reader. So what documentation on storage would you like to see?
35 Comments. Leave new
If the hardware vendor could provide a dashboard of sorts that would allow the SAN Admin and the Database Admin to agree, “Yeah, we have a storage problem.” That would be most helpful. I am watching that fight right now, no gloves, not pretty.
This x 1000.
Second that!
Measuring/finding latency. If I run perfmon and see latency in the read/sec or write/sec, what are some ways to check similar counter from the storage side. Acknowledging that there are usually layers between: vm and network. Any insight into evaluating the entire picture would be helpful.
Along the same lines as Mr. Treasure. If I think I have a storage issue, what information should I be collecting and how should I present it to my storage team?
Apart from quickstart/common troubleshooting guides, I’d like to see something formal around discrepancies between what Windows/SQL sees, and what the storage sees.
My use case here is that we would see 200ms latency on the OS, and 20ms on the LUN. It was a virtualized environment, using Paravirtual adapter etc.
The hidden component here was queue depth. It was 200+ in Windows, but only 64 on the LUN. VMWare has a default value of a queue depth of 64 for guests, which was throttling our workload.
We got to the bottom of that one, but some troubleshooting when there are layers between your workload and the drives would also be really useful (Filter drivers, virtualization, MDS switches etc.)
This is great, and I think I might be in the same situation. Perfmon shows >200ms latency spikes on our SQL boxes (ESXi 6 VMs) but our SAN admin says things look fine from his end. How did you know (and prove) your storage adapter queue depth was overwhelmed? Is there a way to tell what the controller’s QD is from the guest OS?
I’m in this camp too. Figuring out whether the problem is in the SAN (Dell), the server NIC/drivers (Dell), VMWare, or the network in the middle somewhere (Some Dell, some other vendors) seems very difficult without specific guidance.
My favorites are intermittent 833 errors (“SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete…”)
Agreed. I see this error constantly, san admins refuse to acknowledge the issue. Would be nice to have.
Second that 🙂
Back In the bad old days when I was an Oracle DBA we spent a lot of time arguing about raid levels and spreading files over magnetic Frisbees. These days I’m quite removed from the hardware. We run VM’s on VMware and storage is on a SAN which I have never ever seen let alone touched. How much does Raid and disk arrangement really matter to SQL these days? Windows just sees it as a volume. Should I just leave it to IT and treat it as a black box? Or is there scope for me to make significant improvements through tuning. configuration.
Running multi-Tbyte databases in a VM is a challenge. I wish Dell/EMC would publish something I can use when I talk to the SAN Admin that will convince them that provisioning me one, “the SAN’s really fast” 20TB partition is not really a solution. Either that or show me why it is a good solution.
From the architecture perspective:
I’m the virtualization\storage guy, so I can give you my perspective.
If your systems aren’t architected correctly, then you will have performance issues. We are in the middle of re-architecting our SQL VMs, not due to horrible I/O or complaints but due to “it could be better” and learning as we all adapt to SQL in a virtual environment.
If DBAs say they see latency on the DB pointing to a disk, then it’s our job to make sure the systems are architected correctly and minimize the latency not eliminate it. The DBAs just see the numbers, it is our job to find an explanation and work together to find out where compromises can be made.
Here are some resources.
SQL on VMware.
http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/solutions/sql-server-on-vmware-best-practices-guide.pdf
Figure 6.16, 6.17 and 6.18 in the Pearson link below is the best example we have put together and have been building our architecture based on these. Not strictly on these designs because every environment is different, especially with server side caching (vFRC) on the data files and temp files, our I/O testing yielded far better results.
http://www.pearsonitcertification.com/articles/article.aspx?p=2240989&seqNum=3
vFRC
http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/vmware-vfrc-performance-vsphere55-white-paper.pdf
Also there is a great video from vmworld 2015 on the SQL virtualized topic. At 18:02 there is a slide that says “Strict Best Practice” then a few minutes later there is a “Realistic Disk layout”. This is the compromise I am referring to, the design and speeds will all be based on your design, environment factors, and frankly compromises.
https://www.youtube.com/watch?v=yqJXA1aL4YI
Keep in mind not every SQL server in your environment has to be architected exactly the same, again compromises based on workload, environment and feasibility.
Let me know if you would like to see my own diagrams of the SQL servers we are working on.
It would definitely be worth doing a series on the differences between new SSD all-flash SAN’s and old skinny types.
A lot of the old rules about different raid types and separating log and data go out the window. We use all flash DELL Compellents and the rules go out the window with 1-3ms latency at 100000 64k IOPS.
Snapshot and de duplication at the SAN level are also really interesting and evolving with solid state.
This feels like asking what I don’t know. It depends on the scope that you want to write for, But start with a general over view of what you need to consider when scoping/provisioning/ordering a system and then start zooming in from there.
Others on this thread obviously have more experience with this than I, and may find the first one or two parts to be old rope, but to others up to date information on storage options and setup, shriven of the usual big brand marketing speak that gets in the way of comprehension, would be very valuable.
The DW reference architectures that Dell publishes are great, but I’d like to see ’em for OLTP, too. “Best OLTP box you can buy from Dell for $10k and under 1Tb”, best for $20k & 4Tb, etc.
I like your thinking. the DWFT program is a Microsoft program. I’ve been trying to get them to do an OLTP one, but they haven’t been willing to fund it.
Long time listener, first time caller. Would be interested to hear about SQL performance on hyperconverged platforms like VxRail / VSAN and tuning strategies where virtualised storage is making many of the decisions for you in terms of block optimisation, hotspotting, etc
Howdy sir! Good to see you here. I happened to be at my email when this came in and went, “Heeeeeey, I know that guy….”
Hah! Good to see you hear also ;). If you come to town at some point let’s grab Charaka and get some beers.
There’s lots of great sounding suggestions in here ( and several that I don’t even understand). It would be really cool if we could up vote/like the suggestions like on FB or SE,
Pete – come on, now. It’s our blog. All of the comments are likable. 😉
touché 😉
I wanted to “like that”… but couldn’t…
Oh no wait.. that was me..
Thanks for all the great feedback so far. I am listening! Look for follow-up blog posts, webinars, or white papers to address these comments in the near future.
Hi Doug,
I did some digging on here, the Dell site, & Google for the follow-ups & I may have missed them, but I haven’t hit on anything yet. Where would be the best place to start looking?
Thanks for your time!
There wasn’t much follow-up because I didn’t receive many questions or requests. Feel free to contact me directly for any questions. Doug.Bernhardt@dell.com
Let the storage team tell us the real story. There should be AUTO logging of storage related issues we should be able to get for any database, by hour, day, weekly basis. Make is easily available . We don’t have to HUNT for this info. Hate watching perfmon to find storage issues.
This way, you get to know what you have built is helping the customer or not.
Hello Brent, let’s say that SQL will have a high I/O > 500 IOPS, what do you think could be the best option ? – attached SAN storage or SSD?
My laptop does over 500 IOPS. It does over 5000 for that matter.
So, basically we can say that when we talk about 500 IOPS SSD is not needed at all, we are
killing flies with cannonballs … ?
Yep.
ok, Sorry for bother, but how can I justify that fact ? what tests do you advise me to do ?, in thise case , supos we are talking about a new db design and I dont have servers etc..
Alex – unfortunately, personal system design is outside of the scope of what we can do fast in a blog post comment. If you’d like to hire us for consulting, click Consulting at the top of the page, or head to https://dba.stackexchange.com if you’d like free advice from the public. Thanks!