#DellDBADays 2016: What Would You Do with Unlimited Hardware?

Last Updated 7 years ago

Last August, we got the team together in person for Dell DBA Days. We ran all kinds of interesting experiments with SQL Server, and shared the results with you via live webcasts.

https://www.youtube.com/watch?v=Gn43sOLrcVs

You can watch our recorded episodes from last year – I’d highly recommend the last one, Watch SQL Server Break and Explode. Erik showed how to make a SQL Server crash instantly and reboot. Kendra demonstrated what happens when you run thousands of databases in an Availability Group. Doug and I yanked hard drives out of a server one by one to show how RAID controllers react.

This August, we’re heading out to Round Rock again – and you can be a part of it. What experiments would you like to see us run on SQL Server 2016? We’ve got all the hardware a DBA could want, and the only limit is your imagination.

If we pick your idea (and we may pick more than one!), we’ll give you a free Everything Bundle, plus credit you on air during the webcasts. Leave your idea in the comments – let’s see what you’d do if you were let loose in the Dell data center.

Update – let’s focus on experiments where you can actually learn something helpful. Think about what we could test in a lab that might change the way you administer databases, like whether TempDB still really needs 8 files in the year 2016, or what the impact of Transparent Data Encryption might be on a particular workload. We don’t need your help coming up ways to set SQL Server on fire. 😉

SQL Server 2016 and the Internet: Forced Updates, Phoning Home

Is your SAN’s cache killing tempdb?

94 Comments. Leave new

Jason A
June 2, 2016 6:15 am

Not sure if Dell has a server that would survive something like this, but show a processor failure. While the server is humming along, yank out a CPU and see what happens…

Or, test failure of RAM by removing a stick of RAM (or bank, depending on the server and how it’s installed.) Especially with something like RAM mirroring configured on the server.

Yes, I’m a destructive little gremlin…

Reply
- Brent Ozar
  June 2, 2016 6:18 am
  
  Jason – hahaha, well, I’ll save you some time: if you yank a CPU, Windows will crash.
  
  To simulate RAM failure, you actually don’t want to rip a stick of RAM out – you just want to put a bad stick of RAM in there. If it’s ECC RAM, the parity takes care of some errors, but if it’s widespread faults in the chip, the memory will be taken offline at the next reboot.
  
  Reply
  - Jason A
    June 2, 2016 7:50 am
    
    I believe some Dell servers can be configured to “mirror” their RAM, similar to a RAID-1 drive configuration, so that if a stick / bank of RAM fails, the server keeps chugging along. I’ve got a PowerEdge R900 that shows that option in the BIOS, but I want all the RAM so I’m not using it.
    
    ECC can handle bad sticks, or bad chips, or the one weird fluke of a single bit being flipped, but it’s not going to help against a full on failure of a stick (bad solder causing a chip to fall off sort of thing.)
    
    Can you tell I spent a good bit of time on the hardware side?
    (IIRC, on the CPU thing, at one time I believe there were some servers you could “hot-add” processors too, which is what I was thinking of.)
    
    Reply
    - Brent Ozar
      June 2, 2016 7:52 am
      
      Jason – yeah, I’ve just never seen anyone burn half their RAM on mirroring. It’s one of those things that probably only five people in the world use, heh.
      
      Reply
      - Jason A
        June 2, 2016 9:05 am
        
        Got to admit though, it’d be fun to watch someone yank out a bank of RAM from a server while it’s running…
        🙂
      - Marcin S
        June 3, 2016 4:37 am
        
        More than 5. But in world of real systems, not toys we’re using. Mainframes come to mind.
Erik C.
June 2, 2016 7:34 am

Run the same recursive CTE (or stored procedure) that uses scalar and tvf functions, which are utilizing data from larger data sets (50 million rows+) in 3 Different databases on the same instance in AG (SQL Server 2016), 1 database using 2012 compatibility, 1 using 2014 compatibility and 1 using 2016 compatibility. To add more gasoline to the fire, try running 50, 100 or 1000 concurrent users on each database. It’ll be fun to see what the CE does in each scenario.

Reply
Morden Kain
June 2, 2016 11:31 am

I know that SpeedStep is supposed to work to keep the CPU from frying. However, under full load, remove a heatsink from a CPU to simulate a cooling failure or a CPU “gone wild”.

Another I can think of, hot swap a PCI card to see if “Plug-and-Play” can really work with hot swapping. Yes, I know that you are not supposed to do this, and this kind of “Plug-and-Play” is not what Microsoft had in mind. If I recall though, there are some server motherboards out there that allow you to do this very thing. There is a story involving me and a Junior Sys Admin and this exact scenario (the Junior Sys Admin had the balls to tell me that Windows would recognize it because the NIC was Plug-and-Play).

Reply
- Brent Ozar
  June 2, 2016 11:37 am
  
  Morden – ooo, that’s interesting about removing the heat sink, muhahaha…
  
  Reply
Matt K
June 2, 2016 11:35 am

Test the scalability of a Central Management Server. Attach 100 instances, create alerting and test some drift management. Attach 2,000 instances and see what overhead there is, what things flow properly and what things break. Increase to 5,000 instances, and see how it scales.

Reply
- Brent Ozar
  June 2, 2016 11:36 am
  
  Matt – we actually wouldn’t recommend a CMS for that. At scale, you want to use a different solution like third party monitoring software.
  
  Reply
Alex
June 2, 2016 11:41 am

Not really DBA specific, but I would like to see some series of vSAN experiments. The interwebs are thin on good benchmarks comparing vSAN to a SAN.

I would like to see *normal running activity* performance comparisons

Also see how performance is impacted when things go bad (ie. ripping out HDs / banks) / Comparing SAN vs vSAN in that way.

Not sure if you guys could pull that off though

Reply
- Brent Ozar
  June 2, 2016 11:43 am
  
  Alex – yeah, we’re going to take a pass on that just because we wouldn’t be qualified to do performance tuning on a vSAN.
  
  Reply
  - Alex Byrd
    June 2, 2016 11:55 am
    
    I haven’t really seen anyone that is (at least no one that’s as exposed as you + the crew here), so I think there’s some new ground to be cracked, but I understand.
    
    Reply
    - Brent Ozar
      June 2, 2016 12:03 pm
      
      Yeah, sometimes it’s an empty market for a reason. 😉
      
      Reply
Katherine Villyard
June 2, 2016 11:44 am

1. Create a failover cluster. Fail it over gracefully. Randomly reboot the member servers. Remove/hide the quorum drive. Remove/hide the shared storage. Put them back, reboot it, and tell the cluster that it’s a poor baby and it can have its storage back. Pat them on their poor metal cases. Verify that all is well.

2. Create an availability group. Fail it over gracefully once or twice. Randomly reboot a few. Do cruel things to their network connections. Give them their network back, and tell the various AG members that they’re poor babies. Pat them on their poor metal cases. Verify that all is well.

I should probably do a software one, too. Hmm.

3. Create terrible views using non-sargable predicates. Create terrible views of the terrible views joining the two views in improbable ways, perhaps left-joined on unindexed varchar(max) fields. Create terrible queries that join the terrible views of terrible views in improbable ways and use non-sargable predicates. (If you have a nice short indexed key, create views that expand that key to a varchar field and join on that instead of the key.) Ask the optimizer what it thinks it should do. See if you can make the optimizer stop weeping in the corner and generate reasonable plans through query tuning and indexing. Apologize to the optimizer for being so mean to it. Offer to buy it more RAM to make up for it.

Reply
- Brent Ozar
  June 2, 2016 11:49 am
  
  I’m sensing a theme here, hahaha – everybody wants to see us beat the bejeezus out of SQL Servers!
  
  Reply
  - Katherine Villyard
    June 2, 2016 11:55 am
    
    At least I suggested apologizing to the poor things afterwards! 😉
    
    Reply
Fred Shope
June 2, 2016 11:46 am

Two words: Chaos. Monkey. (http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html)
(yes it’s for AWS, and no I don’t know if anyone has converted it to run on-premise hardware or virtual, but it’s been out there since 2012 so it had to be converted since then, right?)
I’ve always wanted to turn this loose in a datacenter running various loads and try to keep things running, like Netflix does.

Maybe that explains why the sysadmin keeps turning my account off…

Reply
- Brent Ozar
  June 2, 2016 11:48 am
  
  HA! Yeah, could be like whack-a-mole but for infrastructure.
  
  Reply
- Katherine Villyard
  June 2, 2016 11:55 am
  
  *disables your account* 😉
  
  Reply
Greg Wahl
June 2, 2016 11:52 am

Create a routine that causes the database to grow so that it causes the drive space to run out or to cause the transaction log to get so large that it fills up the drive. Will SQL Server just stop and not commit the transaction or will it just pause waiting for more drive space?

Reply
- Brent Ozar
  June 2, 2016 12:04 pm
  
  Greg – that’s actually really easy to see on your own local machine – you can just cap file sizes at, say, 10mb and repro that in a matter of minutes.
  
  Reply
Matthew Iskra
June 2, 2016 12:09 pm

So many horrible ideas to abuse the database(s). I am impressed.

My suggestion is less an abuse of the database as which is better: Full backup + log file restore or Commvault (or equivalent). You would need a Commvault setup, but it would be interesting to see what SQL Server does when you keep stopping the server and reverting it to previous points in time, memory and disk wise. SQL Server might have issues with the sudden time-warps from its point of view. I wish I had the cool toys to play with to test that.

Oh, and joining views on unindexed VARCHAR(MAX) just makes me shudder.

Reply
- Brent Ozar
  June 2, 2016 12:09 pm
  
  Matthew – what action would you take based on that experiment? How would it change the way you manage SQL Server?
  
  Reply
Ricardo.
June 2, 2016 12:13 pm

Parellize everything, use all stuff, DRC,Linked servers, Tabular and multidimensional mode, stress with massive workloads, use Hibernate n times (especially trying to do developers happy, using paginators ), (developers favorite: Index all columns of all tables) kill those disc, eat all RAM, burn that proccessor , jus lets us watch how SQL Server burnn ,Polybase against Hadoop cluster, and of course all virtualized =), jobs against jobs, scheduling reports, and backups all at same time.. Just like owners said,” why things are not working”…

P.D. Please put some softly music ( winners playlist )

Reply
Russ Mulvaney
June 2, 2016 12:15 pm

Just how many DBs (each with some concurrent users) can you put on a single instance before performance starts to degrade?

Reply
- Brent Ozar
  June 2, 2016 12:16 pm
  
  Russ – well, you can degrade performance with even a single database as long as you have enough users, right? Heck, I can do it with a single query from a single user. Hello, CHECKDB. 😀
  
  Reply
  - Russ Mulvaney
    June 2, 2016 12:20 pm
    
    DUH
    
    Reply
Morden Kain
June 2, 2016 12:40 pm

How about this… Put a bunch of data into In-Memory tables, then have RAM failures all over the place (ie: gracefully remove (no yanking) of RAM modules/banks) to see how well things recover. Or put memory pressure on the RAM to push the In-Memory tables down to the point that they are no longer “in memory”.

Can also see how well things go with SAN intermittent communications (ie: Oh I have a SAN drive, oh it is gone… back again.. gone) on large databases.

Reply
ck
June 2, 2016 12:50 pm

I was thinking about what other features of their interesting hardware you can exploit, and it would be interesting to see things like forcefully corrupting pages in memory from heat or cosmic rays, though I dont know if you could get a particle emitter to use on their ram.

You might be able to reproduce that by just having bad sectors that would be still in use, but you would probably need to not use ECC to get this behavior with that type of issue.

It would also be neat to see things that would happen in extremis, for instance a basic DDOS against your SQL Server’s resources, whether it be bad authenticated user (hi ops team) or a large amount of resource being brought to bear and scaling it up as fast as you can (black friday.)

I would love to be able to do some sort of mini-aws like scenarios with more power and control; show how you can spin up instances for pre-prod, staging, and prod from the same configuration with parameters.

Try and make a replication chain and see how long you can go without the source and destination straying too far, experiment with star and ring formations. Imagine you have 10,000 destination SQL Servers, what would be the most expedient way to keep them up to day (this probably takes AGs out of the mix).

Reply
- Brent Ozar
  June 2, 2016 12:52 pm
  
  CK – seriously, what would you use any of this knowledge for?
  
  Reply
Ian C
June 2, 2016 12:54 pm

I’d be interested in seeing comparisons between different CPU’s in terms of SQL workloads – which will relate to decision making between balancing hardware selection over SQL Server licensing.

One of the things I would want to play with would be determine the benchmark (HammerDB?) differences between faster CPU’s with less cores as compared to slower CPU’s with more cores. How well would each CPU scale in terms of simultaneous user activity – for example compare workloads between processors and see the through put for 10, 100, 1000, etc simultaneous workloads. At what point does performance start to be affected for a 4/6 core processor as compared to a 12/18/24 core processor.

I’d also want to see adding a virtual (VMware/Hyper-V) layer in there as well. If I run simultaneous work loads on multiple VM’s at which point is it better to have more cores or faster cores. If a host machine has faster/less cores processor at which point does the overall/average throughput for all VM’s start declining and the host becomes too saturated.

Also just for fun I’d play with NUMA. If I had SQL Server configured to use 4 or 8 cores (processor affinity), what is the performance difference of those cores were on 1 processor versus spanning 2? What kind of impact is that for both faster cores versus slower cores.

Reply
- Brent Ozar
  June 2, 2016 12:55 pm
  
  Ian – innnteresting! I like your suggestions.
  
  Reply
  - Ian C
    June 2, 2016 1:48 pm
    
    Since I have been preaching over the last couple of years that we may be better off balancing licensing costs with these faster CPU’s it would be good to actually have some better testing\data to back that up. Especially since anytime I get new hardware to test I don’t have the luxury of comparing just those processors. Its always our new systems I get which I usually campare to the older systems in place. This isn’t a like for like comparison obviously since there is different RAM, storage, chip sets, etc. It isn’t a true comparison of the choice in CPU.
    
    One other thought I had over lunch too was to do a similar test as the NUMA one but with hyper threading. Thinking of Jeremiahs article about hyper threading awhile back (https://facility9.com/2016/02/cores-is-cores/) it would be interesting to see what is the impact on SQL workload (both physical/virtual) when using two threads on a single core as compared to two threads on two separate cores.
    
    Reply
    - Brent Ozar
      June 2, 2016 1:58 pm
      
      Ooo, hyperthreading, yeah, that’s a good one.
      
      Reply
Jason A
June 2, 2016 1:10 pm

After reading your update, here’s something that might be worth trying.

Set up a FCI (2 servers, shared storage, heartbeat network, etc,) and see what happens when the heartbeat network slows to a crawl, say from 1Gbit/s to 10Mbit/s (or less.) Then do the same to one of the “outside world” NICs (no clustered NICs, that’s cheating!)

Maybe slow the HB network way down, then fail the primary server, how long does it take to recognize the failure and start failover, vs when the HB network is running full speed.

Reply
Dennis
June 2, 2016 1:13 pm

Use some type of VLDB greater than 4TB, have 1TB+ of RAM, and 3+TB of buffer pool extension on SSD. Fill the buffer pool extension, run some heavy duty queries that use the BPE and then kill the SSD in in windows so SQL loses it. I’d love to see the impact as it happens.

Reply
- Brent Ozar
  June 2, 2016 1:59 pm
  
  Dennis – ah, it actually works fine. If the BPE drive disappears, SQL Server keeps right on trucking – just switches to hitting the data file storage instead of BPE.
  
  Reply
  - Dennis
    June 2, 2016 2:03 pm
    
    I know sql would handle it, but I want to see the impact in terms of perf counters, on the fly. Then turn it back on to see how quickly the BPE fills up. It would be a good selling point to say here is what type of performance we can get with BPE vs none.
    
    Reply
    - Brent Ozar
      June 2, 2016 2:09 pm
      
      Dennis – you probably wanna start here: https://ozar.me/2014/06/fact-checking-24-hours-of-pass/
      
      Reply
      - Dennis
        June 2, 2016 2:14 pm
        
        Thanks! on a side note, fusion io boards are ridiculous in price.
      - Brent Ozar
        June 2, 2016 2:15 pm
        
        Dennis – yeah, but check out Intel’s. They’re equally as fast, and waaaaay cheaper:
        
        http://amzn.to/1PofElu
  - Nik
    June 2, 2016 7:05 pm
    
    How about seeing what happens if you pull the BPE volume when you have a memory-optimized table that is large enough to require both available memory and BPE?
    
    Does everything keep on chugging? Does the database containing the memory-optimized table become suspect? Or does something even more terrifying occur?
    
    Reply
    - Brent Ozar
      June 2, 2016 7:46 pm
      
      Memory-optimized tables don’t use BPE.
      
      Reply
Katherine Villyard
June 2, 2016 1:30 pm

Okay, if my HA suggestions aren’t quite practical enough (although I think they’re not bad), how about experimenting with different storage layouts? Like, where’s the best place to put system DBs, really? Microsoft says to separate MDFs and LDFs (but yes, I’m not sure that matters if you have enough databases on the same storage); does that make a difference on slow vs. fast storage? Do you really need multiple TempDB files if TempDB is on SSDs?

Actually, I’d really like to see you play with that last question.

Reply
- Brent Ozar
  June 2, 2016 1:59 pm
  
  Katherine – innnnteresting, we could totally do that.
  
  Reply
- Erik C.
  June 3, 2016 7:31 am
  
  To piggy back on the initial suggestion, it would be nice to see the above suggestion run for both low and high workloads. What would happen if ram or drives were added or taken away during the load testing?
  
  Reply
Todd Kleinhans
June 2, 2016 1:32 pm

How about a TPC-E or TPC-H throw down between SQL2014 and SQL2016? I don’t think Dell hasn’t put up any SQL benchmark results on TPC for a while so maybe re-run an old test for them but on SQL2016. Compare apples to apples. Or pecan pie to pecan pie since y’all will be in Texas.

Reply
- Brent Ozar
  June 2, 2016 1:58 pm
  
  Todd – yeah, the trick is that 2016’s licensing forbids publishing benchmarks without Microsoft’s involvement.
  
  Reply
Arnold Garcia
June 2, 2016 1:55 pm

I would like to see Always On Encryption in SQL Server 2016 on Dell’s hardware. Set up a SQL Server and separate IIS Server. Have some 100 GB+ databases. Show performance data from the client to the web server (web server accessing the SQL Server) until the web server returns the requests data. Show “No Encryption”. Then turn on “Always On Encryption – Randomized.” Finally change it to “Always On Encryption – Deterministic.” Then show replication between on-premise to Azure SQL Server using these various encryption settings.

Reply
- Brent Ozar
  June 2, 2016 1:58 pm
  
  Arnold – AlwaysEncrypted is a separate feature, unrelated to Always On Availability Groups. AlwaysEncrypted requires changes to the application layer in order to work. It has no effect on Always On Availability Groups.
  
  Reply
  - Arnold Garcia
    June 2, 2016 2:05 pm
    
    Boo…how about running SQL Server 2016 on Dell Hardware and the running SQL Server 2016 in a Docker Container on the same Dell Hardware. Gathering performance data or tips on how to minimize any performance differences.
    
    Reply
    - Brent Ozar
      June 2, 2016 2:10 pm
      
      Arnold – interesting, but we don’t use Docker here yet, so we wouldn’t be good candidates to give that advice.
      
      Reply
MetalDBA
June 2, 2016 2:51 pm

How about getting a backup that takes about an hour straight up (no compression, no multiple files, no maxtransferrate, no juiced up hardware etc) and then run a contest among all of you to see who can make it the fastest. No rules, use all available sql server commands/tricks, throw all kinds of hardware at it(storage, memory, cpu). Who can make that backup run in 15 minutes, 5 minutes, 1 minute?!?!

Reply
- Brent Ozar
  June 2, 2016 3:10 pm
  
  MetalDBA – that’s a great answer! And I actually wrote up a session for that already, so you’re gonna win by default there. 😀 Nicely done.
  
  Reply
Dennis
June 2, 2016 3:15 pm

While having a case with Microsoft, one of the analyst mentioned “There might actually be a ‘too much memory’!”.
The reasoning behind was: SQL Server uses internal tables to maintain the usage of all the 8K pages in memory and at some point there are so many entries SQL Server runs into a timeout while searching through them and starts handing out memory without a proper strategy.

The obvious solution: Usage of “Large Page Allocation”, though the startup time of SQL Server increases.
So what if that is not an option? Does the problem actually exists with 1,2 or even 5 or 10TB of RAM.
Back than we used 1.5 TB of RAM and we had some problems, but it wasn’t investigated before I left the company.

Reply
- Brent Ozar
  June 2, 2016 3:16 pm
  
  Dennis – since there’s such an easy fix (using large pages), I’d just use that.
  
  Reply
  - Dennis
    June 2, 2016 3:46 pm
    
    If it just would have been that easy of a fix back then – from the point of view of the business. 😉
    I guess they still have their problems and are complaining everything is too slow. 😀
    
    Just would have been nice to know,whether the problem indeed exists or not.
    
    Reply
Eric C. Singer
June 2, 2016 4:28 pm

Given that services like azure and aws limit how much capacity you get per disk (azure being 1tb) it would be interesting to see storage spaces striped volumes vs db partitioning. Meaning which is a better architecture for performance, scalability and manageability. Perhaps limit memory to force disk performance.

For a cool factor, I would say use all nvme drives instead of disk.

Second idea, perhaps use a monster r930 running VMware with databases spread across multiple vm’s vs. all physical r930 and using resource governor to deal with load. Also observing the impacts of an un resource manged physical server vs. a vm that goes nuts.

Reply
Rosco
June 2, 2016 9:36 pm

It would be interesting to see SQL Server disk IO throughput tests on VMware. Perform testing with default settings then adjust PVSCSI queue depth, Windows queue depth, datastore queue depth, and storage adapter depth. There’s not a lot of information in the SQL Server community about how queue depth can impact SQL Server performance.

Reply
- Brent Ozar
  June 3, 2016 6:24 am
  
  Rosco – generally, if you’re hitting IO so hard that you genuinely care about storage settings like queue depth, you’re better off in a physical box. This way, you can offload TempDB work to local solid state storage, for example.
  
  Reply
Cody Konior
June 2, 2016 11:14 pm

Thanks for making those older videos available. I remember feeling liked I missed out when it happened ages ago.

There is a test I’d like to suggest. Can you put SQL 2014 into an AG with some massive, fragmented databases on Fusion IO cards, and trigger a load test while running maintenance on them, and monitor what happens to the AG synchronisation at the same time?

What I heard from an MCM was that in these expensive environments that SQL 2014 has internal threading and synchronisation issues when the underlying SSDs become *too* fast, and also that this is made all the more worse by AGs, and that the index maintenance will cause massive storms that may impact the application in unpredictable ways (I hope I’m interpreting this correctly).

I don’t think I’d have the opportunity to confirm this on a Fusion IO card any time soon. SQL 2016 allegedly has lots of internal enhancements in this area so you might even compare both of them?

Thanks.

Reply
- Brent Ozar
  June 3, 2016 6:26 am
  
  Cody – when you’re doing AGs (or log shipping, or mirroring), you want to back down on index maintenance. Regardless of the storage, you don’t want to pile up a bunch of traffic moving between replicas that doesn’t really have a business value.
  
  Reply
  - Cody
    June 3, 2016 10:43 pm
    
    Yes Brent it’s just that this is said to cause weird internal issues with SQL because of the speed of the Fusion IO cards. That’s the experiment 🙂
    
    Reply
    - Brent Ozar
      June 4, 2016 4:37 am
      
      Cody – yeah, the short answer is that that got fixed in SQL Server 2016. Check out Jimmy May’s slide deck on it for more details:
      
      http://www.sqlsaturday.com/493/Sessions/Details.aspx?sid=45723
      
      Reply
Mickel Reemer
June 3, 2016 1:17 am

First of all, I am jealous! I would love to have unlimited hardware at my fingertips and then play around with endless possibilities.

I have not had the chance yet to install a 2016 copy. I am highly interested in the effects of masking data in relation to query performance; especially when it comes to using (n)varchar(max) and (n)text datatypes (it appears that (n)text is still supported in 2016). I work with very large tables and would want to use this new 2016 feature.

This may or may not be something you would want to try on your field trip (after all, at Dell you’d have the room to store a billion records in a table), but I definitely look forward to your insights some time.

Reply
- Brent Ozar
  June 3, 2016 6:26 am
  
  Mickel – before doing masking, I’d make sure to read the Books Online limitations very carefully. There are a long list of scenarios where the masking fails and you can reverse engineer the data in the tables.
  
  Reply
Richard Elmer
June 3, 2016 3:30 am

1) I’d like to see the team build the best performant SQL but offset by price. Bang for your buck kind of thing and show use with the same queries on each etc.

2) Show us how good or bad a fusion IO card is running in SQL, compared to spending the same amount on other hardware.

Reply
Gareth
June 3, 2016 3:43 am

Not as destructive as the other suggestions but how about stopping a SQL Server instance, deleting the LDF file (its only a log file after all, and the drive is nearly full), then restart the SQL instance.

Had this happen once, where I got a call to say SQL had just stopped. Took a long time before the user admitted what they had done,

Reply
mmiike
June 3, 2016 4:28 am

i have a couple i’d love to see:

Storage wars

SMB vs. iSCSI: sure, we’ve read the whitepapers and known SQL Server supports hosting database files on UNC paths for some time now, but is anybody actually doing this? Grab the latest and greatest compute and networking components and build an SMB3 file cluster Jose Barreto would be proud of. Run the same workload against copies of the same database, one hosted on SMB3 cluster of the future, one hosted on a typical SAN. Bonus points if you wire one port on each of two nics to each of the storage systems. For SMB3, compare presenting Remote VHDs as locally attached to hyper-v virtual machine running sql vs. SQL seeing direct UNC paths to database files.
Double bonus points if you compare running the whole thing in a blade chassis vs. stacked and racked.

RAM Jam

1.) find a very sort-y, spill-y workload that on a 1TB dataset but that otherwise doesn’t require/benefit from enterprise features (merry-go-round scans, for example). run the same workload on

a.) a server running standard edition with RAMDISKs for TempDB (and possibly even the data files) can be done without 3rd party software using loopback iscsi with ramdisk VHD (or perhaps use VMWare ramdisk)
b.) a server running enterprise with enough RAM to hold the entire dataset in Buffer Pool but with typical SSD for tempdb.

compare other operations to see if there are situations where standard plus creative use of lots of ram beats enterprise edition (for example, test restores, run checkdbs, etc.). get creative with CPU affining/masking Cores that Standard couldn’t otherwise see to see if you can beat a “default” enterprise setup.

Reply
- mmiike
  June 3, 2016 4:34 am
  
  as an addendum: run through some of the items on the “it just runs faster” list for SQL2016 and find situations where *SQL2016 standard* beats SQL2014 Enterprise on the same hardware/workload. For this test, use “sane”/”standard” configs, i.e. none of the ramdisk shenanigans alluded to above (which i’d still like to see).
  
  Reply
  - Brent Ozar
    June 3, 2016 6:33 am
    
    Mike – you are a winner right there. Erik and I had been talking yesterday about testing that exact thing. I was really surprised nobody mentioned it earlier – Microsoft keeps beating the “it just runs faster” drum, that’s the first thing I wanna test, hahaha.
    
    There’s a few ways we could do it: we could do a synthetic workload (like take the top 20 user queries from http://data.stackexchange.com/stackoverflow/queries as a representative sample), or we could do TPCs, or we could engineer tests specifically to check some of the “it just runs faster” pieces like DBCC.
    
    (Microsoft employee readers: yes, I know that the 2016 EULA doesn’t allow publishing of benchmarks, so we wouldn’t show the actual TPC numbers.)
    
    Reply
- Brent Ozar
  June 3, 2016 6:31 am
  
  Mike – good to see you, sir! About SMB3 – I haven’t seen anybody doing it yet because you’ve gotta have separate networks for storage traffic vs regular network traffic. Otherwise, when you do a big table scan, a backup, index rebuild, etc, you can saturate the single network, heartbeats fail, monitoring alarms go off, etc. This would actually be a fun test to run – doing the network split the right way – and see how smoking fast we can actually get SMB3 to go.
  
  I doubt we’ll do that one only because of bang-for-the-buck – I think separate networks for storage is still a ways off (even though SAN guys have been doing it for decades).
  
  Now, the RAM jam, that is VERY interesting and doable. I’d been thinking about a line graph with a Y axis of batch requests per second, and an X axis of RAM as a percentage of database size. (Stack is a 100GB database, so 50GB = 50% of the database size is RAM, 100%, 150%, 200%.) I bet there’s a very clear hockey stick effect as you get to the point where the database is cached in ram, and then TempDB effectively lives in RAM, and then TempDB can *actually* be in RAM with a RAM disk, and so forth.
  
  Reply
E.S.
June 3, 2016 8:14 am

Yank out the memory, while running a complex query. Will SQL auto adjust or just out right crash?

Inactive an AD account or shut down the domain controller for a SQL server that uses AD account?

Reply
Morden Kain
June 3, 2016 3:01 pm

Not sure if this has been tried (or mentioned), though there are many mentions of bits and pieces… Test if In Memory vs RAMDrive table. Curious if they would be the same. Keep the log file up on that RAMDrive as well. IIRC, In Memory tables log everything to physical HDDs. Backups would be taken every 5, 10 or 15 minutes. This would require at least 2TB of RAM… a 1TB RAMDrive, and the other for 1TB for system usage (including SQL Server).

Reply
- Brent Ozar
  June 3, 2016 3:10 pm
  
  Morden – would you actually run a live production system on a RAM drive? Can you elaborate more on that scenario?
  
  Reply
  - Morden Kain
    June 3, 2016 3:56 pm
    
    Yes, you can use the RAM Drive for a staging area where data you do not necessarily care too much about can reside or that is refreshed from a different area.
    
    For instance, we pull across data from a separate department everyday. The data resides in a completely different area and this data is used locally within the department. Transactions to the data are processed off hours (can anyone say Gov’ment). The department we pull the data from charges for the data, and the amount of bandwidth consumed.
    
    The use of In Memory tables has the gotcha of needing to ensure the bucket size is “just right”. We do not have to worry too much about a regular table that resides on a RAM Drive.
    
    I threw the backup into the mix in the event someone wanted to do this on a setup where the data mattered.
    
    Reply
    - Brent Ozar
      June 3, 2016 3:58 pm
      
      Morden – OK, in that case, stick with the RAM drive. You won’t have to make changes to your code, won’t have to worry about what datatypes match, etc.
      
      Reply
      - Morden Kain
        June 3, 2016 4:06 pm
        
        Well crap, I wanted to see the performance difference. {sigh}
Pmpjr
June 3, 2016 9:00 pm

I like hadr and I haven’t read the above so if this is in there then my apologies.

Setup each rack with a DAG site, consider each rack a continent. Show the fancy new readable secondaries and scale out for workloads or migrations. Then power down the room, all of it. Consider a global or business wide disaster of proportions not ever hopefully seen. Power it back up and see which setups lived and died.

Reply
Richard Douglas
June 6, 2016 11:54 am

Start a fire and see if you can cook some bacon before the fire prevention methods kick in.Then see how long it takes to bring all the “HA” systems back online that were in the same data center 🙂

In all seriousness, it would be interesting to see what would cope better with all the air being sucked out of a room at high speed, traditional media or SSD’s.
An experiment like that would dictate what people buy from now on and form the basis of Disaster Recovery plans. Can’t see Dell agreeing though.

Reply
- Brent Ozar
  June 6, 2016 12:31 pm
  
  HAHAHAHA
  
  Reply
Radu
June 6, 2016 12:12 pm

I would really like to see something like this as I am toying with the idea.

DWH workload min 5+ TB, more like 10TB+
2 node failover SQL 2016 EE cluster (lot of ram, cpu etc), 1 instance, each node 2×2 10G mellanox or 1 x 2 40G mellanox, tempdb storage on local PCCIE SSD (~2T), db storage on (RoCE, SMB 3.1 shares on)

windows 2016 storage spaces direct 5-8 nodes with 9x4TB hdd 3x1TB SSD (dell r730xd with PCIE SSD?)
2×2 mellanox 10G or 1x 2 40 G mellanox, RoCE, RSS, SMB3.1

2x mellanox 10G, 40G switches

. Spead the db files over 4+ volumes + 1 for log,

do some heavy testing simulating
– ETL (insert,update, merge etc)
– selects over big set of data, column store storage too.

I’ve tested this in a VM lab enviroment and worked ok for it … but I would really like to know if this is a viable solution.
My main problem is that, usually when I connect my DWH servers in the shared enterprise SAN I get angry emails from the storage admins or I get throttled down. Luckily I could install some local PCIE SSDs for the tempdbs.

Reply
- Brent Ozar
  June 6, 2016 12:31 pm
  
  Radu – dumb question: why wouldn’t it be a viable solution?
  
  Reply
  - Radu
    June 7, 2016 9:49 am
    
    I’m thinking to use/propose such a config but it is quite new tech. and I think not many have tested this with real hardware. (I am a contractor and can’t really afford to buy such much hardware for a PoC)
    
    My concerns are directed of using SQL over SOFS/SMB/RoCE.
    – How is the performance in bandwidth / latency ?
    – How is the write performance on SOFS/SS Direct from the point of view of SQL loads. (latency? on 3 copy volumes)
    – How much bandwidth is required for the interconnect on the SOFS/SS Direct Cluster.
    
    All this could work wonderful but I have 0 experience out of my VM lab setup.
    
    Reply
    - Brent Ozar
      June 7, 2016 10:42 am
      
      Radu – all of those are really dependent on your workloads. You’ll need to capture and replay your own workloads to answer that one. (It’s no different than switching storage back ends or buying a new server.)
      
      Reply
Daniel
June 6, 2016 12:29 pm

Could you do some work on multi-subnet failover cluster instances and Always ON Availability Groups across data centers and share your learning? I bet Dell has more than one data center. Find out which features work smoothly with multisubnet clusters and which ones do not, (i.e replication, log shipping, etc..)

Take some numbers on latency added by having SAN Replication in the backend in the case of Multisubnet FCI vs Always ON AG synchronous mode across the same data centers.

Reply
- Brent Ozar
  June 6, 2016 12:44 pm
  
  Daniel – the choice of HA/DR methods really boils down to your RPO and RTO, not performance. Check out the HA and DR worksheet:
  
  https://www.brentozar.com/sql/sql-server-failover-cluster/
  
  If the business says they need to fail over groups of databases automatically in less than 1 minute with 0 data loss, for example, then your choices really boil down for you.
  
  Reply
Stuart Ainsworth
June 8, 2016 9:45 am

I’m sensing a lot of “Go Big and Blow Up” themes; I’d like you to consider going the other way. I’m really interested in microservices these days, or the dynamic creation of compute-on-demand resources (think Docker). How easily can a micro-instance of SQL Server be spun up and managed? How about thousands of them?

I’m sure the licensing model would be atrocious for this (unless you used express edition).

Reply
Chad Tether
June 15, 2016 10:45 am

Sorry, if this has already been said but I was a bit late in commenting.

However, I would really love to see the multi-threaded log writer put to the test in SQL 2016. Tested in comparison to SQL 2014 where it was still single threaded and tested against some blazing fast disks as well as slower disk to see the differences.

Would be great to see how different workloads perform in comparison as well as different options like delayed durability or minimally logged transactions.

Reply
Joe L.
June 17, 2016 3:59 pm

Look at log file and data file growth when doing Online index rebuilds with SORT_IN_TEMPDB ON \ OFF while running Insert \ Update \ Delete against the database.

Reply
Konstantin
July 25, 2017 3:26 pm

Hi, guys.

What happens with video links – now all not working (redirect to quest.com site 404):
Dell DBA days the Brent Ozar Unlimited 2015 – https://software.dell.com/event/webcast-series-dell-dba-days-the-brent-ozar-unlimited-show890450
Dell DBA days the Brent Ozar Unlimited 2016 – https://software.dell.com/event/live-webcast-series-dell-dba-days-the-brent-ozar-unlimited-show-the-se8112821
Do you have backup copy?

Reply
- Brent Ozar
  July 26, 2017 6:47 am
  
  No, sorry.
  
  Reply