The StackOverflow XML Data Dump was recently updated with 2016-03 data, so I’ve updated our torrent of the SQL Server database version of the Stack Overflow data dump.
Fun facts about the database and its real-world-ness:
- 95GB in size
- 29,499,660 posts spanning 2008-07-31 to 2016-03-06
- 5,277,831 users spanning ages from -972 to 96 (just like real world data, you can’t trust it)
- 46,306,538 comments (227 of which have the F-bomb)
- Every table has a clustered key on an Id identity field, and has relationships to other tables’ Ids (again, much more real-world-ish)
- Lots of lumpy data distribution and sizes, making it fun for parameter sniffing demos
- Case-sensitive collation (because if you’re going to share scripts online, you want to get used to testing them on case sensitive servers – this stuff exists out in the real world)
- 1,305% cooler than AdventureWorks
Here’s how I built the torrent:
In our AWS lab, we have an m4.large (2 cores, 8GB RAM) VM with SQL Server 2005. We use that for testing behaviors – even though 2005 isn’t supported anymore, sometimes it’s helpful to hop in and see how things used to work.
I still use 2005 to create the dump because I want the widest possible number of folks to be able to use it. (This is the same reason I don’t make the database smaller with table compression – that’s an Enterprise Edition feature, and not everybody can use that.) You can attach this database to a SQL 2005, 2008, 2008R2, 2012, or 2014 instance and it’s immediately usable. Keep in mind, though, that it attaches at a 2005 or similar compatibility level. If you want 2014’s new cardinality estimator, you’ll need to set your compat level to 2014 after you attach the database.
I downloaded the Stack Exchange data dump on that 2005 VM. It’s a little confusing because the Archive.org page says it was uploaded on 1/21/2014, but that’s just the first date the file was published. The top update date of March 1, 2016 is the current version you’ll get if you use the download links at the top right of the page.
To make the import run faster, I shut the VM down, then changed its instance type to the largest supported m4 – an M4 Deca Extra Large with 40 cores and 160GB RAM for $4.91/hour – and booted it back up. (Don’t forget to revisit your SQL Server’s max memory, MAXDOP, and TempDB settings when you make changes like this.)
I created an empty StackOverflow database, then fired up the Stack Overflow Data Dump Importer (SODDI), an open source tool that reads the XML data dump files and does batch inserts into a SQL Server database. I pasted in a connection string pointing to my SQL Server – ConnectionStrings.com makes this easy – and off it went:
The import finished in about 25 minutes, although it turns out the extra cores didn’t really help here – SODDI is single-threaded per import file:
After SODDI finished, I stopped the SQL Server service so I could access the ~95GB data and log files directly, and then used 7-zip set to use ultra compression and 32 cores, and the CPU usage showed a little different story:
After creating the 7z file, I shut down the EC2 VM, adjusted it back down to m4.large. I created a torrent with uTorrent, then hopped over to my Whatbox. Whatbox sells seedboxes – virtual machines that stay online and seed your torrent for you. They’re relatively inexpensive – around $10-$30/mo depending on the plan, and I just go for unlimited traffic to make sure the database is always available.
To double-check my work, I fired up my home BitTorrent client, downloaded the torrent, extracted it, and attached the database in my home lab. Presto, working 95GB StackOverflow database.
Now, you can go grab our torrent of the SQL Server database version of the Stack Overflow data dump. Enjoy!
Thanks for describing the process… It’s really useful.
On another note, I’m constantly amazed at the progress of storage. Your photo contains flash drives that can store the same data as 1.3 million floppy disks.
And the bulk of those flash drives seems to be dictated by the shape of a USB port.
Michael – yeah, it’s incredible how fast that moves forward! And these aren’t even expensive 1-2TB drives – they’re the el-cheapo $30 drives from Amazon. Insane.
Have you ever considered getting a 4-6 port USB hub and building a USB thumb drive RAID array? I remember seeing an article of someone building a RAID array of 5.25″ floppy drives on an Apple II.
Wayne – no, but I do use something kinda similar – a Blackmagic Design Thunderbolt Multidock. It takes one Thunderbolt cable, and has bays for 4 2.5″ SATA drives. That gives me ~4TB of insanely fast capacity in a RAID 0. (I don’t need redundancy for my temporary VMs.)
And just think – one day we’ll have flash drives (or some kind of drive, who knows what they’ll be called) that can store the same data as 1.3 million of those flash drives in the picture 🙂
Thanks for the explanation and the torrent update.
Scott – you bet!
Thank you very much for posting this. (Downloading torrent now). Whatbox is a pretty good service as i’ve been using them for the last 3 years with very few problems.
Amazing work Brent ! downloading ! I really like the way of utilization of 160GB RAM for $4.91/hour.
I am going try out this database with Power BI lets see how Power BI will react ..will keep you posted !