#SQLPASS Keynote Day 3 Liveblog

#SQLPass
3 Comments

I missed the first 10 minutes of the keynote today when Kevin Kline (Blog@KEKline) was recognized for his amazing decade of service to PASS.

I was not present for the next 30 minutes of Dell keynote, but I did not miss it.

</rimshot>

I’m starting this blog entry now, but don’t refresh for a few minutes.  This is not the keynote you’re looking for.

9:18AM – TGIF – Thank God It’s Finished.  Dr. David DeWitt, Technical Fellow at Microsoft, is taking the stage to talk through a deck called “From 1 to 1000 MIPS.”

9:20 – “Tweet if you think SQL* would be a better name than SQL Server 2008 R2 Parallel Data Warehouse Edition.”

9:22 – He’s diving in extremely deeply based on attendee feedback.  If only Dell would listen that well.

9:23 – DeWitt says database system specialization is inevitable over the next 10 years.  This is evidenced by Azure, the traditional engine, and Project Madison having different paths in different directions.  Our skills can be used across them, but there’s different ways they work.  He’s making a disclaimer, though, that this is an academic talk, not a product chalk talk for Microsoft.  It doesn’t indicate future direction.

Dr David DeWitt
Dr David DeWitt

9:25 – Taking us back through time to 1980 when computers had 1 MIPS and 1Kb of memory.  Today, the typical RDBMS has 1,000-2,000x the performance based on the components, plus 10,000x faster storage.  In 1985, the fastest relational databases were running 10-100 transactions per second.  Today, SQL Server on a modest Intel box can do 25,000 TPS (TPC-B), but we need 330 drives in order to keep 1 processor busy.  CPUs and disk are totally out of whack in terms of performance.

9:30 – Disk capacity has gone 10,000 faster over the last 30 years, but if you consider the metric “transfer bandwidth per byte,” things are totally different.  Think in terms of towns with a water tower – they have a ton of capacity, but if they try to get that water through a garden hose, they’re screwed.  That’s where we are with disk.

9:32 – In 1980, sequential transfers through storage were 5 times faster than random transfers.  In 2009, sequential is now 33 times faster than random.  That means you can’t afford to move the heads.  It forces databases to avoid random access and use clustered indexes wherever possible in order to get the best performance.

9:33 – In 1980, it took systems 6-10 processor cycles to access memory.  In 2009, a typical Core 2 Duo takes 200 cycles to access memory.

9:36 – DeWitt is covering the impact on DBMS performance of various bottlenecks like resource stalls, memory stalls, computation, and branch-misprediction stalls on scan-bound queries and join-bound queries.  I want a pre-con from this guy next year.  He’s explaining that databases end up working around L1 and L2 cache stalls.  For non-sysadmins out there, L1/L2 cache is memory located on the CPU itself.  It’s fairly expensive, and there’s not much of it on each CPU.  When you’re buying SQL Server CPUs, this number does make a difference.

Sysadmin Slide Pr0n
Sysadmin Slide Pr0n

9:37 – DeWitt’s slides are just fantastic.  I can’t possibly due them justice with explanations here – it’s a great example of how a picture (or a chart) is worth a thousand words.  He’s explaining how data migrates from disk to memory to L2 cache to L1 cache.  Best animations ever.  He clearly put a LOT of time into these slides, and the value to the attendees is enormous.  I guarantee almost everyone in the audience is learning something.

9:44 – Until now in the presentation, everything has focused on row-based storage databases.  He’s about to transition into columnar databases like what’s going on behind the scenes in PowerPivot in Excel 2010, which explains why it works so well on commodity hardware like laptops and desktops.

Column Store Tables
Column Store Tables

9:48 – DeWitt’s explaining how columnar databases can pull more data through L2 and L1 cache with each single page read off the disk.  A single disk IO can pull thousands of values from an 8k page for a particular index instead of just a handful.  More fantastic slides.

9:52 – Storing tables in columns can dramatically reduce execution times of some data warehouse queries, but SELECT * is never faster, since you need all of the fields on every row anyway.  The less fields you grab, the more columnar storage pays off.  This design improves CPU performance – that’s a little tricky, though, because it really means the CPU will be busy MORE.  It won’t be idling around waiting for data anymore.

9:55 – “You all are just out there tweeting anyway.”  Talked about his history as a professor – he banned laptops because people were just surfing the web.  These days, however, the students are out there tweeting and conversing about the material live.

9:56 – He’s explaining the different representation models you can use for columnar data storage.  Kind of esoteric for most of us, but it’s really interesting to see how it’s going to work under the hood.

10:01 – One model of columnar storage doesn’t even store row IDs, just column values.  I’m not doing this justice in short sentences, but the database engine itself is responsible for figuring out which values belong to which records.  You can’t update this, because it’s stored in order, but it’s fast as hell.  Perfect for read-only data warehouses.

10:03 – Since CPUs have gotten 1,000x faster and disk is only 65x faster, it makes sense to do a ton of compression on these types of warehouses.  He uses a typical rule of thumb that you’ll get 10x reduction in table size with a column store as opposed to a 3x reduction in a row store.  The engine can use that extra space to store multiple sets of the same data in different sort orders.

10:05 – Explaining run-length expression and how it gives such great compression in columnar databases because so many values are identical.  You can have thousands of customers in the same state, for example, and you can eliminate all of the redundant state records that you would have stored in that column.  Explaining bit-vector encoding, and that one soared over my head.  Explaining dictionary encoding, and that’s near and dear to my heart.  You can take multiple occurrences of long strings (city, name, item, etc) and replace them with a couple of bits.  Commonly used when data isn’t sorted in that order.

10:08 – Column-stored databases aren’t all unicorns and rainbows.  Table scans are particularly nasty, and he’s covering techniques that can be used to mitigate the performance penalties.  I love this kind of academic talk because he’s not saying, “Look, here’s a weakness of Parallel Data Warehouse Edition, and here’s all the ways we thought of working around it,” but you can read between the lines.  It’s jaw-droppingly impressive to see the amount of thought that goes into the storage engine.

10:14 – He’s still explaining ways they can work around execution plan weaknesses to get the best performance.  I’m not covering it here because it’s pretty narrowly focused, and doesn’t have a real-world impact on 99% of us yet.  Grant Fritchey, however, is paying close attention for his next query plan tuning book, no doubt.

10:16 – Watching this keynote, it’s crystal-clear to me why Project Madison (Parallel Data Warehouse Supercalifragalisticexpialidotious Edition) makes sense for data warehousing (big tables, infrequent loads) as opposed to OLTP systems.

Key point takeaway slides:

Key Points 1
Key Points 1
Key Points 2
Key Points 2
Key Points 3
Key Points 3
Previous Post
SQL Server 2008 R2 Pricing and Feature Changes
Next Post
The King of Bedside Manor

3 Comments. Leave new

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.