ColumnStore Indexes And Recursive CTEs

Alone Together

When I think about SQL Server features, I often picture a high school cafeteria.

AGs are making fun of Mirroring, index rebuilds are walking around taking everyone’s lunch, dta is making a glue sandwich, and no one knows who Service Broker is even though they’ve been going to school together since Kindergarten.

At the artsy table are two oft-misunderstood and unpopular features: ColumnStore indexes and Recursive CTEs. The oldest person on Stack Overflow, Joe Obbish, recently blogged about a deficiency in ColumnStore indexes when it came to doing string aggregation. When someone says “X is bad at Y”, my first reaction is generally to try to find other things X is bad at, and kicking X while it’s down.

Lace Up Your Boots

I know what you’re thinking: WHY WOULD YOU EVER DO THIS?

Well, ladies and gentlemen, that’s what consultants specialize in.

Stuff no one else would ever do.

There’s a really interesting relationship in the Stack Overflow database both within the Posts table, and connecting the User and Comment tables. You can build up pretty interesting hierarchies with them.

The Posts table has both questions and answers in it. You can tell them apart using the PostTypeId column, and you can also look at the ParentId column. If a post is an answer to a question, it’ll have the Id of the question populated.

Likewise, the Comments table connects comments to posts and answers, and both tables track the Ids of users. This comes in Handy for grabbing DisplayName from the Users table.

Having spent an amount of time only a consultant can looking at stuff in the database, I know that post 184618 currently (2016-03 data dump) has the most answers in the Posts table, at 518. So that’ll be our performance bar.

Gangland

Let’s start by indexing and querying rowstore tables.

Here are our awesome nonclustered indexes.

And here’s our extra fun recursive query.

With my indexes in place, here are my stats Time and IO results. I’ve lopped off all the zero-read operations. for readability.

I even have what most would consider a reasonable execution plan for such a query.

Methinks the lady doth compute too many scalars.

Inflection Point

With that in mind, let’s hop on over to my ColumnStore version of the Stack Overflow data dump. All of the tables here either have a clustered ColumnStore index on them, or a nonclustered one if they have columns with datatypes that make that impossible (Users and Posts both have columns that are MAX types, which makes CS puke).

Running the same query, here are the results of stats Time and IO. What you’re looking at are the results of the second run with a warm cache. The first run took about 4 minutes and did a bunch of physical reads, and that didn’t seem fair.

If you’re wondering what those milliseconds are in minutes, it’s 3:17. So uh, there’s our first problem.

And here’s the query plan, which bears some interpretive dance. It looks a whole heck of a lot like the previous plan. Prior plan? One of those. Except two index spools are introduced.

We’re a lot alike, you and I

Funny things you should know about Index Spools: They’re single threaded, and I’m going to go out on a limb and say SQL doesn’t build ColumnStore Index Spools. They cache data in tempdb. When they’re Eager, they cache all the rows coming to them, and when they’re Lazy they only cache rows as required. This makes Lazy Spools much more benign than Eager Spools (there’s a Lazy Spool in both of the query plans).

They also make up the majority of the work our query is doing here in ColumnStore land. There are other obvious problems with it, like it running in Row mode rather than Batch mode. I stuck the query plan in Paste The Plan for anyone interested in poking around more.

I’m not getting any prettier, here

Stuff like this is part of the reason why we’ve started warning about Index Spools in sp_BlitzCache and sp_BlitzQueryStore. We started warning about ColumnStore indexes operating in Row Mode a while back because it’s basically a death knell for performance.

This post should have a point, right? ColumnStore! Not good at everything. Pretty spiffy for aggregations, but there are plenty of times when traditional row store indexes will out-perform them.

For more information about Spools, check out these links:

Fabiano Amorim: Part 1 and Part 2

Paul White: Part 1, Part 2, Part 3

Rob Farley: Part 1

A most excellent dba.stackexchange.com question, in which Joe Obbish gets prescribed some Spool Softener

Thanks for reading!

, , ,
Previous Post
Other People’s Blog Posts I Talk About the Most
Next Post
SQL Server 2017: Less xp_cmdshell?

7 Comments. Leave new

  • Adding OPTION (IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX) is useful when you want to compare performance of your queries on a table with a NC columnstore index.

    • Erik Darling
      July 27, 2017 11:39 am

      Yeah, unfortunately there’s not a similar hint for clustered ColumnStore indexes.

  • “AGs are making fun of Mirroring, index rebuilds are walking around taking everyone’s lunch, dta is making a glue sandwich, and no one knows who Service Broker is even though they’ve been going to school together since Kindergarten.”

    this is quite possibly the best sentence ever to appear on this blog. tears of joy, sir. tears of joy.

    plus, i now have a solution to a problem which plagues me on a near-daily basis: next time i can’t put a name to a face it’s: “Hey hey, Service Broker! Long time no see!”

  • I was just working on this very issue this week. It helped me identify what was going on with a recursive CTE that was utterly failing for me. Thanks Erik. The timing was perfect.

Menu
{"cart_token":"","hash":"","cart_data":""}