Window Functions and Cruel Defaults

Last Updated March 12, 2018

My First Post Here…

Well, my first technical post, was about how the default index creation method is OFFLINE. If you want that sweet, sweet Enterpri$e Edition ONLINE goodness, you need to specify it. It’s been a while since that one; almost six months to the day. So here’s another one!

But Window Functions Are Awesome

Heck yeah they are. And how. Boy howdy. Etc. You get the point. I’m enthusiastic. What can be cruel about them? Glad you asked!

Window Functions, according to the almighty ANSI Standard, have two ways of framing data: RANGE and ROWS. Without getting into the implementation differences between the ANSI Standard and Microsoft’s versions, or any performance differences between the two, there’s a funny difference in how they handle aggregations when ordered by non-unique values. A simple example using the Stack Overflow database follows.

SELECT OwnerUserId,
 CAST(CreationDate AS DATE) AS DumbedDownDate,
 Score,
 SUM(Score) OVER ( ORDER BY CAST(CreationDate AS DATE)) AS Not_Specified,
 SUM(Score) OVER ( ORDER BY CAST(CreationDate AS DATE)RANGE UNBOUNDED PRECEDING ) AS Range_Specified,
 SUM(Score) OVER ( ORDER BY CAST(CreationDate AS DATE)ROWS UNBOUNDED PRECEDING ) AS Rows_Specified
FROM dbo.Posts
WHERE OwnerUserId = 1
AND CAST(CreationDate AS DATE) BETWEEN '2008-08-01' AND '2008-08-31'
ORDER BY CAST(CreationDate AS DATE);

1

2

3

4

5

6

7

8

9

10

SELECT OwnerUserId,

CAST(CreationDate AS DATE) AS DumbedDownDate,

Score,

SUM(Score) OVER ( ORDER BY CAST(CreationDate AS DATE)) AS Not_Specified,

SUM(Score) OVER ( ORDER BY CAST(CreationDate AS DATE)RANGE UNBOUNDED PRECEDING ) AS Range_Specified,

SUM(Score) OVER ( ORDER BY CAST(CreationDate AS DATE)ROWS UNBOUNDED PRECEDING ) AS Rows_Specified

FROM dbo.Posts

WHERE OwnerUserId = 1

AND CAST(CreationDate AS DATE) BETWEEN '2008-08-01' AND '2008-08-31'

ORDER BY CAST(CreationDate AS DATE);

For the month of August, Year of Our Codd 2008, we’re getting a running total of the score for posts by UserId 1. Who is UserId 1? I’ll never tell. But back to the syntax! In the first SUM, we’re not specifying anything, for the next two we’re specifying RANGE and then ROWS. Why? REASONS! And why am I casting the CreateDate column as a date? MORE REASONS!

Before you scroll down, think for a second:

If I don’t specify RANGE or ROWS, which will SQL Server use?
If I left the CreateDate column as DATETIME, what ~~eff~~ ~~aff~~ difference would it make to the output?

Do you see a pattern forming here?

When we don’t specify RANGE or ROWS, well, SQL Server is nice enough to pick RANGE for us. “Nice”.

Deep breaths, Erik. Deep breaths.

You should also notice the difference in how each different method aggregates data. When the ordering column has duplicates, RANGE, and by extension, the default method, will SUM all the values for that group at once. When ROWS is specified as the framing method, you see the running total that most people are after.

And, of course, if all the values were unique, they’d do the same thing.

SELECT OwnerUserId,
 CreationDate,
 Score,
 SUM(Score) OVER ( ORDER BY CreationDate ) AS Not_Specified,
 SUM(Score) OVER ( ORDER BY CreationDate RANGE UNBOUNDED PRECEDING ) AS Range_Specified,
 SUM(Score) OVER ( ORDER BY CreationDate ROWS UNBOUNDED PRECEDING ) AS Rows_Specified
FROM dbo.Posts
WHERE OwnerUserId = 1
AND CAST(CreationDate AS DATE) BETWEEN '2008-08-01' AND '2008-08-31'
ORDER BY CreationDate;

1

2

3

4

5

6

7

8

9

10

SELECT OwnerUserId,

CreationDate,

Score,

SUM(Score) OVER ( ORDER BY CreationDate ) AS Not_Specified,

SUM(Score) OVER ( ORDER BY CreationDate RANGE UNBOUNDED PRECEDING ) AS Range_Specified,

SUM(Score) OVER ( ORDER BY CreationDate ROWS UNBOUNDED PRECEDING ) AS Rows_Specified

FROM dbo.Posts

WHERE OwnerUserId = 1

AND CAST(CreationDate AS DATE) BETWEEN '2008-08-01' AND '2008-08-31'

ORDER BY CreationDate;

Wrap. It. Up.

This one is pretty self explanatory. If you’re lucky enough to be on SQL Server 2012 or greater, and you’re using Window Functions to their full T-SQL potential, it’s was easier to calculate running totals. Just be careful how you write your code.

If you like this sort of stuff, Check out Doug’s new video series, T-SQL Level Up. There are next to zero fart jokes in it.

Mike
September 29, 2015 11:13 am

” it’s was easier to calculate running totals”

I’m pretty sure you wanted to say ” it’s way easier to calculate running totals”.

Good article.

Cheers.

Reply
Koen Verbeeck
September 30, 2015 2:20 am

Aside from the change in behavior, there’s also a very serious performance impact when RANGE is used instead of ROWS.

Worst.Default.Ever.

Reply
Thomas Franz
September 30, 2015 2:35 am

Are there performance differences between RANGE and ROWS?

I remember a training course where us was told that RANGE will use spills to tempdb and ROWS not.

Following to statistics I/O RANGE produces several reads on worktable while ROWs has a worktable but shows me zero reads (don’t know if it means that there are really zero reads on worktable or if the statistics are lying).

This is particularly true, if you
CREATE INDEX idx_post_dummy ON dbo.Posts (OwnerUserId, CreationDate) INCLUDE (Score)

Reply
- Koen Verbeeck
  September 30, 2015 4:31 am
  
  Yes, that’s true. RANGE needs a worktable in tempdb, while ROWS can use (under certain conditions) a specialized in-memory table (hence the big performance improvement).
  
  That’s also why you see a worktable but with 0 reads I guess.
  
  Reply
Richie Lee
November 5, 2015 11:18 am

First time I’ve read through an article on windows functions that I have actually understood the concept. Thanks and nicely done!

Reply