Windowing functions are a great way of letting a row be aware of what’s going on around it – not just in the current row, but in all its neighbors. I like to know what’s going on with my neighbors too, so that’s why I’m a couple of blocks away from my own home looking into the home of actually the basement of David and Angela Matthews. I like to see what they’re up to. They’re home. She just saw me and she’s calling the cops. I think this would be a good time to leave.
Of all the windowing functions out there, row number is probably the most familiar to you for a couple of reasons. Number one, it’s been around since SQL Server 2005. It’s actually one of the oldest windowing functions SQL Server has. The other reason is it’s enormously popular. At one time or another, pretty much every developer has a need for ranking or row numbering a results set.
Up until 2005, we had to roll our own code in order to get that output. Now the row number function simply puts it out for us, it’s built into SQL Server, so everyone that sees it understands exactly how it works and exactly what it does. Speaking of how it works, let’s take a look at the syntax of it.
We’ve got the row number function followed by the over clause which establishes the scope of the row number and then here we have Partition By RollDate. What we’re telling it is we want to reset that counter every time we move to a new RollDate. We’ll count one up to the last row on a certain date, say January 1st. Then when we hit January 2nd, [00:02:00] we’ll start that counter over again and so on, and so forth.
Then we have the Order By clause which tells us that we want that row number to be applied in the order of the RollID. Let’s take a look at how this would work then. We’ve got RollID number one which happened to be January 1st, Player ID rolled a 20-sided die and got a 5. That was roll number for the day.
RollID number four was the second roll of the day. That’s because RollIDs number two and three took place in between, but here we’re just looking at the 20-sided dice. This is an instance where we can’t trust the built in ID if we want to limit the results set at all. If we only want to look at 20-sided dice, we’re going to create gaps in the data and then we have to smooth over with those gaps by using the row number function.
Pretty simple. Pretty straightforward. If we wanted to, we could add more columns to the Partition By RollDates. We could see the RollDate and the Player ID cause the row number to reset, so let’s take a look at what that would look like. Before I execute it, let’s go down here. We had 21 rolls on January 1st. I’ll run that. Of those 21 rolls that we saw before, 5 came from player one, 3 came from player two, 5 from player 3, and 8 from player 4.
Pretty straightforward how it works. Partition By just says when to reset the counter, so in this case we’re telling it reset the counter with every RollDate or Player ID change.
Let’s look at the Sum window function. Now in our query here we’ve got select the RollDate, Player ID, DieSides, DieRoll, and we want [00:04:00] to add this, the Sum of the DieRoll. Now Sum is different from row number and rank in that it’s not a ranking window function. It’s an aggregation window function. What makes Sum goofy here is that we’re selecting other columns, but we’re not saying anything in the over clause, we’re not saying Partition By, and nowhere in our query is a Group By.
Now if you look at this the way that you look at a traditional Sum function, you might think that’s going to break because we can’t possibly have a Sum and other columns without grouping by those other columns. This is where you have to think along a different line with window functions. No matter what you scope a windowing function with, it won’t affect any of the other columns, it won’t affect the number of rows that you get back, it won’t change anything. All it will apply to is that one column.
Let’s take a look at the Sum of the DieRoll. If we don’t specify any Partition By, we’ll get the grand total. What we can do with the Sum window function is scope it down. We can partition it by the RollDate and we can partition it by the RollDate and the Player ID. We can get the grand total of all the rolls, we can get the total of all the rolls for that day, and we can get the total of all the rolls for that day, for that player. No matter what we do with any of these window functions, we will not change the number of rows that we’re going to get back and we won’t affect any of the other columns.
I’ll run this again. We can see that for January 1st, our grand total is unchanged. For January 1st, the day sums up to 203 and for that player the sum is 37. Player one rolled [00:06:00] a combined 37, player two rolled a combined 25, 3 rolled a 56, and player four rolled an 85.
Now if we look at this sum number, the day total and the player day total, these aren’t really useful numbers. They don’t tell us much. They demonstrate syntax, but there is no real intelligence that you can glean from this. Let’s move onto a function that might tell us a little bit more about what’s going on.
Since Sum isn’t all that helpful, let’s look at average which might be a little bit better for analytical purposes in this case. We’ve got the same query, select RollDate, Player ID, DieSide, DieRoll and we want to look at the average DieRoll partitioned first by the RollDate to tell us the average roll for that day regardless of who rolled the die. Then the average roll partitioned by the RollDate and the player, so we can see what a player’s average was for any given day.
It starts to get a little more interesting and a little more fun when we look at numbers this way, so let’s check it out. Player ID number one on January 1st rolled an average of seven. Player two rolled an average of eight. Player three was a little bit luckier, got 11 and player 4 got 10. Then we move over to the next day. The cycle continues so on and so forth.
If we scroll down here, we can see that player two had a pretty rough day on January 3rd rolling a 20-sided die. Their average is only six. If we scroll down a little farther, we’ll probably find someone- Player four had a pretty lucky day there, 14, 14, 14, and 20. Nice.
Using these window functions, again, pretty simple syntax. All you have to do is take the function and then the over clause to specify [00:08:00] how you want to scope it down. You can say you just want it to apply to a date, player ID, and so on, and so forth, however you like.
You can get some really useful information very quickly and you don’t have to go through the trouble of doing Group By because these are applied to every single row. Again, we don’t cut down on the number of rows that we get back. We simply add intelligence to the rows we already have. Let’s take a look at some other window functions now.
With SQL Server 2012, we’ve got a bunch of new windowing functions that allow us to take census of our results set. We can do that from any position, any row in the results set. Before we had these functions, if you ask a given row what’s going on around it, how does it compare to others except for the row number [00:09:00] and rank, it really wouldn’t know.
Now we’ve got these special functions like Lag and Lead which are enormously helpful. You’ll see how in just a moment. Our query has the RollDate, Player ID, DieSides, DieRoll and then I’m going to use both the LAG and LEAD function. LAG simply looks back over its shoulder. LEAD looks ahead.
In this instance, I want to know about what? I want to know about the DieRoll, so I’m going to say LAG and then in parenthesis the DieRoll. I want to know about the DieRoll that is behind me. How many rows behind me? I’m going to say one. Then with the over clause I will say I need to make sure that these are ordered, so I always get the same results back every time. I’ll order by the RollDate.
Then I’ll label that column the Previous Roll. The same with LEAD. I’ll say I’ll look ahead. I want to know about the DieRoll that’s in front of me. How many rows in front of me? One. [00:10:00] I’ll order it by the RollDate, so I know that my results are consistent and then I’ll say that will be the next roll. If I execute it, it will look exactly like you think it will.
We’ll get the RollDate, the Player ID, DieSides, DieRoll. Now in that first roll there wasn’t a previous roll, so that’s going to come back as null. In the next column over I’m asking what’s the roll that took place right in front of me, the next one down the list. If my DieRoll here is five, I want to know what the next roll is. It’s going to be one. I can actually find that out in the current row. Pretty cool, huh?
I can look as far ahead or as far back as I want to go. I can target four rows ahead, ten rows ahead, doesn’t matter. It’s really neat that you can take a certain row and ask it about other rows in the results set and it will know the answer.
If you wanted to do something, say, totally unrelated to our cross posing cursors database, say, you needed to do this period last year, this period last month, any sort of period to period comparison. This is fantastic for doing that because you can look a certain number of quarters back or a certain number of months, or a certain number of years, whatever the case.
Now you’ll notice that the roll previous was null because no value existed. There is a little trick that you can do with LAG and LEAD to get rid of that null value. I’m going to go ahead and insert it here. I’m going to say I want to know about the DieRoll that happened one roll previous to mine. If there isn’t one, I’m going to say it was zero. I’m going to assign a default value in case the expression returns a null in there.
Now I actually have a different value. If you need to handle that null somehow, if null won’t be good enough in the results set itself, you can always assign a default value if [00:12:00] no value is returned by that expression.
Now you might be asking yourself, “Okay, so I can LAG and I can LEAD a certain number of rows, but what if I don’t want a certain number of rows in one direction or the other? I want to know the first or the last value.” That could be any number of rows in between. Well, you’re in luck because there’s a windowing function for that. Let’s take a look.
LAG and LEAD will allow us to look a fixed number of rows in either direction, but FIRST_VALUE and LAST_VALUE will allow us to look all the way to the front and all the way to the back regardless of how many rows might exist in between. We have the same query here, RollDate, Player ID, DieSides, DieRoll, and we want to look at the very first DieRoll. We’ll order these by RollDate and we’ll look at the very last DieRoll ordered by the RollDate.
We’ll go ahead and run this. What we should see is the roll FIRST_VALUE across the entire results set should be the same and it should be five. Is it a match? Sure is. Our first roll in the entire results set was a five and so everybody knows we’re onboard with that. It’s a five. The very last roll in the results set should be 19. Let’s go take a look.
That’s not right. That’s not 19. What happened? Now our first roll is being recorded correctly across the entire results set, but the last one is wrong. What’s up with that? Well in order for me to explain what’s going on, we’re going to have to move on from this and then come back to it. We’re going to move on to look at another window function clause that [00:14:00] will help explain what’s going on with this FIRST_VALUE, LAST_VALUE.
Let’s step out and look at the Rows Between clause. We’re returning back to the Sum windowing function and we’re going to add a little wrinkle this time. Remember that the over clause determines the scope of how that windowing function is applied, so we can do Partition By, we can do Order By. We already know that, but there’s a third one that we want to add, the Rows Between.
Rows Between simply says I want you to look a certain distance back in terms of row count or a certain distance forward. We have three choices really. We have unbounded which means go as far as you can go in that direction. We have current row which obviously means stop here or start here. Then we have an integer. We can say a fixed number of rows in one direction or the other.
Let’s take a look at our roll running total. Now if you’ve ever had to do a running total in T-SQL, it’s complicated to roll your own code for that. This is going to make that a lot easier because we can simply say we want to look a certain number of rows back, sort of like a rolling 12 or some sort of financial report like that would have on it.
In this case, we want to look at the Rows Between Unbounded Preceding And Current Row as our roll running total. We want to get a running total across the entire results set. Then the next Sum function that we put in says Rows Between 3 Preceding, so we want to look 3 rows back and start there, and end the sum at the current row. That will be the sum of the last four including the one that we’re looking at right now.
Let’s execute this code and you can get an idea of how that’s going to look in practice. Starting with row one, we roll a five. Our [00:16:00] running total is five and the last four is five. The next roll is a one, so we increment these up. The next roll is a five, that brings us to 11. The next roll is a 19, that brings us to 30. The fifth roll is a seven, so our running total will increment up to 37 now. The sum of the last four will only be the sum of 7 plus 19, plus 5, plus 1, four rows. When we do that, we get 32.
It’s actually pretty simple. You can do Unbounded Preceding and Unbounded Following which is look as far as you can forward. You can do current row and Unbounded Following. If you need to count up something in all the rows that have yet to come from the perspective of the current row you’re in, you can do that too.
Now that we’ve explained how this Rows Between clause works, let’s go back to the FIRST_VALUE and LAST_VALUE, and see if we can figure out how to make that behave the way we expect it to behave as opposed to the way Microsoft says it should behave.
We’re back to our FIRST_VALUE, LAST_VALUE query and we’re trying to figure out how it is we can make last value behave the way that we expect it to because right now it’s not doing that Let’s look at the results again.
We want FIRST_VALUE and LAST_VALUE to match up with the very first record and the very last record. We don’t want to Partition By anything. Right now FIRST_VALUE is doing exactly what we’d expect it to. The trouble is LAST_VALUE isn’t. What’s happening with LAST_VALUE is it’s actually treating our Order By as a Partition By.
You’ll notice as soon as the date changes, the LAST_VALUE changes. Now it’s January 2nd and our LAST_VALUE is one. Of course, [00:18:00] the LAST_VALUE on January 2nd is one. That’s what’s going on and we can further confirm this by adding Player ID to the Order By clause. Let’s see what happens to that roll last column.
The last roll by player one is a seven. The last roll by player two, coincidentally, is a seven. The last roll by player three is a two. Yes, that’s indeed what’s happening. The Order By gets misconstrued, in my opinion anyway, as an Partition By. We don’t want that, but the problem is we can’t add a Partition By to open things up. We have to figure out a way of working with it the way it is with the Order By because Order By is required for FIRST_VALUE and LAST_VALUE.
The way that we can fix this up is using what we just learned about the rows between. What we’ll put in here is Rows Between Unbounded Preceding and Unbounded Following because right now we’re getting this goofy Partition By behavior that involves Unbounded Preceding and current row. We don’t want that. We want it to look at the very first row and the very last row.
Let’s add this Rows Between Unbounded Preceding and Unbounded Following and see what happens. What we really want to see, and we’ll double-check this, is ten. We want to see ten all up and down that column the way we see five up and down the first roll column.
Let’s run this and see if that fix will fix it. Indeed it does. If you ever use LAST_VALUE and you want to look across the entire results set, you don’t want to Partition By anything, make sure that you add Rows Between Unbounded Preceding and Unbounding Following. Otherwise, you will get weird results. You will get stuff that is Partition By or Order By. You do not want to deal with that. Let’s scan all the way down. Yeah, five and ten top to bottom. [00:20:00] Perfect.
All right. FIRST and LAST_VALUE very useful. You just have to know that LAST_VALUE is a little bit messed up compared to the way that you and I might expect it to behave.
- 01. Prologue and Script Files
- 02. Thinking in Sets (5m)
- 03. Number and Date Tables (10m)
- 04. Case Expressions (17m)
- 05. Computed Columns (10m)
- 07. Windowing Functions – Performance (8m)
- 08. Replacing Cursors, Part 1 (6m)
- 09. Replacing Cursors, Part 2 (5m)
- 10. Replacing Cursors, Part 3 (7m)
- 11. Let the Adventure Begin! (9m)
- 12. About Doug (2m)