Aggregations and grouping can be a pain in the rear for the novice SQL developer. Way back in 2003, the ANSI/ISO standards people figured this out and added windows and ranking functions to the standard. In 2005, Microsoft added a few of these functions (ROW_NUMBER()
, RANK()
, DENSE_RANK()
, and NTILE()
) as well as the OVER()
clause so we could whet our appetites. The combination of window and ranking functionality has been used and abused to add all kinds of functionality to applications. SQL Server Denali takes it to the next step and adds a whole slew of additional functionality to window and rank and analyze data in without having to find a BI expert.
OVER() Defining Your Windows
I’m sitting in my office right now, when I look out the window I can see a shrub, part of the neighbor’s car, and the grass in my lawn. I can also see my cat lazily napping. If I move my chair over a little bit, the view out of the window changes and I can see the driveway and the road. Changing where my chair is changes the view I can see of the outside. Changing the definition of a window in T-SQL changes the view of the data for a function in the query.
PARTITION BY and ORDER BY
Much like GROUP BY
in a query, PARTITION BY
lets us divide up the results in partitions. The difference is that PARTITION BY
acts locally – on a single function – instead of on the entire query.
1 2 3 4 |
SELECT LastName , FirstName , ROW_NUMBER() OVER (PARTITION BY LastName ORDER BY FirstName) AS position FROM Person.Person ; |
Looking at the results from this query, we can see that the ROW_NUMBER()
function gives us the number of a row – in FirstName
order – in the results when it’s grouped up by LastName
. Or, to put it differently, it gives us the relative position of the row defined by the window of LastName
.
A window is a way of defining a local scope for a function. Instead of using convoluted logic, subqueries, and temporary tables, it’s easy to definite local scope for queries to do something useful.
Finding Other Rows with LAG and LEAD
It’s easy to produce queries based on relatively fixed windows, but what if you need to know what’s going on just before and just after the current row. Surely there must be a way to do something like that.
SQL Server 2012 brings support for the LAG
and LEAD
functions. The LAG
function can be used to show a previous row. LAG
accepts two additional optional parameters, the number of rows to go back a default value to use, just in case a NULL
is found.
Let’s look at employee performance in the AdventureWorks2008R2
database by comparing their current performance with previous performance. This query, without the LAG
function requires creating some sort of temporary table and joining to it once over each of last_month
, three_months_ago
, six_months_ago
, nine_months_ago
and twelve_months_ago
. It might end up looking something like this (if we used a CTE instead of a temporary table):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
WITH sales AS ( SELECT h.SalesPersonID , ROW_NUMBER() OVER (PARTITION BY h.SalesPersonID ORDER BY DATEPART(yyyy, h.OrderDate), DATEPART(mm, h.OrderDate)) AS rn , DATEPART(yyyy, h.OrderDate) AS [Year], DATEPART(mm, h.OrderDate) AS [Month], SUM(h.SubTotal) AS TerritoryTotal FROM Sales.SalesOrderHeader AS h WHERE SalesPersonID IS NOT NULL GROUP BY SalesPersonID , DATEPART(yyyy, OrderDate) , DATEPART(mm, OrderDate) ) SELECT s.SalesPersonID, s.[Year], s.[Month], s.TerritoryTotal, s1.TerritoryTotal AS last_month , s3.TerritoryTotal AS three_months_ago , s6.TerritoryTotal AS six_months_ago , s9.TerritoryTotal AS nine_months_ago , s12.TerritoryTotal AS twelve_months_ago FROM sales AS s LEFT OUTER JOIN sales AS s1 ON s.SalesPersonID = s1.SalesPersonID AND s.rn = s1.rn + 1 LEFT OUTER JOIN sales AS s3 ON s.SalesPersonID = s3.SalesPersonID AND s.rn = s3.rn + 3 LEFT OUTER JOIN sales AS s6 ON s.SalesPersonID = s6.SalesPersonID AND s.rn = s6.rn + 6 LEFT OUTER JOIN sales AS s9 ON s.SalesPersonID = s9.SalesPersonID AND s.rn = s9.rn + 9 LEFT OUTER JOIN sales AS s12 ON s.SalesPersonID = s12.SalesPersonID AND s.rn = s12.rn + 12 |
Although this query is clear enough, it requires a considerable amount of disk access and performs 6 logical scans and 4116 logical reads.
Re-writing this query to use the LAG
function makes the query easier to read, easier to understand, and much faster for SQL Server to run. It only requires 1 logical scan and 686 logical reads.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
SELECT SalesPersonID , [Year] , [Month] , TerritoryTotal , LAG(TerritoryTotal) OVER (PARTITION BY SalesPersonID ORDER BY [Year], [Month]) AS last_month , LAG(TerritoryTotal, 3, 0.00) OVER (PARTITION BY SalesPersonID ORDER BY [Year], [Month]) AS three_months_ago , LAG(TerritoryTotal, 6, 0.00) OVER (PARTITION BY SalesPersonID ORDER BY [Year], [Month]) AS six_months_ago , LAG(TerritoryTotal, 9, 0.00) OVER (PARTITION BY SalesPersonID ORDER BY [Year], [Month]) AS nine_months_ago , LAG(TerritoryTotal, 12, 0.00) OVER (PARTITION BY SalesPersonID ORDER BY [Year], [Month]) AS twelve_months_ago FROM ( SELECT SalesPersonID, DATEPART(yyyy, OrderDate) AS [Year] , DATEPART(MONTH, OrderDate) AS [Month] , SUM(SubTotal) AS TerritoryTotal FROM Sales.SalesOrderHeader GROUP BY SalesPersonID, DATEPART(yyyy, OrderDate), DATEPART(month, OrderDate) ) AS x WHERE SalesPersonID IS NOT NULL ORDER BY SalesPersonID, [Year], [Month]; |
While this isn’t the most interesting report, it’s very common in many fields to compare current performance to previous performance. Stocks and 401(k) portfolios do this. My utility bills all show the last year of history as well as what I’m being charged this month.
Relative Aggregates Using ROWS BETWEEN
Moving a window relative to the current row is pretty interesting, but LAG
and LEAD
can only do so much. What if you need to show a three month average? What if that three month average needs to be centered on the current month?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
SELECT SalesPersonID, the_year, the_month, TerritoryTotal, AVG(TerritoryTotal) OVER (PARTITION BY SalesPersonID ORDER BY the_year, the_month ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS three_month_average , AVG(TerritoryTotal) OVER (PARTITION BY SalesPersonID ORDER BY the_year, the_month ROWS BETWEEN 6 PRECEDING AND 6 FOLLOWING) AS year_average FROM ( SELECT SalesPersonID, DATEPART(yyyy, OrderDate) AS the_year, DATEPART(MONTH, OrderDate) AS the_month, DATEPART(QUARTER, OrderDate) AS the_quarter, SUM(SubTotal) AS TerritoryTotal FROM Sales.SalesOrderHeader GROUP BY SalesPersonID, DATEPART(yyyy, OrderDate), DATEPART(month, OrderDate), DATEPART(QUARTER, OrderDate) ) AS x ; |
This query adds in something called the frame clause (ROWS BETWEEN start AND end
). The frame clause makes it easy to define a sliding window for the function being used.
There are a few ways to use the frame clause to change how the window is computed. As you saw in the previous query it’s possible to state the number of rows before and after the current row. That’s simple enough. It’s also possible to define a range of rows to use. One of the most first uses that developers will find for frame clauses is to produce a running total:
1 2 3 4 5 6 7 8 |
SELECT soh.SalesPersonID, soh.OrderDate, soh.SubTotal , SUM(soh.SubTotal) OVER (PARTITION BY SalesPersonID ORDER BY soh.OrderDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS total_sales_to_date FROM Sales.SalesOrderHeader AS soh WHERE soh.SalesPersonID IS NOT NULL |
In the past, developers would have had to resort to trickery with messy self-joins to get running totals. Thankfully, it’s now possible to define a frame for the window and use it compute aggregates over a range. I did resort to a bit of trickery to demonstrate the running total. The default value for a frame clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
, so it’s not that impressive and can be achieved with T-SQL that looks like any other window function.
The ranges in a frame clause are all relative to the current partition of the window function. UNBOUNDED PRECEDING
will get all rows from the start of the frame, and the start of the frame is relative to whatever has been defined in the OVER
clause. UNBOUNDED FOLLOWING
will get all rows from the current row to the end of the frame.
Brand New Analytic Functions
There are a few other functions that are being added to T-SQL in SQL Server 2012. I haven’t mentioned them yet because they’re statistical functions for computing distribution and median information. This is about the extent that I’ll mention them, but here they are:
CUME_DIST()
– calculates the position of a value relative to a group of valuesPERCENT_RANK()
– calculates the percent rank of a value in a group of valuesPERCENTILE_CONT(X)
– looks at the percent rank of values in a group until it finds one greater than or equal to X.PERCENTILE_DISC(X)
– looks at the cumulative distribution of values in a group until it finds one greater than or equal to X
I don’t do a lot of analytical work, and certainly nothing where I’m going to be computing cumulative distributions, but it’s good to know that these functions are available. A lot of people do need to perform this functionality and they’ve resorted to using T-SQL trickery or SQL-CLR to get the same results when they could have just had these functions in the database.
These analytical functions are a bit different than the other functions that I’ve mentioned so far. First of all, they all don’t require an OVER()
clause. Because these are analytical functions, you might want to perform the analysis over the entire query, so there’s no explicit need to apply a window and frame to the function.
The other thing that makes these functions different is that they allow a WITHIN GROUP
specification. WITHIN GROUP
is new in SQL Server 2012 and it applies an ordering to the analytical function’s actions but not to the results.
Here’s an example using all four to demonstrate how they work on the data in AdventureWorks, and to show how poorly those poor sales clerks get paid:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
SELECT p.LastName , p.FirstName , e.OrganizationLevel, eph.Rate * 2080 AS Salary , PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY eph.Rate * 2080 DESC) OVER (PARTITION BY e.OrganizationLevel) AS MedianContinuous , PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY eph.Rate * 2080 DESC) OVER (PARTITION BY e.OrganizationLevel) AS MedianDiscrete , CUME_DIST() OVER (PARTITION BY e.OrganizationLevel ORDER BY eph.Rate * 2080 DESC) AS CUME_DIST, PERCENT_RANK() OVER (PARTITION BY e.OrganizationLevel ORDER BY eph.Rate * 2080 DESC) AS PERCENT_RANK FROM HumanResources.Employee AS e JOIN Person.Person AS p ON e.BusinessEntityID = p.BusinessEntityID LEFT JOIN HumanResources.EmployeePayHistory AS eph ON eph.BusinessEntityID = e.BusinessEntityID GROUP BY p.LastName, p.FirstName, e.OrganizationLevel, eph.Rate ORDER BY OrganizationLevel ASC, Salary DESC ; |
Mike, I have good news: you’re getting a corner office.
SQL Server Windowing Functions Training Video
For more information, watch this free 20-minute training video on SQL Server 2012’s new windowing functions:
Summing it up
SQL Server 2012 adds many new features, not the least of which are the new window functions, frame clauses, and analytic functions. When you take each of these features on their own, they add powerful functionality to SQL Server. When you combine these three T-SQL enhancements, they give database developer tremendous power and flexibility to quickly develop reports that analyze data and deliver the results without needing extra features, tools, or products.
13 Comments. Leave new
Very useful! Thank you very much.
Nice details of the upcoming new functions in denali. Thank you.
Thanks! These features will be tremendous for database analytics.
Can’t wait to see the query plans!
I had to stop at the picture of the view outside your office. It even made my cubicle walls cry.
Ok I kept going, great article, thanks 🙂
Hi,
Great post! one small correction –
the default for the window range is not ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
but RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
the difference is that when using RANGE, CURRENT ROW means al the rows with the same value in the ORDER BY columns.
this script can demostrate the difference:
create table T1(ID INT)
insert into T1(ID)
VALUES(1),(2),(2),(3),(4),(4)
SELECT ID,COUNT(*) OVER (ORDER BY ID)
FROM T1
SELECT ID,COUNT(*) OVER (ORDER BY ID rows between UNBOUNDED Preceding and CURRENT ROW)
FROM T1
SELECT ID,COUNT(*) OVER (ORDER BY ID range between UNBOUNDED Preceding and CURRENT ROW)
FROM T1
Good catch, thanks!