Blog

Aggregations and grouping can be a pain in the rear for the novice SQL developer. Way back in 2003, the ANSI/ISO standards people figured this out and added windows and ranking functions to the standard. In 2005, Microsoft added a few of these functions (ROW_NUMBER(), RANK(), DENSE_RANK(), and NTILE()) as well as the OVER() clause so we could whet our appetites. The combination of window and ranking functionality has been used and abused to add all kinds of functionality to applications. SQL Server Denali takes it to the next step and adds a whole slew of additional functionality to window and rank and analyze data in without having to find a BI expert.

OVER() Defining Your Windows

I’m sitting in my office right now, when I look out the window I can see a shrub, part of the neighbor’s car, and the grass in my lawn. I can also see my cat lazily napping. If I move my chair over a little bit, the view out of the window changes and I can see the driveway and the road. Changing where my chair is changes the view I can see of the outside. Changing the definition of a window in T-SQL changes the view of the data for a function in the query.

AVG(view) OVER (PARTITION BY chair)

PARTITION BY and ORDER BY

Much like GROUP BY in a query, PARTITION BY lets us divide up the results in partitions. The difference is that PARTITION BY acts locally – on a single function – instead of on the entire query.

SELECT  LastName ,
        FirstName ,
        ROW_NUMBER() OVER (PARTITION BY LastName ORDER BY FirstName) AS position
FROM    Person.Person ;

We're all counting on you

Looking at the results from this query, we can see that the ROW_NUMBER() function gives us the number of a row – in FirstName order – in the results when it’s grouped up by LastName. Or, to put it differently, it gives us the relative position of the row defined by the window of LastName.

A window is a way of defining a local scope for a function. Instead of using convoluted logic, subqueries, and temporary tables, it’s easy to definite local scope for queries to do something useful.

Finding Other Rows with LAG and LEAD

It’s easy to produce queries based on relatively fixed windows, but what if you need to know what’s going on just before and just after the current row. Surely there must be a way to do something like that.

SQL Server 2012 brings support for the LAG and LEAD functions. The LAG function can be used to show a previous row. LAG accepts two additional optional parameters, the number of rows to go back a default value to use, just in case a NULL is found.

Let’s look at employee performance in the AdventureWorks2008R2 database by comparing their current performance with previous performance. This query, without the LAG function requires creating some sort of temporary table and joining to it once over each of last_month, three_months_ago, six_months_ago, nine_months_ago and twelve_months_ago. It might end up looking something like this (if we used a CTE instead of a temporary table):

WITH sales AS (
    SELECT  h.SalesPersonID ,
            ROW_NUMBER() OVER (PARTITION BY h.SalesPersonID
                               ORDER BY     DATEPART(yyyy, h.OrderDate),
                                            DATEPART(mm, h.OrderDate)) AS rn ,
            DATEPART(yyyy, h.OrderDate) AS [Year],
            DATEPART(mm, h.OrderDate) AS [Month],
            SUM(h.SubTotal) AS TerritoryTotal
    FROM    Sales.SalesOrderHeader AS h
    WHERE   SalesPersonID IS NOT NULL
    GROUP BY SalesPersonID ,
            DATEPART(yyyy, OrderDate) ,
            DATEPART(mm, OrderDate)
)
SELECT  s.SalesPersonID,
        s.[Year],
        s.[Month],
        s.TerritoryTotal,
        s1.TerritoryTotal AS last_month ,
        s3.TerritoryTotal AS three_months_ago ,
        s6.TerritoryTotal AS six_months_ago ,
        s9.TerritoryTotal AS nine_months_ago ,
        s12.TerritoryTotal AS twelve_months_ago
FROM    sales AS s
        LEFT OUTER JOIN sales AS s1 ON s.SalesPersonID = s1.SalesPersonID
                                       AND s.rn = s1.rn + 1
        LEFT OUTER JOIN sales AS s3 ON s.SalesPersonID = s3.SalesPersonID
                                       AND s.rn = s3.rn + 3
        LEFT OUTER JOIN sales AS s6 ON s.SalesPersonID = s6.SalesPersonID
                                       AND s.rn = s6.rn + 6
        LEFT OUTER JOIN sales AS s9 ON s.SalesPersonID = s9.SalesPersonID
                                       AND s.rn = s9.rn + 9
        LEFT OUTER JOIN sales AS s12 ON s.SalesPersonID = s12.SalesPersonID
                                       AND s.rn = s12.rn + 12

Although this query is clear enough, it requires a considerable amount of disk access and performs 6 logical scans and 4116 logical reads.

Re-writing this query to use the LAG function makes the query easier to read, easier to understand, and much faster for SQL Server to run. It only requires 1 logical scan and 686 logical reads.

SELECT  SalesPersonID ,
        [Year] ,
        [Month] ,
        TerritoryTotal ,
        LAG(TerritoryTotal) OVER (PARTITION BY SalesPersonID
                                  ORDER BY [Year], [Month]) AS last_month ,
        LAG(TerritoryTotal, 3, 0.00) OVER (PARTITION BY SalesPersonID
                                           ORDER BY [Year], [Month]) AS three_months_ago ,
        LAG(TerritoryTotal, 6, 0.00) OVER (PARTITION BY SalesPersonID
                                           ORDER BY [Year], [Month]) AS six_months_ago ,
        LAG(TerritoryTotal, 9, 0.00) OVER (PARTITION BY SalesPersonID
                                           ORDER BY [Year], [Month]) AS nine_months_ago ,
        LAG(TerritoryTotal, 12, 0.00) OVER (PARTITION BY SalesPersonID
                                            ORDER BY [Year], [Month]) AS twelve_months_ago
FROM (
        SELECT    SalesPersonID,
                DATEPART(yyyy, OrderDate) AS [Year] ,
                DATEPART(MONTH, OrderDate) AS [Month] ,
                SUM(SubTotal) AS TerritoryTotal
        FROM    Sales.SalesOrderHeader
        GROUP BY SalesPersonID,
                DATEPART(yyyy, OrderDate),
                DATEPART(month, OrderDate)
) AS x
WHERE    SalesPersonID IS NOT NULL
ORDER BY SalesPersonID,
        [Year],
        [Month];

While this isn’t the most interesting report, it’s very common in many fields to compare current performance to previous performance. Stocks and 401(k) portfolios do this. My utility bills all show the last year of history as well as what I’m being charged this month.

Your sales are down, Ted.

Relative Aggregates Using ROWS BETWEEN

Moving a window relative to the current row is pretty interesting, but LAG and LEAD can only do so much. What if you need to show a three month average? What if that three month average needs to be centered on the current month?

SELECT  SalesPersonID,
        the_year,
        the_month,
        TerritoryTotal,
        AVG(TerritoryTotal) OVER (PARTITION BY SalesPersonID
                                  ORDER BY the_year, the_month
                                  ROWS BETWEEN 1 PRECEDING
                                           AND 1 FOLLOWING) AS three_month_average ,
        AVG(TerritoryTotal) OVER (PARTITION BY SalesPersonID
                                  ORDER BY the_year, the_month
                                  ROWS BETWEEN 6 PRECEDING
                                           AND 6 FOLLOWING) AS year_average
FROM (
        SELECT    SalesPersonID,
                DATEPART(yyyy, OrderDate) AS the_year,
                DATEPART(MONTH, OrderDate) AS the_month,
                DATEPART(QUARTER, OrderDate) AS the_quarter,
                SUM(SubTotal) AS TerritoryTotal
        FROM    Sales.SalesOrderHeader
        GROUP BY SalesPersonID,
                DATEPART(yyyy, OrderDate),
                DATEPART(month, OrderDate),
                DATEPART(QUARTER, OrderDate)
) AS x ;

This query adds in something called the frame clause (ROWS BETWEEN start AND end). The frame clause makes it easy to define a sliding window for the function being used.

Historically speaking, you never were very good.

There are a few ways to use the frame clause to change how the window is computed. As you saw in the previous query it’s possible to state the number of rows before and after the current row. That’s simple enough. It’s also possible to define a range of rows to use. One of the most first uses that developers will find for frame clauses is to produce a running total:

SELECT  soh.SalesPersonID, soh.OrderDate,
        soh.SubTotal ,
        SUM(soh.SubTotal) OVER (PARTITION BY SalesPersonID
                                ORDER BY soh.OrderDate
                                ROWS BETWEEN UNBOUNDED PRECEDING
                                         AND CURRENT ROW) AS total_sales_to_date
FROM    Sales.SalesOrderHeader AS soh
WHERE   soh.SalesPersonID IS NOT NULL

Run faster. Work harder. Sell. Sell. Sell.

In the past, developers would have had to resort to trickery with messy self-joins to get running totals. Thankfully, it’s now possible to define a frame for the window and use it compute aggregates over a range. I did resort to a bit of trickery to demonstrate the running total. The default value for a frame clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, so it’s not that impressive and can be achieved with T-SQL that looks like any other window function.

The ranges in a frame clause are all relative to the current partition of the window function. UNBOUNDED PRECEDING will get all rows from the start of the frame, and the start of the frame is relative to whatever has been defined in the OVER clause. UNBOUNDED FOLLOWING will get all rows from the current row to the end of the frame.

Brand New Analytic Functions

There are a few other functions that are being added to T-SQL in SQL Server 2012. I haven’t mentioned them yet because they’re statistical functions for computing distribution and median information. This is about the extent that I’ll mention them, but here they are:

  • CUME_DIST() – calculates the position of a value relative to a group of values
  • PERCENT_RANK() – calculates the percent rank of a value in a group of values
  • PERCENTILE_CONT(X) – looks at the percent rank of values in a group until it finds one greater than or equal to X.
  • PERCENTILE_DISC(X) – looks at the cumulative distribution of values in a group until it finds one greater than or equal to X

I don’t do a lot of analytical work, and certainly nothing where I’m going to be computing cumulative distributions, but it’s good to know that these functions are available. A lot of people do need to perform this functionality and they’ve resorted to using T-SQL trickery or SQL-CLR to get the same results when they could have just had these functions in the database.

These analytical functions are a bit different than the other functions that I’ve mentioned so far. First of all, they all don’t require an OVER() clause. Because these are analytical functions, you might want to perform the analysis over the entire query, so there’s no explicit need to apply a window and frame to the function.

The other thing that makes these functions different is that they allow a WITHIN GROUP specification. WITHIN GROUP is new in SQL Server 2012 and it applies an ordering to the analytical function’s actions but not to the results.

Here’s an example using all four to demonstrate how they work on the data in AdventureWorks, and to show how poorly those poor sales clerks get paid:

SELECT  p.LastName ,
        p.FirstName ,
        e.OrganizationLevel,
        eph.Rate * 2080 AS Salary ,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY eph.Rate * 2080 DESC)
                                     OVER (PARTITION BY e.OrganizationLevel) AS MedianContinuous ,
        PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY eph.Rate * 2080 DESC)
                                     OVER (PARTITION BY e.OrganizationLevel) AS MedianDiscrete ,
        CUME_DIST() OVER (PARTITION BY e.OrganizationLevel
                          ORDER BY eph.Rate * 2080 DESC) AS CUME_DIST,
        PERCENT_RANK() OVER (PARTITION BY e.OrganizationLevel
                             ORDER BY eph.Rate * 2080 DESC) AS PERCENT_RANK
FROM    HumanResources.Employee AS e
        JOIN Person.Person AS p ON e.BusinessEntityID = p.BusinessEntityID
        LEFT JOIN HumanResources.EmployeePayHistory AS eph ON eph.BusinessEntityID = e.BusinessEntityID
GROUP BY p.LastName,
        p.FirstName,
        e.OrganizationLevel,
        eph.Rate
ORDER BY OrganizationLevel ASC,
        Salary DESC ;

Mike, I have good news: you're getting a corner office.

SQL Server Windowing Functions Training Video

For more information, watch this free 20-minute training video on SQL Server 2012′s new windowing functions:

Summing it up

SQL Server 2012 adds many new features, not the least of which are the new window functions, frame clauses, and analytic functions. When you take each of these features on their own, they add powerful functionality to SQL Server. When you combine these three T-SQL enhancements, they give database developer tremendous power and flexibility to quickly develop reports that analyze data and deliver the results without needing extra features, tools, or products.

Training Classes

If you’ve come here from a training class, or you’re just curious, you can download several additional code samples as a standalone T-SQL file or as an SSMS project.

↑ Back to top
  1. Very useful! Thank you very much.

  2. Nice details of the upcoming new functions in denali. Thank you.

  3. Thanks! These features will be tremendous for database analytics.

  4. Pingback: SSIS Junkie : Debunking Kimball Effective Dates part 2 – Windowing Functions

  5. Can’t wait to see the query plans!

  6. I had to stop at the picture of the view outside your office. It even made my cubicle walls cry.

    Ok I kept going, great article, thanks :)

  7. Hi,

    Great post! one small correction –
    the default for the window range is not ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    but RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
    the difference is that when using RANGE, CURRENT ROW means al the rows with the same value in the ORDER BY columns.
    this script can demostrate the difference:

    create table T1(ID INT)
    insert into T1(ID)
    VALUES(1),(2),(2),(3),(4),(4)

    SELECT ID,COUNT(*) OVER (ORDER BY ID)
    FROM T1

    SELECT ID,COUNT(*) OVER (ORDER BY ID rows between UNBOUNDED Preceding and CURRENT ROW)
    FROM T1

    SELECT ID,COUNT(*) OVER (ORDER BY ID range between UNBOUNDED Preceding and CURRENT ROW)
    FROM T1

  8. Pingback: Video: How to Use the Denali Windowing Functions | Brent Ozar PLF | Brent Ozar PLF

  9. Pingback: LAST_VALUE() Analytic Function–SQL Server 2012 « Sam Vanga's

  10. Pingback: Something for the Weekend – SQL Server Links 15/07/11

  11. Pingback: SQL2012 Windowing Functions In The Data Warehouse–1. Dimensions : IN2BI

  12. Pingback: SQL2012 Windowing Functions In The Data Warehouse–2. Reporting : IN2BI

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php