Friday, 6 December 2013

Sql server windows functions


With SQL Server 2005, SQL Server introduced some of SQL's  window functions, that apply, not to the full set, but a partitioned 'window'. Although the ROW_NUMBER, RANK, NTILE and DENSE_RANK bring great power to TSQL, the full versatility will not be available until SQL Server delivers the full implementation. As usual, Robert Sheldon explains all.


With the release of SQL Server 2005, Transact-SQL included limited support for a subset of set functions referred to as window functions. A set function is one that can be applied to a set of rows. By extension, a window function is one that can be applied to a partitioned set of rows (known as a window) in order to rank or aggregate values in that partition. Window functions have nothing to do with Microsoft Windows, but they do provide a way to work with data grouped into logical windows.
You implement window functions as part of a query’s SELECT expression. The window function can be a ranking function (described below) or aggregate function such as SUM or AVG. The function is then followed by an OVER clause that determines how data is partitioned or ordered. The way in which the OVER clause is implemented depends in part on whether you’re using a ranking window function or aggregate window function. The following syntax shows how both types of functions—along with their supporting OVER clause—are implemented in SQL Server:
<over_clause>::=
ranking_function <over_clause_ranking> |
  aggregate_function <over_clause_aggregate>

<over_clause_ranking>::=
  OVER([PARTITION BY column, ... [n]] <order_by_clause>)

<over_clause_aggregate>::=
  OVER([PARTITION BY column, ... [n]])
As the syntax indicates, the OVER clause, when used with a ranking function, includes an optional PARTITION BY clause and a required ORDER BY clause. However, when used with an aggregate function, the OVER clause must include a PARTITION BY clause but cannot include an ORDER BY clause. The components that make up each type of clause and the differences between them will become clearer as we work through the article.
To that end, I’ve created several examples based on the AdventureWorks2008 database (implemented on a local instance of SQL Server 2008). If you want to run the examples against the AdventureWorks database in SQL Server 2005, be sure to change the database name wherever it’s mentioned in the code samples. In addition, you must change the name of the BusinessEntityID column to SalesPersonID where applicable.
Working with Ranking Window Functions
As the name suggests, ranking functions let you rank the rows in your result set based on specified values in those rows. SQL Server supports four ranking functions:
  • ROW_NUMBER: Assigns a sequential number to each row in the result set.
  • RANK: Ranks each row in the result set. If values in the ranking column are the same, they receive the same rank. However, the next number in the ranking sequence is skipped.
  • DENSE_RANK: Ranks each row in the result set. If values in the ranking column are the same, they receive the same rank. The next number in the ranking sequence is then used to rank the row or rows that follow.
  • NTILE: Divides the result set into the number of groups specified as an argument to the function. A group number is then assigned to each row identifying which group the row belongs to.
All ranking functions begin with the number 1 when assigning values and rank the data based on the column specified in the mandatory ORDER BY clause. As you saw in the syntax above, for each ranking function, you must include an OVER clause that contains the ORDER BY clause. In the ORDER BY clause, you identify the column on which to base the ranking. You also specify whether the rows should be sorted in ascending or descending order.
In the following SELECT statement, I use all four ranking functions to rank data based on the SalesLastYear column:
USE AdventureWorks2008;
GO
SELECT
  BusinessEntityID AS SalesID,
  FirstName + ' ' + LastName AS FullName,
  SalesLastYear,
  ROW_NUMBER() OVER(ORDER BY SalesLastYear ASC) AS RowNumber,
  RANK() OVER(ORDER BY SalesLastYear ASC) AS SalesRank,
  DENSE_RANK() OVER(ORDER BY SalesLastYear ASC) AS DenseRank,
  NTILE(4) OVER(ORDER BY SalesLastYear ASC) AS NTileRank
FROM
  Sales.vSalesPerson;
Notice that each ranking function is used as part of a column expression in the SELECT list and that only the NTILE function takes an argument (4), which in this case identifies the number of groups into which to divide the result set.
Following the function is the OVER clause. In each case, the OVER clause includes an ORDER BY clause that specifies the SalesLastYear column. In addition, the clause indicates that the sorting order is ascending (ASC). If you want the rows to be sorted in descending order, you must use DESC. The following table shows the results returned by the SELECT statement:
SalesID
FullName
SalesLastYear
RowNumber
SalesRank
DenseRank
NTileRank
274
Stephen Jiang
0.00
1
1
1
1
284
Tete Mensa-Annan
0.00
2
1
1
1
285
Syed Abbas
0.00
3
1
1
1
287
Amy Alberts
0.00
4
1
1
1
288
Rachel Valdez
1307949.7917
5
5
2
1
283
David Campbell
1371635.3158
6
6
3
2
276
Linda Mitchell
1439156.0291
7
7
4
2
278
Garrett Vargas
1620276.8966
8
8
5
2
289
Jae Pak
1635823.3967
9
9
6
2
275
Michael Blythe
1750406.4785
10
10
7
3
279
Tsvi Reiter
1849640.9418
11
11
8
3
280
Pamela Ansman-Wolfe
1927059.178
12
12
9
3
277
Jillian Carson
1997186.2037
13
13
10
3
282
José Saraiva
2038234.6549
14
14
11
4
281
Shu Ito
2073505.9999
15
15
12
4
286
Lynn Tsoflias
2278548.9776
16
16
13
4
290
Ranjit Varkey Chudukatil
2396539.7601
17
17
14
4
Each function ranks the results based on the values in SalesLastYear column (sorted in ascending order). The ROW_NUMBER function assigns a sequential number to each row according to that order. Because the result set returns 17 rows, the function assigns the values 1 through 17 to those rows, starting with the first SalesLastYear value (0.00).
The RANK function also starts with the number 1. However, because the first four rows are the same value (0.00), each row is ranked the same. The next value in SalesLastYear (1307949.7917) is located in the fifth row of the result set, so it receives a rank of 5. The function skips 2 through 4 because 1 is used four times. In addition, because none of the remaining values are repeated, the rest of the result set is ranked sequentially from 6 through 17.
The DENSE_RANK function, like the RANK function, also assigns the number 1 to the first four rows. However, on the fifth row, the function assigns a number 2, rather than skipping 5, and then ranks the rest of the result set sequentially. Consequently, the function assigns only the numbers 1 through 14 because 1 is repeated four times.
The NTILE function works differently from the other functions. Because 4 is specified as the argument to the function, the result set is divided into four groups. That division is based on the total number of rows divided by the number in the argument. That means, in this case, the first group includes five rows and the other three groups include four. The numbers 1 through 4 are assigned to the rows based on this division.
As you’ll recall from the syntax, the OVER clause can also include a PARTITION BY clause, which identifies how to partition the data into windows. In the following example, I include a PARTITION BY clause for each rank function in order to partition the data based on the values in the TerritoryGroup column:
USE AdventureWorks2008;
GO
SELECT
  BusinessEntityID AS SalesID,
  TerritoryGroup,
  SalesLastYear,
  ROW_NUMBER() OVER(PARTITION BY TerritoryGroup
    ORDER BY SalesLastYear ASC) AS RowNumber,
  RANK() OVER(PARTITION BY TerritoryGroup
    ORDER BY SalesLastYear ASC) AS SalesRank,
  DENSE_RANK() OVER(PARTITION BY TerritoryGroup
    ORDER BY SalesLastYear ASC) AS DenseRank,
  NTILE(2) OVER(PARTITION BY TerritoryGroup
    ORDER BY SalesLastYear ASC) AS NTileRank
FROM
  Sales.vSalesPerson;
This example is nearly identical to the preceding example except for the addition of the PARTITION BY clause to each OVER clause. (I also changed the NTILE argument from 4 to 2.) The result set is now grouped according to values in the TerritoryGroup column. The ranking functions are then applied to each group, as shown in the following example:
SalesID
TerritoryGroup
SalesLastYear
RowNumber
SalesRank
DenseRank
NTileRank
274
NULL
0.00
1
1
1
1
285
NULL
0.00
2
1
1
1
287
NULL
0.00
3
1
1
2
288
Europe
1307949.7917
1
1
1
1
289
Europe
1635823.3967
2
2
2
1
290
Europe
2396539.7601
3
3
3
2
284
North America
0.00
1
1
1
1
283
North America
1371635.3158
2
2
2
1
276
North America
1439156.0291
3
3
3
1
278
North America
1620276.8966
4
4
4
1
275
North America
1750406.4785
5
5
5
1
279
North America
1849640.9418
6
6
6
2
280
North America
1927059.178
7
7
7
2
277
North America
1997186.2037
8
8
8
2
282
North America
2038234.6549
9
9
9
2
281
North America
2073505.9999
10
10
10
2
286
Pacific
2278548.9776
1
1
1
1
Because the TerritoryGroup column contains NULL values, those values form their own group, and the ranking functions are applied to that group. So ROW_NUMER assigns the values of 1 through 3 to the NULL rows, RANK and DENSE_RANK assign a value of 1 to each of the three rows (because the SalesLastYear values are the same), and NTILE splits the three rows into two groups, with two rows in the first group and one row in the second. This process is then repeated for each group, regardless of the number of rows.
Working with Aggregate Window Functions
In addition to applying ranking functions to partitioned data, you can also apply aggregate functions. The primary difference between the two—in terms of how you implement the functions—is that you cannot use the ORDER BY clause in the OVER clause associated with an aggregate function. A more important difference, however, is in the result set. An aggregate function is applied to all rows within the partition, rather than individual rows, with regard to calculating the aggregated data.
Let’s look at an example to demonstrate how this works. In the following SELECT statement, I partition the result set by the TerritoryName column and then apply the COUNT, SUM, and AVG aggregate functions to the SalesLastYear values in each partition:
USE AdventureWorks2008;
GO
SELECT
  TerritoryGroup,
  TerritoryName,
  SalesLastYear,
  COUNT(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesCnt,
  SUM(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesTtl,
  AVG(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesAvg
FROM
  Sales.vSalesPerson
WHERE
  TerritoryGroup IS NOT NULL;
As you can see, I pass the SalesLastYear column in as an argument to the aggregate functions. The functions will then be applied to each partition, as shown in the following results:
TerritoryGroup
TerritoryName
SalesLastYear
SalesCnt
SalesTtl
SalesAvg
Pacific
Australia
2278548.9776
1
2278548.9776
2278548.9776
North America
Canada
1620276.8966
2
3658511.5515
1829255.7757
North America
Canada
2038234.6549
2
3658511.5515
1829255.7757
North America
Central
1997186.2037
1
1997186.2037
1997186.2037
Europe
France
2396539.7601
1
2396539.7601
2396539.7601
Europe
Germany
1307949.7917
1
1307949.7917
1307949.7917
North America
Northeast
1750406.4785
1
1750406.4785
1750406.4785
North America
Northwest
1927059.178
3
3298694.4938
1099564.8312
North America
Northwest
1371635.3158
3
3298694.4938
1099564.8312
North America
Northwest
0.00
3
3298694.4938
1099564.8312
North America
Southeast
1849640.9418
1
1849640.9418
1849640.9418
North America
Southwest
1439156.0291
2
3512662.029
1756331.0145
North America
Southwest
2073505.9999
2
3512662.029
1756331.0145
Europe
United Kingdom
1635823.3967
1
1635823.3967
1635823.3967
The result set is, you'll have noticed, grouped by the TerritoryName values. The aggregate functions are then applied to each row of the group, based on all the rows in the partition. For example, the Northwest territory group includes three rows. The SalesLastYear values in those rows range from 0.00 to 1927059.178. Not surprisingly, the COUNT function returns a value of 3. Notice, however, that this value appears in all three rows. The same is true for the SUM and AVG functions (3298694.4938 and 1099564.8312, respectively), with regard to the values being repeated in each row.
Although this is a lot of repetitive data, the structure of the SELECT statement itself is still much simpler than it would be if you could not take advantage of the aggregate window functions. However, I think the real strength in using window aggregate functions is when you include one of the functions in a larger column expression.
For instance, in the following example I’ve included an expression in the SELECT list that divides the SalesLastYear value by the total sales for each territory:
USE AdventureWorks2008;
GO
SELECT
  FirstName + ' ' + LastName AS FullName,
  TerritoryName,
  SalesLastYear,
  COUNT(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesCnt,
  SUM(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesTtl,
  AVG(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesAvg,
  SalesLastYear / SUM(SalesLastYear)
    OVER(PARTITION BY TerritoryName) AS SalesPct
FROM
  Sales.vSalesPerson
WHERE
  TerritoryGroup IS NOT NULL;
Notice that I use the SUM aggregate function to calculate SalesLastYear values based on the TerritoryName partitions. That way, I can determine what percentage of a territory’s sales an individual salesperson generates, as shown in the following results:
FullName
TerritoryName
SalesLastYear
SalesCnt
SalesTtl
SalesAvg
SalesPct
Lynn Tsoflias
Australia
2278548.9776
1
2278548.9776
2278548.9776
1.00
Garrett Vargas
Canada
1620276.8966
2
3658511.5515
1829255.7757
0.4428
José Saraiva
Canada
2038234.6549
2
3658511.5515
1829255.7757
0.5571
Jillian Carson
Central
1997186.2037
1
1997186.2037
1997186.2037
1.00
Ranjit Varkey Chudukatil
France
2396539.7601
1
2396539.7601
2396539.7601
1.00
Rachel Valdez
Germany
1307949.7917
1
1307949.7917
1307949.7917
1.00
Michael Blythe
Northeast
1750406.4785
1
1750406.4785
1750406.4785
1.00
Pamela Ansman-Wolfe
Northwest
1927059.178
3
3298694.4938
1099564.8312
0.5841
David Campbell
Northwest
1371635.3158
3
3298694.4938
1099564.8312
0.4158
Tete Mensa-Annan
Northwest
0.00
3
3298694.4938
1099564.8312
0.00
Tsvi Reiter
Southeast
1849640.9418
1
1849640.9418
1849640.9418
1.00
Linda Mitchell
Southwest
1439156.0291
2
3512662.029
1756331.0145
0.4097
Shu Ito
Southwest
2073505.9999
2
3512662.029
1756331.0145
0.5902
Jae Pak
United Kingdom
1635823.3967
1
1635823.3967
1635823.3967
1.00
As you can see, the results include a percentage of sales for each salesperson. For example, in the Southwest group, Linda Mitchell generated about 41% of the sales, and Shu Ito about 59%. On the other hand, Lynn Tsoflias in the Australia group generated 100% of the sales because she’s the only one in the group. Although the result set also shows the COUNT, SUM, and AVG totals, I’ve included them only to verify the data. Ultimately, you can create your statements to include only the calculated data you need.
Moving Forward
At the beginning of the article, I mentioned that SQL Server includes only limited support for window functions, which in this case are the ranking and aggregate functions. However, the window functions feature set, as defined in the ANSI SQL standards, is far more extensive than what SQL Server currently supports. Here’s a sample of the type of functionality that Microsoft has yet to implement:
  • Support for the ORDER BY clause when working with aggregate functions
  • Implementation of the DISTINCT clause when working with aggregate functions
  • Implementation of the ROWS and RANGE framing functions to help define the window frame type
  • Implementation of the LAG and LEAD offset functions to compare values based on relative positions
  • Implementation of the FIRST_VALUE and LAST_VALUE offset functions to work with ranked rows
  • Implementation of the WINDOW clause to reuse the window definition
A number of SQL Server developers have requested that Microsoft include these and other window function enhancements in their next release of SQL Server. For many, extending this feature set is one of their top priorities for the next implementation of Transact-SQL. Until then, we must be satisfied with the limited functionality that’s currently supported, which includes the ability to rank and aggregate partitioned data. That said, the features I’ve shown you can be useful tools when you need to return this type of data. For more information on how to implement window functions, particularly the OVER clause, check out the topic “OVER Clause (Transact-SQL)” in SQL Server Books Online.