MukaramRocks: December 2013

With SQL Server 2005, SQL Server introduced some of SQL's window functions, that apply, not to the full set, but a partitioned 'window'. Although the ROW_NUMBER, RANK, NTILE and DENSE_RANK bring great power to TSQL, the full versatility will not be available until SQL Server delivers the full implementation. As usual, Robert Sheldon explains all.

With the release of SQL Server 2005, Transact-SQL included limited support for a subset of set functions referred to as window functions. A set function is one that can be applied to a set of rows. By extension, a window function is one that can be applied to a partitioned set of rows (known as a window) in order to rank or aggregate values in that partition. Window functions have nothing to do with Microsoft Windows, but they do provide a way to work with data grouped into logical windows.

You implement window functions as part of a query’s SELECT expression. The window function can be a ranking function (described below) or aggregate function such as SUM or AVG. The function is then followed by an OVER clause that determines how data is partitioned or ordered. The way in which the OVER clause is implemented depends in part on whether you’re using a ranking window function or aggregate window function. The following syntax shows how both types of functions—along with their supporting OVER clause—are implemented in SQL Server:

<over_clause>::=

ranking_function <over_clause_ranking> |

aggregate_function <over_clause_aggregate>

<over_clause_ranking>::=

OVER([PARTITION BY column, ... [n]] <order_by_clause>)

<over_clause_aggregate>::=

OVER([PARTITION BY column, ... [n]])

As the syntax indicates, the OVER clause, when used with a ranking function, includes an optional PARTITION BY clause and a required ORDER BY clause. However, when used with an aggregate function, the OVER clause must include a PARTITION BY clause but cannot include an ORDER BY clause. The components that make up each type of clause and the differences between them will become clearer as we work through the article.

To that end, I’ve created several examples based on the AdventureWorks2008 database (implemented on a local instance of SQL Server 2008). If you want to run the examples against the AdventureWorks database in SQL Server 2005, be sure to change the database name wherever it’s mentioned in the code samples. In addition, you must change the name of the BusinessEntityID column to SalesPersonID where applicable.

Working with Ranking Window Functions

As the name suggests, ranking functions let you rank the rows in your result set based on specified values in those rows. SQL Server supports four ranking functions:

ROW_NUMBER: Assigns a sequential number to each row in the result set.
RANK: Ranks each row in the result set. If values in the ranking column are the same, they receive the same rank. However, the next number in the ranking sequence is skipped.
DENSE_RANK: Ranks each row in the result set. If values in the ranking column are the same, they receive the same rank. The next number in the ranking sequence is then used to rank the row or rows that follow.
NTILE: Divides the result set into the number of groups specified as an argument to the function. A group number is then assigned to each row identifying which group the row belongs to.

All ranking functions begin with the number 1 when assigning values and rank the data based on the column specified in the mandatory ORDER BY clause. As you saw in the syntax above, for each ranking function, you must include an OVER clause that contains the ORDER BY clause. In the ORDER BY clause, you identify the column on which to base the ranking. You also specify whether the rows should be sorted in ascending or descending order.

In the following SELECT statement, I use all four ranking functions to rank data based on the SalesLastYear column:

USE AdventureWorks2008;

SELECT

BusinessEntityID AS SalesID,

FirstName + ' ' + LastName AS FullName,

SalesLastYear,

ROW_NUMBER() OVER(ORDER BY SalesLastYear ASC) AS RowNumber,

RANK() OVER(ORDER BY SalesLastYear ASC) AS SalesRank,

DENSE_RANK() OVER(ORDER BY SalesLastYear ASC) AS DenseRank,

NTILE(4) OVER(ORDER BY SalesLastYear ASC) AS NTileRank

FROM

Sales.vSalesPerson;

Notice that each ranking function is used as part of a column expression in the SELECT list and that only the NTILE function takes an argument (4), which in this case identifies the number of groups into which to divide the result set.

Following the function is the OVER clause. In each case, the OVER clause includes an ORDER BY clause that specifies the SalesLastYear column. In addition, the clause indicates that the sorting order is ascending (ASC). If you want the rows to be sorted in descending order, you must use DESC. The following table shows the results returned by the SELECT statement:

SalesID	FullName	SalesLastYear	RowNumber	SalesRank	DenseRank	NTileRank
274	Stephen Jiang	0.00	1	1	1	1
284	Tete Mensa-Annan	0.00	2	1	1	1
285	Syed Abbas	0.00	3	1	1	1
287	Amy Alberts	0.00	4	1	1	1
288	Rachel Valdez	1307949.7917	5	5	2	1
283	David Campbell	1371635.3158	6	6	3	2
276	Linda Mitchell	1439156.0291	7	7	4	2
278	Garrett Vargas	1620276.8966	8	8	5	2
289	Jae Pak	1635823.3967	9	9	6	2
275	Michael Blythe	1750406.4785	10	10	7	3
279	Tsvi Reiter	1849640.9418	11	11	8	3
280	Pamela Ansman-Wolfe	1927059.178	12	12	9	3
277	Jillian Carson	1997186.2037	13	13	10	3
282	José Saraiva	2038234.6549	14	14	11	4
281	Shu Ito	2073505.9999	15	15	12	4
286	Lynn Tsoflias	2278548.9776	16	16	13	4
290	Ranjit Varkey Chudukatil	2396539.7601	17	17	14	4

Each function ranks the results based on the values in SalesLastYear column (sorted in ascending order). The ROW_NUMBER function assigns a sequential number to each row according to that order. Because the result set returns 17 rows, the function assigns the values 1 through 17 to those rows, starting with the first SalesLastYear value (0.00).

The RANK function also starts with the number 1. However, because the first four rows are the same value (0.00), each row is ranked the same. The next value in SalesLastYear (1307949.7917) is located in the fifth row of the result set, so it receives a rank of 5. The function skips 2 through 4 because 1 is used four times. In addition, because none of the remaining values are repeated, the rest of the result set is ranked sequentially from 6 through 17.

The DENSE_RANK function, like the RANK function, also assigns the number 1 to the first four rows. However, on the fifth row, the function assigns a number 2, rather than skipping 5, and then ranks the rest of the result set sequentially. Consequently, the function assigns only the numbers 1 through 14 because 1 is repeated four times.

The NTILE function works differently from the other functions. Because 4 is specified as the argument to the function, the result set is divided into four groups. That division is based on the total number of rows divided by the number in the argument. That means, in this case, the first group includes five rows and the other three groups include four. The numbers 1 through 4 are assigned to the rows based on this division.

As you’ll recall from the syntax, the OVER clause can also include a PARTITION BY clause, which identifies how to partition the data into windows. In the following example, I include a PARTITION BY clause for each rank function in order to partition the data based on the values in the TerritoryGroup column:

USE AdventureWorks2008;

SELECT

BusinessEntityID AS SalesID,

TerritoryGroup,

SalesLastYear,

ROW_NUMBER() OVER(PARTITION BY TerritoryGroup

ORDER BY SalesLastYear ASC) AS RowNumber,

RANK() OVER(PARTITION BY TerritoryGroup

ORDER BY SalesLastYear ASC) AS SalesRank,

DENSE_RANK() OVER(PARTITION BY TerritoryGroup

ORDER BY SalesLastYear ASC) AS DenseRank,

NTILE(2) OVER(PARTITION BY TerritoryGroup

ORDER BY SalesLastYear ASC) AS NTileRank

FROM

Sales.vSalesPerson;

This example is nearly identical to the preceding example except for the addition of the PARTITION BY clause to each OVER clause. (I also changed the NTILE argument from 4 to 2.) The result set is now grouped according to values in the TerritoryGroup column. The ranking functions are then applied to each group, as shown in the following example:

SalesID	TerritoryGroup	SalesLastYear	RowNumber	SalesRank	DenseRank	NTileRank
274	NULL	0.00	1	1	1	1
285	NULL	0.00	2	1	1	1
287	NULL	0.00	3	1	1	2
288	Europe	1307949.7917	1	1	1	1
289	Europe	1635823.3967	2	2	2	1
290	Europe	2396539.7601	3	3	3	2
284	North America	0.00	1	1	1	1
283	North America	1371635.3158	2	2	2	1
276	North America	1439156.0291	3	3	3	1
278	North America	1620276.8966	4	4	4	1
275	North America	1750406.4785	5	5	5	1
279	North America	1849640.9418	6	6	6	2
280	North America	1927059.178	7	7	7	2
277	North America	1997186.2037	8	8	8	2
282	North America	2038234.6549	9	9	9	2
281	North America	2073505.9999	10	10	10	2
286	Pacific	2278548.9776	1	1	1	1

Because the TerritoryGroup column contains NULL values, those values form their own group, and the ranking functions are applied to that group. So ROW_NUMER assigns the values of 1 through 3 to the NULL rows, RANK and DENSE_RANK assign a value of 1 to each of the three rows (because the SalesLastYear values are the same), and NTILE splits the three rows into two groups, with two rows in the first group and one row in the second. This process is then repeated for each group, regardless of the number of rows.

Working with Aggregate Window Functions

In addition to applying ranking functions to partitioned data, you can also apply aggregate functions. The primary difference between the two—in terms of how you implement the functions—is that you cannot use the ORDER BY clause in the OVER clause associated with an aggregate function. A more important difference, however, is in the result set. An aggregate function is applied to all rows within the partition, rather than individual rows, with regard to calculating the aggregated data.

Let’s look at an example to demonstrate how this works. In the following SELECT statement, I partition the result set by the TerritoryName column and then apply the COUNT, SUM, and AVG aggregate functions to the SalesLastYear values in each partition:

USE AdventureWorks2008;

SELECT

TerritoryGroup,

TerritoryName,

SalesLastYear,

COUNT(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesCnt,

SUM(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesTtl,

AVG(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesAvg

FROM

Sales.vSalesPerson

WHERE

TerritoryGroup IS NOT NULL;

As you can see, I pass the SalesLastYear column in as an argument to the aggregate functions. The functions will then be applied to each partition, as shown in the following results:

TerritoryGroup	TerritoryName	SalesLastYear	SalesCnt	SalesTtl	SalesAvg
Pacific	Australia	2278548.9776	1	2278548.9776	2278548.9776
North America	Canada	1620276.8966	2	3658511.5515	1829255.7757
North America	Canada	2038234.6549	2	3658511.5515	1829255.7757
North America	Central	1997186.2037	1	1997186.2037	1997186.2037
Europe	France	2396539.7601	1	2396539.7601	2396539.7601
Europe	Germany	1307949.7917	1	1307949.7917	1307949.7917
North America	Northeast	1750406.4785	1	1750406.4785	1750406.4785
North America	Northwest	1927059.178	3	3298694.4938	1099564.8312
North America	Northwest	1371635.3158	3	3298694.4938	1099564.8312
North America	Northwest	0.00	3	3298694.4938	1099564.8312
North America	Southeast	1849640.9418	1	1849640.9418	1849640.9418
North America	Southwest	1439156.0291	2	3512662.029	1756331.0145
North America	Southwest	2073505.9999	2	3512662.029	1756331.0145
Europe	United Kingdom	1635823.3967	1	1635823.3967	1635823.3967

The result set is, you'll have noticed, grouped by the TerritoryName values. The aggregate functions are then applied to each row of the group, based on all the rows in the partition. For example, the Northwest territory group includes three rows. The SalesLastYear values in those rows range from 0.00 to 1927059.178. Not surprisingly, the COUNT function returns a value of 3. Notice, however, that this value appears in all three rows. The same is true for the SUM and AVG functions (3298694.4938 and 1099564.8312, respectively), with regard to the values being repeated in each row.

Although this is a lot of repetitive data, the structure of the SELECT statement itself is still much simpler than it would be if you could not take advantage of the aggregate window functions. However, I think the real strength in using window aggregate functions is when you include one of the functions in a larger column expression.

For instance, in the following example I’ve included an expression in the SELECT list that divides the SalesLastYear value by the total sales for each territory:

USE AdventureWorks2008;

SELECT

FirstName + ' ' + LastName AS FullName,

TerritoryName,

SalesLastYear,

COUNT(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesCnt,

SUM(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesTtl,

AVG(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesAvg,

SalesLastYear / SUM(SalesLastYear)

OVER(PARTITION BY TerritoryName) AS SalesPct

FROM

Sales.vSalesPerson

WHERE

TerritoryGroup IS NOT NULL;

Notice that I use the SUM aggregate function to calculate SalesLastYear values based on the TerritoryName partitions. That way, I can determine what percentage of a territory’s sales an individual salesperson generates, as shown in the following results:

FullName	TerritoryName	SalesLastYear	SalesCnt	SalesTtl	SalesAvg	SalesPct
Lynn Tsoflias	Australia	2278548.9776	1	2278548.9776	2278548.9776	1.00
Garrett Vargas	Canada	1620276.8966	2	3658511.5515	1829255.7757	0.4428
José Saraiva	Canada	2038234.6549	2	3658511.5515	1829255.7757	0.5571
Jillian Carson	Central	1997186.2037	1	1997186.2037	1997186.2037	1.00
Ranjit Varkey Chudukatil	France	2396539.7601	1	2396539.7601	2396539.7601	1.00
Rachel Valdez	Germany	1307949.7917	1	1307949.7917	1307949.7917	1.00
Michael Blythe	Northeast	1750406.4785	1	1750406.4785	1750406.4785	1.00
Pamela Ansman-Wolfe	Northwest	1927059.178	3	3298694.4938	1099564.8312	0.5841
David Campbell	Northwest	1371635.3158	3	3298694.4938	1099564.8312	0.4158
Tete Mensa-Annan	Northwest	0.00	3	3298694.4938	1099564.8312	0.00
Tsvi Reiter	Southeast	1849640.9418	1	1849640.9418	1849640.9418	1.00
Linda Mitchell	Southwest	1439156.0291	2	3512662.029	1756331.0145	0.4097
Shu Ito	Southwest	2073505.9999	2	3512662.029	1756331.0145	0.5902
Jae Pak	United Kingdom	1635823.3967	1	1635823.3967	1635823.3967	1.00

As you can see, the results include a percentage of sales for each salesperson. For example, in the Southwest group, Linda Mitchell generated about 41% of the sales, and Shu Ito about 59%. On the other hand, Lynn Tsoflias in the Australia group generated 100% of the sales because she’s the only one in the group. Although the result set also shows the COUNT, SUM, and AVG totals, I’ve included them only to verify the data. Ultimately, you can create your statements to include only the calculated data you need.

Moving Forward

At the beginning of the article, I mentioned that SQL Server includes only limited support for window functions, which in this case are the ranking and aggregate functions. However, the window functions feature set, as defined in the ANSI SQL standards, is far more extensive than what SQL Server currently supports. Here’s a sample of the type of functionality that Microsoft has yet to implement:

Support for the ORDER BY clause when working with aggregate functions
Implementation of the DISTINCT clause when working with aggregate functions
Implementation of the ROWS and RANGE framing functions to help define the window frame type
Implementation of the LAG and LEAD offset functions to compare values based on relative positions
Implementation of the FIRST_VALUE and LAST_VALUE offset functions to work with ranked rows
Implementation of the WINDOW clause to reuse the window definition

A number of SQL Server developers have requested that Microsoft include these and other window function enhancements in their next release of SQL Server. For many, extending this feature set is one of their top priorities for the next implementation of Transact-SQL. Until then, we must be satisfied with the limited functionality that’s currently supported, which includes the ability to rank and aggregate partitioned data. That said, the features I’ve shown you can be useful tools when you need to return this type of data. For more information on how to implement window functions, particularly the OVER clause, check out the topic “OVER Clause (Transact-SQL)” in SQL Server Books Online.

MukaramRocks

Friday, 6 December 2013

Sql server windows functions