GROUPING SETS (and ROLLUP & CUBE)

As part of my preparation for the 70-761 exam I’ve been trying to get a full understanding of some of the T-SQL I’m less familiar with. One example of this is the GROUPING SETS clause. There are also ROLLUP and CUBE clauses but I believe these are deprecated and GROUPING SETS can do everything ROLLUP and CUBE can do but with much more control.

The best way to explain how GROUPING SETS works is to show some example code. I’ll show how it compares to ROLLUP and CUBE in case you need to update any code to convert it to using GROUPING SETS.

So lets create some test data. I’m just creating a simple table containing a few purchases made by a few customers…

DROP TABLE IF EXISTS dbo.GroupingSetsDemo
GO

CREATE TABLE dbo.GroupingSetsDemo
 (
 Id INT NOT NULL IDENTITY(1,1),
 CustomerID INT NOT NULL,
 PurchaseDate DATETIME2(0) NOT NULL,
 PurchaseItem VARCHAR(100) NOT NULL,
 Amount DECIMAL(10,2) NOT NULL,
 );
 GO

INSERT INTO dbo.GroupingSetsDemo
 (
 CustomerID,
 PurchaseDate,
 PurchaseItem,
 Amount
 )
 VALUES
 (1, '20171106 12:34', 'Rugby Ball', 14.99),
 (1, '20180307 09:54', 'Frying Pan', 21.50),
 (1, '20180307 10:02', 'Vacuum Cleaner', 259.990),
 (2, '20171106 15:26', 'Orchid', 13.62),
 (2, '20180228 18:47', 'Book', 7.99),
 (2, '20180301 08:02', 'Snakes and Ladders', 8.00),
 (2, '20180306 14:32', 'Pillows', 20.00),
 (2, '20180306 14:32', 'Washing Machine', 359.99),
 (2, '20180307 23:15', 'Jeans', 59.99),
 (3, '20180301 09:27', 'Rucksack', 13.99),
 (3, '20180301 09:27', 'Scarf', 12.50);
 GO

SELECT *
 FROM dbo.GroupingSetsDemo;
 GO

GS1

If we want to see the total value of the purchases for each customer we can just group by the CustomerID and sum up the Amount values as follows…

SELECT CustomerID, SUM(Amount) AS TotalAmount
FROM dbo.GroupingSetsDemo WFD
GROUP BY CustomerID

GS2

As well a grouping by columns we can group by what are called GROUPING SETS. This example just shows how we can use GROUPING SETS to mimic the previous query to get total sales by customer…

SELECT CustomerID, SUM(Amount) AS TotalAmount
 FROM dbo.GroupingSetsDemo WFD
 GROUP BY
 GROUPING SETS ((CustomerID));

GS2

So in this case we’ve just said to group by a single grouping set for CustomerID.

The really useful stuff starts to happen when we have more than one grouping set. One useful example is where you want to create a total value in the same data set. To do this we just need to add an additional grouping set telling SQL Server to also group over all columns to give us the total amount for all customers…

SELECT CustomerID, SUM(Amount) AS TotalAmount
FROM dbo.GroupingSetsDemo WFD
GROUP BY 
 GROUPING SETS ((CustomerID), ());

GS3

Please note, that you still need to make sure all columns in the SELECT clause appear in at least one grouping set. So we can’t just include the total group set from the above query. If we try, we get the following error as expected…

SELECT CustomerID, SUM(Amount) AS TotalAmount
FROM dbo.GroupingSetsDemo WFD
GROUP BY 
 GROUPING SETS (());
Msg 8120, Level 16, State 1, Line 56
 Column 'dbo.GroupingSetsDemo.CustomerID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

More interesting things start to happen when we want to group over multiple columns. Let’s bring the purchase year into our query as follows..

SELECT CustomerID, YEAR(WFD.PurchaseDate) AS PurchaseYear, SUM(Amount) AS TotalAmount
FROM dbo.GroupingSetsDemo WFD
GROUP BY 
 GROUPING SETS ((CustomerID, YEAR(WFD.PurchaseDate)));

GS4

We can now add more grouping sets to this query, for example, to give total amount for each customer…

SELECT CustomerID, YEAR(WFD.PurchaseDate) AS PurchaseYear, SUM(Amount) AS TotalAmount
FROM dbo.GroupingSetsDemo WFD
GROUP BY 
 GROUPING SETS (
    (CustomerID, YEAR(WFD.PurchaseDate)),
    (CustomerID)
 );

GS5

… or for each year…

SELECT CustomerID, YEAR(WFD.PurchaseDate) AS PurchaseYear, SUM(Amount) AS TotalAmount
FROM dbo.GroupingSetsDemo WFD
GROUP BY 
 GROUPING SETS (
    (CustomerID, YEAR(WFD.PurchaseDate)),
    (YEAR(WFD.PurchaseDate))
 );

GS6

We can also combine everything to give us all possible total values

SELECT CustomerID, YEAR(WFD.PurchaseDate) AS PurchaseYear, SUM(Amount) AS TotalAmount
FROM dbo.GroupingSetsDemo WFD
GROUP BY 
 GROUPING SETS (
    (CustomerID, YEAR(WFD.PurchaseDate)),
    (CustomerID),
    (YEAR(WFD.PurchaseDate)),
    ()
 );

GS7

As we can see in the previous example, whenever we are adding a grouping set and getting a total row appear, one or more of the values returned can be NULL.  However, what if one of the values being grouped on is NULL.  How do we differential between the original NULL and the NULL created as part of the grouping set.

In the following example one of the PurchaseItem values is NULL….

DROP TABLE IF EXISTS dbo.GroupingSetsDemoNULL

CREATE TABLE dbo.GroupingSetsDemoNULL
(
  Id INT NOT NULL IDENTITY(1,1),
  CustomerID INT NOT NULL,
  PurchaseItem VARCHAR(100) NULL,
  Amount DECIMAL(10,2) NOT NULL,
);
GO

INSERT INTO dbo.GroupingSetsDemoNULL
(
  CustomerID,
  PurchaseItem,
  Amount
)
VALUES
  (1, 'Rugby Ball', 10),
  (1, 'Frying Pan', 20),
  (1, NULL, 30);
GO

SELECT *
FROM dbo.GroupingSetsDemoNULL;

GS11

Let’s say we want to give a total per customer and per item, and a grand total.  We need to create three grouping sets as follows…

SELECT CustomerID, PurchaseItem, SUM(Amount) AS TotalAmount
FROM dbo.GroupingSetsDemoNULL WFD
GROUP BY 
  GROUPING SETS (
    (CustomerID),
    (PurchaseItem),
    ()
  );

GS12

As we can see, we’ve got two rows where the CustomerID and PurchaseItems are NULL.  How do we know which one of these is the aggregation of the NULL PurchaseItem acorss all customer and which one is the grand total?  In this example looking at the amount we can see the that row 4 with the value of 60 is the grand total but we can also use the GROUPING_ID  to show this explicitly.

The GROUPING_ID function is a scalar function that returns a BIT.  We pass in a particular column name and this returns a flag saying if each row is part of the aggregation for that column or not.

In this example we’ve got two columns we are aggregating over so let’s use the GROUPING_ID function on both of these as follows…

SELECT CustomerID, PurchaseItem, SUM(Amount) AS TotalAmount,
     GROUPING_ID(CustomerID) AS CustomerIDGrouped, GROUPING_ID(PurchaseItem) AS PurchaseItemGrouped
FROM dbo.GroupingSetsDemoNULL WFD
GROUP BY 
  GROUPING SETS (
    (CustomerID),
    (PurchaseItem),
    ()
  );

GS13

As we can see row 1 is grouped on the CustomerID only and is the aggregation of the NULL PurchaseItem for all customers and row 4 is the grand total aggregated across all customers and items.

As mentioned above, the only requirement is that each non-aggregated column appears in at least one grouping set so we can return just the total for each customer and for each year as follows…

SELECT CustomerID, YEAR(WFD.PurchaseDate) AS PurchaseYear, SUM(Amount) AS TotalAmount
FROM dbo.GroupingSetsDemo WFD
GROUP BY 
 GROUPING SETS (
    (CustomerID),
    (YEAR(WFD.PurchaseDate))
 );

GS8

So what about ROLLUP and CUBE (but please remember these are deprecated)?

CUBE just means that the query is grouped by every possible grouping set, i.e. every combination of non-aggregated columns are used as the grouping sets. The syntax for using CUBE is as follows…

SELECT CustomerID, YEAR(WFD.PurchaseDate) AS PurchaseYear, SUM(Amount) AS TotalAmount
FROM dbo.GroupingSetsDemo WFD
GROUP BY YEAR(WFD.PurchaseDate), WFD.CustomerID
WITH CUBE
ORDER BY WFD.CustomerID, PurchaseYear

GS

This is the same as using the following grouping sets…

SELECT CustomerID, YEAR(WFD.PurchaseDate) AS PurchaseYear, SUM(Amount) AS TotalAmount
FROM dbo.GroupingSetsDemo WFD
GROUP BY
 GROUPING SETS (
    (YEAR(WFD.PurchaseDate), WFD.CustomerID),
    (WFD.CustomerID),
    (YEAR(WFD.PurchaseDate)),
    ()
 )
ORDER BY WFD.CustomerID, PurchaseYear

ROLLUP is slightly different and it depends on the order of the columns in the GROUP BY clause. It works by creating a grouping set across all columns in the GROUP BY clause, then creating additional sets across all columns except the last and each time dropping the last column off the end until it’s just left with the empty group. So if we’re grouping over three columns it will create a set for (1,2,3) then for (1,2) then for (1) and finally for (). Here’s an example…

SELECT CustomerID, YEAR(WFD.PurchaseDate) AS PurchaseYear, SUM(Amount) AS TotalAmount
FROM dbo.GroupingSetsDemo WFD
GROUP BY WFD.CustomerID, YEAR(WFD.PurchaseDate)
WITH ROLLUP

GS9

So this is grouping over both customer and year, then by just customer and finally over all columns. This is the same as creating the following grouping sets…

SELECT CustomerID, YEAR(WFD.PurchaseDate) AS PurchaseYear, SUM(Amount) AS TotalAmount
FROM dbo.GroupingSetsDemo WFD
GROUP BY
 GROUPING SETS (
    (CustomerID, YEAR(WFD.PurchaseDate)),
    (CustomerID),
    ()
 )

As mentioned, the order of the columns in the GROUP BY clause makes a difference and switching them from WFD.CustomerID, YEAR(WFD.PurchaseDate) to YEAR(WFD.PurchaseDate), WFD.CustomerID gives different results…

SELECT CustomerID, YEAR(WFD.PurchaseDate) AS PurchaseYear, SUM(Amount) AS TotalAmount
FROM dbo.GroupingSetsDemo WFD
GROUP BY YEAR(WFD.PurchaseDate), WFD.CustomerID
WITH ROLLUP

GS10

This time we’re grouping by year and customer, then by year, and finally by all columns.