70-761 Exam Prep: Miscellaneous T-SQL – Part 1

I’m currently preparing to take the exam 70-761 – Querying Data with Transact-SQL and have been going through various resources including Itzit Ben-Gan’s exam guidebook which is certainly worth reading if you’re planning on taking this exam.

There are quite a few smaller bits of T-SQL functionality that I’m not that familiar with that I’m going to demonstrate in this and further posts. It’s a bit of a random selection but hopefully it will help me remember them and might be of use to others as well.

Contents

  1. ALL ANY SOME with subqueries
  2. TOP (x) WITH TIES
  3. Truncate partition
  4. TRY_CAST, TRY_CONVERT, TRY_PARSE

1. ALL ANY SOME with subqueries

This is something that I’ve never seen anyone use anywhere but I’ve had questions come up on this in two different exams!

We use the ALL, ANY and SOME keywords when compare data against rows returned by a subquery.  For example, if we have two tables: the first containing numbers 1 to 5 and the second containing 3 to 5 we can use ALL to return all rows from the first table that are less than ALL the values in the second table.  The only values that are less than all the values in the second table are 1 and 2.  We can also use ANY or SOME (they both mean the same thing) to return all numbers less than any of the numbers in the second table.  1 and 2 are less so they all so they get returned, 3 is less than 4 and 5 that that get’s returns, and 4 is less than 5 so that gets returned.

Here’s the code, so it should make more sense looking at this…

DROP TABLE IF EXISTS dbo.AllAnySome;
GO

CREATE TABLE dbo.AllAnySome (SomeNumber INT);
GO

INSERT INTO dbo.AllAnySome (SomeNumber)
VALUES (1), (2), (3), (4), (5);
GO

DROP TABLE IF EXISTS dbo.AllAnySomeLookup;
GO

CREATE TABLE dbo.AllAnySomeLookup (SomeNumber INT);
GO

INSERT INTO AllAnySomeLookup (SomeNumber)
VALUES (3), (4), (5);
GO

SELECT *
FROM dbo.AllAnySome
WHERE SomeNumber < ALL (SELECT * FROM dbo.AllAnySomeLookup);
GO

AAS1

SELECT *
FROM dbo.AllAnySome
WHERE SomeNumber < SOME (SELECT * FROM dbo.AllAnySomeLookup);
GO

AAS2

SELECT *
FROM dbo.AllAnySome
WHERE SomeNumber < ANY (SELECT * FROM dbo.AllAnySomeLookup);
GO

AAS2

The query above using ALL is returning all the row from dbo.AllAnySome that are less than the minimum value from dbo.AllAnySomeLookup, so it’s equivalent to…

SELECT *
FROM dbo.AllAnySome
WHERE SomeNumber < (SELECT MIN(SomeNumber) FROM dbo.AllAnySomeLookup);
GO

We can also do the following…

SELECT *
FROM dbo.AllAnySome
WHERE SomeNumber = ALL (SELECT * FROM dbo.AllAnySomeLookup);
GO

AAS3

… which is the same as using the IN keyword…

SELECT *
FROM dbo.AllAnySome
WHERE SomeNumber IN (SELECT * FROM dbo.AllAnySomeLookup);
GO

AAS3

In this example, ALL could also be replaced with SOME or ANY to return the same results.

2. TOP (x) WITH TIES

You can specify TOP (x) immediately after a SELECT clause to restrict the returned result set to only return x number of rows.  Here’s an example where we’ve got a table with 12 rows and we’re just asking the top 10 rows…

DROP TABLE IF EXISTS dbo.TiesTest;

CREATE TABLE dbo.TiesTest (Id INT);

INSERT INTO dbo.TiesTest (Id)
VALUES (1), (1), (1),
 (2), (2), (2),
 (3), (3), (3),
 (4), (4), (4);

SELECT TOP (10) *
FROM dbo.TiesTest
ORDER BY Id;

TopTies1

As we can see this query has returned all the rows with values 1, 2 and 3 but only one of the three rows with value 4.  If we wanted the TOP 10 rows but also wanted to include any other rows outside of the top 10 with the same values we can use the WITH TIES clause, meaning any subsequent rows where the value ties with the last value will be included.

SELECT TOP (10) WITH TIES *
FROM dbo.TiesTest
ORDER BY Id

TopTies2

As we can see we’ve now got 12 rows returned instead of the 10 as the values in rows 11 and 12 tie with the value in row 10.

One last thing to note is we must use ORDER BY if we’re using WITH TIES.

3. TRUNCATE PARTITION

In SQL Server 2016 we now have the ability to truncate a single partition.  Prior to this we could only truncate the entire table  Here’s an example of how to truncate a single partition…

CREATE PARTITION FUNCTION PF1 (DATETIME2(0))
AS RANGE RIGHT FOR VALUES ('20171001', '20171101', '20171201', '20180101', 
 '20180201', '20180301', '20180401');
GO

CREATE PARTITION SCHEME PS1
AS PARTITION PF1 ALL TO ([PRIMARY]);
GO

CREATE TABLE dbo.PartitionedTable
(
 Id INT IDENTITY(1,1) NOT NULL,
 FirstName VARCHAR(100) NOT NULL,
 LastName VARCHAR(100) NOT NULL,
 DateOfBirth DATETIME2(0) NOT NULL,
 Address VARCHAR(400) NOT NULL,
 RegisteredDate DATETIME2(0) NOT NULL
) ON PS1(RegisteredDate);

INSERT INTO dbo.PartitionedTable (FirstName, LastName, DateOfBirth, Address, RegisteredDate)
VALUES
('Angela', 'Bennett', '1985-06-03', '1 The Road, This Town, AA1 2BB', '20171015'),
('Colin', 'Denton', '1952-12-05', '12 The Road, This Town, AA1 2BB', '20171125'),
('Eleanor', 'Francis', '1988-07-23', '9 The Road, This Town, AA1 2BB', '20171214'),
('Gerald', 'Harris', '1985-07-25', '102 The Road, This Town, AA1 2BB', '20180109'),
('Iris', 'Jones', '1947-10-18', '93 The Road, This Town, AA1 2BB', '20180206'),
('Kirk', 'Lennox', '2001-04-19', '2 The Road, This Town, AA1 2BB', '20180328');


SELECT *
FROM dbo.PartitionedTable;

TruncPart1

So now we’ve got a partitioned table with a single row on different partitions.

SELECT *
FROM sys.partitions
WHERE object_id = OBJECT_ID('dbo.PartitionedTable');

TruncPart2

As we can see there is a row on partitions 2 to 7.

We can now truncate partitions 2, 3 and 4 which contains the rows for Angela Bennett, Colin Denton and Eleanor Francis as follows…

TRUNCATE TABLE dbo.PartitionedTable WITH (PARTITIONS (2 TO 4));

We can now see that the data has been removed from the table and the partitions now have no rows…

SELECT *
FROM dbo.PartitionedTable;

TruncPart3

SELECT *
FROM sys.partitions
WHERE object_id = OBJECT_ID('dbo.PartitionedTable')

TruncPart4

4. TRY_CAST, TRY_CONVERT, TRY_PARSE

Let’s say, for example, that we have a fairly simple process that loads data from one table to another.  We get a feed of the source data into a table from a third party and it contains date values stored as text strings.  Our target table needs to store the date as a DATE data type rather than a string so we explicitly convert it into a DATE when loading into the target.  We’re just using an insert statement to load the data but the data from the supplier is sometimes invalid and causes the whole batch to fail.  We want a way to allow the valid data to go through (and possibly store the failed data somewhere else).  Let’s set up some test data to demo this…

CREATE TABLE T_Source
(
 ItsADate VARCHAR(20)
);

INSERT INTO T_Source (ItsADate)
VALUES 
('20180203'),
('20180204'),
('ThisIsntADate'),
('20180205');

SELECT *
FROM T_Source;

misc1

As we can see the third row does not contain a valid date.

Let’s now create our target table and try and load the source data in…

CREATE TABLE T_Target
(
 DefinitelyADate DATE
);

INSERT INTO dbo.T_Target (DefinitelyADate)
SELECT CAST(ItsADate AS DATE)
FROM dbo.T_Source;

Msg 241, Level 16, State 1, Line 32
Conversion failed when converting date and/or time from character string.

So because of the one invalid row, the other valid rows do not get loaded.  Instead of using CAST we can use TRY_CAST instead which will return NULL if the value can’t be cast and won’t return an error.

SELECT TRY_CAST(ItsADate AS DATE)
FROM dbo.T_Source

misc2

We probably wouldn’t want to load NULL in but we could do some filtering or maybe just use ISDATE() = 1.

CONVERT and PARSE have “TRY” equivalents…

SELECT TRY_CONVERT(DATE, 'simon')
SELECT TRY_PARSE('99 March 2018' AS DATE USING 'en-gb')

These two queries just return NULL without throwing an error.

 

Advertisements

WINDOW FUNCTIONS – PART 3: Offset Functions

As part of my preparation for the 70-761 exam I’ve been trying to get a full understanding of the different types of window functions which are as follows:

  1. Ranking
  2. Aggregate
  3. Offset

I know the ranking functions pretty well but haven’t used the aggregate and offset functions at all really.

This third and final post demonstrates the offset functions.

I’ve been going through the official exam guidebook by Itzik Ben-Gan and he explains how the window functions work very well and the whole book is certainly worth reading if you use T-SQL at all.

With every window function you use the OVER clause. This is the part that specifies the “window” you are ranking or aggregating over.

The Offset Functions

For the demonstrations I’m going to use the same data set I used in the ranking function demo and create a table with 11 rows that contain purchases for 3 customers as follows…

CREATE TABLE dbo.WindowFunctionDemo
(
   Id INT NOT NULL IDENTITY(1,1),
   CustomerID INT NOT NULL,
   PurchaseDate DATETIME2(0) NOT NULL,
   PurchaseItem VARCHAR(100) NOT NULL,
   Amount DECIMAL(10,2) NOT NULL,
);
GO

INSERT INTO dbo.WindowFunctionDemo
(
   CustomerID,
   PurchaseDate,
   PurchaseItem,
   Amount
)
VALUES
 (1, '20180306 12:34', 'Rugby Ball', 14.99),
 (1, '20180307 09:54', 'Frying Pan', 21.50),
 (1, '20180307 10:02', 'Vacuum Cleaner', 259.990),
 (2, '20180225 15:26', 'Orchid', 13.62),
 (2, '20180228 18:47', 'Book', 7.99),
 (2, '20180301 08:02', 'Snakes and Ladders', 8.00),
 (2, '20180306 14:32', 'Pillows', 20.00),
 (2, '20180306 14:32', 'Washing Machine', 359.99),
 (2, '20180307 23:15', 'Jeans', 59.99),
 (3, '20180301 09:27', 'Rucksack', 13.99),
 (3, '20180301 09:27', 'Scarf', 12.50);
GO

SELECT *
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 1

The offset window functions allow you to return data from other rows based on their distance away from the current row.  For example, we might want to return all of the purchases made today but in the same row return the value of the previous purchase that the customer made.

The LAG and LEAD window functions allow you to do this: LAG allows you to access previous rows and LEAD allows you to access following rows.

The data set must be ordered so the functions can find previous and following rows correctly.

The following example returns all the data in the table along with the amount of the previous and next purchases each customer has made…

SELECT *,
 LAG(Amount) OVER (PARTITION BY CustomerID ORDER BY PurchaseDate) AS PreviousAmount,
 LEAD(Amount) OVER (PARTITION BY CustomerID ORDER BY PurchaseDate) AS NextAmount
FROM dbo.WindowFunctionDemo;
GO

LAG LEAD 1

You can pass a second parameter into the LAG and LEAD functions to access other rows and not just the ones immediately either side of the current row.  This example returns the Amount value from the two previous orders…

SELECT *,
   LAG(Amount) OVER (PARTITION BY CustomerID ORDER BY PurchaseDate) AS PreviousAmount,
   LAG(Amount, 2) OVER (PARTITION BY CustomerID ORDER BY PurchaseDate) AS SecondPreviousAmount
FROM dbo.WindowFunctionDemo;
GO

LAG LEAD 2

If not supplied, a default value of 1 is used as this parameter so

LAG(Amount)

is the same as

LAG(Amount,1)

There is also a third parameter that can be passed in that will convert NULLs to a default value…

SELECT *,
   LAG(Amount, 1, -100) OVER (PARTITION BY CustomerID ORDER BY PurchaseDate) AS PreviousAmount,
   LAG(Amount, 2, -200) OVER (PARTITION BY CustomerID ORDER BY PurchaseDate) AS SecondPreviousAmount
FROM dbo.WindowFunctionDemo;
GO

LAG LEAD 3

There are also FIRST_VALUE and LAST_VALUE offset functions that return the first and last values in the window frame.

This example returns the Amount value of the first purchase for each customer…

SELECT *,
 FIRST_VALUE(Amount) OVER (PARTITION BY CustomerID ORDER BY PurchaseDate) AS FirstPurchaseAmount
FROM dbo.WindowFunctionDemo;
GO

LAG LEAD 4

We can do something similar to find the last value in the window frame… (however there is something you need to be careful with here!!!)

SELECT *,
   LAST_VALUE(Amount) OVER (PARTITION BY CustomerID ORDER BY PurchaseDate) AS LatestPurchaseAmount
FROM dbo.WindowFunctionDemo;
GO

LAG LEAD 5

As you can see this is not showing the results we might expect.  We might expect to see 259.99 in the LatestPurchaseAmount for all rows for CustomerID = 1.  If you read my post on the aggregate window functions you might remember that the default window frame is…

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

This means that for each row the window frame, i.e. the data available to the offset function, is all previous rows up to and including the current row.  As we’re ordering by PurchaseDate, the window frame for the first purchase by customer 1 of the Rugby Ball on 2018-03-06 12:34:00 only contains that single row.  The window frame is restricting access to the rows for later purchases.

All we need to do to get the required result is make sure we’re using the correct window frame, i.e. make sure the frame includes the latest purchase for each customer.  We can do this as follows…

SELECT *,
   LAST_VALUE(Amount) OVER (
                     PARTITION BY CustomerID 
                     ORDER BY PurchaseDate
                     ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
 ) AS LatestPurchaseAmount
FROM dbo.WindowFunctionDemo;
GO

LAG LEAD 6

We could also use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING which spreads the window frame across the whole partition.

Window Functions – PART 2: Aggregate Functions

As part of my preparation for the 70-761 exam I’ve been trying to get a full understanding of the different types of window functions which are as follows:

  1. Ranking
  2. Aggregate
  3. Offset

I know the ranking functions pretty well but haven’t used the aggregate and offset functions at all really.

This second post demonstrates the aggregate functions.

I’ve been going through the official exam guidebook by Itzik Ben-Gan and he explains how the window functions work very well and the whole book is certainly worth reading if you use T-SQL at all.

With every window function you use the OVER clause. This is the part that specifies the “window” you are ranking or aggregating over.

The Aggregate Functions

For the demonstrations I’m going to use the same data set I used in the ranking function demo and create a table with 11 rows that contain purchases for 3 customers as follows…

CREATE TABLE dbo.WindowFunctionDemo
(
   Id INT NOT NULL IDENTITY(1,1),
   CustomerID INT NOT NULL,
   PurchaseDate DATETIME2(0) NOT NULL,
   PurchaseItem VARCHAR(100) NOT NULL,
   Amount DECIMAL(10,2) NOT NULL,
);
GO

INSERT INTO dbo.WindowFunctionDemo
(
   CustomerID,
   PurchaseDate,
   PurchaseItem,
   Amount
)
VALUES
 (1, '20180306 12:34', 'Rugby Ball', 14.99),
 (1, '20180307 09:54', 'Frying Pan', 21.50),
 (1, '20180307 10:02', 'Vacuum Cleaner', 259.990),
 (2, '20180225 15:26', 'Orchid', 13.62),
 (2, '20180228 18:47', 'Book', 7.99),
 (2, '20180301 08:02', 'Snakes and Ladders', 8.00),
 (2, '20180306 14:32', 'Pillows', 20.00),
 (2, '20180306 14:32', 'Washing Machine', 359.99),
 (2, '20180307 23:15', 'Jeans', 59.99),
 (3, '20180301 09:27', 'Rucksack', 13.99),
 (3, '20180301 09:27', 'Scarf', 12.50);
GO

SELECT *
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 1

So what if we now wanted to see what percentage each purchase is of the total amount. Firstly, we’d need to sum all the amounts so let’s just add that to our column list as follows….

SELECT *, SUM(Amount) AS TotalSales
FROM dbo.WindowFunctionDemo;
GO

Msg 8120, Level 16, State 1, Line 42
Column 'dbo.WindowFunctionDemo.Id' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

As I’m sure you’re already aware, we can’t add an aggregate function without a GROUP BY clause if non-aggregated columns also exist.

We could remove all the other columns from the SELECT list and just return the total amount and perhaps store that value in a variable…

SELECT SUM(Amount) AS TotalSales
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 10

We can include some columns along with the aggregate, for example, including the CustomerID will give the total per customer but without using something like GROUPING SETS or ROLLUP we can’t see the grand total of all purchases in this query.

SELECT CustomerID, SUM(Amount) AS TotalSales
FROM dbo.WindowFunctionDemo
GROUP BY CustomerID
GO

Windows Ranking 11

However, the OVER clause does allow us to do this. The great thing about this is it allows us to return all of the detail available in all the other columns along with the aggregated amount. With this query we can see the total per customer included as an additional column in the full data set…

SELECT
   *, 
   SUM(Amount) OVER (PARTITION BY CustomerID) AS TotalSalesByCustomer
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 12

Unlike the ranking functions the ORDER BY clause isn’t compulsory and you can see we’ve just partitioned by the CustomerID to give the sum per customer. In fact we don’t even need to do this and can aggregate over all rows by not supplying anything in the brackets as follows…

SELECT
   *,
   SUM(Amount) OVER (PARTITION BY CustomerID) AS TotalSalesByCustomer,
   SUM(Amount) OVER () AS GrandTotal
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 13

So now we have a way if getting the grand total at the row level we can use this to calculate the percentage of total amount for each purchase as follows…

SELECT
   *,
   SUM(Amount) OVER (PARTITION BY CustomerID) AS TotalSalesByCustomer,
   SUM(Amount) OVER () AS GrandTotal,
   CAST(Amount / SUM(Amount) OVER () * 100 AS DECIMAL(10,2)) AS GrandTotalPercent
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 14

As well as optionally including the ORDER BY and/or PARTITION BY clauses we can include something called a window frame. This allows us to specify a frame of rows or values that the aggregation is taken over relative the current row. It can only be used if the data set is ordered, i.e. the ORDER BY clause must be used.

We can use this to do things like generate running totals as follows…

SELECT
   *,
   SUM(Amount) OVER (ORDER BY Amount 
                     ROWS BETWEEN UNBOUNDED PRECEDING 
                     AND CURRENT ROW) AS RunningTotal
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 15

In this example we’ve ordered the data by the Amount and for each row we’re saying to sum the Amount values from the first row (UNBOUNDED PRECEDING) to the CURRENT ROW ordered by the Amount. For example, for the row for the “Scarf” purchase we’re saying to sum the Amount for the “Book”, “Snakes and Ladders”, and “Scarf”.

As well as UNBOUNDED PRECEDING meaning the first row we can also use UNBOUNDED FOLLOWING to mean the last row. For example…

SELECT
   *,
   SUM(Amount) OVER (ORDER BY Amount 
                     ROWS BETWEEN CURRENT ROW 
                     AND UNBOUNDED FOLLOWING) AS RunningTotal
FROM dbo.WindowFunctionDemo
ORDER BY Amount;
GO

Windows Ranking 17

This has basically just reversed how the running total is calculated.

We can set the window frame to include all rows as follows which is the same as giving the grand total over all rows.

SELECT
   *,
   SUM(Amount) OVER (ORDER BY Amount 
                     ROWS BETWEEN UNBOUNDED PRECEDING 
                     AND UNBOUNDED FOLLOWING) AS RunningTotal
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 16

As well as using UNBOUNDED to mean the ends of the data set we can specify a number of rows the window frame should cover. This next example shows the average of each purchase and the previous 2 for that customer based on the PurchaseDate

SELECT
   *,
   AVG(Amount) OVER (PARTITION BY CustomerID 
                     ORDER BY PurchaseDate 
                     ROWS BETWEEN 2 PRECEDING 
                     AND CURRENT ROW) AS RunningTotal
FROM dbo.WindowFunctionDemo
ORDER BY PurchaseDate;
GO

Windows Ranking 18

There is also an option to have the window frame over the values in the rows and not the rows themselves. To do this we just replace ROWS with RANGE. In order to show the difference I’m going to insert two new rows with an Amount of 7.99. This means we’ve now got 3 rows with the same Amount

INSERT INTO dbo.WindowFunctionDemo (CustomerID, PurchaseDate, PurchaseItem, Amount)
VALUES (2, '20180313 18:41', 'The BFG', 7.99),
(2, '20180313 18:41', 'Matilda', 7.99);

If we run our Running Total query from before we can still see it works as expected…

SELECT
   *,
   SUM(Amount) OVER (ORDER BY Amount
                     ROWS BETWEEN UNBOUNDED PRECEDING
                     AND CURRENT ROW) AS RunningTotal
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 19

However, if we use RANGE instead of ROWS we can see the difference…

SELECT
   *,
   SUM(Amount) OVER (ORDER BY Amount 
                     RANGE BETWEEN UNBOUNDED PRECEDING
                     AND CURRENT ROW) AS RunningTotal
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 20

Instead of treating each of the three rows separately it’s now treating each of the distinct values as the “current row” and all three get grouped together.

When using an aggregate window function with an ORDER BY clause the default window frame is actually:

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

For example the following returns the same as the previous query…

SELECT
   *,
   SUM(Amount) OVER (ORDER BY Amount) AS RunningTotal
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 21

If you’re getting unexpected results in your aggregations it might be because you need to specify ROWS in your OVER clause instead of the default RANGE.

Window Functions – PART 1: Ranking Functions

As part of my preparation for the 70-761 exam I’ve been trying to get a full understanding of the different types of window functions which are as follows:

  1. Ranking
  2. Aggregate
  3. Offset

I know the ranking functions pretty well but haven’t used the aggregate and offset functions at all really.

This first post is just going to give a description of how the ranking functions work and I’ll try and go into more detail on the other two in further posts.

I’ve been going through the official exam guidebook by Itzik Ben-Gan and he explains how the window functions work very well and the whole book is certainly worth reading if you use T-SQL at all.

With every window function you use the OVER clause. This is the part that specifies the “window” you are ranking or aggregating over.

The Ranking Functions

There are four ranking functions and to demonstrate how they work I’m going to create a table with 11 rows in that contain purchases for 3 customers as follows…

CREATE TABLE dbo.WindowFunctionDemo
(
   Id INT NOT NULL IDENTITY(1,1),
   CustomerID INT NOT NULL,
   PurchaseDate DATETIME2(0) NOT NULL,
   PurchaseItem VARCHAR(100) NOT NULL,
   Amount DECIMAL(10,2) NOT NULL,
);
GO

INSERT INTO dbo.WindowFunctionDemo
(
   CustomerID,
   PurchaseDate,
   PurchaseItem,
   Amount
)
VALUES
(1, '20180306 12:34', 'Rugby Ball', 14.99),
(1, '20180307 09:54', 'Frying Pan', 21.50),
(1, '20180307 10:02', 'Vacuum Cleaner', 259.990),
(2, '20180225 15:26', 'Orchid', 13.62),
(2, '20180228 18:47', 'Book', 7.99),
(2, '20180301 08:02', 'Snakes and Ladders', 8.00),
(2, '20180306 14:32', 'Pillows', 20.00),
(2, '20180306 14:32', 'Washing Machine', 359.99),
(2, '20180307 23:15', 'Jeans', 59.99),
(3, '20180301 09:27', 'Rucksack', 13.99),
(3, '20180301 09:27', 'Scarf', 12.50);
GO

SELECT *
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 1

All the ranking functions require an ORDER BY clause as part of the OVER clause: this is because the rows must have an order to allow a ranking to be applied.

The first one I’m going to show is the ROW_NUMBER function that just returns a sequential BIGINT number in the order specified in the OVER clause.

This first example just orders by the Id column so we end up with the ROW_NUMBER function returning exactly the same value that is in the Id column.

SELECT ROW_NUMBER() OVER (ORDER BY Id) AS RowNumber, *
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 2

We can obviously order by any column or columns: this example orders by Amount descending…

SELECT ROW_NUMBER() OVER (ORDER BY Amount DESC) AS RowNumber, *
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 3

We can also include a PARTITION BY clause to the OVER clause which means that the ROW_NUMBER function is applied across each of the partitions independently. For example, we can add a partition on the CustomerID column to the statement above so it now ranks each customer by the amount as follows…

SELECT ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY Amount DESC) AS RowNumber, *
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 4

We can see from the data set above that the row with a RowNumber of 1 is the most expensive purchase each customer has made. So what if we just want to return those rows. Let’s trying filtering on these ones…

SELECT ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY Amount DESC) AS RowNumber, *
FROM dbo.WindowFunctionDemo
WHERE ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY Amount DESC) = 1;
GO

Msg 4108, Level 15, State 1, Line 58
Windowed functions can only appear in the SELECT or ORDER BY clauses.

As we can see from the error above we can only use the window functions in the SELECT and ORDER BY clauses. This is because of the order SQL Server processes each part of a query. The order is:

  1. FROM
  2. ON
  3. JOIN
  4. WHERE
  5. GROUP BY
  6. WITH CUBE or WITH ROLLUP
  7. HAVING
  8. SELECT
  9. DISTINCT
  10. ORDER BY
  11. TOP

The window functions are calculated as part of the SELECT phase so they are not available for use in any of the preceding phases. One way to get the required results would be to use a derived table like a CTE as follows…

WITH Orders AS (
   SELECT ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY Amount DESC) AS RowNumber, *
   FROM dbo.WindowFunctionDemo
)
SELECT *
FROM Orders
WHERE RowNumber = 1;

Windows Ranking 5

The second ranking function I’m going to demo is the RANK function. The syntax is the same as ROW_NUMBER and again we can use the ORDER BY and PARTITION BY clauses in the OVER clauses. The difference between RANK and ROW_NUMBER is that RANK assigns a rank based on the value in the row and give rows with the same value the same rank. For example, ordering by PurchaseDate with no partitioning we can see the two sets of the duplicate PurchaseDates have been given the same rank: these are rows with ranks 4 and 7.

SELECT RANK() OVER (ORDER BY PurchaseDate) AS Ranker, *
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 6

As we can see there are two rows with a rank of 4 and they both have the same PurchaseDate value. We can also see that the next value in the ranking sequence is 6. If there were 3 values then the next one would have been 7.

We can use the third ranking function, DENSE_RANK to not skip over the ranking values as follows…

SELECT DENSE_RANK() OVER (ORDER BY PurchaseDate) AS Ranker, *
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 7

The fourth and final ranking function is the NTILE function which just splits a data set into n parts as equally as possible. For example, we can split our purchases data set into 5 parts as follows:

SELECT NTILE(5) OVER (ORDER BY Amount) AS Ranker, *
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 8

Because the 11 rows in the data set can’t be split into 5 equal parts the first bucket ends up with 3 rows and the rest with 2 rows. It always puts the extra remainder rows starting from the first bucket upwards so if there 12 rows we’d have ended up with 3 row in the first two buckets.

For the final example, here’s everything together…

SELECT
   ROW_NUMBER() OVER (ORDER BY PurchaseDate) AS RowNumber,
   RANK() OVER (ORDER BY PurchaseDate) AS Ranker,
   DENSE_RANK() OVER (ORDER BY PurchaseDate) AS DenseRanker,
   NTILE(5) OVER (ORDER BY PurchaseDate) AS NTil,
   *
FROM dbo.WindowFunctionDemo;
GO

Windows Ranking 9

JSON – Part 1: FOR JSON

JSON support was introduced in SQL Server 2016 and is part of the 70-761 – Querying Data with Transact-SQL exam.

Some of it works in a similar way to the XML functionality that has been around for a long time, for example, there is a FOR JSON clause that returns data in JSON format in a similar way FOR XML returns data in XML format. There are other functions available for reading and processing JSON data and I’ll cover some of those in future posts as part of my prep for exam 70-761.

Full details of JSON in SQL Server can be found here.

FOR JSON

Let’s create a table containing a few different data types and see how the FOR JSON clause works.

DROP TABLE IF EXISTS dbo.ForJsonTest1;
GO

CREATE TABLE dbo.ForJsonTest1
(
    IntColumn INT NOT NULL,
    BitColumn BIT NOT NULL,
    DecColumn DECIMAL(5,2) NOT NULL,
    FixedCharColumn CHAR(5) NOT NULL,
    VariableCharColumn VARCHAR(10) NOT NULL,
    DateColumn DATE NOT NULL,
    DateTimeColumn DATETIME2(7) NOT NULL,
    BinaryColumn VARBINARY(10) NOT NULL,
    MoneyColumn MONEY NOT NULL,
    UniqueIDColumn UNIQUEIDENTIFIER NOT NULL,
    NullableColumn VARCHAR(10) NULL
);
GO

INSERT INTO dbo.ForJsonTest1
(
    IntColumn,
    BitColumn,
    DecColumn,
    FixedCharColumn,
    VariableCharColumn,
    DateColumn,
    DateTimeColumn,
    BinaryColumn,
    MoneyColumn,
    UniqueIDColumn,
    NullableColumn
)
VALUES
(
    100,
    1,
    12.34,
    'AB',
    'ABCD',
    '20180301',
    '20180302 15:12',
    123,
    12.34,
    NEWID(),
    NULL
),
(
    250,
    0,
    54.21,
    'QWERT',
    'QWERTY',
    '20180302',
    '20180303 11:12',
    123,
    12.34,
    NEWID(),
    'NotNull'
);
GO

SELECT *
FROM dbo.ForJsonTest1;
GO

Here what the inserted data looks like in the table

JSON1

AUTO

We can now add the FOR JSON clause which has two different options. The simplest one is AUTO which works as follows…

SELECT *
FROM dbo.ForJsonTest1
FOR JSON AUTO
GO

This returns the JSON in a link as follows…

JSON2

… and when formatted it looks like this…

[
 {
 "IntColumn": 100,
 "BitColumn": true,
 "DecColumn": 12.34,
 "FixedCharColumn": "AB ",
 "VariableCharColumn": "ABCD",
 "DateColumn": "2018-03-01",
 "DateTimeColumn": "2018-03-02T15:12:00",
 "BinaryColumn": "AAAAew==",
 "MoneyColumn": 12.34,
 "UniqueIDColumn": "7A53DDE9-C97F-4384-AB1B-D23BA5AB3ED5"
 },
 {
 "IntColumn": 250,
 "BitColumn": false,
 "DecColumn": 54.21,
 "FixedCharColumn": "QWERT",
 "VariableCharColumn": "QWERTY",
 "DateColumn": "2018-03-02",
 "DateTimeColumn": "2018-03-03T11:12:00",
 "BinaryColumn": "AAAAew==",
 "MoneyColumn": 12.34,
 "UniqueIDColumn": "F36DBD9F-9B01-4C15-B562-BB0B9DF6A689",
 "NullableColumn": "NotNull"
 }
]

Certain SQL Server data types get converted to certain JSON data types and details of each data type can be found here.

We can add a root to the JSON similar to how the XML root works…

SELECT *
FROM dbo.ForJsonTest1
FOR JSON AUTO, ROOT('JSONTesting');
GO
{
 "JSONTesting": [
 {
 "IntColumn": 100,
 "BitColumn": true,
 "DecColumn": 12.34,
 "FixedCharColumn": "AB ",
 "VariableCharColumn": "ABCD",
 "DateColumn": "2018-03-01",
 "DateTimeColumn": "2018-03-02T15:12:00",
 "BinaryColumn": "AAAAew==",
 "MoneyColumn": 12.34,
 "UniqueIDColumn": "7A53DDE9-C97F-4384-AB1B-D23BA5AB3ED5"
 },
 {
 "IntColumn": 250,
 "BitColumn": false,
 "DecColumn": 54.21,
 "FixedCharColumn": "QWERT",
 "VariableCharColumn": "QWERTY",
 "DateColumn": "2018-03-02",
 "DateTimeColumn": "2018-03-03T11:12:00",
 "BinaryColumn": "AAAAew==",
 "MoneyColumn": 12.34,
 "UniqueIDColumn": "F36DBD9F-9B01-4C15-B562-BB0B9DF6A689",
 "NullableColumn": "NotNull"
 }
 ]
}

We can include null values as follows so the NullableColumn is always returned…

SELECT *
FROM dbo.ForJsonTest1
FOR JSON AUTO, INCLUDE_NULL_VALUES;
GO
[
 {
 "IntColumn": 100,
 "BitColumn": true,
 "DecColumn": 12.34,
 "FixedCharColumn": "AB ",
 "VariableCharColumn": "ABCD",
 "DateColumn": "2018-03-01",
 "DateTimeColumn": "2018-03-02T15:12:00",
 "BinaryColumn": "AAAAew==",
 "MoneyColumn": 12.34,
 "UniqueIDColumn": "7A53DDE9-C97F-4384-AB1B-D23BA5AB3ED5",
 "NullableColumn": null
 },
 {
 "IntColumn": 250,
 "BitColumn": false,
 "DecColumn": 54.21,
 "FixedCharColumn": "QWERT",
 "VariableCharColumn": "QWERTY",
 "DateColumn": "2018-03-02",
 "DateTimeColumn": "2018-03-03T11:12:00",
 "BinaryColumn": "AAAAew==",
 "MoneyColumn": 12.34,
 "UniqueIDColumn": "F36DBD9F-9B01-4C15-B562-BB0B9DF6A689",
 "NullableColumn": "NotNull"
 }
]

We can use the FOR JSON clause when joining tables together and we get nested JSON as follows…

DROP TABLE IF EXISTS dbo.Joining;
GO

CREATE TABLE dbo.Joining
(
    IntColumn INT NOT NULL,
    SomeText VARCHAR(100) NOT NULL
);
GO

INSERT INTO dbo.Joining (IntColumn, SomeText)
VALUES
(100, 'AAAAA'),
(100, 'BBBBB'),
(100, 'CCCCC');
GO
SELECT
 F.IntColumn,
 F.BitColumn,
 F.DecColumn,
 F.FixedCharColumn,
 F.VariableCharColumn,
 F.DateColumn,
 F.DateTimeColumn,
 F.BinaryColumn,
 F.MoneyColumn,
 F.UniqueIDColumn,
 F.NullableColumn,
 J.SomeText
FROM dbo.ForJsonTest1 F
LEFT JOIN dbo.Joining J ON J.IntColumn = F.IntColumn
FOR JSON AUTO;
GO

[
 {
 "IntColumn": 100,
 "BitColumn": true,
 "DecColumn": 12.34,
 "FixedCharColumn": "AB ",
 "VariableCharColumn": "ABCD",
 "DateColumn": "2018-03-01",
 "DateTimeColumn": "2018-03-02T15:12:00",
 "BinaryColumn": "AAAAew==",
 "MoneyColumn": 12.34,
 "UniqueIDColumn": "7A53DDE9-C97F-4384-AB1B-D23BA5AB3ED5",
 "J": [
   {"SomeText": "AAAAA"},
   {"SomeText": "BBBBB"},
   {"SomeText": "CCCCC"}
 ]
 },
 {
 "IntColumn": 250,
 "BitColumn": false,
 "DecColumn": 54.21,
 "FixedCharColumn": "QWERT",
 "VariableCharColumn": "QWERTY",
 "DateColumn": "2018-03-02",
 "DateTimeColumn": "2018-03-03T11:12:00",
 "BinaryColumn": "AAAAew==",
 "MoneyColumn": 12.34,
 "UniqueIDColumn": "F36DBD9F-9B01-4C15-B562-BB0B9DF6A689",
 "NullableColumn": "NotNull",
 "J": [
   {}
 ]
 }
]

PATH

We get much more control using PATH instead of AUTO and can easily nest JSON using aliases separated with dots as follows.

SELECT
   IntColumn AS 'Numbers.IntColumn',
   BitColumn AS 'Numbers.BitColumn',
   DecColumn AS 'Numbers.DecColumn',
   MoneyColumn AS 'Numbers.MoneyColumn',
   FixedCharColumn AS 'Strings.FixedCharColumn',
   VariableCharColumn AS 'Strings.VariableCharColumn',
   NullableColumn AS 'Strings.NullableColumn',
   DateColumn AS 'Dates.DateColumn',
   DateTimeColumn AS 'Dates.DateTimeColumn',
   BinaryColumn AS 'Boolean.BinaryColumn',
   UniqueIDColumn AS 'Others.UniqueIDColumn'
FROM dbo.ForJsonTest1
FOR JSON PATH;
GO

[
 {
 "Numbers": {
   "IntColumn": 100,
   "BitColumn": true,
   "DecColumn": 12.34,
   "MoneyColumn": 12.34
 },
 "Strings": {
   "FixedCharColumn": "AB ",
   "VariableCharColumn": "ABCD"
 },
 "Dates": {
   "DateColumn": "2018-03-01",
   "DateTimeColumn": "2018-03-02T15:12:00"
 },
 "Boolean": {
   "BinaryColumn": "AAAAew=="
 },
 "Others": {
   "UniqueIDColumn": "7A53DDE9-C97F-4384-AB1B-D23BA5AB3ED5"
 }
 },
 {
 "Numbers": {
   "IntColumn": 250,
   "BitColumn": false,
   "DecColumn": 54.21,
   "MoneyColumn": 12.34
 },
 "Strings": {
   "FixedCharColumn": "QWERT",
   "VariableCharColumn": "QWERTY",
   "NullableColumn": "NotNull"
 },
 "Dates": {
   "DateColumn": "2018-03-02",
   "DateTimeColumn": "2018-03-03T11:12:00"
 },
 "Boolean": {
   "BinaryColumn": "AAAAew=="
 },
 "Others": {
   "UniqueIDColumn": "F36DBD9F-9B01-4C15-B562-BB0B9DF6A689"
 }
 }
]

Please note, you need to make sure the columns inside nested JSON are all next to each other in the SELECT statement. The following generates the following error because all the Numbers columns and Strings columns are not together…

SELECT
 IntColumn AS 'Numbers.IntColumn',
 BitColumn AS 'Numbers.BitColumn',
 DecColumn AS 'Numbers.DecColumn',
 FixedCharColumn AS 'Strings.FixedCharColumn',
 VariableCharColumn AS 'Strings.VariableCharColumn',
 DateColumn AS 'Dates.DateColumn',
 DateTimeColumn AS 'Dates.DateTimeColumn',
 BinaryColumn AS 'Boolean.BinaryColumn',
 MoneyColumn AS 'Numbers.MoneyColumn',
 UniqueIDColumn AS 'Others.UniqueIDColumn',
 NullableColumn AS 'Strings.NullableColumn'
FROM dbo.ForJsonTest1
FOR JSON PATH;
GO

Msg 13601, Level 16, State 1, Line 77
Property 'Numbers.MoneyColumn' cannot be generated in JSON output due to a conflict with another column name or alias. Use different names and aliases for each column in SELECT list.

Return Max (or Min) Value Across Different Columns in Result Set

Here’s a nice way to return the minimum or maximum value across multiple columns.  Obviously, it’s easy to work out the min or max value for all rows in a single column but it’s a bit trickier to find the min or max value across multiple columns for all rows.

You can use the VALUES clause in a sub-query to do this.  For example, if you have an account table with different dates for actions like first game play, first purchase etc, you can pull out the maximum date of all these account actions as follows:

CREATE TABLE dbo.AccountActions
(
    AccountId INT NOT NULL IDENTITY(1,1),
    FullName VARCHAR(100) NOT NULL,
    AccountOpenedDate DATETIME2(0) NOT NULL,
    FirstPurchasedDate DATETIME2(0) NULL,
    FirstGamePlayDate DATETIME2(0) NULL,
    FirstMultiPlayerDate DATETIME2(0) NULL
);
GO

INSERT INTO dbo.AccountActions (FullName, AccountOpenedDate, FirstGamePlayDate, FirstMultiPlayerDate, FirstPurchasedDate)
VALUES 
('Adam Bennett', '20180301 17:06', '20180301 17:08', NULL, NULL),
('Charlie Dawson', '20180301 20:54', '20180301 21:32', NULL, '20180301 21:17'),
('Eric Franks', '20180301 09:41', '20180302 06:38', '20180303 20:12', NULL),
('Gina Harris', '20180303 23:19', '20180303 23:21', '20180304 06:52', '20180303 23:20');
GO

SELECT
    AccountId,
    FullName,
    (
        SELECT MAX(AllDates.TheDate)
        FROM (
            VALUES (AccountOpenedDate), (FirstGamePlayDate), (FirstMultiPlayerDate), (FirstPurchasedDate)
        ) AS AllDates (TheDate)
    ) AS MaxDate
FROM dbo.AccountActions;
GO

Dynamic Data Masking

As part of my prep for exam 70-761 (Querying Data with Transact-SQL) I’m writing about a few of the new features available in SQL Server 2016 that weren’t included in the 70-461 exam I took a few years ago based on SQL Server 2012.

Dynamic Data Masking is one of these new festures and can be used to restrict certain users from viewing sensitive data. With the General Data Protection Regulation (GDPR) coming into force on 25th May 2018 this may be something that companies can use to ensure that people who don’t need it cannot see certain personally identifiable information (PII). The data is not actually stored any differently, it is just masked upon read.

There are four functions available to mask the data: default, random, partial and email. I’ll go into more details on each of these below but first I’m going to quickly demonstrate the default function so I can then talk about permissions.

I’m going to connect to my SimonBlog database which is SQL Server 2016 compatibility level and create a new table that contains PII.

USE SimonBlog
GO

CREATE SCHEMA DDM;
GO

CREATE TABLE DDM.MaskingDemo1
(
 Id INT NOT NULL IDENTITY(1,1),
 FirstName VARCHAR(50) NOT NULL,
 MiddleName VARCHAR(50) NULL,
 LastName VARCHAR(50) NOT NULL,
 Email VARCHAR(500) NULL,
 Address1 VARCHAR(100) NULL,
 Address2 VARCHAR(100) NULL,
 PostCode VARCHAR(10) NULL,
 DateOfBirth DATE NOT NULL,
 Salary INT NULL
);
GO

INSERT INTO DDM.MaskingDemo1
VALUES
('Anna', NULL, 'Testing', 'anna.learning@data.co.uk', '1 My Road', 'My Town', 'QQ1 4ZZ', '1985-07-04', 20000),
('Matilda', 'Mary', 'Test', 'matilda.mary@test.com', '100 The Street', 'Big City', 'AQ123ER', '1965-12-01', 50000);
GO

I can now view this data as follows

SELECT *
FROM DDM.MaskingDemo1

DDM1

I’m now going to add the default masking function on the FirstName column as follows

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN FirstName 
  ADD MASKED WITH (FUNCTION='default()');

It’s as simple as that and now I can talk about how the permissions work with DDM.

Permissions

The masking is solely based on permissions and I’m sysadmin on my SimonBlog database so the masking is never going to be displayed for me as sysadmin contains the unmask permission on all databases. If I select all data from the table above I don’t see any masking. So to demonstrate how themasking works I’m going to create a new SQL user as follows:

USE master
GO

CREATE LOGIN DDMTestUser WITH PASSWORD = 'DDMTest1'
GO

USE SimonBlog
GO

CREATE USER DDMTestUser FROM LOGIN DDMTestUser;
GO

GRANT CONNECT ON DATABASE::SimonBlog TO DDMTestUser;
GO

GRANT SELECT ON OBJECT::DDM.MaskingDemo1 TO DDMTestUser;
GO

If I now run the following…

EXECUTE AS USER = 'DDMTestUser';
SELECT *
FROM DDM.MaskingDemo1;
REVERT;

… I get the following results

DDM2

You can see that the default mask on the FirstName column has replaced all text with 4 x characters regardless of the length of the original text value. The 7 character Matilda has been replaced by xxxx.

If it’s decided that the user DDMTestUser needs to be able to see the PII we can remove the mask as follows…

GRANT UNMASK TO DDMTestUser;
GO

If I now run the following I can see all the data…

EXECUTE AS USER = 'DDMTestUser';
SELECT *
FROM DDM.MaskingDemo1;
REVERT;

DDM1

As you can see the unmask permission is a database wide permission and it’s all or nothing, i.e. all masking is applied on all tables and columns within a database when reading data unless you have this unmask permission which means everything is available.

To turn it back off again I can run the following…

REVOKE UNMASK TO DDMTestUser;
GO

Now lets give DDMTestUser extra permissions to create a new table. We’ll create the table with one column using the default mask and then insert some data.

GRANT CREATE TABLE ON DATABASE::SimonBlog TO DDMTestUser;
GO

GRANT ALTER, SELECT, INSERT, UPDATE, DELETE ON SCHEMA::DDM TO DDMTestUser;
GO

EXECUTE AS USER = 'DDMTestUser';
CREATE TABLE DDM.MaskingDemo2
(
  SomeText VARCHAR(100) MASKED WITH (FUNCTION='default()'),
  SomeMoreText VARCHAR(100)
);
GO

INSERT INTO DDM.MaskingDemo2 (SomeText, SomeMoreText)
VALUES ('This is a fairly long piece of text', 'And here is a short one');

SELECT *
FROM DDM.MaskingDemo2;
GO

REVERT;

DDM3

So even though DDMTestUser inserted the unmasked data, when trying to read it they only see the mask because they do not have the unmask permission.

There is a second column in this table that is unmasked so lets now try and mask this one

EXECUTE AS USER = 'DDMTestUser';
ALTER TABLE DDM.MaskingDemo2
  ALTER COLUMN SomeMoreText 
  ADD MASKED WITH (FUNCTION='default()');
REVERT;

Msg 15247, Level 16, State 5, Line 51
User does not have permission to perform this action.

Any user can create a new table with masking enabled or add a new masked column to an existing table but they have to have the ALTER ANY MASK permissions on the database which can be set as follows…

GRANT ALTER ANY MASK ON DATABASE::SimonBlog TO DDMTestUser;
GO

DDMTestUser can now add the mask to the SomeMoreText column

EXECUTE AS USER = 'DDMTestUser';
ALTER TABLE DDM.MaskingDemo2
  ALTER COLUMN SomeMoreText 
  ADD MASKED WITH (FUNCTION='default()');
GO
SELECT *
FROM DDM.MaskingDemo2;
GO
REVERT;

DDM4

As we saw above a user can add data to a masked column but will not be able to see the data when reading without the unmask permission. The same applies for updating existing data. If I run the following and then try to read the data I still see the mask.

EXECUTE AS USER = 'DDMTestUser';
UPDATE DDM.MaskingDemo2
SET SomeText = 'this is some even longer text than was here before';
GO
SELECT *
FROM DDM.MaskingDemo2;
REVERT;
GO

DDM4

That’s a lot of the permissions covered so now let’s look at the differences between the 4 different masking functions.

Masking Functions

Default

We’ve covered a few examples of the default mask above where the text data in the FirstName column is masked with xxxx but what about where the value is less than 4 characters. If I add some nonsense data with character length of 1, 2 and 3 we can see how this looks masked.

INSERT INTO DDM.MaskingDemo1 (FirstName, LastName, DateOfBirth)
VALUES ('a', 'x', '19900502'), ('ab', 'xy', '19800603'), ('abc', 'xyz', '19701208')
GO

The documentation says that the mask can be fewer than four Xs if the text is less than 4 characters but this doesn’t seem to be the case

Unmasked…

DDM5

Masked…

DDM6

It looks like text data types are all masked with xxxx so let’s now add the default mask to the DateOfBirth and Salary columns

ALTER TABLE DDM.MaskingDemo1
 ALTER COLUMN DateOfBirth 
 ADD MASKED WITH (FUNCTION='default()');

ALTER TABLE DDM.MaskingDemo1
 ALTER COLUMN Salary 
 ADD MASKED WITH (FUNCTION='default()');

If we then view the data we see the following…

DDM7

The date type columns are displayed as 1990-01-01 and numeric columns are displayed as 0. I’ve not included them here but binary and bit columns are also displayed as 0. The salary column is the first one we’ve seen with NULL values and we can see that these values are unaffected by the mask.

Random

The random function is only available on numeric data types and you specify the start and end range for the random number to be between. Let’s add it to the Salary column with a range of 1 to 100.

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN Salary 
  ADD MASKED WITH (FUNCTION='random(1,100)');

The Salary column already had the default mask applied to it and the statement above has replaced that with the random mask. I didn’t have to remove the default mask before adding the random mask.

If I query the table twice we can see that different random values are shown in the Salary column

DDM8

DDM9

Partial

The partial function is for text data types and gives you the most control over the mask. With this function you specify how many of the leading and trailing characters you want to display and what mask you want to show in between. For example, you could say I want to see the first 2 characters, then 4 asterisk characters, then the last 3 characters. Let’s try this on the Address1 column in the DDM.MaskingDemo1 table

ALTER TABLE DDM.MaskingDemo1
 ALTER COLUMN Address1 
 ADD MASKED WITH (FUNCTION='partial(2,"****",3)');

If we then query the table we see the following

DDM10

If we look at the second row with the Address1 value of 100 The Street we can see it’s taken the first 2 characters 10 and the last 3 characters eet and stuffed **** in between to give 10****eet.

Let’s see how a similar mask would work on shorter data by adding the following mask to the LastName column.

ALTER TABLE DDM.MaskingDemo1
 ALTER COLUMN LastName 
 ADD MASKED WITH (FUNCTION='partial(1,"%",1)');

At most this will return 3 characters: the first and last with % in between.

DDM11

The first two LastName values are 7 and 4 characters long and these have been masked as expected. The last was xyz which is 3 characters and has just resulted in the y being replaced by the %. The third and fourth LastName values are 1 and 2 characters respectively and if the character length is less than the number of characters in the mask (in this case 3) then the leading and trailing values are not included and the mask if the only thing returned.

As mentioned above the partial function is only available on character data types so let’s see what happens when we try to add it to the DateOfBirth column

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN DateOfBirth 
  ADD MASKED WITH (FUNCTION='partial(1,"%",1)');

Msg 16003, Level 16, State 0, Line 65
The data type of column 'DateOfBirth' does not support data masking function 'partial'.

Email

The email mask is also only available on character data types and will display the first character followed by XXX@XXXX.com

So if I add this to the Email column we see the following

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN Email 
  ADD MASKED WITH (FUNCTION='email()');

DDM12

We can actually add the email mask to any character data type so let’s add it to the Address2 column.

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN Address2 
  ADD MASKED WITH (FUNCTION='email()');

DDM13

As you can see the data doesn’t actually have to be an email and the email mask is basically just a specific version of the partial mask. We can replicate it as follows

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN Email 
  ADD MASKED WITH (FUNCTION='partial(1,"XXX@XXXX.com",0)');

DDM14

We can also replicate the default mask on character data types using the partial mask as follows

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN FirstName 
  ADD MASKED WITH (FUNCTION='partial(0,"xxxx",0)');

Removing a mask

Removing the mask from a column can be done as follows

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN Address2 DROP MASKED;

Now the Address2 column is no longer masked

DDM16

SELECT INTO and Computed Columns

If we add a computed column based on a masked column then that computed value is also masked. I’ve created 5 different computed columns based on the FirstName, MiddleName and LastName columns. It seems that if there is any concatenation applied to the computed column then the default mask is always used. So by just adding abc to the LastName column which has the partial mask we see the data returned as xxxx. This is shown in the FullName4 column.

ALTER TABLE DDM.MaskingDemo1
  ADD FullName1 AS (FirstName + ' ' + ISNULL(MiddleName, '') + ' ' + LastName),
  FullName2 AS (ISNULL(MiddleName, '') + ' ' + LastName + ' ' + FirstName),
  FullName3 AS (LastName + ' ' + FirstName + ' ' + ISNULL(MiddleName, '')),
  FullName4 AS (LastName + 'abc'),
  FullName5 AS (LastName)

DDM17

If we view the data with an account with the unmask permission we can see the computed columns as expected

DDM18

A similar thing happens when loading data from one table to another. If the source table is masked and the user loading the data into the destination table does not have the unmask permission then the masked data is loaded into the destination and can never be unmasked. We can demo this with the SELECT INTO statement. If I run this as a user without unmask

EXECUTE AS USER = 'DDMTestUser';
SELECT *
INTO #masking
FROM DDM.MaskingDemo1;
REVERT;

… then select from #masking with a user with unmask permission we can see the following

SELECT *
FROM DDM.MaskingDemo1;

SELECT *
FROM #masking;

DDM19

Finally

To find out what columns in a database are masked you can run the following query using the sys.masked_columns view

SELECT
  SCHEMA_NAME(o.schema_id),
  OBJECT_NAME(o.object_id),
  mc.name,
  mc.masking_function
FROM sys.masked_columns mc
JOIN sys.objects o ON o.object_id = mc.object_id
ORDER BY
  SCHEMA_NAME(o.schema_id),
  OBJECT_NAME(o.object_id),
  mc.column_id;

DDM15