Dynamic Data Masking

As part of my prep for exam 70-761 (Querying Data with Transact-SQL) I’m writing about a few of the new features available in SQL Server 2016 that weren’t included in the 70-461 exam I took a few years ago based on SQL Server 2012.

Dynamic Data Masking is one of these new festures and can be used to restrict certain users from viewing sensitive data. With the General Data Protection Regulation (GDPR) coming into force on 25th May 2018 this may be something that companies can use to ensure that people who don’t need it cannot see certain personally identifiable information (PII). The data is not actually stored any differently, it is just masked upon read.

There are four functions available to mask the data: default, random, partial and email. I’ll go into more details on each of these below but first I’m going to quickly demonstrate the default function so I can then talk about permissions.

I’m going to connect to my SimonBlog database which is SQL Server 2016 compatibility level and create a new table that contains PII.

USE SimonBlog
GO

CREATE SCHEMA DDM;
GO

CREATE TABLE DDM.MaskingDemo1
(
 Id INT NOT NULL IDENTITY(1,1),
 FirstName VARCHAR(50) NOT NULL,
 MiddleName VARCHAR(50) NULL,
 LastName VARCHAR(50) NOT NULL,
 Email VARCHAR(500) NULL,
 Address1 VARCHAR(100) NULL,
 Address2 VARCHAR(100) NULL,
 PostCode VARCHAR(10) NULL,
 DateOfBirth DATE NOT NULL,
 Salary INT NULL
);
GO

INSERT INTO DDM.MaskingDemo1
VALUES
('Anna', NULL, 'Testing', 'anna.learning@data.co.uk', '1 My Road', 'My Town', 'QQ1 4ZZ', '1985-07-04', 20000),
('Matilda', 'Mary', 'Test', 'matilda.mary@test.com', '100 The Street', 'Big City', 'AQ123ER', '1965-12-01', 50000);
GO

I can now view this data as follows

SELECT *
FROM DDM.MaskingDemo1

DDM1

I’m now going to add the default masking function on the FirstName column as follows

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN FirstName 
  ADD MASKED WITH (FUNCTION='default()');

It’s as simple as that and now I can talk about how the permissions work with DDM.

Permissions

The masking is solely based on permissions and I’m sysadmin on my SimonBlog database so the masking is never going to be displayed for me as sysadmin contains the unmask permission on all databases. If I select all data from the table above I don’t see any masking. So to demonstrate how themasking works I’m going to create a new SQL user as follows:

USE master
GO

CREATE LOGIN DDMTestUser WITH PASSWORD = 'DDMTest1'
GO

USE SimonBlog
GO

CREATE USER DDMTestUser FROM LOGIN DDMTestUser;
GO

GRANT CONNECT ON DATABASE::SimonBlog TO DDMTestUser;
GO

GRANT SELECT ON OBJECT::DDM.MaskingDemo1 TO DDMTestUser;
GO

If I now run the following…

EXECUTE AS USER = 'DDMTestUser';
SELECT *
FROM DDM.MaskingDemo1;
REVERT;

… I get the following results

DDM2

You can see that the default mask on the FirstName column has replaced all text with 4 x characters regardless of the length of the original text value. The 7 character Matilda has been replaced by xxxx.

If it’s decided that the user DDMTestUser needs to be able to see the PII we can remove the mask as follows…

GRANT UNMASK TO DDMTestUser;
GO

If I now run the following I can see all the data…

EXECUTE AS USER = 'DDMTestUser';
SELECT *
FROM DDM.MaskingDemo1;
REVERT;

DDM1

As you can see the unmask permission is a database wide permission and it’s all or nothing, i.e. all masking is applied on all tables and columns within a database when reading data unless you have this unmask permission which means everything is available.

To turn it back off again I can run the following…

REVOKE UNMASK TO DDMTestUser;
GO

Now lets give DDMTestUser extra permissions to create a new table. We’ll create the table with one column using the default mask and then insert some data.

GRANT CREATE TABLE ON DATABASE::SimonBlog TO DDMTestUser;
GO

GRANT ALTER, SELECT, INSERT, UPDATE, DELETE ON SCHEMA::DDM TO DDMTestUser;
GO

EXECUTE AS USER = 'DDMTestUser';
CREATE TABLE DDM.MaskingDemo2
(
  SomeText VARCHAR(100) MASKED WITH (FUNCTION='default()'),
  SomeMoreText VARCHAR(100)
);
GO

INSERT INTO DDM.MaskingDemo2 (SomeText, SomeMoreText)
VALUES ('This is a fairly long piece of text', 'And here is a short one');

SELECT *
FROM DDM.MaskingDemo2;
GO

REVERT;

DDM3

So even though DDMTestUser inserted the unmasked data, when trying to read it they only see the mask because they do not have the unmask permission.

There is a second column in this table that is unmasked so lets now try and mask this one

EXECUTE AS USER = 'DDMTestUser';
ALTER TABLE DDM.MaskingDemo2
  ALTER COLUMN SomeMoreText 
  ADD MASKED WITH (FUNCTION='default()');
REVERT;

Msg 15247, Level 16, State 5, Line 51
User does not have permission to perform this action.

Any user can create a new table with masking enabled or add a new masked column to an existing table but they have to have the ALTER ANY MASK permissions on the database which can be set as follows…

GRANT ALTER ANY MASK ON DATABASE::SimonBlog TO DDMTestUser;
GO

DDMTestUser can now add the mask to the SomeMoreText column

EXECUTE AS USER = 'DDMTestUser';
ALTER TABLE DDM.MaskingDemo2
  ALTER COLUMN SomeMoreText 
  ADD MASKED WITH (FUNCTION='default()');
GO
SELECT *
FROM DDM.MaskingDemo2;
GO
REVERT;

DDM4

As we saw above a user can add data to a masked column but will not be able to see the data when reading without the unmask permission. The same applies for updating existing data. If I run the following and then try to read the data I still see the mask.

EXECUTE AS USER = 'DDMTestUser';
UPDATE DDM.MaskingDemo2
SET SomeText = 'this is some even longer text than was here before';
GO
SELECT *
FROM DDM.MaskingDemo2;
REVERT;
GO

DDM4

That’s a lot of the permissions covered so now let’s look at the differences between the 4 different masking functions.

Masking Functions

Default

We’ve covered a few examples of the default mask above where the text data in the FirstName column is masked with xxxx but what about where the value is less than 4 characters. If I add some nonsense data with character length of 1, 2 and 3 we can see how this looks masked.

INSERT INTO DDM.MaskingDemo1 (FirstName, LastName, DateOfBirth)
VALUES ('a', 'x', '19900502'), ('ab', 'xy', '19800603'), ('abc', 'xyz', '19701208')
GO

The documentation says that the mask can be fewer than four Xs if the text is less than 4 characters but this doesn’t seem to be the case

Unmasked…

DDM5

Masked…

DDM6

It looks like text data types are all masked with xxxx so let’s now add the default mask to the DateOfBirth and Salary columns

ALTER TABLE DDM.MaskingDemo1
 ALTER COLUMN DateOfBirth 
 ADD MASKED WITH (FUNCTION='default()');

ALTER TABLE DDM.MaskingDemo1
 ALTER COLUMN Salary 
 ADD MASKED WITH (FUNCTION='default()');

If we then view the data we see the following…

DDM7

The date type columns are displayed as 1990-01-01 and numeric columns are displayed as 0. I’ve not included them here but binary and bit columns are also displayed as 0. The salary column is the first one we’ve seen with NULL values and we can see that these values are unaffected by the mask.

Random

The random function is only available on numeric data types and you specify the start and end range for the random number to be between. Let’s add it to the Salary column with a range of 1 to 100.

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN Salary 
  ADD MASKED WITH (FUNCTION='random(1,100)');

The Salary column already had the default mask applied to it and the statement above has replaced that with the random mask. I didn’t have to remove the default mask before adding the random mask.

If I query the table twice we can see that different random values are shown in the Salary column

DDM8

DDM9

Partial

The partial function is for text data types and gives you the most control over the mask. With this function you specify how many of the leading and trailing characters you want to display and what mask you want to show in between. For example, you could say I want to see the first 2 characters, then 4 asterisk characters, then the last 3 characters. Let’s try this on the Address1 column in the DDM.MaskingDemo1 table

ALTER TABLE DDM.MaskingDemo1
 ALTER COLUMN Address1 
 ADD MASKED WITH (FUNCTION='partial(2,"****",3)');

If we then query the table we see the following

DDM10

If we look at the second row with the Address1 value of 100 The Street we can see it’s taken the first 2 characters 10 and the last 3 characters eet and stuffed **** in between to give 10****eet.

Let’s see how a similar mask would work on shorter data by adding the following mask to the LastName column.

ALTER TABLE DDM.MaskingDemo1
 ALTER COLUMN LastName 
 ADD MASKED WITH (FUNCTION='partial(1,"%",1)');

At most this will return 3 characters: the first and last with % in between.

DDM11

The first two LastName values are 7 and 4 characters long and these have been masked as expected. The last was xyz which is 3 characters and has just resulted in the y being replaced by the %. The third and fourth LastName values are 1 and 2 characters respectively and if the character length is less than the number of characters in the mask (in this case 3) then the leading and trailing values are not included and the mask if the only thing returned.

As mentioned above the partial function is only available on character data types so let’s see what happens when we try to add it to the DateOfBirth column

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN DateOfBirth 
  ADD MASKED WITH (FUNCTION='partial(1,"%",1)');

Msg 16003, Level 16, State 0, Line 65
The data type of column 'DateOfBirth' does not support data masking function 'partial'.

Email

The email mask is also only available on character data types and will display the first character followed by XXX@XXXX.com

So if I add this to the Email column we see the following

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN Email 
  ADD MASKED WITH (FUNCTION='email()');

DDM12

We can actually add the email mask to any character data type so let’s add it to the Address2 column.

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN Address2 
  ADD MASKED WITH (FUNCTION='email()');

DDM13

As you can see the data doesn’t actually have to be an email and the email mask is basically just a specific version of the partial mask. We can replicate it as follows

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN Email 
  ADD MASKED WITH (FUNCTION='partial(1,"XXX@XXXX.com",0)');

DDM14

We can also replicate the default mask on character data types using the partial mask as follows

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN FirstName 
  ADD MASKED WITH (FUNCTION='partial(0,"xxxx",0)');

Removing a mask

Removing the mask from a column can be done as follows

ALTER TABLE DDM.MaskingDemo1
  ALTER COLUMN Address2 DROP MASKED;

Now the Address2 column is no longer masked

DDM16

SELECT INTO and Computed Columns

If we add a computed column based on a masked column then that computed value is also masked. I’ve created 5 different computed columns based on the FirstName, MiddleName and LastName columns. It seems that if there is any concatenation applied to the computed column then the default mask is always used. So by just adding abc to the LastName column which has the partial mask we see the data returned as xxxx. This is shown in the FullName4 column.

ALTER TABLE DDM.MaskingDemo1
  ADD FullName1 AS (FirstName + ' ' + ISNULL(MiddleName, '') + ' ' + LastName),
  FullName2 AS (ISNULL(MiddleName, '') + ' ' + LastName + ' ' + FirstName),
  FullName3 AS (LastName + ' ' + FirstName + ' ' + ISNULL(MiddleName, '')),
  FullName4 AS (LastName + 'abc'),
  FullName5 AS (LastName)

DDM17

If we view the data with an account with the unmask permission we can see the computed columns as expected

DDM18

A similar thing happens when loading data from one table to another. If the source table is masked and the user loading the data into the destination table does not have the unmask permission then the masked data is loaded into the destination and can never be unmasked. We can demo this with the SELECT INTO statement. If I run this as a user without unmask

EXECUTE AS USER = 'DDMTestUser';
SELECT *
INTO #masking
FROM DDM.MaskingDemo1;
REVERT;

… then select from #masking with a user with unmask permission we can see the following

SELECT *
FROM DDM.MaskingDemo1;

SELECT *
FROM #masking;

DDM19

Finally

To find out what columns in a database are masked you can run the following query using the sys.masked_columns view

SELECT
  SCHEMA_NAME(o.schema_id),
  OBJECT_NAME(o.object_id),
  mc.name,
  mc.masking_function
FROM sys.masked_columns mc
JOIN sys.objects o ON o.object_id = mc.object_id
ORDER BY
  SCHEMA_NAME(o.schema_id),
  OBJECT_NAME(o.object_id),
  mc.column_id;

DDM15

Advertisements

Paging in T-SQL

As part of my prep for the 70-461 exam I learnt how to use the new paging functionality that was made available in SQL Server 2012.  This is using the OFFSET FETCH statement and I’ve never used them since and have completely forgotten how they work.  So I’m going write out a demonstration using them to refresh my memory for my prep for exam 70-761.  Please note, this should not be confused with the FETCH statement available when using cursors although the concept is the same.

It turns out there is not much too this so here’s how it works:

You have to use the ORDER BY clause and then just follow this by OFFSET and specify the number of rows to skip.  Finally you can optionally use FETCH to return only a certain number of rows.  By not using FETCH you will return the remaining rows after the OFFSET.

For a demo I’ll populate a table with numbers 1 to 50 as follows:

CREATE TABLE dbo.OffestFetchDemo (Id INT);

WITH cte
AS ( 
 SELECT 1 AS n
 UNION ALL
 SELECT n + 1
 FROM cte
 WHERE n < 50
)
INSERT INTO dbo.OffestFetchDemo (Id)
SELECT n
FROM cte;

Next I can run the following to return 10 rows only starting from 6

SELECT *
FROM dbo.OffestFetchDemo
ORDER BY Id
OFFSET 5 ROWS FETCH NEXT 10 ROWS ONLY;

OF1

You can not skip rows by using an offset of zero.  This will return the first 5 rows.

SELECT *
FROM dbo.OffestFetchDemo
ORDER BY Id
OFFSET 0 ROWS FETCH NEXT 5 ROWS ONLY;

OF2

As mentioned you can omit the FETCH and just return rows 46 onward as follows.

SELECT *
FROM dbo.OffestFetchDemo
ORDER BY Id
OFFSET 45 ROWS;

OF3

You can combine the two and just return everything with the following (not that you would ever want to do this)

SELECT *
FROM dbo.OffestFetchDemo
ORDER BY Id
OFFSET 0 ROWS;

The OFFSET FETCH statement is a bit weird as you can swap ROWS with ROW and NEXT with FIRST so

SELECT *
FROM dbo.OffestFetchDemo
ORDER BY Id
OFFSET 5 ROWS FETCH NEXT 10 ROWS ONLY;

is the same as

SELECT *
FROM dbo.OffestFetchDemo
ORDER BY Id
OFFSET 5 ROW FETCH FIRST 10 ROW ONLY;

 

Indexed Views and Blocking

I recently came across at situation in SQL Server where one process was blocking another and there seemed to be absolutely no link between the two tables in question.  It took a little while to work out what was causing the blocking but I’ve managed to recreate what I found with a more simple example.

We’d identified a bug where a MERGE statement into a table was now taking a very long time to complete but the part I couldn’t immediately understand was why this long running merge statement was blocking an update on another seemingly unrelated table.

Here’s something similar to what I found using sp_whoisactive

01

The blocking query (this is equivalent to the MERGE statement I was seeing on my live environment) is as follows

UPDATE dbo.Address
SET PostCode = 'AA11AA'
WHERE AddressID = 1;

and the query being blocked is

UPDATE dbo.Person
SET Name = 'Simon'
WHERE PersonID = 1;

In my database there is no direct link between the Person and Address tables, e.g. no direct foreign keys from one table to another, and there are no explicit transactions being used – so there isn’t an update to the Person table in the same transaction as the update to the Address table, each will just commit once completed.

The block was caused because there was an indexed view using both the tables in the two transactions.

The bug has since been fixed so the MERGE statement runs very quickly now so the block doesn’t last long but it’s certainly something I will be more wary of when using indexed views.  For example, if you need to run a large update on a table within an indexed view, such as a backfill of data, you might inadvertently block writing of data to one or more other tables within that same indexed view.

If it’s not obvious why SQL Server has to do this, below are details of what is actually happening during the block.


To replicate the issue I ran the following code to create a few tables and put an indexed view over the top of them.  This is just something simple that links people to addresses in a many-to-many relationship.

CREATE TABLE dbo.Person
( 
  PersonID INT CONSTRAINT Person_PersonID PRIMARY KEY NOT NULL IDENTITY(1,1), 
  Name VARCHAR(100), 
  DOB DATETIME2(0)
);
GO

CREATE TABLE dbo.AddressType
( 
  AddressTypeID INT CONSTRAINT AddressType_AddressTypeID PRIMARY KEY NOT NULL IDENTITY(1,1), 
  AddressType VARCHAR(100)
);
GO

CREATE TABLE dbo.Address
(
  AddressID INT CONSTRAINT Address_AddressID PRIMARY KEY NOT NULL IDENTITY(1,1), 
  AddressLine1 VARCHAR(100), 
  AddressLine2 VARCHAR(100), 
  PostCode VARCHAR(10), 
  AddressTypeID INT CONSTRAINT FK_Address_AddressTypeID FOREIGN KEY REFERENCES dbo.AddressType (AddressTypeID)
);
GO
CREATE TABLE dbo.PersonAddress
(
  PersonID INT CONSTRAINT FK_PersonAddress_PersonID FOREIGN KEY REFERENCES dbo.Person (PersonID), 
  AddressID INT CONSTRAINT FK_PersonAddress_AddressID FOREIGN KEY REFERENCES dbo.Address (AddressID)
);
GO
CREATE VIEW dbo.PeopleAddresses
WITH SCHEMABINDING
AS
SELECT P.PersonID, P.Name, P.DOB, A.AddressLine1, A.PostCode, T.AddressType
FROM dbo.PersonAddress PA
JOIN dbo.Person P ON P.PersonID = PA.PersonID
JOIN dbo.Address A ON A.AddressID = PA.AddressID
JOIN dbo.AddressType T ON T.AddressTypeID = A.AddressTypeID;
GO
CREATE UNIQUE CLUSTERED INDEX IX_PeopleAddresses ON dbo.PeopleAddresses (PersonID, Name, DOB, AddressLine1, PostCode, AddressType);
GO

I’ve then put one row in each of the three main tables as follows

INSERT INTO dbo.AddressType (AddressType)
VALUES ('Home'), ('Work');
GO

INSERT INTO dbo.Address (AddressLine1, AddressLine2, PostCode, AddressTypeID)
VALUES ('AddressLine1', 'AddressLine2', 'PostCode', 1);
GO

INSERT INTO dbo.Person (Name, DOB)
VALUES ('My Name', '20000101');
GO

INSERT INTO dbo.PersonAddress (AddressID, PersonID)
SELECT PersonID, PersonID
FROM dbo.Person;
GO

Next I’m going to update the PostCode in the Address table for our single row.  To mimic a long running update I am actually explicitly beginning a transaction.

BEGIN TRAN

UPDATE dbo.Address
SET PostCode = 'AA11AA'
WHERE AddressID = 1;

As this transaction is still active we can check which locks have been taken using the following query which gives the following results (the SPID of my query is 54).

SELECT
  ISNULL(OBJECT_NAME(p.object_id),o.name) AS ObjectName, 
  resource_type, 
  resource_subtype, 
  resource_description, 
  resource_associated_entity_id, 
  request_mode, 
  request_type, 
  request_status
FROM sys.dm_tran_locks l
LEFT JOIN sys.partitions p ON p.hobt_id = l.resource_associated_entity_id
LEFT JOIN sys.objects o ON o.object_id = l.resource_associated_entity_id
WHERE request_session_id = 54
ORDER BY resource_type, ObjectName

02

As expected we’ve got a shared lock (S) on the database and the second row is the exclusive lock (X) on the row we’re updating in the Address table. This lock also means we get intent exclusive locks (IX) on the page and the table. SQL Server takes these intent locks for two reasons: firstly, in case the amount of locks taken on individual rows breaches the threshold and the lock gets escalated to the table or page – the intent lock stops the escaltion being blocked; and secondly, for performance reasons – if another process tries to update the table it would have to check every row for existing locks, but the intent lock tells this process there is a lock somewhere below so it has to wait. For more details on locking, blocking and isolation levels see the links below.

As well as the expected locks on the Address table we’ve also got the X and IX locks on the PeopleAddresses indexed view which means that we would be blocked reading from PeopleAddresses (unless using the read uncommitted isolation level).

So if we now try and update the Person table in a different process as follows

UPDATE dbo.Person
SET Name = 'Simon'
WHERE PersonID = 1;

this query just hangs and we see the same blocking using whoisactive shown at the top of this post.

This is the locks that are held for the hanging query for the update on the Person table

03

An exclusive lock (X) has been granted on the row in the Person table but the process is also requesting a shared lock (S) on the row in the Address table.  This row already has the X lock from SPID 54 so the Person update is blocked.

In this example we had one row in Address and one in Person and they were connected using the PersonAddress table. But what would happen if we tried to update a Person record that isn’t linked to the Address?

Let’s add a new Person, a new Address and link them together.

INSERT INTO dbo.Address (AddressLine1, AddressLine2, PostCode, AddressTypeID)
VALUES ('AddressLine1', 'AddressLine2', 'PostCode', 2);
GO

INSERT INTO dbo.Person (Name, DOB)
VALUES ('My Name', '20000101');
GO

INSERT INTO dbo.PersonAddress (AddressID, PersonID)
VALUES (2,2);
GO

So now let’s mimic the long running lock on the Address table for the first address again

UPDATE dbo.Address
SET PostCode = 'AA11AA'
WHERE AddressID = 1;

and then try and update the Person record for the second person row

UPDATE dbo.Person
SET Name = 'Simon'
WHERE PersonID = 2;

This works without any blocking because there is no lock directly on the Person table and the corresponding row in the indexed view is not locked.

If we run both UPDATE statements within explicit transactions we can see which locks are taken. SPID 54 is the Address update and 56 is the Person

04

As you can see there are no two same resources being locked across the two SPIDs with the exception of the IX on the PeopleAddresses PAGE (1:309096). However, this does not block as IX locks are compatible with each other.

This is just showing what happens when row level locks are taken but how does lock escalation affect things?  Let’s insert lots more data and see

INSERT INTO dbo.Address (AddressLine1, AddressLine2, PostCode, AddressTypeID)
SELECT AddressLine1, AddressLine2, PostCode, AddressTypeID
FROM dbo.Address;
GO 20

INSERT INTO dbo.Person (Name, DOB)
SELECT Name, DOB
FROM dbo.Person;
GO 20

INSERT INTO dbo.PersonAddress (AddressID, PersonID)
SELECT PersonID, PersonID
FROM dbo.Person
WHERE PersonID > 2

This means we’ve now got 2,097,152 rows in each table

Now let’s try similar queries to before but updating 100,000 rows in the Address table in the first query

UPDATE dbo.Address
SET PostCode = 'AA11AA'
WHERE AddressID <= 100000;

05

So many row level locks were taken for this update that SQL Server has decided it’s better to escalate the lock to the whole table.  So rather than exclusive (X) locks on the rows (KEY) we now have exclusive locks on both the Address table and PeopleAddresses indexed view.  So now an update on any single row in the Person table, even one not linked to one of the 100,000 Address rows being updated, is blocked. For example, the following would be blocked.

UPDATE dbo.Person
SET Name = 'Simon'
WHERE PersonID = 2000000;

So finally what happens if first query is updating a column that is not included in the select list in the indexed view but is part of one of the tables in the view.

I purposely left out the Address2 column from the Address table in the PeopleAddresses indexed view.

If the first update just updates this Address2 column

UPDATE dbo.Address
SET AddressLine2 = 'New Address 1'
WHERE AddressID = 1;

no lock is placed on the PeopleAddresses indexed view at all

06

This means there are only locks on the Address table itself and any updates to the Person table would not be blocked.

For more about locking, blocking and isolation levels see my posts from a few years ago…

Locking, Blocking and Isolation Levels

Key-Range Locks in the Serializable Isolation Level and using sys.dm_tran_locks and DBCC PAGE to see what’s locked

 

Using $PARTITION to find the last time an event occured

In a previous post I showed how you can use the $PARTITION system function to force your query to only look at one or more partitions and today I have a semi-practical example of how to use this.

We have a very big table that holds all our financial transactions along with an integer indicating what type of transaction it is.  I wanted to find out when the last time we recorded a certain transaction type as I think it might have been removed quite some time ago.

This could obviously been done as follows:

SELECT MAX(T.MyDate)
FROM dbo.MyTable T (NOLOCK)
WHERE T.TransTypeId = X

and depending on how your table is structured and indexes this may be the best way to do it.

Our table is partitioned by month on our date column so I can use $PARTITION to search each partition, starting with the most recent, and search for the last date TransTypeID = X was used.  When I find a value we can stop.  This can be done as follows:

DECLARE @Counter INT = 200 -- this should be the maximum partition ID on our table
DECLARE @MaxDate DATETIME2(0);

WHILE @Counter > 0
BEGIN

  RAISERROR('%d', 0,1,@Counter) WITH NOWAIT;

  SELECT @MaxDate = MAX(T.MyDate)
  FROM dbo.MyTable T
  WHERE T.TransTypeId = X
  AND $partition.PF_MyTable(MyDate) = @Counter

  IF @MaxDate IS NOT NULL
  BEGIN
    SELECT @MaxDate, @Counter
    BREAK;
  END

  SET @Counter -= 1;

END;

If TransTypeID X has occurred recently then this code won’t need to loop through many partitions but if it last occurred in partition 1 then this is probably going to be much slower.  It all depends on how your table is set up but I think it’s always nice to have multiple ways to potentially speed up a query.

More Partitioning

In a previous post I wrote an introduction to partitioning with a simple demo.  In this post I’m going to use a larger data set to compare the performance of partitioned and non partitioned tables.  I’ll also show that you need make sure any queries where you want to use partition elimination have the correct search terms.

Firstly, I’ll create a new partition function and scheme that will partition across every month in 2016.

CREATE PARTITION FUNCTION BlogPartFunc1 (DATETIME2(0))
AS RANGE RIGHT FOR VALUES ('20160101', '20160201', '20160301', '20160401', 
 '20160501', '20160601', '20160701', '20160801', 
 '20160901', '20161001', '20161101', '20161201' );
GO
CREATE PARTITION SCHEME BlogPartScheme1
AS PARTITION BlogPartFunc1 ALL TO ([PRIMARY]);
GO

Next I’ll create two tables…

CREATE TABLE dbo.LargeVolumeNonPartitioned
(
 Id INT NOT NULL IDENTITY(1,1),
 SomeDate DATETIME2(0) NOT NULL,
 SomePadding1 NVARCHAR(4000) NOT NULL,
 CONSTRAINT PK_LargeVolumeNonPartitioned PRIMARY KEY CLUSTERED (Id, SomeDate)
) ON [PRIMARY];
GO
CREATE TABLE dbo.LargeVolumePartitioned
(
 Id INT NOT NULL IDENTITY(1,1),
 SomeDate DATETIME2(0) NOT NULL,
 SomePadding1 NVARCHAR(4000) NOT NULL,
 CONSTRAINT PK_LargeVolumePartitioned PRIMARY KEY CLUSTERED (Id, SomeDate)
) ON BlogPartScheme1 (SomeDate);
GO

This has created two tables with the same columns, one partitioned and one not, and both with a clustered key on the Id and the SomeDate  columns.  Note that, I’ve purposely not put the SomeDate column as the first column in the clustered index as just doing this would bring the performance of querying the non partitioned table close to the partitioned table when filtering on a date range.  I’ll explain more about this further down.  There is also a column SomePadding1 where we can insert a large amount of text data (this will ensure we don’t get too many rows stored on the same data page and highlight the performance improvement even further)

I will run the following to stick some data in the two tables

SET NOCOUNT ON
DECLARE @DateTime DATETIME2(0) = '20151201'

WHILE @DateTime < '20170101'
BEGIN

 INSERT INTO dbo.LargeVolumePartitioned (SomeDate, SomePadding1)
 SELECT @DateTime, REPLICATE('A', 1000);

 INSERT INTO dbo.LargeVolumeNonPartitioned (SomeDate, SomePadding1)
 SELECT @DateTime, REPLICATE('A', 1000);

 SET @DateTime = DATEADD(MINUTE, 1, @DateTime)

END
GO

This will add a row for every minute between 01/12/2015 and 01/01/2017.  This takes a few minutes to run so feel free to cut it down if you’re using it.

This gives us 571,680 rows in each table, 1440 for each day and spread over just under 145,000 data pages

If we run a query to return all rows for 01/01/2016 from both tables in the same batch we obviously get the same data returned but we get a slightly different execution plan and different info returned by setting statistics IO and time on.

SET STATISTICS IO ON
SET STATISTICS TIME ON

SELECT *
FROM dbo.LargeVolumeNonPartitioned LVP
WHERE LVP.SomeDate >= CAST('20160101' AS DATETIME2(0))
AND LVP.SomeDate < CAST('20160102' AS DATETIME2(0))

SELECT *
FROM dbo.LargeVolumePartitioned LVP
WHERE LVP.SomeDate >= CAST('20160101' AS DATETIME2(0))
AND LVP.SomeDate < CAST('20160102' AS DATETIME2(0))

We get the following in the Messages tab from statistics IO and time

(1440 row(s) affected)
Table 'LargeVolumeNonPartitioned'. Scan count 5, logical reads 162373, physical reads 0, read-ahead reads 79043, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
 CPU time = 374 ms, elapsed time = 9401 ms.
SQL Server parse and compile time: 
 CPU time = 0 ms, elapsed time = 0 ms.

(1440 row(s) affected)
Table 'LargeVolumePartitioned'. Scan count 1, logical reads 11219, physical reads 0, read-ahead reads 264, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
 CPU time = 16 ms, elapsed time = 243 ms.
SQL Server parse and compile time: 
 CPU time = 0 ms, elapsed time = 0 ms.

There is obviously a large factor in the difference between the page reads and the query time differs from over 9 seconds to 0.243.

Here is the execution plans for both

01

If we look at the properties of the Clustered Index Scan on the partitioned table we see that we’ve only accessed 1 partition (Actual Partition Count) and partition 2 is the one that’s been accessed (Actual Partitions Accessed), i.e. the optimiser has used partition elimination to just look at the subset of data in that single partition and just scanned that. Whereas, the Clustered Index Scan on the non partitioned table has had to scan then entire table.

02

As I mentioned above I purposely didn’t put the SomeDate column as the first column in the clustered index.  If I had this would have meant that a clustered index seek operation would probably have been performed instead of a scan as shown in the execution plans above.  This means that on the non partitioned table, the query engine has had to read every page in the table.  On the partitioned table, the optimser has already told the query engine to just use partition 2 so it just has to do a full scan of this subset of data.

So like everything in SQL Server it depends on how you want to store your data and how you read and write that determines whether you partition the table or not.  If you have a large table and several of the your queries are causing long clustered index scans you might benefit from trying to partition the table (or you clustered index needs to be changed or new non clustered indexes added).

We have a table with a composite clustered key spread across 6 different columns.  We query this table in our code in lots of different ways and it was best to have a particular column as the first column in the clustered index as most queries used this one.  However, we have a date column that stored the date that the event for that row of data occurred.  This is part of the clustered key and we have partitioned the table on this column by month.  This allows us to quickly reduce the data set to the current month or last 6 months in lots of other queries and then perform additional filtering on the smaller data sets.

One thing that you do have to be wary of is ensuring your query allows partition elimination to occur.  In the example above the SomeDate column is a DATETIME2(0) data type.  You might have noticed that when filtering my query on the SomeDate column I explicitly cast the textual date values to a DATETIME2(0) data type.  If I hadn’t have done this we would have seen the following results in the statistics IO and Clustered Index Scan operation

SELECT *
FROM dbo.LargeVolumePartitioned LVP
WHERE LVP.SomeDate >= '20160101'
AND LVP.SomeDate < '20160102'

(1440 row(s) affected)
Table 'LargeVolumePartitioned'. Scan count 13, logical reads 143680, physical reads 0, read-ahead reads 132006, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
 CPU time = 483 ms, elapsed time = 2157 ms.
SQL Server parse and compile time: 
 CPU time = 0 ms, elapsed time = 0 ms.

03

The page reads has gone up from 11,219 to 143,680 and the Cluster Index Scan has now accessed all 13 partitions.

Even if we cast the date value to a DATETIME2 but with a different precision we still don’t get partition elimination.

SELECT *
FROM dbo.LargeVolumePartitioned LVP
WHERE LVP.SomeDate >= CAST('20160101' AS DATETIME2(1))
AND LVP.SomeDate < CAST('20160102' AS DATETIME2(1))

04

The same is true for other data types such as INT, BIGINT etc.  It’s something you need to be wary of when querying partitioned tables.

One last thing I want to demonstrate is a way of forcing the query engine to just look at one or more partitions.  We can do this with the $PARTITION system function.  We can use this to work out what partition a certain value sits on and when we know that we can ue that to filter our data.

Firstly, let’s use to should how many rows exist on each partition as follows…

SELECT $PARTITION.BlogPartFunc1(SomeDate) AS PartitionNumber, COUNT(*) as RowCount
FROM dbo.LargeVolumePartitioned LVP
GROUP BY $PARTITION.BlogPartFunc1(SomeDate)
ORDER BY 1

10

Here PartitionNumber = 1 is all the data before 01/01/2016, PartitionNumber = 2 is all the data in January 2016 and so on.  As you can remember we’ve got a row for every minute between 01/12/2015 and 01/01/2017 and you can see that all the months with 31 days have 44,640 rows, all the months with 30 days have 43,200 and February (PartitionNumber = 3) has 41,760 rows.

The query above comparing the partitioned and non-partitioned tables was looking at data just on 01/01/2016.  We know that this data sits on PartitionNumber = 2 but we can confirm this as follows…

SELECT $PARTITION.BlogPartFunc1('20160101');

11

Let’s say we want to return all the data for February.  We know this is on partition 3 so we can run the following…

SELECT *
FROM dbo.LargeVolumePartitioned LVP
WHERE $PARTITION.BlogPartFunc1(SomeDate) = 3

12

If we check the properties of the Clustered Index Scan operator in the execution plan we can see that only partition 3 was accessed.

13

Partitioning is a powerfully optimisation tool and is something that definitely should be considered during design time, particularly on wide tables where you might want to filter in several different ways but are only usually just interested in the latest data.

There is one final part of partitioning that I want to demonstrate and this is Partition Switching which I will show in the next post.

 

Introduction to Partitioning

Introduction

Table and index partitioning is something that has been around for a while and it’s something we use a lot on some of our large tables, particularly where we only really want to access the newest data and can do without having to think about older data.  There are two parts to partitioning that I think are very useful.  One is the ability to switch partitions between tables.  This is handy, for example, if you have a “live” table and an “archive” table.  You may only want to keep the last 12 months worth of data in your “live” table but still keep the older data in the “archive” table for various reasons including auditing.  You might want a process that archives the oldest month of data from the “live” table to the “archive” table.  Without partitioning you’ll likely have to copy the data from the “live” table into the “archive” table and delete the “live” data.  This means that all that data gets moved from one bunch of data pages to another.  If your “live” and “archive” tables are both partitioned by month then you can just switch the oldest partition from “live” to “archive”.  This leaves the data in the same pages and just updates that metadata that says those pages belong to the “archive” table now.  I’m not going to demo partition switching here but here is some more detail if you are interested: https://technet.microsoft.com/en-us/library/ms191160(v=sql.105).aspx

The second useful part is the bit I use all the time and this is a performance improvement where we only touch a small number of the partitions in a table rather than the whole table.  This is called “partition elimination” and is where the optimiser works out it doesn’t have to bother with certain partitions so eliminates them from the execution plan.  An example of this is a bunch of reporting tables we have with a website that calls a stored proc to read the report data.  The reports are all displayed by month and you chose the month you want to view from a drop down list at the top.  These reporting tables are partitioned by month and only one partition is ever accessed at a time.

You can store each partition on a different filegroup so you can have data that is accessed more frequently on disks with better I/O and put stuff rarely touched on less efficient disks.  In the following demo I’m showing you how to put the partitions on different filegroups but the main point for this is to show the performance improvement with partition elimination.

Demo

I’ve already got a database “SimonBlog” with a single PRIMARY filegroup.  I’m now going to add two new filegroups and files.

USE master
GO

ALTER DATABASE SimonBlog
ADD FILEGROUP BlogFG1;
GO

ALTER DATABASE SimonBlog
ADD FILEGROUP BlogFG2;
GO

ALTER DATABASE SimonBlog 
ADD FILE 
(
 NAME = BlogFG1F1,
 FILENAME = 'D:\Data\MSSQL11.MSSQLSERVER\MSSQL\DATA\BlogFG1F1.ndf',
 SIZE = 50MB,
 MAXSIZE = 1000MB,
 FILEGROWTH = 50MB
)
TO FILEGROUP BlogFG1;
GO

ALTER DATABASE SimonBlog 
ADD FILE 
(
 NAME = BlogFG2F1,
 FILENAME = 'D:\Data\MSSQL11.MSSQLSERVER\MSSQL\DATA\BlogFG2F1.ndf',
 SIZE = 50MB,
 MAXSIZE = 1000MB,
 FILEGROWTH = 50MB
)
TO FILEGROUP BlogFG2;
GO

Before we can create a partitioned table we need to create a partition function and a scheme that maps the partitions to one of more filegroups.  The partition function is the thing that specifies how you want your table partitioned and this can be used by more than one table.

So let’s create a partition function that partitions the first quarter of this year into different months.  The partitioning data type is DATETIME2(0) which means we can only use this partition on a table with a DATETIME2(0) data type column.

USE SimonBlog
GO

CREATE PARTITION FUNCTION BlogPartFunc (DATETIME2(0))
AS RANGE RIGHT FOR VALUES ('20160101', '20160201', '20160301');

We can see this partition function and it’s values in the following DMVs

SELECT *
FROM sys.partition_functions;

partitioning01

SELECT *
FROM sys.partition_range_values;

partitioning02

Even though we’ve only specified three  boundaries, ‘20160101’, ‘20160201’ and ‘20160301’, we will actually have four possible partitions: first, everything up to ‘20160101’; second, everything from ‘20160101’ up to ‘20160201; third, everything from ‘20160201’ up to ‘20160301’: and fourth, everything from ‘20160301 onward.

The “RIGHT” keyword in the “…AS RANGE RIGHT FOR VALUES…” part can either be “LEFT” or “RIGHT” and specifies whether the boundary value is in the left partition or the right.  In this case we’ve used “RIGHT” and this means the first partition is everything up to but not including ‘20160101’ and the second partition is everything after and including ‘20160101’, i.e, the boundary value of ‘20160101’ is included in the second partition.  If we used “LEFT” then ‘20160101’ would be in the first partition, i.e. it would be everything up to and including ‘20160101’.

So anyway, we have four possible partitions and now we need to map these to one or more filegroups.  We can map all partitions to one filegroup in the following way

CREATE PARTITION SCHEME BlogPartScheme
AS PARTITION BlogPartFunc ALL TO ([PRIMARY]);

or we can specify where each of the four partitions goes

CREATE PARTITION SCHEME BlogPartScheme
AS PARTITION BlogPartFunc TO (BlogFG1, [PRIMARY], BlogFG2, [PRIMARY]);

This puts the first partition on the filegroup BlogFG1, the third partition on BlogFG2 and the second and fourth both on PRIMARY – you probably wouldn’t do this but this just highlights that it’s completely flexible.

We can now create our partitioned table.  We just have to specify that the table is being created on our partition scheme and what column is going to be used for partitioning.

CREATE TABLE dbo.BlogPartitionTest
(
 Id INT NOT NULL IDENTITY(1,1),
 SomeDate DATETIME2(0) NOT NULL
) ON BlogPartScheme (SomeDate);

You might notice that this table is currently a heap.  There is a requirement where the partitioning column must be part of any unique index on the table so I couldn’t use the Id column as a PRIMARY KEY.  If I tried to create a new table on the same partition scheme with a PRIMARY KEY on the Id column then I get an error

CREATE TABLE dbo.BlogPartitionTest2
(
 Id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
 SomeDate DATETIME2(0) NOT NULL
) ON BlogPartScheme (SomeDate);

Msg 1908, Level 16, State 1, Line 1
Column 'SomeDate' is partitioning column of the index 'PK__BlogPartitionTes__5EA4867D'. Partition columns for a unique index must be a subset of the index key.
Msg 1750, Level 16, State 0, Line 1
Could not create constraint. See previous errors.

The table doesn’t have to be a heap, it can be a clustered index, it just can’t be unique unless the partitioning column is included.  For example we can add a clustered index to the first table

CREATE CLUSTERED INDEX IX_BlogPartitionTest ON dbo.BlogPartitionTest (Id)

We can now query another DMV to view our four partitions

SELECT *
FROM sys.partitions
WHERE OBJECT_NAME(OBJECT_ID) = 'BlogPartitionTest'

partitioning03

Please note, if you query sys.partitions with no WHERE clause you might be surprised to see several rows returned.  This DMV considers every table to contain at least one partition, so all non-partitioned tables are actually tables with a single partition.

Now we can insert some data across the various partitions and see where it all sits

INSERT INTO dbo.BlogPartitionTest(SomeDate)
VALUES ('20151203'),
('20160101'),
('20160131'),
('20160215'),
('20160302');

So we’ve inserted five rows.  Looking at the dates it should be obvious that the first rows will be on the first partition, the second and third rows will be on the second partition, the fourth row will be on the third partition and finally the fifth row on the last partition. If we re-query sys.partitions we can see the rows counts confirm this

SELECT *
FROM sys.partitions
WHERE OBJECT_NAME(OBJECT_ID) = 'BlogPartitionTest'

partitioning04

Now we can query the table and try to pull back the row for ‘20160101’ as follows

SELECT *
FROM dbo.BlogPartitionTest
WHERE SomeDate = CAST('20160101' AS DATETIME2(0))

partitioning05

If we include the execution plan we can view the properties of the data retrieval operation and confirm that partition elimination has been used and only one partition has been accessed

partitioning06

partitioning07

The “Actual Partition Count” tells us that on 1 partition was accessed and “Actual Partitions Accessed” tells us that partition 2 was the only one accessed.

If we run a select over the whole table and view the properties of the data retrieval operation we see the following

partitioning08

As expected we can see that all 4 partitions were accessed.

This demo only contains a very small amount of data but in the next post I’ll show another demo with a larger volume of data and compare performance of partitioned versus non-partitioned.  I’ll also show that you need make sure any queries that you want to use partition elimination use the correct search terms.

Error when trying to create SSISDB catalog: An error occurred during Service Master Key decryption

I’ve just installed SSIS on my PC and tried to create the SSISDB instance in SSMS as shown below

01

When I entered a password and clicked OK I got the following error:

An error occurred during Service Master Key decryption. Changed database context to ‘SSISDB’. (.Net SqlClient Data Provider).

To fix this I had to run the following

ALTER SERVICE MASTER KEY FORCE REGENERATE