Introduction to Partitioning

Introduction

Table and index partitioning is something that has been around for a while and it’s something we use a lot on some of our large tables, particularly where we only really want to access the newest data and can do without having to think about older data.  There are two parts to partitioning that I think are very useful.  One is the ability to switch partitions between tables.  This is handy, for example, if you have a “live” table and an “archive” table.  You may only want to keep the last 12 months worth of data in your “live” table but still keep the older data in the “archive” table for various reasons including auditing.  You might want a process that archives the oldest month of data from the “live” table to the “archive” table.  Without partitioning you’ll likely have to copy the data from the “live” table into the “archive” table and delete the “live” data.  This means that all that data gets moved from one bunch of data pages to another.  If your “live” and “archive” tables are both partitioned by month then you can just switch the oldest partition from “live” to “archive”.  This leaves the data in the same pages and just updates that metadata that says those pages belong to the “archive” table now.  I’m not going to demo partition switching here but here is some more detail if you are interested: https://technet.microsoft.com/en-us/library/ms191160(v=sql.105).aspx

The second useful part is the bit I use all the time and this is a performance improvement where we only touch a small number of the partitions in a table rather than the whole table.  This is called “partition elimination” and is where the optimiser works out it doesn’t have to bother with certain partitions so eliminates them from the execution plan.  An example of this is a bunch of reporting tables we have with a website that calls a stored proc to read the report data.  The reports are all displayed by month and you chose the month you want to view from a drop down list at the top.  These reporting tables are partitioned by month and only one partition is ever accessed at a time.

You can store each partition on a different filegroup so you can have data that is accessed more frequently on disks with better I/O and put stuff rarely touched on less efficient disks.  In the following demo I’m showing you how to put the partitions on different filegroups but the main point for this is to show the performance improvement with partition elimination.

Demo

I’ve already got a database “SimonBlog” with a single PRIMARY filegroup.  I’m now going to add two new filegroups and files.

USE master
GO

ALTER DATABASE SimonBlog
ADD FILEGROUP BlogFG1;
GO

ALTER DATABASE SimonBlog
ADD FILEGROUP BlogFG2;
GO

ALTER DATABASE SimonBlog 
ADD FILE 
(
 NAME = BlogFG1F1,
 FILENAME = 'D:\Data\MSSQL11.MSSQLSERVER\MSSQL\DATA\BlogFG1F1.ndf',
 SIZE = 50MB,
 MAXSIZE = 1000MB,
 FILEGROWTH = 50MB
)
TO FILEGROUP BlogFG1;
GO

ALTER DATABASE SimonBlog 
ADD FILE 
(
 NAME = BlogFG2F1,
 FILENAME = 'D:\Data\MSSQL11.MSSQLSERVER\MSSQL\DATA\BlogFG2F1.ndf',
 SIZE = 50MB,
 MAXSIZE = 1000MB,
 FILEGROWTH = 50MB
)
TO FILEGROUP BlogFG2;
GO

Before we can create a partitioned table we need to create a partition function and a scheme that maps the partitions to one of more filegroups.  The partition function is the thing that specifies how you want your table partitioned and this can be used by more than one table.

So let’s create a partition function that partitions the first quarter of this year into different months.  The partitioning data type is DATETIME2(0) which means we can only use this partition on a table with a DATETIME2(0) data type column.

USE SimonBlog
GO

CREATE PARTITION FUNCTION BlogPartFunc (DATETIME2(0))
AS RANGE RIGHT FOR VALUES ('20160101', '20160201', '20160301');

We can see this partition function and it’s values in the following DMVs

SELECT *
FROM sys.partition_functions;

partitioning01

SELECT *
FROM sys.partition_range_values;

partitioning02

Even though we’ve only specified three  boundaries, ‘20160101’, ‘20160201’ and ‘20160301’, we will actually have four possible partitions: first, everything up to ‘20160101’; second, everything from ‘20160101’ up to ‘20160201; third, everything from ‘20160201’ up to ‘20160301’: and fourth, everything from ‘20160301 onward.

The “RIGHT” keyword in the “…AS RANGE RIGHT FOR VALUES…” part can either be “LEFT” or “RIGHT” and specifies whether the boundary value is in the left partition or the right.  In this case we’ve used “RIGHT” and this means the first partition is everything up to but not including ‘20160101’ and the second partition is everything after and including ‘20160101’, i.e, the boundary value of ‘20160101’ is included in the second partition.  If we used “LEFT” then ‘20160101’ would be in the first partition, i.e. it would be everything up to and including ‘20160101’.

So anyway, we have four possible partitions and now we need to map these to one or more filegroups.  We can map all partitions to one filegroup in the following way

CREATE PARTITION SCHEME BlogPartScheme
AS PARTITION BlogPartFunc ALL TO ([PRIMARY]);

or we can specify where each of the four partitions goes

CREATE PARTITION SCHEME BlogPartScheme
AS PARTITION BlogPartFunc TO (BlogFG1, [PRIMARY], BlogFG2, [PRIMARY]);

This puts the first partition on the filegroup BlogFG1, the third partition on BlogFG2 and the second and fourth both on PRIMARY – you probably wouldn’t do this but this just highlights that it’s completely flexible.

We can now create our partitioned table.  We just have to specify that the table is being created on our partition scheme and what column is going to be used for partitioning.

CREATE TABLE dbo.BlogPartitionTest
(
 Id INT NOT NULL IDENTITY(1,1),
 SomeDate DATETIME2(0) NOT NULL
) ON BlogPartScheme (SomeDate);

You might notice that this table is currently a heap.  There is a requirement where the partitioning column must be part of any unique index on the table so I couldn’t use the Id column as a PRIMARY KEY.  If I tried to create a new table on the same partition scheme with a PRIMARY KEY on the Id column then I get an error

CREATE TABLE dbo.BlogPartitionTest2
(
 Id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
 SomeDate DATETIME2(0) NOT NULL
) ON BlogPartScheme (SomeDate);

Msg 1908, Level 16, State 1, Line 1
Column 'SomeDate' is partitioning column of the index 'PK__BlogPartitionTes__5EA4867D'. Partition columns for a unique index must be a subset of the index key.
Msg 1750, Level 16, State 0, Line 1
Could not create constraint. See previous errors.

The table doesn’t have to be a heap, it can be a clustered index, it just can’t be unique unless the partitioning column is included.  For example we can add a clustered index to the first table

CREATE CLUSTERED INDEX IX_BlogPartitionTest ON dbo.BlogPartitionTest (Id)

We can now query another DMV to view our four partitions

SELECT *
FROM sys.partitions
WHERE OBJECT_NAME(OBJECT_ID) = 'BlogPartitionTest'

partitioning03

Please note, if you query sys.partitions with no WHERE clause you might be surprised to see several rows returned.  This DMV considers every table to contain at least one partition, so all non-partitioned tables are actually tables with a single partition.

Now we can insert some data across the various partitions and see where it all sits

INSERT INTO dbo.BlogPartitionTest(SomeDate)
VALUES ('20151203'),
('20160101'),
('20160131'),
('20160215'),
('20160302');

So we’ve inserted five rows.  Looking at the dates it should be obvious that the first rows will be on the first partition, the second and third rows will be on the second partition, the fourth row will be on the third partition and finally the fifth row on the last partition. If we re-query sys.partitions we can see the rows counts confirm this

SELECT *
FROM sys.partitions
WHERE OBJECT_NAME(OBJECT_ID) = 'BlogPartitionTest'

partitioning04

Now we can query the table and try to pull back the row for ‘20160101’ as follows

SELECT *
FROM dbo.BlogPartitionTest
WHERE SomeDate = CAST('20160101' AS DATETIME2(0))

partitioning05

If we include the execution plan we can view the properties of the data retrieval operation and confirm that partition elimination has been used and only one partition has been accessed

partitioning06

partitioning07

The “Actual Partition Count” tells us that on 1 partition was accessed and “Actual Partitions Accessed” tells us that partition 2 was the only one accessed.

If we run a select over the whole table and view the properties of the data retrieval operation we see the following

partitioning08

As expected we can see that all 4 partitions were accessed.

This demo only contains a very small amount of data but in the next post I’ll show another demo with a larger volume of data and compare performance of partitioned versus non-partitioned.  I’ll also show that you need make sure any queries that you want to use partition elimination use the correct search terms.