More Partitioning

In a previous post I wrote an introduction to partitioning with a simple demo.  In this post I’m going to use a larger data set to compare the performance of partitioned and non partitioned tables.  I’ll also show that you need make sure any queries where you want to use partition elimination have the correct search terms.

Firstly, I’ll create a new partition function and scheme that will partition across every month in 2016.

CREATE PARTITION FUNCTION BlogPartFunc1 (DATETIME2(0))
AS RANGE RIGHT FOR VALUES ('20160101', '20160201', '20160301', '20160401', 
 '20160501', '20160601', '20160701', '20160801', 
 '20160901', '20161001', '20161101', '20161201' );
GO
CREATE PARTITION SCHEME BlogPartScheme1
AS PARTITION BlogPartFunc1 ALL TO ([PRIMARY]);
GO

Next I’ll create two tables…

CREATE TABLE dbo.LargeVolumeNonPartitioned
(
 Id INT NOT NULL IDENTITY(1,1),
 SomeDate DATETIME2(0) NOT NULL,
 SomePadding1 NVARCHAR(4000) NOT NULL,
 CONSTRAINT PK_LargeVolumeNonPartitioned PRIMARY KEY CLUSTERED (Id, SomeDate)
) ON [PRIMARY];
GO
CREATE TABLE dbo.LargeVolumePartitioned
(
 Id INT NOT NULL IDENTITY(1,1),
 SomeDate DATETIME2(0) NOT NULL,
 SomePadding1 NVARCHAR(4000) NOT NULL,
 CONSTRAINT PK_LargeVolumePartitioned PRIMARY KEY CLUSTERED (Id, SomeDate)
) ON BlogPartScheme1 (SomeDate);
GO

This has created two tables with the same columns, one partitioned and one not, and both with a clustered key on the Id and the SomeDate  columns.  Note that, I’ve purposely not put the SomeDate column as the first column in the clustered index as just doing this would bring the performance of querying the non partitioned table close to the partitioned table when filtering on a date range.  I’ll explain more about this further down.  There is also a column SomePadding1 where we can insert a large amount of text data (this will ensure we don’t get too many rows stored on the same data page and highlight the performance improvement even further)

I will run the following to stick some data in the two tables

SET NOCOUNT ON
DECLARE @DateTime DATETIME2(0) = '20151201'

WHILE @DateTime < '20170101'
BEGIN

 INSERT INTO dbo.LargeVolumePartitioned (SomeDate, SomePadding1)
 SELECT @DateTime, REPLICATE('A', 1000);

 INSERT INTO dbo.LargeVolumeNonPartitioned (SomeDate, SomePadding1)
 SELECT @DateTime, REPLICATE('A', 1000);

 SET @DateTime = DATEADD(MINUTE, 1, @DateTime)

END
GO

This will add a row for every minute between 01/12/2015 and 01/01/2017.  This takes a few minutes to run so feel free to cut it down if you’re using it.

This gives us 571,680 rows in each table, 1440 for each day and spread over just under 145,000 data pages

If we run a query to return all rows for 01/01/2016 from both tables in the same batch we obviously get the same data returned but we get a slightly different execution plan and different info returned by setting statistics IO and time on.

SET STATISTICS IO ON
SET STATISTICS TIME ON

SELECT *
FROM dbo.LargeVolumeNonPartitioned LVP
WHERE LVP.SomeDate >= CAST('20160101' AS DATETIME2(0))
AND LVP.SomeDate < CAST('20160102' AS DATETIME2(0))

SELECT *
FROM dbo.LargeVolumePartitioned LVP
WHERE LVP.SomeDate >= CAST('20160101' AS DATETIME2(0))
AND LVP.SomeDate < CAST('20160102' AS DATETIME2(0))

We get the following in the Messages tab from statistics IO and time

(1440 row(s) affected)
Table 'LargeVolumeNonPartitioned'. Scan count 5, logical reads 162373, physical reads 0, read-ahead reads 79043, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
 CPU time = 374 ms, elapsed time = 9401 ms.
SQL Server parse and compile time: 
 CPU time = 0 ms, elapsed time = 0 ms.

(1440 row(s) affected)
Table 'LargeVolumePartitioned'. Scan count 1, logical reads 11219, physical reads 0, read-ahead reads 264, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
 CPU time = 16 ms, elapsed time = 243 ms.
SQL Server parse and compile time: 
 CPU time = 0 ms, elapsed time = 0 ms.

There is obviously a large factor in the difference between the page reads and the query time differs from over 9 seconds to 0.243.

Here is the execution plans for both

01

If we look at the properties of the Clustered Index Scan on the partitioned table we see that we’ve only accessed 1 partition (Actual Partition Count) and partition 2 is the one that’s been accessed (Actual Partitions Accessed), i.e. the optimiser has used partition elimination to just look at the subset of data in that single partition and just scanned that. Whereas, the Clustered Index Scan on the non partitioned table has had to scan then entire table.

02

As I mentioned above I purposely didn’t put the SomeDate column as the first column in the clustered index.  If I had this would have meant that a clustered index seek operation would probably have been performed instead of a scan as shown in the execution plans above.  This means that on the non partitioned table, the query engine has had to read every page in the table.  On the partitioned table, the optimser has already told the query engine to just use partition 2 so it just has to do a full scan of this subset of data.

So like everything in SQL Server it depends on how you want to store your data and how you read and write that determines whether you partition the table or not.  If you have a large table and several of the your queries are causing long clustered index scans you might benefit from trying to partition the table (or you clustered index needs to be changed or new non clustered indexes added).

We have a table with a composite clustered key spread across 6 different columns.  We query this table in our code in lots of different ways and it was best to have a particular column as the first column in the clustered index as most queries used this one.  However, we have a date column that stored the date that the event for that row of data occurred.  This is part of the clustered key and we have partitioned the table on this column by month.  This allows us to quickly reduce the data set to the current month or last 6 months in lots of other queries and then perform additional filtering on the smaller data sets.

One thing that you do have to be wary of is ensuring your query allows partition elimination to occur.  In the example above the SomeDate column is a DATETIME2(0) data type.  You might have noticed that when filtering my query on the SomeDate column I explicitly cast the textual date values to a DATETIME2(0) data type.  If I hadn’t have done this we would have seen the following results in the statistics IO and Clustered Index Scan operation

SELECT *
FROM dbo.LargeVolumePartitioned LVP
WHERE LVP.SomeDate >= '20160101'
AND LVP.SomeDate < '20160102'

(1440 row(s) affected)
Table 'LargeVolumePartitioned'. Scan count 13, logical reads 143680, physical reads 0, read-ahead reads 132006, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
 CPU time = 483 ms, elapsed time = 2157 ms.
SQL Server parse and compile time: 
 CPU time = 0 ms, elapsed time = 0 ms.

03

The page reads has gone up from 11,219 to 143,680 and the Cluster Index Scan has now accessed all 13 partitions.

Even if we cast the date value to a DATETIME2 but with a different precision we still don’t get partition elimination.

SELECT *
FROM dbo.LargeVolumePartitioned LVP
WHERE LVP.SomeDate >= CAST('20160101' AS DATETIME2(1))
AND LVP.SomeDate < CAST('20160102' AS DATETIME2(1))

04

The same is true for other data types such as INT, BIGINT etc.  It’s something you need to be wary of when querying partitioned tables.

One last thing I want to demonstrate is a way of forcing the query engine to just look at one or more partitions.  We can do this with the $PARTITION system function.  We can use this to work out what partition a certain value sits on and when we know that we can ue that to filter our data.

Firstly, let’s use to should how many rows exist on each partition as follows…

SELECT $PARTITION.BlogPartFunc1(SomeDate) AS PartitionNumber, COUNT(*) as RowCount
FROM dbo.LargeVolumePartitioned LVP
GROUP BY $PARTITION.BlogPartFunc1(SomeDate)
ORDER BY 1

10

Here PartitionNumber = 1 is all the data before 01/01/2016, PartitionNumber = 2 is all the data in January 2016 and so on.  As you can remember we’ve got a row for every minute between 01/12/2015 and 01/01/2017 and you can see that all the months with 31 days have 44,640 rows, all the months with 30 days have 43,200 and February (PartitionNumber = 3) has 41,760 rows.

The query above comparing the partitioned and non-partitioned tables was looking at data just on 01/01/2016.  We know that this data sits on PartitionNumber = 2 but we can confirm this as follows…

SELECT $PARTITION.BlogPartFunc1('20160101');

11

Let’s say we want to return all the data for February.  We know this is on partition 3 so we can run the following…

SELECT *
FROM dbo.LargeVolumePartitioned LVP
WHERE $PARTITION.BlogPartFunc1(SomeDate) = 3

12

If we check the properties of the Clustered Index Scan operator in the execution plan we can see that only partition 3 was accessed.

13

Partitioning is a powerfully optimisation tool and is something that definitely should be considered during design time, particularly on wide tables where you might want to filter in several different ways but are only usually just interested in the latest data.

There is one final part of partitioning that I want to demonstrate and this is Partition Switching which I will show in the next post.

 

Advertisements