SSIS Checkpoints

There is a very handy feature available in SSIS 2012 onwards that allows you to restart a package from the point of failure and skip any previously successful tasks. This is particularly useful if you have a few tasks that take a very long time to run that have been successful and a task further down the process chain fails. You don’t want to have to rerun the long tasks again if there is no need to. Adding checkpoints to package can do this. You just need to enable your package to allow checkpoints and then set which tasks (it could be all of them) are points in the package that you can restart from if the package fails further down the line. Here’s an example of a simple process using checkpoints…

Here we have a simple little package that truncates a table, inserts some data and then runs an update on that data. I’m going to force the update task to fail and get it to restart without having to truncate the data and load it in again. If the insert of the data took several minutes or even hours then this would save a lot of time.

First of all just create a test table

USE SimonBlog
GO
CREATE TABLE dbo.SSISCheckPointTest
(
 ID INT NOT NULL IDENTITY(1,1),
 SomeNumber INT NOT NULL
);

Now create a new SSIS package with three Execute SQL Tasks.

1

The first one just truncates the table dbo.SSISCheckPointTest

TRUNCATE TABLE dbo.SSISCheckPointTest

The second one runs the following just to populate the table with some data and then returns a single value which is the number of rows inserted into the table.  This gets written to the SSIS variable @SomeNumber.

INSERT INTO dbo.SSISCheckPointTest
SELECT ROW_NUMBER() OVER (ORDER BY SalesOrderId)
FROM [AdventureWorks2012].Sales.SalesOrderDetail
CROSS JOIN [AdventureWorks2012].[Sales].[SalesPerson]
SELECT @@ROWCOUNT AS NumRows;

The third step runs an update statement that attempts to set the SomeNumber integer value to a character value as follows…

UPDATE dbo.SSISCheckPointTest
SET SomeNumber = 'A'
WHERE ID = 100;

Now lets run the package and the third task will fail

2

As we can see from the Progress tab the insert took 40 secs which was by far the longest part of the whole package.  If this was 40 minutes rather than 40 seconds we wouldn’t want to run the insert part again.  We’d just want to fix the failed step and then run it from there.

Just to confirm we can see that the insert was successful and we have some rows.

3

Now let’s set the package up so that it creates a checkpoint after doing the insert.  Firstly enable it in the package properties by setting the SaveCheckpoints property to True.  Then specify an XML file in the CheckpointFileName property.  This will record when each checkpoint is hit and which ones were successful.  It also stores the state of each variable.  Finally set the CheckpointUsage to IfExists.  This means that if the checkpoint file doesn’t exist then it will create it.  If the package completes successfully then the checkpoint file is deleted so the next time it runs it will start from the beginning.

4

Next we need to specify which tasks will create a checkpoint.  We want to be able to restart the package after the insert task so we will set a checkpoint on that task.  I’m also going to get a checkpoint on the truncate task.  To do this set the FailPackageOnFailure to True.

5

Now let’s run the package again.  As there was previously no checkpoints enabled the package starts from the beginning and fails again on the update task.

6

We can now see that the checkpoint XML file was created as follows. We can see that the SomeNumber variable has a value of 2062389.

7

Now let’s fix the update statement and change it to use the value of the SomeNumber variable in the update.  As you can see below the default value of the SomeNumber variable is 0 so if it get’s set to 2062389 this will prove that the checkpoints also remember the values of variables.

8

9

Now lets rerun the package.

10

As you can see only the update task ran and this time it was successful.  We can also see that the SomeNumber value for the row with ID = 100 has been set to the value in the SomeNumber variable.

11

The checkpoint file has also now been deleted so the next time the package executes it will start from the beginning.

The SEQUENCE object

In SQL Server 2012 the SEQUENCE object was added.  This seems to be to get past some of the deficiencies in the IDENTITY property.

It allows you to use the same sequence across multiple columns in multiple tables whereas the IDENTITY property is linked to a particular column within a particular table.

It allows cycling of the sequence and allows you to create the next sequence number before you insert the data into the table.  With the IDENTITY property you need to insert the data first and then get the newly generated identity value.

To create a new SEQUENCE object just use the following format

CREATE SEQUENCE dbo.SimonTest

This creates a new SEQUENCE object named dbo.SimonTest with default properties, including the default BIGINT data type.

The SEQUENCE object has the following properties

  • INCREMENT BY – the value to increment by.  Equivalent to the y value in the IDENTITY(x,y) property.  The default is 1.
  • MINVALUE – the minimum value allowed.  By default this is this smallest value allowed in the data type.
  • MAXVALUE – the maximum value allowed.  By default this is the largest value allowed in the data type.
  • CYCLE or NOCYCLE – says whether to allow cycling or not.  The default is NOCYCLE.
  • START WITH – the value to start the sequence at.  Equivalent to the x value in the IDENTITY(x,y) property.

Cycling is where the SEQUENCE object will reset itself to the minimum value once it reaches the maximum value – I’m not sure when you would actually use this.  It’s worth noting that if you want your sequence to allow cycling and you only want to include positive numbers then you should set the MINVALUE propery to be 1, otherwise if you’re using an INT data type the first value after cycling will be -2147483648 even if you set the START WITH property to 1.  For example

CREATE SEQUENCE dbo.SimonPositiveCycle AS INT
MINVALUE 1
CYCLE;

will create a new SEQUENCE object with an INT data type.  The START VALUE is by default the same as the MINVALUE so this will this 1 and when it recycles it will go back to 1.

Now we want to start using the SEQUENCE and to do this we run

SELECT NEXT VALUE FOR dbo.SimonPositiveCycle

This will return the value 1.  If we run it again it returns 2 and so on.

You can use the SELECT NEXT VALUE FOR function in multiple places including INSERT and UPDATE statements and default constraints.  To use a SEQUENCE object instead of an IDENTITY property you just create a default constraint on the identity column.

All of the SEQUENCE object properties listed above can be updated using the ALTER SEQUENCE statement.

Finally, you can tell the SEQUENCE object how often it should write the current sequence value to disk rather than just storing it in memory.

ALTER SEQUENCE dbo.SimonPositiveCycle
CACHE 100;

means that it will write to memory 100 times before it writes to the disk.  It’s better for performance to write to memory but there is a risk that you lose the current value if there is a shutdown.

Columnstore Indexes – The Very Basics

Columnstore indexes are briefly mentioned in chapter 10, lesson 1 on the 70-462 exam guide so I don’t think there will be many questions on them in the exam.

There are a new feature in 2012 and can provide performance improvement for data warehouse type queries that have to perform aggregations over large data sets, e.g. a query that groups on a few columns along with an AVG() and SUM() and an ORDER BY.

The columnstore index organises data by columns rather than rows which allows SQL Server to compress repeating data values which, according to the exam guide, is approximately double the compression rate of PAGE compression.

Obviously, like everything else in SQL Server you need to test what works best for your particular scenario.

You can check the query plan to ensure the column store index is being used. There are two modes it can run under: batch and row. Batch mode should be quick but row mode means it is not using the full benefits of the column store index.

The column store index is split into segments of around 1 million rows and the optimiser can use segment elimination in a similar way to how it uses partition elimination in a partitioned table. The segments store the min and max values in their metadata so the optimiser can quickly deduce which segments it needs to use.

If your query uses only a small number columns from a table then SQL Server only gets this column data from disk rather than the entire row. This can significantly reduce disk I/O and memory cache usage.

However if your query uses several columns it might be better to just use row based indexes as SQL Server has to go and get the row data anyway. This is why column store indexes are suitable for data warehouse queries.

The syntax to create the index is as follows:

CREATE NONCLUSTERED COLUMNSTORE INDEX [IX_CS_Test]
ON dbo.table (col1, col2, ...);

However, there are limitations on columnstore indexes: there are several data types that cannot be used; and the table must be read-only.

This second limitation is probably going to be a problem on most data warehouse systems as I expect you’d want to load new data in on a regular basis. There are a few ways around this: disable the index (making the table read/write), load in the new data and re-enable the index (this is probably a good option for data warehouses where data is loaded from OLTP systems at regular intervals but you wouldn’t do this if the table is updated frequently from, say, a client app); use partition switching to add the new data into the read-only table; use. UNION ALL in your query to include an active data set in with the columnstore data and then add the active data into the columnstore later on.

For more information…

http://msdn.microsoft.com/en-us/library/gg492088.aspx

http://blog.sqlauthority.com/2011/10/29/sql-server-fundamentals-of-columnstore-index/