Read from Azure Data Lake using Azure Databricks

I’ve been using Azure Data Lake for a little while now and have been looking at some of the tools used to read, write and analyse the data including Data Lake Analytics using U-SQL and more recently Azure Databricks.

As an ad-hoc analysis tool I think the Databricks notebooks are great and have been able to create a few basic reports using some of our streaming data via an Event Hub. There is loads you can do with Databricks including ETL and we can now execute Python scripts against Databricks clusters using Data Factory.

Some aspects of using Azure Databricks are very easy to get started with, especially using the notebooks, but there were a few things that took a lot longer to get up and running than I first expected.

Some of the documentation is good but there were a few things I had to piece together from various sources and tinker about myself to get it working.

Connecting to the Data Lake is simple but you don’t want connection details hard-coded in your notebooks and storing the sensitive information in the Azure Key Vault is a little more involved.

The following step-by-step guide should allow any complete beginner to get this up and running fairly quickly – although there are quite a few steps and a few places where things can go wrong.

Please note, the Azure Portal Environment changes frequently and these instructions are only accurate at the time of writing.  If things move or are renamed hopefully there is enough info below to work out what is required.  

In order to connect to the Azure Data Lake we can create a credential in Azure Active Directory (AAD) with access to the relevant files and folders. We need a ClientID and a key for this credential and also need a reference to our AAD. We can store these values in Azure Key Vault and use Databricks secrets to access them.

Firstly, let’s looks at the data we want to access in the Azure Data Lake.

Login in to and navigate to the Data Lake Storage and then Data Explorer.


In this example I’ve created a new Data Lake Store named simon and will now upload some speed camera data I’ve mocked up.  This is the data we want to access using Databricks.


If we click on Folder Properties on the root folder in the Data Lake we can see the URL we need to connect to the Data Lake from Databricks.  This is the value in the PATH field, in this case, adl://


Now we’ve got the files in place let’s set up everything we need to securely connect to it with Databricks.

To do this we need to create an App registration in AAD via the Azure Portal.

Navigate to the Azure Active Directory resource and click on App registration in the menu on the left.


Click on New application registration and enter a Name and Sign-on URL (we don’t use the the Sign-on URL so this can be anything you like).


Now we need to create a key for this App registration which Databricks can use in it’s connection to the Data Lake.

Once the App registration is created click on Settings.


Click on the Keys option, enter a new key description and set the expiry date of the key.


Then click on Save and the key value will be displayed.  You need to make a note of this value somewhere as you cannot view this value again.


You’ll also need to make a note of the Application ID of the App Registration as this is also used in the connection (although this one can be obtained again later on if need be).


As I mentioned above we don’t want to hard code these values into our Databricks notebooks or script files so a better option is to store this in the Azure Key Vault.

Navigate to the Azure Key Vault resource and either use an existing appropriate Key Vault or create a new one.


Click on Properties in the menu on the left and make a note of the DNS NAME and RESOURCE ID values of the Key Vault.  These are needed when setting up the Databricks Secret Scope later on.


Click on Secrets in the menu on the left and create a new secret for each of the bits of sensitive data needed in the Databricks connection which are as follows…

ClientID comes from Application ID of the new App registration
Credential comes from the key on the new App registration
RefreshURL comes from the DirectoryID of the Properties of Azure Active Directory and should be in the format<DirectoryID>/oauth2/token


I’ve also included the URL of the Data Lake in the Simon-ADLS-URL key.


Next we need to give the App Registration permissions to read the data from the Data Lake.


We need to make sure that all folders from the root down to the files have Execute permissions as an access entry.

In a Production environment it’s likely that there we will be some process that loads new files into the Data Lake.  For the folder(s) where the files are loaded we need to set the permission as Read, Execute and an access and default permission entry to all folders and their children.  This means that any new files added to these folders will be given these same permissions by default.  If this isn’t set up then the new files will cause the Databricks code to fail.

To add these permissions click on Access on the relevant folder.  Then click Add.


Enter the App Registration name in the text box and select it when it is displayed below.


Then select the appropriate permissions as described above.  Once you’ve got far enough down the chain of folders where all sub-folders and files need to be accessed by Databricks then you can choose This folder and all children.


Now we’ve got all our sensitive data stored in Azure Key Vault Secrets and permissions on the Data Lake set up we need to create an Azure Databricks Secret Scope and link it to our Key Vault.

I’m assuming you’ve already managed to set up Azure Databricks, if not, you can do this quite easily via the Azure Portal.

Once you’ve got Databricks set up you’ll connect to it via a URL something like…

As you can see from this URL, my Databricks is running in North Europe so to create a new Secret Scope we need to navigate to…

Once we’ve logged in, the following page is displayed.  Give the scope a name and chose an option for the Managed Principal.  For this demo I’m just using the Standard Pricing Tier for Databricks so I have to choose All Users.  The Creator option is only available in Premium Tier.

Enter the DNS NAME and RESOURCE ID values from the Properties of the Key Vault and click Create.


We’re finally ready!

Now we can start a Databricks cluster and enter the following in a new Python notepad (the syntax for Scala is very similar)

client_id = dbutils.secrets.get(scope = "SimonTemp", key = "Simon-ADLS-ClientID")
credential = dbutils.secrets.get(scope = "SimonTemp", key = "Simon-ADLS-Credential")
refresh_url = dbutils.secrets.get(scope = "SimonTemp", key = "Simon-ADLS-Refresh-URL")
adls_url = dbutils.secrets.get(scope = "SimonTemp", key = "Simon-ADLS-URL")

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("", client_id)
spark.conf.set("dfs.adls.oauth2.credential", credential)
spark.conf.set("dfs.adls.oauth2.refresh.url", refresh_url)

This code is using the Databricks Secret Scope named SimonTemp created above to access the four secrets we put in the Azure Key Vault and store in variables.

We can then use the SparkSession object spark which is available automatically in an Azure Databricks cluster and use the client_idcredential and refresh_url variables to set the authentication values required to connect to the Data Lake.

Now we can use the final variable adls_url along with the rest of the path to read the CSVs files and show the data as follows…


If you’ve managed to follow these fairly long instructions hopefully you’re able to read your data.  However, if your read statement is just hanging then it’s likely you’ve not set the correct values in the key vault.  This happened to me the first time I tried this and I didn’t get any errors in Databricks.  I had to re-enter the values in my key vault secrets to fix it.

If you’ve not set up permissions correctly in the Data Lake then you will receive an error something like… LISTSTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.).

In this case make sure the App Registration has access from the root folder all the way down to the files you are trying to read.


Archiving old SQL Server data to Azure Data Lake and reading it with Data Lake Analytics

I recently worked on a project where we rebuilt one of our existing systems. This involved one team building a new application and the data engineering team rebuilding the database.

We migrated a lot of the data from the old database into the new but also aggregated up some financial data and added it as a balance transfer from the old system to the new.

The old database has been sitting in read-only state for a long time now and we’d like to get rid of it to recover some space. However, very recently I was asked to show details of how the balance transfer amount was calculated for a particular customer. I was able to read from the read-only old database on this occasion but wanted to come up with a way to archive some of the old database data so that is can be accessed if need be.

The solution I came up with was to export the data we need into our Azure Data Lake Store and then create a U-SQL Project in Visual Studio to hold the queries we run against the Data Lake. This project is then placed in source control for other people to use these queries in the future.

I’m going to use some of the tables in the AdventureWorks2016 database to mimic how I set this up…

An example query…

Let’s assume we’ve built a new version of AdventureWorks2016 and migrated a lot of the data across. However, for certain reasons we didn’t migrate over the order history but once or twice a year we might get a request asking for the order history for a customer. If the AdventureWorks2016 database is still available then we can run something like the following for AccountNumber AW00029794

USE AdventureWorks2016;

    P.Name AS ProductName
FROM Sales.SalesOrderHeader SOH
JOIN Sales.SalesOrderDetail SOD ON SOD.SalesOrderID = SOH.SalesOrderID
JOIN Sales.Customer C ON C.CustomerID = SOH.CustomerID
JOIN Production.Product P ON P.ProductID = SOD.ProductID
WHERE C.AccountNumber = 'AW00029794'

This returned 59 rows including these…

Archive to ADLS 01

The following steps show how we can archive some of this data into the Data Lake and then run adhoc queries like the one above against it.

Archive tables

I just used the Import/Export Wizard via SQL Server Management Studio to export the table data as CSV files.

When choosing the destination it’s important to remember what options you chose as this will determine the options you need to include in your U-SQL script to read the data from the Data Lake.

Archive to ADLS 02

I’ve identified the Customer, Product, SalesOrderHeader and SalesOrderDetail tables as the only ones I need to archive so I’ve ended up with 4 CSV files containing all the data from these tables.

Archive to ADLS 03

Upload to Azure Data Lake

Like any Azure resource you need to have an active Azure Subscription (free trials are available) and I’m assuming you have this as well as have provisioned an Azure Data Lake Store (ADLS).

For this demo I’ve got a Data Lake set up called aworks2016archive and a Data Lake Analytics (ADLA) account with the same name linked to this Data Lake.

I now use Visual Studio and the Cloud Explorer for Visual Studio extension to interact with ADLS.

Archive to ADLS 06

This gives me the Cloud Explorer window and I can connect to the ADLS from there

Archive to ADLS 07.JPG

I can open this ADLS in File Explorer and add a new folder named AdventureWorks2016 to store the exported CSV files. I can then right click in that folder and Upload the CSV files…

Archive to ADLS 08

Reading the data from ADLS

I now use Visual Studio and the Azure Data Lake and Stream Analytics Tools extension to interact with ADLS and ADLA and query the data in the CSV files.

Archive to ADLS 04

I create a new U-SQL Project in Visual Studio. This allows me to keep my U-SQL scripts under source control although I can just run these scripts directly in ADLA via the Azure Portal if need be.

Archive to ADLS 05

I then end up with a new project with a single u-sql file named Script.usql

To run the script using the aworks2016archive ADLA account I need to choose it from the drop down. You can run this script against the local connection while you are developing it. I can also choose the number of analytics units (AU) here, up to the maximum set in the ADLA account. For the time being I’m going to set this to 1.

Within U-SQL you have extractors and outputters for reading data and writing results, respectively. There is quite a good course on ADLS and ADLA on Microsoft Virtual Academy called Introducing Azure Data Lake and there are plenty of other resources available so I’m not going to go too much into the U-SQL syntax.

One thing I would like to show is a nice way to get your EXTRACT statement for the files in the ADLS. From Cloud Explorer -> File Explorer I can navigate to the file I want to read and double-click it to open a preview of the file. From here I can choose the Delimiter, Quoting and Encoding to make sure the file is readable.

Archive to ADLS 09

There is also a button named Create EXTRACT Script…

Archive to ADLS 10

The file has the column headers in the first row so I can check the File Has Header Row box and get the correct column names in the script…

Archive to ADLS 11

This tool has made the best guess at what it thinks the data types should be but if the data is sparse for a particular column and there are no values in the preview it quite often just chooses string. I can change the data types in the columns in the data set but I prefer to click Copy Script and paste the code into my u-sql script and change it there.

@input =
    EXTRACT [ProductID] int,
            [Name] string,
            [ProductNumber] string,
            [MakeFlag] bool,
            [FinishedGoodsFlag] bool,
            [Color] string,
            [SafetyStockLevel] int,
            [ReorderPoint] int,
            [StandardCost] int,
            [ListPrice] int,
            [Size] string,
            [SizeUnitMeasureCode] string,
            [WeightUnitMeasureCode] string,
            [Weight] string,
            [DaysToManufacture] int,
            [ProductLine] string,
            [Class] string,
            [Style] string,
            [ProductSubcategoryID] string,
            [ProductModelID] string,
            [SellStartDate] DateTime,
            [SellEndDate] string,
            [DiscontinuedDate] string,
            [rowguid] Guid,
            [ModifiedDate] DateTime
    FROM "adl://" 
    USING Extractors.Csv(skipFirstNRows:1);

In order to know what the data types should be it might be useful to store the schema of the original SQL Server table in the ADLS as well. For the Product table this is as follows…

CREATE TABLE [Production].[Product](
	[ProductID] [int] IDENTITY(1,1) NOT NULL,
	[Name] [dbo].[Name] NOT NULL,
	[ProductNumber] [nvarchar](25) NOT NULL,
	[MakeFlag] [dbo].[Flag] NOT NULL,
	[FinishedGoodsFlag] [dbo].[Flag] NOT NULL,
	[Color] [nvarchar](15) NULL,
	[SafetyStockLevel] [smallint] NOT NULL,
	[ReorderPoint] [smallint] NOT NULL,
	[StandardCost] [money] NOT NULL,
	[ListPrice] [money] NOT NULL,
	[Size] [nvarchar](5) NULL,
	[SizeUnitMeasureCode] [nchar](3) NULL,
	[WeightUnitMeasureCode] [nchar](3) NULL,
	[Weight] [decimal](8, 2) NULL,
	[DaysToManufacture] [int] NOT NULL,
	[ProductLine] [nchar](2) NULL,
	[Class] [nchar](2) NULL,
	[Style] [nchar](2) NULL,
	[ProductSubcategoryID] [int] NULL,
	[ProductModelID] [int] NULL,
	[SellStartDate] [datetime] NOT NULL,
	[SellEndDate] [datetime] NULL,
	[DiscontinuedDate] [datetime] NULL,
	[rowguid] [uniqueidentifier] ROWGUIDCOL  NOT NULL,
	[ModifiedDate] [datetime] NOT NULL,
	[ProductID] ASC

I can use this to correct any incorrect data types…

@input =
    EXTRACT [ProductID] int,
            [Name] string,
            [ProductNumber] string,
            [MakeFlag] bool,
            [FinishedGoodsFlag] bool,
            [Color] string,
            [SafetyStockLevel] int,
            [ReorderPoint] int,
            [StandardCost] decimal,
            [ListPrice] decimal,
            [Size] string,
            [SizeUnitMeasureCode] string,
            [WeightUnitMeasureCode] string,
            [Weight] decimal?,
            [DaysToManufacture] int,
            [ProductLine] string,
            [Class] string,
            [Style] string,
            [ProductSubcategoryID] int?,
            [ProductModelID] int?,
            [SellStartDate] DateTime,
            [SellEndDate] DateTime?,
            [DiscontinuedDate] DateTime?,
            [rowguid] Guid,
            [ModifiedDate] DateTime
    FROM "adl://" 
    USING Extractors.Csv(skipFirstNRows:1);

Notice some of the data types have a question mark after their name. This is to indicate that the field is nullable. Please note, you can’t include this for string data types.

In this example we’re extracting from a specific file but you can use wildcards in the path within ADLS.

We are using the built in CSV extractor and specify one of the options to skip the first 1 row. There are other options available here and I’m going to add some of these in even though they are the default values. If we’d checked the Unicode box in the destination on the Import/Export Wizard I would have to specify the Unicode encoding type…

@input =
    EXTRACT [ProductID] int,
            [Name] string,
            [ProductNumber] string,
            [MakeFlag] bool,
            [FinishedGoodsFlag] bool,
            [Color] string,
            [SafetyStockLevel] int,
            [ReorderPoint] int,
            [StandardCost] decimal,
            [ListPrice] decimal,
            [Size] string,
            [SizeUnitMeasureCode] string,
            [WeightUnitMeasureCode] string,
            [Weight] decimal?,
            [DaysToManufacture] int,
            [ProductLine] string,
            [Class] string,
            [Style] string,
            [ProductSubcategoryID] int?,
            [ProductModelID] int?,
            [SellStartDate] DateTime,
            [SellEndDate] DateTime?,
            [DiscontinuedDate] DateTime?,
            [rowguid] Guid,
            [ModifiedDate] DateTime
    FROM "adl://" 
    USING Extractors.Csv(skipFirstNRows:1, encoding:Encoding.UTF8, quoting:true, silent:false);

I’m going to give this variable a more appropriate name instead of @input and then do the same for the other three CSV files…

@customer =
    EXTRACT [CustomerID] int,
            [PersonID] int?,
            [StoreID] int?,
            [TerritoryID] int?,
            [AccountNumber] string,
            [rowguid] Guid,
            [ModifiedDate] DateTime
    FROM "adl://" 
    USING Extractors.Csv(skipFirstNRows:1);

@salesorderheader =
    EXTRACT [SalesOrderID] int,
            [RevisionNumber] int,
            [OrderDate] DateTime,
            [DueDate] DateTime,
            [ShipDate] DateTime?,
            [Status] int,
            [OnlineOrderFlag] bool,
            [SalesOrderNumber] string,
            [PurchaseOrderNumber] string,
            [AccountNumber] string,
            [CustomerID] int,
            [SalesPersonID] int?,
            [TerritoryID] int?,
            [BillToAddressID] int,
            [ShipToAddressID] int,
            [ShipMethodID] int,
            [CreditCardID] int?,
            [CreditCardApprovalCode] string,
            [CurrencyRateID] int?,
            [SubTotal] double,
            [TaxAmt] double,
            [Freight] double,
            [TotalDue] double,
            [Comment] string,
            [rowguid] Guid,
            [ModifiedDate] DateTime
    FROM "adl://" 
    USING Extractors.Csv(skipFirstNRows:1);

@salesorderdetail =
    EXTRACT [SalesOrderID] int,
            [SalesOrderDetailID] int,
            [CarrierTrackingNumber] string,
            [OrderQty] int,
            [ProductID] int,
            [SpecialOfferID] int,
            [UnitPrice] double,
            [UnitPriceDiscount] double,
            [LineTotal] double,
            [rowguid] Guid,
            [ModifiedDate] DateTime
    FROM "adl://" 
    USING Extractors.Csv(skipFirstNRows:1);

Joining data in U-SQL

Now that we’ve got our four EXTRACT statements we can join these all together and create the required data set that we can then output…

@orders =
       p.Name AS ProductName
    FROM @customer AS c
    JOIN @salesorderheader AS soh ON soh.CustomerID == c.CustomerID
    JOIN @salesorderdetail AS sod ON sod.SalesOrderID == soh.SalesOrderID
    JOIN @product AS p ON p.ProductID == sod.ProductID
    WHERE c.AccountNumber == "AW00029794";

There are some differences in this SELECT statement to the T-SQL equivalent. Remember this is U-SQL not T-SQL.

The variables we’re joining together have to have an AS before the alias name and the alias name can’t all be in upper case.

It also uses the C# equality operator with a double equals (single equals is used for assigning values).

The code is also case sensitive and you may be used to a SQL Server collation that is not. So ProductID is different to ProductId.

Double-quotes are used on string data instead of single.

There is a section at the bottom that shows some of the errors I got when first compiling my U-SQL scripts.

Outputting the data

Now that we’ve got the query to get the required data we need to output the results to the ADLS. We can’t see the results in Visual Studio like we would if running a T-SQL query against SQL Server.

OUTPUT @orders
TO "adl://"
USING Outputters.Csv(outputHeader:true);

Here we specify the file we want to output the results to and we’re using the built in CSV outputter. The only option I’m specifying is that I want the column names included as the first row but you can see other options here.

Now we’ve got our full script we can execute it by clicking the Execute button.

Executing the U-SQL job

The scripts execute within jobs and if the script compiles successfully the job window will be displayed like the following…

Archive to ADLS 12

Here we can see the completed job…

Archive to ADLS 13

We can see the four Extract operations at the top and then some Combine operations which is where we are joining the data together.

One thing to note is the Extract on the Customer.csv file. Our EXTRACT statement didn’t have any filtering. It was only in the SELECT statement that we added the where clause to get only one customer. But if you look at the SV1 EXTRACT operation you can see only 1 row was returned. The compiler knows to apply that filter from the SELECT statement to the EXTRACT statement. Looking at the code you might think the EXTRACT statements work into a similar way to loading data from SELECT statements into temp tables in T-SQL in which case it would read all the data from the table and only filter from the temp table.

As we can see from the SV7 Combine operation there are 59 rows outputted which matches the number from the original T-SQL query from the very beginning.

Because we are running queries against the ADLA account we can log in to the Azure portal and see the job history…

Archive to ADLS 14.JPG

As you can see the successful job took 58 secs with 1 AU.  If I double-click the job and can see the details of it including the job graph.  There is a tab AU analysis that can be used to work out the most efficient amount of AU’s to use.  We have to find the balance between execution time and cost.

Because the data sets are tiny in this example 1 AU is the best choice.

The output file

The result of the script execution is a file written to ADLS and we can now see that in File Explorer

Archive to ADLS 15

We can view a preview of this file or download it.

Some errors that can be returned

Here are a few of the errors that I’ve received while creating U-SQL scripts along with the cause and how to resolve them.

  • Alias all in upper case


JOIN @salesorderheader AS SOH ON SOH.CustomerID == c.CustomerID


E_CSC_USER_USQLUPPERCASEIDENTIFIER: The identifier SOH is all upper case.


Alias SOH needs to be Soh or soh

JOIN @salesorderheader AS soh ON soh.CustomerID == c.CustomerID
  • Missing AS in alias


JOIN @salesorderheader soh ON soh.CustomerID == c.CustomerID




Add AS before any aliases

JOIN @salesorderheader AS soh ON soh.CustomerID == c.CustomerID
  • Incorrect casing


JOIN @salesorderheader AS soh ON soh.CustomerID == c.CustomerId


E_CSC_USER_QUALIFIEDCOLUMNNOTFOUND: Column 'CustomerId' not found in rowset 'c'.


Make sure all casing is correct

JOIN @salesorderheader AS soh ON soh.CustomerID == c.CustomerID
  • Wrong equality operator used


JOIN @salesorderheader AS soh ON soh.CustomerID = c.CustomerID


E_CSC_USER_JOINEXPRESSIONNOTVALID: The Join Condition soh.CustomerID = c.CustomerID is not supported.


Equality operators need to be double equals

JOIN @salesorderheader AS soh ON soh.CustomerID == c.CustomerID
  • Trying to return results




E_CSC_USER_QUERYSTATEMENTMISSINGASSIGNMENT	Missing assignment for query statement.


Can’t return results in a U-SQL script.  You have to output the results to a file

@orders =
OUTPUT @orders


Exam prep for 70-762

Last Updated 19/11/2018

A few months ago I took exam 70-761 and now I’m going to take 70-762 – Developing SQL Databases.

As part of my prep for 70-761 I wrote about a few of the topics less well known to me and they can all be found here – Exam prep for 70-761.

This page covers some of the topics covered in 70-762.

70-761 Exam Prep: Miscellaneous T-SQL – Part 2

I’m currently preparing to take the exam 70-761 – Querying Data with Transact-SQL and have been going through various resources including Itzit Ben-Gan’s exam guidebook which is certainly worth reading if you’re planning on taking this exam.

There are quite a few smaller bits of T-SQL functionality that I’m not that familiar with that I’m going to demonstrate in this and further posts. It’s a bit of a random selection but hopefully it will help me remember them and might be of use to others as well.


  1. Date and Time Functions
  2. IIF (Inline IF Function)
  3. CHOOSE Function
  4. String Functions

1. Date and Time Functions

There are quite a few different Date and Time functions available in SQL Server. There are a few I used regularly such as GETUTCDATE, DATEPART and DATEADD but some I hardly use or have never used at all. I can never remember the syntax for most of them so I’m going to give an example of each one along with a bit of detail about each.

Returning current date and time

The following functions all return the current date and time.
The timezone of my PC is BST.

This function is specific to SQL Server and returns a DATETIME value containing the current date and time.

2018-04-30 22:13:18.403

This function is specific to SQL Server and returns a DATETIME value containing the current UTC date and time.

2018-04-30 21:13:18.403

This function is SQL standard and returns a DATETIME value containing the current date and time.

2018-04-30 22:13:18.403

This function is SQL standard and returns a DATETIME2(7) value containing the current date and time.

2018-04-30 22:13:18.4048719

This function is SQL standard and returns a DATETIME2(7) value containing the current UTC date and time.

2018-04-30 21:13:18.4048719

This function is SQL standard and returns a DATETIMEOFFSET value containing the current date, time and offset.
As mentioned above this is BST which is 1 hour ahead of UTC.

2018-04-30 22:13:18.4048719 +01:00

Date Parts

These following two functions can be used to extract different parts of a supplied date and time.
For example, the following will return the month number from the supplied date or time parameter (please note I’m passing in a VARCHAR value but this is implicitly converted to a DATETIME2 data type).


The following will return the minute value from the supplied date time value.

SELECT DATEPART(MINUTE, '20180430 03:23:17');

If no time part is included in the date value then it’s assumed to be midnight which is why this returns 0.


The following returns the number of days since the 1st Jan.


The following returns any offset in minutes. The offset is three hours so this returns 180 minutes.

SELECT DATEPART(TZoffset, '20180430 13:30:00 +03:00')

DATEPART always returns an INT value.

We can use the DATENAME function to return textual values for date parts.
For example, the following returns the month name for the supplied date.


The DATENAME function is language dependent and will return the value in the language of the user running the query.

We can use the following functions to return date and time values from integer date parts. Each function requires a different number parameters and all are mandatory in order to build up the returned value correctly.

The following returns a DATE value.


The following returns a DATETIME value.

SELECT DATETIMEFROMPARTS(2018,04,30,19,42,34,13)
2018-04-30 19:42:34.013

All 7 parameters in this one are required but we can obviously use zeroes if we don’t have the required level of precision.

SELECT DATETIMEFROMPARTS(2018,04,30,19,42,0,0)
2018-04-30 19:42:00.000

The following returns a DATETIME2 type with an eighth parameter used to specify the precision of the data type.
The “fraction” parameter (the seventh one) is related to the precision which determines how many digits are allowed in the “fraction” parameter.
For example,

SELECT DATETIME2FROMPARTS(2018,04,30,19,42,34,5,1);
2018-04-30 19:42:34.5

SELECT DATETIME2FROMPARTS(2018,04,30,19,42,34,5,2);
2018-04-30 19:42:34.05

SELECT DATETIME2FROMPARTS(2018,04,30,19,42,34,5,3);
2018-04-30 19:42:34.005

SELECT DATETIME2FROMPARTS(2018,04,30,19,42,34,5,7);
2018-04-30 19:42:34.0000005

If we’re only allowing a precision of 1 we can’t specify 2 digits in the “fraction” parameter.

SELECT DATETIME2FROMPARTS(2018,04,30,19,42,34,15,1);
Msg 289, Level 16, State 5, Line 89
Cannot construct data type datetime2, some of the arguments have values which are not valid.

If we are using a precision of 0, which is just to the nearest second, the “fraction” parameter must be 0 otherwise we get the same error as above.

SELECT DATETIME2FROMPARTS(2018,04,30,19,42,34,5,0);
Msg 289, Level 16, State 5, Line 53
Cannot construct data type datetime2, some of the arguments have values which are not valid.

SELECT DATETIME2FROMPARTS(2018,04,30,19,42,34,0,0);
2018-04-30 19:42:34

The following returns a DATETIMEOFFSET value with the offset hour and minute parameters before the precision.

SELECT DATETIMEOFFSETFROMPARTS(2018,4,30,19,42,34,0,2,30,1);
2018-04-30 19:42:34.0 +02:30

The following function returns a DATE value containing the last date in the month of the supplied date.

SELECT EOMONTH('20180315');

There is an optional offset parameter that can move forward or back the specified number of months

SELECT EOMONTH('20180315', -1);
SELECT EOMONTH('20180315', 2);

There isn’t a “start of month” function but further down I show how to do this using the DATEADD and DATEDIFF functions.


The DATEADD function can be used to add a specified number to the specified date part of the supplied date time value and returns a DATETIME value.
For example, the following adds 10 days to 10th April 2018.

SELECT DATEADD(DAY, 10, '20180410');
2018-04-20 00:00:00.000

The following one takes 1 month from 10th April 2018.

SELECT DATEADD(MONTH, -1, '20180410');
2018-03-10 00:00:00.000


This function compares a certain date part value between two datetime values and returns an INT value.
For example, the following calculates that there are 2 days difference between 1st April and 3rd April 2018.

SELECT DATEDIFF(DAY, '20180401', '20180403');

If the number returned is very large we can use this function to return the value as a BIGINT.


Above we saw we can use the built in EOMONTH function to get the last day of the month for the supplied date.
Here’s a nice little trick to get the first date of the supplied month, in this case the current month. I’m running this on 30th April 2018

2018-04-01 00:00:00.000

To explain how this works let’s first look at the DATEDIFF part that is passed in as the second parameter in the DATEADD function.
The start date parameter is supplied as zero. This defaults to 1st Jan 1900.
So the following two statements return the same value which is the number of months since 1st Jan 1900.
I’m running this in April 2018 so this returns 1419


The DATEADD function also uses 0 as the date parameter which again defaults to 1st Jan 1900.
So this function is saying “add 1419 months to 1st Jan 1900” and this gives us 1st April 2018.

Offest functions

We can use the TODATETIMEOFFSET function to add an offset to a date time value. This returns a DATETIMEOFFSET value.

SELECT TODATETIMEOFFSET('20180430 10:00', '+02:00')
2018-04-30 10:00:00.0000000 +02:00

We can use the SWITCHOFFSET function to switch from one time zone to another.
The first parameter has to be a DATETIMEOFFSET value and the second is the offset from UTC.  I’ve run this at 21:16 BST which is UTC + 1.

2018-04-30 22:13:18.403

2018-04-30 23:13:18.4048719 +02:00

We can also add an offset using “AT TIME ZONE” and specifically specify the time zone as follows.

SELECT CAST('20180415 17:32' AS DATETIME) AT TIME ZONE 'Central European Standard Time'
2018-04-15 17:32:00.000 +02:00

2. IIF (Inline IF function)

Most programming languages provide some sort of ternary or condition operator and SQL Server is no different.  It provides the Inline IF function (IIF)
This function accepts three parameters. The first is a boolean, or an expression that returns a boolean.  If the boolean value is true then the value supplied in the second parameter is the value returned by the function.  If the boolean value is false then the value supplied in the third parameter is the value returned by the function.

SELECT IIF(1=1, 1, 0)

SELECT IIF(1=0, 1, 0)

If you are more familiar with the CASE statement, the queries above are equivalent to the following…


3. CHOOSE Function

The CHOOSE function accepts at least 2 parameters.
The first parameter is the “index” and has to be an INT that says which value from the preceding parameters should be returned.
The following examples all use four parameters meaning there are three values available to chose from the parameters supplied after the “index” parameter.

The following will return the value in the first parameter after the “index” value

SELECT CHOOSE(1,'a','b','c')

The following will return the value in the second parameter after the “index” value

SELECT CHOOSE(2,'a','b','c')

The following will return the value in the third parameter after the “index” value

SELECT CHOOSE(3,'a','b','c')

If we pass an “index” value greater than the number of values to choose from then the function returns NULL.

SELECT CHOOSE(4,'a','b','c')

4. String Functions

There are several string functions provided in SQL Server and I’m going to demonstrate some of the ones I find most useful…


The following repeats the string “ab” 5 times…


The following reverses the supplied string…


The SPLIT_STRING table valued function will split a string based on a supplied separator. This is available from SQL Server 2016 onwards.
The following splits the string into a data set of rows using the underscore character as the separator…

FROM STRING_SPLIT('My_Name_Is_Simon', '_')


The STUFF function is one of the most powerful string manipulation tools.
It can be used to insert (or stuff) a string at any point inside another string.
It can be used to replace a certain number of characters at any point within a string and replace them with another string.
The scalar function returns a VARCHAR and accepts four parameters…
1. This is the original string supplied for modification
2. This is the starting point for the string manipulation
3. This is the number of characters in the original string that are going to be deleted starting at the character supplied in parameter 2
4. This is the string that we want to “stuff” into the original string at the character supplied in parameter 2

In the following example the original string is “abcde”. Parameter 2 says we’re going to perform the “stuff” at character 1, i.e. at character “a”.  Parmeter 3 says we’re deleting 0 characters from the original string and the final parameter is the the string we want to “stuff” in at character 1.
This will insert the character “x” at the beginning of the original string…

SELECT STUFF('abcde',1,0,'x')

We can’t use STUFF to add “x” to the end of the string.  We would want to place it as the 6th character in this example but if we try this is just returns NULL.

SELECT STUFF('abcde',6,0,'x')

However, add a string to the end of another is just simple concatenation.

SELECT 'abcde' + 'x'
SELECT CONCAT('abcde', 'x')

Now let’s try and replace the “c” character with the “x” character using the STUFF function…

SELECT STUFF('abcde', 3, 1, 'x')

Obviously this is not specifically looking for the “c” character as the REPLACE function would, it’s just replacing the third character in the original string with the string in the fourth parameter.

If we want to replace the second and third characters of the string we can do the following…

SELECT STUFF('abcde', 2, 2, 'x')

This is saying delete the two characters starting at character 2, i.e. “bc”, and insert the string in the fourth parameter, i.e. “x”.

The fourth parameter doesn’t have to be a single character…

SELECT STUFF('abcde', 2, 2, 'alongerstring')

The statement won’t error if the third parameter is a bigger number than the number of characters in the original string.
For example, if we know the original string is less than 100 characters long we can run the following to replace everything but the first character…

SELECT STUFF('abcde', 2, 100, 'XXXXXXX')
SELECT STUFF('fghijklmnopq', 2, 100, 'XXXXXXX')

SQL Server now supplies dynamic data masking functions but we could use STUFF to do something similar such as mask all but the last four digits of a credit card.
Card numbers are typically 16 digits long so we could delete the first 12 digits by starting at digit 1 supplying 12 as the third parameter.  We could then replace that deleted part of the string with 12 asterisk characters…

SELECT STUFF('1234567812345678',1 , 12, REPLICATE('*', 12));

One final thing on STUFF is that it can be used along with FOR XML to concatenate rows of string data into a single string.  Details of this can be found here.

The following functions extract a certain number of characters from a supplied string
This returns the first 2 characters on the left…

SELECT LEFT('abcde', 2);

This returns the last 2 characters…

SELECT RIGHT('abcde', 2);

This returns the 3 character starting from character 2…

SELECT SUBSTRING('abcde', 2, 3);

I’m only going to show one example of the FORMAT function as there is quite a lot to this one. For full information see the MS documentation.
This example will format a supplied UK phone number into a more readable format…

SELECT FORMAT(07777123678, '+44 (0) #### ### ###');

Please note as the value supplied in the first parameter is a number the leading zero is ignored.

There are several other string functions available and please see the MS documentation for full details.



Forcing default values on a nullable columns

When we add a new column to an existing table that contains data we need to think about if we need to populate the data in this column for the existing rows.

If the column should not allow NULL then we need to supply a default for the new column (or set it to allow NULL, backfill it, then change to not allow NULL).

We can add a default constraint to a column that allows NULL but the default is not applied unless we specifically say to set it

Here’s a demonstration of working with defaults on nullable columns.

First let’s create a table as follows…

DROP TABLE IF EXISTS dbo.WithValuesTest;

CREATE TABLE dbo.WithValuesTest
    FullName VARCHAR(100) NOT NULL

INSERT INTO dbo.WithValuesTest
    ('Leslie Tambourine'),
    ('Linda Pencilcase'),
    ('Sally Carpets'),
    ('Andy Anchovie'),
    ('Ronnie Clotheshorse');

FROM dbo.WithValuesTest

This gives us the following data…

If we try to add a new not nullable column to this table this errors because there is no data to add into the new column

ALTER TABLE dbo.WithValuesTest ADD DateOfBirth DATETIME2(0) NOT NULL;
Msg 4901, Level 16, State 1, Line 86

ALTER TABLE only allows columns to be added that…

  1. can contain nulls, or
  2. have a DEFAULT definition specified, or
  3. the column being added is an identity or timestamp column, or
  4. the table must be empty

The column DateOfBirth cannot be added to the non-empty table WithValuesTest because it does not satisfy these conditions.

However, it we specify a default it will work…


FROM dbo.WithValuesTest


To continue let’s drop this new column


Now let’s try to create the same column with the same default but now allow the column to be nullable


FROM dbo.WithValuesTest;


As you can see, unlike when the column didn’t allow NULL, the default value has not been used.

Let’s drop the column again and this time create it with the WITH VALUES clause…

ALTER TABLE dbo.WithValuesTest ADD DateOfBirth DATETIME2(0) NULL DEFAULT '19000101' WITH VALUES;

FROM dbo.WithValuesTest;


To complete the demo let’s now try and add some new rows.

If we don’t include the DateOfBirth column in the insert list then the default is used…

INSERT INTO dbo.WithValuesTest (FullName)
VALUES ('Bernard Seesaw');


Suppose our code needs to insert the date of birth if it’s supplied but should use the default value if it’s not supplied, i.e. if it’s NULL.

If we just try a straight insert using the NULL value then NULL is what is put into the table.

INSERT INTO dbo.WithValuesTest (FullName, DateOfBirth)
VALUES ('Lizzie Onion', '19830402'),
('Terence Artichoke', NULL);


We can specify DEFAULT instead of NULL and it will use the default value on the insert

INSERT INTO dbo.WithValuesTest (FullName, DateOfBirth)
VALUES ('Mavis Breadbin', DEFAULT);


The DEFAULT value can be used in an UPDATE statement as well…

UPDATE dbo.WithValuesTest
WHERE FullName = 'Terence Artichoke'



Partitioning Series

I’ve now written a few different posts about various parts of partitioning so thought I would put them all together here in a single series.

  1. Introduction to Partitioning – this post covers how to set up a table with multiple partitions using a partition scheme and function and briefly touches on partition elimination to help reduce the amount of data being read in a query.
  2. More Partitioning – this covers partition elimination in more detail and includes tips to make sure the optimiser knows it can eliminate certain partitions. It also covers the $PARTITION function.
  3. Using $PARTITION to find the last time an event occured – this post demonstrates a real life scenario where I’ve been able to use the $PARTITION function to speed up a query.
  4. 70-761 Exam Prep: Miscellaneous T-SQL – Part 1 – section 3 of this post covers truncating partitions which is new functionality available in SQL Server 2016.
  5. Modifying Partitions – this post shows how to add and remove partitions from an existing partition function.