I’ve been using Azure Data Lake for a little while now and have been looking at some of the tools used to read, write and analyse the data including Data Lake Analytics using U-SQL and more recently Azure Databricks.
As an ad-hoc analysis tool I think the Databricks notebooks are great and have been able to create a few basic reports using some of our streaming data via an Event Hub. There is loads you can do with Databricks including ETL and we can now execute Python scripts against Databricks clusters using Data Factory.
Some aspects of using Azure Databricks are very easy to get started with, especially using the notebooks, but there were a few things that took a lot longer to get up and running than I first expected.
Some of the documentation is good but there were a few things I had to piece together from various sources and tinker about myself to get it working.
Connecting to the Data Lake is simple but you don’t want connection details hard-coded in your notebooks and storing the sensitive information in the Azure Key Vault is a little more involved.
The following step-by-step guide should allow any complete beginner to get this up and running fairly quickly – although there are quite a few steps and a few places where things can go wrong.
Please note, the Azure Portal Environment changes frequently and these instructions are only accurate at the time of writing. If things move or are renamed hopefully there is enough info below to work out what is required.
In order to connect to the Azure Data Lake we can create a credential in Azure Active Directory (AAD) with access to the relevant files and folders. We need a ClientID and a key for this credential and also need a reference to our AAD. We can store these values in Azure Key Vault and use Databricks secrets to access them.
Firstly, let’s looks at the data we want to access in the Azure Data Lake.
Login in to portal.azure.com and navigate to the Data Lake Storage and then Data Explorer.
In this example I’ve created a new Data Lake Store named simon and will now upload some speed camera data I’ve mocked up. This is the data we want to access using Databricks.
If we click on Folder Properties on the root folder in the Data Lake we can see the URL we need to connect to the Data Lake from Databricks. This is the value in the PATH field, in this case, adl://simon.azuredatalakestore.net
Now we’ve got the files in place let’s set up everything we need to securely connect to it with Databricks.
To do this we need to create an App registration in AAD via the Azure Portal.
Navigate to the Azure Active Directory resource and click on App registration in the menu on the left.
Click on New application registration and enter a Name and Sign-on URL (we don’t use the the Sign-on URL so this can be anything you like).
Now we need to create a key for this App registration which Databricks can use in it’s connection to the Data Lake.
Once the App registration is created click on Settings.
Click on the Keys option, enter a new key description and set the expiry date of the key.
Then click on Save and the key value will be displayed. You need to make a note of this value somewhere as you cannot view this value again.
You’ll also need to make a note of the Application ID of the App Registration as this is also used in the connection (although this one can be obtained again later on if need be).
As I mentioned above we don’t want to hard code these values into our Databricks notebooks or script files so a better option is to store this in the Azure Key Vault.
Navigate to the Azure Key Vault resource and either use an existing appropriate Key Vault or create a new one.
Click on Properties in the menu on the left and make a note of the DNS NAME and RESOURCE ID values of the Key Vault. These are needed when setting up the Databricks Secret Scope later on.
Click on Secrets in the menu on the left and create a new secret for each of the bits of sensitive data needed in the Databricks connection which are as follows…
ClientID comes from Application ID of the new App registration
Credential comes from the key on the new App registration
RefreshURL comes from the DirectoryID of the Properties of Azure Active Directory and should be in the format https://login.microsoftonline.com/<DirectoryID>/oauth2/token
I’ve also included the URL of the Data Lake in the Simon-ADLS-URL key.
Next we need to give the App Registration permissions to read the data from the Data Lake.
We need to make sure that all folders from the root down to the files have Execute permissions as an access entry.
In a Production environment it’s likely that there we will be some process that loads new files into the Data Lake. For the folder(s) where the files are loaded we need to set the permission as Read, Execute and an access and default permission entry to all folders and their children. This means that any new files added to these folders will be given these same permissions by default. If this isn’t set up then the new files will cause the Databricks code to fail.
To add these permissions click on Access on the relevant folder. Then click Add.
Enter the App Registration name in the text box and select it when it is displayed below.
Then select the appropriate permissions as described above. Once you’ve got far enough down the chain of folders where all sub-folders and files need to be accessed by Databricks then you can choose This folder and all children.
Now we’ve got all our sensitive data stored in Azure Key Vault Secrets and permissions on the Data Lake set up we need to create an Azure Databricks Secret Scope and link it to our Key Vault.
I’m assuming you’ve already managed to set up Azure Databricks, if not, you can do this quite easily via the Azure Portal.
Once you’ve got Databricks set up you’ll connect to it via a URL something like…
As you can see from this URL, my Databricks is running in North Europe so to create a new Secret Scope we need to navigate to…
Once we’ve logged in, the following page is displayed. Give the scope a name and chose an option for the Managed Principal. For this demo I’m just using the Standard Pricing Tier for Databricks so I have to choose All Users. The Creator option is only available in Premium Tier.
Enter the DNS NAME and RESOURCE ID values from the Properties of the Key Vault and click Create.
We’re finally ready!
Now we can start a Databricks cluster and enter the following in a new Python notepad (the syntax for Scala is very similar)
client_id = dbutils.secrets.get(scope = "SimonTemp", key = "Simon-ADLS-ClientID") credential = dbutils.secrets.get(scope = "SimonTemp", key = "Simon-ADLS-Credential") refresh_url = dbutils.secrets.get(scope = "SimonTemp", key = "Simon-ADLS-Refresh-URL") adls_url = dbutils.secrets.get(scope = "SimonTemp", key = "Simon-ADLS-URL") spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential") spark.conf.set("dfs.adls.oauth2.client.id", client_id) spark.conf.set("dfs.adls.oauth2.credential", credential) spark.conf.set("dfs.adls.oauth2.refresh.url", refresh_url)
This code is using the Databricks Secret Scope named SimonTemp created above to access the four secrets we put in the Azure Key Vault and store in variables.
We can then use the SparkSession object spark which is available automatically in an Azure Databricks cluster and use the client_id, credential and refresh_url variables to set the authentication values required to connect to the Data Lake.
Now we can use the final variable adls_url along with the rest of the path to read the CSVs files and show the data as follows…
If you’ve managed to follow these fairly long instructions hopefully you’re able to read your data. However, if your read statement is just hanging then it’s likely you’ve not set the correct values in the key vault. This happened to me the first time I tried this and I didn’t get any errors in Databricks. I had to re-enter the values in my key vault secrets to fix it.
If you’ve not set up permissions correctly in the Data Lake then you will receive an error something like…
org.apache.hadoop.security.AccessControlException: LISTSTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.).
In this case make sure the App Registration has access from the root folder all the way down to the files you are trying to read.