Categories
Azure

Databricks CI/CD

In this article we’ll review how to implement continuous integration and continuous delivery on Azure Databricks using Azure DevOps. We’ll be covering how to build and deploy Databricks notebooks, interactive clusters and libraries for the interactive clusters.

We’ll start by organizing the authorization via Service Principals and Databricks Personal Access Tokens, including setup of service connections in DevOps. Then we’ll create new repo and setup a build. Afterwards we’ll create a release pipeline and will finish with a test run.

Setup the authorizations and connections

Let’s setup new Databricks PAT (Personal Access Token) to be used for getting the information in Databricks. Go to your Databricks Workspace > User Settings > Generate New Token

Copy the value and go to your Key Vault to add it as secret (i.e. your key vault > Secrets > Generate/Import

Type a name for your secret and paste the value of your PAT (personal access token)

Now let’s create new service principal to be used for connecting between the services. Go to App Registrations > New registration

Type a name for your service principal and click register

After the registration completes go to “Certificates & secrets” > “New client secret” add description and select expiration, then click on “Add”
(save the value, you’ll need it for later)

Now go to your subscription or resource group you’d like to use, go to “Access Control (IAM)” > Add > Add role assignment. Select “Contributor” role and the service principal you’ve created. Click on save.

Now login to DevOps, go to your project > Service connection > Create service connection > “Azure Resource Manager” > Next

Select Service Principal (manual) > Next.

  • Environment: Azure Cloud
  • Scope level: Subscription
  • Subscription id: Go to Portal > Search “Subscriptions” > we’ll see the subscription id there
  • Subscription name: same as above
  • Service Principal Id: Go to Portal > Search “App Registrations” > Search your SP name > we’ll see the Application (client) id there
  • Credential: Service principal key
  • Service Principal Key: the secret for the service principal we’ve created above.
  • Tenant Id: Go to Portal > Search “Tenant properties” > we’ll see the Tenant id there
  • Service connection name: write something that’ll make sense to you
  • Description (optional): same as above

Now go to your Key Vault > Access policies > + Add Access Policy

Select all secret permissions, select the service principal, click select and then add:

Once you’ll ready, don’t forget to click “Save” to commit your changes.

Go to your Artifacts > Create Feed > type a name and click Create. This will create a feed in the Artifacts section, which would make your builds easier to maintain and review (traceability).

Setup Continuous Integration

Create new Databricks repository by going to your Project Settings > Repositories > Create and add your repo name.

Then click on “Set up build”

Select “Starter Pipeline”

Add the following code as YAML

trigger:
  branches:
    include:
    - main
  paths:
    include:
    - notebooks

pool:
  vmImage: "windows-latest"

steps:

- checkout: self
  path: src
  persistCredentials: true
  clean: true

- task: DeleteFiles@1
  inputs:
    SourceFolder: '$(System.DefaultWorkingDirectory)'
    Contents: '*.yml'

- task: DeleteFiles@1
  inputs:
    SourceFolder: '$(System.DefaultWorkingDirectory)'
    Contents: '*.md'

- task: AzureKeyVault@1
  inputs:
    azureSubscription: 'ivo-devops-sp'
    KeyVaultName: 'ivo-akv-dev'
    SecretsFilter: '*'
    RunAsPreJob: true

  
- task: AzurePowerShell@5
  inputs:
    azureSubscription: 'ivo-devops-sp'
    ScriptType: 'InlineScript'
    Inline: |
      pip3 install databricks-cli
      $env:PYTHONIOENCODING="utf8"
      $env:DATABRICKS_HOST="https://northeurope.azuredatabricks.net"
      $env:DATABRICKS_TOKEN="$(ivo-adb-access-token-kvs)"
      $cluster_id = databricks clusters list --output "JSON" | jq -r '.clusters[].cluster_id'
      $cluster_name = databricks clusters list --output "JSON" | jq -r '.clusters[].cluster_name'
      New-Item -Name "clusters" -ItemType "directory"
      New-Item -Name "libraries" -ItemType "directory"
      cd clusters
      for ($i=0;$i -lt $cluster_id.length;$i++){
      $file = $cluster_name[$i]+".json"
      databricks clusters get --cluster-id $cluster_id[$i] | set-content $file}
      Remove-Item job*
      cd ..
      cd libraries
      for ($i=0;$i -lt $cluster_id.length;$i++){
      $file = $cluster_name[$i]+".json"
      databricks libraries cluster-status --cluster-id $cluster_id[$i] | set-content $file}  
      Remove-Item job*
    errorActionPreference: 'continue'
    azurePowerShellVersion: 'LatestVersion'

- task: PublishPipelineArtifact@1
  inputs:
    targetPath: '$(System.DefaultWorkingDirectory)'
    publishLocation: pipeline
    Artifact: 'drop'


- task: UniversalPackages@0
  displayName: 'Universal publish'
  inputs:
    command: 'publish'
    publishDirectory: '$(Build.ArtifactStagingDirectory)'
    feedsToUsePublish: 'internal'
    vstsFeedPublish: '6ae33e5f-9792-45d4-aa00-d4b805d85dce/daa02d26-986a-4dca-aa8a-ab8bea2fc485'
    vstsFeedPackagePublish: 'drop-ivo-adb'
    versionOption: 'patch'
    packagePublishDescription: 'ver'

  • The trigger step will start the build pipeline every time a new pull request / push to main (master) branch is created and will pick up on everything inside the notebooks folder.
  • For pool we’re using a regular Windows OS for the Virtual Machine.
  • For checkout we’re getting the branch that is currently in use.
  • The delete tasks just remove all files that have “yml” and “md” extensions.
  • The Key Vault task will get the secret for your Databricks PAT.
  • The Inline PowerShell task will install and login to databricks-cli using the credentials from the key vault secrets, it’ll get the cluster configurations and libraries definitions for each cluster into clusters and libraries folders respectively.
  • The Publish Artifact will publish the artifact withing your build.
  • The Universal publish will link your artifact into the Artifact Feeds. (change the vstsFeedPublish with your Artifact Feed by clicking on the task’s settings and selecting your feed from the “Destination Feed” dropdown)
  • Overall for the authorizations, if you have different values, you can change the “ivo-devops-sp” to your service principal, the “ivo-akv-dev” Key Vault and “ivo-adb-access-token-kvs” to your Key Vault secret.

For easier maintenance, you can rename the build name and move to a different folder by selecting the “rename/move” option within the build pipeline properties.

This way if you have multiple CI/CD setups withing one project, it’ll be easier to distinguish the different pipelines.

Setup Continuous Delivery

Go to Pipelines > Releases and create new release pipeline

Select “empty job”

Type stage name (for example “Production”) and in the Artifacts click “+ Add”

Select source type “Build” and choose the build pipeline we’ve created in the CD part of this article. Finish by clicking “add”.

Click on “Continuous Deployment Trigger” and select “Enabled” and rename the pipeline “New release pipeline” with something that make sense to you.

If necessary, you can add pre-deployment conditions like manual approval step and add approvers to be either specific persons or anyone from a specific team. Click on “Pre-deployment conditions” > “Pre-deployment approvals” > Enabled > Approvers

Now go to Production stage > Add a task to Agent job > search and add Key Vault task

For display name you can enter the Key Vault name, for Azure Subscription you can add the service connections we’ve added above, for Key Vault select the Key Vault where the secret is located and in secrets filter you can leave “*” to get all secrets or choose the specific secret. Check to make secrets available to whole job. To setup the PAT for production environment you can follow the same steps as we did in development environment above.

Next, search for “DevOps for Azure Databricks”, “Azure Databricks REST API”, “Databricks Script Deployment Task by Data Thirst” and install them to your organization for free.

After they’re installed, search for “Configure Databricks CLI” task and add it.

After we add it, we need to add the Workspace URL and the Access Token (which we get from the Key Vault)

Now for Databricks Notebooks deployment task, we’re looking for “Deploy Databricks Notebooks”

For Databricks folder add “$(System.DefaultWorkingDirectory)/_CI_ADB_ITT/drop/notebooks/Shared” (or if you don’t use “Shared” pick the directory you’ll be using, and change the same for “Workspace folder)

Next, search for “databricks deploy cluster” and add it as a task:

Here we’ll have to pick the Azure Region the Databricks Workspace is located to, the JSON config file “$(System.DefaultWorkingDirectory)/_CI_ADB_ITT/drop/clusters/IVO_S_F4.json” (IVO_S_F4 is the name of the cluster i want to deploy), add “databricks bearer token” (this is the PAT from the Key Vault) and add reference name (e.g. DeployCluster1)

Next, search for “start databricks cluster” (we need to start the interactive cluster, because for the next step we’d like to install the libraries in the cluster and we can’t do that if the cluster is in inactive state)

For Cluster ID add “$(DeployCluster1.DatabricksClusterId)”

Finally add “Azure PowerShell” task to install the PyPI & CRAN libraries in the interactive cluster.

$json=Get-Content -Raw -Path "$(System.DefaultWorkingDirectory)/_CI_ADB_ITT/drop/libraries/IVO_S_F4.json" | Out-String | ConvertFrom-Json
$pypiPackage = $json.library_statuses.library.pypi.package
$cranPackage = $json.library_statuses.library.cran.package 

cd C:\Users\VssAdministrator

echo "[DEFAULT] 
host = https://northeurope.azuredatabricks.net 
token = $(ivo-adb-access-token-kvs)" > .databrickscfg

foreach ($pp in $pypiPackage) {databricks libraries install --cluster-id $(DeployCluster1.DatabricksClusterId) --pypi-package $pp}
foreach ($cp in $cranPackage) {databricks libraries install --cluster-id $(DeployCluster1.DatabricksClusterId) --cran-package $cp}

If you have pre-production environment, you can reproduce the steps for it.

CI/CD test run

Let’s start by creating new branch in the databricks repo. Go to your project’s repo, click on “New branch” add name and click “create”.

Login to Databricks Workspace, and create new Notebook

Create new notebook, click on “Revision history” and sync it with GIT.

And then click “save now” to commit your changes to GIT.

In terms of Interactive Databricks clusters, I already have a cluster ready for this test “IVO_S_F4” with both PyPI and CRAN libraries installed in development environment (ivo-adb-dev)

Now let’s create new pull request from the feature branch to master (main). And after completion it’ll trigger the CI/CD process.

After completion of the pull request, the build pipeline (CI part) automatically starts

We can click on the run to see what’s happening, here we can see the repository and version, the commit the initiated it, when it ran and for how much time, what is the artifact (1 published) and the job itself.

Clicking on the artifact (1 published) you can see the clusters, libraries for the clusters and the notebooks.

Clicking on the job it will show the logs of the steps that are added in the YML file.

Here we have the Key Vault task, checkout of the branch, deleting of the YML/MD files, the PowerShell task that gets the clusters configurations and libraries definitions, the publish of the artifact and the linking of the artifact to the artifacts feed.

If we go to Artifacts feed we can see this build again.

Which is pointing to the build pipeline.

Upon build completion, the release pipeline starts, which currently is awaiting the pre-deployment approval that we’ve setup earlier. We also receive email notifications about this:

To approve the release, go to Pipelines > Releases and there we can see the history and any pending releases

Clicking on the stage, we can either approve or reject the release

Selecting the logs on the release pipeline will show all the steps that are taken. Here we can see that all the steps passed successfully. The artifact was loaded, the Databricks CLI was configured, the notebooks were deployed, the cluster was deployed, the interactive cluster was started and the libraries for the cluster were installed.

Now let’s see if this indeed happened in the Databricks Workspace. Login to the Production environment (ivo-adb-prod) we can see that the cluster was deployed, and the libraries were installed.

Going to the Databricks Workspace we can see that the notebook is there.

Opening the notebook, we can see that the code is the same.

Enjoy the CI/CD setup!

Stay Awesome,
Ivelin

Leave a Reply

Your email address will not be published.