Categories
Azure

Automatic creation of directory structure in Data Lake Gen2 container

In this article we’ll review how we can automatically create directory structure in Data Lake Gen2 container utilizing Azure DevOps Repo and Release pipeline with custom PowerShell script. We’ll create new repo for Data Lake, add the PowerShell script in different repo where we store the DevOps scripts (or use inline), and create release pipeline to create the directories upon pushing to the Data Lake branch.

Create Data Lake repository

  1. Create a new data lake repository.
  • Login to Azure DevOps open your project and navigate to Project settings > Repositories > Create.
  • Leave the repository type to “Git”, type a name for the repository that makes sense for you (e.g. “data_lake” and click “Create”.

2. For easier maintenance – create folders for each container you’d like to update and add text file containing

3. Add branches for your development and pre-production environments, if you have any (e.g. “dev” and “qlty”)

Add the PowerShell script for directory creation in Data Lake Gen2 container

Here’s a PowerShell script that creates directories inside Data Lake Gen2 container. If you have specific place where you store your DevOps scripts you can add it there (and for example name it “adls2_folder_creation.ps1” or you can add the script as a inline script in the release pipeline itself

param
(
    [parameter(Mandatory = $true)] [String] $path,
    [parameter(Mandatory = $true)] [String] $subscriptionid,
    [parameter(Mandatory = $true)] [String] $rgname,
    [parameter(Mandatory = $true)] [String] $accname,
    [parameter(Mandatory = $true)] [String] $container
)
cd $path
$textfile = Get-ChildItem -Path $path | Sort-Object LastAccessTime -Descending | Select-Object -First 1
$dirs = Get-Content $textfile
$dirs = $dirs.Trim()

Set-AzContext -SubscriptionId $subscriptionid
$storageAccount = Get-AzStorageAccount -ResourceGroupName $rgname -AccountName $accname
$ctx = $storageAccount.Context

foreach($line in $dirs) {
    if($line -ne ""){
        New-AzDataLakeGen2Item -Context $ctx -FileSystem $container -Path $line -Directory
    }
}

The script gets parameters for the path to repository, resource group name, account name and container name. It gets the latest file from the repository, trims the list, logins to the storage account and for each line of the list creates the directory, while skipping any blank rows. If the directory is “folder/subfolder/subsubfolder” and you don’t have either of these directories, it automatically creates all directories (i.e. you don’t have to list each directory).

Create release pipeline for automatic creation upon push to the Data Lake repository

Go to your project’s Pipelines > Releases > New release pipeline.

Select “Empty job”

Rename the stage name and pipeline name for easier identification.

For first artifact add the data lake’s repository and select stage’s branch and latest from the default branch.

And then enable the continuous deployment trigger.

For second artifact add the repo where you store your DevOps scripts (but don’t enable any triggers) or skip this step if you want to use the script inline inside the PowerShell task. In my case “IvoTalksTech” this repo and it looks like this:

Now go to the stage’s job and search for “Azure PowerShell” task and add it.

Then change the configurations as follows:

  1. Display Name – rename it as you see fit (e.g. Deploy ADLS2 {container_name} directory structure).
  2. Azure Subscription – select/create service connection to service principal that have contributor role-based access control.
  3. Script type – select “Script File Path” if you have your DevOps scripts stored in different repo or alternatively select “Inline script” if you want to paste the script inside the job.
  4. Script path – add the path to the script
  5. Script arguments
    1. path – path to data_lake repo (incl. the folder for the container).
    2. subscriptionid – your subscription id.
    3. rgname – the name of the resource group.
    4. accname – the name of the Data Lake Gen2 account.
    5. container – the name of the container that is inside of the Data Lake Gen2 account.
  6. ErrorActionPreference – Continue.
  7. Azure PowerShell Version – Latest installed version.

Test run

Adding some list of directories inside the data_lake repo and committing.

The release pipeline automatically starts and completes in less than a minute.

The folders inside the data lake gen2’s container are created.

Enjoy!

Stay Awesome,
Ivelin Dochev

Leave a Reply

Your email address will not be published.