Backup Data Lake Gen2 containers with Data Factory

In this article we’ll review how to create a Data Factory pipeline, linked services, dataset and trigger in order to copy files between Data Lakes. The storage accounts can be setup to have disaster recovery and replication (and some storage accounts have soft-deletes), but this does not cover accidental deletes cause by human error. We’ll be copying the data between the Data Lakes for backup purpose.

Setup Authorizations

For authorizations we can use Managed Identity, Access Keys or Service Principals.
1. For Access Keys – You need to just pick up the storage account’s access key. Optionally store it as a secret in Key Vault and create Key Vault Linked Service in Data Factory.
2. For Service Principal – You need to create a service principal, create secret for the service principal and assign RBAC permissions to the Data Lake. Optionally store the secret in Key Vault and create Key Vault Linked Service in Data Factory.
3. Assign Data Factory’s Managed Identity RBAC permissions to the Data Lake.

I’ve already covered points 1 and 2 in my older articles so now I’m going to show you how this can be done with Managed Identity.

Go to your Data Lake > Access Control > Add Role Assignment

Select Contributor and then Next

Select Assign access to Managed Identity, for select members pick Data Factory (V2) and your Data Factory, then click Select > Next > Review and Assign.

Do the same for the Backup Data Lake.

Setup Data Factory objects

We need to setup several objects in Azure Data Factory studio.

  1. Linked Services
  2. Datasets
  3. Pipeline
  4. Trigger

For creation of the Linked Services go to:
Manage > Linked services > New > search for “Data Lake” and select Gen2 > Continue

Select Managed Identity, your storage account and click Create.

Do the same for the Backup Data Lake.

For creation of Copy Pipeline and Datasets go to:
Author > Pipeline > New Pipeline

Search for “Copy activity”

Select Source > Source dataset > New > search for “Data Lake” and pick Gen2 > Continue

Select Binary > Continue

Select the Linked Service pointing to the source Data Lake that we created at the beginning. And for example, let’s say if we only want to backup container named “ivo” we can add it in the File path and then select OK.

Do the same for the Backup Data Lake in Sink. (Don’t forget to pick the target linked service)

In the settings section of the pipeline, we can pick up specific number for data integration units and degree of copy parallelism instead of using the automatic option. This can help up reduce costs, but it can slow down the process. Also, if we want to also have the access control lists saved as well, we can select it in “Preserve” setting. If you’re not sure you can select on the (!) icon that will show information and link to the Microsoft documentation.

Now that we have the pipeline and should also add scheduled trigger so this can run automatically on specific timeframe. Go to Add Trigger > New/Edit

If we want to backup this every month, we can select scheduled trigger with Recurrence set to Every 1 Month. And for advanced recurrence settings “Month days” > “Last” and then select at which time (so we can be consistent) and then click OK.

Test dummy pipeline run

Go to Trigger > Trigger Now (or Debug)

Then view pipeline run:

And now we can see that the pipeline copied the files.

Enjoy the Copy Pipeline!

Stay Awesome,

Leave a Reply

Your email address will not be published.