ADF (Process Files at a SFTP Location)

Move the files in the SFTP location to another folder after they are copied , with a Timestamp for the Date the File is copied.

Problem : Load the data from csv files in a SFTP location into the azure data warehouse.

The source Folder is in a SFTP location (AWS Cloud VM) , where a third party vendor will drop the files and we had to move these files to ADLS (Azure Data Lake Storage then read the files from the ADLS , load the data into Azure Data Warehouse.

Once the Files are in the ADLS storage they are processed to read the data and load the data in the ADW. Once the data is loaded in the Data Warehouse , the files are moved to a processed folder , in a similar way as was done in the SFTP location.

Brain Storming for the Approach : There are different options available in ADF to copy the files from one folder to another :

  1. Use SSIS (The Old Tool) , which is a proven and robust technology.
  2. Use ADF Copy Activity to copy the files from the SFTP location to the ADLS Storage and then use another copy activity to read the file from the ADLS storage and load the data in the Data Warehouse (It can be any other location also).
  3. Use Data Flows to read the data from the SFTP Location and then move the files in the SFTP location. 
  4. Use Power Automate by creating workflows.

Solution

  1. The idea of SSIS was dropped , since it will require spinning a VM in the cloud and the client will not agree to bear the cost of another VM.
  2. Power Automate was not considered because we didn’t want to expand the type of services being used in the Project. Since we were using ADF for other data migration in the Project , therefore we had to leverage the ADF for this activity also.
  3. ADF was the solution decided and Pipelines in ADF will be used to copy the file from the SFTP location to the ADLS and then using another copy activity the file will be moved from a Folder in ADLS to another folder in ADLS.
  4. Data Flows can not be used to move files , they are used to build data transformation using visuals , without writing code.

Jump To the Solution : 

   SFTP location was finalised and the folder structure was created in the SFTP location , where the client will be dropping the files. Following SFTP folder structure was created 

      File_Name_Folder —> ProcessedFiles

Azure Data Lake Folders were also created with the following hierarchy 

     File_Name_LandingFolder >>  ProcessedFolder

The File to be copied will first land in the File_Name_LandingFolder and  once the files are processed the file is moved to the ProcessedFolder with the TimeStamp attached as post fix to the to the File_name (File_Name_mm_yyyy_dd).

There is an SFTP file connector available in ADF. We will create a linked service to connect to the SFTP location. SFTP connector will use the user name and the password for the user created on the SFTP Folder to access the files. This user will need full access to the folder, to read , write and delete the files. This needs to be configured , before we create the linked service and the activities in the ADF.  Once the linked service is created , it will show you the access to the folders where the file is located.

  1. Configuring Linked Services

First linked service is required to move the files from the SFTP location to BLOB storage ,

Second linked service is required to read the data from the BLOB storage 

Third is required to copy the data to ADW (Dedicated SQL Pool))

Fourth or as in this case the second linked service can also used to move the files from one folder in BLOB storage to another folder in the BLOB storage.

  1. SFTP Linked Service : SFTP connector is used to create the linked service. Click on the + sign in the Linked service to add a linked service.

Figure 2: Step 1 in creating a Linked Service is to choose the type of  Data Source location , in this picture it’s a SFTP location.

Enter SFTP and then select the sftp icon and click on continue at the bottom of the window. In the next window , you will fill up the details of the SFTP linked Service.

Figure 3 : Step 3 in creating the Linked Service is to add the details related to the linked service

This will open up  a new window inside the Portal to fill in the Details of the Linked Service. You are required to fill in the information highlighted (Orange Rectangles) in the above picture Name (Name of the linked service, this should be a meaningful name from your business perspective), Host (This is IP address of the SFTP Server), Port (Default Port for SFTP Service is 22 and in case you are using a different port , then will specify that port), Authentication Type (Basic, SSH  Public Key Authentication or Multi Factor Authentication, in our case we are using the Basic Authentication), User name ( This is the User Name used to login to the SFTP location to access the files) , AKV linked service (Azure Key Vault linked service is the linked service used to access the Azure Key Vault to access the password from the AKV), secret Name ( Name of the Secret , Password secret name to access the password from the AKV), secret version (Secret Version , normally left blank or in case you need to use a specific version , then provide the version for the Secret , this is normally a long number using characters).

  1. ADLS Linked Service : Second Linked service is used to connect to the BLOB storage, same way as we did for the SFTP linked service , Click on the + sign in the Linked service to add a linked service.

Select the Azure Data Lake Storage Gen2 and click continue. In the next step we will be filling the details for the linked service to connect to the ADLS Gen2 storage (Data Lake Storage).

The First Option is the name given to the Linked Service (Give a meaningful name that can help to identify the source type and the data related to business e.g. ls_dl_finance etc).

In the Description section we can give a brief description for the linked service . 

Next option is to choose the Integration Runtime (we have chosen Auto Resolve Integration Run Time (Managed Virtual Network).

Note : To use Private End Point in the BLOB storage and provide Access to ADF , you will need to Provide the Storage BLOB Data Contributor Role, also  you can only use Managed Identity in the Linked Service, as shown in the below picture.

Now we are ready with two linked services, one to connect to the SFTP location and the another one to connect to the BLOB storage. Our Next step will be to create the Pipeline using the Activities provided in the Azure Data Factory.

….To Be Continued

One thought on “ADF (Process Files at a SFTP Location)

Leave a comment