Moving Delta Tables to SQL Database Flexible using ADF Dataflow

In today’s data-driven world, often we find a way to manage vast amounts of data stored across various platforms and storage layers. Azure Data Factory is an ETL tool provided by Azure that allows the user to create a pipeline to orchestrate the data processing pipeline for event-driven workflows. In this blog, we will explore how to move the delta table from Azure Data Lake Gen 2 as a table in Azure MySQL Database flexible Server leveraging Azure Data Factory Data Flow feature.

Overview:

Data Flow is a powerful feature in Azure Data Factory that helps the users to visually design and execute data transformation without writing code, executing the data transformation like selecting required fields, mapping, aggregation, joins, and so on. Data flow facilitates the visual design and easy execution of data by users and can process and transform the data for downstream applications.

Now let’s see how the Data Flow feature in Azure Data Factory helps to build the flow and orchestrate to move the delta table from Azure Data Lake gen2 to Azure MYSQL Database Flexible Server as a table with a visual interface. Data Flow provides automatic schema mapping and scalable processing for reliable data movement.

Workflow for Orchestrating Delta Table to MySQL Database Flexible Server:

Configuring Source and Applying Transformations in Azure Data Factory Data Flow:

Step 1: Configuring the Source Settings:

In the Azure Data Factory Data Flow, the first step involves configuring the source from which the data will be extracted. In this scenario, the data resides in Azure Data Lake Storage Gen2.

Source Type: Choose inline as the data source type for moving delta tables.
Inline Dataset Type: Select the appropriate file format as Delta.
File Path: Specify the exact table path within the ADLS Gen2 where your delta table is stored.
Schema Import: Data Flow automatically imports the schema of the source file.

Step 2: Filter on Scheduled Date:

Filter Transformation is applied on the scheduleDate column and can be applied on any other column based on requirements. This step ensures that only the relevant data moves downstream reducing unnecessary data movement.

Filter Expression: This step includes filtering of records that fall under a specific date, fetching only data scheduled for a particular time.

Eg: toDate(scheduleDate) == currentDate()

Step 3: Configuring the Alter Row Transformation:

The Alter Row transformation allows you to manage row-level operations like insertion, updation, and deletion and is useful when you are dealing with incremental data or delta updates.

Step 4: Configuring the Sink: Loading Data into Azure MySQL Flexible Server:

The Sink is the final step in the Azure Data Factory Data Flow, where the transformed data is loaded to the target destination. In this scenario, our target is Azure MYSQL Database Flexible Server. Configuring the sink correctly ensures that the data flows smoothly and efficiently into the database with high-perform

Key Sink Settings:

Sink Type and Dataset: Select the appropriate dataset targeting the right table – Azure Database for MySQL
Settings: Specify the update method of upsert for the incremental scenario and list the right primary key columns.
Mapping: Azure Data Factory typically handles schema mapping automatically by providing auto mapping.

Sink optimization plays a vital role lets Deep Dive into the Difference Between Partitioning and No Partitioning Using a Dataset of 2,086 Records

No Partitioning Applied:

In this case, the data is processed as a single unit. All 2,086 rows are transferred and processed together without splitting them into smaller partitions.

Single-Threaded Execution: When there is no partitioning, Azure Data Factory processes the data in a single thread. It processes the entire dataset as a single block. This means that there is no parallelism, and the entire load is handled in a linear, step-by-step manner. The larger the dataset, the longer it takes to move and process.
Memory Bottlenecks: Without partitioning, the entire data is loaded into memory all at once. This can cause memory contention, leading to bottlenecks, especially with large datasets.
Resource Contention: Since the pipeline is processing the entire data in a single partition, all available compute resources are used at the same time, leading to inefficient resource utilization. The absence of partitioning results in higher resource contention and increased waiting times for data transfer operations.

Hash Partitioning Applied:

In this optimized scenario, partitioning is applied to the data, splitting it into two partitions, allowing the pipeline to process the data in parallel, which leads to a faster and more efficient process.

How Hash Partitioning Works: Hash partitioning divides the dataset into smaller chunks based on the hash value of a specific key column. In this case, the data is split into two partitions. Each partition contains part of the total rows, which can then be processed in parallel. This results in faster processing since ADF can execute the partition operations simultaneously, using multiple CPU cores and memory.
Parallel Processing: With hash partitioning, ADF processes each partition independently and simultaneously. Since the data is split into two partitions, the processing time is significantly reduced. The sink processing time drops by almost half of no partitioning.
Efficient Resource Utilization: Each partition is processed separately, allowing ADF to better utilize memory and CPU resources. Instead of loading all the data into memory at once, each partition gets its own allocated memory and compute resources. This prevents resource bottlenecks and ensures smoother processing. Since two partitions are processed in parallel, the overall resource consumption is more balanced and optimized.
Faster Stage Time: The stage time (which measures how long the data stays in the staging phase) was also reduced in the partitioned scenario. The stage time decreased from 5 minutes 35 seconds (in the non-partitioned run) to 3 minutes 14 seconds. This is an indicator of more efficient data staging due to partitioning.

Aspect	No Partitioning	With Hash Partitioning
Processing Time	6m 6s single-threaded execution	3m 44s parallel execution
Memory Usage	Higher memory consumption	Balanced memory usage per partition
Resource Contention	Higher due to single large unit	Lower as data is processed in chunks
Risk of Memory Bottlenecks	High especially with larger datasets	Low due to partitioning
Resource Utilization	Inefficient single-threaded process	Efficient with parallel resource use

Note: Stay tuned for our next blog where look into another approach for moving the delta table, simplifying the process further, and enhancing performance.

Conclusion:

In this blog, we explored how Azure Data Factory Data Flow efficiently moves a delta table from Azure Data Lake Gen2 to Azure MySQL Flexible Server. And comparing the impact of no partitioning vs. Hash Partitioning on processing time and resource utilization

For More Details, Diggibyte Technologies Pvt Ltd has all the experts you need. Contact us Today to embed intelligence into your organization.

Author: Nandini Srinivas