Moving Delta Tables to SQL Database Flexible using Databricks

In today’s fast-paced, data-driven world managing and transferring vast amounts of data efficiently across different storage platforms is a significant challenge. Storing the large datasets in Azure Data Lake Gen2 but need to transfer that data into databases like MySQL Flexible Server for analytics and reporting.

In this blog, we’ll delve into how Databricks, with its powerful Apache Spark engine, can be used to directly move data from Azure Data Lake Gen2 to MySQL Flexible Server. This method offers greater flexibility, scalability, and control over data transfer compared to visual workflow tools. If you’re looking for a solution that doesn’t rely on traditional tools or visual interfaces and instead focuses on using Databricks for full control over the data transfer process, you’re in the right place.

Overview:

Databricks provides a flexible and powerful platform for transferring and processing large-scale data without relying on any traditional workflows. By using Apache Spark, Databricks allows users to write custom logic, apply transformations, and control the entire data transfer. This makes it ideal for more complex scenarios where you need greater flexibility in managing data pipelines. Now let’s look at how to move a Delta table from Azure Data Lake Gen2 to MySQL Flexible Server using Databricks, focusing on optimizing performance and ensuring smooth data movement. This approach offers scalability and full control, making it a robust solution for handling large datasets efficiently.

For those interested in the Azure Data Factory Data Flow approach to move data from Azure Data Lake Gen2 to MySQL Flexible Server, please refer to this blog post which covers the ADF visual workflow.

Workflow for Orchestrating Delta Table to MySQL Database Flexible Server using Databricks:

Steps to Transfer Data Using Databricks:

Step 1: Configure the JDBC Connection:

To connect Databricks to MySQL, we need to configure a JDBC connection string that specifies the MySQL server details, authentication information, and connection properties.

jdbc_hostname = "<MYSQL_HOSTNAME>"
jdbc_port = "<MYSQL_PORT>"
jdbc_database = "<MYSQL_DATABASE>"
jdbc_username = "<MYSQL_USERNAME>"
jdbc_password = "<MYSQL_PASSWORD>"

Connection String:

jdbc_url = f"jdbc:mysql://{jdbc_hostname}:{jdbc_port}/{jdbc_database}?useSSL=true&requireSSL=false&serverTimezone=UTC&rewriteBatchedStatements=true"

Connection properties:

connection_properties = {"user": jdbc_username, "password": jdbc_password, "driver": "com.mysql.cj.jdbc.Driver" }

This connection string is essential for Databricks to communicate with the MySQL database. Ensure the accessibility of MYSQL Server providing the username and password for authentication.

Step 2: Reading Delta Table from Azure Data Lake Gen2

Read the Delta table from Azure Data Lake Gen2in Databricks by using the delta path.
This reads the Delta table stored in Azure Data Lake Gen2 and converts it into a Spark DataFrame.

df = spark.read.format("delta").load(delta_table_path)

Step 3: Partitioning the Data

For better performance with larger datasets, repartitioning the DataFrame before writing it to MySQL can significantly speed up the process. This allows parallel writing across multiple threads and reduces resource contention.

df_part = df.repartition(4)

Step 4: Writing Data to MySQL

Finally, use the write.jdbc function to transfer the DataFrame into the target MySQL table, below function helps to overwrite specified SQL table using existing dataframe.

def write_overwrite_sql_server(df, url, target_table_name, connection_properties):
    df_part = df.repartition(15)  # Partition the DataFrame for better parallelism
    df_part.write.jdbc(url=url,table=target_table_name,mode="overwrite", properties=connection_properties)
write_overwrite_sql_server(df, jdbc_url, "target_table_name", connection_properties)

SQL individually, leading to multiple round trips to the database.

Individual Execution: Each SQL statement is sent to the database one at a time. This means the driver makes multiple round trips to the database.
Increased Latency: Each round trip incurs network latency. This can significantly slow down performance, especially when executing a large number of statements in sequence.
Higher Resource Consumption: More network packets result in increased consumption of database and network resources. Each packet requires processing, which can lead to higher CPU usage on both the client and server sides.
Transaction Overhead: Handling transactions can become cumbersome. Each statement might need individual transaction management, making it more complex to roll back or commit changes across multiple operations.
Slower Throughput: Overall throughput of the application may be reduced due to the overhead associated with executing each statement individually.

Scenario 2: With Repartitioning & rewriteBatchedStatements=true:

Batch Execution: Multiple SQL statements are grouped together and sent to the database in a single batch. This reduces the number of trips to the server.
Reduced Latency: Since multiple commands are processed in a single round trip, network latency is significantly reduced. This can lead to faster execution of bulk operations.
Improved Resource Efficiency: The database can optimize the processing of batched statements, leading to more efficient use of resources. This can result in lower CPU and memory usage on the database server.
Simplified Transaction Management: Transactions can be handled more cleanly. You can commit or roll back a batch of statements in one go, simplifying error handling and making your code more robust.
Higher Throughput: Overall application performance improves, especially in scenarios with heavy database interaction, leading to faster processing times and better user experience.

Conclusion:

In conclusion, using Databricks to move data from Azure Data Lake Gen2 to MySQL Flexible Server offers a powerful, flexible solution for handling large datasets. By optimizing with repartitioning and rewriteBatchedStatements=true, data transfers become more efficient, reducing latency, resource usage, and execution time. This method provides full control over data pipelines, making it ideal for large-scale operations, while offering greater performance compared to traditional visual workflow tools.

For More Details, Diggibyte Technologies Pvt Ltd has all the experts you need. Contact us Today to embed intelligence into your organization.

Author: Nandini Srinivas