Skip to content

Unleashing the Power of Synthetic Data with Databricks Labs Data Generator (aka dbldatagen)

The foundation of robust data science projects is high-quality data. However, real-world datasets often suffer from noise, missing values, and limited availability. Synthetic data generation offers a solution by creating artificial data that statistically mirrors real-world data.

Databricks Labs Data Generator is a powerful tool that allows you to generate synthetic data for use in your Databricks notebooks and Spark applications. With this tool, you can create data that follows a specific schema or create a schema on the fly.

This blog post will provide a comprehensive overview of the Databricks Labs Data Generator, including its key features, benefits, and use cases. We will also walk you through the process of creating synthetic data using the tool, so you can start generating your own data in no time.

Why is a Data Generator Required?

There are several reasons why a data generator is a valuable tool for data scientists and data engineers:

  • Generate data at scale: Data Generator can be used to generate large volumes of synthetic data quickly and efficiently. This is essential for tasks such as performance testing and training machine learning models.
  • Create realistic test data: Data Generator can be used to create synthetic data that closely resembles real-world data. This is important for creating realistic test scenarios and avoiding data leakage.
  • Improve data privacy: A Data Generator can be used to generate synthetic data that does not contain any personally identifiable information (PII). This can help comply with data privacy regulations.

Why create a synthetic dataset?

Synthetic datasets are a great way to anonymously demonstrate your data product, such as a website or analytics platform. Allowing users and stakeholders to interact with example data, exposing meaningful analysis without breaching any privacy concerns with sensitive data.

It can also be great for exploring Machine Learning algorithms, allowing Data Scientists to train models in the case of limited real data.

Performance testing Data Engineering pipeline activities is another great use case for synthetic data, giving teams the ability to ramp up the scale of data pushed through infrastructure and identify weaknesses in the design, as well as benchmarking runtimes.

Steps to Use Data Generator

The Databricks Labs Data Generator is a Python library that integrates seamlessly with Spark. It provides a high-level API for defining synthetic data generation specifications. These specifications can be used to generate data for a variety of data types, including numbers, strings, dates, and timestamps.

Here’s a high-level overview :

  1. Define data generation spec: You define the data you want to generate using a schema or by providing a list of columns and their data types.
  2. Generate data: The data generator engine takes the data generation spec as input and generates a PySpark DataFrame containing the synthetic data.
  3. Expose data: Once the data frame is generated, it can be used with any Spark data frame compatible API to save or persist data, analyze data, write it to an external database or stream, or use it in the same manner as a regular PySpark data frame.

Tips and Tricks

Here are some tips and tricks for using the Databricks Labs Data Generator:

  • Use a schema: Whenever possible, it is recommended to use a schema to define the data you want to generate. This will help to ensure that the generated data is consistent and accurate.
  • Leverage Faker library: Faker is a popular Python library for generating realistic fake data. Faker with Data Generator can generate more realistic data for specific columns, such as names, addresses, and companies.
  • Use expressions: Data Generator supports SQL-based expressions for generating data. This allows you to generate more complex data, such as data that depends on the values of other columns.
  • Start small: When first using Data Generator, it is recommended to start by generating a small amount of data. This will help you to verify that the generated data is correct before generating larger datasets.
  • Large Datasets: Instead of creating a giant pile of data all at once, you can break it down into smaller, more manageable chunks. This ‘batch processing’ approach makes handling large datasets much easier.

How to use dbldatagen?

To use dbldatagen, you need to install it using a %pip install command

Dbldatagen is a Python library. The Databricks Labs Data Generator framework can be used with Pyspark 3.1.2 and Python 3.8 or later. These are compatible with the Databricks runtime 10.4 LTS and later releases. For full Unity Catalog support, as per official documentation, it recommends using Databricks runtime 13.2 or later (Databricks 13.3 LTS or above preferred)

This will install the PyPi package and work in regular notebooks, Delta Live Tables pipeline notebooks, and works on the community edition.

Installing dbldatagen

Creating the data set is performed by creating a definition for your dataset via the DataGenerator instance, which specifies the rules that control data generation.

Once the DataGenerator specification is created, you use the build method to generate a Spark data frame for the data.

The library also provides several standard predefined data sets that can be used as a starting point for generating your synthetic data.

Generating Our First Sample Data

Relative data generation involves creating data values based on predefined lists or ranges. Here’s an example:

Output:

Here you can see You can pass arguments such as spark, rows, and partitions to the DataGenerator() method to customize the number of rows and partitions in the generated data frame.

Generating Weighted Data

Weighted data generation allows for specifying the probability of each value or we can say List of discrete weights for the column. Controls spread of values. For example:

Output:

you can see the spread of the data by using group by and count

Disadvantage: While use of the weights option with discrete value lists can be used to introduce skew, this can be awkward to manage for large sets of values. for more details click here

Using Third-Party Library Example (Faker)

Faker can be integrated to generate realistic names, addresses, etc.

To use Faker you have to install the library

Let’s see an example of generating names and company names using Faker

Output:

You may have a doubt why we are using list comprehension here, why not use directly like this below fig

So if we use this syntax we end up getting the same name and company name for every row as shown below

Using SQL Expressions to transform existing or generate random values

The expr attribute can be used to generate data values from arbitrary SQL expressions. These can include expressions such as concat that generate text result

Let’s see an example of Generating Discounted Prices for E-commerce Transactions

Output:

create a unique transaction_id for each transaction using the uuid() function. which is an SQL function, and you can use the datatypes as SQL datatypes or pyspark datatypes

The discounted_price column is calculated using the expression "original_price * (1 - discount_rate)", which dynamically computes the discounted price based on the original price and the discount rate for each transaction.

Don’ts of Data Generation

  1. Avoid Large Lists: Using excessively large lists for value generation can degrade performance.
  2. Overcomplicating Expressions: Complex expressions can lead to inefficiencies and errors.
  3. For Large No. of records: Skip list comprehensions (loops in lists) and explore other options to avoid slowdowns.

If you want to explore more about dbldatagen here is the official document — click here

Here is the GitHub link

Conclusion

Dbldatagen is a versatile tool for generating large-scale synthetic data. By understanding its architecture and best practices, users can effectively create data for testing and development purposes. Whether generating relative data, weighted data, or integrating third-party libraries like Faker, dbldatagen provides robust capabilities to meet various data generation needs.

For More Details, Diggibyte Technologies Pvt Ltd has all the experts you need. Contact us Today to embed intelligence into your organization.

Author: Basheer Ahmed

Leave a Reply

Your email address will not be published. Required fields are marked *