Unlocking Seamless Data Exchange with Delta Sharing

In the initial days, data sharing has been severely limited because sharing solutions are tied to a single vendor, that is, both the data provider and consumer should use the same platform to access the data and share the data. This creates friction between them, who naturally run on different platforms.

Existing sharing solutions are too restrictive

Delta Sharing at a Glance:

Delta Sharing, an innovative solution from Databricks, is transforming the way organizations share and exchange data. It offers a simple, secure, and open framework that allows data providers and consumers to collaborate in real time, regardless of the computing platforms they use. Built on top of Delta Lake, an open-source storage layer, Delta Sharing brings enhanced reliability to data lakes. Delta Lake ensures data integrity through ACID transactions (Atomicity, Consistency, Isolation, Durability), offers scalable metadata management, and supports both batch and streaming data processing seamlessly

Traditional Data-sharing challenges:

A Deep Dive into Delta Sharing:

Delta Sharing’s seamless integration with Unity Catalog empowers data providers to efficiently manage, govern, audit, and track the usage of shared data all in one place. To securely share data, providers are required to register their datasets in Unity Catalog, and the data must be stored in Delta table format.
For recipients, if both the provider and recipient are using Databricks workspaces with Unity Catalog enabled, they can access the shared data directly within the workspace. However, recipients without Unity Catalog can still access the shared data, ensuring flexibility and broader accessibility across different environments.

Delta Sharing

The Pillars of Delta Sharing:

Delta Sharing revolves around three key components that make secure and real-time data exchange possible: ProvidersRecipients, and Shares.

Delta sharing between Unity catalog enabled Workspace

Providers:

Provider is an organization or individual that owns and controls the data they want to share with others. Providers are responsible for:
1. Defining Access: They decide which datasets will be shared, who will have access to them, and the level of access each recipient will get.
2. Registering Data in Unity Catalog: In the case of Databricks, Providers must register their datasets in Unity Catalog (Databricks’ centralized governance platform) and store the data in Delta table format. This allows for secure and compliant data sharing
3. Managing Permissions: Providers can set granular permissions.

Recipients:

  1. Recipient is the organization or individual receiving and using the shared data. Recipients can:
  2. Access Shared Data: Depending on the permissions granted by the provider, recipients can directly query or analyze the shared data without needing to replicate it in their environment. If the recipient is using a Databricks workspace that has Unity Catalog enabled, the shared data will be accessible within their own workspace.
  3. Workspace or No Workspace: While recipients using Unity Catalog-enabled Databricks workspaces have a seamless experience, even recipients who do not have Unity Catalog can access the data. This flexibility ensures that recipients from different platforms or organizations can consume data without having to fully align their infrastructure
  4. Analyze Data in Real-Time: Recipients benefit from real-time access to data, meaning they can view and analyze the latest version of the dataset as it’s updated by the provider.

Shares:

  1. Share is essentially the bundle of permissions and data that is being shared between a Provider and Recipient.
  2. Delta Tables: The datasets (in Delta format) that are being shared. Each share can contain one or more Delta Tables.
  3. Access Rules: The permissions set by the Provider, defining what data the Recipient can access and the specific rights (e.g., read-only) granted.
  4. Simple to Manage: A Share acts as the intermediary object that the Provider manages, enabling multiple Recipients to access different parts of the data based on the defined permissions.
  5. Shares simplify data sharing, as Providers can manage and update them as needed.

For example:
Provider can create a Share with access to specific tables for one department and another Share for external partners, each with its own customized permissions. This flexibility enables efficient collaboration across different teams and organizations while ensuring data governance and security.

There are 2 main types of Delta Sharing:

  1. Sharing Between Databricks Environments (Databricks to Databricks — D2D) — Data recipients can belong to any computing platform (or even Databricks) that they use.
  2. Sharing from Databricks to Open Source (Databricks to Open — D2O) — The data recipient should be able to access the Databricks workspace which is been enabled with the Unity catalog.

Working of Delta Sharing:

1. D2D Databricks-Databricks enabled workspace
  • The Provider registers data in the Unity Catalog and defines the access policies.
  • The Provider adds a recipient and creates a Share, specifying the datasets (Delta Tables) and permissions.
Adding New Recipient
  • The Provider assigns the Share to one or more Recipients, allowing them to access the shared data.
Sharing the Data to the Recipient
  • Recipients can then access and analyze the data as per the permissions granted, without having to move or duplicate it.
Accessing the shared Data in workspace2

2. Databricks to Open Source (Databricks to Open — D2O):

Databricks to Open — D2O

The data provider creates shares and adds the table to the shares.
The data provider sets access permission to those who are allowed to access the table in the share.
When a Data recipient wants to access a table, they will start by sending a request to the delta-sharing server.
The delta-sharing server will then verify whether the recipient has required permission and which data is to be sent. This will be data objects in S3 or other cloud storage systems which has the actual data. Authentication is done via a unique shared identifier or short-lived URL.
So then, the authorized Data Recipient can start reading the data and transfer the objects directly from the cloud object store.

Open sharing

In this method, the Data provider generates a token (URL) and shares it securely with the recipient, where the recipient uses the token for authentication purposes and through which they get read access to the tables which is shared with the Data Recipient.

Steps to share the data via Open sharing:

Create and manage shares for Delta sharing: The data provider creates shares and places the tables in the share which is to be shared with the data recipient.

Creating a new share

Create recipient: Data Providers create the recipient to whom they need to share the data.

    Creating new Recipient

    Create and manage access to delta sharing share: The data provider provides the recipient access to the shares that have been created. Here in this step, you grant shares to the recipients using the Grant Share icon.

    Get the activation link: The data provider sends the activation link along a secure channel to the recipient.

      The recipient who receives this activation link will then be able to download a credential file which is been used by the recipient to create a secure connection with the data provider to receive the shared data.

      Credential file

      This credential file can be downloaded only once by the recipient. So, this activation link is the short-lived URL that has been referred to earlier.

      By downloading the credential file, you can also access the delta tables using the Databricks notebooks by following the below steps.

      Step 1: Install the delta-sharing package

      %sh pip install delta-sharing 

      Step 2: Upload the contents of the credential file to a folder in DBFS

      %scala 

      dbutils.fs.put("/FileStore/client/config.share","""<content-inside-credential-file>""")

      Step 3: Using Python, list the tables in the share.

      import delta_sharing 
      client1 = delta_sharing.SharingClient("/dbfs/FileStore/client/config.share")
      client1.list_all_tables()

      This step lists the table names available under share.

      Step 4: Now we can use the dataframe to read the content of a desired table and display it.

      df=delta_sharing.load_as_spark("dbfs:/FileStore/client/config.share#share_name.database_name.table_name") 
      display(df)

      Conclusion:

      Delta Sharing is a game-changer for data collaboration, offering secure and efficient sharing. Providers manage access, Recipients consume data seamlessly, and Unity Catalog ensures governance. Its open nature and scalability make it ideal for modern organizations, empowering collaborations

      For More Details, Diggibyte Technologies Pvt Ltd has all the experts you need. Contact us Today to embed intelligence into your organization.

      Author: Harini R
      Sanghavi A R

      Leave a Reply

      Your email address will not be published. Required fields are marked *