This blog explores the limitations of Hive Metastore and highlights how Unity Catalog offers superior metadata management, enhanced security, and better data lineage support, making it the preferred choice for Databricks users.
The Hive Metastore is a central repository that stores metadata about the Give tables, databases, columns, partitions, and various other schema-related information.
Unity Catalog is a unified governance solution for all data assets in Databricks Lakouse. It provides centralized access control and management for various data types across different storage systems.
Availability:
The Hive Metastore is available by default in standard Databricks workspaces.
The Unity Catalog Metastore is not available by default in standard Databricks workspaces.
Scope:
The Hive Metastore Supports Single workspace or cluster-specific. Basic permissions using SQL commands for each workspace
Unity Catalog is Multi-workspace, multi-cloud, and regional. It supports multiple catalogs for logical grouping and management of data.
The hierarchy for the Hive Metastore from the top level down to the table structure is as follows:
- Hive Metastore: The central metadata repository that stores all the information about the databases, tables, partitions, columns, and various schema-related objects in Hive.
- Database: A logical namespace in Hive Metastore used to organize and group tables. Each database has its own unique name and can be used to manage data related to a specific business domain or purpose and can contain multiple tables within it.
- Table: Represents a single structured dataset within a database.
Manages only tables, views, and partitions within a workspace.
Hierarchy for the Unity Catalog Metastore:
The hierarchy for the Unity Catalog Metastore is structured as follows:
- Metastore: The top-level container, like a Hive Metastore, that holds all the metadata for databases, tables, and other catalog-related objects. Each workspace is linked to a single metastore.
- Catalog: A logical collection of databases and tables. This acts as a high-level grouping that can be used for organizing different areas of the data estate, such as department-specific data collection.
- Schema: Equivalent to a “database” in the traditional sense. It holds a group of related tables and views within a particular catalog.
- Table/View: The lowest-level structure that represents the actual data stored. Tables hold structured data, and views can be created for simplified access patterns or querying.
Storage Access – the Hive Metastore:
- By default, Hive stores tables at the location: /user/hive/warehouse/database name/table name
- If an external storage location (such as ADLS or S3) is specified, the storage must be mounted to the desired path before use.
Storage Access – the Unity Catalog:
- Unity Catalog gives you the ability to access existing data in storage accounts using storage credentials and external locations. Storage credentials store the managed identity, and external locations define a path to storage along with a reference to the storage credential.
You can use this approach to grant and control access to existing data in cloud storage and to register external tables in the Unity Catalog.
Storage Credentials: Managed Identity
External Locations: Path Location to ADLS or S3
- A storage credential can hold a managed identity or service principal. Using a managed identity has the benefit of allowing Unity Catalog to access storage accounts protected by network rules
Data Lineage – the Hive Metastore:
Data Lineage is not inherently supported in Hive Metastore. Users need to manually track the relationships between source and target tables.
Data Lineage – the Unity Catalog:
- You can use Unity Catalog to capture runtime data lineage across queries run on Azure Databricks. Lineage is supported for all languages and is captured down to the column level. Lineage data includes notebooks, jobs, and dashboards related to the query.
- Lineage is aggregated across all workspaces attached to a Unity Catalog metastore. This means that lineage captured in one workspace is visible in any other workspace sharing that metastore. Users must have the correct permissions to view the lineage data.
Lineage data is retained for 1 year. - Go to the Lineage tab and click See Lineage Graph. Click the Plus icons to explore the data lineage generated by the queries.
Volumes– the Hive Metastore:
Hive Metastore supports structured data, but it lacks the flexibility to handle semi-structured or unstructured data natively.
Volumes – the Unity Catalog:
- Volumes are Unity Catalog objects that enable governance over non-tabular datasets. Volumes represent a logical volume of storage in a cloud object storage location and provide capabilities for accessing, storing, governing, and organizing files.
- While tables provide governance over tabular datasets, volumes add governance over non-tabular datasets.
You can use volumes to store and access files in any format, including structured, semi-structured, and unstructured data.
Sensitive Data Management in Hive Metastore:
- Involves implementing practices and configurations that ensure the confidentiality, integrity, and accessibility of sensitive data stored in Hive tables.
- Access Control: Implement RBAC to restrict access to sensitive data based on user roles. This ensures that only authorized personnel can access or manipulate sensitive data. Hive supports column-level access control, allowing administrators to restrict access to specific columns that contain sensitive information.
Sensitive Data Management in Unity Catalog within Databricks:
- Involves implementing security and governance measures for managing sensitive data across different data assets. Unity Catalog provides features that enhance data security, compliance, and collaboration.
- Allows administrators to define fine-grained access controls, enabling users to have permissions at the row and column levels. This means sensitive information can be protected while allowing users to access non-sensitive data.
- Implement dynamic data masking to obfuscate sensitive data in query results based on user roles, ensuring users only see data they are authorized to view
How to Determine if Your Databricks Workspace is Using Hive Metastore or Unity Catalog:
- Check for Unity Catalog in Databricks UI:
- In the Databricks workspace, go to the Data tab on the left-hand navigation panel.
- Look for “Catalogs” in the UI:
- If you see a list of catalogs and options of Delta Sharing, and External Data it indicates that your workspace is using Unity Catalog.
Workspace without Unity Catalog Enabled:
2. Using SQL Commands to Check Metastore Type:
Use SQL queries in a Databricks notebook to check whether Unity Catalog is enabled or if your workspace is using the Hive Metastore.
For Hive Metastore:
- You will see the default catalog as hive_metastore. You can run this query to check:
- If the only catalog returned is hive_metastore, then your workspace is using the Hive Metastore.
For Unity Catalog:
- If your workspace is using Unity Catalog, running the same query (SHOW CATALOGS) will return multiple catalogs, not just hive_metastore.
Additionally, Unity Catalog allows the creation of new catalogs:
- If the command runs successfully, Unity Catalog is enabled. If not, your workspace is likely using Hive Metastore.
Conclusion:
While Hive Metastore served its purpose as a foundational metadata store for data lakes, it lacks several key features required for modern data governance. Unity Catalog not only addresses these shortcomings but also brings additional functionalities, such as fine-grained access control, built-in data lineage, and support for unstructured data. By offering a unified view of metadata and advanced data management capabilities, Unity Catalog stands out as the ideal solution for organizations looking to streamline their data governance in Databricks.
For More Details, Diggibyte Technologies Pvt Ltd has all the experts you need. Contact us Today to embed intelligence into your organization.
Author: Chilukamari Durga Bhavani