DELTA behavior in Data Lake
Delta tables store data in a directory structure within the specified file system. Each Delta table consists of multiple files organized into several subdirectories. These subdirectories include the “_delta_log” directory, which contains transaction logs recording all the changes made to the Delta table, and additional directories containing the actual data files. These data files contain the table’s data in a columnar format, compressed for efficient storage and processing.
Additionally, Delta tables can utilize partitioning to further organize data files based on specific column values, enhancing query performance. Overall, Delta tables provide a structured and optimized way to store and manage data for analytical workloads
Delta Table Versioning in the Data Lake
Storage structure of delta
Delta Lake maintains multiple versions of table data in the data lake to support features like time travel and query performance optimization. While this approach enhances data integrity and query flexibility, it can also lead to increased storage consumption over time due to old data versions. Delta Lake provides the VACUUM command to address this, which efficiently cleans up older data versions and unused files from the table directory.
What is VACUUM?
VACUUM is a feature in Delta Lake that helps manage storage by removing old files that are no longer needed. It cleans up files based on Delta’s transaction log, ensuring that only necessary data is retained. This process helps optimize storage usage and improves query performance. By default, VACUUM maintains data for 7 days before removing it, but you can adjust this retention period to suit your needs.
Does VACUUM run automatically?
No, Typically Vacuum must be executed on the tables regularly, as Databricks does not automatically run Vacuum on your tables. Hence VACUUM is a mandatory maintenance process for all delta tables. Especially for delta tables that involve updates or rewrites as the amount of stale files might explode your storage account.
The default log retention period is 7 days. Hence the below command will delete all files that are not referenced by the current state of the transaction log.
Command to run VACUUM:
If no retention threshold is specified, Delta Lake defaults to a retention period of 7 days or 168 hours.
Command to run VACUUM by using the path of the table:
Delta Lake has a safety check to prevent you from running a dangerous VACUUM command.
In Databricks Runtime, if you’re sure no operations exceed the retention interval, you can turn off this check with spark.databricks.delta.retentionDurationCheck.enabled set to false.
Then you can perform VACUUM below the default retention period based on the requirement.
Delta Lake doesn’t want to let you perform dangerous actions unless you explicitly update your configurations. Here’s the error message you’ll receive:
Update Configuration
Command: VACUUM table_name [RETAIN num HOURS]
Examples of Dry Run Commands for VACUUM Operation
If you want to list down the parquet files before performing VACUUM, you can use the below commands.
- Here’s a command for the DRY RUN, this will list all the files (up to 1000 files in the display).
Command: VACUUM table_name DRY RUN
- To perform a dry run of the VACUUM command with a specific retention period, you can use the following command.
Command: VACUUM table_name [RETAIN num HOURS] [DRY RUN]
The above commands will show what files would be deleted if the VACUUM operation were actually performed.
Does vacuum limits the ability to time travel?
The answer is yes, when VACUUM removes a file that a Delta table version needs, you can’t access that version anymore. Make sure the retention period for your Delta tables is long enough for your time travel needs. If files are deleted beyond this period, you won’t be able to go back to those older versions.
NOTE: It is not advisable to embed the VACUUM command directly within your codebase. Instead, it is recommended to execute it separately or schedule it as a maintenance pipeline/job according to your specific requirements.
CONCLUSION
This blog has taught how Delta Lake stores its data, what is a vacuum, and how it works on your Delta tables to reduce your storage costs, and time travel limits.
- The optimal vacuum strategy for your tables depends on your business needs.
- Databricks does not automatically run Vacuum on your tables, it should be maintained explicitly.
- Delta Lake vacuum helps save on storage costs, not a performance optimization.
Refer to the vacuum documentation for more details on the vacuum commands.
Author: Poluparthi Revathi