“Setting Up Azure Databricks workspace with VNET Injection Using Terraform”

Unlocking Secure Configurations:

In the realm of cloud infrastructure, Terraform takes the lead as a powerful tool for defining and deploying resources. This blog post delves into the process of configuring Azure Databricks, an analytics platform built on Apache Spark, using Terraform. Our focus is on optimizing security through the integration of a Virtual Network (VNet) and two subnets. Join us as we navigate the simplicity and efficiency of Terraform in shaping a secure Azure Databricks environment.

Why Virtual Networks and Subnets?

Design systems

Integrating Azure Databricks with a Virtual Network ensures a private and controlled environment for your analytics platform. This configuration enables you to manage network traffic, implement security policies, and strengthen overall governance.

Subnets are subdivisions within a Virtual Network that help organize and segment resources. They allow for better network management by grouping similar resources together and implementing security policies within the Virtual Network.

Together, Virtual Networks and Subnets provide a flexible and secure foundation for deploying and managing Azure services, ensuring efficient communication while maintaining isolation and security.

Prerequisites
  • Databricks workspace
  • Virtual network and two subnets
  • Network Security Group
  • Network Security Group Association

Understanding Subnets in Databricks Workspaces

Logical Network Partition:

A subnet serves as a logical division within a broader network, segmenting it into smaller, manageable parts.

source: Introduction to Information Security, 2014. Retrieved from https://www.sciencedirect.com/
VNET INJECTION
NOTE: Cannot replace VNET for existing workspace

VNET INJECTION REQUIREMENTS

  • Workspace and VNET must reside in the same region and have the same subscription
  • Address Space for VNET: between CIDR/16 and /24
  • Several Workspaces can share the same VNET
  • 5 IP addresses reserved for Azure in each subnet

2 subnets for each workspace

  • Host/public subnet
  • Container/private subnet

Databricks Workspace Subnets:

Databricks mandates a minimum of two subnets for each workspace, residing in different availability zones.

VNET INJECTION

The two subnets

Container subnet/Private subnet:

Also known as the private subnet. This subnet is used for communication between the Spark executors and the driver.

This subnet is designated for communication specifically between Spark executors and the driver within the Databricks environment.

In a distributed computing environment like Apache Spark, tasks are divided among multiple executors that run on different nodes. The Container subnet is likely isolated from the broader network to enhance security and reduce exposure. The Spark driver, which coordinates the tasks and manages the overall execution, communicates with the Spark executors within this private subnet. The privacy of this subnet is crucial for the secure exchange of data and commands between Spark components.

Host subnet/public subnet:

Also known as the public subnet. This subnet is used for communication between the Databricks workspace and the Azure services.

This subnet is used for communication between the Databricks workspace and external Azure services.

It is responsible for handling communication between the Databricks workspace and other Azure services. Databricks workspaces often need to interact with various Azure services such as storage, databases, or other external resources. Placing these interactions in the Host subnet allows the Databricks environment to securely communicate with the broader Azure ecosystem while segregating these communications from the private subnet used for Spark internals. This segregation is done to control and manage the flow of data and commands, ensuring that sensitive Spark-related communication is kept separate from external service communication

Conclusion:

The Container subnet focuses on internal Spark communication, providing a secure environment for the coordination between Spark executors and the driver. On the other hand, the Host subnet is exposed to the broader Azure network, facilitating communication between the Databricks workspace and external Azure services. This architectural separation helps in maintaining a well-organized and secure network environment for Databricks on Azure.

Author: Sanghavi A R

Leave a Reply

Your email address will not be published. Required fields are marked *