Ensuring Secure Storage of PII Data

In today’s digital world, keeping personal information safe is crucial. This includes names, phone numbers, and social security numbers, known as Personally Identifiable Information (PII). Protecting PII is essential to maintain privacy and follow laws like GDPR and CCPA. Tools like Databricks offer ways to keep this info safe with the help of different encryption methods and Python libraries, ensuring only the right people can access it. 

What is PII data and data security?

PII stands for personally identifiable information and refers to any data that could identify a specific individual. This includes names, phone numbers, email addresses, social security numbers, and biometric data. 

Data Security: Data security involves safeguarding digital information from unauthorized access, utilization, disclosure, and modification. 

Why securing PII data is crucial

  1. Privacy protection: Securing PII is essential to protect people’s privacy. It is important to protect the data from unauthorized access theft & fraud. 
  1. Legal Compliance: Laws like GDPR, CCPA & HIPAA require organizations to protect PII data. Failure to comply with these regulations can result in severe penalties, fines & legal liabilities. 

Regulatory Compliance 

  • In Europe – GDPR (General Data Protection Regulation) 
  • In the US – CCPA (California Consumer Privacy Act) 
  • In the US – HIPAA (Health Insurance Portability and Accountability Act) 
  • Under both CCPA & GDPR, compliance is indeed required to be able to identify data associated with requests to export, update & delete that data. These results must be processed promptly upon receipt. 
  • For CCPA, Companies have 45 days to respond to consumer requests with the possibility of a 45-day extension under certain circumstances. 
  • For GDPR, Companies have 30 days to respond to data subject access requests (DSARs) can be extended to 60 days in complex cases. 
  1. Trust and Reputation: Securing PII data maintains trust with customers & stakeholders. A data breach harms reputation, erodes trust & can lead to loss of business. 
  1. Costs of data breaches: Data breaches lead to financial losses from fines, lawsuits & decreased market value while causing operational disruptions like downtime & productivity loss. 

How Lakehouse simplifies compliance 

  1. Reducing copies of PII minimizes the risk of exposure & simplifies data management. 
  1. Quick access to personal info aids in responding to data subject requests promptly. 
  1. Reliable data management enables secure changes, deletions, or exports of PII. 
  1. Transaction logs in Databricks act like a diary, keeping track of every move made with PII data to ensure everything stays above board. 

Databricks provides features 

  1. Data Encryption: Databricks offers encryption capabilities to safeguard data during transit and while at rest, guaranteeing the protection of PII from unauthorized access. 
  1. Access Controls: Role-based access control (RBAC) mechanisms allow organizations to define and enforce access policies, limiting who can view, modify, or access PII data within Databricks. 
  1. Audit Trails and Logging: Databricks keep a detailed record of who viewed or changed the PII data, helping organizations track who’s been secretly looking at it. 
  1. Data Governance: Databricks helps organizations follow the rules for handling PII data, ensuring it’s treated right from the beginning to the end. 

How do we protect PII data? 

Encryption in Transit and at Rest with Column-Level Encryption 

  • Cryptography is a Python library that provides cryptographic functionalities to secure data transmission, storage, and authentication. It includes encryption, decryption, hashing, digital signatures, and more. 
  • Fernet is a symmetric encryption algorithm provided by the Cryptography library. 

Implementation of Data Encryption & Decryption 

  1. Create the sample Delta table. 
  1. Insert the values with sensitive data. 

View the data in the table. 

  1. Install the Cryptography library and Fernet module. 
  1. Generate a Key: using the generate_key() function  

The generate_key() function is important for creating cryptographic keys used in encryption and decryption processes. 

Using the same key for both encryption and decryption in Fernet is called symmetric encryption

The key generated by Fernet is represented as a bytes string, which is suitable for cryptographic operations. This bytes string typically consists of random binary data of a specific length, such as 32 bytes(256 bits) for Fernet keys. 

  1. Create the object for the key  

By using this object, we can encrypt & decrypt the data. 

In real-time, after the generation of the key, we need to secure it in a key vault or Databricks secrets scope. 

Encrypt Sample Data 

  • To encrypt data, we use the encrypt() function available in the Fernet module.  
  • This function is specifically designed to encrypt data. We pass the string that needs to be encrypted as an argument to the encrypt() function. 

The result of an encrypted message is called a fernet token

  • It is important to use bytes when dealing with binary data, such as when working with cryptographic functions like Fernet encryption. 
  • Fernet guarantees that a message encrypted using it cannot be manipulated or read without a key. 

The Encrypted data is in bytes form. 

Decrypt the Data 

To decrypt the encrypted data, we can use the same key that was generated earlier. 

  • For authorized users to view the original string of PII data, we can use the decrypt() function in the Fernet module. 
  • This function is specifically designed to decrypt the encrypted data. We pass the Fernet token that needs to be decrypted as an argument to the decrypt() function. 

We can decode a bytes datatype into a normal string using the UTF-8 encoding. 

Register the UDF for Data Encryption and Decryption 

Define UDF to encrypt the data 

  1. Define the Python udf function named encrypt_data() 
  1. Generated the key using the Fernet module. 
  1. Create the object for the key 
  1. Converting input data to bytes using UTF-8 encoding is necessary because cryptographic algorithms like Fernet typically operate on binary data(bytes) rather than text strings. 
  1. The bytes data type encryptes the data, resulting in encrypted binary data. 
  1. Previously, it was a bytes datatype that, in many cases, needed to be converted to a string if we wanted to store it in a text-based format (such as JSON or a database). 

Register the UDF:  

We can call the registered UDF within the withColumn() function and pass the column containing the PII data that needs to be encrypted. Later, we can drop the column that contains the original PII data 

Now, the encrypted data can be shared confidently across the workspace or with anyone, knowing that the data is protected. Unauthorized users are unable to misuse it due to encryption. 

Define UDF to decrypt the data 

Authorized users, need to view the original string for business purposes. 

  • We can use the same string to decrypt the encrypted data 

Registering the UDF  

We can see that the original string containing the SSN column and the decrypted column (ssn_decrypt) are the same. 

Generating Fernet Key Using Custom Password 

  1. Import Necessary Libraries and Modules: Importing the required libraries, including os, base64, and cryptography library for cryptography operations. 
  1. Generate a password and salt: Create a password to derive the encryption key and generate a random salt using the os.urandom function. The salts add randomness to the key derivation process, enhancing the security.  
  1. Initialize the Key Derivation Function(KDF): Create an instance of PBKDF2HMAC (Password-Based Key Derivation Function 2 with HMAC-SHA256) key derivation function with specified parameters such as the hash algorithm (SHA-256), key length, salt, and number of iterations. 
  1. Derive the Encryption Key: Utilize the Key Derivation Function (KDF) to derive the encryption key from the password. This process enhances the password’s strength by subjecting it to multiple hash function iterations, making it more resilient to brute-force attacks. 
  1. Secure Encryption with Fernet: Using the derived key, we set up the Fernet for encrypting data like email addresses into unreadable ciphertext, ensuring security during transmission and storage. 

Note: If you encounter the error “AlreadyFinalized: PBKDF2 instances can only be used once” it means you attempted to use a PBKDF2 instance more than once. PBKDF2 is designed for single use only for security reasons. To resolve this error, create a new PBKDF2 instance if you need to derive keys multiple times. 

  1. Decrypted Data: Decryption reverses the encryption process, transforming the ciphertext back into the original plaintext data. 

Data Masking 

Data masking is a technique that replaces or hides sensitive data with fictitious, similar, or scrambled values to safeguard it. It ensures that sensitive information is kept private and secure, especially when shared with people or systems that do not require access to the actual data. 

There are various methods of data masking including substitution, Redaction, Encryption, Shuffling, etc. 

Example: masking the sample credit values except for the last four digits 


Securing personally identifiable information (PII) data is crucial in today’s digital landscape to protect individual privacy and ensure compliance with regulations like GDPR and CCPA. Databricks offers essential features such as data encryption and access controls to safeguard PII data. By encrypting sensitive information using techniques like Fernet encryption, organizations can ensure its confidentiality. Overall, prioritizing PII data security helps maintain trust with customers and avoids legal consequences. 

For More Details, Diggibyte Technologies Pvt Ltd has all the experts you need. Contact us Today to embed intelligence into your organization.

Author: Marripudi Haritha Sumanjali

Leave a Reply

Your email address will not be published. Required fields are marked *