Azure Storage, managing secrets, and working with Delta Lake

Key concepts regarding access to Azure Storage, secret scope, Data Lakes, and Delta Lake:

1. Accessing the Storage Account

There are multiple methods to access a storage account in Azure. The primary methods include:

Access Key / Account Key:

  • Description: This is the most basic method of accessing Azure storage. The Access Key is generated at the storage account level.
  • Security Level: It provides the broadest access, meaning you can't restrict access at the container or folder level. Any entity with this key has access to all resources in the storage account.
  • Use Case: Suitable for broad, trusted access when you don’t need to restrict permissions for specific containers or folders.

SAS Key (Shared Access Signature):

  • Description: SAS is a more granular method that allows access to specific resources in Azure storage (like containers or folders).
  • Security Level: SAS tokens are generated to give restricted access, and you can specify permissions like read, write, delete, etc.
  • Use Case: Provides more control over what parts of the storage can be accessed (container, folder), and for how long.

Service Principal (OAuth Authentication):

  • Description: This method involves Azure Active Directory (Azure AD) to authenticate users or applications, providing more fine-grained control over who can access storage at the container/folder level.
  • Security Level: This method is more secure because it uses Azure AD identities and roles, which can be controlled via Azure’s access control (IAM).
  • Use Case: This approach allows detailed control and is suitable for applications or services where access needs to be tightly controlled.

2. Secret Scope in Databricks

The Secret Scope in Databricks helps to securely manage access credentials without hardcoding sensitive keys in your notebooks. There are two main types of Secret Scopes:

Azure Key Vault-backed Secret Scope:

  • Description: The secret keys are stored in Azure Key Vault, a secure cloud service for storing sensitive information like passwords, API keys, and certificates.
  • Steps:
    1. Create an Azure Key Vault and store the secrets.
    2. Link the Key Vault to Databricks Secret Scope by providing the Vault URL and Resource ID.
    3. Use these secrets securely in Databricks notebooks without hardcoding the credentials.

Databricks-backed Secret Scope:

  • Description: The secrets are stored in an encrypted Databricks database. This scope can only be created using the CLI or API, not through the UI.
  • Steps:
    1. Use the CLI to create the scope:
      databricks secrets create-scope --scope <scope-name> --initial-manage-principal users
      
    2. Add secret keys using:
      databricks secrets put --scope <scope-name> --key <key-name>
      
    3. Use the secret in notebooks to access Azure Storage.

3. Mounting Azure Storage in Databricks

To access Azure storage in Databricks, you need to create a mount point. A mount point is a link between your Databricks workspace and a data source like Azure Blob Storage or ADLS.

Example using Databricks-backed Secret Scope to mount storage:

dbutils.fs.mount(
    source = 'wasbs://<container>@<storage-account>.blob.core.windows.net',
    mount_point = '/mnt/<mount-point>',
    extra_configs = {'<key-name>': dbutils.secrets.get(scope='<scope-name>', key='<key-name>')}
)

This uses a secret stored in Databricks to authenticate access to the storage account.

4. SAS Token Authentication for Storage Access

A Shared Access Signature (SAS) allows restricted access to Azure storage resources. You can configure SAS tokens to provide specific permissions and expiration times.

To use SAS in Databricks:

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", "<SAS-Token>")

This provides temporary, granular access to the storage account.

5. Service Principal Authentication

A Service Principal is an identity created in Azure AD for an application or service to authenticate. It is the most secure way to access storage as it allows role-based access control (RBAC) to define permissions at different levels (e.g., container, directory).

Steps to use Service Principal for authentication:

  1. Create a Service Principal in Azure AD.
  2. Generate a client secret under the application and save it.
  3. Configure Databricks with the Service Principal credentials to authenticate and access the storage.

Example of Service Principal configuration:

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", "<secret>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

This allows accessing storage with fine-grained access control using Azure AD roles.

6. Data Lake

A Data Lake is a storage system designed to handle vast amounts of raw, unstructured, or semi-structured data. Examples include Amazon S3, ADLS Gen2, and Google Cloud Storage.

  • Advantages:

    • Cost-effective: Cheap storage for large datasets.
    • Scalable: Can scale to store petabytes of data.
    • Versatile: Can store all types of data—structured, semi-structured, unstructured.
  • Challenges:

    • ACID Guarantees: Data lakes typically do not support full ACID properties (Atomicity, Consistency, Isolation, Durability), which can lead to data consistency issues, especially in failure scenarios.
    • Data Quality: No inherent validation of data, which could lead to inconsistent or corrupted data in case of job failures or concurrent writes.

7. Delta Lake

Delta Lake is an open-source storage layer that provides ACID guarantees on top of data lakes, allowing you to manage data with consistency, reliability, and performance optimizations. It enables features like time travel, schema evolution, and data versioning.

  • Key Features:

    • ACID Transactions: Ensures data consistency during writes, updates, deletes, and concurrent reads/writes.
    • Schema Enforcement: Automatically enforces schema correctness on writes.
    • Time Travel: Allows querying historical versions of data.
    • Updates and Deletes: Delta Lake allows updating and deleting records in a way that maintains ACID properties.
  • Operations:

    • Writes: Delta Lake handles append and overwrite operations gracefully, ensuring consistency even if a job fails during the process.
    • Updates/Deletes: Changes are written to new files, and old files are replaced once the transaction completes successfully.

8. Delta Lake Practical Example

  1. Create Database and Delta Table:

    %sql
    create database if not exists magdb
    create table magdb.ordersdelta using delta location '/mnt/magdb/delta/orders.delta/*'
    
  2. Insert Data: You can insert data using various methods:

    • Insert Command:
      insert into magdb.ordersdelta values ('111111111', '2025-01-12', '3333333', 'CLOSED')
    • Append Command:
      df.write.mode('append').format('delta').save('/mnt/magdb/delta/orders.delta')
      
  3. Update/Delete Data:

    • Delta Lake handles updates and deletes by creating new files and updating the transaction log, ensuring ACID properties.
  4. Versioning: Delta Lake allows you to use time travel to view previous versions of the data.

9. Conclusion

The methods outlined for accessing Azure Storage, managing secrets, and working with Delta Lake provide a secure, scalable, and efficient way to handle large datasets in cloud environments. Delta Lake, with its ACID compliance, schema enforcement, and versioning, overcomes the challenges typically faced in Data Lakes.

Akash

I am working as a Data Engineer

Post a Comment

Previous Post Next Post