Databricks Fundamental

The content you’ve shared covers a wide range of topics related to Azure Storage, Azure Data Lake Storage, Databricks, and its features. Below is a simplified and categorized explanation with short examples where possible:

Azure Storage Account Overview

Durable, Scalable, Secure: Azure storage ensures data is protected and scalable to handle large datasets.
- Example: Data stored in BLOB storage is replicated across multiple data centers to prevent loss.
Services in Storage Account:
- BLOB/Containers: Stores unstructured data like images or logs.
  - Example: Save a video file in BLOB storage for streaming.
- Table: NoSQL storage for semi-structured data.
  - Example: Save IoT sensor data as key-value pairs.
- File Share: For lift-and-shift scenarios, file shares can be mounted.
  - Example: Store application logs in a shared folder.
- Queue: Message storage for decoupling application components.
  - Example: Use a queue to store messages between microservices.

Azure Data Lake Storage Gen2

Key Points:
- Enables hierarchical namespaces for big data workloads.
- Supports different access tiers: hot, cool, and archive.
  - Example: Rehydrate data from archive tier to hot tier for processing.
Life Cycle Management:
- Automate data movement between tiers based on rules.
  - Example: Archive files older than 30 days automatically.

Databricks Overview

What is Databricks?
Databricks is a Spark-based unified analytics platform optimized for cloud usage, simplifying tasks like infrastructure setup, upgrades, and security.
- Example: Set up a Spark cluster in Databricks in minutes with pre-installed libraries.
Cluster Types:
- All-Purpose: Interactive use.
- Job Cluster: Created for scheduled jobs, terminated after execution.
- High Concurrency: Supports multiple users.
Notebook:
- A collaborative coding environment.
- Example: Write Python and SQL in the same notebook using %python and %sql.

Databricks File System (DBFS)

DBFS:
- Distributed file system that wraps around scalable storage like BLOB or Data Lake Gen2.
- Example: Use dbutils.fs.ls('/') to list files in DBFS.

Mount Points:

Map Azure Storage containers to DBFS.

Example:

dbutils.fs.mount(
  source="wasbs://container@storageaccount.blob.core.windows.net",
  mount_point="/mnt/mountpoint",
  extra_configs={"fs.azure.account.key.<storageaccount>.blob.core.windows.net": "<access-key>"}
)

File Operations:
- List files: dbutils.fs.ls('/mnt/mountpoint')
- Read file head: dbutils.fs.head('/mnt/mountpoint/file.txt')

Databricks Utilities

File Utilities (dbutils.fs):
- Create directories: dbutils.fs.mkdirs('/FileStore/temp')
- Move files: dbutils.fs.mv('/source', '/destination', True)

Data Utilities:

Summarize data:

df = spark.read.csv("/path/to/file.csv", header=True)
dbutils.data.summarize(df)

Notebook Utilities:

Run notebooks in sequence:

dbutils.notebook.run("/ChildNotebook", 60, {"param": "value"})

Widgets:

Create dynamic UI inputs for notebooks.

Example:

dbutils.widgets.dropdown("status", "Closed", ["Open", "Closed"], "Order Status")
status = dbutils.widgets.get("status")
df.filter(f"status == '{status}'").show()

Databricks Architecture

Control Plane:
- Manages Databricks UI, clusters, and metadata.
Data Plane:

Contains your data and resources like VNETs and BLOB storage.

The architecture of Databricks is designed to simplify big data analytics and AI workflows by combining Apache Spark with additional features for collaboration, ease of use, and scalability. It operates on a two-plane architecture:

1. Control Plane

The control plane manages the overall Databricks workspace and is owned by Databricks. It handles things like user interfaces, metadata, and the orchestration of clusters. All these resources are managed under Databricks' subscription.

Key components of the Control Plane:

Databricks User Interface (UI): The interactive environment for creating notebooks, clusters, and managing jobs.
Cluster Manager: Ensures clusters are created, terminated, and managed properly.
DBFS Metadata: Manages information about the files stored in Databricks File System (DBFS).

Note: The control plane does not store your data; it only handles metadata and configuration.

2. Data Plane

The data plane is where the actual data processing happens, and it operates in your Azure subscription. The resources deployed here interact with your storage accounts and data sources securely.

Key components of the Data Plane:

Virtual Network (VNet): All compute resources like clusters run within the Azure Virtual Network for secure connectivity.
Network Security Groups (NSG): Used to enforce security policies and restrict traffic between resources.
Azure Blob Storage/Data Lake Gen2: This is where the actual data resides. Databricks interacts with these storage services using APIs.
Clusters: Compute resources (Driver & Worker Nodes) used for executing workloads like Spark jobs.

How the Planes Work Together

When you create a Databricks workspace or manage a cluster, Control Plane resources orchestrate the setup and interaction.
The Data Plane processes and stores your actual data. For example, when you mount Azure Blob storage, the data is processed in the Data Plane, but its metadata is managed in the Control Plane.

Security and Isolation

Databricks separates the control plane and data plane to ensure security and scalability:

Data Security: Your data remains in your Azure subscription (Data Plane), and Databricks (Control Plane) only accesses configuration and metadata.
Cluster Isolation: Clusters run in your own virtual network, ensuring complete control over security and access policies.

Diagram Overview (Visualizing this helps):

Control Plane (Databricks' Subscription):
- Hosts UI, job scheduler, and cluster manager.
- Manages metadata and logs.
Data Plane (Your Azure Subscription):
- Runs clusters (Driver + Workers) in your Virtual Network.
- Stores data in Azure Blob Storage or Data Lake Gen2.
- Executes all Spark workloads and job processing.

This architecture provides the scalability of the cloud, simplicity for collaboration, and strong security by keeping your data within your Azure environment while offloading management tasks to Databricks.

Databricks CLI

Setup:
- Install Python and pip.
- Use Databricks CLI to automate tasks like cluster creation and file management.

Azure Databricks vs. Synapse

Databricks provides the latest Spark versions and optimizations, while Synapse supports additional languages like .NET.