Azure Synapse

Azure Synapse Analytics Overview

Azure Synapse Analytics is a unified analytics service combining:

Data Integration: Synapse Pipelines and Mapping Dataflows (built on Azure Data Factory).
Enterprise Data Warehousing.
Big Data Analytics: Using Spark and SQL capabilities.

Why Use a Data Warehouse?

Traditional databases are not suitable for large-scale analysis because:

They handle day-to-day transactions efficiently but not heavy analysis.
Data warehouses are optimized for analyzing large volumes of historical data.

Three Key Compute Services:

Serverless SQL Pool: Queries data stored in Azure Data Lake without needing infrastructure setup.
Dedicated SQL Pool: A high-performance data warehouse that uses a distributed query engine.
Apache Spark Pool: Used for fast, in-memory computations on large datasets.

Key Components in Synapse Analytics

1. Serverless SQL Pool

Serverless SQL Pool provides a cost-effective way to query data stored in Azure Data Lake. It doesn't require predefined infrastructure, and queries are executed on demand using T-SQL.
Steps to Use:
1. Upload Data: Use Synapse Studio to upload files to ADLS Gen2.
2. Query the Data: Use OPENROWSET to query data without creating tables.
3. Publish & Run: Execute SQL scripts to retrieve results.
External Tables vs Normal Tables:
- External Table: Metadata is stored in Synapse, but the data is in the data lake (ADLS Gen2).
- Normal Table: Data is stored both in Synapse and its metadata.
- Serverless SQL Pool can only create External Tables, as no dedicated storage is provisioned.

Example: Querying data in ADLS Gen2 using OPENROWSET

sql

SELECT *
FROM OPENROWSET(
    BULK 'https://<storage_account_name>.dfs.core.windows.net/<container>/<file_path>',
    FORMAT = 'CSV'
) AS [result];

Steps:

Upload a file to ADLS Gen2.
Query using the OPENROWSET command or create an External Table for structured querying.

Creating an External Table

Create External Data Source:
- Use a SAS Token for accessing the storage.
Create File Format:
- Define how the data is stored (e.g., CSV, TSV).
Create External Table:
- Define the schema for the table, referencing the data source and file format.

Example SQL:

CREATE EXTERNAL TABLE SalesData ( product_id INT, sales_amount DECIMAL ) WITH ( LOCATION = 'path-to-data', DATA_SOURCE = myDataSource, FILE_FORMAT = myFileFormat );

Use Cases for Serverless SQL Pool:

Ad-hoc Data Exploration: Quickly query large datasets.
Logical Data Warehouse: Use queries without moving data around.
Cost-Effective: Pay only for the queries run, no infrastructure maintenance.

2. Dedicated SQL Pool

Definition: A fully provisioned data warehouse optimized for parallel processing using MPP architecture.
Architecture:
- Control Node + Compute Nodes (distributed across 60 distributions).
- Fact and Dimension table modeling with Star Schema.

Table Distribution Types:

Round Robin: Simple distribution, suitable for staging tables.
Hash: Key-based distribution for co-located joins.
Replicate: Broadcast small tables to all nodes.

Example: Loading data into Dedicated SQL Pool

sql

-- Load data using PolyBase (CTAS)
CREATE TABLE sales_fact AS
SELECT * 
FROM OPENROWSET(
    BULK 'https://<storage_account_name>.dfs.core.windows.net/<container>/<file_path>',
    FORMAT = 'CSV'
) AS [result];

3. Apache Spark Pool

Apache Spark in Synapse is used for large-scale data processing, particularly for complex computations.
Key Points:
- Fast Computation: Spark performs in-memory processing, making it suitable for intensive computations.
- Creating Spark Tables: Spark tables are created and managed through Synapse Studio.
- Data Processing: Use PySpark to load, process, and store data in tables.
Example Spark Code:
```
python

df = spark.read.csv('adls-path', header=True)
df.write.mode('overwrite').saveAsTable('ProcessedData');
```

Data Lakehouse Architecture

Combines the strengths of Data Lakes and Data Warehouses:

Handles all data types (structured, semi-structured, unstructured).
Ensures integrated security and governance.
Supports ML, BI, and ACID transactions (Delta Lake).

Example: Delta Lake Query in Spark

python

# PySpark Code for Delta Lake
from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/mnt/datalake/path/to/delta_table")
deltaTable.toDF().show()


Data Warehouse
Data Types: Structured
Cost: Expensive
Governance: High
Scalability: Limited
BI Support: Strong
ML Support: Weak
Data Lake
Data Types: Structured, Semi/Unstructured
Cost: Low
Governance: Low
Scalability: High
BI Support: Weak
ML Support: Strong
Data Lakehouse
Data Types: All
Cost: Moderate
Governance: High
Scalability: High
BI Support: Strong
ML Support: Strong

Key Notes

Querying Data in ADLS Gen2:
- Use OPENROWSET for quick exploration.
- Create External Tables for repeatable queries.
Serverless SQL Pool Use Cases:

Cost-effective, on-demand data processing.
Simplified data exploration and logical warehousing.

Azure Synapse

Azure Synapse Analytics Overview

Why Use a Data Warehouse?

Three Key Compute Services:

Key Components in Synapse Analytics

1. Serverless SQL Pool

Steps to Use:

External Tables vs Normal Tables:

Creating an External Table

Use Cases for Serverless SQL Pool:

2. Dedicated SQL Pool

3. Apache Spark Pool

Key Points:

Data Lakehouse Architecture

Data Warehouse

Data Lake

Data Lakehouse

Key Notes

Post a Comment

Contact Form