Azure Synapse Analytics Overview
Azure Synapse Analytics is a unified analytics service combining:
- Data Integration: Synapse Pipelines and Mapping Dataflows (built on Azure Data Factory).
- Enterprise Data Warehousing.
- Big Data Analytics: Using Spark and SQL capabilities.
Why Use a Data Warehouse?
Traditional databases are not suitable for large-scale analysis because:
- They handle day-to-day transactions efficiently but not heavy analysis.
- Data warehouses are optimized for analyzing large volumes of historical data.
Three Key Compute Services:
- Serverless SQL Pool: Queries data stored in Azure Data Lake without needing infrastructure setup.
- Dedicated SQL Pool: A high-performance data warehouse that uses a distributed query engine.
- Apache Spark Pool: Used for fast, in-memory computations on large datasets.
Key Components in Synapse Analytics
1. Serverless SQL Pool
Serverless SQL Pool provides a cost-effective way to query data stored in Azure Data Lake. It doesn't require predefined infrastructure, and queries are executed on demand using T-SQL.
Steps to Use:
- Upload Data: Use Synapse Studio to upload files to ADLS Gen2.
- Query the Data: Use
OPENROWSET
to query data without creating tables. - Publish & Run: Execute SQL scripts to retrieve results.
External Tables vs Normal Tables:
- External Table: Metadata is stored in Synapse, but the data is in the data lake (ADLS Gen2).
- Normal Table: Data is stored both in Synapse and its metadata.
- Serverless SQL Pool can only create External Tables, as no dedicated storage is provisioned.
Example: Querying data in ADLS Gen2 using OPENROWSET
Steps:
- Upload a file to ADLS Gen2.
- Query using the
OPENROWSET
command or create an External Table for structured querying.
Creating an External Table
- Create External Data Source:
- Use a SAS Token for accessing the storage.
- Create File Format:
- Define how the data is stored (e.g., CSV, TSV).
- Create External Table:
- Define the schema for the table, referencing the data source and file format.
Example SQL:
CREATE EXTERNAL TABLE SalesData ( product_id INT, sales_amount DECIMAL ) WITH ( LOCATION = 'path-to-data', DATA_SOURCE = myDataSource, FILE_FORMAT = myFileFormat );
Use Cases for Serverless SQL Pool:
- Ad-hoc Data Exploration: Quickly query large datasets.
- Logical Data Warehouse: Use queries without moving data around.
- Cost-Effective: Pay only for the queries run, no infrastructure maintenance.
2. Dedicated SQL Pool
- Definition: A fully provisioned data warehouse optimized for parallel processing using MPP architecture.
- Architecture:
- Control Node + Compute Nodes (distributed across 60 distributions).
- Fact and Dimension table modeling with Star Schema.
Table Distribution Types:
- Round Robin: Simple distribution, suitable for staging tables.
- Hash: Key-based distribution for co-located joins.
- Replicate: Broadcast small tables to all nodes.
Example: Loading data into Dedicated SQL Pool
3. Apache Spark Pool
Apache Spark in Synapse is used for large-scale data processing, particularly for complex computations.
Key Points:
- Fast Computation: Spark performs in-memory processing, making it suitable for intensive computations.
- Creating Spark Tables: Spark tables are created and managed through Synapse Studio.
- Data Processing: Use PySpark to load, process, and store data in tables.
Example Spark Code:
Data Lakehouse Architecture
Combines the strengths of Data Lakes and Data Warehouses:
- Handles all data types (structured, semi-structured, unstructured).
- Ensures integrated security and governance.
- Supports ML, BI, and ACID transactions (Delta Lake).
Example: Delta Lake Query in Spark
Key Notes
- Querying Data in ADLS Gen2:
- Use
OPENROWSET
for quick exploration. - Create External Tables for repeatable queries.
- Use
- Serverless SQL Pool Use Cases:
- Cost-effective, on-demand data processing.
- Simplified data exploration and logical warehousing.