Azure Synapse

 

Azure Synapse Analytics Overview

Azure Synapse Analytics is a unified analytics service combining:

  1. Data Integration: Synapse Pipelines and Mapping Dataflows (built on Azure Data Factory).
  2. Enterprise Data Warehousing.
  3. Big Data Analytics: Using Spark and SQL capabilities.

Why Use a Data Warehouse?

Traditional databases are not suitable for large-scale analysis because:

  • They handle day-to-day transactions efficiently but not heavy analysis.
  • Data warehouses are optimized for analyzing large volumes of historical data.

Three Key Compute Services:

  1. Serverless SQL Pool: Queries data stored in Azure Data Lake without needing infrastructure setup.
  2. Dedicated SQL Pool: A high-performance data warehouse that uses a distributed query engine.
  3. Apache Spark Pool: Used for fast, in-memory computations on large datasets.

Key Components in Synapse Analytics

1. Serverless SQL Pool

  • Serverless SQL Pool provides a cost-effective way to query data stored in Azure Data Lake. It doesn't require predefined infrastructure, and queries are executed on demand using T-SQL.

    Steps to Use:

    1. Upload Data: Use Synapse Studio to upload files to ADLS Gen2.
    2. Query the Data: Use OPENROWSET to query data without creating tables.
    3. Publish & Run: Execute SQL scripts to retrieve results.

    External Tables vs Normal Tables:

    • External Table: Metadata is stored in Synapse, but the data is in the data lake (ADLS Gen2).
    • Normal Table: Data is stored both in Synapse and its metadata.
    • Serverless SQL Pool can only create External Tables, as no dedicated storage is provisioned.

Example: Querying data in ADLS Gen2 using OPENROWSET

sql

SELECT * FROM OPENROWSET( BULK 'https://<storage_account_name>.dfs.core.windows.net/<container>/<file_path>', FORMAT = 'CSV' ) AS [result];

Steps:

  1. Upload a file to ADLS Gen2.
  2. Query using the OPENROWSET command or create an External Table for structured querying.

Creating an External Table

  1. Create External Data Source:
    • Use a SAS Token for accessing the storage.
  2. Create File Format:
    • Define how the data is stored (e.g., CSV, TSV).
  3. Create External Table:
    • Define the schema for the table, referencing the data source and file format.

Example SQL:

CREATE EXTERNAL TABLE SalesData ( product_id INT, sales_amount DECIMAL ) WITH ( LOCATION = 'path-to-data', DATA_SOURCE = myDataSource, FILE_FORMAT = myFileFormat );

Use Cases for Serverless SQL Pool:

  1. Ad-hoc Data Exploration: Quickly query large datasets.
  2. Logical Data Warehouse: Use queries without moving data around.
  3. Cost-Effective: Pay only for the queries run, no infrastructure maintenance.

2. Dedicated SQL Pool

  • Definition: A fully provisioned data warehouse optimized for parallel processing using MPP architecture.
  • Architecture:
    • Control Node + Compute Nodes (distributed across 60 distributions).
    • Fact and Dimension table modeling with Star Schema.

Table Distribution Types:

  1. Round Robin: Simple distribution, suitable for staging tables.
  2. Hash: Key-based distribution for co-located joins.
  3. Replicate: Broadcast small tables to all nodes.

Example: Loading data into Dedicated SQL Pool

sql

-- Load data using PolyBase (CTAS) CREATE TABLE sales_fact AS SELECT * FROM OPENROWSET( BULK 'https://<storage_account_name>.dfs.core.windows.net/<container>/<file_path>', FORMAT = 'CSV' ) AS [result];

3. Apache Spark Pool

  • Apache Spark in Synapse is used for large-scale data processing, particularly for complex computations.

    Key Points:

    • Fast Computation: Spark performs in-memory processing, making it suitable for intensive computations.
    • Creating Spark Tables: Spark tables are created and managed through Synapse Studio.
    • Data Processing: Use PySpark to load, process, and store data in tables.

    Example Spark Code:

    python

    df = spark.read.csv('adls-path', header=True) df.write.mode('overwrite').saveAsTable('ProcessedData');

Data Lakehouse Architecture

Combines the strengths of Data Lakes and Data Warehouses:

  1. Handles all data types (structured, semi-structured, unstructured).
  2. Ensures integrated security and governance.
  3. Supports ML, BI, and ACID transactions (Delta Lake).

Example: Delta Lake Query in Spark

python

# PySpark Code for Delta Lake from delta.tables import DeltaTable deltaTable = DeltaTable.forPath(spark, "/mnt/datalake/path/to/delta_table") deltaTable.toDF().show()


Data Warehouse

  • Data Types: Structured
  • Cost: Expensive
  • Governance: High
  • Scalability: Limited
  • BI Support: Strong
  • ML Support: Weak

Data Lake

  • Data Types: Structured, Semi/Unstructured
  • Cost: Low
  • Governance: Low
  • Scalability: High
  • BI Support: Weak
  • ML Support: Strong

Data Lakehouse

  • Data Types: All
  • Cost: Moderate
  • Governance: High
  • Scalability: High
  • BI Support: Strong
  • ML Support: Strong

Key Notes

  • Querying Data in ADLS Gen2:
    • Use OPENROWSET for quick exploration.
    • Create External Tables for repeatable queries.
  • Serverless SQL Pool Use Cases:
    • Cost-effective, on-demand data processing.
    • Simplified data exploration and logical warehousing.





Akash

I am working as a Data Engineer

Post a Comment

Previous Post Next Post