Delta Architecture

 Delta Architecture, also known as Medallion Architecture, is a structured approach used for handling data processing in multiple stages: Bronze, Silver, and Gold. As data moves up through these layers, its quality and value increase, enabling effective analytics and reporting. Let's break down each layer and the concept of Change Data Feed (CDF) involved:

1. Bronze Layer (Raw Data):

  • Data: Raw, unprocessed data from various sources is ingested and stored in this layer.
  • Purpose: It's used as a landing zone for all incoming data, typically before any transformation or cleaning.
  • Data Operations: Data is stored as-is, with basic metadata like filenames and timestamps.

Example:

create table retaildb.orders_bronze
(
    order_id int,
    order_date string,
    customer_id int,
    order_status string,
    filename string,
    createdon timestamp
)
using delta
location “dbfs://FileStore/data/orders_bronze.delta”
partition by(order_status)
TBLPROPERTIES(delta.enableChangeDataFeed = true)

2. Silver Layer (Cleaned & Enhanced Data):

  • Data: The data is cleaned and transformed here by applying business rules, and data quality checks are performed.
  • Purpose: To enhance the data with meaningful attributes (like year, month) and apply preprocessing.
  • Operations: Merge incremental changes from the Bronze layer, clean data, and add new columns or modify existing ones.

Example:

create table retaildb.orders_silver
(
    order_id int,
    order_date string,
    customer_id int,
    order_status string,
    order_year int GENERATED ALWAYS AS (YEAR(order_date)),
    order_month int GENERATED ALWAYS AS (MONTH(order_date)),
    createdon timestamp,
    modifiedon timestamp
)
using delta
location “dbfs://FileStore/data/orders_silver.delta”
partition by(order_status)
TBLPROPERTIES(delta.enableChangeDataFeed = true)

3. Gold Layer (Aggregated & Optimized Data):

  • Data: Data here is highly aggregated and refined to meet business requirements.
  • Purpose: To prepare optimized data for BI and reporting. This layer is used for providing insights.
  • Operations: Aggregated data, business metrics, or any other data transformations needed for business analysis.

Example:

create table retaildb.orders_gold
(
    customer_id int,
    order_status string,
    order_year int,
    Num_orders int
)
using delta
location “dbfs://FileStore/data/orders_gold.delta”
TBLPROPERTIES(delta.enableChangeDataFeed = true)

4. Platinum Layer (Optional, Highly Aggregated Data):

  • Data: Some systems might introduce an additional Platinum layer to store even more aggregated and refined data.
  • Purpose: This layer focuses on very high-quality data, used for advanced business intelligence and analytics.

Data Movement through the Layers:

The data moves from one layer to another as follows:

  1. Bronze Layer: Insert raw data into a folder (e.g., CSV files).
  2. Merge into Bronze Table: Copy data into the Bronze table.
  3. Silver Layer: Apply data transformations and clean the data. Merge data from the Bronze table.
  4. Gold Layer: Perform aggregations based on business needs and store in the Gold table.

Example of moving data:

-- Copy data into Bronze
copy into retaildb.orders_bronze from (
    select order_id::int,
           order_date::string,
           customer_id::int,
           order_status::string,
           INPUT_FILE_NAME() as filename,
           CURRENT_TIMESTAMP() as createdon
    FROM 'dbfs:/FileStore/raw'
) fileformat = CSV format_options('header' = 'true')

-- Merging changes from Bronze to Silver
merge into retaildb.orders_silver tgt
using orders_bronze_changes src
on tgt.order_id = src.order_id
when matched then update set tgt.order_status = src.order_status
when not matched then insert (order_id, order_date, customer_id, order_status, createdon, modifiedon)

Change Data Feed (CDF):

  • Purpose: CDF helps track and merge incremental changes such as inserts, updates, and deletes. It is essential for maintaining a history of changes and for auditing.
  • Use case: Enables tracking changes in data, especially in the Bronze to Silver or Silver to Gold transformation.
  • How to enable:
    • For new tables: Add the property TBLPROPERTIES (delta.enableChangeDataFeed = true).
    • For existing tables: Use the ALTER command to enable it.
    • To enable it for all tables by default: Set the Spark property spark.databricks.delta.properties.defaults.enableChangeDataFeed = true.

Example of merging CDF data into Silver:

create or replace temporary view orders_bronze_changes as
select * from table_changes('retaildb.orders_bronze', 1)
where order_id > 0 and order_status in ('PAYMENT_REVIEW', 'PROCESSING', 'CLOSED');

-- Merging to Silver
merge into retaildb.orders_silver tgt
using orders_bronze_changes src
on tgt.order_id = src.order_id
when matched then update set tgt.order_status = src.order_status
when not matched then insert (order_id, order_date, customer_id, order_status, createdon, modifiedon)

Key Takeaways:

  • Delta Architecture improves data quality as it moves through different layers (Bronze → Silver → Gold).
  • CDF allows for tracking and merging incremental changes (inserts, updates, deletes).
  • The data movement process ensures that raw data in the Bronze layer gets cleaned and transformed as it moves to higher layers, ultimately supporting business intelligence and reporting.
Akash

I am working as a Data Engineer

Post a Comment

Previous Post Next Post