Data Modeling

 

Data Modeling: In-depth Explanation with Examples

1. What is Data Modeling?

Data modeling is the process of structuring data in a way that supports business requirements. It involves defining the types of data, the relationships between them, and any integrity constraints to ensure consistent, accurate, and meaningful data in a system.

Goal: The goal is to design a system that can store data in an efficient way, helping users easily access and manipulate data according to business needs. Data modeling includes identifying tables, columns, and the relationships between them.

2. Types of Systems

  • OLTP Systems (Online Transactional Processing): These systems are used for handling transactional data. They focus on quick, real-time processing of a few records at a time. They are optimized for data insertion, updates, and deletion. OLTP databases follow normalization principles to avoid redundancy.
  • OLAP Systems (Online Analytical Processing): These systems are used for analyzing large datasets, typically for reporting and business intelligence purposes. OLAP systems are optimized for read-heavy operations like complex queries, aggregations, and joins. They often employ denormalized structures for faster query performance.

3. Types of Consumers

  • Data Analysts generally require data in simpler, denormalized structures to perform quick analysis without needing to write complex queries.
  • Data Engineers handle more complex structures and queries that are optimized for performance and scalability.


This is a detailed and comprehensive guide on data modeling techniques, especially in the context of OLTP and OLAP systems. I'll break down and explain each of the concepts and techniques with examples:

1. Data Modeling

Data modeling involves designing the structure of data within a system to meet user requirements in an efficient way. It helps identify different types of data and their relationships. The main goal is to ensure that data is organized in a way that helps end-users retrieve and analyze it optimally.

  • Types of Systems:
    • OLTP (Online Transactional Processing): These systems handle routine transactions. Example: Bank transaction systems.
    • OLAP (Online Analytical Processing): These are designed for reporting and data analysis, optimized for complex queries across large datasets. Example: Business intelligence systems.

2. Normalization

Normalization is a technique used to minimize data redundancy and ensure data integrity. It involves dividing large tables into smaller, manageable ones.

1NF (First Normal Form)

  • Data is atomic (each column contains indivisible values).
  • There is a primary key to uniquely identify rows.
  • No duplicate rows or columns. Example:
    OrderID Product Quantity
    1 Phone 2
    2 Laptop 1

2NF (Second Normal Form)

  • The table must be in 1NF.
  • All non-key columns must depend on the entire primary key. Example: In a table with OrderID, ProductID, and ProductName, splitting into two tables:
    • Orders (OrderID, ProductID)
    • Products (ProductID, ProductName)

3NF (Third Normal Form)

  • The table must be in 2NF.
  • No transitive dependencies (non-key columns should not depend on other non-key columns). Example: If ProductName depends on ProductID, and ProductID depends on OrderID, then the transitive dependency is removed by separating these into different tables.

Drawbacks of Data Redundancy

  • Increased storage and costs.
  • Complex updates and maintenance.

3. Modeling a Data Warehouse (DWH)

  • OLTP systems are not designed for reporting or analysis as they involve complex joins and large data sets.
  • Data warehouses (OLAP) are optimized for reading and analyzing large datasets.

4. Dimensional Modeling

Dimensional modeling is a design technique for structuring data in a data warehouse. It focuses on fact and dimension tables.

  • Fact Table: Contains measurable metrics (e.g., sales amount, order quantity).
  • Dimension Table: Contains descriptive attributes (e.g., customer name, product details).

Let's model a sales system with Fact_Sales as the fact table and Dim_ProductDim_CustomerDim_Date as dimension tables.

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("DimensionalModeling").getOrCreate()

# Create example Fact and Dimension data
fact_sales_data = [(1, 1000, 1, 1),
                   (2, 1500, 2, 2),
                   (3, 500, 3, 3)]

dim_product_data = [(1, "iPhone", "Electronics"),
                    (2, "Laptop", "Electronics"),
                    (3, "Table", "Furniture")]

dim_customer_data = [(1, "Alice", "NY"),
                     (2, "Bob", "LA"),
                     (3, "Charlie", "SF")]

dim_date_data = [(1, "2025-01-01"),
                 (2, "2025-01-02"),
                 (3, "2025-01-03")]

# Convert to DataFrames
fact_sales_df = spark.createDataFrame(fact_sales_data, ["sale_id", "amount", "product_id", "customer_id"])
dim_product_df = spark.createDataFrame(dim_product_data, ["product_id", "product_name", "category"])
dim_customer_df = spark.createDataFrame(dim_customer_data, ["customer_id", "customer_name", "city"])
dim_date_df = spark.createDataFrame(dim_date_data, ["date_id", "date"])

# Join Fact table with Dimension tables
sales_with_details = fact_sales_df \
    .join(dim_product_df, fact_sales_df.product_id == dim_product_df.product_id) \
    .join(dim_customer_df, fact_sales_df.customer_id == dim_customer_df.customer_id) \
    .join(dim_date_df, fact_sales_df.sale_id == dim_date_df.date_id) \
    .select("sale_id", "amount", "product_name", "category", "customer_name", "city", "date")

sales_with_details.show()

Star Schema

  • Fact tables are connected directly to dimension tables. Example: A Sales fact table might connect to Customer, Product, and Time dimension tables.
  • Fact Table: Order Amount, Quantity Sold.
  • Dimension Tables: Product Name, Customer ID, Date.

Snowflake Schema

  • Dimension tables are normalized into multiple related tables. Example: A Product dimension table can be split into Author, Publisher, Line, which are connected by keys.
  • This reduces redundancy but increases complexity in querying due to more joins.
  • Example:

    # Snowflake schema for product dimension
    dim_category_data = [(1, "Electronics"), (2, "Furniture")]
    dim_product_details_data = [(1, "iPhone", "Smartphone"), (2, "Laptop", "Portable"), (3, "Table", "Wooden")]
    
    # Create DataFrames
    dim_category_df = spark.createDataFrame(dim_category_data, ["category_id", "category_name"])
    dim_product_details_df = spark.createDataFrame(dim_product_details_data, ["product_id", "product_name", "product_type"])
    
    # Join Snowflake schema
    dim_product_snowflake_df = dim_product_details_df.join(dim_category_df, dim_product_details_df.product_id == dim_category_df.category_id)
    
    dim_product_snowflake_df.show()

Surrogate Key

  • An artificially created key to uniquely identify a row in the dimension table.
    • Benefits: Insulates the data warehouse from changes in the source system and helps manage Slowly Changing Dimensions (SCDs).
  • Example:

    from pyspark.sql.functions import monotonically_increasing_id
    
    # Adding surrogate key to the Dimension tables
    dim_product_with_sk = dim_product_df.withColumn("product_sk", monotonically_increasing_id())
    dim_customer_with_sk = dim_customer_df.withColumn("customer_sk", monotonically_increasing_id())
    dim_date_with_sk = dim_date_df.withColumn("date_sk", monotonically_increasing_id())
    
    # Show data with surrogate keys
    dim_product_with_sk.show()
    dim_customer_with_sk.show()
    dim_date_with_sk.show()

Slowly Changing Dimensions (SCD)

  • SCD Type 0: No changes are allowed, and attributes do not change over time (e.g., birth date).
  • SCD Type 1: Overwrite the old value with the new value without keeping history (e.g., address changes).
  • SCD Type 2: Creates a new row for each change, keeping complete history.
  • SCD Type 3: Maintains partial history by adding new columns for the old and new values.

SCD Type 1 (Overwrite):

Data is overwritten with new values. Historical values are not retained.

# Example: Overwrite customer city in case of SCD Type 1
from pyspark.sql.functions import lit

# Overwriting customer data
dim_customer_df = dim_customer_df.withColumn("city", lit("NewCity"))

dim_customer_df.show()

SCD Type 2 (Create New Row):

A new row is created for every change, maintaining historical values.

from pyspark.sql.functions import col, current_date

# Simulate SCD Type 2 (Adding historical record)
dim_customer_scd2_df = dim_customer_df.withColumn("start_date", current_date()).withColumn("end_date", lit(None))

dim_customer_scd2_df.show()

SCD Type 3 (Partial History):

Store only the most recent change and the previous value.

# Example: SCD Type 3 (Partial History) for customer city
dim_customer_scd3_df = dim_customer_df.withColumn("previous_city", lit("OldCity"))

dim_customer_scd3_df.show()

Dimensional Modeling Process

  1. Choose the business process (e.g., Sales).
  2. Declare the grain (level of detail needed, e.g., one record per order or per item in an order).
  3. Identify dimensions (e.g., customer, time).
  4. Identify facts (e.g., sales amount, units sold).

Fact Table Design and Optimization

Fact tables are typically large because they store transaction data. It's important to choose the correct grain to minimize the data volume and optimize queries.

  • One Big Table (OBT): This concept stores all data in a single table, without separating facts and dimensions. While it simplifies querying (as no joins are required), it leads to redundancy, storage issues, and performance degradation over time.

SCD Type 2 in PySpark

To implement SCD Type 2, you can track changes to data by adding new rows whenever a change occurs. PySpark is typically used for handling large datasets efficiently in a distributed manner.

Example of an SCD Type 2 implementation in PySpark:

  1. Create Source and Target folders in HDFS.
  2. Load the source file (e.g., customer.csv) into a DataFrame.
  3. Transform data to identify changes, and insert the new data with the change history into the target.

By following this structure, you can efficiently manage historical data and keep track of changes over time.

Dimensional modeling, Surrogate Keys, Snowflake Schema, Slowly Changing Dimensions (SCD), and optimizations using PySpark.

Optimizing Fact Tables:

As fact tables grow, the number of joins can significantly impact query performance. It’s a good practice to pre-process the fact tables for optimization using bucketing or partitioning.

Bucketing:

Bucketing can optimize the performance of joins.

# Example: Bucketing Fact Table by product_id
fact_sales_df.write.bucketBy(10, "product_id").saveAsTable("bucketed_sales")

One Big Table (OBT):

The "One Big Table" concept stores all data in a single, large table without separate fact and dimension tables.

Example:

# Example: One Big Table
one_big_table = fact_sales_df.join(dim_product_df, "product_id").join(dim_customer_df, "customer_id").join(dim_date_df, "date_id")

one_big_table.show()

Conclusion:

These modeling techniques—fact and dimension tables, surrogate keys, snowflake schema, and slowly changing dimensions—are essential for structuring data efficiently for reporting and analysis. By using techniques like bucketing and partitioning, you can optimize performance for larger datasets, making it easier for business analysts and engineers to work with data.

This set of concepts forms the foundation for building robust data models that enable faster, more efficient querying and reporting.

Akash

I am working as a Data Engineer

Post a Comment

Previous Post Next Post