Hive

Transactional Systems vs Analytical Systems and a detailed breakdown of Hive, including its features, optimizations, and use cases:


Transactional Systems vs Analytical Systems

Aspect Transactional Systems Analytical Systems
Type of Data Handled Day-to-day transactional data Historical data
Operations Performed Insert, Update, Delete Read operations to analyze large volumes of data
Examples ATM transactions, e-commerce transactions Analyzing sales campaign data
Best-Suited Systems RDBMS (e.g., Oracle, MySQL); Monolithic systems Data Warehouses (e.g., Teradata); Distributed systems

Apache Hive Overview

Definition

Hive is an open-source, distributed, fault-tolerant data warehouse designed for analytics at scale. It enables querying large datasets stored in distributed storage using Hive Query Language (HQL), which is similar to SQL.

Key Components of Hive

  1. Actual Data: Stored in distributed storage like HDFS, Amazon S3, or Azure Data Lake.
  2. Metadata: Schema information stored in a Metastore Database.

Key Features and Concepts

Why Metadata is Stored in a Database?

  • Limitations of Data Lakes:
    • Updates are difficult to perform.
    • High throughput but not optimized for low-latency queries.
  • Benefits of Metastore Database:
    • Frequent updates are possible.
    • Faster access to metadata compared to a Data Lake.

Schema on Read vs Schema on Write

Aspect Schema on Write (RDBMS) Schema on Read (Hive)
Process Define table structure first; data is validated while being written into the table. Data is stored as files; table structure is applied during query execution.
Data Validation Performed during data insertion. Performed during data reading.

Types of Hive Tables

Table Type Description Use Case
Managed Table Hive manages both data and metadata. When Hive is the sole owner of the data.
External Table Hive manages only metadata; data remains in external storage. When other tools access the data; prevents data loss from accidental table drops.

Hive Optimizations

1. Table-Level Optimization: Partitioning vs Bucketing

  • Partitioning: Divides data into directories (e.g., by order_status).

    • Example:
      CREATE EXTERNAL TABLE orders_partition_table (
          order_id INT, 
          order_date STRING, 
          customer_id INT
      ) PARTITIONED BY (order_status STRING);
      
    • Use Case: Filters like WHERE order_status = 'CLOSED' scan only relevant partitions.
  • Bucketing: Divides data into files (e.g., by order_id) for faster joins.

    • Example:
      CREATE TABLE orders_bucket_table (
          order_id INT, 
          order_date STRING, 
          customer_id INT
      ) CLUSTERED BY (order_id) INTO 4 BUCKETS;
      
  • Combined Use: Partitioning followed by bucketing is possible, but not vice versa.

2. Query-Level Optimization: Join Optimizations

  • Map-Side Join: Broadcast small tables to all nodes for faster joins.
  • Bucket Map Join: Both tables must be bucketed on the join column, and the number of buckets must align.
  • Sort-Merge-Bucket (SMB) Join: Performs additional preprocessing for optimized joins.

Common Hive Commands

Database and Table Management

  • Create Database:
    CREATE DATABASE trendytech_102;
    USE trendytech_102;
    
  • Create Managed Table:
    CREATE TABLE orders_managed_table (
        order_id INT, 
        order_date STRING, 
        customer_id INT, 
        order_status STRING
    ) ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY ',' 
      STORED AS TEXTFILE;
    
  • Create External Table:
    CREATE EXTERNAL TABLE orders_external_table (
        order_id INT, 
        order_date STRING, 
        customer_id INT, 
        order_status STRING
    ) ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY ',' 
      STORED AS TEXTFILE;
    

Load Data

  • Load Data into a Managed Table:
    LOAD DATA INPATH '/path/to/orders.csv' INTO TABLE orders_managed_table;
    

Hive Execution Engines

  • Supported engines: MapReduce, Tez, and Spark.
  • Change engine property:
    SET hive.execution.engine = spark;
    

Limitations of Hive

  1. Not optimized for transactional operations (Insert/Update/Delete).
  2. Inserts are time-consuming.
  3. Updates and deletes are supported only with:
    • ORC format
    • Bucketed tables
    • Transactions enabled (transaction = true).

Example Use Case: Daily Incremental Data Analysis

  • Solution: Create an External Table pointing to a folder with daily data files.
  • Behavior: Automatically detects new files when querying the table.

Example:

CREATE EXTERNAL TABLE orders_external_table (
    order_id INT,
    order_date STRING,
    customer_id INT,
    order_status STRING
) ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  STORED AS TEXTFILE;

CID properties in Hive and how they enable transactional capabilities in a data warehousing tool that traditionally doesn't support in-place modifications (due to the nature of HDFS). Below are key takeaways and additional context to clarify the concepts:


ACID Properties

ACID (Atomicity, Consistency, Isolation, Durability) defines the requirements for reliable transaction management:

  1. Atomicity: Transactions are all-or-nothing. If one operation fails, the entire transaction rolls back.
  2. Consistency: Transactions preserve database integrity, ensuring transitions between valid states.
  3. Isolation: Concurrent transactions do not interfere with one another.
  4. Durability: Committed changes persist, even after a crash or failure.

Transactional Needs in Hive

Although Hive is not a traditional database, transactional support is required in specific scenarios:

  1. Slowly Changing Dimensions (SCDs):
    • Fact Tables: Frequently updated/inserted transactional data (e.g., orders).
    • Dimension Tables: Infrequent updates (e.g., customer records).
  2. HDFS Constraints: HDFS doesn't natively support in-place updates/deletes. Hive achieves this via Delta Files:
    • Changes (insert/update/delete) are written to delta files.
    • When querying, Hive merges delta files with base files to present updated data.

Configurations for Transactional Tables

To enable transactional capabilities in Hive, configure the following:

sql

SET hive.support.concurrency=true; SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; SET hive.enforce.bucketing=true; SET hive.exec.dynamic.partition.mode=nostrict; SET hive.compactor.initiator.on=true; SET hive.compactor.worker.threads=1;
  • TBLPROPERTIES: Use 'transactional'='true' for transactional tables.
  • Note: Only managed tables (stored in Hive-controlled directories) and ORC file format are supported.

Types of Transactional Tables

  1. Full ACID Transactional Table:

    • Supports inserts, updates, and deletes.
    • Example:
      sql

      CREATE TABLE orders_trx1 ( order_id INT, order_date STRING, customer_id INT, order_status STRING ) STORED AS ORC TBLPROPERTIES ('transactional'='true');
    • Check with: DESCRIBE FORMATTED orders_trx1;
  2. Insert-Only ACID Table:

    • Supports inserts only; all file formats are supported.
    • Example:
      sql

      CREATE TABLE orders_trx2 ( order_id INT, order_date STRING, customer_id INT, order_status STRING ) STORED AS TEXTFILE TBLPROPERTIES ('transactional'='true', 'transactional_properties'='insert_only');
    • Check with: DESCRIBE FORMATTED orders_trx2;

Key Features and Considerations

  1. Delta Files and Compaction:

    • Frequent transactions create many small files, leading to metadata bottlenecks.
    • Hive compaction merges small delta files into larger files to improve query performance.
  2. Snapshot Isolation:

    • Ensures read operations lock the state of the warehouse, avoiding interference during ongoing writes.
  3. MSCK Repair:

    • Updates metadata changes not recognized by Hive (e.g., manual file operations in HDFS):
      python
      spark.sql("MSCK REPAIR TABLE <table_name>")

Hive-Spark Integration

  • Hive Metastore allows seamless integration with Spark.
  • Example Use Case:
    • Create a Hive table for schema management and process data with Spark SQL, avoiding MapReduce delays in Hive queries.

This overview highlights how Hive bridges the gap between data warehousing and transactional data needs using ACID compliance and HDFS adaptability. 

Akash

I am working as a Data Engineer

Post a Comment

Previous Post Next Post