Hive

Transactional Systems vs Analytical Systems and a detailed breakdown of Hive, including its features, optimizations, and use cases:

Transactional Systems vs Analytical Systems

Aspect	Transactional Systems	Analytical Systems
Type of Data Handled	Day-to-day transactional data	Historical data
Operations Performed	Insert, Update, Delete	Read operations to analyze large volumes of data
Examples	ATM transactions, e-commerce transactions	Analyzing sales campaign data
Best-Suited Systems	RDBMS (e.g., Oracle, MySQL); Monolithic systems	Data Warehouses (e.g., Teradata); Distributed systems

Apache Hive Overview

Definition

Hive is an open-source, distributed, fault-tolerant data warehouse designed for analytics at scale. It enables querying large datasets stored in distributed storage using Hive Query Language (HQL), which is similar to SQL.

Key Components of Hive

Actual Data: Stored in distributed storage like HDFS, Amazon S3, or Azure Data Lake.
Metadata: Schema information stored in a Metastore Database.

Key Features and Concepts

Why Metadata is Stored in a Database?

Limitations of Data Lakes:
- Updates are difficult to perform.
- High throughput but not optimized for low-latency queries.
Benefits of Metastore Database:
- Frequent updates are possible.
- Faster access to metadata compared to a Data Lake.

Schema on Read vs Schema on Write

Aspect	Schema on Write (RDBMS)	Schema on Read (Hive)
Process	Define table structure first; data is validated while being written into the table.	Data is stored as files; table structure is applied during query execution.
Data Validation	Performed during data insertion.	Performed during data reading.

Types of Hive Tables

Table Type	Description	Use Case
Managed Table	Hive manages both data and metadata.	When Hive is the sole owner of the data.
External Table	Hive manages only metadata; data remains in external storage.	When other tools access the data; prevents data loss from accidental table drops.

Hive Optimizations

1. Table-Level Optimization: Partitioning vs Bucketing

Partitioning: Divides data into directories (e.g., by order_status).
- Example:
```
CREATE EXTERNAL TABLE orders_partition_table (
    order_id INT, 
    order_date STRING, 
    customer_id INT
) PARTITIONED BY (order_status STRING);
```
- Use Case: Filters like WHERE order_status = 'CLOSED' scan only relevant partitions.

Bucketing: Divides data into files (e.g., by order_id) for faster joins.

Example:

CREATE TABLE orders_bucket_table (
    order_id INT, 
    order_date STRING, 
    customer_id INT
) CLUSTERED BY (order_id) INTO 4 BUCKETS;

Combined Use: Partitioning followed by bucketing is possible, but not vice versa.

2. Query-Level Optimization: Join Optimizations

Map-Side Join: Broadcast small tables to all nodes for faster joins.
Bucket Map Join: Both tables must be bucketed on the join column, and the number of buckets must align.
Sort-Merge-Bucket (SMB) Join: Performs additional preprocessing for optimized joins.

Common Hive Commands

Database and Table Management

Create Database:

CREATE DATABASE trendytech_102;
USE trendytech_102;

Create Managed Table:

CREATE TABLE orders_managed_table (
    order_id INT, 
    order_date STRING, 
    customer_id INT, 
    order_status STRING
) ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  STORED AS TEXTFILE;

Create External Table:

CREATE EXTERNAL TABLE orders_external_table (
    order_id INT, 
    order_date STRING, 
    customer_id INT, 
    order_status STRING
) ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  STORED AS TEXTFILE;

Load Data

Load Data into a Managed Table:

LOAD DATA INPATH '/path/to/orders.csv' INTO TABLE orders_managed_table;

Hive Execution Engines

Supported engines: MapReduce, Tez, and Spark.
Change engine property:
```
SET hive.execution.engine = spark;
```

Limitations of Hive

Not optimized for transactional operations (Insert/Update/Delete).
Inserts are time-consuming.
Updates and deletes are supported only with:
- ORC format
- Bucketed tables
- Transactions enabled (transaction = true).

Example Use Case: Daily Incremental Data Analysis

Solution: Create an External Table pointing to a folder with daily data files.
Behavior: Automatically detects new files when querying the table.

Example:

CREATE EXTERNAL TABLE orders_external_table (
    order_id INT,
    order_date STRING,
    customer_id INT,
    order_status STRING
) ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  STORED AS TEXTFILE;

CID properties in Hive and how they enable transactional capabilities in a data warehousing tool that traditionally doesn't support in-place modifications (due to the nature of HDFS). Below are key takeaways and additional context to clarify the concepts:

ACID Properties

ACID (Atomicity, Consistency, Isolation, Durability) defines the requirements for reliable transaction management:

Atomicity: Transactions are all-or-nothing. If one operation fails, the entire transaction rolls back.
Consistency: Transactions preserve database integrity, ensuring transitions between valid states.
Isolation: Concurrent transactions do not interfere with one another.
Durability: Committed changes persist, even after a crash or failure.

Transactional Needs in Hive

Although Hive is not a traditional database, transactional support is required in specific scenarios:

Slowly Changing Dimensions (SCDs):
- Fact Tables: Frequently updated/inserted transactional data (e.g., orders).
- Dimension Tables: Infrequent updates (e.g., customer records).
HDFS Constraints: HDFS doesn't natively support in-place updates/deletes. Hive achieves this via Delta Files:
- Changes (insert/update/delete) are written to delta files.
- When querying, Hive merges delta files with base files to present updated data.

Configurations for Transactional Tables

To enable transactional capabilities in Hive, configure the following:

sql

SET hive.support.concurrency=true;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nostrict;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=1;

TBLPROPERTIES: Use 'transactional'='true' for transactional tables.
Note: Only managed tables (stored in Hive-controlled directories) and ORC file format are supported.

Types of Transactional Tables

Full ACID Transactional Table:

Supports inserts, updates, and deletes.

Example:

sql

CREATE TABLE orders_trx1 (
    order_id INT,
    order_date STRING,
    customer_id INT,
    order_status STRING
)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

Check with: DESCRIBE FORMATTED orders_trx1;

Insert-Only ACID Table:

Supports inserts only; all file formats are supported.

Example:

sql

CREATE TABLE orders_trx2 (
    order_id INT,
    order_date STRING,
    customer_id INT,
    order_status STRING
)
STORED AS TEXTFILE
TBLPROPERTIES ('transactional'='true', 'transactional_properties'='insert_only');

Check with: DESCRIBE FORMATTED orders_trx2;

Key Features and Considerations

Delta Files and Compaction:
- Frequent transactions create many small files, leading to metadata bottlenecks.
- Hive compaction merges small delta files into larger files to improve query performance.
Snapshot Isolation:
- Ensures read operations lock the state of the warehouse, avoiding interference during ongoing writes.
MSCK Repair:
- Updates metadata changes not recognized by Hive (e.g., manual file operations in HDFS):
```
python
spark.sql("MSCK REPAIR TABLE <table_name>")
```

Hive-Spark Integration

Hive Metastore allows seamless integration with Spark.
Example Use Case:
- Create a Hive table for schema management and process data with Spark SQL, avoiding MapReduce delays in Hive queries.

This overview highlights how Hive bridges the gap between data warehousing and transactional data needs using ACID compliance and HDFS adaptability.

Hive

Transactional Systems vs Analytical Systems

Apache Hive Overview

Definition

Key Components of Hive

Key Features and Concepts

Why Metadata is Stored in a Database?

Schema on Read vs Schema on Write

Types of Hive Tables

Hive Optimizations

1. Table-Level Optimization: Partitioning vs Bucketing

2. Query-Level Optimization: Join Optimizations

Common Hive Commands

Database and Table Management

Load Data

Hive Execution Engines

Limitations of Hive

Example Use Case: Daily Incremental Data Analysis

ACID Properties

Transactional Needs in Hive

Configurations for Transactional Tables

Types of Transactional Tables

Key Features and Considerations

Hive-Spark Integration

Post a Comment

Contact Form