Transactional Systems vs Analytical Systems and a detailed breakdown of Hive, including its features, optimizations, and use cases:
Transactional Systems vs Analytical Systems
Aspect | Transactional Systems | Analytical Systems |
---|---|---|
Type of Data Handled | Day-to-day transactional data | Historical data |
Operations Performed | Insert, Update, Delete | Read operations to analyze large volumes of data |
Examples | ATM transactions, e-commerce transactions | Analyzing sales campaign data |
Best-Suited Systems | RDBMS (e.g., Oracle, MySQL); Monolithic systems | Data Warehouses (e.g., Teradata); Distributed systems |
Apache Hive Overview
Definition
Hive is an open-source, distributed, fault-tolerant data warehouse designed for analytics at scale. It enables querying large datasets stored in distributed storage using Hive Query Language (HQL), which is similar to SQL.
Key Components of Hive
- Actual Data: Stored in distributed storage like HDFS, Amazon S3, or Azure Data Lake.
- Metadata: Schema information stored in a Metastore Database.
Key Features and Concepts
Why Metadata is Stored in a Database?
- Limitations of Data Lakes:
- Updates are difficult to perform.
- High throughput but not optimized for low-latency queries.
- Benefits of Metastore Database:
- Frequent updates are possible.
- Faster access to metadata compared to a Data Lake.
Schema on Read vs Schema on Write
Aspect | Schema on Write (RDBMS) | Schema on Read (Hive) |
---|---|---|
Process | Define table structure first; data is validated while being written into the table. | Data is stored as files; table structure is applied during query execution. |
Data Validation | Performed during data insertion. | Performed during data reading. |
Types of Hive Tables
Table Type | Description | Use Case |
---|---|---|
Managed Table | Hive manages both data and metadata. | When Hive is the sole owner of the data. |
External Table | Hive manages only metadata; data remains in external storage. | When other tools access the data; prevents data loss from accidental table drops. |
Hive Optimizations
1. Table-Level Optimization: Partitioning vs Bucketing
-
Partitioning: Divides data into directories (e.g., by
order_status
).- Example:
CREATE EXTERNAL TABLE orders_partition_table ( order_id INT, order_date STRING, customer_id INT ) PARTITIONED BY (order_status STRING);
- Use Case: Filters like
WHERE order_status = 'CLOSED'
scan only relevant partitions.
- Example:
-
Bucketing: Divides data into files (e.g., by
order_id
) for faster joins.- Example:
CREATE TABLE orders_bucket_table ( order_id INT, order_date STRING, customer_id INT ) CLUSTERED BY (order_id) INTO 4 BUCKETS;
- Example:
-
Combined Use: Partitioning followed by bucketing is possible, but not vice versa.
2. Query-Level Optimization: Join Optimizations
- Map-Side Join: Broadcast small tables to all nodes for faster joins.
- Bucket Map Join: Both tables must be bucketed on the join column, and the number of buckets must align.
- Sort-Merge-Bucket (SMB) Join: Performs additional preprocessing for optimized joins.
Common Hive Commands
Database and Table Management
- Create Database:
CREATE DATABASE trendytech_102; USE trendytech_102;
- Create Managed Table:
CREATE TABLE orders_managed_table ( order_id INT, order_date STRING, customer_id INT, order_status STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
- Create External Table:
CREATE EXTERNAL TABLE orders_external_table ( order_id INT, order_date STRING, customer_id INT, order_status STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Load Data
- Load Data into a Managed Table:
LOAD DATA INPATH '/path/to/orders.csv' INTO TABLE orders_managed_table;
Hive Execution Engines
- Supported engines: MapReduce, Tez, and Spark.
- Change engine property:
SET hive.execution.engine = spark;
Limitations of Hive
- Not optimized for transactional operations (Insert/Update/Delete).
- Inserts are time-consuming.
- Updates and deletes are supported only with:
- ORC format
- Bucketed tables
- Transactions enabled (
transaction = true
).
Example Use Case: Daily Incremental Data Analysis
- Solution: Create an External Table pointing to a folder with daily data files.
- Behavior: Automatically detects new files when querying the table.
Example:
CREATE EXTERNAL TABLE orders_external_table (
order_id INT,
order_date STRING,
customer_id INT,
order_status STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
CID properties in Hive and how they enable transactional capabilities in a data warehousing tool that traditionally doesn't support in-place modifications (due to the nature of HDFS). Below are key takeaways and additional context to clarify the concepts:
ACID Properties
ACID (Atomicity, Consistency, Isolation, Durability) defines the requirements for reliable transaction management:
- Atomicity: Transactions are all-or-nothing. If one operation fails, the entire transaction rolls back.
- Consistency: Transactions preserve database integrity, ensuring transitions between valid states.
- Isolation: Concurrent transactions do not interfere with one another.
- Durability: Committed changes persist, even after a crash or failure.
Transactional Needs in Hive
Although Hive is not a traditional database, transactional support is required in specific scenarios:
- Slowly Changing Dimensions (SCDs):
- Fact Tables: Frequently updated/inserted transactional data (e.g., orders).
- Dimension Tables: Infrequent updates (e.g., customer records).
- HDFS Constraints: HDFS doesn't natively support in-place updates/deletes. Hive achieves this via Delta Files:
- Changes (insert/update/delete) are written to delta files.
- When querying, Hive merges delta files with base files to present updated data.
Configurations for Transactional Tables
To enable transactional capabilities in Hive, configure the following:
- TBLPROPERTIES: Use
'transactional'='true'
for transactional tables. - Note: Only managed tables (stored in Hive-controlled directories) and ORC file format are supported.
Types of Transactional Tables
Full ACID Transactional Table:
- Supports inserts, updates, and deletes.
- Example:
- Check with:
DESCRIBE FORMATTED orders_trx1;
Insert-Only ACID Table:
- Supports inserts only; all file formats are supported.
- Example:
- Check with:
DESCRIBE FORMATTED orders_trx2;
Key Features and Considerations
Delta Files and Compaction:
- Frequent transactions create many small files, leading to metadata bottlenecks.
- Hive compaction merges small delta files into larger files to improve query performance.
Snapshot Isolation:
- Ensures read operations lock the state of the warehouse, avoiding interference during ongoing writes.
MSCK Repair:
- Updates metadata changes not recognized by Hive (e.g., manual file operations in HDFS):
- Updates metadata changes not recognized by Hive (e.g., manual file operations in HDFS):
Hive-Spark Integration
- Hive Metastore allows seamless integration with Spark.
- Example Use Case:
- Create a Hive table for schema management and process data with Spark SQL, avoiding MapReduce delays in Hive queries.
This overview highlights how Hive bridges the gap between data warehousing and transactional data needs using ACID compliance and HDFS adaptability.