What is Big Data?
- Definition by IBM: Data characterized by 3Vs/5Vs.
- Key Characteristics (5Vs):
- Volume: Huge data not processable on a single system (e.g., Social Media, Sensors).
- Variety: Data in different formats:
- Structured: RDBMS, Tables (e.g., MySQL).
- Semi-Structured: JSON, XML, CSV.
- Unstructured: Images, Logs, Videos.
- Velocity: Speed of incoming data (e.g., Amazon sales).
- Veracity: Quality/correctness of data.
- Value: Ability to generate business insights from data.
Need for New Technology Stack (Big Data):
Traditional systems fail due to size and distributed nature of data.
System Types:
- Monolithic: Single high-resource system, limited scalability, vertical scaling (e.g., upgrading CPU/RAM).
- Distributed: Cluster of systems, horizontal scaling, true scalability.
Key Factors for Big Data Systems:
- Storage: Requires distributed storage (e.g., HDFS).
- Processing: Needs distributed processing.
- Scalability: Handles growing demands effectively.
Overview of Hadoop
Definition: First framework for Big Data, an ecosystem of tools.
Core Components:
- HDFS: Distributed storage.
- MapReduce: Distributed processing (obsolete due to complexity).
- YARN: Resource manager for allocating cluster resources.
Ecosystem Technologies:
- Sqoop: Data movement (RDBMS ↔ HDFS).
- Pig: Data cleaning.
- Hive: SQL-like querying interface.
- Oozie: Workflow scheduler (e.g., parallel MapReduce jobs).
- HBase: NoSQL database for fast random access.
Challenges:
- MapReduce is slow and hard to code.
- Steep learning curve for ecosystem tools.
Cloud and Advantages
Advantages:
- Scalable: Easy resource scaling.
- Cost-Effective: Pay-as-you-use (OpEx vs. CapEx).
- Agility: Quick setup compared to on-premise systems.
- GeoDistribution: Reduces latency by using multiple global servers.
- Disaster Recovery: Backup ensures availability.
Cloud Types:
- Public: Shared, cost-effective (e.g., AWS, Azure).
- Private: Dedicated for sensitive data.
- Hybrid: Combination of public and private for different data types.
Apache Spark
- Definition: General-purpose, in-memory compute engine.
- Features:
- Replaces MapReduce (not Hadoop).
- 10x-100x faster than MapReduce due to in-memory processing.
- Supports multiple languages: Python (PySpark), Scala, Java, R.
- Dependencies:
- Storage: HDFS, S3, ADLS, etc.
- Resource Manager: YARN, Kubernetes, Mesos.
Database vs. Data Warehouse vs. Data Lake
Database:
- For transactional data (OLTP).
- Handles structured data, recent data only.
- Example: MySQL, Oracle.
Data Warehouse (DWH):
- For analytical data (OLAP).
- Stores historical data in structured format.
- Example: Snowflake, Redshift.
Data Lake:
- Stores structured, semi-structured, and unstructured data.
- Example: Azure Data Lake, Amazon S3.
Tags:
Bigdata