1. Big Data

Definition by IBM: Data characterized by 3Vs/5Vs.
Key Characteristics (5Vs):
1. Volume: Huge data not processable on a single system (e.g., Social Media, Sensors).
2. Variety: Data in different formats:
  - Structured: RDBMS, Tables (e.g., MySQL).
  - Semi-Structured: JSON, XML, CSV.
  - Unstructured: Images, Logs, Videos.
3. Velocity: Speed of incoming data (e.g., Amazon sales).
4. Veracity: Quality/correctness of data.
5. Value: Ability to generate business insights from data.

Traditional systems fail due to size and distributed nature of data.
System Types:
- Monolithic: Single high-resource system, limited scalability, vertical scaling (e.g., upgrading CPU/RAM).
- Distributed: Cluster of systems, horizontal scaling, true scalability.
Key Factors for Big Data Systems:
1. Storage: Requires distributed storage (e.g., HDFS).
2. Processing: Needs distributed processing.
3. Scalability: Handles growing demands effectively.

Definition: First framework for Big Data, an ecosystem of tools.
Core Components:
1. HDFS: Distributed storage.
2. MapReduce: Distributed processing (obsolete due to complexity).
3. YARN: Resource manager for allocating cluster resources.
Ecosystem Technologies:
- Sqoop: Data movement (RDBMS ↔ HDFS).
- Pig: Data cleaning.
- Hive: SQL-like querying interface.
- Oozie: Workflow scheduler (e.g., parallel MapReduce jobs).
- HBase: NoSQL database for fast random access.
Challenges:
- MapReduce is slow and hard to code.
- Steep learning curve for ecosystem tools.

Advantages:
1. Scalable: Easy resource scaling.
2. Cost-Effective: Pay-as-you-use (OpEx vs. CapEx).
3. Agility: Quick setup compared to on-premise systems.
4. GeoDistribution: Reduces latency by using multiple global servers.
5. Disaster Recovery: Backup ensures availability.
Cloud Types:
- Public: Shared, cost-effective (e.g., AWS, Azure).
- Private: Dedicated for sensitive data.
- Hybrid: Combination of public and private for different data types.

Definition: General-purpose, in-memory compute engine.
Features:
- Replaces MapReduce (not Hadoop).
- 10x-100x faster than MapReduce due to in-memory processing.
- Supports multiple languages: Python (PySpark), Scala, Java, R.
Dependencies:
- Storage: HDFS, S3, ADLS, etc.
- Resource Manager: YARN, Kubernetes, Mesos.

Database:
- For transactional data (OLTP).
- Handles structured data, recent data only.
- Example: MySQL, Oracle.
Data Warehouse (DWH):
- For analytical data (OLAP).
- Stores historical data in structured format.
- Example: Snowflake, Redshift.
Data Lake:
- Stores structured, semi-structured, and unstructured data.
- Example: Azure Data Lake, Amazon S3.