1. Big Data

 

What is Big Data?

  • Definition by IBM: Data characterized by 3Vs/5Vs.
  • Key Characteristics (5Vs):
    1. Volume: Huge data not processable on a single system (e.g., Social Media, Sensors).
    2. Variety: Data in different formats:
      • Structured: RDBMS, Tables (e.g., MySQL).
      • Semi-Structured: JSON, XML, CSV.
      • Unstructured: Images, Logs, Videos.
    3. Velocity: Speed of incoming data (e.g., Amazon sales).
    4. Veracity: Quality/correctness of data.
    5. Value: Ability to generate business insights from data.

Need for New Technology Stack (Big Data):

  • Traditional systems fail due to size and distributed nature of data.

  • System Types:

    • Monolithic: Single high-resource system, limited scalability, vertical scaling (e.g., upgrading CPU/RAM).
    • Distributed: Cluster of systems, horizontal scaling, true scalability.
  • Key Factors for Big Data Systems:

    1. Storage: Requires distributed storage (e.g., HDFS).
    2. Processing: Needs distributed processing.
    3. Scalability: Handles growing demands effectively.

Overview of Hadoop

  • Definition: First framework for Big Data, an ecosystem of tools.

  • Core Components:

    1. HDFS: Distributed storage.
    2. MapReduce: Distributed processing (obsolete due to complexity).
    3. YARN: Resource manager for allocating cluster resources.
  • Ecosystem Technologies:

    • Sqoop: Data movement (RDBMS ↔ HDFS).
    • Pig: Data cleaning.
    • Hive: SQL-like querying interface.
    • Oozie: Workflow scheduler (e.g., parallel MapReduce jobs).
    • HBase: NoSQL database for fast random access.
  • Challenges:

    • MapReduce is slow and hard to code.
    • Steep learning curve for ecosystem tools.

Cloud and Advantages

  • Advantages:

    1. Scalable: Easy resource scaling.
    2. Cost-Effective: Pay-as-you-use (OpEx vs. CapEx).
    3. Agility: Quick setup compared to on-premise systems.
    4. GeoDistribution: Reduces latency by using multiple global servers.
    5. Disaster Recovery: Backup ensures availability.
  • Cloud Types:

    • Public: Shared, cost-effective (e.g., AWS, Azure).
    • Private: Dedicated for sensitive data.
    • Hybrid: Combination of public and private for different data types.

Apache Spark

  • Definition: General-purpose, in-memory compute engine.
  • Features:
    • Replaces MapReduce (not Hadoop).
    • 10x-100x faster than MapReduce due to in-memory processing.
    • Supports multiple languages: Python (PySpark), Scala, Java, R.
  • Dependencies:
    • Storage: HDFS, S3, ADLS, etc.
    • Resource Manager: YARN, Kubernetes, Mesos.

Database vs. Data Warehouse vs. Data Lake

  1. Database:

    • For transactional data (OLTP).
    • Handles structured data, recent data only.
    • Example: MySQL, Oracle.
  2. Data Warehouse (DWH):

    • For analytical data (OLAP).
    • Stores historical data in structured format.
    • Example: Snowflake, Redshift.
  3. Data Lake:

    • Stores structured, semi-structured, and unstructured data.
    • Example: Azure Data Lake, Amazon S3.
Akash

I am working as a Data Engineer

Post a Comment

Previous Post Next Post