2. HDFS Architecture

1. HDFS Architecture Overview

Master-Slave Architecture:

  • Master Node (NameNode):
    • Stores metadata (e.g., block locations, file permissions).
    • Acts as a central point to manage the file system namespace and regulates access to files.
  • Slave Nodes (DataNodes):
    • Store the actual data in blocks.
    • Periodically send heartbeat signals and block reports to the NameNode.

File Storage in HDFS:

  • Files are split into fixed-size blocks (default: 128 MB, configurable to 64 MB or higher).
  • These blocks are stored across multiple DataNodes.

Example:
A 500 MB file splits into:

  • Block 1: 128 MB
  • Block 2: 128 MB
  • Block 3: 128 MB
  • Block 4: 116 MB

2. Metadata and Block Management

  • Metadata in NameNode:

    • Information like file paths, permissions, and block locations is stored in memory.
    • This allows the NameNode to respond quickly to client requests.
  • Replication Factor:

    • Ensures fault tolerance by storing multiple copies of each block across different DataNodes (default: 3 replicas).

3. NameNode Federation

Introduced to handle scalability challenges in HDFS:

  • Why Needed?
    • As metadata grows, a single NameNode becomes a bottleneck.
    • In large clusters, one NameNode may not handle metadata for all files efficiently.
  • Solution:
    • Federation divides metadata among multiple NameNodes.
    • Each NameNode manages a subset of the namespace, providing better performance and scalability.

4. Fault Tolerance Mechanisms

DataNode Failure:

  • If a DataNode fails, the blocks it stored can be retrieved from replica nodes.
  • Replication ensures data availability despite hardware failures.

NameNode Failure:

  • NameNode stores its metadata and namespace in RAM, so failure leads to system downtime.
  • Secondary NameNode:
    • Acts as a backup, periodically merging the edit log with the file system image (FSImage).
    • However, it is not a real-time replacement for the primary NameNode.

Rack Awareness:

  • Ensures replicas are stored across multiple racks (geographical regions):
    • Prevents data loss due to rack failures.
    • Default strategy: Place one replica in one rack and two replicas in another rack.

5. Block Size Considerations

  • Smaller Block Size (< 128 MB):

    • Advantages: Increased parallelism as more blocks are processed simultaneously.
    • Drawbacks: Overloads the NameNode with metadata entries.
  • Larger Block Size (> 128 MB):

    • Advantages: Reduces metadata burden on the NameNode.
    • Drawbacks: Decreases parallelism, potentially affecting performance.

6. HDFS Commands

Here are commonly used HDFS commands with examples:

File Operations:

  • List files/directories in HDFS:
    hadoop fs -ls /
    
  • Create directories:
    hadoop fs -mkdir -p /user/<username>/dir1/dir2
    
  • Copy file from local to HDFS:
    hadoop fs -put /path/to/localfile /path/in/hdfs
    
  • Copy file from HDFS to local:
    hadoop fs -get /path/in/hdfs /path/to/local
    

7. Cloud-Based Data Lakes vs. HDFS

Feature HDFS ADLS Gen2 / Amazon S3
Type Distributed File System Object Storage
Data Format Blocks Objects (ID, Value, Metadata)
Persistence Tightly coupled with compute Independent of compute
Scalability Limited by cluster setup Unlimited
Cost Efficiency Higher (compute + storage) Lower (storage decoupled)
Accessibility Within a cluster Accessible globally

8. Practical Scenarios

Scenario 1: Reading a File

  • Client requests file1.txt.
  • NameNode looks up metadata for block locations.
  • Client fetches the blocks from the respective DataNodes.

Scenario 2: Fault Tolerance with Replication

  • A block of file1.txt stored in DN1, DN2, and DN3 is still accessible if DN1 fails.
  • NameNode automatically re-replicates the missing block to maintain replication factor.

9. Key Points to Remember

  • HDFS ensures fault tolerance via replication.
  • NameNode Federation addresses scalability.
  • Block size impacts parallelism and metadata storage.
  • Racks provide geographical fault tolerance.
  • Cloud data lakes (e.g., ADLS Gen2) decouple storage and compute for better cost efficiency.


Akash

I am working as a Data Engineer

Post a Comment

Previous Post Next Post