2. HDFS Architecture

1. HDFS Architecture Overview

Master-Slave Architecture:

Master Node (NameNode):
- Stores metadata (e.g., block locations, file permissions).
- Acts as a central point to manage the file system namespace and regulates access to files.
Slave Nodes (DataNodes):
- Store the actual data in blocks.
- Periodically send heartbeat signals and block reports to the NameNode.

File Storage in HDFS:

Files are split into fixed-size blocks (default: 128 MB, configurable to 64 MB or higher).
These blocks are stored across multiple DataNodes.

Example:
A 500 MB file splits into:

Block 1: 128 MB
Block 2: 128 MB
Block 3: 128 MB
Block 4: 116 MB

2. Metadata and Block Management

Metadata in NameNode:
- Information like file paths, permissions, and block locations is stored in memory.
- This allows the NameNode to respond quickly to client requests.
Replication Factor:
- Ensures fault tolerance by storing multiple copies of each block across different DataNodes (default: 3 replicas).

3. NameNode Federation

Introduced to handle scalability challenges in HDFS:

Why Needed?
- As metadata grows, a single NameNode becomes a bottleneck.
- In large clusters, one NameNode may not handle metadata for all files efficiently.
Solution:
- Federation divides metadata among multiple NameNodes.
- Each NameNode manages a subset of the namespace, providing better performance and scalability.

4. Fault Tolerance Mechanisms

DataNode Failure:

If a DataNode fails, the blocks it stored can be retrieved from replica nodes.
Replication ensures data availability despite hardware failures.

NameNode Failure:

NameNode stores its metadata and namespace in RAM, so failure leads to system downtime.
Secondary NameNode:
- Acts as a backup, periodically merging the edit log with the file system image (FSImage).
- However, it is not a real-time replacement for the primary NameNode.

Rack Awareness:

Ensures replicas are stored across multiple racks (geographical regions):
- Prevents data loss due to rack failures.
- Default strategy: Place one replica in one rack and two replicas in another rack.

5. Block Size Considerations

Smaller Block Size (< 128 MB):
- Advantages: Increased parallelism as more blocks are processed simultaneously.
- Drawbacks: Overloads the NameNode with metadata entries.
Larger Block Size (> 128 MB):
- Advantages: Reduces metadata burden on the NameNode.
- Drawbacks: Decreases parallelism, potentially affecting performance.

6. HDFS Commands

Here are commonly used HDFS commands with examples:

File Operations:

List files/directories in HDFS:
```
hadoop fs -ls /
```

Create directories:

hadoop fs -mkdir -p /user/<username>/dir1/dir2

Copy file from local to HDFS:

hadoop fs -put /path/to/localfile /path/in/hdfs

Copy file from HDFS to local:

hadoop fs -get /path/in/hdfs /path/to/local

7. Cloud-Based Data Lakes vs. HDFS

Feature	HDFS	ADLS Gen2 / Amazon S3
Type	Distributed File System	Object Storage
Data Format	Blocks	Objects (ID, Value, Metadata)
Persistence	Tightly coupled with compute	Independent of compute
Scalability	Limited by cluster setup	Unlimited
Cost Efficiency	Higher (compute + storage)	Lower (storage decoupled)
Accessibility	Within a cluster	Accessible globally

8. Practical Scenarios

Scenario 1: Reading a File

Client requests file1.txt.
NameNode looks up metadata for block locations.
Client fetches the blocks from the respective DataNodes.

Scenario 2: Fault Tolerance with Replication

A block of file1.txt stored in DN1, DN2, and DN3 is still accessible if DN1 fails.
NameNode automatically re-replicates the missing block to maintain replication factor.

9. Key Points to Remember

HDFS ensures fault tolerance via replication.
NameNode Federation addresses scalability.
Block size impacts parallelism and metadata storage.
Racks provide geographical fault tolerance.
Cloud data lakes (e.g., ADLS Gen2) decouple storage and compute for better cost efficiency.