1. HDFS Architecture Overview
Master-Slave Architecture:
- Master Node (NameNode):
- Stores metadata (e.g., block locations, file permissions).
- Acts as a central point to manage the file system namespace and regulates access to files.
- Slave Nodes (DataNodes):
- Store the actual data in blocks.
- Periodically send heartbeat signals and block reports to the NameNode.
File Storage in HDFS:
- Files are split into fixed-size blocks (default: 128 MB, configurable to 64 MB or higher).
- These blocks are stored across multiple DataNodes.
Example:
A 500 MB file splits into:
- Block 1: 128 MB
- Block 2: 128 MB
- Block 3: 128 MB
- Block 4: 116 MB
2. Metadata and Block Management
-
Metadata in NameNode:
- Information like file paths, permissions, and block locations is stored in memory.
- This allows the NameNode to respond quickly to client requests.
-
Replication Factor:
- Ensures fault tolerance by storing multiple copies of each block across different DataNodes (default: 3 replicas).
3. NameNode Federation
Introduced to handle scalability challenges in HDFS:
- Why Needed?
- As metadata grows, a single NameNode becomes a bottleneck.
- In large clusters, one NameNode may not handle metadata for all files efficiently.
- Solution:
- Federation divides metadata among multiple NameNodes.
- Each NameNode manages a subset of the namespace, providing better performance and scalability.
4. Fault Tolerance Mechanisms
DataNode Failure:
- If a DataNode fails, the blocks it stored can be retrieved from replica nodes.
- Replication ensures data availability despite hardware failures.
NameNode Failure:
- NameNode stores its metadata and namespace in RAM, so failure leads to system downtime.
- Secondary NameNode:
- Acts as a backup, periodically merging the edit log with the file system image (FSImage).
- However, it is not a real-time replacement for the primary NameNode.
Rack Awareness:
- Ensures replicas are stored across multiple racks (geographical regions):
- Prevents data loss due to rack failures.
- Default strategy: Place one replica in one rack and two replicas in another rack.
5. Block Size Considerations
-
Smaller Block Size (< 128 MB):
- Advantages: Increased parallelism as more blocks are processed simultaneously.
- Drawbacks: Overloads the NameNode with metadata entries.
-
Larger Block Size (> 128 MB):
- Advantages: Reduces metadata burden on the NameNode.
- Drawbacks: Decreases parallelism, potentially affecting performance.
6. HDFS Commands
Here are commonly used HDFS commands with examples:
File Operations:
- List files/directories in HDFS:
hadoop fs -ls /
- Create directories:
hadoop fs -mkdir -p /user/<username>/dir1/dir2
- Copy file from local to HDFS:
hadoop fs -put /path/to/localfile /path/in/hdfs
- Copy file from HDFS to local:
hadoop fs -get /path/in/hdfs /path/to/local
7. Cloud-Based Data Lakes vs. HDFS
Feature | HDFS | ADLS Gen2 / Amazon S3 |
---|---|---|
Type | Distributed File System | Object Storage |
Data Format | Blocks | Objects (ID, Value, Metadata) |
Persistence | Tightly coupled with compute | Independent of compute |
Scalability | Limited by cluster setup | Unlimited |
Cost Efficiency | Higher (compute + storage) | Lower (storage decoupled) |
Accessibility | Within a cluster | Accessible globally |
8. Practical Scenarios
Scenario 1: Reading a File
- Client requests
file1.txt
. - NameNode looks up metadata for block locations.
- Client fetches the blocks from the respective DataNodes.
Scenario 2: Fault Tolerance with Replication
- A block of
file1.txt
stored inDN1
,DN2
, andDN3
is still accessible ifDN1
fails. - NameNode automatically re-replicates the missing block to maintain replication factor.
9. Key Points to Remember
- HDFS ensures fault tolerance via replication.
- NameNode Federation addresses scalability.
- Block size impacts parallelism and metadata storage.
- Racks provide geographical fault tolerance.
- Cloud data lakes (e.g., ADLS Gen2) decouple storage and compute for better cost efficiency.
Tags:
HDFS