Dinesh on Java

Categories: Hadoop

HDFS Architecture

Hi in this hadoop tutorial we will describing now HDFS Architecture. There are following are two main components of HDFS.

Main Components of HDFS-

NameNodes
- master of the system
- maintain and manage the blocks which are present on the datanodes
- Namenode is like above pic “lamborghini car” strong with body, is single point failure point
DataNodes
- slaves which are deployed on each machine and provide the actual storage
- responsible for serving read and write request for the clients
- Datanodes are like above pic where like “ambassador cars” some less strong compare to “lamborghini” but have actual point of service providers, is commodity hardware.

HDFS Architecture-

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

Rack-storage area where we store multiple datanodes

client-is application which you used to intract with NameNode and DataNode

The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

NameNode Metadata-

Meta data in memory

the entire meta-data in main memory
no demand paging for FS meta-data

Types of Meta-data

List of files
List of blocks of each file
List of data nodes of each block
File attributes e.g. access time, replication factors

A Transaction Log

Recode file creation and deletion

Secondary Name Node-

It is storing backup of NameNode it will not work as an alternate of namenode, it just stored namenode metadata.

HDFS Client Create a New File-

When an application reads a file, the HDFS client first asks the NameNode for the list of DataNodes that host replicas of the blocks of the file. The list is sorted by the network topology distance from the client. The client contacts a DataNode directly and requests the transfer of the desired block. When a client writes, it first asks the NameNode to choose DataNodes to host replicas of the first block of the file. The client organizes a pipeline from node-to-node and sends the data. When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block. A new pipeline is organized, and the client sends the further bytes of the file. Choice of DataNodes for each block is likely to be different. The interactions among the client, the NameNode and the DataNodes are illustrated in following figure.

Unlike conventional filesystems, HDFS provides an API that exposes the locations of a file blocks. This allows applications like the MapReduce framework to schedule a task to where the data are located, thus improving the read performance. It also allows an application to set the replication factor of a file. By default a file’s replication factor is three. For critical files or files which are accessed very often, having a higher replication factor improves tolerance against faults and increases read bandwidth.

Anatomy of File Write by HDFS Client-

1. Create input splits by HDFS client.
2. After that it goes to namenode and namenode give the information back to which datanode to be selected.
3. Then client to be write the data pack to Datanode no one else, Namenode does not write any thing datanodes.
4. Data written to datanodes in a pipeline by HDFS client.
5. and every datanodes return ack packet i.e. acknowledgement back to HDFS client (Non-posted write here all writes are asynchronous).
6. close connection with datanodes.
7. confirm about completion to namenode.

Anatomy of File Read by HDFS Client-

1. User ask to HDFS client to read a file and Client move request to NameNode.
2. NameNode give block information which data node has the file.
3. and then client goes to read data from datanodes.
4. Client reading data from all datanodes in parallel(for fast accessing data in case of any failure of any datanode that is why hadoop read data in parallel way) way not in pipeline.
5. Reading data from every datanodes where same file exists.
6. After reading is complete then close the connection with datanode cluster.

Dinesh Rajput

Dinesh Rajput is the chief editor of a website Dineshonjava, a technical blog dedicated to the Spring and Java technologies. It has a series of articles related to Java technologies. Dinesh has been a Spring enthusiast since 2008 and is a Pivotal Certified Spring Professional, an author of a book Spring 5 Design Pattern, and a blogger. He has more than 10 years of experience with different aspects of Spring and Java design and development. His core expertise lies in the latest version of Spring Framework, Spring Boot, Spring Security, creating REST APIs, Microservice Architecture, Reactive Pattern, Spring AOP, Design Patterns, Struts, Hibernate, Web Services, Spring Batch, Cassandra, MongoDB, and Web Application Design and Architecture. He is currently working as a technology manager at a leading product and web development company. He worked as a developer and tech lead at the Bennett, Coleman & Co. Ltd and was the first developer in his previous company, Paytm. Dinesh is passionate about the latest Java technologies and loves to write technical blogs related to it. He is a very active member of the Java and Spring community on different forums. When it comes to the Spring Framework and Java, Dinesh tops the list!