JobTracker and TaskTracker are coming into picture when we required processing to data set. In hadoop system there are five services always running in background (called hadoop daemon services).
Above three services 1, 2, 3 can talk to each other and other two services 4,5 can also talk to each other. Namenode and datanodes are also talking to each other as well as Jobtracker and Tasktracker are also.
Above the file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby.
If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.
If the JobTracker failed on Hadoop 0.20 or earlier, all ongoing work was lost. Hadoop version 0.21 added some checkpointing to this process; the JobTracker records what it is up to in the file system. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off.
1. User copy all input files to distributed file system using namenode meta data.
2. Submit jobs to client which applied to input files fetched stored in datanodes.
3. Client get information about input files from namenodes to be process.
4. Client create splits of all files for the jobs
5. After splitting files client stored meta data about this job to DFS.
6. Now client submit this job to job tracker.
7. Now jobtracker come into picture and initialize job with job queue.
8. Jobtracker read job files from DFS submitted by client.
9. Now jobtracker create maps and reduces for jobs and input splits applied to mappers. Same number of mapper are there as many input splits are there. Every map work on individual split and create output.
10. Now tasktrackers come into picture and jobs submitted to every tasktrackers by jobtracker and receiving heartbeat from every TaskTracker for confirming tasktracker working properly or not. This heartbeat frequently sent to JobTracker in 3 second by every TaskTrackers. If suppose any task tracker is not sending heartbeat to jobtracker in 3 second then JobTracker wait for 30 second more after that jobtracker consider those tasktracker as a dead state and upate metadata about those task trackers.
11. Picks tasks from splits.
12. Assign to TaskTracker.
Finally all tasktrackers create outputs and number of reduces generate as number of outputs created by task trackers. After all reducer give us final output.
Strategy Design Patterns We can easily create a strategy design pattern using lambda. To implement…
Decorator Pattern A decorator pattern allows a user to add new functionality to an existing…
Delegating pattern In software engineering, the delegation pattern is an object-oriented design pattern that allows…
Technology has emerged a lot in the last decade, and now we have artificial intelligence;…
Managing a database is becoming increasingly complex now due to the vast amount of data…
Overview In this article, we will explore Spring Scheduler how we could use it by…