Detailed explanation on HDFS and Mapreduce:-

M S Dillibabu
11 min readMar 2, 2020

--

Hadoop:

It is a framework used to store and process large datasets parallelly in a distributed fashion. So here for storing the datasets we will be using HDFS (Hadoop Distributed File System) and for parallel processing/retrieving the information stored in HDFS we will be using MapReduce.

Now lets deeply dive into HDFS concept.

HDFS( Hadoop Distributed File System):

In HDFS each and every data/file is stored as Blocks, Block is the smallest unit of data that the file system stores. From Hadoop 2.0 onwards the size of these HDFS data blocks is 128 MB by default, previously it was 64 MB. We can configure the block size as per our requirement by changing the dfs.block.size property in hdfs-site.xml

Example:-

Let us say, if we want to store a file(ex:-score.txt — 240MB) in HDFS. First, score.txt file is splitted into 2 blocks — block-1 will store 128 MB and block-2 will store 112MB of score.txt, remaining 16MB free space in block-2 will be given to another file without wasting, each block is distributed across the cluster on the basis of replication factor. The default replication factor is 3, thus each block is replicated 3 times. The first replica is stored on the first datanode. The second replica is stored on another datanode within the same rack to minimize the cross talk and third is stored on datanode in different racks, ensuring that even if rack fails the data is not lost. We will talk about this in a later session.

Why HDFS came into existence instead of normal file system?

In normal filesystem, each data block is of 4 kB size only but in HDFS, each data block is taken as 128 MB by default. NameNode stores metadata of HDFS( each and every block details will be stored). If a block has 4 KB of size then more block details need to be stored in NameNode and retrieving block details by NameNode will become a difficult task. For that reason, HDFS came into existence with 128 MB of block size, as HDFS will be dealing with big data. Installing hadoop on top of any hardware will change into HDFS(i.e., it will take block size as 128 MB) automatically. In normal file system if the block is not fully filled i.e., a block has data <4 KB, remaining free space in the block will not be given to other files to store so the free space is wasted but that is not in the case of HDFS, remaining free space in the block will be given to other files/data to store.

Hadoop has mainly 5 services:-

  1. NameNode
  2. Secondary NameNode
  3. Job Tracker
  4. DataNode
  5. Task Tracker

First 3(1,2,3) services are Master services. Last two (4,5) are slave services. Master services can talk to each other and slave services can talk to each other. NameNode, Secondary NameNode and DataNode(1,2,4) are related to HDFS and Job Tracker and Task Tracker(3,5) are related to map reduce.

  1. NameNode:

NameNode is the centerpiece of HDFS and it is a master node. It only stores the metadata of HDFS — the directory tree of all files in the file system, and tracks the files across the cluster. NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes. NameNode knows the list of the blocks and its location for any given file in HDFS. With this information NameNode knows how to construct the file from blocks. It is so critical to HDFS and when the NameNode is down, HDFS/Hadoop cluster is inaccessible and considered down. It is a single point of failure in Hadoop cluster. NameNode is usually configured with a lot of memory (RAM). Because the block locations are help in main memory.

4. DataNode:

DataNode is responsible for storing the actual data in HDFS and it is a Slave Node. NameNode and DataNode are in constant communication by sending heartbeat signals. When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for. When a DataNode is down, it does not affect the availability of data or the cluster. NameNode will arrange for replication for the blocks managed by the DataNode that is not available. DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode

Here is a sample configuration for NameNode and DataNode hardware configuration.

Name Node Configuration:

Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 128 GB
Disk: 6 x 1TB SATA
Network: 10 Gigabit Ethernet

Data Node Configuration:

Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 64 GB
Disk: 12–24 x 1TB SATA
Network: 10 Gigabit Ethernet

2.Secondary NameNode

Secondary Namenode, by its name we assume that it as a backup node but its not. First let me give a brief about Namenode.

Namenode holds the metadata for HDFS like Block information, size etc. This Information is stored in main memory as well as disk for persistence storage .

The information is stored in 2 different files .They are

  • Editlogs- It keeps track of each and every changes to HDFS.
  • Fsimage- It stores the snapshot of the file system.

Any changes done to HDFS gets noted in the edit logs the file size grows where as the size of fsimage remains same. This not have any impact until we restart the server. When we restart the server the edit file logs are written into fsimage file and loaded into main memory which takes some time. If we restart the cluster after a long time there will be a vast down time since the edit log file would have grown. Secondary namenode would come into picture in rescue of this problem.

Secondary Namenode simply gets edit logs from name node periodically and copies to fsimage. This new fsimage is copied back to namenode.Namenode now, this uses this new fsimage for next restart which reduces the startup time. It is a helper node to Namenode and to precise Secondary Namenode whole purpose is to have checkpoint in HDFS, which helps namenode to function effectively. Hence, It is also called as Checkpoint node. Default checkpoint period is 1 hour. Copying editlog into fsimage is called checkpoint and giving new edit log.

JobTracker and TaskTracker are 2 essential process involved in MapReduce execution in MRv1 (or Hadoop version 1). Both processes are now deprecated in MRv2 (or Hadoop version 2) and replaced by Resource Manager, Application Master and Node Manager Daemons.

3.Job Tracker –

  1. JobTracker process runs on a separate node and not usually on a DataNode.
  2. JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced by ResourceManager/ApplicationMaster in MRv2.
  3. JobTracker receives the requests for MapReduce execution from the client.
  4. JobTracker talks to the NameNode to determine the location of the data.
  5. JobTracker finds the best TaskTracker nodes to execute tasks based on the data locality (proximity of the data) and the available slots to execute a task on a given node.
  6. JobTracker monitors the individual TaskTrackers and the submits back the overall status of the job back to the client.
  7. JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution.
  8. When the JobTracker is down, HDFS will still be functional but the MapReduce execution can not be started and the existing MapReduce jobs will be halted.

5.TaskTracker –

  1. TaskTracker runs on DataNode. Mostly on all DataNodes.
  2. TaskTracker is replaced by Node Manager in MRv2.
  3. Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.
  4. TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.
  5. TaskTracker will be in constant communication with the JobTracker signalling the progress of the task in execution.
  6. TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker will assign the task executed by the TaskTracker to another node.

read and write mechanism of HDFS

MapReduce:

In today’s data-driven market, algorithms and applications are collecting data 24/7 about people, processes, systems, and organizations, resulting in huge volumes of data. The challenge, though, is how to process this massive amount of data with speed and efficiency, and without sacrificing meaningful insights.

This is where the MapReduce programming model comes to rescue. Initially used by Google for analyzing its search results, MapReduce gained massive popularity due to its ability to split and process terabytes of data in parallel, achieving quicker results.

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application.

For example, a Hadoop cluster with 20,000 inexpensive commodity servers and 256MB block of data in each, can process around 5TB of data at the same time. This reduces the processing time as compared to sequential processing of such a large data set.

With MapReduce, rather than sending data to where the application or logic resides, the logic is executed on the server where the data already resides, to expedite processing. Data access and storage is disk-based — the input is usually stored as files containing structured, semi-structured, or unstructured data, and the output is also stored in files.

MapReduce was once the only method through which the data stored in the HDFS could be retrieved, but that is no longer the case. Today, there are other query-based systems such as Hive and Pig that are used to retrieve data from the HDFS using SQL-like statements. However, these usually run along with jobs that are written using the MapReduce model. That’s because MapReduce has unique advantages.

As we know MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs’ component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

How MapReduce Works

At the crux of MapReduce are two functions: Map and Reduce. They are sequenced one after the other.

  • The Map function takes input from the disk as <key,value> pairs, processes them, and produces another set of intermediate <key,value> pairs as output.
  • The Reduce function also takes inputs as <key,value> pairs, and produces <key,value> pairs as output.

The types of keys and values differ based on the use case. All inputs and outputs are stored in the HDFS. While the map is a mandatory step to filter and sort the initial data, the reduce function is optional.

<k1, v1> -> Map() -> list(<k2, v2>)
<k2, list(v2)> -> Reduce() -> list(<k3, v3>)

Mappers and Reducers are the Hadoop servers that run the Map and Reduce functions respectively. It doesn’t matter if these are the same or different servers.

Map

The input data is first split into smaller blocks. Each block is then assigned to a mapper for processing.

For example, if a file has 100 records to be processed, 100 mappers can run together to process one record each. Or maybe 50 mappers can run together to process two records each. The Hadoop framework decides how many mappers to use, based on the size of the data to be processed and the memory block available on each mapper server.

Reduce

After all the mappers complete processing, the framework shuffles and sorts the results before passing them on to the reducers. A reducer cannot start while a mapper is still in progress. All the map output values that have the same key are assigned to a single reducer, which then aggregates the values for that key.

Combine and Partition

There are two intermediate steps between Map and Reduce.

Combine is an optional process. The combiner is a reducer that runs individually on each mapper server. It reduces the data on each mapper further to a simplified form before passing it downstream.

This makes shuffling and sorting easier as there is less data to work with. Often, the combiner class is set to the reducer class itself, due to the cumulative and associative functions in the reduce function. However, if needed, the combiner can be a separate class as well.

Partition is the process that translates the <key, value> pairs resulting from mappers to another set of <key, value> pairs to feed into the reducer. It decides how the data has to be presented to the reducer and also assigns it to a particular reducer.

The default partitioner determines the hash value for the key, resulting from the mapper, and assigns a partition based on this hash value. There are as many partitions as there are reducers. So, once the partitioning is complete, the data from each partition is sent to a specific reducer.

Example:

Lets take a file — file.txt of 500 MB size. Default block size is 128 MB in HDFS. file.txt is divide into four input splits of (128 MB, 128 MB, 128 MB, 116 MB) as shown in below

Four input splits each assigned with a mapper task ( i.e., No of input splits = No of map task). We cannot change the number of map tasks to a file. It takes based on the number of input splits.

Here RecordReader reads one line at a time and it will input the mapper class with key,value pair (byteoffset,entire line). Byteoffset value is the number of character in a line for example “hi how are you\n” has 15 chars so the next line will start from 16. and mapper class will produce a key value pair in either of four formats. They are

  1. TextInputFormat
  2. KeyValueTextInputFormat
  3. SequenceFileInputFormat
  4. SequenceFileAsTextInputFormat

In this example, Key value pairs are in TextInputFormat and it is inputted to reducer class and reducer will perform shuffling and sorting and will produce output file.

find out map reduce program

--

--

No responses yet