Big data is a collection of large datasets either it could be in terabytes or even in petabytes of data( can be structured,semi-structured or unstructured data). Typical example would be social media websites like Facebook or Twitter which
hold millions of users posts,photographs and so on or it could also be search engine data which holds data from different databases.
Main Challenges of bigdata are stroage ,analysis and cost. Hadoop comes into picture here to solve the main challenges of bigdata.
What is Hadoop?
Apache Hadoop is an opensource framework(written in Java) designed for distributed storage and distributed processing that can operate on commodity servers(low cost hardware).
Hadoop’s main benefits:
Scalable – we can add more nodes to increase cluster capacity
fault-tolerant – automatically restarts failed jobs
low cost – Hadoop runs on commodity hardware
Core Hadoop has two main components: storage part called Hadoop Distributed File System (HDFS) and a processing part called MapReduce.
HDFS is one of the most common file system used in Hadoop. HDFS uses master slave architecture to store and manage data.
Master consists of a daemon called Namenode to manage file system metadata.Slave consists of one or more daemons called datanode to store the actual data.
A file is split on HDFS into multiple blocks ( minimum of 64MB or 128 MB which can be configured ) and stored in datanodes. Namenode manages the mapping of blocks to datanodes and which in turn take care about read/write operation. HDFS by default creates 3 replicas of each block for fault tolerance.
MapReduce a framework designed to write application to process large data parallel in multiple nodes. Mapreduce also works on master slave architecture. Master consists of a daemon called jobtracker and a slave daemon called tasktracker per node. Master takes care about resource mangement and scheduling/tracking of jobs in slaves. Slaves actually execute the job and send the status to jobtracker every few seconds.
Other modules of the Hadoop framework are hadoop common(contains necessary scripts files to start hadoop) and YARN(YARN is a resource manager that was created by separating the processing engine and resource management capabilities of MapReduce as it was implemented in Hadoop 1.0.
Resource Manager acts as master and runs on a dedicated host ( 1 per cluster ) , Node Manager runs on every host and responsible to run MR 2.0 jobs. Node Manager launch and monitor containers. A container represents a resource(cpu,memory etc).
A client contacts the resource manager and asks it to run an application master process. Every application running in Hadoop as its own dedicated application master (oversees the full life cycle of an application) which runs in a container process. Application master can request additional containers from resource manager and then send the result to client.