Big Data – the buzzword making rounds for last few years! Big Data refers to data sets so voluminous and complex that traditional data processing applications become inadequate. The data needs special provisioning for storage, analysis, sharing, and querying etc. This is where Apache Hadoop, Apache Spark come to your rescue.

What is Apache Hadoop?

Hadoop is an open-source Java-based framework that allows you store and process humongous data in a parallel format using MapReduce model. This framework consists of multiple modules:

  • HDFS (Hadoop Distributed File System) – store your data on commodity machines accessible across the cluster.
  • YARN – scheduler and resource manager – it manages the resources in the cluster and schedules various jobs to be executed within the cluster.
  • MapReduce – Implementation of MapReduce Programming model to process and analyze huge data.

All these modules have been designed with a base assumption that hardware may fail at any point in time and hence framework should recover automatically from hardware failures. Apart from this, Hadoop following design principles:

  • The system shall manage and heal itself – fault tolerant and self-healing framework.
  • Scale linearly – additional resources should continue to give the same performance without any overheads.
  • Computation should move to data – push data to compute nodes and let computation happen locally. Reduces latency and bandwidth requirement, resulting in faster processing.
  • Modular and extensible.

Apache Hadoop uses master-slave architecture. For a small cluster, there will be a single Master and multiple slaves. The master server will have JobTracker, TaskTracker, NameNode, and DataNode, whereas the slave will have TaskTracker and DataNode only. For large clusters, master server data could be replicated to secondary master to achieve redundancy. MapReduce jobs are submitted to JobTracker on Master, which are then distributed to the registered nodes in the cluster in such a fashion that the task could be executed using local data available on that node. YARN manages the jobs and their progress and keeps on distributing jobs based on resource availability.

Apache Hadoop
Apache Hadoop

All the major cloud providers, provide support for Apache Hadoop. On Azure, it is called as Azure HDInsight, whereas on AWS you can use AWS MapReduce. On Google, you can use Google Cloud Dataproc.

Related Links

Related Keywords

MapReduce, Apache Spark, Big Data, HDFS, AWS MapReduce, Azure HDInsight

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.