What is Hadoop? Introduction, Architecture, Ecosystem, Components

Created by danielrodrigues Jun 29, 2020

What is Hadoop? Introduction, Architecture, Ecosystem, Components

What is Hadoop?
Apache Hadoop is an open source software framework used to develop big data processing applications which are executed in a distributed computing environment.
 Applications built using HADOOP are run on large data sets distributed across clusters of commodity computers. Commodity computers are cheap and widely available. These are mainly useful for achieving greater computational power at low cost.
Similar to data residing in a local file system of a personal computer system, in Hadoop, data resides in a distributed file system which is called as a Hadoop Distributed File system. The processing model is based on 'Data Locality' concept wherein computational logic is sent to cluster nodes(server) containing data. This computational logic is nothing, but a compiled version of a program written in a high-level language such as Java. Such a program, processes data stored in Hadoop HDFS.

Hadoop EcoSystem and Components
Below diagram shows various components in the Hadoop ecosystem-

Apache Hadoop consists of two sub-projects –

  1. Hadoop MapReduce: MapReduce is a computational model and software framework for writing applications which are run on Hadoop. These MapReduce programs are capable of processing enormous data in parallel on large clusters of computation nodes.
  2. HDFS (Hadoop Distributed File System): HDFS takes care of the storage part of Hadoop applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them on compute nodes in a cluster. This distribution enables reliable and extremely rapid computations.

Although Hadoop is best known for MapReduce and its distributed file system- HDFS, the term is also used for a family of related projects that fall under the umbrella of distributed computing and large-scale data processing. Other Hadoop-related projects at Apache include are Hive, HBase, Mahout, Sqoop, Flume, and ZooKeeper.

Readmore: Big Data Testing Tutorial: What is, Strategy, How to test Hadoop