Bda Bi Jit Chapter-4
Bda Bi Jit Chapter-4
Bda Bi Jit Chapter-4
JIT #CMIT-5125
Unit-4
Overview of Big Data
Tools and Technology
Admas Abtew
Faculty of Computing and Informatics
Jimma Technology Institute, Jimma University
[email protected]
+251-912499102
Outline
Looping
• What is Hadoop
• Hadoop framework-advantages
• The Hadoop Ecosystem
Chapter Four: Big data tools
and technologies
4.1 Introduction to Hadoop
Apache Hadoop is one of the most popularly used tools in the Big Data industry.
Hadoop is an open-source framework from Apache and runs on commodity hardware. It is
used to store process and analyze Big Data.
Hadoop is written in Java. Apache Hadoop enables parallel processing of data as it works on
multiple machines simultaneously. It uses clustered architecture. A Cluster is a group of systems
that are connected via LAN.
It consists of 3 parts-
Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop.
Map-Reduce – It is the data processing layer of Hadoop.
YARN – It is the resource management layer of Hadoop.
YARN Apache Hadoop's YARN component is responsible for assigning system resources to the various
applications operating in a Hadoop cluster and scheduling jobs to be done on different cluster nodes.
YARN is one of the main components of Apache Hadoop.
MapReduce MapReduce works with two functions: Map and Reduce. The Map function accepts as input from the
disc a set of key, value> pairs, processes them, and returns as output another set of intermediate key,
value> pairs that were processed previously.
The Reduce function accepts inputs in the form of key, value> pairs and returns output in the form of
key, value> pairs
Apache Pig Apache Pig is a very high-level programming API that allows us to write simple scripts. If we don't
want to write Java or Python MapReduce codes and are more familiar with a scripting language that has
somewhat SQL-style syntax, Pig is the best one.