Bda Bi Jit Chapter-4

Big Data Analytics(BDA)
JIT #CMIT-5125
Unit-4
Overview of Big Data
Tools and Technology
Admas Abtew
Faculty of Computing and Informatics
Jimma Technology Institute, Jimma University
[email protected]
+251-912499102
 Outline
Looping
• What is Hadoop
• Hadoop framework-advantages
• The Hadoop Ecosystem
Chapter Four: Big data tools
and technologies
4.1 Introduction to Hadoop
Apache Hadoop is one of the most popularly used tools in the Big Data industry.
Hadoop is an open-source framework from Apache and runs on commodity hardware. It is
used to store process and analyze Big Data.
Hadoop is written in Java. Apache Hadoop enables parallel processing of data as it works on
multiple machines simultaneously. It uses clustered architecture. A Cluster is a group of systems
that are connected via LAN.
It consists of 3 parts-
 Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop.
 Map-Reduce – It is the data processing layer of Hadoop.
 YARN – It is the resource management layer of Hadoop.
#(BDA)  Unit:4 –Overview of Big Data Tools and

4
Hadoop….

5
Hadoop…
1. Hadoop Distributed File System (HDFS)
HDFS is the main or most important part of the Hadoop ecosystem. It stores big sets of structured
or unstructured data across multiple nodes and keeps track of information in log files. It is a
distributed file system designed to store and manages a large amount of data across different
Hadoop cluster servers.
HDFS Components
HDFS consists of two core components i.e.,
 Name Node
 Data Node
NameNodes and DataNodes are the two primary components of HDFS's architecture.
NameNode is responsible for managing file system metadata, including directory structure and file-
to-block mapping. It maintains a record of which blocks reside on which DataNodes. Data nodes
are responsible for holding the data units themselves.

6
Name Node
 The name node is also known as the master node which controls the workings of the data
nodes.Generally, it contains metadata.
Data Node
 The main task of data nodes is to read, write, process, and replication of data.
 The data nodes communicate with NameNode using sending signals to the name node,
known as heartbeats. These heartbeats show the status of the data node.

7
HDFS Cluster Master and Slave Nodes
Master and slave nodes form the HDFS cluster.
The master is the name node, and the slaves are
the data nodes.
The name Node is the main node. It saves
metadata, which is information about data, and uses
fewer resources than the data nodes, which store
the real data.
In a distributed environment, these data nodes are
just like any hardware device which makes Hadoop
cost-effective.

8
2. Yet Another Resource Negotiator (YARN)
As the name implies, Yet Another Resource Negotiator (YARN) is a resource manager
who assists in the management of resources across clusters of computers. Briefly stated, it
is responsible for the scheduling and resource allocation of the Hadoop System. It is made
up of three fundamental components:
 Resource Manager
 Node Manager
 Application Master
 Hadoop YARN functions as Hadoop's operating system. It's an additional file system
that may be added to HDFS. Its job is to coordinate the cluster's resources so that no one
node gets overworked. The jobs are planned appropriately thanks to the work scheduling
it does.

9
.
Client - A client refers to an entity that sends a task or
application to the YARN cluster to run it.
The client is typically a program or script running on a
user's machine or a system that initiates the job
execution.
Resource manager - The apps in a system can only
receive the resources that the resource management has
authorized them to get.
Node managers - Node Managers at each node allocate
system resources like CPU, memory, and bandwidth
before giving credit to the machine's resource
management.
Application manager - The application manager acts as
a go-between for the resource management and the node
manager, negotiating resources and nodes as needed.

10
MapReduce
MapReduce The core component of Hadoop is its MapReduce framework. The

MapReduce distributes processing among the slave nodes, who then report their tasks to
the master node. MapReduce utilizes distributed and parallel algorithms to make it feasible
to develop applications that turn massive data sets into more manageable ones.
The complete dataset is processed by data with embedded code. In most cases, the
amount of coded information is negligible compared to the raw data. To get a lot of work
done on computers, we can just transmit a few thousand bytes of code.

11
MapReduce..
.

12
MapReduce…
MapReduce makes use of two functions i.e. Map() and Reduce(). The description of these is
described below.
The main components of MapReduce are:
 Input Data: Firstly data is inputted to the system that needs to be worked on. It can be
inputted in different forms, like text files, in a database, or on HDFS.
 Split: The sets of intermediate keys and values are split up based on their names.
Partitioning makes sure that all key-value pairs with the same key end up in the same
partition. This step makes it easy to sort and group the data for the next phase.
 Map Function: the Map() function groups data by applying filters and performing sort
operations on it. Using a key-value pair as input, Map produces a result that may be further
processed using the Reduce() function. In a distributed computing framework, it uses
multiple nodes to handle the data at the same time. The Map function gives you a set of
intermediate key-value pairs as its result.

13
MapReduce…
 Shuffle and Sort: The intermediate data that has been split up is sent from one node in the
cluster to another. The key is used to sort the data so that all of the numbers that go with the
same key are put together.
 Reduce Function: Reduce() is a basic function that takes the result of Map() as input and
concatenates the tuples into a smaller collection of tuples. It aggregates the mapped data.
Multiple nodes run the Reduce function at the same time, and its result is usually a set of final
key-value pairs.
 Output Data: The result of the Reduce function is the result of the MapReduce process. It
can be kept in different ways, like as text files, in a database, or on HDFS.

14
Hadoop Ecosystem
 Hadoop ecosystem as integration of numerous components that are created directly on top of
the Hadoop platform. There are, however, a plethora of complicated interdependencies across
these systems to consider.

15
Component Description
HDFS A key data storage system for Hadoop applications, the Hadoop Distributed File System (HDFS), is the
Hadoop Distributed File System. HDFS is a distributed file system that is implemented using
NameNode and DataNode architecture to offer high-performance access to data across highly scalable
Hadoop clusters.
YARN Apache Hadoop's YARN component is responsible for assigning system resources to the various
applications operating in a Hadoop cluster and scheduling jobs to be done on different cluster nodes.
YARN is one of the main components of Apache Hadoop.
MapReduce MapReduce works with two functions: Map and Reduce. The Map function accepts as input from the
disc a set of key, value> pairs, processes them, and returns as output another set of intermediate key,
value> pairs that were processed previously.
The Reduce function accepts inputs in the form of key, value> pairs and returns output in the form of
key, value> pairs
Apache Pig Apache Pig is a very high-level programming API that allows us to write simple scripts. If we don't
want to write Java or Python MapReduce codes and are more familiar with a scripting language that has
somewhat SQL-style syntax, Pig is the best one.

16
Apache Hive Hive is a means of accepting SQL queries and making the distributed data that is sitting on your file system
look like a SQL database that is not actually there. It makes use of a programming language known as Hive
SQL. In reality, it is merely a database in which you can connect to a shell client and ODBC (Open
Database Connectivity) and perform SQL queries on the data that is stored on your Hadoop cluster, despite
the fact that it is not a relational database in the traditional sense.
Apache In order to make Hadoop management easier, the Apache Ambari project is developing software that will
Ambari be used for deploying, managing, and monitoring Apache Hadoop clusters.
Mesos Apache Mesos is an open-source cluster manager that manages workloads in a distributed environment by
dynamic resource sharing and isolation. It is free and open-source software.
Apache Apache Spark is a multi-language engine that may be used to run data engineering, data science, and
Spark machine learning tasks on single-node workstations or clusters of computers.
Apache HBase is a column-oriented data store that runs on top of the Hadoop Distributed File System and enables
HBase random data lookup and updates for large amounts of data. It is designed for big data applications. HBase
creates a schema on top of the HDFS files, allowing users to read and alter these files as many times as
they like.
Apache This is a system for the real-time processing of streaming data streams. Apache Storm extends the
Storm capabilities of Enterprise Hadoop by providing dependable real-time data processing.

17
Apache Apache Oozie is a Java Web application that is used to schedule jobs for the Apache Hadoop distributed
Oozie computing system. Oozie integrates numerous jobs in a logical unit of labor by processing them in a
sequential manner. With YARN serving as its architectural heartbeat, it is fully integrated with the Hadoop
stack, and it supports Hadoop tasks for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop,
among other technologies. Job scheduling software such as Oozie may also plan jobs that are specific to a
system, such as Java programs or shell scripts.
ZooKeeper Apache ZooKeeper is a Hadoop cluster management tool that provides operational services. Among other
things, ZooKeeper provides a distributed configuration service, a synchronization service, and a naming
registry for systems that are spread over multiple computers. Zookeeper is a distributed application service
that stores and mediates updates to critical configuration information for distributed applications.
Data It involves bringing data from different sources or databases and files into Hadoop. Hadoop is free and
Ingestion open-source, and there is a multitude of methods for ingesting data into the system. It provides every
developer with the option of ingesting data into Hadoop using her/his favorite tool or programming
language. When selecting a tool or technology, developers place a strong emphasis on performance; yet,
this makes governance extremely difficult.

18
4.3 Hadoop Framework - Advantages
Advantages Description
Scalability Hadoop provides scalability both in terms of storage and processing power. It allows organizations to store
and process large volumes of data by distributing the workload across a cluster of commodity hardware. As
data volumes grow, additional nodes can be added to the cluster, ensuring the system can handle increasing
data demands.
Fault Hadoop's fault tolerance is one of its core strengths. It achieves fault tolerance through data replication.
Tolerance HDFS, the storage component of Hadoop, automatically replicates data across multiple nodes in the cluster.
If a node fails, the data can be retrieved from other nodes, ensuring data availability and system reliability.
Cost- Hadoop is built on commodity hardware, which is more affordable compared to proprietary hardware
Effective solutions. By using cost-effective hardware, organizations can significantly reduce the infrastructure costs
associated with storing and processing big data. Additionally, Hadoop's ability to scale horizontally allows
organizations to start with a small cluster and expand it as needed, optimizing hardware utilization and cost
efficiency.
Flexible Data Hadoop's MapReduce processing model enables flexible and parallel processing of data. It is designed to
Processing handle batch processing tasks that do not require real-time analysis. This flexibility makes Hadoop suitable
for a wide range of applications, including log processing, data mining, ETL (Extract, Transform, Load)
operations, and large-scale analytics.

19
Advantages Description
Wide Hadoop has a rich ecosystem of tools, frameworks, and libraries that extend its capabilities. These include
Ecosystem frameworks like Apache Spark, Apache Hive, Apache Pig, and Apache HBase, among others. These tools
provide additional functionalities such as real-time processing, SQL-like querying, data analysis, and NoSQL
database capabilities, making Hadoop a comprehensive platform for big data processing and analytics.
Data Locality Hadoop leverages the concept of data locality, which refers to processing data where it resides. By bringing
the computation closer to the data, Hadoop minimizes network traffic and improves overall processing
performance. This feature is particularly beneficial when dealing with large datasets distributed across
multiple nodes.
Open-Source Hadoop is an open-source project with a vibrant community of developers and contributors. This community
Community actively supports the framework by providing bug fixes, updates, and new features. The open-source nature of
Hadoop ensures continuous improvement, innovation, and the availability of a vast array of resources,
documentation, and community support.
Integration Hadoop can integrate with existing IT infrastructures and systems. It supports various data sources, including
with Existing structured, semi-structured, and unstructured data. This allows organizations to leverage their existing
Systems investments and combine data from different sources for comprehensive analysis and insights.
Security Hadoop provides robust security features to protect data and ensure compliance with regulatory requirements.
It offers authentication, authorization, and encryption mechanisms to safeguard sensitive data stored and
processed within the Hadoop ecosystem.

20

Bda Bi Jit Chapter-4

Uploaded by

Copyright:

Available Formats

Bda Bi Jit Chapter-4

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda Bi Jit Chapter-4

Uploaded by

Copyright:

Available Formats

Big Data Analytics(BDA)

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

MapReduce The core component of Hadoop is its MapReduce framework. The

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

#(BDA)  Unit:4 –Overview of Big Data Tools and

You might also like