BDA UNIT-3
BDA UNIT-3
BDA UNIT-3
UNIT-III
Syllabus: Understanding MapReduce Fundamentals and Hbase: Map Reduce Frame
Work, Techniques To Optimize MapReduce Jobs, Use Of Mapreduce, Role of Hbase in Big
Data Processing, Exploring Hive: Introducing Hive, Getting Started with Hive, Hive
Services, Data Types in Hive, Building functions in hive, Hive DDL, Hive DML.
MapReduce Architecture
MapReduce and HDFS are the two major components of Hadoop which makes it so powerful
and efficient to use. MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner. The data is first split and then combined
to produce the final result. The libraries for MapReduce is written in so many programming
languages with various different-different optimizations. The purpose of MapReduce in
Hadoop is to Map each of the jobs and then it will reduce it to equivalent tasks for providing
less overhead over the cluster network and to reduce the processing power. The MapReduce
task is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result
of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to the
Hadoop MapReduce Master. Now, the MapReduce master will divide this job into further
equivalent job-parts. These job-parts are then made available for the Map and Reduce Task.
This Map and Reduce task will contain the program as per the requirement of the use-case
that the particular company is solving. The developer writes their logic to fulfill the
requirement that the industry requires. The input data which we are using is then fed to the
Map Task and the Map will generate intermediate key-value pair as its output. The output of
Map i.e. these key-value pairs are then fed to the Reducer and the final output is stored on the
HDFS. There can be n number of Map and Reduce tasks made available for processing the
data as per the requirement. The algorithm for Map and Reduce is made with a very
optimized way such that the time complexity or space complexity is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs. The
input to the map may be a key-value pair where the key can be the id of some kind of
address and value is the actual value that it keeps. The Map() function will be executed in
its memory repository on each of these input key-value pairs and generates the
intermediate key-value pair which works as input for the Reducer or Reduce() function.
2. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and
sort and send to the Reduce() function. Reducer aggregate or group the data based on its
key-value pair as per the reducer algorithm written by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all the jobs
across the cluster and also to schedule each map on the Task Tracker running on the same
data node since there can be hundreds of data nodes available in the cluster.
2. Task Tracker: The Task Tracker can be considered as the actual slaves that are working
on the instruction given by the Job Tracker. This Task Tracker is deployed on each of the
nodes available in the cluster that executes the Map and Reduce task as instructed by Job
Tracker.
There is also one important component of MapReduce Architecture known as Job History
Server. The Job History Server is a daemon process that saves and stores historical
information about the task or application, like the logs which are generated during or after the
job execution are stored on Job History Server.
Summer-time is here and so is the time to skill-up! More than 5,000 learners have now
completed their journey from basics of DSA to advanced level development
programs such as Full-Stack, Backend Development, Data Science.
And why go anywhere else when our DSA to Development: Coding Guide will help you
master all this in a few months! Apply now to our DSA to Development Program and our
counsellors will connect with you for further guidance & support.
a) Social Media Analytics: MapReduce is used to analyse social media data to find
trends and patterns. This analysis, facilitated by MapReduce, empowers organisations
to make data-driven decisions and tailor their strategies to better engage with their
target audience.
b) Fraud Detection Systems: MapReduce is used to detect fraudulent activities in
financial transactions. By leveraging this technology, organisations can enhance their
fraud detection capabilities, mitigate risks, and safeguard the integrity of economic
systems.
c) Entertainment Industry: MapReduce is used to analyse user preferences and
viewing history to recommend movies and TV shows. By analysing this information,
the industry can deliver personalised recommendations for movies and TV shows,
enhancing user experience and satisfaction.
d) E-commerce Optimisation: MapReduce evaluates consumer buying patterns
based on customers’ interests or historical purchasing patterns. This personalised
approach enhances the overall shopping experience for consumers while improving
the efficiency of e-commerce operations.
e) Data Warehousing: MapReduce is used to process large volumes of data in data
warehousing applications. In this way, organisations can derive actionable insights
from their data, supporting informed decision-making processes across various
business functions.
Apache HBase
Prerequisite– Introduction to Hadoop HBase is a data model that is similar to Google’s big
table. It is an open source, distributed database developed by Apache software foundation
written in Java. HBase is an essential part of our Hadoop ecosystem. HBase runs on top of
HDFS (Hadoop Distributed File System). It can store massive amounts of data from terabytes
to petabytes. It is column oriented and horizontally scalable.
Figure – History of HBase
Real-time analytics: HBase is an excellent choice for real-time analytics applications that
require low-latency data access. It provides fast read and write performance and can handle
large amounts of data, making it suitable for real-time data analysis.
Social media applications: HBase is an ideal database for social media applications that
require high scalability and performance. It can handle the large volume of data generated by
social media platforms and provide real-time analytics capabilities.
IoT applications: HBase can be used for Internet of Things (IoT) applications that require
storing and processing large volumes of sensor data. HBase’s scalable architecture and fast
write performance make it a suitable choice for IoT applications that require low-latency data
processing.
Online transaction processing: HBase can be used as an online transaction processing
(OLTP) database, providing high availability, consistency, and low-latency data access.
HBase’s distributed architecture and automatic failover capabilities make it a good fit for
OLTP applications that require high availability.
Ad serving and clickstream analysis: HBase can be used to store and process large volumes
of clickstream data for ad serving and clickstream analysis. HBase’s column-oriented data
storage and indexing capabilities make it a good fit for these types of applications.
Features of HBase –
1. It is linearly scalable across various nodes as well as modularly scalable, as it divided
across various nodes.
3. It provides atomic read and write means during one read or write process, all other
processes are prevented from performing any read or write operations.
5. It supports Thrift and REST API for non-Java front ends which supports XML, Protobuf
and binary data encoding options.
6. It supports a Block Cache and Bloom Filters for real-time queries and for high volume
query optimization.
8. It support for exporting metrics with the Hadoop metrics subsystem to files.
10. It is a platform for storing and retrieving data with random access.
Facebook Messenger Platform was using Apache Cassandra but it shifted from Apache
Cassandra to HBase in November 2010. Facebook was trying to build a scalable and robust
infrastructure to handle set of services like messages, email, chat and SMS into a real time
conversation so that’s why HBase is best suited for that.
RDBMS Vs HBase –
2. RDBMS has fixed schema but in HBase we can scale or add columns in run time also.
3. RDBMS is good for structured data whereas HBase is good for semi-structured data.
4. RDBMS is optimized for joins but HBase is not optimized for joins.
Apache HBase is a NoSQL, column-oriented database that is built on top of the Hadoop
ecosystem. It is designed to provide low-latency, high-throughput access to large-scale,
distributed datasets. Here are some of the advantages and disadvantages of using HBase:
Advantages Of Apache HBase:
1. Scalability: HBase can handle extremely large datasets that can be distributed across a
cluster of machines. It is designed to scale horizontally by adding more nodes to the
cluster, which allows it to handle increasingly larger amounts of data.
2. High-performance: HBase is optimized for low-latency, high-throughput access to data.
It uses a distributed architecture that allows it to process large amounts of data in parallel,
which can result in faster query response times.
3. Flexible data model: HBase’s column-oriented data model allows for flexible schema
design and supports sparse datasets. This can make it easier to work with data that has a
variable or evolving schema.
4. Fault tolerance: HBase is designed to be fault-tolerant by replicating data across
multiple nodes in the cluster. This helps ensure that data is not lost in the event of a
hardware or network failure.
Disadvantages Of Apache HBase:
1. Complexity: HBase can be complex to set up and manage. It requires knowledge of the
Hadoop ecosystem and distributed systems concepts, which can be a steep learning curve
for some users.
2. Limited query language: HBase’s query language, HBase Shell, is not as feature-rich as
SQL. This can make it difficult to perform complex queries and analyses.
3. No support for transactions: HBase does not support transactions, which can make it
difficult to maintain data consistency in some use cases.
4. Not suitable for all use cases: HBase is best suited for use cases where high throughput
and low-latency access to large datasets is required. It may not be the best choice for
applications that require real-time processing or strong consistency guarantees
Apache Hive – Getting Started With HQL Database Creation And Drop Database
The Database is a storage schema that contains multiple tables. The Hive Databases refer to
the namespace of tables. If you don’t specify the database name by default Hive uses
its default database for table creation and other purposes. Creating a Database allows
multiple users to create tables with a similar name in different schemas so that their names
don’t match.
So, let’s start our hive shell for performing our tasks with the below command.
hive
See the already existing databases using the below command.
show databases; # this will show the existing databases
If we again try to create a Test database hive will throw an error/warning that the database
with the name Test already exists. In general, we don’t want to get an error if the database
exists. So we use the create database command with [IF NOT EXIST] clause. This will do
not throw any error.
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Example:
CREATE SCHEMA IF NOT EXISTS Test1;
SHOW DATABASES;
Hive Services
The following are the services provided by Hive:-
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column
and its type information, the serializers and deserializers which is used to read and
write data and the corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts HiveQL
statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.
HIVE Data Types
Hive data types are categorized in numeric types, string types, misc types, and complex types.
A list of Hive data types is given below.
Integer Types
Type Size Range
2,147,483,648 to
INT 4-byte signed integer
2,147,483,647
-
9,223,372,036,854,775,808
BIGINT 8-byte signed integer
to
9,223,372,036,854,775,807
HiveQL - Functions
The Hive provides various in-built functions to perform mathematical and aggregate type
operations. Here, we are going to execute such type of functions on the records of the below
table:
Example of Functions in Hive
Let's create a table and load the data into it by using the following steps: -
Advertisement
Advertisement
ADVERTISING
o Let's see an example to fetch the square root of each employee's salary.
1. hive> select Id, Name, sqrt(Salary) from employee_data ;
Return Type Operator Description
sum(DISTINCT
DOUBLE It returns the sum of distinct values.
col)