BDA UNIT-3

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

Big Data Analytics.

UNIT-III
Syllabus: Understanding MapReduce Fundamentals and Hbase: Map Reduce Frame
Work, Techniques To Optimize MapReduce Jobs, Use Of Mapreduce, Role of Hbase in Big
Data Processing, Exploring Hive: Introducing Hive, Getting Started with Hive, Hive
Services, Data Types in Hive, Building functions in hive, Hive DDL, Hive DML.

MapReduce Architecture

MapReduce and HDFS are the two major components of Hadoop which makes it so powerful
and efficient to use. MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner. The data is first split and then combined
to produce the final result. The libraries for MapReduce is written in so many programming
languages with various different-different optimizations. The purpose of MapReduce in
Hadoop is to Map each of the jobs and then it will reduce it to equivalent tasks for providing
less overhead over the cluster network and to reduce the processing power. The MapReduce
task is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:

Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result
of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to the
Hadoop MapReduce Master. Now, the MapReduce master will divide this job into further
equivalent job-parts. These job-parts are then made available for the Map and Reduce Task.
This Map and Reduce task will contain the program as per the requirement of the use-case
that the particular company is solving. The developer writes their logic to fulfill the
requirement that the industry requires. The input data which we are using is then fed to the
Map Task and the Map will generate intermediate key-value pair as its output. The output of
Map i.e. these key-value pairs are then fed to the Reducer and the final output is stored on the
HDFS. There can be n number of Map and Reduce tasks made available for processing the
data as per the requirement. The algorithm for Map and Reduce is made with a very
optimized way such that the time complexity or space complexity is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs. The
input to the map may be a key-value pair where the key can be the id of some kind of
address and value is the actual value that it keeps. The Map() function will be executed in
its memory repository on each of these input key-value pairs and generates the
intermediate key-value pair which works as input for the Reducer or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and
sort and send to the Reduce() function. Reducer aggregate or group the data based on its
key-value pair as per the reducer algorithm written by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all the jobs
across the cluster and also to schedule each map on the Task Tracker running on the same
data node since there can be hundreds of data nodes available in the cluster.

2. Task Tracker: The Task Tracker can be considered as the actual slaves that are working
on the instruction given by the Job Tracker. This Task Tracker is deployed on each of the
nodes available in the cluster that executes the Map and Reduce task as instructed by Job
Tracker.
There is also one important component of MapReduce Architecture known as Job History
Server. The Job History Server is a daemon process that saves and stores historical
information about the task or application, like the logs which are generated during or after the
job execution are stored on Job History Server.

Summer-time is here and so is the time to skill-up! More than 5,000 learners have now
completed their journey from basics of DSA to advanced level development
programs such as Full-Stack, Backend Development, Data Science.

And why go anywhere else when our DSA to Development: Coding Guide will help you
master all this in a few months! Apply now to our DSA to Development Program and our
counsellors will connect with you for further guidance & support.

How Job runs on MapReduce


MapReduce can be used to work with a solitary method call: submit() on a Job object (you
can likewise call waitForCompletion(), which presents the activity on the off chance that it
hasn’t been submitted effectively, at that point sits tight for it to finish).
Let’s understand the components –
1. Client: Submitting the MapReduce job.
2. Yarn node manager: In a cluster, it monitors and launches the compute containers on
machines.
3. Yarn resource manager: Handles the allocation of computing resources coordination on
the cluster.
4. MapReduce application master Facilitates the tasks running the MapReduce work.
5. Distributed Filesystem: Shares job files with other entities.

How to submit Job?


To create an internal JobSubmitter instance, use the submit() which further
calls submitJobInternal() on it. Having submitted the job,
waitForCompletion() polls the job’s progress after submitting the job once per second. If the
reports have changed since the last report, it further reports the progress to the console. The
job counters are displayed when the job completes successfully. Else the error (that caused
the job to fail) is logged to the console.
Processes implemented by JobSubmitter for submitting the Job :
 The resource manager asks for a new application ID that is used for MapReduce Job ID.
 Output specification of the job is checked. For e.g. an error is thrown to the MapReduce
program or the job is not submitted or the output directory already exists or it has not
been specified.
 If the splits cannot be computed, it computes the input splits for the job. This can be due
to the job is not submitted and an error is thrown to the MapReduce program.
 Resources needed to run the job are copied – it includes the job JAR file, and the
computed input splits, to the shared filesystem in a directory named after the job ID and
the configuration file.
 It copies job JAR with a high replication factor, which is controlled
by mapreduce.client.submit.file.replication property. AS there are a number of copies
across the cluster for the node managers to access.
 By calling submitApplication(), submits the job to the resource manager
USE OF MAP REDUCE
 Here are the top 5 uses of MapReduce:

 a) Social Media Analytics: MapReduce is used to analyse social media data to find
trends and patterns. This analysis, facilitated by MapReduce, empowers organisations
to make data-driven decisions and tailor their strategies to better engage with their
target audience.
 b) Fraud Detection Systems: MapReduce is used to detect fraudulent activities in
financial transactions. By leveraging this technology, organisations can enhance their
fraud detection capabilities, mitigate risks, and safeguard the integrity of economic
systems.
 c) Entertainment Industry: MapReduce is used to analyse user preferences and
viewing history to recommend movies and TV shows. By analysing this information,
the industry can deliver personalised recommendations for movies and TV shows,
enhancing user experience and satisfaction.
 d) E-commerce Optimisation: MapReduce evaluates consumer buying patterns
based on customers’ interests or historical purchasing patterns. This personalised
approach enhances the overall shopping experience for consumers while improving
the efficiency of e-commerce operations.
 e) Data Warehousing: MapReduce is used to process large volumes of data in data
warehousing applications. In this way, organisations can derive actionable insights
from their data, supporting informed decision-making processes across various
business functions.

Apache HBase
Prerequisite– Introduction to Hadoop HBase is a data model that is similar to Google’s big
table. It is an open source, distributed database developed by Apache software foundation
written in Java. HBase is an essential part of our Hadoop ecosystem. HBase runs on top of
HDFS (Hadoop Distributed File System). It can store massive amounts of data from terabytes
to petabytes. It is column oriented and horizontally scalable.
Figure – History of HBase

Applications of Apache HBase:

Real-time analytics: HBase is an excellent choice for real-time analytics applications that
require low-latency data access. It provides fast read and write performance and can handle
large amounts of data, making it suitable for real-time data analysis.
Social media applications: HBase is an ideal database for social media applications that
require high scalability and performance. It can handle the large volume of data generated by
social media platforms and provide real-time analytics capabilities.
IoT applications: HBase can be used for Internet of Things (IoT) applications that require
storing and processing large volumes of sensor data. HBase’s scalable architecture and fast
write performance make it a suitable choice for IoT applications that require low-latency data
processing.
Online transaction processing: HBase can be used as an online transaction processing
(OLTP) database, providing high availability, consistency, and low-latency data access.
HBase’s distributed architecture and automatic failover capabilities make it a good fit for
OLTP applications that require high availability.
Ad serving and clickstream analysis: HBase can be used to store and process large volumes
of clickstream data for ad serving and clickstream analysis. HBase’s column-oriented data
storage and indexing capabilities make it a good fit for these types of applications.
Features of HBase –
1. It is linearly scalable across various nodes as well as modularly scalable, as it divided
across various nodes.

2. HBase provides consistent read and writes.

3. It provides atomic read and write means during one read or write process, all other
processes are prevented from performing any read or write operations.

4. It provides easy to use Java API for client access.

5. It supports Thrift and REST API for non-Java front ends which supports XML, Protobuf
and binary data encoding options.

6. It supports a Block Cache and Bloom Filters for real-time queries and for high volume
query optimization.

7. HBase provides automatic failure support between Region Servers.

8. It support for exporting metrics with the Hadoop metrics subsystem to files.

9. It doesn’t enforce relationship within your data.

10. It is a platform for storing and retrieving data with random access.

Facebook Messenger Platform was using Apache Cassandra but it shifted from Apache
Cassandra to HBase in November 2010. Facebook was trying to build a scalable and robust
infrastructure to handle set of services like messages, email, chat and SMS into a real time
conversation so that’s why HBase is best suited for that.

RDBMS Vs HBase –

1. RDBMS is mostly Row Oriented whereas HBase is Column Oriented.

2. RDBMS has fixed schema but in HBase we can scale or add columns in run time also.

3. RDBMS is good for structured data whereas HBase is good for semi-structured data.

4. RDBMS is optimized for joins but HBase is not optimized for joins.
Apache HBase is a NoSQL, column-oriented database that is built on top of the Hadoop
ecosystem. It is designed to provide low-latency, high-throughput access to large-scale,
distributed datasets. Here are some of the advantages and disadvantages of using HBase:
Advantages Of Apache HBase:
1. Scalability: HBase can handle extremely large datasets that can be distributed across a
cluster of machines. It is designed to scale horizontally by adding more nodes to the
cluster, which allows it to handle increasingly larger amounts of data.
2. High-performance: HBase is optimized for low-latency, high-throughput access to data.
It uses a distributed architecture that allows it to process large amounts of data in parallel,
which can result in faster query response times.
3. Flexible data model: HBase’s column-oriented data model allows for flexible schema
design and supports sparse datasets. This can make it easier to work with data that has a
variable or evolving schema.
4. Fault tolerance: HBase is designed to be fault-tolerant by replicating data across
multiple nodes in the cluster. This helps ensure that data is not lost in the event of a
hardware or network failure.
Disadvantages Of Apache HBase:
1. Complexity: HBase can be complex to set up and manage. It requires knowledge of the
Hadoop ecosystem and distributed systems concepts, which can be a steep learning curve
for some users.
2. Limited query language: HBase’s query language, HBase Shell, is not as feature-rich as
SQL. This can make it difficult to perform complex queries and analyses.
3. No support for transactions: HBase does not support transactions, which can make it
difficult to maintain data consistency in some use cases.
4. Not suitable for all use cases: HBase is best suited for use cases where high throughput
and low-latency access to large datasets is required. It may not be the best choice for
applications that require real-time processing or strong consistency guarantees

Apache Hive – Getting Started With HQL Database Creation And Drop Database

Pre-requisite: Hive 3.1.2 Installation, Hadoop 3.1.2 Installation


HiveQL or HQL is a Hive query language that we used to process or query structured data
on Hive. HQL syntaxes are very much similar to MySQL but have some significant
differences. We will use the hive command, which is a bash shell script to complete our
hive demo using CLI(Command Line Interface). We can easily start hive shell by simply
typing hive in the terminal. Make sure that the /bin directory of your hive installation is
mentioned in the .basrc file. The .bashrc file executes automatically when the user logs
into the system and all necessary commands mentioned in this script file will run. We can
simply check whether the /bin directory is available or not by simply opening it with the
command as shown below.
sudo gedit ~/.bashrc
In case if the path is not added then add it so that we can directly run the hive shell from the
terminal without moving to the hive directory. Otherwise, we can start hive manually by
moving to apache-hive-3.1.2/bin/ directory and by pressing the hive command.
Before performing hive make sure that all of your Hadoop daemons are started and
working. We can simply start all the Hadoop daemon with the below command.
start-dfs.sh # this will start namenode, datanode and secondary namenode

start-yarn.sh # this will start node manager and resource manager

jps # To check running daemons


Databases In Apache Hive

The Database is a storage schema that contains multiple tables. The Hive Databases refer to
the namespace of tables. If you don’t specify the database name by default Hive uses
its default database for table creation and other purposes. Creating a Database allows
multiple users to create tables with a similar name in different schemas so that their names
don’t match.
So, let’s start our hive shell for performing our tasks with the below command.
hive
See the already existing databases using the below command.
show databases; # this will show the existing databases

Create Database Syntax:


We can create a database with the help of the below command but if the database already
exists then, in that case, Hive will throw an error.
CREATE DATABASE|SCHEMA <database name> # we can use DATABASE or
SCHEMA for creation of DB
Example:
CREATE DATABASE Test; # create database with name Test
show databases; # this will show the existing databases

If we again try to create a Test database hive will throw an error/warning that the database
with the name Test already exists. In general, we don’t want to get an error if the database
exists. So we use the create database command with [IF NOT EXIST] clause. This will do
not throw any error.
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Example:
CREATE SCHEMA IF NOT EXISTS Test1;

SHOW DATABASES;

Syntax To Drop Existing Databases:


DROP DATABASE <db_name>; or DROP DATABASE IF EXIST <db_name> # The IF
EXIST clause again is used to suppress error
Example:
DROP DATABASE IF EXISTS Test;
DROP DATABASE Test1;

Now quit hive shell with quit command.


quit;

Hive Services
The following are the services provided by Hive:-

o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column
and its type information, the serializers and deserializers which is used to read and
write data and the corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts HiveQL
statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.
HIVE Data Types

Hive data types are categorized in numeric types, string types, misc types, and complex types.
A list of Hive data types is given below.

Integer Types
Type Size Range

TINYINT 1-byte signed integer -128 to 127

SMALLINT 2-byte signed integer 32,768 to 32,767

2,147,483,648 to
INT 4-byte signed integer
2,147,483,647

-
9,223,372,036,854,775,808
BIGINT 8-byte signed integer
to
9,223,372,036,854,775,807

HiveQL - Functions

The Hive provides various in-built functions to perform mathematical and aggregate type
operations. Here, we are going to execute such type of functions on the records of the below
table:
Example of Functions in Hive

Let's create a table and load the data into it by using the following steps: -

Advertisement

o Select the database in which we want to create a table.


1. hive> use hql;

o Create a hive table using the following command: -


1. hive> create table employee_data (Id int, Name string , Salary float)
2. row format delimited
3. fields terminated by ',' ;

o Now, load the data into the table.


1. hive> load data local inpath '/home/codegyani/hive/emp_details' into table employee_data;

o Let's fetch the loaded data by using the following command: -


1. hive> select * from employee_data;
Return type Functions Description

It returns the BIGINT for the rounded


BIGINT round(num)
value of DOUBLE num.

It returns the largest BIGINT that is less


BIGINT floor(num)
than or equal to num.

ceil(num), It returns the smallest BIGINT that is


BIGINT
ceiling(DOUBLE num) greater than or equal to num.

DOUBLE exp(num) It returns exponential of num.

DOUBLE ln(num) It returns the natural logarithm of num.

DOUBLE log10(num) It returns the base-10 logarithm of num.

DOUBLE sqrt(num) It returns the square root of num.

DOUBLE abs(num) It returns the absolute value of num.

DOUBLE sin(d) It returns the sin of num, in radians.

DOUBLE asin(d) It returns the arcsin of num, in radians.

DOUBLE cos(d) It returns the cosine of num, in radians.

DOUBLE acos(d) It returns the arccosine of num, in radians.

DOUBLE tan(d) It returns the tangent of num, in radians.

It returns the arctangent of num, in


DOUBLE atan(d)
radians.
Now, we discuss mathematical, aggregate and other in-built functions with the corresponding
examples.

Mathematical Functions in Hive


The commonly used mathematical functions in the hive are: -

Advertisement
ADVERTISING

Example of Mathematical Functions in Hive

o Let's see an example to fetch the square root of each employee's salary.
1. hive> select Id, Name, sqrt(Salary) from employee_data ;
Return Type Operator Description

It returns the count of the number of rows present in


BIGINT count(*)
the file.

DOUBLE sum(col) It returns the sum of values.

sum(DISTINCT
DOUBLE It returns the sum of distinct values.
col)

DOUBLE avg(col) It returns the average of values.


avg(DISTINCT
DOUBLE It returns the average of distinct values.
col)

It compares the values and returns the minimum one


DOUBLE min(col)
form it.

It compares the values and returns the maximum one


DOUBLE max(col)
form it.

Aggregate Functions in Hive


In Hive, the aggregate function returns a single value resulting from computation over many

rows. Let''s see some commonly used aggregate functions: -

Examples of Aggregate Functions in Hive

o Let's see an example to fetch the maximum salary of an employee.


1. hive> select max(Salary) from employee_data;
Return Type Operator Description

INT length(str) It returns the length of the string.

STRING reverse(str) It returns the string in reverse order.

concat(str1, It returns the concatenation of two or


STRING
str2, ...) more strings.

substr(str, It returns the substring from the string


STRING
start_index) based on the provided starting index.

It returns the substring from the string


substr(str, int
STRING based on the provided starting index
start, int length)
and length.

STRING upper(str) It returns the string in uppercase.

STRING lower(str) It returns the string in lowercase.

It returns the string by removing


STRING trim(str)
whitespaces from both the ends.

It returns the string by removing


STRING ltrim(str)
whitespaces from left-hand side.

It returns the string by removing


TRING rtrim(str)
whitespaces from right-hand side.
o Let's see an example to fetch the minimum
o salary of an employee.
1. hive> select min(Salary) from employee_data;
Other built-in Functions in Hive
The following are some other commonly used in-built functions in the hive: -

Examples of other in-built Functions in Hive

o Let's see an example to fetch the name of each employee in uppercase.


1. select Id, upper(Name) from employee_data;
o Let's see an example to fetch the name of each employee in lowercase.
1. select Id, lower(Name) from employee_data;

You might also like