SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

SDL Module- No SQL Module

Assignment No. 2
Pranav Nayak
17070124045

Q1 What is Hadoop and need for it? Discuss it’s architecture.


Hadoop is an open source framework for storing and processing huge datasets. It works
by clustering several machines to analyse complex datasets in parallel. Data is stored in
Hadoop's Hadoop Distributed File System, which is a distributed file system. The
processing model is based on 'Data Locality,' in which computational logic is sent to data
cluster nodes.Unlike traditional systems, Hadoop enables multiple types of analytic
workloads to run on the same data, at the same time, at massive scale on industry-
standard hardware.

Why do we need Hadoop

Earlier, data scientists are having a restriction to use datasets from their Local machine.
Data Scientists are required to use a large volume of data. With the increase in data and a
massive requirement for analyzing it, Big dat and Hadoop provides a common platform for
exploring and analyzing the data. With Hadoop, one can write a MapR job, HIVE or a PIG
script and launch it onto Hadoop over to full dataset and obtain results.

Data Scientists are required to use the most of data preprocessing to be carried out with
data acquisition, transformation, cleanup, and feature extraction. This is required to
transform raw data into standardized feature vectors.
Hadoop makes large scale data-preprocessing simple for the data scientists. It provides
tools like MapR, PIG, and Hive for efficiently handling large scale data.

It is used for storing and processing big data. It is a distributed file system which allows
processing and fault tolerance simultaneously. It is used for Searching, Log Processing,
Data Warehouse, Video and Image Analysis. The main advantage of using Hadoop is it
does analytics quickly on large unstructured data and it is extreme scalable. In this new
storage capacity can be added by simply adding server nodes in Hadoop clusters.

Architecture

Hadoop has a Master-Slave Architecture for data storage and distributed data processing
using MapReduce and HDFS methods. Hadoop follows a Master Slave architecture for the
transformation and analysis of large datasets using Hadoop MapReduce paradigm. The 3
important hadoop components that play a vital role in the Hadoop architecture are -

• Hadoop Distributed File System (HDFS) – Patterned after the UNIX file system

• Hadoop MapReduce
• Yet Another Resource Negotiator (YARN)

1) MasterNode: It allows to conduct parallel processing of data


2) Slave node: It allows to store data to conduct complex calculations and to synchronise
the processes with the NameNode and Job Tracker.
3) NameNode: NameNode represented every files and directory which is used in the
namespace.
4) DataNode: DataNode helps to manage the state of an HDFS node and allows to
interacts with the blocks.

MapReduce: It is a parallel programming model for writing distributed applications devised


at Google for efficient processing of large amounts of data on large clusters.

HDFS: It is based on the Google File System (GFS) and provides a distributed file system
that is designed to run on commodity hardware. It is highly fault-tolerant and it provides
high throughput access to application data.
Hadoop Distributed File System (HDFS) stores the application data and file system
metadata separately on dedicated servers. NameNode and DataNode are the two critical
components of the Hadoop HDFS architecture. Application data is stored on servers
referred to as DataNodes and file system metadata is stored on servers referred to as
NameNode. HDFS replicates the file content on multiple DataNodes based on the
replication factor to ensure reliability of data. The NameNode and DataNode communicate
with each other using TCP based protocols. For the Hadoop architecture to be
performance efficient, HDFS must satisfy certain pre-requisites –

• All the hard drives should have a high throughput.


• Good network speed to manage intermediate data transfer and block replications.

Features of Hadoop

• Suitable for Big Data Analysis:


• Scalability: HADOOP clusters can easily be scaled to any extent by adding additional
cluster nodes and allowing growth of Big Data.
•. Fault Tolerance: It can replicate the input data on to other cluster nodes. Even when
cluster fails, data processing can still be processed by using data stored on another cluster
Q2. What is Hadoop HBase API

HBase:
HBase is a distributed column-oriented database built on top of the Hadoop file system. It
is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access.

An API is a set of programming code that enables data transmission between one
software product and another. It also contains the terms of this data exchange.
API lets you to connect and interact with Hbase.
A connection to Hbase is established through Hbase Shell which is a Java API.
HBase is written in java, therefore it provides java API to communicate with HBase. Java
API is the fastest way to communicate with HBase. Given below is the referenced java
Admin API that covers the tasks used to manage tables.

Class HBaseAdmin: HBaseAdmin is a class representing the Admin. This class belongs
to the org.apache.hadoop.hbase.client package. Using this class, you can perform the
tasks of an administrator. You can get the instance of Admin
using Connection.getAdmin() method. This has a list of methods that you can use to
perform tasks as admin like Create, Delete tables etc.
Client API; java client API for HBase that is used to perform CRUD operations on HBase
tables. HBase is written in Java and has a Java Native API. Therefore it provides
programmatic access to Data Manipulation Language (DML).

Class HBase Configuration


Adds HBase configuration files to a Configuration. This class belongs to
the org.apache.hadoop.hbase package.
Q3. CRUD operations in Hadoop

1) [Creating a table named ‘People’, column family ‘Details’ and column ‘Name’]

create ‘customers’ , ‘details’ , ‘relatives’ , ‘accounts'

2) Create : It is performed by PUT command

put ‘customers’ , ‘101’, ‘details:name’ , ‘Adam’

3) Read : Read operations is performed by GET or SCAN command

get ‘customers’, ‘102’ , ‘accounts:checking’ , ‘35000’

scan ‘customers’ , {COLUMNS => [‘details:name’ , ‘accounts’]}

4) Update : It is performed by PUT command

put ‘customers’ , ‘101’ , ‘accounts:savings’ , ‘20000’

5) Delete : It is performed by Delete command

delete ‘customers’ , ‘101’ , ‘accounts:business’

CASE STUDY: HADOOP IN FINANCIAL SECTOR


The Finance sector is one of the major users of Hadoop. One of the primary use cases for
Hadoop, was in risk modelling, to solve the question for banks to evaluate customers and
markets better than legacy systems.

These days we notice that many banks compile separate data warehouses into a single
repository backed by Hadoop for quick and easy analysis. Hadoop clusters are used by
banks to create more accurate risk analysis models for the customers in its portfolio. Such
risk analysis helps banks to manage their financial security and offer customized products
and services.

Hadoop has helped the financial sector, maintain a better risk record in the aftermath of
2008 economic downturn. Before that, every regional branch of the bank maintained a
legacy data warehouse framework isolated from a global entity. Data such as checking
and saving transactions, home mortgage details, credit card transactions and other
financial details of every customer was restricted to local database systems, due to which,
banks failed to paint a comprehensive risk portfolio of their customers.
After the economic recession, most of the financial institutions and national monetary
associations started maintaining a single Hadoop Cluster containing more than petabytes
of financial data aggregated from multiple enterprise and legacy database systems. Along
with aggregating, banks and financial institutions started pulling in other data sources -
such as customer call records, chat and web logs, email correspondence and others.
When such unprecedented scale of data is analysed with the assistance of Hadoop,
MapReduce and techniques like sentiment analysis, text processing, pattern matching,
graph creating; banks were able to identify same customers across different sources along
with accurate risk assessment.

This proved that not just banks but any company which has fragmented and incomplete
picture of its customers, would improve its customer satisfaction and revenue by creating a
global data warehouse. Currently banks as well as government financial institutions use
HDFS and MapReduce to commence anti money laundering practices, asset valuation
and portfolio analysis.

Predicting market behaviour is another classic problem in the financial sector. Different
market sources such as, stock exchanges, banks, revenue and proceeds department,
securities market - hold massive volume of data themselves but their performance is
interdependent. Hadoop provides the technological backbone by creating a platform where
data from such sources can be compiled and processed in real-time.

Morgan Stanley with assets over 350 billion is one of the world’s biggest financial services
organizations. It relies on the Hadoop framework to make industry critical investment
decisions. Hadoop provides scalability and better results through it administrator and can
manage petabytes of data which is not possible with traditional database systems.

JPMorgan Chase is another financial giant which provides services in more than 100
countries. Such large commercial banks can leverage big data analytics more effectively
by using frameworks like Hadoop on massive volumes of structured and unstructured
data. JPMorgan Chase has mentioned it on various channels that they prefer to use HDFS
to support the exponentially growing size of data as well as for low latency processing of
complex unstructured data.
Larry Feinsmith, Managing Director of Information Technology at JPMorgan Chase said
that, “Hadoop's ability to store vast volumes of unstructured data allows the company to
collect and store web logs, transaction data and social media data. Hadoop allows us to
store data that we never stored before. The data is aggregated into a common platform for
use in a range of customer-focused data mining and data analytics tools”.

You might also like