CC Becse Unit 4 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

4.

UNIT 4
BIG DATA AND ANALYTICS
Big Data, Challenges in Big Data, Hadoop: Definition, Architechture, Cloud file systems: GFS and HDFS, BigTable, HBase and
Dynamo, MapReduce and extensions: Parallel computing, The MapReduce model: Parallel efficiency of MapReduce, Relational
operations using MapReduce, Projects in Hadoop: Hive, HBase, Pig, Oozie, Flume, Sqoop

BIG DATA DEFINITION


Big Data is the term for a collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data processing applications.
―Big data technologies describe a new generation of technologies and architectures, designed to
economically extract value from very large volumes of a wide variety of data, by enabling high-velocity
capture, discovery, and/or analysis.''
Characteristics of big data and its role in current world:--
Data defined as Big Data includes machine-generated data from sensor networks, nuclear plants, X-
ray scanning devices, airplane engines, and consumer-driven data from social media.

1) DATA VOLUME:- Volume refers to the size of the dataset. It may be in KB, MB, GB, TB, or PB
based on the type of the application that generates or receives the data. Data volume is characterized by the
amount of data that is generated continuously. Different data-types come in different sizes. For example, a
blog text is a few kilobytes; voice calls or video‘s few megabytes; sensor data, machine logs, and
clickstream data can be in gigabytes. The following are some examples of data generated by different
sources.
Machine data: Every machine (device) that we use today from industrial to personal devices can generate a
lot of data. This data includes both usage and behaviors of the owners of these machines. Machine-generated
data is often characterized by a steady pattern of numbers and text, which occurs in a rapid-fire fashion.
There are several examples of machine-generated data; for instance, Radio signals. Satellites, and Mobile
devices all transmit signals.
Application log: Another form of machine-generated data is an application log. Different devices generate
logs at different paces and formats. For example: CT scanners, X-ray machines, body scanners at airports,
airplanes, ships, military equipment, commercial satellites.
Clickstream logs: The usage statistics of the web page are captured in clickstream data. This data type
provides insight into what a user is doing on the web page, and can provide data that is highly useful for
behavior and usability analysis, marketing, and general research.

2) DATA VELOCITY:-
Velocity refers to the low latency, real-time speed at which the analytics need to be applied. With the
advent of Big Data, understanding the velocity of data is extremely important. The basic reason for this is to
analyze the data generated. Let us look at some examples of data velocity.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.2

Amazon, Yahoo, and Google


The business models adopted by Amazon, Facebook, Yahoo, and Google, which became the de-facto
business models for most web-based companies. So, the number of clickstreams on the web-sites are
millions of clicks gathered from users at every second, amounting to large volumes of data. This data can be
processed, segmented, and modeled to study population behaviors based on time of day, geography,
advertisement effectiveness, click behavior, and guided navigation response. The velocity of data produced
by user clicks on any website today is a prime example for Big Data velocity.
Sensor data
Another prime example of data velocity comes from a variety of sensors like GPS, mobile devices,
biometric systems, airplane sensors and engines. The data generated from sensor networks can range from a
few gigabytes per second to terabytes per second. For example, a flight from London to NewYork generates
650 TB of data from the airplane engine sensors. There is a lot of value in reading this information during
the stream processing and post gathering for statistical modeling purposes.
Social media
Another Big Data favorite, different social media sites produce and provide data at different
velocities and in multiple formats. While Twitter is fixed at 140 characters, Facebook, YouTube, or Flickr
can have posts of varying sizes from the same user. Not only is the size of the post important, understanding
how many times it is forwarded or shared and how much follow-on data it gathers is essential to process the
entire data set.

3) DATA VARIETY:-
Variety refers to the various types of the data that can exist, for example, text, audio, video, and photos.
Big Data comes in multiple formats as it ranges from emails to tweets to social media and sensor data. There
is no control over the input data format or the structure of the data. The processing complexity associated
with a variety of formats is the availability of appropriate metadata for identifying what is contained in the
actual data. This is critical when we process images, audio, video, and large chunks of text. The platform
requirements for processing new formats are:
● Scalability
● Distributed processing capabilities
● Image processing capabilities
● Graph processing capabilities
● Video and audio processing capabilities

TYPES OF DATA:-
Structured data is characterized by a high degree of organization and is typically the kind of data you see
in relational databases or spreadsheets. Because of its defined structure, it maps easily to one of the standard
data types. It can be searched using standard search algorithms and manipulated in well-defined ways.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.3

Semi-structured data (such as what you might see in log files) is a bit more difficult to understand than
structured data. Normally, this kind of data is stored in the form of text files, where there is some degree of
order. For example, tab delimited files, where columns are separated by a tab character. So instead of being
able to issue a database query for a certain column and knowing exactly what you‘re getting back, users
typically need to explicitly assign data types to any data elements extracted from semi structured data sets.
Unstructured data has none of the advantages of having structure coded into a data set. Its analysis by way
of more traditional approaches is difficult and costly at best, and logistically impossible at worst. Without a
robust set of text analytics tools, it would be extremely tedious to determine any interesting behavior
patterns.

Challenges in Big-Data:
Acquire: Making the most of big data means quickly capturing high volumes of data generated in many
different formats is the first and foremost challenge.
Organize: A big data research platform needs to process massive quantities of data—filtering, transforming
and sorting it before loading it into a data warehouse.
Analyze: The infrastructure required for analyzing big data must be able to support deeper analytics such as
statistical analysis and data mining on a wider variety of data types stored in diverse systems; scale to
extreme data volumes; deliver faster response times; and automate decisions based on analytical models.
Scale: With big data you want to be able to scale very rapidly and elastically. Most of the NoSQL solutions
like MongoDB or HBase have their own scaling limitations.
Performance: In an online world where nanosecond delays can cost your sales, big data must move at
extremely high velocities no matter how much you scale or what workloads your database must perform.
Continuous Availability: When you rely on big data to feed your essential, revenue-generating 24/7
business applications, even high availability is not high enough. Your data can never go down. A certain
amount of downtime is built-in to RDBMS and other NoSQL systems.
Data Security: Big data carries some big risks when it contains credit card data, personal ID information
and other sensitive assets. Most NoSQL big data platforms have few if any security mechanisms in place to
safeguard your big data.

HADOOP

DEFINITION: Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model. Apache Hadoop is an open-
source software framework written in Java for distributed storage and distributed processing of very large
data sets on computer clusters built from commodity hardware. The core of Apache Hadoop consists of a
storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce).

The base Apache Hadoop framework is composed of the following modules:


Hadoop Common – contains libraries and utilities needed by other Hadoop modules.
Bamuengine.com

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS


4.4

 Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth across the cluster.
 Hadoop YARN – a resource-management platform responsible for managing computing resources
in clusters and using them for scheduling of users' applications and
 Hadoop MapReduce – a programming model for large scale data processing.
Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on
their MapReduce and Google File System. Apache Hadoop is a registered trademark of the Apache Software
Foundation.

Architecture:

A multi-node Hadoop cluster:


A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a
JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node acts as both a DataNode and
TaskTracker. Hadoop requires Java Runtime Environment (JRE) 1.6 or higher. The standard startup and
shutdown scripts require that Secure Shell (ssh) be set up between nodes in the cluster.
In a larger cluster, the HDFS is managed through a dedicated NameNode server to host the file system
index, and a secondary NameNode that can generate snapshots of the namenode's memory structures, thus
preventing file-system corruption and reducing loss of data. Similarly, a standalone JobTracker server can
manage job scheduling. In clusters where the Hadoop MapReduce engine is deployed against an alternate
file system, the NameNode, secondary NameNode, and DataNode architecture of HDFS are replaced by the
file-system-specific equivalents.
Hadoop distributed file system
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written
in Java for the Hadoop framework. A Hadoop cluster has nominally a single name node plus a cluster of
data nodes. Each data node serves up blocks of data over the network using a block protocol specific to
HDFS. The file system uses TCP/IP sockets for communication. Clients use remote procedure call (RPC) to

Bamuengine.com
communicate between each other.

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS


4.5

HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It
achieves reliability by replicating the data across multiple hosts. With the default replication value 3, data
is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other
to rebalance data, to move copies around, and to keep the replication of data high. HDFS stores the data by
dividing into the large amount of file into blocks of size 64 MB by default and it may vary to 128 MB size
of each block.
Understanding HDFS components
HDFS is managed with the master-slave architecture included with the following components:
• NameNode:
 This is the master of the HDFS system. It maintains the directories, files, and manages the blocks that
are present on the DataNodes.
 Only one per hadoop cluster.
 Manages the file system namespace and metadata.
 Single point of failure but mitigated by writing state to multiple file systems.
 Single point of failure: Don‘t use inexpensive commodity hardware for this node, large memory
requirements.
• DataNode:
 These are slaves that are deployed on each machine and provide actual storage. They are responsible for
serving read-and-write data requests for the clients.
 Many per hadoop cluster.
 Manages blocks with data and serves them to clients.
 Periodically reports to name node the list of blocks it stores.
 Use inexpensive commodity hardware for this node.
• Secondary NameNode: The HDFS file system includes a so-called secondary namenode, a misleading
name that some might incorrectly interpret as a backup namenode for when the primary namenode goes
offline. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots
of the primary namenode's directory information, which the system then saves to local or remote directories.
These check pointed images can be used to restart a failed primary namenode without having to replay the

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.6

entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because
the namenode is the single point for storage and management of metadata, it can become a bottleneck for
supporting a huge number of files, especially a large number of small files. HDFS Federation, a new
addition, aims to tackle this problem to a certain extent by allowing multiple namespaces served by separate
namenodes.
Understanding the MapReduce architecture:
MapReduce is also implemented over master-slave architectures. Classic MapReduce contains job
submission, job initialization, task assignment, task execution, progress and status update, and job
completion-related activities, which are mainly managed by the JobTracker node and executed by
TaskTracker. Client application submits a job to the JobTracker. Then input is divided across the cluster.
The JobTracker then calculates the number of map and reducer to be processed. It commands the
TaskTracker to start executing the job. Now, the TaskTracker copies the resources to a local machine and
launches JVM to map and reduce program over the data. Along with this, the TaskTracker periodically
sends update to the JobTracker, which can be considered as the heartbeat that helps to update JobID, job
status, and usage of resources.

Understanding MapReduce components


MapReduce is managed with master-slave architecture included with the following components:
• JobTracker:
 This is the master node of the MapReduce system, which manages the jobs and resources in the
cluster (TaskTrackers). The JobTracker tries to schedule each map as close to the actual data being
processed on the TaskTracker, which is running on the same DataNode as the underlying block.
 One per hadoop cluster.
 Receives job requests submitted by client.
 Schedules and monitors MapReduce jobs on task trackers.
• TaskTracker:
 These are the slaves that are deployed on each machine. They are responsible for running the map
and reducing tasks as instructed by the JobTracker.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.7

 Many per Hadoop cluster.


 Executes MapReduce operations.
CLOUD FILE SYSTEMS: GFS AND HDFS
The Google File System (GFS) is designed to manage relatively large files using a very large distributed
cluster of commodity servers connected by a high-speed network. It is therefore designed to (a) expect and
tolerate hardware failures, even during the reading or writing of an individual file (since files are expected to
be very large) and (b) support parallel reads, writes and appended by multiple client programs. A common
use case that is efficiently supported is that of many ‗producers‘ appending to the same file in parallel,
which is also being simultaneously read by many parallel ‗consumers‘. In contrast to traditional parallel
databases, on the other hand, do not make similar assumptions as regards to the prevalence of failures or the
expectations that failures will occur often even during large computations as a result they also do not scale.
The Hadoop Distributed File System (HDFS) is an open source implementation of the GFS architecture that
is also available on the Amazon EC2 cloud platform.
We refer to both GFS and HDFS as ‗cloud file systems.‘ The architecture of cloud file systems is
illustrated in Figure 10.3. Large files are broken up into ‗chunks‘ (GFS) or ‗blocks‘ (HDFS), which are
themselves large (64MB being typical). These chunks are stored on commodity (Linux) servers called
Chunk Servers (GFS) or Data Nodes (HDFS); further each chunk is replicated at least three times, both on a
different physical rack as well as a different network segment in anticipation of possible failures of these
components apart from server failures.

When a client program (‗cloud application‘) needs to read/write a file, it sends the full path and
offset to the Master (GFS) which sends back meta-data for one (in the case of read) or all (in the case of
write) of the replicas of the chunk where this data is to be found. The client caches such meta-data so that it
need not contact the Master each time. Thereafter the client directly reads data from the designated chunk
server/DataNode. This data is not cached since most reads are large and caching would complicate writes.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.8

Anatomy of HDFS READ OPERATIONS


In case of a write, in particular an append, the client sends only the data to be appended to all the
chunk servers/DataNodes; when they all acknowledge receiving this data it informs a designated ‗primary‘
chunk server, whose identity it receives (and also caches) from the Master. The primary chunk server
appends its copy of data into the chunk at an offset of its choice; note that this may be beyond the EOF to
account for multiple writers who may be appending to this file simultaneously.
The primary then forwards the request to all other replicas which in turn write the data at the same
offset if possible or return a failure. In case of a failure the primary rewrites the data at possibly another
offset and retries the process. The Master maintains regular contact with each chunk server through
heartbeat messages and in case it detects a failure its meta-data is updated to reflect this, and if required
assigns a new primary for the chunks being served by a failed chunk server. Since clients cache meta-data,
occasionally they will try to connect to failed chunk servers, in which case they update their meta-data from
the master and retry. In [26] it is shown that this architecture efficiently supports multiple parallel readers
and writers. It also supports writing (appending) and reading the same file by parallel sets of writers and
readers while maintaining a consistent view, i.e. each reader always sees the same data regardless of the
replica it happens to read from. Finally, note that computational processes (the ‗client‘ applications above)
run on the same set of servers that files are stored on.

Anatomy of HDFS WRITE OPERATIONS

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.9

BIGTABLE & HBASE:


BigTable is a distributed structured storage system built on GFS; Hadoop‘s HBase is a similar open source
system that uses HDFS. BigTable is accessed by a row key, column key and a timestamp. Each column can
store arbitrary name–value pairs of the form column-family:label, string. The set of possible column-
families for a table is fixed when it is created whereas columns, i.e. labels within the column family, can be
created dynamically at any time. Column families are stored close together in the distributed file system;
thus the BigTable model shares elements of column oriented databases.
We illustrate these features below through an example. Figure 10.4 illustrates the BigTable data
structure: Each row stores information about a specific sale transaction and the row key is a transaction
identifier. The ‗location‘ column family stores columns relating to where the sale occurred, whereas the
‗product‘ column family stores the actual products sold and their classification. Note that there are two
values for region having different timestamps, possibly because of a reorganization of sales regions.

Since data in each column family is stored together, using this data organization results in efficient
data access patterns depending on the nature of analysis: For example, only the location column family may
be read for traditional data-cube based analysis of sales, whereas only the product column family is needed
for say, market-basket analysis. Thus, the BigTable structure can be used in a manner similar to a column-
oriented database. Figure 10.5 illustrates how BigTable tables are stored on a distributed file system such as
GFS or HDFS. Each table is split into different row ranges, called tablets. Each tablet is managed by a tablet
server that stores each column family for the given row range in a separate distributed file, called an
SSTable. Additionally, a single Metadata table is managed by a meta-data server that is used to locate the
tablets of any user table in response to a read or write request. The Metadata table itself can be large and is
also split into tablets, with the root tablet being special in that it points to the locations of other meta-data
tablets. BigTable and HBase rely on the underlying distributed file systems GFS and HDFS respectively and
therefore also inherit some of the properties of these systems. In particular large parallel reads and inserts are
efficiently supported, even simultaneously on the same table, unlike a traditional relational database. In
particular, reading all rows for a small number of column families from a large table, such as in aggregation
queries, is efficient in a manner similar to column-oriented databases. Similarly, the consistency properties
of large parallel inserts are stronger than that for parallel random writes, as is pointed out in. Further, writes
can even fail if a few replicas are unable to write even if other replicas are successfully updated.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.10

DYNAMO ---- AMAZON:----


A distributed data system called Dynamo, which was developed at Amazon and underlies its SimpleDB key-
value pair database. Unlike BigTable, Dynamo was designed specifically for supporting a large volume of
concurrent updates, each of which could be small in size, rather than bulk reads and appends as in the case
of BigTable and GFS. Dynamo‘s data model is that of simple key-value pairs, and it is expected that
applications read and write such data objects fairly randomly. This model is well suited for many web-based
e-commerce applications that all need to support constructs such as a ‗shopping cart.‘ Dynamo also
replicates data for fault tolerance, but uses distributed object versioning and quorum-consistency to enable
writes to succeed without waiting for all replicas to be successfully updated, unlike in the case of GFS.
Managing conflicts if they arise is relegated to reads which are provided enough information to
enable application dependent resolution. Because of these features, Dynamo does not rely on any underlying
distributed file system and instead directly manages data storage across distributed nodes. The architecture
of Dynamo is illustrated in Figure 10.6. Objects are keyvalue pairs with arbitrary arrays of bytes. An MD5
hash of the key is used to generate a 128-bit hash value. The range of this hash function is mapped to a set of
virtual nodes arranged in a ring, so each key gets mapped to one virtual node. The object is replicated at this
primary virtual node as well as N − 1 additional virtual nodes (where N is fixed for a particular Dynamo
cluster). Each physical node (server) handles a number of virtual nodes at distributed positions on the ring so
as to continuously distribute load evenly as nodes leave and join the cluster because of transient failures or
network
partitions. Notice that the Dynamo architecture is completely symmetric with each node being equal, unlike
the BigTable/GFS architecture that has special master nodes at both the BigTable as well as GFS layer. A
write request on an object is first executed at one of its virtual nodes which then forward the request to all
nodes having replicas of the object. Objects are always versioned, so a write merely creates a new version of
the object with its local timestamp (Tx on node X) incremented. Thus the timestamps capture the history of
object updates; versions that are superseded by later versions having a larger vector timestamp are discarded.
For example, two sequential updates at node X would create an object version with vector timestamp to [2 0

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.11

0], so an earlier version with timestamp [1 0 0] can be safely discarded. However, if the second write took
place at node Y before the first write had propagated to this replica, it would have a timestamp of [0 1 0]. In
this case even when the first write arrives at Y (and Y‘s write arrives symmetrically at X), the two versions
[100] and [010] would both be maintained and returned to any subsequent read to be resolved using
application-dependent logic. Say this read took place at node Z and was reconciled by the application which
then further updated the object; the new timestamp for the object would be set to [1 1 1], and as this
supersedes other versions they would be discarded once this update was propagated to all replicas. We
mention in passing that such vector-timestamp-based ordering of distributed events was first conceived of by
Lamport in [35].

In Dynamo write operations are allowed to return even if all replicas are not updated. However a
quorum protocol is used to maintain eventual consistency of the replicas when a large number of concurrent
reads and writes take place: Each read operation accesses R replicas and each write ensures propagation to
W replicas; as long as R +W > N the system is said to be quorum consistent [14]. Thus, if we want very
efficient writes, we pay the price of having to read many replicas, and vice versa. In practice Amazon uses N
= 3, with R
and W being configurable depending on what is desired; for a high update frequency one uses W = 1, R = 3,
whereas for a high-performance read store W = 3, R = 1 is used.
Dynamo is able to handle transient failures by passing writes intended for a failed node to another
node temporarily. Such replicas are kept separately and scanned periodically with replicas being sent back to
their intended node as soon as it is found to have revived. Finally, Dynamo can be implemented using
different storage engines at the node level, such as Berkeley DB or even MySQL; Amazon is said to use the
former in production.
MAPREDUCE AND EXTENSIONS
The MapReduce programming model was developed at Google in the process of implementing large-scale
search and text processing tasks on massive collections of web data stored using BigTable and the GFS
distributed file system. The MapReduce programming model is designed for processing and generating large

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.12

volumes of data via massively parallel computations utilizing tens of thousands of processors at a time. The
underlying infrastructure to support this model needs to assume that processors and networks will fail, even
during a particular computation, and build in support for handling such failures while ensuring progress of
the computations being performed.
Hadoop is an open source implementation of the MapReduce model developed at Yahoo, and
presumably also used internally. Hadoop is also available on pre-packaged AMIs in the Amazon EC2 cloud
platform, which has sparked interest in applying the MapReduce model for large-scale, fault-tolerant
computations in other domains, including such applications in the enterprise context.

PARALLEL COMPUTING
Parallel computing has a long history with its origins in scientific computing in the late 60s and early 70s.
Different models of parallel computing have been used based on the nature and evolution of multiprocessor
computer architectures. The shared-memory model assumes that any processor can access any memory
location, but not equally fast. In the distributed memory model each processor can address only its own
memory and communicates with other processors using message passing over the network. In scientific
computing applications for which these models were developed, it was assumed that data would be loaded
from disk at the start of a parallel job and then written back once the computations had been completed, as
scientific tasks were largely compute bound. Over time, parallel computing also began to be applied in the
database arena, such as SAN, NAS, database systems supporting shared-memory, shared-disk and shared-
nothing models became available.
The premise of parallel computing is that a task that takes time T should take time T/p if executed
on p processors. In practice, inefficiencies are introduced by distributing the computations such as
(a) The need for synchronization among processors,
(b) Overheads of communication between processors through messages or disk, and
(c) Any imbalance in the distribution of work to processors.
Thus in practice the time Tp to execute on p processors is less than T, and the parallel efficiency of an
algorithm is defined as:
€ = T/p Tp……………………………………………... (11.1)
A scalable parallel implementation is one where:
(a) The parallel efficiency remains constant as the size of data is increased along with a corresponding
increase in processors and
(b) The parallel efficiency increases with the size of data for a fixed number of processors.
We illustrate how parallel efficiency and scalability depends on the algorithm, as well as the nature
of the problem, through an example. Consider a very large collection of documents; say web pages crawled
from the entire internet. The problem is to determine the frequency (i.e., total number of occurrences) of
each word in this collection. Thus, if there are n documents and m distinct words, we wish to determine m
frequencies, one for each word.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.13

Now we compare two approaches to compute these frequencies in parallel using p processors:
(a) Let each processor compute the frequencies for m/p words and
(b) Let each processor compute the frequencies of m words across n/p documents, followed by all the
processors summing their results.
At first glance it appears that approach
(a) Where each processor works independently may be more efficient as compared to
(b) Where they need to communicate with each other to add up all the frequencies. However, a more careful
analysis reveals otherwise: We assume a distributed-memory model with a shared disk, so that each
processor is able to access any document from disk in parallel with no contention.
Further we assume that the time spent c for reading each word in the document is the same as that of
sending it to another processor via inter-processor communication. On the other hand, the time to add to a
running total of frequencies is negligible as compared to the time spent on a disk read or inter-processor
communication, so we ignore the time taken for arithmetic additions in our analysis.
Finally, assume that each word occurs f times in a document, on average. With these assumptions,
the time for computing all the m frequencies with a single processor is n×m×f ×c, i.e. since each word needs
to be read approximately f times in each document.
Using approach (a) each processor reads approximately n × m × f words and adds them n × m/p × f
times. Ignoring the time spent in additions, the parallel efficiency can be calculated as:

Since efficiency falls with increasing p the algorithm is not scalable.


On the other hand using approach (b) each processor performs approximately n/p×m×f reads and the same
number of additions in the first phase, producing p vectors of m partial frequencies, which can be written to
disk in parallel by each processor in time cm. In the second phase these vectors of partial frequencies need to
be added: First each processor sends p – 1 sub-vectors of size m/p to each of the remaining processors. Each
processor then adds p sub-vectors locally to compute one pth of the final m-vector of frequencies. The
parallel efficiency is computed as:

Since in practice p<< nf the efficiency of approach (b) is higher than that of approach (a), and can even be
close to one: For example, with n = 10 000 documents and f = 10, the condition (11.3) works out to p << 50
000, so method (b) is efficient (€b ≈ 0.9) even with thousands of processors. The reason is that in the first
approach each processor is reading many words that it need not read, resulting in wasted work, whereas in
the second approach every read is useful in that it results in a computation that contributes to the final

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.14

answer. Algorithm (b) is also scalable, since €b remains constant as p and n both increase, and approaches
one as n increases for a fixed p.

THE MAPREDUCE MODEL


Traditional parallel computing algorithms were developed for systems with a small number of processors,
dozens rather than thousands. So it was safe to assume that processors would not fail during a computation.
At significantly larger scales this assumption breaks down, as was experienced at Google in the course of
having to carry out many large-scale computations similar to the one in our word counting example. The
MapReduce parallel programming abstraction was developed in response to these needs, so that it could be
used by many different parallel applications while leveraging a common underlying fault-tolerant
implementation that was transparent to application developers. Figure 11.1 illustrates MapReduce using the
word counting example where we needed to count the occurrences of each word in a collection of
documents.
MapReduce proceeds in two phases, a distributed ‗map‘ operation followed by a distributed ‗reduce‘
operation; at each phase a configurable number of M ‗mapper‘ processors and R ‗reducer‘ processors are
assigned to work on the problem (we have used M = 3 and R = 2 in the illustration). The computation is
coordinated by a single master process (not shown in the figure).
A MapReduce implementation of the word counting task proceeds as follows: In the map phase each
mapper reads approximately 1/Mth of the input (in this case documents), from the global file system, using
locations given to it by the master. Each mapper then performs a ‗map‘ operation to compute word
frequencies for its subset of documents. These frequencies are sorted by the words they represent and
written to the local file system of the mapper. At the next phase reducers are each assigned a subset of
words; in our illustration the first reducer is assigned w1 and w2 while the second one handles w3 and w4.
In fact during the map phase itself each mapper writes one file per reducer, based on the words assigned to
each reducer, and keeps the master informed of these file locations. The master in turn informs the reducers
where the partial counts for their words have been stored on the local files of respective mappers; the
reducers then make remote procedure call requests to the mappers to fetch these. Each reducer performs a
‗reduce‘ operation that sums up the frequencies for each word, which are finally written back to the GFS file
system.
The MapReduce programming model generalizes the computational structure of the above example.
Each map operation consists of transforming one set of key-value pairs to another:
Map: (k1, v1) → [(k2, v2)]……………………………… (11.4)
In our example each map operation takes a document indexed by its id and emits a list if word-count pairs
indexed by word-id: (dk, [w1 . . .wn]) → [(wi, ci)]. The reduce operation groups the results of the map step
using the same key k2 and performs a function f on the list of values that correspond to each
Reduce: (k2, [v2]) → (k2, f ([v2]))………………………. (11.5)

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.15

In our example each reduce operation sums the frequency counts for each word:

The implementation also generalizes. Each mapper is assigned an input-key range (set of values for k1) on
which map operations need to be performed. The mapper writes results of its map operations to its local disk
in R partitions, each corresponding to the output-key range (values of k2) assigned to a particular reducer,
and informs the master of these locations. Next each reducer fetches these pairs from the respective mappers
and performs reduce operations for each key k2 assigned to it. If a processor fails during the execution, the
master detects this through regular heartbeat communications it maintains with each worker, wherein
updates are also exchanged regarding the status of tasks assigned to workers.
If a mapper fails, then the master reassigns the key-range designated to it to another working node
for re-execution. Note that re-execution is required even if the mapper had completed some of its map
operations, because the results were written to local disk rather than the GFS. On the other hand if a reducer
fails only its remaining tasks (values k2) are reassigned to another node, since the completed tasks would
already have been written to the GFS.
Finally, heartbeat failure detection can be fooled by a wounded task that has a heartbeat but is
making no progress: Therefore, the master also tracks the overall progress of the computation and if results
from the last few processors in either phase are excessively delayed, these tasks are duplicated and assigned
to processors who have already completed their work. The master declares the task completed when any one
of the duplicate workers complete.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.16

The MapReduce model is widely applicable to a number of parallel computations, including


database-oriented tasks which we cover later. Indexing large collections is not only important in web search,
but also a critical aspect of handling structured data; so it is important to know that it can be executed
efficiently in parallel using MapReduce. Traditional parallel databases focus on rapid query execution
against data warehouses that are updated infrequently; as a result these systems often do not parallelize
index creation sufficiently well.

PARALLEL EFFICIENCY OF MAPREDUCE


As we have seen earlier, parallel efficiency is impacted by overheads such as synchronization and
communication costs, or load imbalance. The MapReduce master process is able to balance load efficiently
if the number of map and reduce operations are significantly larger than the number of processors. For large
data sets this is usually the case (since an individual map or reduce operation usually deals with a single
document or record). However, communication costs in the distributed file system can be significant,

Bamuengine.com
especially when the volume of data being read, written and transferred between processors is large.

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS


4.17

For the purposes of our analysis we assume a general computational task, on a volume of data D,
which takes wD time on a uniprocessor, including the time spent reading data from disk, performing
computations, and writing it back to disk (i.e. we assume that computational complexity is linear in the size
of data). Let c be the time spent reading one unit of data (such as a word) from disk. Further, let us assume
that our computational task can be decomposed into map and reduce stages as follows: First cmD
computations are performed in the map stage, producing σD data as output. Next the reduce stage performs
crσD computations on the output of the map stage, producing σμD data as the final result. Finally, we
assume that our decomposition into a map and reduce stages introduces no additional overheads when run
on a single processor, such as having to write intermediate results to disk, and so
wD = cD + cmD + crσD + cσμD………………………………… (11.6)
Now consider running the decomposed computation on P processors that serve as both mappers and
reducers in respective phases of a MapReduce based parallel implementation. As compared to the single
processor case, the additional overhead in a parallel MapReduce implementation is between the map and
reduce phases where each mapper writes to its local disk followed by each reducer remotely reading from
the local disk of each mapper. For the purposes of our analysis we shall assume that the time spent reading a
word from a remote disk is also c, i.e. the same as for a local read. Each mapper produces approximately
σD/P data that is written to a local disk (unlike in the uniprocessor case), which takes cσD/P time. Next,
after the map phase, each reducer needs to read its partition of data from each of the P mappers, with
approximately one Pth of the data at each mapper by each reducer, i.e. σD/P2. The entire exchange can be
executed in P steps. Thus the transfer time is cσD/P2 × P = cσD/P. The total overhead in the parallel
implementation because of intermediate disk writes and reads is therefore 2cσD/P. We can now compute the
parallel efficiency of the MapReduce implementation as:

Let us validate (11.7) above for our parallel word counting example discussed in Section 11.1: The volume
of data is D = nmf . We ignore the time spent in adding word counts, so cr = cm = 0. We also did not include
the
(small) time cm for writing the final result to disk. So wD = wnmf = cnmf, or w = c. The map phase
produces mP partial counts, so σ = mP/nmf = p/nf . Sing (11.7) and c = w we reproduce (11.3) as computed
earlier. It is important to note how _MR depends on σ, the ‗compression‘ in data achieved in the map phase,
and its relation to the number of processors p. To illustrate this dependence, let us recall the definition (11.4)
of a map operation, as applied to the word counting problem, i.e. (dk, ‗w1 . . .wn]) → [(wi, ci)]. Each map
operation takes a document as input and emits a partial count for each word in that document alone, rather
than a partial sum across all the documents it sees. In this case the output of the map phase is of size mn (an
m-vector of counts for each document). So, σ = mn/nmf = 1/f and the parallel efficiency is 1/1+2/f ,
independent of data size or number of processors, which is not scalable.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.18

A strict implementation of MapReduce as per the definitions (11.4) and (11.5) does not allow for partial
reduction across all input values seen by a particular reducer, which is what enabled the parallel
implementation of Section 11.1 to be highly efficient and scalable. Therefore, in practice the map phase
usually includes a combine operation in addition to the map, defined as follows:
Combine: (k2, [v2]) → (k2, fc([v2])). ………………………….(11.8)
The function fc is similar to the function f in the reduce operation but is applied only across documents
processed by each mapper, rather than globally. The equivalence of a MapReduce implementation with and
without a combiner step relies on the reduce function f being commutative and associative,
i.e. f (v1, v2, v3) = f (v3, f (v1, v2)).
Finally, recall our definition of a scalable parallel implementation: A MapReduce implementation is scalable
if we are able to achieve an efficiency that approaches one as data volume D grows, and remains constant as
D and P both increase. Using combiners is crucial to achieving scalability in practical MapReduce
implementations by achieving a high degree of data ‗compression‘ in the map phase, so that σ is
proportional to P/D, which in turn results in scalability due to (11.7).

RELATIONAL OPERATIONS USING MAPREDUCE


Enterprise applications rely on structured data processing, which over the years has become virtually
synonymous with the relational data model and SQL. Traditional parallel databases have become fairly
sophisticated in automatically generating parallel execution plans for SQL statements. At the same time
these systems lack the scale and fault-tolerance properties of MapReduce implementations, naturally
motivating the quest to execute SQL statements on large data sets using the MapReduce model. Parallel
joins in particular are well studied, and so it is instructive to examine how a relational join could be executed
in parallel using MapReduce. Figure 11.2 illustrates such an example: Point of sale transactions taking place
at stores (identified by addresses) are stored in a Sales table. A Cities table captures the addresses that fall
within each city. In order to compute the gross sales by city these two tables need to be joined using SQL as
shown in the figure. The MapReduce implementation works as follows: In the map step, each mapper reads
a (random) subset of records from each input table Sales and Cities, and segregates each of these by address,
i.e. the reduce key k2 is ‗address.‘ Next each reducer fetches Sales and Cities data for its assigned range of
address values from each mapper, and then performs a local join operation including the aggregation of sale
value and grouping by city. Note that since addresses are randomly assigned to reducers, sales aggregates for
any particular city will still be distributed across reducers. A second mapreduce step is needed to group the
results by city and compute the final sales aggregates. Parallel SQL implementations usually distribute the
smaller table, Cities in this case, to all processors. As a result, local joins and aggregations can be performed
in the first map phase itself, followed by a reduce phase using city as the key, thus obviating the need for
two phases of data exchange. Naturally there have been efforts at automatically translating SQL-like
statements to a map-reduce framework. Two notable examples are Pig Latin developed at Yahoo!, and Hive
developed and used at Facebook. Both of these are open source tools available as part of the Hadoop project,

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.19

and both leverage the Hadoop distributed file system HDFS. Figure 11.3 illustrates how the above SQL
query can be represented using the Pig Latin language as well as the HiveQL dialect of SQL. Pig Latin has
features of an imperative language, wherein a programmer specifies a sequence of transformations that each
read and write large distributed files. The Pig Latin compiler generates MapReduce phases by treating each
GROUP (or COGROUP) statement as defining a map-reduce boundary, and pushing remaining statements
on either side into the map or reduce steps.

HiveQL, on the other hand, shares SQL‘s declarative syntax. Once again though, as in Pig Latin,
each JOIN and GROUP operation define a map-reduce boundary. As depicted in the figure, the Pig Latin as
well as HiveQL representations of our SQL query translate into two MapReduce phases similar to our
example of Figure 11.2.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.20

Pig Latin is ideal for executing sequences of large-scale data transformations using MapReduce. In the
enterprise context it is well suited for the tasks involved in loading information into a data warehouse.
HiveQL, being more declarative and closer to SQL, is a good candidate for formulating analytical queries on
a large distributed data warehouse.
There has been considerable interest in comparing the performance of MapReduce-based
implementations of SQL queries with that of traditional parallel databases, especially specialized column-
oriented databases tuned for analytical queries. In general, as of this writing, parallel databases are still faster
than available open source implementations of MapReduce (such as Hadoop), for smaller data sizes using
fewer processes where fault tolerance is less critical. MapReduce-based implementations, on the other hand,
are able to handle orders of magnitude larger data using massively parallel clusters in a fault-tolerant
manner. MapReduce is also preferable over traditional databases if data needs to be processed only once and
then discarded: As an example, the time required to load some large data sets into a database is 50 times
greater than the time to both read and perform the required analysis using MapReduce. On the contrary, if
data needs to be stored for a long time, so that queries can be performed against it regularly, a traditional
database wins over MapReduce, at least as of this writing. HadoopDB is an attempt at combining the
advantages of MapReduce and relational databases by using databases locally within nodes while using
MapReduce to coordinate parallel execution. Another example is SQL/MR from Aster Data that enhances a
set of distributed SQL-compliant databases with MapReduce programming constructs. Needless to say,
relational processing using MapReduce is an active research area and many improvements to the available
state of the art are to be expected in the near future.

PIG:
Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of
creating and executing map-reduce jobs on very large data sets. In 2007, it was moved into the Apache
Software Foundation.
Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this
platform is called Pig Latin. Pig Latin is a high level Data Flow scripting language that enables data
workers to write complex data transformations without knowing Java. Pig Latin can be extended using UDF
(User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then
call directly from the language. Pig works with data from many sources, including structured and
unstructured data, and stores the results into the Hadoop Data File System.
PIG ARCHITECHTURE:
The Pig Latin compiler converts the Pig Latin code into executable code. The executable code is in the form
of MapReduce jobs. The sequence of MapReduce programs enables Pig programs to do data processing and
analysis in parallel, leveraging Hadoop MapReduce and HDFS.
Pig programs can run on MapReduce v1 or MapReduce v2 without any code changes, regardless of what
mode your cluster is running. However, Pig scripts can also run using the Tez API instead. Apache Tez

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.21

provides a more efficient execution framework than MapReduce. YARN enables application frameworks
other than MapReduce (like Tez) to run on Hadoop.

Figure: Pig architecture.


Executing Pig
Pig has two modes for executing pig scripts
1) Local mode: When you run pig in Local mode, the Pig program runs in the context of a local Java
Virtual Machine, and data access is via the local file system of a single machine. To run pig in local mode
type pig –x local in terminal.
2) MapReduce mode (also known as Hadoop mode):
The second option, MapReduce mode, runs on a Hadoop cluster and converts the Pig statements into
MapReduce code. To run pig in MapReduce mode type simply pig or pig –x mapreduce

Pig programs can be packaged in three different ways:


✓ Script: This method is nothing more than a file containing Pig Latin commands, identified by the .pig
suffix (FlightData.pig, for example).
✓ Grunt: Grunt acts as a command interpreter where you can interactively enter Pig Latin at the Grunt
command line and immediately see the response.
✓ Embedded: Pig Latin statements can be executed within Java, Python, or JavaScript programs.
PIG DATA TYPES:
Pig‘s data types can be divided into two categories: scalar types, which contain a single value, and complex
types, which contain other types.
SCALAR TYPES
Table :Atomic data types in Pig Latin

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.22

int Signed 32-bit integer


Long Signed 64-bit integer
float 32-bit floating point
double 64-bit floating point
chararray Character array (string) in Unicode UTF-8
bytearray Byte array (binary object)
COMPLEX DATA TYPES:
Tuple (12.5,hello world,-2) A tuple is an ordered set of fields. It‘s most often used as a row in a relation.
It‘srepresented by fields separated by commas, all enclosed by parentheses.
Bag {(12.5,hello world,-2),(2.87,bye world,10)} A bag is an unordered collection of tuples. A relation is
aspecial kind of bag, sometimes called an outer bag. An inner bag is a bag that is a field within some
complextype. A bag is represented by tuples separated by commas, all enclosed by curly brackets. Tuples
in a bag aren‘trequired to have the same schema or even have the same number of fields. It‘s a good idea to
do this though,unless you‘re handling semistructured or unstructured data.
Map [key#value] A map is a set of key/value pairs. Keys must be unique and be a string (chararray). The
valuecan be any type.
Input and Output
Before you can do anything of interest, you need to be able to add inputs and outputs to your data flows.
Load: alias = LOAD 'file' [USING function] [AS schema];
Load data from a file into a relation. Uses the PigStorage load function as default unless specified otherwise
with the USING option. The data can be given a schema using the AS option.
Store: STORE alias INTO 'directory' [USING function];
Store data from a relation into a directory. The directory must not exist when this command is executed. Pig
will create the directory and store the relation in files named part-nnnnn in it. Uses the PigStorage store
function as default unless specified otherwise with the USING option.
Dump DUMP alias;
Display the content of a relation. Use mainly for debugging. The relation should be small enough for
printing on screen. You can apply the LIMIT operation on an alias to make sure it‘s small enough for
display.
HIVE:
One of the biggest ingredients in the Information Platform built by Jeff Hammerbacher team at Facebook
was Hive, a framework for data warehousing on top of Hadoop in 2007. Hive was open sourced in 2008
August.
 Hive is an SQL oriented Query Language and Abbreviated as HiveQL or Simply HQL.
 Hive saves you from having to write the MapReduce programs.
 Like most SQL dialects, HiveQL does not conform to the ANSI SQL standard and it differs in various

Bamuengine.com
ways from the familiar SQL dialects provided by Oracle, MySQL, and SQL Server.

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS


4.23

 Hive is most suited for data warehouse applications, where relatively static data is analyzed, fast
response times are not required, and when the data is not changing rapidly.
 Hive is not a full database because the design constraints and limitations of Hadoop and HDFS impose
limits on what Hive can do.
 The biggest limitation is that Hive does not provide record-level update, insert, or delete.
HIVE ARCHITECTURE:
CLI: The command line interface to Hive (the shell). This is the default service. Hiveserver Runs Hive as a
server exposing a Thrift service, enabling access from a range of clients written in different languages.
Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to communicate with
Hive.
Metastore Database: The metastore is the central repository of Hive metadata.
The Hive Web Interface (HWI): As an alternative to the shell, you might want to try Hive‘s simple web
interface. Start it using the following commands:
Hive clients: If you run Hive as a server (hive --service hiveserver), then there are a number of different
mechanisms for connecting to it from applications. The relationship between Hive clients and Hive services
is illustrated in Figure .Hive architecture
Thrift Client: The Hive Thrift Client makes it easy to run Hive commands from a wide range of
programming languages. Thrift bindings for Hive are available for C++, Java, PHP, Python, and Ruby. They
can be found in the src/service/src subdirectory in the Hive distribution.
JDBC Driver: Hive provides a Type 4 (pure Java) JDBC driver, defined in the class
org.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of the form
jdbc:hive://host:port/dbname, a Java application will connect to a Hive server running in a separate process
at the given host and port.
ODBC Driver : The Hive ODBC Driver allows applications that support the ODBC protocol to connect to
Hive.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.24

We can create two types of tables in Hive


Managed Tables: Managed tables are the one in which when data is loaded into the table and if we drop the
table then the table along with the data is dropped. So, data loss occurs in managed tables.
External Tables: External tables are the one which when data is loaded into the table and if we drop the
table then the table only is dropped but the data is going to be safe.
Hive supports many of the primitive data types you find in relational databases, as well as three
collection data types that are rarely found in relational databases.
Primitive Data Types : Hive supports several sizes of integer and floating-point types, a Boolean type, and
character strings of arbitrary length. Hive v0.8.0 added types for timestamps and binary fields.
Note that Hive does not support ―character arrays‖ (strings) with maximum-allowed lengths, as is
common in other SQL dialects.
Collection Data Types : Hive supports columns that are structs, maps, and arrays. Note that the literal
syntax examples in Table 3-2 are actually calls to built-in functions. Table 3-2. Collection data types.
Type Description example
STRUCT Analogous to a C struct or an ―object.‖ Fields can be accessed struct('John', 'Doe')
using the ―dot‖ notation. For example, if a column name is of type
STRUCT {first STRING; last STRING}, then the first name field
can be referenced using name.first.
MAP A collection of key-value tuples, where the fields are accessed map('first', 'John',
using array notation (e.g., ['key']). For example, if a column name 'last',’Doe’)
is of type MAP with key→value pairs 'first'→'John' and
'last'→'Doe', then the last name can be referenced using
name['last'].
ARRAY Ordered sequences of the same type that are indexable using zero- array('John', 'Doe')
based integers. For example, if a column name is of type ARRAY
of strings with the value ['John', 'Doe'], then the second element
can be referenced using name[1].
HBASE:
• Hbase project was started toward the end of 2006 by Chad Walters and Jim Kellerman at
Powerset.
• Hbase is the Hadoop database modeled after Google‘s Bigtable.
• ―A HBase is a sparse, distributed, persistent, multi-dimensional sorted map‖ for structured data
by Chang et al.
• Sparse basically means that the data is scattered.
• Distributed means that the storage of the data is spread out across commodity hardware.
• Persistent means that the data will be saved.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.25

• Multidimensional in HBase means that for a particular cell, there can be multiple versions of
it.
• Sorted map basically means that to obtain each value, you must provide a key.
• The first Hbase release was bundled as part of Hadoop 0.15.0 in Oct,2007.
• In May,2010, Hbase graduated from Hadoop sub-project to become an Apache top level project.
• Facebook, Twitter, and other leading websites uses Hbase for their BigData.
• Hbase is a distributed column-oriented database built on top of HDFS
• Use Hbase for random, real-time read/write access to your datasets.
• Hbase and its native API is written in Java, but you do not have to use Java to access its API.

HBASE ARCHITECTURE:
In HBase Architecture, we have the Master and multiple region servers consisting of multiple regions.

HBase Master
• Responsible for managing region servers and their locations
– Assigns regions to region servers
– Re-balanced to accommodate workloads
– Recovers if a region server becomes unavailable
– Uses Zookeeper – distributed coordination service

Bamuengine.com
• Doesn't actually store or read data

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS


4.26

Region Server
 Each region server is basically a node within your cluster.
 You would typically have at least 3 region servers in your distributed environment.
 A region server contains one or more regions.
 It‘s important to know that each region contains data from ONE column family with a range of rows.
 The files are primarily handled by the HRegionServer.
 The HLog is shared between all the stores of an HRegionServer.

WAL
The Write-Ahead Log exists in the HLog file. When you have your data only in the MemStore and the
system fails before the data is flushed to file, all the data is lost. On a large system, the chances of this are
high because you‘ll have hundreds of commodity hardware. The WAL is to prevent data loss if this should
happen.

HRegion:
Region‘s consists of one or more Column Families with a range of row keys.
Default size of the region is 256 MB.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.27

Data Storage
• Data is stored in files called HFiles/StoreFiles
• HFile is basically a key-value map
• When data is added it's written to a log called Write Ahead Log (WAL) and is also stored in memory
(memstore)

• Flush: when in-memory data exceeds maximum value it is flushed to an HFile


HBase Data Model
• Data is stored in Tables
• Tables contain rows
– Rows are referenced by a unique key
• Key is an array of bytes .
• Rows made of columns which are grouped in column families
• Data is stored in cells
– Identified by row x column-family x column
– Cell's content is also an array of bytes.
HBase Families
• Rows are grouped into families
– Labeled as ―family:column‖
• Example ―user:first_name‖
– A way to organize your data
– Various features are applied to families
HBase Timestamps: Cells' values are versioned: For each cell multiple versions are kept 3 by default.
Another dimension to identify your data Either explicitly timestamped by region server or provided by the
client Versions are stored in decreasing timestamp order Read the latest first – optimization to read the
current value.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.28

HBase Cells: Value = Table+RowKey+Family+Column+Timestamp


SortedMap<
RowKey, List<
SortedMap<
Column, List<
Value, Timestamp
>
>
>
>
CRUD operations In HBase:
These operations enable the basics of HBase‘s get, put, and delete commands. They are the building blocks
of an HBase application.
Create Command: creates the table with a column family.
hbase(main):001:0> create 'test', 'cf1'
Put Command: The put command allows you to put data into HBase. This is also the same command to
update data currently in HBase. Remember that a table can have more than one column family and that each
column family consists of one or more columns.
hbase(main):003:0> put 'test', 'row1', 'cf1', 'val2'
Get Command: The Get command retrieves data from an HBase
hbase(main):003:0> get 'test', 'row1'
Delete command:This deletes from an HBase table.
hbase(main):003:0>Delete 'test', 'row1'
Scan operation: Scans could be used for something like counting the occurrences of a hash tag over a given
time period. One thing to remember is that you need to release your scanner instance as soon as you are
done.
hbase(main):004:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf1:, timestamp=1297853125623, value=val2

Hbase vs BigTable:

HBase Bigtable

Region Tablet

Region Server Tablet Server

Write-ahead log Commit log

HDFS GFS

Hadoop Map-Reduce Map-Reduce

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.29

Memstore Memtable

Hfile SSTable

ZooKeeper Chubby

SQOOP:
When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra, Pig, etc. of the
Hadoop ecosystem came into picture, they required a tool to interact with the relational database servers for
importing and exporting the Big Data residing in them. Here, Sqoop occupies a place in the Hadoop
ecosystem to provide feasible interaction between relational database server and Hadoop‘s HDFS.
Sqoop: ―SQL to Hadoop and Hadoop to SQL‖
Sqoop is a command-line interface application for transferring data between relational databases and
Hadoop. Sqoop is used to import data from a relational database management system (RDBMS) such as
MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop
MapReduce, and then export the data back into an RDBMS.
Sqoop automates most of this process, relying on the database to describe the schema for the data to be
imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as
fault tolerance. Microsoft uses a Sqoop-based connector to help transfer data from Microsoft SQL Server
databases to Hadoop. Couchbase, Inc. also provides a Couchbase Server-Hadoop connector by means of
Sqoop.
SQOOP ARCHITECTURE

Sqoop Import
The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record
in HDFS. All records are stored as text data in text files or as binary data in Avro and Sequence files.
$ sqoop import --connect jdbc:mysql://localhost/userdb --username root --table emp_add --m 1--target-dir
/queryresult
Sqoop Export

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.30

The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop
contain records, which are called as rows in table. Those are read and parsed into a set of records and
delimited with user-specified delimiter.
$ sqoop export --connect jdbc:mysql://localhost/db --username root --table employee --export-dir
/emp/emp_data

FLUME:
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and
moving large amounts of log data from many different sources to a centralized data store. Apache Flume is a
top level project at the Apache Software Foundation. There are currently two release code lines available,
versions 0.9.x and 1.x. This documentation applies to the 1.x codeline. Please click here for the Flume 0.9.x
User Guide.
Architecture Of FLUME:
Data flow model
A Flume event is defined as a unit of data flow having a byte payload and an optional set of string
attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an
external source to the next destination (hop). A Flume source consumes events delivered to it by an external
source like a web server. The external source sends events to Flume in a format that is recognized by the
target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro
clients or other Flume agents in the flow that send events from an Avro sink. When a Flume source receives
an event, it stores it into one or more channels. The channel is a passive store that keeps the event until it's
consumed by a Flume sink. The JDBC channel is one example -- it uses a file system backed embedded
database. The sink removes the event from the channel and puts it into an external repository like HDFS (via
Flume HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow. The
source and sink within the given agent run asynchronously with the events staged in the channel.

Complex flows: Flume allows a user to build multi-hop flows where events travel through multiple agents
before reaching the final destination. It also allows fan-in and fan-out flows, contextual routing and backup
routes (fail-over) for failed hops.
Reliability: The events are staged in a channel on each agent. The events are then delivered to the next

Bamuengine.com
agent or terminal repository (like HDFS) in the flow. The events are removed from a channel only after they

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS


4.31

are stored in the channel of next agent or in the terminal repository. This is a how the single-hop message
delivery semantics in Flume provide end-to-end reliability of the flow. Flume uses a transactional approach
to guarantee the reliable delivery of the events. The sources and sinks encapsulate in a transaction the
storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel.
This ensures that the set of events are reliably passed from point to point in the flow. In the case of a multi-
hop flow, the sink from the previous hop and the source from the next hop both have their transactions
running to ensure that the data is safely stored in the channel of the next hop.
Recoverability: The events are staged in the channel, which manages recovery from failure. Flume supports
a durable JDBC channel which is backed by a relational database. There's also a memory channel which
simply stores the events in an in-memory queue, which is faster but any events still left in the memory
channel when an agent process dies can't be recovered.

OOZIE:
Oozie Workflow Overview
Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that run
Hadoop Map/Reduce and Pig jobs. Oozie is a Java Web-Application that runs in a Java servlet-container.
For the purposes of Oozie, a workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs)
arranged in a control dependency DAG (Direct Acyclic Graph). "Control dependency" from one action to
another means that the second action can't run until the first action has completed. In terms of the actions we
can schedule, Oozie supports a wide range of job types, including Pig, Hive, and MapReduce, as well as
jobs coming from Java programs and Shell scripts.
An Oozie coordinator job, for example, enables us to schedule any workflows you‘ve already created. We
can schedule them to run based on specific time intervals, or even based on data availability. At an even
higher level, we can create an Oozie bundle job to manage our coordinator jobs. Using a bundle job, you can
easily apply policies against a set of coordinator jobs by using a bundle job.
For all three kinds of Oozie jobs (workflow, coordinator, and bundle), we start out by defining them using
individual .xml files, and then we configure them using a combination of properties files and command-line
options.
Writing Oozie workflow definitions
Oozie workflow definitions are written in XML, based on the hPDL (Hadoop Process Definition Language)
schema. This particular schema is, in turn, based on the XML Process Definition Language (XPDL) schema,
which is a product independent standard for modeling business process definitions.
Oozie workflows definitions are written in hPDL Oozie workflow actions start jobs in remote systems (i.e.
Hadoop, Pig). Upon action completion, the remote systems callback Oozie to notify the action completion,
at this point Oozie proceeds to the next action in the workflow.
Oozie workflows contain control flow nodes and action nodes.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.32

Control flow nodes define the beginning and the end of a workflow (start, end and fail nodes) and provide a
mechanism to control the workflow execution path (decision, fork and join nodes).
Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing
task. Oozie provides support for different types of actions: Hadoop map-reduce, Hadoop file system, Pig,
SSH, HTTP, eMail and Oozie sub-workflow. Oozie can be extended to support additional type of actions.
To see how this concept would look, check out Listing 10-1, which shows an example of the basic structure
of an Oozie workflow‘s XML file.
Listing 10-1: A Sample Oozie XML File
<workflow-app name="SampleWorkflow" xmlns="uri:oozie:workflow:0.1">
<start to="firstJob"/>
<action name="firstJob">
<pig>...</pig>
<ok to="secondJob"/>
<error to="kill"/>
</action>
<action name="secondJob">
<map-reduce>...</map-reduce>
<ok to="end" />
<error to="kill" />
</action>
<end name="end"/>
<kill name="kill">
<message>"Killed job."</message>
</kill>
</workflow-app>
In this example, aside from the start, end, and kill nodes, you have two action nodes. Each action node
represents an application or a command being executed. The next few sections look a bit closer at each node
type.
Start and end nodes
Each workflow XML file must have one matched pair of start and end nodes. The sole purpose of the start
node is to direct the workflow to the first node, which is done using the to attribute. Because it‘s the
automatic starting point for the workflow, no name identifier is required.
Action nodes need name identifiers, as the Oozie server uses them to track the current position of the control
flow as well as to specify which action to execute next. The sole purpose of the end node is to provide a
termination point for the workflow. A name identifier is required, but there‘s no need for a to attribute.
Kill nodes
Oozie workflows can include kill nodes, which are a special kind of node dedicated to handling error
conditions. Kill nodes are optional, and you can define multiple instances of them for cases where you need
specialized handling for different kinds of errors. Action nodes can include error transition tags, which direct
the control flow to the named kill node in case of an error.
You can also direct decision nodes to point to a kill node based on the results of decision predicates, if
needed. Like an end node, a kill node results in the workflow ending, and it does not need to attribute.

Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

You might also like