CC Becse Unit 4 PDF
CC Becse Unit 4 PDF
CC Becse Unit 4 PDF
UNIT 4
BIG DATA AND ANALYTICS
Big Data, Challenges in Big Data, Hadoop: Definition, Architechture, Cloud file systems: GFS and HDFS, BigTable, HBase and
Dynamo, MapReduce and extensions: Parallel computing, The MapReduce model: Parallel efficiency of MapReduce, Relational
operations using MapReduce, Projects in Hadoop: Hive, HBase, Pig, Oozie, Flume, Sqoop
1) DATA VOLUME:- Volume refers to the size of the dataset. It may be in KB, MB, GB, TB, or PB
based on the type of the application that generates or receives the data. Data volume is characterized by the
amount of data that is generated continuously. Different data-types come in different sizes. For example, a
blog text is a few kilobytes; voice calls or video‘s few megabytes; sensor data, machine logs, and
clickstream data can be in gigabytes. The following are some examples of data generated by different
sources.
Machine data: Every machine (device) that we use today from industrial to personal devices can generate a
lot of data. This data includes both usage and behaviors of the owners of these machines. Machine-generated
data is often characterized by a steady pattern of numbers and text, which occurs in a rapid-fire fashion.
There are several examples of machine-generated data; for instance, Radio signals. Satellites, and Mobile
devices all transmit signals.
Application log: Another form of machine-generated data is an application log. Different devices generate
logs at different paces and formats. For example: CT scanners, X-ray machines, body scanners at airports,
airplanes, ships, military equipment, commercial satellites.
Clickstream logs: The usage statistics of the web page are captured in clickstream data. This data type
provides insight into what a user is doing on the web page, and can provide data that is highly useful for
behavior and usability analysis, marketing, and general research.
2) DATA VELOCITY:-
Velocity refers to the low latency, real-time speed at which the analytics need to be applied. With the
advent of Big Data, understanding the velocity of data is extremely important. The basic reason for this is to
analyze the data generated. Let us look at some examples of data velocity.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.2
3) DATA VARIETY:-
Variety refers to the various types of the data that can exist, for example, text, audio, video, and photos.
Big Data comes in multiple formats as it ranges from emails to tweets to social media and sensor data. There
is no control over the input data format or the structure of the data. The processing complexity associated
with a variety of formats is the availability of appropriate metadata for identifying what is contained in the
actual data. This is critical when we process images, audio, video, and large chunks of text. The platform
requirements for processing new formats are:
● Scalability
● Distributed processing capabilities
● Image processing capabilities
● Graph processing capabilities
● Video and audio processing capabilities
TYPES OF DATA:-
Structured data is characterized by a high degree of organization and is typically the kind of data you see
in relational databases or spreadsheets. Because of its defined structure, it maps easily to one of the standard
data types. It can be searched using standard search algorithms and manipulated in well-defined ways.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.3
Semi-structured data (such as what you might see in log files) is a bit more difficult to understand than
structured data. Normally, this kind of data is stored in the form of text files, where there is some degree of
order. For example, tab delimited files, where columns are separated by a tab character. So instead of being
able to issue a database query for a certain column and knowing exactly what you‘re getting back, users
typically need to explicitly assign data types to any data elements extracted from semi structured data sets.
Unstructured data has none of the advantages of having structure coded into a data set. Its analysis by way
of more traditional approaches is difficult and costly at best, and logistically impossible at worst. Without a
robust set of text analytics tools, it would be extremely tedious to determine any interesting behavior
patterns.
Challenges in Big-Data:
Acquire: Making the most of big data means quickly capturing high volumes of data generated in many
different formats is the first and foremost challenge.
Organize: A big data research platform needs to process massive quantities of data—filtering, transforming
and sorting it before loading it into a data warehouse.
Analyze: The infrastructure required for analyzing big data must be able to support deeper analytics such as
statistical analysis and data mining on a wider variety of data types stored in diverse systems; scale to
extreme data volumes; deliver faster response times; and automate decisions based on analytical models.
Scale: With big data you want to be able to scale very rapidly and elastically. Most of the NoSQL solutions
like MongoDB or HBase have their own scaling limitations.
Performance: In an online world where nanosecond delays can cost your sales, big data must move at
extremely high velocities no matter how much you scale or what workloads your database must perform.
Continuous Availability: When you rely on big data to feed your essential, revenue-generating 24/7
business applications, even high availability is not high enough. Your data can never go down. A certain
amount of downtime is built-in to RDBMS and other NoSQL systems.
Data Security: Big data carries some big risks when it contains credit card data, personal ID information
and other sensitive assets. Most NoSQL big data platforms have few if any security mechanisms in place to
safeguard your big data.
HADOOP
DEFINITION: Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model. Apache Hadoop is an open-
source software framework written in Java for distributed storage and distributed processing of very large
data sets on computer clusters built from commodity hardware. The core of Apache Hadoop consists of a
storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce).
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth across the cluster.
Hadoop YARN – a resource-management platform responsible for managing computing resources
in clusters and using them for scheduling of users' applications and
Hadoop MapReduce – a programming model for large scale data processing.
Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on
their MapReduce and Google File System. Apache Hadoop is a registered trademark of the Apache Software
Foundation.
Architecture:
Bamuengine.com
communicate between each other.
HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It
achieves reliability by replicating the data across multiple hosts. With the default replication value 3, data
is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other
to rebalance data, to move copies around, and to keep the replication of data high. HDFS stores the data by
dividing into the large amount of file into blocks of size 64 MB by default and it may vary to 128 MB size
of each block.
Understanding HDFS components
HDFS is managed with the master-slave architecture included with the following components:
• NameNode:
This is the master of the HDFS system. It maintains the directories, files, and manages the blocks that
are present on the DataNodes.
Only one per hadoop cluster.
Manages the file system namespace and metadata.
Single point of failure but mitigated by writing state to multiple file systems.
Single point of failure: Don‘t use inexpensive commodity hardware for this node, large memory
requirements.
• DataNode:
These are slaves that are deployed on each machine and provide actual storage. They are responsible for
serving read-and-write data requests for the clients.
Many per hadoop cluster.
Manages blocks with data and serves them to clients.
Periodically reports to name node the list of blocks it stores.
Use inexpensive commodity hardware for this node.
• Secondary NameNode: The HDFS file system includes a so-called secondary namenode, a misleading
name that some might incorrectly interpret as a backup namenode for when the primary namenode goes
offline. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots
of the primary namenode's directory information, which the system then saves to local or remote directories.
These check pointed images can be used to restart a failed primary namenode without having to replay the
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.6
entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because
the namenode is the single point for storage and management of metadata, it can become a bottleneck for
supporting a huge number of files, especially a large number of small files. HDFS Federation, a new
addition, aims to tackle this problem to a certain extent by allowing multiple namespaces served by separate
namenodes.
Understanding the MapReduce architecture:
MapReduce is also implemented over master-slave architectures. Classic MapReduce contains job
submission, job initialization, task assignment, task execution, progress and status update, and job
completion-related activities, which are mainly managed by the JobTracker node and executed by
TaskTracker. Client application submits a job to the JobTracker. Then input is divided across the cluster.
The JobTracker then calculates the number of map and reducer to be processed. It commands the
TaskTracker to start executing the job. Now, the TaskTracker copies the resources to a local machine and
launches JVM to map and reduce program over the data. Along with this, the TaskTracker periodically
sends update to the JobTracker, which can be considered as the heartbeat that helps to update JobID, job
status, and usage of resources.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.7
When a client program (‗cloud application‘) needs to read/write a file, it sends the full path and
offset to the Master (GFS) which sends back meta-data for one (in the case of read) or all (in the case of
write) of the replicas of the chunk where this data is to be found. The client caches such meta-data so that it
need not contact the Master each time. Thereafter the client directly reads data from the designated chunk
server/DataNode. This data is not cached since most reads are large and caching would complicate writes.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.8
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.9
Since data in each column family is stored together, using this data organization results in efficient
data access patterns depending on the nature of analysis: For example, only the location column family may
be read for traditional data-cube based analysis of sales, whereas only the product column family is needed
for say, market-basket analysis. Thus, the BigTable structure can be used in a manner similar to a column-
oriented database. Figure 10.5 illustrates how BigTable tables are stored on a distributed file system such as
GFS or HDFS. Each table is split into different row ranges, called tablets. Each tablet is managed by a tablet
server that stores each column family for the given row range in a separate distributed file, called an
SSTable. Additionally, a single Metadata table is managed by a meta-data server that is used to locate the
tablets of any user table in response to a read or write request. The Metadata table itself can be large and is
also split into tablets, with the root tablet being special in that it points to the locations of other meta-data
tablets. BigTable and HBase rely on the underlying distributed file systems GFS and HDFS respectively and
therefore also inherit some of the properties of these systems. In particular large parallel reads and inserts are
efficiently supported, even simultaneously on the same table, unlike a traditional relational database. In
particular, reading all rows for a small number of column families from a large table, such as in aggregation
queries, is efficient in a manner similar to column-oriented databases. Similarly, the consistency properties
of large parallel inserts are stronger than that for parallel random writes, as is pointed out in. Further, writes
can even fail if a few replicas are unable to write even if other replicas are successfully updated.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.10
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.11
0], so an earlier version with timestamp [1 0 0] can be safely discarded. However, if the second write took
place at node Y before the first write had propagated to this replica, it would have a timestamp of [0 1 0]. In
this case even when the first write arrives at Y (and Y‘s write arrives symmetrically at X), the two versions
[100] and [010] would both be maintained and returned to any subsequent read to be resolved using
application-dependent logic. Say this read took place at node Z and was reconciled by the application which
then further updated the object; the new timestamp for the object would be set to [1 1 1], and as this
supersedes other versions they would be discarded once this update was propagated to all replicas. We
mention in passing that such vector-timestamp-based ordering of distributed events was first conceived of by
Lamport in [35].
In Dynamo write operations are allowed to return even if all replicas are not updated. However a
quorum protocol is used to maintain eventual consistency of the replicas when a large number of concurrent
reads and writes take place: Each read operation accesses R replicas and each write ensures propagation to
W replicas; as long as R +W > N the system is said to be quorum consistent [14]. Thus, if we want very
efficient writes, we pay the price of having to read many replicas, and vice versa. In practice Amazon uses N
= 3, with R
and W being configurable depending on what is desired; for a high update frequency one uses W = 1, R = 3,
whereas for a high-performance read store W = 3, R = 1 is used.
Dynamo is able to handle transient failures by passing writes intended for a failed node to another
node temporarily. Such replicas are kept separately and scanned periodically with replicas being sent back to
their intended node as soon as it is found to have revived. Finally, Dynamo can be implemented using
different storage engines at the node level, such as Berkeley DB or even MySQL; Amazon is said to use the
former in production.
MAPREDUCE AND EXTENSIONS
The MapReduce programming model was developed at Google in the process of implementing large-scale
search and text processing tasks on massive collections of web data stored using BigTable and the GFS
distributed file system. The MapReduce programming model is designed for processing and generating large
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.12
volumes of data via massively parallel computations utilizing tens of thousands of processors at a time. The
underlying infrastructure to support this model needs to assume that processors and networks will fail, even
during a particular computation, and build in support for handling such failures while ensuring progress of
the computations being performed.
Hadoop is an open source implementation of the MapReduce model developed at Yahoo, and
presumably also used internally. Hadoop is also available on pre-packaged AMIs in the Amazon EC2 cloud
platform, which has sparked interest in applying the MapReduce model for large-scale, fault-tolerant
computations in other domains, including such applications in the enterprise context.
PARALLEL COMPUTING
Parallel computing has a long history with its origins in scientific computing in the late 60s and early 70s.
Different models of parallel computing have been used based on the nature and evolution of multiprocessor
computer architectures. The shared-memory model assumes that any processor can access any memory
location, but not equally fast. In the distributed memory model each processor can address only its own
memory and communicates with other processors using message passing over the network. In scientific
computing applications for which these models were developed, it was assumed that data would be loaded
from disk at the start of a parallel job and then written back once the computations had been completed, as
scientific tasks were largely compute bound. Over time, parallel computing also began to be applied in the
database arena, such as SAN, NAS, database systems supporting shared-memory, shared-disk and shared-
nothing models became available.
The premise of parallel computing is that a task that takes time T should take time T/p if executed
on p processors. In practice, inefficiencies are introduced by distributing the computations such as
(a) The need for synchronization among processors,
(b) Overheads of communication between processors through messages or disk, and
(c) Any imbalance in the distribution of work to processors.
Thus in practice the time Tp to execute on p processors is less than T, and the parallel efficiency of an
algorithm is defined as:
€ = T/p Tp……………………………………………... (11.1)
A scalable parallel implementation is one where:
(a) The parallel efficiency remains constant as the size of data is increased along with a corresponding
increase in processors and
(b) The parallel efficiency increases with the size of data for a fixed number of processors.
We illustrate how parallel efficiency and scalability depends on the algorithm, as well as the nature
of the problem, through an example. Consider a very large collection of documents; say web pages crawled
from the entire internet. The problem is to determine the frequency (i.e., total number of occurrences) of
each word in this collection. Thus, if there are n documents and m distinct words, we wish to determine m
frequencies, one for each word.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.13
Now we compare two approaches to compute these frequencies in parallel using p processors:
(a) Let each processor compute the frequencies for m/p words and
(b) Let each processor compute the frequencies of m words across n/p documents, followed by all the
processors summing their results.
At first glance it appears that approach
(a) Where each processor works independently may be more efficient as compared to
(b) Where they need to communicate with each other to add up all the frequencies. However, a more careful
analysis reveals otherwise: We assume a distributed-memory model with a shared disk, so that each
processor is able to access any document from disk in parallel with no contention.
Further we assume that the time spent c for reading each word in the document is the same as that of
sending it to another processor via inter-processor communication. On the other hand, the time to add to a
running total of frequencies is negligible as compared to the time spent on a disk read or inter-processor
communication, so we ignore the time taken for arithmetic additions in our analysis.
Finally, assume that each word occurs f times in a document, on average. With these assumptions,
the time for computing all the m frequencies with a single processor is n×m×f ×c, i.e. since each word needs
to be read approximately f times in each document.
Using approach (a) each processor reads approximately n × m × f words and adds them n × m/p × f
times. Ignoring the time spent in additions, the parallel efficiency can be calculated as:
Since in practice p<< nf the efficiency of approach (b) is higher than that of approach (a), and can even be
close to one: For example, with n = 10 000 documents and f = 10, the condition (11.3) works out to p << 50
000, so method (b) is efficient (€b ≈ 0.9) even with thousands of processors. The reason is that in the first
approach each processor is reading many words that it need not read, resulting in wasted work, whereas in
the second approach every read is useful in that it results in a computation that contributes to the final
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.14
answer. Algorithm (b) is also scalable, since €b remains constant as p and n both increase, and approaches
one as n increases for a fixed p.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.15
In our example each reduce operation sums the frequency counts for each word:
The implementation also generalizes. Each mapper is assigned an input-key range (set of values for k1) on
which map operations need to be performed. The mapper writes results of its map operations to its local disk
in R partitions, each corresponding to the output-key range (values of k2) assigned to a particular reducer,
and informs the master of these locations. Next each reducer fetches these pairs from the respective mappers
and performs reduce operations for each key k2 assigned to it. If a processor fails during the execution, the
master detects this through regular heartbeat communications it maintains with each worker, wherein
updates are also exchanged regarding the status of tasks assigned to workers.
If a mapper fails, then the master reassigns the key-range designated to it to another working node
for re-execution. Note that re-execution is required even if the mapper had completed some of its map
operations, because the results were written to local disk rather than the GFS. On the other hand if a reducer
fails only its remaining tasks (values k2) are reassigned to another node, since the completed tasks would
already have been written to the GFS.
Finally, heartbeat failure detection can be fooled by a wounded task that has a heartbeat but is
making no progress: Therefore, the master also tracks the overall progress of the computation and if results
from the last few processors in either phase are excessively delayed, these tasks are duplicated and assigned
to processors who have already completed their work. The master declares the task completed when any one
of the duplicate workers complete.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.16
Bamuengine.com
especially when the volume of data being read, written and transferred between processors is large.
For the purposes of our analysis we assume a general computational task, on a volume of data D,
which takes wD time on a uniprocessor, including the time spent reading data from disk, performing
computations, and writing it back to disk (i.e. we assume that computational complexity is linear in the size
of data). Let c be the time spent reading one unit of data (such as a word) from disk. Further, let us assume
that our computational task can be decomposed into map and reduce stages as follows: First cmD
computations are performed in the map stage, producing σD data as output. Next the reduce stage performs
crσD computations on the output of the map stage, producing σμD data as the final result. Finally, we
assume that our decomposition into a map and reduce stages introduces no additional overheads when run
on a single processor, such as having to write intermediate results to disk, and so
wD = cD + cmD + crσD + cσμD………………………………… (11.6)
Now consider running the decomposed computation on P processors that serve as both mappers and
reducers in respective phases of a MapReduce based parallel implementation. As compared to the single
processor case, the additional overhead in a parallel MapReduce implementation is between the map and
reduce phases where each mapper writes to its local disk followed by each reducer remotely reading from
the local disk of each mapper. For the purposes of our analysis we shall assume that the time spent reading a
word from a remote disk is also c, i.e. the same as for a local read. Each mapper produces approximately
σD/P data that is written to a local disk (unlike in the uniprocessor case), which takes cσD/P time. Next,
after the map phase, each reducer needs to read its partition of data from each of the P mappers, with
approximately one Pth of the data at each mapper by each reducer, i.e. σD/P2. The entire exchange can be
executed in P steps. Thus the transfer time is cσD/P2 × P = cσD/P. The total overhead in the parallel
implementation because of intermediate disk writes and reads is therefore 2cσD/P. We can now compute the
parallel efficiency of the MapReduce implementation as:
Let us validate (11.7) above for our parallel word counting example discussed in Section 11.1: The volume
of data is D = nmf . We ignore the time spent in adding word counts, so cr = cm = 0. We also did not include
the
(small) time cm for writing the final result to disk. So wD = wnmf = cnmf, or w = c. The map phase
produces mP partial counts, so σ = mP/nmf = p/nf . Sing (11.7) and c = w we reproduce (11.3) as computed
earlier. It is important to note how _MR depends on σ, the ‗compression‘ in data achieved in the map phase,
and its relation to the number of processors p. To illustrate this dependence, let us recall the definition (11.4)
of a map operation, as applied to the word counting problem, i.e. (dk, ‗w1 . . .wn]) → [(wi, ci)]. Each map
operation takes a document as input and emits a partial count for each word in that document alone, rather
than a partial sum across all the documents it sees. In this case the output of the map phase is of size mn (an
m-vector of counts for each document). So, σ = mn/nmf = 1/f and the parallel efficiency is 1/1+2/f ,
independent of data size or number of processors, which is not scalable.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.18
A strict implementation of MapReduce as per the definitions (11.4) and (11.5) does not allow for partial
reduction across all input values seen by a particular reducer, which is what enabled the parallel
implementation of Section 11.1 to be highly efficient and scalable. Therefore, in practice the map phase
usually includes a combine operation in addition to the map, defined as follows:
Combine: (k2, [v2]) → (k2, fc([v2])). ………………………….(11.8)
The function fc is similar to the function f in the reduce operation but is applied only across documents
processed by each mapper, rather than globally. The equivalence of a MapReduce implementation with and
without a combiner step relies on the reduce function f being commutative and associative,
i.e. f (v1, v2, v3) = f (v3, f (v1, v2)).
Finally, recall our definition of a scalable parallel implementation: A MapReduce implementation is scalable
if we are able to achieve an efficiency that approaches one as data volume D grows, and remains constant as
D and P both increase. Using combiners is crucial to achieving scalability in practical MapReduce
implementations by achieving a high degree of data ‗compression‘ in the map phase, so that σ is
proportional to P/D, which in turn results in scalability due to (11.7).
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.19
and both leverage the Hadoop distributed file system HDFS. Figure 11.3 illustrates how the above SQL
query can be represented using the Pig Latin language as well as the HiveQL dialect of SQL. Pig Latin has
features of an imperative language, wherein a programmer specifies a sequence of transformations that each
read and write large distributed files. The Pig Latin compiler generates MapReduce phases by treating each
GROUP (or COGROUP) statement as defining a map-reduce boundary, and pushing remaining statements
on either side into the map or reduce steps.
HiveQL, on the other hand, shares SQL‘s declarative syntax. Once again though, as in Pig Latin,
each JOIN and GROUP operation define a map-reduce boundary. As depicted in the figure, the Pig Latin as
well as HiveQL representations of our SQL query translate into two MapReduce phases similar to our
example of Figure 11.2.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.20
Pig Latin is ideal for executing sequences of large-scale data transformations using MapReduce. In the
enterprise context it is well suited for the tasks involved in loading information into a data warehouse.
HiveQL, being more declarative and closer to SQL, is a good candidate for formulating analytical queries on
a large distributed data warehouse.
There has been considerable interest in comparing the performance of MapReduce-based
implementations of SQL queries with that of traditional parallel databases, especially specialized column-
oriented databases tuned for analytical queries. In general, as of this writing, parallel databases are still faster
than available open source implementations of MapReduce (such as Hadoop), for smaller data sizes using
fewer processes where fault tolerance is less critical. MapReduce-based implementations, on the other hand,
are able to handle orders of magnitude larger data using massively parallel clusters in a fault-tolerant
manner. MapReduce is also preferable over traditional databases if data needs to be processed only once and
then discarded: As an example, the time required to load some large data sets into a database is 50 times
greater than the time to both read and perform the required analysis using MapReduce. On the contrary, if
data needs to be stored for a long time, so that queries can be performed against it regularly, a traditional
database wins over MapReduce, at least as of this writing. HadoopDB is an attempt at combining the
advantages of MapReduce and relational databases by using databases locally within nodes while using
MapReduce to coordinate parallel execution. Another example is SQL/MR from Aster Data that enhances a
set of distributed SQL-compliant databases with MapReduce programming constructs. Needless to say,
relational processing using MapReduce is an active research area and many improvements to the available
state of the art are to be expected in the near future.
PIG:
Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of
creating and executing map-reduce jobs on very large data sets. In 2007, it was moved into the Apache
Software Foundation.
Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this
platform is called Pig Latin. Pig Latin is a high level Data Flow scripting language that enables data
workers to write complex data transformations without knowing Java. Pig Latin can be extended using UDF
(User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then
call directly from the language. Pig works with data from many sources, including structured and
unstructured data, and stores the results into the Hadoop Data File System.
PIG ARCHITECHTURE:
The Pig Latin compiler converts the Pig Latin code into executable code. The executable code is in the form
of MapReduce jobs. The sequence of MapReduce programs enables Pig programs to do data processing and
analysis in parallel, leveraging Hadoop MapReduce and HDFS.
Pig programs can run on MapReduce v1 or MapReduce v2 without any code changes, regardless of what
mode your cluster is running. However, Pig scripts can also run using the Tez API instead. Apache Tez
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.21
provides a more efficient execution framework than MapReduce. YARN enables application frameworks
other than MapReduce (like Tez) to run on Hadoop.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.22
Bamuengine.com
ways from the familiar SQL dialects provided by Oracle, MySQL, and SQL Server.
Hive is most suited for data warehouse applications, where relatively static data is analyzed, fast
response times are not required, and when the data is not changing rapidly.
Hive is not a full database because the design constraints and limitations of Hadoop and HDFS impose
limits on what Hive can do.
The biggest limitation is that Hive does not provide record-level update, insert, or delete.
HIVE ARCHITECTURE:
CLI: The command line interface to Hive (the shell). This is the default service. Hiveserver Runs Hive as a
server exposing a Thrift service, enabling access from a range of clients written in different languages.
Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to communicate with
Hive.
Metastore Database: The metastore is the central repository of Hive metadata.
The Hive Web Interface (HWI): As an alternative to the shell, you might want to try Hive‘s simple web
interface. Start it using the following commands:
Hive clients: If you run Hive as a server (hive --service hiveserver), then there are a number of different
mechanisms for connecting to it from applications. The relationship between Hive clients and Hive services
is illustrated in Figure .Hive architecture
Thrift Client: The Hive Thrift Client makes it easy to run Hive commands from a wide range of
programming languages. Thrift bindings for Hive are available for C++, Java, PHP, Python, and Ruby. They
can be found in the src/service/src subdirectory in the Hive distribution.
JDBC Driver: Hive provides a Type 4 (pure Java) JDBC driver, defined in the class
org.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of the form
jdbc:hive://host:port/dbname, a Java application will connect to a Hive server running in a separate process
at the given host and port.
ODBC Driver : The Hive ODBC Driver allows applications that support the ODBC protocol to connect to
Hive.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.24
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.25
• Multidimensional in HBase means that for a particular cell, there can be multiple versions of
it.
• Sorted map basically means that to obtain each value, you must provide a key.
• The first Hbase release was bundled as part of Hadoop 0.15.0 in Oct,2007.
• In May,2010, Hbase graduated from Hadoop sub-project to become an Apache top level project.
• Facebook, Twitter, and other leading websites uses Hbase for their BigData.
• Hbase is a distributed column-oriented database built on top of HDFS
• Use Hbase for random, real-time read/write access to your datasets.
• Hbase and its native API is written in Java, but you do not have to use Java to access its API.
HBASE ARCHITECTURE:
In HBase Architecture, we have the Master and multiple region servers consisting of multiple regions.
HBase Master
• Responsible for managing region servers and their locations
– Assigns regions to region servers
– Re-balanced to accommodate workloads
– Recovers if a region server becomes unavailable
– Uses Zookeeper – distributed coordination service
Bamuengine.com
• Doesn't actually store or read data
Region Server
Each region server is basically a node within your cluster.
You would typically have at least 3 region servers in your distributed environment.
A region server contains one or more regions.
It‘s important to know that each region contains data from ONE column family with a range of rows.
The files are primarily handled by the HRegionServer.
The HLog is shared between all the stores of an HRegionServer.
WAL
The Write-Ahead Log exists in the HLog file. When you have your data only in the MemStore and the
system fails before the data is flushed to file, all the data is lost. On a large system, the chances of this are
high because you‘ll have hundreds of commodity hardware. The WAL is to prevent data loss if this should
happen.
HRegion:
Region‘s consists of one or more Column Families with a range of row keys.
Default size of the region is 256 MB.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.27
Data Storage
• Data is stored in files called HFiles/StoreFiles
• HFile is basically a key-value map
• When data is added it's written to a log called Write Ahead Log (WAL) and is also stored in memory
(memstore)
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.28
Hbase vs BigTable:
HBase Bigtable
Region Tablet
HDFS GFS
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.29
Memstore Memtable
Hfile SSTable
ZooKeeper Chubby
SQOOP:
When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra, Pig, etc. of the
Hadoop ecosystem came into picture, they required a tool to interact with the relational database servers for
importing and exporting the Big Data residing in them. Here, Sqoop occupies a place in the Hadoop
ecosystem to provide feasible interaction between relational database server and Hadoop‘s HDFS.
Sqoop: ―SQL to Hadoop and Hadoop to SQL‖
Sqoop is a command-line interface application for transferring data between relational databases and
Hadoop. Sqoop is used to import data from a relational database management system (RDBMS) such as
MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop
MapReduce, and then export the data back into an RDBMS.
Sqoop automates most of this process, relying on the database to describe the schema for the data to be
imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as
fault tolerance. Microsoft uses a Sqoop-based connector to help transfer data from Microsoft SQL Server
databases to Hadoop. Couchbase, Inc. also provides a Couchbase Server-Hadoop connector by means of
Sqoop.
SQOOP ARCHITECTURE
Sqoop Import
The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record
in HDFS. All records are stored as text data in text files or as binary data in Avro and Sequence files.
$ sqoop import --connect jdbc:mysql://localhost/userdb --username root --table emp_add --m 1--target-dir
/queryresult
Sqoop Export
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.30
The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop
contain records, which are called as rows in table. Those are read and parsed into a set of records and
delimited with user-specified delimiter.
$ sqoop export --connect jdbc:mysql://localhost/db --username root --table employee --export-dir
/emp/emp_data
FLUME:
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and
moving large amounts of log data from many different sources to a centralized data store. Apache Flume is a
top level project at the Apache Software Foundation. There are currently two release code lines available,
versions 0.9.x and 1.x. This documentation applies to the 1.x codeline. Please click here for the Flume 0.9.x
User Guide.
Architecture Of FLUME:
Data flow model
A Flume event is defined as a unit of data flow having a byte payload and an optional set of string
attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an
external source to the next destination (hop). A Flume source consumes events delivered to it by an external
source like a web server. The external source sends events to Flume in a format that is recognized by the
target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro
clients or other Flume agents in the flow that send events from an Avro sink. When a Flume source receives
an event, it stores it into one or more channels. The channel is a passive store that keeps the event until it's
consumed by a Flume sink. The JDBC channel is one example -- it uses a file system backed embedded
database. The sink removes the event from the channel and puts it into an external repository like HDFS (via
Flume HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow. The
source and sink within the given agent run asynchronously with the events staged in the channel.
Complex flows: Flume allows a user to build multi-hop flows where events travel through multiple agents
before reaching the final destination. It also allows fan-in and fan-out flows, contextual routing and backup
routes (fail-over) for failed hops.
Reliability: The events are staged in a channel on each agent. The events are then delivered to the next
Bamuengine.com
agent or terminal repository (like HDFS) in the flow. The events are removed from a channel only after they
are stored in the channel of next agent or in the terminal repository. This is a how the single-hop message
delivery semantics in Flume provide end-to-end reliability of the flow. Flume uses a transactional approach
to guarantee the reliable delivery of the events. The sources and sinks encapsulate in a transaction the
storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel.
This ensures that the set of events are reliably passed from point to point in the flow. In the case of a multi-
hop flow, the sink from the previous hop and the source from the next hop both have their transactions
running to ensure that the data is safely stored in the channel of the next hop.
Recoverability: The events are staged in the channel, which manages recovery from failure. Flume supports
a durable JDBC channel which is backed by a relational database. There's also a memory channel which
simply stores the events in an in-memory queue, which is faster but any events still left in the memory
channel when an agent process dies can't be recovered.
OOZIE:
Oozie Workflow Overview
Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that run
Hadoop Map/Reduce and Pig jobs. Oozie is a Java Web-Application that runs in a Java servlet-container.
For the purposes of Oozie, a workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs)
arranged in a control dependency DAG (Direct Acyclic Graph). "Control dependency" from one action to
another means that the second action can't run until the first action has completed. In terms of the actions we
can schedule, Oozie supports a wide range of job types, including Pig, Hive, and MapReduce, as well as
jobs coming from Java programs and Shell scripts.
An Oozie coordinator job, for example, enables us to schedule any workflows you‘ve already created. We
can schedule them to run based on specific time intervals, or even based on data availability. At an even
higher level, we can create an Oozie bundle job to manage our coordinator jobs. Using a bundle job, you can
easily apply policies against a set of coordinator jobs by using a bundle job.
For all three kinds of Oozie jobs (workflow, coordinator, and bundle), we start out by defining them using
individual .xml files, and then we configure them using a combination of properties files and command-line
options.
Writing Oozie workflow definitions
Oozie workflow definitions are written in XML, based on the hPDL (Hadoop Process Definition Language)
schema. This particular schema is, in turn, based on the XML Process Definition Language (XPDL) schema,
which is a product independent standard for modeling business process definitions.
Oozie workflows definitions are written in hPDL Oozie workflow actions start jobs in remote systems (i.e.
Hadoop, Pig). Upon action completion, the remote systems callback Oozie to notify the action completion,
at this point Oozie proceeds to the next action in the workflow.
Oozie workflows contain control flow nodes and action nodes.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS
4.32
Control flow nodes define the beginning and the end of a workflow (start, end and fail nodes) and provide a
mechanism to control the workflow execution path (decision, fork and join nodes).
Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing
task. Oozie provides support for different types of actions: Hadoop map-reduce, Hadoop file system, Pig,
SSH, HTTP, eMail and Oozie sub-workflow. Oozie can be extended to support additional type of actions.
To see how this concept would look, check out Listing 10-1, which shows an example of the basic structure
of an Oozie workflow‘s XML file.
Listing 10-1: A Sample Oozie XML File
<workflow-app name="SampleWorkflow" xmlns="uri:oozie:workflow:0.1">
<start to="firstJob"/>
<action name="firstJob">
<pig>...</pig>
<ok to="secondJob"/>
<error to="kill"/>
</action>
<action name="secondJob">
<map-reduce>...</map-reduce>
<ok to="end" />
<error to="kill" />
</action>
<end name="end"/>
<kill name="kill">
<message>"Killed job."</message>
</kill>
</workflow-app>
In this example, aside from the start, end, and kill nodes, you have two action nodes. Each action node
represents an application or a command being executed. The next few sections look a bit closer at each node
type.
Start and end nodes
Each workflow XML file must have one matched pair of start and end nodes. The sole purpose of the start
node is to direct the workflow to the first node, which is done using the to attribute. Because it‘s the
automatic starting point for the workflow, no name identifier is required.
Action nodes need name identifiers, as the Oozie server uses them to track the current position of the control
flow as well as to specify which action to execute next. The sole purpose of the end node is to provide a
termination point for the workflow. A name identifier is required, but there‘s no need for a to attribute.
Kill nodes
Oozie workflows can include kill nodes, which are a special kind of node dedicated to handling error
conditions. Kill nodes are optional, and you can define multiple instances of them for cases where you need
specialized handling for different kinds of errors. Action nodes can include error transition tags, which direct
the control flow to the named kill node in case of an error.
You can also direct decision nodes to point to a kill node based on the results of decision predicates, if
needed. Like an end node, a kill node results in the workflow ending, and it does not need to attribute.
Bamuengine.com
Documented by Prof. K. V. Reddy Asst.Prof at DIEMS