CH 3
CH 3
By:
Irandufa Indebu
[email protected]
Outlines
Data Science Infrastructure
Hadoop ecosystem
Introduction
The set of technologies used to do data science varies across
organizations
Output
Process
Input
Typical architecture of Data Science…
This architecture is not just for big-data environments, but for
data environments of all sizes.
these data.
Big data
Big data is a popular term used to describe the exponential growth and
availability of data, both structured and unstructured.
Big data…
Every day, we create 2.5 quintillion bytes of data — so much that 90%
of the data in the world today has been created in the last two years
alone.
Big data usually includes data sets with sizes beyond the ability
of commonly used software tools to capture, create, manage, and
process the data within a tolerable elapsed time
Big data…
It is a term used to refer to the study and applications of data sets
that are so big and complex that traditional data-processing
application software are inadequate to deal with them.
A text file is a few kilo bytes, a sound file is a few mega bytes
while a full length movie is a few giga bytes.
4-Dimensions / Characteristics of Big Data (4 V’s)
1. Volume:
More sources of data with a larger size of data combine to increase the
volume of data that has to be analyzed. This is a major issue for those
looking to put that data to use instead of letting it just disappear.
Peta byte data sets are common these days and Exa byte is not far
away.
4-Dimensions / Characteristics of Big Data (4 V’s)
2. Velocity:
One takes a chunk of data, submits a job to the server and waits for
delivery of the result.
That scheme works when the incoming data rate is slower than the
batch-processing rate and when the result is useful despite the delay.
4-Dimensions / Characteristics of Big Data (4 V’s)
2. Velocity:
With the new sources of data such as social and mobile applications,
the batch process breaks down.
The data is now streaming into the server in real time, in a continuous
fashion and the result is only useful if the delay is very short. Data
comes at you at a record or a byte level, not always in bulk.
3. Variety:
From excel tables and databases, data structure has changed to loose its
structure and to add hundreds of formats.
Pure text, photo, audio, video, web, GPS data, sensor data, relational
data bases, documents, SMS, pdf, flash, etc.
4-Dimensions / Characteristics of Big Data (4 V’s)
3. Variety:
The variety of data sources continues to increase. It includes
Internet data (i.e., click stream, social media, social networking
links)
Primary research (i.e., surveys, experiments, observations)
Secondary research (i.e., competitive and marketplace data,
industry reports, consumer data, business data)
Location data (i.e., mobile device data, geospatial data)
Image data (i.e., video, satellite image, surveillance)
Supply chain data (i.e., EDI, vendor catalogs and pricing, quality
information)
Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)
4-Dimensions / Characteristics of Big Data (4 V’s)
3. Variety:
The variety of data sources continues to increase. It includes
Internet data (i.e., click stream, social media, social networking
links)
Primary research (i.e., surveys, experiments, observations)
Secondary research (i.e., competitive and marketplace data, industry
reports, consumer data, business data)
Location data (i.e., mobile device data, geospatial data)
Image data (i.e., video, satellite image, surveillance)
Supply chain data (i.e., EDI, vendor catalogs and pricing, quality
information)
Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)
Big Data Vs non Big Data
Today increasingly more semi-structured and unstructured data are
coming in.
In non big data the data is usually almost structured in an RDBMS.
Flink.
Ingestion Layer
If the ingestion layer of the big data architecture is not well planned the
entire lay become malfunction that results on big data failure.
The layer supporting the storage layer, that is, the physical infrastructure
is This is the fundamental to the operation and scalability of the big data
architecture.
layer that provides the tools and query languages to access the NoSQL
databases using the HDFS storage file system sitting on top of the Hadoop
physical infrastructure layer.
availability requirements.
Architecture of Big data-Hadoop based…
MapReduce
MapReduce
Pig:
with its own language syntax for expressing data analysis programs.
Architecture of Big data-Hadoop based…
Pig:
The pig is designed for batch processing of data
Sqoop:
Apache Sqoop is a tool designed for efficiently transferring bulk data
between Hadoop and Structured Relational Databases
data warehouse
Architecture of Big data-Hadoop based…
Hive: Hive is a data warehouse system for Hadoop that facilitates easy
data summarization, ad hoc queries, and the analysis of large datasets
stored in HDFS. It has its own SQL-like query language called Hive
Query Language (HQL), which is used to issue query commands to
Hadoop.
Architecture of Big data-Hadoop based…
Security Layer
monitor so that there are very low overhead and high parallelisms.
Architecture of Big data-Hadoop based…
Visualization Layer
After processing data sets, the next step is converting the output or
into the traditional ODS, data warehouse, and data marts for further
analysis along with the transaction data, the visualization layers can