0% found this document useful (0 votes)
15 views

CH 3

The document discusses the data science ecosystem and big data infrastructure. It describes typical data science architecture with input, process, and output layers. It also explains the four V's of big data: volume, velocity, variety, and veracity. The document then provides details on Hadoop as a big data infrastructure and its architecture.

Uploaded by

Abebe Bekele
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

CH 3

The document discusses the data science ecosystem and big data infrastructure. It describes typical data science architecture with input, process, and output layers. It also explains the four V's of big data: volume, velocity, variety, and veracity. The document then provides details on Hadoop as a big data infrastructure and its architecture.

Uploaded by

Abebe Bekele
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Chapter Three

DATA SCIENCE ECOSYSTEM

By:
Irandufa Indebu

[email protected]
Outlines
Data Science Infrastructure

Big data and its challenges

Hadoop ecosystem
Introduction
The set of technologies used to do data science varies across
organizations

The larger the organization, the amount of data being processed

The amount of data being processed the greater the complexity of


the technology ecosystem supporting the data science activities.

Ecosystem contains tools and components from a number


of different software suppliers, processing data in many different
formats
Typical architecture of Data Science

Output

Process

Input
Typical architecture of Data Science…
This architecture is not just for big-data environments, but for
data environments of all sizes.

Thus, we have three layer:

1. Data sources, where all the data in an organization are


generated

2. Data storage, where the data are stored and processed;

3. Applications, where the data are shared with consumers of

these data.
Big data

What is Big data?

Extremely large data sets that may be analyzed computationally


to reveal patterns, trends, and associations, especially relating to
human behavior

Big data is data that exceeds the processing capacity of


conventional database systems. The data is too big, moves too fast,
or does not fit the structures of traditional database architectures.

Big data is a popular term used to describe the exponential growth and
availability of data, both structured and unstructured.
Big data…
Every day, we create 2.5 quintillion bytes of data — so much that 90%
of the data in the world today has been created in the last two years
alone.

This data comes from everywhere: sensors used to gather climate


information, posts to social media sites, digital pictures and
videos, purchase transaction records, and cell phone GPS signals
to name a few. This data is big data

Big data usually includes data sets with sizes beyond the ability
of commonly used software tools to capture, create, manage, and
process the data within a tolerable elapsed time
Big data…
It is a term used to refer to the study and applications of data sets
that are so big and complex that traditional data-processing
application software are inadequate to deal with them.

 How ``Big'‘ is Big? See the following


Google processes 20 PB a day (2008)
Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
CERN’s Large Hydron Collider (LHC) generates 15 PB a
year etc.
You see how big is big?
4-Dimensions / Characteristics of Big Data (4 V’s)

1. Volume: The amount of data generated every second. Now

we are talking zettabytes and brontobytes of data.


2. Variety
The different types of data. Data can be structure, however
more than 80% of the world’s data is unstructured.
3. Velocity
The speed at which new data is generated.
The ability to access data anywhere, at any time.
4. Veracity
The trustworthiness of the data. The control over quality and
accuracy
4-Dimensions / Characteristics of Big Data (4 V’s)
1. Volume:

The size of available data has been growing at an increasing rate

The volume of data is growing. Experts predict that the volume


of data in the world will grow to 35 Zettabytes in 2020.

 That same phenomenon affects every business – their data is


growing at the same exponential rate too. This applies to
companies and to individuals.

 A text file is a few kilo bytes, a sound file is a few mega bytes
while a full length movie is a few giga bytes.

4-Dimensions / Characteristics of Big Data (4 V’s)
1. Volume:

Currently, the data is generated by employees, partners and customers.


For a group of companies, the data is also generated by machines.

For example, Hundreds of millions of smart phones send a variety of


information to the network infrastructure. This data did not exist five
years ago.

More sources of data with a larger size of data combine to increase the
volume of data that has to be analyzed. This is a major issue for those
looking to put that data to use instead of letting it just disappear.

Peta byte data sets are common these days and Exa byte is not far
away.
4-Dimensions / Characteristics of Big Data (4 V’s)

2. Velocity:

Data is increasingly accelerating the velocity at which it is created


and at which it is integrated. We have moved from batch to a real-
time business.

Initially, companies analyzed data using a batch process.

One takes a chunk of data, submits a job to the server and waits for
delivery of the result.

That scheme works when the incoming data rate is slower than the
batch-processing rate and when the result is useful despite the delay.
4-Dimensions / Characteristics of Big Data (4 V’s)
2. Velocity:

With the new sources of data such as social and mobile applications,
the batch process breaks down.

The data is now streaming into the server in real time, in a continuous
fashion and the result is only useful if the delay is very short. Data
comes at you at a record or a byte level, not always in bulk.

And the demands of the business have increased as well – from an


answer next week to an answer in a minute.

In addition, the world is becoming more instrumented and


interconnected. The volume of data streaming off those instruments is
exponentially larger than it was even 2 years ago.
4-Dimensions / Characteristics of Big Data (4 V’s)

3. Variety:

Variety presents an equally difficult challenge. The growth in data


sources has fueled the growth in data types. In fact, 80% of the
world’s data is unstructured. Yet most traditional methods apply
analytics only to structured information.

From excel tables and databases, data structure has changed to loose its
structure and to add hundreds of formats.

Pure text, photo, audio, video, web, GPS data, sensor data, relational
data bases, documents, SMS, pdf, flash, etc.
4-Dimensions / Characteristics of Big Data (4 V’s)

3. Variety:
The variety of data sources continues to increase. It includes
Internet data (i.e., click stream, social media, social networking
links)
Primary research (i.e., surveys, experiments, observations)
Secondary research (i.e., competitive and marketplace data,
industry reports, consumer data, business data)
Location data (i.e., mobile device data, geospatial data)
Image data (i.e., video, satellite image, surveillance)
Supply chain data (i.e., EDI, vendor catalogs and pricing, quality
information)
Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)
4-Dimensions / Characteristics of Big Data (4 V’s)

3. Variety:
The variety of data sources continues to increase. It includes
Internet data (i.e., click stream, social media, social networking
links)
Primary research (i.e., surveys, experiments, observations)
Secondary research (i.e., competitive and marketplace data, industry
reports, consumer data, business data)
Location data (i.e., mobile device data, geospatial data)
Image data (i.e., video, satellite image, surveillance)
Supply chain data (i.e., EDI, vendor catalogs and pricing, quality
information)
Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)
Big Data Vs non Big Data
Today increasingly more semi-structured and unstructured data are

coming in.

In non big data the data is usually almost structured in an RDBMS.

Data is retained in a distributed file system instead of on a central

server in Big data.


Big Data Vs non Big Data…
The main difference between traditional data and big data.
Big Data Infrastructure

The traditional (modern) database is incredibly efficient at


processing transactional data,

However, in the age of big data new infrastructure is required to


manage all the other forms of data and for longer-term storage of
the data

Hadoop is an open-source platform developed and released by


the Apache Software Foundation to process big data

It is a platform for ingesting and storing large volumes of data in


an efficient manner
Big Data Infrastructure…

In Hadoop, the data are divided up and partitioned in a variety of


ways

These partitions or portions of data are spread across the nodes of


the Hadoop cluster.

Other big-data processing frameworks include Storm, Spark, and

Flink.

All of these frameworks are part of the Apache software


foundation projects.
Architecture of Big data-Hadoop based
.
Architecture of Big data-Hadoop based…

Data Sources layer

Multiple internal and external data feeds are available to enterprises


from various sources

It is important that, separating noise from relevant information


before feeding these data to ingestion layer.

Industrial data, social media data, health data, educational data,


telecommunication data, government data, etc. are some example
Architecture of Big data-Hadoop based…

Ingestion Layer

If the ingestion layer of the big data architecture is not well planned the
entire lay become malfunction that results on big data failure.

Distributed (Hadoop) Storage Layer

The storage layer provides storage patterns

communication from ingestion layer to the storage layer

Implemented based on the performance, scalability, and availability


requirements.
Architecture of Big data-Hadoop based…

Distributed (Hadoop) Storage Layer

Distributed storage system promises fault-tolerance, and


parallelization enables high-speed distributed processing algorithms
to execute over large-scale data.

The Hadoop distributed file system (HDFS) is the cornerstone of


the big data storage layer

HDFS is a file system designed to store a very large volume of


information (terabytes or petabytes) across a large number of
machines in a cluster.
Architecture of Big data-Hadoop based…

Distributed (Hadoop) Storage Layer

Uses blocks to store a file or parts of a file, and supports a write-


once-read-many model of data access.

The storage layer is usually loaded with data using a batch


process.

The integration component of the ingestion layer invokes


various mechanisms like Sqoop, MapReduce jobs, ETL jobs, and
others to upload data to the distributed Hadoop storage layer
(DHSL).
Architecture of Big data-Hadoop based…
Hadoop Infrastructure Layer

The layer supporting the storage layer, that is, the physical infrastructure
is This is the fundamental to the operation and scalability of the big data
architecture.

Hadoop Platform Management Layer

layer that provides the tools and query languages to access the NoSQL
databases using the HDFS storage file system sitting on top of the Hadoop
physical infrastructure layer.

The Hadoop platform management layer accesses data, runs queries,


and manages the lower layers using scripting languages like Pig and Hive.
Architecture of Big data-Hadoop based…

Hadoop Platform Management Layer

Various data-access patterns (communication from the platform

layer to the storage layer) suitable for different application scenarios

are implemented based on the performance, scalability, and

availability requirements.
Architecture of Big data-Hadoop based…

MapReduce

is used for efficiently executing a set of functions against a large


amount of data in batch mode

The map component distributes the problem or tasks across a


large number of systems and handles the placement of the tasks in
a way that distributes the load and manages recovery from failures

After the distributed computation is completed, another function


called reduce combines all the elements back together to provide a
result.
Architecture of Big data-Hadoop based…

MapReduce

MapReduce simplifies the creation of processes that analyze large

amounts of unstructured and structured data in parallel

Underlying hardware failures are handled transparently for user

applications, providing a reliable and fault-tolerant capability.

Pig:

Pig is a high-level language (such as PERL) to analyze large datasets

with its own language syntax for expressing data analysis programs.
Architecture of Big data-Hadoop based…
Pig:
The pig is designed for batch processing of data
Sqoop:
Apache Sqoop is a tool designed for efficiently transferring bulk data
between Hadoop and Structured Relational Databases

Sqoop is an abbreviation for (SQL) to Hadoop

It is command line tool enables importing individual tables, specific

columns, or entire database files straight to the distributed file system or

data warehouse
Architecture of Big data-Hadoop based…

ZooKeper: is a centralized service to maintain configuration


information, naming, providing distributed synchronization, and
group services, which are very useful for a variety of distributed
systems.

Hive: Hive is a data warehouse system for Hadoop that facilitates easy
data summarization, ad hoc queries, and the analysis of large datasets
stored in HDFS. It has its own SQL-like query language called Hive
Query Language (HQL), which is used to issue query commands to
Hadoop.
Architecture of Big data-Hadoop based…

Security Layer

As big data analysis becomes a mainstream functionality for

companies, the security of that data becomes a prime concern.

 Proper authorization, encryption, role-based access, and

authentication methods have to be applied to the analytics

Monitoring Layer is responsible for checking over the function of

the big data without effect. Performance is a key parameter to

monitor so that there are very low overhead and high parallelisms.
Architecture of Big data-Hadoop based…

Visualization Layer

After processing data sets, the next step is converting the output or

result of processing as input into visualization tools

Once the big data Hadoop processing aggregated output, is scooped

into the traditional ODS, data warehouse, and data marts for further

analysis along with the transaction data, the visualization layers can

work on top of this consolidated aggregated data.


Challenges of Big data
Real implementation is usually the challenge in big data.

This real implementation hurdles require immediate attention. Without


handling of such challenges will definitely lead to technology
implementation failure and objectionable results. Some challenges
includes
 Data storage
 Discovery of knowledge challenge
 Privacy issues
 Data security
 Issues related to big data characteristics
Thank
s

You might also like