0% found this document useful (0 votes)

15 views

CH 3

The document discusses the data science ecosystem and big data infrastructure. It describes typical data science architecture with input, process, and output layers. It also explains the four V's of big data: volume, velocity, variety, and veracity. The document then provides details on Hadoop as a big data infrastructure and its architecture.

Uploaded by

Abebe Bekele

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

CH 3

Uploaded by

Abebe Bekele

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Chapter Three

DATA SCIENCE ECOSYSTEM

By:
Irandufa Indebu

[email protected]
Outlines
Data Science Infrastructure

Big data and its challenges

Hadoop ecosystem
Introduction
The set of technologies used to do data science varies across
organizations

The larger the organization, the amount of data being processed

The amount of data being processed the greater the complexity of

the technology ecosystem supporting the data science activities.

Ecosystem contains tools and components from a number

of different software suppliers, processing data in many different
formats
Typical architecture of Data Science

Output

Process

Input
Typical architecture of Data Science…
This architecture is not just for big-data environments, but for
data environments of all sizes.

Thus, we have three layer:

1. Data sources, where all the data in an organization are

generated

2. Data storage, where the data are stored and processed;

3. Applications, where the data are shared with consumers of

these data.
Big data

What is Big data?

Extremely large data sets that may be analyzed computationally

to reveal patterns, trends, and associations, especially relating to
human behavior

Big data is data that exceeds the processing capacity of

conventional database systems. The data is too big, moves too fast,
or does not fit the structures of traditional database architectures.

Big data is a popular term used to describe the exponential growth and
availability of data, both structured and unstructured.
Big data…
Every day, we create 2.5 quintillion bytes of data — so much that 90%
of the data in the world today has been created in the last two years
alone.

This data comes from everywhere: sensors used to gather climate

information, posts to social media sites, digital pictures and
videos, purchase transaction records, and cell phone GPS signals
to name a few. This data is big data

Big data usually includes data sets with sizes beyond the ability
of commonly used software tools to capture, create, manage, and
process the data within a tolerable elapsed time
Big data…
It is a term used to refer to the study and applications of data sets
that are so big and complex that traditional data-processing
application software are inadequate to deal with them.

 How ``Big'‘ is Big? See the following

Google processes 20 PB a day (2008)
Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
CERN’s Large Hydron Collider (LHC) generates 15 PB a
year etc.
You see how big is big?
4-Dimensions / Characteristics of Big Data (4 V’s)

1. Volume: The amount of data generated every second. Now

we are talking zettabytes and brontobytes of data.

2. Variety
The different types of data. Data can be structure, however
more than 80% of the world’s data is unstructured.
3. Velocity
The speed at which new data is generated.
The ability to access data anywhere, at any time.
4. Veracity
The trustworthiness of the data. The control over quality and
accuracy
4-Dimensions / Characteristics of Big Data (4 V’s)
1. Volume:

The size of available data has been growing at an increasing rate

The volume of data is growing. Experts predict that the volume

of data in the world will grow to 35 Zettabytes in 2020.

 That same phenomenon affects every business – their data is

growing at the same exponential rate too. This applies to
companies and to individuals.

 A text file is a few kilo bytes, a sound file is a few mega bytes
while a full length movie is a few giga bytes.

4-Dimensions / Characteristics of Big Data (4 V’s)
1. Volume:

Currently, the data is generated by employees, partners and customers.

For a group of companies, the data is also generated by machines.

For example, Hundreds of millions of smart phones send a variety of

information to the network infrastructure. This data did not exist five
years ago.

More sources of data with a larger size of data combine to increase the
volume of data that has to be analyzed. This is a major issue for those
looking to put that data to use instead of letting it just disappear.

Peta byte data sets are common these days and Exa byte is not far
away.
4-Dimensions / Characteristics of Big Data (4 V’s)

2. Velocity:

Data is increasingly accelerating the velocity at which it is created

and at which it is integrated. We have moved from batch to a real-
time business.

Initially, companies analyzed data using a batch process.

One takes a chunk of data, submits a job to the server and waits for
delivery of the result.

That scheme works when the incoming data rate is slower than the
batch-processing rate and when the result is useful despite the delay.
4-Dimensions / Characteristics of Big Data (4 V’s)
2. Velocity:

With the new sources of data such as social and mobile applications,
the batch process breaks down.

The data is now streaming into the server in real time, in a continuous
fashion and the result is only useful if the delay is very short. Data
comes at you at a record or a byte level, not always in bulk.

And the demands of the business have increased as well – from an

answer next week to an answer in a minute.

In addition, the world is becoming more instrumented and

interconnected. The volume of data streaming off those instruments is
exponentially larger than it was even 2 years ago.
4-Dimensions / Characteristics of Big Data (4 V’s)

3. Variety:

Variety presents an equally difficult challenge. The growth in data

sources has fueled the growth in data types. In fact, 80% of the
world’s data is unstructured. Yet most traditional methods apply
analytics only to structured information.

From excel tables and databases, data structure has changed to loose its
structure and to add hundreds of formats.

Pure text, photo, audio, video, web, GPS data, sensor data, relational
data bases, documents, SMS, pdf, flash, etc.
4-Dimensions / Characteristics of Big Data (4 V’s)

3. Variety:
The variety of data sources continues to increase. It includes
Internet data (i.e., click stream, social media, social networking
links)
Primary research (i.e., surveys, experiments, observations)
Secondary research (i.e., competitive and marketplace data,
industry reports, consumer data, business data)
Location data (i.e., mobile device data, geospatial data)
Image data (i.e., video, satellite image, surveillance)
Supply chain data (i.e., EDI, vendor catalogs and pricing, quality
information)
Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)
4-Dimensions / Characteristics of Big Data (4 V’s)

3. Variety:
The variety of data sources continues to increase. It includes
Internet data (i.e., click stream, social media, social networking
links)
Primary research (i.e., surveys, experiments, observations)
Secondary research (i.e., competitive and marketplace data, industry
reports, consumer data, business data)
Location data (i.e., mobile device data, geospatial data)
Image data (i.e., video, satellite image, surveillance)
Supply chain data (i.e., EDI, vendor catalogs and pricing, quality
information)
Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)
Big Data Vs non Big Data
Today increasingly more semi-structured and unstructured data are

coming in.

In non big data the data is usually almost structured in an RDBMS.

Data is retained in a distributed file system instead of on a central

server in Big data.

Big Data Vs non Big Data…
The main difference between traditional data and big data.
Big Data Infrastructure

The traditional (modern) database is incredibly efficient at

processing transactional data,

However, in the age of big data new infrastructure is required to

manage all the other forms of data and for longer-term storage of
the data

Hadoop is an open-source platform developed and released by

the Apache Software Foundation to process big data

It is a platform for ingesting and storing large volumes of data in

an efficient manner
Big Data Infrastructure…

In Hadoop, the data are divided up and partitioned in a variety of

ways

These partitions or portions of data are spread across the nodes of

the Hadoop cluster.

Other big-data processing frameworks include Storm, Spark, and

Flink.

All of these frameworks are part of the Apache software

foundation projects.
Architecture of Big data-Hadoop based
.
Architecture of Big data-Hadoop based…

Data Sources layer

Multiple internal and external data feeds are available to enterprises

from various sources

It is important that, separating noise from relevant information

before feeding these data to ingestion layer.

Industrial data, social media data, health data, educational data,

telecommunication data, government data, etc. are some example
Architecture of Big data-Hadoop based…

Ingestion Layer

If the ingestion layer of the big data architecture is not well planned the
entire lay become malfunction that results on big data failure.

Distributed (Hadoop) Storage Layer

The storage layer provides storage patterns

communication from ingestion layer to the storage layer

Implemented based on the performance, scalability, and availability

requirements.
Architecture of Big data-Hadoop based…

Distributed (Hadoop) Storage Layer

Distributed storage system promises fault-tolerance, and

parallelization enables high-speed distributed processing algorithms
to execute over large-scale data.

The Hadoop distributed file system (HDFS) is the cornerstone of

the big data storage layer

HDFS is a file system designed to store a very large volume of

information (terabytes or petabytes) across a large number of
machines in a cluster.
Architecture of Big data-Hadoop based…

Distributed (Hadoop) Storage Layer

Uses blocks to store a file or parts of a file, and supports a write-

once-read-many model of data access.

The storage layer is usually loaded with data using a batch

process.

The integration component of the ingestion layer invokes

various mechanisms like Sqoop, MapReduce jobs, ETL jobs, and
others to upload data to the distributed Hadoop storage layer
(DHSL).
Architecture of Big data-Hadoop based…
Hadoop Infrastructure Layer

The layer supporting the storage layer, that is, the physical infrastructure
is This is the fundamental to the operation and scalability of the big data
architecture.

Hadoop Platform Management Layer

layer that provides the tools and query languages to access the NoSQL
databases using the HDFS storage file system sitting on top of the Hadoop
physical infrastructure layer.

The Hadoop platform management layer accesses data, runs queries,

and manages the lower layers using scripting languages like Pig and Hive.
Architecture of Big data-Hadoop based…

Hadoop Platform Management Layer

Various data-access patterns (communication from the platform

layer to the storage layer) suitable for different application scenarios

are implemented based on the performance, scalability, and

availability requirements.
Architecture of Big data-Hadoop based…

MapReduce

is used for efficiently executing a set of functions against a large

amount of data in batch mode

The map component distributes the problem or tasks across a

large number of systems and handles the placement of the tasks in
a way that distributes the load and manages recovery from failures

After the distributed computation is completed, another function

called reduce combines all the elements back together to provide a
result.
Architecture of Big data-Hadoop based…

MapReduce

MapReduce simplifies the creation of processes that analyze large

amounts of unstructured and structured data in parallel

Underlying hardware failures are handled transparently for user

applications, providing a reliable and fault-tolerant capability.

Pig:

Pig is a high-level language (such as PERL) to analyze large datasets

with its own language syntax for expressing data analysis programs.
Architecture of Big data-Hadoop based…
Pig:
The pig is designed for batch processing of data
Sqoop:
Apache Sqoop is a tool designed for efficiently transferring bulk data
between Hadoop and Structured Relational Databases

Sqoop is an abbreviation for (SQL) to Hadoop

It is command line tool enables importing individual tables, specific

columns, or entire database files straight to the distributed file system or

data warehouse
Architecture of Big data-Hadoop based…

ZooKeper: is a centralized service to maintain configuration

information, naming, providing distributed synchronization, and
group services, which are very useful for a variety of distributed
systems.

Hive: Hive is a data warehouse system for Hadoop that facilitates easy
data summarization, ad hoc queries, and the analysis of large datasets
stored in HDFS. It has its own SQL-like query language called Hive
Query Language (HQL), which is used to issue query commands to
Hadoop.
Architecture of Big data-Hadoop based…

Security Layer

As big data analysis becomes a mainstream functionality for

companies, the security of that data becomes a prime concern.

 Proper authorization, encryption, role-based access, and

authentication methods have to be applied to the analytics

Monitoring Layer is responsible for checking over the function of

the big data without effect. Performance is a key parameter to

monitor so that there are very low overhead and high parallelisms.
Architecture of Big data-Hadoop based…

Visualization Layer

After processing data sets, the next step is converting the output or

result of processing as input into visualization tools

Once the big data Hadoop processing aggregated output, is scooped

into the traditional ODS, data warehouse, and data marts for further

analysis along with the transaction data, the visualization layers can

work on top of this consolidated aggregated data.

Challenges of Big data
Real implementation is usually the challenge in big data.

This real implementation hurdles require immediate attention. Without

handling of such challenges will definitely lead to technology
implementation failure and objectionable results. Some challenges
includes
 Data storage
 Discovery of knowledge challenge
 Privacy issues
 Data security
 Issues related to big data characteristics
Thank
s

How To Print A Material Document
100% (6)
How To Print A Material Document
13 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Future Revolution On Big Data
No ratings yet
Future Revolution On Big Data
24 pages
Prepared By: Asmita Deshmukh
No ratings yet
Prepared By: Asmita Deshmukh
51 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
83 pages
BD-Unit-1
No ratings yet
BD-Unit-1
63 pages
Part 1 - Introduction To Big Data
No ratings yet
Part 1 - Introduction To Big Data
24 pages
BDA U1 copy
No ratings yet
BDA U1 copy
78 pages
Class - Big Data UNIT-I
No ratings yet
Class - Big Data UNIT-I
40 pages
Crash Course Big Data
From Everand
Crash Course Big Data
IntroBooks Team
No ratings yet
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
Big Data: the Revolution That Is Transforming Our Work, Market and World
From Everand
Big Data: the Revolution That Is Transforming Our Work, Market and World
PAT NAKAMOTO
No ratings yet
Big Data-Introduction
No ratings yet
Big Data-Introduction
14 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Bigdata Intro
No ratings yet
Bigdata Intro
25 pages
Basic Concepts in Big Data 1
No ratings yet
Basic Concepts in Big Data 1
43 pages
Deloitte Solutions Network: Introduction To Big Data
No ratings yet
Deloitte Solutions Network: Introduction To Big Data
9 pages
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
No ratings yet
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
12 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Seminar On: Big Data
No ratings yet
Seminar On: Big Data
23 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
29 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
42 pages
Big Data (Analytics) in Power Systems
No ratings yet
Big Data (Analytics) in Power Systems
20 pages
BIG DATA_UNIT-I
No ratings yet
BIG DATA_UNIT-I
17 pages
ICT30005 - Assignment 1 - Begum Bolu 6623433 - Big Data Analytics
No ratings yet
ICT30005 - Assignment 1 - Begum Bolu 6623433 - Big Data Analytics
7 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Unit 1
No ratings yet
Unit 1
56 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
Unit 1 and Unit 2 notes bda
No ratings yet
Unit 1 and Unit 2 notes bda
11 pages
Introduction To Big Data Analytics
100% (4)
Introduction To Big Data Analytics
112 pages
Module-1-Introduction To BigData Platform
No ratings yet
Module-1-Introduction To BigData Platform
21 pages
4-Big Data Management
No ratings yet
4-Big Data Management
40 pages
Unit 1
No ratings yet
Unit 1
76 pages
Big Data Class - Introduction
No ratings yet
Big Data Class - Introduction
60 pages
Dsc652 - Chapter 1 Introduction To Big Data Systems
No ratings yet
Dsc652 - Chapter 1 Introduction To Big Data Systems
27 pages
UNIT I
No ratings yet
UNIT I
25 pages
BigData Processing Intro
No ratings yet
BigData Processing Intro
34 pages
Big Data Presentation
No ratings yet
Big Data Presentation
22 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
PPT
No ratings yet
PPT
19 pages
IMTC634_Data Science_Chapter 11
No ratings yet
IMTC634_Data Science_Chapter 11
22 pages
Big Data
No ratings yet
Big Data
30 pages
Big Data
No ratings yet
Big Data
31 pages
Big Data
No ratings yet
Big Data
24 pages
Big Data Analytics_Lecture Slides
No ratings yet
Big Data Analytics_Lecture Slides
72 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
Detailednotes_unit1_Big Data
No ratings yet
Detailednotes_unit1_Big Data
22 pages
Unit 1
No ratings yet
Unit 1
54 pages
BDS Module-1
No ratings yet
BDS Module-1
59 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Introductions: What Are The 5 Vs of Big Data/ Characteristics of Big Data or Nature of Data
No ratings yet
Introductions: What Are The 5 Vs of Big Data/ Characteristics of Big Data or Nature of Data
75 pages
BDA Unit 1
No ratings yet
BDA Unit 1
68 pages
Unit 1.1 - Introduction to Big Data Analytics
No ratings yet
Unit 1.1 - Introduction to Big Data Analytics
19 pages
Bigdata 201126054145 PDF
No ratings yet
Bigdata 201126054145 PDF
23 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
221 pages
Big Data
No ratings yet
Big Data
63 pages
PPT 1.1.2
No ratings yet
PPT 1.1.2
17 pages
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
No ratings yet
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
58 pages
07-08 What Is Big Data
No ratings yet
07-08 What Is Big Data
41 pages
7 Types of Barcode Labels
No ratings yet
7 Types of Barcode Labels
2 pages
Week 6 Jquery and Asynchronous Javascript Exercise
No ratings yet
Week 6 Jquery and Asynchronous Javascript Exercise
4 pages
COB3.Close of Business - Important Concepts and COB Crashes - R10.01
No ratings yet
COB3.Close of Business - Important Concepts and COB Crashes - R10.01
30 pages
Unit 1 Introduction to R.pptx
No ratings yet
Unit 1 Introduction to R.pptx
58 pages
XCrySDen - (X-Window) Crystalline Structures and Densities
No ratings yet
XCrySDen - (X-Window) Crystalline Structures and Densities
3 pages
Jgraph Tutorial
No ratings yet
Jgraph Tutorial
30 pages
Dart Language Specification
No ratings yet
Dart Language Specification
122 pages
Swagelok Welding System: M200 Power Supply
No ratings yet
Swagelok Welding System: M200 Power Supply
4 pages
Libre Office
No ratings yet
Libre Office
2 pages
Third Party Components
No ratings yet
Third Party Components
44 pages
"Linux at The Command Line": Don Johnson of BU IS&T
No ratings yet
"Linux at The Command Line": Don Johnson of BU IS&T
53 pages
SANsymphony-V Datasheet
No ratings yet
SANsymphony-V Datasheet
2 pages
Inntool Release Notes: Author
100% (1)
Inntool Release Notes: Author
34 pages
HTE New Company Profile
No ratings yet
HTE New Company Profile
14 pages
Resume of Md. Rayhanul Haque: Career Objective
No ratings yet
Resume of Md. Rayhanul Haque: Career Objective
2 pages
Lesson 9 Introduction To Ms-Word - Reference Notes
No ratings yet
Lesson 9 Introduction To Ms-Word - Reference Notes
20 pages
Field Genius 9
No ratings yet
Field Genius 9
38 pages
Human Resource, Payroll, Time & Attendance Software System RFP Template
100% (1)
Human Resource, Payroll, Time & Attendance Software System RFP Template
223 pages
Unofficial Transcript Process
No ratings yet
Unofficial Transcript Process
1 page
PushNotification Test Cases
No ratings yet
PushNotification Test Cases
1 page
Practice Worksheet - Class 5 (CH 2, CH 3)
No ratings yet
Practice Worksheet - Class 5 (CH 2, CH 3)
2 pages
Installation de Zabbix Sur Ubuntu
No ratings yet
Installation de Zabbix Sur Ubuntu
4 pages
Cisco ASA ASDM Configuration: Search
No ratings yet
Cisco ASA ASDM Configuration: Search
4 pages
CS201 Mega Solved Short Questions
No ratings yet
CS201 Mega Solved Short Questions
33 pages
Vault 7 - CIA Hacking Tools Revealed Wikileaks-Org
No ratings yet
Vault 7 - CIA Hacking Tools Revealed Wikileaks-Org
12 pages
9 Scott Minehane Windsor Place Consulting
No ratings yet
9 Scott Minehane Windsor Place Consulting
16 pages
BV Cybersecurity Brochure
No ratings yet
BV Cybersecurity Brochure
20 pages
Introduction To Operating Systems
No ratings yet
Introduction To Operating Systems
15 pages