CH 4. The Evolution of Analytic Scalability: Taming The Big Data Tidal Wave
CH 4. The Evolution of Analytic Scalability: Taming The Big Data Tidal Wave
CH 4. The Evolution of Analytic Scalability: Taming The Big Data Tidal Wave
24 May 2012
SNU IDB Lab.
Hyewon Kim
Outline
Introduction
The Convergence of the Analytic and Data Environment
Massively Parallel Processing System (MPP)
Cloud Computing
Grid Computing
MapReduce
Conclusion
2
Introduction
The amount of data organizations process continues to increase
3
Outline
Introduction
The Convergence of the Analytic and Data Environment
Massively Parallel Processing System (MPP)
Cloud Computing
Grid Computing
MapReduce
Conclusion
4
The Convergence of the Analytic and Data Environment (1/2)
Traditional Analytic Architecture
We had to pull all data together into a separate analytics
environment to do analysis
Database 3
Database 1 Database 4
Database 2
Analytic Server
Or PC
5
The Convergence of the Analytic and Data Environment (2/2)
Modern In-Database Architecture
The processing stays in the database where the data has been
consolidated
Database 3
Database 1 Database 4
Database 2
Consolidate
Analytic Server Or PC
6
Outline
Introduction
The Convergence of the Analytic and Data Environment
Massively Parallel Processing System (MPP)
Cloud Computing
Grid Computing
MapReduce
Conclusion
7
Massively Parallel Processing (1/3)
What is an MPP Database?
An MPP database breaks the data into independent chunks with
independent disk and CPU
8
Massively Parallel Processing (2/3)
Concurrent Processing
An MPP system allows the different sets of CPU and disk to run the
process concurrently
An MPP system
breaks the job into pieces
Single Threaded ★ ★
Process Parallel Process
9
Massively Parallel Processing (3/3)
Others
MPP systems build in redundancy to make recovery easy
11
Cloud Computing (1/2)
What is Cloud Computing?
McKinsey and Company paper from 2009¹
– Mask the underlying infrastructure from the user
– Be elastic to scale on demand
– On a pay-per-use basis
[1] McKinsey and Company, ‘Clearing the Air on Cloud Computing,” March 2009. 12
Cloud Computing (2/2)
Two Types of Cloud Environment
1. Public Cloud
– The services and infrastructure are provided off-site over the internet
– Greatest level of efficiency in shared resources
– Less secured and more vulnerable than private clouds
2. Private Cloud
– Infrastructure operated solely for a single organization
– The same features of a public cloud
– Offer the greatest level of security and control
– Necessary to purchase and own the entire cloud infrastructure
13
Outline
Introduction
The Convergence of the Analytic and Data Environment
Massively Parallel Processing System (MPP)
Cloud Computing
Grid Computing
MapReduce
Conclusion
14
Grid Computing
The federation of computer resources to reach a common goal
– E.g., SETI@Home (Search for Extraterrestrial Intelligence)
An Internet-based public volunteer computing project
15
Outline
Introduction
The Convergence of the Analytic and Data Environment
Massively Parallel Processing System (MPP)
Cloud Computing
Grid Computing
MapReduce
Conclusion
16
MapReduce (1/3)
What is MapReduce?
A Parallel programming framework¹
Library
Parallelization
Fault-tolerance
Data distribution
Load balancing
…… map reduce
– Map function
Processing a key/value pairs to generate a set of intermediate key/value pairs
– Reduce function
Merging all intermediate values associated with the same intermediate key
18
MapReduce (3/3)
Strengths and Weaknesses
Good for
– Lots of input, intermediate, and output data
– Batch oriented datasets (ETL: Extract, Load, Transform)
– Cheap to get up and running because of running on commodity hardware
Bad for
– Fast response time
– Large amounts of shared data
– CPU intensive operations (as opposed to data intensive)
– NOT a database!
No built-in security
No indexing, No query or process optimizer
No knowledge of other data that exists
19
Outline
Introduction
The Convergence of the Analytic and Data Environment
Massively Parallel Processing System (MPP)
Cloud Computing
Grid Computing
MapReduce
Conclusion
20
Conclusion
These technologies can integrate and work together
– Databases running in the cloud
– Databases including MapReduce functionality
– MapReduce can be run against data sourced from a database
– MapReduce can also run against data in the cloud
[1] https://blogs.oracle.com/datawarehousing/entry/in-database_map-reduce
[2] http://code.google.com/p/cloudmapreduce/
Cloud mapreduce: a mapreduce implementation on top of a cloud operating system – CCGRID 2011, IEEE Computer Society 21