A Lightweight Continuous Jobs Mechanism For Mapreduce Frameworks
A Lightweight Continuous Jobs Mechanism For Mapreduce Frameworks
MapReduce Frameworks
Trong-Tuan Vu
INRIA Lille Nord Europe
Fabrice Huet
INRIA-University of Nice
Model
Real-time
Iterative
Batch
Data
Static
Dynamic
Stream
Batch
Iterative
Hadoop
HOP
HaLoop
Twister
PIC
Real-time
Data
Static
Dynamic (fast
data)
Stream
Amazon S4
Twitter Storm
Canonical workflow
Push data to cluster
Start jobs
Pull results
Profit!
Bulk arrival
Job only submitted once and runs automatically
Slightly changes the workflow
While (new data)
Push, execute, pull, profit!
-5
Continuous Analysis
Time
Foo
Bar
What
Bar
Foo
Bar
What
Bar
Word-Count
Foo 1
Bar 1
What 1
Bar 1
Foo 1
Bar 2
What 1
-6
Properties
Efficiency
Only process new data, not the whole data set
Correctness
Merging all results on intermediate data should give
the same result than processing the whole dataset
-7
Dependencies
Time
Foo
Bar
What
Bar
Word-2
Word-2
Foo
Bar
What
Bar
Bar
Different categories
New data
Results
Carried data
-9
Carried data
Example Word-2 :
Result : words which appear at least twice
Carry : words which appear once
- 10
Map
Reduce
Carry
Reduce
Carry
- 11
Contribution
- 12
CONTINUOUS HADOOP
- 13
- 14
- 15
Continuous
Job
Continuous
JobTracker
Job
JobTracker
Task
Task
Task
Task
Task
Task
Task
Task
Task
TaskTracker
Continuous
NameNode
NameNode
Data Nodes
if(sum < 2) {
context.carry(key, result);
} else {
context.write(key, result);
}
}
- 18
SELECT ?yr
WHERE {
?journal rdf:type bench:Journal.
?journal dc:title "Journal 1 (1940)"^^xsd:string.
?journal dcterms:issued ?yr
}
- 19
Continuous SPARQL
Selection Job
Map
Reduce
Join Job
Map
Reduce
Carry
Selection Job
Map
Reduce
Map
Reduce
Carry
- 20
Hundred of seconds
cHadoop
14
Hadoop
12
10
8
6
4
2
0
20
40
60
80
100
120
140
160
180
(Millions of
RDF triple)
Experiments on 40 nodes
- 21
Conclusion