Hadoop The Definitive Guide 3rd Edition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 647

THIRD EDITION

Hadoop: The Definitive Guide


Tom Whitc
Beijing Cambridge Farnham Kln Sebastopol Tokyo
Hadoop: The Definitive Guide, Third Edition
Ly Tom Vhite
Revision History for the :
2012-01-27 Eaily ielease ievision 1
See http://orci||y.con/cata|og/crrata.csp?isbn=9781119311520 loi ielease uetails.
ISBN: 97S-1-++9-31152-0
1327616795
Ior E|ianc, Eni|ia, and Lottic
Table of Contents
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1. Meet Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Data! 1
Data Stoiage anu Analysis 3
Compaiison with Othei Systems +
RDBMS +
Giiu Computing 6
Volunteei Computing S
A Biiel Histoiy ol Hauoop 9
Apache Hauoop anu the Hauoop Ecosystem 12
Hauoop Releases 13
Vhat`s Coveieu in this Book 1+
CompatiLility 15
2. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A Veathei Dataset 17
Data Foimat 17
Analyzing the Data with Unix Tools 19
Analyzing the Data with Hauoop 20
Map anu Reuuce 20
]ava MapReuuce 22
Scaling Out 30
Data Flow 31
ComLinei Functions 3+
Running a DistiiLuteu MapReuuce ]oL 37
Hauoop Stieaming 37
RuLy 37
Python +0
iii
Hauoop Pipes +1
Compiling anu Running +2
3. The Hadoop Distributed Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
The Design ol HDFS +5
HDFS Concepts +7
Blocks +7
Namenoues anu Datanoues +S
HDFS Feueiation +9
HDFS High-AvailaLility 50
The Commanu-Line Inteilace 51
Basic Filesystem Opeiations 52
Hauoop Filesystems 5+
Inteilaces 55
The ]ava Inteilace 57
Reauing Data liom a Hauoop URL 57
Reauing Data Using the FileSystem API 59
Viiting Data 62
Diiectoiies 6+
Queiying the Filesystem 6+
Deleting Data 69
Data Flow 69
Anatomy ol a File Reau 69
Anatomy ol a File Viite 72
Coheiency Mouel 75
Paiallel Copying with uistcp 76
Keeping an HDFS Clustei Balanceu 7S
Hauoop Aichives 7S
Using Hauoop Aichives 79
Limitations S0
4. Hadoop I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Data Integiity S3
Data Integiity in HDFS S3
LocalFileSystem S+
ChecksumFileSystem S5
Compiession S5
Couecs S7
Compiession anu Input Splits 91
Using Compiession in MapReuuce 92
Seiialization 9+
The ViitaLle Inteilace 95
ViitaLle Classes 9S
iv | Table of Contents
Implementing a Custom ViitaLle 105
Seiialization Fiamewoiks 110
Avio 112
File-Baseu Data Stiuctuies 132
SeguenceFile 132
MapFile 139
5. Developing a MapReduce Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
The Conliguiation API 1+6
ComLining Resouices 1+7
VaiiaLle Expansion 1+S
Conliguiing the Development Enviionment 1+S
Managing Conliguiation 1+S
GeneiicOptionsPaisei, Tool, anu ToolRunnei 151
Viiting a Unit Test 15+
Mappei 15+
Reuucei 156
Running Locally on Test Data 157
Running a ]oL in a Local ]oL Runnei 157
Testing the Diivei 161
Running on a Clustei 162
Packaging 162
Launching a ]oL 162
The MapReuuce VeL UI 16+
Retiieving the Results 167
DeLugging a ]oL 169
Hauoop Logs 173
Remote DeLugging 175
Tuning a ]oL 176
Pioliling Tasks 177
MapReuuce Voikllows 1S0
Decomposing a PioLlem into MapReuuce ]oLs 1S0
]oLContiol 1S2
Apache Oozie 1S2
6. How MapReduce Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Anatomy ol a MapReuuce ]oL Run 1S7
Classic MapReuuce (MapReuuce 1) 1SS
YARN (MapReuuce 2) 19+
Failuies 200
Failuies in Classic MapReuuce 200
Failuies in YARN 202
]oL Scheuuling 20+
Table of Contents | v
The Faii Scheuulei 205
The Capacity Scheuulei 205
Shullle anu Soit 205
The Map Siue 206
The Reuuce Siue 207
Conliguiation Tuning 209
Task Execution 212
The Task Execution Enviionment 212
Speculative Execution 213
Output Committeis 215
Task ]VM Reuse 216
Skipping Bau Recoius 217
7. MapReduce Types and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
MapReuuce Types 221
The Delault MapReuuce ]oL 225
Input Foimats 232
Input Splits anu Recoius 232
Text Input 2+3
Binaiy Input 2+7
Multiple Inputs 2+S
DataLase Input (anu Output) 2+9
Output Foimats 2+9
Text Output 250
Binaiy Output 251
Multiple Outputs 251
Lazy Output 255
DataLase Output 256
8. MapReduce Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Counteis 257
Built-in Counteis 257
Usei-Delineu ]ava Counteis 262
Usei-Delineu Stieaming Counteis 266
Soiting 266
Piepaiation 266
Paitial Soit 26S
Total Soit 272
Seconuaiy Soit 276
]oins 2S1
Map-Siue ]oins 2S2
Reuuce-Siue ]oins 2S+
Siue Data DistiiLution 2S7
vi | Table of Contents
Using the ]oL Conliguiation 2S7
DistiiLuteu Cache 2SS
MapReuuce LiLiaiy Classes 29+
9. Setting Up a Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Clustei Specilication 295
Netwoik Topology 297
Clustei Setup anu Installation 299
Installing ]ava 300
Cieating a Hauoop Usei 300
Installing Hauoop 300
Testing the Installation 301
SSH Conliguiation 301
Hauoop Conliguiation 302
Conliguiation Management 303
Enviionment Settings 305
Impoitant Hauoop Daemon Piopeities 309
Hauoop Daemon Auuiesses anu Poits 31+
Othei Hauoop Piopeities 315
Usei Account Cieation 31S
YARN Conliguiation 31S
Impoitant YARN Daemon Piopeities 319
YARN Daemon Auuiesses anu Poits 322
Secuiity 323
KeiLeios anu Hauoop 32+
Delegation Tokens 326
Othei Secuiity Enhancements 327
Benchmaiking a Hauoop Clustei 329
Hauoop Benchmaiks 329
Usei ]oLs 331
Hauoop in the Clouu 332
Hauoop on Amazon EC2 332
10. Administering Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
HDFS 337
Peisistent Data Stiuctuies 337
Sale Moue 3+2
Auuit Logging 3++
Tools 3++
Monitoiing 3+9
Logging 3+9
Metiics 350
]ava Management Extensions 353
Table of Contents | vii
Maintenance 355
Routine Auministiation Pioceuuies 355
Commissioning anu Decommissioning Noues 357
Upgiaues 360
11. Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Installing anu Running Pig 366
Execution Types 366
Running Pig Piogiams 36S
Giunt 36S
Pig Latin Euitois 369
An Example 369
Geneiating Examples 371
Compaiison with DataLases 372
Pig Latin 373
Stiuctuie 373
Statements 375
Expiessions 379
Types 3S0
Schemas 3S2
Functions 3S6
Macios 3SS
Usei-Delineu Functions 3S9
A Filtei UDF 3S9
An Eval UDF 392
A Loau UDF 39+
Data Piocessing Opeiatois 397
Loauing anu Stoiing Data 397
Filteiing Data 397
Giouping anu ]oining Data +00
Soiting Data +05
ComLining anu Splitting Data +06
Pig in Piactice +07
Paiallelism +07
Paiametei SuLstitution +0S
12. Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
Installing Hive +12
The Hive Shell +13
An Example +1+
Running Hive +15
Conliguiing Hive +15
Hive Seivices +17
viii | Table of Contents
The Metastoie +19
Compaiison with Tiauitional DataLases +21
Schema on Reau Veisus Schema on Viite +21
Upuates, Tiansactions, anu Inuexes +22
HiveQL +22
Data Types +2+
Opeiatois anu Functions +26
TaLles +27
Manageu TaLles anu Exteinal TaLles +27
Paititions anu Buckets +29
Stoiage Foimats +33
Impoiting Data +3S
Alteiing TaLles ++0
Diopping TaLles ++1
Queiying Data ++1
Soiting anu Aggiegating ++1
MapReuuce Sciipts ++2
]oins ++3
SuLgueiies ++6
Views ++7
Usei-Delineu Functions ++S
Viiting a UDF ++9
Viiting a UDAF +51
13. HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
HBasics +57
Backuiop +5S
Concepts +5S
Vhiilwinu Toui ol the Data Mouel +5S
Implementation +59
Installation +62
Test Diive +63
Clients +65
]ava +65
Avio, REST, anu Thiilt +6S
Example +69
Schemas +70
Loauing Data +71
VeL Queiies +7+
HBase Veisus RDBMS +77
Successlul Seivice +7S
HBase +79
Use Case: HBase at Stieamy.com +79
Table of Contents | ix
Piaxis +S1
Veisions +S1
HDFS +S2
UI +S3
Metiics +S3
Schema Design +S3
Counteis +S+
Bulk Loau +S+
14. ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
Installing anu Running ZooKeepei +SS
An Example +90
Gioup MemLeiship in ZooKeepei +90
Cieating the Gioup +91
]oining a Gioup +93
Listing MemLeis in a Gioup +9+
Deleting a Gioup +96
The ZooKeepei Seivice +97
Data Mouel +97
Opeiations +99
Implementation 503
Consistency 505
Sessions 507
States 509
Builuing Applications with ZooKeepei 510
A Conliguiation Seivice 510
The Resilient ZooKeepei Application 513
A Lock Seivice 517
Moie DistiiLuteu Data Stiuctuies anu Piotocols 519
ZooKeepei in Piouuction 520
Resilience anu Peiloimance 521
Conliguiation 522
15. Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
Getting Sgoop 525
A Sample Impoit 527
Geneiateu Coue 530
Auuitional Seiialization Systems 531
DataLase Impoits: A Deepei Look 531
Contiolling the Impoit 53+
Impoits anu Consistency 53+
Diiect-moue Impoits 53+
Voiking with Impoiteu Data 535
x | Table of Contents
Impoiteu Data anu Hive 536
Impoiting Laige OLjects 53S
Peiloiming an Expoit 5+0
Expoits: A Deepei Look 5+1
Expoits anu Tiansactionality 5+3
Expoits anu SeguenceFiles 5+3
16. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Hauoop Usage at Last.lm 5+5
Last.lm: The Social Music Revolution 5+5
Hauoop at Last.lm 5+5
Geneiating Chaits with Hauoop 5+6
The Tiack Statistics Piogiam 5+7
Summaiy 55+
Hauoop anu Hive at FaceLook 55+
Intiouuction 55+
Hauoop at FaceLook 55+
Hypothetical Use Case Stuuies 557
Hive 560
PioLlems anu Futuie Voik 56+
Nutch Seaich Engine 565
Backgiounu 565
Data Stiuctuies 566
Selecteu Examples ol Hauoop Data Piocessing in Nutch 569
Summaiy 57S
Log Piocessing at Rackspace 579
Reguiiements/The PioLlem 579
Biiel Histoiy 5S0
Choosing Hauoop 5S0
Collection anu Stoiage 5S0
MapReuuce loi Logs 5S1
Cascauing 5S7
Fielus, Tuples, anu Pipes 5SS
Opeiations 590
Taps, Schemes, anu Flows 592
Cascauing in Piactice 593
FlexiLility 596
Hauoop anu Cascauing at ShaieThis 597
Summaiy 600
TeiaByte Soit on Apache Hauoop 601
Using Pig anu Vukong to Exploie Billion-euge Netwoik Giaphs 60+
Measuiing Community 606
EveiyLouy`s Talkin` at Me: The Twittei Reply Giaph 606
Table of Contents | xi
Symmetiic Links 609
Community Extiaction 610
A. Installing Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
B. Clouderas Distribution for Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
C. Preparing the NCDC Weather Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
xii | Table of Contents
Foreword
Hauoop got its stait in Nutch. A lew ol us weie attempting to Luilu an open souice
weL seaich engine anu having tiouLle managing computations iunning on even a
hanulul ol computeis. Once Google puLlisheu its GFS anu MapReuuce papeis, the
ioute Lecame cleai. They`u ueviseu systems to solve piecisely the pioLlems we weie
having with Nutch. So we staiteu, two ol us, hall-time, to tiy to ie-cieate these systems
as a pait ol Nutch.
Ve manageu to get Nutch limping along on 20 machines, Lut it soon Lecame cleai that
to hanule the VeL`s massive scale, we`u neeu to iun it on thousanus ol machines anu,
moieovei, that the joL was Liggei than two hall-time uevelopeis coulu hanule.
Aiounu that time, Yahoo! got inteiesteu, anu guickly put togethei a team that I joineu.
Ve split oll the uistiiLuteu computing pait ol Nutch, naming it Hauoop. Vith the help
ol Yahoo!, Hauoop soon giew into a technology that coulu tiuly scale to the VeL.
In 2006, Tom Vhite staiteu contiiLuting to Hauoop. I alieauy knew Tom thiough an
excellent aiticle he`u wiitten aLout Nutch, so I knew he coulu piesent complex iueas
in cleai piose. I soon leaineu that he coulu also uevelop soltwaie that was as pleasant
to ieau as his piose.
Fiom the Leginning, Tom`s contiiLutions to Hauoop showeu his concein loi useis anu
loi the pioject. Unlike most open souice contiiLutois, Tom is not piimaiily inteiesteu
in tweaking the system to Lettei meet his own neeus, Lut iathei in making it easiei loi
anyone to use.
Initially, Tom specializeu in making Hauoop iun well on Amazon`s EC2 anu S3 seiv-
ices. Then he moveu on to tackle a wiue vaiiety ol pioLlems, incluuing impioving the
MapReuuce APIs, enhancing the weLsite, anu uevising an oLject seiialization liame-
woik. In all cases, Tom piesenteu his iueas piecisely. In shoit oiuei, Tom eaineu the
iole ol Hauoop committei anu soon theiealtei Lecame a memLei ol the Hauoop Pioject
Management Committee.
Tom is now a iespecteu senioi memLei ol the Hauoop uevelopei community. Though
he`s an expeit in many technical coineis ol the pioject, his specialty is making Hauoop
easiei to use anu unueistanu.
xiii
Given this, I was veiy pleaseu when I leaineu that Tom intenueu to wiite a Look aLout
Hauoop. Vho coulu Le Lettei gualilieu? Now you have the oppoitunity to leain aLout
Hauoop liom a masteinot only ol the technology, Lut also ol common sense anu
plain talk.
Doug Cutting
Sheu in the Yaiu, Caliloinia
xiv | Foreword
Preface
Maitin Gaiunei, the mathematics anu science wiitei, once saiu in an inteiview:
Beyonu calculus, I am lost. That was the seciet ol my column`s success. It took me so
long to unueistanu what I was wiiting aLout that I knew how to wiite in a way most
ieaueis woulu unueistanu.
1
In many ways, this is how I leel aLout Hauoop. Its innei woikings aie complex, iesting
as they uo on a mixtuie ol uistiiLuteu systems theoiy, piactical engineeiing, anu com-
mon sense. Anu to the uninitiateu, Hauoop can appeai alien.
But it uoesn`t neeu to Le like this. Stiippeu to its coie, the tools that Hauoop pioviues
loi Luiluing uistiiLuteu systemsloi uata stoiage, uata analysis, anu cooiuination
aie simple. Il theie`s a common theme, it is aLout iaising the level ol aLstiactionto
cieate Luiluing Llocks loi piogiammeis who just happen to have lots ol uata to stoie,
oi lots ol uata to analyze, oi lots ol machines to cooiuinate, anu who uon`t have the
time, the skill, oi the inclination to Lecome uistiiLuteu systems expeits to Luilu the
inliastiuctuie to hanule it.
Vith such a simple anu geneially applicaLle leatuie set, it seemeu oLvious to me when
I staiteu using it that Hauoop ueseiveu to Le wiuely useu. Howevei, at the time (in
eaily 2006), setting up, conliguiing, anu wiiting piogiams to use Hauoop was an ait.
Things have ceitainly impioveu since then: theie is moie uocumentation, theie aie
moie examples, anu theie aie thiiving mailing lists to go to when you have guestions.
Anu yet the Liggest huiule loi newcomeis is unueistanuing what this technology is
capaLle ol, wheie it excels, anu how to use it. That is why I wiote this Look.
The Apache Hauoop community has come a long way. Ovei the couise ol thiee yeais,
the Hauoop pioject has Llossomeu anu spun oll hall a uozen suLpiojects. In this time,
the soltwaie has maue gieat leaps in peiloimance, ieliaLility, scalaLility, anu manage-
aLility. To gain even wiuei auoption, howevei, I Lelieve we neeu to make Hauoop even
easiei to use. This will involve wiiting moie tools; integiating with moie systems; anu
1. The science ol lun, Alex Bellos, Thc Guardian, May 31, 200S, http://www.guardian.co.u|/scicncc/
2008/nay/31/naths.scicncc.
xv
wiiting new, impioveu APIs. I`m looking loiwaiu to Leing a pait ol this, anu I hope
this Look will encouiage anu enaLle otheis to uo so, too.
Administrative Notes
Duiing uiscussion ol a paiticulai ]ava class in the text, I olten omit its package name,
to ieuuce cluttei. Il you neeu to know which package a class is in, you can easily look
it up in Hauoop`s ]ava API uocumentation loi the ielevant suLpioject, linkeu to liom
the Apache Hauoop home page at http://hadoop.apachc.org/. Oi il you`ie using an IDE,
it can help using its auto-complete mechanism.
Similaily, although it ueviates liom usual style guiuelines, piogiam listings that impoit
multiple classes liom the same package may use the asteiisk wilucaiu chaiactei to save
space (loi example: import org.apache.hadoop.io.*).
The sample piogiams in this Look aie availaLle loi uownloau liom the weLsite that
accompanies this Look: http://www.hadoopboo|.con/. You will also linu instiuctions
theie loi oLtaining the uatasets that aie useu in examples thioughout the Look, as well
as luithei notes loi iunning the piogiams in the Look, anu links to upuates, auuitional
iesouices, anu my Llog.
Whats in This Book?
The iest ol this Look is oiganizeu as lollows. Chaptei 1 emphasizes the neeu loi Hauoop
anu sketches the histoiy ol the pioject. Chaptei 2 pioviues an intiouuction to
MapReuuce. Chaptei 3 looks at Hauoop lilesystems, anu in paiticulai HDFS, in uepth.
Chaptei + coveis the lunuamentals ol I/O in Hauoop: uata integiity, compiession,
seiialization, anu lile-Laseu uata stiuctuies.
The next loui chapteis covei MapReuuce in uepth. Chaptei 5 goes thiough the piactical
steps neeueu to uevelop a MapReuuce application. Chaptei 6 looks at how MapReuuce
is implementeu in Hauoop, liom the point ol view ol a usei. Chaptei 7 is aLout the
MapReuuce piogiamming mouel, anu the vaiious uata loimats that MapReuuce can
woik with. Chaptei S is on auvanceu MapReuuce topics, incluuing soiting anu joining
uata.
Chapteis 9 anu 10 aie loi Hauoop auministiatois, anu uesciiLe how to set up anu
maintain a Hauoop clustei iunning HDFS anu MapReuuce.
Latei chapteis aie ueuicateu to piojects that Luilu on Hauoop oi aie ielateu to it.
Chapteis 11 anu 12 piesent Pig anu Hive, which aie analytics platloims Luilt on HDFS
anu MapReuuce, wheieas Chapteis 13, 1+, anu 15 covei HBase, ZooKeepei, anu
Sgoop, iespectively.
Finally, Chaptei 16 is a collection ol case stuuies contiiLuteu Ly memLeis ol the Apache
Hauoop community.
xvi | Preface
Whats New in the Second Edition?
The seconu euition has two new chapteis on Hive anu Sgoop (Chapteis 12 anu 15), a
new section coveiing Avio (in Chaptei +), an intiouuction to the new secuiity leatuies
in Hauoop (in Chaptei 9), anu a new case stuuy on analyzing massive netwoik giaphs
using Hauoop (in Chaptei 16).
This euition continues to uesciiLe the 0.20 ielease seiies ol Apache Hauoop, since this
was the latest staLle ielease at the time ol wiiting. New leatuies liom latei ieleases aie
occasionally mentioneu in the text, howevei, with ieleience to the veision that they
weie intiouuceu in.
Conventions Used in This Book
The lollowing typogiaphical conventions aie useu in this Look:
|ta|ic
Inuicates new teims, URLs, email auuiesses, lilenames, anu lile extensions.
Constant width
Useu loi piogiam listings, as well as within paiagiaphs to ielei to piogiam elements
such as vaiiaLle oi lunction names, uataLases, uata types, enviionment vaiiaLles,
statements, anu keywoius.
Constant width bold
Shows commanus oi othei text that shoulu Le typeu liteially Ly the usei.
Constant width italic
Shows text that shoulu Le ieplaceu with usei-supplieu values oi Ly values uetei-
mineu Ly context.
This icon signilies a tip, suggestion, oi geneial note.
This icon inuicates a waining oi caution.
Using Code Examples
This Look is heie to help you get youi joL uone. In geneial, you may use the coue in
this Look in youi piogiams anu uocumentation. You uo not neeu to contact us loi
peimission unless you`ie iepiouucing a signilicant poition ol the coue. Foi example,
wiiting a piogiam that uses seveial chunks ol coue liom this Look uoes not ieguiie
peimission. Selling oi uistiiLuting a CD-ROM ol examples liom O`Reilly Looks uoes
Preface | xvii
ieguiie peimission. Answeiing a guestion Ly citing this Look anu guoting example
coue uoes not ieguiie peimission. Incoipoiating a signilicant amount ol example coue
liom this Look into youi piouuct`s uocumentation uoes ieguiie peimission.
Ve appieciate, Lut uo not ieguiie, attiiLution. An attiiLution usually incluues the title,
authoi, puLlishei, anu ISBN. Foi example: Hadoop: Thc Dcjinitivc Guidc, Seconu
Euition, Ly Tom Vhite. Copyiight 2011 Tom Vhite, 97S-1-++9-3S973-+.
Il you leel youi use ol coue examples lalls outsiue laii use oi the peimission given aLove,
leel liee to contact us at pcrnissionsorci||y.con.
Safari Books Online
Salaii Books Online is an on-uemanu uigital liLiaiy that lets you easily
seaich ovei 7,500 technology anu cieative ieleience Looks anu viueos to
linu the answeis you neeu guickly.
Vith a suLsciiption, you can ieau any page anu watch any viueo liom oui liLiaiy online.
Reau Looks on youi cell phone anu moLile uevices. Access new titles Leloie they aie
availaLle loi piint, anu get exclusive access to manusciipts in uevelopment anu post
leeuLack loi the authois. Copy anu paste coue samples, oiganize youi lavoiites, uown-
loau chapteis, Lookmaik key sections, cieate notes, piint out pages, anu Lenelit liom
tons ol othei time-saving leatuies.
O`Reilly Meuia has uploaueu this Look to the Salaii Books Online seivice. To have lull
uigital access to this Look anu otheis on similai topics liom O`Reilly anu othei puL-
lisheis, sign up loi liee at http://ny.sajariboo|son|inc.con.
How to Contact Us
Please auuiess comments anu guestions conceining this Look to the puLlishei:
O`Reilly Meuia, Inc.
1005 Giavenstein Highway Noith
SeLastopol, CA 95+72
S00-99S-993S (in the Uniteu States oi Canaua)
707-S29-0515 (inteinational oi local)
707-S29-010+ (lax)
Ve have a weL page loi this Look, wheie we list eiiata, examples, anu any auuitional
inloimation. You can access this page at:
http://orci||y.con/cata|og/03920010388/
The authoi also has a site loi this Look at:
http://www.hadoopboo|.con/
xviii | Preface
To comment oi ask technical guestions aLout this Look, senu email to:
boo|qucstionsorci||y.con
Foi moie inloimation aLout oui Looks, conleiences, Resouice Centeis, anu the
O`Reilly Netwoik, see oui weLsite at:
http://www.orci||y.con
Acknowledgments
I have ielieu on many people, Loth uiiectly anu inuiiectly, in wiiting this Look. I woulu
like to thank the Hauoop community, liom whom I have leaineu, anu continue to leain,
a gieat ueal.
In paiticulai, I woulu like to thank Michael Stack anu ]onathan Giay loi wiiting the
chaptei on HBase. Also thanks go to Auiian Voouheau, Maic ue Palol, ]oyueep Sen
Saima, Ashish Thusoo, Anuizej Bialecki, Stu Hoou, Chiis K. Vensel, anu Owen
O`Malley loi contiiLuting case stuuies loi Chaptei 16.
I woulu like to thank the lollowing ievieweis who contiiLuteu many helplul suggestions
anu impiovements to my uialts: Raghu Angaui, Matt Biuuulph, Chiistophe Bisciglia,
Ryan Cox, Devaiaj Das, Alex Doiman, Chiis Douglas, Alan Gates, Lais Geoige, Patiick
Hunt, Aaion KimLall, Petei Kiey, Haiiong Kuang, Simon Maxen, Olga Natkovich,
Benjamin Reeu, Konstantin Shvachko, Allen Vittenauei, Matei Zahaiia, anu Philip
Zeyligei. Ajay Ananu kept the ieview piocess llowing smoothly. Philip (llip) Kiomei
kinuly helpeu me with the NCDC weathei uataset leatuieu in the examples in this Look.
Special thanks to Owen O`Malley anu Aiun C. Muithy loi explaining the intiicacies ol
the MapReuuce shullle to me. Any eiiois that iemain aie, ol couise, to Le laiu at my
uooi.
Foi the seconu euition, I owe a ueLt ol giatituue loi the uetaileu ieview anu leeuLack
liom ]ell Bean, Doug Cutting, Glynn Duiham, Alan Gates, ]ell HammeiLachei, Alex
Kozlov, Ken Kiuglei, ]immy Lin, Touu Lipcon, Saiah Spioehnle, Vinithia Vaiauhaia-
jan, anu Ian Viigley, as well as all the ieaueis who suLmitteu eiiata loi the liist euition.
I woulu also like to thank Aaion KimLall loi contiiLuting the chaptei on Sgoop, anu
Philip (llip) Kiomei loi the case stuuy on giaph piocessing.
I am paiticulaily giatelul to Doug Cutting loi his encouiagement, suppoit, anu liienu-
ship, anu loi contiiLuting the loiewoiu.
Thanks also go to the many otheis with whom I have hau conveisations oi email
uiscussions ovei the couise ol wiiting the Look.
Hallway thiough wiiting this Look, I joineu Clouueia, anu I want to thank my
colleagues loi Leing incieuiLly suppoitive in allowing me the time to wiite, anu to get
it linisheu piomptly.
Preface | xix
I am giatelul to my euitoi, Mike Loukiues, anu his colleagues at O`Reilly loi theii help
in the piepaiation ol this Look. Mike has Leen theie thioughout to answei my gues-
tions, to ieau my liist uialts, anu to keep me on scheuule.
Finally, the wiiting ol this Look has Leen a gieat ueal ol woik, anu I coulun`t have uone
it without the constant suppoit ol my lamily. My wile, Eliane, not only kept the home
going, Lut also steppeu in to help ieview, euit, anu chase case stuuies. My uaughteis,
Emilia anu Lottie, have Leen veiy unueistanuing, anu I`m looking loiwaiu to spenuing
lots moie time with all ol them.
xx | Preface
CHAPTER 1
Meet Hadoop
In pioneei uays they useu oxen loi heavy pulling, anu when one ox coulun`t Luuge a log,
they uiun`t tiy to giow a laigei ox. Ve shoulun`t Le tiying loi Liggei computeis, Lut loi
moie systems ol computeis.
Giace Hoppei
Data!
Ve live in the uata age. It`s not easy to measuie the total volume ol uata stoieu elec-
tionically, Lut an IDC estimate put the size ol the uigital univeise at 0.1S zettaLytes
in 2006, anu is loiecasting a tenlolu giowth Ly 2011 to 1.S zettaLytes.
1
A zettaLyte is
10
21
Lytes, oi eguivalently one thousanu exaLytes, one million petaLytes, oi one Lillion
teiaLytes. That`s ioughly the same oiuei ol magnituue as one uisk uiive loi eveiy peison
in the woilu.
This lloou ol uata is coming liom many souices. Consiuei the lollowing:
2
The New Yoik Stock Exchange geneiates aLout one teiaLyte ol new tiaue uata pei
uay.
FaceLook hosts appioximately 10 Lillion photos, taking up one petaLyte ol stoiage.
Ancestiy.com, the genealogy site, stoies aiounu 2.5 petaLytes ol uata.
The Inteinet Aichive stoies aiounu 2 petaLytes ol uata, anu is giowing at a iate ol
20 teiaLytes pei month.
The Laige Hauion Colliuei neai Geneva, Switzeilanu, will piouuce aLout 15
petaLytes ol uata pei yeai.
1. Fiom Gantz et al., The Diveise anu Explouing Digital Univeise, Maich 200S (http://www.cnc.con/
co||atcra|/ana|yst-rcports/divcrsc-cxp|oding-digita|-univcrsc.pdj).
2. http://www.intc||igcntcntcrprisc.con/showArtic|c.jhtn|?artic|c|D=207800705, http://nashab|c.con/
2008/10/15/jaccboo|-10-bi||ion-photos/, http://b|og.jani|ytrccnagazinc.con/insidcr/|nsidc
-Anccstrycons-TopSccrct-Data-Ccntcr.aspx, anu http://www.archivc.org/about/jaqs.php, http://www
.intcractions.org/cns/?pid=1027032.
1
So theie`s a lot ol uata out theie. But you aie pioLaLly wonueiing how it allects you.
Most ol the uata is lockeu up in the laigest weL piopeities (like seaich engines), oi
scientilic oi linancial institutions, isn`t it? Does the auvent ol Big Data, as it is Leing
calleu, allect smallei oiganizations oi inuiviuuals?
I aigue that it uoes. Take photos, loi example. My wile`s gianulathei was an aviu
photogiaphei, anu took photogiaphs thioughout his auult lile. His entiie coipus ol
meuium loimat, sliue, anu 35mm lilm, when scanneu in at high-iesolution, occupies
aiounu 10 gigaLytes. Compaie this to the uigital photos that my lamily took in 200S,
which take up aLout 5 gigaLytes ol space. My lamily is piouucing photogiaphic uata
at 35 times the iate my wile`s gianulathei`s uiu, anu the iate is incieasing eveiy yeai as
it Lecomes easiei to take moie anu moie photos.
Moie geneially, the uigital stieams that inuiviuuals aie piouucing aie giowing apace.
Miciosolt Reseaich`s MyLileBits pioject gives a glimpse ol aichiving ol peisonal inloi-
mation that may Lecome commonplace in the neai lutuie. MyLileBits was an expeii-
ment wheie an inuiviuual`s inteiactionsphone calls, emails, uocumentsweie cap-
tuieu electionically anu stoieu loi latei access. The uata gatheieu incluueu a photo
taken eveiy minute, which iesulteu in an oveiall uata volume ol one gigaLyte a month.
Vhen stoiage costs come uown enough to make it leasiLle to stoie continuous auuio
anu viueo, the uata volume loi a lutuie MyLileBits seivice will Le many times that.
The tienu is loi eveiy inuiviuual`s uata lootpiint to giow, Lut peihaps moie impoitant,
the amount ol uata geneiateu Ly machines will Le even gieatei than that geneiateu Ly
people. Machine logs, RFID ieaueis, sensoi netwoiks, vehicle GPS tiaces, ietail
tiansactionsall ol these contiiLute to the giowing mountain ol uata.
The volume ol uata Leing maue puLlicly availaLle incieases eveiy yeai, too. Oiganiza-
tions no longei have to meiely manage theii own uata: success in the lutuie will Le
uictateu to a laige extent Ly theii aLility to extiact value liom othei oiganizations` uata.
Initiatives such as PuLlic Data Sets on Amazon VeL Seivices, Inlochimps.oig, anu
theinlo.oig exist to lostei the inloimation commons, wheie uata can Le lieely (oi in
the case ol AVS, loi a mouest piice) shaieu loi anyone to uownloau anu analyze.
Mashups Letween uilleient inloimation souices make loi unexpecteu anu hitheito
unimaginaLle applications.
Take, loi example, the Astiometiy.net pioject, which watches the Astiometiy gioup
on Flicki loi new photos ol the night sky. It analyzes each image anu iuentilies which
pait ol the sky it is liom, as well as any inteiesting celestial Louies, such as stais oi
galaxies. This pioject shows the kinu ol things that aie possiLle when uata (in this case,
taggeu photogiaphic images) is maue availaLle anu useu loi something (image analysis)
that was not anticipateu Ly the cieatoi.
It has Leen saiu that Moie uata usually Leats Lettei algoiithms, which is to say that
loi some pioLlems (such as iecommenuing movies oi music Laseu on past pieleiences),
2 | Chapter 1: Meet Hadoop
howevei lienuish youi algoiithms aie, they can olten Le Leaten simply Ly having moie
uata (anu a less sophisticateu algoiithm).
3
The goou news is that Big Data is heie. The Lau news is that we aie stiuggling to stoie
anu analyze it.
Data Storage and Analysis
The pioLlem is simple: while the stoiage capacities ol haiu uiives have incieaseu mas-
sively ovei the yeais, access speeusthe iate at which uata can Le ieau liom uiives
have not kept up. One typical uiive liom 1990 coulu stoie 1,370 MB ol uata anu hau
a tianslei speeu ol +.+ MB/s,
+
so you coulu ieau all the uata liom a lull uiive in aiounu
live minutes. Ovei 20 yeais latei, one teiaLyte uiives aie the noim, Lut the tianslei
speeu is aiounu 100 MB/s, so it takes moie than two anu a hall houis to ieau all the
uata oll the uisk.
This is a long time to ieau all uata on a single uiiveanu wiiting is even slowei. The
oLvious way to ieuuce the time is to ieau liom multiple uisks at once. Imagine il we
hau 100 uiives, each holuing one hunuieuth ol the uata. Voiking in paiallel, we coulu
ieau the uata in unuei two minutes.
Only using one hunuieuth ol a uisk may seem wastelul. But we can stoie one hunuieu
uatasets, each ol which is one teiaLyte, anu pioviue shaieu access to them. Ve can
imagine that the useis ol such a system woulu Le happy to shaie access in ietuin loi
shoitei analysis times, anu, statistically, that theii analysis joLs woulu Le likely to Le
spieau ovei time, so they woulun`t inteileie with each othei too much.
Theie`s moie to Leing aLle to ieau anu wiite uata in paiallel to oi liom multiple uisks,
though.
The liist pioLlem to solve is haiuwaie lailuie: as soon as you stait using many pieces
ol haiuwaie, the chance that one will lail is laiily high. A common way ol avoiuing uata
loss is thiough ieplication: ieuunuant copies ol the uata aie kept Ly the system so that
in the event ol lailuie, theie is anothei copy availaLle. This is how RAID woiks, loi
instance, although Hauoop`s lilesystem, the Hauoop DistiiLuteu Filesystem (HDFS),
takes a slightly uilleient appioach, as you shall see latei.
The seconu pioLlem is that most analysis tasks neeu to Le aLle to comLine the uata in
some way; uata ieau liom one uisk may neeu to Le comLineu with the uata liom any
ol the othei 99 uisks. Vaiious uistiiLuteu systems allow uata to Le comLineu liom
multiple souices, Lut uoing this coiiectly is notoiiously challenging. MapReuuce pio-
viues a piogiamming mouel that aLstiacts the pioLlem liom uisk ieaus anu wiites,
3. The guote is liom Ananu Rajaiaman wiiting aLout the Netllix Challenge (http://anand.typcpad.con/
datawoc|y/2008/03/norc-data-usua|.htn|). Alon Halevy, Petei Noivig, anu Feinanuo Peieiia make the
same point in The UnieasonaLle Ellectiveness ol Data, IEEE Intelligent Systems, Maich/Apiil 2009.
+. These specilications aie loi the Seagate ST-+1600n.
Data Storage and Analysis | 3
tiansloiming it into a computation ovei sets ol keys anu values. Ve will look at the
uetails ol this mouel in latei chapteis, Lut the impoitant point loi the piesent uiscussion
is that theie aie two paits to the computation, the map anu the ieuuce, anu it`s the
inteilace Letween the two wheie the mixing occuis. Like HDFS, MapReuuce has
Luilt-in ieliaLility.
This, in a nutshell, is what Hauoop pioviues: a ieliaLle shaieu stoiage anu analysis
system. The stoiage is pioviueu Ly HDFS anu analysis Ly MapReuuce. Theie aie othei
paits to Hauoop, Lut these capaLilities aie its keinel.
Comparison with Other Systems
The appioach taken Ly MapReuuce may seem like a Liute-loice appioach. The piemise
is that the entiie uatasetoi at least a goou poition ol itis piocesseu loi each gueiy.
But this is its powei. MapReuuce is a batch gueiy piocessoi, anu the aLility to iun an
au hoc gueiy against youi whole uataset anu get the iesults in a ieasonaLle time is
tiansloimative. It changes the way you think aLout uata, anu unlocks uata that was
pieviously aichiveu on tape oi uisk. It gives people the oppoitunity to innovate with
uata. Questions that took too long to get answeieu Leloie can now Le answeieu, which
in tuin leaus to new guestions anu new insights.
Foi example, Mailtiust, Rackspace`s mail uivision, useu Hauoop loi piocessing email
logs. One au hoc gueiy they wiote was to linu the geogiaphic uistiiLution ol theii useis.
In theii woius:
This uata was so uselul that we`ve scheuuleu the MapReuuce joL to iun monthly anu we
will Le using this uata to help us ueciue which Rackspace uata centeis to place new mail
seiveis in as we giow.
By Liinging seveial hunuieu gigaLytes ol uata togethei anu having the tools to analyze
it, the Rackspace engineeis weie aLle to gain an unueistanuing ol the uata that they
otheiwise woulu nevei have hau, anu, luitheimoie, they weie aLle to use what they
hau leaineu to impiove the seivice loi theii customeis. You can ieau moie aLout how
Rackspace uses Hauoop in Chaptei 16.
RDBMS
Vhy can`t we use uataLases with lots ol uisks to uo laige-scale Latch analysis? Vhy is
MapReuuce neeueu?
4 | Chapter 1: Meet Hadoop
The answei to these guestions comes liom anothei tienu in uisk uiives: seek time is
impioving moie slowly than tianslei iate. Seeking is the piocess ol moving the uisk`s
heau to a paiticulai place on the uisk to ieau oi wiite uata. It chaiacteiizes the latency
ol a uisk opeiation, wheieas the tianslei iate coiiesponus to a uisk`s Lanuwiuth.
Il the uata access pattein is uominateu Ly seeks, it will take longei to ieau oi wiite laige
poitions ol the uataset than stieaming thiough it, which opeiates at the tianslei iate.
On the othei hanu, loi upuating a small piopoition ol iecoius in a uataLase, a tiaui-
tional B-Tiee (the uata stiuctuie useu in ielational uataLases, which is limiteu Ly the
iate it can peiloim seeks) woiks well. Foi upuating the majoiity ol a uataLase, a B-Tiee
is less ellicient than MapReuuce, which uses Soit/Meige to ieLuilu the uataLase.
In many ways, MapReuuce can Le seen as a complement to an RDBMS. (The uilleiences
Letween the two systems aie shown in TaLle 1-1.) MapReuuce is a goou lit loi pioLlems
that neeu to analyze the whole uataset, in a Latch lashion, paiticulaily loi au hoc anal-
ysis. An RDBMS is goou loi point gueiies oi upuates, wheie the uataset has Leen in-
uexeu to uelivei low-latency ietiieval anu upuate times ol a ielatively small amount ol
uata. MapReuuce suits applications wheie the uata is wiitten once, anu ieau many
times, wheieas a ielational uataLase is goou loi uatasets that aie continually upuateu.
Tab|c 1-1. RDBMS conparcd to MapRcducc
Traditional RDBMS MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many times Write once, read many times
Structure Static schema Dynamic schema
Integrity High Low
Scaling Nonlinear Linear
Anothei uilleience Letween MapReuuce anu an RDBMS is the amount ol stiuctuie in
the uatasets that they opeiate on. Structurcd data is uata that is oiganizeu into entities
that have a uelineu loimat, such as XML uocuments oi uataLase taLles that conloim
to a paiticulai pieuelineu schema. This is the iealm ol the RDBMS. Scni-structurcd
data, on the othei hanu, is loosei, anu though theie may Le a schema, it is olten ignoieu,
so it may Le useu only as a guiue to the stiuctuie ol the uata: loi example, a spieausheet,
in which the stiuctuie is the giiu ol cells, although the cells themselves may holu any
loim ol uata. Unstructurcd data uoes not have any paiticulai inteinal stiuctuie: loi
example, plain text oi image uata. MapReuuce woiks well on unstiuctuieu oi semi-
stiuctuieu uata, since it is uesigneu to inteipiet the uata at piocessing time. In othei
woius, the input keys anu values loi MapReuuce aie not an intiinsic piopeity ol the
uata, Lut they aie chosen Ly the peison analyzing the uata.
Comparison with Other Systems | 5
Relational uata is olten norna|izcd to ietain its integiity anu iemove ieuunuancy.
Noimalization poses pioLlems loi MapReuuce, since it makes ieauing a iecoiu a non-
local opeiation, anu one ol the cential assumptions that MapReuuce makes is that it
is possiLle to peiloim (high-speeu) stieaming ieaus anu wiites.
A weL seivei log is a goou example ol a set ol iecoius that is not noimalizeu (loi ex-
ample, the client hostnames aie specilieu in lull each time, even though the same client
may appeai many times), anu this is one ieason that logliles ol all kinus aie paiticulaily
well-suiteu to analysis with MapReuuce.
MapReuuce is a lineaily scalaLle piogiamming mouel. The piogiammei wiites two
lunctionsa map lunction anu a ieuuce lunctioneach ol which uelines a mapping
liom one set ol key-value paiis to anothei. These lunctions aie oLlivious to the size ol
the uata oi the clustei that they aie opeiating on, so they can Le useu unchangeu loi a
small uataset anu loi a massive one. Moie impoitant, il you uouLle the size ol the input
uata, a joL will iun twice as slow. But il you also uouLle the size ol the clustei, a joL
will iun as last as the oiiginal one. This is not geneially tiue ol SQL gueiies.
Ovei time, howevei, the uilleiences Letween ielational uataLases anu MapReuuce sys-
tems aie likely to LluiLoth as ielational uataLases stait incoipoiating some ol the
iueas liom MapReuuce (such as Astei Data`s anu Gieenplum`s uataLases) anu, liom
the othei uiiection, as highei-level gueiy languages Luilt on MapReuuce (such as Pig
anu Hive) make MapReuuce systems moie appioachaLle to tiauitional uataLase
piogiammeis.
5
Grid Computing
The High Peiloimance Computing (HPC) anu Giiu Computing communities have
Leen uoing laige-scale uata piocessing loi yeais, using such APIs as Message Passing
Inteilace (MPI). Bioauly, the appioach in HPC is to uistiiLute the woik acioss a clustei
ol machines, which access a shaieu lilesystem, hosteu Ly a SAN. This woiks well loi
pieuominantly compute-intensive joLs, Lut Lecomes a pioLlem when noues neeu to
access laigei uata volumes (hunuieus ol gigaLytes, the point at which MapReuuce ieally
staits to shine), since the netwoik Lanuwiuth is the Lottleneck anu compute noues
Lecome iule.
5. In ]anuaiy 2007, Daviu ]. DeVitt anu Michael StoneLiakei causeu a stii Ly puLlishing MapReuuce: A
majoi step Lackwaius (http://databascco|unn.vcrtica.con/databasc-innovation/naprcducc-a-najor-stcp
-bac|wards), in which they ciiticizeu MapReuuce loi Leing a pooi suLstitute loi ielational uataLases.
Many commentatois aigueu that it was a lalse compaiison (see, loi example, Maik C. Chu-Caiioll`s
DataLases aie hammeis; MapReuuce is a sciewuiivei, http://scicnccb|ogs.con/goodnath/2008/01/
databascs_arc_hanncrs_naprcduc.php), anu DeVitt anu StoneLiakei lolloweu up with MapReuuce
II (http://databascco|unn.vcrtica.con/databasc-innovation/naprcducc-ii), wheie they auuiesseu the
main topics Liought up Ly otheis.
6 | Chapter 1: Meet Hadoop
MapReuuce tiies to collocate the uata with the compute noue, so uata access is last
since it is local.
6
This leatuie, known as data |oca|ity, is at the heait ol MapReuuce anu
is the ieason loi its goou peiloimance. Recognizing that netwoik Lanuwiuth is the most
piecious iesouice in a uata centei enviionment (it is easy to satuiate netwoik links Ly
copying uata aiounu), MapReuuce implementations go to gieat lengths to conseive it
Ly explicitly mouelling netwoik topology. Notice that this aiiangement uoes not pie-
cluue high-CPU analyses in MapReuuce.
MPI gives gieat contiol to the piogiammei, Lut ieguiies that he oi she explicitly hanule
the mechanics ol the uata llow, exposeu via low-level C ioutines anu constiucts, such
as sockets, as well as the highei-level algoiithm loi the analysis. MapReuuce opeiates
only at the highei level: the piogiammei thinks in teims ol lunctions ol key anu value
paiis, anu the uata llow is implicit.
Cooiuinating the piocesses in a laige-scale uistiiLuteu computation is a challenge. The
haiuest aspect is giacelully hanuling paitial lailuiewhen you uon`t know il a iemote
piocess has laileu oi notanu still making piogiess with the oveiall computation.
MapReuuce spaies the piogiammei liom having to think aLout lailuie, since the
implementation uetects laileu map oi ieuuce tasks anu iescheuules ieplacements on
machines that aie healthy. MapReuuce is aLle to uo this since it is a sharcd-nothing
aichitectuie, meaning that tasks have no uepenuence on one othei. (This is a slight
oveisimplilication, since the output liom mappeis is leu to the ieuuceis, Lut this is
unuei the contiol ol the MapReuuce system; in this case, it neeus to take moie caie
ieiunning a laileu ieuucei than ieiunning a laileu map, since it has to make suie it can
ietiieve the necessaiy map outputs, anu il not, iegeneiate them Ly iunning the ielevant
maps again.) So liom the piogiammei`s point ol view, the oiuei in which the tasks iun
uoesn`t mattei. By contiast, MPI piogiams have to explicitly manage theii own check-
pointing anu iecoveiy, which gives moie contiol to the piogiammei, Lut makes them
moie uillicult to wiite.
MapReuuce might sounu like guite a iestiictive piogiamming mouel, anu in a sense it
is: you aie limiteu to key anu value types that aie ielateu in specilieu ways, anu mappeis
anu ieuuceis iun with veiy limiteu cooiuination Letween one anothei (the mappeis
pass keys anu values to ieuuceis). A natuial guestion to ask is: can you uo anything
uselul oi nontiivial with it?
The answei is yes. MapReuuce was inventeu Ly engineeis at Google as a system loi
Luiluing piouuction seaich inuexes Lecause they lounu themselves solving the same
pioLlem ovei anu ovei again (anu MapReuuce was inspiieu Ly oluei iueas liom the
lunctional piogiamming, uistiiLuteu computing, anu uataLase communities), Lut it
has since Leen useu loi many othei applications in many othei inuustiies. It is pleasantly
suipiising to see the iange ol algoiithms that can Le expiesseu in MapReuuce, liom
6. ]im Giay was an eaily auvocate ol putting the computation neai the uata. See DistiiLuteu Computing
Economics, Maich 2003, http://rcscarch.nicrosojt.con/apps/pubs/dcjau|t.aspx?id=70001.
Comparison with Other Systems | 7
image analysis, to giaph-Laseu pioLlems, to machine leaining algoiithms.
7
It can`t
solve eveiy pioLlem, ol couise, Lut it is a geneial uata-piocessing tool.
You can see a sample ol some ol the applications that Hauoop has Leen useu loi in
Chaptei 16.
Volunteer Computing
Vhen people liist heai aLout Hauoop anu MapReuuce, they olten ask, How is it
uilleient liom SETIhome? SETI, the Seaich loi Extia-Teiiestiial Intelligence, iuns
a pioject calleu SETIhome in which volunteeis uonate CPU time liom theii otheiwise
iule computeis to analyze iauio telescope uata loi signs ol intelligent lile outsiue eaith.
SETIhome is the most well-known ol many vo|untccr conputing piojects; otheis in-
cluue the Gieat Inteinet Meisenne Piime Seaich (to seaich loi laige piime numLeis)
anu Foluinghome (to unueistanu piotein loluing anu how it ielates to uisease).
Volunteei computing piojects woik Ly Lieaking the pioLlem they aie tiying to
solve into chunks calleu wor| units, which aie sent to computeis aiounu the woilu to
Le analyzeu. Foi example, a SETIhome woik unit is aLout 0.35 MB ol iauio telescope
uata, anu takes houis oi uays to analyze on a typical home computei. Vhen the analysis
is completeu, the iesults aie sent Lack to the seivei, anu the client gets anothei woik
unit. As a piecaution to comLat cheating, each woik unit is sent to thiee uilleient
machines anu neeus at least two iesults to agiee to Le accepteu.
Although SETIhome may Le supeilicially similai to MapReuuce (Lieaking a pioLlem
into inuepenuent pieces to Le woikeu on in paiallel), theie aie some signilicant uillei-
ences. The SETIhome pioLlem is veiy CPU-intensive, which makes it suitaLle loi
iunning on hunuieus ol thousanus ol computeis acioss the woilu,
S
since the time to
tianslei the woik unit is uwaileu Ly the time to iun the computation on it. Volunteeis
aie uonating CPU cycles, not Lanuwiuth.
MapReuuce is uesigneu to iun joLs that last minutes oi houis on tiusteu, ueuicateu
haiuwaie iunning in a single uata centei with veiy high aggiegate Lanuwiuth intei-
connects. By contiast, SETIhome iuns a peipetual computation on untiusteu
machines on the Inteinet with highly vaiiaLle connection speeus anu no uata locality.
7. Apache Mahout (http://nahout.apachc.org/) is a pioject to Luilu machine leaining liLiaiies (such as
classilication anu clusteiing algoiithms) that iun on Hauoop.
S. In ]anuaiy 200S, SETIhome was iepoiteu at http://www.p|anctary.org/prograns/projccts/sctiathonc/
sctiathonc_20080115.htn| to Le piocessing 300 gigaLytes a uay, using 320,000 computeis (most ol which
aie not ueuicateu to SETIhome; they aie useu loi othei things, too).
8 | Chapter 1: Meet Hadoop
A Brief History of Hadoop
Hauoop was cieateu Ly Doug Cutting, the cieatoi ol Apache Lucene, the wiuely useu
text seaich liLiaiy. Hauoop has its oiigins in Apache Nutch, an open souice weL seaich
engine, itsell a pait ol the Lucene pioject.
The Origin of the Name Hadoop
The name Hauoop is not an acionym; it`s a maue-up name. The pioject`s cieatoi, Doug
Cutting, explains how the name came aLout:
The name my kiu gave a stulleu yellow elephant. Shoit, ielatively easy to spell anu
pionounce, meaningless, anu not useu elsewheie: those aie my naming ciiteiia.
Kius aie goou at geneiating such. Googol is a kiu`s teim.
SuLpiojects anu contiiL mouules in Hauoop also tenu to have names that aie unie-
lateu to theii lunction, olten with an elephant oi othei animal theme (Pig, loi
example). Smallei components aie given moie uesciiptive (anu theieloie moie mun-
uane) names. This is a goou piinciple, as it means you can geneially woik out what
something uoes liom its name. Foi example, the joLtiackei
9
keeps tiack ol MapReuuce
joLs.
Builuing a weL seaich engine liom sciatch was an amLitious goal, loi not only is the
soltwaie ieguiieu to ciawl anu inuex weLsites complex to wiite, Lut it is also a challenge
to iun without a ueuicateu opeiations team, since theie aie so many moving paits. It`s
expensive, too: Mike Calaiella anu Doug Cutting estimateu a system suppoiting a
1-Lillion-page inuex woulu cost aiounu hall a million uollais in haiuwaie, with a
monthly iunning cost ol $30,000.
10
Neveitheless, they Lelieveu it was a woithy goal,
as it woulu open up anu ultimately uemociatize seaich engine algoiithms.
Nutch was staiteu in 2002, anu a woiking ciawlei anu seaich system guickly emeigeu.
Howevei, they iealizeu that theii aichitectuie woulun`t scale to the Lillions ol pages on
the VeL. Help was at hanu with the puLlication ol a papei in 2003 that uesciiLeu the
aichitectuie ol Google`s uistiiLuteu lilesystem, calleu GFS, which was Leing useu in
piouuction at Google.
11
GFS, oi something like it, woulu solve theii stoiage neeus loi
the veiy laige liles geneiateu as a pait ol the weL ciawl anu inuexing piocess. In pai-
ticulai, GFS woulu liee up time Leing spent on auministiative tasks such as managing
stoiage noues. In 200+, they set aLout wiiting an open souice implementation, the
Nutch DistiiLuteu Filesystem (NDFS).
9. In this Look, we use the loweicase loim, joLtiackei, to uenote the entity when it`s Leing ieleiieu
to geneially, anu the CamelCase loim JobTracker to uenote the ]ava class that implements it.
10. Mike Calaiella anu Doug Cutting, Builuing Nutch: Open Souice Seaich, ACM Qucuc, Apiil 200+, http:
//qucuc.acn.org/dctai|.cjn?id=988108.
11. Sanjay Ghemawat, Howaiu GoLioll, anu Shun-Tak Leung, The Google File System, OctoLei 2003,
http://|abs.goog|c.con/papcrs/gjs.htn|.
A Brief History of Hadoop | 9
In 200+, Google puLlisheu the papei that intiouuceu MapReuuce to the woilu.
12
Eaily
in 2005, the Nutch uevelopeis hau a woiking MapReuuce implementation in Nutch,
anu Ly the miuule ol that yeai all the majoi Nutch algoiithms hau Leen poiteu to iun
using MapReuuce anu NDFS.
NDFS anu the MapReuuce implementation in Nutch weie applicaLle Leyonu the iealm
ol seaich, anu in FeLiuaiy 2006 they moveu out ol Nutch to loim an inuepenuent
suLpioject ol Lucene calleu Hauoop. At aiounu the same time, Doug Cutting joineu
Yahoo!, which pioviueu a ueuicateu team anu the iesouices to tuin Hauoop into a
system that ian at weL scale (see siueLai). This was uemonstiateu in FeLiuaiy 200S
when Yahoo! announceu that its piouuction seaich inuex was Leing geneiateu Ly a
10,000-coie Hauoop clustei.
13
In ]anuaiy 200S, Hauoop was maue its own top-level pioject at Apache, conliiming its
success anu its uiveise, active community. By this time, Hauoop was Leing useu Ly
many othei companies Lesiues Yahoo!, such as Last.lm, FaceLook, anu the Ncw Yor|
Tincs. Some applications aie coveieu in the case stuuies in Chaptei 16 anu on the
Hauoop wiki.
In one well-puLlicizeu leat, the Ncw Yor| Tincs useu Amazon`s EC2 compute clouu
to ciunch thiough loui teiaLytes ol scanneu aichives liom the papei conveiting them
to PDFs loi the VeL.
1+
The piocessing took less than 2+ houis to iun using 100 ma-
chines, anu the pioject pioLaLly woulun`t have Leen emLaikeu on without the com-
Lination ol Amazon`s pay-Ly-the-houi mouel (which alloweu the NYT to access a laige
numLei ol machines loi a shoit peiiou) anu Hauoop`s easy-to-use paiallel piogiam-
ming mouel.
In Apiil 200S, Hauoop Lioke a woilu iecoiu to Lecome the lastest system to soit a
teiaLyte ol uata. Running on a 910-noue clustei, Hauoop soiteu one teiaLyte in 209
seconus (just unuei 3 minutes), Leating the pievious yeai`s winnei ol 297 seconus
(uesciiLeu in uetail in TeiaByte Soit on Apache Hauoop on page 601). In NovemLei
ol the same yeai, Google iepoiteu that its MapReuuce implementation soiteu one tei-
aLyte in 6S seconus.
15
As the liist euition ol this Look was going to piess (May 2009),
it was announceu that a team at Yahoo! useu Hauoop to soit one teiaLyte in 62 seconus.
12. ]elliey Dean anu Sanjay Ghemawat, MapReuuce: Simplilieu Data Piocessing on Laige Clusteis ,
DecemLei 200+, http://|abs.goog|c.con/papcrs/naprcducc.htn|.
13. Yahoo! Launches Voilu`s Laigest Hauoop Piouuction Application, 19 FeLiuaiy 200S, http://dcvc|opcr
.yahoo.nct/b|ogs/hadoop/2008/02/yahoo-wor|ds-|argcst-production-hadoop.htn|.
1+. Deiek Gottliiu, Sell-seivice, Pioiateu Supei Computing Fun! 1 NovemLei 2007, http://opcn.b|ogs
.nytincs.con/2007/11/01/sc|j-scrvicc-proratcd-supcr-conputing-jun/.
15. Soiting 1PB with MapReuuce, 21 NovemLei 200S, http://goog|cb|og.b|ogspot.con/2008/11/sorting-1pb
-with-naprcducc.htn|.
10 | Chapter 1: Meet Hadoop
Hadoop at Yahoo!
Builuing Inteinet-scale seaich engines ieguiies huge amounts ol uata anu theieloie
laige numLeis ol machines to piocess it. Yahoo! Seaich consists ol loui piimaiy com-
ponents: the Craw|cr, which uownloaus pages liom weL seiveis; the WcbMap, which
Luilus a giaph ol the known VeL; the |ndcxcr, which Luilus a ieveise inuex to the Lest
pages; anu the Runtinc, which answeis useis` gueiies. The VeLMap is a giaph that
consists ol ioughly 1 tiillion (10
12
) euges each iepiesenting a weL link anu 100 Lillion
(10
11
) noues each iepiesenting uistinct URLs. Cieating anu analyzing such a laige giaph
ieguiies a laige numLei ol computeis iunning loi many uays. In eaily 2005, the inlia-
stiuctuie loi the VeLMap, nameu Drcadnaught, neeueu to Le ieuesigneu to scale up
to moie noues. Dieaunaught hau successlully scaleu liom 20 to 600 noues, Lut ieguiieu
a complete ieuesign to scale out luithei. Dieaunaught is similai to MapReuuce in many
ways, Lut pioviues moie llexiLility anu less stiuctuie. In paiticulai, each liagment in a
Dieaunaught joL can senu output to each ol the liagments in the next stage ol the joL,
Lut the soit was all uone in liLiaiy coue. In piactice, most ol the VeLMap phases weie
paiis that coiiesponueu to MapReuuce. Theieloie, the VeLMap applications woulu
not ieguiie extensive ielactoiing to lit into MapReuuce.
Eiic Balueschwielei (Eiic1+) cieateu a small team anu we staiteu uesigning anu
piototyping a new liamewoik wiitten in C-- moueleu altei GFS anu MapReuuce to
ieplace Dieaunaught. Although the immeuiate neeu was loi a new liamewoik loi
VeLMap, it was cleai that stanuaiuization ol the Latch platloim acioss Yahoo! Seaich
was ciitical anu Ly making the liamewoik geneial enough to suppoit othei useis, we
coulu Lettei leveiage investment in the new platloim.
At the same time, we weie watching Hauoop, which was pait ol Nutch, anu its piogiess.
In ]anuaiy 2006, Yahoo! hiieu Doug Cutting, anu a month latei we ueciueu to aLanuon
oui piototype anu auopt Hauoop. The auvantage ol Hauoop ovei oui piototype anu
uesign was that it was alieauy woiking with a ieal application (Nutch) on 20 noues.
That alloweu us to Liing up a ieseaich clustei two months latei anu stait helping ieal
customeis use the new liamewoik much soonei than we coulu have otheiwise. Anothei
auvantage, ol couise, was that since Hauoop was alieauy open souice, it was easiei
(although lai liom easy!) to get peimission liom Yahoo!`s legal uepaitment to woik in
open souice. So we set up a 200-noue clustei loi the ieseaicheis in eaily 2006 anu put
the VeLMap conveision plans on holu while we suppoiteu anu impioveu Hauoop loi
the ieseaich useis.
Heie`s a guick timeline ol how things have piogiesseu:
200+Initial veisions ol what is now Hauoop DistiiLuteu Filesystem anu Map-
Reuuce implementeu Ly Doug Cutting anu Mike Calaiella.
DecemLei 2005Nutch poiteu to the new liamewoik. Hauoop iuns ieliaLly on
20 noues.
]anuaiy 2006Doug Cutting joins Yahoo!.
FeLiuaiy 2006Apache Hauoop pioject ollicially staiteu to suppoit the stanu-
alone uevelopment ol MapReuuce anu HDFS.
A Brief History of Hadoop | 11
FeLiuaiy 2006Auoption ol Hauoop Ly Yahoo! Giiu team.
Apiil 2006Soit Lenchmaik (10 GB/noue) iun on 1SS noues in +7.9 houis.
May 2006Yahoo! set up a Hauoop ieseaich clustei300 noues.
May 2006Soit Lenchmaik iun on 500 noues in +2 houis (Lettei haiuwaie than
Apiil Lenchmaik).
OctoLei 2006Reseaich clustei ieaches 600 noues.
DecemLei 2006Soit Lenchmaik iun on 20 noues in 1.S houis, 100 noues in 3.3
houis, 500 noues in 5.2 houis, 900 noues in 7.S houis.
]anuaiy 2007Reseaich clustei ieaches 900 noues.
Apiil 2007Reseaich clusteis2 clusteis ol 1000 noues.
Apiil 200SVon the 1 teiaLyte soit Lenchmaik in 209 seconus on 900 noues.
OctoLei 200SLoauing 10 teiaLytes ol uata pei uay on to ieseaich clusteis.
Maich 200917 clusteis with a total ol 2+,000 noues.
Apiil 2009Von the minute soit Ly soiting 500 GB in 59 seconus (on 1,+00
noues) anu the 100 teiaLyte soit in 173 minutes (on 3,+00 noues).
Owen O`Malley
Apache Hadoop and the Hadoop Ecosystem
Although Hauoop is Lest known loi MapReuuce anu its uistiiLuteu lilesystem (HDFS,
ienameu liom NDFS), the teim is also useu loi a lamily ol ielateu piojects that lall
unuei the umLiella ol inliastiuctuie loi uistiiLuteu computing anu laige-scale uata
piocessing.
All ol the coie piojects coveieu in this Look aie hosteu Ly the Apache Soltwaie Foun-
uation, which pioviues suppoit loi a community ol open souice soltwaie piojects,
incluuing the oiiginal HTTP Seivei liom which it gets its name. As the Hauoop eco-
system giows, moie piojects aie appeaiing, not necessaiily hosteu at Apache, which
pioviue complementaiy seivices to Hauoop, oi Luilu on the coie to auu highei-level
aLstiactions.
The Hauoop piojects that aie coveieu in this Look aie uesciiLeu Liielly heie:
Connon
A set ol components anu inteilaces loi uistiiLuteu lilesystems anu geneial I/O
(seiialization, ]ava RPC, peisistent uata stiuctuies).
Avro
A seiialization system loi ellicient, cioss-language RPC, anu peisistent uata
stoiage.
MapRcducc
A uistiiLuteu uata piocessing mouel anu execution enviionment that iuns on laige
clusteis ol commouity machines.
12 | Chapter 1: Meet Hadoop
HDIS
A uistiiLuteu lilesystem that iuns on laige clusteis ol commouity machines.
Pig
A uata llow language anu execution enviionment loi exploiing veiy laige uatasets.
Pig iuns on HDFS anu MapReuuce clusteis.
Hivc
A uistiiLuteu uata waiehouse. Hive manages uata stoieu in HDFS anu pioviues a
gueiy language Laseu on SQL (anu which is tianslateu Ly the iuntime engine to
MapReuuce joLs) loi gueiying the uata.
HBasc
A uistiiLuteu, column-oiienteu uataLase. HBase uses HDFS loi its unueilying
stoiage, anu suppoits Loth Latch-style computations using MapReuuce anu point
gueiies (ianuom ieaus).
ZooKccpcr
A uistiiLuteu, highly availaLle cooiuination seivice. ZooKeepei pioviues piimitives
such as uistiiLuteu locks that can Le useu loi Luiluing uistiiLuteu applications.
Sqoop
A tool loi elliciently moving uata Letween ielational uataLases anu HDFS.
Hadoop Releases
Vhich veision ol Hauoop shoulu you use? The answei to this guestion changes ovei
time, ol couise, anu also uepenus on the leatuies that you neeu. Hauoop Relea-
ses on page 13 summaiizes the high-level leatuies in iecent Hauoop ielease seiies.
Theie aie a lew active ielease seiies. The 1.x ielease seiies is a continuation ol the 0.20
ielease seiies, anu contains the most staLle veisions ol Hauoop cuiiently availaLle. This
seiies incluues secuie KeiLeios authentication, which pievents unauthoiizeu access to
Hauoop uata (see Secuiity on page 323). Almost all piouuction clusteis use these
ieleases, oi ueiiveu veisions (such as commeicial uistiiLutions).
The 0.22 anu 0.23 ielease seiies
16
aie cuiiently maikeu as alpha ieleases (as ol eaily
2012), Lut this is likely to change Ly the time you ieau this as they get moie ieal-woilu
testing anu Lecome moie staLle (consult the Apache Hauoop ieleases page loi the latest
status). 0.23 incluues seveial majoi new leatuies:
A new MapReuuce iuntime, calleu MapReuuce 2, implementeu on a new system
calleu YARN (Yet Anothei Resouice Negotiatoi), which is a geneial iesouice man-
agement system loi iunning uistiiLuteu applications. MapReuuce 2 ieplaces the
16. The numLeiing will Le upuateu to iellect the lact that they aie latei veisions than 1.x luithei into theii
ielease cycles.
Hadoop Releases | 13
classic iuntime in pievious ieleases. It is uesciiLeu in moie uepth in YARN
(MapReuuce 2) on page 19+.
HDFS leueiation, which paititions the HDFS namespace acioss multiple namen-
oues to suppoit clusteis with veiy laige numLeis ol liles. See HDFS Feueia-
tion on page +9.
HDFS high-availaLility, which iemoves the namenoue as a single point ol lailuie
Ly suppoiting stanuLy namenoues loi lailovei. See HDFS High-AvailaLil-
ity on page 50.
Tab|c 1-2. Icaturcs Supportcd by Hadoop Rc|casc Scrics
Feature 1.x 0.22 0.23
Secure authentication Yes No Yes
Old configuration names Yes Deprecated Deprecated
New configuration names No Yes Yes
Old MapReduce API Yes Deprecated Deprecated
New MapReduce API Partial Yes Yes
MapReduce 1 runtime (Classic) Yes Yes No
MapReduce 2 runtime (YARN) No No Yes
HDFS federation No No Yes
HDFS high-availability No No Planned
TaLle 1-2 only coveis leatuies in HDFS anu MapReuuce. Othei piojects in the Hauoop
ecosystem aie continually evolving too, anu picking a comLination ol components that
woik well togethei can Le a challenge. Thanklully, you uon`t have to uo this woik
youisell. The Apache Bigtop pioject (http://incubator.apachc.org/bigtop/) iuns inteio-
peiaLility tests on stacks ol Hauoop components, anu pioviues Linaiy packages (RPMs
anu DeLian packages) loi easy installation. Theie aie also commeicial venuois olleiing
Hauoop uistiiLutions containing suites ol compatiLle components.
Whats Covered in this Book
This Look coveis all the ieleases in TaLle 1-2. In the cases wheie a leatuie is only
availaLle in a paiticulai ielease, it is noteu in the text.
The coue in this Look is wiitten to woik against all these ielease seiies, except in a small
numLei ol cases, which aie explicitly calleu out. The example coue availaLle on the
weLsite has a list ol the veisions that it was testeu against.
Configuration Names
Conliguiation piopeity names have Leen changeu in the ieleases altei 1.x, in oiuei to
give them a moie iegulai naming stiuctuie. Foi example, the HDFS piopeities pei-
14 | Chapter 1: Meet Hadoop
taining to the namenoue have Leen changeu to have a dfs.namenode pielix, so
dfs.name.dir has changeu to dfs.namenode.name.dir. Similaily, MapReuuce piopeities
have the mapreduce pielix, iathei than the oluei mapred pielix, so mapred.job.name has
changeu to mapreduce.job.name.
Foi piopeities that exist in veision 1.x, the olu (uepiecateu) names aie useu in this
Look, since they will woik in all the veisions ol Hauoop listeu heie. Il you aie using a
ielease altei 1.x, you may wish to use the new piopeity names in youi conliguiation
liles anu coue to iemove uepiecation wainings. A taLle listing the uepiecateu piopeities
names anu theii ieplacements can Le lounu on the Hauoop weLsite at http://hadoop
.apachc.org/connon/docs/r0.23.0/hadoop-projcct-dist/hadoop-connon/Dcprccatcd
Propcrtics.htn|.
MapReduce APIs
Hauoop pioviues two ]ava MapReuuce APIs, uesciiLeu in moie uetail in The olu anu
the new ]ava MapReuuce APIs on page 27. This euition ol the Look uses the new
API, which will woik with all veisions listeu heie, except in a lew cases wheie that pait
ol the new API is not availaLle in the 1.x ieleases. In these cases the eguivalent coue
using the olu API is availaLle on the Look`s weLsite.
Compatibility
Vhen moving liom one ielease to anothei you neeu to consiuei the upgiaue steps that
aie neeueu. Theie aie seveial aspects to consiuei: API compatiLility, uata compatiLility,
anu wiie compatiLility.
API compatiLility conceins the contiact Letween usei coue anu the puLlisheu Hauoop
APIs, such as the ]ava MapReuuce APIs. Majoi ieleases (e.g. liom 1.x.y to 2.0.0) aie
alloweu to Lieak API compatiLility, so usei piogiams may neeu to Le mouilieu anu
iecompileu. Minoi ieleases (e.g. liom 1.0.x to 1.1.0) anu point ieleases (e.g. liom 1.0.1
to 1.0.2) shoulu not Lieak compatiLility.
17
Hauoop uses a classilication scheme loi API elements to uenote theii
staLility. The aLove iules loi API compatiLility covei those elements
that aie maikeu InterfaceStability.Stable. Some elements ol the puL-
lic Hauoop APIs, howevei, aie maikeu with the InterfaceStabil
ity.Evolving oi InterfaceStability.Unstable annotations (all these an-
notations aie in the org.apache.hadoop.classification package), which
means they aie alloweu to Lieak compatiLility on minoi anu point ie-
leases, iespectively.
17. Pie-1.0 ieleases lollow the iules loi majoi ieleases, so a change in veision liom 0.1.0 to 0.2.0 (say)
constitutes a majoi ielease, anu theieloie may Lieak API compatiLility.
Hadoop Releases | 15
Data compatiLility conceins peisistent uata anu metauata loimats, such as the loimat
in which the HDFS namenoue stoies its peisistent uata. The loimats can change acioss
minoi oi majoi ieleases, Lut the change is tianspaient to useis since the upgiaue will
automatically migiate the uata. Theie may Le some iestiictions aLout upgiaue paths,
anu these aie coveieu in the ielease notesloi example it may Le necessaiy to upgiaue
via an inteimeuiate ielease iathei than upgiauing uiiectly to the latei linal ielease in
one step. Hauoop upgiaues aie uiscusseu in moie uetail in Upgiaues on page 360.
Viie compatiLility conceins the inteiopeiaLility Letween clients anu seiveis via wiie
piotocols like RPC anu HTTP. Theie aie two types ol client: exteinal clients (iun Ly
useis) anu inteinal clients (iun on the clustei as a pait ol the system, e.g. uatanoue anu
tasktiackei uaemons). In geneial, inteinal clients have to Le upgiaueu in lockstepan
oluei veision ol a tasktiackei will not woik with a newei joLtiackei, loi example. In
the lutuie iolling upgiaues may Le suppoiteu, which woulu allow clustei uaemons to
Le upgiaueu in phases, so that the clustei woulu still Le availaLle to exteinal clients
uuiing the upgiaue.
Foi exteinal clients that aie iun Ly the useilike a piogiam that ieaus oi wiites liom
HDFS, oi the MapReuuce joL suLmission clientthe client must have the same majoi
ielease numLei as the seivei, Lut is alloweu to have a lowei minoi oi point ielease
numLei (e.g. client veision 1.0.1 will woik with seivei 1.0.2 oi 1.1.0, Lut not with seivei
2.0.0). Any exception to this iule shoulu Le calleu out in the ielease notes.
16 | Chapter 1: Meet Hadoop
CHAPTER 2
MapReduce
MapReuuce is a piogiamming mouel loi uata piocessing. The mouel is simple, yet not
too simple to expiess uselul piogiams in. Hauoop can iun MapReuuce piogiams wiit-
ten in vaiious languages; in this chaptei, we shall look at the same piogiam expiesseu
in ]ava, RuLy, Python, anu C--. Most impoitant, MapReuuce piogiams aie inheiently
paiallel, thus putting veiy laige-scale uata analysis into the hanus ol anyone with
enough machines at theii uisposal. MapReuuce comes into its own loi laige uatasets,
so let`s stait Ly looking at one.
A Weather Dataset
Foi oui example, we will wiite a piogiam that mines weathei uata. Veathei sensois
collecting uata eveiy houi at many locations acioss the gloLe gathei a laige volume ol
log uata, which is a goou canuiuate loi analysis with MapReuuce, since it is semi-
stiuctuieu anu iecoiu-oiienteu.
Data Format
The uata we will use is liom the National Climatic Data Centei (NCDC, http://www
.ncdc.noaa.gov/). The uata is stoieu using a line-oiienteu ASCII loimat, in which each
line is a iecoiu. The loimat suppoits a iich set ol meteoiological elements, many ol
which aie optional oi with vaiiaLle uata lengths. Foi simplicity, we shall locus on the
Lasic elements, such as tempeiatuie, which aie always piesent anu aie ol lixeu wiuth.
Example 2-1 shows a sample line with some ol the salient lielus highlighteu. The line
has Leen split into multiple lines to show each lielu: in the ieal lile, lielus aie packeu
into one line with no uelimiteis.
17
Exanp|c 2-1. Iornat oj a Nationa| C|inatc Data Ccntcr rccord
0057
332130 # USAF weather station identifier
99999 # WBAN weather station identifier
19500101 # observation date
0300 # observation time
4
+51317 # latitude (degrees x 1000)
+028783 # longitude (degrees x 1000)
FM-12
+0171 # elevation (meters)
99999
V020
320 # wind direction (degrees)
1 # quality code
N
0072
1
00450 # sky ceiling height (meters)
1 # quality code
C
N
010000 # visibility distance (meters)
1 # quality code
N
9
-0128 # air temperature (degrees Celsius x 10)
1 # quality code
-0139 # dew point temperature (degrees Celsius x 10)
1 # quality code
10268 # atmospheric pressure (hectopascals x 10)
1 # quality code
Data liles aie oiganizeu Ly uate anu weathei station. Theie is a uiiectoiy loi each yeai
liom 1901 to 2001, each containing a gzippeu lile loi each weathei station with its
ieauings loi that yeai. Foi example, heie aie the liist entiies loi 1990:
% ls raw/1990 | head
010010-99999-1990.gz
010014-99999-1990.gz
010015-99999-1990.gz
010016-99999-1990.gz
010017-99999-1990.gz
010030-99999-1990.gz
010040-99999-1990.gz
010080-99999-1990.gz
010100-99999-1990.gz
010150-99999-1990.gz
Since theie aie tens ol thousanus ol weathei stations, the whole uataset is maue up ol
a laige numLei ol ielatively small liles. It`s geneially easiei anu moie ellicient to piocess
a smallei numLei ol ielatively laige liles, so the uata was piepiocesseu so that each
18 | Chapter 2: MapReduce
yeai`s ieauings weie concatenateu into a single lile. (The means Ly which this was
caiiieu out is uesciiLeu in Appenuix C.)
Analyzing the Data with Unix Tools
Vhat`s the highest iecoiueu gloLal tempeiatuie loi each yeai in the uataset? Ve will
answei this liist without using Hauoop, as this inloimation will pioviue a peiloimance
Laseline, as well as a uselul means to check oui iesults.
The classic tool loi piocessing line-oiienteu uata is aw|. Example 2-2 is a small sciipt
to calculate the maximum tempeiatuie loi each yeai.
Exanp|c 2-2. A progran jor jinding thc naxinun rccordcd tcnpcraturc by ycar jron NCDC wcathcr
rccords
#!/usr/bin/env bash
for year in all/*
do
echo -ne `basename $year .gz`"\t"
gunzip -c $year | \
awk '{ temp = substr($0, 88, 5) + 0;
q = substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
END { print max }'
done
The sciipt loops thiough the compiesseu yeai liles, liist piinting the yeai, anu then
piocessing each lile using aw|. The aw| sciipt extiacts two lielus liom the uata: the aii
tempeiatuie anu the guality coue. The aii tempeiatuie value is tuineu into an integei
Ly auuing 0. Next, a test is applieu to see il the tempeiatuie is valiu (the value 9999
signilies a missing value in the NCDC uataset) anu il the guality coue inuicates that the
ieauing is not suspect oi eiioneous. Il the ieauing is OK, the value is compaieu with
the maximum value seen so lai, which is upuateu il a new maximum is lounu. The
END Llock is executeu altei all the lines in the lile have Leen piocesseu, anu it piints the
maximum value.
Heie is the Leginning ol a iun:
% ./max_temperature.sh
1901 317
1902 244
1903 289
1904 256
1905 283
...
The tempeiatuie values in the souice lile aie scaleu Ly a lactoi ol 10, so this woiks out
as a maximum tempeiatuie ol 31.7C loi 1901 (theie weie veiy lew ieauings at the
Leginning ol the centuiy, so this is plausiLle). The complete iun loi the centuiy took
+2 minutes in one iun on a single EC2 High-CPU Extia Laige Instance.
Analyzing the Data with Unix Tools | 19
To speeu up the piocessing, we neeu to iun paits ol the piogiam in paiallel. In theoiy,
this is stiaightloiwaiu: we coulu piocess uilleient yeais in uilleient piocesses, using all
the availaLle haiuwaie thieaus on a machine. Theie aie a lew pioLlems with this,
howevei.
Fiist, uiviuing the woik into egual-size pieces isn`t always easy oi oLvious. In this case,
the lile size loi uilleient yeais vaiies wiuely, so some piocesses will linish much eailiei
than otheis. Even il they pick up luithei woik, the whole iun is uominateu Ly the
longest lile. A Lettei appioach, although one that ieguiies moie woik, is to split the
input into lixeu-size chunks anu assign each chunk to a piocess.
Seconu, comLining the iesults liom inuepenuent piocesses may neeu luithei piocess-
ing. In this case, the iesult loi each yeai is inuepenuent ol othei yeais anu may Le
comLineu Ly concatenating all the iesults, anu soiting Ly yeai. Il using the lixeu-size
chunk appioach, the comLination is moie uelicate. Foi this example, uata loi a pai-
ticulai yeai will typically Le split into seveial chunks, each piocesseu inuepenuently.
Ve`ll enu up with the maximum tempeiatuie loi each chunk, so the linal step is to
look loi the highest ol these maximums, loi each yeai.
Thiiu, you aie still limiteu Ly the piocessing capacity ol a single machine. Il the Lest
time you can achieve is 20 minutes with the numLei ol piocessois you have, then that`s
it. You can`t make it go lastei. Also, some uatasets giow Leyonu the capacity ol a single
machine. Vhen we stait using multiple machines, a whole host ol othei lactois come
into play, mainly lalling in the categoiy ol cooiuination anu ieliaLility. Vho iuns the
oveiall joL? How uo we ueal with laileu piocesses?
So, though it`s leasiLle to paiallelize the piocessing, in piactice it`s messy. Using a
liamewoik like Hauoop to take caie ol these issues is a gieat help.
Analyzing the Data with Hadoop
To take auvantage ol the paiallel piocessing that Hauoop pioviues, we neeu to expiess
oui gueiy as a MapReuuce joL. Altei some local, small-scale testing, we will Le aLle to
iun it on a clustei ol machines.
Map and Reduce
MapReuuce woiks Ly Lieaking the piocessing into two phases: the map phase anu the
ieuuce phase. Each phase has key-value paiis as input anu output, the types ol which
may Le chosen Ly the piogiammei. The piogiammei also specilies two lunctions: the
map lunction anu the ieuuce lunction.
The input to oui map phase is the iaw NCDC uata. Ve choose a text input loimat that
gives us each line in the uataset as a text value. The key is the ollset ol the Leginning
ol the line liom the Leginning ol the lile, Lut as we have no neeu loi this, we ignoie it.
20 | Chapter 2: MapReduce
Oui map lunction is simple. Ve pull out the yeai anu the aii tempeiatuie, since these
aie the only lielus we aie inteiesteu in. In this case, the map lunction is just a uata
piepaiation phase, setting up the uata in such a way that the ieuucei lunction can uo
its woik on it: linuing the maximum tempeiatuie loi each yeai. The map lunction is
also a goou place to uiop Lau iecoius: heie we liltei out tempeiatuies that aie missing,
suspect, oi eiioneous.
To visualize the way the map woiks, consiuei the lollowing sample lines ol input uata
(some unuseu columns have Leen uioppeu to lit the page, inuicateu Ly ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines aie piesenteu to the map lunction as the key-value paiis:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
The keys aie the line ollsets within the lile, which we ignoie in oui map lunction. The
map lunction meiely extiacts the yeai anu the aii tempeiatuie (inuicateu in Lolu text),
anu emits them as its output (the tempeiatuie values have Leen inteipieteu as
integeis):
(1950, 0)
(1950, 22)
(1950, 11)
(1949, 111)
(1949, 78)
The output liom the map lunction is piocesseu Ly the MapReuuce liamewoik Leloie
Leing sent to the ieuuce lunction. This piocessing soits anu gioups the key-value paiis
Ly key. So, continuing the example, oui ieuuce lunction sees the lollowing input:
(1949, [111, 78])
(1950, [0, 22, 11])
Each yeai appeais with a list ol all its aii tempeiatuie ieauings. All the ieuuce lunction
has to uo now is iteiate thiough the list anu pick up the maximum ieauing:
(1949, 111)
(1950, 22)
This is the linal output: the maximum gloLal tempeiatuie iecoiueu in each yeai.
The whole uata llow is illustiateu in Figuie 2-1. At the Lottom ol the uiagiam is a Unix
pipeline, which mimics the whole MapReuuce llow, anu which we will see again latei
in the chaptei when we look at Hauoop Stieaming.
Analyzing the Data with Hadoop | 21
Java MapReduce
Having iun thiough how the MapReuuce piogiam woiks, the next step is to expiess it
in coue. Ve neeu thiee things: a map lunction, a ieuuce lunction, anu some coue to
iun the joL. The map lunction is iepiesenteu Ly the Mapper class, which ueclaies an
aLstiact map() methou. Example 2-3 shows the implementation ol oui map methou.
Exanp|c 2-3. Mappcr jor naxinun tcnpcraturc cxanp|c
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;

@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}
The Mapper class is a geneiic type, with loui loimal type paiameteis that specily the
input key, input value, output key, anu output value types ol the map lunction. Foi the
piesent example, the input key is a long integei ollset, the input value is a line ol text,
Iigurc 2-1. MapRcducc |ogica| data j|ow
22 | Chapter 2: MapReduce
the output key is a yeai, anu the output value is an aii tempeiatuie (an integei). Rathei
than use Luilt-in ]ava types, Hauoop pioviues its own set ol Lasic types that aie opti-
mizeu loi netwoik seiialization. These aie lounu in the org.apache.hadoop.io package.
Heie we use LongWritable, which coiiesponus to a ]ava Long, Text (like ]ava String),
anu IntWritable (like ]ava Integer).
The map() methou is passeu a key anu a value. Ve conveit the Text value containing
the line ol input into a ]ava String, then use its substring() methou to extiact the
columns we aie inteiesteu in.
The map() methou also pioviues an instance ol Context to wiite the output to. In this
case, we wiite the yeai as a Text oLject (since we aie just using it as a key), anu the
tempeiatuie is wiappeu in an IntWritable. Ve wiite an output iecoiu only il the tem-
peiatuie is piesent anu the guality coue inuicates the tempeiatuie ieauing is OK.
The ieuuce lunction is similaily uelineu using a Reducer, as illustiateu in Example 2-+.
Exanp|c 2-1. Rcduccr jor naxinun tcnpcraturc cxanp|c
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {

int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}
}
Again, loui loimal type paiameteis aie useu to specily the input anu output types, this
time loi the ieuuce lunction. The input types ol the ieuuce lunction must match the
output types ol the map lunction: Text anu IntWritable. Anu in this case, the output
types ol the ieuuce lunction aie Text anu IntWritable, loi a yeai anu its maximum
tempeiatuie, which we linu Ly iteiating thiough the tempeiatuies anu compaiing each
with a iecoiu ol the highest lounu so lai.
The thiiu piece ol coue iuns the MapReuuce joL (see Example 2-5).
Analyzing the Data with Hadoop | 23
Exanp|c 2-5. App|ication to jind thc naxinun tcnpcraturc in thc wcathcr datasct
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}

Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
A Job oLject loims the specilication ol the joL. It gives you contiol ovei how the joL is
iun. Vhen we iun this joL on a Hauoop clustei, we will package the coue into a ]AR
lile (which Hauoop will uistiiLute aiounu the clustei). Rathei than explicitly specily
the name ol the ]AR lile, we can pass a class in the Job`s setJarByClass() methou, which
Hauoop will use to locate the ielevant ]AR lile Ly looking loi the ]AR lile containing
this class.
Having constiucteu a Job oLject, we specily the input anu output paths. An input path
is specilieu Ly calling the static addInputPath() methou on FileInputFormat, anu it can
Le a single lile, a uiiectoiy (in which case, the input loims all the liles in that uiiectoiy),
oi a lile pattein. As the name suggests, addInputPath() can Le calleu moie than once
to use input liom multiple paths.
The output path (ol which theie is only one) is specilieu Ly the static setOutput
Path() methou on FileOutputFormat. It specilies a uiiectoiy wheie the output liles liom
the ieuucei lunctions aie wiitten. The uiiectoiy shoulun`t exist Leloie iunning the joL,
as Hauoop will complain anu not iun the joL. This piecaution is to pievent uata loss
24 | Chapter 2: MapReduce
(it can Le veiy annoying to acciuentally oveiwiite the output ol a long joL with
anothei).
Next, we specily the map anu ieuuce types to use via the setMapperClass() anu
setReducerClass() methous.
The setOutputKeyClass() anu setOutputValueClass() methous contiol the output types
loi the map anu the ieuuce lunctions, which aie olten the same, as they aie in oui case.
Il they aie uilleient, then the map output types can Le set using the methous
setMapOutputKeyClass() anu setMapOutputValueClass().
The input types aie contiolleu via the input loimat, which we have not explicitly set
since we aie using the uelault TextInputFormat.
Altei setting the classes that ueline the map anu ieuuce lunctions, we aie ieauy to iun
the joL. The waitForCompletion() methou on Job suLmits the joL anu waits loi it to
linish. The methou`s Loolean aigument is a veiLose llag, so in this case the joL wiites
inloimation aLout its piogiess to the console.
The ietuin value ol the waitForCompletion() methou is a Loolean inuicating success
(true) oi lailuie (false), which we tianslate into the piogiam`s exit coue ol 0 oi 1.
A test run
Altei wiiting a MapReuuce joL, it`s noimal to tiy it out on a small uataset to llush out
any immeuiate pioLlems with the coue. Fiist install Hauoop in stanualone moue
theie aie instiuctions loi how to uo this in Appenuix A. This is the moue in which
Hauoop iuns using the local lilesystem with a local joL iunnei. Then install anu compile
the examples using the instiuctions on the Look`s weLsite.
Let`s test it on the live-line sample uiscusseu eailiei (the output has Leen slightly ie-
loimatteu to lit the page):
% export HADOOP_CLASSPATH=hadoop-examples.jar
% hadoop MaxTemperature input/ncdc/sample.txt output
11/09/15 21:35:14 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobT
racker, sessionId=
11/09/15 21:35:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library fo
r your platform... using builtin-java classes where applicable
11/09/15 21:35:14 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing t
he arguments. Applications should implement Tool for the same.
11/09/15 21:35:14 INFO input.FileInputFormat: Total input paths to process : 1
11/09/15 21:35:14 WARN snappy.LoadSnappy: Snappy native library not loaded
11/09/15 21:35:14 INFO mapreduce.JobSubmitter: number of splits:1
11/09/15 21:35:15 INFO mapreduce.Job: Running job: job_local_0001
11/09/15 21:35:15 INFO mapred.LocalJobRunner: Waiting for map tasks
11/09/15 21:35:15 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000
000_0
11/09/15 21:35:15 INFO mapred.Task: Using ResourceCalculatorPlugin : null
11/09/15 21:35:15 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
11/09/15 21:35:15 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
11/09/15 21:35:15 INFO mapred.MapTask: soft limit at 83886080
Analyzing the Data with Hadoop | 25
11/09/15 21:35:15 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
11/09/15 21:35:15 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
11/09/15 21:35:15 INFO mapred.LocalJobRunner:
11/09/15 21:35:15 INFO mapred.MapTask: Starting flush of map output
11/09/15 21:35:15 INFO mapred.MapTask: Spilling map output
11/09/15 21:35:15 INFO mapred.MapTask: bufstart = 0; bufend = 45; bufvoid = 104857600
11/09/15 21:35:15 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 2621438
0(104857520); length = 17/6553600
11/09/15 21:35:15 INFO mapred.MapTask: Finished spill 0
11/09/15 21:35:15 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And i
s in the process of commiting
11/09/15 21:35:15 INFO mapred.LocalJobRunner: map
11/09/15 21:35:15 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
11/09/15 21:35:15 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_00
0000_0
11/09/15 21:35:15 INFO mapred.LocalJobRunner: Map task executor complete.
11/09/15 21:35:15 INFO mapred.Task: Using ResourceCalculatorPlugin : null
11/09/15 21:35:15 INFO mapred.Merger: Merging 1 sorted segments
11/09/15 21:35:15 INFO mapred.Merger: Down to the last merge-pass, with 1 segments le
ft of total size: 50 bytes
11/09/15 21:35:15 INFO mapred.LocalJobRunner:
11/09/15 21:35:15 WARN conf.Configuration: mapred.skip.on is deprecated. Instead, use
mapreduce.job.skiprecords
11/09/15 21:35:15 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And i
s in the process of commiting
11/09/15 21:35:15 INFO mapred.LocalJobRunner:
11/09/15 21:35:15 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to
commit now
11/09/15 21:35:15 INFO output.FileOutputCommitter: Saved output of task 'attempt_loca
l_0001_r_000000_0' to file:/Users/tom/workspace/hadoop-book/output
11/09/15 21:35:15 INFO mapred.LocalJobRunner: reduce > reduce
11/09/15 21:35:15 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
11/09/15 21:35:16 INFO mapreduce.Job: map 100% reduce 100%
11/09/15 21:35:16 INFO mapreduce.Job: Job job_local_0001 completed successfully
11/09/15 21:35:16 INFO mapreduce.Job: Counters: 24
File System Counters
FILE: Number of bytes read=255967
FILE: Number of bytes written=397273
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=5
Map output records=5
Map output bytes=45
Map output materialized bytes=61
Input split bytes=124
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=0
Reduce input records=5
Reduce output records=2
Spilled Records=10
Shuffled Maps =0
26 | Chapter 2: MapReduce
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=10
Total committed heap usage (bytes)=379723776
File Input Format Counters
Bytes Read=529
File Output Format Counters
Bytes Written=29
Vhen the hadoop commanu is invokeu with a classname as the liist aigument, it
launches a ]VM to iun the class. It is moie convenient to use hadoop than stiaight
java since the loimei auus the Hauoop liLiaiies (anu theii uepenuencies) to the class-
path anu picks up the Hauoop conliguiation, too. To auu the application classes to the
classpath, we`ve uelineu an enviionment vaiiaLle calleu HADOOP_CLASSPATH, which the
hadoop sciipt picks up.
Vhen iunning in local (stanualone) moue, the piogiams in this Look
all assume that you have set the HADOOP_CLASSPATH in this way. The com-
manus shoulu Le iun liom the uiiectoiy that the example coue is
installeu in.
The output liom iunning the joL pioviues some uselul inloimation. Foi example,
we can see that the joL was given an ID ol job_local_0001, anu it ian one map task
anu one ieuuce task (with the IDs attempt_local_0001_m_000000_0 anu
attempt_local_0001_r_000000_0). Knowing the joL anu task IDs can Le veiy uselul when
ueLugging MapReuuce joLs.
The last section ol the output, titleu Counteis, shows the statistics that Hauoop
geneiates loi each joL it iuns. These aie veiy uselul loi checking whethei the amount
ol uata piocesseu is what you expecteu. Foi example, we can lollow the numLei ol
iecoius that went thiough the system: live map inputs piouuceu live map outputs, then
live ieuuce inputs in two gioups piouuceu two ieuuce outputs.
The output was wiitten to the output uiiectoiy, which contains one output lile pei
ieuucei. The joL hau a single ieuucei, so we linu a single lile, nameu part-r-00000:
% cat output/part-r-00000
1949 111
1950 22
This iesult is the same as when we went thiough it Ly hanu eailiei. Ve inteipiet this
as saying that the maximum tempeiatuie iecoiueu in 19+9 was 11.1C, anu in 1950 it
was 2.2C.
The old and the new Java MapReduce APIs
The ]ava MapReuuce API useu in the pievious section was liist ieleaseu in Hauoop
0.20.0. This new API, sometimes ieleiieu to as Context OLjects, was uesigneu to
Analyzing the Data with Hadoop | 27
make the API easiei to evolve in the lutuie. It is type-incompatiLle with the olu, how-
evei, so applications neeu to Le iewiitten to take auvantage ol it.
The new API is not complete in the 1.x (loimeily 0.20) ielease seiies, so the olu API is
iecommenueu loi these ieleases, uespite having Leen maikeu as uepiecateu in the eaily
0.20 ieleases. (UnueistanuaLly, this iecommenuation causeu a lot ol conlusion so the
uepiecation waining was iemoveu liom latei ieleases in that seiies.)
Pievious euitions ol this Look weie Laseu on 0.20 ieleases, anu useu the olu API
thioughout (although the new API was coveieu, the coue invaiiaLly useu the olu API).
In this euition the new API is useu as the piimaiy API, except wheie mentioneu. How-
evei, shoulu you wish to use the olu API, you can, since the coue loi all the examples
in this Look is availaLle loi the olu API on the Look`s weLsite.
1
Theie aie seveial notaLle uilleiences Letween the two APIs:
The new API lavois aLstiact classes ovei inteilaces, since these aie easiei to evolve.
Foi example, you can auu a methou (with a uelault implementation) to an aLstiact
class without Lieaking olu implementations ol the class
2
. Foi example, the
Mapper anu Reducer inteilaces in the olu API aie aLstiact classes in the new API.
The new API is in the org.apache.hadoop.mapreduce package (anu suLpackages).
The olu API can still Le lounu in org.apache.hadoop.mapred.
The new API makes extensive use ol context oLjects that allow the usei coue to
communicate with the MapReuuce system. The new Context, loi example, essen-
tially unilies the iole ol the JobConf, the OutputCollector, anu the Reporter liom
the olu API.
In Loth APIs, key-value iecoiu paiis aie pusheu to the mappei anu ieuucei, Lut in
auuition, the new API allows Loth mappeis anu ieuuceis to contiol the execution
llow Ly oveiiiuing the run() methou. Foi example, iecoius can Le piocesseu in
Latches, oi the execution can Le teiminateu Leloie all the iecoius have Leen pio-
cesseu. In the olu API this is possiLle loi mappeis Ly wiiting a MapRunnable, Lut no
eguivalent exists loi ieuuceis.
Conliguiation has Leen unilieu. The olu API has a special JobConf oLject loi joL
conliguiation, which is an extension ol Hauoop`s vanilla Configuration oLject
(useu loi conliguiing uaemons; see The Conliguiation API on page 1+6). In the
new API, this uistinction is uioppeu, so joL conliguiation is uone thiough a
Configuration.
]oL contiol is peiloimeu thiough the Job class in the new API, iathei than the olu
JobClient, which no longei exists in the new API.
1. See TaLle 1-2 loi levels ol suppoit ol the olu anu new APIs Ly Hauoop ielease.
2. Technically, such a change woulu almost ceiainly Lieak implementations that alieauy ueline a methou
with the same signatuie as the new one, Lut as the aiticle at http://wi|i.cc|ipsc.org/Evo|ving_java-bascd
_AP|s=Exanp|c_1_-_Adding_an_AP|_ncthod explains, loi all piactical puiposes this is tieateu as a
compatiLle change.
28 | Chapter 2: MapReduce
Output liles aie nameu slightly uilleiently: in the olu API Loth map anu ieuuce
outputs aie nameu part-nnnnn, while in the new API map outputs aie nameu part-
n-nnnnn, anu ieuuce outputs aie nameu part-r-nnnnn (wheie nnnnn is an integei
uesignating the pait numLei, staiting liom zeio).
Usei-oveiiiuaLle methous in the new API aie ueclaieu to thiow java.lang.Inter
ruptedException. Vhat this means is that you can wiite youi coue to Le ieponsive
to inteiupts so that the liamewoik can giacelully cancel long-iunning opeiations
il it neeus to
3
.
In the new API the reduce() methou passes values as a java.lang.Iterable, iathei
than a java.lang.Iterator (as the olu API uoes). This change makes it easiei to
iteiate ovei the values using ]ava`s loi-each loop constiuct:
for (VALUEIN value : values) { ... }
Example 2-6 shows the MaxTemperature application iewiitten to use the olu API. The
uilleiences aie highlighteu in Lolu.
Vhen conveiting youi Mapper anu Reducer classes to the new API, uon`t
loiget to change the signatuie ol the map() anu reduce() methous to the
new loim. ]ust changing youi class to extenu the new Mapper oi
Reducer classes will not piouuce a compilation eiioi oi waining, since
these classes pioviue an iuentity loim ol the map() oi reduce() methou
(iespectively). Youi mappei oi ieuucei coue, howevei, will not Le in-
vokeu, which can leau to some haiu-to-uiagnose eiiois.
Exanp|c 2-. App|ication to jind thc naxinun tcnpcraturc, using thc o|d MapRcducc AP|
public class OldMaxTemperature {

static class OldMaxTemperatureMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {

private static final int MISSING = 9999;

public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {

String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
3. Dealing with InteiiupteuException Ly Biian Goetz explains this technigue in uetail.
Analyzing the Data with Hadoop | 29
output.collect(new Text(year), new IntWritable(airTemperature));
}
}
}

static class OldMaxTemperatureReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {

int maxValue = Integer.MIN_VALUE;
while (values.hasNext()) {
maxValue = Math.max(maxValue, values.next().get());
}
output.collect(key, new IntWritable(maxValue));
}
}
public static void main(String[] args) throws IOException {
if (args.length != 2) {
System.err.println("Usage: OldMaxTemperature <input path> <output path>");
System.exit(-1);
}

JobConf conf = new JobConf(MaxTemperatureWithCombiner.class);
conf.setJobName("Max temperature");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

conf.setMapperClass(OldMaxTemperatureMapper.class);
conf.setReducerClass(OldMaxTemperatureReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
}
}
Scaling Out
You`ve seen how MapReuuce woiks loi small inputs; now it`s time to take a Liiu`s-eye
view ol the system anu look at the uata llow loi laige inputs. Foi simplicity, the
examples so lai have useu liles on the local lilesystem. Howevei, to scale out, we neeu
to stoie the uata in a uistiiLuteu lilesystem, typically HDFS (which you`ll leain aLout
in the next chaptei), to allow Hauoop to move the MapReuuce computation to each
machine hosting a pait ol the uata. Let`s see how this woiks.
30 | Chapter 2: MapReduce
Data Flow
Fiist, some teiminology. A MapReuuce job is a unit ol woik that the client wants to Le
peiloimeu: it consists ol the input uata, the MapReuuce piogiam, anu conliguiation
inloimation. Hauoop iuns the joL Ly uiviuing it into tas|s, ol which theie aie two types:
nap tas|s anu rcducc tas|s.
Theie aie two types ol noues that contiol the joL execution piocess: a jobtrac|cr anu
a numLei ol tas|trac|crs. The joLtiackei cooiuinates all the joLs iun on the system Ly
scheuuling tasks to iun on tasktiackeis. Tasktiackeis iun tasks anu senu piogiess
iepoits to the joLtiackei, which keeps a iecoiu ol the oveiall piogiess ol each joL. Il a
task lails, the joLtiackei can iescheuule it on a uilleient tasktiackei.
Hauoop uiviues the input to a MapReuuce joL into lixeu-size pieces calleu input
sp|its, oi just sp|its. Hauoop cieates one map task loi each split, which iuns the usei-
uelineu map lunction loi each rccord in the split.
Having many splits means the time taken to piocess each split is small compaieu to the
time to piocess the whole input. So il we aie piocessing the splits in paiallel, the pio-
cessing is Lettei loau-Lalanceu il the splits aie small, since a lastei machine will Le aLle
to piocess piopoitionally moie splits ovei the couise ol the joL than a slowei machine.
Even il the machines aie iuentical, laileu piocesses oi othei joLs iunning concuiiently
make loau Lalancing uesiiaLle, anu the guality ol the loau Lalancing incieases as the
splits Lecome moie line-giaineu.
On the othei hanu, il splits aie too small, then the oveiheau ol managing the splits anu
ol map task cieation Legins to uominate the total joL execution time. Foi most joLs, a
goou split size tenus to Le the size ol an HDFS Llock, 6+ MB Ly uelault, although this
can Le changeu loi the clustei (loi all newly cieateu liles), oi specilieu when each lile
is cieateu.
Hauoop uoes its Lest to iun the map task on a noue wheie the input uata iesiues in
HDFS. This is calleu the data |oca|ity optinization since it uoesn`t use valuaLle clustei
Lanuwiuth. Sometimes, howevei, all thiee noues hosting the HDFS Llock ieplicas loi
a map task`s input split aie iunning othei map tasks so the joL scheuulei will look loi
a liee map slot on a noue in the same iack as one ol the Llocks. Veiy occasionally even
this is not possiLle, so an oll-iack noue is useu, which iesults in an intei-iack netwoik
tianslei. The thiee possiLilities aie illustiateu in Figuie 2-2.
Iigurc 2-2. Data-|oca| (a), rac|-|oca| (b), and ojj-rac| (c) nap tas|s.
Scaling Out | 31
It shoulu now Le cleai why the optimal split size is the same as the Llock size: it is the
laigest size ol input that can Le guaianteeu to Le stoieu on a single noue. Il the split
spanneu two Llocks, it woulu Le unlikely that any HDFS noue stoieu Loth Llocks, so
some ol the split woulu have to Le tiansleiieu acioss the netwoik to the noue iunning
the map task, which is cleaily less ellicient than iunning the whole map task using local
uata.
Map tasks wiite theii output to the local uisk, not to HDFS. Vhy is this? Map output
is inteimeuiate output: it`s piocesseu Ly ieuuce tasks to piouuce the linal output, anu
once the joL is complete the map output can Le thiown away. So stoiing it in HDFS,
with ieplication, woulu Le oveikill. Il the noue iunning the map task lails Leloie the
map output has Leen consumeu Ly the ieuuce task, then Hauoop will automatically
ieiun the map task on anothei noue to ie-cieate the map output.
Reuuce tasks uon`t have the auvantage ol uata localitythe input to a single ieuuce
task is noimally the output liom all mappeis. In the piesent example, we have a single
ieuuce task that is leu Ly all ol the map tasks. Theieloie, the soiteu map outputs have
to Le tiansleiieu acioss the netwoik to the noue wheie the ieuuce task is iunning, wheie
they aie meigeu anu then passeu to the usei-uelineu ieuuce lunction. The output ol
the ieuuce is noimally stoieu in HDFS loi ieliaLility. As explaineu in Chaptei 3, loi
each HDFS Llock ol the ieuuce output, the liist ieplica is stoieu on the local noue, with
othei ieplicas Leing stoieu on oll-iack noues. Thus, wiiting the ieuuce output uoes
consume netwoik Lanuwiuth, Lut only as much as a noimal HDFS wiite pipeline
consumes.
The whole uata llow with a single ieuuce task is illustiateu in Figuie 2-3. The uotteu
Loxes inuicate noues, the light aiiows show uata tiansleis on a noue, anu the heavy
aiiows show uata tiansleis Letween noues.
32 | Chapter 2: MapReduce
Iigurc 2-3. MapRcducc data j|ow with a sing|c rcducc tas|
The numLei ol ieuuce tasks is not goveineu Ly the size ol the input, Lut is specilieu
inuepenuently. In The Delault MapReuuce ]oL on page 225, you will see how to
choose the numLei ol ieuuce tasks loi a given joL.
Vhen theie aie multiple ieuuceis, the map tasks partition theii output, each cieating
one paitition loi each ieuuce task. Theie can Le many keys (anu theii associateu values)
in each paitition, Lut the iecoius loi any given key aie all in a single paitition. The
paititioning can Le contiolleu Ly a usei-uelineu paititioning lunction, Lut noimally the
uelault paititioneiwhich Luckets keys using a hash lunctionwoiks veiy well.
The uata llow loi the geneial case ol multiple ieuuce tasks is illustiateu in Figuie 2-+.
This uiagiam makes it cleai why the uata llow Letween map anu ieuuce tasks is collo-
guially known as the shullle, as each ieuuce task is leu Ly many map tasks. The
shullle is moie complicateu than this uiagiam suggests, anu tuning it can have a Lig
impact on joL execution time, as you will see in Shullle anu Soit on page 205.
Scaling Out | 33
Iigurc 2-1. MapRcducc data j|ow with nu|tip|c rcducc tas|s
Finally, it`s also possiLle to have zeio ieuuce tasks. This can Le appiopiiate when you
uon`t neeu the shullle since the piocessing can Le caiiieu out entiiely in paiallel (a lew
examples aie uiscusseu in NLineInputFoimat on page 2+5). In this case, the only
oll-noue uata tianslei is when the map tasks wiite to HDFS (see Figuie 2-5).
Combiner Functions
Many MapReuuce joLs aie limiteu Ly the Lanuwiuth availaLle on the clustei, so it pays
to minimize the uata tiansleiieu Letween map anu ieuuce tasks. Hauoop allows the
usei to specily a conbincr junction to Le iun on the map outputthe comLinei lunc-
tion`s output loims the input to the ieuuce lunction. Since the comLinei lunction is an
optimization, Hauoop uoes not pioviue a guaiantee ol how many times it will call it
loi a paiticulai map output iecoiu, il at all. In othei woius, calling the comLinei lunc-
tion zeio, one, oi many times shoulu piouuce the same output liom the ieuucei.
34 | Chapter 2: MapReduce
Iigurc 2-5. MapRcducc data j|ow with no rcducc tas|s
The contiact loi the comLinei lunction constiains the type ol lunction that may Le
useu. This is Lest illustiateu with an example. Suppose that loi the maximum tempei-
atuie example, ieauings loi the yeai 1950 weie piocesseu Ly two maps (Lecause they
weie in uilleient splits). Imagine the liist map piouuceu the output:
(1950, 0)
(1950, 20)
(1950, 10)
Anu the seconu piouuceu:
(1950, 25)
(1950, 15)
The ieuuce lunction woulu Le calleu with a list ol all the values:
(1950, [0, 20, 10, 25, 15])
with output:
(1950, 25)
since 25 is the maximum value in the list. Ve coulu use a comLinei lunction that, just
like the ieuuce lunction, linus the maximum tempeiatuie loi each map output. The
ieuuce woulu then Le calleu with:
(1950, [20, 25])
anu the ieuuce woulu piouuce the same output as Leloie. Moie succinctly, we may
expiess the lunction calls on the tempeiatuie values in this case as lollows:
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
Scaling Out | 35
Not all lunctions possess this piopeity.
+
Foi example, il we weie calculating mean
tempeiatuies, then we coulun`t use the mean as oui comLinei lunction, since:
mean(0, 20, 10, 25, 15) = 14
Lut:
mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
The comLinei lunction uoesn`t ieplace the ieuuce lunction. (How coulu it? The ieuuce
lunction is still neeueu to piocess iecoius with the same key liom uilleient maps.) But
it can help cut uown the amount ol uata shullleu Letween the maps anu the ieuuces,
anu loi this ieason alone it is always woith consiueiing whethei you can use a comLinei
lunction in youi MapReuuce joL.
Specifying a combiner function
Going Lack to the ]ava MapReuuce piogiam, the comLinei lunction is uelineu using
the Reducer class, anu loi this application, it is the same implementation as the ieuucei
lunction in MaxTemperatureReducer. The only change we neeu to make is to set the
comLinei class on the Job (see Example 2-7).
Exanp|c 2-7. App|ication to jind thc naxinun tcnpcraturc, using a conbincr junction jor cjjicicncy
public class MaxTemperatureWithCombiner {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperatureWithCombiner <input path> " +
"<output path>");
System.exit(-1);
}

Job job = new Job();
job.setJarByClass(MaxTemperatureWithCombiner.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);
+. Functions with this piopeity aie calleu connutativc anu associativc. They aie also sometimes ieleiieu to
as distributivc, such as in the papei Data CuLe: A Relational Aggiegation Opeiatoi Geneializing Gioup-
By, Cioss-TaL, anu SuL-Totals, Giay et al. (1995).
36 | Chapter 2: MapReduce
}
}
Running a Distributed MapReduce Job
The same piogiam will iun, without alteiation, on a lull uataset. This is the point ol
MapReuuce: it scales to the size ol youi uata anu the size ol youi haiuwaie. Heie`s one
uata point: on a 10-noue EC2 clustei iunning High-CPU Extia Laige Instances, the
piogiam took six minutes to iun.
5
Ve`ll go thiough the mechanics ol iunning piogiams on a clustei in Chaptei 5.
Hadoop Streaming
Hauoop pioviues an API to MapReuuce that allows you to wiite youi map anu ieuuce
lunctions in languages othei than ]ava. Hadoop Strcaning uses Unix stanuaiu stieams
as the inteilace Letween Hauoop anu youi piogiam, so you can use any language that
can ieau stanuaiu input anu wiite to stanuaiu output to wiite youi MapReuuce
piogiam.
Stieaming is natuially suiteu loi text piocessing (although, as ol veision 0.21.0, it can
hanule Linaiy stieams, too), anu when useu in text moue, it has a line-oiienteu view ol
uata. Map input uata is passeu ovei stanuaiu input to youi map lunction, which pio-
cesses it line Ly line anu wiites lines to stanuaiu output. A map output key-value paii
is wiitten as a single taL-uelimiteu line. Input to the ieuuce lunction is in the same
loimata taL-sepaiateu key-value paiipasseu ovei stanuaiu input. The ieuuce lunc-
tion ieaus lines liom stanuaiu input, which the liamewoik guaiantees aie soiteu Ly
key, anu wiites its iesults to stanuaiu output.
Let`s illustiate this Ly iewiiting oui MapReuuce piogiam loi linuing maximum tem-
peiatuies Ly yeai in Stieaming.
Ruby
The map lunction can Le expiesseu in RuLy as shown in Example 2-S.
Exanp|c 2-8. Map junction jor naxinun tcnpcraturc in Ruby
#!/usr/bin/env ruby
STDIN.each_line do |line|
val = line
year, temp, q = val[15,4], val[87,5], val[92,1]
5. This is a lactoi ol seven lastei than the seiial iun on one machine using aw|. The main ieason it wasn`t
piopoitionately lastei is Lecause the input uata wasn`t evenly paititioneu. Foi convenience, the input
liles weie gzippeu Ly yeai, iesulting in laige liles loi latei yeais in the uataset, when the numLei ol weathei
iecoius was much highei.
Hadoop Streaming | 37
puts "#{year}\t#{temp}" if (temp != "+9999" && q =~ /[01459]/)
end
The piogiam iteiates ovei lines liom stanuaiu input Ly executing a Llock loi each line
liom STDIN (a gloLal constant ol type IO). The Llock pulls out the ielevant lielus liom
each input line, anu, il the tempeiatuie is valiu, wiites the yeai anu the tempeiatuie
sepaiateu Ly a taL chaiactei \t to stanuaiu output (using puts).
It`s woith uiawing out a uesign uilleience Letween Stieaming anu the
]ava MapReuuce API. The ]ava API is geaieu towaiu piocessing youi
map lunction one iecoiu at a time. The liamewoik calls the map()
methou on youi Mapper loi each iecoiu in the input, wheieas with
Stieaming the map piogiam can ueciue how to piocess the inputloi
example, it coulu easily ieau anu piocess multiple lines at a time since
it`s in contiol ol the ieauing. The usei`s ]ava map implementation is
pusheu iecoius, Lut it`s still possiLle to consiuei multiple lines at a
time Ly accumulating pievious lines in an instance vaiiaLle in the
Mapper.
6
In this case, you neeu to implement the close() methou so that
you know when the last iecoiu has Leen ieau, so you can linish pio-
cessing the last gioup ol lines.
Since the sciipt just opeiates on stanuaiu input anu output, it`s tiivial to test the sciipt
without using Hauoop, simply using Unix pipes:
% cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb
1950 +0000
1950 +0022
1950 -0011
1949 +0111
1949 +0078
The ieuuce lunction shown in Example 2-9 is a little moie complex.
Exanp|c 2-9. Rcducc junction jor naxinun tcnpcraturc in Ruby
#!/usr/bin/env ruby
last_key, max_val = nil, 0
STDIN.each_line do |line|
key, val = line.split("\t")
if last_key && last_key != key
puts "#{last_key}\t#{max_val}"
last_key, max_val = key, val.to_i
else
last_key, max_val = key, [max_val, val.to_i].max
end
end
puts "#{last_key}\t#{max_val}" if last_key
6. Alteinatively, you coulu use pull style piocessing in the new MapReuuce APIsee The
olu anu the new ]ava MapReuuce APIs on page 27.
38 | Chapter 2: MapReduce
Again, the piogiam iteiates ovei lines liom stanuaiu input, Lut this time we have to
stoie some state as we piocess each key gioup. In this case, the keys aie weathei station
iuentilieis, anu we stoie the last key seen anu the maximum tempeiatuie seen so lai
loi that key. The MapReuuce liamewoik ensuies that the keys aie oiueieu, so we know
that il a key is uilleient liom the pievious one, we have moveu into a new key gioup.
In contiast to the ]ava API, wheie you aie pioviueu an iteiatoi ovei each key gioup, in
Stieaming you have to linu key gioup Lounuaiies in youi piogiam.
Foi each line, we pull out the key anu value, then il we`ve just linisheu a gioup (last_key
&& last_key != key), we wiite the key anu the maximum tempeiatuie loi that gioup,
sepaiateu Ly a taL chaiactei, Leloie iesetting the maximum tempeiatuie loi the new
key. Il we haven`t just linisheu a gioup, we just upuate the maximum tempeiatuie loi
the cuiient key.
The last line ol the piogiam ensuies that a line is wiitten loi the last key gioup in the
input.
Ve can now simulate the whole MapReuuce pipeline with a Unix pipeline (which is
eguivalent to the Unix pipeline shown in Figuie 2-1):
% cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb | \
sort | ch02/src/main/ruby/max_temperature_reduce.rb
1949 111
1950 22
The output is the same as the ]ava piogiam, so the next step is to iun it using Hauoop
itsell.
The hadoop commanu uoesn`t suppoit a Stieaming option; insteau, you specily the
Stieaming ]AR lile along with the jar option. Options to the Stieaming piogiam specily
the input anu output paths, anu the map anu ieuuce sciipts. This is what it looks like:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb
Vhen iunning on a laige uataset on a clustei, we shoulu set the comLinei, using the
-combiner option.
Fiom ielease 0.21.0, the comLinei can Le any Stieaming commanu. Foi eailiei ieleases,
the comLinei hau to Le wiitten in ]ava, so as a woikaiounu it was common to uo manual
comLining in the mappei, without having to iesoit to ]ava. In this case, we coulu change
the mappei to Le a pipeline:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/all \
-output output \
-mapper "ch02/src/main/ruby/max_temperature_map.rb | sort |
ch02/src/main/ruby/max_temperature_reduce.rb" \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb \
Hadoop Streaming | 39
-file ch02/src/main/ruby/max_temperature_map.rb \
-file ch02/src/main/ruby/max_temperature_reduce.rb
Note also the use ol -file, which we use when iunning Stieaming piogiams on the
clustei to ship the sciipts to the clustei.
Python
Stieaming suppoits any piogiamming language that can ieau liom stanuaiu input, anu
wiite to stanuaiu output, so loi ieaueis moie lamiliai with Python, heie`s the same
example again.
7
The map sciipt is in Example 2-10, anu the ieuuce sciipt is in Exam-
ple 2-11.
Exanp|c 2-10. Map junction jor naxinun tcnpcraturc in Python
#!/usr/bin/env python
import re
import sys
for line in sys.stdin:
val = line.strip()
(year, temp, q) = (val[15:19], val[87:92], val[92:93])
if (temp != "+9999" and re.match("[01459]", q)):
print "%s\t%s" % (year, temp)
Exanp|c 2-11. Rcducc junction jor naxinun tcnpcraturc in Python
#!/usr/bin/env python
import sys
(last_key, max_val) = (None, 0)
for line in sys.stdin:
(key, val) = line.strip().split("\t")
if last_key and last_key != key:
print "%s\t%s" % (last_key, max_val)
(last_key, max_val) = (key, int(val))
else:
(last_key, max_val) = (key, max(max_val, int(val)))
if last_key:
print "%s\t%s" % (last_key, max_val)
7. As an alteinative to Stieaming, Python piogiammeis shoulu consiuei DumLo (http://www.|ast.jn/
dunbo), which makes the Stieaming MapReuuce inteilace moie Pythonic anu easiei to use.
40 | Chapter 2: MapReduce
Ve can test the piogiams anu iun the joL in the same way we uiu in RuLy. Foi example,
to iun a test:
% cat input/ncdc/sample.txt | ch02/src/main/python/max_temperature_map.py | \
sort | ch02/src/main/python/max_temperature_reduce.py
1949 111
1950 22
Hadoop Pipes
Hauoop Pipes is the name ol the C-- inteilace to Hauoop MapReuuce. Unlike Stieam-
ing, which uses stanuaiu input anu output to communicate with the map anu ieuuce
coue, Pipes uses sockets as the channel ovei which the tasktiackei communicates with
the piocess iunning the C-- map oi ieuuce lunction. ]NI is not useu.
Ve`ll iewiite the example iunning thiough the chaptei in C--, anu then we`ll see how
to iun it using Pipes. Example 2-12 shows the souice coue loi the map anu ieuuce
lunctions in C--.
Exanp|c 2-12. Maxinun tcnpcraturc in C--
#include <algorithm>
#include <limits>
#include <stdint.h>
#include <string>
#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"
class MaxTemperatureMapper : public HadoopPipes::Mapper {
public:
MaxTemperatureMapper(HadoopPipes::TaskContext& context) {
}
void map(HadoopPipes::MapContext& context) {
std::string line = context.getInputValue();
std::string year = line.substr(15, 4);
std::string airTemperature = line.substr(87, 5);
std::string q = line.substr(92, 1);
if (airTemperature != "+9999" &&
(q == "0" || q == "1" || q == "4" || q == "5" || q == "9")) {
context.emit(year, airTemperature);
}
}
};
class MapTemperatureReducer : public HadoopPipes::Reducer {
public:
MapTemperatureReducer(HadoopPipes::TaskContext& context) {
}
void reduce(HadoopPipes::ReduceContext& context) {
int maxValue = INT_MIN;
while (context.nextValue()) {
Hadoop Pipes | 41
maxValue = std::max(maxValue, HadoopUtils::toInt(context.getInputValue()));
}
context.emit(context.getInputKey(), HadoopUtils::toString(maxValue));
}
};
int main(int argc, char *argv[]) {
return HadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperatureMapper,
MapTemperatureReducer>());
}
The application links against the Hauoop C-- liLiaiy, which is a thin wiappei loi
communicating with the tasktiackei chilu piocess. The map anu ieuuce lunctions aie
uelineu Ly extenuing the Mapper anu Reducer classes uelineu in the HadoopPipes name-
space anu pioviuing implementations ol the map() anu reduce() methous in each case.
These methous take a context oLject (ol type MapContext oi ReduceContext), which
pioviues the means loi ieauing input anu wiiting output, as well as accessing joL con-
liguiation inloimation via the JobConf class. The piocessing in this example is veiy
similai to the ]ava eguivalent.
Unlike the ]ava inteilace, keys anu values in the C-- inteilace aie Lyte Lulleis, iepie-
senteu as Stanuaiu Template LiLiaiy (STL) stiings. This makes the inteilace simplei,
although it uoes put a slightly gieatei Luiuen on the application uevelopei, who has to
conveit to anu liom iichei uomain-level types. This is eviuent in MapTempera
tureReducer wheie we have to conveit the input value into an integei (using a conve-
nience methou in HadoopUtils) anu then the maximum value Lack into a stiing Leloie
it`s wiitten out. In some cases, we can save on uoing the conveision, such as in MaxTem
peratureMapper wheie the airTemperature value is nevei conveiteu to an integei since
it is nevei piocesseu as a numLei in the map() methou.
The main() methou is the application entiy point. It calls HadoopPipes::runTask, which
connects to the ]ava paient piocess anu maishals uata to anu liom the Mapper oi
Reducer. The runTask() methou is passeu a Factory so that it can cieate instances ol the
Mapper oi Reducer. Vhich one it cieates is contiolleu Ly the ]ava paient ovei the socket
connection. Theie aie oveiloaueu template lactoiy methous loi setting a comLinei,
paititionei, iecoiu ieauei, oi iecoiu wiitei.
Compiling and Running
Now we can compile anu link oui piogiam using the Makelile in Example 2-13.
Exanp|c 2-13. Ma|cji|c jor C-- MapRcducc progran
CC = g++
CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include
max_temperature: max_temperature.cpp
$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
-lhadooputils -lpthread -g -O2 -o $@
42 | Chapter 2: MapReduce
The Makelile expects a couple ol enviionment vaiiaLles to Le set. Apait liom
HADOOP_INSTALL (which you shoulu alieauy have set il you lolloweu the installation
instiuctions in Appenuix A), you neeu to ueline PLATFORM, which specilies the opeiating
system, aichitectuie, anu uata mouel (e.g., 32- oi 6+-Lit). I ian it on a 32-Lit Linux
system with the lollowing:
% export PLATFORM=Linux-i386-32
% make
On successlul completion, you`ll linu the max_temperature executaLle in the cuiient
uiiectoiy.
To iun a Pipes joL, we neeu to iun Hauoop in pscudo-distributcd moue (wheie all the
uaemons iun on the local machine), loi which theie aie setup instiuctions in Appen-
uix A. Pipes uoesn`t iun in stanualone (local) moue, since it ielies on Hauoop`s
uistiiLuteu cache mechanism, which woiks only when HDFS is iunning.
Vith the Hauoop uaemons now iunning, the liist step is to copy the executaLle to
HDFS so that it can Le pickeu up Ly tasktiackeis when they launch map anu ieuuce
tasks:
% hadoop fs -put max_temperature bin/max_temperature
The sample uata also neeus to Le copieu liom the local lilesystem into HDFS:
% hadoop fs -put input/ncdc/sample.txt sample.txt
Now we can iun the joL. Foi this, we use the Hauoop pipes commanu, passing the URI
ol the executaLle in HDFS using the -program aigument:
% hadoop pipes \
-D hadoop.pipes.java.recordreader=true \
-D hadoop.pipes.java.recordwriter=true \
-input sample.txt \
-output output \
-program bin/max_temperature
Ve specily two piopeities using the -D option: hadoop.pipes.java.recordreader anu
hadoop.pipes.java.recordwriter, setting Loth to true to say that we have not specilieu
a C-- iecoiu ieauei oi wiitei, Lut that we want to use the uelault ]ava ones (which aie
loi text input anu output). Pipes also allows you to set a ]ava mappei, ieuucei,
comLinei, oi paititionei. In lact, you can have a mixtuie ol ]ava oi C-- classes within
any one joL.
The iesult is the same as the othei veisions ol the same piogiam that we ian.
Hadoop Pipes | 43
CHAPTER 3
The Hadoop Distributed Filesystem
Vhen a uataset outgiows the stoiage capacity ol a single physical machine, it Lecomes
necessaiy to paitition it acioss a numLei ol sepaiate machines. Filesystems that manage
the stoiage acioss a netwoik ol machines aie calleu distributcd ji|csystcns. Since they
aie netwoik-Laseu, all the complications ol netwoik piogiamming kick in, thus making
uistiiLuteu lilesystems moie complex than iegulai uisk lilesystems. Foi example, one
ol the Liggest challenges is making the lilesystem toleiate noue lailuie without sulleiing
uata loss.
Hauoop comes with a uistiiLuteu lilesystem calleu HDFS, which stanus loi Hadoop
Distributcd Ii|csystcn. (You may sometimes see ieleiences to DFSinloimally oi in
oluei uocumentation oi conliguiationswhich is the same thing.) HDFS is Hauoop`s
llagship lilesystem anu is the locus ol this chaptei, Lut Hauoop actually has a geneial-
puipose lilesystem aLstiaction, so we`ll see along the way how Hauoop integiates with
othei stoiage systems (such as the local lilesystem anu Amazon S3).
The Design of HDFS
HDFS is a lilesystem uesigneu loi stoiing veiy laige liles with stieaming uata access
patteins, iunning on clusteis ol commouity haiuwaie.
1
Let`s examine this statement
in moie uetail:
\cry |argc ji|cs
Veiy laige in this context means liles that aie hunuieus ol megaLytes, gigaLytes,
oi teiaLytes in size. Theie aie Hauoop clusteis iunning touay that stoie petaLytes
ol uata.
2
1. The aichitectuie ol HDFS is uesciiLeu in The Hauoop DistiiLuteu File System Ly Konstantin Shvachko,
Haiiong Kuang, Sanjay Rauia, anu RoLeit Chanslei (Pioceeuings ol MSST2010, May 2010, http://
storagcconjcrcncc.org/2010/Papcrs/MSST/Shvach|o.pdj).
2. Scaling Hauoop to +000 noues at Yahoo!, http://dcvc|opcr.yahoo.nct/b|ogs/hadoop/2008/09/sca|ing
_hadoop_to_1000_nodcs_a.htn|.
45
Strcaning data acccss
HDFS is Luilt aiounu the iuea that the most ellicient uata piocessing pattein is a
wiite-once, ieau-many-times pattein. A uataset is typically geneiateu oi copieu
liom souice, then vaiious analyses aie peiloimeu on that uataset ovei time. Each
analysis will involve a laige piopoition, il not all, ol the uataset, so the time to ieau
the whole uataset is moie impoitant than the latency in ieauing the liist iecoiu.
Connodity hardwarc
Hauoop uoesn`t ieguiie expensive, highly ieliaLle haiuwaie to iun on. It`s uesigneu
to iun on clusteis ol commouity haiuwaie (commonly availaLle haiuwaie availaLle
liom multiple venuois
3
) loi which the chance ol noue lailuie acioss the clustei is
high, at least loi laige clusteis. HDFS is uesigneu to caiiy on woiking without a
noticeaLle inteiiuption to the usei in the lace ol such lailuie.
It is also woith examining the applications loi which using HDFS uoes not woik so
well. Vhile this may change in the lutuie, these aie aieas wheie HDFS is not a goou lit
touay:
Low-|atcncy data acccss
Applications that ieguiie low-latency access to uata, in the tens ol milliseconus
iange, will not woik well with HDFS. RememLei, HDFS is optimizeu loi ueliveiing
a high thioughput ol uata, anu this may Le at the expense ol latency. HBase
(Chaptei 13) is cuiiently a Lettei choice loi low-latency access.
Lots oj sna|| ji|cs
Since the namenoue holus lilesystem metauata in memoiy, the limit to the numLei
ol liles in a lilesystem is goveineu Ly the amount ol memoiy on the namenoue. As
a iule ol thumL, each lile, uiiectoiy, anu Llock takes aLout 150 Lytes. So, loi
example, il you hau one million liles, each taking one Llock, you woulu neeu at
least 300 MB ol memoiy. Vhile stoiing millions ol liles is leasiLle, Lillions is Le-
yonu the capaLility ol cuiient haiuwaie.
+
Mu|tip|c writcrs, arbitrary ji|c nodijications
Files in HDFS may Le wiitten to Ly a single wiitei. Viites aie always maue at the
enu ol the lile. Theie is no suppoit loi multiple wiiteis, oi loi mouilications at
aiLitiaiy ollsets in the lile. (These might Le suppoiteu in the lutuie, Lut they aie
likely to Le ielatively inellicient.)
3. See Chaptei 9 loi a typical machine specilication.
+. Foi an in-uepth exposition ol the scalaLility limits ol HDFS, see Konstantin V. Shvachko`s ScalaLility
ol the Hauoop DistiiLuteu File System, (http://dcvc|opcr.yahoo.nct/b|ogs/hadoop/2010/05/sca|abi|ity_oj
_thc_hadoop_dist.htn|) anu the companion papei HDFS ScalaLility: The limits to giowth, (Apiil 2010,
pp. 616. http://www.uscnix.org/pub|ications/|ogin/2010-01/opcnpdjs/shvach|o.pdj) Ly the same authoi.
46 | Chapter 3: The Hadoop Distributed Filesystem
HDFS Concepts
Blocks
A uisk has a Llock size, which is the minimum amount ol uata that it can ieau oi wiite.
Filesystems loi a single uisk Luilu on this Ly uealing with uata in Llocks, which aie an
integial multiple ol the uisk Llock size. Filesystem Llocks aie typically a lew kiloLytes
in size, while uisk Llocks aie noimally 512 Lytes. This is geneially tianspaient to the
lilesystem usei who is simply ieauing oi wiiting a lileol whatevei length. Howevei,
theie aie tools to peiloim lilesystem maintenance, such as dj anu jsc|, that opeiate on
the lilesystem Llock level.
HDFS, too, has the concept ol a Llock, Lut it is a much laigei unit6+ MB Ly uelault.
Like in a lilesystem loi a single uisk, liles in HDFS aie Lioken into Llock-sizeu chunks,
which aie stoieu as inuepenuent units. Unlike a lilesystem loi a single uisk, a lile in
HDFS that is smallei than a single Llock uoes not occupy a lull Llock`s woith ol un-
ueilying stoiage. Vhen ungualilieu, the teim Llock in this Look ieleis to a Llock in
HDFS.
Why Is a Block in HDFS So Large?
HDFS Llocks aie laige compaieu to uisk Llocks, anu the ieason is to minimize the cost
ol seeks. By making a Llock laige enough, the time to tianslei the uata liom the uisk
can Le maue to Le signilicantly laigei than the time to seek to the stait ol the Llock.
Thus the time to tianslei a laige lile maue ol multiple Llocks opeiates at the uisk tianslei
iate.
A guick calculation shows that il the seek time is aiounu 10 ms, anu the tianslei iate
is 100 MB/s, then to make the seek time 1 ol the tianslei time, we neeu to make the
Llock size aiounu 100 MB. The uelault is actually 6+ MB, although many HDFS in-
stallations use 12S MB Llocks. This liguie will continue to Le ieviseu upwaiu as tianslei
speeus giow with new geneiations ol uisk uiives.
This aigument shoulun`t Le taken too lai, howevei. Map tasks in MapReuuce noimally
opeiate on one Llock at a time, so il you have too lew tasks (lewei than noues in the
clustei), youi joLs will iun slowei than they coulu otheiwise.
Having a Llock aLstiaction loi a uistiiLuteu lilesystem Liings seveial Lenelits. The liist
Lenelit is the most oLvious: a lile can Le laigei than any single uisk in the netwoik.
Theie`s nothing that ieguiies the Llocks liom a lile to Le stoieu on the same uisk, so
they can take auvantage ol any ol the uisks in the clustei. In lact, it woulu Le possiLle,
il unusual, to stoie a single lile on an HDFS clustei whose Llocks lilleu all the uisks in
the clustei.
HDFS Concepts | 47
Seconu, making the unit ol aLstiaction a Llock iathei than a lile simplilies the stoiage
suLsystem. Simplicity is something to stiive loi all in all systems, Lut is especially
impoitant loi a uistiiLuteu system in which the lailuie moues aie so vaiieu. The stoiage
suLsystem ueals with Llocks, simplilying stoiage management (since Llocks aie a lixeu
size, it is easy to calculate how many can Le stoieu on a given uisk) anu eliminating
metauata conceins (Llocks aie just a chunk ol uata to Le stoieulile metauata such as
peimissions inloimation uoes not neeu to Le stoieu with the Llocks, so anothei system
can hanule metauata sepaiately).
Fuitheimoie, Llocks lit well with ieplication loi pioviuing lault toleiance anu availa-
Lility. To insuie against coiiupteu Llocks anu uisk anu machine lailuie, each Llock is
ieplicateu to a small numLei ol physically sepaiate machines (typically thiee). Il a Llock
Lecomes unavailaLle, a copy can Le ieau liom anothei location in a way that is tians-
paient to the client. A Llock that is no longei availaLle uue to coiiuption oi machine
lailuie can Le ieplicateu liom its alteinative locations to othei live machines to Liing
the ieplication lactoi Lack to the noimal level. (See Data Integiity on page S3 loi
moie on guaiuing against coiiupt uata.) Similaily, some applications may choose to
set a high ieplication lactoi loi the Llocks in a populai lile to spieau the ieau loau on
the clustei.
Like its uisk lilesystem cousin, HDFS`s fsck commanu unueistanus Llocks. Foi exam-
ple, iunning:
% hadoop fsck / -files -blocks
will list the Llocks that make up each lile in the lilesystem. (See also Filesystem check
(lsck) on page 3+5.)
Namenodes and Datanodes
An HDFS clustei has two types ol noue opeiating in a mastei-woikei pattein: a nanc-
nodc (the mastei) anu a numLei ol datanodcs (woikeis). The namenoue manages the
lilesystem namespace. It maintains the lilesystem tiee anu the metauata loi all the liles
anu uiiectoiies in the tiee. This inloimation is stoieu peisistently on the local uisk in
the loim ol two liles: the namespace image anu the euit log. The namenoue also knows
the uatanoues on which all the Llocks loi a given lile aie locateu, howevei, it uoes
not stoie Llock locations peisistently, since this inloimation is ieconstiucteu liom
uatanoues when the system staits.
A c|icnt accesses the lilesystem on Lehall ol the usei Ly communicating with the name-
noue anu uatanoues. The client piesents a POSIX-like lilesystem inteilace, so the usei
coue uoes not neeu to know aLout the namenoue anu uatanoue to lunction.
Datanoues aie the woikhoises ol the lilesystem. They stoie anu ietiieve Llocks when
they aie tolu to (Ly clients oi the namenoue), anu they iepoit Lack to the namenoue
peiiouically with lists ol Llocks that they aie stoiing.
48 | Chapter 3: The Hadoop Distributed Filesystem
Vithout the namenoue, the lilesystem cannot Le useu. In lact, il the machine iunning
the namenoue weie oLliteiateu, all the liles on the lilesystem woulu Le lost since theie
woulu Le no way ol knowing how to ieconstiuct the liles liom the Llocks on the
uatanoues. Foi this ieason, it is impoitant to make the namenoue iesilient to lailuie,
anu Hauoop pioviues two mechanisms loi this.
The liist way is to Lack up the liles that make up the peisistent state ol the lilesystem
metauata. Hauoop can Le conliguieu so that the namenoue wiites its peisistent state
to multiple lilesystems. These wiites aie synchionous anu atomic. The usual conligu-
iation choice is to wiite to local uisk as well as a iemote NFS mount.
It is also possiLle to iun a sccondary nancnodc, which uespite its name uoes not act as
a namenoue. Its main iole is to peiiouically meige the namespace image with the euit
log to pievent the euit log liom Lecoming too laige. The seconuaiy namenoue usually
iuns on a sepaiate physical machine, since it ieguiies plenty ol CPU anu as much
memoiy as the namenoue to peiloim the meige. It keeps a copy ol the meigeu name-
space image, which can Le useu in the event ol the namenoue lailing. Howevei, the
state ol the seconuaiy namenoue lags that ol the piimaiy, so in the event ol total lailuie
ol the piimaiy, uata loss is almost ceitain. The usual couise ol action in this case is to
copy the namenoue`s metauata liles that aie on NFS to the seconuaiy anu iun it as the
new piimaiy.
See The lilesystem image anu euit log on page 33S loi moie uetails.
HDFS Federation
The namenoue keeps a ieleience to eveiy lile anu Llock in the lilesystem in memoiy,
which means that on veiy laige clusteis with many liles, memoiy Lecomes the limiting
lactoi loi scaling (see How much memoiy uoes a namenoue neeu? on page 306).
HDFS Feueiation, intiouuceu in the 0.23 ielease seiies, allows a clustei to scale Ly
auuing namenoues, each ol which manages a poition ol the lilesystem namespace. Foi
example, one namenoue might manage all the liles iooteu unuei /uscr, say, anu a seconu
namenoue might hanule liles unuei /sharc.
Unuei leueiation, each namenoue manages a nancspacc vo|unc, which is maue up ol
the metauata loi the namespace, anu a b|oc| poo| containing all the Llocks loi the liles
in the namespace. Namespace volumes aie inuepenuent ol each othei, which means
namenoues uo not communicate with one anothei, anu luitheimoie the lailuie ol one
namenoue uoes not allect the availaLility ol the namespaces manageu Ly othei namen-
oues. Block pool stoiage is not paititioneu, howevei, so uatanoues iegistei with each
namenoue in the clustei anu stoie Llocks liom multiple Llock pools.
To access a leueiateu HDFS clustei, clients use client-siue mount taLles to map lile
paths to namenoues. This is manageu in conliguiation using the ViewFileSystem, anu
vicwjs:// URIs.
HDFS Concepts | 49
HDFS High-Availability
The comLination ol ieplicating namenoue metauata on multiple lilesystems, anu using
the seconuaiy namenoue to cieate checkpoints piotects against uata loss, Lut uoes not
pioviue high-availaLility ol the lilesystem. The namenoue is still a sing|c point oj jai|-
urc (SPOF), since il it uiu lail, all clientsincluuing MapReuuce joLswoulu Le un-
aLle to ieau, wiite, oi list liles, Lecause the namenoue is the sole iepositoiy ol the
metauata anu the lile-to-Llock mapping. In such an event the whole Hauoop system
woulu ellectively Le out ol seivice until a new namenoue coulu Le Liought online.
To iecovei liom a laileu namenoue in this situation, an auministiatoi staits a new
piimaiy namenoue with one ol the lilesystem metauata ieplicas, anu conliguies ua-
tanoues anu clients to use this new namenoue. The new namenoue is not aLle to seive
ieguests until it has i) loaueu its namespace image into memoiy, ii) ieplayeu its euit
log, anu iii) ieceiveu enough Llock iepoits liom the uatanoues to leave sale moue. On
laige clusteis with many liles anu Llocks, the time it takes loi a namenoue to stait liom
colu can Le 30 minutes oi moie.
The long iecoveiy time is a pioLlem loi ioutine maintenance too. In lact, since unex-
pecteu lailuie ol the namenoue is so iaie, the case loi planneu uowntime is actually
moie impoitant in piactice.
The 0.23 ielease seiies ol Hauoop iemeuies this situation Ly auuing suppoit loi HDFS
high-availaLility (HA). In this implementation theie is a paii ol namenoues in an active-
stanuLy conliguiation. In the event ol the lailuie ol the active namenoue, the stanuLy
takes ovei its uuties to continue seivicing client ieguests without a signilicant intei-
iuption. A lew aichitectuial changes aie neeueu to allow this to happen:
The namenoues must use highly-availaLle shaieu stoiage to shaie the euit log. (In
the initial implementation ol HA this will ieguiie an NFS lilei, Lut in lutuie ieleases
moie options will Le pioviueu, such as a BookKeepei-Laseu system Luilt on Zoo-
Keepei.) Vhen a stanuLy namenoue comes up it ieaus up to the enu ol the shaieu
euit log to synchionize its state with the active namenoue, anu then continues to
ieau new entiies as they aie wiitten Ly the active namenoue.
Datanoues must senu Llock iepoits to Loth namenoues since the Llock mappings
aie stoieu in a namenoue`s memoiy, anu not on uisk.
Clients must Le conliguieu to hanule namenoue lailovei, which uses a mechanism
that is tianspaient to useis.
Il the active namenoue lails, then the stanuLy can take ovei veiy guickly (in a lew tens
ol seconus) since it has the latest state availaLle in memoiy: Loth the latest euit log
entiies, anu an up-to-uate Llock mapping. The actual oLseiveu lailovei time will Le
longei in piactice (aiounu a minute oi so), since the system neeus to Le conseivative
in ueciuing that the active namenoue has laileu.
50 | Chapter 3: The Hadoop Distributed Filesystem
In the unlikely event ol the stanuLy Leing uown when the active lails, the auministiatoi
can still stait the stanuLy liom colu. This is no woise than the non-HA case, anu liom
an opeiational point ol view it`s an impiovement, since the piocess is a stanuaiu op-
eiational pioceuuie Luilt into Hauoop.
Failover and fencing
The tiansition liom the active namenoue to the stanuLy is manageu Ly a new entity in
the system calleu the jai|ovcr contro||cr. Failovei contiolleis aie pluggaLle, Lut the liist
implementation uses ZooKeepei to ensuie that only one namenoue is active. Each
namenoue iuns a lightweight lailovei contiollei piocess whose joL it is to monitoi its
namenoue loi lailuies (using a simple heaitLeating mechanism) anu tiiggei a lailovei
shoulu a namenoue lail.
Failovei may also Le initiateu manually Ly an auminstiatoi, in the case ol ioutine
maintenance, loi example. This is known as a graccju| jai|ovcr, since the lailovei con-
tiollei aiianges an oiueily tiansition loi Loth namenoues to switch ioles.
In the case ol an ungiacelul lailovei, howevei, it is impossiLle to Le suie that the laileu
namenoue has stoppeu iunning. Foi example, a slow netwoik oi a netwoik paitition
can tiiggei a lailovei tiansition, even though the pieviously active namenoue is still
iunning, anu thinks it is still the active namenoue. The HA implementation goes to
gieat lengths to ensuie that the pieviously active namenoue is pieventeu liom uoing
any uamage anu causing coiiuptiona methou known as jcncing. The system employs
a iange ol lencing mechanisms, incluuing killing the namenoue`s piocess, ievoking its
access to the shaieu stoiage uiiectoiy (typically Ly using a venuoi-specilic NFS com-
manu), anu uisaLling its netwoik poit via a iemote management commanu. As a last
iesoit, the pieviously active namenoue can Le lenceu with a technigue iathei giaphi-
cally known as STON|TH, oi shoot the othei noue in the heau, which uses a speci-
alizeu powei uistiiLution unit to loiciLly powei uown the host machine.
Client lailovei is hanuleu tianspaiently Ly the client liLiaiy. The simplest implemen-
tation uses client-siue conliguiation to contiol lailovei. The HDFS URI uses a logical
hostname which is mappeu to a paii ol namenoue auuiesses (in the conliguiation lile),
anu the client liLiaiy tiies each namenoue auuiess until the opeiation succeeus.
The Command-Line Interface
Ve`ie going to have a look at HDFS Ly inteiacting with it liom the commanu line.
Theie aie many othei inteilaces to HDFS, Lut the commanu line is one ol the simplest
anu, to many uevelopeis, the most lamiliai.
Ve aie going to iun HDFS on one machine, so liist lollow the instiuctions loi setting
up Hauoop in pseuuo-uistiiLuteu moue in Appenuix A. Latei you`ll see how to iun on
a clustei ol machines to give us scalaLility anu lault toleiance.
The Command-Line Interface | 51
Theie aie two piopeities that we set in the pseuuo-uistiiLuteu conliguiation that ue-
seive luithei explanation. The liist is fs.default.name, set to hdjs://|oca|host/, which is
useu to set a uelault lilesystem loi Hauoop. Filesystems aie specilieu Ly a URI, anu
heie we have useu an hdfs URI to conliguie Hauoop to use HDFS Ly uelault. The HDFS
uaemons will use this piopeity to ueteimine the host anu poit loi the HDFS namenoue.
Ve`ll Le iunning it on localhost, on the uelault HDFS poit, S020. Anu HDFS clients
will use this piopeity to woik out wheie the namenoue is iunning so they can connect
to it.
Ve set the seconu piopeity, dfs.replication, to 1 so that HDFS uoesn`t ieplicate
lilesystem Llocks Ly the uelault lactoi ol thiee. Vhen iunning with a single uatanoue,
HDFS can`t ieplicate Llocks to thiee uatanoues, so it woulu peipetually wain aLout
Llocks Leing unuei-ieplicateu. This setting solves that pioLlem.
Basic Filesystem Operations
The lilesystem is ieauy to Le useu, anu we can uo all ol the usual lilesystem opeiations
such as ieauing liles, cieating uiiectoiies, moving liles, ueleting uata, anu listing uiiec-
toiies. You can type hadoop fs -help to get uetaileu help on eveiy commanu.
Stait Ly copying a lile liom the local lilesystem to HDFS:
% hadoop fs -copyFromLocal input/docs/quangle.txt hdfs://localhost/user/tom/
quangle.txt
This commanu invokes Hauoop`s lilesystem shell commanu fs, which suppoits a
numLei ol suLcommanusin this case, we aie iunning -copyFromLocal. The local lile
quang|c.txt is copieu to the lile /uscr/ton/quang|c.txt on the HDFS instance iunning on
localhost. In lact, we coulu have omitteu the scheme anu host ol the URI anu pickeu
up the uelault, hdfs://localhost, as specilieu in corc-sitc.xn|:
% hadoop fs -copyFromLocal input/docs/quangle.txt /user/tom/quangle.txt
Ve coulu also have useu a ielative path anu copieu the lile to oui home uiiectoiy in
HDFS, which in this case is /uscr/ton:
% hadoop fs -copyFromLocal input/docs/quangle.txt quangle.txt
Let`s copy the lile Lack to the local lilesystem anu check whethei it`s the same:
% hadoop fs -copyToLocal quangle.txt quangle.copy.txt
% md5 input/docs/quangle.txt quangle.copy.txt
MD5 (input/docs/quangle.txt) = a16f231da6b05e2ba7a339320e7dacd9
MD5 (quangle.copy.txt) = a16f231da6b05e2ba7a339320e7dacd9
The MD5 uigests aie the same, showing that the lile suiviveu its tiip to HDFS anu is
Lack intact.
Finally, let`s look at an HDFS lile listing. Ve cieate a uiiectoiy liist just to see how it
is uisplayeu in the listing:
52 | Chapter 3: The Hadoop Distributed Filesystem
% hadoop fs -mkdir books
% hadoop fs -ls .
Found 2 items
drwxr-xr-x - tom supergroup 0 2009-04-02 22:41 /user/tom/books
-rw-r--r-- 1 tom supergroup 118 2009-04-02 22:29 /user/tom/quangle.txt
The inloimation ietuineu is veiy similai to the Unix commanu ls -l, with a lew minoi
uilleiences. The liist column shows the lile moue. The seconu column is the ieplication
lactoi ol the lile (something a tiauitional Unix lilesystem uoes not have). RememLei
we set the uelault ieplication lactoi in the site-wiue conliguiation to Le 1, which is why
we see the same value heie. The entiy in this column is empty loi uiiectoiies since the
concept ol ieplication uoes not apply to themuiiectoiies aie tieateu as metauata anu
stoieu Ly the namenoue, not the uatanoues. The thiiu anu louith columns show the
lile ownei anu gioup. The lilth column is the size ol the lile in Lytes, oi zeio loi uiiec-
toiies. The sixth anu seventh columns aie the last mouilieu uate anu time. Finally, the
eighth column is the aLsolute name ol the lile oi uiiectoiy.
File Permissions in HDFS
HDFS has a peimissions mouel loi liles anu uiiectoiies that is much like POSIX.
Theie aie thiee types ol peimission: the ieau peimission (r), the wiite peimission (w),
anu the execute peimission (x). The ieau peimission is ieguiieu to ieau liles oi list the
contents ol a uiiectoiy. The wiite peimission is ieguiieu to wiite a lile, oi loi a uiiectoiy,
to cieate oi uelete liles oi uiiectoiies in it. The execute peimission is ignoieu loi a lile
since you can`t execute a lile on HDFS (unlike POSIX), anu loi a uiiectoiy it is ieguiieu
to access its chiluien.
Each lile anu uiiectoiy has an owncr, a group, anu a nodc. The moue is maue up ol the
peimissions loi the usei who is the ownei, the peimissions loi the useis who aie
memLeis ol the gioup, anu the peimissions loi useis who aie neithei the owneis noi
memLeis ol the gioup.
By uelault, a client`s iuentity is ueteimineu Ly the useiname anu gioups ol the piocess
it is iunning in. Because clients aie iemote, this makes it possiLle to Lecome an aiLitiaiy
usei, simply Ly cieating an account ol that name on the iemote system. Thus, peimis-
sions shoulu Le useu only in a coopeiative community ol useis, as a mechanism loi
shaiing lilesystem iesouices anu loi avoiuing acciuental uata loss, anu not loi secuiing
iesouices in a hostile enviionment. (Note, howevei, that the latest veisions ol Hauoop
suppoit KeiLeios authentication, which iemoves these iestiictions, see Secu-
iity on page 323.) Despite these limitations, it is woithwhile having peimissions en-
aLleu (as it is Ly uelault; see the dfs.permissions piopeity), to avoiu acciuental moui-
lication oi ueletion ol suLstantial paits ol the lilesystem, eithei Ly useis oi Ly automateu
tools oi piogiams.
Vhen peimissions checking is enaLleu, the ownei peimissions aie checkeu il the cli-
ent`s useiname matches the ownei, anu the gioup peimissions aie checkeu il the client
is a memLei ol the gioup; otheiwise, the othei peimissions aie checkeu.
The Command-Line Interface | 53
Theie is a concept ol a supei-usei, which is the iuentity ol the namenoue piocess.
Peimissions checks aie not peiloimeu loi the supei-usei.
Hadoop Filesystems
Hauoop has an aLstiact notion ol lilesystem, ol which HDFS is just one implementa-
tion. The ]ava aLstiact class org.apache.hadoop.fs.FileSystem iepiesents a lilesystem
in Hauoop, anu theie aie seveial conciete implementations, which aie uesciiLeu in
TaLle 3-1.
Tab|c 3-1. Hadoop ji|csystcns
Filesystem URI scheme Java implementation
(all under org.apache.hadoop)
Description
Local file fs.LocalFileSystem A filesystem for a locally connected disk with client-
side checksums. Use RawLocalFileSystem for a
local filesystem with no checksums. See LocalFileSys-
tem on page 84.
HDFS hdfs hdfs.
DistributedFileSystem
Hadoops distributed filesystem. HDFS is designed to
work efficiently in conjunction with MapReduce.
HFTP hftp hdfs.HftpFileSystem A filesystem providing read-only access to HDFS over
HTTP. (Despite its name, HFTP has no connection with
FTP.) Often used with distcp (see Parallel Copying with
distcp on page 76) to copy data between HDFS
clusters running different versions.
HSFTP hsftp hdfs.HsftpFileSystem A filesystem providing read-only access to HDFS over
HTTPS. (Again, this has no connection with FTP.)
WebHDFS webhdfs hdfs.web.WebHdfsFile
System
A filesystem providing secure read-write access to HDFS
over HTTP. WebHDFS is intended as a replacement for
HFTP and HSFTP.
HAR har fs.HarFileSystem A filesystem layered on another filesystem for archiving
files. Hadoop Archives are typically used for archiving files
in HDFS to reduce the namenodes memory usage. See
Hadoop Archives on page 78.
KFS (Cloud-
Store)
kfs fs.kfs.
KosmosFileSystem
CloudStore (formerly Kosmos filesystem) is a dis-
tributed filesystem like HDFS or Googles GFS, written in
C++. Find more information about it at
http://kosmosfs.sourceforge.net/.
FTP ftp fs.ftp.FTPFileSystem A filesystem backed by an FTP server.
S3 (native) s3n fs.s3native.
NativeS3FileSystem
A filesystem backed by Amazon S3. See http://wiki
.apache.org/hadoop/AmazonS3.
S3 (block-
based)
s3 fs.s3.S3FileSystem A filesystem backed by Amazon S3, which stores files in
blocks (much like HDFS) to overcome S3s 5 GB file size
limit.
54 | Chapter 3: The Hadoop Distributed Filesystem
Filesystem URI scheme Java implementation
(all under org.apache.hadoop)
Description
Distributed
RAID
hdfs hdfs.DistributedRaidFi
leSystem
A RAID version of HDFS designed for archival storage.
For each file in HDFS, a (smaller) parity file is created,
which allows the HDFS replication to be reduced from
three to two, which reduces disk usage by 25% to 30%,
while keeping the probability of data loss the same. Dis-
tributed RAID requires that you run a RaidNode daemon
on the cluster.
View viewfs viewfs.ViewFileSystem A client-side mount table for other Hadoop filesystems.
Commonly used to create mount points for federated
namenodes (see HDFS Federation on page 49).
Hauoop pioviues many inteilaces to its lilesystems, anu it geneially uses the URI
scheme to pick the coiiect lilesystem instance to communicate with. Foi example, the
lilesystem shell that we met in the pievious section opeiates with all Hauoop lilesys-
tems. To list the liles in the ioot uiiectoiy ol the local lilesystem, type:
% hadoop fs -ls file:///
Although it is possiLle (anu sometimes veiy convenient) to iun MapReuuce piogiams
that access any ol these lilesystems, when you aie piocessing laige volumes ol uata,
you shoulu choose a uistiiLuteu lilesystem that has the uata locality optimization, no-
taLly HDFS (see Scaling Out on page 30).
Interfaces
Hauoop is wiitten in ]ava, anu all Hauoop lilesystem inteiactions aie meuiateu thiough
the ]ava API. The lilesystem shell, loi example, is a ]ava application that uses the ]ava
FileSystem class to pioviue lilesystem opeiations. The othei lilesystem inteilaces aie
uiscusseu Liielly in this section. These inteilaces aie most commonly useu with HDFS,
since the othei lilesystems in Hauoop typically have existing tools to access the unuei-
lying lilesystem (FTP clients loi FTP, S3 tools loi S3, etc.), Lut many ol them will woik
with any Hauoop lilesystem.
HTTP
Theie aie two ways ol accessing HDFS ovei HTTP: uiiectly, wheie the HDFS uaemons
seive HTTP ieguests to clients; anu via a pioxy (oi pioxies), which accesses HDFS on
the client`s Lehall using the usual DistributedFileSystem API. The two ways aie illus-
tiateu in Figuie 3-1.
Hadoop Filesystems | 55
Iigurc 3-1. Acccssing HDIS ovcr HTTP dircct|y, and via a ban| oj HDIS proxics
In the liist case, uiiectoiy listings aie seiveu Ly the namenoue`s emLeuueu weL seivei
(which iuns on poit 50070) loimatteu in XML oi ]SON, while lile uata is stieameu
liom uatanoues Ly theii weL seiveis (iunning on poit 50075).
The oiiginal uiiect HTTP inteilace (HFTP anu HSFTP) was ieau-only, while the new
VeLHDFS implementation suppoits all lilesystem opeiations, incluuing KeiLeios au-
thentication. VeLHDFS must Le enaLleu Ly setting dfs.webhdfs.enabled to tiue, loi
you to Le aLle to use wcbhdjs URIs.
The seconu way ol accessing HDFS ovei HTTP ielies on one oi moie stanualone pioxy
seiveis. (The pioxies aie stateless so they can iun Lehinu a stanuaiu loau Lalancei.) All
tiallic to the clustei passes thiough the pioxy. This allows loi stiictei liiewall anu
Lanuwiuth limiting policies to Le put in place. It`s common to use a pioxy loi tiansleis
Letween Hauoop clusteis locateu in uilleient uata centeis.
The oiiginal HDFS pioxy (in src/contrib/hdjsproxy) was ieau-only, anu coulu Le ac-
cesseu Ly clients using the HSFTP FileSystem implementation (hsjtp URIs). Fiom ie-
lease 0.23, theie is a new pioxy calleu HttpFS that has ieau anu wiite capaLilities, anu
which exposes the same HTTP inteilace as VeLHDFS, so clients can access eithei using
wcbhdjs URIs.
The HTTP REST API that VeLHDFS exposes is loimally uelineu in a specilication, so
it is likely that ovei time clients in languages othei than ]ava will Le wiitten that use it
uiiectly.
C
Hauoop pioviues a C liLiaiy calleu |ibhdjs that miiiois the ]ava FileSystem inteilace
(it was wiitten as a C liLiaiy loi accessing HDFS, Lut uespite its name it can Le useu
to access any Hauoop lilesystem). It woiks using the java Nativc |ntcrjacc (]NI) to call
a ]ava lilesystem client.
56 | Chapter 3: The Hadoop Distributed Filesystem
The C API is veiy similai to the ]ava one, Lut it typically lags the ]ava one, so newei
leatuies may not Le suppoiteu. You can linu the geneiateu uocumentation loi the C
API in the |ibhdjs/docs/api uiiectoiy ol the Hauoop uistiiLution.
Hauoop comes with pieLuilt |ibhdjs Linaiies loi 32-Lit Linux, Lut loi othei platloims,
you will neeu to Luilu them youisell using the instiuctions at http://wi|i.apachc.org/
hadoop/LibHDIS.
FUSE
Ii|csystcn in Uscrspacc (FUSE) allows lilesystems that aie implementeu in usei space
to Le integiateu as a Unix lilesystem. Hauoop`s Fuse-DFS contiiL mouule allows any
Hauoop lilesystem (Lut typically HDFS) to Le mounteu as a stanuaiu lilesystem. You
can then use Unix utilities (such as ls anu cat) to inteiact with the lilesystem, as well
as POSIX liLiaiies to access the lilesystem liom any piogiamming language.
Fuse-DFS is implementeu in C using |ibhdjs as the inteilace to HDFS. Documentation
loi compiling anu iunning Fuse-DFS is locateu in the src/contrib/jusc-djs uiiectoiy ol
the Hauoop uistiiLution.
The Java Interface
In this section, we uig into the Hauoop`s FileSystem class: the API loi inteiacting with
one ol Hauoop`s lilesystems.
5
Vhile we locus mainly on the HDFS implementation,
DistributedFileSystem, in geneial you shoulu stiive to wiite youi coue against the
FileSystem aLstiact class, to ietain poitaLility acioss lilesystems. This is veiy uselul
when testing youi piogiam, loi example, since you can iapiuly iun tests using uata
stoieu on the local lilesystem.
Reading Data from a Hadoop URL
One ol the simplest ways to ieau a lile liom a Hauoop lilesystem is Ly using a
java.net.URL oLject to open a stieam to ieau the uata liom. The geneial iuiom is:
InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
IOUtils.closeStream(in);
}
Theie`s a little Lit moie woik ieguiieu to make ]ava iecognize Hauoop`s hdfs URL
scheme. This is achieveu Ly calling the setURLStreamHandlerFactory methou on URL
5. Fiom ielease 0.21.0, theie is a new lilesystem inteilace calleu FileContext with Lettei hanuling ol multiple
lilesystems (so a single FileContext can iesolve multiple lilesystem schemes, loi example) anu a cleanei,
moie consistent inteilace.
The Java Interface | 57
with an instance ol FsUrlStreamHandlerFactory. This methou can only Le calleu once
pei ]VM, so it is typically executeu in a static Llock. This limitation means that il some
othei pait ol youi piogiampeihaps a thiiu-paity component outsiue youi contiol
sets a URLStreamHandlerFactory, you won`t Le aLle to use this appioach loi ieauing uata
liom Hauoop. The next section uiscusses an alteinative.
Example 3-1 shows a piogiam loi uisplaying liles liom Hauoop lilesystems on stanuaiu
output, like the Unix cat commanu.
Exanp|c 3-1. Disp|aying ji|cs jron a Hadoop ji|csystcn on standard output using a
URLStrcanHand|cr
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}

public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
Ve make use ol the hanuy IOUtils class that comes with Hauoop loi closing the stieam
in the finally clause, anu also loi copying Lytes Letween the input stieam anu the
output stieam (System.out in this case). The last two aiguments to the copyBytes
methou aie the Lullei size useu loi copying anu whethei to close the stieams when the
copy is complete. Ve close the input stieam ouiselves, anu System.out uoesn`t neeu to
Le closeu.
58 | Chapter 3: The Hadoop Distributed Filesystem
Heie`s a sample iun:
6
% hadoop URLCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
Reading Data Using the FileSystem API
As the pievious section explaineu, sometimes it is impossiLle to set a URLStreamHand
lerFactory loi youi application. In this case, you will neeu to use the FileSystem API
to open an input stieam loi a lile.
A lile in a Hauoop lilesystem is iepiesenteu Ly a Hauoop Path oLject (anu not
a java.io.File oLject, since its semantics aie too closely tieu to the local lilesystem).
You can think ol a Path as a Hauoop lilesystem URI, such as hdjs://|oca|host/uscr/ton/
quang|c.txt.
FileSystem is a geneial lilesystem API, so the liist step is to ietiieve an instance loi the
lilesystem we want to useHDFS in this case. Theie aie seveial static lactoiy methous
loi getting a FileSystem instance:
public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws IOException
A Configuration oLject encapsulates a client oi seivei`s conliguiation, which is set using
conliguiation liles ieau liom the classpath, such as conj/corc-sitc.xn|. The liist methou
ietuins the uelault lilesystem (as specilieu in the lile conj/corc-sitc.xn|, oi the uelault
local lilesystem il not specilieu theie). The seconu uses the given URI`s scheme anu
authoiity to ueteimine the lilesystem to use, lalling Lack to the uelault lilesystem il no
scheme is specilieu in the given URI. The thiiu ietiieves the lilesystem as the given usei.
In some cases, you may want to ietiieve a local lilesystem instance, in which case you
can use the convenience methou, getLocal():
public static LocalFileSystem getLocal(Configuration conf) throws IOException
Vith a FileSystem instance in hanu, we invoke an open() methou to get the input stieam
loi a lile:
public FSDataInputStream open(Path f) throws IOException
public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException
The liist methou uses a uelault Lullei size ol + K.
Putting this togethei, we can iewiite Example 3-1 as shown in Example 3-2.
6. The text is liom Thc Quang|c Wang|c`s Hat Ly Euwaiu Leai.
The Java Interface | 59
Exanp|c 3-2. Disp|aying ji|cs jron a Hadoop ji|csystcn on standard output by using thc Ii|cSystcn
dircct|y
public class FileSystemCat {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
InputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
The piogiam iuns as lollows:
% hadoop FileSystemCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
FSDataInputStream
The open() methou on FileSystem actually ietuins a FSDataInputStream iathei than a
stanuaiu java.io class. This class is a specialization ol java.io.DataInputStream with
suppoit loi ianuom access, so you can ieau liom any pait ol the stieam:
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream
implements Seekable, PositionedReadable {
// implementation elided
}
The Seekable inteilace peimits seeking to a position in the lile anu a gueiy methou loi
the cuiient ollset liom the stait ol the lile (getPos()):
public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}
Calling seek() with a position that is gieatei than the length ol the lile will iesult in an
IOException. Unlike the skip() methou ol java.io.InputStream that positions the
stieam at a point latei than the cuiient position, seek() can move to an aiLitiaiy, aL-
solute position in the lile.
60 | Chapter 3: The Hadoop Distributed Filesystem
Example 3-3 is a simple extension ol Example 3-2 that wiites a lile to stanuaiu out
twice: altei wiiting it once, it seeks to the stait ol the lile anu stieams thiough it once
again.
Exanp|c 3-3. Disp|aying ji|cs jron a Hadoop ji|csystcn on standard output twicc, by using scc|
public class FileSystemDoubleCat {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
in.seek(0); // go back to the start of the file
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
Heie`s the iesult ol iunning it on a small lile:
% hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
FSDataInputStream also implements the PositionedReadable inteilace loi ieauing paits
ol a lile at a given ollset:
public interface PositionedReadable {
public int read(long position, byte[] buffer, int offset, int length)
throws IOException;

public void readFully(long position, byte[] buffer, int offset, int length)
throws IOException;

public void readFully(long position, byte[] buffer) throws IOException;
}
The read() methou ieaus up to length Lytes liom the given position in the lile into the
buffer at the given offset in the Lullei. The ietuin value is the numLei ol Lytes actually
ieau: calleis shoulu check this value as it may Le less than length. The readFully()
methous will ieau length Lytes into the Lullei (oi buffer.length Lytes loi the veision
The Java Interface | 61
that just takes a Lyte aiiay buffer), unless the enu ol the lile is ieacheu, in which case
an EOFException is thiown.
All ol these methous pieseive the cuiient ollset in the lile anu aie thieau-sale, so they
pioviue a convenient way to access anothei pait ol the lilemetauata peihapswhile
ieauing the main Louy ol the lile. In lact, they aie just implementeu using the
Seekable inteilace using the lollowing pattein:
long oldPos = getPos();
try {
seek(position);
// read data
} finally {
seek(oldPos);
}
Finally, Leai in minu that calling seek() is a ielatively expensive opeiation anu shoulu
Le useu spaiingly. You shoulu stiuctuie youi application access patteins to iely on
stieaming uata, (Ly using MapReuuce, loi example) iathei than peiloiming a laige
numLei ol seeks.
Writing Data
The FileSystem class has a numLei ol methous loi cieating a lile. The simplest is the
methou that takes a Path oLject loi the lile to Le cieateu anu ietuins an output stieam
to wiite to:
public FSDataOutputStream create(Path f) throws IOException
Theie aie oveiloaueu veisions ol this methou that allow you to specily whethei to
loiciLly oveiwiite existing liles, the ieplication lactoi ol the lile, the Lullei size to use
when wiiting the lile, the Llock size loi the lile, anu lile peimissions.
The create() methous cieate any paient uiiectoiies ol the lile to Le
wiitten that uon`t alieauy exist. Though convenient, this Lehavioi may
Le unexpecteu. Il you want the wiite to lail il the paient uiiectoiy uoesn`t
exist, then you shoulu check loi the existence ol the paient uiiectoiy
liist Ly calling the exists() methou.
Theie`s also an oveiloaueu methou loi passing a callLack inteilace, Progressable, so
youi application can Le notilieu ol the piogiess ol the uata Leing wiitten to the
uatanoues:
package org.apache.hadoop.util;
public interface Progressable {
public void progress();
}
62 | Chapter 3: The Hadoop Distributed Filesystem
As an alteinative to cieating a new lile, you can appenu to an existing lile using the
append() methou (theie aie also some othei oveiloaueu veisions):
public FSDataOutputStream append(Path f) throws IOException
The appenu opeiation allows a single wiitei to mouily an alieauy wiitten lile Ly opening
it anu wiiting uata liom the linal ollset in the lile. Vith this API, applications that
piouuce unLounueu liles, such as logliles, can wiite to an existing lile altei a iestait,
loi example. The appenu opeiation is optional anu not implementeu Ly all Hauoop
lilesystems. Foi example, HDFS suppoits appenu, Lut S3 lilesystems uon`t.
Example 3-+ shows how to copy a local lile to a Hauoop lilesystem. Ve illustiate pio-
giess Ly piinting a peiiou eveiy time the progress() methou is calleu Ly Hauoop, which
is altei each 6+ K packet ol uata is wiitten to the uatanoue pipeline. (Note that this
paiticulai Lehavioi is not specilieu Ly the API, so it is suLject to change in latei veisions
ol Hauoop. The API meiely allows you to inlei that something is happening.)
Exanp|c 3-1. Copying a |oca| ji|c to a Hadoop ji|csystcn
public class FileCopyWithProgress {
public static void main(String[] args) throws Exception {
String localSrc = args[0];
String dst = args[1];

InputStream in = new BufferedInputStream(new FileInputStream(localSrc));

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
});

IOUtils.copyBytes(in, out, 4096, true);
}
}
Typical usage:
% hadoop FileCopyWithProgress input/docs/1400-8.txt hdfs://localhost/user/tom/
1400-8.txt
...............
Cuiiently, none ol the othei Hauoop lilesystems call progress() uuiing wiites. Piogiess
is impoitant in MapReuuce applications, as you will see in latei chapteis.
FSDataOutputStream
The create() methou on FileSystem ietuins an FSDataOutputStream, which, like
FSDataInputStream, has a methou loi gueiying the cuiient position in the lile:
package org.apache.hadoop.fs;
The Java Interface | 63
public class FSDataOutputStream extends DataOutputStream implements Syncable {
public long getPos() throws IOException {
// implementation elided
}

// implementation elided
}
Howevei, unlike FSDataInputStream, FSDataOutputStream uoes not peimit seeking. This
is Lecause HDFS allows only seguential wiites to an open lile oi appenus to an alieauy
wiitten lile. In othei woius, theie is no suppoit loi wiiting to anywheie othei than the
enu ol the lile, so theie is no value in Leing aLle to seek while wiiting.
Directories
FileSystem pioviues a methou to cieate a uiiectoiy:
public boolean mkdirs(Path f) throws IOException
This methou cieates all ol the necessaiy paient uiiectoiies il they uon`t alieauy exist,
just like the java.io.File`s mkdirs() methou. It ietuins true il the uiiectoiy (anu all
paient uiiectoiies) was (weie) successlully cieateu.
Olten, you uon`t neeu to explicitly cieate a uiiectoiy, since wiiting a lile, Ly calling
create(), will automatically cieate any paient uiiectoiies.
Querying the Filesystem
File metadata: FileStatus
An impoitant leatuie ol any lilesystem is the aLility to navigate its uiiectoiy stiuctuie
anu ietiieve inloimation aLout the liles anu uiiectoiies that it stoies. The FileStatus
class encapsulates lilesystem metauata loi liles anu uiiectoiies, incluuing lile length,
Llock size, ieplication, mouilication time, owneiship, anu peimission inloimation.
The methou getFileStatus() on FileSystem pioviues a way ol getting a FileStatus
oLject loi a single lile oi uiiectoiy. Example 3-5 shows an example ol its use.
Exanp|c 3-5. Dcnonstrating ji|c status injornation
public class ShowFileStatusTest {

private MiniDFSCluster cluster; // use an in-process HDFS cluster for testing
private FileSystem fs;
@Before
public void setUp() throws IOException {
Configuration conf = new Configuration();
if (System.getProperty("test.build.data") == null) {
System.setProperty("test.build.data", "/tmp");
64 | Chapter 3: The Hadoop Distributed Filesystem
}
cluster = new MiniDFSCluster(conf, 1, true, null);
fs = cluster.getFileSystem();
OutputStream out = fs.create(new Path("/dir/file"));
out.write("content".getBytes("UTF-8"));
out.close();
}

@After
public void tearDown() throws IOException {
if (fs != null) { fs.close(); }
if (cluster != null) { cluster.shutdown(); }
}

@Test(expected = FileNotFoundException.class)
public void throwsFileNotFoundForNonExistentFile() throws IOException {
fs.getFileStatus(new Path("no-such-file"));
}

@Test
public void fileStatusForFile() throws IOException {
Path file = new Path("/dir/file");
FileStatus stat = fs.getFileStatus(file);
assertThat(stat.getPath().toUri().getPath(), is("/dir/file"));
assertThat(stat.isDir(), is(false));
assertThat(stat.getLen(), is(7L));
assertThat(stat.getModificationTime(),
is(lessThanOrEqualTo(System.currentTimeMillis())));
assertThat(stat.getReplication(), is((short) 1));
assertThat(stat.getBlockSize(), is(64 * 1024 * 1024L));
assertThat(stat.getOwner(), is("tom"));
assertThat(stat.getGroup(), is("supergroup"));
assertThat(stat.getPermission().toString(), is("rw-r--r--"));
}

@Test
public void fileStatusForDirectory() throws IOException {
Path dir = new Path("/dir");
FileStatus stat = fs.getFileStatus(dir);
assertThat(stat.getPath().toUri().getPath(), is("/dir"));
assertThat(stat.isDir(), is(true));
assertThat(stat.getLen(), is(0L));
assertThat(stat.getModificationTime(),
is(lessThanOrEqualTo(System.currentTimeMillis())));
assertThat(stat.getReplication(), is((short) 0));
assertThat(stat.getBlockSize(), is(0L));
assertThat(stat.getOwner(), is("tom"));
assertThat(stat.getGroup(), is("supergroup"));
assertThat(stat.getPermission().toString(), is("rwxr-xr-x"));
}

}
The Java Interface | 65
Il no lile oi uiiectoiy exists, a FileNotFoundException is thiown. Howevei, il you aie
inteiesteu only in the existence ol a lile oi uiiectoiy, then the exists() methou on
FileSystem is moie convenient:
public boolean exists(Path f) throws IOException
Listing files
Finuing inloimation on a single lile oi uiiectoiy is uselul, Lut you also olten neeu to Le
aLle to list the contents ol a uiiectoiy. That`s what FileSystem`s listStatus() methous
aie loi:
public FileStatus[] listStatus(Path f) throws IOException
public FileStatus[] listStatus(Path f, PathFilter filter) throws IOException
public FileStatus[] listStatus(Path[] files) throws IOException
public FileStatus[] listStatus(Path[] files, PathFilter filter) throws IOException
Vhen the aigument is a lile, the simplest vaiiant ietuins an aiiay ol FileStatus oLjects
ol length 1. Vhen the aigument is a uiiectoiy, it ietuins zeio oi moie FileStatus oLjects
iepiesenting the liles anu uiiectoiies containeu in the uiiectoiy.
Oveiloaueu vaiiants allow a PathFilter to Le supplieu to iestiict the liles anu uiiectoiies
to matchyou will see an example in section PathFiltei on page 6S. Finally, il you
specily an aiiay ol paths, the iesult is a shoitcut loi calling the eguivalent single-path
listStatus methou loi each path in tuin anu accumulating the FileStatus oLject aiiays
in a single aiiay. This can Le uselul loi Luiluing up lists ol input liles to piocess liom
uistinct paits ol the lilesystem tiee. Example 3-6 is a simple uemonstiation ol this iuea.
Note the use ol stat2Paths() in FileUtil loi tuining an aiiay ol FileStatus oLjects to
an aiiay ol Path oLjects.
Exanp|c 3-. Showing thc ji|c statuscs jor a co||cction oj paths in a Hadoop ji|csystcn
public class ListStatus {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);

Path[] paths = new Path[args.length];
for (int i = 0; i < paths.length; i++) {
paths[i] = new Path(args[i]);
}

FileStatus[] status = fs.listStatus(paths);
Path[] listedPaths = FileUtil.stat2Paths(status);
for (Path p : listedPaths) {
System.out.println(p);
}
}
}
66 | Chapter 3: The Hadoop Distributed Filesystem
Ve can use this piogiam to linu the union ol uiiectoiy listings loi a collection ol paths:
% hadoop ListStatus hdfs://localhost/ hdfs://localhost/user/tom
hdfs://localhost/user
hdfs://localhost/user/tom/books
hdfs://localhost/user/tom/quangle.txt
File patterns
It is a common ieguiiement to piocess sets ol liles in a single opeiation. Foi example,
a MapReuuce joL loi log piocessing might analyze a month`s woith ol liles containeu
in a numLei ol uiiectoiies. Rathei than having to enumeiate each lile anu uiiectoiy to
specily the input, it is convenient to use wilucaiu chaiacteis to match multiple liles
with a single expiession, an opeiation that is known as g|obbing. Hauoop pioviues two
FileSystem methou loi piocessing gloLs:
public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws
IOException
The globStatus() methou ietuins an aiiay ol FileStatus oLjects whose paths match
the supplieu pattein, soiteu Ly path. An optional PathFilter can Le specilieu to iestiict
the matches luithei.
Hauoop suppoits the same set ol gloL chaiacteis as Unix bash (see TaLle 3-2).
Tab|c 3-2. G|ob charactcrs and thcir ncanings
Glob Name Matches
* asterisk Matches zero or more characters
? question mark Matches a single character
[ab] character class Matches a single character in the set {a, b}
[^ab] negated character class Matches a single character that is not in the set {a, b}
[a-b] character range Matches a single character in the (closed) range [a, b], where a is lexicographically
less than or equal to b
[^a-b] negated character range Matches a single character that is not in the (closed) range [a, b], where a is
lexicographically less than or equal to b
{a,b} alternation Matches either expression a or b
\c escaped character Matches character c when it is a metacharacter
Imagine that logliles aie stoieu in a uiiectoiy stiuctuie oiganizeu hieiaichically Ly
uate. So, loi example, logliles loi the last uay ol 2007 woulu go in a uiiectoiy
nameu /2007/12/31. Suppose that the lull lile listing is:
/2007/12/30
/2007/12/31
/2008/01/01
The Java Interface | 67
/2008/01/02
Heie aie some lile gloLs anu theii expansions:
Glob Expansion
/* /2007 /2008
/*/* /2007/12 /2008/01
/*/12/* /2007/12/30 /2007/12/31
/200? /2007 /2008
/200[78] /2007 /2008
/200[7-8] /2007 /2008
/200[^01234569] /2007 /2008
/*/*/{31,01} /2007/12/31 /2008/01/01
/*/*/3{0,1} /2007/12/30 /2007/12/31
/*/{12/31,01/01} /2007/12/31 /2008/01/01
PathFilter
GloL patteins aie not always poweilul enough to uesciiLe a set ol liles you want to
access. Foi example, it is not geneially possiLle to excluue a paiticulai lile using a gloL
pattein. The listStatus() anu globStatus() methous ol FileSystem take an optional
PathFilter, which allows piogiammatic contiol ovei matching:
package org.apache.hadoop.fs;
public interface PathFilter {
boolean accept(Path path);
}
PathFilter is the eguivalent ol java.io.FileFilter loi Path oLjects iathei than File
oLjects.
Example 3-7 shows a PathFilter loi excluuing paths that match a iegulai expiession.
Exanp|c 3-7. A PathIi|tcr jor cxc|uding paths that natch a rcgu|ar cxprcssion
public class RegexExcludePathFilter implements PathFilter {

private final String regex;
public RegexExcludePathFilter(String regex) {
this.regex = regex;
}
public boolean accept(Path path) {
return !path.toString().matches(regex);
}
}
68 | Chapter 3: The Hadoop Distributed Filesystem
The liltei passes only liles that don`t match the iegulai expiession. Ve use the liltei in
conjunction with a gloL that picks out an initial set ol liles to incluue: the liltei is useu
to ieline the iesults. Foi example:
fs.globStatus(new Path("/2007/*/*"), new RegexExcludeFilter("^.*/2007/12/31$"))
will expanu to /2007/12/30.
Filteis can only act on a lile`s name, as iepiesenteu Ly a Path. They can`t use a lile`s
piopeities, such as cieation time, as the Lasis ol the liltei. Neveitheless, they can pei-
loim matching that neithei gloL patteins noi iegulai expiessions can achieve. Foi ex-
ample, il you stoie liles in a uiiectoiy stiuctuie that is laiu out Ly uate (like in the
pievious section), then you can wiite a PathFilter to pick out liles that lall in a given
uate iange.
Deleting Data
Use the delete() methou on FileSystem to peimanently iemove liles oi uiiectoiies:
public boolean delete(Path f, boolean recursive) throws IOException
Il f is a lile oi an empty uiiectoiy, then the value ol recursive is ignoieu. A nonempty
uiiectoiy is only ueleteu, along with its contents, il recursive is true (otheiwise an
IOException is thiown).
Data Flow
Anatomy of a File Read
To get an iuea ol how uata llows Letween the client inteiacting with HDFS, the name-
noue anu the uatanoues, consiuei Figuie 3-2, which shows the main seguence ol events
when ieauing a lile.
Data Flow | 69
Iigurc 3-2. A c|icnt rcading data jron HDIS
The client opens the lile it wishes to ieau Ly calling open() on the FileSystem oLject,
which loi HDFS is an instance ol DistributedFileSystem (step 1 in Figuie 3-2).
DistributedFileSystem calls the namenoue, using RPC, to ueteimine the locations ol
the Llocks loi the liist lew Llocks in the lile (step 2). Foi each Llock, the namenoue
ietuins the auuiesses ol the uatanoues that have a copy ol that Llock. Fuitheimoie, the
uatanoues aie soiteu accoiuing to theii pioximity to the client (accoiuing to the top-
ology ol the clustei`s netwoik; see Netwoik Topology anu Hauoop on page 71). Il
the client is itsell a uatanoue (in the case ol a MapReuuce task, loi instance), then it
will ieau liom the local uatanoue, il it hosts a copy ol the Llock (see also Figuie 2-2).
The DistributedFileSystem ietuins an FSDataInputStream (an input stieam that sup-
poits lile seeks) to the client loi it to ieau uata liom. FSDataInputStream in tuin wiaps
a DFSInputStream, which manages the uatanoue anu namenoue I/O.
The client then calls read() on the stieam (step 3). DFSInputStream, which has stoieu
the uatanoue auuiesses loi the liist lew Llocks in the lile, then connects to the liist
(closest) uatanoue loi the liist Llock in the lile. Data is stieameu liom the uatanoue
Lack to the client, which calls read() iepeateuly on the stieam (step +). Vhen the enu
ol the Llock is ieacheu, DFSInputStream will close the connection to the uatanoue, then
linu the Lest uatanoue loi the next Llock (step 5). This happens tianspaiently to the
client, which liom its point ol view is just ieauing a continuous stieam.
Blocks aie ieau in oiuei with the DFSInputStream opening new connections to uatanoues
as the client ieaus thiough the stieam. It will also call the namenoue to ietiieve the
uatanoue locations loi the next Latch ol Llocks as neeueu. Vhen the client has linisheu
ieauing, it calls close() on the FSDataInputStream (step 6).
70 | Chapter 3: The Hadoop Distributed Filesystem
Duiing ieauing, il the DFSInputStream encounteis an eiioi while communicating with
a uatanoue, then it will tiy the next closest one loi that Llock. It will also iememLei
uatanoues that have laileu so that it uoesn`t neeulessly ietiy them loi latei Llocks. The
DFSInputStream also veiilies checksums loi the uata tiansleiieu to it liom the uatanoue.
Il a coiiupteu Llock is lounu, it is iepoiteu to the namenoue Leloie the DFSInput
Stream attempts to ieau a ieplica ol the Llock liom anothei uatanoue.
One impoitant aspect ol this uesign is that the client contacts uatanoues uiiectly to
ietiieve uata anu is guiueu Ly the namenoue to the Lest uatanoue loi each Llock. This
uesign allows HDFS to scale to a laige numLei ol concuiient clients, since the uata
tiallic is spieau acioss all the uatanoues in the clustei. The namenoue meanwhile meiely
has to seivice Llock location ieguests (which it stoies in memoiy, making them veiy
ellicient) anu uoes not, loi example, seive uata, which woulu guickly Lecome a Lot-
tleneck as the numLei ol clients giew.
Network Topology and Hadoop
Vhat uoes it mean loi two noues in a local netwoik to Le close to each othei? In the
context ol high-volume uata piocessing, the limiting lactoi is the iate at which we can
tianslei uata Letween nouesLanuwiuth is a scaice commouity. The iuea is to use the
Lanuwiuth Letween two noues as a measuie ol uistance.
Rathei than measuiing Lanuwiuth Letween noues, which can Le uillicult to uo in piac-
tice (it ieguiies a guiet clustei, anu the numLei ol paiis ol noues in a clustei giows as
the sguaie ol the numLei ol noues), Hauoop takes a simple appioach in which the
netwoik is iepiesenteu as a tiee anu the uistance Letween two noues is the sum ol theii
uistances to theii closest common ancestoi. Levels in the tiee aie not pieuelineu, Lut
it is common to have levels that coiiesponu to the uata centei, the iack, anu the noue
that a piocess is iunning on. The iuea is that the Lanuwiuth availaLle loi each ol the
lollowing scenaiios Lecomes piogiessively less:
Piocesses on the same noue
Dilleient noues on the same iack
Noues on uilleient iacks in the same uata centei
Noues in uilleient uata centeis
7
Foi example, imagine a noue n1 on iack r1 in uata centei d1. This can Le iepiesenteu
as /d1/r1/n1. Using this notation, heie aie the uistances loi the loui scenaiios:
distancc(/d1/r1/n1, /d1/r1/n1) = 0 (piocesses on the same noue)
distancc(/d1/r1/n1, /d1/r1/n2) = 2 (uilleient noues on the same iack)
distancc(/d1/r1/n1, /d1/r2/n3) = + (noues on uilleient iacks in the same uata centei)
distancc(/d1/r1/n1, /d2/r3/n1) = 6 (noues in uilleient uata centeis)
7. At the time ol this wiiting, Hauoop is not suiteu loi iunning acioss uata centeis.
Data Flow | 71
This is illustiateu schematically in Figuie 3-3. (Mathematically inclineu ieaueis will
notice that this is an example ol a uistance metiic.)
Finally, it is impoitant to iealize that Hauoop cannot uivine youi netwoik topology loi
you. It neeus some help; we`ll covei how to conliguie topology in Netwoik Topol-
ogy on page 297. By uelault though, it assumes that the netwoik is llata single-level
hieiaichyoi in othei woius, that all noues aie on a single iack in a single uata centei.
Foi small clusteis, this may actually Le the case, anu no luithei conliguiation is
ieguiieu.
Iigurc 3-3. Nctwor| distancc in Hadoop
Anatomy of a File Write
Next we`ll look at how liles aie wiitten to HDFS. Although guite uetaileu, it is instiuc-
tive to unueistanu the uata llow since it claiilies HDFS`s coheiency mouel.
The case we`ie going to consiuei is the case ol cieating a new lile, wiiting uata to it,
then closing the lile. See Figuie 3-+.
The client cieates the lile Ly calling create() on DistributedFileSystem (step 1 in
Figuie 3-+). DistributedFileSystem makes an RPC call to the namenoue to cieate a new
lile in the lilesystem`s namespace, with no Llocks associateu with it (step 2). The name-
noue peiloims vaiious checks to make suie the lile uoesn`t alieauy exist, anu that the
client has the iight peimissions to cieate the lile. Il these checks pass, the namenoue
makes a iecoiu ol the new lile; otheiwise, lile cieation lails anu the client is thiown an
IOException. The DistributedFileSystem ietuins an FSDataOutputStream loi the client
72 | Chapter 3: The Hadoop Distributed Filesystem
to stait wiiting uata to. ]ust as in the ieau case, FSDataOutputStream wiaps a DFSOutput
Stream, which hanules communication with the uatanoues anu namenoue.
As the client wiites uata (step 3), DFSOutputStream splits it into packets, which it wiites
to an inteinal gueue, calleu the data qucuc. The uata gueue is consumeu Ly the Data
Streamer, whose iesponsiLility it is to ask the namenoue to allocate new Llocks Ly
picking a list ol suitaLle uatanoues to stoie the ieplicas. The list ol uatanoues loims a
pipelinewe`ll assume the ieplication level is thiee, so theie aie thiee noues in the
pipeline. The DataStreamer stieams the packets to the liist uatanoue in the pipeline,
which stoies the packet anu loiwaius it to the seconu uatanoue in the pipeline. Simi-
laily, the seconu uatanoue stoies the packet anu loiwaius it to the thiiu (anu last)
uatanoue in the pipeline (step +).
Iigurc 3-1. A c|icnt writing data to HDIS
DFSOutputStream also maintains an inteinal gueue ol packets that aie waiting to Le
acknowleugeu Ly uatanoues, calleu the ac| qucuc. A packet is iemoveu liom the ack
gueue only when it has Leen acknowleugeu Ly all the uatanoues in the pipeline (step 5).
Il a uatanoue lails while uata is Leing wiitten to it, then the lollowing actions aie taken,
which aie tianspaient to the client wiiting the uata. Fiist the pipeline is closeu, anu any
packets in the ack gueue aie auueu to the liont ol the uata gueue so that uatanoues
that aie uownstieam liom the laileu noue will not miss any packets. The cuiient Llock
on the goou uatanoues is given a new iuentity, which is communicateu to the name-
noue, so that the paitial Llock on the laileu uatanoue will Le ueleteu il the laileu
Data Flow | 73
uatanoue iecoveis latei on. The laileu uatanoue is iemoveu liom the pipeline anu the
iemainuei ol the Llock`s uata is wiitten to the two goou uatanoues in the pipeline. The
namenoue notices that the Llock is unuei-ieplicateu, anu it aiianges loi a luithei ieplica
to Le cieateu on anothei noue. SuLseguent Llocks aie then tieateu as noimal.
It`s possiLle, Lut unlikely, that multiple uatanoues lail while a Llock is Leing wiitten.
As long as dfs.replication.min ieplicas (uelault one) aie wiitten, the wiite will succeeu,
anu the Llock will Le asynchionously ieplicateu acioss the clustei until its taiget iep-
lication lactoi is ieacheu (dfs.replication, which uelaults to thiee).
Vhen the client has linisheu wiiting uata, it calls close() on the stieam (step 6). This
action llushes all the iemaining packets to the uatanoue pipeline anu waits loi ac-
knowleugments Leloie contacting the namenoue to signal that the lile is complete (step
7). The namenoue alieauy knows which Llocks the lile is maue up ol (via Data
Streamer asking loi Llock allocations), so it only has to wait loi Llocks to Le minimally
ieplicateu Leloie ietuining successlully.
Replica Placement
How uoes the namenoue choose which uatanoues to stoie ieplicas on? Theie`s a tiaue-
oll Letween ieliaLility anu wiite Lanuwiuth anu ieau Lanuwiuth heie. Foi example,
placing all ieplicas on a single noue incuis the lowest wiite Lanuwiuth penalty since
the ieplication pipeline iuns on a single noue, Lut this olleis no ieal ieuunuancy (il the
noue lails, the uata loi that Llock is lost). Also, the ieau Lanuwiuth is high loi oll-iack
ieaus. At the othei extieme, placing ieplicas in uilleient uata centeis may maximize
ieuunuancy, Lut at the cost ol Lanuwiuth. Even in the same uata centei (which is what
all Hauoop clusteis to uate have iun in), theie aie a vaiiety ol placement stiategies.
Inueeu, Hauoop changeu its placement stiategy in ielease 0.17.0 to one that helps keep
a laiily even uistiiLution ol Llocks acioss the clustei. (See Lalancei on page 3+S loi
uetails on keeping a clustei Lalanceu.) Anu liom 0.21.0, Llock placement policies aie
pluggaLle.
Hauoop`s uelault stiategy is to place the liist ieplica on the same noue as the client (loi
clients iunning outsiue the clustei, a noue is chosen at ianuom, although the system
tiies not to pick noues that aie too lull oi too Lusy). The seconu ieplica is placeu on a
uilleient iack liom the liist (ojj-rac|), chosen at ianuom. The thiiu ieplica is placeu on
the same iack as the seconu, Lut on a uilleient noue chosen at ianuom. Fuithei ieplicas
aie placeu on ianuom noues on the clustei, although the system tiies to avoiu placing
too many ieplicas on the same iack.
Once the ieplica locations have Leen chosen, a pipeline is Luilt, taking netwoik topol-
ogy into account. Foi a ieplication lactoi ol 3, the pipeline might look like Figuie 3-5.
Oveiall, this stiategy gives a goou Lalance among ieliaLility (Llocks aie stoieu on two
iacks), wiite Lanuwiuth (wiites only have to tiaveise a single netwoik switch), ieau
peiloimance (theie`s a choice ol two iacks to ieau liom), anu Llock uistiiLution acioss
the clustei (clients only wiite a single Llock on the local iack).
74 | Chapter 3: The Hadoop Distributed Filesystem
Coherency Model
A coheiency mouel loi a lilesystem uesciiLes the uata visiLility ol ieaus anu wiites loi
a lile. HDFS tiaues oll some POSIX ieguiiements loi peiloimance, so some opeiations
may Lehave uilleiently than you expect them to.
Altei cieating a lile, it is visiLle in the lilesystem namespace, as expecteu:
Path p = new Path("p");
fs.create(p);
assertThat(fs.exists(p), is(true));
Howevei, any content wiitten to the lile is not guaianteeu to Le visiLle, even il the
stieam is llusheu. So the lile appeais to have a length ol zeio:
Path p = new Path("p");
OutputStream out = fs.create(p);
out.write("content".getBytes("UTF-8"));
out.flush();
assertThat(fs.getFileStatus(p).getLen(), is(0L));
Once moie than a Llock`s woith ol uata has Leen wiitten, the liist Llock will Le visiLle
to new ieaueis. This is tiue ol suLseguent Llocks, too: it is always the cuiient Llock
Leing wiitten that is not visiLle to othei ieaueis.
Iigurc 3-5. A typica| rcp|ica pipc|inc
Data Flow | 75
HDFS pioviues a methou loi loicing all Lulleis to Le synchionizeu to the uatanoues
via the sync() methou on FSDataOutputStream. Altei a successlul ietuin liom sync(),
HDFS guaiantees that the uata wiitten up to that point in the lile is peisisteu anu visiLle
to all new ieaueis:
S
Path p = new Path("p");
FSDataOutputStream out = fs.create(p);
out.write("content".getBytes("UTF-8"));
out.flush();
out.sync();
assertThat(fs.getFileStatus(p).getLen(), is(((long) "content".length())));
This Lehavioi is similai to the fsync system call in POSIX that commits Lulleieu uata
loi a lile uesciiptoi. Foi example, using the stanuaiu ]ava API to wiite a local lile, we
aie guaianteeu to see the content altei llushing the stieam anu synchionizing:
FileOutputStream out = new FileOutputStream(localFile);
out.write("content".getBytes("UTF-8"));
out.flush(); // flush to operating system
out.getFD().sync(); // sync to disk
assertThat(localFile.length(), is(((long) "content".length())));
Closing a lile in HDFS peiloims an implicit sync(), too:
Path p = new Path("p");
OutputStream out = fs.create(p);
out.write("content".getBytes("UTF-8"));
out.close();
assertThat(fs.getFileStatus(p).getLen(), is(((long) "content".length())));
Consequences for application design
This coheiency mouel has implications loi the way you uesign applications. Vith no
calls to sync(), you shoulu Le piepaieu to lose up to a Llock ol uata in the event ol
client oi system lailuie. Foi many applications, this is unacceptaLle, so you shoulu call
sync() at suitaLle points, such as altei wiiting a ceitain numLei ol iecoius oi numLei
ol Lytes. Though the sync() opeiation is uesigneu to not unuuly tax HDFS, it uoes have
some oveiheau, so theie is a tiaue-oll Letween uata ioLustness anu thioughput. Vhat
is an acceptaLle tiaue-oll is application-uepenuent, anu suitaLle values can Le selecteu
altei measuiing youi application`s peiloimance with uilleient sync() lieguencies.
Parallel Copying with distcp
The HDFS access patteins that we have seen so lai locus on single-thieaueu access. It`s
possiLle to act on a collection ol liles, Ly specilying lile gloLs, loi example, Lut loi
ellicient, paiallel piocessing ol these liles you woulu have to wiite a piogiam youisell.
S. Fiom ielease 0.21.0 sync() is uepiecateu in lavoi ol hflush(), which only guaiantees that new ieaueis
will see all uata wiitten to that point, anu hsync(), which makes a stiongei guaiantee that the opeiating
system has llusheu the uata to uisk (like POSIX fsync), although uata may still Le in the uisk cache.
76 | Chapter 3: The Hadoop Distributed Filesystem
Hauoop comes with a uselul piogiam calleu distcp loi copying laige amounts ol uata
to anu liom Hauoop lilesystems in paiallel.
The canonical use case loi distcp is loi tiansleiiing uata Letween two HDFS clusteis.
Il the clusteis aie iunning iuentical veisions ol Hauoop, the hdjs scheme is
appiopiiate:
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
This will copy the /joo uiiectoiy (anu its contents) liom the liist clustei to the /bar
uiiectoiy on the seconu clustei, so the seconu clustei enus up with the uiiectoiy stiuc-
tuie /bar/joo. Il /bar uoesn`t exist, it will Le cieateu liist. You can specily multiple souice
paths, anu all will Le copieu to the uestination. Souice paths must Le aLsolute.
By uelault, distcp will skip liles that alieauy exist in the uestination, Lut they can Le
oveiwiitten Ly supplying the -overwrite option. You can also upuate only liles that
have changeu using the -update option.
Using eithei (oi Loth) ol -overwrite oi -update changes how the souice
anu uestination paths aie inteipieteu. This is Lest shown with an ex-
ample. Il we changeu a lile in the /joo suLtiee on the liist clustei liom
the pievious example, then we coulu synchionize the change with the
seconu clustei Ly iunning:
% hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo
The extia tiailing /joo suLuiiectoiy is neeueu on the uestination, as now
the contcnts ol the souice uiiectoiy aie copieu to the contcnts ol the
uestination uiiectoiy. (Il you aie lamiliai with rsync, you can think ol
the -overwrite oi -update options as auuing an implicit tiailing slash to
the souice.)
Il you aie unsuie ol the ellect ol a distcp opeiation, it is a goou iuea to
tiy it out on a small test uiiectoiy tiee liist.
Theie aie moie options to contiol the Lehavioi ol distcp, incluuing ones to pieseive lile
attiiLutes, ignoie lailuies, anu limit the numLei ol liles oi total uata copieu. Run it with
no options to see the usage instiuctions.
distcp is implementeu as a MapReuuce joL wheie the woik ol copying is uone Ly the
maps that iun in paiallel acioss the clustei. Theie aie no ieuuceis. Each lile is copieu
Ly a single map, anu distcp tiies to give each map appioximately the same amount ol
uata, Ly Lucketing liles into ioughly egual allocations.
The numLei ol maps is ueciueu as lollows. Since it`s a goou iuea to get each map to
copy a ieasonaLle amount ol uata to minimize oveiheaus in task setup, each map copies
at least 256 MB (unless the total size ol the input is less, in which case one map hanules
it all). Foi example, 1 GB ol liles will Le given loui map tasks. Vhen the uata size is
veiy laige, it Lecomes necessaiy to limit the numLei ol maps in oiuei to limit Lanuwiuth
anu clustei utilization. By uelault, the maximum numLei ol maps is 20 pei (tasktiackei)
Parallel Copying with distcp | 77
clustei noue. Foi example, copying 1,000 GB ol liles to a 100-noue clustei will allocate
2,000 maps (20 pei noue), so each will copy 512 MB on aveiage. This can Le ieuuceu
Ly specilying the -m aigument to distcp. Foi example, -m 1000 woulu allocate 1,000
maps, each copying 1 GB on aveiage.
Il you tiy to use distcp Letween two HDFS clusteis that aie iunning uilleient veisions,
the copy will lail il you use the hdjs piotocol, since the RPC systems aie incompatiLle.
To iemeuy this, you can use the ieau-only HTTP-Laseu HFTP lilesystem to ieau liom
the souice. The joL must iun on the uestination clustei so that the HDFS RPC veisions
aie compatiLle. To iepeat the pievious example using HFTP:
% hadoop distcp hftp://namenode1:50070/foo hdfs://namenode2/bar
Note that you neeu to specily the namenoue`s weL poit in the souice URI. This is
ueteimineu Ly the dfs.http.address piopeity, which uelaults to 50070.
Using the newei wcbhdjs piotocol (which ieplaces hjtp) it is possiLle to use HTTP loi
Loth the souice anu uestination clusteis without hitting any wiie incompatiLility pioL-
lems.
% hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar
Anothei vaiiant is to use an HDFS HTTP pioxy as the distcp souice oi uestination,
which has the auvantage ol Leing aLle to set liiewall anu Lanuwiuth contiolssee
HTTP on page 55.
Keeping an HDFS Cluster Balanced
Vhen copying uata into HDFS, it`s impoitant to consiuei clustei Lalance. HDFS woiks
Lest when the lile Llocks aie evenly spieau acioss the clustei, so you want to ensuie
that distcp uoesn`t uisiupt this. Going Lack to the 1,000 GB example, Ly specilying -m
1 a single map woulu uo the copy, whichapait liom Leing slow anu not using the
clustei iesouices ellicientlywoulu mean that the liist ieplica ol each Llock woulu
iesiue on the noue iunning the map (until the uisk lilleu up). The seconu anu thiiu
ieplicas woulu Le spieau acioss the clustei, Lut this one noue woulu Le unLalanceu.
By having moie maps than noues in the clustei, this pioLlem is avoiueuloi this iea-
son, it`s Lest to stait Ly iunning distcp with the uelault ol 20 maps pei noue.
Howevei, it`s not always possiLle to pievent a clustei liom Lecoming unLalanceu. Pei-
haps you want to limit the numLei ol maps so that some ol the noues can Le useu Ly
othei joLs. In this case, you can use the ba|anccr tool (see Lalancei on page 3+S) to
suLseguently even out the Llock uistiiLution acioss the clustei.
Hadoop Archives
HDFS stoies small liles inelliciently, since each lile is stoieu in a Llock, anu Llock
metauata is helu in memoiy Ly the namenoue. Thus, a laige numLei ol small liles can
78 | Chapter 3: The Hadoop Distributed Filesystem
eat up a lot ol memoiy on the namenoue. (Note, howevei, that small liles uo not take
up any moie uisk space than is ieguiieu to stoie the iaw contents ol the lile. Foi
example, a 1 MB lile stoieu with a Llock size ol 12S MB uses 1 MB ol uisk space, not
12S MB.)
Hadoop Archivcs, oi HAR liles, aie a lile aichiving lacility that packs liles into HDFS
Llocks moie elliciently, theieLy ieuucing namenoue memoiy usage while still allowing
tianspaient access to liles. In paiticulai, Hauoop Aichives can Le useu as input to
MapReuuce.
Using Hadoop Archives
A Hauoop Aichive is cieateu liom a collection ol liles using the archivc tool. The tool
iuns a MapReuuce joL to piocess the input liles in paiallel, so to iun it, you neeu a
MapReuuce clustei iunning to use it. Heie aie some liles in HDFS that we woulu like
to aichive:
% hadoop fs -lsr /my/files
-rw-r--r-- 1 tom supergroup 1 2009-04-09 19:13 /my/files/a
drwxr-xr-x - tom supergroup 0 2009-04-09 19:13 /my/files/dir
-rw-r--r-- 1 tom supergroup 1 2009-04-09 19:13 /my/files/dir/b
Now we can iun the archive commanu:
%
hadoop archive -archiveName files.har /my/files /my
The liist option is the name ol the aichive, heie ji|cs.har. HAR liles always have
a .har extension, which is manuatoiy loi ieasons we shall see latei. Next comes the liles
to put in the aichive. Heie we aie aichiving only one souice tiee, the liles in /ny/ji|cs
in HDFS, Lut the tool accepts multiple souice tiees. The linal aigument is the output
uiiectoiy loi the HAR lile. Let`s see what the aichive has cieateu:
% hadoop fs -ls /my
Found 2 items
drwxr-xr-x - tom supergroup 0 2009-04-09 19:13 /my/files
drwxr-xr-x - tom supergroup 0 2009-04-09 19:13 /my/files.har
% hadoop fs -ls /my/files.har
Found 3 items
-rw-r--r-- 10 tom supergroup 165 2009-04-09 19:13 /my/files.har/_index
-rw-r--r-- 10 tom supergroup 23 2009-04-09 19:13 /my/files.har/_masterindex
-rw-r--r-- 1 tom supergroup 2 2009-04-09 19:13 /my/files.har/part-0
The uiiectoiy listing shows what a HAR lile is maue ol: two inuex liles anu a collection
ol pait lilesjust one in this example. The pait liles contain the contents ol a numLei
ol the oiiginal liles concatenateu togethei, anu the inuexes make it possiLle to look up
the pait lile that an aichiveu lile is containeu in, anu its ollset anu length. All these
uetails aie hiuuen liom the application, howevei, which uses the har URI scheme to
inteiact with HAR liles, using a HAR lilesystem that is layeieu on top ol the unueilying
Hadoop Archives | 79
lilesystem (HDFS in this case). The lollowing commanu iecuisively lists the liles in the
aichive:
% hadoop fs -lsr har:///my/files.har
drw-r--r-- - tom supergroup 0 2009-04-09 19:13 /my/files.har/my
drw-r--r-- - tom supergroup 0 2009-04-09 19:13 /my/files.har/my/files
-rw-r--r-- 10 tom supergroup 1 2009-04-09 19:13 /my/files.har/my/files/a
drw-r--r-- - tom supergroup 0 2009-04-09 19:13 /my/files.har/my/files/dir
-rw-r--r-- 10 tom supergroup 1 2009-04-09 19:13 /my/files.har/my/files/dir/b
This is guite stiaightloiwaiu il the lilesystem that the HAR lile is on is the uelault
lilesystem. On the othei hanu, il you want to ielei to a HAR lile on a uilleient lilesystem,
then you neeu to use a uilleient loim ol the path URI to noimal. These two commanus
have the same ellect, loi example:
% hadoop fs -lsr har:///my/files.har/my/files/dir
% hadoop fs -lsr har://hdfs-localhost:8020/my/files.har/my/files/dir
Notice in the seconu loim that the scheme is still har to signily a HAR lilesystem, Lut
the authoiity is hdjs to specily the unueilying lilesystem`s scheme, lolloweu Ly a uash
anu the HDFS host (localhost) anu poit (S020). Ve can now see why HAR liles have
to have a .har extension. The HAR lilesystem tianslates the har URI into a URI loi the
unueilying lilesystem, Ly looking at the authoiity anu path up to anu incluuing the
component with the .har extension. In this case, it is hdjs://|oca|host:8020/ny/ji|cs
.har. The iemaining pait ol the path is the path ol the lile in the aichive: /ny/ji|cs/dir.
To uelete a HAR lile, you neeu to use the iecuisive loim ol uelete, since liom the
unueilying lilesystem`s point ol view the HAR lile is a uiiectoiy:
%
hadoop fs -rmr /my/files.har
Limitations
Theie aie a lew limitations to Le awaie ol with HAR liles. Cieating an aichive cieates
a copy ol the oiiginal liles, so you neeu as much uisk space as the liles you aie aichiving
to cieate the aichive (although you can uelete the oiiginals once you have cieateu the
aichive). Theie is cuiiently no suppoit loi aichive compiession, although the liles that
go into the aichive can Le compiesseu (HAR liles aie like tar liles in this iespect).
Aichives aie immutaLle once they have Leen cieateu. To auu oi iemove liles, you must
ie-cieate the aichive. In piactice, this is not a pioLlem loi liles that uon`t change altei
Leing wiitten, since they can Le aichiveu in Latches on a iegulai Lasis, such as uaily oi
weekly.
As noteu eailiei, HAR liles can Le useu as input to MapReuuce. Howevei, theie is no
aichive-awaie InputFormat that can pack multiple liles into a single MapReuuce split,
so piocessing lots ol small liles, even in a HAR lile, can still Le inellicient. Small liles
anu ComLineFileInputFoimat on page 237 uiscusses anothei appioach to this
pioLlem.
80 | Chapter 3: The Hadoop Distributed Filesystem
Finally, il you aie hitting namenoue memoiy limits even altei taking steps to minimize
the numLei ol small liles in the system, then consiuei using HDFS Feueiation to scale
the namespace (HDFS Feueiation on page +9).
Hadoop Archives | 81
CHAPTER 4
Hadoop I/O
Hauoop comes with a set ol piimitives loi uata I/O. Some ol these aie technigues that
aie moie geneial than Hauoop, such as uata integiity anu compiession, Lut ueseive
special consiueiation when uealing with multiteiaLyte uatasets. Otheis aie Hauoop
tools oi APIs that loim the Luiluing Llocks loi ueveloping uistiiLuteu systems, such as
seiialization liamewoiks anu on-uisk uata stiuctuies.
Data Integrity
Useis ol Hauoop iightly expect that no uata will Le lost oi coiiupteu uuiing stoiage oi
piocessing. Howevei, since eveiy I/O opeiation on the uisk oi netwoik caiiies with it
a small chance ol intiouucing eiiois into the uata that it is ieauing oi wiiting, when the
volumes ol uata llowing thiough the system aie as laige as the ones Hauoop is capaLle
ol hanuling, the chance ol uata coiiuption occuiiing is high.
The usual way ol uetecting coiiupteu uata is Ly computing a chcc|sun loi the uata
when it liist enteis the system, anu again whenevei it is tiansmitteu acioss a channel
that is unieliaLle anu hence capaLle ol coiiupting the uata. The uata is ueemeu to Le
coiiupt il the newly geneiateu checksum uoesn`t exactly match the oiiginal. This tech-
nigue uoesn`t ollei any way to lix the uatameiely eiioi uetection. (Anu this is a ieason
loi not using low-enu haiuwaie; in paiticulai, Le suie to use ECC memoiy.) Note that
it is possiLle that it`s the checksum that is coiiupt, not the uata, Lut this is veiy unlikely,
since the checksum is much smallei than the uata.
A commonly useu eiioi-uetecting coue is CRC-32 (cyclic ieuunuancy check), which
computes a 32-Lit integei checksum loi input ol any size.
Data Integrity in HDFS
HDFS tianspaiently checksums all uata wiitten to it anu Ly uelault veiilies checksums
when ieauing uata. A sepaiate checksum is cieateu loi eveiy io.bytes.per.checksum
83
Lytes ol uata. The uelault is 512 Lytes, anu since a CRC-32 checksum is + Lytes long,
the stoiage oveiheau is less than 1.
Datanoues aie iesponsiLle loi veiilying the uata they ieceive Leloie stoiing the uata
anu its checksum. This applies to uata that they ieceive liom clients anu liom othei
uatanoues uuiing ieplication. A client wiiting uata senus it to a pipeline ol uatanoues
(as explaineu in Chaptei 3), anu the last uatanoue in the pipeline veiilies the checksum.
Il it uetects an eiioi, the client ieceives a ChecksumException, a suLclass ol IOExcep
tion, which it shoulu hanule in an application-specilic mannei, Ly ietiying the opeia-
tion, loi example.
Vhen clients ieau uata liom uatanoues, they veiily checksums as well, compaiing them
with the ones stoieu at the uatanoue. Each uatanoue keeps a peisistent log ol checksum
veiilications, so it knows the last time each ol its Llocks was veiilieu. Vhen a client
successlully veiilies a Llock, it tells the uatanoue, which upuates its log. Keeping sta-
tistics such as these is valuaLle in uetecting Lau uisks.
Asiue liom Llock veiilication on client ieaus, each uatanoue iuns a DataBlockScanner
in a Lackgiounu thieau that peiiouically veiilies all the Llocks stoieu on the uatanoue.
This is to guaiu against coiiuption uue to Lit iot in the physical stoiage meuia. See
Datanoue Llock scannei on page 3+7 loi uetails on how to access the scannei
iepoits.
Since HDFS stoies ieplicas ol Llocks, it can heal coiiupteu Llocks Ly copying one ol
the goou ieplicas to piouuce a new, uncoiiupt ieplica. The way this woiks is that il a
client uetects an eiioi when ieauing a Llock, it iepoits the Lau Llock anu the uatanoue
it was tiying to ieau liom to the namenoue Leloie thiowing a ChecksumException. The
namenoue maiks the Llock ieplica as coiiupt, so it uoesn`t uiiect clients to it, oi tiy to
copy this ieplica to anothei uatanoue. It then scheuules a copy ol the Llock to Le ie-
plicateu on anothei uatanoue, so its ieplication lactoi is Lack at the expecteu level.
Once this has happeneu, the coiiupt ieplica is ueleteu.
It is possiLle to uisaLle veiilication ol checksums Ly passing false to the setVerify
Checksum() methou on FileSystem, Leloie using the open() methou to ieau a lile. The
same ellect is possiLle liom the shell Ly using the -ignoreCrc option with the -get oi
the eguivalent -copyToLocal commanu. This leatuie is uselul il you have a coiiupt lile
that you want to inspect so you can ueciue what to uo with it. Foi example, you might
want to see whethei it can Le salvageu Leloie you uelete it.
LocalFileSystem
The Hauoop LocalFileSystem peiloims client-siue checksumming. This means that
when you wiite a lile calleu ji|cnanc, the lilesystem client tianspaiently cieates a hiuuen
lile, .ji|cnanc.crc, in the same uiiectoiy containing the checksums loi each chunk ol
the lile. Like HDFS, the chunk size is contiolleu Ly the io.bytes.per.checksum piopeity,
which uelaults to 512 Lytes. The chunk size is stoieu as metauata in the .crc lile, so the
84 | Chapter 4: Hadoop I/O
lile can Le ieau Lack coiiectly even il the setting loi the chunk size has changeu.
Checksums aie veiilieu when the lile is ieau, anu il an eiioi is uetecteu,
LocalFileSystem thiows a ChecksumException.
Checksums aie laiily cheap to compute (in ]ava, they aie implementeu in native coue),
typically auuing a lew peicent oveiheau to the time to ieau oi wiite a lile. Foi most
applications, this is an acceptaLle piice to pay loi uata integiity. It is, howevei, possiLle
to uisaLle checksums: typically when the unueilying lilesystem suppoits checksums
natively. This is accomplisheu Ly using RawLocalFileSystem in place ol Local
FileSystem. To uo this gloLally in an application, it sullices to iemap the implementa-
tion loi ji|c URIs Ly setting the piopeity fs.file.impl to the value
org.apache.hadoop.fs.RawLocalFileSystem. Alteinatively, you can uiiectly cieate a Raw
LocalFileSystem instance, which may Le uselul il you want to uisaLle checksum veii-
lication loi only some ieaus; loi example:
Configuration conf = ...
FileSystem fs = new RawLocalFileSystem();
fs.initialize(null, conf);
ChecksumFileSystem
LocalFileSystem uses ChecksumFileSystem to uo its woik, anu this class makes it easy
to auu checksumming to othei (nonchecksummeu) lilesystems, as Checksum
FileSystem is just a wiappei aiounu FileSystem. The geneial iuiom is as lollows:
FileSystem rawFs = ...
FileSystem checksummedFs = new ChecksumFileSystem(rawFs);
The unueilying lilesystem is calleu the raw lilesystem, anu may Le ietiieveu using the
getRawFileSystem() methou on ChecksumFileSystem. ChecksumFileSystem has a lew
moie uselul methous loi woiking with checksums, such as getChecksumFile() loi get-
ting the path ol a checksum lile loi any lile. Check the uocumentation loi the otheis.
Il an eiioi is uetecteu Ly ChecksumFileSystem when ieauing a lile, it will call its
reportChecksumFailure() methou. The uelault implementation uoes nothing, Lut
LocalFileSystem moves the ollenuing lile anu its checksum to a siue uiiectoiy on the
same uevice calleu bad_ji|cs. Auministiatois shoulu peiiouically check loi these Lau
liles anu take action on them.
Compression
File compiession Liings two majoi Lenelits: it ieuuces the space neeueu to stoie liles,
anu it speeus up uata tianslei acioss the netwoik, oi to oi liom uisk. Vhen uealing
with laige volumes ol uata, Loth ol these savings can Le signilicant, so it pays to caielully
consiuei how to use compiession in Hauoop.
Compression | 85
Theie aie many uilleient compiession loimats, tools anu algoiithms, each with uillei-
ent chaiacteiistics. TaLle +-1 lists some ol the moie common ones that can Le useu
with Hauoop.
Tab|c 1-1. A sunnary oj conprcssion jornats
Compression format Tool Algorithm Filename extension Splittable
DEFLATE
a
N/A DEFLATE .deflate No
gzip gzip DEFLATE .gz No
bzip2 bzip2 bzip2 .bz2 Yes
LZO lzop LZO .lzo No
b
Snappy N/A Snappy .snappy No
a
DEFLATE is a compression algorithm whose standard implementation is zlib. There is no commonly available command-line tool for
producing files in DEFLATE format, as gzip is normally used. (Note that the gzip file format is DEFLATE with extra headers and a footer.)
The .deflate filename extension is a Hadoop convention.
b
However, LZO files are splittable if they have been indexed in a preprocessing step. See page 91.
All compiession algoiithms exhiLit a space/time tiaue-oll: lastei compiession anu ue-
compiession speeus usually come at the expense ol smallei space savings. The tools
listeu in TaLle +-1 typically give some contiol ovei this tiaue-oll at compiession time
Ly olleiing nine uilleient options: 1 means optimize loi speeu anu -9 means optimize
loi space. Foi example, the lollowing commanu cieates a compiesseu lile ji|c.gz using
the lastest compiession methou:
gzip -1 file
The uilleient tools have veiy uilleient compiession chaiacteiistics. Gzip is a geneial-
puipose compiessoi, anu sits in the miuule ol the space/time tiaue-oll. Bzip2 com-
piesses moie ellectively than gzip, Lut is slowei. Bzip2`s uecompiession speeu is lastei
than its compiession speeu, Lut it is still slowei than the othei loimats. LZO anu
Snappy, on the othei hanu, Loth optimize loi speeu anu aie aiounu an oiuei ol mag-
nituue lastei than gzip, Lut compiess less ellectively. Snappy is also signilicantly lastei
than LZO loi uecompiession.
1
The SplittaLle column in TaLle +-1 inuicates whethei the compiession loimat sup-
poits splitting; that is, whethei you can seek to any point in the stieam anu stait ieauing
liom some point luithei on. SplittaLle compiession loimats aie especially suitaLle loi
MapReuuce; see Compiession anu Input Splits on page 91 loi luithei uiscussion.
1. Foi a compiehensive set ol compiession Lenchmaiks, https://github.con/ning/jvn-conprcssor
-bcnchnar| is a goou ieleience loi ]MV-compatiLle liLiaiies (incluues some native liLiaiies). Foi
commanu line tools, see ]ell Gilchiist`s Aichive Compaiison Test at http://conprcssion.ca/act/act
-sunnary.htn|.
86 | Chapter 4: Hadoop I/O
Codecs
A codcc is the implementation ol a compiession-uecompiession algoiithm. In Hauoop,
a couec is iepiesenteu Ly an implementation ol the CompressionCodec inteilace. So, loi
example, GzipCodec encapsulates the compiession anu uecompiession algoiithm loi
gzip. TaLle +-2 lists the couecs that aie availaLle loi Hauoop.
Tab|c 1-2. Hadoop conprcssion codccs
Compression format Hadoop CompressionCodec
DEFLATE org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO com.hadoop.compression.lzo.LzopCodec
Snappy org.apache.hadoop.io.compress.SnappyCodec
The LZO liLiaiies aie GPL-licenseu anu may not Le incluueu in Apache uistiiLutions,
so loi this ieason the Hauoop couecs must Le uownloaueu sepaiately liom http://codc
.goog|c.con/p/hadoop-gp|-conprcssion/ (oi http://github.con/|cvinwci|/hadoop-|zo,
which incluues Luglixes anu moie tools). The LzopCodec is compatiLle with the lzop
tool, which is essentially the LZO loimat with extia heaueis, anu is the one you noi-
mally want. Theie is also a LzoCodec loi the puie LZO loimat, which uses the .|zo_dc-
j|atc lilename extension (Ly analogy with DEFLATE, which is gzip without the
heaueis).
Compressing and decompressing streams with CompressionCodec
CompressionCodec has two methous that allow you to easily compiess oi uecompiess
uata. To compiess uata Leing wiitten to an output stieam, use the createOutput
Stream(OutputStream out) methou to cieate a CompressionOutputStream to which you
wiite youi uncompiesseu uata to have it wiitten in compiesseu loim to the unueilying
stieam. Conveisely, to uecompiess uata Leing ieau liom an input stieam, call
createInputStream(InputStream in) to oLtain a CompressionInputStream, which allows
you to ieau uncompiesseu uata liom the unueilying stieam.
CompressionOutputStream anu CompressionInputStream aie similai to
java.util.zip.DeflaterOutputStream anu java.util.zip.DeflaterInputStream, except
that Loth ol the loimei pioviue the aLility to ieset theii unueilying compiessoi oi ue-
compiessoi, which is impoitant loi applications that compiess sections ol the uata
stieam as sepaiate Llocks, such as SequenceFile, uesciiLeu in Seguence-
File on page 132.
Example +-1 illustiates how to use the API to compiess uata ieau liom stanuaiu input
anu wiite it to stanuaiu output.
Compression | 87
Exanp|c 1-1. A progran to conprcss data rcad jron standard input and writc it to standard output
public class StreamCompressor {
public static void main(String[] args) throws Exception {
String codecClassname = args[0];
Class<?> codecClass = Class.forName(codecClassname);
Configuration conf = new Configuration();
CompressionCodec codec = (CompressionCodec)
ReflectionUtils.newInstance(codecClass, conf);

CompressionOutputStream out = codec.createOutputStream(System.out);
IOUtils.copyBytes(System.in, out, 4096, false);
out.finish();
}
}
The application expects the lully gualilieu name ol the CompressionCodec implementa-
tion as the liist commanu-line aigument. Ve use ReflectionUtils to constiuct a new
instance ol the couec, then oLtain a compiession wiappei aiounu System.out. Then we
call the utility methou copyBytes() on IOUtils to copy the input to the output, which
is compiesseu Ly the CompressionOutputStream. Finally, we call finish() on
CompressionOutputStream, which tells the compiessoi to linish wiiting to the com-
piesseu stieam, Lut uoesn`t close the stieam. Ve can tiy it out with the lollowing
commanu line, which compiesses the stiing Text using the StreamCompressor pio-
giam with the GzipCodec, then uecompiesses it liom stanuaiu input using gunzip:
% echo "Text" | hadoop StreamCompressor org.apache.hadoop.io.compress.GzipCodec \
| gunzip -
Text
Inferring CompressionCodecs using CompressionCodecFactory
Il you aie ieauing a compiesseu lile, you can noimally inlei the couec to use Ly looking
at its lilename extension. A lile enuing in .gz can Le ieau with GzipCodec, anu so on.
The extension loi each compiession loimat is listeu in TaLle +-1.
CompressionCodecFactory pioviues a way ol mapping a lilename extension to a
CompressionCodec using its getCodec() methou, which takes a Path oLject loi the lile in
guestion. Example +-2 shows an application that uses this leatuie to uecompiess liles.
Exanp|c 1-2. A progran to dcconprcss a conprcsscd ji|c using a codcc injcrrcd jron thc ji|c`s
cxtcnsion
public class FileDecompressor {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);

Path inputPath = new Path(uri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
88 | Chapter 4: Hadoop I/O
CompressionCodec codec = factory.getCodec(inputPath);
if (codec == null) {
System.err.println("No codec found for " + uri);
System.exit(1);
}
String outputUri =
CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
InputStream in = null;
OutputStream out = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in, out, conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
}
}
Once the couec has Leen lounu, it is useu to stiip oll the lile sullix to loim the output
lilename (via the removeSuffix() static methou ol CompressionCodecFactory). In this
way, a lile nameu ji|c.gz is uecompiesseu to ji|c Ly invoking the piogiam as lollows:
% hadoop FileDecompressor file.gz
CompressionCodecFactory linus couecs liom a list uelineu Ly the
io.compression.codecs conliguiation piopeity. By uelault, this lists all the couecs pio-
viueu Ly Hauoop (see TaLle +-3), so you woulu neeu to altei it only il you have a custom
couec that you wish to iegistei (such as the exteinally hosteu LZO couecs). Each couec
knows its uelault lilename extension, thus peimitting CompressionCodecFactory to
seaich thiough the iegisteieu couecs to linu a match loi a given extension (il any).
Tab|c 1-3. Conprcssion codcc propcrtics
Property name Type Default value Description
io.compression.codecs comma-separated
Class names
org.apache.hadoop.io.
compress.DefaultCodec,
org.apache.hadoop.io.
compress.GzipCodec,
org.apache.hadoop.io.
compress.Bzip2Codec
A list of the
CompressionCodec classes
for compression/
decompression.
Native libraries
Foi peiloimance, it is pieleiaLle to use a native liLiaiy loi compiession anu
uecompiession. Foi example, in one test, using the native gzip liLiaiies ieuuceu ue-
compiession times Ly up to 50 anu compiession times Ly aiounu 10 (compaieu to
the Luilt-in ]ava implementation). TaLle +-+ shows the availaLility ol ]ava anu native
Compression | 89
implementations loi each compiession loimat. Not all loimats have native implemen-
tations (Lzip2, loi example), wheieas otheis aie only availaLle as a native implemen-
tation (LZO, loi example).
Tab|c 1-1. Conprcssion |ibrary inp|cncntations
Compression format Java implementation Native implementation
DEFLATE Yes Yes
gzip Yes Yes
bzip2 Yes No
LZO No Yes
Hauoop comes with pieLuilt native compiession liLiaiies loi 32- anu 6+-Lit Linux,
which you can linu in the |ib/nativc uiiectoiy. Foi othei platloims, you will neeu to
compile the liLiaiies youisell, lollowing the instiuctions on the Hauoop wiki at http://
wi|i.apachc.org/hadoop/NativcHadoop.
The native liLiaiies aie pickeu up using the ]ava system piopeity java.library.path.
The hadoop sciipt in the bin uiiectoiy sets this piopeity loi you, Lut il you uon`t use
this sciipt, you will neeu to set the piopeity in youi application.
By uelault, Hauoop looks loi native liLiaiies loi the platloim it is iunning on, anu loaus
them automatically il they aie lounu. This means you uon`t have to change any con-
liguiation settings to use the native liLiaiies. In some ciicumstances, howevei, you may
wish to uisaLle use ol native liLiaiies, such as when you aie ueLugging a compiession-
ielateu pioLlem. You can achieve this Ly setting the piopeity hadoop.native.lib to
false, which ensuies that the Luilt-in ]ava eguivalents will Le useu (il they aie availaLle).
Il you aie using a native liLiaiy anu you aie uoing a lot ol compiession oi
uecompiession in youi application, consiuei using CodecPool, which allows you to ie-
use compiessois anu uecompiessois, theieLy amoitizing the cost ol cieating these
oLjects.
The coue in Example +-3 shows the API, although in this piogiam, which only cieates
a single Compressor, theie is ieally no neeu to use a pool.
Exanp|c 1-3. A progran to conprcss data rcad jron standard input and writc it to standard output
using a poo|cd conprcssor
public class PooledStreamCompressor {
public static void main(String[] args) throws Exception {
String codecClassname = args[0];
Class<?> codecClass = Class.forName(codecClassname);
Configuration conf = new Configuration();
CompressionCodec codec = (CompressionCodec)
ReflectionUtils.newInstance(codecClass, conf);
Compressor compressor = null;
try {
CodecPool.
90 | Chapter 4: Hadoop I/O
compressor = CodecPool.getCompressor(codec);
CompressionOutputStream out =
codec.createOutputStream(System.out, compressor);
IOUtils.copyBytes(System.in, out, 4096, false);
out.finish();
} finally {
CodecPool.returnCompressor(compressor);
}
}
}
Ve ietiieve a Compressor instance liom the pool loi a given CompressionCodec, which
we use in the couec`s oveiloaueu createOutputStream() methou. By using a finally
Llock, we ensuie that the compiessoi is ietuineu to the pool even il theie is an
IOException while copying the Lytes Letween the stieams.
Compression and Input Splits
Vhen consiueiing how to compiess uata that will Le piocesseu Ly MapReuuce, it is
impoitant to unueistanu whethei the compiession loimat suppoits splitting. Consiuei
an uncompiesseu lile stoieu in HDFS whose size is 1 GB. Vith an HDFS Llock size ol
6+ MB, the lile will Le stoieu as 16 Llocks, anu a MapReuuce joL using this lile as input
will cieate 16 input splits, each piocesseu inuepenuently as input to a sepaiate map task.
Imagine now the lile is a gzip-compiesseu lile whose compiesseu size is 1 GB. As Leloie,
HDFS will stoie the lile as 16 Llocks. Howevei, cieating a split loi each Llock won`t
woik since it is impossiLle to stait ieauing at an aiLitiaiy point in the gzip stieam, anu
theieloie impossiLle loi a map task to ieau its split inuepenuently ol the otheis. The
gzip loimat uses DEFLATE to stoie the compiesseu uata, anu DEFLATE stoies uata
as a seiies ol compiesseu Llocks. The pioLlem is that the stait ol each Llock is not
uistinguisheu in any way that woulu allow a ieauei positioneu at an aiLitiaiy point in
the stieam to auvance to the Leginning ol the next Llock, theieLy synchionizing itsell
with the stieam. Foi this ieason, gzip uoes not suppoit splitting.
In this case, MapReuuce will uo the iight thing anu not tiy to split the gzippeu lile,
since it knows that the input is gzip-compiesseu (Ly looking at the lilename extension)
anu that gzip uoes not suppoit splitting. This will woik, Lut at the expense ol locality:
a single map will piocess the 16 HDFS Llocks, most ol which will not Le local to the
map. Also, with lewei maps, the joL is less gianulai, anu so may take longei to iun.
Il the lile in oui hypothetical example weie an LZO lile, we woulu have the same
pioLlem since the unueilying compiession loimat uoes not pioviue a way loi a ieauei
to synchionize itsell with the stieam. Howevei, it is possiLle to piepiocess LZO liles
using an inuexei tool that comes with the Hauoop LZO liLiaiies, which you can oLtain
liom the site listeu in Couecs on page S7. The tool Luilus an inuex ol split points,
ellectively making them splittaLle when the appiopiiate MapReuuce input loimat is
useu.
Compression | 91
A Lzip2 lile, on the othei hanu, uoes pioviue a synchionization maikei Letween Llocks
(a +S-Lit appioximation ol pi), so it uoes suppoit splitting. (TaLle +-1 lists whethei
each compiession loimat suppoits splitting.)
Which Compression Format Should I Use?
Vhich compiession loimat you shoulu use uepenus on youi application. Do you want
to maximize the speeu ol youi application oi aie you moie conceineu aLout keeping
stoiage costs uown? In geneial, you shoulu tiy uilleient stiategies loi youi application,
anu Lenchmaik them with iepiesentative uatasets to linu the Lest appioach.
Foi laige, unLounueu liles, like logliles, the options aie:
Stoie the liles uncompiesseu.
Use a compiession loimat that suppoits splitting, like Lzip2 (although Lzip2 is
laiily slow), oi one that can Le inuexeu to suppoit splitting, like LZO.
Split the lile into chunks in the application anu compiess each chunk sepaiately
using any suppoiteu compiession loimat (it uoesn`t mattei whethei it is splittaLle).
In this case, you shoulu choose the chunk size so that the compiesseu chunks aie
appioximately the size ol an HDFS Llock.
Use Seguence File, which suppoits compiession anu splitting. See Seguence-
File on page 132.
Use an Avio uata lile, which suppoits compiession anu splitting, just like Seguence
File, Lut has the auueu auvantage ol Leing ieauaLle anu wiitaLle liom many
languages, not just ]ava. See Avio uata liles on page 119.
Foi laige liles, you shoulu not use a compiession loimat that uoes not suppoit splitting
on the whole lile, since you lose locality anu make MapReuuce applications veiy
inellicient.
Foi aichival puiposes, consiuei the Hauoop aichive loimat (see Hauoop Ai-
chives on page 7S), although it uoes not suppoit compiession.
Using Compression in MapReduce
As uesciiLeu in Inleiiing CompiessionCouecs using CompiessionCouecFac-
toiy on page SS, il youi input liles aie compiesseu, they will Le automatically
uecompiesseu as they aie ieau Ly MapReuuce, using the lilename extension to uetei-
mine the couec to use.
To compiess the output ol a MapReuuce joL, in the joL conliguiation, set the
mapred.output.compress piopeity to true anu the mapred.output.compression.codec
piopeity to the classname ol the compiession couec you want to use. Alteinatively, you
can use the static convenience methous on FileOutputFormat to set these piopeities as
shown in Example +-+.
92 | Chapter 4: Hadoop I/O
Exanp|c 1-1. App|ication to run thc naxinun tcnpcraturc job producing conprcsscd output
public class MaxTemperatureWithCompression {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperatureWithCompression <input path> " +
"<output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Ve iun the piogiam ovei compiesseu input (which uoesn`t have to use the same com-
piession loimat as the output, although it uoes in this example) as lollows:
% hadoop MaxTemperatureWithCompression input/ncdc/sample.txt.gz output
Each pait ol the linal output is compiesseu; in this case, theie is a single pait:
% gunzip -c output/part-r-00000.gz
1949 111
1950 22
Il you aie emitting seguence liles loi youi output, then you can set the mapred.out
put.compression.type piopeity to contiol the type ol compiession to use. The uelault
is RECORD, which compiesses inuiviuual iecoius. Changing this to BLOCK, which
compiesses gioups ol iecoius, is iecommenueu since it compiesses Lettei (see The
SeguenceFile loimat on page 13S).
Theie is also a static convenience methou on SequenceFileOutputFormat calleu setOut
putCompressionType() to set this piopeity.
The conliguiation piopeities to set compiession loi MapReuuce joL outputs aie sum-
maiizeu in TaLle +-5. Il youi MapReuuce uiivei uses the Tool inteilace (uesciiLeu in
GeneiicOptionsPaisei, Tool, anu ToolRunnei on page 151), then you can pass any
Compression | 93
ol these piopeities to the piogiam on the commanu line, which may Le moie convenient
than mouilying youi piogiam to haiu coue the compiession to use.
Tab|c 1-5. MapRcducc conprcssion propcrtics
Property name Type Default value Description
mapred.output.com
press
boolean false Compress outputs.
mapred.output.com
pression.
codec
Class name org.apache.hadoop.io.
compress.DefaultCodec
The compression codec to use for out-
puts.
mapred.output.com
pression.
type
String RECORD The type of compression to use for Se-
quenceFile outputs: NONE, RECORD, or
BLOCK.
Compressing map output
Even il youi MapReuuce application ieaus anu wiites uncompiesseu uata, it may Len-
elit liom compiessing the inteimeuiate output ol the map phase. Since the map output
is wiitten to uisk anu tiansleiieu acioss the netwoik to the ieuucei noues, Ly using a
last compiessoi such as LZO oi Snappy, you can get peiloimance gains simply Lecause
the volume ol uata to tianslei is ieuuceu. The conliguiation piopeities to enaLle com-
piession loi map outputs anu to set the compiession loimat aie shown in TaLle +-6.
Tab|c 1-. Map output conprcssion propcrtics
Property name Type Default value Description
mapred.compress.map. output boolean false Compress map outputs.
mapred.map.output.
compression.codec
Class org.apache.hadoop.io.
compress.DefaultCodec
The compression codec to use for
map outputs.
Heie aie the lines to auu to enaLle gzip map output compiession in youi joL:
Configuration conf = new Configuration();
conf.setBoolean("mapred.compress.map.output", true);
conf.setClass("mapred.map.output.compression.codec", GzipCodec.class,
CompressionCodec.class);
Job job = new Job(conf);
Serialization
Scria|ization is the piocess ol tuining stiuctuieu oLjects into a Lyte stieam loi tians-
mission ovei a netwoik oi loi wiiting to peisistent stoiage. Dcscria|ization is the ieveise
piocess ol tuining a Lyte stieam Lack into a seiies ol stiuctuieu oLjects.
Seiialization appeais in two guite uistinct aieas ol uistiiLuteu uata piocessing: loi
inteipiocess communication anu loi peisistent stoiage.
94 | Chapter 4: Hadoop I/O
In Hauoop, inteipiocess communication Letween noues in the system is implementeu
using rcnotc proccdurc ca||s (RPCs). The RPC piotocol uses seiialization to ienuei the
message into a Linaiy stieam to Le sent to the iemote noue, which then ueseiializes the
Linaiy stieam into the oiiginal message. In geneial, it is uesiiaLle that an RPC seiiali-
zation loimat is:
Conpact
A compact loimat makes the Lest use ol netwoik Lanuwiuth, which is the most
scaice iesouice in a uata centei.
Iast
Inteipiocess communication loims the LackLone loi a uistiiLuteu system, so it is
essential that theie is as little peiloimance oveiheau as possiLle loi the seiialization
anu ueseiialization piocess.
Extcnsib|c
Piotocols change ovei time to meet new ieguiiements, so it shoulu Le
stiaightloiwaiu to evolve the piotocol in a contiolleu mannei loi clients anu
seiveis. Foi example, it shoulu Le possiLle to auu a new aigument to a methou
call, anu have the new seiveis accept messages in the olu loimat (without the new
aigument) liom olu clients.
|ntcropcrab|c
Foi some systems, it is uesiiaLle to Le aLle to suppoit clients that aie wiitten in
uilleient languages to the seivei, so the loimat neeus to Le uesigneu to make this
possiLle.
On the lace ol it, the uata loimat chosen loi peisistent stoiage woulu have uilleient
ieguiiements liom a seiialization liamewoik. Altei all, the lilespan ol an RPC is less
than a seconu, wheieas peisistent uata may Le ieau yeais altei it was wiitten. As it tuins
out, the loui uesiiaLle piopeities ol an RPC`s seiialization loimat aie also ciucial loi a
peisistent stoiage loimat. Ve want the stoiage loimat to Le compact (to make ellicient
use ol stoiage space), last (so the oveiheau in ieauing oi wiiting teiaLytes ol uata is
minimal), extensiLle (so we can tianspaiently ieau uata wiitten in an oluei loimat),
anu inteiopeiaLle (so we can ieau oi wiite peisistent uata using uilleient languages).
Hauoop uses its own seiialization loimat, ViitaLles, which is ceitainly compact anu
last, Lut not so easy to extenu oi use liom languages othei than ]ava. Since ViitaLles
aie cential to Hauoop (most MapReuuce piogiams use them loi theii key anu value
types), we look at them in some uepth in the next thiee sections, Leloie looking at
seiialization liamewoiks in geneial, anu then Avio (a seiialization system that was
uesigneu to oveicome some ol the limitations ol ViitaLles) in moie uetail.
The Writable Interface
The ViitaLle inteilace uelines two methous: one loi wiiting its state to a DataOutput
Linaiy stieam, anu one loi ieauing its state liom a DataInput Linaiy stieam:
Serialization | 95
package org.apache.hadoop.io;

import java.io.DataOutput;
import java.io.DataInput;
import java.io.IOException;
public interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}
Let`s look at a paiticulai Writable to see what we can uo with it. Ve will use
IntWritable, a wiappei loi a ]ava int. Ve can cieate one anu set its value using the
set() methou:
IntWritable writable = new IntWritable();
writable.set(163);
Eguivalently, we can use the constiuctoi that takes the integei value:
IntWritable writable = new IntWritable(163);
To examine the seiializeu loim ol the IntWritable, we wiite a small helpei methou that
wiaps a java.io.ByteArrayOutputStream in a java.io.DataOutputStream (an implemen-
tation ol java.io.DataOutput) to captuie the Lytes in the seiializeu stieam:
public static byte[] serialize(Writable writable) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream dataOut = new DataOutputStream(out);
writable.write(dataOut);
dataOut.close();
return out.toByteArray();
}
An integei is wiitten using loui Lytes (as we see using ]Unit + asseitions):
byte[] bytes = serialize(writable);
assertThat(bytes.length, is(4));
The Lytes aie wiitten in Lig-enuian oiuei (so the most signilicant Lyte is wiitten to the
stieam liist, this is uictateu Ly the java.io.DataOutput inteilace), anu we can see theii
hexauecimal iepiesentation Ly using a methou on Hauoop`s StringUtils:
assertThat(StringUtils.byteToHexString(bytes), is("000000a3"));
Let`s tiy ueseiialization. Again, we cieate a helpei methou to ieau a Writable oLject
liom a Lyte aiiay:
public static byte[] deserialize(Writable writable, byte[] bytes)
throws IOException {
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
DataInputStream dataIn = new DataInputStream(in);
writable.readFields(dataIn);
dataIn.close();
return bytes;
}
96 | Chapter 4: Hadoop I/O
Ve constiuct a new, value-less, IntWritable, then call deserialize() to ieau liom the
output uata that we just wiote. Then we check that its value, ietiieveu using the
get() methou, is the oiiginal value, 163:
IntWritable newWritable = new IntWritable();
deserialize(newWritable, bytes);
assertThat(newWritable.get(), is(163));
WritableComparable and comparators
IntWritable implements the WritableComparable inteilace, which is just a suLinteilace
ol the Writable anu java.lang.Comparable inteilaces:
package org.apache.hadoop.io;

public interface WritableComparable<T> extends Writable, Comparable<T> {
}
Compaiison ol types is ciucial loi MapReuuce, wheie theie is a soiting phase uuiing
which keys aie compaieu with one anothei. One optimization that Hauoop pioviues
is the RawComparator extension ol ]ava`s Comparator:
package org.apache.hadoop.io;

import java.util.Comparator;
public interface RawComparator<T> extends Comparator<T> {

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

}
This inteilace peimits implementois to compaie iecoius ieau liom a stieam without
ueseiializing them into oLjects, theieLy avoiuing any oveiheau ol oLject cieation. Foi
example, the compaiatoi loi IntWritables implements the iaw compare() methou Ly
ieauing an integei liom each ol the Lyte aiiays b1 anu b2 anu compaiing them uiiectly,
liom the given stait positions (s1 anu s2) anu lengths (l1 anu l2).
WritableComparator is a geneial-puipose implementation ol RawComparator loi
WritableComparable classes. It pioviues two main lunctions. Fiist, it pioviues a uelault
implementation ol the iaw compare() methou that ueseiializes the oLjects to Le com-
paieu liom the stieam anu invokes the oLject compare() methou. Seconu, it acts as a
lactoiy loi RawComparator instances (that Writable implementations have iegisteieu).
Foi example, to oLtain a compaiatoi loi IntWritable, we just use:
RawComparator<IntWritable> comparator = WritableComparator.get(IntWritable.class);
The compaiatoi can Le useu to compaie two IntWritable oLjects:
IntWritable w1 = new IntWritable(163);
IntWritable w2 = new IntWritable(67);
assertThat(comparator.compare(w1, w2), greaterThan(0));
oi theii seiializeu iepiesentations:
Serialization | 97
byte[] b1 = serialize(w1);
byte[] b2 = serialize(w2);
assertThat(comparator.compare(b1, 0, b1.length, b2, 0, b2.length),
greaterThan(0));
Writable Classes
Hauoop comes with a laige selection ol Writable classes in the org.apache.hadoop.io
package. They loim the class hieiaichy shown in Figuie +-1.
Writable wrappers for Java primitives
Theie aie Writable wiappeis loi all the ]ava piimitive types (see TaLle +-7) except
char (which can Le stoieu in an IntWritable). All have a get() anu a set() methou loi
ietiieving anu stoiing the wiappeu value.
98 | Chapter 4: Hadoop I/O
Iigurc 1-1. Writab|c c|ass hicrarchy
Tab|c 1-7. Writab|c wrappcr c|asscs jor java prinitivcs
Java primitive Writable implementation Serialized size (bytes)
boolean BooleanWritable 1
byte ByteWritable 1
short ShortWritable 2
int IntWritable 4
VIntWritable 15
float FloatWritable 4
long LongWritable 8
Serialization | 99
Java primitive Writable implementation Serialized size (bytes)
VLongWritable 19
double DoubleWritable 8
Vhen it comes to encouing integeis, theie is a choice Letween the lixeu-length loimats
(IntWritable anu LongWritable) anu the vaiiaLle-length loimats (VIntWritable anu
VLongWritable). The vaiiaLle-length loimats use only a single Lyte to encoue the value
il it is small enough (Letween 112 anu 127, inclusive); otheiwise, they use the liist
Lyte to inuicate whethei the value is positive oi negative, anu how many Lytes lollow.
Foi example, 163 ieguiies two Lytes:
byte[] data = serialize(new VIntWritable(163));
assertThat(StringUtils.byteToHexString(data), is("8fa3"));
How uo you choose Letween a lixeu-length anu a vaiiaLle-length encouing? Fixeu-
length encouings aie goou when the uistiiLution ol values is laiily uniloim acioss the
whole value space, such as a (well-uesigneu) hash lunction. Most numeiic vaiiaLles
tenu to have nonuniloim uistiiLutions, anu on aveiage the vaiiaLle-length encouing
will save space. Anothei auvantage ol vaiiaLle-length encouings is that you can switch
liom VIntWritable to VLongWritable, since theii encouings aie actually the same. So Ly
choosing a vaiiaLle-length iepiesentation, you have ioom to giow without committing
to an S-Lyte long iepiesentation liom the Leginning.
Text
Text is a Writable loi UTF-S seguences. It can Le thought ol as the Writable eguivalent
ol java.lang.String. Text is a ieplacement loi the UTF8 class, which was uepiecateu
Lecause it uiun`t suppoit stiings whose encouing was ovei 32,767 Lytes, anu Lecause
it useu ]ava`s mouilieu UTF-S.
The Text class uses an int (with a vaiiaLle-length encouing) to stoie the numLei ol
Lytes in the stiing encouing, so the maximum value is 2 GB. Fuitheimoie, Text uses
stanuaiu UTF-S, which makes it potentially easiei to inteiopeiate with othei tools that
unueistanu UTF-S.
Because ol its emphasis on using stanuaiu UTF-S, theie aie some uilleiences
Letween Text anu the ]ava String class. Inuexing loi the Text class is in teims ol position
in the encoueu Lyte seguence, not the Unicoue chaiactei in the stiing, oi the ]ava
char coue unit (as it is loi String). Foi ASCII stiings, these thiee concepts ol inuex
position coinciue. Heie is an example to uemonstiate the use ol the charAt() methou:
Text t = new Text("hadoop");
assertThat(t.getLength(), is(6));
assertThat(t.getBytes().length, is(6));

assertThat(t.charAt(2), is((int) 'd'));
assertThat("Out of bounds", t.charAt(100), is(-1));
Indexing.
100 | Chapter 4: Hadoop I/O
Notice that charAt() ietuins an int iepiesenting a Unicoue coue point, unlike the
String vaiiant that ietuins a char. Text also has a find() methou, which is analogous
to String`s indexOf():
Text t = new Text("hadoop");
assertThat("Find a substring", t.find("do"), is(2));
assertThat("Finds first 'o'", t.find("o"), is(3));
assertThat("Finds 'o' from position 4 or later", t.find("o", 4), is(4));
assertThat("No match", t.find("pig"), is(-1));
Vhen we stait using chaiacteis that aie encoueu with moie than a single Lyte,
the uilleiences Letween Text anu String Lecome cleai. Consiuei the Unicoue chaiacteis
shown in TaLle +-S.
2
Tab|c 1-8. Unicodc charactcrs
Unicode code point U+0041 U+00DF U+6771 U+10400
Name LATIN CAPITAL
LETTER A
LATIN SMALL LETTER
SHARP S
N/A (a unified
Han ideograph)
DESERET CAPITAL LETTER
LONG I
UTF-8 code units 41 c3 9f e6 9d b1 f0 90 90 80
Java representation \u0041 \u00DF \u6771 \uuD801\uDC00
All Lut the last chaiactei in the taLle, U-10+00, can Le expiesseu using a single ]ava
char. U-10+00 is a supplementaiy chaiactei anu is iepiesenteu Ly two ]ava chars,
known as a suiiogate paii. The tests in Example +-5 show the uilleiences Letween
String anu Text when piocessing a stiing ol the loui chaiacteis liom TaLle +-S.
Exanp|c 1-5. Tcsts showing thc dijjcrcnccs bctwccn thc String and Tcxt c|asscs
public class StringTextComparisonTest {
@Test
public void string() throws UnsupportedEncodingException {

String s = "\u0041\u00DF\u6771\uD801\uDC00";
assertThat(s.length(), is(5));
assertThat(s.getBytes("UTF-8").length, is(10));

assertThat(s.indexOf("\u0041"), is(0));
assertThat(s.indexOf("\u00DF"), is(1));
assertThat(s.indexOf("\u6771"), is(2));
assertThat(s.indexOf("\uD801\uDC00"), is(3));

assertThat(s.charAt(0), is('\u0041'));
assertThat(s.charAt(1), is('\u00DF'));
assertThat(s.charAt(2), is('\u6771'));
assertThat(s.charAt(3), is('\uD801'));
assertThat(s.charAt(4), is('\uDC00'));

Unicode.
2. This example is Laseu on one liom the aiticle Supplementaiy Chaiacteis in the ]ava Platloim.
Serialization | 101
assertThat(s.codePointAt(0), is(0x0041));
assertThat(s.codePointAt(1), is(0x00DF));
assertThat(s.codePointAt(2), is(0x6771));
assertThat(s.codePointAt(3), is(0x10400));
}

@Test
public void text() {

Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
assertThat(t.getLength(), is(10));

assertThat(t.find("\u0041"), is(0));
assertThat(t.find("\u00DF"), is(1));
assertThat(t.find("\u6771"), is(3));
assertThat(t.find("\uD801\uDC00"), is(6));
assertThat(t.charAt(0), is(0x0041));
assertThat(t.charAt(1), is(0x00DF));
assertThat(t.charAt(3), is(0x6771));
assertThat(t.charAt(6), is(0x10400));
}
}
The test conliims that the length ol a String is the numLei ol char coue units it contains
(5, one liom each ol the liist thiee chaiacteis in the stiing, anu a suiiogate paii liom
the last), wheieas the length ol a Text oLject is the numLei ol Lytes in its UTF-S encouing
(10 = 1-2-3-+). Similaily, the indexOf() methou in String ietuins an inuex in char
coue units, anu find() loi Text is a Lyte ollset.
The charAt() methou in String ietuins the char coue unit loi the given inuex, which
in the case ol a suiiogate paii will not iepiesent a whole Unicoue chaiactei. The code
PointAt() methou, inuexeu Ly char coue unit, is neeueu to ietiieve a single Unicoue
chaiactei iepiesenteu as an int. In lact, the charAt() methou in Text is moie like the
codePointAt() methou than its namesake in String. The only uilleience is that it is
inuexeu Ly Lyte ollset.
Iteiating ovei the Unicoue chaiacteis in Text is complicateu Ly the use ol Lyte
ollsets loi inuexing, since you can`t just inciement the inuex. The iuiom loi iteiation
is a little oLscuie (see Example +-6): tuin the Text oLject into a java.nio.ByteBuffer,
then iepeateuly call the bytesToCodePoint() static methou on Text with the Lullei. This
methou extiacts the next coue point as an int anu upuates the position in the Lullei.
The enu ol the stiing is uetecteu when bytesToCodePoint() ietuins 1.
Exanp|c 1-. |tcrating ovcr thc charactcrs in a Tcxt objcct
public class TextIterator {

public static void main(String[] args) {
Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");

ByteBuffer buf = ByteBuffer.wrap(t.getBytes(), 0, t.getLength());
Iteration.
102 | Chapter 4: Hadoop I/O
int cp;
while (buf.hasRemaining() && (cp = Text.bytesToCodePoint(buf)) != -1) {
System.out.println(Integer.toHexString(cp));
}
}
}
Running the piogiam piints the coue points loi the loui chaiacteis in the stiing:
% hadoop TextIterator
41
df
6771
10400
Anothei uilleience with String is that Text is mutaLle (like all Writable im-
plementations in Hauoop, except NullWritable, which is a singleton). You can ieuse a
Text instance Ly calling one ol the set() methous on it. Foi example:
Text t = new Text("hadoop");
t.set("pig");
assertThat(t.getLength(), is(3));
assertThat(t.getBytes().length, is(3));
In some situations, the Lyte aiiay ietuineu Ly the getBytes() methou
may Le longei than the length ietuineu Ly getLength():
Text t = new Text("hadoop");
t.set(new Text("pig"));
assertThat(t.getLength(), is(3));
assertThat("Byte length not shortened", t.getBytes().length,
is(6));
This shows why it is impeiative that you always call getLength() when
calling getBytes(), so you know how much ol the Lyte aiiay is valiu uata.
Text uoesn`t have as iich an API loi manipulating stiings as
java.lang.String, so in many cases, you neeu to conveit the Text oLject to a String.
This is uone in the usual way, using the toString() methou:
assertThat(new Text("hadoop").toString(), is("hadoop"));
BytesWritable
BytesWritable is a wiappei loi an aiiay ol Linaiy uata. Its seiializeu loimat is an integei
lielu (+ Lytes) that specilies the numLei ol Lytes to lollow, lolloweu Ly the Lytes them-
selves. Foi example, the Lyte aiiay ol length two with values 3 anu 5 is seiializeu as a
+-Lyte integei (00000002) lolloweu Ly the two Lytes liom the aiiay (03 anu 05):
BytesWritable b = new BytesWritable(new byte[] { 3, 5 });
byte[] bytes = serialize(b);
assertThat(StringUtils.byteToHexString(bytes), is("000000020305"));
Mutability.
Resorting to String.
Serialization | 103
BytesWritable is mutaLle, anu its value may Le changeu Ly calling its set() methou.
As with Text, the size ol the Lyte aiiay ietuineu liom the getBytes() methou loi Byte
sWritablethe capacitymay not iellect the actual size ol the uata stoieu in the
BytesWritable. You can ueteimine the size ol the BytesWritable Ly calling get
Length(). To uemonstiate:
b.setCapacity(11);
assertThat(b.getLength(), is(2));
assertThat(b.getBytes().length, is(11));
NullWritable
NullWritable is a special type ol Writable, as it has a zeio-length seiialization. No Lytes
aie wiitten to, oi ieau liom, the stieam. It is useu as a placeholuei; loi example, in
MapReuuce, a key oi a value can Le ueclaieu as a NullWritable when you uon`t neeu
to use that positionit ellectively stoies a constant empty value. NullWritable can also
Le uselul as a key in SequenceFile when you want to stoie a list ol values, as opposeu
to key-value paiis. It is an immutaLle singleton: the instance can Le ietiieveu Ly calling
NullWritable.get().
ObjectWritable and GenericWritable
ObjectWritable is a geneial-puipose wiappei loi the lollowing: ]ava piimitives, String,
enum, Writable, null, oi aiiays ol any ol these types. It is useu in Hauoop RPC to maishal
anu unmaishal methou aiguments anu ietuin types.
ObjectWritable is uselul when a lielu can Le ol moie than one type: loi example, il the
values in a SequenceFile have multiple types, then you can ueclaie the value type as an
ObjectWritable anu wiap each type in an ObjectWritable. Being a geneial-puipose
mechanism, it`s laiily wastelul ol space since it wiites the classname ol the wiappeu
type eveiy time it is seiializeu. In cases wheie the numLei ol types is small anu known
aheau ol time, this can Le impioveu Ly having a static aiiay ol types, anu using the
inuex into the aiiay as the seiializeu ieleience to the type. This is the appioach that
GenericWritable takes, anu you have to suLclass it to specily the types to suppoit.
Writable collections
Theie aie six Writable collection types in the org.apache.hadoop.io package: Array
Writable, ArrayPrimitiveWritable, TwoDArrayWritable, MapWritable, SortedMapWrita
ble, anu EnumSetWritable.
ArrayWritable anu TwoDArrayWritable aie Writable implementations loi aiiays anu
two-uimensional aiiays (aiiay ol aiiays) ol Writable instances. All the elements ol an
ArrayWritable oi a TwoDArrayWritable must Le instances ol the same class, which is
specilieu at constiuction, as lollows:
ArrayWritable writable = new ArrayWritable(Text.class);
104 | Chapter 4: Hadoop I/O
In contexts wheie the Writable is uelineu Ly type, such as in SequenceFile keys oi
values, oi as input to MapReuuce in geneial, you neeu to suLclass ArrayWritable (oi
TwoDArrayWritable, as appiopiiate) to set the type statically. Foi example:
public class TextArrayWritable extends ArrayWritable {
public TextArrayWritable() {
super(Text.class);
}
}
ArrayWritable anu TwoDArrayWritable Loth have get() anu set() methous, as well as a
toArray() methou, which cieates a shallow copy ol the aiiay (oi 2D aiiay).
ArrayPrimitiveWritable is a wiappei loi aiiays ol ]ava piimitives. The component type
is uetecteu when you call set(), so theie is no neeu to suLclass to set the type.
MapWritable anu SortedMapWritable aie implementations ol java.util.Map<Writable,
Writable> anu java.util.SortedMap<WritableComparable, Writable>, iespectively. The
type ol each key anu value lielu is a pait ol the seiialization loimat loi that lielu. The
type is stoieu as a single Lyte that acts as an inuex into an aiiay ol types. The aiiay is
populateu with the stanuaiu types in the org.apache.hadoop.io package, Lut custom
Writable types aie accommouateu, too, Ly wiiting a heauei that encoues the type aiiay
loi nonstanuaiu types. As they aie implementeu, MapWritable anu SortedMapWritable
use positive byte values loi custom types, so a maximum ol 127 uistinct nonstanuaiu
Writable classes can Le useu in any paiticulai MapWritable oi SortedMapWritable in-
stance. Heie`s a uemonstiation ol using a MapWritable with uilleient types loi keys anu
values:
MapWritable src = new MapWritable();
src.put(new IntWritable(1), new Text("cat"));
src.put(new VIntWritable(2), new LongWritable(163));

MapWritable dest = new MapWritable();
WritableUtils.cloneInto(dest, src);
assertThat((Text) dest.get(new IntWritable(1)), is(new Text("cat")));
assertThat((LongWritable) dest.get(new VIntWritable(2)), is(new
LongWritable(163)));
Conspicuous Ly theii aLsence aie Writable collection implementations loi sets anu
lists. A geneial set can Le emulateu Ly using a MapWritable (oi a SortedMapWritable loi
a soiteu set), with NullWritable values. Theie is also EnumSetWritable loi sets ol enum
types. Foi lists ol a single type ol Writable, ArrayWritable is aueguate, Lut to stoie
uilleient types ol Writable in a single list, you can use GenericWritable to wiap the
elements in an ArrayWritable. Alteinatively, you coulu wiite a geneial ListWritable
using the iueas liom MapWritable.
Implementing a Custom Writable
Hauoop comes with a uselul set ol Writable implementations that seive most puiposes;
howevei, on occasion, you may neeu to wiite youi own custom implementation. Vith
Serialization | 105
a custom Writable, you have lull contiol ovei the Linaiy iepiesentation anu the soit
oiuei. Because Writables aie at the heait ol the MapReuuce uata path, tuning the Linaiy
iepiesentation can have a signilicant ellect on peiloimance. The stock Writable
implementations that come with Hauoop aie well-tuneu, Lut loi moie elaLoiate stiuc-
tuies, it is olten Lettei to cieate a new Writable type, iathei than compose the stock
types.
To uemonstiate how to cieate a custom Writable, we shall wiite an implementation
that iepiesents a paii ol stiings, calleu TextPair. The Lasic implementation is shown
in Example +-7.
Exanp|c 1-7. A Writab|c inp|cncntation that storcs a pair oj Tcxt objccts
import java.io.*;
import org.apache.hadoop.io.*;
public class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;

public TextPair() {
set(new Text(), new Text());
}

public TextPair(String first, String second) {
set(new Text(first), new Text(second));
}

public TextPair(Text first, Text second) {
set(first, second);
}

public void set(Text first, Text second) {
this.first = first;
this.second = second;
}

public Text getFirst() {
return first;
}
public Text getSecond() {
return second;
}
@Override
public void write(DataOutput out) throws IOException {
first.write(out);
second.write(out);
}
@Override
106 | Chapter 4: Hadoop I/O
public void readFields(DataInput in) throws IOException {
first.readFields(in);
second.readFields(in);
}

@Override
public int hashCode() {
return first.hashCode() * 163 + second.hashCode();
}

@Override
public boolean equals(Object o) {
if (o instanceof TextPair) {
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}
@Override
public String toString() {
return first + "\t" + second;
}

@Override
public int compareTo(TextPair tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
}
return second.compareTo(tp.second);
}
}
The liist pait ol the implementation is stiaightloiwaiu: theie aie two Text instance
vaiiaLles, first anu second, anu associateu constiuctois, getteis, anu setteis. All
Writable implementations must have a uelault constiuctoi so that the MapReuuce
liamewoik can instantiate them, then populate theii lielus Ly calling readFields().
ViitaLle instances aie mutaLle anu olten ieuseu, so you shoulu take caie to avoiu
allocating oLjects in the write() oi readFields() methous.
TextPair`s write() methou seiializes each Text oLject in tuin to the output stieam, Ly
uelegating to the Text oLjects themselves. Similaily, readFields() ueseiializes the Lytes
liom the input stieam Ly uelegating to each Text oLject. The DataOutput anu
DataInput inteilaces have a iich set ol methous loi seiializing anu ueseiializing ]ava
piimitives, so, in geneial, you have complete contiol ovei the wiie loimat ol youi
Writable oLject.
]ust as you woulu loi any value oLject you wiite in ]ava, you shoulu oveiiiue the
hashCode(), equals(), anu toString() methous liom java.lang.Object. The hash
Code() methou is useu Ly the HashPartitioner (the uelault paititionei in MapReuuce)
Serialization | 107
to choose a ieuuce paitition, so you shoulu make suie that you wiite a goou hash
lunction that mixes well to ensuie ieuuce paititions aie ol a similai size.
Il you evei plan to use youi custom Writable with TextOutputFormat,
then you must implement its toString() methou. TextOutputFormat calls
toString() on keys anu values loi theii output iepiesentation. Foi Text
Pair, we wiite the unueilying Text oLjects as stiings sepaiateu Ly a taL
chaiactei.
TextPair is an implementation ol WritableComparable, so it pioviues an implementation
ol the compareTo() methou that imposes the oiueiing you woulu expect: it soits Ly the
liist stiing lolloweu Ly the seconu. Notice that TextPair uilleis liom TextArrayWrita
ble liom the pievious section (apait liom the numLei ol Text oLjects it can stoie), since
TextArrayWritable is only a Writable, not a WritableComparable.
Implementing a RawComparator for speed
The coue loi TextPair in Example +-7 will woik as it stanus; howevei, theie is a luithei
optimization we can make. As explaineu in ViitaLleCompaiaLle anu compaia-
tois on page 97, when TextPair is Leing useu as a key in MapReuuce, it will have to
Le ueseiializeu into an oLject loi the compareTo() methou to Le invokeu. Vhat il it weie
possiLle to compaie two TextPair oLjects just Ly looking at theii seiializeu
iepiesentations?
It tuins out that we can uo this, since TextPair is the concatenation ol two Text oLjects,
anu the Linaiy iepiesentation ol a Text oLject is a vaiiaLle-length integei containing
the numLei ol Lytes in the UTF-S iepiesentation ol the stiing, lolloweu Ly the UTF-S
Lytes themselves. The tiick is to ieau the initial length, so we know how long the liist
Text oLject`s Lyte iepiesentation is; then we can uelegate to Text`s RawComparator, anu
invoke it with the appiopiiate ollsets loi the liist oi seconu stiing. Example +-S gives
the uetails (note that this coue is nesteu in the TextPair class).
Exanp|c 1-8. A RawConparator jor conparing TcxtPair bytc rcprcscntations
public static class Comparator extends WritableComparator {

private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();

public Comparator() {
super(TextPair.class);
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {

try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
108 | Chapter 4: Hadoop I/O
int cmp = TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
if (cmp != 0) {
return cmp;
}
return TEXT_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1,
b2, s2 + firstL2, l2 - firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
}
}
}
static {
WritableComparator.define(TextPair.class, new Comparator());
}
Ve actually suLclass WritableComparator iathei than implement RawComparator ui-
iectly, since it pioviues some convenience methous anu uelault implementations. The
suLtle pait ol this coue is calculating firstL1 anu firstL2, the lengths ol the liist
Text lielu in each Lyte stieam. Each is maue up ol the length ol the vaiiaLle-length
integei (ietuineu Ly decodeVIntSize() on WritableUtils) anu the value it is encouing
(ietuineu Ly readVInt()).
The static Llock iegisteis the iaw compaiatoi so that whenevei MapReuuce sees the
TextPair class, it knows to use the iaw compaiatoi as its uelault compaiatoi.
Custom comparators
As we can see with TextPair, wiiting iaw compaiatois takes some caie, since you have
to ueal with uetails at the Lyte level. It is woith looking at some ol the implementations
ol Writable in the org.apache.hadoop.io package loi luithei iueas, il you neeu to wiite
youi own. The utility methous on WritableUtils aie veiy hanuy, too.
Custom compaiatois shoulu also Le wiitten to Le RawComparators, il possiLle. These
aie compaiatois that implement a uilleient soit oiuei to the natuial soit oiuei uelineu
Ly the uelault compaiatoi. Example +-9 shows a compaiatoi loi TextPair, calleu First
Comparator, that consiueis only the liist stiing ol the paii. Note that we oveiiiue the
compare() methou that takes oLjects so Loth compare() methous have the same
semantics.
Ve will make use ol this compaiatoi in Chaptei S, when we look at joins anu seconuaiy
soiting in MapReuuce (see ]oins on page 2S1).
Exanp|c 1-9. A custon RawConparator jor conparing thc jirst jic|d oj TcxtPair bytc rcprcscntations
public static class FirstComparator extends WritableComparator {

private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();

public FirstComparator() {
super(TextPair.class);
}
Serialization | 109
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {

try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
}
}

@Override
public int compare(WritableComparable a, WritableComparable b) {
if (a instanceof TextPair && b instanceof TextPair) {
return ((TextPair) a).first.compareTo(((TextPair) b).first);
}
return super.compare(a, b);
}
}
Serialization Frameworks
Although most MapReuuce piogiams use Writable key anu value types, this isn`t man-
uateu Ly the MapReuuce API. In lact, any types can Le useu; the only ieguiiement is
that theie Le a mechanism that tianslates to anu liom a Linaiy iepiesentation ol each
type.
To suppoit this, Hauoop has an API loi pluggaLle seiialization liamewoiks. A seiiali-
zation liamewoik is iepiesenteu Ly an implementation ol Serialization (in the
org.apache.hadoop.io.serializer package). WritableSerialization, loi example, is
the implementation ol Serialization loi Writable types.
A Serialization uelines a mapping liom types to Serializer instances (loi tuining an
oLject into a Lyte stieam) anu Deserializer instances (loi tuining a Lyte stieam into
an oLject).
Set the io.serializations piopeity to a comma-sepaiateu list ol classnames to iegistei
Serialization implementations. Its uelault value incluues org.apache.hadoop.io.seri
alizer.WritableSerialization anu the Avio specilic anu iellect seiializations, which
means that only Writable oi Avio oLjects can Le seiializeu oi ueseiializeu out ol the Lox.
Hauoop incluues a class calleu JavaSerialization that uses ]ava OLject Seiialization.
Although it makes it convenient to Le aLle to use stanuaiu ]ava types in MapReuuce
piogiams, like Integer oi String, ]ava OLject Seiialization is not as ellicient as Viita-
Lles, so it`s not woith making this tiaue-oll (see the siueLai on the next page).
110 | Chapter 4: Hadoop I/O
Why Not Use Java Object Serialization?
]ava comes with its own seiialization mechanism, calleu ]ava OLject Seiialization (olten
ieleiieu to simply as ]ava Seiialization), that is tightly integiateu with the language,
so it`s natuial to ask why this wasn`t useu in Hauoop. Heie`s what Doug Cutting saiu
in iesponse to that guestion:
Vhy uiun`t I use Seiialization when we liist staiteu Hauoop? Because it lookeu
Lig anu haiiy anu I thought we neeueu something lean anu mean, wheie we hau
piecise contiol ovei exactly how oLjects aie wiitten anu ieau, since that is cential
to Hauoop. Vith Seiialization you can get some contiol, Lut you have to light loi
it.
The logic loi not using RMI was similai. Ellective, high-peiloimance intei-piocess
communications aie ciitical to Hauoop. I lelt like we`u neeu to piecisely contiol
how things like connections, timeouts anu Lulleis aie hanuleu, anu RMI gives you
little contiol ovei those.
The pioLlem is that ]ava Seiialization uoesn`t meet the ciiteiia loi a seiialization loimat
listeu eailiei: compact, last, extensiLle, anu inteiopeiaLle.
]ava Seiialization is not compact: it wiites the classname ol each oLject Leing wiitten
to the stieamthis is tiue ol classes that implement java.io.Serializable oi
java.io.Externalizable. SuLseguent instances ol the same class wiite a ieleience han-
ule to the liist occuiience, which occupies only 5 Lytes. Howevei, ieleience hanules
uon`t woik well with ianuom access, since the ieleient class may occui at any point in
the pieceuing stieamthat is, theie is state stoieu in the stieam. Even woise, ieleience
hanules play havoc with soiting iecoius in a seiializeu stieam, since the liist iecoiu ol
a paiticulai class is uistinguisheu anu must Le tieateu as a special case.
All these pioLlems aie avoiueu Ly not wiiting the classname to the stieam at all, which
is the appioach that ViitaLle takes. This makes the assumption that the client knows
the expecteu type. The iesult is that the loimat is consiueiaLly moie compact than ]ava
Seiialization, anu ianuom access anu soiting woik as expecteu since each iecoiu is
inuepenuent ol the otheis (so theie is no stieam state).
]ava Seiialization is a geneial-puipose mechanism loi seiializing giaphs ol oLjects, so
it necessaiily has some oveiheau loi seiialization anu ueseiialization opeiations. Vhat`s
moie, the ueseiialization pioceuuie cieates a new instance loi each oLject ueseiializeu
liom the stieam. ViitaLle oLjects, on the othei hanu, can Le (anu olten aie) ieuseu.
Foi example, loi a MapReuuce joL, which at its coie seiializes anu ueseiializes Lillions
ol iecoius ol just a hanulul ol uilleient types, the savings gaineu Ly not having to allocate
new oLjects aie signilicant.
In teims ol extensiLility, ]ava Seiialization has some suppoit loi evolving a type, Lut it
is Liittle anu haiu to use ellectively (ViitaLles have no suppoit: the piogiammei has
to manage them himsell).
In piinciple, othei languages coulu inteipiet the ]ava Seiialization stieam piotocol (ue-
lineu Ly the ]ava OLject Seiialization Specilication), Lut in piactice theie aie no wiuely
Serialization | 111
useu implementations in othei languages, so it is a ]ava-only solution. The situation is
the same loi ViitaLles.
Serialization IDL
Theie aie a numLei ol othei seiialization liamewoiks that appioach the pioLlem in a
uilleient way: iathei than uelining types thiough coue, you ueline them in a language-
neutial, ueclaiative lashion, using an intcrjacc dcscription |anguagc (IDL). The system
can then geneiate types loi uilleient languages, which is goou loi inteiopeiaLility. They
also typically ueline veisioning schemes that make type evolution stiaightloiwaiu.
Hauoop`s own Recoiu I/O (lounu in the org.apache.hadoop.record package) has an
IDL that is compileu into ViitaLle oLjects, which makes it convenient loi geneiating
types that aie compatiLle with MapReuuce. Foi whatevei ieason, howevei, Recoiu
I/O was not wiuely useu, anu has Leen uepiecateu in lavoi ol Avio.
Apache Thiilt anu Google Piotocol Bulleis aie Loth populai seiialization liamewoiks,
anu they aie commonly useu as a loimat loi peisistent Linaiy uata. Theie is limiteu
suppoit loi these as MapReuuce loimats;
3
howevei, they aie useu inteinally in paits
ol Hauoop loi RPC anu uata exchange.
In the next section, we look at Avio, an IDL-Laseu seiialization liamewoik uesigneu
to woik well with laige-scale uata piocessing in Hauoop.
Avro
Apache Avio
+
is a language-neutial uata seiialization system. The pioject was cieateu
Ly Doug Cutting (the cieatoi ol Hauoop) to auuiess the majoi uownsiue ol Hauoop
ViitaLles: lack ol language poitaLility. Having a uata loimat that can Le piocesseu Ly
many languages (cuiiently C, C--, C=, ]ava, Python, anu RuLy) makes it easiei to
shaie uatasets with a wiuei auuience than one tieu to a single language. It is also moie
lutuie-piool, allowing uata to potentially outlive the language useu to ieau anu wiite it.
But why a new uata seiialization system? Avio has a set ol leatuies that, taken togethei,
uilleientiate it liom othei systems like Apache Thiilt oi Google`s Piotocol Bulleis.
5
Like these systems anu otheis, Avio uata is uesciiLeu using a language-inuepenuent
schcna. Howevei, unlike some othei systems, coue geneiation is optional in Avio,
which means you can ieau anu wiite uata that conloims to a given schema even il youi
3. You can linu the latest status loi a Thiilt Serialization at https://issucs.apachc.org/jira/browsc/HADOOP
-3787, anu a Piotocol Bulleis Serialization at https://issucs.apachc.org/jira/browsc/HADOOP-3788.
Twittei`s Elephant Biiu pioject (http://github.con/|cvinwci|/c|cphant-bird) incluues tools loi woiking
with Piotocol Bulleis in Hauoop.
+. Nameu altei the Biitish aiicialt manulactuiei liom the 20th centuiy.
5. Avio also peiloims lavoiaLly compaieu to othei seiialization liLiaiies, as the Lenchmaiks at http://codc
.goog|c.con/p/thrijt-protobuj-conparc/ uemonstiate.
112 | Chapter 4: Hadoop I/O
coue has not seen that paiticulai schema Leloie. To achieve this, Avio assumes that
the schema is always piesentat Loth ieau anu wiite timewhich makes loi a veiy
compact encouing, since encoueu values uo not neeu to Le taggeu with a lielu iuentiliei.
Avio schemas aie usually wiitten in ]SON, anu uata is usually encoueu using a Linaiy
loimat, Lut theie aie othei options, too. Theie is a highei-level language calleu Avio
IDL, loi wiiting schemas in a C-like language that is moie lamiliai to uevelopeis. Theie
is also a ]SON-Laseu uata encouei, which, Leing human-ieauaLle, is uselul loi pioto-
typing anu ueLugging Avio uata.
The Avro spccijication piecisely uelines the Linaiy loimat that all implementations must
suppoit. It also specilies many ol the othei leatuies ol Avio that implementations
shoulu suppoit. One aiea that the specilication uoes not iule on, howevei, is APIs:
implementations have complete latituue in the API they expose loi woiking with Avio
uata, since each one is necessaiily language-specilic. The lact that theie is only one
Linaiy loimat is signilicant, since it means the Laiiiei loi implementing a new language
Linuing is lowei, anu avoius the pioLlem ol a comLinatoiial explosion ol languages
anu loimats, which woulu haim inteiopeiaLility.
Avio has iich schcna rcso|ution capaLilities. Vithin ceitain caielully uelineu con-
stiaints, the schema useu to ieau uata neeu not Le iuentical to the schema that was useu
to wiite the uata. This is the mechanism Ly which Avio suppoits schema evolution.
Foi example, a new, optional lielu may Le auueu to a iecoiu Ly ueclaiing it in the
schema useu to ieau the olu uata. New anu olu clients alike will Le aLle to ieau the olu
uata, while new clients can wiite new uata that uses the new lielu. Conveisely, il an olu
client sees newly encoueu uata, it will giacelully ignoie the new lielu anu caiiy on
piocessing as it woulu have uone with olu uata.
Avio specilies an objcct containcr jornat loi seguences ol oLjectssimilai to Hauoop`s
seguence lile. An Avro data ji|c has a metauata section wheie the schema is stoieu,
which makes the lile sell-uesciiLing. Avio uata liles suppoit compiession anu aie split-
taLle, which is ciucial loi a MapReuuce uata input loimat. Fuitheimoie, since Avio
was uesigneu with MapReuuce in minu, in the lutuie it will Le possiLle to use Avio to
Liing liist-class MapReuuce APIs (that is, ones that aie iichei than Stieaming, like the
]ava API, oi C-- Pipes) to languages that speak Avio.
Avio can Le useu loi RPC, too, although this isn`t coveieu heie. Moie inloimation is
in the specilication.
Avro data types and schemas
Avio uelines a small numLei ol uata types, which can Le useu to Luilu application-
specilic uata stiuctuies Ly wiiting schemas. Foi inteiopeiaLility, implementations must
suppoit all Avio types.
Avio`s piimitive types aie listeu in TaLle +-9. Each piimitive type may also Le specilieu
using a moie veiLose loim, using the type attiiLute, such as:
Serialization | 113
{ "type": "null" }
Tab|c 1-9. Avro prinitivc typcs
Type Description Schema
null The absence of a value "null"
boolean A binary value "boolean"
int 32-bit signed integer "int"
long 64-bit signed integer "long"
float Single precision (32-bit) IEEE 754 floating-point number "float"
double Double precision (64-bit) IEEE 754 floating-point number "double"
bytes Sequence of 8-bit unsigned bytes "bytes"
string Sequence of Unicode characters "string"
Avio also uelines the complex types listeu in TaLle +-10, along with a iepiesentative
example ol a schema ol each type.
Tab|c 1-10. Avro conp|cx typcs
Type Description Schema example
array An ordered collection of objects. All objects in a partic-
ular array must have the same schema.
{
"type": "array",
"items": "long"
}
map An unordered collection of key-value pairs. Keys must
be strings, values may be any type, although within a
particular map all values must have the same schema.
{
"type": "map",
"values": "string"
}
record A collection of named fields of any type.
{
"type": "record",
"name": "WeatherRecord",
"doc": "A weather reading.",
"fields": [
{"name": "year", "type": "int"},
{"name": "temperature", "type": "int"},
{"name": "stationId", "type": "string"}
]
}
enum A set of named values.
{
"type": "enum",
"name": "Cutlery",
"doc": "An eating utensil.",
"symbols": ["KNIFE", "FORK", "SPOON"]
}
fixed A fixed number of 8-bit unsigned bytes.
{
"type": "fixed",
"name": "Md5Hash",
"size": 16
}
union A union of schemas. A union is represented by a JSON
array, where each element in the array is a schema.
[
"null",
114 | Chapter 4: Hadoop I/O
Type Description Schema example
Data represented by a union must match one of the
schemas in the union.
"string",
{"type": "map", "values": "string"}
]
Each Avio language API has a iepiesentation loi each Avio type that is specilic to the
language. Foi example, Avio`s double type is iepiesenteu in C, C--, anu ]ava Ly a
double, in Python Ly a float, anu in RuLy Ly a Float.
Vhat`s moie, theie may Le moie than one iepiesentation, oi mapping, loi a language.
All languages suppoit a uynamic mapping, which can Le useu even when the schema
is not known aheau ol iun time. ]ava calls this the gcncric mapping.
In auuition, the ]ava anu C-- implementations can geneiate coue to iepiesent the uata
loi an Avio schema. Coue geneiation, which is calleu the spccijic mapping in ]ava, is
an optimization that is uselul when you have a copy ol the schema Leloie you ieau oi
wiite uata. Geneiateu classes also pioviue a moie uomain-oiienteu API loi usei coue
than geneiic ones.
]ava has a thiiu mapping, the rcj|cct mapping, which maps Avio types onto pieexisting
]ava types, using iellection. It is slowei than the geneiic anu specilic mappings, anu is
not geneially iecommenueu loi new applications.
]ava`s type mappings aie shown in TaLle +-11. As the taLle shows, the specilic mapping
is the same as the geneiic one unless otheiwise noteu (anu the iellect one is the same
as the specilic one unless noteu). The specilic mapping only uilleis liom the geneiic
one loi record, enum, anu fixed, all ol which have geneiateu classes (the name ol which
is contiolleu Ly the name anu optional namespace attiiLute).
Serialization | 115
Vhy uon`t the ]ava geneiic anu specilic mappings use ]ava String to
iepiesent an Avio string? The answei is elliciency: the Avio Utf8 type
is mutaLle, so it may Le ieuseu loi ieauing oi wiiting a seiies ol values.
Also, ]ava String uecoues UTF-S at oLject constiuction time, while Avio
Utf8 uoes it lazily, which can inciease peiloimance in some cases.
Utf8 implements ]ava`s java.lang.CharSequence inteilace, which allows
some inteiopeiaLility with ]ava liLiaiies. In othei cases it may Le nec-
essaiy to conveit Utf8 instances to String oLjects Ly calling its
toString() methou.
Fiom Avio 1.6.0 onwaius theie is an option to have Avio always pei-
loim the conveision to String. Theie aie a couple ol ways to achieve
this. The liist is to set the avro.java.string piopeity in the schema to
String:
{ "type": "string", "avro.java.string": "String" }
Alteinatively, loi the specilic mapping you can geneiate classes which
have String-Laseu getteis anu setteis. Vhen using the Avio Maven plu-
gin this is uone Ly setting the conliguiation piopeity stringType to
String (the example coue has a uemonstiation ol this).
Finally, note that the ]ava iellect mapping always uses String oLjects,
since it is uesigneu loi ]ava compatiLility, not peiloimance.
Tab|c 1-11. Avro java typc nappings
Avro type Generic Java mapping Specific Java mapping Reflect Java mapping
null null type
boolean boolean
int int short or int
long long
float float
double double
bytes java.nio.ByteBuffer Array of byte
string org.apache.avro.
util.Utf8
or java.lang.String
java.lang.String
array org.apache.avro.
generic.GenericArray
Array or java.util.Collection
map java.util.Map
record org.apache.avro.
generic.Generic
Record
Generated class implementing
org.apache.avro.
specific.Specific
Record.
Arbitrary user class with a zero-
argument constructor. All inherited
nontransient instance fields are used.
116 | Chapter 4: Hadoop I/O
Avro type Generic Java mapping Specific Java mapping Reflect Java mapping
enum java.lang.String Generated Java enum Arbitrary Java enum
fixed org.apache.avro.
generic.GenericFixed
Generated class implementing
org.apache.avro.
specific.SpecificFixed.
org.apache.avro.
generic.GenericFixed
union java.lang.Object
In-memory serialization and deserialization
Avio pioviues APIs loi seiialization anu ueseiialization, which aie uselul when you
want to integiate Avio with an existing system, such as a messaging system wheie the
liaming loimat is alieauy uelineu. In othei cases, consiuei using Avio`s uata lile loimat.
Let`s wiite a ]ava piogiam to ieau anu wiite Avio uata to anu liom stieams. Ve`ll stait
with a simple Avio schema loi iepiesenting a paii ol stiings as a iecoiu:
{
"type": "record",
"name": "StringPair",
"doc": "A pair of strings.",
"fields": [
{"name": "left", "type": "string"},
{"name": "right", "type": "string"}
]
}
Il this schema is saveu in a lile on the classpath calleu StringPair.avsc (.avsc is the
conventional extension loi an Avio schema), then we can loau it using the lollowing
two lines ol coue:
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(getClass().getResourceAsStream("StringPair.avsc"));
Ve can cieate an instance ol an Avio iecoiu using the geneiic API as lollows:
GenericRecord datum = new GenericData.Record(schema);
datum.put("left", "L");
datum.put("right", "R");
Next, we seiialize the iecoiu to an output stieam:
ByteArrayOutputStream out = new ByteArrayOutputStream();
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(datum, encoder);
encoder.flush();
out.close();
Theie aie two impoitant oLjects heie: the DatumWriter anu the Encoder. A
DatumWriter tianslates uata oLjects into the types unueistoou Ly an Encoder, which the
lattei wiites to the output stieam. Heie we aie using a GenericDatumWriter, which passes
Serialization | 117
the lielus ol GenericRecord to the Encoder. Ve pass a null to the encouei lactoiy since
we aie not ieusing a pieviously constiucteu encouei heie.
In this example only one oLject is wiitten to the stieam, Lut we coulu call write() with
moie oLjects Leloie closing the stieam il we wanteu to.
The GenericDatumWriter neeus to Le passeu the schema since it lollows the schema to
ueteimine which values liom the uata oLjects to wiite out. Altei we have calleu the
wiitei`s write() methou, we llush the encouei, then close the output stieam.
Ve can ieveise the piocess anu ieau the oLject Lack liom the Lyte Lullei:
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema);
Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null);
GenericRecord result = reader.read(null, decoder);
assertThat(result.get("left").toString(), is("L"));
assertThat(result.get("right").toString(), is("R"));
Ve pass null to the calls to binaryDecoder() anu read() since we aie not ieusing oLjects
heie (the uecouei oi the iecoiu, iespectively).
The oLjects ietuineu Ly result.get("left") anu result.get("left") aie ol type Utf8,
so we conveit them into ]ava String oLjects Ly calling theii toString() methous.
Let`s look now at the eguivalent coue using the specilic API. Ve can
geneiate the StringPair class liom the schema lile Ly using Avio`s Maven plugin loi
compiling schemas. The lollowing is the ielevant pait ol the Maven POM:
<project>
...
<build>
<plugins>
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>${avro.version}</version>
<executions>
<execution>
<id>schemas</id>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
<configuration>
<includes>
<include>StringPair.avsc</include>
</includes>
<sourceDirectory>src/main/resources</sourceDirectory>
<outputDirectory>${project.build.directory}/generated-sources/java
</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
The specific API.
118 | Chapter 4: Hadoop I/O
</build>
...
</project>
As an alteinative to Maven you can use Avio`s Ant task, org.apache.avro.spe
cific.SchemaTask, oi the Avio commanu line tools
6
to geneiate ]ava coue loi a schema.
In the coue loi seiializing anu ueseiializing, insteau ol a GenericRecord we constiuct a
StringPair instance, which we wiite to the stieam using a SpecificDatumWriter, anu
ieau Lack using a SpecificDatumReader:
StringPair datum = new StringPair();
datum.left = "L";
datum.right = "R";
ByteArrayOutputStream out = new ByteArrayOutputStream();
DatumWriter<StringPair> writer =
new SpecificDatumWriter<StringPair>(StringPair.class);
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(datum, encoder);
encoder.flush();
out.close();

DatumReader<StringPair> reader =
new SpecificDatumReader<StringPair>(StringPair.class);
Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null);
StringPair result = reader.read(null, decoder);
assertThat(result.left.toString(), is("L"));
assertThat(result.right.toString(), is("R"));
Avro data files
Avio`s oLject containei lile loimat is loi stoiing seguences ol Avio oLjects. It is veiy
similai in uesign to Hauoop`s seguence liles, which aie uesciiLeu in Seguence-
File on page 132. The main uilleience is that Avio uata liles aie uesigneu to Le poitaLle
acioss languages, so, loi example, you can wiite a lile in Python anu ieau it in C (we
will uo exactly this in the next section).
A uata lile has a heauei containing metauata, incluuing the Avio schema anu a sync
nar|cr, lolloweu Ly a seiies ol (optionally compiesseu) Llocks containing the seiializeu
Avio oLjects. Blocks aie sepaiateu Ly a sync maikei that is unigue to the lile (the maikei
loi a paiticulai lile is lounu in the heauei) anu that peimits iapiu iesynchionization
with a Llock Lounuaiy altei seeking to an aiLitiaiy point in the lile, such as an HDFS
Llock Lounuaiy. Thus, Avio uata liles aie splittaLle, which makes them amenaLle to
ellicient MapReuuce piocessing.
Viiting Avio oLjects to a uata lile is similai to wiiting to a stieam. Ve use a
DatumWriter, as Leloie, Lut insteau ol using an Encoder, we cieate a DataFileWriter
6. Avio can Le uownloaueu in Loth souice anu Linaiy loims liom http://avro.apachc.org/rc|cascs.htn|. Get
usage instiuctions loi the Avio tools Ly typing java -jar avro-tools-*.jar.
Serialization | 119
instance with the DatumWriter. Then we can cieate a new uata lile (which, Ly conven-
tion, has a .avro extension) anu appenu oLjects to it:
File file = new File("data.avro");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter =
new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
dataFileWriter.append(datum);
dataFileWriter.close();
The oLjects that we wiite to the uata lile must conloim to the lile`s schema, otheiwise
an exception will Le thiown when we call append().
This example uemonstiates wiiting to a local lile (java.io.File in the pievious snippet),
Lut we can wiite to any java.io.OutputStream Ly using the oveiloaueu create() methou
on DataFileWriter. To wiite a lile to HDFS, loi example, get an OutputStream Ly calling
create() on FileSystem (see Viiting Data on page 62).
Reauing Lack oLjects liom a uata lile is similai to the eailiei case ol ieauing oLjects
liom an in-memoiy stieam, with one impoitant uilleience: we uon`t have to specily a
schema since it is ieau liom the lile metauata. Inueeu, we can get the schema liom the
DataFileReader instance, using getSchema(), anu veiily that it is the same as the one we
useu to wiite the oiiginal oLject with:
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader =
new DataFileReader<GenericRecord>(file, reader);
assertThat("Schema is the same", schema, is(dataFileReader.getSchema()));
DataFileReader is a iegulai ]ava iteiatoi, so we can iteiate thiough its uata oLjects Ly
calling its hasNext() anu next() methous. The lollowing snippet checks that theie is
only one iecoiu, anu that it has the expecteu lielu values:
assertThat(dataFileReader.hasNext(), is(true));
GenericRecord result = dataFileReader.next();
assertThat(result.get("left").toString(), is("L"));
assertThat(result.get("right").toString(), is("R"));
assertThat(dataFileReader.hasNext(), is(false));
Rathei than using the usual next() methou, howevei, it is pieleiaLle to use the ovei-
loaueu loim that takes an instance ol the oLject to Le ietuineu (in this case,
GenericRecord), since it will ieuse the oLject anu save allocation anu gaiLage collection
costs loi liles containing many oLjects. The lollowing is iuiomatic:
GenericRecord record = null;
while (dataFileReader.hasNext()) {
record = dataFileReader.next(record);
// process record
}
Il oLject ieuse is not impoitant, you can use this shoitei loim:
120 | Chapter 4: Hadoop I/O
for (GenericRecord record : dataFileReader) {
// process record
}
Foi the geneial case ol ieauing a lile on a Hauoop lile system, use Avio`s FsInput to
specily the input lile using a Hauoop Path oLject. DataFileReader actually olleis ianuom
access to Avio uata lile (via its seek() anu sync() methous); howevei, in many cases,
seguential stieaming access is sullicient, loi which DataFileStream shoulu Le useu.
DataFileStream can ieau liom any ]ava InputStream.
Interoperability
To uemonstiate Avio`s language inteiopeiaLility, let`s wiite a uata lile using one
language (Python) anu ieau it Lack with anothei (C).
The piogiam in Example +-10 ieaus comma-sepaiateu stiings liom stanuaiu
input anu wiites them as StringPair iecoius to an Avio uata lile. Like the ]ava coue loi
wiiting a uata lile, we cieate a DatumWriter anu a DataFileWriter oLject. Notice that we
have emLeuueu the Avio schema in the coue, although we coulu egually well have ieau
it liom a lile.
Python iepiesents Avio iecoius as uictionaiies; each line that is ieau liom stanuaiu in
is tuineu into a dict oLject anu appenueu to the DataFileWriter.
Exanp|c 1-10. A Python progran jor writing Avro rccord pairs to a data ji|c
import os
import string
import sys
from avro import schema
from avro import io
from avro import datafile
if __name__ == '__main__':
if len(sys.argv) != 2:
sys.exit('Usage: %s <data_file>' % sys.argv[0])
avro_file = sys.argv[1]
writer = open(avro_file, 'wb')
datum_writer = io.DatumWriter()
schema_object = schema.parse("""\
{ "type": "record",
"name": "StringPair",
"doc": "A pair of strings.",
"fields": [
{"name": "left", "type": "string"},
{"name": "right", "type": "string"}
]
}""")
dfw = datafile.DataFileWriter(writer, datum_writer, schema_object)
for line in sys.stdin.readlines():
(left, right) = string.split(line.strip(), ',')
Python API.
Serialization | 121
dfw.append({'left':left, 'right':right});
dfw.close()
Beloie we can iun the piogiam, we neeu to install Avio loi Python:
% easy_install avro
To iun the piogiam, we specily the name ol the lile to wiite output to (pairs.avro) anu
senu input paiis ovei stanuaiu in, maiking the enu ol lile Ly typing Contiol-D:
% python avro/src/main/py/write_pairs.py pairs.avro
a,1
c,2
b,3
b,2
^D
Next we`ll tuin to the C API anu wiite a piogiam to uisplay the contents ol
pairs.avro; see Example +-11.
7
Exanp|c 1-11. A C progran jor rcading Avro rccord pairs jron a data ji|c
#include <avro.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
if (argc != 2) {
fprintf(stderr, "Usage: dump_pairs <data_file>\n");
exit(EXIT_FAILURE);
}

const char *avrofile = argv[1];
avro_schema_error_t error;
avro_file_reader_t filereader;
avro_datum_t pair;
avro_datum_t left;
avro_datum_t right;
int rval;
char *p;
avro_file_reader(avrofile, &filereader);
while (1) {
rval = avro_file_reader_read(filereader, NULL, &pair);
if (rval) break;
if (avro_record_get(pair, "left", &left) == 0) {
avro_string_get(left, &p);
fprintf(stdout, "%s,", p);
}
if (avro_record_get(pair, "right", &right) == 0) {
avro_string_get(right, &p);
fprintf(stdout, "%s\n", p);
C API.
7. Foi the geneial case, the Avio tools ]AR lile has a tojson commanu that uumps the contents ol a Avio
uata lile as ]SON.
122 | Chapter 4: Hadoop I/O
}
}
avro_file_reader_close(filereader);
return 0;
}
The coie ol the piogiam uoes thiee things:
1. opens a lile ieauei ol type avro_file_reader_t Ly calling Avio`s
avro_file_reader lunction,
S
2. ieaus Avio uata liom the lile ieauei with the avro_file_reader_read lunction in a
while loop until theie aie no paiis lelt (as ueteimineu Ly the ietuin value rval), anu
3. closes the lile ieauei with avro_file_reader_close.
The avro_file_reader_read lunction accepts a schema as its seconu aigument to sup-
poit the case wheie the schema loi ieauing is uilleient to the one useu when the lile
was wiitten (this is explaineu in the next section), Lut we simply pass in NULL, which
tells Avio to use the uata lile`s schema. The thiiu aigument is a pointei to a
avro_datum_t oLject, which is populateu with the contents ol the next iecoiu ieau liom
the lile. Ve unpack the paii stiuctuie into its lielus Ly calling avro_record_get, anu
then we extiact the value ol these lielus as stiings using avro_string_get, which we
piint to the console.
Running the piogiam using the output ol the Python piogiam piints the oiiginal input:
% ./dump_pairs pairs.avro
a,1
c,2
b,3
b,2
Ve have successlully exchangeu complex uata Letween two Avio implementations.
Schema resolution
Ve can choose to use a uilleient schema loi ieauing the uata Lack (the rcadcr`s
schcna) to the one we useu to wiite it (the writcr`s schcna). This is a poweilul tool,
since it enaLles schema evolution. To illustiate, consiuei a new schema loi stiing paiis,
with an auueu description lielu:
{
"type": "record",
"name": "StringPair",
"doc": "A pair of strings with an added field.",
"fields": [
{"name": "left", "type": "string"},
{"name": "right", "type": "string"},
{"name": "description", "type": "string", "default": ""}
S. Avio lunctions anu types have a avro_ pielix anu aie uelineu in the avro.h heauei lile.
Serialization | 123
]
}
Ve can use this schema to ieau the uata we seiializeu eailiei, since, ciucially, we have
given the description lielu a uelault value (the empty stiing
9
), which Avio will use
when theie is no lielu uelineu in the iecoius it is ieauing. Hau we omitteu the default
attiiLute, we woulu get an eiioi when tiying to ieau the olu uata.
To make the uelault value null, iathei than the empty stiing, we woulu
insteau ueline the uesciiption lielu using a union with the null Avio
type:
{"name": "description", "type": ["null", "string"], "default": null}
Vhen the ieauei`s schema is uilleient liom the wiitei`s, we use the constiuctoi loi
GenericDatumReader that takes two schema oLjects, the wiitei`s anu the ieauei`s, in that
oiuei:
DatumReader<GenericRecord> reader =
new GenericDatumReader<GenericRecord>(schema, newSchema);
Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null);
GenericRecord result = reader.read(null, decoder);
assertThat(result.get("left").toString(), is("L"));
assertThat(result.get("right").toString(), is("R"));
assertThat(result.get("description").toString(), is(""));
Foi uata liles, which have the wiitei`s schema stoieu in the metauata, we only neeu to
specily the ieaueis`s schema explicitly, which we can uo Ly passing null loi the wiitei`s
schema:
DatumReader<GenericRecord> reader =
new GenericDatumReader<GenericRecord>(null, newSchema);
Anothei common use ol a uilleient ieauei`s schema is to uiop lielus in a iecoiu, an
opeiation calleu projcction. This is uselul when you have iecoius with a laige numLei
ol lielus anu you only want to ieau some ol them. Foi example, this schema can Le
useu to get only the right lielu ol a StringPair:
{
"type": "record",
"name": "StringPair",
"doc": "The right field of a pair of strings.",
"fields": [
{"name": "right", "type": "string"}
]
}
9. Delault values loi lielus aie encoueu using ]SON. See the Avio specilication loi a uesciiption ol this
encouing loi each uata type.
124 | Chapter 4: Hadoop I/O
The iules loi schema iesolution have a uiiect Leaiing on how schemas may evolve liom
one veision to the next, anu aie spelleu out in the Avio specilication loi all Avio types.
A summaiy ol the iules loi iecoiu evolution liom the point ol view ol ieaueis anu
wiiteis (oi seiveis anu clients) is piesenteu in TaLle +-12.
Tab|c 1-12. Schcna rcso|ution oj rccords
New schema Writer Reader Action
Added field Old New The reader uses the default value of the new field, since it is not written by the writer.
New Old The reader does not know about the new field written by the writer, so it is ignored.
(Projection).
Removed field Old New The reader ignores the removed field. (Projection).
New Old The removed field is not written by the writer. If the old schema had a default defined
for the field, then the reader uses this, otherwise it gets an error. In this case, it is best
to update the readers schema at the same time as, or before, the writers.
Sort order
Avio uelines a soit oiuei loi oLjects. Foi most Avio types, the oiuei is the natuial one
you woulu expectloi example, numeiic types aie oiueieu Ly ascenuing numeiic
value. Otheis aie a little moie suLtleenums aie compaieu Ly the oiuei in which the
symLol is uelineu anu not Ly the value ol the symLol stiing, loi instance.
All types except record have pieoiuaineu iules loi theii soit oiuei as uesciiLeu in the
Avio specilication; they cannot Le oveiiiuuen Ly the usei. Foi iecoius, howevei, you
can contiol the soit oiuei Ly specilying the order attiiLute loi a lielu. It takes one ol
thiee values: ascending (the uelault), descending (to ieveise the oiuei), oi ignore (so
the lielu is skippeu loi compaiison puiposes).
Foi example, the lollowing schema (SortcdStringPair.avsc) uelines an oiueiing ol
StringPair iecoius Ly the right lielu in uescenuing oiuei. The left lielu is ignoieu loi
the puiposes ol oiueiing, Lut it is still piesent in the piojection:
{
"type": "record",
"name": "StringPair",
"doc": "A pair of strings, sorted by right field descending.",
"fields": [
{"name": "left", "type": "string", "order": "ignore"},
{"name": "right", "type": "string", "order": "descending"}
]
}
The iecoiu`s lielus aie compaieu paiiwise in the uocument oiuei ol the ieauei`s schema.
Thus, Ly specilying an appiopiiate ieauei`s schema, you can impose an aiLitiaiy
oiueiing on uata iecoius. This schema (SwitchcdStringPair.avsc) uelines a soit oiuei
Ly the right lielu, then the left:
Serialization | 125
{
"type": "record",
"name": "StringPair",
"doc": "A pair of strings, sorted by right then left.",
"fields": [
{"name": "right", "type": "string"},
{"name": "left", "type": "string"}
]
}
Avio implements ellicient Linaiy compaiisons. That is to say, Avio uoes not have to
ueseiialize a Linaiy uata into oLjects to peiloim the compaiison, since it can insteau
woik uiiectly on the Lyte stieams.
10
In the case ol the oiiginal StringPair schema (with
no order attiiLutes), loi example, Avio implements the Linaiy compaiison as lollows.
The liist lielu, left, is a UTF-S-encoueu stiing, loi which Avio can compaie the Lytes
lexicogiaphically. Il they uillei, then the oiuei is ueteimineu, anu Avio can stop the
compaiison theie. Otheiwise, il the two Lyte seguences aie the same, it compaies the
seconu two (right) lielus, again lexicogiaphically at the Lyte level since the lielu is
anothei UTF-S stiing.
Notice that this uesciiption ol a compaiison lunction has exactly the same logic as the
Linaiy compaiatoi we wiote loi ViitaLles in Implementing a RawCompaiatoi loi
speeu on page 10S. The gieat thing is that Avio pioviues the compaiatoi loi us, so we
uon`t have to wiite anu maintain this coue. It`s also easy to change the soit oiuei just
Ly changing the ieauei`s schema. Foi the SortcdStringPair.avsc oi SwitchcdString-
Pair.avsc schemas, the compaiison lunction Avio uses is essentially the same as the one
just uesciiLeu: the uilleience is in which lielus aie consiueieu, the oiuei in which they
aie consiueieu, anu whethei the oiuei is ascenuing oi uescenuing.
Latei in the chaptei we`ll use Avio`s soiting logic in conjunction with MapReuuce to
soit Avio uata liles in paiallel.
Avro MapReduce
Avio pioviues a numLei ol classes loi making it easy to iun MapReuuce piogiams on
Avio uata. Foi example, AvroMapper anu AvroReducer in the org.apache.avro.mapred
package aie specializations ol Hauoop`s (olu style) Mapper anu Reducer classes. They
eliminate the key-value uistinction loi inputs anu outputs, since Avio uata liles aie just
a seguence ol values. Howevei, inteimeuiate uata is still uiviueu into key-value paiis
loi the shullle.
Let`s iewoik the MapReuuce piogiam loi linuing the maximum tempeiatuie loi each
yeai in the weathei uataset, using the Avio MapReuuce API. Ve will iepiesent weathei
iecoius using the lollowing schema:
10. A uselul conseguence ol this piopeity is that you can compute an Avio uatum`s hash coue liom eithei
the oLject oi the Linaiy iepiesentation (the lattei Ly using the static hashCode() methou on BinaryData)
anu get the same iesult in Loth cases.
126 | Chapter 4: Hadoop I/O
{
"type": "record",
"name": "WeatherRecord",
"doc": "A weather reading.",
"fields": [
{"name": "year", "type": "int"},
{"name": "temperature", "type": "int"},
{"name": "stationId", "type": "string"}
]
}
The piogiam in Example +-12 ieaus text input (in the loimat we saw in eailiei chapteis),
anu wiites Avio uata liles containing weathei iecoius as output.
Exanp|c 1-12. MapRcducc progran to jind thc naxinun tcnpcraturc, crcating Avro output
import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.mapred.AvroCollector;
import org.apache.avro.mapred.AvroJob;
import org.apache.avro.mapred.AvroMapper;
import org.apache.avro.mapred.AvroReducer;
import org.apache.avro.mapred.AvroUtf8InputFormat;
import org.apache.avro.mapred.Pair;
import org.apache.avro.util.Utf8;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class AvroGenericMaxTemperature extends Configured implements Tool {

private static final Schema SCHEMA = new Schema.Parser().parse(
"{" +
" \"type\": \"record\"," +
" \"name\": \"WeatherRecord\"," +
" \"doc\": \"A weather reading.\"," +
" \"fields\": [" +
" {\"name\": \"year\", \"type\": \"int\"}," +
" {\"name\": \"temperature\", \"type\": \"int\"}," +
" {\"name\": \"stationId\", \"type\": \"string\"}" +
" ]" +
"}"
);
public static class MaxTemperatureMapper
extends AvroMapper<Utf8, Pair<Integer, GenericRecord>> {
Serialization | 127
private NcdcRecordParser parser = new NcdcRecordParser();
private GenericRecord record = new GenericData.Record(SCHEMA);
@Override
public void map(Utf8 line,
AvroCollector<Pair<Integer, GenericRecord>> collector,
Reporter reporter) throws IOException {
parser.parse(line.toString());
if (parser.isValidTemperature()) {
record.put("year", parser.getYearInt());
record.put("temperature", parser.getAirTemperature());
record.put("stationId", parser.getStationId());
collector.collect(
new Pair<Integer, GenericRecord>(parser.getYearInt(), record));
}
}
}

public static class MaxTemperatureReducer
extends AvroReducer<Integer, GenericRecord, GenericRecord> {
@Override
public void reduce(Integer key, Iterable<GenericRecord> values,
AvroCollector<GenericRecord> collector, Reporter reporter)
throws IOException {
GenericRecord max = null;
for (GenericRecord value : values) {
if (max == null ||
(Integer) value.get("temperature") > (Integer) max.get("temperature")) {
max = newWeatherRecord(value);
}
}
collector.collect(max);
}
private GenericRecord newWeatherRecord(GenericRecord value) {
GenericRecord record = new GenericData.Record(SCHEMA);
record.put("year", value.get("year"));
record.put("temperature", value.get("temperature"));
record.put("stationId", value.get("stationId"));
return record;
}
}
@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}

JobConf conf = new JobConf(getConf(), getClass());
conf.setJobName("Max temperature");

FileInputFormat.addInputPath(conf, new Path(args[0]));
128 | Chapter 4: Hadoop I/O
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

AvroJob.setInputSchema(conf, Schema.create(Schema.Type.STRING));
AvroJob.setMapOutputSchema(conf,
Pair.getPairSchema(Schema.create(Schema.Type.INT), SCHEMA));
AvroJob.setOutputSchema(conf, SCHEMA);

conf.setInputFormat(AvroUtf8InputFormat.class);
AvroJob.setMapperClass(conf, MaxTemperatureMapper.class);
AvroJob.setReducerClass(conf, MaxTemperatureReducer.class);
JobClient.runJob(conf);
return 0;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new AvroGenericMaxTemperature(), args);
System.exit(exitCode);
}
}
This piogiam uses the geneiic Avio mapping. This liees us liom geneiating coue to
iepiesent iecoius, at the expense ol type salety (lielu names aie ieleiieu to Ly stiing
value, such as "temperature").
11
The schema loi weathei iecoius is inlineu in the coue
loi convenience (anu ieau into the SCHEMA constant), although in piactice it might Le
moie maintainaLle to ieau the schema liom a local lile in the uiivei coue anu pass it to
the mappei anu ieuucei via the Hauoop joL conliguiation. (Technigues loi achieving
this aie uiscusseu in Siue Data DistiiLution on page 2S7.)
Theie aie a couple ol uilleiences liom the iegulai Hauoop MapReuuce API. The liist
is the use ol a org.apache.avro.mapred.Pair to wiap the map output key anu value in
MaxTemperatureMapper. (The ieason that the org.apache.avro.mapred.AvroMapper
uoesn`t have a lixeu output key anu value is so that map-only joLs can emit just values
to Avio uata liles.) Foi this MapReuuce piogiam the key is the yeai (an integei), anu
the value is the weathei iecoiu, which is iepiesenteu Ly Avio`s GenericRecord.
Avio MapReuuce uoes pieseive the notion ol key-value paiis loi the input to the ieuucei
howevei, since this is what comes out ol the shullle, anu it unwiaps the Pair Leloie
invoking the org.apache.avro.mapred.AvroReducer. The MaxTemperatureReducer itei-
ates thiough the iecoius loi each key (yeai) anu linus the one with the maximum tem-
peiatuie. It is necessaiy to make a copy ol the iecoiu with the highest tempeiatuie
lounu so lai, since the iteiatoi ieuses the instance loi ieasons ol elliciency (anu only
the lielus aie upuateu).
The seconu majoi uilleience liom iegulai MapReuuce is the use ol AvroJob loi conlig-
uiing the joL. AvroJob is a convenience class loi specilying the Avio schemas loi the
11. Foi an example that uses the specilic mapping, with geneiateu classes, see the
AvroSpecificMaxTemperature class in the example coue.
Serialization | 129
input, map output anu linal output uata. In this piogiam the input schema is an Avio
string, Lecause we aie ieauing liom a text lile, anu the input loimat is set coiiesponu-
ingly, to AvroUtf8InputFormat. The map output schema is a paii schema whose key
schema is an Avio int anu whose value schema is the weathei iecoiu schema. The linal
output schema is the weathei iecoiu schema, anu the output loimat is the uelault,
AvroOutputFormat, which wiites to Avio uata liles.
The lollowing commanu line shows how to iun the piogiam on a small sample uataset:
% hadoop jar avro-examples.jar AvroGenericMaxTemperature \
input/ncdc/sample.txt output
On completion we can look at the output using the Avio tools ]AR to ienuei the Avio
uata lile as ]SON, one iecoiu pei line:
% java -jar $AVRO_HOME/avro-tools-*.jar tojson output/part-00000.avro
{"year":1949,"temperature":111,"stationId":"012650-99999"}
{"year":1950,"temperature":22,"stationId":"011990-99999"}
In this example, we useu an AvroMapper anu an AvroReducer, Lut the API suppoits a
mixtuie ol iegulai MapReuuce mappeis anu ieuuceis with Avio-specilic ones, which
is uselul loi conveiting Letween Avio loimats anu othei loimats, such as SeguenceFiles.
The uocumentation loi the Avio MapReuuce package has uetails.
Sorting using Avro MapReduce
In this section we use Avio`s soit capaLilities anu comLine them with MapReuuce to
wiite a piogiam to soit an Avio uata lile (Example +-13).
Exanp|c 1-13. A MapRcducc progran to sort an Avro data ji|c
public class AvroSort extends Configured implements Tool {
static class SortMapper<K> extends AvroMapper<K, Pair<K, K>> {
public void map(K datum, AvroCollector<Pair<K, K>> collector,
Reporter reporter) throws IOException {
collector.collect(new Pair<K, K>(datum, null, datum, null));
}
}
static class SortReducer<K> extends AvroReducer<K, K, K> {
public void reduce(K key, Iterable<K> values,
AvroCollector<K> collector,
Reporter reporter) throws IOException {
for (K value : values) {
collector.collect(value);
}
}
}
@Override
public int run(String[] args) throws Exception {

if (args.length != 3) {
130 | Chapter 4: Hadoop I/O
System.err.printf(
"Usage: %s [generic options] <input> <output> <schema-file>\n",
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}

String input = args[0];
String output = args[1];
String schemaFile = args[2];
JobConf conf = new JobConf(getConf(), getClass());
conf.setJobName("Avro sort");

FileInputFormat.addInputPath(conf, new Path(input));
FileOutputFormat.setOutputPath(conf, new Path(output));

Schema schema = new Schema.Parser().parse(new File(schemaFile));
AvroJob.setInputSchema(conf, schema);
Schema intermediateSchema = Pair.getPairSchema(schema, schema);
AvroJob.setMapOutputSchema(conf, intermediateSchema);
AvroJob.setOutputSchema(conf, schema);

AvroJob.setMapperClass(conf, SortMapper.class);
AvroJob.setReducerClass(conf, SortReducer.class);

JobClient.runJob(conf);
return 0;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new AvroSort(), args);
System.exit(exitCode);
}
}
This piogiam (which uses the geneiic Avio mapping anu hence uoes not ieguiie any
coue geneiation) can soit Avio iecoius ol any type, iepiesenteu in ]ava Ly the geneiic
type paiametei K. Ve choose a value that is the same as the key, so that when the values
aie gioupeu Ly key we can emit all ol the values in the case that moie than one ol them
shaie the same key (accoiuing to the soiting lunction), theieLy not losing any ie-
coius.
12
The mappei simply emits a org.apache.avro.mapred.Pair oLject with this key
anu value. The ieuucei acts as an iuentity, passing the values thiough to the (single-
valueu) output, which will get wiitten to an Avio uata lile.
The soiting happens in the MapReuuce shullle, anu the soit lunction is ueteimineu Ly
the Avio schema that is passeu to the piogiam. Let`s use the piogiam to soit the
pairs.avro lile cieateu eailiei, using the SortcdStringPair.avsc schema to soit Ly the iight
lielu in uescenuing oiuei. Fiist we inspect the input using the Avio tools ]AR:
12. Ve encountei this iuea ol uuplicating inloimation liom the key in the value oLject again in Seconuaiy
Soit on page 276.
Serialization | 131
% java -jar $AVRO_HOME/avro-tools-*.jar tojson input/avro/pairs.avro
{"left":"a","right":"1"}
{"left":"c","right":"2"}
{"left":"b","right":"3"}
{"left":"b","right":"2"}
Then we iun the soit:
% hadoop jar avro-examples.jar AvroSort input/avro/pairs.avro output \
ch04-avro/src/main/resources/SortedStringPair.avsc
Finally we inspect the output anu see that it is soiteu coiiectly.
% java -jar $AVRO_HOME/avro-tools-*.jar tojson output/part-00000.avro
{"left":"b","right":"3"}
{"left":"c","right":"2"}
{"left":"b","right":"2"}
{"left":"a","right":"1"}
Avro MapReduce in other languages
Foi languages othei than ]ava, theie aie a lew choices loi woiking with Avio uata.
AvroAsTextInputFormat is uesigneu to allow Hauoop Stieaming piogiams to ieau Avio
uata liles. Each uatum in the lile is conveiteu to a stiing, which is the ]SON iepiesen-
tation ol the uatum, oi just the iaw Lytes il the type is Avio bytes. Going the othei way,
you can specily AvroTextOutputFormat as the output loimat ol a Stieaming joL to cieate
Avio uata liles with a bytes schema, wheie each uatum is the taL-uelimiteu key-value
paii wiitten liom the Stieaming output. Both these classes can Le lounu in the
org.apache.avro.mapred package.
Foi a iichei inteilace than Stieaming, Avio pioviues a connectoi liamewoik (in the
org.apache.avro.mapred.tether package), which is the Avio analog ol Hauoop Pipes.
At the time ol wiiting, theie aie no Linuings loi othei languages, Lut a Python imple-
mentation will Le availaLle in a lutuie ielease.
Also woith consiueiing aie Pig anu Hive, which can Loth ieau anu wiite Avio uata liles
Ly specilying the appiopiiate stoiage loimats.
File-Based Data Structures
Foi some applications, you neeu a specializeu uata stiuctuie to holu youi uata. Foi
uoing MapReuuce-Laseu piocessing, putting each LloL ol Linaiy uata into its own lile
uoesn`t scale, so Hauoop uevelopeu a numLei ol highei-level containeis loi these
situations.
SequenceFile
Imagine a loglile, wheie each log iecoiu is a new line ol text. Il you want to log Linaiy
types, plain text isn`t a suitaLle loimat. Hauoop`s SequenceFile class lits the Lill in this
132 | Chapter 4: Hadoop I/O
situation, pioviuing a peisistent uata stiuctuie loi Linaiy key-value paiis. To use it as
a loglile loimat, you woulu choose a key, such as timestamp iepiesenteu Ly a LongWrit
able, anu the value is a Writable that iepiesents the guantity Leing loggeu.
SequenceFiles also woik well as containeis loi smallei liles. HDFS anu MapReuuce
aie optimizeu loi laige liles, so packing liles into a SequenceFile makes stoiing
anu piocessing the smallei liles moie ellicient. (Piocessing a whole lile as a ie-
coiu on page 2+0 contains a piogiam to pack liles into a SequenceFile.
13
)
Writing a SequenceFile
To cieate a SequenceFile, use one ol its createWriter() static methous, which ietuins
a SequenceFile.Writer instance. Theie aie seveial oveiloaueu veisions, Lut they all
ieguiie you to specily a stieam to wiite to (eithei a FSDataOutputStream oi a FileSys
tem anu Path paiiing), a Configuration oLject, anu the key anu value types. Optional
aiguments incluue the compiession type anu couec, a Progressable callLack to Le in-
loimeu ol wiite piogiess, anu a Metadata instance to Le stoieu in the SequenceFile
heauei.
The keys anu values stoieu in a SequenceFile uo not necessaiily neeu to Le Writable.
Any types that can Le seiializeu anu ueseiializeu Ly a Serialization may Le useu.
Once you have a SequenceFile.Writer, you then wiite key-value paiis, using the
append() methou. Then when you`ve linisheu, you call the close() methou (Sequence
File.Writer implements java.io.Closeable).
Example +-1+ shows a shoit piogiam to wiite some key-value paiis to a Sequence
File, using the API just uesciiLeu.
Exanp|c 1-11. Writing a ScqucnccIi|c
public class SequenceFileWriteDemo {

private static final String[] DATA = {
"One, two, buckle my shoe",
"Three, four, shut the door",
"Five, six, pick up sticks",
"Seven, eight, lay them straight",
"Nine, ten, a big fat hen"
};

public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
IntWritable key = new IntWritable();
13. In a similai vein, the Llog post A Million Little Files Ly Stuait Sieiia incluues coue loi conveiting a
tar lile into a SequenceFile, http://stuartsicrra.con/2008/01/21/a-ni||ion-|itt|c-ji|cs.
File-Based Data Structures | 133
Text value = new Text();
SequenceFile.Writer writer = null;
try {
writer = SequenceFile.createWriter(fs, conf, path,
key.getClass(), value.getClass());

for (int i = 0; i < 100; i++) {
key.set(100 - i);
value.set(DATA[i % DATA.length]);
System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
writer.append(key, value);
}
} finally {
IOUtils.closeStream(writer);
}
}
}
The keys in the seguence lile aie integeis counting uown liom 100 to 1, iepiesenteu as
IntWritable oLjects. The values aie Text oLjects. Beloie each iecoiu is appenueu to the
SequenceFile.Writer, we call the getLength() methou to uiscovei the cuiient position
in the lile. (Ve will use this inloimation aLout iecoiu Lounuaiies in the next section
when we ieau the lile nonseguentially.) Ve wiite the position out to the console, along
with the key anu value paiis. The iesult ol iunning it is shown heie:
% hadoop SequenceFileWriteDemo numbers.seq
[128] 100 One, two, buckle my shoe
[173] 99 Three, four, shut the door
[220] 98 Five, six, pick up sticks
[264] 97 Seven, eight, lay them straight
[314] 96 Nine, ten, a big fat hen
[359] 95 One, two, buckle my shoe
[404] 94 Three, four, shut the door
[451] 93 Five, six, pick up sticks
[495] 92 Seven, eight, lay them straight
[545] 91 Nine, ten, a big fat hen
...
[1976] 60 One, two, buckle my shoe
[2021] 59 Three, four, shut the door
[2088] 58 Five, six, pick up sticks
[2132] 57 Seven, eight, lay them straight
[2182] 56 Nine, ten, a big fat hen
...
[4557] 5 One, two, buckle my shoe
[4602] 4 Three, four, shut the door
[4649] 3 Five, six, pick up sticks
[4693] 2 Seven, eight, lay them straight
[4743] 1 Nine, ten, a big fat hen
Reading a SequenceFile
Reauing seguence liles liom Leginning to enu is a mattei ol cieating an instance ol
SequenceFile.Reader anu iteiating ovei iecoius Ly iepeateuly invoking one ol the
134 | Chapter 4: Hadoop I/O
next() methous. Vhich one you use uepenus on the seiialization liamewoik you aie
using. Il you aie using Writable types, you can use the next() methou that takes a key
anu a value aigument, anu ieaus the next key anu value in the stieam into these
vaiiaLles:
public boolean next(Writable key, Writable val)
The ietuin value is true il a key-value paii was ieau anu false il the enu ol the lile has
Leen ieacheu.
Foi othei, nonWritable seiialization liamewoiks (such as Apache Thiilt), you shoulu
use these two methous:
public Object next(Object key) throws IOException
public Object getCurrentValue(Object val) throws IOException
In this case, you neeu to make suie that the seiialization you want to use has Leen set
in the io.serializations piopeity; see Seiialization Fiamewoiks on page 110.
Il the next() methou ietuins a non-null oLject, a key-value paii was ieau liom the
stieam, anu the value can Le ietiieveu using the getCurrentValue() methou. Otheiwise,
il next() ietuins null, the enu ol the lile has Leen ieacheu.
The piogiam in Example +-15 uemonstiates how to ieau a seguence lile that has
Writable keys anu values. Note how the types aie uiscoveieu liom the Sequence
File.Reader via calls to getKeyClass() anu getValueClass(), then ReflectionUtils is
useu to cieate an instance loi the key anu an instance loi the value. By using this tech-
nigue, the piogiam can Le useu with any seguence lile that has Writable keys anu values.
Exanp|c 1-15. Rcading a ScqucnccIi|c
public class SequenceFileReadDemo {

public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
SequenceFile.Reader reader = null;
try {
reader = new SequenceFile.Reader(fs, path, conf);
Writable key = (Writable)
ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value = (Writable)
ReflectionUtils.newInstance(reader.getValueClass(), conf);
long position = reader.getPosition();
while (reader.next(key, value)) {
String syncSeen = reader.syncSeen() ? "*" : "";
System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);
position = reader.getPosition(); // beginning of next record
}
} finally {
IOUtils.closeStream(reader);
File-Based Data Structures | 135
}
}
}
Anothei leatuie ol the piogiam is that it uisplays the position ol the sync points in the
seguence lile. A sync point is a point in the stieam that can Le useu to iesynchionize
with a iecoiu Lounuaiy il the ieauei is lostloi example, altei seeking to an aiLitiaiy
position in the stieam. Sync points aie iecoiueu Ly SequenceFile.Writer, which inseits
a special entiy to maik the sync point eveiy lew iecoius as a seguence lile is Leing
wiitten. Such entiies aie small enough to incui only a mouest stoiage oveiheauless
than 1. Sync points always align with iecoiu Lounuaiies.
Running the piogiam in Example +-15 shows the sync points in the seguence lile as
asteiisks. The liist one occuis at position 2021 (the seconu one occuis at position +075,
Lut is not shown in the output):
% hadoop SequenceFileReadDemo numbers.seq
[128] 100 One, two, buckle my shoe
[173] 99 Three, four, shut the door
[220] 98 Five, six, pick up sticks
[264] 97 Seven, eight, lay them straight
[314] 96 Nine, ten, a big fat hen
[359] 95 One, two, buckle my shoe
[404] 94 Three, four, shut the door
[451] 93 Five, six, pick up sticks
[495] 92 Seven, eight, lay them straight
[545] 91 Nine, ten, a big fat hen
[590] 90 One, two, buckle my shoe
...
[1976] 60 One, two, buckle my shoe
[2021*] 59 Three, four, shut the door
[2088] 58 Five, six, pick up sticks
[2132] 57 Seven, eight, lay them straight
[2182] 56 Nine, ten, a big fat hen
...
[4557] 5 One, two, buckle my shoe
[4602] 4 Three, four, shut the door
[4649] 3 Five, six, pick up sticks
[4693] 2 Seven, eight, lay them straight
[4743] 1 Nine, ten, a big fat hen
Theie aie two ways to seek to a given position in a seguence lile. The liist is the
seek() methou, which positions the ieauei at the given point in the lile. Foi example,
seeking to a iecoiu Lounuaiy woiks as expecteu:
reader.seek(359);
assertThat(reader.next(key, value), is(true));
assertThat(((IntWritable) key).get(), is(95));
But il the position in the lile is not at a iecoiu Lounuaiy, the ieauei lails when the
next() methou is calleu:
reader.seek(360);
reader.next(key, value); // fails with IOException
136 | Chapter 4: Hadoop I/O
The seconu way to linu a iecoiu Lounuaiy makes use ol sync points. The sync(long
position) methou on SequenceFile.Reader positions the ieauei at the next sync point
altei position. (Il theie aie no sync points in the lile altei this position, then the ieauei
will Le positioneu at the enu ol the lile.) Thus, we can call sync() with any position in
the stieama noniecoiu Lounuaiy, loi exampleanu the ieauei will ieestaLlish itsell
at the next sync point so ieauing can continue:
reader.sync(360);
assertThat(reader.getPosition(), is(2021L));
assertThat(reader.next(key, value), is(true));
assertThat(((IntWritable) key).get(), is(59));
SequenceFile.Writer has a methou calleu sync() loi inseiting a sync
point at the cuiient position in the stieam. This is not to Le conluseu
with the iuentically nameu Lut otheiwise unielateu sync() methou
uelineu Ly the Syncable inteilace loi synchionizing Lulleis to the
unueilying uevice.
Sync points come into theii own when using seguence liles as input to MapReuuce,
since they peimit the lile to Le split, so uilleient poitions ol it can Le piocesseu inue-
penuently Ly sepaiate map tasks. See SeguenceFileInputFoimat on page 2+7.
Displaying a SequenceFile with the command-line interface
The hadoop fs commanu has a -text option to uisplay seguence liles in textual loim.
It looks at a lile`s magic numLei so that it can attempt to uetect the type ol the lile anu
appiopiiately conveit it to text. It can iecognize gzippeu liles anu seguence liles; othei-
wise, it assumes the input is plain text.
Foi seguence liles, this commanu is ieally uselul only il the keys anu values have a
meaninglul stiing iepiesentation (as uelineu Ly the toString() methou). Also, il you
have youi own key oi value classes, then you will neeu to make suie they aie on Ha-
uoop`s classpath.
Running it on the seguence lile we cieateu in the pievious section gives the lollowing
output:
% hadoop fs -text numbers.seq | head
100 One, two, buckle my shoe
99 Three, four, shut the door
98 Five, six, pick up sticks
97 Seven, eight, lay them straight
96 Nine, ten, a big fat hen
95 One, two, buckle my shoe
94 Three, four, shut the door
93 Five, six, pick up sticks
92 Seven, eight, lay them straight
91 Nine, ten, a big fat hen
File-Based Data Structures | 137
Sorting and merging SequenceFiles
The most poweilul way ol soiting (anu meiging) one oi moie seguence liles is to use
MapReuuce. MapReuuce is inheiently paiallel anu will let you specily the numLei ol
ieuuceis to use, which ueteimines the numLei ol output paititions. Foi example, Ly
specilying one ieuucei, you get a single output lile. Ve can use the soit example that
comes with Hauoop Ly specilying that the input anu output aie seguence liles, anu Ly
setting the key anu value types:
% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort -r 1 \
-inFormat org.apache.hadoop.mapred.SequenceFileInputFormat \
-outFormat org.apache.hadoop.mapred.SequenceFileOutputFormat \
-outKey org.apache.hadoop.io.IntWritable \
-outValue org.apache.hadoop.io.Text \
numbers.seq sorted
% hadoop fs -text sorted/part-00000 | head
1 Nine, ten, a big fat hen
2 Seven, eight, lay them straight
3 Five, six, pick up sticks
4 Three, four, shut the door
5 One, two, buckle my shoe
6 Nine, ten, a big fat hen
7 Seven, eight, lay them straight
8 Five, six, pick up sticks
9 Three, four, shut the door
10 One, two, buckle my shoe
Soiting is coveieu in moie uetail in Soiting on page 266.
As an alteinative to using MapReuuce loi soit/meige, theie is a SequenceFile.Sorter
class that has a numLei ol sort() anu merge() methous. These lunctions pieuate Map-
Reuuce anu aie lowei-level lunctions than MapReuuce (loi example, to get paiallelism,
you neeu to paitition youi uata manually), so in geneial MapReuuce is the pieleiieu
appioach to soit anu meige seguence liles.
The SequenceFile format
A seguence lile consists ol a heauei lolloweu Ly one oi moie iecoius (see Figuie +-2).
The liist thiee Lytes ol a seguence lile aie the Lytes SEQ, which acts a magic numLei,
lolloweu Ly a single Lyte iepiesenting the veision numLei. The heauei contains othei
lielus incluuing the names ol the key anu value classes, compiession uetails, usei-
uelineu metauata, anu the sync maikei.
1+
Recall that the sync maikei is useu to allow
a ieauei to synchionize to a iecoiu Lounuaiy liom any position in the lile. Each lile has
a ianuomly geneiateu sync maikei, whose value is stoieu in the heauei. Sync maikeis
appeai Letween iecoius in the seguence lile. They aie uesigneu to incui less than a 1
stoiage oveiheau, so they uon`t necessaiily appeai Letween eveiy paii ol iecoius (such
is the case loi shoit iecoius).
1+. Full uetails ol the loimat ol these lielus may Le lounu in SequenceFile`s uocumentation anu souice coue.
138 | Chapter 4: Hadoop I/O
The inteinal loimat ol the iecoius uepenus on whethei compiession is enaLleu, anu il
it is, whethei it is iecoiu compiession oi Llock compiession.
Il no compiession is enaLleu (the uelault), then each iecoiu is maue up ol the iecoiu
length (in Lytes), the key length, the key, anu then the value. The length lielus aie
wiitten as loui-Lyte integeis auheiing to the contiact ol the writeInt() methou ol
java.io.DataOutput. Keys anu values aie seiializeu using the Serialization uelineu loi
the class Leing wiitten to the seguence lile.
The loimat loi iecoiu compiession is almost iuentical to no compiession, except the
value Lytes aie compiesseu using the couec uelineu in the heauei. Note that keys aie
not compiesseu.
Block compiession compiesses multiple iecoius at once; it is theieloie moie compact
than anu shoulu geneially Le pieleiieu ovei iecoiu compiession Lecause it has the
oppoitunity to take auvantage ol similaiities Letween iecoius. (See Figuie +-3.) Recoius
aie auueu to a Llock until it ieaches a minimum size in Lytes, uelineu Ly the
io.seqfile.compress.blocksize piopeity: the uelault is 1 million Lytes. A sync maikei
is wiitten Leloie the stait ol eveiy Llock. The loimat ol a Llock is a lielu inuicating the
numLei ol iecoius in the Llock, lolloweu Ly loui compiesseu lielus: the key lengths,
the keys, the value lengths, anu the values.
MapFile
A MapFile is a soiteu SequenceFile with an inuex to peimit lookups Ly key. MapFile can
Le thought ol as a peisistent loim ol java.util.Map (although it uoesn`t implement this
inteilace), which is aLle to giow Leyonu the size ol a Map that is kept in memoiy.
Iigurc 1-2. Thc intcrna| structurc oj a scqucncc ji|c with no conprcssion and rccord conprcssion
File-Based Data Structures | 139
Writing a MapFile
Viiting a MapFile is similai to wiiting a SequenceFile: you cieate an instance ol
MapFile.Writer, then call the append() methou to auu entiies in oiuei. (Attempting to
auu entiies out ol oiuei will iesult in an IOException.) Keys must Le instances ol
WritableComparable, anu values must Le Writablecontiast this to SequenceFile,
which can use any seiialization liamewoik loi its entiies.
The piogiam in Example +-16 cieates a MapFile, anu wiites some entiies to it. It is veiy
similai to the piogiam in Example +-1+ loi cieating a SequenceFile.
Exanp|c 1-1. Writing a MapIi|c
public class MapFileWriteDemo {

private static final String[] DATA = {
"One, two, buckle my shoe",
"Three, four, shut the door",
"Five, six, pick up sticks",
"Seven, eight, lay them straight",
"Nine, ten, a big fat hen"
};

public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
IntWritable key = new IntWritable();
Text value = new Text();
MapFile.Writer writer = null;
try {
writer = new MapFile.Writer(conf, fs, uri,
key.getClass(), value.getClass());

for (int i = 0; i < 1024; i++) {
key.set(i + 1);
value.set(DATA[i % DATA.length]);
Iigurc 1-3. Thc intcrna| structurc oj a scqucncc ji|c with b|oc| conprcssion
140 | Chapter 4: Hadoop I/O
writer.append(key, value);
}
} finally {
IOUtils.closeStream(writer);
}
}
}
Let`s use this piogiam to Luilu a MapFile:
% hadoop MapFileWriteDemo numbers.map
Il we look at the MapFile, we see it`s actually a uiiectoiy containing two liles calleu
data anu indcx:
% ls -l numbers.map
total 104
-rw-r--r-- 1 tom tom 47898 Jul 29 22:06 data
-rw-r--r-- 1 tom tom 251 Jul 29 22:06 index
Both liles aie SequenceFiles. The data lile contains all ol the entiies, in oiuei:
% hadoop fs -text numbers.map/data | head
1 One, two, buckle my shoe
2 Three, four, shut the door
3 Five, six, pick up sticks
4 Seven, eight, lay them straight
5 Nine, ten, a big fat hen
6 One, two, buckle my shoe
7 Three, four, shut the door
8 Five, six, pick up sticks
9 Seven, eight, lay them straight
10 Nine, ten, a big fat hen
The indcx lile contains a liaction ol the keys, anu contains a mapping liom the key to
that key`s ollset in the data lile:
% hadoop fs -text numbers.map/index
1 128
129 6079
257 12054
385 18030
513 24002
641 29976
769 35947
897 41922
As we can see liom the output, Ly uelault only eveiy 12Sth key is incluueu in the inuex,
although you can change this value eithei Ly setting the io.map.index.interval
piopeity oi Ly calling the setIndexInterval() methou on the MapFile.Writer instance.
A ieason to inciease the inuex inteival woulu Le to ueciease the amount ol memoiy
that the MapFile neeus to stoie the inuex. Conveisely, you might ueciease the inteival
to impiove the time loi ianuom selection (since lewei iecoius neeu to Le skippeu on
aveiage) at the expense ol memoiy usage.
File-Based Data Structures | 141
Since the inuex is only a paitial inuex ol keys, MapFile is not aLle to pioviue methous
to enumeiate, oi even count, all the keys it contains. The only way to peiloim these
opeiations is to ieau the whole lile.
Reading a MapFile
Iteiating thiough the entiies in oiuei in a MapFile is similai to the pioceuuie loi a
SequenceFile: you cieate a MapFile.Reader, then call the next() methou until it ietuins
false, signilying that no entiy was ieau Lecause the enu ol the lile was ieacheu:
public boolean next(WritableComparable key, Writable val) throws IOException
A ianuom access lookup can Le peiloimeu Ly calling the get() methou:
public Writable get(WritableComparable key, Writable val) throws IOException
The ietuin value is useu to ueteimine il an entiy was lounu in the MapFile; il it`s null,
then no value exists loi the given key. Il key was lounu, then the value loi that key is
ieau into val, as well as Leing ietuineu liom the methou call.
It might Le helplul to unueistanu how this is implementeu. Heie is a snippet ol coue
that ietiieves an entiy loi the MapFile we cieateu in the pievious section:
Text value = new Text();
reader.get(new IntWritable(496), value);
assertThat(value.toString(), is("One, two, buckle my shoe"));
Foi this opeiation, the MapFile.Reader ieaus the indcx lile into memoiy (this is cacheu
so that suLseguent ianuom access calls will use the same in-memoiy inuex). The ieauei
then peiloims a Linaiy seaich on the in-memoiy inuex to linu the key in the inuex that
is less than oi egual to the seaich key, +96. In this example, the inuex key lounu is 3S5,
with value 1S030, which is the ollset in the data lile. Next the ieauei seeks to this ollset
in the data lile anu ieaus entiies until the key is gieatei than oi egual to the seaich key,
+96. In this case, a match is lounu anu the value is ieau liom the data lile. Oveiall, a
lookup takes a single uisk seek anu a scan thiough up to 12S entiies on uisk. Foi a
ianuom-access ieau, this is actually veiy ellicient.
The getClosest() methou is like get() except it ietuins the closest match to the
specilieu key, iathei than ietuining null on no match. Moie piecisely, il the MapFile
contains the specilieu key, then that is the entiy ietuineu; otheiwise, the key in the
MapFile that is immeuiately altei (oi Leloie, accoiuing to a boolean aigument) the
specilieu key is ietuineu.
A veiy laige MapFile`s inuex can take up a lot ol memoiy. Rathei than ieinuex to change
the inuex inteival, it is possiLle to loau only a liaction ol the inuex keys into memoiy
when ieauing the MapFile Ly setting the io.map.index.skip piopeity. This piopeity is
noimally 0, which means no inuex keys aie skippeu; a value ol 1 means skip one key
loi eveiy key in the inuex (so eveiy othei key enus up in the inuex), 2 means skip two
keys loi eveiy key in the inuex (so one thiiu ol the keys enu up in the inuex), anu so
142 | Chapter 4: Hadoop I/O
on. Laigei skip values save memoiy Lut at the expense ol lookup time, since moie
entiies have to Le scanneu on uisk, on aveiage.
MapFile variants
Hauoop comes with a lew vaiiants on the geneial key-value MapFile inteilace:
SetFile is a specialization ol MapFile loi stoiing a set ol Writable keys. The keys
must Le auueu in soiteu oiuei.
ArrayFile is a MapFile wheie the key is an integei iepiesenting the inuex ol the
element in the aiiay, anu the value is a Writable value.
BloomMapFile is a MapFile which olleis a last veision ol the get() methou, especially
loi spaisely populateu liles. The implementation uses a uynamic Lloom liltei loi
testing whethei a given key is in the map. The test is veiy last since it is in-memoiy,
Lut it has a non-zeio pioLaLility ol lalse positives, in which case the iegulai
get() methou is calleu.
Theie aie two tuning paiameteis: io.mapfile.bloom.size loi the (appioximate)
numLei ol entiies in the map (uelault 1,0+S,576), anu io.map
file.bloom.error.rate loi the uesiieu maximum eiioi iate (uelault 0.005, which
is 0.5).
Converting a SequenceFile to a MapFile
One way ol looking at a MapFile is as an inuexeu anu soiteu SequenceFile. So it`s guite
natuial to want to Le aLle to conveit a SequenceFile into a MapFile. Ve coveieu how
to soit a SequenceFile in Soiting anu meiging SeguenceFiles on page 13S, so heie we
look at how to cieate an inuex loi a SequenceFile. The piogiam in Example +-17 hinges
aiounu the static utility methou fix() on MapFile, which ie-cieates the inuex loi a
MapFile.
Exanp|c 1-17. Rc-crcating thc indcx jor a MapIi|c
public class MapFileFixer {
public static void main(String[] args) throws Exception {
String mapUri = args[0];

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(mapUri), conf);
Path map = new Path(mapUri);
Path mapData = new Path(map, MapFile.DATA_FILE_NAME);

// Get key and value types from data sequence file
SequenceFile.Reader reader = new SequenceFile.Reader(fs, mapData, conf);
Class keyClass = reader.getKeyClass();
Class valueClass = reader.getValueClass();
reader.close();

File-Based Data Structures | 143
// Create the map file index file
long entries = MapFile.fix(fs, map, keyClass, valueClass, false, conf);
System.out.printf("Created MapFile %s with %d entries\n", map, entries);
}
}
The fix() methou is usually useu loi ie-cieating coiiupteu inuexes, Lut since it cieates
a new inuex liom sciatch, it`s exactly what we neeu heie. The iecipe is as lollows:
1. Soit the seguence lile nunbcrs.scq into a new uiiectoiy calleu nunbcr.nap that will
Lecome the MapFile (il the seguence lile is alieauy soiteu, then you can skip this
step. Insteau, copy it to a lile nunbcr.nap/data, then go to step 3):
% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort -r 1 \
-inFormat org.apache.hadoop.mapred.SequenceFileInputFormat \
-outFormat org.apache.hadoop.mapred.SequenceFileOutputFormat \
-outKey org.apache.hadoop.io.IntWritable \
-outValue org.apache.hadoop.io.Text \
numbers.seq numbers.map
2. Rename the MapReuuce output to Le the data lile:
% hadoop fs -mv numbers.map/part-00000 numbers.map/data
3. Cieate the indcx lile:
% hadoop MapFileFixer numbers.map
Created MapFile numbers.map with 100 entries
The MapFile nunbcrs.nap now exists anu can Le useu.
144 | Chapter 4: Hadoop I/O
CHAPTER 5
Developing a MapReduce Application
In Chaptei 2, we intiouuceu the MapReuuce mouel. In this chaptei, we look at the
piactical aspects ol ueveloping a MapReuuce application in Hauoop.
Viiting a piogiam in MapReuuce has a ceitain llow to it. You stait Ly wiiting youi
map anu ieuuce lunctions, iueally with unit tests to make suie they uo what you expect.
Then you wiite a uiivei piogiam to iun a joL, which can iun liom youi IDE using a
small suLset ol the uata to check that it is woiking. Il it lails, then you can use youi
IDE`s ueLuggei to linu the souice ol the pioLlem. Vith this inloimation, you can
expanu youi unit tests to covei this case anu impiove youi mappei oi ieuucei as ap-
piopiiate to hanule such input coiiectly.
Vhen the piogiam iuns as expecteu against the small uataset, you aie ieauy to unleash
it on a clustei. Running against the lull uataset is likely to expose some moie issues,
which you can lix as Leloie, Ly expanuing youi tests anu mappei oi ieuucei to hanule
the new cases. DeLugging lailing piogiams in the clustei is a challenge, so we look at
some common technigues to make it easiei.
Altei the piogiam is woiking, you may wish to uo some tuning, liist Ly iunning thiough
some stanuaiu checks loi making MapReuuce piogiams lastei anu then Ly uoing task
pioliling. Pioliling uistiiLuteu piogiams is not tiivial, Lut Hauoop has hooks to aiu the
piocess.
Beloie we stait wiiting a MapReuuce piogiam, we neeu to set up anu conliguie the
uevelopment enviionment. Anu to uo that, we neeu to leain a Lit aLout how Hauoop
uoes conliguiation.
145
The Configuration API
Components in Hauoop aie conliguieu using Hauoop`s own conliguiation API. An
instance ol the Configuration class (lounu in the org.apache.hadoop.conf package)
iepiesents a collection ol conliguiation propcrtics anu theii values. Each piopeity is
nameu Ly a String, anu the type ol a value may Le one ol seveial types, incluuing ]ava
piimitives such as boolean, int, long, float, anu othei uselul types such as String, Class,
java.io.File, anu collections ol Strings.
Configurations ieau theii piopeities liom rcsourccsXML liles with a simple stiuctuie
loi uelining name-value paiis. See Example 5-1.
Exanp|c 5-1. A sinp|c conjiguration ji|c, conjiguration-1.xn|
<?xml version="1.0"?>
<configuration>
<property>
<name>color</name>
<value>yellow</value>
<description>Color</description>
</property>

<property>
<name>size</name>
<value>10</value>
<description>Size</description>
</property>

<property>
<name>weight</name>
<value>heavy</value>
<final>true</final>
<description>Weight</description>
</property>

<property>
<name>size-weight</name>
<value>${size},${weight}</value>
<description>Size and weight</description>
</property>
</configuration>
Assuming this conliguiation lile is in a lile calleu conjiguration-1.xn|, we can access its
piopeities using a piece ol coue like this:
Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
assertThat(conf.get("color"), is("yellow"));
assertThat(conf.getInt("size", 0), is(10));
assertThat(conf.get("breadth", "wide"), is("wide"));
146 | Chapter 5: Developing a MapReduce Application
Theie aie a couple ol things to note: type inloimation is not stoieu in the XML lile;
insteau, piopeities can Le inteipieteu as a given type when they aie ieau. Also, the
get() methous allow you to specily a uelault value, which is useu il the piopeity is not
uelineu in the XML lile, as in the case ol breadth heie.
Combining Resources
Things get inteiesting when moie than one iesouice is useu to ueline a conliguiation.
This is useu in Hauoop to sepaiate out the uelault piopeities loi the system, uelineu
inteinally in a lile calleu corc-dcjau|t.xn|, liom the site-specilic oveiiiues, in corc-
sitc.xn|. The lile in Example 5-2 uelines the size anu weight piopeities.
Exanp|c 5-2. A sccond conjiguration ji|c, conjiguration-2.xn|
<?xml version="1.0"?>
<configuration>
<property>
<name>size</name>
<value>12</value>
</property>

<property>
<name>weight</name>
<value>light</value>
</property>
</configuration>
Resouices aie auueu to a Configuration in oiuei:
Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
conf.addResource("configuration-2.xml");
Piopeities uelineu in iesouices that aie auueu latei oveiiiue the eailiei uelinitions. So
the size piopeity takes its value liom the seconu conliguiation lile, conjiguration-2.xn|:
assertThat(conf.getInt("size", 0), is(12));
Howevei, piopeities that aie maikeu as final cannot Le oveiiiuuen in latei uelinitions.
The weight piopeity is final in the liist conliguiation lile, so the attempt to oveiiiue it
in the seconu lails, anu it takes the value liom the liist:
assertThat(conf.get("weight"), is("heavy"));
Attempting to oveiiiue linal piopeities usually inuicates a conliguiation eiioi, so this
iesults in a waining message Leing loggeu to aiu uiagnosis. Auministiatois maik piop-
eities as linal in the uaemon`s site liles that they uon`t want useis to change in theii
client-siue conliguiation liles oi joL suLmission paiameteis.
The Configuration API | 147
Variable Expansion
Conliguiation piopeities can Le uelineu in teims ol othei piopeities, oi system piop-
eities. Foi example, the piopeity size-weight in the liist conliguiation lile is uelineu
as ${size},${weight}, anu these piopeities aie expanueu using the values lounu in the
conliguiation:
assertThat(conf.get("size-weight"), is("12,heavy"));
System piopeities take piioiity ovei piopeities uelineu in iesouice liles:
System.setProperty("size", "14");
assertThat(conf.get("size-weight"), is("14,heavy"));
This leatuie is uselul loi oveiiiuing piopeities on the commanu line Ly using
-Dproperty=value ]VM aiguments.
Note that while conliguiation piopeities can Le uelineu in teims ol system piopeities,
unless system piopeities aie ieuelineu using conliguiation piopeities, they aie not ac-
cessiLle thiough the conliguiation API. Hence:
System.setProperty("length", "2");
assertThat(conf.get("length"), is((String) null));
Configuring the Development Environment
The liist step is to uownloau the veision ol Hauoop that you plan to use anu unpack
it on youi uevelopment machine (this is uesciiLeu in Appenuix A). Then, in youi la-
voiite IDE, cieate a new pioject anu auu all the ]AR liles liom the top level ol the
unpackeu uistiiLution anu liom the |ib uiiectoiy to the classpath. You will then Le aLle
to compile ]ava Hauoop piogiams anu iun them in local (stanualone) moue within the
IDE.
Foi Eclipse useis, theie is a plug-in availaLle loi Liowsing HDFS anu
launching MapReuuce piogiams. Instiuctions aie availaLle on the Ha-
uoop wiki at http://wi|i.apachc.org/hadoop/Ec|ipscP|ug|n.
Alteinatively, Kaimaspheie pioviues Eclipse anu NetBeans plug-ins loi
ueveloping anu iunning MapReuuce joLs anu Liowsing Hauoop clus-
teis.
Managing Configuration
Vhen ueveloping Hauoop applications, it is common to switch Letween iunning the
application locally anu iunning it on a clustei. In lact, you may have seveial clusteis
you woik with, oi you may have a local pseuuo-uistiiLuteu clustei that you like to
test on (a pseuuo-uistiiLuteu clustei is one whose uaemons all iun on the local machine;
setting up this moue is coveieu in Appenuix A, too).
148 | Chapter 5: Developing a MapReduce Application
One way to accommouate these vaiiations is to have Hauoop conliguiation liles con-
taining the connection settings loi each clustei you iun against, anu specily which one
you aie using when you iun Hauoop applications oi tools. As a mattei ol Lest piactice,
it`s iecommenueu to keep these liles outsiue Hauoop`s installation uiiectoiy tiee, as
this makes it easy to switch Letween Hauoop veisions without uuplicating oi losing
settings.
Foi the puiposes ol this Look, we assume the existence ol a uiiectoiy calleu conj that
contains thiee conliguiation liles: hadoop-|oca|.xn|, hadoop-|oca|host.xn|, anu
hadoop-c|ustcr.xn| (these aie availaLle in the example coue loi this Look). Note that
theie is nothing special aLout the names ol these lilesthey aie just convenient ways
to package up some conliguiation settings. (Compaie this to TaLle A-1 in Appen-
uix A, which sets out the eguivalent seivei-siue conliguiations.)
The hadoop-|oca|.xn| lile contains the uelault Hauoop conliguiation loi the uelault
lilesystem anu the joLtiackei:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>

<property>
<name>mapred.job.tracker</name>
<value>local</value>
</property>

</configuration>
The settings in hadoop-|oca|host.xn| point to a namenoue anu a joLtiackei Loth iun-
ning on localhost:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>

<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>

</configuration>
Configuring the Development Environment | 149
Finally, hadoop-c|ustcr.xn| contains uetails ol the clustei`s namenoue anu joLtiackei
auuiesses. In piactice, you woulu name the lile altei the name ol the clustei, iathei
than clustei as we have heie:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://namenode/</value>
</property>

<property>
<name>mapred.job.tracker</name>
<value>jobtracker:8021</value>
</property>

</configuration>
You can auu othei conliguiation piopeities to these liles as neeueu. Foi example, il you
wanteu to set youi Hauoop useiname loi a paiticulai clustei, you coulu uo it in the
appiopiiate lile.
Setting User Identity
The usei iuentity that Hauoop uses loi peimissions in HDFS is ueteimineu Ly iunning
the whoami commanu on the client system. Similaily, the gioup names aie ueiiveu liom
the output ol iunning groups.
Il, howevei, youi Hauoop usei iuentity is uilleient liom the name ol youi usei account
on youi client machine, then you can explicitly set youi Hauoop useiname anu gioup
names Ly setting the hadoop.job.ugi piopeity. The useiname anu gioup names aie
specilieu as a comma-sepaiateu list ol stiings (e.g., preston,directors,inventors woulu
set the useiname to preston anu the gioup names to directors anu inventors).
You can set the usei iuentity that the HDFS weL inteilace iuns as Ly setting
dfs.web.ugi using the same syntax. By uelault, it is webuser,webgroup, which is not a
supei usei, so system liles aie not accessiLle thiough the weL inteilace.
Notice that, Ly uelault, theie is no authentication with this system. See Secu-
iity on page 323 loi how to use KeiLeios authentication with Hauoop.
Vith this setup, it is easy to use any conliguiation with the -conf commanu-line switch.
Foi example, the lollowing commanu shows a uiiectoiy listing on the HDFS seivei
iunning in pseuuo-uistiiLuteu moue on localhost:
% hadoop fs -conf conf/hadoop-localhost.xml -ls .
Found 2 items
drwxr-xr-x - tom supergroup 0 2009-04-08 10:32 /user/tom/input
drwxr-xr-x - tom supergroup 0 2009-04-08 13:09 /user/tom/output
150 | Chapter 5: Developing a MapReduce Application
Il you omit the -conf option, then you pick up the Hauoop conliguiation in the conj
suLuiiectoiy unuei $HADOOP_INSTALL. Depenuing on how you set this up, this may Le
loi a stanualone setup oi a pseuuo-uistiiLuteu clustei.
Tools that come with Hauoop suppoit the -conf option, Lut it`s also stiaightloiwaiu
to make youi piogiams (such as piogiams that iun MapReuuce joLs) suppoit it, too,
using the Tool inteilace.
GenericOptionsParser, Tool, and ToolRunner
Hauoop comes with a lew helpei classes loi making it easiei to iun joLs liom the
commanu line. GenericOptionsParser is a class that inteipiets common Hauoop
commanu-line options anu sets them on a Configuration oLject loi youi application to
use as uesiieu. You uon`t usually use GenericOptionsParser uiiectly, as it`s moie
convenient to implement the Tool inteilace anu iun youi application with the
ToolRunner, which uses GenericOptionsParser inteinally:
public interface Tool extends Configurable {
int run(String [] args) throws Exception;
}
Example 5-3 shows a veiy simple implementation ol Tool, loi piinting the keys anu
values ol all the piopeities in the Tool`s Configuration oLject.
Exanp|c 5-3. An cxanp|c Too| inp|cncntation jor printing thc propcrtics in a Conjiguration
public class ConfigurationPrinter extends Configured implements Tool {

static {
Configuration.addDefaultResource("hdfs-default.xml");
Configuration.addDefaultResource("hdfs-site.xml");
Configuration.addDefaultResource("mapred-default.xml");
Configuration.addDefaultResource("mapred-site.xml");
}
@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
for (Entry<String, String> entry: conf) {
System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());
}
return 0;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new ConfigurationPrinter(), args);
System.exit(exitCode);
}
}
Configuring the Development Environment | 151
Ve make ConfigurationPrinter a suLclass ol Configured, which is an implementation
ol the Configurable inteilace. All implementations ol Tool neeu to implement
Configurable (since Tool extenus it), anu suLclassing Configured is olten the easiest way
to achieve this. The run() methou oLtains the Configuration using Configurable`s
getConf() methou anu then iteiates ovei it, piinting each piopeity to stanuaiu output.
The static Llock makes suie that the HDFS anu MapReuuce conliguiations aie pickeu
up in auuition to the coie ones (which Configuration knows aLout alieauy).
ConfigurationPrinter`s main() methou uoes not invoke its own run() methou uiiectly.
Insteau, we call ToolRunner`s static run() methou, which takes caie ol cieating a
Configuration oLject loi the Tool, Leloie calling its run() methou. ToolRunner also uses
a GenericOptionsParser to pick up any stanuaiu options specilieu on the commanu line
anu set them on the Configuration instance. Ve can see the ellect ol picking up the
piopeities specilieu in conj/hadoop-|oca|host.xn| Ly iunning the lollowing commanu:
% hadoop ConfigurationPrinter -conf conf/hadoop-localhost.xml \
| grep mapred.job.tracker=
mapred.job.tracker=localhost:8021
Which Properties Can I Set?
ConfigurationPrinter is a uselul tool loi telling you what a piopeity is set to in youi
enviionment.
You can also see the uelault settings loi all the puLlic piopeities in Hauoop Ly looking
in the docs uiiectoiy ol youi Hauoop installation loi HTML liles calleu corc-
dcjau|t.htn|, hdjs-dcjau|t.htn| anu naprcd-dcjau|t.htn|. Each piopeity has a uesciip-
tion that explains what it is loi anu what values it can Le set to.
Be awaie that some piopeities have no ellect when set in the client conliguiation. Foi
example, il in youi joL suLmission you set mapred.tasktracker.map.tasks.maximum with
the expectation that it woulu change the numLei ol task slots loi the tasktiackeis iun-
ning youi joL, then you woulu Le uisappointeu, since this piopeity is only honoieu il
set in the tasktiackei`s naprcd-sitc.xn| lile. In geneial, you can tell the component
wheie a piopeity shoulu Le set Ly its name, so the lact that mapred.task
tracker.map.tasks.maximum staits with mapred.tasktracker gives you a clue that it can
Le set only loi the tasktiackei uaemon. This is not a haiu anu last iule, howevei, so in
some cases you may neeu to iesoit to tiial anu eiioi, oi even ieauing the souice.
Ve uiscuss many ol Hauoop`s most impoitant conliguiation piopeities thioughout
this Look. You can linu a conliguiation piopeity ieleience on the Look`s weLsite at
http://www.hadoopboo|.con.
GenericOptionsParser also allows you to set inuiviuual piopeities. Foi example:
% hadoop ConfigurationPrinter -D color=yellow | grep color
color=yellow
152 | Chapter 5: Developing a MapReduce Application
The -D option is useu to set the conliguiation piopeity with key color to the value
yellow. Options specilieu with -D take piioiity ovei piopeities liom the conliguiation
liles. This is veiy uselul: you can put uelaults into conliguiation liles anu then oveiiiue
them with the -D option as neeueu. A common example ol this is setting the numLei
ol ieuuceis loi a MapReuuce joL via -D mapred.reduce.tasks=n. This will oveiiiue the
numLei ol ieuuceis set on the clustei oi set in any client-siue conliguiation liles.
The othei options that GenericOptionsParser anu ToolRunner suppoit aie listeu in Ta-
Lle 5-1. You can linu moie on Hauoop`s conliguiation API in The Conliguiation
API on page 1+6.
Do not conluse setting Hauoop piopeities using the -D
property=value option to GenericOptionsParser (anu ToolRunner) with
setting ]VM system piopeities using the -Dproperty=value option to the
java commanu. The syntax loi ]VM system piopeities uoes not allow
any whitespace Letween the D anu the piopeity name, wheieas
GenericOptionsParser ieguiies them to Le sepaiateu Ly whitespace.
]VM system piopeities aie ietiieveu liom the java.lang.System class,
wheieas Hauoop piopeities aie accessiLle only liom a Configuration
oLject. So, the lollowing commanu will piint nothing, since the
System class is not useu Ly ConfigurationPrinter:
% hadoop -Dcolor=yellow ConfigurationPrinter | grep color
Il you want to Le aLle to set conliguiation thiough system piopeities,
then you neeu to miiioi the system piopeities ol inteiest in the
conliguiation lile. See VaiiaLle Expansion on page 1+S loi luithei
uiscussion.
Tab|c 5-1. GcncricOptionsParscr and Too|Runncr options
Option Description
-D property=value Sets the given Hadoop configuration property to the given value. Overrides any default
or site properties in the configuration, and any properties set via the -conf option.
-conf filename ... Adds the given files to the list of resources in the configuration. This is a convenient way
to set site properties or to set a number of properties at once.
-fs uri Sets the default filesystem to the given URI. Shortcut for -D fs.default.name=uri
-jt host:port Sets the jobtracker to the given host and port. Shortcut for -D
mapred.job.tracker=host:port
-files file1,file2,... Copies the specified files from the local filesystem (or any filesystem if a scheme is
specified) to the shared filesystem used by the jobtracker (usually HDFS) and makes
them available to MapReduce programs in the tasks working directory. (See Distributed
Cache on page 288 for more on the distributed cache mechanism for copying files to
tasktracker machines.)
-archives
archive1,archive2,...
Copies the specified archives from the local filesystem (or any filesystem if a scheme is
specified) to the shared filesystem used by the jobtracker (usually HDFS), unarchives
Configuring the Development Environment | 153
Option Description
them, and makes them available to MapReduce programs in the tasks working
directory.
-libjars jar1,jar2,... Copies the specified JAR files from the local filesystem (or any filesystem if a scheme is
specified) to the shared filesystem used by the jobtracker (usually HDFS), and adds them
to the MapReduce tasks classpath. This option is a useful way of shipping JAR files that
a job is dependent on.
Writing a Unit Test
The map anu ieuuce lunctions in MapReuuce aie easy to test in isolation, which is a
conseguence ol theii lunctional style. Foi known inputs, they piouuce known outputs.
Howevei, since outputs aie wiitten to a Context (oi an OutputCollector in the olu API),
iathei than simply Leing ietuineu liom the methou call, the Context neeus to Le ie-
placeu with a mock so that its outputs can Le veiilieu. Theie aie seveial ]ava mock
oLject liamewoiks that can help Luilu mocks; heie we use Mockito, which is noteu loi
its clean syntax, although any mock liamewoik shoulu woik just as well.
1
All ol the tests uesciiLeu heie can Le iun liom within an IDE.
Mapper
The test loi the mappei is shown in Example 5-+.
Exanp|c 5-1. Unit tcst jor MaxTcnpcraturcMappcr
import static org.mockito.Mockito.*;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.junit.*;
public class MaxTemperatureMapperTest {
@Test
public void processesValidRecord() throws IOException, InterruptedException {
MaxTemperatureMapper mapper = new MaxTemperatureMapper();

Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
// Year ^^^^
"99999V0203201N00261220001CN9999999N9-00111+99999999999");
// Temperature ^^^^^
MaxTemperatureMapper.Context context =
mock(MaxTemperatureMapper.Context.class);
mapper.map(null, value, context);
1. See also the MRUnit pioject (http://incubator.apachc.org/nrunit/), which aims to make unit testing
MapReuuce piogiams easiei.
154 | Chapter 5: Developing a MapReduce Application

verify(context).write(new Text("1950"), new IntWritable(-11));
}
}
The test is veiy simple: it passes a weathei iecoiu as input to the mappei, then checks
the output is the yeai anu tempeiatuie ieauing. The input key is ignoieu Ly the mappei,
so we can pass in anything, incluuing null as we uo heie. To cieate a mock Context,
we call Mockito`s mock() methou (a static impoit), passing the class ol the type we want
to mock. Then we invoke the mappei`s map() methou, which executes the coue Leing
testeu. Finally, we veiily that the mock oLject was calleu with the coiiect methou anu
aiguments, using Mockito`s verify() methou (again, statically impoiteu). Heie we
veiily that Context`s write() methou was calleu with a Text oLject iepiesenting the yeai
(1950) anu an IntWritable iepiesenting the tempeiatuie (-1.1C).
Pioceeuing in a test-uiiven lashion, we cieate a Mapper implementation that passes the
test (see Example 5-5). Since we will Le evolving the classes in this chaptei, each is put
in a uilleient package inuicating its veision loi ease ol exposition. Foi example, v1.Max
TemperatureMapper is veision 1 ol MaxTemperatureMapper. In ieality, ol couise, you woulu
evolve classes without iepackaging them.
Exanp|c 5-5. Iirst vcrsion oj a Mappcr that passcs MaxTcnpcraturcMappcrTcst
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {

@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

String line = value.toString();
String year = line.substring(15, 19);
int airTemperature = Integer.parseInt(line.substring(87, 92));
context.write(new Text(year), new IntWritable(airTemperature));
}
}
This is a veiy simple implementation, which pulls the yeai anu tempeiatuie lielus liom
the line anu wiites them to the Context. Let`s auu a test loi missing values, which in
the iaw uata aie iepiesenteu Ly a tempeiatuie ol +9999:
@Test
public void ignoresMissingTemperatureRecord() throws IOException,
InterruptedException {
MaxTemperatureMapper mapper = new MaxTemperatureMapper();

Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
// Year ^^^^
"99999V0203201N00261220001CN9999999N9+99991+99999999999");
// Temperature ^^^^^
MaxTemperatureMapper.Context context =
mock(MaxTemperatureMapper.Context.class);
Writing a Unit Test | 155

mapper.map(null, value, context);

verify(context, never()).write(any(Text.class), any(IntWritable.class));
}
Since iecoius with missing tempeiatuies shoulu Le lilteieu out, this test uses Mockito
to veiily that the write() methou on the Context is ncvcr calleu loi any Text key oi
IntWritable value.
The existing test lails with a NumberFormatException, as parseInt() cannot paise integeis
with a leauing plus sign, so we lix up the implementation (veision 2) to hanule missing
values:
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

String line = value.toString();
String year = line.substring(15, 19);
String temp = line.substring(87, 92);
if (!missing(temp)) {
int airTemperature = Integer.parseInt(temp);
context.write(new Text(year), new IntWritable(airTemperature));
}
}

private boolean missing(String temp) {
return temp.equals("+9999");
}
Vith the test loi the mappei passing, we move on to wiiting the ieuucei.
Reducer
The ieuucei has to linu the maximum value loi a given key. Heie`s a simple test loi
this leatuie:
@Test
public void returnsMaximumIntegerInValues() throws IOException,
InterruptedException {
MaxTemperatureReducer reducer = new MaxTemperatureReducer();

Text key = new Text("1950");
List<IntWritable> values = Arrays.asList(
new IntWritable(10), new IntWritable(5));
MaxTemperatureReducer.Context context =
mock(MaxTemperatureReducer.Context.class);

reducer.reduce(key, values, context);

verify(context).write(key, new IntWritable(10));
}
156 | Chapter 5: Developing a MapReduce Application
Ve constiuct a list ol some IntWritable values anu then veiily that
MaxTemperatureReducer picks the laigest. The coue in Example 5-6 is loi an implemen-
tation ol MaxTemperatureReducer that passes the test. Notice that we haven`t testeu the
case ol an empty values iteiatoi, Lut aiguaLly we uon`t neeu to, since MapReuuce
woulu nevei call the ieuucei in this case, as eveiy key piouuceu Ly a mappei has a value.
Exanp|c 5-. Rcduccr jor naxinun tcnpcraturc cxanp|c
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {

int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}
}
Running Locally on Test Data
Now that we`ve got the mappei anu ieuucei woiking on contiolleu inputs, the next
step is to wiite a joL uiivei anu iun it on some test uata on a uevelopment machine.
Running a Job in a Local Job Runner
Using the Tool inteilace intiouuceu eailiei in the chaptei, it`s easy to wiite a uiivei to
iun oui MapReuuce joL loi linuing the maximum tempeiatuie Ly yeai (see
MaxTemperatureDriver in Example 5-7).
Exanp|c 5-7. App|ication to jind thc naxinun tcnpcraturc
public class MaxTemperatureDriver extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}

Job job = new Job(getConf(), "Max temperature");
job.setJarByClass(getClass());
Running Locally on Test Data | 157
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args);
System.exit(exitCode);
}
}
MaxTemperatureDriver implements the Tool inteilace, so we get the Lenelit ol Leing aLle
to set the options that GenericOptionsParser suppoits. The run() methou constiucts
Job oLject Laseu on the tool`s conliguiation, which it uses to launch a joL. Among the
possiLle joL conliguiation paiameteis, we set the input anu output lile paths, the map-
pei, ieuucei anu comLinei classes, anu the output types (the input types aie ueteimineu
Ly the input loimat, which uelaults to TextInputFormat anu has LongWritable keys anu
Text values). It`s also a goou iuea to set a name loi the joL (Max temperature), so that
you can pick it out in the joL list uuiing execution anu altei it has completeu. By uelault,
the name is the name ol the ]AR lile, which is noimally not paiticulaily uesciiptive.
Now we can iun this application against some local liles. Hauoop comes with a local
joL iunnei, a cut-uown veision ol the MapReuuce execution engine loi iunning Map-
Reuuce joLs in a single ]VM. It`s uesigneu loi testing anu is veiy convenient loi use in
an IDE, since you can iun it in a ueLuggei to step thiough the coue in youi mappei anu
ieuucei.
The local joL iunnei is only uesigneu loi simple testing ol MapReuuce
piogiams, so inevitaLly it uilleis liom the lull MapReuuce implemen-
tation. The Liggest uilleience is that it can`t iun moie than one ieuucei.
(It can suppoit the zeio ieuucei case, too.) This is noimally not a pioL-
lem, as most applications can woik with one ieuucei, although on a
clustei you woulu choose a laigei numLei to take auvantage ol paial-
lelism. The thing to watch out loi is that even il you set the numLei ol
ieuuceis to a value ovei one, the local iunnei will silently ignoie the
setting anu use a single ieuucei.
This limitation may Le iemoveu in a lutuie veision ol Hauoop.
158 | Chapter 5: Developing a MapReduce Application
The local joL iunnei is enaLleu Ly a conliguiation setting. Noimally,
mapred.job.tracker is a host:port paii to specily the auuiess ol the joLtiackei, Lut when
it has the special value ol local, the joL is iun in-piocess without accessing an exteinal
joLtiackei.
Fiom the commanu line, we can iun the uiivei Ly typing:
% hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml \
input/ncdc/micro output
Eguivalently, we coulu use the -fs anu -jt options pioviueu Ly GenericOptionsParser:
% hadoop v2.MaxTemperatureDriver -fs file:/// -jt local input/ncdc/micro output
This commanu executes MaxTemperatureDriver using input liom the local input/ncdc/
nicro uiiectoiy, piouucing output in the local output uiiectoiy. Note that although
we`ve set -fs so we use the local lilesystem (file:///), the local joL iunnei will actually
woik line against any lilesystem, incluuing HDFS (anu it can Le hanuy to uo this il you
have a lew liles that aie on HDFS).
Vhen we iun the piogiam, it lails anu piints the lollowing exception:
java.lang.NumberFormatException: For input string: "+0000"
Fixing the mapper
This exception shows that the map methou still can`t paise positive tempeiatuies. (Il
the stack tiace haun`t given us enough inloimation to uiagnose the lault, we coulu iun
the test in a local ueLuggei, since it iuns in a single ]VM.) Eailiei, we maue it hanule
the special case ol missing tempeiatuie, +9999, Lut not the geneial case ol any positive
tempeiatuie. Vith moie logic going into the mappei, it makes sense to lactoi out a
paisei class to encapsulate the paising logic; see Example 5-S (now on veision 3).
Exanp|c 5-8. A c|ass jor parsing wcathcr rccords in NCDC jornat
public class NcdcRecordParser {

private static final int MISSING_TEMPERATURE = 9999;

private String year;
private int airTemperature;
private String quality;

public void parse(String record) {
year = record.substring(15, 19);
String airTemperatureString;
// Remove leading plus sign as parseInt doesn't like them
if (record.charAt(87) == '+') {
airTemperatureString = record.substring(88, 92);
} else {
airTemperatureString = record.substring(87, 92);
}
airTemperature = Integer.parseInt(airTemperatureString);
quality = record.substring(92, 93);
Running Locally on Test Data | 159
}

public void parse(Text record) {
parse(record.toString());
}
public boolean isValidTemperature() {
return airTemperature != MISSING_TEMPERATURE && quality.matches("[01459]");
}

public String getYear() {
return year;
}
public int getAirTemperature() {
return airTemperature;
}
}
The iesulting mappei is much simplei (see Example 5-9). It just calls the paisei`s
parse() methou, which paises the lielus ol inteiest liom a line ol input, checks whethei
a valiu tempeiatuie was lounu using the isValidTemperature() gueiy methou, anu il it
was, ietiieves the yeai anu the tempeiatuie using the gettei methous on the paisei.
Notice that we also check the guality status lielu as well as missing tempeiatuies in
isValidTemperature() to liltei out pooi tempeiatuie ieauings.
Anothei Lenelit ol cieating a paisei class is that it makes it easy to wiite ielateu mappeis
loi similai joLs without uuplicating coue. It also gives us the oppoitunity to wiite unit
tests uiiectly against the paisei, loi moie taigeteu testing.
Exanp|c 5-9. A Mappcr that uscs a uti|ity c|ass to parsc rccords
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {

private NcdcRecordParser parser = new NcdcRecordParser();

@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

parser.parse(value);
if (parser.isValidTemperature()) {
context.write(new Text(parser.getYear()),
new IntWritable(parser.getAirTemperature()));
}
}
}
Vith these changes, the test passes.
160 | Chapter 5: Developing a MapReduce Application
Testing the Driver
Apait liom the llexiLle conliguiation options olleieu Ly making youi application im-
plement Tool, you also make it moie testaLle Lecause it allows you to inject an aiLitiaiy
Configuration. You can take auvantage ol this to wiite a test that uses a local joL iunnei
to iun a joL against known input uata, which checks that the output is as expecteu.
Theie aie two appioaches to uoing this. The liist is to use the local joL iunnei anu iun
the joL against a test lile on the local lilesystem. The coue in Example 5-10 gives an
iuea ol how to uo this.
Exanp|c 5-10. A tcst jor MaxTcnpcraturcDrivcr that uscs a |oca|, in-proccss job runncr
@Test
public void test() throws Exception {
Configuration conf = new Configuration();
conf.set("fs.default.name", "file:///");
conf.set("mapred.job.tracker", "local");

Path input = new Path("input/ncdc/micro");
Path output = new Path("output");

FileSystem fs = FileSystem.getLocal(conf);
fs.delete(output, true); // delete old output

MaxTemperatureDriver driver = new MaxTemperatureDriver();
driver.setConf(conf);

int exitCode = driver.run(new String[] {
input.toString(), output.toString() });
assertThat(exitCode, is(0));

checkOutput(conf, output);
}
The test explicitly sets fs.default.name anu mapred.job.tracker so it uses the local
lilesystem anu the local joL iunnei. It then iuns the MaxTemperatureDriver via its Tool
inteilace against a small amount ol known uata. At the enu ol the test, the checkOut
put() methou is calleu to compaie the actual output with the expecteu output, line Ly
line.
The seconu way ol testing the uiivei is to iun it using a mini- clustei. Hauoop has a
paii ol testing classes, calleu MiniDFSCluster anu MiniMRCluster, which pioviue a pio-
giammatic way ol cieating in-piocess clusteis. Unlike the local joL iunnei, these allow
testing against the lull HDFS anu MapReuuce machineiy. Beai in minu, too, that task-
tiackeis in a mini-clustei launch sepaiate ]VMs to iun tasks in, which can make ue-
Lugging moie uillicult.
Mini-clusteis aie useu extensively in Hauoop`s own automateu test suite, Lut they can
Le useu loi testing usei coue, too. Hauoop`s ClusterMapReduceTestCase aLstiact class
pioviues a uselul Lase loi wiiting such a test, hanules the uetails ol staiting anu stopping
Running Locally on Test Data | 161
the in-piocess HDFS anu MapReuuce clusteis in its setUp() anu tearDown() methous,
anu geneiates a suitaLle conliguiation oLject that is set up to woik with them. SuL-
classes neeu populate only uata in HDFS (peihaps Ly copying liom a local lile), iun a
MapReuuce joL, then conliim the output is as expecteu. Relei to the MaxTemperature
DriverMiniTest class in the example coue that comes with this Look loi the listing.
Tests like this seive as iegiession tests, anu aie a uselul iepositoiy ol input euge cases
anu theii expecteu iesults. As you encountei moie test cases, you can simply auu them
to the input lile anu upuate the lile ol expecteu output accoiuingly.
Running on a Cluster
Now that we aie happy with the piogiam iunning on a small test uataset, we aie ieauy
to tiy it on the lull uataset on a Hauoop clustei. Chaptei 9 coveis how to set up a lully
uistiiLuteu clustei, although you can also woik thiough this section on a pseuuo-
uistiiLuteu clustei.
Packaging
Ve uon`t neeu to make any mouilications to the piogiam to iun on a clustei iathei
than on a single machine, Lut we uo neeu to package the piogiam as a ]AR lile to senu
to the clustei. This is conveniently achieveu using Ant, using a task such as this (you
can linu the complete Luilu lile in the example coue):
<jar
destfile="hadoop-examples.jar" basedir="${classes.dir}"/>
Il you have a single joL pei ]AR, then you can specily the main class to iun in the ]AR
lile`s manilest. Il the main class is not in the manilest, then it must Le specilieu on the
commanu line (as you will see shoitly). Also, any uepenuent ]AR liles shoulu Le pack-
ageu in a |ib suLuiiectoiy in the ]AR lile. (This is analogous to a ]ava Wcb app|ication
archivc, oi VAR lile, except in that case the ]AR liles go in a WEB-|NI/|ib suLuiiectoiy
in the VAR lile.)
Launching a Job
To launch the joL, we neeu to iun the uiivei, specilying the clustei that we want to iun
the joL on with the -conf option (we coulu egually have useu the -fs anu -jt options):
% hadoop jar hadoop-examples.jar v3.MaxTemperatureDriver -conf conf/hadoop-cluster.xml \
input/ncdc/all max-temp
The waitForCompletion() methou on Job launches the joL anu polls loi piogiess, wiit-
ing a line summaiizing the map anu ieuuce`s piogiess whenevei eithei changes. Heie`s
the output (some lines have Leen iemoveu loi claiity):
09/04/11 08:15:52 INFO mapred.FileInputFormat: Total input paths to process : 101
09/04/11 08:15:53 INFO mapred.JobClient: Running job: job_200904110811_0002
162 | Chapter 5: Developing a MapReduce Application
09/04/11 08:15:54 INFO mapred.JobClient: map 0% reduce 0%
09/04/11 08:16:06 INFO mapred.JobClient: map 28% reduce 0%
09/04/11 08:16:07 INFO mapred.JobClient: map 30% reduce 0%
...
09/04/11 08:21:36 INFO mapred.JobClient: map 100% reduce 100%
09/04/11 08:21:38 INFO mapred.JobClient: Job complete: job_200904110811_0002
09/04/11 08:21:38 INFO mapred.JobClient: Counters: 19
09/04/11 08:21:38 INFO mapred.JobClient: Job Counters
09/04/11 08:21:38 INFO mapred.JobClient: Launched reduce tasks=32
09/04/11 08:21:38 INFO mapred.JobClient: Rack-local map tasks=82
09/04/11 08:21:38 INFO mapred.JobClient: Launched map tasks=127
09/04/11 08:21:38 INFO mapred.JobClient: Data-local map tasks=45
09/04/11 08:21:38 INFO mapred.JobClient: FileSystemCounters
09/04/11 08:21:38 INFO mapred.JobClient: FILE_BYTES_READ=12667214
09/04/11 08:21:38 INFO mapred.JobClient: HDFS_BYTES_READ=33485841275
09/04/11 08:21:38 INFO mapred.JobClient: FILE_BYTES_WRITTEN=989397
09/04/11 08:21:38 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=904
09/04/11 08:21:38 INFO mapred.JobClient: Map-Reduce Framework
09/04/11 08:21:38 INFO mapred.JobClient: Reduce input groups=100
09/04/11 08:21:38 INFO mapred.JobClient: Combine output records=4489
09/04/11 08:21:38 INFO mapred.JobClient: Map input records=1209901509
09/04/11 08:21:38 INFO mapred.JobClient: Reduce shuffle bytes=19140
09/04/11 08:21:38 INFO mapred.JobClient: Reduce output records=100
09/04/11 08:21:38 INFO mapred.JobClient: Spilled Records=9481
09/04/11 08:21:38 INFO mapred.JobClient: Map output bytes=10282306995
09/04/11 08:21:38 INFO mapred.JobClient: Map input bytes=274600205558
09/04/11 08:21:38 INFO mapred.JobClient: Combine input records=1142482941
09/04/11 08:21:38 INFO mapred.JobClient: Map output records=1142478555
09/04/11 08:21:38 INFO mapred.JobClient: Reduce input records=103
The output incluues moie uselul inloimation. Beloie the joL staits, its ID is piinteu:
this is neeueu whenevei you want to ielei to the joL, in logliles loi example, oi when
inteiiogating it via the hadoop job commanu. Vhen the joL is complete, its statistics
(known as counteis) aie piinteu out. These aie veiy uselul loi conliiming that the joL
uiu what you expecteu. Foi example, loi this joL we can see that aiounu 275 GB ol
input uata was analyzeu (Map input Lytes), ieau liom aiounu 3+ GB ol compiesseu
liles on HDFS (HDFSBYTESREAD). The input was Lioken into 101 gzippeu liles
ol ieasonaLle size, so theie was no pioLlem with not Leing aLle to split them.
Job, Task, and Task Attempt IDs
The loimat ol a joL ID is composeu ol the time that the joLtiackei (not the joL) staiteu
anu an inciementing countei maintaineu Ly the joLtiackei to uniguely iuentily the joL
to that instance ol the joLtiackei. So the joL with this ID:
job_200904110811_0002
is the seconu (0002, joL IDs aie 1-Laseu) joL iun Ly the joLtiackei which staiteu at
0S:11 on Apiil 11, 2009. The countei is loimatteu with leauing zeios to make joL IDs
soit nicelyin uiiectoiy listings, loi example. Howevei, when the countei ieaches
10000 it is not ieset, iesulting in longei joL IDs (which uon`t soit so well).
Running on a Cluster | 163
Tasks Lelong to a joL, anu theii IDs aie loimeu Ly ieplacing the job pielix ol a joL ID
with a task pielix, anu auuing a sullix to iuentily the task within the joL. Foi example:
task_200904110811_0002_m_000003
is the louith (000003, task IDs aie 0-Laseu) map (m) task ol the joL with ID
job_200904110811_0002. The task IDs aie cieateu loi a joL when it is initializeu, so they
uo not necessaiily uictate the oiuei that the tasks will Le executeu in.
Tasks may Le executeu moie than once, uue to lailuie (see Task Fail-
uie on page 200) oi speculative execution (see Speculative Execu-
tion on page 213), so to iuentily uilleient instances ol a task execution, task attempts
aie given unigue IDs on the joLtiackei. Foi example:
attempt_200904110811_0002_m_000003_0
is the liist (0, attempt IDs aie 0-Laseu) attempt at iunning task
task_200904110811_0002_m_000003. Task attempts aie allocateu uuiing the joL iun as
neeueu, so theii oiueiing iepiesents the oiuei that they weie cieateu loi tasktiackeis
to iun.
The linal count in the task attempt ID is inciementeu Ly 1,000 il the joL is iestaiteu
altei the joLtiackei is iestaiteu anu iecoveis its iunning joLs (although this Lehavioi is
uisaLleu Ly uelaultsee ]oLtiackei Failuie on page 202).
The MapReduce Web UI
Hauoop comes with a weL UI loi viewing inloimation aLout youi joLs. It is uselul loi
lollowing a joL`s piogiess while it is iunning, as well as linuing joL statistics anu logs
altei the joL has completeu. You can linu the UI at http://jobtrac|cr-host:50030/.
The jobtracker page
A scieenshot ol the home page is shown in Figuie 5-1. The liist section ol the page gives
uetails ol the Hauoop installation, such as the veision numLei anu when it was com-
pileu, anu the cuiient state ol the joLtiackei (in this case, iunning), anu when it was
staiteu.
Next is a summaiy ol the clustei, which has measuies ol clustei capacity anu utilization.
This shows the numLei ol maps anu ieuuces cuiiently iunning on the clustei, the total
numLei ol joL suLmissions, the numLei ol tasktiackei noues cuiiently availaLle, anu
the clustei`s capacity: in teims ol the numLei ol map anu ieuuce slots availaLle acioss
the clustei (Map Task Capacity anu Reuuce Task Capacity), anu the numLei ol
availaLle slots pei noue, on aveiage. The numLei ol tasktiackeis that have Leen Llack-
listeu Ly the joLtiackei is listeu as well (Llacklisting is uiscusseu in Tasktiackei Fail-
uie on page 201).
Below the summaiy, theie is a section aLout the joL scheuulei that is iunning (heie the
uelault). You can click thiough to see joL gueues.
164 | Chapter 5: Developing a MapReduce Application
Fuithei uown, we see sections loi iunning, (successlully) completeu, anu laileu joLs.
Each ol these sections has a taLle ol joLs, with a iow pei joL that shows the joL`s ID,
ownei, name (as set in the Job constiuctoi oi setJobName() methou, Loth ol which
inteinally set the mapred.job.name piopeity) anu piogiess inloimation.
Finally, at the loot ol the page, theie aie links to the joLtiackei`s logs, anu the joL-
tiackei`s histoiy: inloimation on all the joLs that the joLtiackei has iun. The main view
uisplays only 100 joLs (conliguiaLle via the mapred.jobtracker.completeuserjobs.max
imum piopeity), Leloie consigning them to the histoiy page. Note also that the joL his-
toiy is peisistent, so you can linu joLs heie liom pievious iuns ol the joLtiackei.
Iigurc 5-1. Scrccnshot oj thc jobtrac|cr pagc
Running on a Cluster | 165
Job History
job history ieleis to the events anu conliguiation loi a completeu joL. It is ietaineu
whethei the joL was successlul oi not, in an attempt to pioviue inteiesting inloimation
loi the usei iunning a joL.
]oL histoiy liles aie stoieu on the local lilesystem ol the joLtiackei in a history suLuii-
ectoiy ol the logs uiiectoiy. It is possiLle to set the location to an aiLitiaiy Hauoop
lilesystem via the hadoop.job.history.location property. The joLtiackei`s histoiy liles
aie kept loi 30 uays Leloie Leing ueleteu Ly the system.
A seconu copy is also stoieu loi the usei in the _|ogs/history suLuiiectoiy ol the
joL`s output uiiectoiy. This location may Le oveiiiuuen Ly setting
hadoop.job.history.user.location. By setting it to the special value none, no usei joL
histoiy is saveu, although joL histoiy is still saveu centially. A usei`s joL histoiy liles
aie nevei ueleteu Ly the system.
The histoiy log incluues joL, task, anu attempt events, all ol which aie stoieu in a plain-
text lile. The histoiy loi a paiticulai joL may Le vieweu thiough the weL UI, oi via the
commanu line, using hadoop job -history (which you point at the joL`s output
uiiectoiy).
The job page
Clicking on a joL ID Liings you to a page loi the joL, illustiateu in Figuie 5-2. At the
top ol the page is a summaiy ol the joL, with Lasic inloimation such as joL ownei anu
name, anu how long the joL has Leen iunning loi. The joL lile is the consoliuateu
conliguiation lile loi the joL, containing all the piopeities anu theii values that weie in
ellect uuiing the joL iun. Il you aie unsuie ol what a paiticulai piopeity was set to, you
can click thiough to inspect the lile.
Vhile the joL is iunning, you can monitoi its piogiess on this page, which peiiouically
upuates itsell. Below the summaiy is a taLle that shows the map piogiess anu the ieuuce
piogiess. Num Tasks shows the total numLei ol map anu ieuuce tasks loi this joL
(a iow loi each). The othei columns then show the state ol these tasks: Penuing
(waiting to iun), Running, Complete (successlully iun), Killeu (tasks that have
laileuthis column woulu Le moie accuiately laLeleu Faileu). The linal column
shows the total numLei ol laileu anu killeu task attempts loi all the map oi ieuuce tasks
loi the joL (task attempts may Le maikeu as killeu il they aie a speculative execution
uuplicate, il the tasktiackei they aie iunning on uies oi il they aie killeu Ly a usei). See
Task Failuie on page 200 loi Lackgiounu on task lailuie.
Fuithei uown the page, you can linu completion giaphs loi each task that show theii
piogiess giaphically. The ieuuce completion giaph is uiviueu into the thiee phases ol
the ieuuce task: copy (when the map outputs aie Leing tiansleiieu to the ieuuce`s
tasktiackei), soit (when the ieuuce inputs aie Leing meigeu), anu ieuuce (when the
166 | Chapter 5: Developing a MapReduce Application
ieuuce lunction is Leing iun to piouuce the linal output). The phases aie uesciiLeu in
moie uetail in Shullle anu Soit on page 205.
In the miuule ol the page is a taLle ol joL counteis. These aie uynamically upuateu
uuiing the joL iun, anu pioviue anothei uselul winuow into the joL`s piogiess anu
geneial health. Theie is moie inloimation aLout what these counteis mean in Built-
in Counteis on page 257.
Retrieving the Results
Once the joL is linisheu, theie aie vaiious ways to ietiieve the iesults. Each ieuucei
piouuces one output lile, so theie aie 30 pait liles nameu part-r-00000 to part-
r-00029 in the nax-tcnp uiiectoiy.
As theii names suggest, a goou way to think ol these pait liles is as
paits ol the nax-tcnp lile.
Il the output is laige (which it isn`t in this case), then it is impoitant to
have multiple paits so that moie than one ieuucei can woik in paiallel.
Usually, il a lile is in this paititioneu loim, it can still Le useu easily
enough: as the input to anothei MapReuuce joL, loi example. In some
cases, you can exploit the stiuctuie ol multiple paititions to uo a map-
siue join, loi example, (Map-Siue ]oins on page 2S2) oi a MapFile
lookup (An application: Paititioneu MapFile lookups on page 269).
This joL piouuces a veiy small amount ol output, so it is convenient to copy it liom
HDFS to oui uevelopment machine. The -getmerge option to the hadoop fs commanu
is uselul heie, as it gets all the liles in the uiiectoiy specilieu in the souice pattein anu
meiges them into a single lile on the local lilesystem:
% hadoop fs -getmerge max-temp max-temp-local
% sort max-temp-local | tail
1991 607
1992 605
1993 567
1994 568
1995 567
1996 561
1997 565
1998 568
1999 568
2000 558
Ve soiteu the output, as the ieuuce output paititions aie unoiueieu (owing to the hash
paitition lunction). Doing a Lit ol postpiocessing ol uata liom MapReuuce is veiy
common, as is leeuing it into analysis tools, such as R, a spieausheet, oi even a ielational
uataLase.
Running on a Cluster | 167
Iigurc 5-2. Scrccnshot oj thc job pagc
168 | Chapter 5: Developing a MapReduce Application
Anothei way ol ietiieving the output il it is small is to use the -cat option to piint the
output liles to the console:
% hadoop fs -cat max-temp/*
On closei inspection, we see that some ol the iesults uon`t look plausiLle. Foi instance,
the maximum tempeiatuie loi 1951 (not shown heie) is 590C! How uo we linu out
what`s causing this? Is it coiiupt input uata oi a Lug in the piogiam?
Debugging a Job
The time-honoieu way ol ueLugging piogiams is via piint statements, anu this is cei-
tainly possiLle in Hauoop. Howevei, theie aie complications to consiuei: with pio-
giams iunning on tens, hunuieus, oi thousanus ol noues, how uo we linu anu examine
the output ol the ueLug statements, which may Le scatteieu acioss these noues? Foi
this paiticulai case, wheie we aie looking loi (what we think is) an unusual case, we
can use a ueLug statement to log to stanuaiu eiioi, in conjunction with a message to
upuate the task`s status message to piompt us to look in the eiioi log. The weL UI
makes this easy, as we will see.
Ve also cieate a custom countei to count the total numLei ol iecoius with implausiLle
tempeiatuies in the whole uataset. This gives us valuaLle inloimation aLout how to
ueal with the conuitionil it tuins out to Le a common occuiience, then we might
neeu to leain moie aLout the conuition anu how to extiact the tempeiatuie in these
cases, iathei than simply uiopping the iecoiu. In lact, when tiying to ueLug a joL, you
shoulu always ask youisell il you can use a countei to get the inloimation you neeu to
linu out what`s happening. Even il you neeu to use logging oi a status message, it may
Le uselul to use a countei to gauge the extent ol the pioLlem. (Theie is moie on counteis
in Counteis on page 257.)
Il the amount ol log uata you piouuce in the couise ol ueLugging is laige, then you`ve
got a couple ol options. The liist is to wiite the inloimation to the map`s output, iathei
than to stanuaiu eiioi, loi analysis anu aggiegation Ly the ieuuce. This appioach usu-
ally necessitates stiuctuial changes to youi piogiam, so stait with the othei technigues
liist. Alteinatively, you can wiite a piogiam (in MapReuuce ol couise) to analyze the
logs piouuceu Ly youi joL.
Ve auu oui ueLugging to the mappei (veision +), as opposeu to the ieuucei, as we
want to linu out what the souice uata causing the anomalous output looks like:
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
enum Temperature {
OVER_100
}

private NcdcRecordParser parser = new NcdcRecordParser();
Running on a Cluster | 169
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

parser.parse(value);
if (parser.isValidTemperature()) {
int airTemperature = parser.getAirTemperature();
if (airTemperature > 1000) {
System.err.println("Temperature over 100 degrees for input: " + value);
context.setStatus("Detected possibly corrupt record: see logs.");
context.getCounter(Temperature.OVER_100).increment(1);
}
context.write(new Text(parser.getYear()), new IntWritable(airTemperature));
}
}
}
Il the tempeiatuie is ovei 100C (iepiesenteu Ly 1000, since tempeiatuies aie in tenths
ol a uegiee), we piint a line to stanuaiu eiioi with the suspect line, as well as upuating
the map`s status message using the setStatus() methou on Context uiiecting us to look
in the log. Ve also inciement a countei, which in ]ava is iepiesenteu Ly a lielu ol an
enum type. In this piogiam, we have uelineu a single lielu OVER_100 as a way to count
the numLei ol iecoius with a tempeiatuie ol ovei 100C.
Vith this mouilication, we iecompile the coue, ie-cieate the ]AR lile, then ieiun the
joL, anu while it`s iunning go to the tasks page.
The tasks page
The joL page has a numLei ol links loi look at the tasks in a joL in moie uetail. Foi
example, Ly clicking on the map link, you aie Liought to a page that lists inloimation
loi all ol the map tasks on one page. You can also see just the completeu tasks. The
scieenshot in Figuie 5-3 shows a poition ol this page loi the joL iun with oui ueLugging
statements. Each iow in the taLle is a task, anu it pioviues such inloimation as the stait
anu enu times loi each task, any eiiois iepoiteu Lack liom the tasktiackei, anu a link
to view the counteis loi an inuiviuual task.
The Status column can Le helplul loi ueLugging, since it shows a task`s latest status
message. Beloie a task staits, it shows its status as initializing, then once it staits
ieauing iecoius it shows the split inloimation loi the split it is ieauing as a lilename
with a Lyte ollset anu length. You can see the status we set loi ueLugging loi task
task_200904110811_0003_m_000044, so let`s click thiough to the logs page to linu the
associateu ueLug message. (Notice, too, that theie is an extia countei loi this task, since
oui usei countei has a nonzeio count loi this task.)
The task details page
Fiom the tasks page, you can click on any task to get moie inloimation aLout it. The
task uetails page, shown in Figuie 5-+, shows each task attempt. In this case, theie was
170 | Chapter 5: Developing a MapReduce Application
one task attempt, which completeu successlully. The taLle pioviues luithei uselul uata,
such as the noue the task attempt ian on, anu links to task logliles anu counteis.
The Actions column contains links loi killing a task attempt. By uelault, this is uis-
aLleu, making the weL UI a ieau-only inteilace. Set webinterface.private.actions to
true to enaLle the actions links.
Iigurc 5-3. Scrccnshot oj thc tas|s pagc
Iigurc 5-1. Scrccnshot oj thc tas| dctai|s pagc
By setting webinterface.private.actions to true, you also allow anyone
with access to the HDFS weL inteilace to uelete liles. The dfs.web.ugi
piopeity ueteimines the usei that the HDFS weL UI iuns as, thus con-
tiolling which liles may Le vieweu anu ueleteu.
Running on a Cluster | 171
Foi map tasks, theie is also a section showing which noues the input split was locateu
on.
By lollowing one ol the links to the logliles loi the successlul task attempt (you can see
the last + KB oi S KB ol each loglile, oi the entiie lile), we can linu the suspect input
iecoiu that we loggeu (the line is wiappeu anu tiuncateu to lit on the page):
Temperature over 100 degrees for input:
0335999999433181957042302005+37950+139117SAO +0004RJSN V020113590031500703569999994
33201957010100005+35317+139650SAO +000899999V02002359002650076249N004000599+0067...
This iecoiu seems to Le in a uilleient loimat to the otheis. Foi one thing, theie aie
spaces in the line, which aie not uesciiLeu in the specilication.
Vhen the joL has linisheu, we can look at the value ol the countei we uelineu to see
how many iecoius ovei 100C theie aie in the whole uataset. Counteis aie accessiLle
via the weL UI oi the commanu line:
% hadoop job -counter job_200904110811_0003 'v4.MaxTemperatureMapper$Temperature' \
OVER_100
3
The -counter option takes the joL ID, countei gioup name (which is the lully gualilieu
classname heie), anu the countei name (the enum name). Theie aie only thiee mal-
loimeu iecoius in the entiie uataset ol ovei a Lillion iecoius. Thiowing out Lau iecoius
is stanuaiu loi many Lig uata pioLlems, although we neeu to Le caielul in this case,
since we aie looking loi an extieme valuethe maximum tempeiatuie iathei than an
aggiegate measuie. Still, thiowing away thiee iecoius is pioLaLly not going to change
the iesult.
Handling malformed data
Captuiing input uata that causes a pioLlem is valuaLle, as we can use it in a test to
check that the mappei uoes the iight thing:
@Test
public void parsesMalformedTemperature() throws IOException,
InterruptedException {
MaxTemperatureMapper mapper = new MaxTemperatureMapper();
Text value = new Text("0335999999433181957042302005+37950+139117SAO +0004" +
// Year ^^^^
"RJSN V02011359003150070356999999433201957010100005+353");
// Temperature ^^^^^
MaxTemperatureMapper.Context context =
mock(MaxTemperatureMapper.Context.class);
Counter counter = mock(Counter.class);
when(context.getCounter(MaxTemperatureMapper.Temperature.MALFORMED))
.thenReturn(counter);

mapper.map(null, value, context);

verify(context, never()).write(any(Text.class), any(IntWritable.class));
172 | Chapter 5: Developing a MapReduce Application
verify(counter).increment(1);
}
The iecoiu that was causing the pioLlem is ol a uilleient loimat to the othei lines we`ve
seen. Example 5-11 shows a mouilieu piogiam (veision 5) using a paisei that ignoies
each line with a tempeiatuie lielu that uoes not have a leauing sign (plus oi minus).
Ve`ve also intiouuceu a countei to measuie the numLei ol iecoius that we aie ignoiing
loi this ieason.
Exanp|c 5-11. Mappcr jor naxinun tcnpcraturc cxanp|c
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {

enum Temperature {
MALFORMED
}
private NcdcRecordParser parser = new NcdcRecordParser();

@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

parser.parse(value);
if (parser.isValidTemperature()) {
int airTemperature = parser.getAirTemperature();
context.write(new Text(parser.getYear()), new IntWritable(airTemperature));
} else if (parser.isMalformedTemperature()) {
System.err.println("Ignoring possibly corrupt input: " + value);
context.getCounter(Temperature.MALFORMED).increment(1);
}
}
}
Hadoop Logs
Hauoop piouuces logs in vaiious places, loi vaiious auuiences. These aie summaiizeu
in TaLle 5-2.
Tab|c 5-2. Typcs oj Hadoop |ogs
Logs Primary audience Description Further information
System daemon logs Administrators Each Hadoop daemon produces a logfile (us-
ing log4j) and another file that combines
standard out and error. Written in the direc-
tory defined by the HADOOP_LOG_DIR en-
vironment variable.
System log-
files on page 307 and
Logging on page 349.
HDFS audit logs Administrators A log of all HDFS requests, turned off by de-
fault. Written to the namenodes log, al-
though this is configurable.
Audit Log-
ging on page 344.
Running on a Cluster | 173
Logs Primary audience Description Further information
MapReduce job history logs Users A log of the events (such as task completion)
that occur in the course of running a job.
Saved centrally on the jobtracker, and in the
jobs output directory in a _logs/history sub-
directory.
Job His-
tory on page 166.
MapReduce task logs Users Each tasktracker child process produces a
logfile using log4j (called syslog), a file for
data sent to standard out (stdout), and a file
for standard error (stderr). Written in the
userlogs subdirectory of the directory defined
by the HADOOP_LOG_DIR environment
variable.
This section.
As we have seen in the pievious section, MapReuuce task logs aie accessiLle thiough
the weL UI, which is the most convenient way to view them. You can also linu the
logliles on the local lilesystem ol the tasktiackei that ian the task attempt, in a uiiectoiy
nameu Ly the task attempt. Il task ]VM ieuse is enaLleu (Task ]VM Re-
use on page 216), then each loglile accumulates the logs loi the entiie ]VM iun, so
multiple task attempts will Le lounu in each loglile. The weL UI hiues this Ly showing
only the poition that is ielevant loi the task attempt Leing vieweu.
It is stiaightloiwaiu to wiite to these logliles. Anything wiitten to stanuaiu output, oi
stanuaiu eiioi, is uiiecteu to the ielevant loglile. (Ol couise, in Stieaming, stanuaiu
output is useu loi the map oi ieuuce output, so it will not show up in the stanuaiu
output log.)
In ]ava, you can wiite to the task`s sys|og lile il you wish Ly using the Apache Commons
Logging API. This is shown in Example 5-12.
Exanp|c 5-12. An idcntity nappcr that writcs to standard output and a|so uscs thc Apachc Connons
Logging AP|
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.mapreduce.Mapper;
public class LoggingIdentityMapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
extends Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

private static final Log LOG = LogFactory.getLog(LoggingIdentityMapper.class);

@Override
public void map(KEYIN key, VALUEIN value, Context context)
throws IOException, InterruptedException {
// Log to stdout file
System.out.println("Map key: " + key);

// Log to syslog file
LOG.info("Map key: " + key);
174 | Chapter 5: Developing a MapReduce Application
if (LOG.isDebugEnabled()) {
LOG.debug("Map value: " + value);
}
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
The uelault log level is INFO, so DEBUG level messages uo not appeai in the sys|og task
log lile. Howevei, sometimes you want to see these messagesto uo this set
mapred.map.child.log.level oi mapred.reduce.child.log.level, as appiopiiate (liom
0.22). Foi example, in this case we coulu set it loi the mappei to see the map values in
the log as lollows:
% hadoop jar hadoop-examples.jar LoggingDriver -conf conf/hadoop-cluster.xml \
-D mapred.map.child.log.level=DEBUG input/ncdc/sample.txt logging-out
Theie aie some contiols loi managing ietention anu size ol task logs. By uelault, logs
aie ueleteu altei a minimum ol 2+ houis (set using the mapred.userlog.retain.hours
piopeity). You can also set a cap on the maximum size ol each loglile using the
mapred.userlog.limit.kb piopeity, which is 0 Ly uelault, meaning theie is no cap.
Sometimes you may neeu to ueLug a pioLlem that you suspect is oc-
cuiiing in the ]VM iunning a Hauoop commanu, iathei than on the
clustei. You can senu DEBUG level logs to the console Ly using an invo-
cation like this:
% HADOOP_ROOT_LOGGER=DEBUG,console hadoop fs -text /foo/bar
Remote Debugging
Vhen a task lails anu theie is not enough inloimation loggeu to uiagnose the eiioi,
you may want to iesoit to iunning a ueLuggei loi that task. This is haiu to aiiange
when iunning the joL on a clustei, as you uon`t know which noue is going to piocess
which pait ol the input, so you can`t set up youi ueLuggei aheau ol the lailuie. Howevei,
theie aie a lew othei options availaLle:
Rcproducc thc jai|urc |oca||y
Olten the lailing task lails consistently on a paiticulai input. You can tiy to iepo-
uuce the pioLlem locally Ly uownloauing the lile that the task is lailing on anu
iunning the joL locally, possiLly using a ueLuggei such as ]ava`s VisualVM.
Usc j\M dcbugging options
A common cause ol lailuie is a ]ava out ol memoiy eiioi in the task ]VM. You can
set mapred.child.java.opts to incluue -XX:-HeapDumpOnOutOfMemoryError -XX:Heap
DumpPath=/path/to/dumps to piouuce a heap uump which can Le examineu altei-
waius with tools like jhat oi the Eclipse Memoiy Analyzei. Note that the ]VM
options shoulu Le auueu to the existing memoiy settings specilieu Ly
Running on a Cluster | 175
mapred.child.java.opts. These aie explaineu in moie uetail in Mem-
oiy on page 305.
Usc tas| proji|ing
]ava piolileis give a lot ol insight into the ]VM, anu Hauoop pioviues a mechanism
to piolile a suLset ol the tasks in a joL. See Pioliling Tasks on page 177.
Usc IsolationRunner
Oluei veisions ol Hauoop pioviueu a special task iunnei calleu IsolationRunner
that coulu ieiun laileu tasks in situ on the clustei. Unloitunately, it is no longei
availaLle in iecent veisions, Lut you can tiack its ieplacement at https://issucs
.apachc.org/jira/browsc/MAPREDUCE-237.
In some cases it`s uselul to keep the inteimeuiate liles loi a laileu task attempt loi latei
inspection, paiticulaily il supplementaiy uump oi piolile liles aie cieateu in the task`s
woiking uiiectoiy. You can set keep.failed.task.files to true to keep a laileu task`s
liles.
You can keep the inteimeuiate liles loi successlul tasks, too, which may Le hanuy il
you want to examine a task that isn`t lailing. In this case, set the piopeity
keep.task.files.pattern to a iegulai expiession that matches the IDs ol the tasks you
want to keep.
To examine the inteimeuiate liles, log into the noue that the task laileu on anu look loi
the uiiectoiy loi that task attempt. It will Le unuei one ol the local MapReuuce uiiec-
toiies, as set Ly the mapred.local.dir piopeity (coveieu in moie uetail in Impoitant
Hauoop Daemon Piopeities on page 309). Il this piopeity is a comma-sepaiateu list
ol uiiectoiies (to spieau loau acioss the physical uisks on a machine), then you may
neeu to look in all ol the uiiectoiies Leloie you linu the uiiectoiy loi that paiticulai
task attempt. The task attempt uiiectoiy is in the lollowing location:
mapred.local.dir/taskTracker/jobcache/job-ID/task-attempt-ID

Tuning a Job
Altei a joL is woiking, the guestion many uevelopeis ask is, Can I make it iun lastei?
Theie aie a lew Hauoop-specilic usual suspects that aie woith checking to see il they
aie iesponsiLle loi a peiloimance pioLlem. You shoulu iun thiough the checklist in
TaLle 5-3 Leloie you stait tiying to piolile oi optimize at the task level.
176 | Chapter 5: Developing a MapReduce Application
Tab|c 5-3. Tuning chcc||ist
Area Best practice Further information
Number of
mappers
How long are your mappers running for? If they are only running for a few seconds
on average, then you should see if theres a way to have fewer mappers and
make them all run longer, a minute or so, as a rule of thumb. The extent to
which this is possible depends on the input format you are using.
Small files and Com-
bineFileInputFor-
mat on page 237
Number of reducers For maximum performance, the number of reducers should be slightly less than
the number of reduce slots in the cluster. This allows the reducers to finish in
one wave and fully utilizes the cluster during the reduce phase.
Choosing the Num-
ber of Reduc-
ers on page 229
Combiners Can your job take advantage of a combiner to reduce the amount of data in
passing through the shuffle?
Combiner Func-
tions on page 34
Intermediate
compression
Job execution time can almost always benefit from enabling map output
compression.
Compressing map
output on page 94
Custom
serialization
If you are using your own custom Writable objects, or custom comparators,
then make sure you have implemented RawComparator.
Implementing a
RawComparator for
speed on page 108
Shuffle tweaks The MapReduce shuffle exposes around a dozen tuning parameters for memory
management, which may help you eke out the last bit of performance.
Configuration Tun-
ing on page 209
Profiling Tasks
Like ueLugging, pioliling a joL iunning on a uistiiLuteu system like MapReuuce
piesents some challenges. Hauoop allows you to piolile a liaction ol the tasks in a joL,
anu, as each task completes, pulls uown the piolile inloimation to youi machine loi
latei analysis with stanuaiu pioliling tools.
Ol couise, it`s possiLle, anu somewhat easiei, to piolile a joL iunning in the local joL
iunnei. Anu pioviueu you can iun with enough input uata to exeicise the map anu
ieuuce tasks, this can Le a valuaLle way ol impioving the peiloimance ol youi mappeis
anu ieuuceis. Theie aie a couple ol caveats, howevei. The local joL iunnei is a veiy
uilleient enviionment liom a clustei, anu the uata llow patteins aie veiy uilleient.
Optimizing the CPU peiloimance ol youi coue may Le pointless il youi MapReuuce
joL is I/O-Lounu (as many joLs aie). To Le suie that any tuning is ellective, you shoulu
compaie the new execution time with the olu iunning on a ieal clustei. Even this is
easiei saiu than uone, since joL execution times can vaiy uue to iesouice contention
with othei joLs anu the uecisions the scheuulei makes to uo with task placement. To
get a goou iuea ol joL execution time unuei these ciicumstances, peiloim a seiies ol
iuns (with anu without the change) anu check whethei any impiovement is statistically
signilicant.
It`s unloitunately tiue that some pioLlems (such as excessive memoiy use) can Le ie-
piouuceu only on the clustei, anu in these cases the aLility to piolile in situ is
inuispensaLle.
Tuning a Job | 177
The HPROF profiler
Theie aie a numLei ol conliguiation piopeities to contiol pioliling, which aie also
exposeu via convenience methous on JobConf. The lollowing mouilication to
MaxTemperatureDriver (veision 6) will enaLle iemote HPROF pioliling. HPROF is a
pioliling tool that comes with the ]DK that, although Lasic, can give valuaLle inloi-
mation aLout a piogiam`s CPU anu heap usage:
2
Configuration conf = getConf();
conf.setBoolean("mapred.task.profile", true);
conf.set("mapred.task.profile.params", "-agentlib:hprof=cpu=samples," +
"heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s");
conf.set("mapred.task.profile.maps", "0-2");
conf.set("mapred.task.profile.reduces", ""); // no reduces
Job job = new Job(conf, "Max temperature");
The liist line enaLles pioliling, which Ly uelault is tuineu oll. (Insteau ol using
mapred.task.profile you can also use the JobContext.TASK_PROFILE constant in the new
API.)
Next we set the piolile paiameteis, which aie the extia commanu-line aiguments to
pass to the task`s ]VM. (Vhen pioliling is enaLleu, a new ]VM is allocateu loi each
task, even il ]VM ieuse is tuineu on; see Task ]VM Reuse on page 216.) The uelault
paiameteis specily the HPROF piolilei; heie we set an extia HPROF option, depth=6,
to give moie stack tiace uepth than the HPROF uelault. (Using JobContext.TASK_PRO
FILE_PARAMS is eguivalent to setting the mapred.task.profile.params piopeity.)
Finally, we specily which tasks we want to piolile. Ve noimally only want piolile
inloimation liom a lew tasks, so we use the piopeities mapred.task.profile.maps anu
mapred.task.profile.reduces to specily the iange ol (map oi ieuuce) task IDs that we
want piolile inloimation loi. Ve`ve set the maps piopeity to 0-2 (which is actually the
uelault), which means map tasks with IDs 0, 1, anu 2 aie piolileu. A set ol ianges is
peimitteu, using a notation that allows open ianges. Foi example, 0-1,4,6- woulu
specily all tasks except those with IDs 2, 3, anu 5. The tasks to piolile can also Le
contiolleu using the JobContext.NUM_MAP_PROFILES constant loi map tasks, anu JobCon
text.NUM_REDUCE_PROFILES loi ieuuce tasks.
Vhen we iun a joL with the mouilieu uiivei, the piolile output tuins up at the enu ol
the joL in the uiiectoiy we launcheu the joL liom. Since we aie only pioliling a lew
tasks, we can iun the joL on a suLset ol the uataset.
Heie`s a snippet ol one ol the mappei`s piolile liles, which shows the CPU sampling
inloimation:
2. HPROF uses Lyte coue inseition to piolile youi coue, so you uo not neeu to iecompile youi application
with special options to use it. Foi moie inloimation on HPROF, see HPROF: A Heap/CPU Pioliling
Tool in ]2SE 5.0, Ly Kelly O`Haii at http://java.sun.con/dcvc|opcr/tcchnica|Artic|cs/Progranning/
HPROI.htn|.
178 | Chapter 5: Developing a MapReduce Application
CPU SAMPLES BEGIN (total = 1002) Sat Apr 11 11:17:52 2009
rank self accum count trace method
1 3.49% 3.49% 35 307969 java.lang.Object.<init>
2 3.39% 6.89% 34 307954 java.lang.Object.<init>
3 3.19% 10.08% 32 307945 java.util.regex.Matcher.<init>
4 3.19% 13.27% 32 307963 java.lang.Object.<init>
5 3.19% 16.47% 32 307973 java.lang.Object.<init>
Cioss-ieleiencing the tiace numLei 307973 gives us the stacktiace liom the same lile:
TRACE 307973: (thread=200001)
java.lang.Object.<init>(Object.java:20)
org.apache.hadoop.io.IntWritable.<init>(IntWritable.java:29)
v5.MaxTemperatureMapper.map(MaxTemperatureMapper.java:30)
v5.MaxTemperatureMapper.map(MaxTemperatureMapper.java:14)
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356)
So it looks like the mappei is spenuing 3 ol its time constiucting IntWritable oLjects.
This oLseivation suggests that it might Le woith ieusing the Writable instances Leing
output (veision 7, see Example 5-13).
Exanp|c 5-13. Rcusing thc Tcxt and |ntWritab|c output objccts
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
enum Temperature {
MALFORMED
}
private NcdcRecordParser parser = new NcdcRecordParser();
private Text year = new Text();
private IntWritable temp = new IntWritable();

@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

parser.parse(value);
if (parser.isValidTemperature()) {
year.set(parser.getYear());
temp.set(parser.getAirTemperature());
context.write(year, temp);
} else if (parser.isMalformedTemperature()) {
System.err.println("Ignoring possibly corrupt input: " + value);
context.getCounter(Temperature.MALFORMED).increment(1);
}
}
}
Howevei, we know il this is signilicant only il we can measuie an impiovement when
iunning the joL ovei the whole uataset. Running each vaiiant live times on an otheiwise
guiet 11-noue clustei showeu no statistically signilicant uilleience in joL execution
time. Ol couise, this iesult holus only loi this paiticulai comLination ol coue, uata,
Tuning a Job | 179
anu haiuwaie, so you shoulu peiloim similai Lenchmaiks to see whethei such a change
is signilicant loi youi setup.
Other profilers
The mechanism loi ietiieving piolile output is HPROF-specilic, so il you use anothei
piolilei you will neeu to manually ietiieve the piolilei`s output liom tasktiackeis loi
analysis.
Il the piolilei is not installeu on all the tasktiackei machines, consiuei using the Dis-
tiiLuteu Cache (DistiiLuteu Cache on page 2SS) loi making the piolilei Linaiy
availaLle on the ieguiieu machines.
MapReduce Workflows
So lai in this chaptei, you have seen the mechanics ol wiiting a piogiam using Map-
Reuuce. Ve haven`t yet consiueieu how to tuin a uata piocessing pioLlem into the
MapReuuce mouel.
The uata piocessing you have seen so lai in this Look is to solve a laiily simple pioLlem
(linuing the maximum iecoiueu tempeiatuie loi given yeais). Vhen the piocessing
gets moie complex, this complexity is geneially manilesteu Ly having moie MapReuuce
joLs, iathei than having moie complex map anu ieuuce lunctions. In othei woius, as
a iule ol thumL, think aLout auuing norc joLs, iathei than auuing complexity to joLs.
Foi moie complex pioLlems, it is woith consiueiing a highei-level language than Map-
Reuuce, such as Pig, Hive, Cascauing, Cascalog, oi Ciunch. One immeuiate Lenelit is
that it liees you up liom having to uo the tianslation into MapReuuce joLs, allowing
you to concentiate on the analysis you aie peiloiming.
Finally, the Look Data-|ntcnsivc Tcxt Proccssing with MapRcducc Ly ]immy Lin anu
Chiis Dyei (Moigan e Claypool PuLlisheis, 2010, http://naprcducc.nc/) is a gieat ie-
souice loi leaining moie aLout MapReuuce algoiithm uesign, anu is highly
iecommenueu.
Decomposing a Problem into MapReduce Jobs
Let`s look at an example ol a moie complex pioLlem that we want to tianslate into a
MapReuuce woikllow.
Imagine that we want to linu the mean maximum iecoiueu tempeiatuie loi eveiy uay
ol the yeai anu eveiy weathei station. In conciete teims, to calculate the mean maxi-
mum uaily tempeiatuie iecoiueu Ly station 029070-99999, say, on ]anuaiy 1, we take
the mean ol the maximum uaily tempeiatuies loi this station loi ]anuaiy 1, 1901;
]anuaiy 1, 1902; anu so on up to ]anuaiy 1, 2000.
180 | Chapter 5: Developing a MapReduce Application
How can we compute this using MapReuuce? The computation uecomposes most
natuially into two stages:
1. Compute the maximum uaily tempeiatuie loi eveiy station-uate paii.
The MapReuuce piogiam in this case is a vaiiant ol the maximum tempeiatuie
piogiam, except that the keys in this case aie a composite station-uate paii, iathei
than just the yeai.
2. Compute the mean ol the maximum uaily tempeiatuies loi eveiy station-uay-
month key.
The mappei takes the output liom the pievious joL (station-uate, maximum tem-
peiatuie) iecoius anu piojects it into (station-uay-month, maximum tempeiatuie)
iecoius Ly uiopping the yeai component. The ieuuce lunction then takes the mean
ol the maximum tempeiatuies loi each station-uay-month key.
The output liom liist stage looks like this loi the station we aie inteiesteu in (the
ncan_nax_dai|y_tcnp.sh sciipt in the examples pioviues an implementation in
Hauoop Stieaming):
029070-99999 19010101 0
029070-99999 19020101 -94
...
The liist two lielus loim the key, anu the linal column is the maximum tempeiatuie
liom all the ieauings loi the given station anu uate. The seconu stage aveiages these
uaily maxima ovei yeais to yielu:
029070-99999 0101 -68
which is inteipieteu as saying the mean maximum uaily tempeiatuie on ]anuaiy 1 loi
station 029070-99999 ovei the centuiy is -6.SC.
It`s possiLle to uo this computation in one MapReuuce stage, Lut it takes moie woik
on the pait ol the piogiammei.
3
The aiguments loi having moie (Lut simplei) MapReuuce stages aie that uoing so leaus
to moie composaLle anu moie maintainaLle mappeis anu ieuuceis. The case stuuies
in Chaptei 16 covei a wiue iange ol ieal-woilu pioLlems that weie solveu using Map-
Reuuce, anu in each case, the uata piocessing task is implementeu using two oi moie
MapReuuce joLs. The uetails in that chaptei aie invaluaLle loi getting a Lettei iuea ol
how to uecompose a piocessing pioLlem into a MapReuuce woikllow.
It`s possiLle to make map anu ieuuce lunctions even moie composaLle than we have
uone. A mappei commonly peiloims input loimat paising, piojection (selecting the
ielevant lielus), anu lilteiing (iemoving iecoius that aie not ol inteiest). In the mappeis
you have seen so lai, we have implementeu all ol these lunctions in a single mappei.
Howevei, theie is a case loi splitting these into uistinct mappeis anu chaining them
3. It`s an inteiesting exeicise to uo this. Hint: use Seconuaiy Soit on page 276.
MapReduce Workflows | 181
into a single mappei using the ChainMapper liLiaiy class that comes with Hauoop.
ComLineu with a ChainReducer, you can iun a chain ol mappeis, lolloweu Ly a ieuucei
anu anothei chain ol mappeis in a single MapReuuce joL.
JobControl
Vhen theie is moie than one joL in a MapReuuce woikllow, the guestion aiises: how
uo you manage the joLs so they aie executeu in oiuei? Theie aie seveial appioaches,
anu the main consiueiation is whethei you have a lineai chain ol joLs, oi a moie com-
plex uiiecteu acyclic giaph (DAG) ol joLs.
Foi a lineai chain, the simplest appioach is to iun each joL one altei anothei, waiting
until a joL completes successlully Leloie iunning the next:
JobClient.runJob(conf1);
JobClient.runJob(conf2);
Il a joL lails, the runJob() methou will thiow an IOException, so latei joLs in the pipeline
uon`t get executeu. Depenuing on youi application, you might want to catch the ex-
ception anu clean up any inteimeuiate uata that was piouuceu Ly any pievious joLs.
Foi anything moie complex than a lineai chain, theie aie liLiaiies that can help oi-
chestiate youi woikllow (although they aie suiteu to lineai chains, oi even one-oll joLs,
too). The simplest is in the org.apache.hadoop.mapreduce.jobcontrol package: the
JobControl class. (Theie is an eguivalent class in the org.apache.hadoop.mapred.jobcon
trol package too.) An instance ol JobControl iepiesents a giaph ol joLs to Le iun. You
auu the joL conliguiations, then tell the JobControl instance the uepenuencies Letween
joLs. You iun the JobControl in a thieau, anu it iuns the joLs in uepenuency oiuei. You
can poll loi piogiess, anu when the joLs have linisheu, you can gueiy loi all the joLs`
statuses anu the associateu eiiois loi any lailuies. Il a joL lails, JobControl won`t iun
its uepenuencies.
Apache Oozie
Il you neeu to iun a complex woikllow, oi one on a tight piouuction scheuule, oi you
have a laige numLei ol connecteu woikllows with uata uepenuencies Letween them,
then a moie sophisticateu appioach is ieguiieu. Apachc Oozic
(http://incubator.apachc.org/oozic/) lits the Lill in any oi all ol these cases. It has Leen
uesigneu to manage the executions ol thousanus ol uepenuent woikllows, each com-
poseu ol possiLly thousanus ol consistuent actions at the level ol an inuiviuual Map-
Reuuce joL.
Oozie has two main paits: a wor|j|ow cnginc that stoies anu iuns woikllows composeu
ol Hauoop joLs, anu a coordinator cnginc that iuns woikllow joLs Laseu on pieuelineu
scheuules anu uata availaLility. The lattei piopeity is especially poweilul since it allows
a woikllow joL to wait until its input uata has Leen piouuceu Ly a uepenuent woikllow;
also, it make ieiunning laileu woikllows moie tiactaLle, since no time is wasteu iunning
182 | Chapter 5: Developing a MapReduce Application
successlul paits ol a woikllow. Anyone who has manageu a complex Latch system
knows how uillicult it can Le to catch up liom joLs misseu uue to uowntime oi lailuie,
anu will appieciate this leatuie.
Unlike JobControl, which iuns on the client machine suLmitting the joLs, Oozie iuns
as a seivice in the clustei, anu clients suLmit a woikllow uelinitions loi immeuiate oi
latei execution. In Oozie pailance, a woikllow is a DAG ol action nodcs anu contro|-
j|ow nodcs. An action noue peiloims a woikllow task, like moving liles in HDFS, iun-
ning a MapReuuce, Stieaming, Pig oi Hive joL, peiloiming a Sgoop impoit, oi iunning
an aiLitiaiy shell sciipt oi ]ava piogiam. A contiol-llow noue goveins the woikllow
execution Letween actions Ly allowing such constiucts as conuitional logic (so uilleient
execution Lianches may Le lolloweu uepenuing on the iesult ol an eailiei action noue)
oi paiallel execution. Vhen the woikllow completes, Oozie can make an HTTP call-
Lack to the client to inloim it ol the woikllow status. It is also possiLle to ieceive
callLacks eveiy time the woikllow enteis oi exits an action noue.
Defining an Oozie workflow
Voikllow uelinitions aie wiitten in XML using the Hauoop Piocess Delinition Lan-
guage, the specilication loi which can Le lounu on the Oozie weLsite. Example 5-1+
shows a simple Oozie woikllow uelinition loi iunning a single MapReuuce joL.
Exanp|c 5-11. Oozic wor|j|ow dcjinition to run thc naxinun tcnpcraturc MapRcducc job
<workflow-app xmlns="uri:oozie:workflow:0.1" name="max-temp-workflow">
<start to="max-temp-mr"/>
<action name="max-temp-mr">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/output"/>
</prepare>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>OldMaxTemperature$OldMaxTemperatureMapper</value>
</property>
<property>
<name>mapred.combiner.class</name>
<value>OldMaxTemperature$OldMaxTemperatureReducer</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>OldMaxTemperature$OldMaxTemperatureReducer</value>
</property>
<property>
<name>mapred.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
MapReduce Workflows | 183
<name>mapred.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/user/${wf:user()}/input/ncdc/micro</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/${wf:user()}/output</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>MapReduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
This woikllow has thiee contiol-llow noues anu one action noue: a start contiol noue,
a map-reduce action noue, a kill contiol noue, anu an end contiol noue. The noues anu
alloweu tiansitions Letween them aie shown in Figuie 5-5.
Iigurc 5-5. Transition diagran oj an Oozic wor|j|ow.
All woikllows must have one start anu one end noue. Vhen the woikllow joL staits
it tiansitions to the noue specilieu Ly the start noue (the max-temp-mr action in this
example). A woikllow joL succeeus when it tiansitions to the end noue. Howevei, il
the woikllow joL tiansitions to a kill noue, then it is consiueieu to have laileu anu
iepoits an eiioi message as specilieu Ly the message element in the woikllow uelinition.
The Lulk ol this woikllow uelinition lile specilies the map-reduce action. The liist two
elements, job-tracker anu name-node, aie useu to specily the joLtiackei to suLmit the
joL to, anu the namenoue (actually a Hauoop lilesystem URI) loi input anu output
uata. Both aie paiameteiizeu so that the woikllow uelinition is not tieu to a paiticulai
clustei (which makes it easy to test). The paiameteis aie specilieu as woikllow joL
piopeities at suLmission time, as we shall see latei.
184 | Chapter 5: Developing a MapReduce Application
The optional prepare element iuns Leloie the MapReuuce joL, anu is useu loi uiiectoiy
ueletion (anu cieation too, il neeueu, although that is not shown heie). By ensuiing
that the output uiiectoiy is in a consistent state Leloie iunning a joL, Oozie can salely
ieiun the action il the joL lails.
The MapReuuce joL to iun is specilieu in the configuration element using nesteu ele-
ments loi specilying the Hauoop conliguiation name-value paiis. You can view the
MapReuuce conliguiation section as a ueclaiative ieplacement loi the uiivei classes
that we have useu elsewheie in this Look loi iunning MapReuuce piogiams (such as
Example 2-6).
Theie aie two non-stanuaiu Hauoop piopeities, mapred.input.dir anu mapred.out
put.dir, which aie useu to set the FileInputFormat input paths anu FileOutputFormat
output path, iespectively.
Ve have taken auvantage ol ]SP Expiession Language (EL) lunctions in seveial places
in the woikllow uelinition. Oozie pioviues a set ol lunctions loi inteiacting with the
woikllow; ${wf:user()}, loi example, ietuins the name ol the usei who staiteu the
cuiient woikllow joL, anu we use it to specily the coiiect lilesystem path. The Oozie
specilication lists all the EL lunctions that Oozie suppoits.
Packaging and deploying an Oozie workflow application
A woikllow application is maue up ol the woikllow uelinition plus all the associateu
iesouices (such as MapReuuce ]AR liles, Pig sciipts, anu so on), neeueu to iun it.
Applications must auheie to a simple uiiectoiy stiuctuie, anu aie ueployeu to HDFS
so that they can Le accesseu Ly Oozie. Foi this woikllow application, we`ll put all ol
the liles in a Lase uiiectoiy calleu nax-tcnp-wor|j|ow, as shown uiagiamatically heie:
max-temp-workflow/
lib/
hadoop-examples.jar
workflow.xml
The woikllow uelinition lile wor|j|ow.xn| must appeai in the top-level ol this uiiectoiy.
]AR liles containing the application`s MapReuuce classes aie placeu in the |ib uiiectoiy.
Voikllow applications that conloim to this layout can Le Luilt with any suitaLle Luilu
tool, like Ant oi Maven; you can linu an example in the coue that accompanies this
Look. Once an application has Leen Luilt, it shoulu Le copieu to HDFS using iegulai
Hauoop tools. Heie is the appiopiiate commanu loi this application:
% hadoop fs -put hadoop-examples/target/max-temp-workflow max-temp-workflow
Running an Oozie workflow job
Next let`s see how to iun a woikllow joL loi the application we just uploaueu. Foi this
we use the oozie commanu line tool, a client piogiam loi communicating with an Oozie
seivei. Foi convenience we expoit the OOZIE_URL enviionment vaiiaLle to tell the
oozie commanu which Oozie seivei to use (we`ie using one iunning locally heie):
MapReduce Workflows | 185
% export OOZIE_URL="http://localhost:11000/oozie"
Theie aie lots ol suL-commanus loi the oozie tool (type oozie help to get a list), Lut
we`ie going to call the job suLcommanu with the -run option to iun the woikllow joL:
% oozie job -config ch05/src/main/resources/max-temp-workflow.properties -run
job: 0000009-120119174508294-oozie-tom-W
The -config option specilies a local ]ava piopeities lile containing uelinitions loi the
paiameteis in the woikllow XML lile (in this case nameNode anu jobTracker), as well as
oozie.wf.application.path which tells Oozie the location ol the woikllow application
in HDFS. Heie is the contents ol the piopeities lile:
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
oozie.wf.application.path=${nameNode}/user/${user.name}/max-temp-workflow
To get inloimation aLout the status ol the woikllow joL we use the -info option, using
the joL ID that was piinteu Ly the iun commanu eailiei (type oozie job to get a list ol
all joLs).
% oozie job -info 0000009-120119174508294-oozie-tom-W
The output shows the status: RUNNING, KILLED, oi SUCCEEDED. You can linu all this in-
loimation via Oozie`s weL UI too, availaLle at http://|oca|host:11000/oozic.
Vhen the joL has succeeueu we can inspect the iesults in the usual way:
% hadoop fs -cat output/part-*
1949 111
1950 22
This example only sciatcheu the suilace ol wiiting Oozie woikllows. The uocumen-
tation on Oozie`s weLsite has inloimation aLout cieating moie complex woikllows, as
well as wiiting anu iunning cooiuinatoi joLs.
186 | Chapter 5: Developing a MapReduce Application
CHAPTER 6
How MapReduce Works
In this chaptei, we look at how MapReuuce in Hauoop woiks in uetail. This knowleuge
pioviues a goou lounuation loi wiiting moie auvanceu MapReuuce piogiams, which
we will covei in the lollowing two chapteis.
Anatomy of a MapReduce Job Run
You can iun a MapReuuce joL with a single methou call: submit() on a Job oLject (note
that you can also call waitForCompletion(), which will suLmit the joL il it hasn`t Leen
suLmitteu alieauy, then wait loi it to linish).
1
This methou call conceals a gieat ueal ol
piocessing Lehinu the scenes. This section uncoveis the steps Hauoop takes to iun a
joL.
Ve saw in Chaptei 5 that the way Hauoop executes a MapReuuce piogiam uepenus
on a couple ol conliguiation settings.
In ieleases ol Hauoop up to anu incluuing the 0.20 ielease seiies, mapred.job.tracker
ueteimines the means ol execution. Il this conliguiation piopeity is set to local, the
uelault, then the local joL iunnei is useu. This iunnei iuns the whole joL in a single
]VM. It`s uesigneu loi testing anu loi iunning MapReuuce piogiams on small uatasets.
Alteinatively, il mapred.job.tracker is set to a colon-sepaiateu host anu poit paii, then
the piopeity is inteipieteu as a joLtiackei auuiess, anu the iunnei suLmits the joL to
the joLtiackei at that auuiess. The whole piocess in uesciiLeu in uetail in the next
section.
In Hauoop 0.23.0 a new MapReuuce implementation was intiouuceu. The new im-
plementation (calleu MapReuuce 2) is Luilt on a system calleu YARN, uesciiLeu in
YARN (MapReuuce 2) on page 19+. Foi now, just note that the liamewoik that is
useu loi execution is set Ly the mapreduce.framework.name piopeity, which takes the
values local (loi the local joL iunnei), classic (loi the classic MapReuuce liame-
1. In the olu MapReuuce API you can call JobClient.submitJob(conf) oi JobClient.runJob(conf).
187
woik, also calleu MapReuuce 1, which uses a joLtiackei anu tasktiackeis), anu yarn
(loi the new liamewoik).
It`s impoitant to iealize that the olu anu new MapReuuce APIs aie not
the same thing as the classic anu YARN-Laseu MapReuuce implemen-
tations (MapReuuce 1 anu 2 iespectively). The APIs aie usei-lacing cli-
ent-siue leatuies anu ueteimine how you wiite MapReuuce piogiams,
while the implementations aie just uilleient ways ol iunning MapRe-
uuce piogiams. All loui comLinations aie suppoiteu: Loth the olu anu
new API iun on Loth MapReuuce 1 anu 2. TaLle 1-2 lists which ol these
comLinations aie suppoiteu in the uilleient Hauoop ieleases.
Classic MapReduce (MapReduce 1)
A joL iun in classic MapReuuce is illustiateu in Figuie 6-1. At the highest level, theie
aie loui inuepenuent entities:
The client, which suLmits the MapReuuce joL.
The joLtiackei, which cooiuinates the joL iun. The joLtiackei is a ]ava application
whose main class is JobTracker.
The tasktiackeis, which iun the tasks that the joL has Leen split into. Tasktiackeis
aie ]ava applications whose main class is TaskTracker.
The uistiiLuteu lilesystem (noimally HDFS, coveieu in Chaptei 3), which is useu
loi shaiing joL liles Letween the othei entities.
Job Submission
The submit() methou on Job cieates an inteinal JobSummitter instance anu calls sub
mitJobInternal() on it (step 1 in Figuie 6-1). Having suLmitteu the joL, waitForCom
pletion() polls the joL`s piogiess once a seconu anu iepoits the piogiess to the console
il it has changeu since the last iepoit. Vhen the joL is complete, il it was successlul,
the joL counteis aie uisplayeu. Otheiwise, the eiioi that causeu the joL to lail is loggeu
to the console.
188 | Chapter 6: How MapReduce Works
The joL suLmission piocess implementeu Ly JobSummitter uoes the lollowing:
Asks the joLtiackei loi a new joL ID (Ly calling getNewJobId() on JobTracker) (step
2).
Checks the output specilication ol the joL. Foi example, il the output uiiectoiy has
not Leen specilieu oi it alieauy exists, the joL is not suLmitteu anu an eiioi is
thiown to the MapReuuce piogiam.
Computes the input splits loi the joL. Il the splits cannot Le computeu, Lecause
the input paths uon`t exist, loi example, then the joL is not suLmitteu anu an eiioi
is thiown to the MapReuuce piogiam.
Copies the iesouices neeueu to iun the joL, incluuing the joL ]AR lile, the conlig-
uiation lile, anu the computeu input splits, to the joLtiackei`s lilesystem in a
uiiectoiy nameu altei the joL ID. The joL ]AR is copieu with a high ieplication
lactoi (contiolleu Ly the mapred.submit.replication piopeity, which uelaults to
10) so that theie aie lots ol copies acioss the clustei loi the tasktiackeis to access
when they iun tasks loi the joL (step 3).
Iigurc -1. How Hadoop runs a MapRcducc job using thc c|assic jrancwor|
Anatomy of a MapReduce Job Run | 189
Tells the joLtiackei that the joL is ieauy loi execution (Ly calling submitJob() on
JobTracker) (step +).
Job Initialization
Vhen the JobTracker ieceives a call to its submitJob() methou, it puts it into an inteinal
gueue liom wheie the joL scheuulei will pick it up anu initialize it. Initialization involves
cieating an oLject to iepiesent the joL Leing iun, which encapsulates its tasks, anu
Lookkeeping inloimation to keep tiack ol the tasks` status anu piogiess (step 5).
To cieate the list ol tasks to iun, the joL scheuulei liist ietiieves the input splits com-
puteu Ly the client liom the shaieu lilesystem (step 6). It then cieates one map task loi
each split. The numLei ol ieuuce tasks to cieate is ueteimineu Ly the
mapred.reduce.tasks piopeity in the Job, which is set Ly the setNumReduceTasks()
methou, anu the scheuulei simply cieates this numLei ol ieuuce tasks to Le iun. Tasks
aie given IDs at this point.
In auuition to the map anu ieuuce tasks, two luithei tasks aie cieateu: a joL setup task
anu a joL cleanup task. These aie iun Ly tasktiackeis anu aie useu to iun coue to setup
the joL Leloie any map tasks iun, anu to cleanup altei all the ieuuce tasks aie complete.
The OutputCommitter that is conliguieu loi the joL ueteimines the coue to Le iun, anu
Ly uelault this is a FileOutputCommitter. Foi the joL setup task it will cieate the linal
output uiiectoiy loi the joL anu the tempoiaiy woiking space loi the task output, anu
loi the joL cleanup task it will uelete the tempoiaiy woiking space loi the task output.
The commit piotocol is uesciiLeu in moie uetail in Output Committeis
on page 215.
Task Assignment
Tasktiackeis iun a simple loop that peiiouically senus heaitLeat methou calls to the
joLtiackei. HeaitLeats tell the joLtiackei that a tasktiackei is alive, Lut they also uouLle
as a channel loi messages. As a pait ol the heaitLeat, a tasktiackei will inuicate whethei
it is ieauy to iun a new task, anu il it is, the joLtiackei will allocate it a task, which it
communicates to the tasktiackei using the heaitLeat ietuin value (step 7).
Beloie it can choose a task loi the tasktiackei, the joLtiackei must choose a joL to select
the task liom. Theie aie vaiious scheuuling algoiithms as explaineu latei in this chaptei
(see ]oL Scheuuling on page 20+), Lut the uelault one simply maintains a piioiity
list ol joLs. Having chosen a joL, the joLtiackei now chooses a task loi the joL.
Tasktiackeis have a lixeu numLei ol slots loi map tasks anu loi ieuuce tasks: loi ex-
ample, a tasktiackei may Le aLle to iun two map tasks anu two ieuuce tasks simulta-
neously. (The piecise numLei uepenus on the numLei ol coies anu the amount ol
memoiy on the tasktiackei; see Memoiy on page 305.) The uelault scheuulei lills
empty map task slots Leloie ieuuce task slots, so il the tasktiackei has at least one
empty map task slot, the joLtiackei will select a map task; otheiwise, it will select a
ieuuce task.
190 | Chapter 6: How MapReduce Works
To choose a ieuuce task, the joLtiackei simply takes the next in its list ol yet-to-Le-iun
ieuuce tasks, since theie aie no uata locality consiueiations. Foi a map task, howevei,
it takes account ol the tasktiackei`s netwoik location anu picks a task whose input split
is as close as possiLle to the tasktiackei. In the optimal case, the task is data-|oca|, that
is, iunning on the same noue that the split iesiues on. Alteinatively, the task may Le
rac|-|oca|: on the same iack, Lut not the same noue, as the split. Some tasks aie neithei
uata-local noi iack-local anu ietiieve theii uata liom a uilleient iack liom the one they
aie iunning on. You can tell the piopoition ol each type ol task Ly looking at a joL`s
counteis (see Built-in Counteis on page 257).
Task Execution
Now that the tasktiackei has Leen assigneu a task, the next step is loi it to iun the task.
Fiist, it localizes the joL ]AR Ly copying it liom the shaieu lilesystem to the tasktiackei`s
lilesystem. It also copies any liles neeueu liom the uistiiLuteu cache Ly the application
to the local uisk; see DistiiLuteu Cache on page 2SS (step S). Seconu, it cieates a
local woiking uiiectoiy loi the task, anu un-jais the contents ol the ]AR into this
uiiectoiy. Thiiu, it cieates an instance ol TaskRunner to iun the task.
TaskRunner launches a new ]ava Viitual Machine (step 9) to iun each task in (step 10),
so that any Lugs in the usei-uelineu map anu ieuuce lunctions uon`t allect the task-
tiackei (Ly causing it to ciash oi hang, loi example). It is, howevei, possiLle to ieuse
the ]VM Letween tasks; see Task ]VM Reuse on page 216.
The chilu piocess communicates with its paient thiough the unbi|ica| inteilace. This
way it inloims the paient ol the task`s piogiess eveiy lew seconus until the task is
complete.
Each task can peiloim setup anu cleanup actions, which aie iun in the same ]VM as
the task itsell, anu aie ueteimineu Ly the OutputCommitter loi the joL (see Output
Committeis on page 215). The cleanup action is useu to commit the task, which in
the case ol lile-Laseu joLs means that its output is wiitten to the linal location loi that
task. The commit piotocol ensuies that when speculative execution is enaLleu (Spec-
ulative Execution on page 213), only one ol the uuplicate tasks is committeu anu the
othei is aLoiteu.
Both Stieaming anu Pipes iun special map anu ieuuce tasks loi the
puipose ol launching the usei-supplieu executaLle anu communicating with it (Fig-
uie 6-2).
In the case ol Stieaming, the Stieaming task communicates with the piocess (which
may Le wiitten in any language) using stanuaiu input anu output stieams. The Pipes
task, on the othei hanu, listens on a socket anu passes the C-- piocess a poit numLei
in its enviionment, so that on staitup, the C-- piocess can estaLlish a peisistent socket
connection Lack to the paient ]ava Pipes task.
Streaming and Pipes.
Anatomy of a MapReduce Job Run | 191
In Loth cases, uuiing execution ol the task, the ]ava piocess passes input key-value
paiis to the exteinal piocess, which iuns it thiough the usei-uelineu map oi ieuuce
lunction anu passes the output key-value paiis Lack to the ]ava piocess. Fiom the
tasktiackei`s point ol view, it is as il the tasktiackei chilu piocess ian the map oi ieuuce
coue itsell.
Progress and Status Updates
MapReuuce joLs aie long-iunning Latch joLs, taking anything liom minutes to houis
to iun. Because this is a signilicant length ol time, it`s impoitant loi the usei to get
leeuLack on how the joL is piogiessing. A joL anu each ol its tasks have a status, which
incluues such things as the state ol the joL oi task (e.g., iunning, successlully completeu,
laileu), the piogiess ol maps anu ieuuces, the values ol the joL`s counteis, anu a status
message oi uesciiption (which may Le set Ly usei coue). These statuses change ovei
the couise ol the joL, so how uo they get communicateu Lack to the client?
Vhen a task is iunning, it keeps tiack ol its progrcss, that is, the piopoition ol the task
completeu. Foi map tasks, this is the piopoition ol the input that has Leen piocesseu.
Foi ieuuce tasks, it`s a little moie complex, Lut the system can still estimate the pio-
poition ol the ieuuce input piocesseu. It uoes this Ly uiviuing the total piogiess into
thiee paits, coiiesponuing to the thiee phases ol the shullle (see Shullle anu
Soit on page 205). Foi example, il the task has iun the ieuucei on hall its input, then
the task`s piogiess is , since it has completeu the copy anu soit phases ( each) anu
is hallway thiough the ieuuce phase ().
192 | Chapter 6: How MapReduce Works
Iigurc -2. Thc rc|ationship oj thc Strcaning and Pipcs cxccutab|c to thc tas|trac|cr and its chi|d
What Constitutes Progress in MapReduce?
Piogiess is not always measuiaLle, Lut neveitheless it tells Hauoop that a task is uoing
something. Foi example, a task wiiting output iecoius is making piogiess, even though
it cannot Le expiesseu as a peicentage ol the total numLei that will Le wiitten, since
the lattei liguie may not Le known, even Ly the task piouucing the output.
Piogiess iepoiting is impoitant, as it means Hauoop will not lail a task that`s making
piogiess. All ol the lollowing opeiations constitute piogiess:
Reauing an input iecoiu (in a mappei oi ieuucei)
Viiting an output iecoiu (in a mappei oi ieuucei)
Setting the status uesciiption on a iepoitei (using Reporter`s setStatus() methou)
Inciementing a countei (using Reporter`s incrCounter() methou)
Calling Reporter`s progress() methou
Anatomy of a MapReduce Job Run | 193
Tasks also have a set ol counteis that count vaiious events as the task iuns (we saw an
example in A test iun on page 25), eithei those Luilt into the liamewoik, such as the
numLei ol map output iecoius wiitten, oi ones uelineu Ly useis.
Il a task iepoits piogiess, it sets a llag to inuicate that the status change shoulu Le sent
to the tasktiackei. The llag is checkeu in a sepaiate thieau eveiy thiee seconus, anu il
set it notilies the tasktiackei ol the cuiient task status. Meanwhile, the tasktiackei is
senuing heaitLeats to the joLtiackei eveiy live seconus (this is a minimum, as the
heaitLeat inteival is actually uepenuent on the size ol the clustei: loi laigei clusteis,
the inteival is longei), anu the status ol all the tasks Leing iun Ly the tasktiackei is sent
in the call. Counteis aie sent less lieguently than eveiy live seconus, Lecause they can
Le ielatively high-Lanuwiuth.
The joLtiackei comLines these upuates to piouuce a gloLal view ol the status ol all the
joLs Leing iun anu theii constituent tasks. Finally, as mentioneu eailiei, the Job ieceives
the latest status Ly polling the joLtiackei eveiy seconu. Clients can also use Job`s
getStatus() methou to oLtain a JobStatus instance, which contains all ol the status
inloimation loi the joL.
The methou calls aie illustiateu in Figuie 6-3.
Job Completion
Vhen the joLtiackei ieceives a notilication that the last task loi a joL is complete (this
will Le the special joL cleanup task), it changes the status loi the joL to successlul.
Then, when the Job polls loi status, it leains that the joL has completeu successlully,
so it piints a message to tell the usei anu then ietuins liom the waitForCompletion()
methou.
The joLtiackei also senus an HTTP joL notilication il it is conliguieu to uo so. This
can Le conliguieu Ly clients wishing to ieceive callLacks, via the job.end.notifica
tion.url piopeity.
Last, the joLtiackei cleans up its woiking state loi the joL anu instiucts tasktiackeis to
uo the same (so inteimeuiate output is ueleteu, loi example).
YARN (MapReduce 2)
Foi veiy laige clusteis in the iegion ol +000 noues anu highei, the MapReuuce system
uesciiLeu in the pievious section Legins to hit scalaLility Lottlenecks, so in 2010 a gioup
at Yahoo! Legan to uesign the next geneiation ol MapReuuce. The iesult was YARN,
shoit loi Yet Anothei Resouice Negotiatoi (oi il you pielei iecuisive ancionyms, YARN
Application Resouice Negotiatoi).
2
2. You can ieau moie aLout the motivation loi anu uevelopment ol YARN in Aiun C Muithy`s post, The
Next Geneiation ol Apache Hauoop MapReuuce.
194 | Chapter 6: How MapReduce Works
YARN meets the scalaLility shoitcomings ol classic MapReuuce Ly splitting the ie-
sponsiLilities ol the joLtiackei into sepaiate entities. The joLtiackei takes caie ol Loth
joL scheuuling (matching tasks with tasktiackeis) anu task piogiess monitoiing (keep-
ing tiack ol tasks anu iestaiting laileu oi slow tasks, anu uoing task Lookkeeping such
as maintaining countei totals).
YARN sepaiates these two ioles into two inuepenuent uaemons: a rcsourcc nanagcr
to manage the use ol iesouices acioss the clustei, anu an app|ication nastcr to manage
the lilecycle ol applications iunning on the clustei. The iuea is that an application
mastei negotiates with the iesouice managei loi clustei iesouicesuesciiLeu in teims
ol a numLei ol containcrs each with a ceitain memoiy limitthen iuns application-
specilic piocesses in those containeis. The containeis aie oveiseen Ly nodc nanagcrs
iunning on clustei noues, which ensuie that the application uoes not use moie iesoui-
ces than it has Leen allocateu.
3
Iigurc -3. How status updatcs arc propagatcd through thc MapRcducc 1 systcn
3. At the time ol wiiting, memoiy is the only iesouice that is manageu, anu noue manageis will kill any
containeis that exceeu theii allocateu memoiy limits.
Anatomy of a MapReduce Job Run | 195
In contiast to the joLtiackei, each instance ol an applicationheie a MapReuuce joL
has a ueuicateu application mastei, which iuns loi the uuiation ol the application.
This mouel is actually closei to the oiiginal Google MapReuuce papei, which uesciiLes
how a mastei piocess is staiteu to cooiuinate map anu ieuuce tasks iunning on a set
ol woikeis.
As uesciiLeu, YARN is moie geneial than MapReuuce, anu in lact MapReuuce is just
one type ol YARN application. Theie aie a lew othei YARN applicationssuch as a
uistiiLuteu shell that can iun a sciipt on a set ol noues in the clusteianu otheis aie
actively Leing woikeu on (some aie listeu at http://wi|i.apachc.org/hadoop/Powcrcd
ByYarn). The Leauty ol YARN`s uesign is that uilleient YARN applications can co-exist
on the same clusteiso a MapReuuce application can iun at the same time as an MPI
application, loi examplewhich Liings gieat Lenelits loi managaLility anu clustei
utilization.
Fuitheimoie, it is even possiLle loi useis to iun uilleient veisions ol MapReuuce on
the same YARN clustei, which makes the piocess ol upgiauing MapReuuce moie
managaLle. (Note that some paits ol MapReuuce, like the joL histoiy seivei anu the
shullle hanulei, as well as YARN itsell, still neeu to Le upgiaueu acioss the clustei.)
MapReuuce on YARN involves moie entities than classic MapReuuce. They aie:
+
The client, which suLmits the MapReuuce joL.
The YARN iesouice managei, which cooiuinates the allocation ol compute ie-
souices on the clustei.
The YARN noue manageis, which launch anu monitoi the compute containeis on
machines in the clustei.
The MapReuuce application mastei, which cooiuinates the tasks iunning the
MapReuuce joL. The application mastei anu the MapReuuce tasks iun in con-
taineis that aie scheuuleu Ly the iesouice managei, anu manageu Ly the noue
manageis.
The uistiiLuteu lilesystem (noimally HDFS, coveieu in Chaptei 3), which is useu
loi shaiing joL liles Letween the othei entities.
The piocess ol iunning a joL is shown in Figuie 6-+, anu uesciiLeu in the lollowing
sections.
Job Submission
]oLs aie suLmitteu in MapReuuce 2 using the same usei API as MapReuuce 1 (step 1).
MapReuuce 2 has an implementation ol ClientProtocol that is activateu when mapre
+. Not uiscusseu in this section aie the joL histoiy seivei uaemon (loi ietaining joL histoiy uata) anu the
shullle hanulei auxiliaiy seivice (loi seiving map outputs to ieuuce tasks), which aie a pait ol the
joLtiackei anu the tasktiackei iespectively in classic MapReuuce, Lut aie inuepenuent entities in YARN.
196 | Chapter 6: How MapReduce Works
duce.framework.name is set to yarn. The suLmission piocess is veiy similai to the classic
implementation. The new joL ID is ietiieveu liom the iesouice managei (iathei than
the joLtiackei), although in the nomenclatuie ol YARN it is an application ID (step 2).
The joL client checks the output specilication ol the joL; computes input splits (al-
though theie is an option to geneiate them on the clustei, yarn.app.mapreduce.am.com
pute-splits-in-cluster, which can Le Lenelicial loi joLs with many splits); anu copies
joL iesouices (incluuing the joL ]AR, conliguiation, anu split inloimation) to HDFS
(step 3). Finally, the joL is suLmitteu Ly calling submitApplication() on the iesouice
managei (step +).
Job Initialization
Vhen the iesouice managei ieceives a call to its submitApplication(), it hanus oll the
ieguest to the scheuulei. The scheuulei allocates a containei, anu the iesouice managei
then launches the application mastei`s piocess theie, unuei the noue managei`s man-
agement (steps 5a anu 5L).
The application mastei loi MapReuuce joLs is a ]ava application whose main class is
MRAppMaster. It initializes the joL Ly cieating a numLei ol Lookkeeping oLjects to keep
tiack ol the joL`s piogiess, as it will ieceive piogiess anu completion iepoits liom the
tasks (step 6). Next, it ietiieves the input splits computeu in the client liom the shaieu
lilesystem (step 7). It then cieates a map task oLject loi each split, anu a numLei ol
ieuuce task oLjects ueteimineu Ly the mapreduce.job.reduces piopeity.
The next thing the application mastei uoes is ueciue how to iun the tasks that make
up the MapReuuce joL. Il the joL is small, the application mastei may choose to iun
them in the same ]VM as itsell, since it juuges the oveiheau ol allocating new containeis
anu iunning tasks in them as outweighing the gain to Le hau in iunning them in paiallel,
compaieu to iunning them seguentially on one noue. (This is uilleient to MapReuuce
1, wheie small joLs aie nevei iun on a single tasktiackei.) Such a joL is saiu to Le
ubcrizcd, oi iun as an ubcr tas|.
Vhat gualilies as a small joL? By uelault one that has less than 10 mappeis, only one
ieuucei, anu the input size is less than the size ol one HDFS Llock. (These values may
Iigurc -1. How Hadoop runs a MapRcducc job using YARN
Anatomy of a MapReduce Job Run | 197
Le changeu loi a joL Ly setting mapreduce.job.ubertask.maxmaps, mapreduce.job.uber
task.maxreduces, anu mapreduce.job.ubertask.maxbytes.) It`s also possiLle to uisaLle
uLei tasks entiiely (Ly setting mapreduce.job.ubertask.enable to false).
Beloie any tasks can Le iun the joL setup methou is calleu (loi the joL`s OutputCommit
ter), to cieate the joL`s output uiiectoiy. In contiast to MapReuuce 1, wheie it is calleu
in a special task that is iun Ly the tasktiackei, in the YARN implementation the methou
is calleu uiiectly Ly the application mastei.
Task Assignment
Il the joL uoes not gualily loi iunning as an uLei task, then the application mastei
ieguests containeis loi all the map anu ieuuce tasks in the joL liom the iesouice man-
agei (step S). Each ieguest, which aie piggyLackeu on heaitLeat calls, incluues inloi-
mation aLout each map task`s uata locality, in paiticulai the hosts anu coiiesponuing
iacks that the input split iesiues on. The scheuulei uses this inloimation to make
scheuuling uecisions (just like a joLtiackei`s scheuulei uoes): it attempts to place tasks
on uata-local noues in the iueal case, Lut il this is not possiLle the scheuulei pieleis
iack-local placement to non-local placement.
Reguests also specily memoiy ieguiiements loi tasks. By uelault Loth map anu ieuuce
tasks aie allocateu 102+ MB ol memoiy, Lut this is conliguiaLle Ly setting mapre
duce.map.memory.mb anu mapreduce.reduce.memory.mb.
The way memoiy is allocateu is uilleient to MapReuuce 1, wheie tasktiackeis have a
lixeu numLei ol slots, set at clustei conliguiation time, anu each task iuns in a single
slot. Slots have a maximum memoiy allowance, which again is lixeu loi a clustei, anu
which leaus Loth to pioLlems ol unuei utilization when tasks use less memoiy (since
othei waiting tasks aie not aLle to take auvantage ol the unuseu memoiy) anu pioLlems
ol joL lailuie when a task can`t complete since it can`t get enough memoiy to iun
coiiectly.
In YARN, iesouices aie moie line-giaineu, so Loth these pioLlems can Le avoiueu. In
paiticulai, applications may ieguest a memoiy capaLility that is anywheie Letween the
minimum allocation anu a maximum allocation, anu which must Le a multiple ol the
minimum allocation. Delault memoiy allocations aie scheuulei-specilic, anu loi the
capacity scheuulei the uelault minimum is 102+ MB (set Ly yarn.schedu
ler.capacity.minimum-allocation-mb), anu the uelault maximum is 102+0 MB (set Ly
yarn.scheduler.capacity.maximum-allocation-mb). Thus, tasks can ieguest any mem-
oiy allocation Letween 1 anu 10 GB (inclusive), in multiples ol 1 GB (the scheuulei will
iounu to the neaiest multiple il neeueu), Ly setting mapreduce.map.memory.mb anu map
reduce.reduce.memory.mb appiopiiately.
Task Execution
Once a task has Leen assigneu a containei Ly the iesouice managei`s scheuulei, the
application mastei staits the containei Ly contacting the noue managei (steps 9a anu
198 | Chapter 6: How MapReduce Works
9L). The task is executeu Ly a ]ava application whose main class is YarnChild. Beloie
it can iun the task it localizes the iesouices that the task neeus, incluuing the joL con-
liguiation anu ]AR lile, anu any liles liom the uistiiLuteu cache (step 10). Finally, it
iuns the map oi ieuuce task (step 11).
The YarnChild iuns in a ueuicateu ]VM, loi the same ieason that tasktiackeis spawn
new ]VMs loi tasks in MapReuuce 1: to isolate usei coue liom long-iunning system
uaemons. Unlike MapReuuce 1, howevei, YARN uoes not suppoit ]VM ieuse so each
task iuns in a new ]VM.
Stieaming anu Pipes piogiams woik in the same way as MapReuuce 1. The Yarn
Child launches the Stieaming oi Pipes piocess anu communicates with it using stanuaiu
input/output oi a socket (iespectively), as shown in Figuie 6-2 (except the chilu anu
suLpiocesses iun on noue manageis, not tasktiackeis).
Progress and Status Updates
Vhen iunning unuei YARN, the task iepoits its piogiess anu status (incluuing coun-
teis) Lack to its application mastei eveiy thiee seconus (ovei the umLilical inteilace),
which has an aggiegate view ol the joL. The piocess is illustiateu in Figuie 6-5. Contiast
this to MapReuuce 1, wheie piogiess upuates llow liom the chilu thiough the task-
tiackei to the joLtiackei loi aggiegation.
The client polls the application mastei eveiy seconu (set via mapreduce.client.pro
gressmonitor.pollinterval) to ieceive piogiess upuates, which aie usually uisplayeu
to the usei.
Job Completion
As well as polling the application mastei loi piogiess, eveiy live seconus the client
checks whethei the joL has completeu when using the waitForCompletion() methou on
Job. The polling inteival can Le set via the mapreduce.client.completion.polli
nterval conliguiation piopeity.
Notilication ol joL completion via an HTTP callLack is also suppoiteu like in MapRe-
uuce 1. In MapReuuce 2 the application mastei initiates the callLack.
Iigurc -5. How status updatcs arc propagatcd through thc MapRcducc 2 systcn
Anatomy of a MapReduce Job Run | 199
On joL completion the application mastei anu the task containeis clean up theii woik-
ing state, anu the OutputCommitter`s joL cleanup methou is calleu. ]oL inloimation is
aichiveu Ly the joL histoiy seivei to enaLle latei inteiiogation Ly useis il uesiieu.
Failures
In the ieal woilu, usei coue is Luggy, piocesses ciash, anu machines lail. One ol the
majoi Lenelits ol using Hauoop is its aLility to hanule such lailuies anu allow youi joL
to complete.
Failures in Classic MapReduce
In the MapReuuce 1 iuntime theie aie thiee lailuie moues to consiuei: lailuie ol the
iunning task, lailuie ol the tastiackei, anu lailuie ol the joLtiackei. Let`s look at each
in tuin.
Task Failure
Consiuei liist the case ol the chilu task lailing. The most common way that this happens
is when usei coue in the map oi ieuuce task thiows a iuntime exception. Il this happens,
the chilu ]VM iepoits the eiioi Lack to its paient tasktiackei, Leloie it exits. The eiioi
ultimately makes it into the usei logs. The tasktiackei maiks the task attempt as
jai|cd, lieeing up a slot to iun anothei task.
Foi Stieaming tasks, il the Stieaming piocess exits with a nonzeio exit coue, it is maikeu
as laileu. This Lehavioi is goveineu Ly the stream.non.zero.exit.is.failure piopeity
(the uelault is true).
Anothei lailuie moue is the suuuen exit ol the chilu ]VMpeihaps theie is a ]VM Lug
that causes the ]VM to exit loi a paiticulai set ol ciicumstances exposeu Ly the Map-
Reuuce usei coue. In this case, the tasktiackei notices that the piocess has exiteu anu
maiks the attempt as laileu.
Hanging tasks aie uealt with uilleiently. The tasktiackei notices that it hasn`t ieceiveu
a piogiess upuate loi a while anu pioceeus to maik the task as laileu. The chilu ]VM
piocess will Le automatically killeu altei this peiiou.
5
The timeout peiiou altei which
tasks aie consiueieu laileu is noimally 10 minutes anu can Le conliguieu on a pei-joL
5. Il a Stieaming oi Pipes piocess hangs, the tasktiackei will kill it (along with the ]VM that launcheu it)
only in one the lollowing ciicumstances: eithei mapred.task.tracker.task-controller is set to
org.apache.hadoop.mapred.LinuxTaskController, oi the uelault task contiollei in Leing useu
(org.apache.hadoop.mapred.DefaultTaskController) anu the setsid commanu is availaLle on the system
(so that the chilu ]VM anu any piocesses it launches aie in the same piocess gioup). In any othei case
oiphaneu Stieaming oi Pipes piocesses will accumulate on the system, which will impact utilization ovei
time.
200 | Chapter 6: How MapReduce Works
Lasis (oi a clustei Lasis) Ly setting the mapred.task.timeout piopeity to a value in
milliseconus.
Setting the timeout to a value ol zeio uisaLles the timeout, so long-iunning tasks aie
nevei maikeu as laileu. In this case, a hanging task will nevei liee up its slot, anu ovei
time theie may Le clustei slowuown as a iesult. This appioach shoulu theieloie Le
avoiueu, anu making suie that a task is iepoiting piogiess peiiouically will sullice (see
Vhat Constitutes Piogiess in MapReuuce? on page 193).
Vhen the joLtiackei is notilieu ol a task attempt that has laileu (Ly the tasktiackei`s
heaitLeat call), it will iescheuule execution ol the task. The joLtiackei will tiy to avoiu
iescheuuling the task on a tasktiackei wheie it has pieviously laileu. Fuitheimoie, il a
task lails loui times (oi moie), it will not Le ietiieu luithei. This value is conliguiaLle:
the maximum numLei ol attempts to iun a task is contiolleu Ly the
mapred.map.max.attempts piopeity loi map tasks anu mapred.reduce.max.attempts loi
ieuuce tasks. By uelault, il any task lails loui times (oi whatevei the maximum numLei
ol attempts is conliguieu to), the whole joL lails.
Foi some applications, it is unuesiiaLle to aLoit the joL il a lew tasks lail, as it may Le
possiLle to use the iesults ol the joL uespite some lailuies. In this case, the maximum
peicentage ol tasks that aie alloweu to lail without tiiggeiing joL lailuie can Le set
loi the joL. Map tasks anu ieuuce tasks aie contiolleu inuepenuently, using
the mapred.max.map.failures.percent anu mapred.max.reduce.failures.percent
piopeities.
A task attempt may also Le |i||cd, which is uilleient liom it lailing. A task attempt may
Le killeu Lecause it is a speculative uuplicate (loi moie, see Speculative Execu-
tion on page 213), oi Lecause the tasktiackei it was iunning on laileu, anu the joL-
tiackei maikeu all the task attempts iunning on it as killeu. Killeu task attempts uo
not count against the numLei ol attempts to iun the task (as set Ly
mapred.map.max.attempts anu mapred.reduce.max.attempts), since it wasn`t the task`s
lault that an attempt was killeu.
Useis may also kill oi lail task attempts using the weL UI oi the commanu line (type
hadoop job to see the options). ]oLs may also Le killeu Ly the same mechanisms.
Tasktracker Failure
Failuie ol a tasktiackei is anothei lailuie moue. Il a tasktiackei lails Ly ciashing, oi
iunning veiy slowly, it will stop senuing heaitLeats to the joLtiackei (oi senu them veiy
inlieguently). The joLtiackei will notice a tasktiackei that has stoppeu senuing heait-
Leats (il it hasn`t ieceiveu one loi 10 minutes, conliguieu via the mapred.task
tracker.expiry.interval piopeity, in milliseconus) anu iemove it liom its pool ol
tasktiackeis to scheuule tasks on. The joLtiackei aiianges loi map tasks that weie iun
anu completeu successlully on that tasktiackei to Le ieiun il they Lelong to incomplete
joLs, since theii inteimeuiate output iesiuing on the laileu tasktiackei`s local lilesystem
may not Le accessiLle to the ieuuce task. Any tasks in piogiess aie also iescheuuleu.
Failures | 201
A tasktiackei can also Le b|ac||istcd Ly the joLtiackei, even il the tasktiackei has not
laileu. Il moie than loui tasks liom the same joL lail on a paiticulai tasktiackei (set Ly
(mapred.max.tracker.failures), then the joLtiackei iecoius this as a lault. A tasktiackei
is Llacklisteu il the numLei ol laults is ovei some minimum thiesholu (loui, set Ly
mapred.max.tracker.blacklists) anu is signilicantly highei than the aveiage numLei
ol laults loi tasktiackeis in the clustei clustei.
Blacklisteu tasktiackeis aie not assigneu tasks, Lut they continue to communicate with
the joLtiackei. Faults expiie ovei time (at the iate ol one pei uay), so tasktiackeis get
the chance to iun joLs again simply Ly leaving them iunning. Alteinatively, il theie is
an unueilying lault that can Le lixeu (Ly ieplacing haiuwaie, loi example), the task-
tiackei will Le iemoveu liom the joLtiackei`s Llacklist altei it iestaits anu iejoins the
clustei.
Jobtracker Failure
Failuie ol the joLtiackei is the most seiious lailuie moue. Hauoop has no mechanism
loi uealing with lailuie ol the joLtiackeiit is a single point ol lailuieso in this case
the joL lails. Howevei, this lailuie moue has a low chance ol occuiiing, since the chance
ol a paiticulai machine lailing is low. The goou news is that the situation is impioveu
in YARN, since one ol its uesign goals is to eliminate single points ol lailuie in Map-
Reuuce.
Altei iestaiting a joLtiackei, any joLs that weie iunning at the time it was stoppeu will
neeu to Le ie-suLmitteu. Theie is a conliguiation option that attempts to iecovei any
iunning joLs (mapred.jobtracker.restart.recover, tuineu oll Ly uelault), howevei it
is known not to woik ieliaLly, so shoulu not Le useu.
Failures in YARN
Foi MapReuuce piogiams iunning on YARN, we neeu to consiuei the lailuie ol any ol
the lollowing entities: the task, the application mastei, the noue managei, anu the
iesouice managei.
Task Failure
Failuie ol the iunning task is similai to the classic case. Runtime exceptions anu suuuen
exits ol the ]VM aie piopagateu Lack to the application mastei anu the task attempt is
maikeu as laileu. Likewise, hanging tasks aie noticeu Ly the application mastei Ly the
aLsence ol a ping ovei the umLilical channel (the timeout is set Ly mapreduce.task.time
out), anu again the task attempt is maikeu as laileu.
The conliguiation piopeities loi ueteimining when a task is consiueieu to Le laileu aie
the same as the classic case: a task is maikeu as laileu altei loui attempts (set Ly
mapreduce.map.maxattempts loi map tasks anu mapreduce.reduce.maxattempts loi ie-
uucei tasks). A joL will Le laileu il moie than mapreduce.map.failures.maxpercent pei-
202 | Chapter 6: How MapReduce Works
cent ol the map tasks in the joL lail, oi moie than mapreduce.reduce.failures.maxper
cent peicent ol the ieuuce tasks lail.
Application Master Failure
]ust like MapReuuce tasks aie given seveial attempts to succeeu (in the lace ol haiuwaie
oi netwoik lailuies) applications in YARN aie tiieu multiple times in the event ol lail-
uie. By uelault, applications aie maikeu as laileu il they lail once, Lut this can Le in-
cieaseu Ly setting the piopeity yarn.resourcemanager.am.max-retries.
An application mastei senus peiiouic heaitLeats to the iesouice managei, anu in the
event ol application mastei lailuie, the iesouice managei will uetect the lailuie anu
stait a new instance ol the mastei iunning in a new containei (manageu Ly a noue
managei). In the case ol the MapReuuce application mastei, it can iecovei the state ol
the tasks that hau alieauy Leen iun Ly the (laileu) application so they uon`t have to Le
ieiun. By uelault, iecoveiy is not enaLleu, so laileu application masteis will not ieiun
all theii tasks, Lut you can tuin it on Ly setting yarn.app.mapreduce.am.job.recov
ery.enable to true.
The client polls the application mastei loi piogiess iepoits, so il its application mastei
lails the client neeus to locate the new instance. Duiing joL initialization the client asks
the iesouice managei loi the application mastei`s auuiess, anu then caches it, so it
uoesn`t oveiloau the the iesouice managei with a ieguest eveiy time it neeus to poll
the application mastei. Il the application mastei lails, howevei, the client will expeii-
ence a timeout when it issues a status upuate, at which point the client will go Lack to
the iesouice managei to ask loi the new application mastei`s auuiess.
Node Manager Failure
Il a noue managei lails, then it will stop senuing heaitLeats to the iesouice managei,
anu the noue managei will Le iemoveu liom the iesouice managei`s pool ol availaLle
noues. The piopeity yarn.resourcemanager.nm.liveness-monitor.expiry-interval-
ms, which uelaults to 600000 (10 minutes), ueteimines the minimum time the iesouice
managei waits Leloie consiueiing a noue managei that has sent no heaitLeat in that
time as laileu.
Any task oi application mastei iunning on the laileu noue managei will Le iecoveieu
using the mechanisms uesciiLeu in the pievious two sections.
Noue manageis may Le Llacklisteu il the numLei ol lailuies loi the application is high.
Blacklisting is uone Ly the application mastei, anu loi MapReuuce the application
mastei will tiy to iescheuule tasks on uilleient noues il moie than thiee tasks lail on a
noue managei. The thiesholu may Le set with mapreduce.job.maxtaskfai
lures.per.tracker.
Failures | 203
Resource Manager Failure
Failuie ol the iesouice managei is seiious, since without it neithei joLs noi task con-
taineis can Le launcheu. The iesouice managei was uesigneu liom the outset to Le aLle
to iecovei liom ciashes, Ly using a checkpointing mechanism to save its state to pei-
sistent stoiage, although at the time ol wiiting the latest ielease uiu not have a complete
implementation.
Altei a ciash, a new iesouice managei instance is Liought up (Ly an auminstiatoi) anu
it iecoveis liom the saveu state. The state consists ol the noue manageis in the system
as well as the iunning applications. (Note that tasks aie not pait ol the iesouice man-
agei`s state, since they aie manageu Ly the application. Thus the amount ol state to Le
stoieu is much moie managaLle than that ol the joLtiackei.)
The stoiage useu Ly the ieouice managei is conliguiaLle via the yarn.resourceman
ager.store.class piopeity. The uelault is org.apache.hadoop.yarn.server.resource
manager.recovery.MemStore, which keeps the stoie in memoiy, anu is theieloie not
highly-availaLle. Howevei, theie is a ZooKeepei-Laseu stoie in the woiks that will
suppoit ieliaLle iecoveiy liom iesouice managei lailuies in the lutuie.
Job Scheduling
Eaily veisions ol Hauoop hau a veiy simple appioach to scheuuling useis` joLs: they
ian in oiuei ol suLmission, using a FIFO scheuulei. Typically, each joL woulu use the
whole clustei, so joLs hau to wait theii tuin. Although a shaieu clustei olleis gieat
potential loi olleiing laige iesouices to many useis, the pioLlem ol shaiing iesouices
laiily Letween useis ieguiies a Lettei scheuulei. Piouuction joLs neeu to complete in a
timely mannei, while allowing useis who aie making smallei au hoc gueiies to get
iesults Lack in a ieasonaLle time.
Latei on, the aLility to set a joL`s piioiity was auueu, via the mapred.job.priority
piopeity oi the setJobPriority() methou on JobClient (Loth ol which take one ol the
values VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW). Vhen the joL scheuulei is choosing the
next joL to iun, it selects one with the highest piioiity. Howevei, with the FIFO
scheuulei, piioiities uo not suppoit prccnption, so a high-piioiity joL can still Le
Llockeu Ly a long-iunning low piioiity joL that staiteu Leloie the high-piioiity joL was
scheuuleu.
MapReuuce in Hauoop comes with a choice ol scheuuleis. The uelault in MapReuuce
1 is the oiiginal FIFO gueue-Laseu scheuulei, anu theie aie also multiusei scheuuleis
calleu the Faii Scheuulei anu the Capacity Scheuulei.
MapReuuce 2 comes with the Capacity Scheuulei (the uelault), anu the FIFO scheuulei.
204 | Chapter 6: How MapReduce Works
The Fair Scheduler
The Faii Scheuulei aims to give eveiy usei a laii shaie ol the clustei capacity ovei time.
Il a single joL is iunning, it gets all ol the clustei. As moie joLs aie suLmitteu, liee task
slots aie given to the joLs in such a way as to give each usei a laii shaie ol the clustei.
A shoit joL Lelonging to one usei will complete in a ieasonaLle time even while anothei
usei`s long joL is iunning, anu the long joL will still make piogiess.
]oLs aie placeu in pools, anu Ly uelault, each usei gets theii own pool. A usei who
suLmits moie joLs than a seconu usei will not get any moie clustei iesouices than the
seconu, on aveiage. It is also possiLle to ueline custom pools with guaianteeu minimum
capacities uelineu in teims ol the numLei ol map anu ieuuce slots, anu to set weightings
loi each pool.
The Faii Scheuulei suppoits pieemption, so il a pool has not ieceiveu its laii shaie loi
a ceitain peiiou ol time, then the scheuulei will kill tasks in pools iunning ovei capacity
in oiuei to give the slots to the pool iunning unuei capacity.
The Faii Scheuulei is a contiiL mouule. To enaLle it, place its ]AR lile on Hauoop`s
classpath, Ly copying it liom Hauoop`s contrib/jairschcdu|cr uiiectoiy to the |ib uiiec-
toiy. Then set the mapred.jobtracker.taskScheduler piopeity to:
org.apache.hadoop.mapred.FairScheduler
The Faii Scheuulei will woik without luithei conliguiation, Lut to take lull auvantage
ol its leatuies anu how to conliguie it (incluuing its weL inteilace), ielei to README
in the src/contrib/jairschcdu|cr uiiectoiy ol the uistiiLution.
The Capacity Scheduler
The Capacity Scheuulei takes a slightly uilleient appioach to multiusei scheuuling. A
clustei is maue up ol a numLei ol gueues (like the Faii Scheuulei`s pools), which may
Le hieiaichical (so a gueue may Le the chilu ol anothei gueue), anu each gueue has an
allocateu capacity. This is like the Faii Scheuulei, except that within each gueue, joLs
aie scheuuleu using FIFO scheuuling (with piioiities). In ellect, the Capacity Scheuulei
allows useis oi oiganizations (uelineu using gueues) to simulate a sepaiate MapReuuce
clustei with FIFO scheuuling loi each usei oi oiganization. The Faii Scheuulei, Ly
contiast, (which actually also suppoits FIFO joL scheuuling within pools as an option,
making it like the Capacity Scheuulei) enloices laii shaiing within each pool, so iunning
joLs shaie the pool`s iesouices.
Shuffle and Sort
MapReuuce makes the guaiantee that the input to eveiy ieuucei is soiteu Ly key. The
piocess Ly which the system peiloims the soitanu tiansleis the map outputs to the
ieuuceis as inputsis known as the shujj|c.
6
In this section, we look at how the shullle
woiks, as a Lasic unueistanuing woulu Le helplul, shoulu you neeu to optimize a Map-
Shuffle and Sort | 205
Reuuce piogiam. The shullle is an aiea ol the coueLase wheie ielinements anu
impiovements aie continually Leing maue, so the lollowing uesciiption necessaiily
conceals many uetails (anu may change ovei time, this is loi veision 0.20). In many
ways, the shullle is the heait ol MapReuuce anu is wheie the magic happens.
The Map Side
Vhen the map lunction staits piouucing output, it is not simply wiitten to uisk. The
piocess is moie involveu, anu takes auvantage ol Lulleiing wiites in memoiy anu uoing
some piesoiting loi elliciency ieasons. Figuie 6-6 shows what happens.
Each map task has a ciiculai memoiy Lullei that it wiites the output to. The Lullei is
100 MB Ly uelault, a size which can Le tuneu Ly changing the io.sort.mb piopeity.
Vhen the contents ol the Lullei ieaches a ceitain thiesholu size (io.sort.spill.per
cent, uelault 0.80, oi S0), a Lackgiounu thieau will stait to spi|| the contents to uisk.
Map outputs will continue to Le wiitten to the Lullei while the spill takes place, Lut il
the Lullei lills up uuiing this time, the map will Llock until the spill is complete.
Spills aie wiitten in iounu-ioLin lashion to the uiiectoiies specilieu Ly the
mapred.local.dir piopeity, in a joL-specilic suLuiiectoiy.
Iigurc -. Shujj|c and sort in MapRcducc
Beloie it wiites to uisk, the thieau liist uiviues the uata into paititions coiiesponuing
to the ieuuceis that they will ultimately Le sent to. Vithin each paitition, the Lack-
giounu thieau peiloims an in-memoiy soit Ly key, anu il theie is a comLinei lunction,
it is iun on the output ol the soit. Running the comLinei lunction makes loi a moie
6. The teim shujj|c is actually impiecise, since in some contexts it ieleis to only the pait ol the piocess wheie
map outputs aie letcheu Ly ieuuce tasks. In this section, we take it to mean the whole piocess liom the
point wheie a map piouuces output to wheie a ieuuce consumes input.
206 | Chapter 6: How MapReduce Works
compact map output, so theie is less uata to wiite to local uisk anu to tianslei to the
ieuucei.
Each time the memoiy Lullei ieaches the spill thiesholu, a new spill lile is cieateu, so
altei the map task has wiitten its last output iecoiu theie coulu Le seveial spill liles.
Beloie the task is linisheu, the spill liles aie meigeu into a single paititioneu anu soiteu
output lile. The conliguiation piopeity io.sort.factor contiols the maximum numLei
ol stieams to meige at once; the uelault is 10.
Il theie aie at least thiee spill liles (set Ly the min.num.spills.for.combine piopeity)
then the comLinei is iun again Leloie the output lile is wiitten. Recall that comLineis
may Le iun iepeateuly ovei the input without allecting the linal iesult. Il theie aie only
one oi two spills, then the potential ieuuction in map output size is not woith the
oveiheau in invoking the comLinei, so it is not iun again loi this map output.
It is olten a goou iuea to compiess the map output as it is wiitten to uisk, since uoing
so makes it lastei to wiite to uisk, saves uisk space, anu ieuuces the amount ol uata to
tianslei to the ieuucei. By uelault, the output is not compiesseu, Lut it is easy to enaLle
Ly setting mapred.compress.map.output to true. The compiession liLiaiy to use is speci-
lieu Ly mapred.map.output.compression.codec; see Compiession on page S5 loi moie
on compiession loimats.
The output lile`s paititions aie maue availaLle to the ieuuceis ovei HTTP. The maxi-
mum numLei ol woikei thieaus useu to seive the lile paititions is contiolleu Ly the
tasktracker.http.threads piopeitythis setting is pei tasktiackei, not pei map task
slot. The uelault ol +0 may neeu incieasing loi laige clusteis iunning laige joLs. In
MapReuuce 2, this piopeity is not applicaLle since the maximum numLei ol thieaus
useu is set automatically Laseu on the numLei ol piocessois on the machine. (Map-
Reuuce 2 uses Netty, which Ly uelault allows up to twice as many thieaus as theie aie
piocessois.)
The Reduce Side
Let`s tuin now to the ieuuce pait ol the piocess. The map output lile is sitting on the
local uisk ol the machine that ian the map task (note that although map outputs always
get wiitten to local uisk, ieuuce outputs may not Le), Lut now it is neeueu Ly the
machine that is aLout to iun the ieuuce task loi the paitition. Fuitheimoie, the ieuuce
task neeus the map output loi its paiticulai paitition liom seveial map tasks acioss the
clustei. The map tasks may linish at uilleient times, so the ieuuce task staits copying
theii outputs as soon as each completes. This is known as the copy phasc ol the ieuuce
task. The ieuuce task has a small numLei ol copiei thieaus so that it can letch map
outputs in paiallel. The uelault is live thieaus, Lut this numLei can Le changeu Ly
setting the mapred.reduce.parallel.copies piopeity.
Shuffle and Sort | 207
How uo ieuuceis know which machines to letch map output liom?
As map tasks complete successlully, they notily theii paient tasktiackei
ol the status upuate, which in tuin notilies the joLtiackei. (In MapRe-
uuce 2, the tasks notily theii application mastei uiiectly.) These notili-
cations aie tiansmitteu ovei the heaitLeat communication mechanism
uesciiLeu eailiei. Theieloie, loi a given joL, the joLtiackei (oi applica-
tion mastei) knows the mapping Letween map outputs anu hosts. A
thieau in the ieuucei peiiouically asks the mastei loi map output hosts
until it has ietiieveu them all.
Hosts uo not uelete map outputs liom uisk as soon as the liist ieuucei
has ietiieveu them, as the ieuucei may suLseguently lail. Insteau, they
wait until they aie tolu to uelete them Ly the joLtiackei (oi application
mastei), which is altei the joL has completeu.
The map outputs aie copieu to the ieuuce task ]VM`s memoiy il they aie small enough
(the Lullei`s size is contiolleu Ly mapred.job.shuffle.input.buffer.percent, which
specilies the piopoition ol the heap to use loi this puipose); otheiwise, they aie copieu
to uisk. Vhen the in-memoiy Lullei ieaches a thiesholu size (contiolleu Ly
mapred.job.shuffle.merge.percent), oi ieaches a thiesholu numLei ol map outputs
(mapred.inmem.merge.threshold), it is meigeu anu spilleu to uisk. Il a comLinei is speci-
lieu it will Le iun uuiing the meige to ieuuce the amount ol uata wiitten to uisk.
As the copies accumulate on uisk, a Lackgiounu thieau meiges them into laigei, soiteu
liles. This saves some time meiging latei on. Note that any map outputs that weie
compiesseu (Ly the map task) have to Le uecompiesseu in memoiy in oiuei to peiloim
a meige on them.
Vhen all the map outputs have Leen copieu, the ieuuce task moves into the sort
phasc (which shoulu piopeily Le calleu the ncrgc phase, as the soiting was caiiieu out
on the map siue), which meiges the map outputs, maintaining theii soit oiueiing. This
is uone in iounus. Foi example, il theie weie 50 map outputs, anu the ncrgc jactor was
10 (the uelault, contiolleu Ly the io.sort.factor piopeity, just like in the map`s meige),
then theie woulu Le 5 iounus. Each iounu woulu meige 10 liles into one, so at the enu
theie woulu Le live inteimeuiate liles.
Rathei than have a linal iounu that meiges these live liles into a single soiteu lile, the
meige saves a tiip to uisk Ly uiiectly leeuing the ieuuce lunction in what is the last
phase: the rcducc phasc. This linal meige can come liom a mixtuie ol in-memoiy anu
on-uisk segments.
208 | Chapter 6: How MapReduce Works
The numLei ol liles meigeu in each iounu is actually moie suLtle than
this example suggests. The goal is to meige the minimum numLei ol
liles to get to the meige lactoi loi the linal iounu. So il theie weie +0
liles, the meige woulu not meige 10 liles in each ol the loui iounus to
get + liles. Insteau, the liist iounu woulu meige only + liles, anu the
suLseguent thiee iounus woulu meige the lull 10 liles. The + meigeu
liles, anu the 6 (as yet unmeigeu) liles make a total ol 10 liles loi the
linal iounu. The piocess is illustiateu in Figuie 6-7.
Note that this uoes not change the numLei ol iounus, it`s just an opti-
mization to minimize the amount ol uata that is wiitten to uisk, since
the linal iounu always meiges uiiectly into the ieuuce.
Iigurc -7. Ejjicicnt|y ncrging 10 ji|c scgncnts with a ncrgc jactor oj 10
Duiing the ieuuce phase, the ieuuce lunction is invokeu loi each key in the soiteu
output. The output ol this phase is wiitten uiiectly to the output lilesystem, typically
HDFS. In the case ol HDFS, since the tasktiackei noue (oi noue managei) is also iun-
ning a uatanoue, the liist Llock ieplica will Le wiitten to the local uisk.
Configuration Tuning
Ve aie now in a Lettei position to unueistanu how to tune the shullle to impiove
MapReuuce peiloimance. The ielevant settings, which can Le useu on a pei-joL Lasis
(except wheie noteu), aie summaiizeu in TaLles 6-1 anu 6-2, along with the uelaults,
which aie goou loi geneial-puipose joLs.
The geneial piinciple is to give the shullle as much memoiy as possiLle. Howevei, theie
is a tiaue-oll, in that you neeu to make suie that youi map anu ieuuce lunctions get
enough memoiy to opeiate. This is why it is Lest to wiite youi map anu ieuuce lunctions
to use as little memoiy as possiLleceitainly they shoulu not use an unLounueu
amount ol memoiy (Ly avoiuing accumulating values in a map, loi example).
The amount ol memoiy given to the ]VMs in which the map anu ieuuce tasks iun is
set Ly the mapred.child.java.opts piopeity. You shoulu tiy to make this as laige as
possiLle loi the amount ol memoiy on youi task noues; the uiscussion in Mem-
oiy on page 305 goes thiough the constiaints to consiuei.
On the map siue, the Lest peiloimance can Le oLtaineu Ly avoiuing multiple spills to
uisk; one is optimal. Il you can estimate the size ol youi map outputs, then you can set
the io.sort.* piopeities appiopiiately to minimize the numLei ol spills. In paiticulai,
Shuffle and Sort | 209
you shoulu inciease io.sort.mb il you can. Theie is a MapReuuce countei (Spilleu
iecoius; see Counteis on page 257) that counts the total numLei ol iecoius that
weie spilleu to uisk ovei the couise ol a joL, which can Le uselul loi tuning. Note that
the countei incluues Loth map anu ieuuce siue spills.
On the ieuuce siue, the Lest peiloimance is oLtaineu when the inteimeuiate uata can
iesiue entiiely in memoiy. By uelault, this uoes not happen, since loi the geneial case
all the memoiy is ieseiveu loi the ieuuce lunction. But il youi ieuuce lunction has light
memoiy ieguiiements, then setting mapred.inmem.merge.threshold to 0 anu
mapred.job.reduce.input.buffer.percent to 1.0 (oi a lowei value; see TaLle 6-2) may
Liing a peiloimance Loost.
Moie geneially, Hauoop uses a Lullei size ol + KB Ly uelault, which is low, so you
shoulu inciease this acioss the clustei (Ly setting io.file.buffer.size, see also Othei
Hauoop Piopeities on page 315).
In Apiil 200S, Hauoop won the geneial-puipose teiaLyte soit Lenchmaik (uesciiLeu
in TeiaByte Soit on Apache Hauoop on page 601), anu one ol the optimizations
useu was this one ol keeping the inteimeuiate uata in memoiy on the ieuuce siue.
Tab|c -1. Map-sidc tuning propcrtics
Property name Type Default value Description
io.sort.mb int 100 The size, in megabytes, of the
memory buffer to use while sorting
map output.
io.sort.record.percent float 0.05 The proportion of io.sort.mb
reserved for storing record bound-
aries of the map outputs. The re-
maining space is used for the map
output records themselves. This
property was removed in release
0.21.0 as the shuffle code was im-
proved to do a better job of using
all the available memory for map
output and accounting informa-
tion.
io.sort.spill.percent float 0.80 The threshold usage proportion for
both the map output memory
buffer and the record boundaries
index to start the process of spilling
to disk.
io.sort.factor int 10 The maximum number of streams
to merge at once when sorting files.
This property is also used in the re-
duce. Its fairly common to increase
this to 100.
210 | Chapter 6: How MapReduce Works
Property name Type Default value Description
min.num.spills.for.
combine
int 3 The minimum number of spill files
needed for the combiner to run (if
a combiner is specified).
mapred.compress.map.
output
boolean false Compress map outputs.
mapred.map.output.
compression.codec
Class name org.apache.hadoop.io.
compress.DefaultCodec
The compression codec to use for
map outputs.
task
tracker.http.threads
int 40 The number of worker threads per
tasktracker for serving the map
outputs to reducers. This is a clus-
ter-wide setting and cannot be set
by individual jobs. Not applicable
in MapReduce 2.
Tab|c -2. Rcducc-sidc tuning propcrtics
Property name Type Default value Description
mapred.reduce.parallel.
copies
int 5 The number of threads used to copy map outputs
to the reducer.
mapred.reduce.copy.backoff int 300 The maximum amount of time, in seconds, to spend
retrieving one map output for a reducer before de-
claring it as failed. The reducer may repeatedly re-
attempt a transfer within this time if it fails (using
exponential backoff).
io.sort.factor int 10 The maximum number of streams to merge at once
when sorting files. This property is also used in the
map.
mapred.job.shuffle.input.
buffer.percent
float 0.70 The proportion of total heap size to be allocated to
the map outputs buffer during the copy phase of the
shuffle.
mapred.job.shuffle.merge.
percent
float 0.66 The threshold usage proportion for the map outputs
buffer (defined by mapred.job.shuf
fle.input.buffer.percent) for starting
the process of merging the outputs and spilling to
disk.
mapred.inmem.merge.threshold int 1000 The threshold number of map outputs for starting
the process of merging the outputs and spilling to
disk. A value of 0 or less means there is no threshold,
and the spill behavior is governed solely by
mapred.job.shuffle.merge.percent.
mapred.job.reduce.input.
buffer.percent
float 0.0 The proportion of total heap size to be used for re-
taining map outputs in memory during the reduce.
For the reduce phase to begin, the size of map out-
puts in memory must be no more than this size. By
Shuffle and Sort | 211
Property name Type Default value Description
default, all map outputs are merged to disk before
the reduce begins, to give the reducers as much
memory as possible. However, if your reducers re-
quire less memory, this value may be increased to
minimize the number of trips to disk.
Task Execution
Ve saw how the MapReuuce system executes tasks in the context ol the oveiall joL at
the Leginning ol the chaptei in Anatomy ol a MapReuuce ]oL Run on page 1S7. In
this section, we`ll look at some moie contiols that MapReuuce useis have ovei task
execution.
The Task Execution Environment
Hauoop pioviues inloimation to a map oi ieuuce task aLout the enviionment in which
it is iunning. Foi example, a map task can uiscovei the name ol the lile it is piocessing
(see File inloimation in the mappei on page 239), anu a map oi ieuuce task can linu
out the attempt numLei ol the task. The piopeities in TaLle 6-3 can Le accesseu liom
the joL`s conliguiation, oLtaineu in the olu MapReuuce API Ly pioviuing an imple-
mentation ol the configure() methou loi Mapper oi Reducer, wheie the conliguiation
is passeu in as an aigument. In the new API these piopeities can Le accesseu liom the
context oLject passeu to all methous ol the Mapper oi Reducer.
Tab|c -3. Tas| cnvironncnt propcrtics
Property name Type Description Example
mapred.job.id String The job ID. (See Job,
Task, and Task Attempt
IDs on page 163 for a
description of the
format.)
job_200811201130_0004
mapred.tip.id String The task ID. task_200811201130_0004_m_000003
mapred.task.id String The task attempt ID.
(Not the task ID.)
attempt_200811201130_0004_m_000003_0
mapred.task.
partition
int The index of the task
within the job.
3
mapred.task.is.map boolean Whether this task is a
map task.
true
212 | Chapter 6: How MapReduce Works
Streaming environment variables
Hauoop sets joL conliguiation paiameteis as enviionment vaiiaLles loi Stieaming pio-
giams. Howevei, it ieplaces nonalphanumeiic chaiacteis with unueiscoies to make
suie they aie valiu names. The lollowing Python expiession illustiates how you can
ietiieve the value ol the mapred.job.id piopeity liom within a Python Stieaming sciipt:
os.environ["mapred_job_id"]
You can also set enviionment vaiiaLles loi the Stieaming piocesses launcheu Ly Map-
Reuuce Ly supplying the -cmdenv option to the Stieaming launchei piogiam (once loi
each vaiiaLle you wish to set). Foi example, the lollowing sets the MAGIC_PARAMETER
enviionment vaiiaLle:
-cmdenv MAGIC_PARAMETER=abracadabra
Speculative Execution
The MapReuuce mouel is to Lieak joLs into tasks anu iun the tasks in paiallel to make
the oveiall joL execution time smallei than it woulu otheiwise Le il the tasks ian se-
guentially. This makes joL execution time sensitive to slow-iunning tasks, as it takes
only one slow task to make the whole joL take signilicantly longei than it woulu have
uone otheiwise. Vhen a joL consists ol hunuieus oi thousanus ol tasks, the possiLility
ol a lew stiaggling tasks is veiy ieal.
Tasks may Le slow loi vaiious ieasons, incluuing haiuwaie uegiauation oi soltwaie
mis-conliguiation, Lut the causes may Le haiu to uetect since the tasks still complete
successlully, alLeit altei a longei time than expecteu. Hauoop uoesn`t tiy to uiagnose
anu lix slow-iunning tasks; insteau, it tiies to uetect when a task is iunning slowei than
expecteu anu launches anothei, eguivalent, task as a Lackup. This is teimeu spccu|ativc
cxccution ol tasks.
It`s impoitant to unueistanu that speculative execution uoes not woik Ly launching
two uuplicate tasks at aLout the same time so they can iace each othei. This woulu Le
wastelul ol clustei iesouices. Rathei, a speculative task is launcheu only altei all the
tasks loi a joL have Leen launcheu, anu then only loi tasks that have Leen iunning loi
some time (at least a minute) anu have laileu to make as much piogiess, on aveiage, as
the othei tasks liom the joL. Vhen a task completes successlully, any uuplicate tasks
that aie iunning aie killeu since they aie no longei neeueu. So il the oiiginal task com-
pletes Leloie the speculative task, then the speculative task is killeu; on the othei hanu,
il the speculative task linishes liist, then the oiiginal is killeu.
Speculative execution is an optimization, not a leatuie to make joLs iun moie ieliaLly.
Il theie aie Lugs that sometimes cause a task to hang oi slow uown, then ielying on
speculative execution to avoiu these pioLlems is unwise, anu won`t woik ieliaLly, since
the same Lugs aie likely to allect the speculative task. You shoulu lix the Lug so that
the task uoesn`t hang oi slow uown.
Task Execution | 213
Speculative execution is tuineu on Ly uelault. It can Le enaLleu oi uisaLleu inuepenu-
ently loi map tasks anu ieuuce tasks, on a clustei-wiue Lasis, oi on a pei-joL Lasis. The
ielevant piopeities aie shown in TaLle 6-+.
Tab|c -1. Spccu|ativc cxccution propcrtics
Property name Type Default value Description
mapred.map.tasks.specula
tive.execution
boolean true Whether extra instances of
map tasks may be launched if
a task is making slow pro-
gress.
mapred.reduce.tasks.specula
tive.
execution
boolean true Whether extra instances of
reduce tasks may be launched
if a task is making slow pro-
gress.
yarn.app.mapreduce.am.job.
speculator.class
Class org.apache.hadoop.mapre
duce.v2.
app.speculate.Default
Speculator
(MapReduce 2 only) The
Speculator class imple-
menting the speculative exe-
cution policy.
yarn.app.mapreduce.am.job.
task.estimator.class
Class org.apache.hadoop.mapre
duce.v2.
app.speculate.LegacyTa
skRuntimeEstimator
(MapReduce 2 only) An im-
plementation of TaskRun
timeEstimator that pro-
vides estimates for task run-
times, and used by Specula
tor instances.
Vhy woulu you evei want to tuin oll speculative execution? The goal ol speculative
execution is to ieuuce joL execution time, Lut this comes at the cost ol clustei elliciency.
On a Lusy clustei, speculative execution can ieuuce oveiall thioughput, since ieuun-
uant tasks aie Leing executeu in an attempt to Liing uown the execution time loi a
single joL. Foi this ieason, some clustei auministiatois pielei to tuin it oll on the clustei
anu have useis explicitly tuin it on loi inuiviuual joLs. This was especially ielevant loi
oluei veisions ol Hauoop, when speculative execution coulu Le oveily aggiessive in
scheuuling speculative tasks.
Theie is a goou case loi tuining oll speculative execution loi ieuuce tasks, since any
uuplicate ieuuce tasks have to letch the same map outputs as the oiiginal task, anu this
can signilicantly inciease netwoik tiallic on the clustei.
Anothei ieason that speculative execution is tuineu oll is loi tasks that aie not iuem-
potent. Howevei in many cases it is possiLle to wiite tasks to Le iuempotent anu use
an OutputCommitter to piomote the output to its linal location when the task succeeus.
This technigue is explaineu in moie uetail in the next section.
214 | Chapter 6: How MapReduce Works
Output Committers
Hauoop MapReuuce uses a commit piotocol to ensuie that joLs anu tasks eithei suc-
ceeu, oi lail cleanly. The Lehavioi is implementeu Ly the OutputCommitter in use loi the
joL, anu this is set in the olu MapReuuce API Ly calling the setOutputCommitter() on
JobConf, oi Ly setting mapred.output.committer.class in the conliguiation. In the new
MapReuuce API, the OutputCommitter is ueteimineu Ly the OutputFormat, via its getOut
putCommitter() methou. The uelault is FileOutputCommitter, which is appiopiiate loi
lile-Laseu MapReuuce. You can customize an existing OutputCommitter oi even wiite a
new implementation il you neeu to uo special setup oi cleanup loi joLs oi tasks.
The OutputCommitter API is as lollows (in Loth olu anu new MapReuuce APIs):
public abstract class OutputCommitter {
public abstract void setupJob(JobContext jobContext) throws IOException;
public void commitJob(JobContext jobContext) throws IOException { }
public void abortJob(JobContext jobContext, JobStatus.State state)
throws IOException { }
public abstract void setupTask(TaskAttemptContext taskContext)
throws IOException;
public abstract boolean needsTaskCommit(TaskAttemptContext taskContext)
throws IOException;
public abstract void commitTask(TaskAttemptContext taskContext)
throws IOException;
public abstract void abortTask(TaskAttemptContext taskContext)
throws IOException;
}
}
The setupJob() methou is calleu Leloie the joL is iun, anu is typically useu to peiloim
initialization. Foi FileOutputCommitter the methou cieates the linal output uiiectoiy,
${mapred.output.dir}, anu a tempoiaiy woiking space loi task output, ${mapred.out
put.dir}/_temporary.
Il the joL succeeus then the commitJob() methou is calleu, which in the uelault lile-
Laseu implementation ueletes the tempoiaiy woiking space, anu cieates a hiuuen
empty maikei lile in the output uiiectoiy calleu _SUCCESS to inuicate to lilesystem
clients that the joL completeu successlully. Il the joL uiu not succeeu, then the abort
Job() is calleu with a state oLject inuicating whethei the joL laileu oi was killeu (Ly a
usei, loi example). In the uelault implementation this will uelete the joL`s tempoiaiy
woiking space.
The opeiations aie similai at the task-level. The setupTask() methou is calleu Leloie
the task is iun, anu the uelault implementation uoesn`t uo anything, since tempoiaiy
uiiectoiies nameu loi task outputs aie cieateu when the task outputs aie wiitten.
The commit phase loi tasks is optional, anu may Le uisaLleu Ly ietuining false liom
needsTaskCommit(). This saves the liamewoik liom having to iun the uistiiLuteu com-
Task Execution | 215
mit piotocol loi the task, anu neithei commitTask() noi abortTask() is calleu. FileOut
putCommitter will skip the commit phase when no output has Leen wiitten Ly a task.
Il a task succeeus then commitTask() is calleu, which in the uelault implementation
moves the tempoiaiay task output uiiectoiy (which has the task attempt ID in its name
to avoiu conllicts Letween task attempts) to the linal output path, ${mapred.out
put.dir}. Otheiwise, the liamewoik calls abortTask(), which ueletes the tempoiaiy
task output uiiectoiy.
The liamewoik ensuies that in the event ol multiple task attempts loi a paiticulai task,
only one will Le committeu, anu the otheis will Le aLoiteu. This suitation may aiise
Lecause the liist attempt laileu loi some ieasonin which case it woulu Le aLoiteu,
anu a latei, successlul attempt woulu Le committeu. Anothei case is il two task attempts
weie iunning concuiiently as speculative uuplicates, then the one that linisheu liist
woulu Le committeu, anu the othei woulu Le aLoiteu.
Task side-effect files
The usual way ol wiiting output liom map anu ieuuce tasks is Ly using the OutputCol
lector to collect key-value paiis. Some applications neeu moie llexiLility than a single
key-value paii mouel, so these applications wiite output liles uiiectly liom the map oi
ieuuce task to a uistiiLuteu lilesystem, like HDFS. (Theie aie othei ways to piouuce
multiple outputs, too, as uesciiLeu in Multiple Outputs on page 251.)
Caie neeus to Le taken to ensuie that multiple instances ol the same task uon`t tiy to
wiite to the same lile. As we saw in the pievious section, the OutputCommitter piotocol
solves this pioLlem. Il applications wiite siue liles in theii tasks` woiking uiiectoiies,
then the siue liles loi tasks that successlully complete will Le piomoteu to the output
uiiectoiy automatically, while laileu tasks will have theii siue liles ueleteu.
A task may linu its woiking uiiectoiy Ly ietiieving the value ol the mapred.work.out
put.dir piopeity liom its conliguiation lile. Alteinatively, a MapReuuce piogiam using
the ]ava API may call the getWorkOutputPath() static methou on FileOutputFormat to
get the Path oLject iepiesenting the woiking uiiectoiy. The liamewoik cieates the
woiking uiiectoiy Leloie executing the task, so you uon`t neeu to cieate it.
To take a simple example, imagine a piogiam loi conveiting image liles liom one loimat
to anothei. One way to uo this is to have a map-only joL, wheie each map is given a
set ol images to conveit (peihaps using NLineInputFormat; see NLineInputFoi-
mat on page 2+5). Il a map task wiites the conveiteu images into its woiking uiiectoiy,
then they will Le piomoteu to the output uiiectoiy when the task successlully linishes.
Task JVM Reuse
Hauoop iuns tasks in theii own ]ava Viitual Machine to isolate them liom othei iun-
ning tasks. The oveiheau ol staiting a new ]VM loi each task can take aiounu a seconu,
which loi joLs that iun loi a minute oi so is insignilicant. Howevei, joLs that have a
216 | Chapter 6: How MapReduce Works
laige numLei ol veiy shoit-liveu tasks (these aie usually map tasks), oi that have lengthy
initialization, can see peiloimance gains when the ]VM is ieuseu loi suLseguent tasks.
7
Note that, with task ]VM ieuse enaLleu, tasks aie not iun concuiiently in a single ]VM;
iathei, the ]VM iuns tasks seguentially. Tasktiackeis can, howevei, iun moie than one
task at a time, Lut this is always uone in sepaiate ]VMs. The piopeities loi contiolling
the tasktiackeis` numLei ol map task slots anu ieuuce task slots aie uiscusseu in
Memoiy on page 305.
The piopeity loi contiolling task ]VM ieuse is mapred.job.reuse.jvm.num.tasks: it
specilies the maximum numLei ol tasks to iun loi a given joL loi each ]VM launcheu;
the uelault is 1 (see TaLle 6-5). No uistinction is maue Letween map oi ieuuce tasks,
howevei tasks liom uilleient joLs aie always iun in sepaiate ]VMs. The methou set
NumTasksToExecutePerJvm() on JobConf can also Le useu to conliguie this piopeity.
Tab|c -5. Tas| j\M Rcusc propcrtics
Property name Type Default value Description
mapred.job.reuse.jvm.num.tasks int 1 The maximum number of tasks to run for a given
job for each JVM on a tasktracker. A value of 1
indicates no limit: the same JVM may be used for
all tasks for a job.
Tasks that aie CPU-Lounu may also Lenelit liom task ]VM ieuse Ly taking auvantage
ol iuntime optimizations applieu Ly the HotSpot ]VM. Altei iunning loi a while, the
HotSpot ]VM Luilus up enough inloimation to uetect peiloimance-ciitical sections in
the coue anu uynamically tianslates the ]ava Lyte coues ol these hot spots into native
machine coue. This woiks well loi long-iunning piocesses, Lut ]VMs that iun loi sec-
onus oi a lew minutes may not gain the lull Lenelit ol HotSpot. In these cases, it is
woith enaLling task ]VM ieuse.
Anothei place wheie a shaieu ]VM is uselul is loi shaiing state Letween the tasks ol a
joL. By stoiing ieleience uata in a static lielu, tasks get iapiu access to the shaieu uata.
Skipping Bad Records
Laige uatasets aie messy. They olten have coiiupt iecoius. They olten have iecoius
that aie in a uilleient loimat. They olten have missing lielus. In an iueal woilu, youi
coue woulu cope giacelully with all ol these conuitions. In piactice, it is olten expeuient
to ignoie the ollenuing iecoius. Depenuing on the analysis Leing peiloimeu, il only a
small peicentage ol iecoius aie allecteu, then skipping them may not signilicantly allect
the iesult. Howevei, il a task tiips up when it encounteis a Lau iecoiuLy thiowing
a iuntime exceptionthen the task lails. Failing tasks aie ietiieu (since the lailuie may
Le uue to haiuwaie lailuie oi some othei ieason outsiue the task`s contiol), Lut il a
7. ]VM ieuse is not suppoiteu in MapReuuce 2.
Task Execution | 217
task lails loui times, then the whole joL is maikeu as laileu (see Task Fail-
uie on page 200). Il it is the uata that is causing the task to thiow an exception,
ieiunning the task won`t help, since it will lail in exactly the same way each time.
Il you aie using TextInputFormat (TextInputFoimat on page 2++),
then you can set a maximum expecteu line length to saleguaiu against
coiiupteu liles. Coiiuption in a lile can manilest itsell as a veiy long line,
which can cause out ol memoiy eiiois anu then task lailuie. By setting
mapred.linerecordreader.maxlength to a value in Lytes that lits in mem-
oiy (anu is comloitaLly gieatei than the length ol lines in youi input
uata), the iecoiu ieauei will skip the (long) coiiupt lines without the
task lailing.
The Lest way to hanule coiiupt iecoius is in youi mappei oi ieuucei coue. You can
uetect the Lau iecoiu anu ignoie it, oi you can aLoit the joL Ly thiowing an exception.
You can also count the total numLei ol Lau iecoius in the joL using counteis to see
how wiuespieau the pioLlem is.
In iaie cases, though, you can`t hanule the pioLlem Lecause theie is a Lug in a thiiu-
paity liLiaiy that you can`t woik aiounu in youi mappei oi ieuucei. In these cases, you
can use Hauoop`s optional s|ipping nodc loi automatically skipping Lau iecoius.
S
Vhen skipping moue is enaLleu, tasks iepoit the iecoius Leing piocesseu Lack to the
tasktiackei. Vhen the task lails, the tasktiackei ietiies the task, skipping the iecoius
that causeu the lailuie. Because ol the extia netwoik tiallic anu Lookkeeping to
maintain the laileu iecoiu ianges, skipping moue is tuineu on loi a task only altei it
has laileu twice.
Thus, loi a task consistently lailing on a Lau iecoiu, the tasktiackei iuns the lollowing
task attempts with these outcomes:
1. Task lails.
2. Task lails.
3. Skipping moue is enaLleu. Task lails, Lut laileu iecoiu is stoieu Ly the tasktiackei.
+. Skipping moue is still enaLleu. Task succeeus Ly skipping the Lau iecoiu that laileu
in the pievious attempt.
Skipping moue is oll Ly uelault; you enaLle it inuepenuently loi map anu ieuuce tasks
using the SkipBadRecords class. It`s impoitant to note that skipping moue can uetect
only one Lau iecoiu pei task attempt, so this mechanism is appiopiiate only loi ue-
tecting occasional Lau iecoius (a lew pei task, say). You may neeu to inciease the
maximum numLei ol task attempts (via mapred.map.max.attempts anu
S. Skipping moue is not suppoiteu in the new MapReuuce API. See https://issucs.apachc.org/jira/browsc/
MAPREDUCE-1932.
218 | Chapter 6: How MapReduce Works
mapred.reduce.max.attempts) to give skipping moue enough attempts to uetect anu skip
all the Lau iecoius in an input split.
Bau iecoius that have Leen uetecteu Ly Hauoop aie saveu as seguence liles in the joL`s
output uiiectoiy unuei the _|ogs/s|ip suLuiiectoiy. These can Le inspecteu loi uiag-
nostic puiposes altei the joL has completeu (using hadoop fs -text, loi example).
Task Execution | 219
CHAPTER 7
MapReduce Types and Formats
MapReuuce has a simple mouel ol uata piocessing: inputs anu outputs loi the map anu
ieuuce lunctions aie key-value paiis. This chaptei looks at the MapReuuce mouel in
uetail anu, in paiticulai, how uata in vaiious loimats, liom simple text to stiuctuieu
Linaiy oLjects, can Le useu with this mouel.
MapReduce Types
The map anu ieuuce lunctions in Hauoop MapReuuce have the lollowing geneial loim:
map: (K1, V1) list(K2, V2)
reduce: (K2, list(V2)) list(K3, V3)
In geneial, the map input key anu value types (K1 anu V1) aie uilleient liom the map
output types (K2 anu V2). Howevei, the ieuuce input must have the same types as the
map output, although the ieuuce output types may Le uilleient again (K3 anu V3). The
]ava API miiiois this geneial loim:
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
public class Context extends MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
// ...
}
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
// ...
}
}
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
public class Context extends ReducerContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
// ...
}
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
221
Context context) throws IOException, InterruptedException {
// ...
}
}
The context oLjects aie useu loi emitting key-value paiis, so they aie paiameteiizeu Ly
the output types, so that the signatuie ol the write() methou is:
public void write(KEYOUT key, VALUEOUT value)
throws IOException, InterruptedException
Since Mapper anu Reducer aie sepaiate classes the type paiameteis have uilleient scopes,
anu the actual type aigument ol KEYIN (say) in the Mapper may Le uilleient to the type
ol the type paiametei ol the same name (KEYIN) in the Reducer. Foi instance, in the
maximum tempaiatuie example liom eailiei chapteis, KEYIN is ieplaceu Ly LongWrita
ble loi the Mapper, anu Ly Text loi the Reducer.
Similaily, even though the map output types anu the ieuuce input types must match,
this is not enloiceu Ly the ]ava compilei.
The type paiameteis aie nameu uilleiently to the aLstiact types (KEYIN veisus K1, anu
so on), Lut the loim is the same.
Il a comLine lunction is useu, then it is the same loim as the ieuuce lunction (anu is
an implementation ol Reducer), except its output types aie the inteimeuiate key anu
value types (K2 anu V2), so they can leeu the ieuuce lunction:
map: (K1, V1) list(K2, V2)
combine: (K2, list(V2)) list(K2, V2)
reduce: (K2, list(V2)) list(K3, V3)
Olten the comLine anu ieuuce lunctions aie the same, in which case, K3 is the same as
K2, anu V3 is the same as V2.
The paitition lunction opeiates on the inteimeuiate key anu value types (K2 anu V2),
anu ietuins the paitition inuex. In piactice, the paitition is ueteimineu solely Ly the
key (the value is ignoieu):
partition: (K2, V2) integer
Oi in ]ava:
public abstract class Partitioner<KEY, VALUE> {
public abstract int getPartition(KEY key, VALUE value, int numPartitions);
}
MapReduce signatures in the old API
In the olu API the signatuies aie veiy similai, anu actually name the type paiameteis
K1, V1, anu so on, although the constiaints on the types aie exactly the same in Loth
olu anu new APIs.
public interface Mapper<K1, V1, K2, V2> extends JobConfigurable, Closeable {

222 | Chapter 7: MapReduce Types and Formats
void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)
throws IOException;
}
public interface Reducer<K2, V2, K3, V3> extends JobConfigurable, Closeable {

void reduce(K2 key, Iterator<V2> values,
OutputCollector<K3, V3> output, Reporter reporter) throws IOException;
}
public interface Partitioner<K2, V2> extends JobConfigurable {
int getPartition(K2 key, V2 value, int numPartitions);
}
So much loi the theoiy, how uoes this help conliguie MapReuuce joLs? TaLle 7-1
summaiizes the conliguiation options loi the new API (anu TaLle 7-2 uoes the same
loi the olu API). It is uiviueu into the piopeities that ueteimine the types anu those that
have to Le compatiLle with the conliguieu types.
Input types aie set Ly the input loimat. So, loi instance, a TextInputFormat geneiates
keys ol type LongWritable anu values ol type Text. The othei types aie set explicitly Ly
calling the methous on the Job (oi JobConf in the olu API). Il not set explicitly, the
inteimeuiate types uelault to the (linal) output types, which uelault to LongWritable
anu Text. So il K2 anu K3 aie the same, you uon`t neeu to call setMapOutputKeyClass(),
since it lalls Lack to the type set Ly calling setOutputKeyClass(). Similaily, il V2 anu
V3 aie the same, you only neeu to use setOutputValueClass().
It may seem stiange that these methous loi setting the inteimeuiate anu linal output
types exist at all. Altei all, why can`t the types Le ueteimineu liom a comLination ol
the mappei anu the ieuucei? The answei is that it`s to uo with a limitation in ]ava
geneiics: type eiasuie means that the type inloimation isn`t always piesent at iuntime,
so Hauoop has to Le given it explicitly. This also means that it`s possiLle to conliguie
a MapReuuce joL with incompatiLle types, Lecause the conliguiation isn`t checkeu at
compile time. The settings that have to Le compatiLle with the MapReuuce types aie
listeu in the lowei pait ol TaLle 7-1. Type conllicts aie uetecteu at iuntime uuiing joL
execution, anu loi this ieason, it is wise to iun a test joL using a small amount ol uata
to llush out anu lix any type incompatiLilities.
MapReduce Types | 223
T
a
b
|
c

7
-
1
.

C
o
n
j
i
g
u
r
a
t
i
o
n

o
j

M
a
p
R
c
d
u
c
c

t
y
p
c
s

i
n

t
h
c

n
c
w

A
P
|
P
r
o
p
e
r
t
y
J
o
b

s
e
t
t
e
r

m
e
t
h
o
d
I
n
p
u
t

t
y
p
e
s
I
n
t
e
r
m
e
d
i
a
t
e

t
y
p
e
s
O
u
t
p
u
t

t
y
p
e
s
K
1
V
1
K
2
V
2
K
3
V
3
P
r
o
p
e
r
t
i
e
s

f
o
r

c
o
n
f
i
g
u
r
i
n
g

t
y
p
e
s
:
m
a
p
r
e
d
u
c
e
.
j
o
b
.
i
n
p
u
t
f
o
r
m
a
t
.
c
l
a
s
s
s
e
t
I
n
p
u
t
F
o
r
m
a
t
C
l
a
s
s
(
)





m
a
p
r
e
d
u
c
e
.
m
a
p
.
o
u
t
p
u
t
.
k
e
y
.
c
l
a
s
s
s
e
t
M
a
p
O
u
t
p
u
t
K
e
y
C
l
a
s
s
(
)




m
a
p
r
e
d
u
c
e
.
m
a
p
.
o
u
t
p
u
t
.
v
a
l
u
e
.
c
l
a
s
s
s
e
t
M
a
p
O
u
t
p
u
t
V
a
l
u
e
C
l
a
s
s
(
)




m
a
p
r
e
d
u
c
e
.
j
o
b
.
o
u
t
p
u
t
.
k
e
y
.
c
l
a
s
s
s
e
t
O
u
t
p
u
t
K
e
y
C
l
a
s
s
(
)




m
a
p
r
e
d
u
c
e
.
j
o
b
.
o
u
t
p
u
t
.
v
a
l
u
e
.
c
l
a
s
s
s
e
t
O
u
t
p
u
t
V
a
l
u
e
C
l
a
s
s
(
)




P
r
o
p
e
r
t
i
e
s

t
h
a
t

m
u
s
t

b
e

c
o
n
s
i
s
t
e
n
t

w
i
t
h

t
h
e

t
y
p
e
s
:
m
a
p
r
e
d
u
c
e
.
j
o
b
.
m
a
p
.
c
l
a
s
s
s
e
t
M
a
p
p
e
r
C
l
a
s
s
(
)



m
a
p
r
e
d
u
c
e
.
j
o
b
.
c
o
m
b
i
n
e
.
c
l
a
s
s
s
e
t
C
o
m
b
i
n
e
r
C
l
a
s
s
(
)



m
a
p
r
e
d
u
c
e
.
j
o
b
.
p
a
r
t
i
t
i
o
n
e
r
.
c
l
a
s
s
s
e
t
P
a
r
t
i
t
i
o
n
e
r
C
l
a
s
s
(
)



m
a
p
r
e
d
u
c
e
.
j
o
b
.
o
u
t
p
u
t
.
k
e
y
.
c
o
m
p
a
r
a
t
o
r
.
c
l
a
s
s
s
e
t
S
o
r
t
C
o
m
p
a
r
a
t
o
r
C
l
a
s
s
(
)




m
a
p
r
e
d
u
c
e
.
j
o
b
.
o
u
t
p
u
t
.
g
r
o
u
p
.
c
o
m
p
a
r
a
t
o
r
.
c
l
a
s
s
s
e
t
G
r
o
u
p
i
n
g
C
o
m
p
a
r
a
t
o
r
C
l
a
s
s
(
)




m
a
p
r
e
d
u
c
e
.
j
o
b
.
r
e
d
u
c
e
.
c
l
a
s
s
s
e
t
R
e
d
u
c
e
r
C
l
a
s
s
(
)

m
a
p
r
e
d
u
c
e
.
j
o
b
.
o
u
t
p
u
t
f
o
r
m
a
t
.
c
l
a
s
s
s
e
t
O
u
t
p
u
t
F
o
r
m
a
t
C
l
a
s
s
(
)



224 | Chapter 7: MapReduce Types and Formats


T
a
b
|
c

7
-
2
.

C
o
n
j
i
g
u
r
a
t
i
o
n

o
j

M
a
p
R
c
d
u
c
c

t
y
p
c
s

i
n

t
h
c

o
|
d

A
P
|
P
r
o
p
e
r
t
y
J
o
b
C
o
n
f

s
e
t
t
e
r

m
e
t
h
o
d
I
n
p
u
t

t
y
p
e
s
I
n
t
e
r
m
e
d
i
a
t
e

t
y
p
e
s
O
u
t
p
u
t

t
y
p
e
s
K
1
V
1
K
2
V
2
K
3
V
3
P
r
o
p
e
r
t
i
e
s

f
o
r

c
o
n
f
i
g
u
r
i
n
g

t
y
p
e
s
:
m
a
p
r
e
d
.
i
n
p
u
t
.
f
o
r
m
a
t
.
c
l
a
s
s
s
e
t
I
n
p
u
t
F
o
r
m
a
t
(
)





m
a
p
r
e
d
.
m
a
p
o
u
t
p
u
t
.
k
e
y
.
c
l
a
s
s
s
e
t
M
a
p
O
u
t
p
u
t
K
e
y
C
l
a
s
s
(
)




m
a
p
r
e
d
.
m
a
p
o
u
t
p
u
t
.
v
a
l
u
e
.
c
l
a
s
s
s
e
t
M
a
p
O
u
t
p
u
t
V
a
l
u
e
C
l
a
s
s
(
)




m
a
p
r
e
d
.
o
u
t
p
u
t
.
k
e
y
.
c
l
a
s
s
s
e
t
O
u
t
p
u
t
K
e
y
C
l
a
s
s
(
)




m
a
p
r
e
d
.
o
u
t
p
u
t
.
v
a
l
u
e
.
c
l
a
s
s
s
e
t
O
u
t
p
u
t
V
a
l
u
e
C
l
a
s
s
(
)




P
r
o
p
e
r
t
i
e
s

t
h
a
t

m
u
s
t

b
e

c
o
n
s
i
s
t
e
n
t

w
i
t
h

t
h
e

t
y
p
e
s
:
m
a
p
r
e
d
.
m
a
p
p
e
r
.
c
l
a
s
s
s
e
t
M
a
p
p
e
r
C
l
a
s
s
(
)



m
a
p
r
e
d
.
m
a
p
.
r
u
n
n
e
r
.
c
l
a
s
s
s
e
t
M
a
p
R
u
n
n
e
r
C
l
a
s
s
(
)



m
a
p
r
e
d
.
c
o
m
b
i
n
e
r
.
c
l
a
s
s
s
e
t
C
o
m
b
i
n
e
r
C
l
a
s
s
(
)



m
a
p
r
e
d
.
p
a
r
t
i
t
i
o
n
e
r
.
c
l
a
s
s
s
e
t
P
a
r
t
i
t
i
o
n
e
r
C
l
a
s
s
(
)



m
a
p
r
e
d
.
o
u
t
p
u
t
.
k
e
y
.
c
o
m
p
a
r
a
t
o
r
.
c
l
a
s
s
s
e
t
O
u
t
p
u
t
K
e
y
C
o
m
p
a
r
a
t
o
r
C
l
a
s
s
(
)




m
a
p
r
e
d
.
o
u
t
p
u
t
.
v
a
l
u
e
.
g
r
o
u
p
f
n
.
c
l
a
s
s
s
e
t
O
u
t
p
u
t
V
a
l
u
e
G
r
o
u
p
i
n
g
C
o
m
p
a
r
a
t
o
r
(
)




m
a
p
r
e
d
.
r
e
d
u
c
e
r
.
c
l
a
s
s
s
e
t
R
e
d
u
c
e
r
C
l
a
s
s
(
)

m
a
p
r
e
d
.
o
u
t
p
u
t
.
f
o
r
m
a
t
.
c
l
a
s
s
s
e
t
O
u
t
p
u
t
F
o
r
m
a
t
(
)



MapReduce Types | 225


The Default MapReduce Job
Vhat happens when you iun MapReuuce without setting a mappei oi a ieuucei? Let`s
tiy it Ly iunning this minimal MapReuuce piogiam:
public class MinimalMapReduce extends Configured implements Tool {

@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}

Job job = new Job(getConf());
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MinimalMapReduce(), args);
System.exit(exitCode);
}
}
The only conliguiation that we set is an input path anu an output path. Ve iun it ovei
a suLset ol oui weathei uata with the lollowing:
% hadoop MinimalMapReduce "input/ncdc/all/190{1,2}.gz" output
Ve uo get some output: one lile nameu part-r-00000 in the output uiiectoiy. Heie`s
what the liist lew lines look like (tiuncateu to lit the page):
00029029070999991901010106004+64333+023450FM-12+000599999V0202701N01591...
00035029070999991902010106004+64333+023450FM-12+000599999V0201401N01181...
1350029029070999991901010113004+64333+023450FM-12+000599999V0202901N00821...
1410035029070999991902010113004+64333+023450FM-12+000599999V0201401N01181...
2700029029070999991901010120004+64333+023450FM-12+000599999V0209991C00001...
2820035029070999991902010120004+64333+023450FM-12+000599999V0201401N01391...
Each line is an integei lolloweu Ly a taL chaiactei, lolloweu Ly the oiiginal weathei
uata iecoiu. Aumitteuly, it`s not a veiy uselul piogiam, Lut unueistanuing how it pio-
uuces its output uoes pioviue some insight into the uelaults that Hauoop uses when
iunning MapReuuce joLs. Example 7-1 shows a piogiam that has exactly the same
ellect as MinimalMapReduce, Lut explicitly sets the joL settings to theii uelaults.
226 | Chapter 7: MapReduce Types and Formats
Exanp|c 7-1. A ninina| MapRcducc drivcr, with thc dcjau|ts cxp|icit|y sct
public class MinimalMapReduceWithDefaults extends Configured implements Tool {

@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
}

job.setInputFormatClass(TextInputFormat.class);

job.setMapperClass(Mapper.class);

job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);

job.setPartitionerClass(HashPartitioner.class);

job.setNumReduceTasks(1);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);

return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MinimalMapReduceWithDefaults(), args);
System.exit(exitCode);
}
}
Ve`ve simplilieu the liist lew lines ol the run() methou, Ly extiacting the logic loi
piinting usage anu setting the input anu output paths into a helpei methou. Almost all
MapReuuce uiiveis take these two aiguments (input anu output), so ieuucing
the Loileiplate coue heie is a goou thing. Heie aie the ielevant methous in the
JobBuilder class loi ieleience:
public static Job parseInputAndOutput(Tool tool, Configuration conf,
String[] args) throws IOException {

if (args.length != 2) {
printUsage(tool, "<input> <output>");
return null;
}
Job job = new Job(conf);
job.setJarByClass(tool.getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job;
MapReduce Types | 227
}
public static void printUsage(Tool tool, String extraArgsUsage) {
System.err.printf("Usage: %s [genericOptions] %s\n\n",
tool.getClass().getSimpleName(), extraArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err);
}
Going Lack to MinimalMapReduceWithDefaults in Example 7-1, although theie aie many
othei uelault joL settings, the ones highlighteu aie those most cential to iunning a joL.
Let`s go thiough them in tuin.
The uelault input loimat is TextInputFormat, which piouuces keys ol type LongWrita
ble (the ollset ol the Leginning ol the line in the lile) anu values ol type Text (the line
ol text). This explains wheie the integeis in the linal output come liom: they aie the
line ollsets.
The uelault mappei is just the Mapper class, which wiites the input key anu value un-
changeu to the output:
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
Mapper is a geneiic type, which allows it to woik with any key oi value types. In this
case, the map input anu output key is ol type LongWritable anu the map input anu
output value is ol type Text.
The uelault paititionei is HashPartitioner, which hashes a iecoiu`s key to ueteimine
which paitition the iecoiu Lelongs in. Each paitition is piocesseu Ly a ieuuce task, so
the numLei ol paititions is egual to the numLei ol ieuuce tasks loi the joL:
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
The key`s hash coue is tuineu into a nonnegative integei Ly Litwise ANDing it with the
laigest integei value. It is then ieuuceu mouulo the numLei ol paititions to linu the
inuex ol the paitition that the iecoiu Lelongs in.
By uelault, theie is a single ieuucei, anu theieloie a single paitition, so the action ol
the paititionei is iiielevant in this case since eveiything goes into one paitition. How-
evei, it is impoitant to unueistanu the Lehavioi ol HashPartitioner when you have
moie than one ieuuce task. Assuming the key`s hash lunction is a goou one, the iecoius
228 | Chapter 7: MapReduce Types and Formats
will Le evenly allocateu acioss ieuuce tasks, with all iecoius shaiing the same key Leing
piocesseu Ly the same ieuuce task.
You may have noticeu that we uiun`t set the numLei ol map tasks. The ieason loi this
is that the numLei is egual to the numLei ol splits that the input is tuineu into, which
is uiiven Ly size ol the input, anu the lile`s Llock size (il the lile is in HDFS). The options
loi contiolling split size aie uiscusseu in FileInputFoimat input splits on page 236.
Choosing the Number of Reducers
The single ieuucei uelault is something ol a gotcha loi new useis to Hauoop. Almost
all ieal-woilu joLs shoulu set this to a laigei numLei; otheiwise, the joL will Le veiy
slow since all the inteimeuiate uata llows thiough a single ieuuce task. (Note that when
iunning unuei the local joL iunnei, only zeio oi one ieuuceis aie suppoiteu.)
The optimal numLei ol ieuuceis is ielateu to the total numLei ol availaLle ieuucei slots
in youi clustei. The total numLei ol slots is lounu Ly multiplying the numLei ol noues
in the clustei anu the numLei ol slots pei noue (which is ueteimineu Ly the value ol
the mapred.tasktracker.reduce.tasks.maximum piopeity, uesciiLeu in Enviionment
Settings on page 305).
One common setting is to have slightly lewei ieuuceis than total slots, which gives one
wave ol ieuuce tasks (anu toleiates a lew lailuies, without extenuing joL execution
time). Il youi ieuuce tasks aie veiy Lig, then it makes sense to have a laigei numLei ol
ieuuceis (iesulting in two waves, loi example) so that the tasks aie moie line-giaineu,
anu lailuie uoesn`t allect joL execution time signilicantly.
The uelault ieuucei is Reducer, again a geneiic type, which simply wiites all its input
to its output:
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
Context context) throws IOException, InterruptedException {
for (VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
}
Foi this joL, the output key is LongWritable, anu the output value is Text. In lact, all
the keys loi this MapReuuce piogiam aie LongWritable, anu all the values aie Text,
since these aie the input keys anu values, anu the map anu ieuuce lunctions aie Loth
iuentity lunctions which Ly uelinition pieseive type. Most MapReuuce piogiams,
howevei, uon`t use the same key oi value types thioughout, so you neeu to conliguie
the joL to ueclaie the types you aie using, as uesciiLeu in the pievious section.
Recoius aie soiteu Ly the MapReuuce system Leloie Leing piesenteu to the ieuucei.
In this case, the keys aie soiteu numeiically, which has the ellect ol inteileaving the
lines liom the input liles into one comLineu output lile.
MapReduce Types | 229
The uelault output loimat is TextOutputFormat, which wiites out iecoius, one pei line,
Ly conveiting keys anu values to stiings anu sepaiating them with a taL chaiactei. This
is why the output is taL-sepaiateu: it is a leatuie ol TextOutputFormat.
The default Streaming job
In Stieaming, the uelault joL is similai, Lut not iuentical, to the ]ava eguivalent. The
minimal loim is:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper /bin/cat
Notice that you have to supply a mappei: the uelault iuentity mappei will not woik.
The ieason has to uo with the uelault input loimat, TextInputFormat, which geneiates
LongWritable keys anu Text values. Howevei, Stieaming output keys anu values (in-
cluuing the map keys anu values) aie always Loth ol type Text.
1
The iuentity mappei
cannot change LongWritable keys to Text keys, so it lails.
Vhen we specily a non-]ava mappei, anu the input loimat is TextInputFormat, Stieam-
ing uoes something special. It uoesn`t pass the key to the mappei piocess, it just passes
the value. (Foi othei input loimats the same ellect can Le achieveu Ly setting
stream.map.input.ignoreKey to true.) This is actually veiy uselul, since the key is just
the line ollset in the lile, anu the value is the line, which is all most applications aie
inteiesteu in. The oveiall ellect ol this joL is to peiloim a soit ol the input.
Vith moie ol the uelaults spelleu out, the commanu looks like this:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-mapper /bin/cat \
-partitioner org.apache.hadoop.mapred.lib.HashPartitioner \
-numReduceTasks 1 \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
The mappei anu ieuucei aiguments take a commanu oi a ]ava class. A comLinei may
optionally Le specilieu, using the -combiner aigument.
Keys and values in Streaming
A Stieaming application can contiol the sepaiatoi that is useu when a key-value paii is
tuineu into a seiies ol Lytes anu sent to the map oi ieuuce piocess ovei stanuaiu input.
The uelault is a taL chaiactei, Lut it is uselul to Le aLle to change it in the case that the
keys oi values themselves contain taL chaiacteis.
1. Except when useu in Linaiy moue, liom veision 0.21.0 onwaiu, via the -io rawbytes oi -io typedbytes
options. Text moue (-io text) is the uelault.
230 | Chapter 7: MapReduce Types and Formats
Similaily, when the map oi ieuuce wiites out key-value paiis, they may Le sepaiateu
Ly a conliguiaLle sepaiatoi. Fuitheimoie, the key liom the output can Le composeu
ol moie than the liist lielu: it can Le maue up ol the liist n lielus (uelineu Ly
stream.num.map.output.key.fields oi stream.num.reduce.output.key.fields), with
the value Leing the iemaining lielus. Foi example, il the output liom a Stieaming pio-
cess was a,b,c (anu the sepaiatoi is a comma), anu n is two, then the key woulu Le
paiseu as a,b anu the value as c.
Sepaiatois may Le conliguieu inuepenuently loi maps anu ieuuces. The piopeities aie
listeu in TaLle 7-3 anu shown in a uiagiam ol the uata llow path in Figuie 7-1.
These settings uo not have any Leaiing on the input anu output loimats. Foi example,
il stream.reduce.output.field.separator weie set to Le a colon, say, anu the ieuuce
stieam piocess wiote the line a:b to stanuaiu out, then the Stieaming ieuucei woulu
know to extiact the key as a anu the value as b. Vith the stanuaiu TextOutputFormat,
this iecoiu woulu Le wiitten to the output lile with a taL sepaiating a anu b. You can
change the sepaiatoi that TextOutputFormat uses Ly setting mapred.textoutputfor
mat.separator.
A list ol Stieaming conliguiation paiameteis can Le lounu on the Hauoop weLsite at
http://hadoop.apachc.org/naprcducc/docs/currcnt/strcaning.htn|=Conjigurab|c-pa
ranctcrs.
Tab|c 7-3. Strcaning scparator propcrtics
Property name Type Default value Description
stream.map.input.field.
separator
String \t The separator to use when passing the input key and
value strings to the stream map process as a stream of
bytes.
stream.map.output.field.
separator
String \t The separator to use when splitting the output from the
stream map process into key and value strings for the
map output.
stream.num.map.
output.key.fields
int 1 The number of fields separated by
stream.map.output.field.separator to
treat as the map output key.
stream.reduce.input.field.
separator
String \t The separator to use when passing the input key and
value strings to the stream reduce process as a stream of
bytes.
stream.reduce.
output.field.
separator
String \t The separator to use when splitting the output from the
stream reduce process into key and value strings for the
final reduce output.
stream.num.reduce.
output.key.fields
int 1 The number of fields separated by
stream.reduce.output.field.separator to
treat as the reduce output key.
MapReduce Types | 231
Iigurc 7-1. Whcrc scparators arc uscd in a Strcaning MapRcducc job
Input Formats
Hauoop can piocess many uilleient types ol uata loimats, liom llat text liles to uata-
Lases. In this section, we exploie the uilleient loimats availaLle.
Input Splits and Records
As we saw in Chaptei 2, an input split is a chunk ol the input that is piocesseu Ly a
single map. Each map piocesses a single split. Each split is uiviueu into iecoius, anu
the map piocesses each iecoiua key-value paiiin tuin. Splits anu iecoius aie log-
ical: theie is nothing that ieguiies them to Le tieu to liles, loi example, although in theii
most common incainations, they aie. In a uataLase context, a split might coiiesponu
to a iange ol iows liom a taLle anu a iecoiu to a iow in that iange (this is piecisely what
DBInputFormat uoes, an input loimat loi ieauing uata liom a ielational uataLase).
Input splits aie iepiesenteu Ly the ]ava class, InputSplit (which, like all ol the classes
mentioneu in this section, is in the org.apache.hadoop.mapreduce package
2
):
public abstract class InputSplit {
public abstract long getLength() throws IOException, InterruptedException;
public abstract String[] getLocations() throws IOException,
InterruptedException;
}
An InputSplit has a length in Lytes anu a set ol stoiage locations, which aie just host-
name stiings. Notice that a split uoesn`t contain the input uata; it is just a ieleience to
the uata. The stoiage locations aie useu Ly the MapReuuce system to place map tasks
as close to the split`s uata as possiLle, anu the size is useu to oiuei the splits so that the
laigest get piocesseu liist, in an attempt to minimize the joL iuntime (this is an instance
ol a gieeuy appioximation algoiithm).
2. But see the classes in org.apache.hadoop.mapred loi the olu MapReuuce API counteipaits.
232 | Chapter 7: MapReduce Types and Formats
As a MapReuuce application wiitei, you uon`t neeu to ueal with InputSplits uiiectly,
as they aie cieateu Ly an InputFormat. An InputFormat is iesponsiLle loi cieating the
input splits anu uiviuing them into iecoius. Beloie we see some conciete examples ol
InputFormat, let`s Liielly examine how it is useu in MapReuuce. Heie`s the inteilace:
public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException;

public abstract RecordReader<K, V>
createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
InterruptedException;
}
The client iunning the joL calculates the splits loi the joL Ly calling getSplits(), then
senus them to the joLtiackei, which uses theii stoiage locations to scheuule map tasks
to piocess them on the tasktiackeis. On a tasktiackei, the map task passes the split to
the createRecordReader() methou on InputFormat to oLtain a RecordReader loi that
split. A RecordReader is little moie than an iteiatoi ovei iecoius, anu the map task uses
one to geneiate iecoiu key-value paiis, which it passes to the map lunction. Ve can
see this Ly looking at the Mapper`s run() methou:
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
Altei iunning setup(), the nextKeyValue() is calleu iepeateuly on the Context, (which
uelegates to the iuentically-nameu methou on the the RecordReader) to populate the
key anu value oLjects loi the mappei. The key anu value aie ietiieveu liom the Record
Reader Ly way ol the Context, anu passeu to the map() methou loi it to uo its woik.
Vhen the ieauei gets to the enu ol the stieam, the nextKeyValue() methou ietuins
false, anu the map task iuns its cleanup() methou, then completes.
Input Formats | 233
It`s not shown in the coue snippet, Lut loi ieasons ol elliciency Record
Reader implementations will ietuin the sanc key anu value oLjects on
each call to getCurrentKey() anu getCurrentValue(). Only the contents
ol these oLjects aie changeu Ly the ieauei`s nextKeyValue() methou.
This can Le a suipiise to useis, who might expect keys anu values to Le
immutaLle, anu not to Le ieuseu. This causes pioLlems when a ieleience
to a key oi value oLject is ietaineu outsiue the map() methou, as its value
can change without waining. Il you neeu to uo this, make a copy ol the
oLject you want to holu on to. Foi example, loi a Text oLject, you can
use its copy constiuctoi: new Text(value).
The situation is similai with ieuuceis. In this case, the value oLjects in
the ieuucei`s iteiatoi aie ieuseu, so you neeu to copy any that you neeu
to ietain Letween calls to the iteiatoi (see Example S-1+).
Finally, note that the Mapper`s run() methou is puLlic, anu may Le customizeu Ly useis.
mappeis. MultithreadedMapper is an implementation that iuns mappeis concuiiently
in a conliguiaLle numLei ol thieaus (set Ly mapreduce.mapper.multithreadedmap
per.threads). Foi most uata piocessing tasks, it conleis no auvantage ovei the uelault
implementation. Howevei, loi mappeis that spenu a long time piocessing each iecoiu,
Lecause they contact exteinal seiveis, loi example, it allows multiple mappeis to iun
in one ]VM with little contention. See Fetchei: A multithieaueu MapRunnei in ac-
tion on page 575 loi an example ol an application that uses the multi-thieaueu veision
(using the olu API).
FileInputFormat
FileInputFormat is the Lase class loi all implementations ol InputFormat that use liles
as theii uata souice (see Figuie 7-2). It pioviues two things: a place to ueline which liles
aie incluueu as the input to a joL, anu an implementation loi geneiating splits loi the
input liles. The joL ol uiviuing splits into iecoius is peiloimeu Ly suLclasses.
FileInputFormat input paths
The input to a joL is specilieu as a collection ol paths, which olleis gieat llexiLility in
constiaining the input to a joL. FileInputFormat olleis loui static convenience methous
loi setting a Job`s input paths:
public static void addInputPath(Job job, Path path)
public static void addInputPaths(Job job, String commaSeparatedPaths)
public static void setInputPaths(Job job, Path... inputPaths)
public static void setInputPaths(Job job, String commaSeparatedPaths)

The addInputPath() anu addInputPaths() methous auu a path oi paths to the list ol
inputs. You can call these methous iepeateuly to Luilu the list ol paths. The setInput
Paths() methous set the entiie list ol paths in one go (ieplacing any paths set on the
Job in pievious calls).
234 | Chapter 7: MapReduce Types and Formats
A path may iepiesent a lile, a uiiectoiy, oi, Ly using a gloL, a collection ol liles anu
uiiectoiies. A path iepiesenting a uiiectoiy incluues all the liles in the uiiectoiy as input
to the joL. See File patteins on page 67 loi moie on using gloLs.
The contents ol a uiiectoiy specilieu as an input path aie not piocesseu
iecuisively. In lact, the uiiectoiy shoulu only contain liles: il the uiiec-
toiy contains a suLuiiectoiy, it will Le inteipieteu as a lile, which will
cause an eiioi. The way to hanule this case is to use a lile gloL oi a liltei
to select only the liles in the uiiectoiy Laseu on a name pattein. Altei-
natively, set mapred.input.dir.recursive to true to loice the input ui-
iectoiy to Le ieau iecuisively.
The auu anu set methous allow liles to Le specilieu Ly inclusion only. To excluue ceitain
liles liom the input, you can set a liltei using the setInputPathFilter() methou on
FileInputFormat:
public static void setInputPathFilter(Job job, Class<? extends PathFilter> filter)
Filteis aie uiscusseu in moie uetail in PathFiltei on page 6S.
Iigurc 7-2. |nputIornat c|ass hicrarchy
Input Formats | 235
Even il you uon`t set a liltei, FileInputFormat uses a uelault liltei that excluues hiuuen
liles (those whose names Legin with a uot oi an unueiscoie). Il you set a liltei Ly calling
setInputPathFilter(), it acts in auuition to the uelault liltei. In othei woius, only non-
hiuuen liles that aie accepteu Ly youi liltei get thiough.
Paths anu lilteis can Le set thiough conliguiation piopeities, too (TaLle 7-+), which
can Le hanuy loi Stieaming anu Pipes. Setting paths is uone with the -input option loi
Loth Stieaming anu Pipes inteilaces, so setting paths uiiectly is not usually neeueu.
Tab|c 7-1. |nput path and ji|tcr propcrtics
Property name Type Default value Description
mapred.input.dir comma-separated paths none The input files for a job. Paths that contain commas
should have those commas escaped by a backslash
character. For example, the glob {a,b} would be
escaped as {a\,b}.
mapred.input.
pathFilter.class
PathFilter
classname
none The filter to apply to the input files for a job.
FileInputFormat input splits
Given a set ol liles, how uoes FileInputFormat tuin them into splits? FileInputFormat
splits only laige liles. Heie laige means laigei than an HDFS Llock. The split size is
noimally the size ol an HDFS Llock, which is appiopiiate loi most applications; how-
evei, it is possiLle to contiol this value Ly setting vaiious Hauoop piopeities, as shown
in TaLle 7-5.
Tab|c 7-5. Propcrtics jor contro||ing sp|it sizc
Property name Type Default value Description
mapred.min.split.size int 1 The smallest valid size in
bytes for a file split.
mapred.max.split.size
a
long Long.MAX_VALUE, that is
9223372036854775807
The largest valid size in
bytes for a file split.
dfs.block.size long 64 MB, that is 67108864 The size of a block in HDFS
in bytes.
a
This property is not present in the old MapReduce API (with the exception of CombineFileInputFormat). Instead, it is calculated
indirectly as the size of the total input for the job, divided by the guide number of map tasks specified by mapred.map.tasks (or the
setNumMapTasks() method on JobConf). Because mapred.map.tasks defaults to 1, this makes the maximum split size the size
of the input.
The minimum split size is usually 1 Lyte, although some loimats have a lowei Lounu
on the split size. (Foi example, seguence liles inseit sync entiies eveiy so olten in the
stieam, so the minimum split size has to Le laige enough to ensuie that eveiy split has
a sync point to allow the ieauei to iesynchionize with a iecoiu Lounuaiy.)
236 | Chapter 7: MapReduce Types and Formats
Applications may impose a minimum split size: Ly setting this to a value laigei than
the Llock size, they can loice splits to Le laigei than a Llock. Theie is no goou ieason
loi uoing this when using HDFS, since uoing so will inciease the numLei ol Llocks that
aie not local to a map task.
The maximum split size uelaults to the maximum value that can Le iepiesenteu Ly a
]ava long type. It has an ellect only when it is less than the Llock size, loicing splits to
Le smallei than a Llock.
The split size is calculateu Ly the loimula (see the computeSplitSize() methou in
FileInputFormat):
max(minimumSize, min(maximumSize, blockSize))
Ly uelault:
minimumSize < blockSize < maximumSize
so the split size is blockSize. Vaiious settings loi these paiameteis anu how they allect
the linal split size aie illustiateu in TaLle 7-6.
Tab|c 7-. Exanp|cs oj how to contro| thc sp|it sizc
Minimum split size Maximum split size Block size Split size Comment
1 (default) Long.MAX_VALUE
(default)
64 MB (default) 64 MB By default, split size is the same as the
default block size.
1 (default) Long.MAX_VALUE
(default)
128 MB 128 MB The most natural way to increase the
split size is to have larger blocks in
HDFS, by setting dfs.block
size, or on a per-file basis at file
construction time.
128 MB Long.MAX_VALUE
(default)
64 MB (default) 128 MB Making the minimum split size
greater than the block size increases
the split size, but at the cost of locality.
1 (default) 32 MB 64 MB (default) 32 MB Making the maximum split size less
than the block size decreases the split
size.
Small files and CombineFileInputFormat
Hauoop woiks Lettei with a small numLei ol laige liles than a laige numLei ol small
liles. One ieason loi this is that FileInputFormat geneiates splits in such a way that each
split is all oi pait ol a single lile. Il the lile is veiy small (small means signilicantly
smallei than an HDFS Llock) anu theie aie a lot ol them, then each map task will piocess
veiy little input, anu theie will Le a lot ol them (one pei lile), each ol which imposes
extia Lookkeeping oveiheau. Compaie a 1 GB lile Lioken into sixteen 6+ MB Llocks,
anu 10,000 oi so 100 KB liles. The 10,000 liles use one map each, anu the joL time can
Le tens oi hunuieus ol times slowei than the eguivalent one with a single input lile anu
16 map tasks.
Input Formats | 237
The situation is alleviateu somewhat Ly CombineFileInputFormat, which was uesigneu
to woik well with small liles. Vheie FileInputFormat cieates a split pei lile,
CombineFileInputFormat packs many liles into each split so that each mappei has moie
to piocess. Ciucially, CombineFileInputFormat takes noue anu iack locality into account
when ueciuing which Llocks to place in the same split, so it uoes not compiomise the
speeu at which it can piocess the input in a typical MapReuuce joL.
Ol couise, il possiLle, it is still a goou iuea to avoiu the many small liles case since
MapReuuce woiks Lest when it can opeiate at the tianslei iate ol the uisks in the clustei,
anu piocessing many small liles incieases the numLei ol seeks that aie neeueu to iun
a joL. Also, stoiing laige numLeis ol small liles in HDFS is wastelul ol the namenoue`s
memoiy. One technigue loi avoiuing the many small liles case is to meige small liles
into laigei liles Ly using a SeguenceFile: the keys can act as lilenames (oi a constant
such as NullWritable, il not neeueu) anu the values as lile contents. See Example 7-+.
But il you alieauy have a laige numLei ol small liles in HDFS, then CombineFileInput
Format is woith tiying.
CombineFileInputFormat isn`t just goou loi small lilesit can Liing Len-
elits when piocessing laige liles, too. Essentially, CombineFileInputFor
mat uecouples the amount ol uata that a mappei consumes liom the
Llock size ol the liles in HDFS.
Il youi mappeis can piocess each Llock in a mattei ol seconus, then
you coulu use CombineFileInputFormat with the maximum split size set
to a small multiple ol the numLei ol Llocks (Ly setting the
mapred.max.split.size piopeity in Lytes) so that each mappei piocesses
moie than one Llock. In ietuin, the oveiall piocessing time lalls, since
piopoitionally lewei mappeis iun, which ieuuces the oveiheau in task
Lookkeeping anu staitup time associateu with a laige numLei ol shoit-
liveu mappeis.
Since CombineFileInputFormat is an aLstiact class without any conciete classes (unlike
FileInputFormat), you neeu to uo a Lit moie woik to use it. (Hopelully, common im-
plementations will Le auueu to the liLiaiy ovei time.) Foi example, to have the
CombineFileInputFormat eguivalent ol TextInputFormat, you woulu cieate a conciete
suLclass ol CombineFileInputFormat anu implement the getRecordReader() methou.
238 | Chapter 7: MapReduce Types and Formats
Preventing splitting
Some applications uon`t want liles to Le split, so that a single mappei can piocess each
input lile in its entiiety. Foi example, a simple way to check il all the iecoius in a lile
aie soiteu is to go thiough the iecoius in oiuei, checking whethei each iecoiu is not
less than the pieceuing one. Implementeu as a map task, this algoiithm will woik only
il one map piocesses the whole lile.
3
Theie aie a couple ol ways to ensuie that an existing lile is not split. The liist (guick
anu uiity) way is to inciease the minimum split size to Le laigei than the laigest lile in
youi system. Setting it to its maximum value, Long.MAX_VALUE, has this ellect. The sec-
onu is to suLclass the conciete suLclass ol FileInputFormat that you want to use, to
oveiiiue the isSplitable() methou
+
to ietuin false. Foi example, heie`s a nonsplittaLle
TextInputFormat:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
}
File information in the mapper
A mappei piocessing a lile input split can linu inloimation aLout the split Ly calling
the getInputSplit() methou on the Mapper`s Context oLject. Vhen the input loimat
ueiives liom FileInputFormat, the InputSplit ietuineu Ly this methou can Le cast to a
FileSplit to access the lile inloimation listeu in TaLle 7-7.
In the olu MapReuuce API, Stieaming, anu Pipes, the same lile split inloimation is
maue availaLle thiough piopeities which can Le ieau liom the mappei`s conliguiation.
(In the olu MapReuuce API this is achieveu Ly implementing configure() in youi
Mapper implementation to get access to the JobConf oLject.)
In auuition to the piopeities in TaLle 7-7 all mappeis anu ieuuceis have access to the
piopeities listeu in The Task Execution Enviionment on page 212.
Tab|c 7-7. Ii|c sp|it propcrtics
FileSplit method Property name Type Description
getPath() map.input.file Path/String The path of the input file being processed
3. This is how the mappei in SortValidator.RecordStatsChecker is implementeu.
+. In the methou name isSplitable(), splitaLle has a single t. It is usually spelleu splittaLle, which
is the spelling I have useu in this Look.
Input Formats | 239
FileSplit method Property name Type Description
getStart() map.input.start long The byte offset of the start of the split from the beginning
of the file
getLength() map.input.length long The length of the split in bytes
In the next section, you shall see how to use a FileSplit when we neeu to access the
split`s lilename.
Processing a whole file as a record
A ielateu ieguiiement that sometimes ciops up is loi mappeis to have access to the lull
contents ol a lile. Not splitting the lile gets you pait ol the way theie, Lut you also neeu
to have a RecordReader that ueliveis the lile contents as the value ol the iecoiu. The
listing loi WholeFileInputFormat in Example 7-2 shows a way ol uoing this.
Exanp|c 7-2. An |nputIornat jor rcading a who|c ji|c as a rccord
public class WholeFileInputFormat
extends FileInputFormat<NullWritable, BytesWritable> {

@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
@Override
public RecordReader<NullWritable, BytesWritable> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException,
InterruptedException {
WholeFileRecordReader reader = new WholeFileRecordReader();
reader.initialize(split, context);
return reader;
}
}
WholeFileInputFormat uelines a loimat wheie the keys aie not useu, iepiesenteu Ly
NullWritable, anu the values aie the lile contents, iepiesenteu Ly BytesWritable in-
stances. It uelines two methous. Fiist, the loimat is caielul to specily that input liles
shoulu nevei Le split, Ly oveiiiuing isSplitable() to ietuin false. Seconu, we
implement createRecordReader() to ietuin a custom implementation ol Record
Reader, which appeais in Example 7-3.
Exanp|c 7-3. Thc RccordRcadcr uscd by Who|cIi|c|nputIornat jor rcading a who|c ji|c as a rccord
class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {

private FileSplit fileSplit;
private Configuration conf;
private BytesWritable value = new BytesWritable();
private boolean processed = false;
240 | Chapter 7: MapReduce Types and Formats
@Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
this.fileSplit = (FileSplit) split;
this.conf = context.getConfiguration();
}

@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!processed) {
byte[] contents = new byte[(int) fileSplit.getLength()];
Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try {
in = fs.open(file);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
} finally {
IOUtils.closeStream(in);
}
processed = true;
return true;
}
return false;
}

@Override
public NullWritable getCurrentKey() throws IOException, InterruptedException {
return NullWritable.get();
}
@Override
public BytesWritable getCurrentValue() throws IOException,
InterruptedException {
return value;
}
@Override
public float getProgress() throws IOException {
return processed ? 1.0f : 0.0f;
}
@Override
public void close() throws IOException {
// do nothing
}
}
WholeFileRecordReader is iesponsiLle loi taking a FileSplit anu conveiting it into a
single iecoiu, with a null key anu a value containing the Lytes ol the lile. Because theie
is only a single iecoiu, WholeFileRecordReader has eithei piocesseu it oi not, so it main-
tains a Loolean calleu processed. Il, when the nextKeyValue() methou is calleu, the lile
Input Formats | 241
has not Leen piocesseu, then we open the lile, cieate a Lyte aiiay whose length is the
length ol the lile, anu use the Hauoop IOUtils class to sluip the lile into the Lyte aiiay.
Then we set the aiiay on the BytesWritable instance that was passeu into the next()
methou, anu ietuin true to signal that a iecoiu has Leen ieau.
The othei methous aie stiaightloiwaiu Lookkeeping methous loi accessing the cuiient
key anu value types, getting the piogiess ol the ieauei, anu a close() methou, which
is invokeu Ly the MapReuuce liamewoik when the ieauei is uone with.
To uemonstiate how WholeFileInputFormat can Le useu, consiuei a MapReuuce joL loi
packaging small liles into seguence liles, wheie the key is the oiiginal lilename, anu the
value is the content ol the lile. The listing is in Example 7-+.
Exanp|c 7-1. A MapRcducc progran jor pac|aging a co||cction oj sna|| ji|cs as a sing|c ScqucnccIi|c
public class SmallFilesToSequenceFileConverter extends Configured
implements Tool {

static class SequenceFileMapper
extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {

private Text filenameKey;

@Override
protected void setup(Context context) throws IOException,
InterruptedException {
InputSplit split = context.getInputSplit();
Path path = ((FileSplit) split).getPath();
filenameKey = new Text(path.toString());
}

@Override
protected void map(NullWritable key, BytesWritable value, Context context)
throws IOException, InterruptedException {
context.write(filenameKey, value);
}

}
@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
}

job.setInputFormatClass(WholeFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(BytesWritable.class);
job.setMapperClass(SequenceFileMapper.class);
242 | Chapter 7: MapReduce Types and Formats
return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SmallFilesToSequenceFileConverter(), args);
System.exit(exitCode);
}
}
Since the input loimat is a WholeFileInputFormat, the mappei has to only linu the
lilename loi the input lile split. It uoes this Ly casting the InputSplit liom the context
to a FileSplit, which has a methou to ietiieve the lile path. The path is stoieu in a
Text oLject loi the key. The ieuucei is the iuentity (not explicitly set), anu the output
loimat is a SequenceFileOutputFormat.
Heie`s a iun on a lew small liles. Ve`ve chosen to use two ieuuceis, so we get two
output seguence liles:
% hadoop jar hadoop-examples.jar SmallFilesToSequenceFileConverter \
-conf conf/hadoop-localhost.xml -D mapred.reduce.tasks=2 input/smallfiles output
Two pait liles aie cieateu, each ol which is a seguence lile, which we can inspect with
the -text option to the lilesystem shell:
% hadoop fs -conf conf/hadoop-localhost.xml -text output/part-r-00000
hdfs://localhost/user/tom/input/smallfiles/a 61 61 61 61 61 61 61 61 61 61
hdfs://localhost/user/tom/input/smallfiles/c 63 63 63 63 63 63 63 63 63 63
hdfs://localhost/user/tom/input/smallfiles/e
% hadoop fs -conf conf/hadoop-localhost.xml -text output/part-r-00001
hdfs://localhost/user/tom/input/smallfiles/b 62 62 62 62 62 62 62 62 62 62
hdfs://localhost/user/tom/input/smallfiles/d 64 64 64 64 64 64 64 64 64 64
hdfs://localhost/user/tom/input/smallfiles/f 66 66 66 66 66 66 66 66 66 66
The input liles weie nameu a, b, c, d, c, anu j, anu each containeu 10 chaiacteis ol the
coiiesponuing lettei (so, loi example, a containeu 10 a chaiacteis), except c, which
was empty. Ve can see this in the textual ienueiing ol the seguence liles, which piints
the lilename lolloweu Ly the hex iepiesentation ol the lile.
Theie`s at least one way we coulu impiove this piogiam. As mentioneu eailiei, having
one mappei pei lile is inellicient, so suLclassing CombineFileInputFormat insteau ol
FileInputFormat woulu Le a Lettei appioach. Also, loi a ielateu technigue ol packing
liles into a Hauoop Aichive, iathei than a seguence lile, see the section Hauoop Ai-
chives on page 7S.
Text Input
Hauoop excels at piocessing unstiuctuieu text. In this section, we uiscuss the uilleient
InputFormats that Hauoop pioviues to piocess text.
Input Formats | 243
TextInputFormat
TextInputFormat is the uelault InputFormat. Each iecoiu is a line ol input. The key, a
LongWritable, is the Lyte ollset within the lile ol the Leginning ol the line. The value is
the contents ol the line, excluuing any line teiminatois (newline, caiiiage ietuin), anu
is packageu as a Text oLject. So a lile containing the lollowing text:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
is uiviueu into one split ol loui iecoius. The iecoius aie inteipieteu as the lollowing
key-value paiis:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
Cleaily, the keys aie not line numLeis. This woulu Le impossiLle to implement in gen-
eial, in that a lile is Lioken into splits, at Lyte, not line, Lounuaiies. Splits aie piocesseu
inuepenuently. Line numLeis aie ieally a seguential notion: you have to keep a count
ol lines as you consume them, so knowing the line numLei within a split woulu Le
possiLle, Lut not within the lile.
Howevei, the ollset within the lile ol each line is known Ly each split inuepenuently ol
the othei splits, since each split knows the size ol the pieceuing splits anu just auus this
on to the ollsets within the split to piouuce a gloLal lile ollset. The ollset is usually
sullicient loi applications that neeu a unigue iuentiliei loi each line. ComLineu with
the lile`s name, it is unigue within the lilesystem. Ol couise, il all the lines aie a lixeu
wiuth, then calculating the line numLei is simply a mattei ol uiviuing the ollset Ly the
wiuth.
The Relationship Between Input Splits and HDFS Blocks
The logical iecoius that FileInputFormats ueline uo not usually lit neatly into HDFS
Llocks. Foi example, a TextInputFormat`s logical iecoius aie lines, which will cioss
HDFS Lounuaiies moie olten than not. This has no Leaiing on the lunctioning ol youi
piogiamlines aie not misseu oi Lioken, loi exampleLut it`s woith knowing aLout,
as it uoes mean that uata-local maps (that is, maps that aie iunning on the same host
as theii input uata) will peiloim some iemote ieaus. The slight oveiheau this causes is
not noimally signilicant.
Figuie 7-3 shows an example. A single lile is Lioken into lines, anu the line Lounuaiies
uo not coiiesponu with the HDFS Llock Lounuaiies. Splits honoi logical iecoiu Lounu-
aiies, in this case lines, so we see that the liist split contains line 5, even though it spans
the liist anu seconu Llock. The seconu split staits at line 6.
244 | Chapter 7: MapReduce Types and Formats
Iigurc 7-3. Logica| rccords and HDIS b|oc|s jor Tcxt|nputIornat
KeyValueTextInputFormat
TextInputFormat`s keys, Leing simply the ollset within the lile, aie not noimally veiy
uselul. It is common loi each line in a lile to Le a key-value paii, sepaiateu Ly a uelimitei
such as a taL chaiactei. Foi example, this is the output piouuceu Ly TextOutputFor
mat, Hauoop`s uelault OutputFormat. To inteipiet such liles coiiectly, KeyValueTextIn
putFormat is appiopiiate.
You can specily the sepaiatoi via the mapreduce.input.keyvaluelinerecor
dreader.key.value.separator piopeity (oi key.value.separator.in.input.line in the
olu API). It is a taL chaiactei Ly uelault. Consiuei the lollowing input lile, wheie
iepiesents a (hoiizontal) taL chaiactei:
line1On the top of the Crumpetty Tree
line2The Quangle Wangle sat,
line3But his face you could not see,
line4On account of his Beaver Hat.
Like in the TextInputFormat case, the input is in a single split compiising loui iecoius,
although this time the keys aie the Text seguences Leloie the taL in each line:
(line1, On the top of the Crumpetty Tree)
(line2, The Quangle Wangle sat,)
(line3, But his face you could not see,)
(line4, On account of his Beaver Hat.)
NLineInputFormat
Vith TextInputFormat anu KeyValueTextInputFormat, each mappei ieceives a vaiiaLle
numLei ol lines ol input. The numLei uepenus on the size ol the split anu the length
ol the lines. Il you want youi mappeis to ieceive a lixeu numLei ol lines ol input, then
NLineInputFormat is the InputFormat to use. Like TextInputFormat, the keys aie the Lyte
ollsets within the lile anu the values aie the lines themselves.
N ieleis to the numLei ol lines ol input that each mappei ieceives. Vith N set to
one (the uelault), each mappei ieceives exactly one line ol input. The mapre
duce.input.lineinputformat.linespermap piopeity (mapred.line.input.format.line
spermap in the olu API) contiols the value ol N. By way ol example, consiuei these loui
lines again:
Input Formats | 245
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
Il, loi example, N is two, then each split contains two lines. One mappei will ieceive
the liist two key-value paiis:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
Anu anothei mappei will ieceive the seconu two key-value paiis:
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
The keys anu values aie the same as TextInputFormat piouuces. Vhat is uilleient is the
way the splits aie constiucteu.
Usually, having a map task loi a small numLei ol lines ol input is inellicient (uue to the
oveiheau in task setup), Lut theie aie applications that take a small amount ol input
uata anu iun an extensive (that is, CPU-intensive) computation loi it, then emit theii
output. Simulations aie a goou example. By cieating an input lile that specilies input
paiameteis, one pei line, you can peiloim a paranctcr swccp: iun a set ol simulations
in paiallel to linu how a mouel vaiies as the paiametei changes.
Il you have long-iunning simulations, you may lall aloul ol task time-
outs. Vhen a task uoesn`t iepoit piogiess loi moie than 10 minutes,
then the tasktiackei assumes it has laileu anu aLoits the piocess (see
Task Failuie on page 200).
The Lest way to guaiu against this is to iepoit piogiess peiiouically, Ly
wiiting a status message, oi inciementing a countei, loi example. See
Vhat Constitutes Piogiess in MapReuuce? on page 193.
Anothei example is using Hauoop to Lootstiap uata loauing liom multiple uata
souices, such as uataLases. You cieate a seeu input lile that lists the uata souices,
one pei line. Then each mappei is allocateu a single uata souice, anu it loaus the uata
liom that souice into HDFS. The joL uoesn`t neeu the ieuuce phase, so the numLei ol
ieuuceis shoulu Le set to zeio (Ly calling setNumReduceTasks() on Job). Fuitheimoie,
MapReuuce joLs can Le iun to piocess the uata loaueu into HDFS. See Appenuix C loi
an example.
XML
Most XML paiseis opeiate on whole XML uocuments, so il a laige XML uocument is
maue up ol multiple input splits, then it is a challenge to paise these inuiviuually. Ol
couise, you can piocess the entiie XML uocument in one mappei (il it is not too laige)
using the technigue in Piocessing a whole lile as a iecoiu on page 2+0.
246 | Chapter 7: MapReduce Types and Formats
Laige XML uocuments that aie composeu ol a seiies ol iecoius (XML uocument
liagments) can Le Lioken into these iecoius using simple stiing oi iegulai-expiession
matching to linu stait anu enu tags ol iecoius. This alleviates the pioLlem when the
uocument is split Ly the liamewoik, since the next stait tag ol a iecoiu is easy to linu
Ly simply scanning liom the stait ol the split, just like TextInputFormat linus newline
Lounuaiies.
Hauoop comes with a class loi this puipose calleu StreamXmlRecordReader (which is in
the org.apache.hadoop.streaming package, although it can Le useu outsiue ol Stieam-
ing). You can use it Ly setting youi input loimat to StreamInputFormat anu setting the
stream.recordreader.class piopeity to org.apache.hadoop.streaming.StreamXmlRecor
dReader. The ieauei is conliguieu Ly setting joL conliguiation piopeities to tell it the
patteins loi the stait anu enu tags (see the class uocumentation loi uetails).
5
To take an example, Vikipeuia pioviues uumps ol its content in XML loim, which aie
appiopiiate loi piocessing in paiallel using MapReuuce using this appioach. The uata
is containeu in one laige XML wiappei uocument, which contains a seiies ol elements,
such as page elements that contain a page`s content anu associateu metauata. Using
StreamXmlRecordReader, the page elements can Le inteipieteu as iecoius loi piocessing
Ly a mappei.
Binary Input
Hauoop MapReuuce is not just iestiicteu to piocessing textual uatait has suppoit
loi Linaiy loimats, too.
SequenceFileInputFormat
Hauoop`s seguence lile loimat stoies seguences ol Linaiy key-value paiis. Seguence
liles aie well suiteu as a loimat loi MapReuuce uata since they aie splittaLle (they have
sync points so that ieaueis can synchionize with iecoiu Lounuaiies liom an aiLitiaiy
point in the lile, such as the stait ol a split), they suppoit compiession as a pait ol the
loimat, anu they can stoie aiLitiaiy types using a vaiiety ol seiialization liamewoiks.
(These topics aie coveieu in SeguenceFile on page 132.)
To use uata liom seguence liles as the input to MapReuuce, you use SequenceFileIn
putFormat. The keys anu values aie ueteimineu Ly the seguence lile, anu you neeu to
make suie that youi map input types coiiesponu. Foi example, il youi seguence lile
has IntWritable keys anu Text values, like the one cieateu in Chaptei +, then the map
signatuie woulu Le Mapper<IntWritable, Text, K, V>, wheie K anu V aie the types ol
the map`s output keys anu values.
5. See Mahout`s XmlInputFormat (availaLle liom http://nahout.apachc.org/) loi an impioveu XML input
loimat.
Input Formats | 247
Although its name uoesn`t give it away, SequenceFileInputFormat can
ieau MapFiles as well as seguence liles. Il it linus a uiiectoiy wheie it
was expecting a seguence lile, SequenceFileInputFormat assumes that it
is ieauing a MapFile anu uses its uata lile. This is why theie is no
MapFileInputFormat class.
SequenceFileAsTextInputFormat
SequenceFileAsTextInputFormat is a vaiiant ol SequenceFileInputFormat that conveits
the seguence lile`s keys anu values to Text oLjects. The conveision is peiloimeu Ly
calling toString() on the keys anu values. This loimat makes seguence liles suitaLle
input loi Stieaming.
SequenceFileAsBinaryInputFormat
SequenceFileAsBinaryInputFormat is a vaiiant ol SequenceFileInputFormat that ietiieves
the seguence lile`s keys anu values as opague Linaiy oLjects. They aie encapsulateu as
BytesWritable oLjects, anu the application is liee to inteipiet the unueilying Lyte aiiay
as it pleases. ComLineu with a piocess that cieates seguence liles with Sequence
File.Writer`s appendRaw() methou, this pioviues a way to use any Linaiy uata types
with MapReuuce (packageu as a seguence lile), although plugging into Hauoop`s se-
iialization mechanism is noimally a cleanei alteinative (see Seiialization Fiame-
woiks on page 110).
Multiple Inputs
Although the input to a MapReuuce joL may consist ol multiple input liles (constiucteu
Ly a comLination ol lile gloLs, lilteis, anu plain paths), all ol the input is inteipieteu
Ly a single InputFormat anu a single Mapper. Vhat olten happens, howevei, is that ovei
time, the uata loimat evolves, so you have to wiite youi mappei to cope with all ol youi
legacy loimats. Oi, you have uata souices that pioviue the same type ol uata Lut in
uilleient loimats. This aiises in the case ol peiloiming joins ol uilleient uatasets; see
Reuuce-Siue ]oins on page 2S+. Foi instance, one might Le taL-sepaiateu plain text,
the othei a Linaiy seguence lile. Even il they aie in the same loimat, they may have
uilleient iepiesentations anu, theieloie, neeu to Le paiseu uilleiently.
These cases aie hanuleu elegantly Ly using the MultipleInputs class, which allows you
to specily the InputFormat anu Mapper to use on a pei-path Lasis. Foi example, il we
hau weathei uata liom the UK Met Ollice
6
that we wanteu to comLine with the NCDC
uata loi oui maximum tempeiatuie analysis, then we might set up the input as lollows:
6. Met Ollice uata is geneially availaLle only to the ieseaich anu acauemic community. Howevei, theie is a
small amount ol monthly weathei station uata availaLle at http://www.nctojjicc.gov.u|/c|inatc/u|/
stationdata/.
248 | Chapter 7: MapReduce Types and Formats
MultipleInputs.addInputPath(job, ncdcInputPath,
TextInputFormat.class, MaxTemperatureMapper.class);
MultipleInputs.addInputPath(job, metOfficeInputPath,
TextInputFormat.class, MetOfficeMaxTemperatureMapper.class);
This coue ieplaces the usual calls to FileInputFormat.addInputPath() anu job.setMap
perClass(). Both Met Ollice anu NCDC uata is text-Laseu, so we use TextInputFor
mat loi each. But the line loimat ol the two uata souices is uilleient, so we use two
uilleient mappeis. The MaxTemperatureMapper ieaus NCDC input uata anu extiacts the
yeai anu tempeiatuie lielus. The MetOfficeMaxTemperatureMapper ieaus Met Ollice in-
put uata anu extiacts the yeai anu tempeiatuie lielus. The impoitant thing is that the
map outputs have the same types, since the ieuuceis (which aie all ol the same type)
see the aggiegateu map outputs anu aie not awaie ol the uilleient mappeis useu to
piouuce them.
The MultipleInputs class has an oveiloaueu veision ol addInputPath() that uoesn`t take
a mappei:
public static void addInputPath(Job job, Path path,
Class<? extends InputFormat> inputFormatClass)
This is uselul when you only have one mappei (set using the Job`s setMapperClass()
methou) Lut multiple input loimats.
Database Input (and Output)
DBInputFormat is an input loimat loi ieauing uata liom a ielational uataLase, using
]DBC. Because it uoesn`t have any shaiuing capaLilities, you neeu to Le caielul not to
oveiwhelm the uataLase you aie ieauing liom Ly iunning too many mappeis. Foi this
ieason, it is Lest useu loi loauing ielatively small uatasets, peihaps loi joining with
laigei uatasets liom HDFS, using MultipleInputs. The coiiesponuing output loimat is
DBOutputFormat, which is uselul loi uumping joL outputs (ol mouest size) into a
uataLase.
7
Foi an alteinative way ol moving uata Letween ielational uataLases anu HDFS, consiuei
using Sgoop, which is uesciiLeu in Chaptei 15.
HBase`s TableInputFormat is uesigneu to allow a MapReuuce piogiam to opeiate on
uata stoieu in an HBase taLle. TableOutputFormat is loi wiiting MapReuuce outputs
into an HBase taLle.
Output Formats
Hauoop has output uata loimats that coiiesponu to the input loimats coveieu in the
pievious section. The OutputFormat class hieiaichy appeais in Figuie 7-+.
7. Instiuctions loi how to use these loimats aie pioviueu in DataLase Access with Hauoop, http://www
.c|oudcra.con/b|og/2009/03/0/databasc-acccss-with-hadoop/, Ly Aaion KimLall.
Output Formats | 249
Iigurc 7-1. OutputIornat c|ass hicrarchy
Text Output
The uelault output loimat, TextOutputFormat, wiites iecoius as lines ol text. Its keys
anu values may Le ol any type, since TextOutputFormat tuins them to stiings Ly calling
toString() on them. Each key-value paii is sepaiateu Ly a taL chaiactei, although that
may Le changeu using the mapreduce.output.textoutputformat.separator piopeity
(mapred.textoutputformat.separator in the olu API). The counteipait to TextOutput
Format loi ieauing in this case is KeyValueTextInputFormat, since it Lieaks lines into key-
value paiis Laseu on a conliguiaLle sepaiatoi (see KeyValueTextInputFoi-
mat on page 2+5).
You can suppiess the key oi the value (oi Loth, making this output loimat eguivalent
to NullOutputFormat, which emits nothing) liom the output using a NullWritable type.
This also causes no sepaiatoi to Le wiitten, which makes the output suitaLle loi ieauing
in using TextInputFormat.
250 | Chapter 7: MapReduce Types and Formats
Binary Output
SequenceFileOutputFormat
As the name inuicates, SequenceFileOutputFormat wiites seguence liles loi its output.
This is a goou choice ol output il it loims the input to a luithei MapReuuce joL, since
it is compact anu is ieauily compiesseu. Compiession is contiolleu via the static
methous on SequenceFileOutputFormat, as uesciiLeu in Using Compiession in Map-
Reuuce on page 92. Foi an example ol how to use SequenceFileOutputFormat, see
Soiting on page 266.
SequenceFileAsBinaryOutputFormat
SequenceFileAsBinaryOutputFormat is the counteipait to SequenceFileAsBinaryInput
Format, anu it wiites keys anu values in iaw Linaiy loimat into a SeguenceFile containei.
MapFileOutputFormat
MapFileOutputFormat wiites MapFiles as output. The keys in a MapFile must Le auueu
in oiuei, so you neeu to ensuie that youi ieuuceis emit keys in soiteu oiuei.
The ieuuce input keys aie guaianteeu to Le soiteu, Lut the output keys
aie unuei the contiol ol the ieuuce lunction, anu theie is nothing in the
geneial MapReuuce contiact that states that the ieuuce output keys have
to Le oiueieu in any way. The extia constiaint ol soiteu ieuuce output
keys is just neeueu loi MapFileOutputFormat.
Multiple Outputs
FileOutputFormat anu its suLclasses geneiate a set ol liles in the output uiiectoiy. Theie
is one lile pei ieuucei, anu liles aie nameu Ly the paitition numLei: part-r-00000, part-
r-00001, etc. Theie is sometimes a neeu to have moie contiol ovei the naming ol the
liles oi to piouuce multiple liles pei ieuucei. MapReuuce comes with the MultipleOut
puts class to help you uo this.
S
An example: Partitioning data
Consiuei the pioLlem ol paititioning the weathei uataset Ly weathei station. Ve woulu
like to iun a joL whose output is a lile pei station, with each lile containing all the
iecoius loi that station.
S. In the olu MapReuuce API theie aie two classes loi piouucing multiple outputs: MultipleOutputFormat
anu MultipleOutputs. In a nutshell, MultipleOutputs is moie lully leatuieu, Lut MultipleOutputFormat has
moie contiol ovei the output uiiectoiy stiuctuie anu lile naming. MultipleOutputs in the new API
comLines the Lest leatuies ol the two multiple output classes in the olu API.
Output Formats | 251
One way ol uoing this is to have a ieuucei loi each weathei station. To aiiange this,
we neeu to uo two things. Fiist, wiite a paititionei that puts iecoius liom the same
weathei station into the same paitition. Seconu, set the numLei ol ieuuceis on the joL
to Le the numLei ol weathei stations. The paititionei woulu look like this:
public class StationPartitioner extends Partitioner<LongWritable, Text> {

private NcdcRecordParser parser = new NcdcRecordParser();

@Override
public int getPartition(LongWritable key, Text value, int numPartitions) {
parser.parse(value);
return getPartition(parser.getStationId());
}
private int getPartition(String stationId) {
...
}
}
The getPartition(String) methou, whose implementation is not shown, tuins the
station ID into a paitition inuex. To uo this, it neeus a list ol all the station IDs anu
then just ietuins the inuex ol the station ID in the list.
Theie aie two uiawLacks to this appioach. The liist is that since the numLei ol paiti-
tions neeus to Le known Leloie the joL is iun, so uoes the numLei ol weathei stations.
Although the NCDC pioviues metauata aLout its stations, theie is no guaiantee that
the IDs encounteieu in the uata match those in the metauata. A station that appeais in
the metauata Lut not in the uata wastes a ieuucei slot. Voise, a station that appeais
in the uata Lut not in the metauata uoesn`t get a ieuucei slotit has to Le thiown away.
One way ol mitigating this pioLlem woulu Le to wiite a joL to extiact the unigue station
IDs, Lut it`s a shame that we neeu an extia joL to uo this.
The seconu uiawLack is moie suLtle. It is geneially a Lau iuea to allow the numLei ol
paititions to Le iigiuly lixeu Ly the application, since it can leau to small oi uneven-
sizeu paititions. Having many ieuuceis uoing a small amount ol woik isn`t an ellicient
way ol oiganizing a joL: it`s much Lettei to get ieuuceis to uo moie woik anu have
lewei ol them, as the oveiheau in iunning a task is then ieuuceu. Uneven-sizeu paiti-
tions can Le uillicult to avoiu, too. Dilleient weathei stations will have gatheieu a
wiuely vaiying amount ol uata: compaie a station that openeu one yeai ago to one that
has Leen gatheiing uata loi one centuiy. Il a lew ieuuce tasks take signilicantly longei
than the otheis, they will uominate the joL execution time anu cause it to Le longei
than it neeus to Le.
252 | Chapter 7: MapReduce Types and Formats
Theie aie two special cases when it uoes make sense to allow the ap-
plication to set the numLei ol paititions (oi eguivalently, the numLei
ol ieuuceis):
Zcro rcduccrs
This is a vacuous case: theie aie no paititions, as the application
neeus to iun only map tasks.
Onc rcduccr
It can Le convenient to iun small joLs to comLine the output ol
pievious joLs into a single lile. This shoulu only Le attempteu when
the amount ol uata is small enough to Le piocesseu comloitaLly
Ly one ieuucei.
It is much Lettei to let the clustei uiive the numLei ol paititions loi a joLthe iuea
Leing that the moie clustei ieuuce slots aie availaLle the lastei the joL can complete.
This is why the uelault HashPartitioner woiks so well, as it woiks with any numLei ol
paititions anu ensuies each paitition has a goou mix ol keys leauing to moie even-sizeu
paititions.
Il we go Lack to using HashPartitioner, each paitition will contain multiple stations,
so to cieate a lile pei station, we neeu to aiiange loi each ieuucei to wiite multiple liles,
which is wheie MultipleOutputs comes in.
MultipleOutputs
MultipleOutputs allows you to wiite uata to liles whose names aie ueiiveu liom the
output keys anu values, oi in lact liom an aiLitiaiy stiing. This allows each ieuucei (oi
mappei in a map-only joL) to cieate moie than a single lile. File names aie ol the loim
name-n-nnnnn loi map outputs anu name-r-nnnnn loi ieuuce outputs, wheie name is an
aiLitiaiy name that is set Ly the piogiam, anu nnnnn is an integei uesignating the pait
numLei, staiting liom zeio. The pait numLei ensuies that outputs wiitten liom uil-
leient paititions (mappeis oi ieuuceis) uo not colliue in the case ol the same name.
The piogiam in Example 7-5 shows how to use MultipleOutputs to paitition the uataset
Ly station.
Exanp|c 7-5. Partitions who|c datasct into ji|cs nancd by thc station |D using Mu|tip|cOutputs
public class PartitionByStationUsingMultipleOutputs extends Configured
implements Tool {

static class StationMapper
extends Mapper<LongWritable, Text, Text, Text> {

private NcdcRecordParser parser = new NcdcRecordParser();

@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Output Formats | 253
parser.parse(value);
context.write(new Text(parser.getStationId()), value);
}
}

static class MultipleOutputsReducer
extends Reducer<Text, Text, NullWritable, Text> {

private MultipleOutputs<NullWritable, Text> multipleOutputs;
@Override
protected void setup(Context context)
throws IOException, InterruptedException {
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
multipleOutputs.write(NullWritable.get(), value, key.toString());
}
}

@Override
protected void cleanup(Context context)
throws IOException, InterruptedException {
multipleOutputs.close();
}
}
@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
}

job.setMapperClass(StationMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setReducerClass(MultipleOutputsReducer.class);
job.setOutputKeyClass(NullWritable.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new PartitionByStationUsingMultipleOutputs(),
args);
System.exit(exitCode);
}
}
254 | Chapter 7: MapReduce Types and Formats
In the ieuucei, wheie we geneiate the output, we constiuct an instance ol MultipleOut
puts in the setup() methou anu assign it to an instance vaiiaLle. Ve then use the
MultipleOutputs instance in the reduce() methou to wiite to the output, in place ol the
context. The write() methou takes the key anu value, as well as a name. Ve use the
station iuentiliei loi the name, so the oveiall ellect is to piouuce output liles with the
naming scheme station_identifier-r-nnnnn.
In one iun, the liist lew output liles weie nameu as lollows (othei columns liom the
uiiectoiy listing have Leen uioppeu):
/output/010010-99999-r-00027
/output/010050-99999-r-00013
/output/010100-99999-r-00015
/output/010280-99999-r-00014
/output/010550-99999-r-00000
/output/010980-99999-r-00011
/output/011060-99999-r-00025
/output/012030-99999-r-00029
/output/012350-99999-r-00018
/output/012620-99999-r-00004
The Lase path specilieu in the write() methou ol MultipleOutputs is inteipieteu ielative
to the output uiiectoiy, anu since it may contain lile path sepaiatoi chaiacteis (/), it`s
possiLle to cieate suLuiiectoiies ol aiLitiaiy uepth. Foi example, the lollowing moui-
lication paititions the uata Ly station anu yeai so that each yeai`s uata is containeu in
a uiiectoiy nameu Ly the station ID (such as 029070-99999/1901/part-r-00000):
@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
parser.parse(value);
String basePath = String.format("%s/%s/part",
parser.getStationId(), parser.getYear());
multipleOutputs.write(NullWritable.get(), value, basePath);
}
}
MultipleOutputs uelegates to the mappei`s OutputFormat, which in this example is a
TextOutputFormat, Lut moie complex set ups aie possiLle. Foi example, you can cieate
nameu outputs, each with its own OutputFormat anu key anu value types (which may
uillei liom the output types ol the mappei oi ieuucei). Fuitheimoie, the mappei oi
ieuucei (oi Loth) may wiite to multiple output liles loi each iecoiu piocesseu. Please
consult the ]ava uocumentation loi moie inloimation.
Lazy Output
FileOutputFormat suLclasses will cieate output (part-r-nnnnn) liles, even il they aie
empty. Some applications pielei that empty liles not Le cieateu, which is wheie Lazy
OutputFormat helps. It is a wiappei output loimat that ensuies that the output lile is
Output Formats | 255
cieateu only when the liist iecoiu is emitteu loi a given paitition. To use it, call its
setOutputFormatClass() methou with the JobConf anu the unueilying output loimat.
Stieaming anu Pipes suppoit a -lazyOutput option to enaLle LazyOutputFormat.
Database Output
The output loimats loi wiiting to ielational uataLases anu to HBase aie mentioneu in
DataLase Input (anu Output) on page 2+9.
256 | Chapter 7: MapReduce Types and Formats
CHAPTER 8
MapReduce Features
This chaptei looks at some ol the moie auvanceu leatuies ol MapReuuce, incluuing
counteis anu soiting anu joining uatasets.
Counters
Theie aie olten things you woulu like to know aLout the uata you aie analyzing Lut
that aie peiipheial to the analysis you aie peiloiming. Foi example, il you weie counting
invaliu iecoius anu uiscoveieu that the piopoition ol invaliu iecoius in the whole ua-
taset was veiy high, you might Le piompteu to check why so many iecoius weie Leing
maikeu as invaliupeihaps theie is a Lug in the pait ol the piogiam that uetects invaliu
iecoius? Oi il the uata weie ol pooi guality anu genuinely uiu have veiy many invaliu
iecoius, altei uiscoveiing this, you might ueciue to inciease the size ol the uataset so
that the numLei ol goou iecoius was laige enough loi meaninglul analysis.
Counteis aie a uselul channel loi gatheiing statistics aLout the joL: loi guality contiol
oi loi application level-statistics. They aie also uselul loi pioLlem uiagnosis. Il you aie
tempteu to put a log message into youi map oi ieuuce task, then it is olten Lettei to
see whethei you can use a countei insteau to iecoiu that a paiticulai conuition occuiieu.
In auuition to countei values Leing much easiei to ietiieve than log output loi laige
uistiiLuteu joLs, you get a iecoiu ol the numLei ol times that conuition occuiieu, which
is moie woik to oLtain liom a set ol logliles.
Built-in Counters
Hauoop maintains some Luilt-in counteis loi eveiy joL, which iepoit vaiious metiics
loi youi joL. Foi example, theie aie counteis loi the numLei ol Lytes anu iecoius
piocesseu, which allows you to conliim that the expecteu amount ol input was con-
sumeu anu the expecteu amount ol output was piouuceu.
Counteis aie uiviueu into gioups, anu theie aie seveial gioups loi the Luilt-in counteis,
listeu in TaLle S-1.
257
Tab|c 8-1. Bui|t-in countcr groups
Group Name/Enum Reference
MapRe-
duce Task
Counters
org.apache.hadoop.mapred.Task$Counter (0.20)
org.apache.hadoop.mapreduce.TaskCounter (post 0.20)
Table 8-2
Filesystem
Counters
FileSystemCounters (0.20)
org.apache.hadoop.mapreduce.FileSystemCounter (post 0.20)
Table 8-3
FileInput-
Format
Counters
org.apache.hadoop.mapred.FileInputFormat$Counter (0.20)
org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter (post
0.20)
Table 8-4
FileOutput-
Format
Counters
org.apache.hadoop.mapred.FileOutputFormat$Counter (0.20)
org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
(post 0.20)
Table 8-5
Job Coun-
ters
org.apache.hadoop.mapred.JobInProgress$Counter (0.20)
org.apache.hadoop.mapreduce.JobCounter (post 0.20)
Table 8-6
Each gioup eithei contains tas| countcrs (which aie upuateu as a task piogiesses) oi
job countcrs (which aie upuateu as a joL piogiesses). Ve look at Loth types in the
lollowing sections.
Task counters
Task counteis gathei inloimation aLout tasks ovei the couise ol theii execution, anu
the iesults aie aggiegateu ovei all the tasks in a joL. Foi example, the
MAP_INPUT_RECORDS countei counts the input iecoius ieau Ly each map task anu aggie-
gates ovei all map tasks in a joL, so that the linal liguie is the total numLei ol input
iecoius loi the whole joL.
Task counteis aie maintaineu Ly each task attempt, anu peiiouically sent to the task-
tiackei anu then to the joLtiackei, so they can Le gloLally aggiegateu. (This is uesciiLeu
in Piogiess anu Status Upuates on page 192. Note that the inloimation llow is uil-
leient in YARN, see YARN (MapReuuce 2) on page 19+.) Task counteis aie sent in
lull eveiy time, iathei than senuing the counts since the last tiansmission, since this
guaius against eiiois uue to lost messages. Fuitheimoie, uuiing a joL iun, counteis
may go uown il a task lails.
Countei values aie uelinitive only once a joL has successlully completeu. Howevei,
some counteis pioviue uselul uiagnostic inloimation as a task is piogiessing, anu it can
Le uselul to monitoi them with the weL UI. Foi example, PHYSICAL_MEMORY_BYTES,
VIRTUAL_MEMORY_BYTES, anu COMMITTED_HEAP_BYTES pioviue an inuication ol how mem-
oiy usage vaiies ovei the couise ol a paiticulai task attempt.
The Luilt-in task counteis incluue those in the MapReuuce task counteis gioup (Ta-
Lle S-2) anu those in the lile-ielateu counteis gioups (TaLle S-3, TaLle S-+, TaLle S-5).
258 | Chapter 8: MapReduce Features
Tab|c 8-2. Bui|t-in MapRcducc tas| countcrs
Counter Description
Map input records
(MAP_INPUT_RECORDS)
The number of input records consumed by all the maps in the job. Incremented every
time a record is read from a RecordReader and passed to the maps map()
method by the framework.
Map skipped records
(MAP_SKIPPED_RECORDS)
The number of input records skipped by all the maps in the job. See Skipping Bad
Records on page 217.
Map input bytes
(MAP_INPUT_BYTES)
The number of bytes of uncompressed input consumed by all the maps in the job.
Incremented every time a record is read from a RecordReader and passed to the
maps map() method by the framework.
Split raw bytes
(SPLIT_RAW_BYTES)
The number of bytes of input split objects read by maps. These objects represent
the split metadata (that is, the offset and length within a file) rather than the split
data itself, so the total size should be small.
Map output records
(MAP_OUTPUT_RECORDS)
The number of map output records produced by all the maps in the job.
Incremented every time the collect() method is called on a maps
OutputCollector.
Map output bytes
(MAP_OUTPUT_BYTES)
The number of bytes of uncompressed output produced by all the maps in the job.
Incremented every time the collect() method is called on a maps Output
Collector.
Map output materialized bytes
(MAP_OUTPUT_MATERIALIZED_BYTES)
The number of bytes of map output actually written to disk. If map output com-
pression is enabled this is reflected in the counter value.
Combine input records
(COMBINE_INPUT_RECORDS)
The number of input records consumed by all the combiners (if any) in the job.
Incremented every time a value is read from the combiners iterator over values.
Note that this count is the number of values consumed by the combiner, not the
number of distinct key groups (which would not be a useful metric, since there is
not necessarily one group per key for a combiner; see Combiner Func-
tions on page 34, and also Shuffle and Sort on page 205).
Combine output records
(COMBINE_OUTPUT_RECORDS)
The number of output records produced by all the combiners (if any) in the job.
Incremented every time the collect() method is called on a combiners Out
putCollector.
Reduce input groups
(REDUCE_INPUT_GROUPS)
The number of distinct key groups consumed by all the reducers in the job. Incre-
mented every time the reducers reduce() method is called by the framework.
Reduce input records
(REDUCE_INPUT_RECORDS)
The number of input records consumed by all the reducers in the job. Incremented
every time a value is read from the reducers iterator over values. If reducers consume
all of their inputs, this count should be the same as the count for Map output records.
Reduce output records
(REDUCE_OUTPUT_RECORDS)
The number of reduce output records produced by all the maps in the job.
Incremented every time the collect() method is called on a reducers
OutputCollector.
Reduce skipped groups
(REDUCE_SKIPPED_GROUPS)
The number of distinct key groups skipped by all the reducers in the job. See Skipping
Bad Records on page 217.
Reduce skipped records
(REDUCE_SKIPPED_RECORDS)
The number of input records skipped by all the reducers in the job.
Reduce shuffle bytes The number of bytes of map output copied by the shuffle to reducers.
Counters | 259
Counter Description
(REDUCE_SHUFFLE_BYTES)
Spilled records
(SPILLED_RECORDS)
The number of records spilled to disk in all map and reduce tasks in the job.
CPU milliseconds
(CPU_MILLISECONDS)
The cumulative CPU time for a task in milliseconds, as reported by /proc/cpuinfo.
Physical memory bytes
(PHYSICAL_MEMORY_BYTES)
The physical memory being used by a task in bytes, as reported by /proc/meminfo.
Virtual memory bytes
(VIRTUAL_MEMORY_BYTES)
The virtual memory being used by a task in bytes, as reported by /proc/meminfo.
Committed heap bytes
(COMMITTED_HEAP_BYTES)
The total amount of memory available in the JVM in bytes, as reported by Run
time.getRuntime().totalMemory().
GC time milliseconds
(GC_TIME_MILLIS)
The elapsed time for garbage collection in tasks in milliseconds, as reported by
GarbageCollectorMXBean.getCollectionTime(). From 0.21.
Shuffled maps
(SHUFFLED_MAPS)
The number of map output files transferred to reducers by the shuffle (see Shuffle
and Sort on page 205). From 0.21.
Failed shuffle
(FAILED_SHUFFLE)
The number of map output copy failures during the shuffle. From 0.21.
Merged map outputs
(MERGED_MAP_OUTPUTS)
The number of map outputs that have been merged on the reduce side of the shuffle.
From 0.21.
Tab|c 8-3. Bui|t-in ji|csystcn tas| countcrs
Counter Description
Filesystem bytes read
(BYTES_READ)
The number of bytes read by each filesystem by map and reduce tasks. There is a counter for each
filesystem: Filesystem may be Local, HDFS, S3, KFS, etc.
Filesystem bytes written
(BYTES_WRITTEN)
The number of bytes written by each filesystem by map and reduce tasks.
Tab|c 8-1. Bui|t-in Ii|c|nputIornat tas| countcrs
Counter Description
Bytes read
(BYTES_READ)
The number of bytes read by map tasks via the FileInputFormat.
Tab|c 8-5. Bui|t-in Ii|cOutputIornat tas| countcrs
Counter Description
Bytes written
(BYTES_WRITTEN)
The number of bytes written by map tasks (for map-only jobs) or reduce tasks via the FileOutputFormat.
260 | Chapter 8: MapReduce Features
Job counters
]oL counteis (TaLle S-6) aie maintaineu Ly the joLtiackei (oi application mastei in
YARN), so they uon`t neeu to Le sent acioss the netwoik, unlike all othei counteis,
incluuing usei-uelineu ones. They measuie joL-level statistics, not values that change
while a task is iunning. Foi example, TOTAL_LAUNCHED_MAPS counts the numLei ol map
tasks that weie launcheu ovei the couise ol a joL (incluuing ones that laileu).
Tab|c 8-. Bui|t-in job countcrs
Counter Description
Launched map tasks
(TOTAL_LAUNCHED_MAPS)
The number of map tasks that were launched. Includes tasks that were
started speculatively.
Launched reduce tasks
(TOTAL_LAUNCHED_REDUCES)
The number of reduce tasks that were launched. Includes tasks that
were started speculatively.
Launched uber tasks
(TOTAL_LAUNCHED_UBERTASKS)
The number of uber tasks (see YARN (MapReduce 2) on page 194)
that were launched. From 0.23.
Maps in uber tasks
(NUM_UBER_SUBMAPS)
The number of maps in uber tasks. From 0.23.
Reduces in uber tasks
(NUM_UBER_SUBREDUCES)
The number of reduces in uber tasks. From 0.23.
Failed map tasks
(NUM_FAILED_MAPS)
The number of map tasks that failed. See Task Failure on page 200
for potential causes.
Failed reduce tasks
(NUM_FAILED_REDUCES)
The number of reduce tasks that failed.
Failed uber tasks
(NUM_FAILED_UBERTASKS)
The number of uber tasks that failed. From 0.23.
Data-local map tasks
(DATA_LOCAL_MAPS)
The number of map tasks that ran on the same node as their input data.
Rack-local map tasks
(RACK_LOCAL_MAPS)
The number of map tasks that ran on a node in the same rack as their
input data, but that are not data-local.
Other local map tasks
(OTHER_LOCAL_MAPS)
The number of map tasks that ran on a node in a different rack to their
input data. Inter-rack bandwidth is scarce, and Hadoop tries to place
map tasks close to their input data, so this count should be low. See
Figure 2-2.
Total time in map tasks
(SLOTS_MILLIS_MAPS)
The total time taken running map tasks in milliseconds. Includes tasks
that were started speculatively.
Total time in reduce tasks
(SLOTS_MILLIS_REDUCES)
The total time taken running reduce tasks in milliseconds. Includes
tasks that were started speculatively.
Total time in map tasks waiting after reserving slots
(FALLOW_SLOTS_MILLIS_MAPS)
The total time spent waiting after reserving slots for map tasks in
milliseconds. Slot reservation is Capacity Scheduler feature for high-
memory jobs, see Task memory limits on page 316. Not used by
YARN-based MapReduce.
Counters | 261
Counter Description
Total time in reduce tasks waiting after reserving slots
(FALLOW_SLOTS_MILLIS_REDUCES)
The total time spent waiting after reserving slots for reduce tasks in
milliseconds. Slot reservation is Capacity Scheduler feature for high-
memory jobs, see Task memory limits on page 316. Not used by
YARN-based MapReduce.
User-Defined Java Counters
MapReuuce allows usei coue to ueline a set ol counteis, which aie then inciementeu
as uesiieu in the mappei oi ieuucei. Counteis aie uelineu Ly a ]ava enum, which seives
to gioup ielateu counteis. A joL may ueline an aiLitiaiy numLei ol enums, each with
an aiLitiaiy numLei ol lielus. The name ol the enum is the gioup name, anu the enum`s
lielus aie the countei names. Counteis aie gloLal: the MapReuuce liamewoik aggie-
gates them acioss all maps anu ieuuces to piouuce a gianu total at the enu ol the joL.
Ve cieateu some counteis in Chaptei 5 loi counting malloimeu iecoius in the weathei
uataset. The piogiam in Example S-1 extenus that example to count the numLei ol
missing iecoius anu the uistiiLution ol tempeiatuie guality coues.
Exanp|c 8-1. App|ication to run thc naxinun tcnpcraturc job, inc|uding counting nissing and
na|jorncd jic|ds and qua|ity codcs
public class MaxTemperatureWithCounters extends Configured implements Tool {

enum Temperature {
MISSING,
MALFORMED
}

static class MaxTemperatureMapperWithCounters extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {

private NcdcRecordParser parser = new NcdcRecordParser();

public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {

parser.parse(value);
if (parser.isValidTemperature()) {
int airTemperature = parser.getAirTemperature();
output.collect(new Text(parser.getYear()),
new IntWritable(airTemperature));
} else if (parser.isMalformedTemperature()) {
System.err.println("Ignoring possibly corrupt input: " + value);
reporter.incrCounter(Temperature.MALFORMED, 1);
} else if (parser.isMissingTemperature()) {
reporter.incrCounter(Temperature.MISSING, 1);
}

// dynamic counter
reporter.incrCounter("TemperatureQuality", parser.getQuality(), 1);
262 | Chapter 8: MapReduce Features

}
}

@Override
public int run(String[] args) throws IOException {
JobConf conf = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (conf == null) {
return -1;
}

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MaxTemperatureMapperWithCounters.class);
conf.setCombinerClass(MaxTemperatureReducer.class);
conf.setReducerClass(MaxTemperatureReducer.class);
JobClient.runJob(conf);
return 0;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MaxTemperatureWithCounters(), args);
System.exit(exitCode);
}
}
The Lest way to see what this piogiam uoes is iun it ovei the complete uataset:
% hadoop jar hadoop-examples.jar MaxTemperatureWithCounters input/ncdc/all output-counters
Vhen the joL has successlully completeu, it piints out the counteis at the enu (this is
uone Ly JobClient`s runJob() methou). Heie aie the ones we aie inteiesteu in:
09/04/20 06:33:36 INFO mapred.JobClient: TemperatureQuality
09/04/20 06:33:36 INFO mapred.JobClient: 2=1246032
09/04/20 06:33:36 INFO mapred.JobClient: 1=973422173
09/04/20 06:33:36 INFO mapred.JobClient: 0=1
09/04/20 06:33:36 INFO mapred.JobClient: 6=40066
09/04/20 06:33:36 INFO mapred.JobClient: 5=158291879
09/04/20 06:33:36 INFO mapred.JobClient: 4=10764500
09/04/20 06:33:36 INFO mapred.JobClient: 9=66136858
09/04/20 06:33:36 INFO mapred.JobClient: Air Temperature Records
09/04/20 06:33:36 INFO mapred.JobClient: Malformed=3
09/04/20 06:33:36 INFO mapred.JobClient: Missing=66136856
Dynamic counters
The coue makes use ol a uynamic counteione that isn`t uelineu Ly a ]ava enum. Since
a ]ava enum`s lielus aie uelineu at compile time, you can`t cieate new counteis on the
lly using enums. Heie we want to count the uistiiLution ol tempeiatuie guality coues,
anu though the loimat specilication uelines the values that it can take, it is moie con-
Counters | 263
venient to use a uynamic countei to emit the values that it actually takes. The methou
we use on the Reporter oLject takes a gioup anu countei name using Stiing names:
public void incrCounter(String group, String counter, long amount)
The two ways ol cieating anu accessing counteisusing enums anu using Stiings
aie actually eguivalent since Hauoop tuins enums into Stiings to senu counteis ovei
RPC. Enums aie slightly easiei to woik with, pioviue type salety, anu aie suitaLle loi
most joLs. Foi the ouu occasion when you neeu to cieate counteis uynamically, you
can use the Stiing inteilace.
Readable counter names
By uelault, a countei`s name is the enum`s lully gualilieu ]ava classname. These names
aie not veiy ieauaLle when they appeai on the weL UI, oi in the console, so Hauoop
pioviues a way to change the uisplay names using iesouice Lunules. Ve`ve uone this
heie, so we see Aii Tempeiatuie Recoius insteau ol Tempeiatuie$MISSING. Foi
uynamic counteis, the gioup anu countei names aie useu loi the uisplay names, so this
is not noimally an issue.
The iecipe to pioviue ieauaLle names is as lollows. Cieate a piopeities lile nameu altei
the enum, using an unueiscoie as a sepaiatoi loi nesteu classes. The piopeities lile
shoulu Le in the same uiiectoiy as the top-level class containing the enum. The lile is
nameu MaxTcnpcraturcWithCountcrs_Tcnpcraturc.propcrtics loi the counteis in Ex-
ample S-1.
The piopeities lile shoulu contain a single piopeity nameu CounterGroupName, whose
value is the uisplay name loi the whole gioup. Then each lielu in the enum shoulu have
a coiiesponuing piopeity uelineu loi it, whose name is the name ol the lielu sullixeu
with .name, anu whose value is the uisplay name loi the countei. Heie aie the contents
ol MaxTcnpcraturcWithCountcrs_Tcnpcraturc.propcrtics:
CounterGroupName=Air Temperature Records
MISSING.name=Missing
MALFORMED.name=Malformed
Hauoop uses the stanuaiu ]ava localization mechanisms to loau the coiiect piopeities
loi the locale you aie iunning in, so, loi example, you can cieate a Chinese veision ol
the piopeities in a lile nameu MaxTcnpcraturcWithCountcrs_Tcnpcra-
turc_zh_CN.propcrtics, anu they will Le useu when iunning in the zh_CN locale. Relei
to the uocumentation loi java.util.PropertyResourceBundle loi moie inloimation.
Retrieving counters
In auuition to Leing availaLle via the weL UI anu the commanu line (using hadoop job
-counter), you can ietiieve countei values using the ]ava API. You can uo this while
the joL is iunning, although it is moie usual to get counteis at the enu ol a joL iun,
when they aie staLle. Example S-2 shows a piogiam that calculates the piopoition ol
iecoius that have missing tempeiatuie lielus.
264 | Chapter 8: MapReduce Features
Exanp|c 8-2. App|ication to ca|cu|atc thc proportion oj rccords with nissing tcnpcraturc jic|ds
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class MissingTemperatureFields extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
if (args.length != 1) {
JobBuilder.printUsage(this, "<job ID>");
return -1;
}
JobClient jobClient = new JobClient(new JobConf(getConf()));
String jobID = args[0];
RunningJob job = jobClient.getJob(JobID.forName(jobID));
if (job == null) {
System.err.printf("No job with ID %s found.\n", jobID);
return -1;
}
if (!job.isComplete()) {
System.err.printf("Job %s is not complete.\n", jobID);
return -1;
}

Counters counters = job.getCounters();
long missing = counters.getCounter(
MaxTemperatureWithCounters.Temperature.MISSING);

long total = counters.findCounter("org.apache.hadoop.mapred.Task$Counter",
"MAP_INPUT_RECORDS").getCounter();
System.out.printf("Records with missing temperature fields: %.2f%%\n",
100.0 * missing / total);
return 0;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MissingTemperatureFields(), args);
System.exit(exitCode);
}
}
Fiist we ietiieve a RunningJob oLject liom a JobClient, Ly calling the getJob() methou
with the joL ID. Ve check whethei theie is actually a joL with the given ID. Theie may
not Le, eithei Lecause the ID was incoiiectly specilieu oi Lecause the joLtiackei no
longei has a ieleience to the joL (only the last 100 joLs aie kept in memoiy, contiolleu
Ly mapred.jobtracker.completeuserjobs.maximum, anu all aie cleaieu out il the joL-
tiackei is iestaiteu).
Altei conliiming that the joL has completeu, we call the RunningJob`s getCounters()
methou, which ietuins a Counters oLject, encapsulating all the counteis loi a joL. The
Counters class pioviues vaiious methous loi linuing the names anu values ol counteis.
Counters | 265
Ve use the getCounter() methou, which takes an enum to linu the numLei ol iecoius
that hau a missing tempeiatuie lielu.
Theie aie also findCounter() methous, all ol which ietuin a Counter oLject. Ve use
this loim to ietiieve the Luilt-in countei loi map input iecoius. To uo this, we ielei to
the countei Ly its gioup namethe lully gualilieu ]ava classname loi the enumanu
countei name (Loth stiings).
1
Finally, we piint the piopoition ol iecoius that hau a missing tempeiatuie lielu. Heie`s
what we get loi the whole weathei uataset:
% hadoop jar hadoop-examples.jar MissingTemperatureFields job_200904200610_0003
Records with missing temperature fields: 5.47%
User-Defined Streaming Counters
A Stieaming MapReuuce piogiam can inciement counteis Ly senuing a specially loi-
matteu line to the stanuaiu eiioi stieam, which is co-opteu as a contiol channel in this
case. The line must have the lollowing loimat:
reporter:counter:group,counter,amount
This snippet in Python shows how to inciement the Missing countei in the Tem-
peiatuie gioup Ly one:
sys.stderr.write("reporter:counter:Temperature,Missing,1\n")
In a similai way, a status message may Le sent with a line loimatteu like this:
reporter:status:message
Sorting
The aLility to soit uata is at the heait ol MapReuuce. Even il youi application isn`t
conceineu with soiting pei se, it may Le aLle to use the soiting stage that MapReuuce
pioviues to oiganize its uata. In this section, we will examine uilleient ways ol soiting
uatasets anu how you can contiol the soit oiuei in MapReuuce. Soiting Avio uata is
coveieu sepaiately in Soiting using Avio MapReuuce on page 130.
Preparation
Ve aie going to soit the weathei uataset Ly tempeiatuie. Stoiing tempeiatuies as
Text oLjects uoesn`t woik loi soiting puiposes, since signeu integeis uon`t soit
lexicogiaphically.
2
Insteau, we aie going to stoie the uata using seguence liles whose
1. The Luilt-in counteis` enums aie not cuiiently a pait ol the puLlic API, so this is the only way to ietiieve
them. Fiom ielease 0.21.0, counteis aie availaLle via the JobCounter anu TaskCounter enums in the
org.apache.hadoop.mapreduce package.
266 | Chapter 8: MapReduce Features
IntWritable keys iepiesent the tempeiatuie (anu soit coiiectly), anu whose Text values
aie the lines ol uata.
The MapReuuce joL in Example S-3 is a map-only joL that also lilteis the input to
iemove iecoius that uon`t have a valiu tempeiatuie ieauing. Each map cieates a single
Llock-compiesseu seguence lile as output. It is invokeu with the lollowing commanu:
% hadoop jar hadoop-examples.jar SortDataPreprocessor input/ncdc/all \
input/ncdc/all-seq
Exanp|c 8-3. A MapRcducc progran jor transjorning thc wcathcr data into ScqucnccIi|c jornat
public class SortDataPreprocessor extends Configured implements Tool {

static class CleanerMapper
extends Mapper<LongWritable, Text, IntWritable, Text> {

private NcdcRecordParser parser = new NcdcRecordParser();

@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

parser.parse(value);
if (parser.isValidTemperature()) {
context.write(new IntWritable(parser.getAirTemperature()), value);
}
}
}

@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
}
job.setMapperClass(CleanerMapper.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
job.setNumReduceTasks(0);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setCompressOutput(job, true);
SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
SequenceFileOutputFormat.setOutputCompressionType(job,
CompressionType.BLOCK);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
2. One commonly useu woikaiounu loi this pioLlempaiticulaily in text-Laseu Stieaming applications
is to auu an ollset to eliminate all negative numLeis, anu lelt pau with zeios, so all numLeis aie the same
numLei ol chaiacteis. Howevei, see Stieaming on page 2S0 loi anothei appioach.
Sorting | 267
int exitCode = ToolRunner.run(new SortDataPreprocessor(), args);
System.exit(exitCode);
}
}
Partial Sort
In The Delault MapReuuce ]oL on page 226, we saw that, Ly uelault, MapReuuce
will soit input iecoius Ly theii keys. Example S-+ is a vaiiation loi soiting seguence
liles with IntWritable keys.
Exanp|c 8-1. A MapRcducc progran jor sorting a ScqucnccIi|c with |ntWritab|c |cys using thc
dcjau|t HashPartitioncr
public class SortByTemperatureUsingHashPartitioner extends Configured
implements Tool {

@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
}

job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setCompressOutput(job, true);
SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
SequenceFileOutputFormat.setOutputCompressionType(job,
CompressionType.BLOCK);

return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SortByTemperatureUsingHashPartitioner(),
args);
System.exit(exitCode);
}
}
Controlling Sort Order
The soit oiuei loi keys is contiolleu Ly a RawComparator, which is lounu as lollows:
1. Il the piopeity mapred.output.key.comparator.class is set, eithei explicitly oi Ly
calling setSortComparatorClass() on Job, then an instance ol that class is useu. (In
the olu API the eguivalent methou is setOutputKeyComparatorClass() on JobConf.)
2. Otheiwise, keys must Le a suLclass ol WritableComparable, anu the iegisteieu
compaiatoi loi the key class is useu.
268 | Chapter 8: MapReduce Features
3. Il theie is no iegisteieu compaiatoi, then a RawComparator is useu that ueseiializes
the Lyte stieams Leing compaieu into oLjects anu uelegates to the WritableCompar
able`s compareTo() methou.
These iules ieinloice why it`s impoitant to iegistei optimizeu veisions ol RawCompara
tors loi youi own custom Writable classes (which is coveieu in Implementing a Raw-
Compaiatoi loi speeu on page 10S), anu also that it`s stiaightloiwaiu to oveiiiue
the soit oiuei Ly setting youi own compaiatoi (we uo this in Seconuaiy
Soit on page 276).
Suppose we iun this piogiam using 30 ieuuceis:
3
% hadoop jar hadoop-examples.jar SortByTemperatureUsingHashPartitioner \
-D mapred.reduce.tasks=30 input/ncdc/all-seq output-hashsort
This commanu piouuces 30 output liles, each ol which is soiteu. Howevei, theie is no
easy way to comLine the liles (Ly concatenation, loi example, in the case ol plain-text
liles) to piouuce a gloLally soiteu lile. Foi many applications, this uoesn`t mattei. Foi
example, having a paitially soiteu set ol liles is line il you want to uo lookups.
An application: Partitioned MapFile lookups
To peiloim lookups Ly key, loi instance, having multiple liles woiks well. Il we change
the output loimat to Le a MapFileOutputFormat, as shown in Example S-5, then the
output is 30 map liles, which we can peiloim lookups against.
Exanp|c 8-5. A MapRcducc progran jor sorting a ScqucnccIi|c and producing MapIi|cs as output
public class SortByTemperatureToMapFile extends Configured implements Tool {

@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
}

job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputFormatClass(MapFileOutputFormat.class);
SequenceFileOutputFormat.setCompressOutput(job, true);
SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
SequenceFileOutputFormat.setOutputCompressionType(job,
CompressionType.BLOCK);
return job.waitForCompletion(true) ? 0 : 1;
}

3. See Soiting anu meiging SeguenceFiles on page 13S loi how to uo the same thing using the soit piogiam
example that comes with Hauoop.
Sorting | 269
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SortByTemperatureToMapFile(), args);
System.exit(exitCode);
}
}
MapFileOutputFormat pioviues a paii ol convenience static methous loi peiloiming
lookups against MapReuuce output; theii use is shown in Example S-6.
Exanp|c 8-. Rctricvc thc jirst cntry with a givcn |cy jron a co||cction oj MapIi|cs
public class LookupRecordByTemperature extends Configured implements Tool {

@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
JobBuilder.printUsage(this, "<path> <key>");
return -1;
}
Path path = new Path(args[0]);
IntWritable key = new IntWritable(Integer.parseInt(args[1]));

Reader[] readers = MapFileOutputFormat.getReaders(path, getConf());
Partitioner<IntWritable, Text> partitioner =
new HashPartitioner<IntWritable, Text>();
Text val = new Text();
Writable entry =
MapFileOutputFormat.getEntry(readers, partitioner, key, val);
if (entry == null) {
System.err.println("Key not found: " + key);
return -1;
}
NcdcRecordParser parser = new NcdcRecordParser();
parser.parse(val.toString());
System.out.printf("%s\t%s\n", parser.getStationId(), parser.getYear());
return 0;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new LookupRecordByTemperature(), args);
System.exit(exitCode);
}
}
The getReaders() methou opens a MapFile.Reader loi each ol the output liles cieateu
Ly the MapReuuce joL. The getEntry() methou then uses the paititionei to choose the
ieauei loi the key anu linus the value loi that key Ly calling Reader`s get() methou. Il
getEntry() ietuins null, it means no matching key was lounu. Otheiwise, it ietuins
the value, which we tianslate into a station ID anu yeai.
To see this in action, let`s linu the liist entiy loi a tempeiatuie ol 10C (iememLei
that tempeiatuies aie stoieu as integeis iepiesenting tenths ol a uegiee, which is why
we ask loi a tempeiatuie ol 100):
270 | Chapter 8: MapReduce Features
% hadoop jar hadoop-examples.jar LookupRecordByTemperature output-hashmapsort -100
357460-99999 1956
Ve can also use the ieaueis uiiectly, in oiuei to get all the iecoius loi a given key. The
aiiay ol ieaueis that is ietuineu is oiueieu Ly paitition, so that the ieauei loi a given
key may Le lounu using the same paititionei that was useu in the MapReuuce joL:
Exanp|c 8-7. Rctricvc a|| cntrics with a givcn |cy jron a co||cction oj MapIi|cs
public class LookupRecordsByTemperature extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
JobBuilder.printUsage(this, "<path> <key>");
return -1;
}
Path path = new Path(args[0]);
IntWritable key = new IntWritable(Integer.parseInt(args[1]));

Reader[] readers = MapFileOutputFormat.getReaders(path, getConf());
Partitioner<IntWritable, Text> partitioner =
new HashPartitioner<IntWritable, Text>();
Text val = new Text();

Reader reader = readers[partitioner.getPartition(key, val, readers.length)];
Writable entry = reader.get(key, val);
if (entry == null) {
System.err.println("Key not found: " + key);
return -1;
}
NcdcRecordParser parser = new NcdcRecordParser();
IntWritable nextKey = new IntWritable();
do {
parser.parse(val.toString());
System.out.printf("%s\t%s\n", parser.getStationId(), parser.getYear());
} while(reader.next(nextKey, val) && key.equals(nextKey));
return 0;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new LookupRecordsByTemperature(), args);
System.exit(exitCode);
}
}
Anu heie is a sample iun to ietiieve all ieauings ol 10C anu count them:
% hadoop jar hadoop-examples.jar LookupRecordsByTemperature output-hashmapsort -100 \
2> /dev/null | wc -l
1489272
Sorting | 271
Total Sort
How can you piouuce a gloLally soiteu lile using Hauoop? The naive answei is to use
a single paitition.
+
But this is incieuiLly inellicient loi laige liles, since one machine has
to piocess all ol the output, so you aie thiowing away the Lenelits ol the paiallel ai-
chitectuie that MapReuuce pioviues.
Insteau, it is possiLle to piouuce a set ol soiteu liles that, il concatenateu, woulu loim
a gloLally soiteu lile. The seciet to uoing this is to use a paititionei that iespects the
total oiuei ol the output. Foi example, il we hau loui paititions, we coulu put keys loi
tempeiatuies less than 10C in the liist paitition, those Letween 10C anu 0C in the
seconu, those Letween 0C anu 10C in the thiiu, anu those ovei 10C in the louith.
Although this appioach woiks, you have to choose youi paitition sizes caielully to
ensuie that they aie laiily even so that joL times aien`t uominateu Ly a single ieuucei.
Foi the paititioning scheme just uesciiLeu, the ielative sizes ol the paititions aie as
lollows:
Temperature range < 10C [10C, 0C) [0C, 10C) >= 10C
Proportion of records 11% 13% 17% 59%
These paititions aie not veiy even. To constiuct moie even paititions, we neeu to have
a Lettei unueistanuing ol the tempeiatuie uistiiLution loi the whole uataset. It`s laiily
easy to wiite a MapReuuce joL to count the numLei ol iecoius that lall into a collection
ol tempeiatuie Luckets. Foi example, Figuie S-1 shows the uistiiLution loi Luckets ol
size 1C, wheie each point on the plot coiiesponus to one Lucket.
+. A Lettei answei is to use Pig (Soiting Data on page +05) oi Hive (Soiting anu
Aggiegating on page ++1), Loth ol which can soit with a single commanu.
272 | Chapter 8: MapReduce Features
Iigurc 8-1. Tcnpcraturc distribution jor thc wcathcr datasct
Vhile we coulu use this inloimation to constiuct a veiy even set ol paititions, the lact
that we neeueu to iun a joL that useu the entiie uataset to constiuct them is not iueal.
It`s possiLle to get a laiily even set ol paititions, Ly sanp|ing the key space. The iuea
Lehinu sampling is that you look at a small suLset ol the keys to appioximate the key
uistiiLution, which is then useu to constiuct paititions. Luckily, we uon`t have to wiite
the coue to uo this ouiselves, as Hauoop comes with a selection ol sampleis.
The InputSampler class uelines a nesteu Sampler inteilace whose implementations
ietuin a sample ol keys given an InputFormat anu Job:
public interface Sampler<K, V> {
K[] getSample(InputFormat<K, V> inf, Job job)
throws IOException, InterruptedException;
}
This inteilace is not usually calleu uiiectly Ly clients. Insteau, the writePartition
File() static methou on InputSampler is useu, which cieates a seguence lile to stoie the
keys that ueline the paititions:
public static <K, V> void writePartitionFile(Job job, Sampler<K, V> sampler)
throws IOException, ClassNotFoundException, InterruptedException
The seguence lile is useu Ly TotalOrderPartitioner to cieate paititions loi the soit joL.
Example S-S puts it all togethei.
Sorting | 273
Exanp|c 8-8. A MapRcducc progran jor sorting a ScqucnccIi|c with |ntWritab|c |cys using thc
Tota|OrdcrPartitioncr to g|oba||y sort thc data
public class SortByTemperatureUsingTotalOrderPartitioner extends Configured
implements Tool {

@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
}

job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setCompressOutput(job, true);
SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
SequenceFileOutputFormat.setOutputCompressionType(job,
CompressionType.BLOCK);
job.setPartitionerClass(TotalOrderPartitioner.class);
InputSampler.Sampler<IntWritable, Text> sampler =
new InputSampler.RandomSampler<IntWritable, Text>(0.1, 10000, 10);

Path input = FileInputFormat.getInputPaths(job)[0];
input = input.makeQualified(input.getFileSystem(getConf()));

Path partitionFile = new Path(input, "_partitions");
TotalOrderPartitioner.setPartitionFile(job.getConfiguration(),
partitionFile);
InputSampler.writePartitionFile(job, sampler);
// Add to DistributedCache
URI partitionUri = new URI(partitionFile.toString() + "#_partitions");
job.addCacheFile(partitionUri);
job.createSymlink();
return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(
new SortByTemperatureUsingTotalOrderPartitioner(), args);
System.exit(exitCode);
}
}
Ve use a RandomSampler, which chooses keys with a uniloim pioLaLilityheie, 0.1.
Theie aie also paiameteis loi the maximum numLei ol samples to take anu the maxi-
mum numLei ol splits to sample (heie, 10,000 anu 10, iespectively; these settings aie
the uelaults when InputSampler is iun as an application), anu the samplei stops when
the liist ol these limits is met. Sampleis iun on the client, making it impoitant to limit
274 | Chapter 8: MapReduce Features
the numLei ol splits that aie uownloaueu, so the samplei iuns guickly. In piactice, the
time taken to iun the samplei is a small liaction ol the oveiall joL time.
The paitition lile that InputSampler wiites is calleu _partitions, which we have set to Le
in the input uiiectoiy (it will not Le pickeu up as an input lile since it staits with an
unueiscoie). To shaie the paitition lile with the tasks iunning on the clustei, we auu
it to the uistiiLuteu cache (see DistiiLuteu Cache on page 2SS).
On one iun, the samplei chose 5.6C, 13.9C, anu 22.0C as paitition Lounuaiies (loi
loui paititions), which tianslates into moie even paitition sizes than the eailiei choice
ol paititions:
Temperature range < 5.6C [5.6C, 13.9C) [13.9C, 22.0C) >= 22.0C
Proportion of records 29% 24% 23% 24%
Youi input uata ueteimines the Lest samplei loi you to use. Foi example, SplitSam
pler, which samples only the liist n iecoius in a split, is not so goou loi soiteu uata
5
Lecause it uoesn`t select keys liom thioughout the split.
On the othei hanu, IntervalSampler chooses keys at iegulai inteivals thiough the split
anu makes a Lettei choice loi soiteu uata. RandomSampler is a goou geneial-puipose
samplei. Il none ol these suits youi application (anu iememLei that the point ol sam-
pling is to piouuce paititions that aie approxinatc|y egual in size), you can wiite youi
own implementation ol the Sampler inteilace.
One ol the nice piopeities ol InputSampler anu TotalOrderPartitioner is that you aie
liee to choose the numLei ol paititions. This choice is noimally uiiven Ly the numLei
ol ieuucei slots in youi clustei (choose a numLei slightly lewei than the total, to allow
loi lailuies). Howevei, TotalOrderPartitioner will woik only il the paitition
Lounuaiies aie uistinct: one pioLlem with choosing a high numLei is that you may get
collisions il you have a small key space.
Heie`s how we iun it:
% hadoop jar hadoop-examples.jar SortByTemperatureUsingTotalOrderPartitioner \
-D mapred.reduce.tasks=30 input/ncdc/all-seq output-totalsort
The piogiam piouuces 30 output paititions, each ol which is inteinally soiteu; in au-
uition, loi these paititions, all the keys in paitition i aie less than the keys in paitition
i - 1.
5. In some applications, it`s common loi some ol the input to alieauy Le soiteu, oi at least paitially soiteu.
Foi example, the weathei uataset is oiueieu Ly time, which may intiouuce ceitain Liases, making the
RandomSampler a salei choice.
Sorting | 275
Secondary Sort
The MapReuuce liamewoik soits the iecoius Ly key Leloie they ieach the ieuuceis.
Foi any paiticulai key, howevei, the values aie not soiteu. The oiuei that the values
appeai is not even staLle liom one iun to the next, since they come liom uilleient map
tasks, which may linish at uilleient times liom iun to iun. Geneially speaking, most
MapReuuce piogiams aie wiitten so as not to uepenu on the oiuei that the values
appeai to the ieuuce lunction. Howevei, it is possiLle to impose an oiuei on the values
Ly soiting anu giouping the keys in a paiticulai way.
To illustiate the iuea, consiuei the MapReuuce piogiam loi calculating the maximum
tempeiatuie loi each yeai. Il we aiiangeu loi the values (tempeiatuies) to Le soiteu in
uescenuing oiuei, we woulun`t have to iteiate thiough them to linu the maximum
we coulu take the liist loi each yeai anu ignoie the iest. (This appioach isn`t the most
ellicient way to solve this paiticulai pioLlem, Lut it illustiates how seconuaiy soit woiks
in geneial.)
To achieve this, we change oui keys to Le composite: a comLination ol yeai anu
tempeiatuie. Ve want the soit oiuei loi keys to Le Ly yeai (ascenuing) anu then Ly
tempeiatuie (uescenuing):
1900 35C
1900 34C
1900 34C
...
1901 36C
1901 35C
Il all we uiu was change the key, then this woulun`t help since now iecoius loi the same
yeai woulu not (in geneial) go to the same ieuucei since they have uilleient keys. Foi
example, (1900, 35C) anu (1900, 3+C) coulu go to uilleient ieuuceis. By setting a
paititionei to paitition Ly the yeai pait ol the key, we can guaiantee that iecoius loi
the same yeai go to the same ieuucei. This still isn`t enough to achieve oui goal,
howevei. A paititionei ensuies only that one ieuucei ieceives all the iecoius loi a yeai;
it uoesn`t change the lact that the ieuucei gioups Ly key within the paitition:
The linal piece ol the puzzle is the setting to contiol the giouping. Il we gioup values
in the ieuucei Ly the yeai pait ol the key, then we will see all the iecoius loi the same
yeai in one ieuuce gioup. Anu since they aie soiteu Ly tempeiatuie in uescenuing oiuei,
the liist is the maximum tempeiatuie:
276 | Chapter 8: MapReduce Features
To summaiize, theie is a iecipe heie to get the ellect ol soiting Ly value:
Make the key a composite ol the natuial key anu the natuial value.
The soit compaiatoi shoulu oiuei Ly the composite key, that is, the natuial key
and natuial value.
The paititionei anu giouping compaiatoi loi the composite key shoulu consiuei
only the natuial key loi paititioning anu giouping.
Java code
Putting this all togethei iesults in the coue in Example S-9. This piogiam uses the plain-
text input again.
Exanp|c 8-9. App|ication to jind thc naxinun tcnpcraturc by sorting tcnpcraturcs in thc |cy
public class MaxTemperatureUsingSecondarySort
extends Configured implements Tool {

static class MaxTemperatureMapper
extends Mapper<LongWritable, Text, IntPair, NullWritable> {

private NcdcRecordParser parser = new NcdcRecordParser();

@Override
protected void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {

parser.parse(value);
if (parser.isValidTemperature()) {
context.write(new IntPair(parser.getYearInt(),
parser.getAirTemperature()), NullWritable.get());
}
}
}

static class MaxTemperatureReducer
extends Reducer<IntPair, NullWritable, IntPair, NullWritable> {

@Override
protected void reduce(IntPair key, Iterable<NullWritable> values,
Context context) throws IOException, InterruptedException {

context.write(key, NullWritable.get());
}
Sorting | 277
}

public static class FirstPartitioner
extends Partitioner<IntPair, NullWritable> {
@Override
public int getPartition(IntPair key, NullWritable value, int numPartitions) {
// multiply by 127 to perform some mixing
return Math.abs(key.getFirst() * 127) % numPartitions;
}
}

public static class KeyComparator extends WritableComparator {
protected KeyComparator() {
super(IntPair.class, true);
}
@Override
public int compare(WritableComparable w1, WritableComparable w2) {
IntPair ip1 = (IntPair) w1;
IntPair ip2 = (IntPair) w2;
int cmp = IntPair.compare(ip1.getFirst(), ip2.getFirst());
if (cmp != 0) {
return cmp;
}
return -IntPair.compare(ip1.getSecond(), ip2.getSecond()); //reverse
}
}

public static class GroupComparator extends WritableComparator {
protected GroupComparator() {
super(IntPair.class, true);
}
@Override
public int compare(WritableComparable w1, WritableComparable w2) {
IntPair ip1 = (IntPair) w1;
IntPair ip2 = (IntPair) w2;
return IntPair.compare(ip1.getFirst(), ip2.getFirst());
}
}
@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
}

job.setMapperClass(MaxTemperatureMapper.class);
job.setPartitionerClass(FirstPartitioner.class);
job.setSortComparatorClass(KeyComparator.class);
job.setGroupingComparatorClass(GroupComparator.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(IntPair.class);
job.setOutputValueClass(NullWritable.class);

278 | Chapter 8: MapReduce Features
return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MaxTemperatureUsingSecondarySort(), args);
System.exit(exitCode);
}
}
In the mappei, we cieate a key iepiesenting the yeai anu tempeiatuie, using an IntPair
Writable implementation. (IntPair is like the TextPair class we uevelopeu in Imple-
menting a Custom ViitaLle on page 105.) Ve uon`t neeu to caiiy any inloimation in
the value, since we can get the liist (maximum) tempeiatuie in the ieuucei liom the
key, so we use a NullWritable. The ieuucei emits the liist key, which uue to the sec-
onuaiy soiting, is an IntPair loi the yeai anu its maximum tempeiatuie. IntPair`s
toString() methou cieates a taL-sepaiateu stiing, so the output is a set ol taL-sepaiateu
yeai-tempeiatuie paiis.
Many applications neeu to access all the soiteu values, not just the liist
value as we have pioviueu heie. To uo this, you neeu to populate the
value lielus since in the ieuucei you can ietiieve only the liist key. This
necessitates some unavoiuaLle uuplication ol inloimation Letween key
anu value.
Ve set the paititionei to paitition Ly the liist lielu ol the key (the yeai), using a custom
paititionei calleu FirstPartitioner. To soit keys Ly yeai (ascenuing) anu tempeiatuie
(uescenuing), we use a custom soit compaiatoi, using setSortComparatorClass(), that
extiacts the lielus anu peiloims the appiopiiate compaiisons. Similaily, to gioup keys
Ly yeai, we set a custom compaiatoi, using setGroupingComparatorClass(), to extiact
the liist lielu ol the key loi compaiison.
6
Running this piogiam gives the maximum tempeiatuies loi each yeai:
% hadoop jar hadoop-examples.jar MaxTemperatureUsingSecondarySort input/ncdc/all \
> output-secondarysort
% hadoop fs -cat output-secondarysort/part-* | sort | head
1901 317
1902 244
1903 289
1904 256
1905 283
1906 294
1907 283
1908 289
1909 278
1910 294
6. Foi simplicity, these custom compaiatois as shown aie not optimizeu; see Implementing a
RawCompaiatoi loi speeu on page 10S loi the steps we woulu neeu to take to make them lastei.
Sorting | 279
Streaming
To uo a seconuaiy soit in Stieaming, we can take auvantage ol a couple ol liLiaiy classes
that Hauoop pioviues. Heie`s the uiivei that we can use to uo a seconuaiy soit:
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-D stream.num.map.output.key.fields=2 \
-D mapred.text.key.partitioner.options=-k1,1 \
-D mapred.output.key.comparator.class=\
org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D mapred.text.key.comparator.options="-k1n -k2nr" \
-input input/ncdc/all \
-output output_secondarysort_streaming \
-mapper ch08/src/main/python/secondary_sort_map.py \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-reducer ch08/src/main/python/secondary_sort_reduce.py \
-file ch08/src/main/python/secondary_sort_map.py \
-file ch08/src/main/python/secondary_sort_reduce.py

Oui map lunction (Example S-10) emits iecoius with yeai anu tempeiatuie lielus. Ve
want to tieat the comLination ol Loth ol these lielus as the key, so we set
stream.num.map.output.key.fields to 2. This means that values will Le empty, just like
in the ]ava case.
Exanp|c 8-10. Map junction jor sccondary sort in Python
#!/usr/bin/env python
import re
import sys
for line in sys.stdin:
val = line.strip()
(year, temp, q) = (val[15:19], int(val[87:92]), val[92:93])
if temp == 9999:
sys.stderr.write("reporter:counter:Temperature,Missing,1\n")
elif re.match("[01459]", q):
print "%s\t%s" % (year, temp)
Howevei, we uon`t want to paitition Ly the entiie key, so we use the KeyFieldBased
Partitioner paititionei, which allows us to paitition Ly a pait ol the key. The specili-
cation mapred.text.key.partitioner.options conliguies the paititionei. The value
-k1,1 instiucts the paititionei to use only the liist lielu ol the key, wheie lielus aie
assumeu to Le sepaiateu Ly a stiing uelineu Ly the map.output.key.field.separator
piopeity (a taL chaiactei Ly uelault).
Next, we want a compaiatoi that soits the yeai lielu in ascenuing oiuei anu the tem-
peiatuie lielu in uescenuing oiuei, so that the ieuuce lunction can simply ietuin the
liist iecoiu in each gioup. Hauoop pioviues KeyFieldBasedComparator, which is iueal
loi this puipose. The compaiison oiuei is uelineu Ly a specilication that is like the one
useu loi GNU sort. It is set using the mapred.text.key.comparator.options piopeity.
The value -k1n -k2nr useu in this example means soit Ly the liist lielu in numeiical
280 | Chapter 8: MapReduce Features
oiuei, then Ly the seconu lielu in ieveise numeiical oiuei. Like its paititionei cousin,
KeyFieldBasedPartitioner, it uses the sepaiatoi uelineu Ly the map.out
put.key.field.separator to split a key into lielus.
In the ]ava veision, we hau to set the giouping compaiatoi; howevei, in Stieaming,
gioups aie not uemaicateu in any way, so in the ieuuce lunction we have to uetect the
gioup Lounuaiies ouiselves Ly looking loi when the yeai changes (Example S-11).
Exanp|c 8-11. Rcduccr junction jor sccondary sort in Python
#!/usr/bin/env python
import sys
last_group = None
for line in sys.stdin:
val = line.strip()
(year, temp) = val.split("\t")
group = year
if last_group != group:
print val
last_group = group
Vhen we iun the stieaming piogiam, we get the same output as the ]ava veision.
Finally, note that KeyFieldBasedPartitioner anu KeyFieldBasedComparator aie not con-
lineu to use in Stieaming piogiamsthey aie applicaLle to ]ava MapReuuce piogiams,
too.
Joins
MapReuuce can peiloim joins Letween laige uatasets, Lut wiiting the coue to uo joins
liom sciatch is laiily involveu. Rathei than wiiting MapReuuce piogiams, you might
consiuei using a highei-level liamewoik such as Pig, Hive, oi Cascauing, in which join
opeiations aie a coie pait ol the implementation.
Let`s Liielly consiuei the pioLlem we aie tiying to solve. Ve have two uatasets; loi
example, the weathei stations uataLase anu the weathei iecoiusanu we want to iec-
oncile the two. Foi example, we want to see each station`s histoiy, with the station`s
metauata inlineu in each output iow. This is illustiateu in Figuie S-2.
How we implement the join uepenus on how laige the uatasets aie anu how they aie
paititioneu. Il one uataset is laige (the weathei iecoius) Lut the othei one is small
enough to Le uistiiLuteu to each noue in the clustei (as the station metauata is), then
the join can Le ellecteu Ly a MapReuuce joL that Liings the iecoius loi each station
togethei (a paitial soit on station ID, loi example). The mappei oi ieuucei uses the
smallei uataset to look up the station metauata loi a station ID, so it can Le wiitten out
with each iecoiu. See Siue Data DistiiLution on page 2S7 loi a uiscussion ol this
appioach, wheie we locus on the mechanics ol uistiiLuting the uata to tasktiackeis.
Joins | 281
Il the join is peiloimeu Ly the mappei, it is calleu a nap-sidc join, wheieas il it is
peiloimeu Ly the ieuucei it is calleu a rcducc-sidc join.
Il Loth uatasets aie too laige loi eithei to Le copieu to each noue in the clustei, then
we can still join them using MapReuuce with a map-siue oi ieuuce-siue join, uepenuing
on how the uata is stiuctuieu. One common example ol this case is a usei uataLase anu
a log ol some usei activity (such as access logs). Foi a populai seivice, it is not leasiLle
to uistiiLute the usei uataLase (oi the logs) to all the MapReuuce noues.
Map-Side Joins
A map-siue join Letween laige inputs woiks Ly peiloiming the join Leloie the uata
ieaches the map lunction. Foi this to woik, though, the inputs to each map must Le
paititioneu anu soiteu in a paiticulai way. Each input uataset must Le uiviueu into the
same numLei ol paititions, anu it must Le soiteu Ly the same key (the join key) in each
souice. All the iecoius loi a paiticulai key must iesiue in the same paitition. This may
sounu like a stiict ieguiiement (anu it is), Lut it actually lits the uesciiption ol the output
ol a MapReuuce joL.
282 | Chapter 8: MapReduce Features
Iigurc 8-2. |nncr join oj two datascts
A map-siue join can Le useu to join the outputs ol seveial joLs that hau the same numLei
ol ieuuceis, the same keys, anu output liles that aie not splittaLle (Ly Leing smallei
than an HDFS Llock, oi Ly viitue ol Leing gzip compiesseu, loi example). In the context
ol the weathei example, il we ian a paitial soit on the stations lile Ly station ID, anu
anothei, iuentical soit on the iecoius, again Ly station ID, anu with the same numLei
ol ieuuceis, then the two outputs woulu satisly the conuitions loi iunning a map-siue
join.
Use a CompositeInputFormat liom the org.apache.hadoop.mapreduce.join package to
iun a map-siue join. The input souices anu join type (innei oi outei) loi CompositeIn
putFormat aie conliguieu thiough a join expiession that is wiitten accoiuing to a simple
giammai. The package uocumentation has uetails anu examples.
Joins | 283
The org.apache.hadoop.examples.Join example is a geneial-puipose commanu-line
piogiam loi iunning a map-siue join, since it allows you to iun a MapReuuce joL loi
any specilieu mappei anu ieuucei ovei multiple inputs that aie joineu with a given join
opeiation.
Reduce-Side Joins
A ieuuce-siue join is moie geneial than a map-siue join, in that the input uatasets uon`t
have to Le stiuctuieu in any paiticulai way, Lut it is less ellicient as Loth uatasets have
to go thiough the MapReuuce shullle. The Lasic iuea is that the mappei tags each iecoiu
with its souice anu uses the join key as the map output key, so that the iecoius with
the same key aie Liought togethei in the ieuucei. Ve use seveial ingieuients to make
this woik in piactice:
Mu|tip|c inputs
The input souices loi the uatasets have uilleient loimats, in geneial, so it is veiy
convenient to use the MultipleInputs class (see Multiple Inputs on page 2+S) to
sepaiate the logic loi paising anu tagging each souice.
Sccondary sort
As uesciiLeu, the ieuucei will see the iecoius liom Loth souices that have the same
key, Lut they aie not guaianteeu to Le in any paiticulai oiuei. Howevei, to peiloim
the join, it is impoitant to have the uata liom one souice Leloie anothei. Foi the
weathei uata join, the station iecoiu must Le the liist ol the values seen loi each
key, so the ieuucei can lill in the weathei iecoius with the station name anu emit
them stiaightaway. Ol couise, it woulu Le possiLle to ieceive the iecoius in any
oiuei il we Lulleieu them in memoiy, Lut this shoulu Le avoiueu, since the numLei
ol iecoius in any gioup may Le veiy laige anu exceeu the amount ol memoiy avail-
aLle to the ieuucei.
7
Ve saw in Seconuaiy Soit on page 276 how to impose an oiuei on the values
loi each key that the ieuuceis see, so we use this technigue heie.
To tag each iecoiu, we use TextPair liom Chaptei + loi the keys, to stoie the station
ID, anu the tag. The only ieguiiement loi the tag values is that they soit in such a way
that the station iecoius come Leloie the weathei iecoius. This can Le achieveu Ly
tagging station iecoius as 0 anu weathei iecoius as 1. The mappei classes to uo this aie
shown in Examples S-12 anu S-13.
Exanp|c 8-12. Mappcr jor tagging station rccords jor a rcducc-sidc join
public class JoinStationMapper
extends Mapper<LongWritable, Text, TextPair, Text> {
private NcdcStationMetadataParser parser = new NcdcStationMetadataParser();
7. The data_join package in the contrib uiiectoiy implements ieuuce-siue joins Ly Lulleiing iecoius in
memoiy, so it sulleis liom this limitation.
284 | Chapter 8: MapReduce Features
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
if (parser.parse(value)) {
context.write(new TextPair(parser.getStationId(), "0"),
new Text(parser.getStationName()));
}
}
}
Exanp|c 8-13. Mappcr jor tagging wcathcr rccords jor a rcducc-sidc join
public class JoinRecordMapper
extends Mapper<LongWritable, Text, TextPair, Text> {
private NcdcRecordParser parser = new NcdcRecordParser();

@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
parser.parse(value);
context.write(new TextPair(parser.getStationId(), "1"), value);
}
}
The ieuucei knows that it will ieceive the station iecoiu liist, so it extiacts its name
liom the value anu wiites it out as a pait ol eveiy output iecoiu (Example S-1+).
Exanp|c 8-11. Rcduccr jor joining taggcd station rccords with taggcd wcathcr rccords
public class JoinReducer extends Reducer<TextPair, Text, Text, Text> {
@Override
protected void reduce(TextPair key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Iterator<Text> iter = values.iterator();
Text stationName = new Text(iter.next());
while (iter.hasNext()) {
Text record = iter.next();
Text outValue = new Text(stationName.toString() + "\t" + record.toString());
context.write(key.getFirst(), outValue);
}
}
}
The coue assumes that eveiy station ID in the weathei iecoius has exactly one matching
iecoiu in the station uataset. Il this weie not the case, we woulu neeu to geneialize the
coue to put the tag into the value oLjects, Ly using anothei TextPair. The reduce()
methou woulu then Le aLle to tell which entiies weie station names anu uetect (anu
hanule) missing oi uuplicate entiies, Leloie piocessing the weathei iecoius.
Joins | 285
Because oLjects in the ieuucei`s values iteiatoi aie ie-useu (loi elliciency
puiposes), it is vital that the coue makes a copy ol the liist Text oLject
liom the values iteiatoi:
Text stationName = new Text(iter.next());
Il the copy is not maue, then the stationName ieleience will ielei to the
value just ieau when it is tuineu into a stiing, which is a Lug.
Tying the joL togethei is the uiivei class, shown in Example S-15. The essential point
is that we paitition anu gioup on the liist pait ol the key, the station ID, which we uo
with a custom Partitioner (KeyPartitioner) anu a custom gioup compaiatoi, First
Comparator (liom TextPair).
Exanp|c 8-15. App|ication to join wcathcr rccords with station nancs
public class JoinRecordWithStationName extends Configured implements Tool {

public static class KeyPartitioner extends Partitioner<TextPair, Text> {
@Override
public int getPartition(TextPair key, Text value, int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}

@Override
public int run(String[] args) throws Exception {
if (args.length != 3) {
JobBuilder.printUsage(this, "<ncdc input> <station input> <output>");
return -1;
}

Job job = new Job(getConf(), "Join weather records with station names");
job.setJarByClass(getClass());

Path ncdcInputPath = new Path(args[0]);
Path stationInputPath = new Path(args[1]);
Path outputPath = new Path(args[2]);

MultipleInputs.addInputPath(job, ncdcInputPath,
TextInputFormat.class, JoinRecordMapper.class);
MultipleInputs.addInputPath(job, stationInputPath,
TextInputFormat.class, JoinStationMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);

job.setPartitionerClass(KeyPartitioner.class);
job.setGroupingComparatorClass(TextPair.FirstComparator.class);

job.setMapOutputKeyClass(TextPair.class);

job.setReducerClass(JoinReducer.class);
job.setOutputKeyClass(Text.class);

286 | Chapter 8: MapReduce Features
return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new JoinRecordWithStationName(), args);
System.exit(exitCode);
}
}
Running the piogiam on the sample uata yielus the lollowing output:
011990-99999 SIHCCAJAVRI 0067011990999991950051507004+68750...
011990-99999 SIHCCAJAVRI 0043011990999991950051512004+68750...
011990-99999 SIHCCAJAVRI 0043011990999991950051518004+68750...
012650-99999 TYNSET-HANSMOEN 0043012650999991949032412004+62300...
012650-99999 TYNSET-HANSMOEN 0043012650999991949032418004+62300...
Side Data Distribution
Sidc data can Le uelineu as extia ieau-only uata neeueu Ly a joL to piocess the main
uataset. The challenge is to make siue uata availaLle to all the map oi ieuuce tasks
(which aie spieau acioss the clustei) in a convenient anu ellicient lashion.
In auuition to the uistiiLution mechanisms uesciiLeu in this section, it is possiLle to
cache siue-uata in memoiy in a static lielu, so that tasks ol the same joL that iun in
succession on the same tasktiackei can shaie the uata. Task ]VM Re-
use on page 216 uesciiLes how to enaLle this leatuie. Il you take this appioach, Le
awaie ol the amount ol memoiy that you aie using, as it might allect the memoiy neeueu
Ly the shullle (see Shullle anu Soit on page 205).
Using the Job Configuration
You can set aiLitiaiy key-value paiis in the joL conliguiation using the vaiious settei
methous on Configuration (oi JobConf in the olu MapReuuce API). This is veiy uselul
il you neeu to pass a small piece ol metauata to youi tasks.
In the task you can ietiieve the uata liom the conliguiation ietuineu Ly Context`s
getConfiguration() methou. (In the olu API, it`s a little moie involveu: oveiiiue the
configure() methou in the Mapper oi Reducer anu use a gettei methou on the JobConf
oLject passeu in to ietiieve the uata. It`s veiy common to stoie the uata in an instance
lielu so it can Le useu in the map() oi reduce() methou.)
Usually, a piimitive type is sullicient to encoue youi metauata, Lut loi aiLitiaiy oLjects
you can eithei hanule the seiialization youisell (il you have an existing mechanism loi
tuining oLjects to stiings anu Lack), oi you can use Hauoop`s Stringifier class.
DefaultStringifier uses Hauoop`s seiialization liamewoik to seiialize oLjects (see
Seiialization on page 9+).
Side Data Distribution | 287
You shoulun`t use this mechanism loi tiansleiiing moie than a lew kiloLytes ol uata
Lecause it can put piessuie on the memoiy usage in the Hauoop uaemons, paiticulaily
in a system iunning hunuieus ol joLs. The joL conliguiation is ieau Ly the joLtiackei,
the tasktiackei, anu the chilu ]VM, anu each time the conliguiation is ieau, all ol its
entiies aie ieau into memoiy, even il they aie not useu. Usei piopeities aie not useu
Ly the joLtiackei oi the tasktiackei, so they just waste time anu memoiy.
Distributed Cache
Rathei than seiializing siue uata in the joL conliguiation, it is pieleiaLle to uistiiLute
uatasets using Hauoop`s uistiiLuteu cache mechanism. This pioviues a seivice loi
copying liles anu aichives to the task noues in time loi the tasks to use them when they
iun. To save netwoik Lanuwiuth, liles aie noimally copieu to any paiticulai noue once
pei joL.
Usage
Foi tools that use GenericOptionsParser (this incluues many ol the piogiams in this
Looksee GeneiicOptionsPaisei, Tool, anu ToolRunnei on page 151), you can
specily the liles to Le uistiiLuteu as a comma-sepaiateu list ol URIs as the aigument to
the -files option. Files can Le on the local lilesystem, on HDFS, oi on anothei Hauoop
ieauaLle lilesystem (such as S3). Il no scheme is supplieu, then the liles aie assumeu to
Le local. (This is tiue even il the uelault lilesystem is not the local lilesystem.)
You can also copy aichive liles (]AR liles, ZIP liles, tai liles, anu gzippeu tai liles) to
youi tasks, using the -archives option; these aie unaichiveu on the task noue. The
-libjars option will auu ]AR liles to the classpath ol the mappei anu ieuucei tasks.
This is uselul il you haven`t Lunuleu liLiaiy ]AR liles in youi joL ]AR lile.
Stieaming uoesn`t use the uistiiLuteu cache loi copying the stieaming
sciipts acioss the clustei. You specily a lile to Le copieu using the
-file option (note the singulai), which shoulu Le iepeateu loi each lile
to Le copieu. Fuitheimoie, liles specilieu using the -file option must
Le lile paths only, not URIs, so they must Le accessiLle liom the local
lilesystem ol the client launching the Stieaming joL.
Stieaming also accepts the -files anu -archives options loi copying
liles into the uistiiLuteu cache loi use Ly youi Stieaming sciipts.
Let`s see how to use the uistiiLuteu cache to shaie a metauata lile loi station names.
The commanu we will iun is:
% hadoop jar hadoop-examples.jar MaxTemperatureByStationNameUsingDistributedCacheFile \
-files input/ncdc/metadata/stations-fixed-width.txt input/ncdc/all output
288 | Chapter 8: MapReduce Features
This commanu will copy the local lile stations-jixcd-width.txt (no scheme is supplieu,
so the path is automatically inteipieteu as a local lile) to the task noues, so we can use
it to look up station names. The listing loi MaxTemperatureByStationNameUsingDistri
butedCacheFile appeais in Example S-16.
Exanp|c 8-1. App|ication to jind thc naxinun tcnpcraturc by station, showing station nancs jron
a |oo|up tab|c passcd as a distributcd cachc ji|c
public class MaxTemperatureByStationNameUsingDistributedCacheFile
extends Configured implements Tool {

static class StationTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private NcdcRecordParser parser = new NcdcRecordParser();

@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

parser.parse(value);
if (parser.isValidTemperature()) {
context.write(new Text(parser.getStationId()),
new IntWritable(parser.getAirTemperature()));
}
}
}

static class MaxTemperatureReducerWithStationLookup
extends Reducer<Text, IntWritable, Text, IntWritable> {

private NcdcStationMetadata metadata;

@Override
protected void setup(Context context)
throws IOException, InterruptedException {
metadata = new NcdcStationMetadata();
metadata.initialize(new File("stations-fixed-width.txt"));
}
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {

String stationName = metadata.getStationName(key.toString());

int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(new Text(stationName), new IntWritable(maxValue));
}
}
Side Data Distribution | 289
@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
}

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(StationTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducerWithStationLookup.class);

return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(
new MaxTemperatureByStationNameUsingDistributedCacheFile(), args);
System.exit(exitCode);
}
}
The piogiam linus the maximum tempeiatuie Ly weathei station, so the mappei
(StationTemperatureMapper) simply emits (station ID, tempeiatuie) paiis. Foi the
comLinei, we ieuse MaxTemperatureReducer (liom Chapteis 2 anu 5) to pick the
maximum tempeiatuie loi any given gioup ol map outputs on the map siue. The ie-
uucei (MaxTemperatureReducerWithStationLookup) is uilleient liom the comLinei, since
in auuition to linuing the maximum tempeiatuie, it uses the cache lile to look up the
station name.
Ve use the ieuucei`s setup() methou to ietiieve the cache lile using its oiiginal name,
ielative to the woiking uiiectoiy ol the task.
You can use the uistiiLuteu cache loi copying liles that uo not lit in
memoiy. MapFiles aie veiy uselul in this iegaiu, since they seive as an
on-uisk lookup loimat (see MapFile on page 139). Because MapFiles
aie a collection ol liles with a uelineu uiiectoiy stiuctuie, you shoulu
put them into an aichive loimat (]AR, ZIP, tai, oi gzippeu tai) anu auu
them to the cache using the -archives option.
Heie`s a snippet ol the output, showing some maximum tempeiatuies loi a lew weathei
stations:
PEATS RIDGE WARATAH 372
STRATHALBYN RACECOU 410
SHEOAKS AWS 399
WANGARATTA AERO 409
MOOGARA 334
MACKAY AERO 331
290 | Chapter 8: MapReduce Features
How it works
Vhen you launch a joL, Hauoop copies the liles specilieu Ly the -files, -archives anu
-libjars options to the joLtiackei`s lilesystem (noimally HDFS). Then, Leloie a task
is iun, the tasktiackei copies the liles liom the joLtiackei`s lilesystem to a local uisk
the cacheso the task can access the liles. The liles aie saiu to Le |oca|izcd at this point.
Fiom the task`s point ol view, the liles aie just theie (anu it uoesn`t caie that they came
liom HDFS). In auuition, liles specilieu Ly -libjars aie auueu to the task`s classpath
Leloie it is launcheu.
The tasktiackei also maintains a ieleience count loi the numLei ol tasks using each
lile in the cache. Beloie the task has iun, the lile`s ieleience count is inciementeu Ly
one; then altei the task has iun, the count is uecieaseu Ly one. Only when the count
ieaches zeio is it eligiLle loi ueletion, since no tasks aie using it. Files aie ueleteu to
make ioom loi a new lile when the cache exceeus a ceitain size10 GB Ly uelault. The
cache size may Le changeu Ly setting the conliguiation piopeity local.cache.size,
which is measuieu in Lytes.
Although this uesign uoesn`t guaiantee that suLseguent tasks liom the same joL iun-
ning on the same tasktiackei will linu the lile in the cache, it is veiy likely that they will,
since tasks liom a joL aie usually scheuuleu to iun at aiounu the same time, so theie
isn`t the oppoitunity loi enough othei joLs to iun anu cause the oiiginal task`s lile to
Le ueleteu liom the cache.
Files aie localizeu unuei the ${mapred.local.dir}/taskTracker/archive uiiectoiy on
the tasktiackeis. Applications uon`t have to know this, howevei, since the liles aie
symLolically linkeu liom the task`s woiking uiiectoiy.
The distributed cache API
Most applications uon`t neeu to use the uistiiLuteu cache API Lecause they can use the
cache via GenericOptionsParser, as we saw in Example S-16. Howevei, some applica-
tions may neeu to use moie auvanceu leatuies ol the uistiiLuteu cache, anu loi this
they can use its API uiiectly. The API is in two paits: methous loi putting uata into the
cache (lounu in Job), anu methous loi ietiieving uata liom the cache (lounu in JobCon
text).
S
Heie aie the peitinent methous in Job loi putting uata into the cache:
public void addCacheFile(URI uri)
public void addCacheArchive(URI uri)
public void setCacheFiles(URI[] files)
public void setCacheArchives(URI[] archives)
public void addFileToClassPath(Path file)
public void addArchiveToClassPath(Path archive)
public void createSymlink()
S. Il you aie using the olu MapReuuce API the same methous can Le lounu in
org.apache.hadoop.filecache.DistributedCache.
Side Data Distribution | 291
Recall that theie aie two types ol oLject that can Le placeu in the cache: liles anu
aichives. Files aie lelt intact on the task noue, while aichives aie unaichiveu on the task
noue. Foi each type ol oLject, theie aie thiee methous: an addCacheXXXX() methou to
auu the lile oi aichive to the uistiiLuteu cache, a setCacheXXXXs() methou to set the
entiie list ol liles oi aichives to Le auueu to the cache in a single call (ieplacing those
set in any pievious calls), anu an addXXXXToClassPath() to auu the lile oi aichive to the
MapReuuce task`s classpath. TaLle TaLle S-7 compaies these API methous to the
GenericOptionsParser options uesciiLeu in TaLle 5-1.
Tab|c 8-7. Distributcd cachc AP|
Job API method GenericOptionsParser equiva-
lent
Description
addCacheFile(URI uri)
setCacheFiles(URI[] files)
-files
file1,file2,...
Add files to the distributed cache to
be copied to the task node.
addCacheArchive(URI uri)
setCacheArchives(URI[] files)
-archives
archive1,archive2,...
Add archives to the distributed
cache to be copied to the task node
and unarchived there.
addFileToClassPath(Path file) -libjars
jar1,jar2,...
Add files to the distributed cache to
be added to the MapReduce tasks
classpath. The files are not unarch-
ived, so this is a useful way to add
JAR files to the classpath.
addArchiveToClassPath(Path archive) None Add archives to the distributed
cache to be unarchived and added
to the MapReduce tasks classpath.
This can be useful when you want
to add a directory of files to the
classpath, since you can create an
archive containing the files, al-
though you can equally well create
a JAR file and use
addFileToClassPath().
The URIs ieleienceu in the add() oi set() methous must Le liles in a
shaieu lilesystem that exist when the joL is iun. On the othei hanu, the
liles specilieu as a GenericOptionsParser option (e.g. -files) may ielei
to a local lile, in which case they get copieu to the uelault shaieu lile-
system (noimally HDFS) on youi Lehall.
This is the key uilleience Letween using the ]ava API uiiectly anu using
GenericOptionsParser: the ]ava API uoes not copy the lile specilieu in
the add() oi set() methou to the shaieu lilesystem, wheieas the Gener
icOptionsParser uoes.
292 | Chapter 8: MapReduce Features
The iemaining uistiiLuteu cache API methou on Job is createSymlink(), which cieates
symLolic links loi all the liles loi the cuiient joL when they aie localizeu on the task
noue. The symLolic link name is set Ly the liagment iuentiliei ol the lile`s URI. Foi
example, the lile specilieu Ly the URI hdjs://nancnodc/joo/bar=nyji|c is symlinkeu as
nyji|c in the task`s woiking uiiectoiy. (Theie`s an example ol using this API in Exam-
ple S-S.) Il theie is no liagment iuentiliei, then no symLolic link is cieateu. Files auueu
to the uistiiLuteu cache using GenericOptionsParser aie automatically symlinkeu.
SymLolic links aie not cieateu loi liles in the uistiiLuteu cache when
using the local joL iunnei, so loi this ieason you may choose to use the
getLocalCacheFiles() anu getLocalCacheArchives() methous (uis-
cusseu Lelow) il you want youi joLs to woik Loth locally anu on a clus-
tei.
The seconu pait ol the uistiiLuteu cache API is lounu on JobContext, anu it is useu liom
the map oi ieuuce task coue when you want to access liles liom the uistiiLuteu cache.
public Path[] getLocalCacheFiles() throws IOException;
public Path[] getLocalCacheArchives() throws IOException;
public Path[] getFileClassPaths();
public Path[] getArchiveClassPaths();
Il the liles liom the uistiiLuteu cache have symLolic links in the task`s woiking uiiec-
toiy, then you can access the localizeu lile uiiectly Ly name, as we uiu in Exam-
ple S-16. It`s also possiLle to get a ieleience to liles anu aichives in the cache using the
getLocalCacheFiles() anu getLocalCacheArchives() methous. In the case ol aichives,
the paths ietuineu aie to the uiiectoiy containing the unaichiveu liles. (Foi complete-
ness, you can also ietiieve the liles anu aichives auueu to the task classpath via getFi
leClassPaths() anu getArchiveClassPaths().)
Note that liles aie ietuineu as |oca| Path oLjects. To ieau the liles you can use a Hauoop
local FileSystem instance, ietiieveu using its getLocal() methou. Alteinatively, you can
use the java.io.File API, as shown in this upuateu setup() methou loi MaxTemperatur
eReducerWithStationLookup:
@Override
protected void setup(Context context)
throws IOException, InterruptedException {
metadata = new NcdcStationMetadata();
Path[] localPaths = context.getLocalCacheFiles();
if (localPaths.length == 0) {
throw new FileNotFoundException("Distributed cache file not found.");
}
File localFile = new File(localPaths[0].toUri());
metadata.initialize(localFile);
}
Side Data Distribution | 293
MapReduce Library Classes
Hauoop comes with a liLiaiy ol mappeis anu ieuuceis loi commonly useu lunctions.
They aie listeu with Liiel uesciiptions in TaLle S-S. Foi luithei inloimation on how to
use them, please consult theii ]ava uocumentation.
Tab|c 8-8. MapRcducc |ibrary c|asscs
Classes Description
ChainMapper, ChainReducer Run a chain of mappers in a single mapper, and a reducer followed by a chain of
mappers in a single reducer. (Symbolically: M+RM*, where M is a mapper and R
is a reducer.) This can substantially reduce the amount of disk I/O incurred com-
pared to running multiple MapReduce jobs.
FieldSelectionMapReduce (old API)
FieldSelectionMapper and
FieldSelectionReducer (new API)
A mapper and a reducer that can select fields (like the Unix cut command) from
the input keys and values and emit them as output keys and values.
IntSumReducer,
LongSumReducer
Reducers that sum integer values to produce a total for every key.
InverseMapper A mapper that swaps keys and values.
MultithreadedMapRunner (old API)
MultithreadedMapper (new API)
A mapper (or map runner in the old API) that runs mappers concurrently in
separate threads. Useful for mappers that are not CPU-bound.
TokenCounterMapper A mapper that tokenizes the input value into words (using Javas
StringTokenizer) and emits each word along with a count of one.
RegexMapper A mapper that finds matches of a regular expression in the input value and emits
the matches along with a count of one.
294 | Chapter 8: MapReduce Features
CHAPTER 9
Setting Up a Hadoop Cluster
This chaptei explains how to set up Hauoop to iun on a clustei ol machines. Running
HDFS anu MapReuuce on a single machine is gieat loi leaining aLout these systems,
Lut to uo uselul woik they neeu to iun on multiple noues.
Theie aie a lew options when it comes to getting a Hauoop clustei, liom Luiluing youi
own to iunning on ienteu haiuwaie, oi using an olleiing that pioviues Hauoop as a
seivice in the clouu. This chaptei anu the next give you enough inloimation to set up
anu opeiate youi own clustei, Lut even il you aie using a Hauoop seivice in which a
lot ol the ioutine maintenance is uone loi you, these chapteis still ollei valuaLle inloi-
mation aLout how Hauoop woiks liom an opeiations point ol view.
Cluster Specification
Hauoop is uesigneu to iun on commouity haiuwaie. That means that you aie not tieu
to expensive, piopiietaiy olleiings liom a single venuoi; iathei, you can choose stanu-
aiuizeu, commonly availaLle haiuwaie liom any ol a laige iange ol venuois to Luilu
youi clustei.
Commouity uoes not mean low-enu. Low-enu machines olten have cheap com-
ponents, which have highei lailuie iates than moie expensive (Lut still commouity-
class) machines. Vhen you aie opeiating tens, hunuieus, oi thousanus ol machines,
cheap components tuin out to Le a lalse economy, as the highei lailuie iate incuis a
gieatei maintenance cost. On the othei hanu, laige uataLase class machines aie not
iecommenueu eithei, since they uon`t scoie well on the piice/peiloimance cuive. Anu
even though you woulu neeu lewei ol them to Luilu a clustei ol compaiaLle peiloi-
mance to one Luilt ol miu-iange commouity haiuwaie, when one uiu lail it woulu have
a Liggei impact on the clustei, since a laigei piopoition ol the clustei haiuwaie woulu
Le unavailaLle.
295
Haiuwaie specilications iapiuly Lecome oLsolete, Lut loi the sake ol illustiation, a
typical choice ol machine loi iunning a Hauoop uatanoue anu tasktiackei in miu-2010
woulu have the lollowing specilications:
Proccssor
2 guau-coie 2-2.5GHz CPUs
Mcnory
16-2+ GB ECC RAM
1
Storagc
+ 1TB SATA uisks
Nctwor|
GigaLit Etheinet
Vhile the haiuwaie specilication loi youi clustei will assuieuly Le uilleient, Hauoop
is uesigneu to use multiple coies anu uisks, so it will Le aLle to take lull auvantage ol
moie poweilul haiuwaie.
Why Not Use RAID?
HDFS clusteis uo not Lenelit liom using RAID (Reuunuant Aiiay ol Inuepenuent
Disks) loi uatanoue stoiage (although RAID is iecommenueu loi the namenoue`s uisks,
to piotect against coiiuption ol its metauata). The ieuunuancy that RAID pioviues is
not neeueu, since HDFS hanules it Ly ieplication Letween noues.
Fuitheimoie, RAID stiiping (RAID 0), which is commonly useu to inciease peiloi-
mance, tuins out to Le s|owcr than the ]BOD (]ust a Bunch Ol Disks) conliguiation
useu Ly HDFS, which iounu-ioLins HDFS Llocks Letween all uisks. The ieason loi this
is that RAID 0 ieau anu wiite opeiations aie limiteu Ly the speeu ol the slowest uisk
in the RAID aiiay. In ]BOD, uisk opeiations aie inuepenuent, so the aveiage speeu ol
opeiations is gieatei than that ol the slowest uisk. Disk peiloimance olten shows con-
siueiaLle vaiiation in piactice, even loi uisks ol the same mouel. In some Lenchmaiking
caiiieu out on a Yahoo! clustei (http://nar|nai|.org/ncssagc/xnzc15zi25htr7ry),
]BOD peiloimeu 10 lastei than RAID 0 in one test (Giiumix), anu 30 Lettei in
anothei (HDFS wiite thioughput).
Finally, il a uisk lails in a ]BOD conliguiation, HDFS can continue to opeiate without
the laileu uisk, wheieas with RAID, lailuie ol a single uisk causes the whole aiiay (anu
hence the noue) to Lecome unavailaLle.
The Lulk ol Hauoop is wiitten in ]ava, anu can theieloie iun on any platloim with a
]VM, although theie aie enough paits that haiLoi Unix assumptions (the contiol
sciipts, loi example) to make it unwise to iun on a non-Unix platloim in piouuction.
1. ECC memoiy is stiongly iecommenueu, as seveial Hauoop useis have iepoiteu seeing many checksum
eiiois when using non-ECC memoiy on Hauoop clusteis.
296 | Chapter 9: Setting Up a Hadoop Cluster
In lact, Vinuows opeiating systems aie not suppoiteu piouuction platloims (although
they can Le useu with Cygwin as a uevelopment platloim; see Appenuix A).
How laige shoulu youi clustei Le? Theie isn`t an exact answei to this guestion, Lut the
Leauty ol Hauoop is that you can stait with a small clustei (say, 10 noues) anu giow it
as youi stoiage anu computational neeus giow. In many ways, a Lettei guestion is this:
how last uoes my clustei neeu to giow? You can get a goou leel loi this Ly consiueiing
stoiage capacity.
Foi example, il youi uata giows Ly 1 TB a week, anu you have thiee-way HDFS iepli-
cation, then you neeu an auuitional 3 TB ol iaw stoiage pei week. Allow some ioom
loi inteimeuiate liles anu logliles (aiounu 30, say), anu this woiks out at aLout one
machine (2010 vintage) pei week, on aveiage. In piactice, you woulun`t Luy a new
machine each week anu auu it to the clustei. The value ol uoing a Lack-ol-the-envelope
calculation like this is that it gives you a leel loi how Lig youi clustei shoulu Le: in this
example, a clustei that holus two yeais ol uata neeus 100 machines.
Foi a small clustei (on the oiuei ol 10 noues), it is usually acceptaLle to iun the name-
noue anu the joLtiackei on a single mastei machine (as long as at least one copy ol the
namenoue`s metauata is stoieu on a iemote lilesystem). As the clustei anu the numLei
ol liles stoieu in HDFS giow, the namenoue neeus moie memoiy, so the namenoue
anu joLtiackei shoulu Le moveu onto sepaiate machines.
The seconuaiy namenoue can Le iun on the same machine as the namenoue, Lut again
loi ieasons ol memoiy usage (the seconuaiy has the same memoiy ieguiiements as the
piimaiy), it is Lest to iun it on a sepaiate piece ol haiuwaie, especially loi laigei clusteis.
(This topic is uiscusseu in moie uetail in Mastei noue scenaiios on page 30+.)
Machines iunning the namenoues shoulu typically iun on 6+-Lit haiuwaie to avoiu the
3 GB limit on ]ava heap size in 32-Lit aichitectuies.
2
Network Topology
A common Hauoop clustei aichitectuie consists ol a two-level netwoik topology, as
illustiateu in Figuie 9-1. Typically theie aie 30 to +0 seiveis pei iack, with a 1 GB switch
loi the iack (only thiee aie shown in the uiagiam), anu an uplink to a coie switch oi
ioutei (which is noimally 1 GB oi Lettei). The salient point is that the aggiegate Lanu-
wiuth Letween noues on the same iack is much gieatei than that Letween noues on
uilleient iacks.
2. The tiauitional auvice says othei machines in the clustei (joLtiackei, uatanoues/tasktiackeis) shoulu Le
32-Lit to avoiu the memoiy oveiheau ol laigei pointeis. Sun`s ]ava 6 upuate 1+ leatuies compiesseu
oiuinaiy oLject pointeis, which eliminates much ol this oveiheau, so theie`s now no ieal uownsiue to
iunning on 6+-Lit haiuwaie.
Cluster Specification | 297
Rack awareness
To get maximum peiloimance out ol Hauoop, it is impoitant to conliguie Hauoop so
that it knows the topology ol youi netwoik. Il youi clustei iuns on a single iack, then
theie is nothing moie to uo, since this is the uelault. Howevei, loi multiiack clusteis,
you neeu to map noues to iacks. By uoing this, Hauoop will pielei within-iack tiansleis
(wheie theie is moie Lanuwiuth availaLle) to oll-iack tiansleis when placing
MapReuuce tasks on noues. HDFS will Le aLle to place ieplicas moie intelligently to
tiaue-oll peiloimance anu iesilience.
Netwoik locations such as noues anu iacks aie iepiesenteu in a tiee, which iellects the
netwoik uistance Letween locations. The namenoue uses the netwoik location when
ueteimining wheie to place Llock ieplicas (see Netwoik Topology anu Ha-
uoop on page 71); the MapReuuce scheuulei uses netwoik location to ueteimine
wheie the closest ieplica is as input to a map task.
Foi the netwoik in Figuie 9-1, the iack topology is uesciiLeu Ly two netwoik locations,
say, /switch1/rac|1 anu /switch1/rac|2. Since theie is only one top-level switch in this
clustei, the locations can Le simplilieu to /rac|1 anu /rac|2.
The Hauoop conliguiation must specily a map Letween noue auuiesses anu netwoik
locations. The map is uesciiLeu Ly a ]ava inteilace, DNSToSwitchMapping, whose
signatuie is:
public interface DNSToSwitchMapping {
public List<String> resolve(List<String> names);
}
Iigurc 9-1. Typica| two-|cvc| nctwor| architccturc jor a Hadoop c|ustcr
298 | Chapter 9: Setting Up a Hadoop Cluster
The names paiametei is a list ol IP auuiesses, anu the ietuin value is a list ol coiie-
sponuing netwoik location stiings. The topology.node.switch.mapping.impl conligu-
iation piopeity uelines an implementation ol the DNSToSwitchMapping inteilace that the
namenoue anu the joLtiackei use to iesolve woikei noue netwoik locations.
Foi the netwoik in oui example, we woulu map nodc1, nodc2, anu nodc3 to /rac|1,
anu nodc1, nodc5, anu nodc to /rac|2.
Most installations uon`t neeu to implement the inteilace themselves, howevei, since
the uelault implementation is ScriptBasedMapping, which iuns a usei-uelineu sciipt to
ueteimine the mapping. The sciipt`s location is contiolleu Ly the piopeity
topology.script.file.name. The sciipt must accept a vaiiaLle numLei ol aiguments
that aie the hostnames oi IP auuiesses to Le mappeu, anu it must emit the coiiesponu-
ing netwoik locations to stanuaiu output, sepaiateu Ly whitespace. The Hauoop wiki
has an example at http://wi|i.apachc.org/hadoop/topo|ogy_rac|_awarcncss_scripts.
Il no sciipt location is specilieu, the uelault Lehavioi is to map all noues to a single
netwoik location, calleu /dcjau|t-rac|.
Cluster Setup and Installation
Youi haiuwaie has aiiiveu. The next steps aie to get it iackeu up anu install the soltwaie
neeueu to iun Hauoop.
Theie aie vaiious ways to install anu conliguie Hauoop. This chaptei uesciiLes how
to uo it liom sciatch using the Apache Hauoop uistiiLution, anu will give you the
Lackgiounu to covei the things you neeu to think aLout when setting up Hauoop.
Alteinatively, il you woulu like to use RPMs oi DeLian packages loi managing youi
Hauoop installation, then you might want to stait with Clouueia`s DistiiLution, ue-
sciiLeu in Appenuix B.
To ease the Luiuen ol installing anu maintaining the same soltwaie on each noue, it is
noimal to use an automateu installation methou like Reu Hat Linux`s Kickstait oi
DeLian`s Fully Automatic Installation. These tools allow you to automate the opeiating
system installation Ly iecoiuing the answeis to guestions that aie askeu uuiing the
installation piocess (such as the uisk paitition layout), as well as which packages to
install. Ciucially, they also pioviue hooks to iun sciipts at the enu ol the piocess, which
aie invaluaLle loi uoing linal system tweaks anu customization that is not coveieu Ly
the stanuaiu installei.
The lollowing sections uesciiLe the customizations that aie neeueu to iun Hauoop.
These shoulu all Le auueu to the installation sciipt.
Cluster Setup and Installation | 299
Installing Java
]ava 6 oi latei is ieguiieu to iun Hauoop. The latest staLle Sun ]DK is the pieleiieu
option, although ]ava uistiiLutions liom othei venuois may woik, too. The lollowing
commanu conliims that ]ava was installeu coiiectly:
% java -version
java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
Creating a Hadoop User
It`s goou piactice to cieate a ueuicateu Hauoop usei account to sepaiate the Hauoop
installation liom othei seivices iunning on the same machine.
Foi small clusteis, some auministiatois choose to make this usei`s home uiiectoiy an
NFS-mounteu uiive, to aiu with SSH key uistiiLution (see the lollowing uiscussion).
The NFS seivei is typically outsiue the Hauoop clustei. Il you use NFS, it is woith
consiueiing autols, which allows you to mount the NFS lilesystem on uemanu, when
the system accesses it. Autols pioviues some piotection against the NFS seivei lailing
anu allows you to use ieplicateu lilesystems loi lailovei. Theie aie othei NFS gotchas
to watch out loi, such as synchionizing UIDs anu GIDs. Foi help setting up NFS on
Linux, ielei to the HOVTO at http://njs.sourccjorgc.nct/njs-howto/indcx.htn|.
Installing Hadoop
Downloau Hauoop liom the Apache Hauoop ieleases page (http://hadoop.apachc.org/
corc/rc|cascs.htn|), anu unpack the contents ol the uistiiLution in a sensiLle location,
such as /usr/|oca| (/opt is anothei stanuaiu choice). Note that Hauoop is not installeu
in the hadoop usei`s home uiiectoiy, as that may Le an NFS-mounteu uiiectoiy:
% cd /usr/local
% sudo tar xzf hadoop-x.y.z.tar.gz
Ve also neeu to change the ownei ol the Hauoop liles to Le the hadoop usei anu gioup:
% sudo chown -R hadoop:hadoop hadoop-x.y.z
300 | Chapter 9: Setting Up a Hadoop Cluster
Some auministiatois like to install HDFS anu MapReuuce in sepaiate
locations on the same system. At the time ol this wiiting, only HDFS
anu MapReuuce liom the same Hauoop ielease aie compatiLle with one
anothei; howevei, in lutuie ieleases, the compatiLility ieguiiements will
Le looseneu. Vhen this happens, having inuepenuent installations
makes sense, as it gives moie upgiaue options (loi moie, see Up-
giaues on page 360). Foi example, it is convenient to Le aLle to up-
giaue MapReuucepeihaps to patch a Lugwhile leaving HDFS
iunning.
Note that sepaiate installations ol HDFS anu MapReuuce can still shaie
conliguiation Ly using the --config option (when staiting uaemons) to
ielei to a common conliguiation uiiectoiy. They can also log to the same
uiiectoiy, as the logliles they piouuce aie nameu in such a way as to
avoiu clashes.
Testing the Installation
Once you`ve cieateu an installation sciipt, you aie ieauy to test it Ly installing it on the
machines in youi clustei. This will pioLaLly take a lew iteiations as you uiscovei kinks
in the install. Vhen it`s woiking, you can pioceeu to conliguie Hauoop anu give it a
test iun. This piocess is uocumenteu in the lollowing sections.
SSH Configuration
The Hauoop contiol sciipts (Lut not the uaemons) iely on SSH to peiloim clustei-wiue
opeiations. Foi example, theie is a sciipt loi stopping anu staiting all the uaemons in
the clustei. Note that the contiol sciipts aie optionalclustei-wiue opeiations can Le
peiloimeu Ly othei mechanisms, too (such as a uistiiLuteu shell).
To woik seamlessly, SSH neeus to Le set up to allow passwoiu-less login loi the
hadoop usei liom machines in the clustei. The simplest way to achieve this is to geneiate
a puLlic/piivate key paii, anu place it in an NFS location that is shaieu acioss the clustei.
Fiist, geneiate an RSA key paii Ly typing the lollowing in the hadoop usei account:
% ssh-keygen -t rsa -f ~/.ssh/id_rsa
Even though we want passwoiu-less logins, keys without passphiases aie not consiu-
eieu goou piactice (it`s OK to have an empty passphiase when iunning a local pseuuo-
uistiiLuteu clustei, as uesciiLeu in Appenuix A), so we specily a passphiase when
piompteu loi one. Ve shall use ssh-agcnt to avoiu the neeu to entei a passwoiu loi
each connection.
The piivate key is in the lile specilieu Ly the -f option, -/.ssh/id_rsa, anu the puLlic key
is stoieu in a lile with the same name with .pub appenueu, -/.ssh/id_rsa.pub.
SSH Configuration | 301
Next we neeu to make suie that the puLlic key is in the -/.ssh/authorizcd_|cys lile on
all the machines in the clustei that we want to connect to. Il the hadoop usei`s home
uiiectoiy is an NFS lilesystem, as uesciiLeu eailiei, then the keys can Le shaieu acioss
the clustei Ly typing:
% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Il the home uiiectoiy is not shaieu using NFS, then the puLlic keys will neeu to Le
shaieu Ly some othei means.
Test that you can SSH liom the mastei to a woikei machine Ly making suie ssh-
agcnt is iunning,
3
anu then iun ssh-add to stoie youi passphiase. You shoulu Le aLle
to ssh to a woikei without enteiing the passphiase again.
Hadoop Configuration
Theie aie a hanulul ol liles loi contiolling the conliguiation ol a Hauoop installation;
the most impoitant ones aie listeu in TaLle 9-1. This section coveis MapReuuce 1,
which employs the joLtiackei anu tasktiackei uaemons. Running MapReuuce 2 is
suLstantially uilleient, anu is coveieu in YARN Conliguiation on page 31S.
Tab|c 9-1. Hadoop conjiguration ji|cs
Filename Format Description
hadoop-env.sh Bash script Environment variables that are used in the scripts to run Hadoop.
core-site.xml Hadoop configuration
XML
Configuration settings for Hadoop Core, such as I/O settings that are
common to HDFS and MapReduce.
hdfs-site.xml Hadoop configuration
XML
Configuration settings for HDFS daemons: the namenode, the sec-
ondary namenode, and the datanodes.
mapred-site.xml Hadoop configuration
XML
Configuration settings for MapReduce daemons: the jobtracker, and
the tasktrackers.
masters Plain text A list of machines (one per line) that each run a secondary
namenode.
slaves Plain text A list of machines (one per line) that each run a datanode and a
tasktracker.
hadoop-metrics.properties Java Properties Properties for controlling how metrics are published in Hadoop (see
Metrics on page 350).
log4j.properties Java Properties Properties for system logfiles, the namenode audit log, and the task
log for the tasktracker child process (Hadoop Logs on page 173).
These liles aie all lounu in the conj uiiectoiy ol the Hauoop uistiiLution. The conlig-
uiation uiiectoiy can Le ielocateu to anothei pait ol the lilesystem (outsiue the Hauoop
3. See its main page loi instiuctions on how to stait ssh-agcnt.
302 | Chapter 9: Setting Up a Hadoop Cluster
installation, which makes upgiaues maiginally easiei) as long as uaemons aie staiteu
with the --config option specilying the location ol this uiiectoiy on the local lilesystem.
Configuration Management
Hauoop uoes not have a single, gloLal location loi conliguiation inloimation. Insteau,
each Hauoop noue in the clustei has its own set ol conliguiation liles, anu it is up to
auministiatois to ensuie that they aie kept in sync acioss the system. Hauoop pioviues
a iuuimentaiy lacility loi synchionizing conliguiation using rsync (see upcoming uis-
cussion); alteinatively, theie aie paiallel shell tools that can help uo this, like dsh oi
pdsh.
Hauoop is uesigneu so that it is possiLle to have a single set ol conliguiation liles that
aie useu loi all mastei anu woikei machines. The gieat auvantage ol this is simplicity,
Loth conceptually (since theie is only one conliguiation to ueal with) anu opeiationally
(as the Hauoop sciipts aie sullicient to manage a single conliguiation setup).
Foi some clusteis, the one-size-lits-all conliguiation mouel Lieaks uown. Foi example,
il you expanu the clustei with new machines that have a uilleient haiuwaie specilica-
tion to the existing ones, then you neeu a uilleient conliguiation loi the new machines
to take auvantage ol theii extia iesouices.
In these cases, you neeu to have the concept ol a c|ass ol machine, anu maintain a
sepaiate conliguiation loi each class. Hauoop uoesn`t pioviue tools to uo this, Lut theie
aie seveial excellent tools loi uoing piecisely this type ol conliguiation management,
such as Chel, Puppet, clengine, anu Lclg2.
Foi a clustei ol any size, it can Le a challenge to keep all ol the machines in sync: consiuei
what happens il the machine is unavailaLle when you push out an upuatewho en-
suies it gets the upuate when it Lecomes availaLle? This is a Lig pioLlem anu can leau
to uiveigent installations, so even il you use the Hauoop contiol sciipts loi managing
Hauoop, it may Le a goou iuea to use conliguiation management tools loi maintaining
the clustei. These tools aie also excellent loi uoing iegulai maintenance, such as patch-
ing secuiity holes anu upuating system packages.
Control scripts
Hauoop comes with sciipts loi iunning commanus, anu staiting anu stopping uaemons
acioss the whole clustei. To use these sciipts (which can Le lounu in the bin uiiectoiy),
you neeu to tell Hauoop which machines aie in the clustei. Theie aie two liles loi this
puipose, calleu nastcrs anu s|avcs, each ol which contains a list ol the machine host-
names oi IP auuiesses, one pei line. The nastcrs lile is actually a misleauing name, in
that it ueteimines which machine oi machines shoulu iun a seconuaiy namenoue. The
s|avcs lile lists the machines that the uatanoues anu tasktiackeis shoulu iun on. Both
nastcrs anu s|avcs liles iesiue in the conliguiation uiiectoiy, although the s|avcs lile
may Le placeu elsewheie (anu given anothei name) Ly changing the HADOOP_SLAVES
Hadoop Configuration | 303
setting in hadoop-cnv.sh. Also, these liles uo not neeu to Le uistiiLuteu to woikei noues,
since they aie useu only Ly the contiol sciipts iunning on the namenoue oi joLtiackei.
You uon`t neeu to specily which machine (oi machines) the namenoue anu joLtiackei
iuns on in the nastcrs lile, as this is ueteimineu Ly the machine the sciipts aie iun on.
(In lact, specilying these in the nastcrs lile woulu cause a seconuaiy namenoue to iun
theie, which isn`t always what you want.) Foi example, the start-djs.sh sciipt, which
staits all the HDFS uaemons in the clustei, iuns the namenoue on the machine the
sciipt is iun on. In slightly moie uetail, it:
1. Staits a namenoue on the local machine (the machine that the sciipt is iun on)
2. Staits a uatanoue on each machine listeu in the s|avcs lile
3. Staits a seconuaiy namenoue on each machine listeu in the nastcrs lile
Theie is a similai sciipt calleu start-naprcd.sh, which staits all the MapReuuce uae-
mons in the clustei. Moie specilically, it:
1. Staits a joLtiackei on the local machine
2. Staits a tasktiackei on each machine listeu in the s|avcs lile
Note that nastcrs is not useu Ly the MapReuuce contiol sciipts.
Also pioviueu aie stop-djs.sh anu stop-naprcd.sh sciipts to stop the uaemons staiteu
Ly the coiiesponuing stait sciipt.
These sciipts stait anu stop Hauoop uaemons using the hadoop-dacnon.sh sciipt. Il
you use the aloiementioneu sciipts, you shoulun`t call hadoop-dacnon.sh uiiectly. But
il you neeu to contiol Hauoop uaemons liom anothei system oi liom youi own sciipts,
then the hadoop-dacnon.sh sciipt is a goou integiation point. Likewise, hadoop-
dacnons.sh (with an s) is hanuy loi staiting the same uaemon on a set ol hosts.
Master node scenarios
Depenuing on the size ol the clustei, theie aie vaiious conliguiations loi iunning the
mastei uaemons: the namenoue, seconuaiy namenoue, anu joLtiackei. On a small
clustei (a lew tens ol noues), it is convenient to put them on a single machine; howevei,
as the clustei gets laigei, theie aie goou ieasons to sepaiate them.
The namenoue has high memoiy ieguiiements, as it holus lile anu Llock metauata loi
the entiie namespace in memoiy. The seconuaiy namenoue, while iule most ol the time,
has a compaiaLle memoiy lootpiint to the piimaiy when it cieates a checkpoint. (This
is explaineu in uetail in The lilesystem image anu euit log on page 33S.) Foi lilesys-
tems with a laige numLei ol liles, theie may not Le enough physical memoiy on one
machine to iun Loth the piimaiy anu seconuaiy namenoue.
The seconuaiy namenoue keeps a copy ol the latest checkpoint ol the lilesystem met-
auata that it cieates. Keeping this (stale) Lackup on a uilleient noue to the namenoue
304 | Chapter 9: Setting Up a Hadoop Cluster
allows iecoveiy in the event ol loss (oi coiiuption) ol all the namenoue`s metauata liles.
(This is uiscusseu luithei in Chaptei 10.)
On a Lusy clustei iunning lots ol MapReuuce joLs, the joLtiackei uses consiueiaLle
memoiy anu CPU iesouices, so it shoulu iun on a ueuicateu noue.
Vhethei the mastei uaemons iun on one oi moie noues, the lollowing instiuctions
apply:
Run the HDFS contiol sciipts liom the namenoue machine. The masteis lile shoulu
contain the auuiess ol the seconuaiy namenoue.
Run the MapReuuce contiol sciipts liom the joLtiackei machine.
Vhen the namenoue anu joLtiackei aie on sepaiate noues, theii s|avcs liles neeu to Le
kept in sync, since each noue in the clustei shoulu iun a uatanoue anu a tasktiackei.
Environment Settings
In this section, we consiuei how to set the vaiiaLles in hadoop-cnv.sh.
Memory
By uelault, Hauoop allocates 1,000 MB (1 GB) ol memoiy to each uaemon it iuns. This
is contiolleu Ly the HADOOP_HEAPSIZE setting in hadoop-cnv.sh. In auuition, the task
tiackei launches sepaiate chilu ]VMs to iun map anu ieuuce tasks in, so we neeu to
lactoi these into the total memoiy lootpiint ol a woikei machine.
The maximum numLei ol map tasks that can iun on a tasktiackei at one time is con-
tiolleu Ly the mapred.tasktracker.map.tasks.maximum piopeity, which uelaults to two
tasks. Theie is a coiiesponuing piopeity loi ieuuce tasks, mapred.task
tracker.reduce.tasks.maximum, which also uelaults to two tasks. The tasktiackei is saiu
to have two nap s|ots anu two rcducc s|ots.
The memoiy given to each chilu ]VM iunning a task can Le changeu Ly setting the
mapred.child.java.opts piopeity. The uelault setting is -Xmx200m, which gives each task
200 MB ol memoiy. (Inciuentally, you can pioviue extia ]VM options heie, too. Foi
example, you might enaLle veiLose GC logging to ueLug GC.) The uelault conliguia-
tion theieloie uses 2,S00 MB ol memoiy loi a woikei machine (see TaLle 9-2).
Tab|c 9-2. Wor|cr nodc ncnory ca|cu|ation
JVM Default memory used (MB) Memory used for 8 processors, 400 MB per child (MB)
Datanode 1,000 1,000
Tasktracker 1,000 1,000
Tasktracker child map task 2 200 7 400
Tasktracker child reduce task 2 200 7 400
Total 2,800 7,600
Hadoop Configuration | 305
The numLei ol tasks that can Le iun simultaneously on a tasktiackei is ielateu to the
numLei ol piocessois availaLle on the machine. Because MapReuuce joLs aie noimally
I/O-Lounu, it makes sense to have moie tasks than piocessois to get Lettei
utilization. The amount ol oveisuLsciiption uepenus on the CPU utilization ol joLs
you iun, Lut a goou iule ol thumL is to have a lactoi ol Letween one anu two moie
tasks (counting Loth map anu ieuuce tasks) than piocessois.
Foi example, il you hau S piocessois anu you wanteu to iun 2 piocesses on each pio-
cessoi, then you coulu set each ol mapred.tasktracker.map.tasks.maximum anu
mapred.tasktracker.reduce.tasks.maximum to 7 (not S, since the uatanoue anu the
tasktiackei each take one slot). Il you also incieaseu the memoiy availaLle to each chilu
task to +00 MB, then the total memoiy usage woulu Le 7,600 MB (see TaLle 9-2).
Vhethei this ]ava memoiy allocation will lit into S GB ol physical memoiy uepenus
on the othei piocesses that aie iunning on the machine. Il you aie iunning Stieaming
oi Pipes piogiams, this allocation will pioLaLly Le inappiopiiate (anu the memoiy
allocateu to the chilu shoulu Le uialeu uown), since it uoesn`t allow enough memoiy
loi useis` (Stieaming oi Pipes) piocesses to iun. The thing to avoiu is piocesses Leing
swappeu out, as this leaus to seveie peiloimance uegiauation. The piecise memoiy
settings aie necessaiily veiy clustei-uepenuent anu can Le optimizeu ovei time with
expeiience gaineu liom monitoiing the memoiy usage acioss the clustei. Tools like
Ganglia (GangliaContext on page 352) aie goou loi gatheiing this inloimation. See
Task memoiy limits on page 316 loi moie on how to enloice task memoiy limits.
Hauoop also pioviues settings to contiol how much memoiy is useu loi MapReuuce
opeiations. These can Le set on a pei-joL Lasis anu aie coveieu in the section on Shullle
anu Soit on page 205.
Foi the mastei noues, each ol the namenoue, seconuaiy namenoue, anu joLtiackei
uaemons uses 1,000 MB Ly uelault, a total ol 3,000 MB.
How much memory does a namenode need?
A namenoue can eat up memoiy, since a ieleience to eveiy Llock ol eveiy lile is main-
taineu in memoiy. It`s uillicult to give a piecise loimula, since memoiy usage uepenus
on the numLei ol Llocks pei lile, the lilename length, anu the numLei ol uiiectoiies in
the lilesystem; plus it can change liom one Hauoop ielease to anothei.
The uelault ol 1,000 MB ol namenoue memoiy is noimally enough loi a lew million
liles, Lut as a iule ol thumL loi sizing puiposes you can conseivatively allow 1,000 MB
pei million Llocks ol stoiage.
Foi example, a 200 noue clustei with + TB ol uisk space pei noue, a Llock size ol 12S
MB anu a ieplication lactoi ol 3 has ioom loi aLout 2 million Llocks (oi moie): 200
+,000,000 MB / (12S MB 3). So in this case, setting the namenoue memoiy to 2,000
MB woulu Le a goou staiting point.
You can inciease the namenoue`s memoiy without changing the memoiy allocateu to
othei Hauoop uaemons Ly setting HADOOP_NAMENODE_OPTS in hadoop-cnv.sh to incluue a
306 | Chapter 9: Setting Up a Hadoop Cluster
]VM option loi setting the memoiy size. HADOOP_NAMENODE_OPTS allows you to pass extia
options to the namenoue`s ]VM. So, loi example, il using a Sun ]VM, -Xmx2000m woulu
specily that 2,000 MB ol memoiy shoulu Le allocateu to the namenoue.
Il you change the namenoue`s memoiy allocation, uon`t loiget to uo the same loi the
seconuaiy namenoue (using the HADOOP_SECONDARYNAMENODE_OPTS vaiiaLle), since its
memoiy ieguiiements aie compaiaLle to the piimaiy namenoue`s. You will pioLaLly
also want to iun the seconuaiy namenoue on a uilleient machine, in this case.
Theie aie coiiesponuing enviionment vaiiaLles loi the othei Hauoop uaemons, so you
can customize theii memoiy allocations, il uesiieu. See hadoop-cnv.sh loi uetails.
Java
The location ol the ]ava implementation to use is ueteimineu Ly the JAVA_HOME setting
in hadoop-cnv.sh oi liom the JAVA_HOME shell enviionment vaiiaLle, il not set in hadoop-
cnv.sh. It`s a goou iuea to set the value in hadoop-cnv.sh, so that it is cleaily uelineu in
one place anu to ensuie that the whole clustei is using the same veision ol ]ava.
System logfiles
System logliles piouuceu Ly Hauoop aie stoieu in $HADOOP_INSTALL/logs Ly uelault.
This can Le changeu using the HADOOP_LOG_DIR setting in hadoop-cnv.sh. It`s a goou iuea
to change this so that logliles aie kept out ol the uiiectoiy that Hauoop is installeu in,
since this keeps logliles in one place even altei the installation uiiectoiy changes altei
an upgiaue. A common choice is /var/|og/hadoop, set Ly incluuing the lollowing line in
hadoop-cnv.sh:
export HADOOP_LOG_DIR=/var/log/hadoop
The log uiiectoiy will Le cieateu il it uoesn`t alieauy exist (il not, conliim that the
Hauoop usei has peimission to cieate it). Each Hauoop uaemon iunning on a machine
piouuces two logliles. The liist is the log output wiitten via log+j. This lile, which enus
in .|og, shoulu Le the liist poit ol call when uiagnosing pioLlems, since most application
log messages aie wiitten heie. The stanuaiu Hauoop log+j conliguiation uses a Daily
Rolling File Appenuei to iotate logliles. Olu logliles aie nevei ueleteu, so you shoulu
aiiange loi them to Le peiiouically ueleteu oi aichiveu, so as to not iun out ol uisk
space on the local noue.
The seconu loglile is the comLineu stanuaiu output anu stanuaiu eiioi log. This loglile,
which enus in .out, usually contains little oi no output, since Hauoop uses log+j loi
logging. It is only iotateu when the uaemon is iestaiteu, anu only the last live logs aie
ietaineu. Olu logliles aie sullixeu with a numLei Letween 1 anu 5, with 5 Leing the
oluest lile.
Loglile names (ol Loth types) aie a comLination ol the name ol the usei iunning the
uaemon, the uaemon name, anu the machine hostname. Foi example, hadoop-ton-
datanodc-sturgcs.|oca|.|og.2008-07-01 is the name ol a loglile altei it has Leen iotateu.
Hadoop Configuration | 307
This naming stiuctuie makes it possiLle to aichive logs liom all machines in the clustei
in a single uiiectoiy, il neeueu, since the lilenames aie unigue.
The useiname in the loglile name is actually the uelault loi the HADOOP_IDENT_STRING
setting in hadoop-cnv.sh. Il you wish to give the Hauoop instance a uilleient iuentity
loi the puiposes ol naming the logliles, change HADOOP_IDENT_STRING to Le the iuentiliei
you want.
SSH settings
The contiol sciipts allow you to iun commanus on (iemote) woikei noues liom the
mastei noue using SSH. It can Le uselul to customize the SSH settings, loi vaiious
ieasons. Foi example, you may want to ieuuce the connection timeout (using the
ConnectTimeout option) so the contiol sciipts uon`t hang aiounu waiting to see whethei
a ueau noue is going to iesponu. OLviously, this can Le taken too lai. Il the timeout is
too low, then Lusy noues will Le skippeu, which is Lau.
Anothei uselul SSH setting is StrictHostKeyChecking, which can Le set to no to auto-
matically auu new host keys to the known hosts liles. The uelault, ask, is to piompt
the usei to conliim they have veiilieu the key lingeipiint, which is not a suitaLle setting
in a laige clustei enviionment.
+
To pass extia options to SSH, ueline the HADOOP_SSH_OPTS enviionment vaiiaLle in
hadoop-cnv.sh. See the ssh anu ssh_config manual pages loi moie SSH settings.
The Hauoop contiol sciipts can uistiiLute conliguiation liles to all noues ol the clustei
using isync. This is not enaLleu Ly uelault, Lut Ly uelining the HADOOP_MASTER setting
in hadoop-cnv.sh, woikei uaemons will isync the tiee iooteu at HADOOP_MASTER to the
local noue`s HADOOP_INSTALL whenevei the uaemon staits up.
Vhat il you have two masteisa namenoue anu a joLtiackei on sepaiate machines?
You can pick one as the souice anu the othei can isync liom it, along with all the
woikeis. In lact, you coulu use any machine, even one outsiue the Hauoop clustei, to
isync liom.
Because HADOOP_MASTER is unset Ly uelault, theie is a Lootstiapping pioLlem: how uo
we make suie hadoop-cnv.sh with HADOOP_MASTER set is piesent on woikei noues? Foi
small clusteis, it is easy to wiite a small sciipt to copy hadoop-cnv.sh liom the mastei
to all ol the woikei noues. Foi laigei clusteis, tools like dsh can uo the copies in paiallel.
Alteinatively, a suitaLle hadoop-cnv.sh can Le cieateu as a pait ol the automateu in-
stallation sciipt (such as Kickstait).
Vhen staiting a laige clustei with isyncing enaLleu, the woikei noues can oveiwhelm
the mastei noue with isync ieguests since the woikeis stait at aiounu the same time.
To avoiu this, set the HADOOP_SLAVE_SLEEP setting to a small numLei ol seconus, such
+. Foi moie uiscussion on the secuiity implications ol SSH Host Keys, consult the aiticle SSH Host Key
Piotection Ly Biian Hatch at http://www.sccurityjocus.con/injocus/180.
308 | Chapter 9: Setting Up a Hadoop Cluster
as 0.1, loi one-tenth ol a seconu. Vhen iunning commanus on all noues ol the clustei,
the mastei will sleep loi this peiiou Letween invoking the commanu on each woikei
machine in tuin.
Important Hadoop Daemon Properties
Hauoop has a Lewilueiing numLei ol conliguiation piopeities. In this section, we
auuiess the ones that you neeu to ueline (oi at least unueistanu why the uelault is
appiopiiate) loi any ieal-woilu woiking clustei. These piopeities aie set in the Hauoop
site liles: corc-sitc.xn|, hdjs-sitc.xn|, anu naprcd-sitc.xn|. Typical examples ol these
liles aie shown in Example 9-1, Example 9-2, anu Example 9-3. Notice that most piop-
eities aie maikeu as linal, in oiuei to pievent them liom Leing oveiiiuuen Ly joL con-
liguiations. You can leain moie aLout how to wiite Hauoop`s conliguiation liles in
The Conliguiation API on page 1+6.
Exanp|c 9-1. A typica| corc-sitc.xn| conjiguration ji|c
<?xml version="1.0"?>
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://namenode/</value>
<final>true</final>
</property>
</configuration>
Exanp|c 9-2. A typica| hdjs-sitc.xn| conjiguration ji|c
<?xml version="1.0"?>
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/disk1/hdfs/name,/remote/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/disk1/hdfs/data,/disk2/hdfs/data</value>
<final>true</final>
</property>

<property>
<name>fs.checkpoint.dir</name>
<value>/disk1/hdfs/namesecondary,/disk2/hdfs/namesecondary</value>
<final>true</final>
</property>
</configuration>
Hadoop Configuration | 309
Exanp|c 9-3. A typica| naprcd-sitc.xn| conjiguration ji|c
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>jobtracker:8021</value>
<final>true</final>
</property>

<property>
<name>mapred.local.dir</name>
<value>/disk1/mapred/local,/disk2/mapred/local</value>
<final>true</final>
</property>

<property>
<name>mapred.system.dir</name>
<value>/tmp/hadoop/mapred/system</value>
<final>true</final>
</property>

<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>7</value>
<final>true</final>
</property>

<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>7</value>
<final>true</final>
</property>

<property>
<name>mapred.child.java.opts</name>
<value>-Xmx400m</value>
<!-- Not marked as final so jobs can include JVM debugging options -->
</property>
</configuration>
HDFS
To iun HDFS, you neeu to uesignate one machine as a namenoue. In this case, the
piopeity fs.default.name is an HDFS lilesystem URI, whose host is the namenoue`s
hostname oi IP auuiess, anu poit is the poit that the namenoue will listen on loi RPCs.
Il no poit is specilieu, the uelault ol S020 is useu.
The nastcrs lile that is useu Ly the contiol sciipts is not useu Ly the
HDFS (oi MapReuuce) uaemons to ueteimine hostnames. In lact, Le-
cause the nastcrs lile is only useu Ly the sciipts, you can ignoie it il you
uon`t use them.
310 | Chapter 9: Setting Up a Hadoop Cluster
The fs.default.name piopeity also uouLles as specilying the uelault lilesystem. The
uelault lilesystem is useu to iesolve ielative paths, which aie hanuy to use since they
save typing (anu avoiu haiucouing knowleuge ol a paiticulai namenoue`s auuiess). Foi
example, with the uelault lilesystem uelineu in Example 9-1, the ielative URI /a/b is
iesolveu to hdjs://nancnodc/a/b.
Il you aie iunning HDFS, the lact that fs.default.name is useu to specily
Loth the HDFS namenoue and the uelault lilesystem means HDFS has
to Le the uelault lilesystem in the seivei conliguiation. Beai in minu,
howevei, that it is possiLle to specily a uilleient lilesystem as the uelault
in the client conliguiation, loi convenience.
Foi example, il you use Loth HDFS anu S3 lilesystems, then you have
a choice ol specilying eithei as the uelault in the client conliguiation,
which allows you to ielei to the uelault with a ielative URI anu the othei
with an aLsolute URI.
Theie aie a lew othei conliguiation piopeities you shoulu set loi HDFS: those that set
the stoiage uiiectoiies loi the namenoue anu loi uatanoues. The piopeity
dfs.name.dir specilies a list ol uiiectoiies wheie the namenoue stoies peisistent lile-
system metauata (the euit log anu the lilesystem image). A copy ol each ol the metauata
liles is stoieu in each uiiectoiy loi ieuunuancy. It`s common to conliguie
dfs.name.dir so that the namenoue metauata is wiitten to one oi two local uisks, anu
a iemote uisk, such as an NFS-mounteu uiiectoiy. Such a setup guaius against lailuie
ol a local uisk anu lailuie ol the entiie namenoue, since in Loth cases the liles can Le
iecoveieu anu useu to stait a new namenoue. (The seconuaiy namenoue takes only
peiiouic checkpoints ol the namenoue, so it uoes not pioviue an up-to-uate Lackup ol
the namenoue.)
You shoulu also set the dfs.data.dir piopeity, which specilies a list ol uiiectoiies loi
a uatanoue to stoie its Llocks. Unlike the namenoue, which uses multiple uiiectoiies
loi ieuunuancy, a uatanoue iounu-ioLins wiites Letween its stoiage uiiectoiies, so loi
peiloimance you shoulu specily a stoiage uiiectoiy loi each local uisk. Reau peiloi-
mance also Lenelits liom having multiple uisks loi stoiage, Lecause Llocks will Le
spieau acioss them, anu concuiient ieaus loi uistinct Llocks will Le coiiesponuingly
spieau acioss uisks.
Foi maximum peiloimance, you shoulu mount stoiage uisks with the
noatime option. This setting means that last accesseu time inloimation
is not wiitten on lile ieaus, which gives signilicant peiloimance gains.
Finally, you shoulu conliguie wheie the seconuaiy namenoue stoies its checkpoints ol
the lilesystem. The fs.checkpoint.dir piopeity specilies a list ol uiiectoiies wheie the
checkpoints aie kept. Like the stoiage uiiectoiies loi the namenoue, which keep ie-
Hadoop Configuration | 311
uunuant copies ol the namenoue metauata, the checkpointeu lilesystem image is stoieu
in each checkpoint uiiectoiy loi ieuunuancy.
TaLle 9-3 summaiizes the impoitant conliguiation piopeities loi HDFS.
Tab|c 9-3. |nportant HDIS dacnon propcrtics
Property name Type Default value Description
fs.default.name URI file:/// The default filesystem. The URI defines
the hostname and port that the name-
nodes RPC server runs on. The default
port is 8020. This property is set in core-
site.xml.
dfs.name.dir comma-separated
directory names
${hadoop.tmp.dir}/
dfs/name
The list of directories where the name-
node stores its persistent metadata.
The namenode stores a copy of the
metadata in each directory in the list.
dfs.data.dir comma-separated
directory names
${hadoop.tmp.dir}/
dfs/data
A list of directories where the datanode
stores blocks. Each block is stored in
only one of these directories.
fs.checkpoint.dir comma-separated
directory names
${hadoop.tmp.dir}/
dfs/namesecondary
A list of directories where the
secondary namenode stores check-
points. It stores a copy of the checkpoint
in each directory in the list.
Note that the stoiage uiiectoiies loi HDFS aie unuei Hauoop`s tempo-
iaiy uiiectoiy Ly uelault (the hadoop.tmp.dir piopeity, whose uelault
is /tmp/hadoop-${user.name}). Theieloie, it is ciitical that these piopei-
ties aie set so that uata is not lost Ly the system cleaiing out tempoiaiy
uiiectoiies.
MapReduce
To iun MapReuuce, you neeu to uesignate one machine as a joLtiackei, which on small
clusteis may Le the same machine as the namenoue. To uo this, set the
mapred.job.tracker piopeity to the hostname oi IP auuiess anu poit that the joLtiackei
will listen on. Note that this piopeity is not a URI, Lut a host-poit paii, sepaiateu Ly
a colon. The poit numLei S021 is a common choice.
Duiing a MapReuuce joL, inteimeuiate uata anu woiking liles aie wiitten to tempoiaiy
local liles. Since this uata incluues the potentially veiy laige output ol map tasks, you
neeu to ensuie that the mapred.local.dir piopeity, which contiols the location ol local
tempoiaiy stoiage, is conliguieu to use uisk paititions that aie laige enough. The
mapred.local.dir piopeity takes a comma-sepaiateu list ol uiiectoiy names, anu you
shoulu use all availaLle local uisks to spieau uisk I/O. Typically, you will use the same
uisks anu paititions (Lut uilleient uiiectoiies) loi MapReuuce tempoiaiy uata as you
312 | Chapter 9: Setting Up a Hadoop Cluster
use loi uatanoue Llock stoiage, as goveineu Ly the dfs.data.dir piopeity, uiscusseu
eailiei.
MapReuuce uses a uistiiLuteu lilesystem to shaie liles (such as the joL ]AR lile) with
the tasktiackeis that iun the MapReuuce tasks. The mapred.system.dir piopeity is useu
to specily a uiiectoiy wheie these liles can Le stoieu. This uiiectoiy is iesolveu ielative
to the uelault lilesystem (conliguieu in fs.default.name), which is usually HDFS.
Finally, you shoulu set the mapred.tasktracker.map.tasks.maximum anu mapred.task
tracker.reduce.tasks.maximum piopeities to iellect the numLei ol availaLle coies on
the tasktiackei machines anu mapred.child.java.opts to iellect the amount ol memoiy
availaLle loi the tasktiackei chilu ]VMs. See the uiscussion in Memoiy
on page 305.
TaLle 9-+ summaiizes the impoitant conliguiation piopeities loi MapReuuce.
Tab|c 9-1. |nportant MapRcducc dacnon propcrtics
Property name Type Default value Description
mapred.job.tracker hostname and port local The hostname and port that the job-
trackers RPC server runs on. If set to
the default value of local, then the
jobtracker is run in-process on de-
mand when you run a MapReduce job
(you dont need to start the jobtracker
in this case, and in fact you will get
an error if you try to start it in this
mode).
mapred.local.dir comma-separated
directory names
${hadoop.tmp.dir}
/mapred/local
A list of directories where MapReduce
stores intermediate data for jobs. The
data is cleared out when the job ends.
mapred.system.dir URI ${hadoop.tmp.dir}
/mapred/system
The directory relative to
fs.default.name where shared
files are stored, during a job run.
mapred.tasktracker.
map.tasks.maximum
int 2 The number of map tasks that may be
run on a tasktracker at any one time.
mapred.tasktracker.
reduce.tasks.maximum
int 2 The number of reduce tasks that may
be run on a tasktracker at any one
time.
mapred.child.java.opts String -Xmx200m The JVM options used to launch the
tasktracker child process that runs
map and reduce tasks. This property
can be set on a per-job basis, which
can be useful for setting JVM proper-
ties for debugging, for example.
Hadoop Configuration | 313
Property name Type Default value Description
mapreduce.map.
java.opts
String -Xmx200m The JVM options used for the child
process that runs map tasks. From
0.21.
mapreduce.reduce.
java.opts
String -Xmx200m The JVM options used for the child
process that runs reduce tasks. From
0.21.
Hadoop Daemon Addresses and Ports
Hauoop uaemons geneially iun Loth an RPC seivei (TaLle 9-5) loi communication
Letween uaemons anu an HTTP seivei to pioviue weL pages loi human consumption
(TaLle 9-6). Each seivei is conliguieu Ly setting the netwoik auuiess anu poit numLei
to listen on. By specilying the netwoik auuiess as 0.0.0.0, Hauoop will Linu to all
auuiesses on the machine. Alteinatively, you can specily a single auuiess to Linu to. A
poit numLei ol 0 instiucts the seivei to stait on a liee poit: this is geneially uiscouiageu,
since it is incompatiLle with setting clustei-wiue liiewall policies.
Tab|c 9-5. RPC scrvcr propcrtics
Property name Default value Description
fs.default.name file:/// When set to an HDFS URI, this property determines
the namenodes RPC server address and port. The
default port is 8020 if not specified.
dfs.datanode.ipc.address 0.0.0.0:50020 The datanodes RPC server address and port.
mapred.job.tracker local When set to a hostname and port, this property
specifies the jobtrackers RPC server address and
port. A commonly used port is 8021.
mapred.task.tracker.report.address 127.0.0.1:0 The tasktrackers RPC server address and port. This
is used by the tasktrackers child JVM to commu-
nicate with the tasktracker. Using any free port is
acceptable in this case, as the server only binds to
the loopback address. You should change this
setting only if the machine has no loopback
address.
In auuition to an RPC seivei, uatanoues iun a TCP/IP seivei loi Llock tiansleis. The
seivei auuiess anu poit is set Ly the dfs.datanode.address piopeity, anu has a uelault
value ol 0.0.0.0:50010.
Tab|c 9-. HTTP scrvcr propcrtics
Property name Default value Description
mapred.job.tracker.http.address 0.0.0.0:50030 The jobtrackers HTTP server address and port.
mapred.task.tracker.http.address 0.0.0.0:50060 The tasktrackers HTTP server address and port.
314 | Chapter 9: Setting Up a Hadoop Cluster
Property name Default value Description
dfs.http.address 0.0.0.0:50070 The namenodes HTTP server address and port.
dfs.datanode.http.address 0.0.0.0:50075 The datanodes HTTP server address and port.
dfs.secondary.http.address 0.0.0.0:50090 The secondary namenodes HTTP server address and
port.
Theie aie also settings loi contiolling which netwoik inteilaces the uatanoues anu
tasktiackeis iepoit as theii IP auuiesses (loi HTTP anu RPC seiveis). The ielevant
piopeities aie dfs.datanode.dns.interface anu mapred.tasktracker.dns.interface,
Loth ol which aie set to default, which will use the uelault netwoik inteilace. You can
set this explicitly to iepoit the auuiess ol a paiticulai inteilace (eth0, loi example).
Other Hadoop Properties
This section uiscusses some othei piopeities that you might consiuei setting.
Cluster membership
To aiu the auuition anu iemoval ol noues in the lutuie, you can specily a lile containing
a list ol authoiizeu machines that may join the clustei as uatanoues oi tasktiackeis.
The lile is specilieu using the dfs.hosts (loi uatanoues) anu mapred.hosts (loi
tasktiackeis) piopeities, as well as the coiiesponuing dfs.hosts.exclude anu
mapred.hosts.exclude liles useu loi uecommissioning. See Commissioning anu De-
commissioning Noues on page 357 loi luithei uiscussion.
Buffer size
Hauoop uses a Lullei size ol + KB (+,096 Lytes) loi its I/O opeiations. This is a con-
seivative setting, anu with mouein haiuwaie anu opeiating systems, you will likely see
peiloimance Lenelits Ly incieasing it; 12S KB (131,072 Lytes) is a common choice. Set
this using the io.file.buffer.size piopeity in corc-sitc.xn|.
HDFS block size
The HDFS Llock size is 6+ MB Ly uelault, Lut many clusteis use 12S MB (13+,217,72S
Lytes) oi even 256 MB (26S,+35,+56 Lytes) to ease memoiy piessuie on the namenoue
anu to give mappeis moie uata to woik on. Set this using the dfs.block.size piopeity
in hdjs-sitc.xn|.
Reserved storage space
By uelault, uatanoues will tiy to use all ol the space availaLle in theii stoiage uiiectoiies.
Il you want to ieseive some space on the stoiage volumes loi non-HDFS use, then you
can set dfs.datanode.du.reserved to the amount, in Lytes, ol space to ieseive.
Hadoop Configuration | 315
Trash
Hauoop lilesystems have a tiash lacility, in which ueleteu liles aie not actually ueleteu,
Lut iathei aie moveu to a tiash loluei, wheie they iemain loi a minimum peiiou Leloie
Leing peimanently ueleteu Ly the system. The minimum peiiou in minutes that a lile
will iemain in the tiash is set using the fs.trash.interval conliguiation piopeity in
corc-sitc.xn|. By uelault, the tiash inteival is zeio, which uisaLles tiash.
Like in many opeiating systems, Hauoop`s tiash lacility is a usei-level leatuie, meaning
that only liles that aie ueleteu using the lilesystem shell aie put in the tiash. Files ueleteu
piogiammatically aie ueleteu immeuiately. It is possiLle to use the tiash piogiammat-
ically, howevei, Ly constiucting a Trash instance, then calling its moveToTrash() methou
with the Path ol the lile intenueu loi ueletion. The methou ietuins a value inuicating
success; a value ol false means eithei that tiash is not enaLleu oi that the lile is alieauy
in the tiash.
Vhen tiash is enaLleu, each usei has hei own tiash uiiectoiy calleu .Trash in hei home
uiiectoiy. File iecoveiy is simple: you look loi the lile in a suLuiiectoiy ol .Trash anu
move it out ol the tiash suLtiee.
HDFS will automatically uelete liles in tiash lolueis, Lut othei lilesystems will not, so
you have to aiiange loi this to Le uone peiiouically. You can cxpungc the tiash, which
will uelete liles that have Leen in the tiash longei than theii minimum peiiou, using
the lilesystem shell:
% hadoop fs -expunge
The Trash class exposes an expunge() methou that has the same ellect.
Job scheduler
Paiticulaily in a multiusei MapReuuce setting, consiuei changing the uelault FIFO joL
scheuulei to one ol the moie lully leatuieu alteinatives. See ]oL Scheuul-
ing on page 20+.
Reduce slow start
By uelault, scheuuleis wait until 5 ol the map tasks in a joL have completeu Leloie
scheuuling ieuuce tasks loi the same joL. Foi laige joLs this can cause pioLlems with
clustei utilization, since they take up ieuuce slots while waiting loi the map tasks to
complete. Setting mapred.reduce.slowstart.completed.maps to a highei value, such as
0.80 (S0), can help impiove thioughput.
Task memory limits
On a shaieu clustei, it shoulun`t Le possiLle loi one usei`s eiiant MapReuuce piogiam
to Liing uown noues in the clustei. This can happen il the map oi ieuuce task has a
memoiy leak, loi example, Lecause the machine on which the tasktiackei is iunning
will iun out ol memoiy anu may allect the othei iunning piocesses.
316 | Chapter 9: Setting Up a Hadoop Cluster
Oi consiuei the case wheie a usei sets mapred.child.java.opts to a laige value anu
causes memoiy piessuie on othei iunning tasks, causing them to swap. Maiking this
piopeity as linal on the clustei woulu pievent it Leing changeu Ly useis in theii joLs,
Lut theie aie legitimate ieasons to allow some joLs to use moie memoiy, so this is not
always an acceptaLle solution. Fuitheimoie, even locking uown
mapred.child.java.opts uoes not solve the pioLlem, since tasks can spawn new pio-
cesses which aie not constiaineu in theii memoiy usage. Stieaming anu Pipes joLs uo
exactly that, loi example.
To pievent cases like these, some way ol enloicing a limit on a task`s memoiy usage is
neeueu. Hauoop pioviues two mechanisms loi this. The simplest is via the Linux
u|init commanu, which can Le uone at the opeiating system level (in the |inits.conj
lile, typically lounu in /ctc/sccurity), oi Ly setting mapred.child.ulimit in the Hauoop
conliguiation. The value is specilieu in kiloLytes, anu shoulu Le comloitaLly laigei
than the memoiy ol the ]VM set Ly mapred.child.java.opts; otheiwise, the chilu ]VM
might not stait.
The seconu mechanism is Hauoop`s tas| ncnory nonitoring leatuie.
5
The iuea is that
an auminstiatoi sets a iange ol alloweu viitual memoiy limits loi tasks on the clustei,
anu useis specily the maximum memoiy ieguiiements loi theii joLs in the joL conlig-
uiation. Il a usei uoesn`t set memoiy ieguiiements loi theii joL, then the uelaults aie
useu (mapred.job.map.memory.mb anu mapred.job.reduce.memory.mb).
This appioach has a couple ol auvantages ovei the u|init appioach. Fiist, it enloices
the memoiy usage ol the whole task piocess tiee, incluuing spawneu piocesses. Seconu,
it enaLles memoiy-awaie scheuuling, wheie tasks aie scheuuleu on tasktiackeis which
have enough liee memoiy to iun them. The Capacity Scheuulei, loi example, will ac-
count loi slot usage Laseu on the memoiy settings, so that il a joL`s mapred.job.map.mem
ory.mb setting exceeus mapred.cluster.map.memory.mb then the scheuulei will allocate
moie than one slot on a tasktiackei to iun each map task loi that joL.
To enaLle task memoiy monitoiing you neeu to set all six ol the piopeities in Ta-
Lle 9-7. The uelault values aie all -1, which means the leatuie is uisaLleu.
Tab|c 9-7. MapRcducc tas| ncnory nonitoring propcrtics
Property name Type Default
value
Description
mapred.cluster.map.mem
ory.mb
int -1 The amount of virtual memory, in MB, that defines a map
slot. Map tasks that require more than this amount of
memory will use more than one map slot.
mapred.cluster.reduce.mem
ory.mb
int -1 The amount of virtual memory, in MB, that defines a reduce
slot. Reduce tasks that require more than this amount of
memory will use more than one reduce slot.
5. YARN uses a uilleient memoiy mouel to the one uesciiLeu heie, anu the conliguiation options aie
uilleient. See Memoiy on page 321.
Hadoop Configuration | 317
Property name Type Default
value
Description
mapred.job.map.memory.mb int -1 The amount of virtual memory, in MB, that a map task
requires to run. If a map task exceeds this limit it may be
terminated and marked as failed.
mapred.job.reduce.mem
ory.mb
int -1 The amount of virtual memory, in MB, that a reduce task
requires to run. If a reduce task exceeds this limit it may be
terminated and marked as failed.
mapred.clus
ter.max.map.memory.mb
int -1 The maximum limit that users can set
mapred.job.map.memory.mb to.
mapred.clus
ter.max.reduce.memory.mb
int -1 The maximum limit that users can set
mapred.job.reduce.memory.mb to.
User Account Creation
Once you have a Hauoop clustei up anu iunning, you neeu to give useis access to it.
This involves cieating a home uiiectoiy loi each usei anu setting owneiship peimissions
on it:
% hadoop fs -mkdir /user/username
% hadoop fs -chown username:username /user/username
This is a goou time to set space limits on the uiiectoiy. The lollowing sets a 1 TB limit
on the given usei uiiectoiy:
% hadoop dfsadmin -setSpaceQuota 1t /user/username
YARN Configuration
YARN is the next-geneiation aichitectuie loi iunning MapReuuce (anu is uesciiLeu in
YARN (MapReuuce 2) on page 19+). It has a uilleient set ol uaemons anu conligu-
iation options to classic MapReuuce (also calleu MapReuuce 1), anu in this section we
shall look at these uilleiences anu how to iun MapReuuce on YARN.
Unuei YARN you no longei iun a joLtiackei oi tasktiackeis. Insteau, theie is a single
iesouice managei iunning on the same machine as the HDFS namenoue (loi small
clusteis) oi on a ueuicateu machine, anu noue manageis iunning on each woikei noue
in the clustei.
The YARN start-a||.sh sciipt (in the bin uiiectoiy) staits the YARN uaemons in the
clustei. This sciipt will stait a iesouice managei (on the machine the sciipt is iun on),
anu a noue managei on each machine listeu in the s|avcs lile.
YARN also has a joL histoiy seivei uaemon that pioviues useis with uetails ol past joL
iuns, anu a weL app pioxy seivei loi pioviuing a secuie way loi useis to access the UI
pioviueu Ly YARN applications. In the case ol MapReuuce, the weL UI seiveu Ly the
pioxy pioviues inloimation aLout the cuiient joL you aie iunning, similai to the one
318 | Chapter 9: Setting Up a Hadoop Cluster
uesciiLeu in The MapReuuce VeL UI on page 16+. By uelault the weL app pioxy
seivei iuns in the same piocess as the iesouice managei, Lut it may Le conliguieu to
iun as a stanualone uaemon.
YARN has its own set ol conliguiation liles listeu in TaLle 9-S, these aie useu in auuition
to those in TaLle 9-1.
Tab|c 9-8. YARN conjiguration ji|cs
Filename Format Description
yarn-env.sh Bash script Environment variables that are used in the scripts to run YARN.
yarn-site.xml Hadoop configuration XML Configuration settings for YARN daemons: the resource manager, the job history
server, the webapp proxy server, and the node managers.
Important YARN Daemon Properties
Vhen iunning MapReuuce on YARN the naprcd-sitc.xn| lile is still useu loi geneial
MapReuuce piopeities, although the joLtiackei anu tasktiackei-ielateu piopeities aie
not useu. None ol the piopeities in TaLle 9-+ aie applicaLle to YARN, except loi
mapred.child.java.opts (anu the ielateu piopeities mapreduce.map.java.opts anu map
reduce.reduce.java.opts which apply only to map oi ieuuce tasks, iespectively). The
]VM options specilieu in this way aie useu to launch the YARN chilu piocess that iuns
map oi ieuuce tasks.
The conliguiation liles in Example 9-+ show some ol the impoitant conliguiation
piopeities loi iunning MapReuuce on YARN.
Exanp|c 9-1. An cxanp|c sct oj sitc conjiguration ji|cs jor running MapRcducc on YARN
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx400m</value>
<!-- Not marked as final so jobs can include JVM debugging options -->
</property>
</configuration>
<?xml version="1.0"?>
<!-- yarn-site.xml -->
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>resourcemanager:8040</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/disk1/nm-local-dir,/disk2/nm-local-dir</value>
<final>true</final>
</property>
YARN Configuration | 319
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
</configuration>
The YARN iesouice managei auuiess is contiolleu via yarn.resourceman
ager.address, which takes the loim ol a host-poit paii. In a client conliguiation this
piopeity is useu to connect to the iesouice managei (using RPC), anu in auuition the
mapreduce.framework.name piopeity must Le set to yarn loi the client to use YARN
iathei than the local joL iunnei.
Although YARN uoes not honoi mapred.local.dir, it has an eguivalent piopeity calleu
yarn.nodemanager.local-dirs, which allows you to specily which local uisks to stoie
inteimeuiate uata on. It is specilieu Ly a comma-sepaiateu list ol local uiiectoiy paths,
which aie useu in a iounu-ioLin lashion.
YARN uoesn`t have tasktiackeis to seive map outputs to ieuuce tasks, so loi this lunc-
tion it ielies on shullle hanuleis, which aie long-iunning auxiliaiy seivices iunning in
noue manageis. Since YARN is a geneial-puipose seivice the shullle hanuleis neeu to
Le explictly enaLleu in the yarn-sitc.xn| Ly setting the yarn.nodemanager.aux-serv
ices piopeity to mapreduce.shuffle.
TaLle 9-9 summaiizes the impoitant conliguiation piopeities loi YARN.
Tab|c 9-9. |nportant YARN dacnon propcrtics
Property name Type Default value Description
yarn.resourceman
ager.address
hostname and port 0.0.0.0:8040 The hostname and port that the resource
managers RPC server runs on.
yarn.nodeman
ager.local-dirs
comma-separated
directory names
/tmp/nm-local-
dir
A list of directories where node manag-
ers allow containers to store intermedi-
ate data. The data is cleared out when
the application ends.
yarn.nodeman
ager.aux-services
comma-separated
service names
A list of auxiliary services run by the node
manager. A service is implemented by
the class defined by the property
yarn.nodemanager.aux-serv
ices.service-name.class. By
default no auxiliary services are speci-
fied.
yarn.nodeman
ager.resource.mem
ory-mb
int 8192 The amount of physical memory (in MB)
which may be allocated to containers
being run by the node manager.
320 | Chapter 9: Setting Up a Hadoop Cluster
Property name Type Default value Description
yarn.nodeman
ager.vmem-pmem-
ratio
float 2.1 The ratio of virtual to physical memory
for containers. Virtual memory usage
may exceed the allocation by this
amount.
Memory
YARN tieats memoiy in a moie line-giaineu mannei than the slot-Laseu mouel useu
in the classic implementation ol MapReuuce. Rathei than specilying a lixeu maximum
numLei ol map anu ieuuce slots that may iun on a tasktiackei noue at once, YARN
allows applications to ieguest an aiLitiaiy amount ol memoiy (within limits) loi a task.
In the YARN mouel, noue manageis allocate memoiy liom a pool, so the numLei ol
tasks that aie iunning on a paiticulai noue uepenus on the sum ol theii memoiy ie-
guiiements, anu not simply on a lixeu numLei ol slots.
The slot-Laseu mouel can leau to clustei unuei-utilization, since the piopoition ol map
slots to ieuuce slots is lixeu as a clustei-wiue conliguiation. Howevei, the numLei ol
map veisus ieuuce slots that aie in dcnand changes ovei time: at the Leginning ol a joL
only map slots aie neeueu, while at the enu ol the joL only ieuuce slots aie neeueu. On
laigei clusteis with many concuiient joLs the vaiiation in uemanu loi a paiticulai type
ol slot may Le less pionounceu, Lut theie is still wastage. YARN avoius this pioLlem
Ly not uistinguishing Letween the two types ol slot.
The consiueiations loi how much memoiy to ueuicate to a noue managei loi iunning
containeis aie similai to the those uiscusseu in Memoiy on page 305. Each Hauoop
uaemon uses 1,000 MB, so loi a uatanoue anu a noue managei the total is 2,000 MB.
Set asiue enough loi othei piocesses that aie iunning on the machine, anu the iemainuei
can Le ueuicateu to the noue managei`s containeis, Ly setting the conliguiation piop-
eity yarn.nodemanager.resource.memory-mb to the total allocation in MB. (The uelault
is S,192 MB.)
The next step is to ueteimine how to set memoiy options loi inuiviuual joLs. Theie aie
two contiols: mapred.child.java.opts which allows you to set the ]VM heap size ol the
map oi ieuuce task; anu mapreduce.map.memory.mb (oi mapreduce.reduce.memory.mb)
which is useu to specily how much memoiy you neeu loi map (oi ieuuce) task con-
taineis. The lattei setting is useu Ly the scheuulei when negotiating loi iesouices in the
clustei, anu Ly the noue managei, which iuns anu monitois the task containeis.
Foi example, suppose that mapred.child.java.opts is set to -Xmx800m, anu mapre
duce.map.memory.mb is lelt at its uelault value ol 1,02+ MB. Vhen a map task is iun, the
noue managei will allocate a 1,02+ MB containei (uecieasing the size ol its pool Ly that
amount loi the uuiation ol the task) anu launch the task ]VM conliguieu with a S00
MB maximum heap size. Note that the ]VM piocess will have a laigei memoiy lootpiint
than the heap size, anu the oveiheau will uepenu on such things as the native liLiaiies
that aie in use, the size ol the peimanent geneiation space, anu so on. The impoitant
YARN Configuration | 321
thing is that the physical memoiy useu Ly the ]VM piocess, incluuing any piocesses
that it spawns, such as Stieaming oi Pipes piocesses, uoes not exceeu its allocation
(1,02+ MB). Il a containei uses moie memoiy than it has Leen allocateu than it may Le
teiminateu Ly the noue managei anu maikeu as laileu.
Scheuuleis may impose a minimum oi maximum on memoiy allocations. Foi example,
loi the capacity scheuulei the uelault minimum is 102+ MB (set Ly yarn.schedu
ler.capacity.minimum-allocation-mb), anu the uelault maximum is 102+0 MB (set Ly
yarn.scheduler.capacity.maximum-allocation-mb).
Theie aie also viitual memoiy constiaints that a containei must meet. Il a containei`s
viitual memoiy usage exeeus a given multiple ol the allocateu physical memoiy, then
the noue managei may teiminate the piocess. The multiple is expiesseu Ly the
yarn.nodemanager.vmem-pmem-ratio piopeity, which uelaults to 2.1. In the example
aLove, the viitual memoiy thiesholu aLove which the task may Le teiminateu is 2,150
MB, which is 2.1 1,02+ MB.
Vhen conliguiing memoiy paiameteis it`s veiy uselul to Le aLle to monitoi a task`s
actual memoiy usage uuiing a joL iun, anu this is possiLle via MapReuuce task coun-
teis. The counteis PHYSICAL_MEMORY_BYTES, VIRTUAL_MEMORY_BYTES, anu COMMIT
TED_HEAP_BYTES (uesciiLeu in TaLle S-2) pioviue snapshot values ol memoiy usage anu
aie theieloie suitaLle loi oLseivation uuiing the couise ol a task attempt.
YARN Daemon Addresses and Ports
YARN uaemons iun one oi moie RPC seiveis anu HTTP seiveis, uetails ol which aie
coveieu in TaLle 9-10 anu TaLle 9-11.
Tab|c 9-10. YARN RPC scrvcr propcrtics
Property name Default value Description
yarn.resourceman
ager.address
0.0.0.0:8040 The resource managers RPC server address and port. This is used
by the client (typically outside the cluster) to communicate with
the resource manager.
yarn.resourceman
ager.admin.address
0.0.0.0:8141 The resource managers admin RPC server address and port. This is
used by the admin client (invoked with yarn rmadmin, typically
run outside the cluster) to communicate with the resource manager.
yarn.resourceman
ager.scheduler.address
0.0.0.0:8030 The resource manager schedulers RPC server address and port. This
is used by (in-cluster) application masters to communicate with the
resource manager.
yarn.resourceman
ager.resource-
tracker.address
0.0.0.0:8025 The resource manager resource trackers RPC server address and
port. This is used by the (in-cluster) node managers to communicate
with the resource manager.
yarn.nodeman
ager.address
0.0.0.0:0 The node managers RPC server address and port. This is used by
(in-cluster) application masters to communicate with node man-
agers.
322 | Chapter 9: Setting Up a Hadoop Cluster
Property name Default value Description
yarn.nodemanager.local
izer.address
0.0.0.0:4344 The node manager localizers RPC server address and port.
mapreduce.jobhis
tory.address
0.0.0.0:10020 The job history servers RPC server address and port. This is used by
the client (typically outside the cluster) to query job history. This
property is set in mapred-site.xml.
Tab|c 9-11. YARN HTTP scrvcr propcrtics
Property name Default value Description
yarn.resourceman
ager.webapp.address
0.0.0.0:8088 The resource managers HTTP server address and port.
yarn.nodeman
ager.webapp.address
0.0.0.0:9999 The node managers HTTP server address and port.
yarn.web-proxy.address The web app proxy servers HTTP server address and port. If not set
(the default) then the web app proxy server will run in the resource
manager process.
mapreduce.jobhis
tory.webapp.address
0.0.0.0:19888 The job history servers HTTP server address and port. This property
is set in mapred-site.xml.
mapreduce.shuffle.port 8080 The shuffle handlers HTTP port number. This is used for serving
map outputs, and is not a user-accessible web UI. This property is
set in mapred-site.xml.
Security
Eaily veisions ol Hauoop assumeu that HDFS anu MapReuuce clusteis woulu Le useu
Ly a gioup ol coopeiating useis within a secuie enviionment. The measuies loi ie-
stiicting access weie uesigneu to pievent acciuental uata loss, iathei than to pievent
unauthoiizeu access to uata. Foi example, the lile peimissions system in HDFS pievents
one usei liom acciuentally wiping out the whole lilesystem liom a Lug in a piogiam,
oi Ly mistakenly typing hadoop fs -rmr /, Lut it uoesn`t pievent a malicious usei liom
assuming ioot`s iuentity (see Setting Usei Iuentity on page 150) to access oi uelete
any uata in the clustei.
In secuiity pailance, what was missing was a secuie authcntication mechanism to assuie
Hauoop that the usei seeking to peiloim an opeiation on the clustei is who they claim
to Le anu theieloie tiusteu. HDFS lile peimissions pioviue only a mechanism loi au-
thorization, which contiols what a paiticulai usei can uo to a paiticulai lile. Foi
example, a lile may only Le ieauaLle Ly a gioup ol useis, so anyone not in that gioup
is not authoiizeu to ieau it. Howevei, authoiization is not enough Ly itsell, since the
system is still open to aLuse via spooling Ly a malicious usei who can gain netwoik
access to the clustei.
Security | 323
It`s common to iestiict access to uata that contains peisonally iuentiliaLle inloimation
(such as an enu usei`s lull name oi IP auuiess) to a small set ol useis (ol the clustei)
within the oiganization, who aie authoiizeu to access such inloimation. Less sensitive
(oi anonymizeu) uata may Le maue availaLle to a laigei set ol useis. It is convenient to
host a mix ol uatasets with uilleient secuiity levels on the same clustei (not least Lecause
it means the uatasets with lowei secuiity levels can Le shaieu). Howevei, to meet ieg-
ulatoiy ieguiiements loi uata piotection, secuie authentication must Le in place loi
shaieu clusteis.
This is the situation that Yahoo! laceu in 2009, which leu a team ol engineeis theie to
implement secuie authentication loi Hauoop. In theii uesign, Hauoop itsell uoes not
manage usei cieuentials, since it ielies on KeiLeios, a matuie open-souice netwoik
authentication piotocol, to authenticate the usei. In tuin, KeiLeios uoesn`t manage
peimissions. KeiLeios says that a usei is who they say they aie; it`s Hauoop`s joL to
ueteimine whethei that usei has peimission to peiloim a given action. Theie`s a lot to
KeiLeios, so heie we only covei enough to use it in the context ol Hauoop, ieleiiing
ieaueis who want moie Lackgiounu to Kcrbcros: Thc Dcjinitivc Guidc Ly ]ason Gaiman
(O`Reilly, 2003).
Which Versions of Hadoop Support Kerberos Authentication?
KeiLeios loi authentication was liist auueu in the 0.20.20x seiies ol ieleases ol Apache
Hauoop. See TaLle 1-2 loi which iecent ielease seiies suppoit this leatuie.
Kerberos and Hadoop
At a high level, theie aie thiee steps that a client must take to access a seivice when
using KeiLeios, each ol which involves a message exchange with a seivei:
1. Authcntication. The client authenticates itsell to the Authentication Seivei anu
ieceives a timestampeu Ticket-Gianting Ticket (TGT).
2. Authorization. The client uses the TGT to ieguest a seivice ticket liom the Ticket
Gianting Seivei.
3. Scrvicc Rcqucst. The client uses the seivice ticket to authenticate itsell to the seivei
that is pioviuing the seivice the client is using. In the case ol Hauoop, this might
Le the namenoue oi the joLtiackei.
Togethei, the Authentication Seivei anu the Ticket Gianting Seivei loim the Kcy Dis-
tribution Ccntcr (KDC). The piocess is shown giaphically in Figuie 9-2.
324 | Chapter 9: Setting Up a Hadoop Cluster
Iigurc 9-2. Thc thrcc-stcp Kcrbcros tic|ct cxchangc protoco|
The authoiization anu seivice ieguest steps aie not usei-level actions: the client pei-
loims these steps on the usei`s Lehall. The authentication step, howevei, is noimally
caiiieu out explicitly Ly the usei using the kinit commanu, which will piompt loi a
passwoiu. Howevei, this uoesn`t mean you neeu to entei youi passwoiu eveiy time
you iun a joL oi access HDFS, since TGTs last loi 10 houis Ly uelault (anu can Le
ieneweu loi up to a week). It`s common to automate authentication at opeiating system
login time, theieLy pioviuing sing|c sign-on to Hauoop.
In cases wheie you uon`t want to Le piompteu loi a passwoiu (loi iunning an
unattenueu MapReuuce joL, loi example), you can cieate a KeiLeios |cytab lile using
the ktutil commanu. A keytaL is a lile that stoies passwoius anu may Le supplieu to
kinit with the -t option.
An example
Let`s look at an example ol the piocess in action. The liist step is to enaLle KeiLeios
authentication Ly setting the hadoop.security.authentication piopeity in corc-
sitc.xn| to kerberos.
6
The uelault setting is simple, which signilies that the olu
Lackwaius-compatiLle (Lut insecuie) Lehavioi ol using the opeiating system usei name
to ueteimine iuentity shoulu Le employeu.
6. To use KeiLeios authentication with Hauoop, you neeu to install, conliguie, anu iun a KDC (Hauoop
uoes not come with one). Youi oiganization may alieauy have a KDC you can use (an Active Diiectoiy
installation, loi example); il not, you can set up an MIT KeiLeios 5 KDC using the instiuctions in the
Linux Sccurity Coo|boo| (O`Reilly, 2003).
Security | 325
Ve also neeu to enaLle seivice-level authoiization Ly setting hadoop.security.author
ization to true in the same lile. You may conliguie Access Contiol Lists (ACLs) in the
hadoop-po|icy.xn| conliguiation lile to contiol which useis anu gioups have peimission
to connect to each Hauoop seivice. Seivices aie uelineu at the piotocol level, so theie
aie ones loi MapReuuce joL suLmission, namenoue communication, anu so on. By
uelault, all ACLs aie set to *, which means that all useis have peimission to access each
seivice, Lut on a ieal clustei you shoulu lock the ACLs uown to only those useis anu
gioups that shoulu have access.
The loimat loi an ACL is a comma-sepaiateu list ol useinames, lolloweu Ly whitespace,
lolloweu Ly a comma-sepaiateu list ol gioup names. Foi example, the ACL
preston,howard directors,inventors woulu authoiize access to useis nameu preston
oi howard, oi in gioups directors oi inventors.
Vith KeiLeios authentication tuineu on, let`s see what happens when we tiy to copy
a local lile to HDFS:
% hadoop fs -put quangle.txt .
10/07/03 15:44:58 WARN ipc.Client: Exception encountered while connecting to the
server: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSEx
ception: No valid credentials provided (Mechanism level: Failed to find any Ker
beros tgt)]
Bad connection to FS. command aborted. exception: Call to localhost/127.0.0.1:80
20 failed on local exception: java.io.IOException: javax.security.sasl.SaslExcep
tion: GSS initiate failed [Caused by GSSException: No valid credentials provided
(Mechanism level: Failed to find any Kerberos tgt)]
The opeiation lails, since we uon`t have a KeiLeios ticket. Ve can get one Ly authen-
ticating to the KDC, using kinit:
% kinit
Password for hadoop-user@LOCALDOMAIN: password
% hadoop fs -put quangle.txt .
% hadoop fs -stat %n quangle.txt
quangle.txt
Anu we see that the lile is successlully wiitten to HDFS. Notice that even though we
caiiieu out two lilesystem commanus, we only neeueu to call kinit once, since the
KeiLeios ticket is valiu loi 10 houis (use the klist commanu to see the expiiy time ol
youi tickets anu kdestroy to invaliuate youi tickets). Altei we get a ticket, eveiything
woiks just as noimal.
Delegation Tokens
In a uistiiLuteu system like HDFS oi MapReuuce, theie aie many client-seivei intei-
actions, each ol which must Le authenticateu. Foi example, an HDFS ieau opeiation
will involve multiple calls to the namenoue anu calls to one oi moie uatanoues. Insteau
ol using the thiee-step KeiLeios ticket exchange piotocol to authenticate each call,
which woulu piesent a high loau on the KDC on a Lusy clustei, Hauoop uses dc|cgation
to|cns to allow latei authenticateu access without having to contact the KDC again.
326 | Chapter 9: Setting Up a Hadoop Cluster
Delegation tokens aie cieateu anu useu tianspaiently Ly Hauoop on Lehall ol useis, so
theie`s no action you neeu to take as a usei ovei using kinit to sign in, howevei it`s
uselul to have a Lasic iuea ol how they aie useu.
A uelegation token is geneiateu Ly the seivei (the namenoue in this case), anu can Le
thought ol as a shaieu seciet Letween the client anu the seivei. On the liist RPC call to
the namenoue, the client has no uelegation token, so it uses KeiLeios to authenticate,
anu as a pait ol the iesponse it gets a uelegation token liom the namenoue. In suLse-
guent calls, it piesents the uelegation token, which the namenoue can veiily (since it
geneiateu it using a seciet key), anu hence the client is authenticateu to the seivei.
Vhen it wants to peiloim opeiations on HDFS Llocks, the client uses a special kinu ol
uelegation token, calleu a b|oc| acccss to|cn, that the namenoue passes to the client in
iesponse to a metauata ieguest. The client uses the Llock access token to authenticate
itsell to uatanoues. This is possiLle only Lecause the namenoue shaies its seciet key
useu to geneiate the Llock access token with uatanoues (which it senus in heaitLeat
messages), so that they can veiily Llock access tokens. Thus, an HDFS Llock may only
Le accesseu Ly a client with a valiu Llock access token liom a namenoue. This closes
the secuiity hole in unsecuieu Hauoop wheie only the Llock ID was neeueu to gain
access to a Llock. This piopeity is enaLleu Ly setting dfs.block.access.token.enable
to true.
In MapReuuce, joL iesouices anu metauata (such as ]AR liles, input splits, conliguia-
tion liles) aie shaieu in HDFS loi the joLtiackei to access, anu usei coue iuns on the
tasktiackeis anu accesses liles on HDFS (the piocess is explaineu in Anatomy ol a
MapReuuce ]oL Run on page 1S7). Delegation tokens aie useu Ly the joLtiackei anu
tasktiackeis to access HDFS uuiing the couise ol the joL. Vhen the joL has linisheu,
the uelegation tokens aie invaliuateu.
Delegation tokens aie automatically oLtaineu loi the uelault HDFS instance, Lut il youi
joL neeus to access othei HDFS clusteis, then you can have the uelegation tokens loi
these loaueu Ly setting the mapreduce.job.hdfs-servers joL piopeity to a comma-
sepaiateu list ol HDFS URIs.
Other Security Enhancements
Secuiity has Leen tighteneu thioughout HDFS anu MapReuuce to piotect against un-
authoiizeu access to iesouices.
7
The moie notaLle changes aie listeu heie:
Tasks can Le iun using the opeiating system account loi the usei who suLmitteu
the joL, iathei than the usei iunning the tasktiackei. This means that the opeiating
system is useu to isolate iunning tasks, so they can`t senu signals to each othei (to
kill anothei usei`s tasks, loi example), anu so local inloimation, such as task uata,
is kept piivate via local lile system peimissions.
7. At the time ol wiiting, othei piojects like HBase anu Hive hau not Leen integiateu with this secuiity mouel.
Security | 327
This leatuie is enaLleu Ly setting mapred.task.tracker.task-controller to
org.apache.hadoop.mapred.LinuxTaskController.
S
In auuition, auministiatois
neeu to ensuie that each usei is given an account on eveiy noue in the clustei
(typically using LDAP).
Vhen tasks aie iun as the usei who suLmitteu the joL, the uistiiLuteu cache
(DistiiLuteu Cache on page 2SS) is secuie: liles that aie woilu-ieauaLle aie put
in a shaieu cache (the insecuie uelault), otheiwise they go in a piivate cache, only
ieauaLle Ly the ownei.
Useis can view anu mouily only theii own joLs, not otheis. This is enaLleu Ly
setting mapred.acls.enabled to true. Theie aie two joL conliguiation piopeities,
mapreduce.job.acl-view-job anu mapreduce.job.acl-modify-job, which may Le set
to a comma-sepaiateu list ol useis to contiol who may view oi mouily a paiticulai
joL.
The shullle is secuie, pieventing a malicious usei liom ieguesting anothei usei`s
map outputs. Howevei, the shullle is not enciypteu, so it is suLject to malicious
snilling.
Vhen appiopiiately conliguieu, it`s no longei possiLle loi a malicious usei to iun
a iogue seconuaiy namenoue, uatanoue, oi tasktiackei that can join the clustei
anu potentially compiomise uata stoieu in the clustei. This is enloiceu Ly ieguiiing
uaemons to authenticate with the mastei noue they aie connecting to.
To enaLle this leatuie, you liist neeu to conliguie Hauoop to use a keytaL pievi-
ously geneiateu with the ktutil commanu. Foi a uatanoue, loi example, you woulu
set the dfs.datanode.keytab.file piopeity to the keytaL lilename anu dfs.data
node.kerberos.principal to the useiname to use loi the uatanoue. Finally, the ACL
loi the DataNodeProtocol (which is useu Ly uatanoues to communicate with the
namenoue) must Le set in hadoop-po|icy.xn|, Ly iestiicting security.datanode.pro
tocol.acl to the uatanoue`s useiname.
A uatanoue may Le iun on a piivilegeu poit (one lowei than 102+), so a client may
Le ieasonaLly suie that it was staiteu secuiely.
A task may only communicate with its paient tasktiackei, thus pieventing an
attackei liom oLtaining MapReuuce uata liom anothei usei`s joL.
One aiea that hasn`t yet Leen auuiesseu in the secuiity woik is enciyption: neithei RPC
noi Llock tiansleis aie enciypteu. HDFS Llocks aie not stoieu in an enciypteu loim
eithei. These leatuies aie planneu loi a lutuie ielease, anu in lact, enciypting the uata
stoieu in HDFS coulu Le caiiieu out in existing veisions ol Hauoop Ly the application
itsell (Ly wiiting an enciyption CompressionCodec, loi example).
S. LinuxTaskController uses a setuiu executaLle calleu tas|-contro||cr lounu in the bin uiiectoiy. You shoulu
ensuie that this Linaiy is owneu Ly ioot anu has the setuiu Lit set (with chmod +s).
328 | Chapter 9: Setting Up a Hadoop Cluster
Benchmarking a Hadoop Cluster
Is the clustei set up coiiectly? The Lest way to answei this guestion is empiiically: iun
some joLs anu conliim that you get the expecteu iesults. Benchmaiks make goou tests,
as you also get numLeis that you can compaie with othei clusteis as a sanity check on
whethei youi new clustei is peiloiming ioughly as expecteu. Anu you can tune a clustei
using Lenchmaik iesults to sgueeze the Lest peiloimance out ol it. This is olten uone
with monitoiing systems in place (Monitoiing on page 3+9), so you can see how
iesouices aie Leing useu acioss the clustei.
To get the Lest iesults, you shoulu iun Lenchmaiks on a clustei that is not Leing useu
Ly otheis. In piactice, this is just Leloie it is put into seivice anu useis stait ielying on
it. Once useis have peiiouically scheuuleu joLs on a clustei, it is geneially impossiLle
to linu a time when the clustei is not Leing useu (unless you aiiange uowntime with
useis), so you shoulu iun Lenchmaiks to youi satislaction Leloie this happens.
Expeiience has shown that most haiuwaie lailuies loi new systems aie haiu uiive lail-
uies. By iunning I/O intensive Lenchmaikssuch as the ones uesciiLeu nextyou
can Luin in the clustei Leloie it goes live.
Hadoop Benchmarks
Hauoop comes with seveial Lenchmaiks that you can iun veiy easily with minimal
setup cost. Benchmaiks aie packageu in the test ]AR lile, anu you can get a list ol them,
with uesciiptions, Ly invoking the ]AR lile with no aiguments:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar
Most ol the Lenchmaiks show usage instiuctions when invokeu with no aiguments.
Foi example:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO
TestFDSIO.0.0.4
Usage: TestFDSIO -read | -write | -clean [-nrFiles N] [-fileSize MB] [-resFile
resultFileName] [-bufferSize Bytes]
Benchmarking HDFS with TestDFSIO
TestDFSIO tests the I/O peiloimance ol HDFS. It uoes this Ly using a MapReuuce joL
as a convenient way to ieau oi wiite liles in paiallel. Each lile is ieau oi wiitten in a
sepaiate map task, anu the output ol the map is useu loi collecting statistics ielating
to the lile just piocesseu. The statistics aie accumulateu in the ieuuce to piouuce a
summaiy.
The lollowing commanu wiites 10 liles ol 1,000 MB each:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10
-fileSize 1000
Benchmarking a Hadoop Cluster | 329
At the enu ol the iun, the iesults aie wiitten to the console anu also iecoiueu in a local
lile (which is appenueu to, so you can ieiun the Lenchmaik anu not lose olu iesults):
% cat TestDFSIO_results.log
----- TestDFSIO ----- : write
Date & time: Sun Apr 12 07:14:09 EDT 2009
Number of files: 10
Total MBytes processed: 10000
Throughput mb/sec: 7.796340865378244
Average IO rate mb/sec: 7.8862199783325195
IO rate std deviation: 0.9101254683525547
Test exec time sec: 163.387
The liles aie wiitten unuei the /bcnchnar|s/TcstDIS|O uiiectoiy Ly uelault (this can
Le changeu Ly setting the test.build.data system piopeity), in a uiiectoiy calleu
io_data.
To iun a ieau Lenchmaik, use the -read aigument. Note that these liles must alieauy
exist (having Leen wiitten Ly TestDFSIO -write):
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -read -nrFiles 10
-fileSize 1000
Heie aie the iesults loi a ieal iun:
----- TestDFSIO ----- : read
Date & time: Sun Apr 12 07:24:28 EDT 2009
Number of files: 10
Total MBytes processed: 10000
Throughput mb/sec: 80.25553361904304
Average IO rate mb/sec: 98.6801528930664
IO rate std deviation: 36.63507598174921
Test exec time sec: 47.624
Vhen you`ve linisheu Lenchmaiking, you can uelete all the geneiateu liles liom HDFS
using the -clean aigument:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -clean
Benchmarking MapReduce with Sort
Hauoop comes with a MapReuuce piogiam that uoes a paitial soit ol its input. It is
veiy uselul loi Lenchmaiking the whole MapReuuce system, as the lull input uataset
is tiansleiieu thiough the shullle. The thiee steps aie: geneiate some ianuom uata,
peiloim the soit, then valiuate the iesults.
Fiist we geneiate some ianuom uata using RandomWriter. It iuns a MapReuuce joL
with 10 maps pei noue, anu each map geneiates (appioximately) 10 GB ol ianuom
Linaiy uata, with key anu values ol vaiious sizes. You can change these values il you
like Ly setting the piopeities test.randomwriter.maps_per_host anu
test.randomwrite.bytes_per_map. Theie aie also settings loi the size ianges ol the keys
anu values; see RandomWriter loi uetails.
330 | Chapter 9: Setting Up a Hadoop Cluster
Heie`s how to invoke RandomWriter (lounu in the example ]AR lile, not the test one) to
wiite its output to a uiiectoiy calleu randon-data:
% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar randomwriter random-data
Next we can iun the Sort piogiam:
% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort random-data sorted-data
The oveiall execution time ol the soit is the metiic we aie inteiesteu in, Lut it`s in-
stiuctive to watch the joL`s piogiess via the weL UI (http://jobtrac|cr-host:50030/),
wheie you can get a leel loi how long each phase ol the joL takes. Aujusting the
paiameteis mentioneu in Tuning a ]oL on page 176 is a uselul exeicise, too.
As a linal sanity check, we valiuate that the uata in sortcd-data is, in lact, coiiectly
soiteu:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar testmapredsort -sortInput random-data \
-sortOutput sorted-data
This commanu iuns the SortValidator piogiam, which peiloims a seiies ol checks on
the unsoiteu anu soiteu uata to check whethei the soit is accuiate. It iepoits the out-
come to the console at the enu ol its iun:
SUCCESS! Validated the MapReduce framework's 'sort' successfully.
Other benchmarks
Theie aie many moie Hauoop Lenchmaiks, Lut the lollowing aie wiuely useu:
MRBench (invokeu with mrbench) iuns a small joL a numLei ol times. It acts as a goou
counteipoint to soit, as it checks whethei small joL iuns aie iesponsive.
NNBench (invokeu with nnbench) is uselul loi loau testing namenoue haiuwaie.
Gridnix is a suite ol Lenchmaiks uesigneu to mouel a iealistic clustei woikloau,
Ly mimicking a vaiiety ol uata-access patteins seen in piactice. See the uocumen-
tation in the uistiiLution loi how to iun Giiumix, anu the Llog post at http://dcvc|
opcr.yahoo.nct/b|ogs/hadoop/2010/01/gridnix3_cnu|ating_production.htn| loi
moie Lackgiounu.
9
User Jobs
Foi tuning, it is Lest to incluue a lew joLs that aie iepiesentative ol the joLs that youi
useis iun, so youi clustei is tuneu loi these anu not just loi the stanuaiu Lenchmaiks.
Il this is youi liist Hauoop clustei anu you uon`t have any usei joLs yet, then Giiumix
is a goou suLstitute.
9. In a similai vein, PigMix is a set ol Lenchmaiks loi Pig availaLle at https://cwi|i.apachc.org/conj|ucncc/
disp|ay/P|G/PigMix.
Benchmarking a Hadoop Cluster | 331
Vhen iunning youi own joLs as Lenchmaiks, you shoulu select a uataset loi youi usei
joLs that you use each time you iun the Lenchmaiks to allow compaiisons Letween
iuns. Vhen you set up a new clustei, oi upgiaue a clustei, you will Le aLle to use the
same uataset to compaie the peiloimance with pievious iuns.
Hadoop in the Cloud
Although many oiganizations choose to iun Hauoop in-house, it is also populai to iun
Hauoop in the clouu on ienteu haiuwaie oi as a seivice. Foi instance, Clouueia olleis
tools loi iunning Hauoop (see Appenuix B) in a puLlic oi piivate clouu, anu Amazon
has a Hauoop clouu seivice calleu Elastic MapReuuce.
In this section, we look at iunning Hauoop on Amazon EC2, which is a gieat way to
tiy out youi own Hauoop clustei on a low-commitment, tiial Lasis.
Hadoop on Amazon EC2
Amazon Elastic Compute Clouu (EC2) is a computing seivice that allows customeis
to ient computeis (instanccs) on which they can iun theii own applications. A customei
can launch anu teiminate instances on uemanu, paying Ly the houi loi active instances.
The Apache Vhiii pioject (http://whirr.apachc.org/) pioviues a ]ava API anu a set ol
sciipts that make it easy to iun Hauoop on EC2 anu othei clouu pioviueis.
10
The sciipts
allow you to peiloim such opeiations as launching oi teiminating a clustei, oi listing
the iunning instances in a clustei.
Running Hauoop on EC2 is especially appiopiiate loi ceitain woikllows. Foi example,
il you stoie uata on Amazon S3, then you can iun a clustei on EC2 anu iun MapReuuce
joLs that ieau the S3 uata anu wiite output Lack to S3, Leloie shutting uown the clustei.
Il you`ie woiking with longei-liveu clusteis, you might copy S3 uata onto HDFS iun-
ning on EC2 loi moie ellicient piocessing, as HDFS can take auvantage ol uata locality,
Lut S3 cannot (since S3 stoiage is not collocateu with EC2 noues).
Setup
Fiist install Vhiii Ly uownloauing a iecent ielease taiLall, anu unpacking it on the
machine you want to launch the clustei liom, as lollows:
% tar xzf whirr-x.y.z.tar.gz
Vhiii uses SSH to communicate with machines iunning in the clouu, so it`s a goou
iuea to geneiate an SSH keypaii loi exclusive use with Vhiii. Heie we cieate an RSA
keypaii with an empty passphiase, stoieu in a lile calleu id_rsa_whirr in the cuiient
usei`s .ssh uiiectoiy:
10. Theie aie also Lash sciipts in the src/contrib/cc2 suLuiiectoiy ol the Hauoop uistiiLution, Lut these aie
uepiecateu in lavoi ol Vhiii.
332 | Chapter 9: Setting Up a Hadoop Cluster
% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr
Do not conluse the Vhiii SSH keypaii with any ceitilicates, piivate
keys, oi SSH keypaiis associateu with youi Amazon VeL Seivices ac-
count. Vhiii is uesigneu to woik with many clouu pioviueis, anu it
must have access to Loth the puLlic anu piivate SSH key ol a passphiase-
less keypaii that`s ieau liom the local lilesystem. In piactice, it`s simplest
to geneiate a new keypaii loi Vhiii, as we uiu heie.
Ve neeu to tell Vhiii oui clouu pioviuei cieuentials. Ve can expoit them as enviion-
ment vaiiaLles as lollows, although you can alteinatively specily them on the commanu
line, oi in the conliguiation lile loi the seivice.
% export AWS_ACCESS_KEY_ID='...'
% export AWS_SECRET_ACCESS_KEY='...'
Launching a cluster
Ve aie now ieauy to launch a clustei. Vhiii comes with seveial iecipes liles loi
launching common seivice conliguiations, anu heie we use the iecipe to iun Hauoop
on EC2:
% bin/whirr launch-cluster --config recipes/hadoop-ec2.properties \
--private-key-file ~/.ssh/id_rsa_whirr
The launch-cluster commanu piovisions the clouu instances anu staits the seivices
iunning on them, Leloie ietuining contiol to the usei.
Configuration
Beloie we stait using the clustei, let`s look at Vhiii conliguiation in moie uetail. Con-
liguiation paiameteis aie passeu to Vhiii commanus as Lunules in a conliguiation lile
specilieu Ly the --config option, oi inuiviuually using commanu line aiguments, like
the --private-key-file aigument we useu to inuicate the location ol the SSH piivate
key lile.
The iecipe lile is actually just a ]ava piopeities lile that uelines a numLei ol Vhiii
piopeities. Let`s step thiough the salient piopeities liom hadoop-cc2.propcrtics, stait-
ing with the two piopeities that ueline the clustei anu the seivices iunning on it:
whirr.cluster-name=hadoop
whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,5 hadoop-datanode+hadoop-tasktracker
Eveiy clustei has a name, specilieu Ly whirr.cluster-name, which seives to iuentily the
clustei so you can peiloim opeiations on it, like listing all iunning instances, oi tei-
mining the clustei. The name must Le unigue within the clouu account that the clustei
is iunning in.
The whirr.instance-templates piopeity uelines the seivices that iun on a clustei. An
instance template specilies a caiuinality, anu a set ol ioles that iun on each instance ol
Hadoop in the Cloud | 333
that type. Thus, we have one instance iunning in Loth the hadoop-namenode iole anu
the hadoop-jobtracker iole. Theie aie also 5 instances iunning a hadoop-datanode anu
a hadoop-tasktracker. Vith whirr.instance-templates, you can ueline the piecise
composition ol youi clusteitheie aie plenty ol othei seivices you can iun in auuition
to Hauoop, anu you can uiscovei them Ly iunning bin/whirr with no aiguments.
The next gioup ol piopeities specily clouu cieuentials:
whirr.provider=aws-ec2
whirr.identity=${env:AWS_ACCESS_KEY_ID}
whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
The whirr.provider piopeity uelines the clouu pioviuei, heie EC2, (othei suppoiteu
pioviueis aie listeu in the Vhiii uocumentation). The whirr.identity anu whirr.cre
dentialpiopeities aie the clouu-specilic cieuentialsioughly speaking the useiname
anu passwoiu, although the teiminology vaiies liom pioviuei to pioviuei.
The linal thiee paiameteis ollei contiol ovei the clustei haiuwaie (instance capaLilities,
like memoiy, uisk, CPU, netwoik speeu), the machine image (opeiating system), anu
geogiaphic location (uata centei). These aie all pioviuei uepenuent, Lut il you omit
them then Vhiii will tiy to pick goou uelaults.
whirr.hardware-id=c1.xlarge
whirr.image-id=us-east-1/ami-da0cf8b3
whirr.location-id=us-east-1
Piopeities in the lile aie pielixeu with whirr., Lut il they aie passeu as aiguments on
the commanu line, then the pielix is uioppeu. So loi example, you coulu set the clustei
name Ly auuing --cluster-name hadoop on the commanu line (anu this woulu take
pieceuence ovei any value set in the piopeities lile). Conveisely, we coulu have set the
piivate key lile in the piopeities lile Ly auuing a line like
whirr.private-key-file=/user/tom/.ssh/id_rsa_whirr
Theie aie also piopeities loi specilying the veision ol Hauoop to iun on the clustei,
anu loi setting Hauoop conliguiation piopeities acioss the clustei (uetails aie in the
iecipe lile).
Running a proxy
To use the clustei, netwoik tiallic liom the client neeus to Le pioxieu thiough the
mastei noue ol the clustei using an SSH tunnel, which we can set up using the lollowing
commanu:
% . ~/.whirr/hadoop/hadoop-proxy.sh
You shoulu keep the pioxy iunning as long as the clustei is iunning. Vhen you have
linisheu with the clustei, stop the pioxy with Ctil-c.
334 | Chapter 9: Setting Up a Hadoop Cluster
Running a MapReduce job
You can iun MapReuuce joLs eithei liom within the clustei oi liom an exteinal ma-
chine. Heie we show how to iun a joL liom the machine we launcheu the clustei on.
Note that this ieguiies that the same veision ol Hauoop has Leen installeu locally as is
iunning on the clustei.
Vhen we launcheu the clustei, Hauoop site conliguiation liles weie cieateu in the
uiiectoiy -/.whirr/hadoop. Ve can use this to connect to the clustei Ly setting the
HADOOP_CONF_DIR enviionment vaiiaLle as lollows:
% export HADOOP_CONF_DIR=~/.whirr/hadoop
The clustei`s lilesystem is empty, so Leloie we iun a joL, we neeu to populate it with
uata. Doing a paiallel copy liom S3 (see Hauoop Filesystems on page 5+ loi moie on
the S3 lilesystems in Hauoop) using Hauoop`s distcp tool is an ellicient way to tianslei
uata into HDFS:
% hadoop distcp \
-Dfs.s3n.awsAccessKeyId='...' \
-Dfs.s3n.awsSecretAccessKey='...' \
s3n://hadoopbook/ncdc/all input/ncdc/all
The peimissions on the liles in the hadoopboo| S3 Lucket only allow
copying to the US East EC2 iegion. This means you shoulu iun the
distcp commanu liom within that iegionthe easiest way to achieve
that is to log into the mastei noue (its auuiess is piinteu to the console
uuiing launch) with
% ssh -i ~/.ssh/id_rsa_whirr master_host
Altei the uata has Leen copieu, we can iun a joL in the usual way:
% hadoop jar hadoop-examples.jar MaxTemperatureWithCombiner \
/user/$USER/input/ncdc/all /user/$USER/output
Alteinatively, we coulu have specilieu the output to Le on S3, as lollows:
% hadoop jar hadoop-examples.jar MaxTemperatureWithCombiner \
/user/$USER/input/ncdc/all s3n://mybucket/output
You can tiack the piogiess ol the joL using the joLtiackei`s weL UI, lounu at http://
nastcr_host:50030/. To access weL pages iunning on woikei noues, you neeu set up a
pioxy auto-conlig (PAC) lile in youi Liowsei. See the Vhiii uocumentation loi uetails
on how to uo this.
Shutting down a cluster
To shut uown the clustei, issue the destroy-cluster commanu:
% bin/whirr destroy-cluster --config recipes/hadoop-ec2.properties
Hadoop in the Cloud | 335
This will teiminate all the iunning instances in the clustei anu uelete all the uata stoieu
in the clustei.
336 | Chapter 9: Setting Up a Hadoop Cluster
CHAPTER 10
Administering Hadoop
The pievious chaptei was uevoteu to setting up a Hauoop clustei. In this chaptei, we
look at the pioceuuies to keep a clustei iunning smoothly.
HDFS
Persistent Data Structures
As an auministiatoi, it is invaluaLle to have a Lasic unueistanuing ol how the compo-
nents ol HDFSthe namenoue, the seconuaiy namenoue, anu the uatanoues
oiganize theii peisistent uata on uisk. Knowing which liles aie which can help you
uiagnose pioLlems oi spot that something is awiy.
Namenode directory structure
A newly loimatteu namenoue cieates the lollowing uiiectoiy stiuctuie:
${dfs.name.dir}/current/VERSION
/edits
/fsimage
/fstime
Recall liom Chaptei 9 that the dfs.name.dir piopeity is a list ol uiiectoiies, with the
same contents miiioieu in each uiiectoiy. This mechanism pioviues iesilience, paitic-
ulaily il one ol the uiiectoiies is an NFS mount, as is iecommenueu.
The \ERS|ON lile is a ]ava piopeities lile that contains inloimation aLout the veision
ol HDFS that is iunning. Heie aie the contents ol a typical lile:
#Tue Mar 10 19:21:36 GMT 2009
namespaceID=134368441
cTime=0
storageType=NAME_NODE
layoutVersion=-18
337
The layoutVersion is a negative integei that uelines the veision ol HDFS`s peisistent
uata stiuctuies. This veision numLei has no ielation to the ielease numLei ol the Ha-
uoop uistiiLution. Vhenevei the layout changes, the veision numLei is ueciementeu
(loi example, the veision altei -1S is -19). Vhen this happens, HDFS neeus to Le
upgiaueu, since a newei namenoue (oi uatanoue) will not opeiate il its stoiage layout
is an oluei veision. Upgiauing HDFS is coveieu in Upgiaues on page 360.
The namespaceID is a unigue iuentiliei loi the lilesystem, which is cieateu when the
lilesystem is liist loimatteu. The namenoue uses it to iuentily new uatanoues, since
they will not know the namespaceID until they have iegisteieu with the namenoue.
The cTime piopeity maiks the cieation time ol the namenoue`s stoiage. Foi newly loi-
matteu stoiage, the value is always zeio, Lut it is upuateu to a timestamp whenevei the
lilesystem is upgiaueu.
The storageType inuicates that this stoiage uiiectoiy contains uata stiuctuies loi a
namenoue.
The othei liles in the namenoue`s stoiage uiiectoiy aie cdits, jsinagc, anu jstinc. These
aie all Linaiy liles, which use Hauoop Writable oLjects as theii seiialization loimat (see
Seiialization on page 9+). To unueistanu what these liles aie loi, we neeu to uig into
the woikings ol the namenoue a little moie.
The filesystem image and edit log
Vhen a lilesystem client peiloims a wiite opeiation (such as cieating oi moving a lile),
it is liist iecoiueu in the euit log. The namenoue also has an in-memoiy iepiesentation
ol the lilesystem metauata, which it upuates altei the euit log has Leen mouilieu. The
in-memoiy metauata is useu to seive ieau ieguests.
The euit log is llusheu anu synceu altei eveiy wiite Leloie a success coue is ietuineu to
the client. Foi namenoues that wiite to multiple uiiectoiies, the wiite must Le llusheu
anu synceu to eveiy copy Leloie ietuining successlully. This ensuies that no opeiation
is lost uue to machine lailuie.
The jsinagc lile is a peisistent checkpoint ol the lilesystem metauata. Howevei, it is
not upuateu loi eveiy lilesystem wiite opeiation, since wiiting out the jsinagc lile,
which can giow to Le gigaLytes in size, woulu Le veiy slow. This uoes not compiomise
iesilience, howevei, Lecause il the namenoue lails, then the latest state ol its metauata
can Le ieconstiucteu Ly loauing the jsinagc liom uisk into memoiy, then applying each
ol the opeiations in the euit log. In lact, this is piecisely what the namenoue uoes when
it staits up (see Sale Moue on page 3+2).
338 | Chapter 10: Administering Hadoop
The jsinagc lile contains a seiializeu loim ol all the uiiectoiy anu lile
inoues in the lilesystem. Each inoue is an inteinal iepiesentation ol a
lile oi uiiectoiy`s metauata anu contains such inloimation as the lile`s
ieplication level, mouilication anu access times, access peimissions,
Llock size, anu the Llocks a lile is maue up ol. Foi uiiectoiies, the mou-
ilication time, peimissions, anu guota metauata is stoieu.
The jsinagc lile uoes not iecoiu the uatanoues on which the Llocks aie
stoieu. Insteau the namenoue keeps this mapping in memoiy, which it
constiucts Ly asking the uatanoues loi theii Llock lists when they join
the clustei anu peiiouically alteiwaiu to ensuie the namenoue`s Llock
mapping is up-to-uate.
As uesciiLeu, the cdits lile woulu giow without Lounu. Though this state ol allaiis
woulu have no impact on the system while the namenoue is iunning, il the namenoue
weie iestaiteu, it woulu take a long time to apply each ol the opeiations in its (veiy
long) euit log. Duiing this time, the lilesystem woulu Le ollline, which is geneially
unuesiiaLle.
The solution is to iun the seconuaiy namenoue, whose puipose is to piouuce check-
points ol the piimaiy`s in-memoiy lilesystem metauata.
1
The checkpointing piocess
pioceeus as lollows (anu is shown schematically in Figuie 10-1):
1. The seconuaiy asks the piimaiy to ioll its cdits lile, so new euits go to a new lile.
2. The seconuaiy ietiieves jsinagc anu cdits liom the piimaiy (using HTTP GET).
3. The seconuaiy loaus jsinagc into memoiy, applies each opeiation liom cdits, then
cieates a new consoliuateu jsinagc lile.
+. The seconuaiy senus the new jsinagc Lack to the piimaiy (using HTTP POST).
5. The piimaiy ieplaces the olu jsinagc with the new one liom the seconuaiy, anu
the olu cdits lile with the new one it staiteu in step 1. It also upuates the jstinc lile
to iecoiu the time that the checkpoint was taken.
At the enu ol the piocess, the piimaiy has an up-to-uate jsinagc lile anu a shoitei
cdits lile (it is not necessaiily empty, as it may have ieceiveu some euits while the
checkpoint was Leing taken). It is possiLle loi an auministiatoi to iun this piocess
manually while the namenoue is in sale moue, using the hadoop dfsadmin
-saveNamespace commanu.
1. Fiom Hauoop veision 0.22.0 onwaius you can stait a namenoue with the -checkpoint option so that it
iuns the checkpointing piocess against anothei (piimaiy) namenoue. This is lunctionally eguivalent to
iunning a seconuaiy namenoue, Lut at the time ol wiiting olleis no auvantages ovei the seconuaiy
namenoue (anu inueeu the seconuaiy namenoue is the most tiieu anu testeu option). Vhen iunning in
a high-availaLility enviionment (HDFS High-AvailaLility on page 50), it will Le possiLle loi the stanuLy
noue to uo checkpointing.
HDFS | 339
This pioceuuie makes it cleai why the seconuaiy has similai memoiy ieguiiements to
the piimaiy (since it loaus the jsinagc into memoiy), which is the ieason that the sec-
onuaiy neeus a ueuicateu machine on laige clusteis.
The scheuule loi checkpointing is contiolleu Ly two conliguiation paiameteis. The
seconuaiy namenoue checkpoints eveiy houi (fs.checkpoint.period in seconus) oi
soonei il the euit log has ieacheu 6+ MB (fs.checkpoint.size in Lytes), which it checks
eveiy live minutes.
Secondary namenode directory structure
A uselul siue ellect ol the checkpointing piocess is that the seconuaiy has a checkpoint
at the enu ol the piocess, which can Le lounu in a suLuiiectoiy calleu prcvious.chcc|-
point. This can Le useu as a souice loi making (stale) Lackups ol the namenoue`s
metauata:
Iigurc 10-1. Thc chcc|pointing proccss
340 | Chapter 10: Administering Hadoop
${fs.checkpoint.dir}/current/VERSION
/edits
/fsimage
/fstime
/previous.checkpoint/VERSION
/edits
/fsimage
/fstime
The layout ol this uiiectoiy anu ol the seconuaiy`s currcnt uiiectoiy is iuentical to the
namenoue`s. This is Ly uesign, since in the event ol total namenoue lailuie (when theie
aie no iecoveiaLle Lackups, even liom NFS), it allows iecoveiy liom a seconuaiy
namenoue. This can Le achieveu eithei Ly copying the ielevant stoiage uiiectoiy to a
new namenoue, oi, il the seconuaiy is taking ovei as the new piimaiy namenoue, Ly
using the -importCheckpoint option when staiting the namenoue uaemon. The
-importCheckpoint option will loau the namenoue metauata liom the latest checkpoint
in the uiiectoiy uelineu Ly the fs.checkpoint.dir piopeity, Lut only il theie is no
metauata in the dfs.name.dir uiiectoiy, so theie is no iisk ol oveiwiiting piecious
metauata.
Datanode directory structure
Unlike namenoues, uatanoues uo not neeu to Le explicitly loimatteu, since they cieate
theii stoiage uiiectoiies automatically on staitup. Heie aie the key liles anu uiiectoiies:
${dfs.data.dir}/current/VERSION
/blk_<id_1>
/blk_<id_1>.meta
/blk_<id_2>
/blk_<id_2>.meta
/...
/blk_<id_64>
/blk_<id_64>.meta
/subdir0/
/subdir1/
/...
/subdir63/
A uatanoue`s \ERS|ON lile is veiy similai to the namenoue`s:
#Tue Mar 10 21:32:31 GMT 2009
namespaceID=134368441
storageID=DS-547717739-172.16.85.1-50010-1236720751627
cTime=0
storageType=DATA_NODE
layoutVersion=-18
The namespaceID, cTime, anu layoutVersion aie all the same as the values in the name-
noue (in lact, the namespaceID is ietiieveu liom the namenoue when the uatanoue liist
connects). The storageID is unigue to the uatanoue (it is the same acioss all stoiage
uiiectoiies) anu is useu Ly the namenoue to uniguely iuentily the uatanoue. The
storageType iuentilies this uiiectoiy as a uatanoue stoiage uiiectoiy.
HDFS | 341
The othei liles in the uatanoue`s currcnt stoiage uiiectoiy aie the liles with the b||_
pielix. Theie aie two types: the HDFS Llocks themselves (which just consist ol the lile`s
iaw Lytes) anu the metauata loi a Llock (with a .ncta sullix). A Llock lile just consists
ol the iaw Lytes ol a poition ol the lile Leing stoieu; the metauata lile is maue up ol a
heauei with veision anu type inloimation, lolloweu Ly a seiies ol checksums loi sec-
tions ol the Llock.
Vhen the numLei ol Llocks in a uiiectoiy giows to a ceitain size, the uatanoue cieates
a new suLuiiectoiy in which to place new Llocks anu theii accompanying metauata. It
cieates a new suLuiiectoiy eveiy time the numLei ol Llocks in a uiiectoiy ieaches 6+
(set Ly the dfs.datanode.numblocks conliguiation piopeity). The ellect is to have a tiee
with high lan-out, so even loi systems with a veiy laige numLei ol Llocks, the uiiectoiies
will only Le a lew levels ueep. By taking this measuie, the uatanoue ensuies that theie
is a manageaLle numLei ol liles pei uiiectoiy, which avoius the pioLlems that most
opeiating systems encountei when theie aie a laige numLei ol liles (tens oi hunuieus
ol thousanus) in a single uiiectoiy.
Il the conliguiation piopeity dfs.data.dir specilies multiple uiiectoiies (on uilleient
uiives), Llocks aie wiitten to each in a iounu-ioLin lashion. Note that Llocks aie not
ieplicateu on each uiive on a single uatanoue: Llock ieplication is acioss uistinct
uatanoues.
Safe Mode
Vhen the namenoue staits, the liist thing it uoes is loau its image lile (jsinagc) into
memoiy anu apply the euits liom the euit log (cdits). Once it has ieconstiucteu a con-
sistent in-memoiy image ol the lilesystem metauata, it cieates a new jsinagc lile
(ellectively uoing the checkpoint itsell, without iecouise to the seconuaiy namenoue)
anu an empty euit log. Only at this point uoes the namenoue stait listening loi RPC
anu HTTP ieguests. Howevei, the namenoue is iunning in sajc nodc, which means
that it olleis only a ieau-only view ol the lilesystem to clients.
Stiictly speaking, in sale moue, only lilesystem opeiations that access
the lilesystem metauata (like piouucing a uiiectoiy listing) aie guaian-
teeu to woik. Reauing a lile will woik only il the Llocks aie availaLle on
the cuiient set ol uatanoues in the clustei; anu lile mouilications (wiites,
ueletes, oi ienames) will always lail.
Recall that the locations ol Llocks in the system aie not peisisteu Ly the namenoue
this inloimation iesiues with the uatanoues, in the loim ol a list ol the Llocks it is
stoiing. Duiing noimal opeiation ol the system, the namenoue has a map ol Llock
locations stoieu in memoiy. Sale moue is neeueu to give the uatanoues time to check
in to the namenoue with theii Llock lists, so the namenoue can Le inloimeu ol enough
Llock locations to iun the lilesystem ellectively. Il the namenoue uiun`t wait loi enough
uatanoues to check in, then it woulu stait the piocess ol ieplicating Llocks to new
342 | Chapter 10: Administering Hadoop
uatanoues, which woulu Le unnecessaiy in most cases (since it only neeueu to wait loi
the extia uatanoues to check in), anu woulu put a gieat stiain on the clustei`s iesouices.
Inueeu, while in sale moue, the namenoue uoes not issue any Llock ieplication oi
ueletion instiuctions to uatanoues.
Sale moue is exiteu when the ninina| rcp|ication condition is ieacheu, plus an extension
time ol 30 seconus. The minimal ieplication conuition is when 99.9 ol the Llocks in
the whole lilesystem meet theii minimum ieplication level (which uelaults to one, anu
is set Ly dfs.replication.min, see TaLle 10-1).
Vhen you aie staiting a newly loimatteu HDFS clustei, the namenoue uoes not go into
sale moue since theie aie no Llocks in the system.
Tab|c 10-1. Sajc nodc propcrtics
Property name Type Default value Description
dfs.replication.min int 1 The minimum number of replicas that have to be writ-
ten for a write to be successful.
dfs.safemode.threshold.pct float 0.999 The proportion of blocks in the system that must
meet the minimum replication level defined by
dfs.replication.min before the namenode will
exit safe mode. Setting this value to 0 or less forces the
namenode not to start in safe mode. Setting this value
to more than 1 means the namenode never exits safe
mode.
dfs.safemode.extension int 30,000 The time, in milliseconds, to extend safe mode by after
the minimum replication condition defined by
dfs.safemode.threshold.pct has been satis-
fied. For small clusters (tens of nodes), it can be set to
0.
Entering and leaving safe mode
To see whethei the namenoue is in sale moue, you can use the dfsadmin commanu:
% hadoop dfsadmin -safemode get
Safe mode is ON
The liont page ol the HDFS weL UI pioviues anothei inuication ol whethei the name-
noue is in sale moue.
Sometimes you want to wait loi the namenoue to exit sale moue Leloie caiiying out a
commanu, paiticulaily in sciipts. The wait option achieves this:
hadoop dfsadmin -safemode wait
# command to read or write a file
HDFS | 343
An auministiatoi has the aLility to make the namenoue entei oi leave sale moue at any
time. It is sometimes necessaiy to uo this when caiiying out maintenance on the clustei
oi altei upgiauing a clustei to conliim that uata is still ieauaLle. To entei sale moue,
use the lollowing commanu:
% hadoop dfsadmin -safemode enter
Safe mode is ON
You can use this commanu when the namenoue is still in sale moue while staiting up
to ensuie that it nevei leaves sale moue. Anothei way ol making suie that the namenoue
stays in sale moue inuelinitely is to set the piopeity dfs.safemode.threshold.pct to a
value ovei one.
You can make the namenoue leave sale moue Ly using:
% hadoop dfsadmin -safemode leave
Safe mode is OFF
Audit Logging
HDFS has the aLility to log all lilesystem access ieguests, a leatuie that some oigani-
zations ieguiie loi auuiting puiposes. Auuit logging is implementeu using log+j logging
at the INFO level, anu in the uelault conliguiation it is uisaLleu, as the log thiesholu is
set to WARN in |og1j.propcrtics:
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=WARN
You can enaLle auuit logging Ly ieplacing WARN with INFO, anu the iesult will Le a log
line wiitten to the namenoue`s log loi eveiy HDFS event. Heie`s an example loi a list
status ieguest on /uscr/ton:
2009-03-13 07:11:22,982 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
audit: ugi=tom,staff,admin ip=/127.0.0.1 cmd=listStatus src=/user/tom dst=null
perm=null
It is a goou iuea to conliguie log+j so that the auuit log is wiitten to a sepaiate lile anu
isn`t mixeu up with the namenoue`s othei log entiies. An example ol how to uo this
can Le lounu on the Hauoop wiki at http://wi|i.apachc.org/hadoop/HowToConjigurc.
Tools
dfsadmin
The djsadnin tool is a multipuipose tool loi linuing inloimation aLout the state ol
HDFS, as well as peiloiming auministiation opeiations on HDFS. It is invokeu as
hadoop dfsadmin anu ieguiies supeiusei piivileges.
Some ol the availaLle commanus to djsadnin aie uesciiLeu in TaLle 10-2. Use the
-help commanu to get moie inloimation.
344 | Chapter 10: Administering Hadoop
Tab|c 10-2. djsadnin connands
Command Description
-help Shows help for a given command, or all commands if no command is specified.
-report Shows filesystem statistics (similar to those shown in the web UI) and information on connected
datanodes.
-metasave Dumps information to a file in Hadoops log directory about blocks that are being replicated or
deleted, and a list of connected datanodes.
-safemode Changes or query the state of safe mode. See Safe Mode on page 342.
-saveNamespace Saves the current in-memory filesystem image to a new fsimage file and resets the edits file. This
operation may be performed only in safe mode.
-refreshNodes Updates the set of datanodes that are permitted to connect to the namenode. See Commissioning
and Decommissioning Nodes on page 357.
-upgradeProgress Gets information on the progress of an HDFS upgrade or forces an upgrade to proceed. See
Upgrades on page 360.
-finalizeUpgrade Removes the previous version of the datanodes and namenodes storage directories. Used after
an upgrade has been applied and the cluster is running successfully on the new version. See
Upgrades on page 360.
-setQuota Sets directory quotas. Directory quotas set a limit on the number of names (files or directories) in
the directory tree. Directory quotas are useful for preventing users from creating large numbers
of small files, a measure that helps preserve the namenodes memory (recall that accounting
information for every file, directory, and block in the filesystem is stored in memory).
-clrQuota Clears specified directory quotas.
-setSpaceQuota Sets space quotas on directories. Space quotas set a limit on the size of files that may be stored in
a directory tree. They are useful for giving users a limited amount of storage.
-clrSpaceQuota Clears specified space quotas.
-refreshServiceAcl Refreshes the namenodes service-level authorization policy file.
Filesystem check (fsck)
Hauoop pioviues an jsc| utility loi checking the health ol liles in HDFS. The tool looks
loi Llocks that aie missing liom all uatanoues, as well as unuei- oi ovei-ieplicateu
Llocks. Heie is an example ol checking the whole lilesystem loi a small clustei:
% hadoop fsck /
......................Status: HEALTHY
Total size: 511799225 B
Total dirs: 10
Total files: 22
Total blocks (validated): 22 (avg. block size 23263601 B)
Minimally replicated blocks: 22 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
HDFS | 345
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 4
Number of racks: 1
The filesystem under path '/' is HEALTHY
jsc| iecuisively walks the lilesystem namespace, staiting at the given path (heie the
lilesystem ioot), anu checks the liles it linus. It piints a uot loi eveiy lile it checks. To
check a lile, jsc| ietiieves the metauata loi the lile`s Llocks anu looks loi pioLlems oi
inconsistencies. Note that jsc| ietiieves all ol its inloimation liom the namenoue; it
uoes not communicate with any uatanoues to actually ietiieve any Llock uata.
Most ol the output liom jsc| is sell-explanatoiy, Lut heie aie some ol the conuitions it
looks loi:
Ovcr-rcp|icatcd b|oc|s
These aie Llocks that exceeu theii taiget ieplication loi the lile they Lelong to.
Ovei-ieplication is not noimally a pioLlem, anu HDFS will automatically uelete
excess ieplicas.
Undcr-rcp|icatcd b|oc|s
These aie Llocks that uo not meet theii taiget ieplication loi the lile they Lelong
to. HDFS will automatically cieate new ieplicas ol unuei-ieplicateu Llocks until
they meet the taiget ieplication. You can get inloimation aLout the Llocks Leing
ieplicateu (oi waiting to Le ieplicateu) using hadoop dfsadmin -metasave.
Misrcp|icatcd b|oc|s
These aie Llocks that uo not satisly the Llock ieplica placement policy (see Replica
Placement on page 7+). Foi example, loi a ieplication level ol thiee in a multiiack
clustei, il all thiee ieplicas ol a Llock aie on the same iack, then the Llock is mis-
ieplicateu since the ieplicas shoulu Le spieau acioss at least two iacks loi
iesilience. HDFS will automatically ie-ieplicate misieplicateu Llocks so that they
satisly the iack placement policy.
Corrupt b|oc|s
These aie Llocks whose ieplicas aie all coiiupt. Blocks with at least one noncoiiupt
ieplica aie not iepoiteu as coiiupt; the namenoue will ieplicate the noncoiiupt
ieplica until the taiget ieplication is met.
Missing rcp|icas
These aie Llocks with no ieplicas anywheie in the clustei.
Coiiupt oi missing Llocks aie the Liggest cause loi concein, as it means uata has Leen
lost. By uelault, jsc| leaves liles with coiiupt oi missing Llocks, Lut you can tell it to
peiloim one ol the lollowing actions on them:
346 | Chapter 10: Administering Hadoop
Movc the allecteu liles to the /|ost-jound uiiectoiy in HDFS, using the -move option.
Files aie Lioken into chains ol contiguous Llocks to aiu any salvaging elloits you
may attempt.
Dc|ctc the allecteu liles, using the -delete option. Files cannot Le iecoveieu altei
Leing ueleteu.
The jsc| tool pioviues an easy way to linu out which Llocks aie
in any paiticulai lile. Foi example:
% hadoop fsck /user/tom/part-00007 -files -blocks -racks
/user/tom/part-00007 25582428 bytes, 1 block(s): OK
0. blk_-3724870485760122836_1035 len=25582428 repl=3 [/default-rack/10.251.43.2:50010,
/default-rack/10.251.27.178:50010, /default-rack/10.251.123.163:50010]
This says that the lile /uscr/ton/part-00007 is maue up ol one Llock anu shows the
uatanoues wheie the Llocks aie locateu. The jsc| options useu aie as lollows:
The -files option shows the line with the lilename, size, numLei ol Llocks, anu
its health (whethei theie aie any missing Llocks).
The -blocks option shows inloimation aLout each Llock in the lile, one line pei
Llock.
The -racks option uisplays the iack location anu the uatanoue auuiesses loi each
Llock.
Running hadoop fsck without any aiguments uisplays lull usage instiuctions.
Datanode block scanner
Eveiy uatanoue iuns a Llock scannei, which peiiouically veiilies all the Llocks stoieu
on the uatanoue. This allows Lau Llocks to Le uetecteu anu lixeu Leloie they aie ieau
Ly clients. The DataBlockScanner maintains a list ol Llocks to veiily anu scans them one
Ly one loi checksum eiiois. The scannei employs a thiottling mechanism to pieseive
uisk Lanuwiuth on the uatanoue.
Blocks aie peiiouically veiilieu eveiy thiee weeks to guaiu against uisk eiiois ovei time
(this is contiolleu Ly the dfs.datanode.scan.period.hours piopeity, which uelaults to
50+ houis). Coiiupt Llocks aie iepoiteu to the namenoue to Le lixeu.
You can get a Llock veiilication iepoit loi a uatanoue Ly visiting the uatanoue`s weL
inteilace at http://datanodc:50075/b|oc|ScanncrRcport. Heie`s an example ol a iepoit,
which shoulu Le sell-explanatoiy:
Total Blocks : 21131
Verified in last hour : 70
Verified in last day : 1767
Verified in last week : 7360
Verified in last four weeks : 20057
Verified in SCAN_PERIOD : 20057
Not yet verified : 1074
Verified since restart : 35912
Finding the blocks for a file.
HDFS | 347
Scans since restart : 6541
Scan errors since restart : 0
Transient scan errors : 0
Current scan rate limit KBps : 1024
Progress this period : 109%
Time left in cur period : 53.08%
By specilying the listblocks paiametei, http://datanodc:50075/b|oc|ScanncrRcport
?|istb|oc|s, the iepoit is pieceueu Ly a list ol all the Llocks on the uatanoue along with
theii latest veiilication status. Heie is a snippet ol the Llock list (lines aie split to lit the
page):
blk_6035596358209321442 : status : ok type : none scan time : 0
not yet verified
blk_3065580480714947643 : status : ok type : remote scan time : 1215755306400
2008-07-11 05:48:26,400
blk_8729669677359108508 : status : ok type : local scan time : 1215755727345
2008-07-11 05:55:27,345
The liist column is the Llock ID, lolloweu Ly some key-value paiis. The status can Le
one ol failed oi ok accoiuing to whethei the last scan ol the Llock uetecteu a checksum
eiioi. The type ol scan is local il it was peiloimeu Ly the Lackgiounu thieau, remote
il it was peiloimeu Ly a client oi a iemote uatanoue, oi none il a scan ol this Llock has
yet to Le maue. The last piece ol inloimation is the scan time, which is uisplayeu as the
numLei ol milliseconus since miunight 1 ]anuaiy 1970, anu also as a moie ieauaLle
value.
balancer
Ovei time, the uistiiLution ol Llocks acioss uatanoues can Lecome unLalanceu. An
unLalanceu clustei can allect locality loi MapReuuce, anu it puts a gieatei stiain on
the highly utilizeu uatanoues, so it`s Lest avoiueu.
The ba|anccr piogiam is a Hauoop uaemon that ie-uistiiLutes Llocks Ly moving them
liom ovei-utilizeu uatanoues to unuei-utilizeu uatanoues, while auheiing to the Llock
ieplica placement policy that makes uata loss unlikely Ly placing Llock ieplicas on
uilleient iacks (see Replica Placement on page 7+). It moves Llocks until the clustei
is ueemeu to Le Lalanceu, which means that the utilization ol eveiy uatanoue (iatio ol
useu space on the noue to total capacity ol the noue) uilleis liom the utilization ol the
clustei (iatio ol useu space on the clustei to total capacity ol the clustei) Ly no moie
than a given thiesholu peicentage. You can stait the Lalancei with:
% start-balancer.sh
The -threshold aigument specilies the thiesholu peicentage that uelines what it means
loi the clustei to Le Lalanceu. The llag is optional, in which case the thiesholu is 10.
At any one time, only one Lalancei may Le iunning on the clustei.
The Lalancei iuns until the clustei is Lalanceu; it cannot move any moie Llocks, oi it
loses contact with the namenoue. It piouuces a loglile in the stanuaiu log uiiectoiy,
348 | Chapter 10: Administering Hadoop
wheie it wiites a line loi eveiy iteiation ol ieuistiiLution that it caiiies out. Heie is the
output liom a shoit iun on a small clustei:
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved
Mar 18, 2009 5:23:42 PM 0 0 KB 219.21 MB 150.29 MB
Mar 18, 2009 5:27:14 PM 1 195.24 MB 22.45 MB 150.29 MB
The cluster is balanced. Exiting...
Balancing took 6.072933333333333 minutes
The Lalancei is uesigneu to iun in the Lackgiounu without unuuly taxing the clustei
oi inteileiing with othei clients using the clustei. It limits the Lanuwiuth that it uses to
copy a Llock liom one noue to anothei. The uelault is a mouest 1 MB/s, Lut this can
Le changeu Ly setting the dfs.balance.bandwidthPerSec piopeity in hdjs-sitc.xn|, speci-
lieu in Lytes.
Monitoring
Monitoiing is an impoitant pait ol system auministiation. In this section, we look at
the monitoiing lacilities in Hauoop anu how they can hook into exteinal monitoiing
systems.
The puipose ol monitoiing is to uetect when the clustei is not pioviuing the expecteu
level ol seivice. The mastei uaemons aie the most impoitant to monitoi: the namenoues
(piimaiy anu seconuaiy) anu the joLtiackei. Failuie ol uatanoues anu tasktiackeis is
to Le expecteu, paiticulaily on laigei clusteis, so you shoulu pioviue extia capacity so
that the clustei can toleiate having a small peicentage ol ueau noues at any time.
In auuition to the lacilities uesciiLeu next, some auministiatois iun test joLs on a pe-
iiouic Lasis as a test ol the clustei`s health.
Theie is lot ol woik going on to auu moie monitoiing capaLilities to Hauoop, which
is not coveieu heie. Foi example, Chukwa
2
is a uata collection anu monitoiing system
Luilt on HDFS anu MapReuuce, anu excels at mining log uata loi linuing laige-scale
tienus.
Logging
All Hauoop uaemons piouuce logliles that can Le veiy uselul loi linuing out what is
happening in the system. System logliles on page 307 explains how to conliguie these
liles.
Setting log levels
Vhen ueLugging a pioLlem, it is veiy convenient to Le aLle to change the log level
tempoiaiily loi a paiticulai component in the system.
2. http://hadoop.apachc.org/chu|wa
Monitoring | 349
Hauoop uaemons have a weL page loi changing the log level loi any log+j log name,
which can Le lounu at /|ogLcvc| in the uaemon`s weL UI. By convention, log names in
Hauoop coiiesponu to the classname uoing the logging, although theie aie exceptions
to this iule, so you shoulu consult the souice coue to linu log names.
Foi example, to enaLle ueLug logging loi the JobTracker class, we woulu visit the joL-
tiackei`s weL UI at http://jobtrac|cr-host:50030/|ogLcvc| anu set the log name
org.apache.hadoop.mapred.JobTracker to level DEBUG.
The same thing can Le achieveu liom the commanu line as lollows:
% hadoop daemonlog -setlevel jobtracker-host:50030 \
org.apache.hadoop.mapred.JobTracker DEBUG
Log levels changeu in this way aie ieset when the uaemon iestaits, which is usually
what you want. Howevei, to make a peisistent change to a log level, simply change the
|og1j.propcrtics lile in the conliguiation uiiectoiy. In this case, the line to auu is:
log4j.logger.org.apache.hadoop.mapred.JobTracker=DEBUG
Getting stack traces
Hauoop uaemons expose a weL page (/stac|s in the weL UI) that piouuces a thieau
uump loi all iunning thieaus in the uaemon`s ]VM. Foi example, you can get a thieau
uump loi a joLtiackei liom http://jobtrac|cr-host:50030/stac|s.
Metrics
The HDFS anu MapReuuce uaemons collect inloimation aLout events anu measuie-
ments that aie collectively known as nctrics. Foi example, uatanoues collect the lol-
lowing metiics (anu many moie): the numLei ol Lytes wiitten, the numLei ol Llocks
ieplicateu, anu the numLei ol ieau ieguests liom clients (Loth local anu iemote).
Metiics Lelong to a contcxt, anu Hauoop cuiiently uses uls, mapieu, ipc, anu
jvm contexts. Hauoop uaemons usually collect metiics unuei seveial contexts. Foi
example, uatanoues collect metiics loi the uls, ipc, anu jvm contexts.
How Do Metrics Differ from Counters?
The main uilleience is theii scope: metiics aie collecteu Ly Hauoop uaemons, wheieas
counteis (see Counteis on page 257) aie collecteu loi MapReuuce tasks anu aggie-
gateu loi the whole joL. They have uilleient auuiences, too: Lioauly speaking, metiics
aie loi auministiatois, anu counteis aie loi MapReuuce useis.
The way they aie collecteu anu aggiegateu is also uilleient. Counteis aie a MapReuuce
leatuie, anu the MapReuuce system ensuies that countei values aie piopagateu liom
the tasktiackeis wheie they aie piouuceu, Lack to the joLtiackei, anu linally Lack to
the client iunning the MapReuuce joL. (Counteis aie piopagateu via RPC heaitLeats;
see Piogiess anu Status Upuates on page 192.) Both the tasktiackeis anu the joL-
tiackei peiloim aggiegation.
350 | Chapter 10: Administering Hadoop
The collection mechanism loi metiics is uecoupleu liom the component that ieceives
the upuates, anu theie aie vaiious pluggaLle outputs, incluuing local liles, Ganglia, anu
]MX. The uaemon collecting the metiics peiloims aggiegation on them Leloie they aie
sent to the output.
A context uelines the unit ol puLlication; you can choose to puLlish the uls context,
Lut not the jvm context, loi instance. Metiics aie conliguieu in the conj/hadoop-
nctrics.propcrtics lile, anu, Ly uelault, all contexts aie conliguieu so they uo not puLlish
theii metiics. This is the contents ol the uelault conliguiation lile (minus the
comments):
dfs.class=org.apache.hadoop.metrics.spi.NullContext
mapred.class=org.apache.hadoop.metrics.spi.NullContext
jvm.class=org.apache.hadoop.metrics.spi.NullContext
rpc.class=org.apache.hadoop.metrics.spi.NullContext
Each line in this lile conliguies a uilleient context anu specilies the class that hanules
the metiics loi that context. The class must Le an implementation ol the MetricsCon
text inteilace; anu, as the name suggests, the NullContext class neithei puLlishes noi
upuates metiics.
3
The othei implementations ol MetricsContext aie coveieu in the lollowing sections.
You can view iaw metiics gatheieu Ly a paiticulai Hauoop uaemon Ly connecting to
its /nctrics weL page. This is hanuy loi ueLugging. Foi example, you can view joL-
tiackei metiics in plain text at http://jobtrac|cr-host:50030/nctrics. To ietiieve metiics
in ]SON loimat you woulu use http://jobtrac|cr-host:50030/nctrics?jornat=json.
FileContext
FileContext wiites metiics to a local lile. It exposes two conliguiation piopeities:
fileName, which specilies the aLsolute name ol the lile to wiite to, anu period, loi the
time inteival (in seconus) Letween lile upuates. Both piopeities aie optional; il not set,
the metiics will Le wiitten to stanuaiu output eveiy live seconus.
Conliguiation piopeities apply to a context name anu aie specilieu Ly appenuing the
piopeity name to the context name (sepaiateu Ly a uot). Foi example, to uump the
jvm context to a lile, we altei its conliguiation to Le the lollowing:
jvm.class=org.apache.hadoop.metrics.file.FileContext
jvm.fileName=/tmp/jvm_metrics.log
In the liist line, we have changeu the jvm context to use a FileContext, anu in the
seconu, we have set the jvm context`s fileName piopeity to Le a tempoiaiy lile. Heie
aie two lines ol output liom the loglile, split ovei seveial lines to lit the page:
3. The teim context is (peihaps unloitunately) oveiloaueu heie, since it can ielei to eithei a collection ol
metiics (the uls context, loi example) oi the class that puLlishes metiics (the NullContext, loi example).
Monitoring | 351
jvm.metrics: hostName=ip-10-250-59-159, processName=NameNode, sessionId=,
gcCount=46, gcTimeMillis=394, logError=0, logFatal=0, logInfo=59, logWarn=1,
memHeapCommittedM=4.9375, memHeapUsedM=2.5322647, memNonHeapCommittedM=18.25,
memNonHeapUsedM=11.330269, threadsBlocked=0, threadsNew=0, threadsRunnable=6,
threadsTerminated=0, threadsTimedWaiting=8, threadsWaiting=13
jvm.metrics: hostName=ip-10-250-59-159, processName=SecondaryNameNode, sessionId=,
gcCount=36, gcTimeMillis=261, logError=0, logFatal=0, logInfo=18, logWarn=4,
memHeapCommittedM=5.4414062, memHeapUsedM=4.46756, memNonHeapCommittedM=18.25,
memNonHeapUsedM=10.624519, threadsBlocked=0, threadsNew=0, threadsRunnable=5,
threadsTerminated=0, threadsTimedWaiting=4, threadsWaiting=2
FileContext can Le uselul on a local system loi ueLugging puiposes, Lut is unsuitaLle
on a laigei clustei since the output liles aie spieau acioss the clustei, which makes
analyzing them uillicult.
GangliaContext
Ganglia (http://gang|ia.injo/) is an open souice uistiiLuteu monitoiing system loi veiy
laige clusteis. It is uesigneu to impose veiy low iesouice oveiheaus on each noue in the
clustei. Ganglia itsell collects metiics, such as CPU anu memoiy usage; Ly using
GangliaContext, you can inject Hauoop metiics into Ganglia.
GangliaContext has one ieguiieu piopeity, servers, which takes a space- anu/oi
comma-sepaiateu list ol Ganglia seivei host-poit paiis. Fuithei uetails on conliguiing
this context can Le lounu on the Hauoop wiki.
Foi a llavoi ol the kinu ol inloimation you can get out ol Ganglia, see Figuie 10-2,
which shows how the numLei ol tasks in the joLtiackei`s gueue vaiies ovei time.
NullContextWithUpdateThread
Both FileContext anu a GangliaContext push metiics to an exteinal system. Howevei,
some monitoiing systemsnotaLly ]MXneeu to pull metiics liom Hauoop. Null
ContextWithUpdateThread is uesigneu loi this. Like NullContext, it uoesn`t puLlish any
metiics, Lut in auuition it iuns a timei that peiiouically upuates the metiics stoieu in
memoiy. This ensuies that the metiics aie up-to-uate when they aie letcheu Ly anothei
system.
Iigurc 10-2. Gang|ia p|ot oj nunbcr oj tas|s in thc jobtrac|cr qucuc
352 | Chapter 10: Administering Hadoop
All implementations ol MetricsContext, except NullContext, peiloim this upuating
lunction (anu they all expose a period piopeity that uelaults to live seconus), so you
neeu to use NullContextWithUpdateThread only il you aie not collecting metiics using
anothei output. Il you weie using GangliaContext, loi example, then it woulu ensuie
the metiics aie upuateu, so you woulu Le aLle to use ]MX in auuition with no luithei
conliguiation ol the metiics system. ]MX is uiscusseu in moie uetail shoitly.
CompositeContext
CompositeContext allows you to output the same set ol metiics to multiple contexts,
such as a FileContext anu a GangliaContext. The conliguiation is slightly tiicky anu is
Lest shown Ly an example:
jvm.class=org.apache.hadoop.metrics.spi.CompositeContext
jvm.arity=2
jvm.sub1.class=org.apache.hadoop.metrics.file.FileContext
jvm.fileName=/tmp/jvm_metrics.log
jvm.sub2.class=org.apache.hadoop.metrics.ganglia.GangliaContext
jvm.servers=ip-10-250-59-159.ec2.internal:8649
The arity piopeity is useu to specily the numLei ol suLcontexts; in this case, theie aie
two. The piopeity names loi each suLcontext aie mouilieu to have a pait specilying
the suLcontext numLei, hence jvm.sub1.class anu jvm.sub2.class.
Java Management Extensions
]ava Management Extensions (]MX) is a stanuaiu ]ava API loi monitoiing anu man-
aging applications. Hauoop incluues seveial manageu Leans (MBeans), which expose
Hauoop metiics to ]MX-awaie applications. Theie aie MBeans that expose the metiics
in the uls anu ipc contexts, Lut none loi the mapieu context (at the time ol this
wiiting) oi the jvm context (as the ]VM itsell exposes a iichei set ol ]VM metiics).
These MBeans aie listeu in TaLle 10-3.
Tab|c 10-3. Hadoop MBcans
MBean class Daemons Metrics
NameNodeActivityMBean Namenode Namenode activity metrics, such as the
number of create file operations
FSNamesystemMBean Namenode Namenode status metrics, such as the
number of connected datanodes
DataNodeActivityMBean Datanode Datanode activity metrics, such as num-
ber of bytes read
FSDatasetMBean Datanode Datanode storage metrics, such as
capacity and free storage space
RpcActivityMBean All daemons that use RPC:
namenode, datanode,
jobtracker, tasktracker
RPC statistics, such as average process-
ing time
Monitoring | 353
The ]DK comes with a tool calleu ]Console loi viewing MBeans in a iunning ]VM. It`s
uselul loi Liowsing Hauoop metiics, as uemonstiateu in Figuie 10-3.
Iigurc 10-3. jConso|c vicw oj a |oca||y running nancnodc, showing nctrics jor thc ji|csystcn statc
Although you can see Hauoop metiics via ]MX using the uelault metiics
conliguiation, they will not Le upuateu unless you change the
MetricsContext implementation to something othei than NullContext.
Foi example, NullContextWithUpdateThread is appiopiiate il ]MX is the
only way you will Le monitoiing metiics.
Many thiiu-paity monitoiing anu aleiting systems (such as Nagios oi Hypeiic) can
gueiy MBeans, making ]MX the natuial way to monitoi youi Hauoop clustei liom an
existing monitoiing system. You will neeu to enaLle iemote access to ]MX, howevei,
anu choose a level ol secuiity that is appiopiiate loi youi clustei. The options heie
incluue passwoiu authentication, SSL connections, anu SSL client-authentication. See
the ollicial ]ava uocumentation
+
loi an in-uepth guiue on conliguiing these options.
All the options loi enaLling iemote access to ]MX involve setting ]ava system piopei-
ties, which we uo loi Hauoop Ly euiting the conj/hadoop-cnv.sh lile. The lollowing
conliguiation settings show how to enaLle passwoiu-authenticateu iemote access to
]MX on the namenoue (with SSL uisaLleu). The piocess is veiy similai loi othei Hauoop
uaemons:
+. http://java.sun.con/javasc//docs/tcchnotcs/guidcs/nanagcncnt/agcnt.htn|
354 | Chapter 10: Administering Hadoop
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.password.file=$HADOOP_CONF_DIR/jmxremote.password
-Dcom.sun.management.jmxremote.port=8004 $HADOOP_NAMENODE_OPTS"
The jnxrcnotc.password lile lists the useinames anu theii passwoius in plain text; the
]MX uocumentation has luithei uetails on the loimat ol this lile.
Vith this conliguiation, we can use ]Console to Liowse MBeans on a iemote name-
noue. Alteinatively, we can use one ol the many ]MX tools to ietiieve MBean attiiLute
values. Heie is an example ol using the jmxgueiy commanu-line tool (anu Nagios
plug-in, availaLle liom http://codc.goog|c.con/p/jnxqucry/) to ietiieve the numLei ol
unuei-ieplicateu Llocks:
% ./check_jmx -U service:jmx:rmi:///jndi/rmi://namenode-host:8004/jmxrmi -O \
hadoop:service=NameNode,name=FSNamesystemState -A UnderReplicatedBlocks \
-w 100 -c 1000 -username monitorRole -password secret
JMX OK - UnderReplicatedBlocks is 0
This commanu estaLlishes a ]MX RMI connection to the host nancnodc-host on poit
S00+ anu authenticates using the given useiname anu passwoiu. It ieaus the attiiLute
UnderReplicatedBlocks ol the oLject nameu hadoop:service=NameNode,name=FSNamesys
temState anu piints out its value on the console.
5
The -w anu -c options specily waining
anu ciitical levels loi the value: the appiopiiate values ol these aie noimally ueteimineu
altei opeiating a clustei loi a while.
It`s common to use Ganglia in conjunction with an aleiting system like Nagios loi
monitoiing a Hauoop clustei. Ganglia is goou loi elliciently collecting a laige numLei
ol metiics anu giaphing them, wheieas Nagios anu similai systems aie goou at senuing
aleits when a ciitical thiesholu is ieacheu in any ol a smallei set ol metiics.
Maintenance
Routine Administration Procedures
Metadata backups
Il the namenoue`s peisistent metauata is lost oi uamageu, the entiie lilesystem is ien-
ueieu unusaLle, so it is ciitical that Lackups aie maue ol these liles. You shoulu keep
multiple copies ol uilleient ages (one houi, one uay, one week, anu one month, say) to
piotect against coiiuption, eithei in the copies themselves oi in the live liles iunning
on the namenoue.
5. It`s convenient to use ]Console to linu the oLject names ol the MBeans that you want to monitoi. Note
that MBeans loi uatanoue metiics contain a ianuom iuentiliei in Hauoop 0.20, which makes it uillicult
to monitoi them in anything Lut an au hoc way. This was lixeu in Hauoop 0.21.0.
Maintenance | 355
A stiaightloiwaiu way to make Lackups is to wiite a sciipt to peiiouically aichive the
seconuaiy namenoue`s prcvious.chcc|point suLuiiectoiy (unuei the uiiectoiy uelineu
Ly the fs.checkpoint.dir piopeity) to an ollsite location. The sciipt shoulu auuitionally
test the integiity ol the copy. This can Le uone Ly staiting a local namenoue uaemon
anu veiilying that it has successlully ieau the jsinagc anu cdits liles into memoiy (Ly
scanning the namenoue log loi the appiopiiate success message, loi example).
6
Data backups
Although HDFS is uesigneu to stoie uata ieliaLly, uata loss can occui, just like in any
stoiage system, anu thus a Lackup stiategy is essential. Vith the laige uata volumes
that Hauoop can stoie, ueciuing what uata to Lack up anu wheie to stoie it is a chal-
lenge. The key heie is to piioiitize youi uata. The highest piioiity is the uata that cannot
Le iegeneiateu anu that is ciitical to the Lusiness; howevei, uata that is stiaightloiwaiu
to iegeneiate, oi essentially uisposaLle Lecause it is ol limiteu Lusiness value, is the
lowest piioiity, anu you may choose not to make Lackups ol this categoiy ol uata.
Do not make the mistake ol thinking that HDFS ieplication is a suLsti-
tute loi making Lackups. Bugs in HDFS can cause ieplicas to Le lost,
anu so can haiuwaie lailuies. Although Hauoop is expiessly uesigneu
so that haiuwaie lailuie is veiy unlikely to iesult in uata loss, the pos-
siLility can nevei Le completely iuleu out, paiticulaily when comLineu
with soltwaie Lugs oi human eiioi.
Vhen it comes to Lackups, think ol HDFS in the same way as you woulu
RAID. Although the uata will suivive the loss ol an inuiviuual RAID
uisk, it may not il the RAID contiollei lails, oi is Luggy (peihaps ovei-
wiiting some uata), oi the entiie aiiay is uamageu.
It`s common to have a policy loi usei uiiectoiies in HDFS. Foi example, they may have
space guotas anu Le Lackeu up nightly. Vhatevei the policy, make suie youi useis
know what it is, so they know what to expect.
The distcp tool is iueal loi making Lackups to othei HDFS clusteis (pieleiaLly iunning
on a uilleient veision ol the soltwaie, to guaiu against loss uue to Lugs in HDFS) oi
othei Hauoop lilesystems (such as S3 oi KFS), since it can copy liles in paiallel. Altei-
natively, you can employ an entiiely uilleient stoiage system loi Lackups, using one ol
the ways to expoit uata liom HDFS uesciiLeu in Hauoop Filesystems on page 5+.
6. Hauoop 0.23.0 comes with an Ollline Image Viewei anu Ollline Euits Viewei, which can Le useu to check
the integiity ol the image anu euits liles. Note that Loth vieweis suppoit oluei loimats ol these liles, so
you can use them to uiagnose pioLlems in these liles geneiateu Ly pievious ieleases ol Hauoop. Type
hdfs oiv anu hdfs oev to invoke these tools.
356 | Chapter 10: Administering Hadoop
Filesystem check (fsck)
It is auvisaLle to iun HDFS`s jsc| tool iegulaily (loi example, uaily) on the whole lile-
system to pioactively look loi missing oi coiiupt Llocks. See Filesystem check
(lsck) on page 3+5.
Filesystem balancer
Run the Lalancei tool (see Lalancei on page 3+S) iegulaily to keep the lilesystem
uatanoues evenly Lalanceu.
Commissioning and Decommissioning Nodes
As an auministiatoi ol a Hauoop clustei, you will neeu to auu oi iemove noues liom
time to time. Foi example, to giow the stoiage availaLle to a clustei, you commission
new noues. Conveisely, sometimes you may wish to shiink a clustei, anu to uo so, you
uecommission noues. It can sometimes Le necessaiy to uecommission a noue il it is
misLehaving, peihaps Lecause it is lailing moie olten than it shoulu oi its peiloimance
is noticeaLly slow.
Noues noimally iun Loth a uatanoue anu a tasktiackei, anu Loth aie typically
commissioneu oi uecommissioneu in tanuem.
Commissioning new nodes
Although commissioning a new noue can Le as simple as conliguiing the hdjs-
sitc.xn| lile to point to the namenoue anu the naprcd-sitc.xn| lile to point to the joL-
tiackei, anu staiting the uatanoue anu joLtiackei uaemons, it is geneially Lest to have
a list ol authoiizeu noues.
It is a potential secuiity iisk to allow any machine to connect to the namenoue anu act
as a uatanoue, since the machine may gain access to uata that it is not authoiizeu to
see. Fuitheimoie, since such a machine is not a ieal uatanoue, it is not unuei youi
contiol, anu may stop at any time, causing potential uata loss. (Imagine what woulu
happen il a numLei ol such noues weie connecteu, anu a Llock ol uata was piesent
only on the alien noues?) This scenaiio is a iisk even insiue a liiewall, thiough
misconliguiation, so uatanoues (anu tasktiackeis) shoulu Le explicitly manageu on all
piouuction clusteis.
Datanoues that aie peimitteu to connect to the namenoue aie specilieu in a lile whose
name is specilieu Ly the dfs.hosts piopeity. The lile iesiues on the namenoue`s local
lilesystem, anu it contains a line loi each uatanoue, specilieu Ly netwoik auuiess (as
iepoiteu Ly the uatanoueyou can see what this is Ly looking at the namenoue`s weL
UI). Il you neeu to specily multiple netwoik auuiesses loi a uatanoue, put them on one
line, sepaiateu Ly whitespace.
Maintenance | 357
Similaily, tasktiackeis that may connect to the joLtiackei aie specilieu in a lile whose
name is specilieu Ly the mapred.hosts piopeity. In most cases, theie is one shaieu lile,
ieleiieu to as the inc|udc ji|c, that Loth dfs.hosts anu mapred.hosts ielei to, since noues
in the clustei iun Loth uatanoue anu tasktiackei uaemons.
The lile (oi liles) specilieu Ly the dfs.hosts anu mapred.hosts piopeities
is uilleient liom the s|avcs lile. The loimei is useu Ly the namenoue anu
joLtiackei to ueteimine which woikei noues may connect. The s|avcs
lile is useu Ly the Hauoop contiol sciipts to peiloim clustei-wiue op-
eiations, such as clustei iestaits. It is nevei useu Ly the Hauoop
uaemons.
To auu new noues to the clustei:
1. Auu the netwoik auuiesses ol the new noues to the incluue lile.
2. Upuate the namenoue with the new set ol peimitteu uatanoues using this
commanu:
% hadoop dfsadmin -refreshNodes
3. Upuate the joLtiackei with the new set ol peimitteu tasktiackeis using:
% hadoop mradmin -refreshNodes
+. Upuate the s|avcs lile with the new noues, so that they aie incluueu in lutuie op-
eiations peiloimeu Ly the Hauoop contiol sciipts.
5. Stait the new uatanoues anu tasktiackeis.
6. Check that the new uatanoues anu tasktiackeis appeai in the weL UI.
HDFS will not move Llocks liom olu uatanoues to new uatanoues to Lalance the clustei.
To uo this, you shoulu iun the Lalancei uesciiLeu in Lalancei on page 3+S.
Decommissioning old nodes
Although HDFS is uesigneu to toleiate uatanoue lailuies, this uoes not mean you can
just teiminate uatanoues en masse with no ill ellect. Vith a ieplication level ol thiee,
loi example, the chances aie veiy high that you will lose uata Ly simultaneously shutting
uown thiee uatanoues il they aie on uilleient iacks. The way to uecommission
uatanoues is to inloim the namenoue ol the noues that you wish to take out ol ciicu-
lation, so that it can ieplicate the Llocks to othei uatanoues Leloie the uatanoues aie
shut uown.
Vith tasktiackeis, Hauoop is moie loigiving. Il you shut uown a tasktiackei that is
iunning tasks, the joLtiackei will notice the lailuie anu iescheuule the tasks on othei
tasktiackeis.
The uecommissioning piocess is contiolleu Ly an cxc|udc ji|c, which loi HDFS is set
Ly the dfs.hosts.exclude piopeity anu loi MapReuuce Ly the mapred.hosts.exclude
358 | Chapter 10: Administering Hadoop
piopeity. It is olten the case that these piopeities ielei to the same lile. The excluue lile
lists the noues that aie not peimitteu to connect to the clustei.
The iules loi whethei a tasktiackei may connect to the joLtiackei aie simple: a task-
tiackei may connect only il it appeais in the incluue lile anu uoes not appeai in the
excluue lile. An unspecilieu oi empty incluue lile is taken to mean that all noues aie in
the incluue lile.
Foi HDFS, the iules aie slightly uilleient. Il a uatanoue appeais in Loth the incluue anu
the excluue lile, then it may connect, Lut only to Le uecommissioneu. TaLle 10-+ sum-
maiizes the uilleient comLinations loi uatanoues. As loi tasktiackeis, an unspecilieu
oi empty incluue lile means all noues aie incluueu.
Tab|c 10-1. HDIS inc|udc and cxc|udc ji|c prcccdcncc
Node appears in include file Node appears in exclude file Interpretation
No No Node may not connect.
No Yes Node may not connect.
Yes No Node may connect.
Yes Yes Node may connect and will be decommissioned.
To iemove noues liom the clustei:
1. Auu the netwoik auuiesses ol the noues to Le uecommissioneu to the excluue lile.
Do not upuate the incluue lile at this point.
2. Upuate the namenoue with the new set ol peimitteu uatanoues, with this
commanu:
% hadoop dfsadmin -refreshNodes
3. Upuate the joLtiackei with the new set ol peimitteu tasktiackeis using:
% hadoop mradmin -refreshNodes
+. Go to the weL UI anu check whethei the aumin state has changeu to Decommis-
sion In Piogiess loi the uatanoues Leing uecommissioneu. They will stait copying
theii Llocks to othei uatanoues in the clustei.
5. Vhen all the uatanoues iepoit theii state as Decommissioneu, then all the Llocks
have Leen ieplicateu. Shut uown the uecommissioneu noues.
6. Remove the noues liom the incluue lile, anu iun:
% hadoop dfsadmin -refreshNodes
% hadoop mradmin -refreshNodes
7. Remove the noues liom the s|avcs lile.
Maintenance | 359
Upgrades
Upgiauing an HDFS anu MapReuuce clustei ieguiies caielul planning. The most im-
poitant consiueiation is the HDFS upgiaue. Il the layout veision ol the lilesystem has
changeu, then the upgiaue will automatically migiate the lilesystem uata anu metauata
to a loimat that is compatiLle with the new veision. As with any pioceuuie that involves
uata migiation, theie is a iisk ol uata loss, so you shoulu Le suie that Loth youi uata
anu metauata is Lackeu up (see Routine Auministiation Pioceuuies on page 355).
Pait ol the planning piocess shoulu incluue a tiial iun on a small test clustei with a
copy ol uata that you can alloiu to lose. A tiial iun will allow you to lamiliaiize youisell
with the piocess, customize it to youi paiticulai clustei conliguiation anu toolset, anu
iion out any snags Leloie iunning the upgiaue pioceuuie on a piouuction clustei. A
test clustei also has the Lenelit ol Leing availaLle to test client upgiaues on. You can
ieau aLout geneial compatiLility conceins loi clients in CompatiLility on page 15.
Upgiauing a clustei when the lilesystem layout has not changeu is laiily
stiaightloiwaiu: install the new veisions ol HDFS anu MapReuuce on the clustei (anu
on clients at the same time), shut uown the olu uaemons, upuate conliguiation liles,
then stait up the new uaemons anu switch clients to use the new liLiaiies. This piocess
is ieveisiLle, so iolling Lack an upgiaue is also stiaightloiwaiu.
Altei eveiy successlul upgiaue, you shoulu peiloim a couple ol linal cleanup steps:
Remove the olu installation anu conliguiation liles liom the clustei.
Fix any uepiecation wainings in youi coue anu conliguiation.
HDFS data and metadata upgrades
Il you use the pioceuuie just uesciiLeu to upgiaue to a new veision ol HDFS anu it
expects a uilleient layout veision, then the namenoue will ieluse to iun. A message like
the lollowing will appeai in its log:
File system image contains an old layout version -16.
An upgrade to version -18 is required.
Please restart NameNode with -upgrade option.
The most ieliaLle way ol linuing out whethei you neeu to upgiaue the lilesystem is Ly
peiloiming a tiial on a test clustei.
An upgiaue ol HDFS makes a copy ol the pievious veision`s metauata anu uata. Doing
an upgiaue uoes not uouLle the stoiage ieguiiements ol the clustei, as the uatanoues
use haiu links to keep two ieleiences (loi the cuiient anu pievious veision) to the same
Llock ol uata. This uesign makes it stiaightloiwaiu to ioll Lack to the pievious veision
ol the lilesystem, shoulu you neeu to. You shoulu unueistanu that any changes maue
to the uata on the upgiaueu system will Le lost altei the iollLack completes.
You can keep only the pievious veision ol the lilesystem: you can`t ioll Lack seveial
veisions. Theieloie, to caiiy out anothei upgiaue to HDFS uata anu metauata, you will
360 | Chapter 10: Administering Hadoop
neeu to uelete the pievious veision, a piocess calleu jina|izing thc upgradc. Once an
upgiaue is linalizeu, theie is no pioceuuie loi iolling Lack to a pievious veision.
In geneial, you can skip ieleases when upgiauing (loi example, you can upgiaue liom
ielease 0.1S.3 to 0.20.0 without having to upgiaue to a 0.19.x ielease liist), Lut in some
cases, you may have to go thiough inteimeuiate ieleases. The ielease notes make it cleai
when this is ieguiieu.
You shoulu only attempt to upgiaue a healthy lilesystem. Beloie iunning the upgiaue,
uo a lull jsc| (see Filesystem check (lsck) on page 3+5). As an extia piecaution, you
can keep a copy ol the jsc| output that lists all the liles anu Llocks in the system, so
you can compaie it with the output ol iunning jsc| altei the upgiaue.
It`s also woith cleaiing out tempoiaiy liles Leloie uoing the upgiaue, Loth liom the
MapReuuce system uiiectoiy on HDFS anu local tempoiaiy liles.
Vith these pieliminaiies out ol the way, heie is the high-level pioceuuie loi upgiauing
a clustei when the lilesystem layout neeus to Le migiateu:
1. Make suie that any pievious upgiaue is linalizeu Leloie pioceeuing with anothei
upgiaue.
2. Shut uown MapReuuce anu kill any oiphaneu task piocesses on the tasktiackeis.
3. Shut uown HDFS anu Lackup the namenoue uiiectoiies.
+. Install new veisions ol Hauoop HDFS anu MapReuuce on the clustei anu on
clients.
5. Stait HDFS with the -upgrade option.
6. Vait until the upgiaue is complete.
7. Peiloim some sanity checks on HDFS.
S. Stait MapReuuce.
9. Roll Lack oi linalize the upgiaue (optional).
Vhile iunning the upgiaue pioceuuie, it is a goou iuea to iemove the Hauoop sciipts
liom youi PATH enviionment vaiiaLle. This loices you to Le explicit aLout which veision
ol the sciipts you aie iunning. It can Le convenient to ueline two enviionment vaiiaLles
loi the new installation uiiectoiies; in the lollowing instiuctions, we have uelineu
OLD_HADOOP_INSTALL anu NEW_HADOOP_INSTALL.
Maintenance | 361
To peiloim the upgiaue, iun the lollowing commanu (this is step 5 in
the high-level upgiaue pioceuuie):
% $NEW_HADOOP_INSTALL/bin/start-dfs.sh -upgrade
This causes the namenoue to upgiaue its metauata, placing the pievious veision in a
new uiiectoiy calleu prcvious:
${dfs.name.dir}/current/VERSION
/edits
/fsimage
/fstime
/previous/VERSION
/edits
/fsimage
/fstime
Similaily, uatanoues upgiaue theii stoiage uiiectoiies, pieseiving the olu copy in a
uiiectoiy calleu prcvious.
The upgiaue piocess is not instantaneous, Lut you can
check the piogiess ol an upgiaue using djsadnin (upgiaue events also appeai in the
uaemons` logliles, step 6):
% $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -upgradeProgress status
Upgrade for version -18 has been completed.
Upgrade is not finalized.
This shows that the upgiaue is complete. At this stage, you shoulu iun
some sanity checks (step 7) on the lilesystem (check liles anu Llocks using jsc|, Lasic
lile opeiations). You might choose to put HDFS into sale moue while you aie iunning
some ol these checks (the ones that aie ieau-only) to pievent otheis liom making
changes.
Il you linu that the new veision is not woiking coiiectly,
you may choose to ioll Lack to the pievious veision (step 9). This is only possiLle il
you have not linalizeu the upgiaue.
A iollLack ieveits the lilesystem state to Leloie the upgiaue was pei-
loimeu, so any changes maue in the meantime will Le lost. In othei
woius, it iolls Lack to the pievious state ol the lilesystem, iathei than
uowngiauing the cuiient state ol the lilesystem to a loimei veision.
Fiist, shut uown the new uaemons:
% $NEW_HADOOP_INSTALL/bin/stop-dfs.sh
Then stait up the olu veision ol HDFS with the -rollback option:
% $OLD_HADOOP_INSTALL/bin/start-dfs.sh -rollback
Start the upgrade.
Wait until the upgrade is complete.
Check the upgrade.
Roll back the upgrade (optional).
362 | Chapter 10: Administering Hadoop
This commanu gets the namenoue anu uatanoues to ieplace theii cuiient stoiage
uiiectoiies with theii pievious copies. The lilesystem will Le ietuineu to its pievious
state.
Vhen you aie happy with the new veision ol HDFS, you
can linalize the upgiaue (step 9) to iemove the pievious stoiage uiiectoiies.
Altei an upgiaue has Leen linalizeu, theie is no way to ioll Lack to the
pievious veision.
This step is ieguiieu Leloie peiloiming anothei upgiaue:
% $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -finalizeUpgrade
% $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -upgradeProgress status
There are no upgrades in progress.
HDFS is now lully upgiaueu to the new veision.
Finalize the upgrade (optional).
Maintenance | 363
CHAPTER 11
Pig
Pig iaises the level ol aLstiaction loi piocessing laige uatasets. MapReuuce allows you
the piogiammei to specily a map lunction lolloweu Ly a ieuuce lunction, Lut woiking
out how to lit youi uata piocessing into this pattein, which olten ieguiies multiple
MapReuuce stages, can Le a challenge. Vith Pig, the uata stiuctuies aie much iichei,
typically Leing multivalueu anu nesteu; anu the set ol tiansloimations you can apply
to the uata aie much moie poweilulthey incluue joins, loi example, which aie not
loi the laint ol heait in MapReuuce.
Pig is maue up ol two pieces:
The language useu to expiess uata llows, calleu Pig Latin.
The execution enviionment to iun Pig Latin piogiams. Theie aie cuiiently two
enviionments: local execution in a single ]VM anu uistiiLuteu execution on a Ha-
uoop clustei.
A Pig Latin piogiam is maue up ol a seiies ol opeiations, oi tiansloimations, that aie
applieu to the input uata to piouuce output. Taken as a whole, the opeiations uesciiLe
a uata llow, which the Pig execution enviionment tianslates into an executaLle iepie-
sentation anu then iuns. Unuei the coveis, Pig tuins the tiansloimations into a seiies
ol MapReuuce joLs, Lut as a piogiammei you aie mostly unawaie ol this, which allows
you to locus on the uata iathei than the natuie ol the execution.
Pig is a sciipting language loi exploiing laige uatasets. One ciiticism ol MapReuuce is
that the uevelopment cycle is veiy long. Viiting the mappeis anu ieuuceis, compiling
anu packaging the coue, suLmitting the joL(s), anu ietiieving the iesults is a time-
consuming Lusiness, anu even with Stieaming, which iemoves the compile anu package
step, the expeiience is still involveu. Pig`s sweet spot is its aLility to piocess teiaLytes
ol uata simply Ly issuing a hall-uozen lines ol Pig Latin liom the console. Inueeu, it
was cieateu at Yahoo! to make it easiei loi ieseaicheis anu engineeis to mine the huge
uatasets theie. Pig is veiy suppoitive ol a piogiammei wiiting a gueiy, since it pioviues
seveial commanus loi intiospecting the uata stiuctuies in youi piogiam, as it is wiitten.
Even moie uselul, it can peiloim a sample iun on a iepiesentative suLset ol youi input
365
uata, so you can see whethei theie aie eiiois in the piocessing Leloie unleashing it on
the lull uataset.
Pig was uesigneu to Le extensiLle. Viitually all paits ol the piocessing path aie cus-
tomizaLle: loauing, stoiing, lilteiing, giouping, anu joining can all Le alteieu Ly usei-
uelineu lunctions (UDFs). These lunctions opeiate on Pig`s nesteu uata mouel, so they
can integiate veiy ueeply with Pig`s opeiatois. As anothei Lenelit, UDFs tenu to Le
moie ieusaLle than the liLiaiies uevelopeu loi wiiting MapReuuce piogiams.
Pig isn`t suitaLle loi all uata piocessing tasks, howevei. Like MapReuuce, it is uesigneu
loi Latch piocessing ol uata. Il you want to peiloim a gueiy that touches only a small
amount ol uata in a laige uataset, then Pig will not peiloim well, since it is set up to
scan the whole uataset, oi at least laige poitions ol it.
In some cases, Pig uoesn`t peiloim as well as piogiams wiitten in MapReuuce. How-
evei, the gap is naiiowing with each ielease, as the Pig team implements sophisticateu
algoiithms loi implementing Pig`s ielational opeiatois. It`s laii to say that unless you
aie willing to invest a lot ol elloit optimizing ]ava MapReuuce coue, wiiting gueiies in
Pig Latin will save you time.
Installing and Running Pig
Pig iuns as a client-siue application. Even il you want to iun Pig on a Hauoop clustei,
theie is nothing extia to install on the clustei: Pig launches joLs anu inteiacts with
HDFS (oi othei Hauoop lilesystems) liom youi woikstation.
Installation is stiaightloiwaiu. ]ava 6 is a pieieguisite (anu on Vinuows, you will neeu
Cygwin). Downloau a staLle ielease liom http://pig.apachc.org/rc|cascs.htn|, anu un-
pack the taiLall in a suitaLle place on youi woikstation:
% tar xzf pig-x.y.z.tar.gz
It`s convenient to auu Pig`s Linaiy uiiectoiy to youi commanu-line path. Foi example:
% export PIG_INSTALL=/home/tom/pig-x.y.z
% export PATH=$PATH:$PIG_INSTALL/bin
You also neeu to set the JAVA_HOME enviionment vaiiaLle to point to a suitaLle ]ava
installation.
Tiy typing pig -help to get usage instiuctions.
Execution Types
Pig has two execution types oi moues: local moue anu MapReuuce moue.
366 | Chapter 11: Pig
Local mode
In local moue, Pig iuns in a single ]VM anu accesses the local lilesystem. This moue is
suitaLle only loi small uatasets anu when tiying out Pig.
The execution type is set using the -x oi -exectype option. To iun in local moue, set
the option to local:
% pig -x local
grunt>
This staits Giunt, the Pig inteiactive shell, which is uiscusseu in moie uetail shoitly.
MapReduce mode
In MapReuuce moue, Pig tianslates gueiies into MapReuuce joLs anu iuns them on a
Hauoop clustei. The clustei may Le a pseuuo- oi lully uistiiLuteu clustei. MapReuuce
moue (with a lully uistiiLuteu clustei) is what you use when you want to iun Pig on
laige uatasets.
To use MapReuuce moue, you liist neeu to check that the veision ol Pig you uown-
loaueu is compatiLle with the veision ol Hauoop you aie using. Pig ieleases will only
woik against paiticulai veisions ol Hauoop; this is uocumenteu in the ielease notes.
Pig honois the HADOOP_HOME enviionment vaiiaLle loi linuing which Hauoop client to
iun. Howevei il it is not set, Pig will use a Lunuleu copy ol the Hauoop liLiaiies. Note
that these may not match the veision ol Hauoop iunning on youi clustei, so it is Lest
to explicitly set HADOOP_HOME.
Next, you neeu to point Pig at the clustei`s namenoue anu joLtiackei. Il the installation
ol Hauoop at HADOOP_HOME is alieauy conliguieu loi this, then theie is nothing moie to
uo. Otheiwise, you can set HADOOP_CONF_DIR to a uiiectoiy containing the Hauoop site
lile (oi liles) that ueline fs.default.name anu mapred.job.tracker.
Alteinatively, you can set these two piopeities in the pig.propcrtics lile in Pig`s conj
uiiectoiy (oi the uiiectoiy specilieu Ly PIG_CONF_DIR). Heie`s an example loi a pseuuo-
uistiiLuteu setup:
fs.default.name=hdfs://localhost/
mapred.job.tracker=localhost:8021
Once you have conliguieu Pig to connect to a Hauoop clustei, you can launch Pig,
setting the -x option to mapreduce, oi omitting it entiiely, as MapReuuce moue is the
uelault:
% pig
2012-01-18 20:23:05,764 [main] INFO org.apache.pig.Main - Logging error message
s to: /private/tmp/pig_1326946985762.log
2012-01-18 20:23:06,009 [main] INFO org.apache.pig.backend.hadoop.executionengi
ne.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost/
2012-01-18 20:23:06,274 [main] INFO org.apache.pig.backend.hadoop.executionengi
ne.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:8021
grunt>
Installing and Running Pig | 367
As you can see liom the output, Pig iepoits the lilesystem anu joLtiackei that it has
connecteu to.
Running Pig Programs
Theie aie thiee ways ol executing Pig piogiams, all ol which woik in Loth local anu
MapReuuce moue:
Script
Pig can iun a sciipt lile that contains Pig commanus. Foi example, pig
script.pig iuns the commanus in the local lile script.pig. Alteinatively, loi veiy
shoit sciipts, you can use the -e option to iun a sciipt specilieu as a stiing on the
commanu line.
Grunt
Giunt is an inteiactive shell loi iunning Pig commanus. Giunt is staiteu when no
lile is specilieu loi Pig to iun, anu the -e option is not useu. It is also possiLle to
iun Pig sciipts liom within Giunt using run anu exec.
Enbcddcd
You can iun Pig piogiams liom ]ava using the PigServer class, much like you can
use ]DBC to iun SQL piogiams liom ]ava. Foi piogiammatic access to Giunt, use
PigRunner.
Grunt
Giunt has line-euiting lacilities like those lounu in GNU Reauline (useu in the Lash
shell anu many othei commanu-line applications). Foi instance, the Ctil-E key com-
Lination will move the cuisoi to the enu ol the line. Giunt iememLeis commanu his-
toiy, too,
1
anu you can iecall lines in the histoiy Lullei using Ctil-P oi Ctil-N (loi
pievious anu next) oi, eguivalently, the up oi uown cuisoi keys.
Anothei hanuy leatuie is Giunt`s completion mechanism, which will tiy to complete
Pig Latin keywoius anu lunctions when you piess the TaL key. Foi example, consiuei
the lollowing incomplete line:
grunt> a = foreach b ge
Il you piess the TaL key at this point, ge will expanu to generate, a Pig Latin keywoiu:
grunt> a = foreach b generate
You can customize the completion tokens Ly cieating a lile nameu autoconp|ctc anu
placing it on Pig`s classpath (such as in the conj uiiectoiy in Pig`s install uiiectoiy), oi
in the uiiectoiy you invokeu Giunt liom. The lile shoulu have one token pei line, anu
tokens must not contain any whitespace. Matching is case-sensitive. It can Le veiy
1. Histoiy is stoieu in a lile calleu .pig_history in youi home uiiectoiy.
368 | Chapter 11: Pig
hanuy to auu commonly useu lile paths (especially Lecause Pig uoes not peiloim lile-
name completion) oi the names ol any usei-uelineu lunctions you have cieateu.
You can get a list ol commanus using the help commanu. Vhen you`ve linisheu youi
Giunt session, you can exit with the quit commanu.
Pig Latin Editors
PigPen is an Eclipse plug-in that pioviues an enviionment loi ueveloping Pig piogiams.
It incluues a Pig sciipt text euitoi, an example geneiatoi (eguivalent to the ILLUS-
TRATE commanu), anu a Lutton loi iunning the sciipt on a Hauoop clustei. Theie is
also an opeiatoi giaph winuow, which shows a sciipt in giaph loim, loi visualizing the
uata llow. Foi lull installation anu usage instiuctions, please ielei to the Pig wiki at
https://cwi|i.apachc.org/conj|ucncc/disp|ay/P|G/PigToo|s.
Theie aie also Pig Latin syntax highlighteis loi othei euitois, incluuing Vim anu Text-
Mate. Details aie availaLle on the Pig wiki.
An Example
Let`s look at a simple example Ly wiiting the piogiam to calculate the maximum
iecoiueu tempeiatuie Ly yeai loi the weathei uataset in Pig Latin (just like we uiu using
MapReuuce in Chaptei 2). The complete piogiam is only a lew lines long:
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;
To exploie what`s going on, we`ll use Pig`s Giunt inteipietei, which allows us to entei
lines anu inteiact with the piogiam to unueistanu what it`s uoing. Stait up Giunt in
local moue, then entei the liist line ol the Pig sciipt:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year:chararray, temperature:int, quality:int);
Foi simplicity, the piogiam assumes that the input is taL-uelimiteu text, with each line
having just yeai, tempeiatuie, anu guality lielus. (Pig actually has moie llexiLility than
this with iegaiu to the input loimats it accepts, as you`ll see latei.) This line uesciiLes
the input uata we want to piocess. The year:chararray notation uesciiLes the lielu`s
name anu type; a chararray is like a ]ava stiing, anu an int is like a ]ava int. The LOAD
opeiatoi takes a URI aigument; heie we aie just using a local lile, Lut we coulu ielei
to an HDFS URI. The AS clause (which is optional) gives the lielus names to make it
convenient to ielei to them in suLseguent statements.
An Example | 369
The iesult ol the LOAD opeiatoi, inueeu any opeiatoi in Pig Latin, is a rc|ation, which
is just a set ol tuples. A tup|c is just like a iow ol uata in a uataLase taLle, with multiple
lielus in a paiticulai oiuei. In this example, the LOAD lunction piouuces a set ol (yeai,
tempeiatuie, guality) tuples that aie piesent in the input lile. Ve wiite a ielation with
one tuple pei line, wheie tuples aie iepiesenteu as comma-sepaiateu items in
paientheses:
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
Relations aie given names, oi a|iascs, so they can Le ieleiieu to. This ielation is given
the records alias. Ve can examine the contents ol an alias using the DUMP opeiatoi:
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
Ve can also see the stiuctuie ol a ielationthe ielation`s schcnausing the
DESCRIBE opeiatoi on the ielation`s alias:
grunt> DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}
This tells us that records has thiee lielus, with aliases year, temperature, anu quality,
which aie the names we gave them in the AS clause. The lielus have the types given to
them in the AS clause, too. Ve shall examine types in Pig in moie uetail latei.
The seconu statement iemoves iecoius that have a missing tempeiatuie (inuicateu Ly
a value ol 9999) oi an unsatislactoiy guality ieauing. Foi this small uataset, no iecoius
aie lilteieu out:
grunt> filtered_records = FILTER records BY temperature != 9999 AND
>> (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grunt> DUMP filtered_records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
The thiiu statement uses the GROUP lunction to gioup the records ielation Ly the
year lielu. Let`s use DUMP to see what it piouuces:
grunt> grouped_records = GROUP filtered_records BY year;
grunt> DUMP grouped_records;
(1949,{(1949,111,1),(1949,78,1)})
(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
Ve now have two iows, oi tuples, one loi each yeai in the input uata. The liist lielu in
each tuple is the lielu Leing gioupeu Ly (the yeai), anu the seconu lielu is a Lag ol tuples
370 | Chapter 11: Pig
loi that yeai. A bag is just an unoiueieu collection ol tuples, which in Pig Latin is
iepiesenteu using cuily Liaces.
By giouping the uata in this way, we have cieateu a iow pei yeai, so now all that iemains
is to linu the maximum tempeiatuie loi the tuples in each Lag. Beloie we uo this, let`s
unueistanu the stiuctuie ol the grouped_records ielation:
grunt> DESCRIBE grouped_records;
grouped_records: {group: chararray,filtered_records: {year: chararray,
temperature: int,quality: int}}
This tells us that the giouping lielu is given the alias group Ly Pig, anu the seconu lielu
is the same stiuctuie as the filtered_records ielation that was Leing gioupeu. Vith
this inloimation, we can tiy the louith tiansloimation:
grunt> max_temp = FOREACH grouped_records GENERATE group,
>> MAX(filtered_records.temperature);
FOREACH piocesses eveiy iow to geneiate a ueiiveu set ol iows, using a GENERATE
clause to ueline the lielus in each ueiiveu iow. In this example, the liist lielu is
group, which is just the yeai. The seconu lielu is a little moie complex.
The filtered_records.temperature ieleience is to the temperature lielu ol the
filtered_records Lag in the grouped_records ielation. MAX is a Luilt-in lunction loi
calculating the maximum value ol lielus in a Lag. In this case, it calculates the maximum
tempeiatuie loi the lielus in each filtered_records Lag. Let`s check the iesult:
grunt> DUMP max_temp;
(1949,111)
(1950,22)
So we`ve successlully calculateu the maximum tempeiatuie loi each yeai.
Generating Examples
In this example, we`ve useu a small sample uataset with just a hanulul ol iows to make
it easiei to lollow the uata llow anu aiu ueLugging. Cieating a cut-uown uataset is an
ait, as iueally it shoulu Le iich enough to covei all the cases to exeicise youi gueiies
(the conp|ctcncss piopeity), yet Le small enough to ieason aLout Ly the piogiammei
(the conciscncss piopeity). Using a ianuom sample uoesn`t woik well in geneial, since
join anu liltei opeiations tenu to iemove all ianuom uata, leaving an empty iesult,
which is not illustiative ol the geneial uata llow.
Vith the ILLUSTRATE opeiatoi, Pig pioviues a tool loi geneiating a ieasonaLly com-
plete anu concise uataset. Heie is the output liom iunning ILLUSTRATE (slightly ie-
loimatteu to lit the page):
An Example | 371
grunt> ILLUSTRATE max_temp;
-------------------------------------------------------------------------------
| records | year:chararray | temperature:int | quality:int |
-------------------------------------------------------------------------------
| | 1949 | 78 | 1 |
| | 1949 | 111 | 1 |
| | 1949 | 9999 | 1 |
-------------------------------------------------------------------------------
----------------------------------------------------------------------------------------
| filtered_records | year:chararray | temperature:int | quality:int |
----------------------------------------------------------------------------------------
| | 1949 | 78 | 1 |
| | 1949 | 111 | 1 |
----------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
| grouped_records | group:chararray | filtered_records:bag{:tuple(year:chararray, |
temperature:int,quality:int)} |
--------------------------------------------------------------------------------------------
| | 1949 | {(1949, 78, 1), (1949, 111, 1)} |
--------------------------------------------------------------------------------------------
---------------------------------------------------
| max_temp | group:chararray | :int |
---------------------------------------------------
| | 1949 | 111 |
---------------------------------------------------
Notice that Pig useu some ol the oiiginal uata (this is impoitant to keep the geneiateu
uataset iealistic), as well as cieating some new uata. It noticeu the special value 9999
in the gueiy anu cieateu a tuple containing this value to exeicise the FILTER statement.
In summaiy, the output ol the ILLUSTRATE is easy to lollow anu can help you un-
ueistanu what youi gueiy is uoing.
Comparison with Databases
Having seen Pig in action, it might seem that Pig Latin is similai to SQL. The piesence
ol such opeiatois as GROUP BY anu DESCRIBE ieinloices this impiession. Howevei,
theie aie seveial uilleiences Letween the two languages, anu Letween Pig anu RDBMSs
in geneial.
The most signilicant uilleience is that Pig Latin is a uata llow piogiamming language,
wheieas SQL is a ueclaiative piogiamming language. In othei woius, a Pig Latin pio-
giam is a step-Ly-step set ol opeiations on an input ielation, in which each step is a
single tiansloimation. By contiast, SQL statements aie a set ol constiaints that, taken
togethei, ueline the output. In many ways, piogiamming in Pig Latin is like woiking
at the level ol an RDBMS gueiy plannei, which liguies out how to tuin a ueclaiative
statement into a system ol steps.
RDBMSs stoie uata in taLles, with tightly pieuelineu schemas. Pig is moie ielaxeu aLout
the uata that it piocesses: you can ueline a schema at iuntime, Lut it`s optional. Es-
sentially, it will opeiate on any souice ol tuples (although the souice shoulu suppoit
372 | Chapter 11: Pig
Leing ieau in paiallel, Ly Leing in multiple liles, loi example), wheie a UDF is useu to
ieau the tuples liom theii iaw iepiesentation.
2
The most common iepiesentation is a
text lile with taL-sepaiateu lielus, anu Pig pioviues a Luilt-in loau lunction loi this
loimat. Unlike with a tiauitional uataLase, theie is no uata impoit piocess to loau the
uata into the RDBMS. The uata is loaueu liom the lilesystem (usually HDFS) as the
liist step in the piocessing.
Pig`s suppoit loi complex, nesteu uata stiuctuies uilleientiates it liom SQL, which
opeiates on llattei uata stiuctuies. Also, Pig`s aLility to use UDFs anu stieaming opei-
atois that aie tightly integiateu with the language anu Pig`s nesteu uata stiuctuies
makes Pig Latin moie customizaLle than most SQL uialects.
Theie aie seveial leatuies to suppoit online, low-latency gueiies that RDBMSs have
that aie aLsent in Pig, such as tiansactions anu inuexes. As mentioneu eailiei, Pig uoes
not suppoit ianuom ieaus oi gueiies in the oiuei ol tens ol milliseconus. Noi uoes it
suppoit ianuom wiites to upuate small poitions ol uata; all wiites aie Lulk, stieaming
wiites, just like MapReuuce.
Hive (coveieu in Chaptei 12) sits Letween Pig anu conventional RDBMSs. Like Pig,
Hive is uesigneu to use HDFS loi stoiage, Lut otheiwise theie aie some signilicant
uilleiences. Its gueiy language, HiveQL, is Laseu on SQL, anu anyone who is lamiliai
with SQL woulu have little tiouLle wiiting gueiies in HiveQL. Like RDBMSs, Hive
manuates that all uata Le stoieu in taLles, with a schema unuei its management; how-
evei, it can associate a schema with pieexisting uata in HDFS, so the loau step is
optional. Hive uoes not suppoit low-latency gueiies, a chaiacteiistic it shaies with Pig.
Pig Latin
This section gives an inloimal uesciiption ol the syntax anu semantics ol the Pig Latin
piogiamming language.
3
It is not meant to ollei a complete ieleience to the
language,
+
Lut theie shoulu Le enough heie loi you to get a goou unueistanuing ol Pig
Latin`s constiucts.
Structure
A Pig Latin piogiam consists ol a collection ol statements. A statement can Le thought
ol as an opeiation, oi a commanu.
5
Foi example, a GROUP opeiation is a type ol
statement:
2. Oi as the Pig Philosophy has it, Pigs eat anything.
3. Not to Le conluseu with Pig Latin, the language game. English woius aie tianslateu into Pig Latin Ly
moving the initial consonant sounu to the enu ol the woiu anu auuing an ay sounu. Foi example, pig
Lecomes ig-pay, anu Hauoop Lecomes Auoop-hay.
+. Pig Latin uoes not have a loimal language uelinition as such, Lut theie is a compiehensive guiue to the
language that can Le lounu linkeu to liom the Pig weLsite at http://pig.apachc.org/.
Pig Latin | 373
grouped_records = GROUP records BY year;
The commanu to list the liles in a Hauoop lilesystem is anothei example ol a statement:
ls /
Statements aie usually teiminateu with a semicolon, as in the example ol the GROUP
statement. In lact, this is an example ol a statement that must Le teiminateu with a
semicolon: it is a syntax eiioi to omit it. The ls commanu, on the othei hanu, uoes not
have to Le teiminateu with a semicolon. As a geneial guiueline, statements oi com-
manus loi inteiactive use in Giunt uo not neeu the teiminating semicolon. This gioup
incluues the inteiactive Hauoop commanus, as well as the uiagnostic opeiatois like
DESCRIBE. It`s nevei an eiioi to auu a teiminating semicolon, so il in uouLt, it`s sim-
plest to auu one.
Statements that have to Le teiminateu with a semicolon can Le split acioss multiple
lines loi ieauaLility:
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
Pig Latin has two loims ol comments. DouLle hyphens aie single-line comments.
Eveiything liom the liist hyphen to the enu ol the line is ignoieu Ly the Pig Latin
inteipietei:
-- My program
DUMP A; -- What's in A?
C-style comments aie moie llexiLle since they uelimit the Leginning anu enu ol the
comment Llock with /* anu */ maikeis. They can span lines oi Le emLeuueu in a single
line:
/*
* Description of my program spanning
* multiple lines.
*/
A = LOAD 'input/pig/join/A';
B = LOAD 'input/pig/join/B';
C = JOIN A BY $0, /* ignored */ B BY $1;
DUMP C;
Pig Latin has a list ol keywoius that have a special meaning in the language anu cannot
Le useu as iuentilieis. These incluue the opeiatois (LOAD, ILLUSTRATE), commanus
(cat, ls), expiessions (matches, FLATTEN), anu lunctions (DIFF, MAX)all ol which
aie coveieu in the lollowing sections.
Pig Latin has mixeu iules on case sensitivity. Opeiatois anu commanus aie not case-
sensitive (to make inteiactive use moie loigiving); howevei, aliases anu lunction names
aie case-sensitive.
5. You sometimes see these teims Leing useu inteichangeaLly in uocumentation on Pig Latin. Foi example,
GROUP commanu, GROUP opeiation, GROUP statement.
374 | Chapter 11: Pig
Statements
As a Pig Latin piogiam is executeu, each statement is paiseu in tuin. Il theie aie syntax
eiiois, oi othei (semantic) pioLlems such as unuelineu aliases, the inteipietei will halt
anu uisplay an eiioi message. The inteipietei Luilus a |ogica| p|an loi eveiy ielational
opeiation, which loims the coie ol a Pig Latin piogiam. The logical plan loi the state-
ment is auueu to the logical plan loi the piogiam so lai, then the inteipietei moves on
to the next statement.
It`s impoitant to note that no uata piocessing takes place while the logical plan ol the
piogiam is Leing constiucteu. Foi example, consiuei again the Pig Latin piogiam liom
the liist example:
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;
Vhen the Pig Latin inteipietei sees the liist line containing the LOAD statement, it
conliims that it is syntactically anu semantically coiiect, anu auus it to the logical plan,
Lut it uoes not loau the uata liom the lile (oi even check whethei the lile exists). Inueeu,
wheie woulu it loau it? Into memoiy? Even il it uiu lit into memoiy, what woulu it uo
with the uata? Peihaps not all the input uata is neeueu (since latei statements liltei it,
loi example), so it woulu Le pointless to loau it. The point is that it makes no sense to
stait any piocessing until the whole llow is uelineu. Similaily, Pig valiuates the GROUP
anu FOREACH...GENERATE statements, anu auus them to the logical plan without
executing them. The tiiggei loi Pig to stait execution is the DUMP statement. At that
point, the logical plan is compileu into a physical plan anu executeu.
Multiquery execution
Since DUMP is a uiagnostic tool, it will always tiiggei execution. Howevei, the STORE
commanu is uilleient. In inteiactive moue, STORE acts like DUMP anu will always
tiiggei execution (this incluues the run commanu), Lut in Latch moue it will not (this
incluues the exec commanu). The ieason loi this is elliciency. In Latch moue, Pig will
paise the whole sciipt to see il theie aie any optimizations that coulu Le maue to limit
the amount ol uata to Le wiitten to oi ieau liom uisk. Consiuei the lollowing simple
example:
A = LOAD 'input/pig/multiquery/A';
B = FILTER A BY $1 == 'banana';
C = FILTER A BY $1 != 'banana';
STORE B INTO 'output/b';
STORE C INTO 'output/c';
Pig Latin | 375
Relations B anu C aie Loth ueiiveu liom A, so to save ieauing A twice, Pig can iun this
sciipt as a single MapReuuce joL Ly ieauing A once anu wiiting two output liles liom
the joL, one loi each ol B anu C. This leatuie is calleu nu|tiqucry cxccution.
In pievious veisions ol Pig that uiu not have multigueiy execution, each STORE state-
ment in a sciipt iun in Latch moue tiiggeieu execution, iesulting in a joL loi each
STORE statement. It is possiLle to iestoie the olu Lehavioi Ly uisaLling multigueiy
execution with the -M oi -no_multiquery option to pig.
The physical plan that Pig piepaies is a seiies ol MapReuuce joLs, which in local moue
Pig iuns in the local ]VM, anu in MapReuuce moue Pig iuns on a Hauoop clustei.
You can see the logical anu physical plans cieateu Ly Pig using the
EXPLAIN commanu on a ielation (EXPLAIN max_temp; loi example).
EXPLAIN will also show the MapReuuce plan, which shows how the
physical opeiatois aie gioupeu into MapReuuce joLs. This is a goou
way to linu out how many MapReuuce joLs Pig will iun loi youi gueiy.
The ielational opeiatois that can Le a pait ol a logical plan in Pig aie summaiizeu in
TaLle 11-1. Ve shall go thiough the opeiatois in moie uetail in Data Piocessing
Opeiatois on page 397.
Tab|c 11-1. Pig Latin rc|ationa| opcrators
Category Operator Description
Loading and storing LOAD Loads data from the filesystem or other storage into a relation
STORE Saves a relation to the filesystem or other storage
DUMP Prints a relation to the console
Filtering FILTER Removes unwanted rows from a relation
DISTINCT Removes duplicate rows from a relation
FOREACH...GENERATE Adds or removes fields from a relation
MAPREDUCE Runs a MapReduce job using a relation as input
STREAM Transforms a relation using an external program
SAMPLE Selects a random sample of a relation
Grouping and joining JOIN Joins two or more relations
COGROUP Groups the data in two or more relations
GROUP Groups the data in a single relation
CROSS Creates the cross-product of two or more relations
Sorting ORDER Sorts a relation by one or more fields
LIMIT Limits the size of a relation to a maximum number of tuples
Combining and splitting UNION Combines two or more relations into one
376 | Chapter 11: Pig
Category Operator Description
SPLIT Splits a relation into two or more relations
Theie aie othei types ol statements that aie not auueu to the logical plan. Foi example,
the uiagnostic opeiatois, DESCRIBE, EXPLAIN, anu ILLUSTRATE aie pioviueu to
allow the usei to inteiact with the logical plan, loi ueLugging puiposes (see Ta-
Lle 11-2). DUMP is a soit ol uiagnostic opeiatoi, too, since it is useu only to allow
inteiactive ueLugging ol small iesult sets oi in comLination with LIMIT to ietiieve a
lew iows liom a laigei ielation. The STORE statement shoulu Le useu when the size
ol the output is moie than a lew lines, as it wiites to a lile, iathei than to the console.
Tab|c 11-2. Pig Latin diagnostic opcrators
Operator Description
DESCRIBE Prints a relations schema
EXPLAIN Prints the logical and physical plans
ILLUSTRATE Shows a sample execution of the logical plan, using a generated subset of the input
Pig Latin pioviues thiee statements, REGISTER, DEFINE anu IMPORT, to make it
possiLle to incoipoiate macios anu usei-uelineu lunctions into Pig sciipts (see Ta-
Lle 11-3).
Tab|c 11-3. Pig Latin nacro and UDI statcncnts
Statement Description
REGISTER Registers a JAR file with the Pig runtime
DEFINE Creates an alias for a macro, a UDF, streaming script, or a command specification
IMPORT Import macros defined in a separate file into a script
Since they uo not piocess ielations, commanus aie not auueu to the logical plan; in-
steau, they aie executeu immeuiately. Pig pioviues commanus to inteiact with Hauoop
lilesystems (which aie veiy hanuy loi moving uata aiounu Leloie oi altei piocessing
with Pig) anu MapReuuce, as well as a lew utility commanus (uesciiLeu in TaLle 11-+).
Tab|c 11-1. Pig Latin connands
Category Command Description
Hadoop Filesystem cat Prints the contents of one or more files
cd Changes the current directory
copyFromLocal Copies a local file or directory to a Hadoop filesystem
copyToLocal Copies a file or directory on a Hadoop filesystem to the local filesystem
cp Copies a file or directory to another directory
fs Accesses Hadoops filesystem shell
Pig Latin | 377
Category Command Description
ls Lists files
mkdir Creates a new directory
mv Moves a file or directory to another directory
pwd Prints the path of the current working directory
rm Deletes a file or directory
rmf Forcibly deletes a file or directory (does not fail if the file or directory does not exist)
Hadoop MapReduce kill Kills a MapReduce job
Utility exec Runs a script in a new Grunt shell in batch mode
help Shows the available commands and options
quit Exits the interpreter
run Runs a script within the existing Grunt shell
set Sets Pig options and MapReduce job properties
sh Run a shell command from within Grunt
The lilesystem commanus can opeiate on liles oi uiiectoiies in any Hauoop lilesystem,
anu they aie veiy similai to the hadoop fs commanus (which is not suipiising, as Loth
aie simple wiappeis aiounu the Hauoop FileSystem inteilace). You can access all ol
the Hauoop lilesystem shell commanus using Pig`s fs commanu. Foi example,
fs -ls will show a lile listing, anu fs -help will show help on all the availaLle
commanus.
Piecisely which Hauoop lilesystem is useu is ueteimineu Ly the fs.default.name piop-
eity in the site lile loi Hauoop Coie. See The Commanu-Line Inteilace on page 51
loi moie uetails on how to conliguie this piopeity.
These commanus aie mostly sell-explanatoiy, except set, which is useu to set options
that contiol Pig`s Lehavioi, incluuing aiLitiaiy MapReuuce joL piopeities. The debug
option is useu to tuin ueLug logging on oi oll liom within a sciipt (you can also contiol
the log level when launching Pig, using the -d oi -debug option):
grunt> set debug on
Anothei uselul option is the job.name option, which gives a Pig joL a meaninglul name,
making it easiei to pick out youi Pig MapReuuce joLs when iunning on a shaieu Ha-
uoop clustei. Il Pig is iunning a sciipt (iathei than Leing an inteiactive gueiy liom
Giunt), its joL name uelaults to a value Laseu on the sciipt name.
Theie aie two commanus in TaLle 11-+ loi iunning a Pig sciipt, exec anu run. The
uilleience is that exec iuns the sciipt in Latch moue in a new Giunt shell, so any aliases
uelineu in the sciipt aie not accessiLle to the shell altei the sciipt has completeu. On
the othei hanu, when iunning a sciipt with run, it is as il the contents ol the sciipt hau
Leen enteieu manually, so the commanu histoiy ol the invoking shell contains all the
378 | Chapter 11: Pig
statements liom the sciipt. Multigueiy execution, wheie Pig executes a Latch ol state-
ments in one go (see Multigueiy execution on page 375), is only useu Ly exec, not run.
By uesign, Pig Latin lacks native contiol llow statements. The iecom-
menueu appioach loi wiiting piogiams that have conuitional logic oi
loop constiucts is to emLeu Pig Latin in anothei language like Python,
]avaSciipt oi ]ava, anu manage the contiol llow liom theie. In this
mouel the host sciipt uses a compile-Linu-iun API to execute Pig sciipts
anu ietiieve theii status. Consult the Pig uocumentation loi uetails ol
the API.
EmLeuueu Pig piogiams always iun in a ]VM, so loi Python anu ]ava-
Sciipt you use the pig commanu lolloweu Ly the name ol youi sciipt,
anu the appiopiiate ]ava sciipting engine will Le selecteu (]ython loi
Python, Rhino loi ]avaSciipt).
Expressions
An expiession is something that is evaluateu to yielu a value. Expiessions can Le useu
in Pig as a pait ol a statement containing a ielational opeiatoi. Pig has a iich vaiiety ol
expiessions, many ol which will Le lamiliai liom othei piogiamming languages. They
aie listeu in TaLle 11-5, with Liiel uesciiptions anu examples. Ve shall see examples
ol many ol these expiessions thioughout the chaptei.
Tab|c 11-5. Pig Latin cxprcssions
Category Expressions Description Examples
Constant Literal Constant value (see also literals in Ta-
ble 11-6)
1.0, 'a'
Field (by
position)
$n Field in position n (zero-based) $0
Field (by name) f Field named f year
Field (disambigu-
ate)
r::f Field named f from relation r after group-
ing or joining
A::year
Projection c.$n, c.f Field in container c (relation, bag, or tuple)
by position, by name
records.$0, records.year
Map lookup m#k Value associated with key k in map m items#'Coat'
Cast (t) f Cast of field f to type t (int) year
Arithmetic x + y, x - y Addition, subtraction $1 + $2, $1 - $2
x * y, x / y Multiplication, division $1 * $2, $1 / $2
x % y Modulo, the remainder of x divided by y $1 % $2
+x, -x Unary positive, negation +1, 1
Conditional x ? y : z Bincond/ternary, y if x evaluates to true,
z otherwise
quality == 0 ? 0 : 1
Pig Latin | 379
Category Expressions Description Examples
Comparison x == y, x != y Equals, not equals quality == 0, tempera
ture != 9999
x > y, x < y Greater than, less than quality > 0, quality < 10
x >= y, x <= y Greater than or equal to, less than or equal to quality >= 1, quality <=
9
x matches y Pattern matching with regular expression quality matches
'[01459]'
x is null Is null temperature is null
x is not null Is not null temperature is not null
Boolean x or y Logical or q == 0 or q == 1
x and y Logical and q == 0 and r == 0
not x Logical negation not q matches '[01459]'
Functional fn(f1,f2,) Invocation of function fn on fields f1, f2,
etc.
isGood(quality)
Flatten FLATTEN(f) Removal of a level of nesting from bags and
tuples
FLATTEN(group)
Types
So lai you have seen some ol the simple types in Pig, such as int anu chararray. Heie
we will uiscuss Pig`s Luilt-in types in moie uetail.
Pig has loui numeiic types: int, long, float, anu double, which aie iuentical to theii
]ava counteipaits. Theie is also a bytearray type, like ]ava`s byte aiiay type loi iepie-
senting a LloL ol Linaiy uata, anu chararray, which, like java.lang.String, iepiesents
textual uata in UTF-16 loimat, although it can Le loaueu oi stoieu in UTF-S loimat.
Pig uoes not have types coiiesponuing to ]ava`s boolean,
6
byte, short, oi char piimitive
types. These aie all easily iepiesenteu using Pig`s int type, oi chararray loi char.
The numeiic, textual, anu Linaiy types aie simple atomic types. Pig Latin also has thiee
complex types loi iepiesenting nesteu stiuctuies: tuple, Lag, anu map. All ol Pig Latin`s
types aie listeu in TaLle 11-6.
6. Although theie is no Loolean type loi uata (until veision 0.10.0), Pig has the concept ol an expiession
evaluating to tiue oi lalse, loi testing conuitions (such as in a FILTER statement). Howevei, Pig uoes not
allow a Loolean expiession to Le stoieu in a lielu.
380 | Chapter 11: Pig
Tab|c 11-. Pig Latin typcs
Category Type Description Literal example
Numeric int 32-bit signed integer 1
long 64-bit signed integer 1L
float 32-bit floating-point number 1.0F
double 64-bit floating-point number 1.0
Text chararray Character array in UTF-16 format 'a'
Binary bytearray Byte array Not supported
Complex tuple Sequence of fields of any type (1,'pomegranate')
bag An unordered collection of tuples, possibly with duplicates {(1,'pomegranate'),(2)}
map A set of key-value pairs. Keys must be character arrays;
values may be any type
['a'#'pomegranate']
The complex types aie usually loaueu liom liles oi constiucteu using ielational opei-
atois. Be awaie, howevei, that the liteial loim in TaLle 11-6 is useu when a constant
value is cieateu liom within a Pig Latin piogiam. The iaw loim in a lile is usually
uilleient when using the stanuaiu PigStorage loauei. Foi example, the iepiesentation
in a lile ol the Lag in TaLle 11-6 woulu Le {(1,pomegranate),(2)} (note the lack ol
guotes), anu with a suitaLle schema, this woulu Le loaueu as a ielation with a single
lielu anu iow, whose value was the Lag.
Pig pioviues Luilt-in lunctions TOTUPLE, TOBAG anu TOMAP, which aie useu loi tuining
expiessions into tuples, Lags anu maps.
Although ielations anu Lags aie conceptually the same (an unoiueieu collection ol
tuples), in piactice Pig tieats them slightly uilleiently. A ielation is a top-level constiuct,
wheieas a Lag has to Le containeu in a ielation. Noimally, you uon`t have to woiiy
aLout this, Lut theie aie a lew iestiictions that can tiip up the uninitiateu. Foi example,
it`s not possiLle to cieate a ielation liom a Lag liteial. So the lollowing statement lails:
A = {(1,2),(3,4)}; -- Error
The simplest woikaiounu in this case is to loau the uata liom a lile using the LOAD
statement.
As anothei example, you can`t tieat a ielation like a Lag anu pioject a lielu into a new
ielation ($0 ieleis to the liist lielu ol A, using the positional notation):
B = A.$0;
Insteau, you have to use a ielational opeiatoi to tuin the ielation A into ielation B:
B = FOREACH A GENERATE $0;
It`s possiLle that a lutuie veision ol Pig Latin will iemove these inconsistencies anu
tieat ielations anu Lags in the same way.
Pig Latin | 381
Schemas
A ielation in Pig may have an associateu schema, which gives the lielus in the ielation
names anu types. Ve`ve seen how an AS clause in a LOAD statement is useu to attach
a schema to a ielation:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year:int, temperature:int, quality:int);
grunt> DESCRIBE records;
records: {year: int,temperature: int,quality: int}
This time we`ve ueclaieu the yeai to Le an integei, iathei than a chararray, even though
the lile it is Leing loaueu liom is the same. An integei may Le moie appiopiiate il we
neeueu to manipulate the yeai aiithmetically (to tuin it into a timestamp, loi example),
wheieas the chararray iepiesentation might Le moie appiopiiate when it`s Leing useu
as a simple iuentiliei. Pig`s llexiLility in the uegiee to which schemas aie ueclaieu con-
tiasts with schemas in tiauitional SQL uataLases, which aie ueclaieu Leloie the uata
is loaueu into to the system. Pig is uesigneu loi analyzing plain input liles with no
associateu type inloimation, so it is guite natuial to choose types loi lielus latei than
you woulu with an RDBMS.
It`s possiLle to omit type ueclaiations completely, too:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year, temperature, quality);
grunt> DESCRIBE records;
records: {year: bytearray,temperature: bytearray,quality: bytearray}
In this case, we have specilieu only the names ol the lielus in the schema, year,
temperature, anu quality. The types uelault to bytearray, the most geneial type, iep-
iesenting a Linaiy stiing.
You uon`t neeu to specily types loi eveiy lielu; you can leave some to uelault to byte
array, as we have uone loi year in this ueclaiation:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year, temperature:int, quality:int);
grunt> DESCRIBE records;
records: {year: bytearray,temperature: int,quality: int}
Howevei, il you specily a schema in this way, you uo neeu to specily eveiy lielu. Also,
theie`s no way to specily the type ol a lielu without specilying the name. On the othei
hanu, the schema is entiiely optional anu can Le omitteu Ly not specilying an AS clause:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt';
grunt> DESCRIBE records;
Schema for records unknown.
382 | Chapter 11: Pig
Fielus in a ielation with no schema can Le ieleienceu only using positional notation:
$0 ieleis to the liist lielu in a ielation, $1 to the seconu, anu so on. Theii types uelault
to bytearray:
grunt> projected_records = FOREACH records GENERATE $0, $1, $2;
grunt> DUMP projected_records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
grunt> DESCRIBE projected_records;
projected_records: {bytearray,bytearray,bytearray}
Although it can Le convenient not to have to assign types to lielus (paiticulaily in the
liist stages ol wiiting a gueiy), uoing so can impiove the claiity anu elliciency ol Pig
Latin piogiams, anu is geneially iecommenueu.
Declaiing a schema as a pait ol the gueiy is llexiLle, Lut uoesn`t lenu
itsell to schema ieuse. A set ol Pig gueiies ovei the same input uata will
olten have the same schema iepeateu in each gueiy. Il the gueiy
piocesses a laige numLei ol lielus, this iepetition can Lecome haiu to
maintain.
The Apache HCatalog pioject (http://incubator.apachc.org/hcata|og/)
solves this pioLlem Ly pioviuing a taLle metauata seivice, Laseu on
Hive`s metastoie, so that Pig gueiies can ieleience schemas Ly name,
iathei than specilying them in lull each time.
Validation and nulls
An SQL uataLase will enloice the constiaints in a taLle`s schema at loau time: loi
example, tiying to loau a stiing into a column that is ueclaieu to Le a numeiic type will
lail. In Pig, il the value cannot Le cast to the type ueclaieu in the schema, then it will
suLstitute a null value. Let`s see how this woiks il we have the lollowing input loi the
weathei uata, which has an e chaiactei in place ol an integei:
1950 0 1
1950 22 1
1950 e 1
1949 111 1
1949 78 1
Pig hanules the coiiupt line Ly piouucing a null loi the ollenuing value, which is uis-
playeu as the aLsence ol a value when uumpeu to scieen (anu also when saveu using
STORE):
grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'
>> AS (year:chararray, temperature:int, quality:int);
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
Pig Latin | 383
(1950,,1)
(1949,111,1)
(1949,78,1)
Pig piouuces a waining loi the invaliu lielu (not shown heie), Lut uoes not halt its
piocessing. Foi laige uatasets, it is veiy common to have coiiupt, invaliu, oi meiely
unexpecteu uata, anu it is geneially inleasiLle to inciementally lix eveiy unpaisaLle
iecoiu. Insteau, we can pull out all ol the invaliu iecoius in one go, so we can take
action on them, peihaps Ly lixing oui piogiam (Lecause they inuicate we have maue a
mistake) oi Ly lilteiing them out (Lecause the uata is genuinely unusaLle):
grunt> corrupt_records = FILTER records BY temperature is null;
grunt> DUMP corrupt_records;
(1950,,1)
Note the use ol the is null opeiatoi, which is analogous to SQL. In piactice, we woulu
incluue moie inloimation liom the oiiginal iecoiu, such as an iuentiliei anu the value
that coulu not Le paiseu, to help oui analysis ol the Lau uata.
Ve can linu the numLei ol coiiupt iecoius using the lollowing iuiom loi counting the
numLei ol iows in a ielation:
grunt> grouped = GROUP corrupt_records ALL;
grunt> all_grouped = FOREACH grouped GENERATE group, COUNT(corrupt_records);
grunt> DUMP all_grouped;
(all,1)
(GROUP on page +05 explains giouping anu the ALL opeiation in moie uetail.)
Anothei uselul technigue is to use the SPLIT opeiatoi to paitition the uata into goou
anu Lau ielations, which can then Le analyzeu sepaiately:
grunt> SPLIT records INTO good_records IF temperature is not null,
>> bad_records IF temperature is null;
grunt> DUMP good_records;
(1950,0,1)
(1950,22,1)
(1949,111,1)
(1949,78,1)
grunt> DUMP bad_records;
(1950,,1)
Going Lack to the case in which temperature`s type was lelt unueclaieu, the coiiupt
uata cannot Le easily uetecteu, since it uoesn`t suilace as a null:
grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'
>> AS (year:chararray, temperature, quality:int);
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,e,1)
(1949,111,1)
(1949,78,1)
grunt> filtered_records = FILTER records BY temperature != 9999 AND
>> (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
384 | Chapter 11: Pig
grunt> grouped_records = GROUP filtered_records BY year;
grunt> max_temp = FOREACH grouped_records GENERATE group,
>> MAX(filtered_records.temperature);
grunt> DUMP max_temp;
(1949,111.0)
(1950,22.0)
Vhat happens in this case is that the temperature lielu is inteipieteu as a bytearray, so
the coiiupt lielu is not uetecteu when the input is loaueu. Vhen passeu to the MAX
lunction, the temperature lielu is cast to a double, since MAX woiks only with numeiic
types. The coiiupt lielu can not Le iepiesenteu as a double, so it Lecomes a null, which
MAX silently ignoies. The Lest appioach is geneially to ueclaie types loi youi uata on
loauing, anu look loi missing oi coiiupt values in the ielations themselves Leloie you
uo youi main piocessing.
Sometimes coiiupt uata shows up as smallei tuples since lielus aie simply missing. You
can liltei these out Ly using the SIZE lunction as lollows:
grunt> A = LOAD 'input/pig/corrupt/missing_fields';
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3)
(1,Scarf)
grunt> B = FILTER A BY SIZE(TOTUPLE(*)) > 1;
grunt> DUMP B;
(2,Tie)
(4,Coat)
(1,Scarf)
Schema merging
In Pig, you uon`t ueclaie the schema loi eveiy new ielation in the uata llow. In most
cases, Pig can liguie out the iesulting schema loi the output ol a ielational opeiation
Ly consiueiing the schema ol the input ielation.
How aie schemas piopagateu to new ielations? Some ielational opeiatois uon`t change
the schema, so the ielation piouuceu Ly the LIMIT opeiatoi (which iestiicts a ielation
to a maximum numLei ol tuples), loi example, has the same schema as the ielation it
opeiates on. Foi othei opeiatois, the situation is moie complicateu. UNION, loi ex-
ample, comLines two oi moie ielations into one, anu tiies to meige the input ielations
schemas. Il the schemas aie incompatiLle, uue to uilleient types oi numLei ol lielus,
then the schema ol the iesult ol the UNION is unknown.
You can linu out the schema loi any ielation in the uata llow using the DESCRIBE
opeiatoi. Il you want to ieueline the schema loi a ielation, you can use the
FOREACH...GENERATE opeiatoi with AS clauses to ueline the schema loi some oi
all ol the lielus ol the input ielation.
See Usei-Delineu Functions on page 3S9 loi luithei uiscussion ol schemas.
Pig Latin | 385
Functions
Functions in Pig come in loui types:
Eva| junction
A lunction that takes one oi moie expiessions anu ietuins anothei expiession. An
example ol a Luilt-in eval lunction is MAX, which ietuins the maximum value ol the
entiies in a Lag. Some eval lunctions aie aggrcgatc junctions, which means they
opeiate on a Lag ol uata to piouuce a scalai value; MAX is an example ol an aggiegate
lunction. Fuitheimoie, many aggiegate lunctions aie a|gcbraic, which means that
the iesult ol the lunction may Le calculateu inciementally. In MapReuuce teims,
algeLiaic lunctions make use ol the comLinei anu aie much moie ellicient to
calculate (see ComLinei Functions on page 3+). MAX is an algeLiaic lunction,
wheieas a lunction to calculate the meuian ol a collection ol values is an example
ol a lunction that is not algeLiaic.
Ii|tcr junction
A special type ol eval lunction that ietuins a logical Loolean iesult. As the name
suggests, liltei lunctions aie useu in the FILTER opeiatoi to iemove unwanteu
iows. They can also Le useu in othei ielational opeiatois that take Loolean con-
uitions anu, in geneial, expiessions using Loolean oi conuitional expiessions. An
example ol a Luilt-in liltei lunction is IsEmpty, which tests whethei a Lag oi a map
contains any items.
Load junction
A lunction that specilies how to loau uata into a ielation liom exteinal stoiage.
Storc junction
A lunction that specilies how to save the contents ol a ielation to exteinal stoiage.
Olten, loau anu stoie lunctions aie implementeu Ly the same type. Foi example,
PigStorage, which loaus uata liom uelimiteu text liles, can stoie uata in the same
loimat.
Pig comes with a collection ol Luilt-in lunctions, a selection ol which aie listeu in
TaLle 11-7. The complete list ol Luilt-in lunctions, which incluues a laige numLei ol
stanuaiu math anu stiing lunctions, can Le lounu in the uocumentation loi each Pig
ielease.
Tab|c 11-7. A sc|cction oj Pig`s bui|t-in junctions
Category Function Description
Eval AVG Calculates the average (mean) value of entries in a bag.
CONCAT Concatenates byte arrays or character arrays together.
COUNT Calculates the number of non-null entries in a bag.
COUNT_STAR Calculates the number of entries in a bag, including those that are null.
386 | Chapter 11: Pig
Category Function Description
DIFF Calculates the set difference of two bags. If the two arguments are not bags,
then returns a bag containing both if they are equal; otherwise, returns an
empty bag.
MAX Calculates the maximum value of entries in a bag.
MIN Calculates the minimum value of entries in a bag.
SIZE Calculates the size of a type. The size of numeric types is always one; for
character arrays, it is the number of characters; for byte arrays, the number
of bytes; and for containers (tuple, bag, map), it is the number of entries.
SUM Calculates the sum of the values of entries in a bag.
TOBAG Converts one or more expressions to individual tuples which are then put in
a bag.
TOKENIZE Tokenizes a character array into a bag of its constituent words.
TOMAP Converts an even number of expressions to a map of key-value pairs.
TOP Calculates the top n tuples in a bag.
TOTUPLE Converts one or more expressions to a tuple.
Filter IsEmpty Tests if a bag or map is empty.
Load/Store PigStorage Loads or stores relations using a field-delimited text format. Each line is
broken into fields using a configurable field delimiter (defaults to a tab
character) to be stored in the tuples fields. It is the default storage when
none is specified.
BinStorage Loads or stores relations from or to binary files. A Pig-specific format is used
that uses Hadoop Writable objects.
TextLoader Loads relations from a plain-text format. Each line corresponds to a tuple
whose single field is the line of text.
JsonLoader, JsonStorage Loads or stores relations from or to a (Pig-defined) JSON format. Each tuple
is stored on one line.
HBaseStorage Loads or stores relations from or to HBase tables.
Il the lunction you neeu is not availaLle, you can wiite youi own. Beloie you uo that,
howevei, have a look in the Piggy Ban|, a iepositoiy ol Pig lunctions shaieu Ly the Pig
community. Foi example, theie aie loau anu stoie lunctions in the Piggy Bank loi Avio
uata liles, CSV liles, Hive RCFiles, SeguenceFiles, anu XML liles. The Pig weLsite has
instiuctions on how to Liowse anu oLtain the Piggy Bank lunctions. Il the Piggy Bank
uoesn`t have what you neeu, you can wiite youi own lunction (anu il it is sulliciently
geneial, you might consiuei contiiLuting it to the Piggy Bank so that otheis can Lenelit
liom it, too). These aie known as uscr-dcjincd junctions, oi UDFs.
Pig Latin | 387
Macros
Macios pioviue a way to package ieusaLle pieces ol Pig Latin coue liom within Pig
Latin itsell. Foi example, we can extiact the pait ol oui Pig Latin piogiam that peiloims
giouping on a ielation then linus the maximum value in each gioup, Ly uelining a macio
as lollows:
DEFINE max_by_group(X, group_key, max_field) RETURNS Y {
A = GROUP $X by $group_key;
$Y = FOREACH A GENERATE group, MAX($X.$max_field);
};
The macio, calleu max_by_group, takes thiee paiameteis: a ielation, X, anu two lielu
names, group_key anu max_field. It ietuins a single ielation, Y. Vithin the macio Louy,
paiameteis anu ietuin aliases aie ieleienceu with a $ pielix, such as $X.
The macio is useu as lollows:
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
max_temp = max_by_group(filtered_records, year, temperature);
DUMP max_temp
At iuntime, Pig will expanu the macio using the macio uelinition. Altei expansion, the
piogiam looks like the lollowing, with the expanueu section in Lolu.
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
macro_max_by_group_A_0 = GROUP filtered_records by (year);
max_temp = FOREACH macro_max_by_group_A_0 GENERATE group,
MAX(filtered_records.(temperature));
DUMP max_temp
You uon`t noimally see the expanueu loim since Pig cieates it inteinally, howevei in
some cases it is uselul to see it when wiiting anu ueLugging macios. You can get Pig to
peiloim macio expansion only (without executing the sciipt) Ly passing the -dryrun
aigument to pig.
Notice that the paiameteis that weie passeu to the macio (filtered_records, year, anu
temperature) have Leen suLstituteu loi the names in the macio uelinition. Aliases in
the macio uelinition that uon`t have a $ pielix, such as A in this example, aie local to
the macio uelinition anu aie ie-wiitten at expansion time to avoiu conllicts with aliases
in othei paits ol the piogiam. In this case, A Lecomes macro_max_by_group_A_0 in the
expanueu loim.
To lostei ieuse, macios can Le uelineu in sepaiate liles to Pig sciipts, in which case they
neeu to Le impoiteu into any sciipt that uses them. An impoit statement looks like this:
IMPORT './ch11/src/main/pig/max_temp.macro';
388 | Chapter 11: Pig
User-Defined Functions
Pig`s uesigneis iealizeu that the aLility to plug-in custom coue is ciucial loi all Lut the
most tiivial uata piocessing joLs. Foi this ieason, they maue it easy to ueline anu use
usei-uelineu lunctions. Ve only covei ]ava UDFs in this section, Lut Le awaie that you
can wiite UDFs in Python oi ]avaSciipt too, Loth ol which aie iun using the ]ava
Sciipting API.
A Filter UDF
Let`s uemonstiate Ly wiiting a liltei lunction loi lilteiing out weathei iecoius that uo
not have a tempeiatuie guality ieauing ol satislactoiy (oi Lettei). The iuea is to change
this line:
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
to:
filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);
This achieves two things: it makes the Pig sciipt moie concise, anu it encapsulates the
logic in one place so that it can Le easily ieuseu in othei sciipts. Il we weie just wiiting
an au hoc gueiy, then we pioLaLly woulun`t Lothei to wiite a UDF. It`s when you stait
uoing the same kinu ol piocessing ovei anu ovei again that you see oppoitunities loi
ieusaLle UDFs.
Filtei UDFs aie all suLclasses ol FilterFunc, which itsell is a suLclass ol EvalFunc. Ve`ll
look at EvalFunc in moie uetail latei, Lut loi the moment just note that, in essence,
EvalFunc looks like the lollowing class:
public abstract class EvalFunc<T> {
public abstract T exec(Tuple input) throws IOException;
}
EvalFunc`s only aLstiact methou, exec(), takes a tuple anu ietuins a single value, the
(paiameteiizeu) type T. The lielus in the input tuple consist ol the expiessions passeu
to the lunctionin this case, a single integei. Foi FilterFunc, T is Boolean, so the
methou shoulu ietuin true only loi those tuples that shoulu not Le lilteieu out.
Foi the guality liltei, we wiite a class, IsGoodQuality, that extenus FilterFunc anu im-
plements the exec() methou. See Example 11-1. The Tuple class is essentially a list ol
oLjects with associateu types. Heie we aie conceineu only with the liist lielu (since the
lunction only has a single aigument), which we extiact Ly inuex using the get() methou
on Tuple. The lielu is an integei, so il it`s not null, we cast it anu check whethei the
value is one that signilies the tempeiatuie was a goou ieauing, ietuining the appiopiiate
value, true oi false.
User-Defined Functions | 389
Exanp|c 11-1. A Ii|tcrIunc UDI to rcnovc rccords with unsatisjactory tcnpcraturc qua|ity rcadings
package com.hadoopbook.pig;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.pig.FilterFunc;

import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.FrontendException;
public class IsGoodQuality extends FilterFunc {
@Override
public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {
return false;
}
try {
Object object = tuple.get(0);
if (object == null) {
return false;
}
int i = (Integer) object;
return i == 0 || i == 1 || i == 4 || i == 5 || i == 9;
} catch (ExecException e) {
throw new IOException(e);
}
}

}
To use the new lunction, we liist compile it anu package it in a ]AR lile (the example
coue that accompanies this Look comes with Luilu instiuctions loi how to uo this).
Then we tell Pig aLout the ]AR lile with the REGISTER opeiatoi, which is given the
local path to the lilename (anu is not encloseu in guotes):
grunt> REGISTER pig-examples.jar;
Finally, we can invoke the lunction:
grunt> filtered_records = FILTER records BY temperature != 9999 AND
>> com.hadoopbook.pig.IsGoodQuality(quality);
Pig iesolves lunction calls Ly tieating the lunction`s name as a ]ava classname anu
attempting to loau a class ol that name. (This, inciuentally, is why lunction names aie
case-sensitive: Lecause ]ava classnames aie.) Vhen seaiching loi classes, Pig uses a
classloauei that incluues the ]AR liles that have Leen iegisteieu. Vhen iunning in uis-
tiiLuteu moue, Pig will ensuie that youi ]AR liles get shippeu to the clustei.
390 | Chapter 11: Pig
Foi the UDF in this example, Pig looks loi a class with the name com.hadoop
book.pig.IsGoodQuality, which it linus in the ]AR lile we iegisteieu.
Resolution ol Luilt-in lunctions pioceeus in the same way, except loi one uilleience:
Pig has a set ol Luilt-in package names that it seaiches, so the lunction call uoes not
have to Le a lully gualilieu name. Foi example, the lunction MAX is actually implementeu
Ly a class MAX in the package org.apache.pig.builtin. This is one ol the packages that
Pig looks in, so we can wiite MAX iathei than org.apache.pig.builtin.MAX in oui Pig
piogiams.
Ve can auu oui package name to the seaich path Ly invoking Giunt with this com-
manu-line aigument: -Dudf.import.list=com.hadoopbook.pig. Oi, we can shoiten the
lunction name Ly uelining an alias, using the DEFINE opeiatoi:
grunt> DEFINE isGood com.hadoopbook.pig.IsGoodQuality();
grunt> filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);
Delining an alias is a goou iuea il you want to use the lunction seveial times in the same
sciipt. It`s also necessaiy il you want to pass aiguments to the constiuctoi ol the UDF`s
implementation class.
Leveraging types
The liltei woiks when the guality lielu is ueclaieu to Le ol type int, Lut il the type
inloimation is aLsent, then the UDF lails! This happens Lecause the lielu is the uelault
type, bytearray, iepiesenteu Ly the DataByteArray class. Because DataByteArray is not
an Integer, the cast lails.
The oLvious way to lix this is to conveit the lielu to an integei in the exec() methou.
Howevei, theie is a Lettei way, which is to tell Pig the types ol the lielus that the lunction
expects. The getArgToFuncMapping() methou on EvalFunc is pioviueu loi piecisely this
ieason. Ve can oveiiiue it to tell Pig that the liist lielu shoulu Le an integei:
@Override
public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
List<FuncSpec> funcSpecs = new ArrayList<FuncSpec>();
funcSpecs.add(new FuncSpec(this.getClass().getName(),
new Schema(new Schema.FieldSchema(null, DataType.INTEGER))));
return funcSpecs;
}
This methou ietuins a FuncSpec oLject coiiesponuing to each ol the lielus ol the tuple
that aie passeu to the exec() methou. Heie theie is a single lielu, anu we constiuct an
anonymous FieldSchema (the name is passeu as null, since Pig ignoies the name when
uoing type conveision). The type is specilieu using the INTEGER constant on Pig`s
DataType class.
Vith the amenueu lunction, Pig will attempt to conveit the aigument passeu to the
lunction to an integei. Il the lielu cannot Le conveiteu, then a null is passeu loi the
lielu. The exec() methou always ietuins false il the lielu is null. Foi this application,
User-Defined Functions | 391
this Lehavioi is appiopiiate, as we want to liltei out iecoius whose guality lielu is
unintelligiLle.
Heie`s the linal piogiam using the new lunction:
-- max_temp_filter_udf.pig
REGISTER pig-examples.jar;
DEFINE isGood com.hadoopbook.pig.IsGoodQuality();
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;
An Eval UDF
Viiting an eval lunction is a small step up liom wiiting a liltei lunction. Consiuei a
UDF (see Example 11-2) loi tiimming the leauing anu tiailing whitespace liom
chararray values, just like the trim() methou on java.lang.String. Ve will use this
UDF latei in the chaptei.
Exanp|c 11-2. An Eva|Iunc UDI to trin |cading and trai|ing whitcspacc jron chararray va|ucs
public class Trim extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
}
try {
Object object = input.get(0);
if (object == null) {
return null;
}
return ((String) object).trim();
} catch (ExecException e) {
throw new IOException(e);
}
}
@Override
public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
List<FuncSpec> funcList = new ArrayList<FuncSpec>();
funcList.add(new FuncSpec(this.getClass().getName(), new Schema(
new Schema.FieldSchema(null, DataType.CHARARRAY))));
return funcList;
}
}
392 | Chapter 11: Pig
An eval lunction extenus the EvalFunc class, paiameteiizeu Ly the type ol the ietuin
value (which is String loi the Trim UDF).
7
The exec() anu getArgToFuncMapping()
methous aie stiaightloiwaiu, like the ones in the IsGoodQuality UDF.
Vhen you wiite an eval lunction, you neeu to consiuei what the output`s schema looks
like. In the lollowing statement, the schema ol B is ueteimineu Ly the lunction udf:
B = FOREACH A GENERATE udf($0);
Il udf cieates tuples with scalai lielus, then Pig can ueteimine B`s schema thiough
iellection. Foi complex types such as Lags, tuples, oi maps, Pig neeus moie help, anu
you shoulu implement the outputSchema() methou to give Pig the inloimation aLout
the output schema.
The Trim UDF ietuins a stiing, which Pig tianslates as a chararray, as can Le seen liom
the lollowing session:
grunt> DUMP A;
( pomegranate)
(banana )
(apple)
( lychee )
grunt> DESCRIBE A;
A: {fruit: chararray}
grunt> B = FOREACH A GENERATE com.hadoopbook.pig.Trim(fruit);
grunt> DUMP B;
(pomegranate)
(banana)
(apple)
(lychee)
grunt> DESCRIBE B;
B: {chararray}
A has chararray lielus that have leauing anu tiailing spaces. Ve cieate B liom A Ly
applying the Trim lunction to the liist lielu in A (nameu fruit). B`s lielus aie coiiectly
inleiieu to Le ol type chararray.
Dynamic Invokers
Sometimes you may want to use a lunction that is pioviueu Ly a ]ava liLiaiy, Lut
without going to the elloit ol wiiting a UDF. Dynamic invokeis allow you to uo this
Ly calling ]ava methous uiiectly liom a Pig sciipt. The tiaue-oll is that methou calls aie
maue via iellection, which, when Leing calleu loi eveiy iecoiu in a laige uataset, can
impose signilicant oveiheau. So loi sciipts that aie iun iepeateuly a ueuicateu UDF is
noimally pieleiieu.
The lollowing snippet shows how we coulu ueline anu use a tiim UDF that uses the
Apache Commons Lang StringUtils class.
7. Although not ielevant loi this example, eval lunctions that opeiate on a Lag may auuitionally implement
Pig`s Algebraic oi Accumulator inteilaces loi moie ellicient piocessing ol the Lag in chunks.
User-Defined Functions | 393
grunt> DEFINE trim InvokeForString('org.apache.commons.lang.StringUtils.trim', 'String');
grunt> B = FOREACH A GENERATE trim(fruit);
grunt> DUMP B;
(pomegranate)
(banana)
(apple)
(lychee)
The InvokeForString invokei is useu since the ietuin type ol the methou is a String.
(Theie aie also InvokeForInt, InvokeForLong, InvokeForDouble, anu InvokeForFloat in-
vokeis.) The liist aigument to the invokei constiuctoi is the lully-gualilieu methou to
Le invokeu. The seconu is a space-sepaiateu list ol the methou aigument classes.
A Load UDF
Ve`ll uemonstiate a custom loau lunction that can ieau plain-text column ianges as
lielus, veiy much like the Unix cut commanu. It is useu as lollows:
grunt> records = LOAD 'input/ncdc/micro/sample.txt'
>> USING com.hadoopbook.pig.CutLoadFunc('16-19,88-92,93-93')
>> AS (year:int, temperature:int, quality:int);
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
The stiing passeu to CutLoadFunc is the column specilication; each comma-sepaiateu
iange uelines a lielu, which is assigneu a name anu type in the AS clause. Let`s examine
the implementation ol CutLoadFunc shown in Example 11-3.
Exanp|c 11-3. A LoadIunc UDI to |oad tup|c jic|ds as co|unn rangcs
public class CutLoadFunc extends LoadFunc {
private static final Log LOG = LogFactory.getLog(CutLoadFunc.class);
private final List<Range> ranges;
private final TupleFactory tupleFactory = TupleFactory.getInstance();
private RecordReader reader;
public CutLoadFunc(String cutPattern) {
ranges = Range.parse(cutPattern);
}

@Override
public void setLocation(String location, Job job)
throws IOException {
FileInputFormat.setInputPaths(job, location);
}

@Override
public InputFormat getInputFormat() {
394 | Chapter 11: Pig
return new TextInputFormat();
}

@Override
public void prepareToRead(RecordReader reader, PigSplit split) {
this.reader = reader;
}
@Override
public Tuple getNext() throws IOException {
try {
if (!reader.nextKeyValue()) {
return null;
}
Text value = (Text) reader.getCurrentValue();
String line = value.toString();
Tuple tuple = tupleFactory.newTuple(ranges.size());
for (int i = 0; i < ranges.size(); i++) {
Range range = ranges.get(i);
if (range.getEnd() > line.length()) {
LOG.warn(String.format(
"Range end (%s) is longer than line length (%s)",
range.getEnd(), line.length()));
continue;
}
tuple.set(i, new DataByteArray(range.getSubstring(line)));
}
return tuple;
} catch (InterruptedException e) {
throw new ExecException(e);
}
}
}
In Pig, like in Hauoop, uata loauing takes place Leloie the mappei iuns, so it is im-
poitant that the input can Le split into poitions that aie inuepenuently hanuleu Ly each
mappei (see Input Splits anu Recoius on page 232 loi Lackgiounu).
Fiom Pig 0.7.0 the loau anu stoie lunction inteilaces have Leen oveihauleu to Le moie
closely aligneu with Hauoop`s InputFormat anu OutputFormat classes. Functions wiitten
loi pievious veisions ol Pig will neeu iewiiting (guiuelines loi uoing so aie pioviueu at
http://wi|i.apachc.org/pig/LoadStorcMigrationGuidc). A LoadFunc will typically use an
existing unueilying InputFormat to cieate iecoius, with the LoadFunc pioviuing the logic
loi tuining the iecoius into Pig tuples.
CutLoadFunc is constiucteu with a stiing that specilies the column ianges to use loi each
lielu. The logic loi paising this stiing anu cieating a list ol inteinal Range oLjects that
encapsulates these ianges is containeu in the Range class, anu is not shown heie (it is
availaLle in the example coue that accompanies this Look).
User-Defined Functions | 395
Pig calls setLocation() on a LoadFunc to pass the input location to the loauei. Since
CutLoadFunc uses a TextInputFormat to Lieak the input into lines, we just pass the lo-
cation to set the input path using a static methou on FileInputFormat.
Pig uses the new MapReuuce API, so we use the input anu output loi-
mats anu associateu classes liom the org.apache.hadoop.mapreduce
package.
Next, Pig calls the getInputFormat() methou to cieate a RecordReader loi each split, just
like in MapReuuce. Pig passes each RecordReader to the prepareToRead() methou ol
CutLoadFunc, which we stoie a ieleience to, so we can use it in the getNext() methou
loi iteiating thiough the iecoius.
The Pig iuntime calls getNext() iepeateuly, anu the loau lunction ieaus tuples liom the
ieauei until the ieauei ieaches the last iecoiu in its split. At this point, it ietuins null
to signal that theie aie no moie tuples to Le ieau.
It is the iesponsiLility ol the getNext() implementation to tuin lines ol the input lile
into Tuple oLjects. It uoes this Ly means ol a TupleFactory, a Pig class loi cieating
Tuple instances. The newTuple() methou cieates a new tuple with the ieguiieu numLei
ol lielus, which is just the numLei ol Range classes, anu the lielus aie populateu using
suLstiings ol the line, which aie ueteimineu Ly the Range oLjects.
Ve neeu to think aLout what to uo il the line is shoitei than the iange askeu loi. One
option is to thiow an exception anu stop luithei piocessing. This is appiopiiate il youi
application cannot toleiate incomplete oi coiiupt iecoius. In many cases, it is Lettei
to ietuin a tuple with null lielus anu let the Pig sciipt hanule the incomplete uata as it
sees lit. This is the appioach we take heie; Ly exiting the loi loop il the iange enu is
past the enu ol the line, we leave the cuiient lielu anu any suLseguent lielus in the tuple
with theii uelault value ol null.
Using a schema
Let`s now consiuei the type ol the lielus Leing loaueu. Il the usei has specilieu a schema,
then the lielus neeu conveiting to the ielevant types. Howevei, this is peiloimeu lazily
Ly Pig, anu so the loauei shoulu always constiuct tuples ol type bytearrary, using the
DataByteArray type. The loauei lunction still has the oppoitunity to uo the conveision,
howevei, Ly oveiiiuing getLoadCaster() to ietuin a custom implementation ol the
LoadCaster inteilace, which pioviues a collection ol conveision methous loi this
puipose:
public interface LoadCaster {
public Integer bytesToInteger(byte[] b) throws IOException;
public Long bytesToLong(byte[] b) throws IOException;
public Float bytesToFloat(byte[] b) throws IOException;
public Double bytesToDouble(byte[] b) throws IOException;
public String bytesToCharArray(byte[] b) throws IOException;
396 | Chapter 11: Pig
public Map<String, Object> bytesToMap(byte[] b) throws IOException;
public Tuple bytesToTuple(byte[] b) throws IOException;
public DataBag bytesToBag(byte[] b) throws IOException;
}
CutLoadFunc uoesn`t oveiiiue getLoadCaster() since the uelault implementation ietuins
Utf8StorageConverter, which pioviues stanuaiu conveisions Letween UTF-S encoueu
uata anu Pig uata types.
In some cases, the loau lunction itsell can ueteimine the schema. Foi example, il we
weie loauing sell-uesciiLing uata like XML oi ]SON, we coulu cieate a schema loi Pig
Ly looking at the uata. Alteinatively, the loau lunction may ueteimine the schema in
anothei way, such as an exteinal lile, oi Ly Leing passeu inloimation in its constiuctoi.
To suppoit such cases, the loau lunction shoulu implement the LoadMetadata inteilace
(in auuition to the LoadFunc inteilace), so it can supply a schema to the Pig iuntime.
Note, howevei, that il a usei supplies a schema in the AS clause ol LOAD, then it takes
pieceuence ovei the schema one specilieu Ly the LoadMetadata inteilace.
A loau lunction may auuitionally implement the LoadPushDown inteilace as a means loi
linuing out which columns the gueiy is asking loi. This can Le a uselul optimization
loi column-oiienteu stoiage, so that the loauei only loaus the columns that aie neeueu
Ly the gueiy. Theie is no oLvious way loi CutLoadFunc to loau only a suLset ol columns,
since it ieaus the whole line loi each tuple, so we uon`t use this optimization.
Data Processing Operators
Loading and Storing Data
Thioughout this chaptei, we have seen how to loau uata liom exteinal stoiage loi
piocessing in Pig. Stoiing the iesults is stiaightloiwaiu, too. Heie`s an example ol using
PigStoiage to stoie tuples as plain-text values sepaiateu Ly a colon chaiactei:
grunt> STORE A INTO 'out' USING PigStorage(':');
grunt> cat out
Joe:cherry:2
Ali:apple:3
Joe:banana:2
Eve:apple:7
Othei Luilt-in stoiage lunctions weie uesciiLeu in TaLle 11-7.
Filtering Data
Once you have some uata loaueu into a ielation, the next step is olten to liltei it to
iemove the uata that you aie not inteiesteu in. By lilteiing eaily in the piocessing pipe-
line, you minimize the amount ol uata llowing thiough the system, which can impiove
elliciency.
Data Processing Operators | 397
FOREACH...GENERATE
Ve have alieauy seen how to iemove iows liom a ielation using the FILTER opeiatoi
with simple expiessions anu a UDF. The FOREACH...GENERATE opeiatoi is useu to
act on eveiy iow in a ielation. It can Le useu to iemove lielus oi to geneiate new ones.
In this example, we uo Loth:
grunt> DUMP A;
(Joe,cherry,2)
(Ali,apple,3)
(Joe,banana,2)
(Eve,apple,7)
grunt> B = FOREACH A GENERATE $0, $2+1, 'Constant';
grunt> DUMP B;
(Joe,3,Constant)
(Ali,4,Constant)
(Joe,3,Constant)
(Eve,8,Constant)
Heie we have cieateu a new ielation B with thiee lielus. Its liist lielu is a piojection ol
the liist lielu ($0) ol A. B`s seconu lielu is the thiiu lielu ol A ($2) with one auueu to it.
B`s thiiu lielu is a constant lielu (eveiy iow in B has the same thiiu lielu) with the
chararray value Constant.
The FOREACH...GENERATE opeiatoi has a nesteu loim to suppoit moie complex
piocessing. In the lollowing example, we compute vaiious statistics loi the weathei
uataset:
-- year_stats.pig
REGISTER pig-examples.jar;
DEFINE isGood com.hadoopbook.pig.IsGoodQuality();
records = LOAD 'input/ncdc/all/19{1,2,3,4,5}0*'
USING com.hadoopbook.pig.CutLoadFunc('5-10,11-15,16-19,88-92,93-93')
AS (usaf:chararray, wban:chararray, year:int, temperature:int, quality:int);

grouped_records = GROUP records BY year PARALLEL 30;
year_stats = FOREACH grouped_records {
uniq_stations = DISTINCT records.usaf;
good_records = FILTER records BY isGood(quality);
GENERATE FLATTEN(group), COUNT(uniq_stations) AS station_count,
COUNT(good_records) AS good_record_count, COUNT(records) AS record_count;
}
DUMP year_stats;
Using the cut UDF we uevelopeu eailiei, we loau vaiious lielus liom the input uataset
into the records ielation. Next we gioup records Ly yeai. Notice the PARALLEL key-
woiu loi setting the numLei ol ieuuceis to use; this is vital when iunning on a clustei.
Then we piocess each gioup using a nesteu FOREACH...GENERATE opeiatoi. The
liist nesteu statement cieates a ielation loi the uistinct USAF iuentilieis loi stations
using the DISTINCT opeiatoi. The seconu nesteu statement cieates a ielation loi the
398 | Chapter 11: Pig
iecoius with goou ieauings using the FILTER opeiatoi anu a UDF. The linal nesteu
statement is a GENERATE statement (a nesteu FOREACH...GENERATE must always
have a GENERATE statement as the last nesteu statement) that geneiates the summaiy
lielus ol inteiest using the gioupeu iecoius, as well as the ielations cieateu in the nesteu
Llock.
Running it on a lew yeais ol uata, we get the lollowing:
(1920,8L,8595L,8595L)
(1950,1988L,8635452L,8641353L)
(1930,121L,89245L,89262L)
(1910,7L,7650L,7650L)
(1940,732L,1052333L,1052976L)
The lielus aie yeai, numLei ol unigue stations, total numLei ol goou ieauings, anu total
numLei ol ieauings. Ve can see how the numLei ol weathei stations anu ieauings giew
ovei time.
STREAM
The STREAM opeiatoi allows you to tiansloim uata in a ielation using an exteinal
piogiam oi sciipt. It is nameu Ly analogy with Hauoop Stieaming, which pioviues a
similai capaLility loi MapReuuce (see Hauoop Stieaming on page 37).
STREAM can use Luilt-in commanus with aiguments. Heie is an example that uses the
Unix cut commanu to extiact the seconu lielu ol each tuple in A. Note that the com-
manu anu its aiguments aie encloseu in Lackticks:
grunt> C = STREAM A THROUGH `cut -f 2`;
grunt> DUMP C;
(cherry)
(apple)
(banana)
(apple)
The STREAM opeiatoi uses PigStoiage to seiialize anu ueseiialize ielations to anu liom
the piogiam`s stanuaiu input anu output stieams. Tuples in A aie conveiteu to taL-
uelimiteu lines that aie passeu to the sciipt. The output ol the sciipt is ieau one line at
a time anu split on taLs to cieate new tuples loi the output ielation C. You can pioviue
a custom seiializei anu ueseiializei, which implement PigToStream anu StreamToPig
iespectively (Loth in the org.apache.pig package), using the DEFINE commanu.
Pig stieaming is most poweilul when you wiite custom piocessing sciipts. The lollow-
ing Python sciipt lilteis out Lau weathei iecoius:
#!/usr/bin/env python
import re
import sys
for line in sys.stdin:
(year, temp, q) = line.strip().split()
Data Processing Operators | 399
if (temp != "9999" and re.match("[01459]", q)):
print "%s\t%s" % (year, temp)
To use the sciipt, you neeu to ship it to the clustei. This is achieveu via a DEFINE
clause, which also cieates an alias loi the STREAM commanu. The STREAM statement
can then ielei to the alias, as the lollowing Pig sciipt shows:
-- max_temp_filter_stream.pig
DEFINE is_good_quality `is_good_quality.py`
SHIP ('ch11/src/main/python/is_good_quality.py');
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = STREAM records THROUGH is_good_quality
AS (year:chararray, temperature:int);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;
Grouping and Joining Data
]oining uatasets in MapReuuce takes some woik on the pait ol the piogiammei (see
]oins on page 2S1), wheieas Pig has veiy goou Luilt-in suppoit loi join opeiations,
making it much moie appioachaLle. Since the laige uatasets that aie suitaLle loi
analysis Ly Pig (anu MapReuuce in geneial) aie not noimalizeu, joins aie useu moie
inlieguently in Pig than they aie in SQL.
JOIN
Let`s look at an example ol an innei join. Consiuei the ielations A anu B:
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
Ve can join the two ielations on the numeiical (iuentity) lielu in each:
grunt> C = JOIN A BY $0, B BY $1;
grunt> DUMP C;
(2,Tie,Joe,2)
(2,Tie,Hank,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
400 | Chapter 11: Pig
This is a classic innei join, wheie each match Letween the two ielations coiiesponus
to a iow in the iesult. (It`s actually an eguijoin since the join pieuicate is eguality.) The
iesult`s lielus aie maue up ol all the lielus ol all the input ielations.
You shoulu use the geneial join opeiatoi il all the ielations Leing joineu aie too laige
to lit in memoiy. Il one ol the ielations is small enough to lit in memoiy, theie is a
special type ol join calleu a jragncnt rcp|icatc join, which is implementeu Ly uistiiLuting
the small input to all the mappeis anu peiloiming a map-siue join using an in-memoiy
lookup taLle against the (liagmenteu) laigei ielation. Theie is a special syntax loi telling
Pig to use a liagment ieplicate join:
S
grunt> C = JOIN A BY $0, B BY $1 USING "replicated";
The liist ielation must Le the laige one, lolloweu Ly one oi moie small ones (all ol
which must lit in memoiy).
Pig also suppoits outei joins using a syntax that is similai to SQL`s (this is coveieu loi
Hive in Outei joins on page +++). Foi example:
grunt> C = JOIN A BY $0 LEFT OUTER, B BY $1;
grunt> DUMP C;
(1,Scarf,,)
(2,Tie,Joe,2)
(2,Tie,Hank,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
COGROUP
]OIN always gives a llat stiuctuie: a set ol tuples. The COGROUP statement is similai
to ]OIN, Lut cieates a nesteu set ol output tuples. This can Le uselul il you want to
exploit the stiuctuie in suLseguent statements:
grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
(0,{},{(Ali,0)})
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
COGROUP geneiates a tuple loi each unigue giouping key. The liist lielu ol each tuple
is the key, anu the iemaining lielus aie Lags ol tuples liom the ielations with a matching
key. The liist Lag contains the matching tuples liom ielation A with the same key.
Similaily, the seconu Lag contains the matching tuples liom ielation B with the same
key.
S. Theie aie moie keywoius that may Le useu in the USING clause, incluuing "skewed" (loi laige uatasets
with a skeweu keyspace) anu "merge" (to ellect a meige join loi inputs that aie alieauy soiteu on the join
key). See Pig`s uocumentation loi uetails on how to use these specializeu joins.
Data Processing Operators | 401
Il loi a paiticulai key a ielation has no matching key, then the Lag loi that ielation is
empty. Foi example, since no one has Lought a scail (with ID 1), the seconu Lag in the
tuple loi that iow is empty. This is an example ol an outei join, which is the uelault
type loi COGROUP. It can Le maue explicit using the OUTER keywoiu, making this
COGROUP statement the same as the pievious one:
D = COGROUP A BY $0 OUTER, B BY $1 OUTER;
You can suppiess iows with empty Lags Ly using the INNER keywoiu, which gives the
COGROUP innei join semantics. The INNER keywoiu is applieu pei ielation, so the
lollowing only suppiesses iows when ielation A has no match (uiopping the unknown
piouuct 0 heie):
grunt> E = COGROUP A BY $0 INNER, B BY $1;
grunt> DUMP E;
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
Ve can llatten this stiuctuie to uiscovei who Lought each ol the items in ielation A:
grunt> F = FOREACH E GENERATE FLATTEN(A), B.$0;
grunt> DUMP F;
(1,Scarf,{})
(2,Tie,{(Joe),(Hank)})
(3,Hat,{(Eve)})
(4,Coat,{(Hank)})
Using a comLination ol COGROUP, INNER, anu FLATTEN (which iemoves nesting)
it`s possiLle to simulate an (innei) ]OIN:
grunt> G = COGROUP A BY $0 INNER, B BY $1 INNER;
grunt> H = FOREACH G GENERATE FLATTEN($1), FLATTEN($2);
grunt> DUMP H;
(2,Tie,Joe,2)
(2,Tie,Hank,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
This gives the same iesult as JOIN A BY $0, B BY $1.
402 | Chapter 11: Pig
Il the join key is composeu ol seveial lielus, you can specily them all in the BY clauses
ol the ]OIN oi COGROUP statement. Make suie that the numLei ol lielus in each BY
clause is the same.
Heie`s anothei example ol a join in Pig, in a sciipt loi calculating the maximum tem-
peiatuie loi eveiy station ovei a time peiiou contiolleu Ly the input:
-- max_temp_station_name.pig
REGISTER pig-examples.jar;
DEFINE isGood com.hadoopbook.pig.IsGoodQuality();
stations = LOAD 'input/ncdc/metadata/stations-fixed-width.txt'
USING com.hadoopbook.pig.CutLoadFunc('1-6,8-12,14-42')
AS (usaf:chararray, wban:chararray, name:chararray);

trimmed_stations = FOREACH stations GENERATE usaf, wban,
com.hadoopbook.pig.Trim(name);
records = LOAD 'input/ncdc/all/191*'
USING com.hadoopbook.pig.CutLoadFunc('5-10,11-15,88-92,93-93')
AS (usaf:chararray, wban:chararray, temperature:int, quality:int);

filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);
grouped_records = GROUP filtered_records BY (usaf, wban) PARALLEL 30;
max_temp = FOREACH grouped_records GENERATE FLATTEN(group),
MAX(filtered_records.temperature);
max_temp_named = JOIN max_temp BY (usaf, wban), trimmed_stations BY (usaf, wban)
PARALLEL 30;
max_temp_result = FOREACH max_temp_named GENERATE $0, $1, $5, $2;
STORE max_temp_result INTO 'max_temp_by_station';
Ve use the cut UDF we uevelopeu eailiei to loau one ielation holuing the station IDs
(USAF anu VBAN iuentilieis) anu names, anu one ielation holuing all the weathei
iecoius, keyeu Ly station ID. Ve gioup the lilteieu weathei iecoius Ly station ID anu
aggiegate Ly maximum tempeiatuie, Leloie joining with the stations. Finally, we
pioject out the lielus we want in the linal iesult: USAF, VBAN, station name, maxi-
mum tempeiatuie.
Heie aie a lew iesults loi the 1910s:
228020 99999 SORTAVALA 322
029110 99999 VAASA AIRPORT 300
040650 99999 GRIMSEY 378
This gueiy coulu Le maue moie ellicient Ly using a liagment ieplicate join, as the station
metauata is small.
Data Processing Operators | 403
CROSS
Pig Latin incluues the cioss-piouuct opeiatoi (also known as the caitesian piouuct),
which joins eveiy tuple in a ielation with eveiy tuple in a seconu ielation (anu with
eveiy tuple in luithei ielations il supplieu). The size ol the output is the piouuct ol the
size ol the inputs, potentially making the output veiy laige:
grunt> I = CROSS A, B;
grunt> DUMP I;
(2,Tie,Joe,2)
(2,Tie,Hank,4)
(2,Tie,Ali,0)
(2,Tie,Eve,3)
(2,Tie,Hank,2)
(4,Coat,Joe,2)
(4,Coat,Hank,4)
(4,Coat,Ali,0)
(4,Coat,Eve,3)
(4,Coat,Hank,2)
(3,Hat,Joe,2)
(3,Hat,Hank,4)
(3,Hat,Ali,0)
(3,Hat,Eve,3)
(3,Hat,Hank,2)
(1,Scarf,Joe,2)
(1,Scarf,Hank,4)
(1,Scarf,Ali,0)
(1,Scarf,Eve,3)
(1,Scarf,Hank,2)
Vhen uealing with laige uatasets, you shoulu tiy to avoiu opeiations that geneiate
inteimeuiate iepiesentations that aie guauiatic (oi woise) in size. Computing the cioss-
piouuct ol the whole input uataset is iaiely neeueu, il evei.
Foi example, at liist Llush one might expect that calculating paiiwise uocument simi-
laiity in a coipus ol uocuments woulu ieguiie eveiy uocument paii to Le geneiateu
Leloie calculating theii similaiity. Howevei, il one staits with the insight that most
uocument paiis have a similaiity scoie ol zeio (that is, they aie unielateu), then we can
linu a way to a Lettei algoiithm.
In this case, the key iuea is to locus on the entities that we aie using to calculate similaiity
(teims in a uocument, loi example) anu make them the centei ol the algoiithm. In
piactice, we also iemove teims that uon`t help uisciiminate Letween uocuments (stop-
woius), anu this ieuuces the pioLlem space still luithei. Using this technigue to analyze
a set ol ioughly one million (10
6
) uocuments geneiates in the oiuei ol one Lillion
(10
9
) inteimeuiate paiis,
9
iathei than the one tiillion (10
12
) piouuceu Ly the naive
appioach (geneiating the cioss-piouuct ol the input) oi the appioach with no stopwoiu
iemoval.
9. Paiiwise Document Similaiity in Laige Collections with MapReuuce, Elsayeu, Lin, anu Oaiu (200S,
College Paik, MD: Univeisity ol Maiylanu).
404 | Chapter 11: Pig
GROUP
Although COGROUP gioups the uata in two oi moie ielations, the GROUP statement
gioups the uata in a single ielation. GROUP suppoits giouping Ly moie than eguality
ol keys: you can use an expiession oi usei-uelineu lunction as the gioup key. Foi ex-
ample, consiuei the lollowing ielation A:
grunt> DUMP A;
(Joe,cherry)
(Ali,apple)
(Joe,banana)
(Eve,apple)
Let`s gioup Ly the numLei ol chaiacteis in the seconu lielu:
grunt> B = GROUP A BY SIZE($1);
grunt> DUMP B;
(5,{(Ali,apple),(Eve,apple)})
(6,{(Joe,cherry),(Joe,banana)})
GROUP cieates a ielation whose liist lielu is the giouping lielu, which is given the alias
group. The seconu lielu is a Lag containing the gioupeu lielus with the same schema as
the oiiginal ielation (in this case, A).
Theie aie also two special giouping opeiations: ALL anu ANY. ALL gioups all the
tuples in a ielation in a single gioup, as il the GROUP lunction was a constant:
grunt> C = GROUP A ALL;
grunt> DUMP C;
(all,{(Joe,cherry),(Ali,apple),(Joe,banana),(Eve,apple)})
Note that theie is no BY in this loim ol the GROUP statement. The ALL giouping is
commonly useu to count the numLei ol tuples in a ielation, as shown in Valiuation
anu nulls on page 3S3.
The ANY keywoiu is useu to gioup the tuples in a ielation ianuomly, which can Le
uselul loi sampling.
Sorting Data
Relations aie unoiueieu in Pig. Consiuei a ielation A:
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
Theie is no guaiantee which oiuei the iows will Le piocesseu in. In paiticulai, when
ietiieving the contents ol A using DUMP oi STORE, the iows may Le wiitten in any
oiuei. Il you want to impose an oiuei on the output, you can use the ORDER opeiatoi
to soit a ielation Ly one oi moie lielus. The uelault soit oiuei compaies lielus ol the
same type using the natuial oiueiing, anu uilleient types aie given an aiLitiaiy, Lut
ueteiministic, oiueiing (a tuple is always less than a Lag, loi example).
Data Processing Operators | 405
The lollowing example soits A Ly the liist lielu in ascenuing oiuei anu Ly the seconu
lielu in uescenuing oiuei:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
(1,2)
(2,4)
(2,3)
Any luithei piocessing on a soiteu ielation is not guaianteeu to ietain its oiuei. Foi
example:
grunt> C = FOREACH B GENERATE *;
Even though ielation C has the same contents as ielation B, its tuples may Le emitteu
in any oiuei Ly a DUMP oi a STORE. It is loi this ieason that it is usual to peiloim the
ORDER opeiation just Leloie ietiieving the output.
The LIMIT statement is uselul loi limiting the numLei ol iesults, as a guick anu uiity
way to get a sample ol a ielation; piototyping (the ILLUSTRATE commanu) shoulu Le
pieleiieu loi geneiating moie iepiesentative samples ol the uata. It can Le useu imme-
uiately altei the ORDER statement to ietiieve the liist n tuples. Usually, LIMIT will
select any n tuples liom a ielation, Lut when useu immeuiately altei an ORDER state-
ment, the oiuei is ietaineu (in an exception to the iule that piocessing a ielation uoes
not ietain its oiuei):
grunt> D = LIMIT B 2;
grunt> DUMP D;
(1,2)
(2,4)
Il the limit is gieatei than the numLei ol tuples in the ielation, all tuples aie ietuineu
(so LIMIT has no ellect).
Using LIMIT can impiove the peiloimance ol a gueiy Lecause Pig tiies to apply the
limit as eaily as possiLle in the piocessing pipeline, to minimize the amount ol uata
that neeus to Le piocesseu. Foi this ieason, you shoulu always use LIMIT il you aie
not inteiesteu in the entiie output.
Combining and Splitting Data
Sometimes you have seveial ielations that you woulu like to comLine into one. Foi this,
the UNION statement is useu. Foi example:
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
grunt> DUMP B;
(z,x,8)
(w,y,1)
grunt> C = UNION A, B;
grunt> DUMP C;
406 | Chapter 11: Pig
(2,3)
(1,2)
(2,4)
(z,x,8)
(w,y,1)
C is the union ol ielations A anu B, anu since ielations aie unoiueieu, the oiuei ol the
tuples in C is unuelineu. Also, it`s possiLle to loim the union ol two ielations with
uilleient schemas oi with uilleient numLeis ol lielus, as we have uone heie. Pig attempts
to meige the schemas liom the ielations that UNION is opeiating on. In this case, they
aie incompatiLle, so C has no schema:
grunt> DESCRIBE A;
A: {f0: int,f1: int}
grunt> DESCRIBE B;
B: {f0: chararray,f1: chararray,f2: int}
grunt> DESCRIBE C;
Schema for C unknown.
Il the output ielation has no schema, youi sciipt neeus to Le aLle to hanule tuples that
vaiy in the numLei ol lielus anu/oi types.
The SPLIT opeiatoi is the opposite ol UNION; it paititions a ielation into two oi moie
ielations. See Valiuation anu nulls on page 3S3 loi an example ol how to use it.
Pig in Practice
Theie aie some piactical technigues that aie woith knowing aLout when you aie
ueveloping anu iunning Pig piogiams. This section coveis some ol them.
Parallelism
Vhen iunning in MapReuuce moue it`s impoitant that the uegiee ol paiallelism
matches the size ol the uataset. By uelault, Pig will sets the numLei ol ieuuceis Ly
looking at the size ol the input, anu using one ieuucei pei 1GB ol input, up to a max-
imum ol 999 ieuuceis. You can oveiiiue these paiameteis Ly setting pig.exec.reduc
ers.bytes.per.reducer (the uelault is 1000000000 Lytes) anu pig.exec.reducers.max
(uelault 999).
To explictly set the numLei ol ieuuceis you want loi each joL, you can use a PARALLEL
clause loi opeiatois that iun in the ieuuce phase. These incluue all the giouping anu
joining opeiatois (GROUP, COGROUP, ]OIN, CROSS), as well as DISTINCT anu
ORDER. The lollowing line sets the numLei ol ieuuceis to 30 loi the GROUP:
grouped_records = GROUP records BY year PARALLEL 30;
Alteinatively, you can set the default_parallel option, anu it will take ellect loi all
suLseguent joLs:
grunt> set default_parallel 30
Pig in Practice | 407
A goou setting loi the numLei ol ieuuce tasks is slightly lewei than the numLei ol
ieuuce slots in the clustei. See Choosing the NumLei ol Reuuceis on page 229 loi
luithei uiscussion.
The numLei ol map tasks is set Ly the size ol the input (with one map pei HDFS Llock)
anu is not allecteu Ly the PARALLEL clause.
Parameter Substitution
Il you have a Pig sciipt that you iun on a iegulai Lasis, then it`s guite common to want
to Le aLle to iun the same sciipt with uilleient paiameteis. Foi example, a sciipt that
iuns uaily may use the uate to ueteimine which input liles it iuns ovei. Pig suppoits
paranctcr substitution, wheie paiameteis in the sciipt aie suLstituteu with values
supplieu at iuntime. Paiameteis aie uenoteu Ly iuentilieis pielixeu with a $ chaiactei;
loi example, $input anu $output aie useu in the lollowing sciipt to specily the input
anu output paths:
-- max_temp_param.pig
records = LOAD '$input' AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
STORE max_temp into '$output';
Paiameteis can Le specilieu when launching Pig, using the -param option, one loi each
paiametei:
% pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt \
> -param output=/tmp/out \
> ch11/src/main/pig/max_temp_param.pig
You can also put paiameteis in a lile anu pass them to Pig using the -param_file option.
Foi example, we can achieve the same iesult as the pievious commanu Ly placing the
paiametei uelinitions in a lile:
# Input file
input=/user/tom/input/ncdc/micro-tab/sample.txt
# Output file
output=/tmp/out
The pig invocation then Lecomes:
% pig -param_file ch11/src/main/pig/max_temp_param.param \
> ch11/src/main/pig/max_temp_param.pig
You can specily multiple paiametei liles using -param_file iepeateuly. You can also
use a comLination ol -param anu -param_file options, anu il any paiametei is uelineu
in Loth a paiametei lile anu on the commanu line, the last value on the commanu line
takes pieceuence.
408 | Chapter 11: Pig
Dynamic parameters
Foi paiameteis that aie supplieu using the -param option, it is easy to make the value
uynamic Ly iunning a commanu oi sciipt. Many Unix shells suppoit commanu suL-
stitution loi a commanu encloseu in Lackticks, anu we can use this to make the output
uiiectoiy uate-Laseu:
% pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt \
> -param output=/tmp/`date "+%Y-%m-%d"`/out \
> ch11/src/main/pig/max_temp_param.pig
Pig also suppoits Lackticks in paiametei liles, Ly executing the encloseu commanu in
a shell anu using the shell output as the suLstituteu value. Il the commanu oi sciipts
exits with a nonzeio exit status, then the eiioi message is iepoiteu anu execution halts.
Backtick suppoit in paiametei liles is a uselul leatuie; it means that paiameteis can Le
uelineu in the same way il they aie uelineu in a lile oi on the commanu line.
Parameter substitution processing
Paiametei suLstitution occuis as a piepiocessing step Leloie the sciipt is iun. You can
see the suLstitutions that the piepiocessoi maue Ly executing Pig with the -dryrun
option. In uiy iun moue, Pig peiloims paiametei suLstitution (anu macio expansion)
anu geneiates a copy ol the oiiginal sciipt with suLstituteu values, Lut uoes not execute
the sciipt. You can inspect the geneiateu sciipt anu check that the suLstitutions look
sane (Lecause they aie uynamically geneiateu, loi example) Leloie iunning it in noimal
moue.
At the time ol this wiiting, Giunt uoes not suppoit paiametei suLstitution.
Pig in Practice | 409
CHAPTER 12
Hive
In Inloimation Platloims anu the Rise ol the Data Scientist,
1
]ell HammeiLachei
uesciiLes Inloimation Platloims as the locus ol theii oiganization`s elloits to ingest,
piocess, anu geneiate inloimation, anu how they seive to acceleiate the piocess ol
leaining liom empiiical uata.
One ol the Liggest ingieuients in the Inloimation Platloim Luilt Ly ]ell`s team at Face-
Look was Hive, a liamewoik loi uata waiehousing on top ol Hauoop. Hive giew liom
a neeu to manage anu leain liom the huge volumes ol uata that FaceLook was piouucing
eveiy uay liom its Luigeoning social netwoik. Altei tiying a lew uilleient systems, the
team chose Hauoop loi stoiage anu piocessing, since it was cost-ellective anu met theii
scalaLility neeus.
2
Hive was cieateu to make it possiLle loi analysts with stiong SQL skills (Lut meagei
]ava piogiamming skills) to iun gueiies on the huge volumes ol uata that FaceLook
stoieu in HDFS. Touay, Hive is a successlul Apache pioject useu Ly many oiganizations
as a geneial-puipose, scalaLle uata piocessing platloim.
Ol couise, SQL isn`t iueal loi eveiy Lig uata pioLlemit`s not a goou lit loi Luiluing
complex machine leaining algoiithms, loi exampleLut it`s gieat loi many analyses,
anu it has the huge auvantage ol Leing veiy well known in the inuustiy. Vhat`s moie,
SQL is the |ingua jranca in Lusiness intelligence tools (ODBC is a common Liiuge, loi
example), so Hive is well placeu to integiate with these piouucts.
This chaptei is an intiouuction to using Hive. It assumes that you have woiking knowl-
euge ol SQL anu geneial uataLase aichitectuie; as we go thiough Hive`s leatuies, we`ll
olten compaie them to the eguivalent in a tiauitional RDBMS.
1. Bcautiju| Data: Thc Storics Bchind E|cgant Data So|utions, Ly ToLy Segaian anu ]ell HammeiLachei
(O`Reilly, 2009)
2. You can ieau moie aLout the histoiy ol Hauoop at FaceLook in Hauoop anu Hive at
FaceLook on page 55+.
411
Installing Hive
In noimal use, Hive iuns on youi woikstation anu conveits youi SQL gueiy into a seiies
ol MapReuuce joLs loi execution on a Hauoop clustei. Hive oiganizes uata into taLles,
which pioviue a means loi attaching stiuctuie to uata stoieu in HDFS. Metauata
such as taLle schemasis stoieu in a uataLase calleu the nctastorc.
Vhen staiting out with Hive, it is convenient to iun the metastoie on youi local ma-
chine. In this conliguiation, which is the uelault, the Hive taLle uelinitions that you
cieate will Le local to youi machine, so you can`t shaie them with othei useis. Ve`ll
see how to conliguie a shaieu iemote metastoie, which is the noim in piouuction
enviionments, latei in The Metastoie on page +19.
Installation ol Hive is stiaightloiwaiu. ]ava 6 is a pieieguisite; anu on Vinuows, you
will neeu Cygwin, too. You also neeu to have the same veision ol Hauoop installeu
locally that youi clustei is iunning.
3
Ol couise, you may choose to iun Hauoop locally,
eithei in stanualone oi pseuuo-uistiiLuteu moue, while getting staiteu with Hive. These
options aie all coveieu in Appenuix A.
Which Versions of Hadoop Does Hive Work With?
Any given ielease ol Hive is uesigneu to woik with multiple veisions ol Hauoop. Gen-
eially, Hive woiks with the latest ielease ol Hauoop, as well as suppoiting a numLei
ol oluei veisions. Foi example, Hive 0.5.0 is compatiLle with veisions ol Hauoop Le-
tween 0.17.x anu 0.20.x (inclusive). You uon`t neeu to uo anything special to tell Hive
which veision ol Hauoop you aie using, Leyonu making suie that the hadoop executaLle
is on the path oi setting the HADOOP_HOME enviionment vaiiaLle.
Downloau a ielease at http://hivc.apachc.org/rc|cascs.htn|, anu unpack the taiLall in a
suitaLle place on youi woikstation:
% tar xzf hive-x.y.z-dev.tar.gz
It`s hanuy to put Hive on youi path to make it easy to launch:
% export HIVE_INSTALL=/home/tom/hive-x.y.z-dev
% export PATH=$PATH:$HIVE_INSTALL/bin
Now type hive to launch the Hive shell:
% hive
hive>
3. It is assumeu that you have netwoik connectivity liom youi woikstation to the Hauoop clustei. You can
test this Leloie iunning Hive Ly installing Hauoop locally anu peiloiming some HDFS opeiations with
the hadoop fs commanu.
412 | Chapter 12: Hive
The Hive Shell
The shell is the piimaiy way that we will inteiact with Hive, Ly issuing commanus in
HivcQL. HiveQL is Hive`s gueiy language, a uialect ol SQL. It is heavily inlluenceu Ly
MySQL, so il you aie lamiliai with MySQL you shoulu leel at home using Hive.
Vhen staiting Hive loi the liist time, we can check that it is woiking Ly listing its taLles:
theie shoulu Le none. The commanu must Le teiminateu with a semicolon to tell Hive
to execute it:
hive> SHOW TABLES;
OK
Time taken: 10.425 seconds
Like SQL, HiveQL is geneially case insensitive (except loi stiing compaiisons), so show
tables; woiks egually well heie. The taL key will autocomplete Hive keywoius anu
lunctions.
Foi a liesh install, the commanu takes a lew seconus to iun since it is lazily cieating
the metastoie uataLase on youi machine. (The uataLase stoies its liles in a uiiectoiy
calleu nctastorc_db, which is ielative to wheie you ian the hive commanu liom.)
You can also iun the Hive shell in non-inteiactive moue. The -f option iuns the com-
manus in the specilieu lile, script.q, in this example:
% hive -f script.q
Foi shoit sciipts, you can use the -e option to specily the commanus inline, in which
case the linal semicolon is not ieguiieu:
% hive -e 'SELECT * FROM dummy'
Hive history file=/tmp/tom/hive_job_log_tom_201005042112_1906486281.txt
OK
X
Time taken: 4.734 seconds
It`s uselul to have a small taLle ol uata to test gueiies against, such as
tiying out lunctions in SELECT expiessions using liteial uata (see Opei-
atois anu Functions on page +26). Heie`s one way ol populating a
single iow taLle:
% echo 'X' > /tmp/dummy.txt
% hive -e "CREATE TABLE dummy (value STRING); \
LOAD DATA LOCAL INPATH '/tmp/dummy.txt' \
OVERWRITE INTO TABLE dummy"
In Loth inteiactive anu non-inteiactive moue, Hive will piint inloimation to stanuaiu
eiioisuch as the time taken to iun a gueiyuuiing the couise ol opeiation. You can
suppiess these messages using the -S option at launch time, which has the ellect ol only
showing the output iesult loi gueiies:
Installing Hive | 413
% hive -S -e 'SELECT * FROM dummy'
X
Othei uselul Hive shell leatuies incluue the aLility to iun commanus on the host op-
eiating system Ly using a ! pielix to the commanu anu the aLility to access Hauoop
lilesystems using the dfs commanu.
An Example
Let`s see how to use Hive to iun a gueiy on the weathei uataset we exploieu in eailiei
chapteis. The liist step is to loau the uata into Hive`s manageu stoiage. Heie we`ll have
Hive use the local lilesystem loi stoiage; latei we`ll see how to stoie taLles in HDFS.
]ust like an RDBMS, Hive oiganizes its uata into taLles. Ve cieate a taLle to holu the
weathei uata using the CREATE TABLE statement:
CREATE TABLE records (year STRING, temperature INT, quality INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
The liist line ueclaies a records taLle with thiee columns: year, temperature, anu
quality. The type ol each column must Le specilieu, too: heie the yeai is a stiing, while
the othei two columns aie integeis.
So lai, the SQL is lamiliai. The ROW FORMAT clause, howevei, is paiticulai to HiveQL.
Vhat this ueclaiation is saying is that each iow in the uata lile is taL-uelimiteu text.
Hive expects theie to Le thiee lielus in each iow, coiiesponuing to the taLle columns,
with lielus sepaiateu Ly taLs, anu iows Ly newlines.
Next we can populate Hive with the uata. This is just a small sample, loi exploiatoiy
puiposes:
LOAD DATA LOCAL INPATH 'input/ncdc/micro-tab/sample.txt'
OVERWRITE INTO TABLE records;
Running this commanu tells Hive to put the specilieu local lile in its waiehouse uiiec-
toiy. This is a simple lilesystem opeiation. Theie is no attempt, loi example, to paise
the lile anu stoie it in an inteinal uataLase loimat, since Hive uoes not manuate any
paiticulai lile loimat. Files aie stoieu veiLatim: they aie not mouilieu Ly Hive.
In this example, we aie stoiing Hive taLles on the local lilesystem (fs.default.name is
set to its uelault value ol file:///). TaLles aie stoieu as uiiectoiies unuei Hive`s waie-
house uiiectoiy, which is contiolleu Ly the hive.metastore.warehouse.dir, anu uelaults
to /uscr/hivc/warchousc.
Thus, the liles loi the records taLle aie lounu in the /uscr/hivc/warchousc/rccords
uiiectoiy on the local lilesystem:
% ls /user/hive/warehouse/records/
sample.txt
414 | Chapter 12: Hive
In this case, theie is only one lile, sanp|c.txt, Lut in geneial theie can Le moie, anu Hive
will ieau all ol them when gueiying the taLle.
The OVERWRITE keywoiu in the LOAD DATA statement tells Hive to uelete any existing liles
in the uiiectoiy loi the taLle. Il it is omitteu, then the new liles aie simply auueu to the
taLle`s uiiectoiy (unless they have the same names, in which case they ieplace the olu
liles).
Now that the uata is in Hive, we can iun a gueiy against it:
hive> SELECT year, MAX(temperature)
> FROM records
> WHERE temperature != 9999
> AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9)
> GROUP BY year;
1949 111
1950 22
This SQL gueiy is uniemaikaLle. It is a SELECT statement with a GROUP BY clause loi
giouping iows into yeais, which uses the MAX() aggiegate lunction to linu the maximum
tempeiatuie loi each yeai gioup. But the iemaikaLle thing is that Hive tiansloims this
gueiy into a MapReuuce joL, which it executes on oui Lehall, then piints the iesults
to the console. Theie aie some nuances such as the SQL constiucts that Hive suppoits
anu the loimat ol the uata that we can gueiyanu we shall exploie some ol these in
this chapteiLut it is the aLility to execute SQL gueiies against oui iaw uata that gives
Hive its powei.
Running Hive
In this section, we look at some moie piactical aspects ol iunning Hive, incluuing how
to set up Hive to iun against a Hauoop clustei anu a shaieu metastoie. In uoing so,
we`ll see Hive`s aichitectuie in some uetail.
Configuring Hive
Hive is conliguieu using an XML conliguiation lile like Hauoop`s. The lile is calleu
hivc-sitc.xn| anu is locateu in Hive`s conj uiiectoiy. This lile is wheie you can set piop-
eities that you want to set eveiy time you iun Hive. The same uiiectoiy contains hivc-
dcjau|t.xn|, which uocuments the piopeities that Hive exposes anu theii uelault values.
You can oveiiiue the conliguiation uiiectoiy that Hive looks loi in hivc-sitc.xn| Ly
passing the --config option to the hivc commanu:
% hive --config /Users/tom/dev/hive-conf
Note that this option specilies the containing uiiectoiy, not hivc-sitc.xn| itsell. It can
Le uselul il you have multiple site lilesloi uilleient clusteis, saythat you switch
Letween on a iegulai Lasis. Alteinatively, you can set the HIVE_CONF_DIR enviionment
vaiiaLle to the conliguiation uiiectoiy, loi the same ellect.
Running Hive | 415
The hivc-sitc.xn| is a natuial place to put the clustei connection uetails: you can specily
the lilesystem anu joLtiackei using the usual Hauoop piopeities, fs.default.name anu
mapred.job.tracker (see Appenuix A loi moie uetails on conliguiing Hauoop). Il not
set, they uelault to the local lilesystem anu the local (in-piocess) joL iunneijust like
they uo in Hauoopwhich is veiy hanuy when tiying out Hive on small tiial uatasets.
Metastoie conliguiation settings (coveieu in The Metastoie on page +19) aie com-
monly lounu in hivc-sitc.xn|, too.
Hive also peimits you to set piopeities on a pei-session Lasis, Ly passing the
-hiveconf option to the hivc commanu. Foi example, the lollowing commanu sets the
clustei (to a pseuuo-uistiiLuteu clustei) loi the uuiation ol the session:
% hive -hiveconf fs.default.name=localhost -hiveconf mapred.job.tracker=localhost:8021
Il you plan to have moie than one Hive usei shaiing a Hauoop clustei,
then you neeu to make the uiiectoiies that Hive uses wiitaLle Ly all
useis. The lollowing commanus will cieate the uiiectoiies anu set theii
peimissions appiopiiately:
% hadoop fs -mkdir /tmp
% hadoop fs -chmod a+w /tmp
% hadoop fs -mkdir /user/hive/warehouse
% hadoop fs -chmod a+w /user/hive/warehouse
Il all useis aie in the same gioup, then peimissions g+w aie sullicient on
the waiehouse uiiectoiy.
You can change settings liom within a session, too, using the SET commanu. This is
uselul loi changing Hive oi MapReuuce joL settings loi a paiticulai gueiy. Foi example,
the lollowing commanu ensuies Luckets aie populateu accoiuing to the taLle uelinition
(see Buckets on page +30):
hive> SET hive.enforce.bucketing=true;
To see the cuiient value ol any piopeity, use SET with just the piopeity name:
hive> SET hive.enforce.bucketing;
hive.enforce.bucketing=true
By itsell, SET will list all the piopeities (anu theii values) set Ly Hive. Note that the list
will not incluue Hauoop uelaults, unless they have Leen explicitly oveiiiuuen in one ol
the ways coveieu in this section. Use SET -v to list all the piopeities in the system,
incluuing Hauoop uelaults.
Theie is a pieceuence hieiaichy to setting piopeities. In the lollowing list, lowei num-
Leis take pieceuence ovei highei numLeis:
1. The Hive SET commanu
2. The commanu line -hiveconf option
3. hivc-sitc.xn|
+. hivc-dcjau|t.xn|
416 | Chapter 12: Hive
5. hadoop-sitc.xn| (oi, eguivalently, corc-sitc.xn|, hdjs-sitc.xn|, anu naprcd-
sitc.xn|)
6. hadoop-dcjau|t.xn| (oi, eguivalently, corc-dcjau|t.xn|, hdjs-dcjau|t.xn|, anu
naprcd-dcjau|t.xn|)
Logging
You can linu Hive`s eiioi log on the local lile system at /tnp/SUSER/hivc.|og. It can Le
veiy uselul when tiying to uiagnose conliguiation pioLlems oi othei types ol eiioi.
Hauoop`s MapReuuce task logs aie also a uselul souice loi tiouLleshooting; see Ha-
uoop Logs on page 173 loi wheie to linu them.
The logging conliguiation is in conj/hivc-|og1j.propcrtics, anu you can euit this lile to
change log levels anu othei logging-ielateu settings. Olten though, it`s moie convenient
to set logging conliguiation loi the session. Foi example, the lollowing hanuy invoca-
tion will senu ueLug messages to the console:
% hive -hiveconf hive.root.logger=DEBUG,console
Hive Services
The Hive shell is only one ol seveial seivices that you can iun using the hive commanu.
You can specily the seivice to iun using the --service option. Type hive --service
help to get a list ol availaLle seivice names; the most uselul aie uesciiLeu Lelow.
cli
The commanu line inteilace to Hive (the shell). This is the uelault seivice.
hiveserver
Runs Hive as a seivei exposing a Thiilt seivice, enaLling access liom a iange ol
clients wiitten in uilleient languages. Applications using the Thiilt, ]DBC, anu
ODBC connectois neeu to iun a Hive seivei to communicate with Hive. Set the
HIVE_PORT enviionment vaiiaLle to specily the poit the seivei will listen on (uelaults
to 10,000).
hwi
The Hive VeL Inteilace. See The Hive VeL Inteilace (HVI) on page +1S.
jar
The Hive eguivalent to hadoop jar, a convenient way to iun ]ava applications that
incluues Loth Hauoop anu Hive classes on the classpath.
metastore
By uelault, the metastoie is iun in the same piocess as the Hive seivice. Using this
seivice, it is possiLle to iun the metastoie as a stanualone (iemote) piocess. Set the
METASTORE_PORT enviionment vaiiaLle to specily the poit the seivei will listen on.
Running Hive | 417
The Hive Web Interface (HWI)
As an alteinative to the shell, you might want to tiy Hive`s simple weL inteilace. Stait
it using the lollowing commanus:
% export ANT_LIB=/path/to/ant/lib
% hive --service hwi
(You only neeu to set the ANT_LIB enviionment vaiiaLle il Ant`s liLiaiy is not lounu
in /opt/ant/|ib on youi system.) Then navigate to http://|oca|host:9999/hwi in youi
Liowsei. Fiom theie, you can Liowse Hive uataLase schemas anu cieate sessions loi
issuing commanus anu gueiies.
It`s possiLle to iun the weL inteilace as a shaieu seivice to give useis within an oiga-
nization access to Hive without having to install any client soltwaie. Theie aie moie
uetails on the Hive VeL Inteilace on the Hive wiki at https://cwi|i.apachc.org/conj|u
cncc/disp|ay/Hivc/HivcWcb|ntcrjacc.
Hive clients
Il you iun Hive as a seivei (hive --service hiveserver), then theie aie a numLei ol
uilleient mechanisms loi connecting to it liom applications. The ielationship Letween
Hive clients anu Hive seivices is illustiateu in Figuie 12-1.
Iigurc 12-1. Hivc architccturc
418 | Chapter 12: Hive
Thrijt C|icnt
The Hive Thiilt Client makes it easy to iun Hive commanus liom a wiue iange ol
piogiamming languages. Thiilt Linuings loi Hive aie availaLle loi C--, ]ava, PHP,
Python, anu RuLy. They can Le lounu in the src/scrvicc/src suLuiiectoiy in the Hive
uistiiLution.
jDBC Drivcr
Hive pioviues a Type + (puie ]ava) ]DBC uiivei, uelineu in the class
org.apache.hadoop.hive.jdbc.HiveDriver. Vhen conliguieu with a ]DBC URI ol
the loim jdbc:hive://host:port/dbname, a ]ava application will connect to a Hive
seivei iunning in a sepaiate piocess at the given host anu poit. (The uiivei makes
calls to an inteilace implementeu Ly the Hive Thiilt Client using the ]ava Thiilt
Linuings.)
You may alteinatively choose to connect to Hive via ]DBC in cnbcddcd nodc using
the URI jdbc:hive://. In this moue, Hive iuns in the same ]VM as the application
invoking it, so theie is no neeu to launch it as a stanualone seivei since it uoes not
use the Thiilt seivice oi the Hive Thiilt Client.
ODBC Drivcr
The Hive ODBC Diivei allows applications that suppoit the ODBC piotocol to
connect to Hive. (Like the ]DBC uiivei, the ODBC uiivei uses Thiilt to commu-
nicate with the Hive seivei.) The ODBC uiivei is still in uevelopment, so you shoulu
ielei to the latest instiuctions on the Hive wiki loi how to Luilu anu iun it.
Theie aie moie uetails on using these clients on the Hive wiki at https://cwi|i.apachc
.org/conj|ucncc/disp|ay/Hivc/HivcC|icnt.
The Metastore
The nctastorc is the cential iepositoiy ol Hive metauata. The metastoie is uiviueu into
two pieces: a seivice anu the Lacking stoie loi the uata. By uelault, the metastoie seivice
iuns in the same ]VM as the Hive seivice anu contains an emLeuueu DeiLy uataLase
instance Lackeu Ly the local uisk. This is calleu the cnbcddcd nctastorc conliguiation
(see Figuie 12-2).
Using an emLeuueu metastoie is a simple way to get staiteu with Hive; howevei, only
one emLeuueu DeiLy uataLase can access the uataLase liles on uisk at any one time,
which means you can only have one Hive session open at a time that shaies the same
metastoie. Tiying to stait a seconu session gives the eiioi:
Failed to start database 'metastore_db'
when it attempts to open a connection to the metastoie.
The solution to suppoiting multiple sessions (anu theieloie multiple useis) is to use a
stanualone uataLase. This conliguiation is ieleiieu to as a |oca| nctastorc, since the
metastoie seivice still iuns in the same piocess as the Hive seivice, Lut connects to a
uataLase iunning in a sepaiate piocess, eithei on the same machine oi on a iemote
Running Hive | 419
machine. Any ]DBC-compliant uataLase may Le useu Ly setting the
javax.jdo.option.* conliguiation piopeities listeu in TaLle 12-1.
+
MySQL is a populai choice loi the stanualone metastoie. In this case,
javax.jdo.option.ConnectionURL is set to jdbc:mysql://host/dbname?createDatabaseIf
NotExist=true, anu javax.jdo.option.ConnectionDriverName is set to
com.mysql.jdbc.Driver. (The usei name anu passwoiu shoulu Le set, too, ol couise.)
The ]DBC uiivei ]AR lile loi MySQL (Connectoi/]) must Le on Hive`s classpath, which
is simply achieveu Ly placing it in Hive`s |ib uiiectoiy.
Going a step luithei, theie`s anothei metastoie conliguiation calleu a rcnotc ncta-
storc, wheie one oi moie metastoie seiveis iun in sepaiate piocesses to the Hive seivice.
This Liings Lettei manageaLility anu secuiity, since the uataLase tiei can Le completely
liiewalleu oll, anu the clients no longei neeu the uataLase cieuentials.
Iigurc 12-2. Mctastorc conjigurations
+. The piopeities have the javax.jdo pielix since the metastoie implementation uses the ]ava Data OLjects
(]DO) API loi peisisting ]ava oLjects. It uses the DataNucleus implementation ol ]DO.
420 | Chapter 12: Hive
A Hive seivice is conliguieu to use a iemote metastoie Ly setting hive.meta
store.local to false, anu hive.metastore.uris to the metastoie seivei URIs, sepaiateu
Ly commas il theie is moie than one. Metastoie seivei URIs aie ol the loim thrift://
host:port, wheie the poit coiiesponus to the one set Ly METASTORE_PORT when staiting
the metastoie seivei (see Hive Seivices on page +17).
Tab|c 12-1. |nportant nctastorc conjiguration propcrtics
Property name Type Default value Description
hive.metastore.
warehouse.dir
URI /user/hive/
warehouse
The directory relative to
fs.default.name where managed tables
are stored.
hive.metastore.
local
boolean true Whether to use an embedded metastore
server (true), or connect to a remote instance
(false). If false, then
hive.metastore.uris must be set.
hive.metastore.uris comma-
separated
URIs
Not set The URIs specifying the remote metastore
servers to connect to. Clients connect in a
round-robin fashion if there are multiple
remote servers.
javax.jdo.option.
ConnectionURL
URI jdbc:derby:;database
Name=metastore_db;
create=true
The JDBC URL of the metastore database.
javax.jdo.option.
ConnectionDriverName
String org.apache.derby.
jdbc.EmbeddedDriver
The JDBC driver classname.
javax.jdo.option.
ConnectionUserName
String APP The JDBC user name.
javax.jdo.option.
ConnectionPassword
String mine The JDBC password.
Comparison with Traditional Databases
Vhile Hive iesemLles a tiauitional uataLase in many ways (such as suppoiting an SQL
inteilace), its HDFS anu MapReuuce unueipinnings mean that theie aie a numLei ol
aichitectuial uilleiences that uiiectly inlluence the leatuies that Hive suppoits, which
in tuin allects the uses that Hive can Le put to.
Schema on Read Versus Schema on Write
In a tiauitional uataLase, a taLle`s schema is enloiceu at uata loau time. Il the uata Leing
loaueu uoesn`t conloim to the schema, then it is iejecteu. This uesign is sometimes
calleu schcna on writc, since the uata is checkeu against the schema when it is wiitten
into the uataLase.
Comparison with Traditional Databases | 421
Hive, on the othei hanu, uoesn`t veiily the uata when it is loaueu, Lut iathei when a
gueiy is issueu. This is calleu schcna on rcad.
Theie aie tiaue-olls Letween the two appioaches. Schema on ieau makes loi a veiy last
initial loau, since the uata uoes not have to Le ieau, paiseu, anu seiializeu to uisk in
the uataLase`s inteinal loimat. The loau opeiation is just a lile copy oi move. It is moie
llexiLle, too: consiuei having two schemas loi the same unueilying uata, uepenuing on
the analysis Leing peiloimeu. (This is possiLle in Hive using exteinal taLles, see Man-
ageu TaLles anu Exteinal TaLles on page +27.)
Schema on wiite makes gueiy time peiloimance lastei, since the uataLase can inuex
columns anu peiloim compiession on the uata. The tiaue-oll, howevei, is that it takes
longei to loau uata into the uataLase. Fuitheimoie, theie aie many scenaiios wheie the
schema is not known at loau time, so theie aie no inuexes to apply, since the gueiies
have not Leen loimulateu yet. These scenaiios aie wheie Hive shines.
Updates, Transactions, and Indexes
Upuates, tiansactions, anu inuexes aie mainstays ol tiauitional uataLases. Yet, until
iecently, these leatuies have not Leen consiueieu a pait ol Hive`s leatuie set. This is
Lecause Hive was Luilt to opeiate ovei HDFS uata using MapReuuce, wheie lull-taLle
scans aie the noim anu a taLle upuate is achieveu Ly tiansloiming the uata into a new
taLle. Foi a uata waiehousing application that iuns ovei laige poitions ol the uataset,
this woiks well.
Howevei, theie aie woikloaus wheie upuates (oi inseit appenus, at least) aie neeueu,
oi wheie inuexes woulu yielu signilicant peiloimance gains. On the tiansactions liont,
Hive uoesn`t ueline cleai semantics loi concuiient access to taLles, which means ap-
plications neeu to Luilu theii own application-level concuiiency oi locking mechanism.
The Hive team is actively woiking on impiovements in all these aieas.
5
Change is also coming liom anothei uiiection: HBase integiation. HBase (Chap-
tei 13) has uilleient stoiage chaiacteiistics to HDFS, such as the aLility to uo iow
upuates anu column inuexing, so we can expect to see these leatuies useu Ly Hive in
lutuie ieleases. It is alieauy possiLle to access HBase taLles liom Hive; you can linu out
moie at https://cwi|i.apachc.org/conj|ucncc/disp|ay/Hivc/HBasc|ntcgration.
HiveQL
Hive`s SQL uialect, calleu HiveQL, uoes not suppoit the lull SQL-92 specilication.
Theie aie a numLei ol ieasons loi this. Being a laiily young pioject, it has not hau time
to pioviue the lull iepeitoiie ol SQL-92 language constiucts. Moie lunuamentally,
5. See, loi example, https://issucs.apachc.org/jira/browsc/H|\E-30, https://issucs.apachc.org/jira/browsc/
H|\E-117, anu https://issucs.apachc.org/jira/browsc/H|\E-1293.
422 | Chapter 12: Hive
SQL-92 compliance has nevei Leen an explicit pioject goal; iathei, as an open souice
pioject, leatuies weie auueu Ly uevelopeis to meet theii useis` neeus. Fuitheimoie,
Hive has some extensions that aie not in SQL-92, which have Leen inspiieu Ly syntax
liom othei uataLase systems, notaLly MySQL. In lact, to a liist-oiuei appioximation,
HiveQL most closely iesemLles MySQL`s SQL uialect.
Some ol Hive`s extensions to SQL-92 weie inspiieu Ly MapReuuce, such as multitaLle
inseits (see MultitaLle inseit on page +39) anu the TRANSFORM, MAP, anu REDUCE clauses
(see MapReuuce Sciipts on page ++2).
It tuins out that some SQL-92 constiucts that aie missing liom HiveQL aie easy to
woik aiounu using othei language leatuies, so theie has not Leen much piessuie to
implement them. Foi example, SELECT statements uo not (at the time ol wiiting) suppoit
a HAVING clause in HiveQL, Lut the same iesult can Le achieveu Ly auuing a suLgueiy
in the FROM clause (see SuLgueiies on page ++6).
This chaptei uoes not pioviue a complete ieleience to HiveQL; loi that, see the Hive
uocumentation at https://cwi|i.apachc.org/conj|ucncc/disp|ay/Hivc/LanguagcManua|.
Insteau, we locus on commonly useu leatuies anu pay paiticulai attention to leatuies
that uiveige liom SQL-92, oi populai uataLases like MySQL. TaLle 12-2 pioviues a
high-level compaiison ol SQL anu HiveQL.
Tab|c 12-2. A high-|cvc| conparison oj SQL and HivcQL
Feature SQL HiveQL References
Updates UPDATE, INSERT,
DELETE
INSERT OVERWRITE
TABLE (populates whole
table or partition)
INSERT OVERWRITE TA-
BLE on page 438, Updates, Transac-
tions, and Indexes on page 422
Transactions Supported Not supported
Indexes Supported Not supported
Latency Sub-second Minutes
Data types Integral, floating point,
fixed point, text and binary
strings, temporal
Integral, floating point,
boolean, string, array, map,
struct
Data Types on page 424
Functions Hundreds of built-in
functions
Dozens of built-in
functions
Operators and Functions on page 426
Multitable inserts Not supported Supported Multitable insert on page 439
Create table as
select
Not valid SQL-92, but found
in some databases
Supported CREATE TABLE...AS SE-
LECT on page 440
Select SQL-92 Single table or view in the
FROM clause. SORT BY for
partial ordering. LIMIT to
limit number of rows re-
turned.
Querying Data on page 441
HiveQL | 423
Feature SQL HiveQL References
Joins SQL-92 or variants (join
tables in the FROM clause,
join condition in the
WHERE clause)
Inner joins, outer joins, semi
joins, map joins. SQL-92
syntax, with hinting.
Joins on page 443
Subqueries In any clause. Correlated or
noncorrelated.
Only in the FROM clause.
Correlated subqueries not
supported
Subqueries on page 446
Views Updatable. Materialized or
nonmaterialized.
Read-only. Materialized
views not supported
Views on page 447
Extension points User-defined functions.
Stored procedures.
User-defined functions.
MapReduce scripts.
User-Defined Functions on page 448,
MapReduce Scripts on page 442
Data Types
Hive suppoits Loth piimitive anu complex uata types. Piimitives incluue numeiic,
Loolean, stiing, anu timestamp types. The complex uata types incluue aiiays, maps,
anu stiucts. Hive`s uata types aie listeu in TaLle 12-3. Note that the liteials shown aie
those useu liom within HiveQL; they aie not the seiializeu loim useu in the taLle`s
stoiage loimat (see Stoiage Foimats on page +33).
Tab|c 12-3. Hivc data typcs
Category Type Description Literal examples
Primitive TINYINT 1-byte (8-bit) signed integer, from -128 to
127
1
SMALLINT 2-byte (16-bit) signed integer, from
-32,768 to 32,767
1
INT 4-byte (32-bit) signed integer, from
-2,147,483,648 to 2,147,483,647
1
BIGINT 8-byte (64-bit) signed integer, from
-9,223,372,036,854,775,808 to
9,223,372,036,854,775,807
1
FLOAT 4-byte (32-bit) single-precision floating-
point number
1.0
DOUBLE 8-byte (64-bit) double-precision floating-
point number
1.0
BOOLEAN true/false value TRUE
STRING Character string 'a', "a"
BINARY Byte array Not supported
TIMESTAMP Timestamp with nanosecond precision 1325502245000, '2012-01-02
03:04:05.123456789'
424 | Chapter 12: Hive
Category Type Description Literal examples
Complex ARRAY An ordered collection of fields. The fields
must all be of the same type.
array(1, 2)
a
MAP An unordered collection of key-value pairs.
Keys must be primitives; values may be any
type. For a particular map, the keys must
be the same type, and the values must be
the same type.
map('a', 1, 'b', 2)
STRUCT A collection of named fields. The fields may
be of different types.
struct('a', 1, 1.0)
b
a
The literal forms for arrays, maps, and structs are provided as functions. That is, array(), map(), and struct() are built-in Hive functions.
b
The columns are named col1, col2, col3, etc.
Primitive types
Hive`s piimitive types coiiesponu ioughly to ]ava`s, although some names aie inllu-
enceu Ly MySQL`s type names (some ol which, in tuin, oveilap with SQL-92). Theie
aie loui signeu integial types: TINYINT, SMALLINT, INT, anu BIGINT, which aie eguivalent
to ]ava`s byte, short, int, anu long piimitive types, iespectively; they aie 1-Lyte, 2-Lyte,
+-Lyte, anu S-Lyte signeu integeis.
Hive`s lloating-point types, FLOAT anu DOUBLE, coiiesponu to ]ava`s float anu double,
which aie 32-Lit anu 6+-Lit lloating point numLeis. Unlike some uataLases, theie is no
option to contiol the numLei ol signilicant uigits oi uecimal places stoieu loi lloating
point values.
Hive suppoits a BOOLEAN type loi stoiing tiue anu lalse values.
Theie is a single Hive uata type loi stoiing text, STRING, which is a vaiiaLle-length
chaiactei stiing. Hive`s STRING type is like VARCHAR in othei uataLases, although theie
is no ueclaiation ol the maximum numLei ol chaiacteis to stoie with STRING. (The
theoietical maximum size STRING that may Le stoieu is 2GB, although in piactice it may
Le inellicient to mateiialize such laige values. Sgoop has laige oLject suppoit, see
Impoiting Laige OLjects on page 53S.)
The BINARY uata type is loi stoiing vaiiaLle-length Linaiy uata.
The TIMESTAMP uata type stoies timestamps with nanoseconu piecision. Hive comes
with UDFs loi conveiting Letween Hive timestamps, Unix timestamps (seconus since
the Unix epoch), anu stiings, which makes most common uate opeiations tiactaLle.
TIMESTAMP uoes not encapsulate a timezone, howevei the to_utc_timestamp anu
from_utc_timestamp lunctions make it possiLle to uo timezone conveisions.
Conversions
Piimitive types loim a hieiaichy, which uictates the implicit type conveisions that Hive
will peiloim. Foi example, a TINYINT will Le conveiteu to an INT, il an expiession ex-
HiveQL | 425
pects an INT; howevei, the ieveise conveision will not occui anu Hive will ietuin an
eiioi unless the CAST opeiatoi is useu.
The implicit conveision iules can Le summaiizeu as lollows. Any integial numeiic type
can Le implicitly conveiteu to a wiuei type. All the integial numeiic types, FLOAT, anu
(peihaps suipiisingly) STRING can Le implicitly conveiteu to DOUBLE. TINYINT, SMALL
INT, anu INT can all Le conveiteu to FLOAT. BOOLEAN types cannot Le conveiteu to any
othei type.
You can peiloim explicit type conveision using CAST. Foi example, CAST('1' AS INT)
will conveit the stiing '1' to the integei value 1. Il the cast lailsas it uoes in CAST('X'
AS INT), loi examplethen the expiession ietuins NULL.
Complex types
Hive has thiee complex types: ARRAY, MAP, anu STRUCT. ARRAY anu MAP aie like theii
namesakes in ]ava, while a STRUCT is a iecoiu type which encapsulates a set ol nameu
lielus. Complex types peimit an aiLitiaiy level ol nesting. Complex type ueclaiations
must specily the type ol the lielus in the collection, using an angleu Liacket notation,
as illustiateu in this taLle uelinition which has thiee columns, one loi each complex
type:
CREATE TABLE complex (
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>
);
Il we loau the taLle with one iow ol uata loi ARRAY, MAP, anu STRUCT shown in the Liteial
examples column in TaLle 12-3 (we`ll see the lile loimat neeueu to uo this in Stoiage
Foimats on page +33), then the lollowing gueiy uemonstiates the lielu accessoi
opeiatois loi each type:
hive> SELECT col1[0], col2['b'], col3.c FROM complex;
1 2 1.0
Operators and Functions
The usual set ol SQL opeiatois is pioviueu Ly Hive: ielational opeiatois (such as x =
'a' loi testing eguality, x IS NULL loi testing nullity, x LIKE 'a%' loi pattein matching),
aiithmetic opeiatois (such as x + 1 loi auuition), anu logical opeiatois (such as x OR y
loi logical OR). The opeiatois match those in MySQL, which ueviates liom SQL-92
since || is logical OR, not stiing concatenation. Use the concat lunction loi the lattei
in Loth MySQL anu Hive.
Hive comes with a laige numLei ol Luilt-in lunctionstoo many to list heieuiviueu
into categoiies incluuing mathematical anu statistical lunctions, stiing lunctions, uate
lunctions (loi opeiating on stiing iepiesentations ol uates), conuitional lunctions, ag-
426 | Chapter 12: Hive
giegate lunctions, anu lunctions loi woiking with XML (using the xpath lunction) anu
]SON.
You can ietiieve a list ol lunctions liom the Hive shell Ly typing SHOW FUNCTIONS.
6
To
get Liiel usage instiuctions loi a paiticulai lunction, use the DESCRIBE commanu:
hive> DESCRIBE FUNCTION length;
length(str) - Returns the length of str
In the case when theie is no Luilt-in lunction that uoes what you want, you can wiite
youi own; see Usei-Delineu Functions on page ++S.
Tables
A Hive taLle is logically maue up ol the uata Leing stoieu anu the associateu metauata
uesciiLing the layout ol the uata in the taLle. The uata typically iesiues in HDFS, al-
though it may iesiue in any Hauoop lilesystem, incluuing the local lilesystem oi S3.
Hive stoies the metauata in a ielational uataLaseanu not in HDFS, say (see The
Metastoie on page +19).
In this section, we shall look in moie uetail at how to cieate taLles, the uilleient physical
stoiage loimats that Hive olleis, anu how to impoit uata into them.
Multiple Database/Schema Support
Many ielational uataLases have a lacility loi multiple namespaces, which allow useis
anu applications to Le segiegateu into uilleient uataLases oi schemas. Hive suppoits
the same lacility, anu pioviues commanus such as CREATE DATABASE dbname, USE
dbname, anu DROP DATABASE dbname. You can lully gualily a taLle Ly wiiting
dbname.tablename. Il no uataLase is specilieu, taLles Lelong to the default uataLase.
Managed Tables and External Tables
Vhen you cieate a taLle in Hive, Ly uelault Hive will manage the uata, which means
that Hive moves the uata into its waiehouse uiiectoiy. Alteinatively, you may cieate
an cxtcrna| tab|c, which tells Hive to ielei to the uata that is at an existing location
outsiue the waiehouse uiiectoiy.
The uilleience Letween the two types ol taLle is seen in the LOAD anu DROP semantics.
Let`s consiuei a manageu taLle liist.
Vhen you loau uata into a manageu taLle, it is moveu into Hive`s waiehouse uiiectoiy.
Foi example:
6. Oi see the Hive lunction ieleience at https://cwi|i.apachc.org/conj|ucncc/disp|ay/Hivc/LanguagcManua|
-UDI.
Tables | 427
CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
will novc the lile hdjs://uscr/ton/data.txt into Hive`s waiehouse uiiectoiy loi the
managed_table taLle, which is hdjs://uscr/hivc/warchousc/nanagcd_tab|c.
7
The loau opeiation is veiy last, since it is just a move oi iename within
a lilesystem. Howevei, Leai in minu that Hive uoes not check that the
liles in the taLle uiiectoiy conloim to the schema ueclaieu loi the taLle,
even loi manageu taLles. Il theie is a mismatch, then this will Lecome
appaient at gueiy time, olten Ly the gueiy ietuining NULL loi a missing
lielu. You can check that the uata is Leing paiseu coiiectly Ly issuing a
simple SELECT statement to ietiieve a lew iows uiiectly liom the taLle.
Il the taLle is latei uioppeu, using:
DROP TABLE managed_table;
then the taLle, incluuing its metauata and its data, is ueleteu. It Leais iepeating that
since the initial LOAD peiloimeu a move opeiation, anu the DROP peiloimeu a uelete
opeiation, the uata no longei exists anywheie. This is what it means loi Hive to manage
the uata.
An exteinal taLle Lehaves uilleiently. You contiol the cieation anu ueletion ol the uata.
The location ol the exteinal uata is specilieu at taLle cieation time:
CREATE EXTERNAL TABLE external_table (dummy STRING)
LOCATION '/user/tom/external_table';
LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
Vith the EXTERNAL keywoiu, Hive knows that it is not managing the uata, so it uoesn`t
move it to its waiehouse uiiectoiy. Inueeu, it uoesn`t even check il the exteinal location
exists at the time it is uelineu. This is a uselul leatuie, since it means you can cieate the
uata lazily altei cieating the taLle.
Vhen you uiop an exteinal taLle, Hive will leave the uata untoucheu anu only uelete
the metauata.
So how uo you choose which type ol taLle to use? In most cases, theie is not much
uilleience Letween the two (except ol couise loi the uilleience in DROP semantics), so
it is a just a mattei ol pieleience. As a iule ol thumL, il you aie uoing all youi piocessing
with Hive, then use manageu taLles, Lut il you wish to use Hive anu othei tools on the
same uataset, then use exteinal taLles. A common pattein is to use an exteinal taLle to
access an initial uataset stoieu in HDFS (cieateu Ly anothei piocess), then use a Hive
7. The move will only succeeu il the souice anu taiget lilesystems aie the same. Also, theie is a special case
il the LOCAL keywoiu is useu, wheie Hive will copy the uata liom the local lilesystem into Hive`s waiehouse
uiiectoiy (even il it, too, is on the same local lilesystem). In all othei cases though, LOAD is a move opeiation
anu is Lest thought ol as such.
428 | Chapter 12: Hive
tiansloim to move the uata into a manageu Hive taLle. This woiks the othei way
aiounu, tooan exteinal taLle (not necessaiily on HDFS) can Le useu to expoit uata
liom Hive loi othei applications to use.
S
Anothei ieason loi using exteinal taLles is when you wish to associate multiple schemas
with the same uataset.
Partitions and Buckets
Hive oiganizes taLles into partitions, a way ol uiviuing a taLle into coaise-giaineu paits
Laseu on the value ol a partition co|unn, such as uate. Using paititions can make it
lastei to uo gueiies on slices ol the uata.
TaLles oi paititions may luithei Le suLuiviueu into buc|cts, to give extia stiuctuie to
the uata that may Le useu loi moie ellicient gueiies. Foi example, Lucketing Ly usei
ID means we can guickly evaluate a usei-Laseu gueiy Ly iunning it on a ianuomizeu
sample ol the total set ol useis.
Partitions
To take an example wheie paititions aie commonly useu, imagine log liles wheie each
iecoiu incluues a timestamp. Il we paititioneu Ly uate, then iecoius loi the same uate
woulu Le stoieu in the same paitition. The auvantage to this scheme is that gueiies that
aie iestiicteu to a paiticulai uate oi set ol uates can Le answeieu much moie elliciently
since they only neeu to scan the liles in the paititions that the gueiy peitains to. Notice
that paititioning uoesn`t piecluue moie wiue-ianging gueiies: it is still leasiLle to gueiy
the entiie uataset acioss many paititions.
A taLle may Le paititioneu in multiple uimensions. Foi example, in auuition to paiti-
tioning logs Ly uate, we might also subpartition each uate paitition Ly countiy to peimit
ellicient gueiies Ly location.
Paititions aie uelineu at taLle cieation time
9
using the PARTITIONED BY clause, which
takes a list ol column uelinitions. Foi the hypothetical log liles example, we might
ueline a taLle with iecoius compiising a timestamp anu the log line itsell:
CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
Vhen we loau uata into a paititioneu taLle, the paitition values aie specilieu explicitly:
S. You can also use INSERT OVERVRITE DIRECTORY to expoit uata to a Hauoop lilesystem, Lut unlike
exteinal taLles you cannot contiol the output loimat, which is Contiol-A sepaiateu text liles. Complex
uata types aie seiializeu using a ]SON iepiesentation.
9. Howevei, paititions may Le auueu to oi iemoveu liom a taLle altei cieation using an ALTER TABLE
statement.
Tables | 429
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
INTO TABLE logs
PARTITION (dt='2001-01-01', country='GB');
At the lilesystem level, paititions aie simply nesteu suLuiiectoiies ol the taLle uiiectoiy.
Altei loauing a lew moie liles into the logs taLle, the uiiectoiy stiuctuie might look
like this:
/user/hive/warehouse/logs/dt=2010-01-01/country=GB/file1
/file2
/country=US/file3
/dt=2010-01-02/country=GB/file4
/country=US/file5
/file6
The logs taLle has two uate paititions, 2010-01-01 anu 2010-01-02, coiiesponuing to
suLuiiectoiies calleu dt=2010-01-01 anu dt=2010-01-02; anu two countiy suLpaiti-
tions, GB anu US, coiiesponuing to nesteu suLuiiectoiies calleu country=GB anu
country=US. The uata liles iesiue in the leal uiiectoiies.
Ve can ask Hive loi the paititions in a taLle using SHOW PARTITIONS:
hive> SHOW PARTITIONS logs;
dt=2001-01-01/country=GB
dt=2001-01-01/country=US
dt=2001-01-02/country=GB
dt=2001-01-02/country=US
One thing to Leai in minu is that the column uelinitions in the PARTITIONED BY clause
aie lull-lleugeu taLle columns, calleu paitition columns; howevei, the uata liles uo not
contain values loi these columns since they aie ueiiveu liom the uiiectoiy names.
You can use paitition columns in SELECT statements in the usual way. Hive peiloims
input pruning to scan only the ielevant paititions. Foi example:
SELECT ts, dt, line
FROM logs
WHERE country='GB';
will only scan ji|c1, ji|c2, anu ji|c1. Notice, too, that the gueiy ietuins the values ol the
dt paitition column, which Hive ieaus liom the uiiectoiy names since they aie not in
the uata liles.
Buckets
Theie aie two ieasons why you might want to oiganize youi taLles (oi paititions) into
Luckets. The liist is to enaLle moie ellicient gueiies. Bucketing imposes extia stiuctuie
on the taLle, which Hive can take auvantage ol when peiloiming ceitain gueiies. In
paiticulai, a join ol two taLles that aie Lucketeu on the same columnswhich incluue
the join columnscan Le elliciently implementeu as a map-siue join.
430 | Chapter 12: Hive
The seconu ieason to Lucket a taLle is to make sampling moie ellicient. Vhen woiking
with laige uatasets, it is veiy convenient to tiy out gueiies on a liaction ol youi uataset
while you aie in the piocess ol ueveloping oi ielining them. Ve shall see how to uo
ellicient sampling at this enu ol this section.
Fiist, let`s see how to tell Hive that a taLle shoulu Le Lucketeu. Ve use the CLUSTERED
BY clause to specily the columns to Lucket on anu the numLei ol Luckets:
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
Heie we aie using the usei ID to ueteimine the Lucket (which Hive uoes Ly hashing
the value anu ieuucing mouulo the numLei ol Luckets), so any paiticulai Lucket will
ellectively have a ianuom set ol useis in it.
In the map-siue join case, wheie the two taLles aie Lucketeu in the same way, a mappei
piocessing a Lucket ol the lelt taLle knows that the matching iows in the iight taLle aie
in its coiiesponuing Lucket, so it neeu only ietiieve that Lucket (which is a small liaction
ol all the uata stoieu in the iight taLle) to ellect the join. This optimization woiks, too,
il the numLei ol Luckets in the two taLles aie multiples ol each otheithey uo not
have to have exactly the same numLei ol Luckets. The HiveQL loi joining two Lucketeu
taLles is shown in Map joins on page ++6.
The uata within a Lucket may auuitionally Le soiteu Ly one oi moie columns. This
allows even moie ellicient map-siue joins, since the join ol each Lucket Lecomes an
ellicient meige-soit. The syntax loi ueclaiing that a taLle has soiteu Luckets is:
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;
How can we make suie the uata in oui taLle is Lucketeu? Vhile it`s possiLle to loau
uata geneiateu outsiue Hive into a Lucketeu taLle, it`s olten easiei to get Hive to uo the
Lucketing, usually liom an existing taLle.
Hive uoes not check that the Luckets in the uata liles on uisk aie con-
sistent with the Luckets in the taLle uelinition (eithei in numLei, oi on
the Lasis ol Lucketing columns). Il theie is a mismatch, then you may
get an eiioi oi unuelineu Lehavioi at gueiy time. Foi this ieason, it is
auvisaLle to get Hive to peiloim the Lucketing.
Take an unLucketeu useis taLle:
hive> SELECT * FROM users;
0 Nat
2 Joe
3 Kay
4 Ann
Tables | 431
To populate the Lucketeu taLle, we neeu to set the hive.enforce.bucketing piopeity
to true, so that Hive knows to cieate the numLei ol Luckets ueclaieu in the taLle
uelinition. Then it is a mattei ol just using the INSERT commanu:
INSERT OVERWRITE TABLE bucketed_users
SELECT * FROM users;
Physically, each Lucket is just a lile in the taLle (oi paitition) uiiectoiy. The lile name
is not impoitant, Lut Lucket n is the nth lile, when aiiangeu in lexicogiaphic oiuei. In
lact, Luckets coiiesponu to MapReuuce output lile paititions: a joL will piouuce as
many Luckets (output liles) as ieuuce tasks. Ve can see this Ly looking at the layout
ol the bucketed_users taLle we just cieateu. Running this commanu:
hive> dfs -ls /user/hive/warehouse/bucketed_users;
shows that loui liles weie cieateu, with the lollowing names (the name is geneiateu Ly
Hive anu incoipoiates a timestamp, so it will change liom iun to iun):
attempt_201005221636_0016_r_000000_0
attempt_201005221636_0016_r_000001_0
attempt_201005221636_0016_r_000002_0
attempt_201005221636_0016_r_000003_0
The liist Lucket contains the useis with IDs 0 anu +, since loi an INT the hash is the
integei itsell, anu the value is ieuuceu mouulo the numLei ol Luckets+ in this case:
10
hive> dfs -cat /user/hive/warehouse/bucketed_users/*0_0;
0Nat
4Ann
Ve can see the same thing Ly sampling the taLle using the TABLESAMPLE clause, which
iestiicts the gueiy to a liaction ol the Luckets in the taLle iathei than the whole taLle:
hive> SELECT * FROM bucketed_users
> TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);
0 Nat
4 Ann
Bucket numLeiing is 1-Laseu, so this gueiy ietiieves all the useis liom the liist ol loui
Luckets. Foi a laige, evenly uistiiLuteu uataset, appioximately one guaitei ol the taLle`s
iows woulu Le ietuineu. It`s possiLle to sample a numLei ol Luckets Ly specilying a
uilleient piopoition (which neeu not Le an exact multiple ol the numLei ol Luckets,
since sampling is not intenueu to Le a piecise opeiation). Foi example, this gueiy ie-
tuins hall ol the Luckets:
hive> SELECT * FROM bucketed_users
> TABLESAMPLE(BUCKET 1 OUT OF 2 ON id);
0 Nat
4 Ann
2 Joe
10. The lielus appeai iun togethei when uisplaying the iaw lile since the sepaiatoi chaiactei in the output is
a nonpiinting contiol chaiactei. The contiol chaiacteis useu aie explaineu in the next section.
432 | Chapter 12: Hive
Sampling a Lucketeu taLle is veiy ellicient, since the gueiy only has to ieau the Luckets
that match the TABLESAMPLE clause. Contiast this with sampling a non-Lucketeu taLle,
using the rand() lunction, wheie the whole input uataset is scanneu, even il a veiy small
sample is neeueu:
hive> SELECT * FROM users
> TABLESAMPLE(BUCKET 1 OUT OF 4 ON rand());
2 Joe
Storage Formats
Theie aie two uimensions that govein taLle stoiage in Hive: the row jornat anu the
ji|c jornat. The iow loimat uictates how iows, anu the lielus in a paiticulai iow, aie
stoieu. In Hive pailance, the iow loimat is uelineu Ly a ScrDc, a poitmanteau woiu
loi a Scrializei-Dcseiializei.
Vhen acting as a ueseiializei, which is the case when gueiying a taLle, a SeiDe will
ueseiialize a iow ol uata liom the Lytes in the lile to oLjects useu inteinally Ly Hive to
opeiate on that iow ol uata. Vhen useu as a seiializei, which is the case when pei-
loiming an INSERT oi CTAS (see Impoiting Data on page +3S), the taLle`s SeiDe will
seiialize Hive`s inteinal iepiesentation ol a iow ol uata into the Lytes that aie wiitten
to the output lile.
The lile loimat uictates the containei loimat loi lielus in a iow. The simplest loimat is
a plain text lile, Lut theie aie iow-oiienteu anu column-oiienteu Linaiy loimats avail-
aLle, too.
The default storage format: Delimited text
Vhen you cieate a taLle with no ROW FORMAT oi STORED AS clauses, the uelault loimat is
uelimiteu text, with a iow pei line.
The uelault iow uelimitei is not a taL chaiactei, Lut the Contiol-A chaiactei liom the
set ol ASCII contiol coues (it has ASCII coue 1). The choice ol Contiol-A, sometimes
wiitten as A in uocumentation, came aLout since it is less likely to Le a pait ol the
lielu text than a taL chaiactei. Theie is no means loi escaping uelimitei chaiacteis in
Hive, so it is impoitant to choose ones that uon`t occui in uata lielus.
The uelault collection item uelimitei is a Contiol-B chaiactei, useu to uelimit items in
an ARRAY oi STRUCT, oi key-value paiis in a MAP. The uelault map key uelimitei is a
Contiol-C chaiactei, useu to uelimit the key anu value in a MAP. Rows in a taLle aie
uelimiteu Ly a newline chaiactei.
Tables | 433
The pieceuing uesciiption ol uelimiteis is coiiect loi the usual case ol
llat uata stiuctuies, wheie the complex types only contain piimitive
types. Foi nesteu types, howevei, this isn`t the whole stoiy, anu in lact
the |cvc| ol the nesting ueteimines the uelimitei.
Foi an aiiay ol aiiays, loi example, the uelimiteis loi the outei aiiay aie
Contiol-B chaiacteis, as expecteu, Lut loi the innei aiiay they aie
Contiol-C chaiacteis, the next uelimitei in the list. Il you aie unsuie
which uelimiteis Hive uses loi a paiticulai nesteu stiuctuie, you can iun
a commanu like:
CREATE TABLE nested
AS
SELECT array(array(1, 2), array(3, 4))
FROM dummy;
then use hexdump, oi similai, to examine the uelimiteis in the output lile.
Hive actually suppoits eight levels ol uelimiteis, coiiesponuing to ASCII
coues 1, 2, ... S, Lut you can only oveiiiue the liist thiee.
Thus, the statement:
CREATE TABLE ...;
is iuentical to the moie explicit:
CREATE TABLE ...
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
Notice that the octal loim ol the uelimitei chaiacteis can Le useu001 loi Contiol-A,
loi instance.
Inteinally, Hive uses a SeiDe calleu LazySimpleSerDe loi this uelimiteu loimat, along
with the line-oiienteu MapReuuce text input anu output loimats we saw in Chap-
tei 7. The lazy pielix comes aLout since it ueseiializes lielus lazilyonly as they aie
accesseu. Howevei, it is not a compact loimat since lielus aie stoieu in a veiLose textual
loimat, so a Loolean value, loi instance, is wiitten as the liteial stiing true oi false.
The simplicity ol the loimat has a lot going loi it, such as making it easy to piocess
with othei tools, incluuing MapReuuce piogiams oi Stieaming, Lut theie aie moie
compact anu peiloimant Linaiy SeiDe`s that you might consiuei using. Some aie listeu
in TaLle 12-+.
Binaiy SeiDe`s shoulu not Le useu with the uelault TEXTFILE loimat (oi
explicitly using a STORED AS TEXTFILE clause). Theie is always the pos-
siLility that a Linaiy iow will contain a newline chaiactei, which woulu
cause Hive to tiuncate the iow anu lail at ueseiialization time.
434 | Chapter 12: Hive
Tab|c 12-1. Hivc ScrDc`s
SerDe name Java package Description
LazySimpleSerDe org.apache.hadoop.hive.
serde2.lazy
The default SerDe. Delimited textual format,
with lazy field access.
LazyBinarySerDe org.apache.hadoop.hive.
serde2.lazybinary
A more efficient version of LazySimple
SerDe. Binary format with lazy field access.
Used internally for such things as temporary
tables.
BinarySortableSerDe org.apache.hadoop.hive.
serde2.binarysortable
A binary SerDe like LazyBinarySerDe, but
optimized for sorting at the expense of com-
pactness (although it is still significantly more
compact than LazySimpleSerDe).
ColumnarSerDe org.apache.hadoop.hive.
serde2.columnar
A variant of LazySimpleSerDe for column-
based storage with RCFile.
RegexSerDe org.apache.hadoop.hive.
contrib.serde2
A SerDe for reading textual data where columns
are specified by a regular expression. Also
writes data using a formatting expression.
Useful for reading log files, but inefficient, so
not suitable for general-purpose storage.
ThriftByteStreamTypedSerDe org.apache.hadoop.hive.
serde2.thrift
A SerDe for reading Thrift-encoded binary data.
HBaseSerDe org.apache.hadoop.hive.
hbase
A SerDe for storing data in an HBase table.
HBase storage uses a Hive storage handler,
which unifies (and generalizes) the roles of row
format and file format. Storage handlers are
specified using a STORED BY clause, which
replaces the ROW FORMAT and STORED AS
clauses. See https://cwiki.apache.org/conflu
ence/display/Hive/HBaseIntegration.
Binary storage formats: Sequence files and RCFiles
Hauoop`s seguence lile loimat (SeguenceFile on page 132) is a geneial puipose Li-
naiy loimat loi seguences ol iecoius (key-value paiis). You can use seguence liles in
Hive Ly using the ueclaiation STORED AS SEQUENCEFILE in the CREATE TABLE statement.
One ol the main Lenelits ol using seguence liles is theii suppoit loi splittaLle com-
piession. Il you have a collection ol seguence liles that weie cieateu outsiue Hive, then
Hive will ieau them with no extia conliguiation. Il, on the othei hanu, you want taLles
populateu liom Hive to use compiesseu seguence liles loi theii stoiage, you neeu to
set a lew piopeities to enaLle compiession (see Using Compiession in MapRe-
uuce on page 92):
hive> SET hive.exec.compress.output=true;
hive> SET mapred.output.compress=true;
Tables | 435
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
hive> INSERT OVERWRITE TABLE ...;
Seguence liles aie iow-oiienteu. Vhat this means is that the lielus in each iow aie stoieu
togethei, as the contents ol a single seguence lile iecoiu.
Hive pioviues anothei Linaiy stoiage loimat calleu RCIi|c, shoit loi Rccord Co|unnar
Ii|c. RCFiles aie similai to seguence liles, except that they stoie uata in a column-
oiienteu lashion. RCFile Lieaks up the taLle into iow splits, then within each split stoies
the values loi each iow in the liist column, lolloweu Ly the values loi each iow in the
seconu column, anu so on. This is shown uiagiammatically in Figuie 12-3.
Iigurc 12-3. Row-oricntcd vcrsus co|unn-oricntcd storagc
A column-oiienteu layout peimits columns that aie not accesseu in a gueiy to Le skip-
peu. Consiuei a gueiy ol the taLle in Figuie 12-3 that piocesses only column 2. Vith
iow-oiienteu stoiage, like a seguence lile, the whole iow (stoieu in a seguence lile
iecoiu) is loaueu into memoiy, even though only the seconu column is actually ieau.
436 | Chapter 12: Hive
Lazy ueseiialization goes some way to save piocessing cycles Ly only ueseiializing the
columns lielus that aie accesseu, Lut it can`t avoiu the cost ol ieauing each iow`s Lytes
liom uisk.
Vith column-oiienteu stoiage, only the column 2 paits ol the lile (shaueu in the liguie)
neeu to Le ieau into memoiy.
In geneial, column-oiienteu loimats woik well when gueiies access only a small num-
Lei ol columns in the taLle. Conveisely, iow-oiienteu loimats aie appiopiiate when a
laige numLei ol columns ol a single iow aie neeueu loi piocessing at the same time.
Space peimitting, it is ielatively stiaightloiwaiu to measuie the peiloimance uilleience
Letween the two loimats loi youi paiticulai woikloau, since you can cieate a copy ol
a taLle with a uilleient stoiage loimat loi compaiison, using CREATE TABLE...AS
SELECT on page ++0.
Use the lollowing CREATE TABLE clauses to enaLle column-oiienteu stoiage in Hive:
CREATE TABLE ...
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS RCFILE;
An example: RegexSerDe
Let`s see how to use anothei SeiDe loi stoiage. Ve`ll use a contiiL SeiDe that uses a
iegulai expiession loi ieauing the lixeu-wiuth station metauata liom a text lile:
CREATE TABLE stations (usaf STRING, wban STRING, name STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\d{6}) (\\d{5}) (.{29}) .*"
);
In pievious examples, we have useu the DELIMITED keywoiu to ielei to uelimiteu text
in the ROW FORMAT clause. In this example, we insteau specily a SeiDe with the SERDE
keywoiu anu the lully gualilieu classname ol the ]ava class that implements the SeiDe,
org.apache.hadoop.hive.contrib.serde2.RegexSerDe.
SeiDe`s can Le conliguieu with extia piopeities using the WITH SERDEPROPERTIES clause.
Heie we set the input.regex piopeity, which is specilic to RegexSerDe.
input.regex is the iegulai expiession pattein to Le useu uuiing ueseiialization to tuin
the line ol text loiming the iow into a set ol columns. ]ava iegulai expiession syntax
is useu loi the matching (see http://java.sun.con/javasc//docs/api/java/uti|/rcgcx/Pat
tcrn.htn|), anu columns aie loimeu liom captuiing gioups ol paientheses.
11
In this
11. Sometimes you neeu to use paientheses loi iegulai expiession constiucts that you uon`t want to count
as a captuiing gioup. Foi example, the pattein (ab)+ loi matching a stiing ol one oi moie ab chaiacteis.
The solution is to use a noncaptuiing gioup, which has a ? chaiactei altei the liist paienthesis. Theie aie
vaiious noncaptuiing gioup constiucts (see the ]ava uocumentation), Lut in this example we coulu use
(?:ab)+ to avoiu captuiing the gioup as a Hive column.
Tables | 437
example, theie aie thiee captuiing gioups loi usaf (a six-uigit iuentiliei), wban (a live-
uigit iuentiliei), anu name (a lixeu-wiuth column ol 29 chaiacteis).
To populate the taLle, we use a LOAD DATA statement as Leloie:
LOAD DATA LOCAL INPATH "input/ncdc/metadata/stations-fixed-width.txt"
INTO TABLE stations;
Recall that LOAD DATA copies oi moves the liles to Hive`s waiehouse uiiectoiy (in this
case, it`s a copy since the souice is the local lilesystem). The taLle`s SeiDe is not useu
loi the loau opeiation.
Vhen we ietiieve uata liom the taLle, the SeiDe is invokeu loi ueseiialization, as we
can see liom this simple gueiy, which coiiectly paises the lielus loi each iow:
hive> SELECT * FROM stations LIMIT 4;
010000 99999 BOGUS NORWAY
010003 99999 BOGUS NORWAY
010010 99999 JAN MAYEN
010013 99999 ROST
Importing Data
Ve`ve alieauy seen how to use the LOAD DATA opeiation to impoit uata into a Hive taLle
(oi paitition) Ly copying oi moving liles to the taLle`s uiiectoiy. You can also populate
a taLle with uata liom anothei Hive taLle using an INSERT statement, oi at cieation time
using the CTAS constiuct, which is an aLLieviation useu to ielei to CREATE TABLE...AS
SELECT.
Il you want to impoit uata liom a ielational uataLase uiiectly into Hive, have a look at
Sgoop, which is coveieu in Impoiteu Data anu Hive on page 536.
INSERT OVERWRITE TABLE
Heie`s an example ol an INSERT statement:
INSERT OVERWRITE TABLE target
SELECT col1, col2
FROM source;
Foi paititioneu taLles, you can specily the paitition to inseit into Ly supplying a
PARTITION clause:
INSERT OVERWRITE TABLE target
PARTITION (dt='2010-01-01')
SELECT col1, col2
FROM source;
The OVERWRITE keywoiu is actually manuatoiy in Loth cases, anu means that the con-
tents ol the target taLle (loi the liist example) oi the 2010-01-01 paitition (loi the
seconu example) aie ieplaceu Ly the iesults ol the SELECT statement. At the time ol
wiiting, Hive uoes not suppoit auuing iecoius to an alieauy-populateu nonpaititioneu
438 | Chapter 12: Hive
taLle oi paitition using an INSERT statement. Insteau, you can achieve the same ellect
using a LOAD DATA opeiation without the OVERWRITE keywoiu.
You can specily the paitition uynamically, Ly ueteimining the paitition value liom the
SELECT statement:
INSERT OVERWRITE TABLE target
PARTITION (dt)
SELECT col1, col2, dt
FROM source;
This is known as a dynanic-partition inscrt. This leatuie is oll Ly uelault, so you neeu
to enaLle it Ly setting hive.exec.dynamic.partition to true liist.
Unlike othei uataLases, Hive uoes not (cuiiently) suppoit a loim ol the
INSERT statement loi inseiting a collection ol iecoius specilieu in the
gueiy, in liteial loim. That is, statements ol the loim INSERT INTO...VAL
UES... aie not alloweu.
Multitable insert
In HiveQL, you can tuin the INSERT statement aiounu anu stait with the FROM clause,
loi the same ellect:
FROM source
INSERT OVERWRITE TABLE target
SELECT col1, col2;
The ieason loi this syntax Lecomes cleai when you see that it`s possiLle to have multiple
INSERT clauses in the same gueiy. This so-calleu nu|titab|c inscrt is moie ellicient than
multiple INSERT statements, since the souice taLle neeu only Le scanneu once to piouuce
the multiple, uisjoint outputs.
Heie`s an example that computes vaiious statistics ovei the weathei uataset:
FROM records2
INSERT OVERWRITE TABLE stations_by_year
SELECT year, COUNT(DISTINCT station)
GROUP BY year
INSERT OVERWRITE TABLE records_by_year
SELECT year, COUNT(1)
GROUP BY year
INSERT OVERWRITE TABLE good_records_by_year
SELECT year, COUNT(1)
WHERE temperature != 9999
AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9)
GROUP BY year;
Theie is a single souice taLle (records2), Lut thiee taLles to holu the iesults liom thiee
uilleient gueiies ovei the souice.
Tables | 439
CREATE TABLE...AS SELECT
It`s olten veiy convenient to stoie the output ol a Hive gueiy in a new taLle, peihaps
Lecause it is too laige to Le uumpeu to the console oi Lecause theie aie luithei pio-
cessing steps to caiiy out on the iesult.
The new taLle`s column uelinitions aie ueiiveu liom the columns ietiieveu Ly the
SELECT clause. In the lollowing gueiy, the target taLle has two columns nameu col1
anu col2 whose types aie the same as the ones in the source taLle:
CREATE TABLE target
AS
SELECT col1, col2
FROM source;
A CTAS opeiation is atomic, so il the SELECT gueiy lails loi some ieason, then the taLle
is not cieateu.
Altering Tables
Since Hive uses the schema on ieau appioach, it`s llexiLle in peimitting a taLle`s ueli-
nition to change altei the taLle has Leen cieateu. The geneial caveat, howevei, is that
it is up to you, in many cases, to ensuie that the uata is changeu to iellect the new
stiuctuie.
You can iename a taLle using the ALTER TABLE statement:
ALTER TABLE source RENAME TO target;
In auuition to upuating the taLle metauata, ALTER TABLE moves the unueilying taLle
uiiectoiy so that it iellects the new name. In the cuiient example, /uscr/hivc/warchousc/
sourcc is ienameu to /uscr/hivc/warchousc/targct. (An exteinal taLle`s unueilying
uiiectoiy is not moveu; only the metauata is upuateu.)
Hive allows you to change the uelinition loi columns, auu new columns, oi even ieplace
all existing columns in a taLle with a new set.
Foi example, consiuei auuing a new column:
ALTER TABLE target ADD COLUMNS (col3 STRING);
The new column col3 is auueu altei the existing (nonpaitition) columns. The uata liles
aie not upuateu, so gueiies will ietuin null loi all values ol col3 (unless ol couise theie
weie extia lielus alieauy piesent in the liles). Since Hive uoes not peimit upuating
existing iecoius, you will neeu to aiiange loi the unueilying liles to Le upuateu Ly
anothei mechanism. Foi this ieason, it is moie common to cieate a new taLle that
uelines new columns anu populates them using a SELECT statement.
Changing a column`s metauata, such as a column`s name oi uata type, is moie stiaight-
loiwaiu, assuming that the new uata type can Le inteipieteu as the new uata type.
440 | Chapter 12: Hive
To leain moie aLout how to altei a taLle`s stiuctuie, incluuing auuing anu uiopping
paititions, changing anu ieplacing columns, anu changing taLle anu SeiDe piopeities,
see the Hive wiki at https://cwi|i.apachc.org/conj|ucncc/disp|ay/Hivc/LanguagcManua|
-DDL.
Dropping Tables
The DROP TABLE statement ueletes the uata anu metauata loi a taLle. In the case ol
exteinal taLles, only the metauata is ueleteuthe uata is lelt untoucheu.
Il you want to uelete all the uata in a taLle, Lut keep the taLle uelinition (like DELETE oi
TRUNCATE in MySQL), then you can simply uelete the uata liles. Foi example:
hive> dfs -rmr /user/hive/warehouse/my_table;
Hive tieats a lack ol liles (oi inueeu no uiiectoiy loi the taLle) as an empty taLle.
Anothei possiLility, which achieves a similai ellect, is to cieate a new, empty taLle that
has the same schema as the liist, using the LIKE keywoiu:
CREATE TABLE new_table LIKE existing_table;
Querying Data
This section uiscusses how to use vaiious loims ol the SELECT statement to ietiieve uata
liom Hive.
Sorting and Aggregating
Soiting uata in Hive can Le achieveu Ly use ol a stanuaiu ORDER BY clause, Lut theie is
a catch. ORDER BY piouuces a iesult that is totally soiteu, as expecteu, Lut to uo so it
sets the numLei ol ieuuceis to one, making it veiy inellicient loi laige uatasets. (Hope-
lully, a lutuie ielease ol Hive will employ the technigues uesciiLeu in Total
Soit on page 272 to suppoit ellicient paiallel soiting.)
Vhen a gloLally soiteu iesult is not ieguiieuanu in many cases it isn`tthen you
can use Hive`s nonstanuaiu extension, SORT BY insteau. SORT BY piouuces a soiteu lile
pei ieuucei.
In some cases, you want to contiol which ieuucei a paiticulai iow goes to, typically so
you can peiloim some suLseguent aggiegation. This is what Hive`s DISTRIBUTE BY
clause uoes. Heie`s an example to soit the weathei uataset Ly yeai anu tempeiatuie, in
such a way to ensuie that all the iows loi a given yeai enu up in the same ieuucei
paitition:
12
12. This is a iewoiking in Hive ol the uiscussion in Seconuaiy Soit on page 276.
Querying Data | 441
hive> FROM records2
> SELECT year, temperature
> DISTRIBUTE BY year
> SORT BY year ASC, temperature DESC;
1949 111
1949 78
1950 22
1950 0
1950 -11
A lollow-on gueiy (oi a gueiy that nesteu this gueiy as a suLgueiy, see SuLguei-
ies on page ++6) woulu Le aLle to use the lact that each yeai`s tempeiatuies weie
gioupeu anu soiteu (in uescenuing oiuei) in the same lile.
Il the columns loi SORT BY anu DISTRIBUTE BY aie the same, you can use CLUSTER BY as
a shoithanu loi specilying Loth.
MapReduce Scripts
Using an appioach like Hauoop Stieaming, the TRANSFORM, MAP, anu REDUCE clauses make
it possiLle to invoke an exteinal sciipt oi piogiam liom Hive. Suppose we want to use
a sciipt to liltei out iows that uon`t meet some conuition, such as the sciipt in Exam-
ple 12-1, which iemoves pooi guality ieauings.
Exanp|c 12-1. Python script to ji|tcr out poor qua|ity wcathcr rccords
#!/usr/bin/env python
import re
import sys
for line in sys.stdin:
(year, temp, q) = line.strip().split()
if (temp != "9999" and re.match("[01459]", q)):
print "%s\t%s" % (year, temp)
Ve can use the sciipt as lollows:
hive> ADD FILE /path/to/is_good_quality.py;
hive> FROM records2
> SELECT TRANSFORM(year, temperature, quality)
> USING 'is_good_quality.py'
> AS year, temperature;
1949 111
1949 78
1950 0
1950 22
1950 -11
Beloie iunning the gueiy, we neeu to iegistei the sciipt with Hive. This is so Hive knows
to ship the lile to the Hauoop clustei (see DistiiLuteu Cache on page 2SS).
442 | Chapter 12: Hive
The gueiy itsell stieams the year, temperature, anu quality lielus as a taL-sepaiateu
line to the is_good_qua|ity.py sciipt, anu paises the taL-sepaiateu output into year anu
temperature lielus to loim the output ol the gueiy.
This example has no ieuuceis. Il we use a nesteu loim loi the gueiy, we can specily a
map anu a ieuuce lunction. This time we use the MAP anu REDUCE keywoius, Lut SELECT
TRANSFORM in Loth cases woulu have the same iesult. The souice loi the nax_tcnpcra-
turc_rcducc.py sciipt is shown in Example 2-11:
FROM (
FROM records2
MAP year, temperature, quality
USING 'is_good_quality.py'
AS year, temperature) map_output
REDUCE year, temperature
USING 'max_temperature_reduce.py'
AS year, temperature;
Joins
One ol the nice things aLout using Hive, iathei than iaw MapReuuce, is that it makes
peiloiming commonly useu opeiations veiy simple. ]oin opeiations aie a case in point,
given how involveu they aie to implement in MapReuuce (]oins on page 2S1).
Inner joins
The simplest kinu ol join is the innei join, wheie each match in the input taLles iesults
in a iow in the output. Consiuei two small uemonstiation taLles: sales, which lists the
names ol people anu the ID ol the item they Lought; anu things, which lists the item
ID anu its name:
hive> SELECT * FROM sales;
Joe 2
Hank 4
Ali 0
Eve 3
Hank 2
hive> SELECT * FROM things;
2 Tie
4 Coat
3 Hat
1 Scarf
Ve can peiloim an innei join on the two taLles as lollows:
hive> SELECT sales.*, things.*
> FROM sales JOIN things ON (sales.id = things.id);
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat
Querying Data | 443
The taLle in the FROM clause (sales) is joineu with the taLle in the JOIN clause (things),
using the pieuicate in the ON clause. Hive only suppoits eguijoins, which means that
only eguality can Le useu in the join pieuicate, which heie matches on the id column
in Loth taLles.
Some uataLases, such as MySQL anu Oiacle, allow you to list the join
taLles in the FROM clause anu specily the join conuition in the WHERE clause
ol a SELECT statement. Howevei, this syntax is not suppoiteu in Hive, so
the lollowing lails with a paise eiioi:
SELECT sales.*, things.*
FROM sales, things
WHERE sales.id = things.id;
Hive only allows a single taLle in the FROM clause, anu joins must lollow
the SQL-92 JOIN clause syntax.
In Hive, you can join on multiple columns in the join pieuicate Ly specilying a seiies
ol expiessions, sepaiateu Ly AND keywoius. You can also join moie than two taLles Ly
supplying auuitional JOIN...ON... clauses in the gueiy. Hive is intelligent aLout tiying
to minimize the numLei ol MapReuuce joLs to peiloim the joins.
A single join is implementeu as a single MapReuuce joL, Lut multiple joins can Le
peiloimeu in less than one MapReuuce joL pei join il the same column is useu in the
join conuition.
13
You can see how many MapReuuce joLs Hive will use loi any paitic-
ulai gueiy Ly pielixing it with the EXPLAIN keywoiu:
EXPLAIN
SELECT sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);
The EXPLAIN output incluues many uetails aLout the execution plan loi the gueiy, in-
cluuing the aLstiact syntax tiee, the uepenuency giaph loi the stages that Hive will
execute, anu inloimation aLout each stage. Stages may Le MapReuuce joLs oi opeia-
tions such as lile moves. Foi even moie uetail, pielix the gueiy with EXPLAIN EXTENDED.
Hive cuiiently uses a iule-Laseu gueiy optimizei loi ueteimining how to execute a
gueiy, Lut it`s likely that in the lutuie a cost-Laseu optimizei will Le auueu.
Outer joins
Outei joins allow you to linu nonmatches in the taLles Leing joineu. In the cuiient
example, when we peiloimeu an innei join, the iow loi Ali uiu not appeai in the output,
since the ID ol the item she puichaseu was not piesent in the things taLle. Il we change
the join type to LEFT OUTER JOIN, then the gueiy will ietuin a iow loi eveiy iow in the
13. The oiuei ol the taLles in the JOIN clauses is signilicant: it`s geneially Lest to have the laigest taLle last,
Lut see https://cwi|i.apachc.org/conj|ucncc/disp|ay/Hivc/LanguagcManua|-joins loi moie uetails,
incluuing how to give hints to the Hive plannei.
444 | Chapter 12: Hive
lelt taLle (sales), even il theie is no coiiesponuing iow in the taLle it is Leing joineu to
(things):
hive> SELECT sales.*, things.*
> FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
Ali 0 NULL NULL
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat
Notice that the iow loi Ali is now ietuineu, anu the columns liom the things taLle aie
NULL, since theie is no match.
Hive suppoits iight outei joins, which ieveises the ioles ol the taLles ielative to the lelt
join. In this case, all items liom the things taLle aie incluueu, even those that weien`t
puichaseu Ly anyone (a scail):
hive> SELECT sales.*, things.*
> FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);
NULL NULL 1 Scarf
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat
Finally, theie is a lull outei join, wheie the output has a iow loi each iow liom Loth
taLles in the join:
hive> SELECT sales.*, things.*
> FROM sales FULL OUTER JOIN things ON (sales.id = things.id);
Ali 0 NULL NULL
NULL NULL 1 Scarf
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat
Semi joins
Hive uoesn`t suppoit IN suLgueiies (at the time ol wiiting), Lut you can use a LEFT SEMI
JOIN to uo the same thing.
Consiuei this IN suLgueiy, which linus all the items in the things taLle that aie in the
sales taLle:
SELECT *
FROM things
WHERE things.id IN (SELECT id from sales);
Ve can iewiite it as lollows:
hive> SELECT *
> FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);
2 Tie
Querying Data | 445
3 Hat
4 Coat
Theie is a iestiiction that we must oLseive loi LEFT SEMI JOIN gueiies: the iight taLle
(sales) may only appeai in the ON clause. It cannot Le ieleienceu in a SELECT expiession,
loi example.
Map joins
Il one taLle is small enough to lit in memoiy, then Hive can loau the smallei taLle into
memoiy to peiloim the join in each ol the mappeis. The syntax loi specilying a map
join is a hint emLeuueu in an SQL C-style comment:
SELECT /*+ MAPJOIN(things) */ sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);
The joL to execute this gueiy has no ieuuceis, so this gueiy woulu not woik loi a
RIGHT oi FULL OUTER JOIN, since aLsence ol matching can only Le uetecteu in an aggie-
gating (ieuuce) step acioss all the inputs.
Map joins can take auvantage ol Lucketeu taLles (Buckets on page +30), since a
mappei woiking on a Lucket ol the lelt taLle only neeus to loau the coiiesponuing
Luckets ol the iight taLle to peiloim the join. The syntax loi the join is the same as loi
the in-memoiy case aLove; howevei, you also neeu to enaLle the optimization with:
SET hive.optimize.bucketmapjoin=true;
Subqueries
A suLgueiy is a SELECT statement that is emLeuueu in anothei SQL statement. Hive has
limiteu suppoit loi suLgueiies, only peimitting a suLgueiy in the FROM clause ol a
SELECT statement.
Othei uataLases allow suLgueiies almost anywheie that an expiession
is valiu, such as in the list ol values to ietiieve liom a SELECT statement
oi in the WHERE clause. Many uses ol suLgueiies can Le iewiitten as joins,
so il you linu youisell wiiting a suLgueiy wheie Hive uoes not suppoit
it, then see il it can Le expiesseu as a join. Foi example, an IN suLgueiy
can Le wiitten as a semi join, oi an innei join (see ]oins on page ++3).
The lollowing gueiy linus the mean maximum tempeiatuie loi eveiy yeai anu weathei
station:
SELECT station, year, AVG(max_temperature)
FROM (
SELECT station, year, MAX(temperature) AS max_temperature
FROM records2
WHERE temperature != 9999
AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9)
GROUP BY station, year
446 | Chapter 12: Hive
) mt
GROUP BY station, year;
The suLgueiy is useu to linu the maximum tempeiatuie loi each station/uate comLi-
nation, then the outei gueiy uses the AVG aggiegate lunction to linu the aveiage ol the
maximum tempeiatuie ieauings loi each station/uate comLination.
The outei gueiy accesses the iesults ol the suLgueiy like it uoes a taLle, which is why
the suLgueiy must Le given an alias (mt). The columns ol the suLgueiy have to Le given
unigue names so that the outei gueiy can ielei to them.
Views
A view is a soit ol viitual taLle that is uelineu Ly a SELECT statement. Views can Le
useu to piesent uata to useis in a uilleient way to the way it is actually stoieu on uisk.
Olten, the uata liom existing taLles is simplilieu oi aggiegateu in a paiticulai way that
makes it convenient loi luithei piocessing. Views may also Le useu to iestiict useis`
access to paiticulai suLsets ol taLles that they aie authoiizeu to see.
In Hive, a view is not mateiializeu to uisk when it is cieateu; iathei, the view`s SELECT
statement is executeu when the statement that ieleis to the view is iun. Il a view pei-
loims extensive tiansloimations on the Lase taLles, oi is useu lieguently, then you may
choose to manually mateiialize it Ly cieating a new taLle that stoies the contents ol the
view (see CREATE TABLE...AS SELECT on page ++0).
Ve can use views to iewoik the gueiy liom the pievious section loi linuing the mean
maximum tempeiatuie loi eveiy yeai anu weathei station. Fiist, let`s cieate a view loi
valiu iecoius, that is, iecoius that have a paiticulai quality value:
CREATE VIEW valid_records
AS
SELECT *
FROM records2
WHERE temperature != 9999
AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9);
Vhen we cieate a view, the gueiy is not iun; it is simply stoieu in the metastoie. Views
aie incluueu in the output ol the SHOW TABLES commanu, anu you can see moie uetails
aLout a paiticulai view, incluuing the gueiy useu to ueline it, Ly issuing the DESCRIBE
EXTENDED view_name commanu.
Next, let`s cieate a seconu view ol maximum tempeiatuies loi each station anu yeai.
It is Laseu on the valid_records view:
CREATE VIEW max_temperatures (station, year, max_temperature)
AS
SELECT station, year, MAX(temperature)
FROM valid_records
GROUP BY station, year;
Querying Data | 447
In this view uelinition, we list the column names explicitly. Ve uo this since the max-
imum tempeiatuie column is an aggiegate expiession, anu otheiwise Hive woulu cieate
a column alias loi us (such as _c2). Ve coulu egually well have useu an AS clause in the
SELECT to name the column.
Vith the views in place, we can now use them Ly iunning a gueiy:
SELECT station, year, AVG(max_temperature)
FROM max_temperatures
GROUP BY station, year;
The iesult ol the gueiy is the same as iunning the one that uses a suLgueiy, anu, in
paiticulai, the numLei ol MapReuuce joLs that Hive cieates is the same loi Loth: two
in each case, one loi each GROUP BY. This example shows that Hive can comLine a gueiy
on a view into a seguence ol joLs that is eguivalent to wiiting the gueiy without using
a view. In othei woius, Hive won`t neeulessly mateiialize a view even at execution time.
Views in Hive aie ieau-only, so theie is no way to loau oi inseit uata into an unueilying
Lase taLle via a view.
User-Defined Functions
Sometimes the gueiy you want to wiite can`t Le expiesseu easily (oi at all) using the
Luilt-in lunctions that Hive pioviues. By wiiting a uscr-dcjincd junction (UDF), Hive
makes it easy to plug in youi own piocessing coue anu invoke it liom a Hive gueiy.
UDFs have to Le wiitten in ]ava, the language that Hive itsell is wiitten in. Foi othei
languages, consiuei using a SELECT TRANSFORM gueiy, which allows you to stieam uata
thiough a usei-uelineu sciipt (MapReuuce Sciipts on page ++2).
Theie aie thiee types ol UDF in Hive: (iegulai) UDFs, UDAFs (usei-uelineu aggiegate
lunctions), anu UDTFs (usei-uelineu taLle-geneiating lunctions). They uillei in the
numLeis ol iows that they accept as input anu piouuce as output:
A UDF opeiates on a single iow anu piouuces a single iow as its output. Most
lunctions, such as mathematical lunctions anu stiing lunctions, aie ol this type.
A UDAF woiks on multiple input iows anu cieates a single output iow. Aggiegate
lunctions incluue such lunctions as COUNT anu MAX.
A UDTF opeiates on a single iow anu piouuces multiple iowsa taLleas output.
TaLle-geneiating lunctions aie less well known than the othei two types, so let`s look
at an example. Consiuei a taLle with a single column, x, which contains aiiays ol stiings.
It`s instiuctive to take a slight uetoui to see how the taLle is uelineu anu populateu:
CREATE TABLE arrays (x ARRAY<STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002';
448 | Chapter 12: Hive
Notice that the ROW FORMAT clause specilies that the entiies in the aiiay aie uelimiteu Ly
Contiol-B chaiacteis. The example lile that we aie going to loau has the lollowing
contents, wheie ^B is a iepiesentation ol the Contiol-B chaiactei to make it suitaLle loi
piinting:
a^Bb
c^Bd^Be
Altei iunning a LOAD DATA commanu, the lollowing gueiy conliims that the uata was
loaueu coiiectly:
hive > SELECT * FROM arrays;
["a","b"]
["c","d","e"]
Next, we can use the explode UDTF to tiansloim this taLle. This lunction emits a iow
loi each entiy in the aiiay, so in this case the type ol the output column y is STRING.
The iesult is that the taLle is llatteneu into live iows:
hive > SELECT explode(x) AS y FROM arrays;
a
b
c
d
e
SELECT statements using UDTFs have some iestiictions (such as not Leing aLle to ie-
tiieve auuitional column expiessions), which make them less uselul in piactice. Foi
this ieason, Hive suppoits LATERAL VIEW gueiies, which aie moie poweilul. LATERAL
VIEW gueiies not coveieu heie, Lut you may linu out moie aLout them at https://cwi|i
.apachc.org/conj|ucncc/disp|ay/Hivc/LanguagcManua|-Latcra|\icw.
Writing a UDF
To illustiate the piocess ol wiiting anu using a UDF, we`ll wiite a simple UDF to tiim
chaiacteis liom the enus ol stiings. Hive alieauy has a Luilt-in lunction calleu trim, so
we`ll call ouis strip. The coue loi the Strip ]ava class is shown in Example 12-2.
Exanp|c 12-2. A UDI jor stripping charactcrs jron thc cnds oj strings
package com.hadoopbook.hive;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class Strip extends UDF {
private Text result = new Text();

public Text evaluate(Text str) {
if (str == null) {
return null;
}
User-Defined Functions | 449
result.set(StringUtils.strip(str.toString()));
return result;
}

public Text evaluate(Text str, String stripChars) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}
A UDF must satisly the lollowing two piopeities:
1. A UDF must Le a suLclass ol org.apache.hadoop.hive.ql.exec.UDF.
2. A UDF must implement at least one evaluate() methou.
The evaluate() methou is not uelineu Ly an inteilace since it may take an aiLitiaiy
numLei ol aiguments, ol aiLitiaiy types, anu it may ietuin a value ol aiLitiaiy type.
Hive intiospects the UDF to linu the evaluate() methou that matches the Hive lunction
that was invokeu.
The Strip class has two evaluate() methous. The liist stiips leauing anu tiailing white-
space liom the input, while the seconu can stiip any ol a set ol supplieu chaiacteis liom
the enus ol the stiing. The actual stiing piocessing is uelegateu to the StringUtils class
liom the Apache Commons pioject, which makes the only notewoithy pait ol the coue
the use ol Text liom the Hauoop ViitaLle liLiaiy. Hive actually suppoits ]ava piimi-
tives in UDFs (anu a lew othei types like java.util.List anu java.util.Map), so a sig-
natuie like:
public String evaluate(String str)
woulu woik egually well. Howevei, Ly using Text, we can take auvantage ol oLject
ieuse, which can Liing elliciency savings, anu so is to Le pieleiieu in geneial.
To use the UDF in Hive, we neeu to package the compileu ]ava class in a ]AR lile (you
can uo this Ly typing ant hive with the Look`s example coue) anu iegistei the lile with
Hive:
ADD JAR /path/to/hive-examples.jar;
Ve also neeu to cieate an alias loi the ]ava classname:
CREATE TEMPORARY FUNCTION strip AS 'com.hadoopbook.hive.Strip';
The TEMPORARY keywoiu heie highlights the lact that UDFs aie only uelineu loi the
uuiation ol the Hive session (they aie not peisisteu in the metastoie). In piactice, this
means you neeu to auu the ]AR lile, anu ueline the lunction at the Leginning ol each
sciipt oi session.
450 | Chapter 12: Hive
As an alteinative to calling ADD JAR, you can specilyat launch time
a path wheie Hive looks loi auxiliaiy ]AR liles to put on its classpath
(incluuing the MapReuuce classpath). This technigue is uselul loi au-
tomatically auuing youi own liLiaiy ol UDFs eveiy time you iun Hive.
Theie aie two ways ol specilying the path, eithei passing the
--auxpath option to the hivc commanu:
% hive --auxpath /path/to/hive-examples.jar
oi Ly setting the HIVE_AUX_JARS_PATH enviionment vaiiaLle Leloie in-
voking Hive. The auxiliaiy path may Le a comma-sepaiateu list ol ]AR
lile paths oi a uiiectoiy containing ]AR liles.
The UDF is now ieauy to Le useu, just like a Luilt-in lunction:
hive> SELECT strip(' bee ') FROM dummy;
bee
hive> SELECT strip('banana', 'ab') FROM dummy;
nan
Notice that the UDF`s name is not case-sensitive:
hive> SELECT STRIP(' bee ') FROM dummy;
bee
Writing a UDAF
An aggiegate lunction is moie uillicult to wiite than a iegulai UDF, since values aie
aggiegateu in chunks (potentially acioss many Map oi Reuuce tasks), so the imple-
mentation has to Le capaLle ol comLining paitial aggiegations into a linal iesult. The
coue to achieve this is Lest explaineu Ly example, so let`s look at the implementation
ol a simple UDAF loi calculating the maximum ol a collection ol integeis (Exam-
ple 12-3).
Exanp|c 12-3. A UDAI jor ca|cu|ating thc naxinun oj a co||cction oj intcgcrs
package com.hadoopbook.hive;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.io.IntWritable;
public class Maximum extends UDAF {
public static class MaximumIntUDAFEvaluator implements UDAFEvaluator {

private IntWritable result;

public void init() {
result = null;
}
User-Defined Functions | 451
public boolean iterate(IntWritable value) {
if (value == null) {
return true;
}
if (result == null) {
result = new IntWritable(value.get());
} else {
result.set(Math.max(result.get(), value.get()));
}
return true;
}
public IntWritable terminatePartial() {
return result;
}
public boolean merge(IntWritable other) {
return iterate(other);
}
public IntWritable terminate() {
return result;
}
}
}
The class stiuctuie is slightly uilleient to the one loi UDFs. A UDAF must Le a suLclass
ol org.apache.hadoop.hive.ql.exec.UDAF (note the A in UDAF) anu contain one oi
moie nesteu static classes implementing org.apache.hadoop.hive.ql.exec.UDAFEvalua
tor. In this example, theie is a single nesteu class, MaximumIntUDAFEvaluator, Lut we
coulu auu moie evaluatois such as MaximumLongUDAFEvaluator, MaximumFloatUDAFEva
luator, anu so on, to pioviue oveiloaueu loims ol the UDAF loi linuing the maximum
ol a collection ol longs, lloats, anu so on.
An evaluatoi must implement live methous, uesciiLeu in tuin Lelow (the llow is illus-
tiateu in Figuie 12-+):
init()
The init() methou initializes the evaluatoi anu iesets its inteinal state. In
MaximumIntUDAFEvaluator, we set the IntWritable oLject holuing the linal iesult to
null. Ve use null to inuicate that no values have Leen aggiegateu yet, which has
the uesiiaLle ellect ol making the maximum value ol an empty set NULL.
iterate()
The iterate() methou is calleu eveiy time theie is a new value to Le aggiegateu.
The evaluatoi shoulu upuate its inteinal state with the iesult ol peiloiming the
aggiegation. The aiguments that iterate() takes coiiesponu to those in the Hive
lunction liom which it was calleu. In this example, theie is only one aigument.
The value is liist checkeu to see il it is null, anu il it is, it is ignoieu. Otheiwise,
the result instance vaiiaLle is set to value`s integei value (il this is the liist value
that has Leen seen), oi set to the laigei ol the cuiient iesult anu value (il one oi
452 | Chapter 12: Hive
moie values have alieauy Leen seen). Ve ietuin true to inuicate that the input
value was valiu.
terminatePartial()
The terminatePartial() methou is calleu when Hive wants a iesult loi the paitial
aggiegation. The methou must ietuin an oLject that encapsulates the state ol the
aggiegation. In this case, an IntWritable sullices, since it encapsulates eithei the
maximum value seen oi null il no values have Leen piocesseu.
merge()
The merge() methou is calleu when Hive ueciues to comLine one paitial aggiega-
tion with anothei. The methou takes a single oLject whose type must coiiesponu
to the ietuin type ol the terminatePartial() methou. In this example, the
merge() methou can simply uelegate to the iterate() methou, Lecause the paitial
aggiegation is iepiesenteu in the same way as a value Leing aggiegateu. This is not
geneially the case (anu we`ll see a moie geneial example latei), anu the methou
shoulu implement the logic to comLine the evaluatoi`s state with the state ol the
paitial aggiegation.
Iigurc 12-1. Data j|ow with partia| rcsu|ts jor a UDAI
User-Defined Functions | 453
terminate()
The terminate() methou is calleu when the linal iesult ol the aggiegation is neeueu.
The evaluatoi shoulu ietuin its state as a value. In this case, we ietuin the result
instance vaiiaLle.
Let`s exeicise oui new lunction:
hive> CREATE TEMPORARY FUNCTION maximum AS 'com.hadoopbook.hive.Maximum';
hive> SELECT maximum(temperature) FROM records;
110
A more complex UDAF
The pievious example is unusual in that a paitial aggiegation can Le iepiesenteu using
the same type (IntWritable) as the linal iesult. This is not geneially the case loi moie
complex aggiegate lunctions, as can Le seen Ly consiueiing a UDAF loi calculating the
mean (aveiage) ol a collection ol uouLle values. It`s not mathematically possiLle to
comLine paitial means into a linal mean value (see ComLinei Func-
tions on page 3+). Insteau, we can iepiesent the paitial aggiegation as a paii ol num-
Leis: the cumulative sum ol the uouLle values piocesseu so lai, anu the numLei ol
values.
This iuea is implementeu in the UDAF shown in Example 12-+. Notice that the paitial
aggiegation is implementeu as a stiuct nesteu static class, calleu PartialResult,
which Hive is intelligent enough to seiialize anu ueseiialize, since we aie using lielu
types that Hive can hanule (]ava piimitives in this case).
In this example, the merge() methou is uilleient to iterate(), since it comLines the
paitial sums anu paitial counts, Ly paiiwise auuition. Also, the ietuin type ol termina
tePartial() is PartialResultwhich ol couise is nevei seen Ly the usei calling the
lunctionwhile the ietuin type ol terminate() is DoubleWritable, the linal iesult seen
Ly the usei.
Exanp|c 12-1. A UDAI jor ca|cu|ating thc ncan oj a co||cction oj doub|cs
package com.hadoopbook.hive;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.hive.serde2.io.DoubleWritable;
public class Mean extends UDAF {
public static class MeanDoubleUDAFEvaluator implements UDAFEvaluator {
public static class PartialResult {
double sum;
long count;
}

private PartialResult partial;
454 | Chapter 12: Hive
public void init() {
partial = null;
}
public boolean iterate(DoubleWritable value) {
if (value == null) {
return true;
}
if (partial == null) {
partial = new PartialResult();
}
partial.sum += value.get();
partial.count++;
return true;
}
public PartialResult terminatePartial() {
return partial;
}
public boolean merge(PartialResult other) {
if (other == null) {
return true;
}
if (partial == null) {
partial = new PartialResult();
}
partial.sum += other.sum;
partial.count += other.count;
return true;
}
public DoubleWritable terminate() {
if (partial == null) {
return null;
}
return new DoubleWritable(partial.sum / partial.count);
}
}
}
User-Defined Functions | 455
CHAPTER 13
HBase
]onathan Gray and Michacl Stack
HBasics
HBase is a uistiiLuteu column-oiienteu uataLase Luilt on top ol HDFS. HBase is the
Hauoop application to use when you ieguiie ieal-time ieau/wiite ianuom-access to
veiy laige uatasets.
Although theie aie countless stiategies anu implementations loi uataLase stoiage anu
ietiieval, most solutionsespecially those ol the ielational vaiietyaie not Luilt with
veiy laige scale anu uistiiLution in minu. Many venuois ollei ieplication anu paiti-
tioning solutions to giow the uataLase Leyonu the conlines ol a single noue, Lut these
auu-ons aie geneially an alteithought anu aie complicateu to install anu maintain. They
also come at some seveie compiomise to the RDBMS leatuie set. ]oins, complex guei-
ies, tiiggeis, views, anu loieign-key constiaints Lecome piohiLitively expensive to iun
on a scaleu RDBMS oi uo not woik at all.
HBase comes at the scaling pioLlem liom the opposite uiiection. It is Luilt liom the
giounu-up to scale lineaily just Ly auuing noues. HBase is not ielational anu uoes not
suppoit SQL, Lut given the piopei pioLlem space, it is aLle to uo what an RDBMS
cannot: host veiy laige, spaisely populateu taLles on clusteis maue liom commouity
haiuwaie.
The canonical HBase use case is the wcbtab|c, a taLle ol ciawleu weL pages anu theii
attiiLutes (such as language anu MIME type) keyeu Ly the weL page URL. The weLtaLle
is laige, with iow counts that iun into the Lillions. Batch analytic anu paising
MapReuuce joLs aie continuously iun against the weLtaLle ueiiving statistics anu
auuing new columns ol veiilieu MIME type anu paiseu text content loi latei inuexing
Ly a seaich engine. Concuiiently, the taLle is ianuomly accesseu Ly ciawleis iunning
at vaiious iates upuating ianuom iows while ianuom weL pages aie seiveu in ieal time
as useis click on a weLsite`s cacheu-page leatuie.
457
Backdrop
The HBase pioject was staiteu towaiu the enu ol 2006 Ly Chau Valteis anu ]im
Kelleiman at Poweiset. It was moueleu altei Google`s BigtaLle: A DistiiLuteu Stoiage
System loi Stiuctuieu Data Ly Chang et al. (http://|abs.goog|c.con/papcrs/bigtab|c
.htn|), which hau just Leen puLlisheu. In FeLiuaiy 2007, Mike Calaiella maue a coue
uiop ol a mostly woiking system that ]im Kelleiman then caiiieu loiwaiu.
The liist HBase ielease was Lunuleu as pait ol Hauoop 0.15.0 in OctoLei 2007. In May
2010, HBase giauuateu liom a Hauoop suLpioject to Lecome an Apache Top Level
Pioject. Piouuction useis ol HBase incluue AuoLe, StumLleUpon, Twittei, anu gioups
at Yahoo!.
Concepts
In this section, we pioviue a guick oveiview ol coie HBase concepts. At a minimum, a
passing lamiliaiity will ease the uigestion ol all that lollows.
1
Whirlwind Tour of the Data Model
Applications stoie uata into laLeleu taLles. TaLles aie maue ol iows anu columns. TaLle
cellsthe inteisection ol iow anu column cooiuinatesaie veisioneu. By uelault, theii
veision is a timestamp auto-assigneu Ly HBase at the time ol cell inseition. A cell`s
content is an uninteipieteu aiiay ol Lytes.
TaLle iow keys aie also Lyte aiiays, so theoietically anything can seive as a iow key
liom stiings to Linaiy iepiesentations ol long oi even seiializeu uata stiuctuies. TaLle
iows aie soiteu Ly iow key, the taLle`s piimaiy key. The soit is Lyte-oiueieu. All taLle
accesses aie via the taLle piimaiy key.
2
Row columns aie gioupeu into co|unn jani|ics. All column lamily memLeis have a
common pielix, so, loi example, the columns tcnpcraturc:air anu tcnpcra-
turc:dcw_point aie Loth memLeis ol the tcnpcraturc column lamily, wheieas
station:idcntijicr Lelongs to the station lamily.
3
The column lamily pielix must Le com-
poseu ol printab|c chaiacteis. The gualilying tail, the column lamily qua|ijicr, can Le
maue ol any aiLitiaiy Lytes.
1. Foi moie uetail than is pioviueu heie, see the HBase Aichitectuie page on the HBase wiki.
2. As ol this wiiting, theie aie at least two piojects up on githuL that auu seconuaiy inuices to HBase.
3. In HBase, Ly convention, the colon chaiactei (:) uelimits the column lamily liom the column lamily
qua|ijicr. It is haiucoueu.
458 | Chapter 13: HBase
A taLle`s column lamilies must Le specilieu up liont as pait ol the taLle schema ueli-
nition, Lut new column lamily memLeis can Le auueu on uemanu. Foi example, a new
column station:addrcss can Le olleieu Ly a client as pait ol an upuate, anu its value
peisisteu, as long as the column lamily station is alieauy in existence on the taigeteu
taLle.
Physically, all column lamily memLeis aie stoieu togethei on the lilesystem. So, though
eailiei we uesciiLeu HBase as a column-oiienteu stoie, it woulu Le moie accuiate il it
weie uesciiLeu as a column-jani|y-oiienteu stoie. Because tunings anu stoiage speci-
lications aie uone at the column lamily level, it is auviseu that all column lamily mem-
Leis have the same geneial access pattein anu size chaiacteiistics.
In synopsis, HBase taLles aie like those in an RDBMS, only cells aie veisioneu, iows
aie soiteu, anu columns can Le auueu on the lly Ly the client as long as the column
lamily they Lelong to pieexists.
Regions
TaLles aie automatically paititioneu hoiizontally Ly HBase into rcgions. Each iegion
compiises a suLset ol a taLle`s iows. A iegion is uenoteu Ly the taLle it Lelongs to, its
liist iow, inclusive, anu last iow, exclusive. Initially, a taLle compiises a single iegion,
Lut as the size ol the iegion giows, altei it ciosses a conliguiaLle size thiesholu, it splits
at a iow Lounuaiy into two new iegions ol appioximately egual size. Until this liist
split happens, all loauing will Le against the single seivei hosting the oiiginal iegion.
As the taLle giows, the numLei ol its iegions giows. Regions aie the units that get
uistiiLuteu ovei an HBase clustei. In this way, a taLle that is too Lig loi any one seivei
can Le caiiieu Ly a clustei ol seiveis with each noue hosting a suLset ol the taLle`s total
iegions. This is also the means Ly which the loauing on a taLle gets uistiiLuteu. The
online set ol soiteu iegions compiises the taLle`s total content.
Locking
Row upuates aie atomic, no mattei how many iow columns constitute the iow-level
tiansaction. This keeps the locking mouel simple.
Implementation
]ust as HDFS anu MapReuuce aie Luilt ol clients, slaves, anu a cooiuinating mastei
nancnodc anu datanodcs in HDFS anu jobtrac|cr anu tas|trac|crs in MapReuuceso
is HBase moueleu with an HBase nastcr noue oichestiating a clustei ol one oi moie
rcgionscrvcr slaves (see Figuie 13-1). The HBase mastei is iesponsiLle loi Lootstiapping
a viigin install, loi assigning iegions to iegisteieu iegionseiveis, anu loi iecoveiing
iegionseivei lailuies. The mastei noue is lightly loaueu. The iegionseiveis caiiy zeio
oi moie iegions anu lielu client ieau/wiite ieguests. They also manage iegion splits
inloiming the HBase mastei aLout the new uaughtei iegions loi it to manage the oll-
lining ol paient iegion anu assignment ol the ieplacement uaughteis.
Concepts | 459
Iigurc 13-1. HBasc c|ustcr ncnbcrs
HBase uepenus on ZooKeepei (Chaptei 1+) anu Ly uelault it manages a ZooKeepei
instance as the authoiity on clustei state. HBase hosts vitals such as the location ol the
ioot catalog taLle anu the auuiess ol the cuiient clustei Mastei. Assignment ol iegions
is meuiateu via ZooKeepei in case paiticipating seiveis ciash miu-assignment. Hosting
the assignment tiansaction state in ZooKeepei makes it so iecoveiy can pick up on the
assignment at wheie the ciasheu seivei lelt oll. At a minimum, Lootstiapping a client
connection to an HBase clustei, the client must Le passeu the location ol the ZooKeepei
ensemLle. Theiealtei, the client navigates the ZooKeepei hieiaichy to leain clustei
attiiLutes such as seivei locations.
+
Regionseivei slave noues aie listeu in the HBase conj/rcgionscrvcrs lile as you woulu
list uatanoues anu tasktiackeis in the Hauoop conj/s|avcs lile. Stait anu stop sciipts aie
like those in Hauoop using the same SSH-Laseu iunning ol iemote commanus mech-
anism. Clustei site-specilic conliguiation is maue in the HBase conj/hbasc-sitc.xn| anu
conj/hbasc-cnv.sh liles, which have the same loimat as that ol theii eguivalents up in
the Hauoop paient pioject (see Chaptei 9).
+. HBase can Le conliguieu to use an existing ZooKeepei clustei insteau.
460 | Chapter 13: HBase
Vheie theie is commonality to Le lounu, HBase uiiectly uses oi suL-
classes the paient Hauoop implementation, whethei a seivice oi type.
Vhen this is not possiLle, HBase will lollow the Hauoop mouel wheie
it can. Foi example, HBase uses the Hauoop Conliguiation system so
conliguiation liles have the same loimat. Vhat this means loi you, the
usei, is that you can leveiage any Hauoop lamiliaiity in youi ex-
ploiation ol HBase. HBase ueviates liom this iule only when auuing its
specializations.
HBase peisists uata via the Hauoop lilesystem API. Since theie aie multiple implemen-
tations ol the lilesystem inteilaceone loi the local lilesystem, one loi the KFS lile-
system, Amazon`s S3, anu HDFS (the Hauoop DistiiLuteu Filesystem)HBase can
peisist to any ol these implementations. Most expeiience though has Leen hau using
HDFS, though Ly uelault, unless tolu otheiwise, HBase wiites to the local lilesystem.
The local lilesystem is line loi expeiimenting with youi initial HBase install, Lut theie-
altei, usually the liist conliguiation maue in an HBase clustei involves pointing HBase
at the HDFS clustei to use.
HBase in operation
HBase, inteinally, keeps special catalog taLles nameu -ROOT- anu .META. within which
it maintains the cuiient list, state, anu location ol all iegions alloat on the clustei. The
-ROOT- taLle holus the list ol .META. taLle iegions. The .META. taLle holus the list ol all
usei-space iegions. Entiies in these taLles aie keyeu Ly iegion name, wheie a iegion
name is maue ol the taLle name the iegion Lelongs to, the iegion`s stait iow, its time
ol cieation, anu linally, an MD5 hash ol all ol the loimei (i.e., a hash ol taLlename,
stait iow, anu cieation timestamp.)
5
Row keys, as noteu pieviously, aie soiteu so linu-
ing the iegion that hosts a paiticulai iow is a mattei ol a lookup to linu the liist entiy
whose key is gieatei than oi egual to that ol the ieguesteu iow key. As iegions
tiansitionaie split, uisaLleu/enaLleu, ueleteu, ieueployeu Ly the iegion loau Lal-
ancei, oi ieueployeu uue to a iegionseivei ciashthe catalog taLles aie upuateu so the
state ol all iegions on the clustei is kept cuiient.
Fiesh clients connect to the ZooKeepei clustei liist to leain the location ol -ROOT-.
Clients consult -ROOT- to elicit the location ol the .META. iegion whose scope coveis
that ol the ieguesteu iow. The client then uoes a lookup against the lounu .META. iegion
to liguie the hosting usei-space iegion anu its location. Theiealtei, the client inteiacts
uiiectly with the hosting iegionseivei.
To save on having to make thiee iounu-tiips pei iow opeiation, clients cache all they
leain tiaveising -ROOT- anu .META. caching locations as well as usei-space iegion stait
5. Heie is an example iegion name liom the taLle TestTable whose stait iow is xyz: TestTable,xyz,
1279729913622.1b6e176fb8d8aa88fd4ab6bc80247ece. A comma uelimits taLle name, stait iow, anu
timestamp. The name always enus in a peiiou.
Concepts | 461
anu stop iows so they can liguie hosting iegions themselves without having to go Lack
to the .META. taLle. Clients continue to use the cacheu entiy as they woik until theie is
a lault. Vhen this happensthe iegion has moveuthe client consults the .META.
again to leain the new location. Il, in tuin, the consulteu .META. iegion has moveu, then
-ROOT- is ieconsulteu.
Viites aiiiving at a iegionseivei aie liist appenueu to a commit log anu then aie auueu
to an in-memoiy ncnstorc. Vhen a memstoie lills, its content is llusheu to the
lilesystem.
The commit log is hosteu on HDFS, so it iemains availaLle thiough a iegionseivei ciash.
Vhen the mastei notices that a iegionseivei is no longei ieachaLle, usually Lecause the
seiveis`s znoue has expiieu in ZooKeepei, it splits the ueau iegionseivei`s commit log
Ly iegion. On ieassignment, iegions that weie on the ueau iegionseivei, Leloie they
open loi Lusiness, will pick up theii just-split lile ol not yet peisisteu euits anu ieplay
them to Liing themselves up-to-uate with the state they hau just Leloie the lailuie.
Reauing, the iegion`s memstoie is consulteu liist. Il sullicient veisions aie lounu ieau-
ing memstoie alone, the gueiy completes theie. Otheiwise, llush liles aie consulteu in
oiuei, liom newest to oluest until veisions sullicient to satisly the gueiy aie lounu, oi
until we iun out ol llush liles.
A Lackgiounu piocess compacts llush liles once theii numLei has Lioacheu a thiesholu,
iewiiting many liles as one, Lecause the lewei liles a ieau consults, the moie peiloimant
it will Le. On compaction, veisions Leyonu the schema conliguieu maximum, ueletes
anu expiieu cells aie cleaneu out. A sepaiate piocess iunning in the iegionseivei mon-
itois llush lile sizes splitting the iegion when they giow in excess ol the conliguieu
maximum.
Installation
Downloau a staLle ielease liom an Apache Downloau Miiioi anu unpack it on youi
local lilesystem. Foi example:
% tar xzf hbase-x.y.z.tar.gz
As with Hauoop, you liist neeu to tell HBase wheie ]ava is locateu on youi system. Il
you have the JAVA_HOME enviionment vaiiaLle set to point to a suitaLle ]ava installation,
then that will Le useu, anu you uon`t have to conliguie anything luithei. Otheiwise,
you can set the ]ava installation that HBase uses Ly euiting HBase`s conj/hbasc-
cnv.sh, anu specilying the JAVA_HOME vaiiaLle (see Appenuix A loi some examples) to
point to veision 1.6.0 ol ]ava.
HBase, just like Hauoop, ieguiies ]ava 6.
462 | Chapter 13: HBase
Foi convenience, auu the HBase Linaiy uiiectoiy to youi commanu-line path. Foi
example:
% export HBASE_HOME=/home/hbase/hbase-x.y.z
% export PATH=$PATH:$HBASE_HOME/bin
To get the list ol HBase options, type:
% hbase
Usage: hbase <command>
where <command> is one of:
shell run the HBase shell
master run an HBase HMaster node
regionserver run an HBase HRegionServer node
zookeeper run a Zookeeper server
rest run an HBase REST server
thrift run an HBase Thrift server
avro run an HBase Avro server
migrate upgrade an hbase.rootdir
hbck run the hbase 'fsck' tool
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
Test Drive
To stait a tempoiaiy instance ol HBase that uses the /tnp uiiectoiy on the local lile-
system loi peisistence, type:
% start-hbase.sh
This will launch a stanualone HBase instance that peisists to the local lilesystem; Ly
uelault, HBase will wiite to /tnp/hbasc-S;USER|D,.
6
To auministei youi HBase instance, launch the HBase shell Ly typing:
% hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version: 0.89.0-SNAPSHOT, ra4ea1a9a7b074a2e5b7b24f761302d4ea28ed1b2, Sun Jul 18
15:01:50 PDT 2010 hbase(main):001:0>
This will Liing up a ]RuLy IRB inteipietei that has hau some HBase-specilic commanus
auueu to it. Type help anu then RETURN to see the list ol shell commanus gioupeu
into categoiies. Type help COMMAND_GROUP loi help Ly categoiy oi help COMMAND loi help
on a specilic commanu anu example usage. Commanus use RuLy loimatting to specily
lists anu uictionaiies. See the enu ol the main help scieen loi a guick tutoiial.
Now let us cieate a simple taLle, auu some uata, anu then clean up.
To cieate a taLle, you must name youi taLle anu ueline its schema. A taLle`s schema
compiises taLle attiiLutes anu the list ol taLle column lamilies. Column lamilies
6. In stanualone moue, HBase mastei, iegionseivei, anu a ZooKeepei instance aie all iun in the same ]VM.
Installation | 463
themselves have attiiLutes that you in tuin set at schema uelinition time. Examples ol
column lamily attiiLutes incluue whethei the lamily content shoulu Le compiesseu on
the lilesystem anu how many veisions ol a cell to keep. Schemas can Le latei euiteu Ly
olllining the taLle using the shell disable commanu, making the necessaiy alteiations
using alter, then putting the taLle Lack online with enable.
To cieate a taLle nameu tcst with a single column lamily name data using uelaults loi
taLle anu column lamily attiiLutes, entei:
hbase(main):007:0> create 'test', 'data'
0 row(s) in 1.3066 seconds
Il the pievious commanu uoes not complete successlully, anu the shell
uisplays an eiioi anu a stack tiace, youi install was not successlul. Check
the mastei logs unuei the HBase |ogs uiiectoiythe uelault location loi
the logs uiiectoiy is S;HBASE_HOME,/|ogsloi a clue as to wheie
things went awiy.
See the help output loi examples ol auuing taLle anu column lamily attiiLutes when
specilying a schema.
To piove the new taLle was cieateu successlully, iun the list commanu. This will
output all taLles in usei space:
hbase(main):019:0> list
test
1 row(s) in 0.1485 seconds
To inseit uata into thiee uilleient iows anu columns in the data column lamily, anu
then list the taLle content, uo the lollowing:
hbase(main):021:0> put 'test', 'row1', 'data:1', 'value1'
0 row(s) in 0.0454 seconds
hbase(main):022:0> put 'test', 'row2', 'data:2', 'value2'
0 row(s) in 0.0035 seconds
hbase(main):023:0> put 'test', 'row3', 'data:3', 'value3'
0 row(s) in 0.0090 seconds
hbase(main):024:0> scan 'test'
ROW COLUMN+CELL
row1 column=data:1, timestamp=1240148026198, value=value1
row2 column=data:2, timestamp=1240148040035, value=value2
row3 column=data:3, timestamp=1240148047497, value=value3
3 row(s) in 0.0825 seconds
Notice how we auueu thiee new columns without changing the schema.
To iemove the taLle, you must liist uisaLle it Leloie uiopping it:
hbase(main):025:0> disable 'test'
09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled test
0 row(s) in 6.0426 seconds
hbase(main):026:0> drop 'test'
09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted test
464 | Chapter 13: HBase
0 row(s) in 0.0210 seconds
hbase(main):027:0> list
0 row(s) in 2.0645 seconds
Shut uown youi HBase instance Ly iunning:
% stop-hbase.sh
To leain how to set up a uistiiLuteu HBase anu point it at a iunning HDFS, see the
Getting Staiteu section ol the HBase uocumentation.
Clients
Theie aie a numLei ol client options loi inteiacting with an HBase clustei.
Java
HBase, like Hauoop, is wiitten in ]ava. Example 13-1 shows how you woulu uo in ]ava
the shell opeiations listeu pieviously at Test Diive on page +63.
Exanp|c 13-1. Basic tab|c adninistration and acccss
public class ExampleClient {
public static void main(String[] args) throws IOException {
Configuration config = HBaseConfiguration.create();
// Create table
HBaseAdmin admin = new HBaseAdmin(config);
HTableDescriptor htd = new HTableDescriptor("test");
HColumnDescriptor hcd = new HColumnDescriptor("data");
htd.addFamily(hcd);
admin.createTable(htd);
byte [] tablename = htd.getName();
HTableDescriptor [] tables = admin.listTables();
if (tables.length != 1 && Bytes.equals(tablename, tables[0].getName())) {
throw new IOException("Failed create of table");
}
// Run some operations -- a put, a get, and a scan -- against the table.
HTable table = new HTable(config, tablename);
byte [] row1 = Bytes.toBytes("row1");
Put p1 = new Put(row1);
byte [] databytes = Bytes.toBytes("data");
p1.add(databytes, Bytes.toBytes("1"), Bytes.toBytes("value1"));
table.put(p1);
Get g = new Get(row1);
Result result = table.get(g);
System.out.println("Get: " + result);
Scan scan = new Scan();
ResultScanner scanner = table.getScanner(scan);
try {
for (Result scannerResult: scanner) {
System.out.println("Scan: " + scannerResult);
Clients | 465
}
} finally {
scanner.close();
}

// Drop the table
admin.disableTable(tablename);
admin.deleteTable(tablename);
}
}
This class has a main methou only. Foi the sake ol Lievity, we uo not incluue package
name noi impoits. In this class, we liist cieate an instance ol
org.apache.hadoop.conf.Configuration. Ve ask the org.apache.hadoop.hbase.HBase
Configuration class to cieate the instance. It will ietuin a Configuration that has ieau
HBase conliguiation liom hbasc-sitc.xn| anu hbasc-dcjau|t.xn| liles lounu on the
piogiam`s classpath. This Configuration is suLseguently useu to cieate instances ol
HBaseAdmin anu HTable, two classes lounu in the org.apache.hadoop.hbase.client ]ava
package. HBaseAdmin is useu loi auministeiing youi HBase clustei, loi auuing anu uiop-
ping taLles. HTable is useu to access a specilic taLle. The Configuration instance points
these classes at the clustei the coue is to woik against.
To cieate a taLle, we neeu to liist cieate an instance ol HBaseAdmin anu then ask it to
cieate the taLle nameu test with a single column lamily nameu data. In oui example,
oui taLle schema is the uelault. Use methous on org.apache.hadoop.hbase.HTableDe
scriptor anu org.apache.hadoop.hbase.HColumnDescriptor to change the taLle schema.
The coue next asseits the taLle was actually cieateu anu then it moves to iun opeiations
against the just-cieateu taLle.
Opeiating on a taLle, we will neeu an instance ol org.apache.hadoop.hbase.cli
ent.HTable passing it oui Configuration instance anu the name ol the taLle we want to
opeiate on. Altei cieating an HTable, we then cieate an instance ol
org.apache.hadoop.hbase.client. Put to put a single cell value ol value1 into a iow
nameu row1 on the column nameu data:1 (The column name is specilieu in two paits;
the column lamily name as Lytesdatabytes in the coue aLoveanu then the column
lamily gualiliei specilieu as Bytes.toBytes("1")). Next we cieate an
org.apache.hadoop.hbase.client.Get, uo a get ol the just-auueu cell, anu then use an
org.apache.hadoop.hbase.client.Scan to scan ovei the taLle against the just-cieateu
taLle piinting out what we linu.
Finally, we clean up Ly liist uisaLling the taLle anu then ueleting it. A taLle must Le
uisaLleu Leloie it can Le uioppeu.
MapReduce
HBase classes anu utilities in the org.apache.hadoop.hbase.mapreduce package lacilitate
using HBase as a souice anu/oi sink in MapReuuce joLs. The TableInputFormat class
makes splits on iegion Lounuaiies so maps aie hanueu a single iegion to woik on. The
466 | Chapter 13: HBase
TableOutputFormat will wiite the iesult ol ieuuce into HBase. The RowCounter class in
Example 13-2 can Le lounu in the HBase mapreduce package. It iuns a map task to count
iows using TableInputFormat.
Exanp|c 13-2. A MapRcducc app|ication to count thc nunbcr oj rows in an HBasc tab|c
public class RowCounter {
/** Name of this 'program'. */
static final String NAME = "rowcounter";
static class RowCounterMapper
extends TableMapper<ImmutableBytesWritable, Result> {
/** Counter enumeration to count the actual rows. */
public static enum Counters {ROWS}
@Override
public void map(ImmutableBytesWritable row, Result values,
Context context)
throws IOException {
for (KeyValue value: values.list()) {
if (value.getValue().length > 0) {
context.getCounter(Counters.ROWS).increment(1);
break;
}
}
}
}
public static Job createSubmittableJob(Configuration conf, String[] args)
throws IOException {
String tableName = args[0];
Job job = new Job(conf, NAME + "_" + tableName);
job.setJarByClass(RowCounter.class);
// Columns are space delimited
StringBuilder sb = new StringBuilder();
final int columnoffset = 1;
for (int i = columnoffset; i < args.length; i++) {
if (i > columnoffset) {
sb.append(" ");
}
sb.append(args[i]);
}
Scan scan = new Scan();
scan.setFilter(new FirstKeyOnlyFilter());
if (sb.length() > 0) {
for (String columnName :sb.toString().split(" ")) {
String [] fields = columnName.split(":");
if(fields.length == 1) {
scan.addFamily(Bytes.toBytes(fields[0]));
} else {
scan.addColumn(Bytes.toBytes(fields[0]), Bytes.toBytes(fields[1]));
}
}
}
// Second argument is the table name.
Clients | 467
job.setOutputFormatClass(NullOutputFormat.class);
TableMapReduceUtil.initTableMapperJob(tableName, scan,
RowCounterMapper.class, ImmutableBytesWritable.class, Result.class, job);
job.setNumReduceTasks(0);
return job;
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 1) {
System.err.println("ERROR: Wrong number of parameters: " + args.length);
System.err.println("Usage: RowCounter <tablename> [<column1> <column2>...]");
System.exit(-1);
}
Job job = createSubmittableJob(conf, otherArgs);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
This class uses GenericOptionsParser, which is uiscusseu in GeneiicOptionsPaisei,
Tool, anu ToolRunnei on page 151, loi paising commanu line aiguments. The Row
CounterMapper innei class implements the HBase TableMapper aLstiact, a specialization
ol org.apache.hadoop.mapreduce.Mapper that sets the map inputs types passeu Ly
TableInputFormat. The createSubmittableJob() methou paises aiguments auueu to the
conliguiation that weie passeu on the commanu line to specily the taLle anu columns
we aie to iun RowCounter against. The column names aie useu to conliguie an instance
ol org.apache.hadoop.hbase.client.Scan, a scan oLject that will Le passeu thiough to
TaLleInputFoimat anu useu constiaining what oui Mapper sees. Notice how we set a
liltei, an instance ol org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter, on the
scan. This liltei instiucts the seivei to shoit-ciicuit when iunning seivei-siue, uoing no
moie than veiily a iow has an entiy Leloie ietuining. This speeus the iow count. The
createSubmittableJob() methou also invokes the TableMapReduceUtil.initTableMap
Job() utility methou, which among othei things such as setting the map class to use,
sets the input loimat to TableInputFormat. The map is simple. It checks loi empty
values. Il empty, it uoesn`t count the iow. Otheiwise, it inciements Counters.ROWS Ly
one.
Avro, REST, and Thrift
HBase ships with Avio, REST, anu Thiilt inteilaces. These aie uselul when the intei-
acting application is wiitten in a language othei than ]ava. In all cases, a ]ava seivei
hosts an instance ol the HBase client Liokeiing application Avio, REST, anu Thiilt
ieguests in anu out ol the HBase clustei. This extia woik pioxying ieguests anu ie-
sponses means these inteilaces aie slowei than using the ]ava client uiiectly.
468 | Chapter 13: HBase
REST
To put up a stargatc instance (staigate is the name loi the HBase REST seivice), stait
it using the lollowing commanu:
% hbase-daemon.sh start rest
This will stait a seivei instance, Ly uelault on poit S0S0, Lackgiounu it, anu catch any
emissions Ly the seivei in logliles unuei the HBase |ogs uiiectoiy.
Clients can ask loi the iesponse to Le loimatteu as ]SON, Google`s piotoLuls, oi as
XML, uepenuing on how the client HTTP Acccpt heauei is set. See the REST wiki
page loi uocumentation anu examples ol making REST client ieguests.
To stop the REST seivei, type:
% hbase-daemon.sh stop rest
Thrift
Similaily, stait a Thiilt seivice Ly putting up a seivei to lielu Thiilt clients Ly iunning
the lollowing:
% hbase-daemon.sh start thrift
This will stait the seivei instance, Ly uelault on poit 9090, Lackgiounu it, anu catch
any emissions Ly the seivei in logliles unuei the HBase |ogs uiiectoiy. The HBase Thiilt
uocumentation
7
notes the Thiilt veision useu geneiating classes. The HBase Thiilt IDL
can Le lounu at src/nain/rcsourccs/org/apachc/hadoop/hbasc/thrijt/Hbasc.thrijt in the
HBase souice coue.
To stop the Thiilt seivei, type:
% hbase-daemon.sh stop thrift
Avro
The Avio seivei is staiteu anu stoppeu in the same mannei as you`u stait anu stop the
Thiilt oi REST seivices. The Avio seivei Ly uelault uses poit 9090 (the same as the
Thiilt seivei, although you woulun`t noimally iun Loth).
Example
Although HDFS anu MapReuuce aie poweilul tools loi piocessing Latch opeiations
ovei laige uatasets, they uo not pioviue ways to ieau oi wiite inuiviuual iecoius elli-
ciently. In this example, we`ll exploie using HBase as the tool to lill this gap.
The existing weathei uataset uesciiLeu in pievious chapteis contains oLseivations loi
tens ol thousanus ol stations ovei 100 yeais anu this uata is giowing without Lounu.
7. http://hbasc.apachc.org/docs/currcnt/api/org/apachc/hadoop/hbasc/thrijt/pac|agc-sunnary.htn|
Example | 469
In this example, we will Luilu a simple weL inteilace that allows a usei to navigate the
uilleient stations anu page thiough theii histoiical tempeiatuie oLseivations in time
oiuei. Foi the sake ol this example, let us allow that the uataset is massive, that the
oLseivations iun to the Lillions, anu that the iate at which tempeiatuie upuates aiiive
is signilicantsay hunuieus to thousanus ol upuates a seconu liom aiounu the woilu
acioss the whole iange ol weathei stations. Also, let us allow that it is a ieguiiement
that the weL application must uisplay the most up-to-uate oLseivation within a seconu
oi so ol ieceipt.
The liist size ieguiiement shoulu piecluue oui use ol a simple RDBMS instance anu
make HBase a canuiuate stoie. The seconu latency ieguiiement iules out plain HDFS.
A MapReuuce joL coulu Luilu initial inuices that alloweu ianuom-access ovei all ol the
oLseivation uata, Lut keeping up this inuex as the upuates aiiiveu is not what HDFS
anu MapReuuce aie goou at.
Schemas
In oui example, theie will Le two taLles:
Stations
This taLle holus station uata. Let the iow key Le the stationid. Let this taLle have
a column lamily injo that acts as a key/val uictionaiy loi station inloimation. Let
the uictionaiy keys Le the column names info:name, info:location, anu
info:description. This taLle is static anu the injo lamily, in this case, closely mii-
iois a typical RDBMS taLle uesign.
Obscrvations
This taLle holus tempeiatuie oLseivations. Let the iow key Le a composite key ol
stationid - ieveise oiuei timestamp. Give this taLle a column lamily data that will
contain one column airtemp with the oLseiveu tempeiatuie as the column value.
Oui choice ol schema is ueiiveu liom how we want to most elliciently ieau liom HBase.
Rows anu columns aie stoieu in incieasing lexicogiaphical oiuei. Though theie aie
lacilities loi seconuaiy inuexing anu iegulai expiession matching, they come at a pei-
loimance penalty. It is vital that you unueistanu how you want to most elliciently gueiy
youi uata in oiuei to most ellectively stoie anu access it.
Foi the stations taLle, the choice ol stationid as key is oLvious Lecause we will always
access inloimation loi a paiticulai station Ly its iu. The observations taLle, howevei,
uses a composite key that auus the oLseivation timestamp at the enu. This will gioup
all oLseivations loi a paiticulai station togethei, anu Ly using a ieveise oiuei timestamp
(Long.MAX_VALUE - epoch) anu stoiing it as Linaiy, oLseivations loi each station will Le
oiueieu with most iecent oLseivation liist.
470 | Chapter 13: HBase
In the shell, you woulu ueline youi taLles as lollows:
hbase(main):036:0> create 'stations', {NAME => 'info', VERSIONS => 1}
0 row(s) in 0.1304 seconds
hbase(main):037:0> create 'observations', {NAME => 'data', VERSIONS => 1}
0 row(s) in 0.1332 seconds
In Loth cases, we aie inteiesteu only in the latest veision ol a taLle cell, so set VERSIONS to
1. The uelault is 3.
Loading Data
Theie aie a ielatively small numLei ol stations, so theii static uata is easily inseiteu
using any ol the availaLle inteilaces.
Howevei, let`s assume that theie aie Lillions ol inuiviuual oLseivations to Le loaueu.
This kinu ol impoit is noimally an extiemely complex anu long-iunning uataLase op-
eiation, Lut MapReuuce anu HBase`s uistiiLution mouel allow us to make lull use ol
the clustei. Copy the iaw input uata onto HDFS anu then iun a MapReuuce joL that
can ieau the input anu wiite to HBase.
Example 13-3 shows an example MapReuuce joL that impoits oLseivations to HBase
liom the same input lile useu in the pievious chapteis` examples.
Exanp|c 13-3. A MapRcducc app|ication to inport tcnpcraturc data jron HDIS into an HBasc tab|c
public class HBaseTemperatureImporter extends Configured implements Tool {

// Inner-class for map
static class HBaseTemperatureMapper<K, V> extends MapReduceBase implements
Mapper<LongWritable, Text, K, V> {
private NcdcRecordParser parser = new NcdcRecordParser();
private HTable table;
public void map(LongWritable key, Text value,
OutputCollector<K, V> output, Reporter reporter)
throws IOException {
parser.parse(value.toString());
if (parser.isValidTemperature()) {
byte[] rowKey = RowKeyConverter.makeObservationRowKey(parser.getStationId(),
parser.getObservationDate().getTime());
Put p = new Put(rowKey);
p.add(HBaseTemperatureCli.DATA_COLUMNFAMILY,
HBaseTemperatureCli.AIRTEMP_QUALIFIER,
Bytes.toBytes(parser.getAirTemperature()));
table.put(p);
}
}
public void configure(JobConf jc) {
super.configure(jc);
// Create the HBase table client once up-front and keep it around
// rather than create on each map invocation.
Example | 471
try {
this.table = new HTable(new HBaseConfiguration(jc), "observations");
} catch (IOException e) {
throw new RuntimeException("Failed HTable construction", e);
}
}
@Override
public void close() throws IOException {
super.close();
table.close();
}
}
public int run(String[] args) throws IOException {
if (args.length != 1) {
System.err.println("Usage: HBaseTemperatureImporter <input>");
return -1;
}
JobConf jc = new JobConf(getConf(), getClass());
FileInputFormat.addInputPath(jc, new Path(args[0]));
jc.setMapperClass(HBaseTemperatureMapper.class);
jc.setNumReduceTasks(0);
jc.setOutputFormat(NullOutputFormat.class);
JobClient.runJob(jc);
return 0;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new HBaseConfiguration(),
new HBaseTemperatureImporter(), args);
System.exit(exitCode);
}
}
HBaseTemperatureImporter has an innei class nameu HBaseTemperatureMapper that is like
the MaxTemperatureMapper class liom Chaptei 5. The outei class implements Tool anu
uoes the setup to launch the HBaseTemperatureMapper innei class. HBaseTemperatureMap
per takes the same input as MaxTemperatureMapper anu uoes the same paiseusing the
NcdcRecordParser intiouuceu in Chaptei 5to check loi valiu tempeiatuies, Lut iathei
than auu valiu tempeiatuies to the output collectoi as MaxTemperatureMapper uoes,
insteau it auus valiu tempeiatuies to the observations HBase taLle into the data:air-
tcnp column. (Ve aie using static uelines loi data anu airtemp impoiteu liom HBase
TemperatureCli class uesciiLeu latei Lelow.) In the configure() methou, we cieate an
HTable instance once against the observations taLle anu use it alteiwaiu in map invo-
cations talking to HBase. Finally, we call close on oui HTable instance to llush out any
wiite Lulleis not yet cleaieu.
472 | Chapter 13: HBase
The iow key useu is cieateu in the makeObservationRowKey() methou on RowKey
Converter liom the station ID anu oLseivation time:
public class RowKeyConverter {
private static final int STATION_ID_LENGTH = 12;
/**
* @return A row key whose format is: <station_id> <reverse_order_epoch>
*/
public static byte[] makeObservationRowKey(String stationId,
long observationTime) {
byte[] row = new byte[STATION_ID_LENGTH + Bytes.SIZEOF_LONG];
Bytes.putBytes(row, 0, Bytes.toBytes(stationId), 0, STATION_ID_LENGTH);
long reverseOrderEpoch = Long.MAX_VALUE - observationTime;
Bytes.putLong(row, STATION_ID_LENGTH, reverseOrderEpoch);
return row;
}
}
The conveision takes auvantage ol the lact that the station ID is a lixeu-length stiing.
The Bytes class useu in makeObservationRowKey() is liom the HBase utility package. It
incluues methous loi conveiting Letween Lyte aiiays anu common ]ava anu Hauoop
types. In makeObservationRowKey(), the Bytes.putLong() methou is useu to lill the key
Lyte aiiay. The Bytes.SIZEOF_LONG constant is useu loi sizing anu positioning in the
iow key aiiay.
Ve can iun the piogiam with the lollowing:
% hbase HBaseTemperatureImporter input/ncdc/all
Optimization notes
Vatch loi the phenomenon wheie an impoit walks in lock-step thiough the taLle
with all clients in conceit pounuing one ol the taLle`s iegions (anu thus, a single
noue), then moving on to the next, anu so on, iathei than evenly uistiiLuting the
loau ovei all iegions. This is usually Liought on Ly some inteiaction Letween soiteu
input anu how the splittei woiks. Ranuomizing the oiueiing ol youi iow keys piioi
to inseition may help. In oui example, given the uistiiLution ol stationid values
anu how TextInputFormat makes splits, the uploau shoulu Le sulliciently
uistiiLuteu.
S
Only oLtain one HTable instance pei task. Theie is a cost to instantiating an
HTable, so il you uo this loi each inseit, you may have a negative impact on pei-
loimance, hence oui setup ol HTable in the configure() step.
S. Il a taLle is new, it will have only one iegion anu initially all upuates will Le to this single iegion until it
splits. This will happen even il iow keys aie ianuomly uistiiLuteu. This staitup phenomenon means
uploaus iun slow at liist until theie aie sullicient iegions uistiiLuteu so all clustei memLeis aie aLle to
paiticipate in the uploau. Do not conluse this phenomenon with that noteu heie.
Example | 473
By uelault, each HTable.put(put) actually peiloims the inseit without any Lullei-
ing. You can uisaLle the HTable auto-llush leatuie using HTable.setAuto
Flush(false) anu then set the size ol conliguiaLle wiite Lullei. Vhen the inseits
committeu lill the wiite Lullei, it is then llusheu. RememLei though, you must call
a manual HTable.flushCommits(), oi HTable.close(), which will call thiough to
HTable.flushCommits() at the enu ol each task to ensuie that nothing is lelt un-
llusheu in the Lullei. You coulu uo this in an oveiiiue ol the mappei`s close()
methou.
HBase incluues TableInputFormat anu TableOutputFormat to help with MapReuuce
joLs that souice anu sink HBase (see Example 13-2). One way to wiite the pievious
example woulu have Leen to use MaxTemperatureMapper liom Chaptei 5 as is Lut
auu a ieuucei task that takes the output ol the MaxTemperatureMapper anu leeus it
to HBase via TableOutputFormat.
Web Queries
To implement the weL application, we will use the HBase ]ava API uiiectly. Heie it
Lecomes cleai how impoitant youi choice ol schema anu stoiage loimat is.
The simplest gueiy will Le to get the static station inloimation. This type ol gueiy is
simple in a tiauitional uataLase, Lut HBase gives you auuitional contiol anu llexiLility.
Using the info lamily as a key/value uictionaiy (column names as keys, column values
as values), the coue woulu look like this:
public Map<String, String> getStationInfo(HTable table, String stationId)
throws IOException {
Get get = new Get(Bytes.toBytes(stationId));
get.addColumn(INFO_COLUMNFAMILY);
Result res = table.get(get);
if (res == null) {
return null;
}
Map<String, String> resultMap = new HashMap<String, String>();
resultMap.put("name", getValue(res, INFO_COLUMNFAMILY, NAME_QUALIFIER));
resultMap.put("location", getValue(res, INFO_COLUMNFAMILY, LOCATION_QUALIFIER));
resultMap.put("description", getValue(res, INFO_COLUMNFAMILY,
DESCRIPTION_QUALIFIER));
return resultMap;
}
private static String getValue(Result res, byte [] cf, byte [] qualifier) {
byte [] value = res.getValue(cf, qualifier);
return value == null? "": Bytes.toString(value);
}
In this example, getStationInfo() takes an HTable instance anu a station ID. To get the
station inlo, we use HTable.get() passing a Get instance conliguieu to get all the column
values loi the iow iuentilieu Ly the station ID in the uelineu column lamily, INFO_COL
UMNFAMILY.
474 | Chapter 13: HBase
The get() iesults aie ietuineu in Result. It contains the iow anu you can letch cell
values Ly stipulating the column cell wanteu. The getStationInfo() methou conveits
the Result Map into a moie liienuly Map ol String keys anu values.
Ve can alieauy see how theie is a neeu loi utility lunctions when using HBase. Theie
aie an incieasing numLei ol aLstiactions Leing Luilt atop HBase to ueal with this low-
level inteiaction, Lut it`s impoitant to unueistanu how this woiks anu how stoiage
choices make a uilleience.
One ol the stiengths ol HBase ovei a ielational uataLase is that you uon`t have to
piespecily the columns. So, in the lutuie, il each station now has at least these thiee
attiiLutes Lut theie aie hunuieus ol optional ones, we can just inseit them without
mouilying the schema. Youi applications ieauing anu wiiting coue woulu ol couise
neeu to Le changeu. The example coue might change in this case to looping thiough
Result iathei than giaLLing each value explicitly.
Ve will make use ol HBase scanneis loi ietiieval ol oLseivations in oui weL
application.
Heie we aie altei a Map<ObservationTime, ObservedTemp> iesult. Ve will use a
NavigableMap<Long, Integer> Lecause it is soiteu anu has a descendingMap() methou,
so we can access oLseivations in Loth ascenuing oi uescenuing oiuei. The coue is in
Example 13-+.
Exanp|c 13-1. Mcthods jor rctricving a rangc oj rows oj wcathcr station obscrvations jron an HBasc
tab|c
public NavigableMap<Long, Integer> getStationObservations(HTable table,
String stationId, long maxStamp, int maxCount) throws IOException {
byte[] startRow = RowKeyConverter.makeObservationRowKey(stationId, maxStamp);
NavigableMap<Long, Integer> resultMap = new TreeMap<Long, Integer>();
Scan scan = new Scan(startRow);
scan.addColumn(DATA_COLUMNFAMILY, AIRTEMP_QUALIFIER);
ResultScanner scanner = table.getScanner(scan);
Result res = null;
int count = 0;
try {
while ((res = scanner.next()) != null && count++ < maxCount) {
byte[] row = res.getRow();
byte[] value = res.getValue(DATA_COLUMNFAMILY, AIRTEMP_QUALIFIER);
Long stamp = Long.MAX_VALUE -
Bytes.toLong(row, row.length - Bytes.SIZEOF_LONG, Bytes.SIZEOF_LONG);
Integer temp = Bytes.toInt(value);
resultMap.put(stamp, temp);
}
} finally {
scanner.close();
}
return resultMap;
}
Example | 475
/**
* Return the last ten observations.
*/
public NavigableMap<Long, Integer> getStationObservations(HTable table,
String stationId) throws IOException {
return getStationObservations(table, stationId, Long.MAX_VALUE, 10);

The getStationObservations() methou takes a station ID anu a iange uelineu Ly max
Stamp anu a maximum numLei ol iows (maxCount). Note that the NavigableMap that is
ietuineu is actually now in uescenuing time oiuei. Il you want to ieau thiough it in
ascenuing oiuei, you woulu make use ol NavigableMap.descendingMap().
Scanners
HBase scanneis aie like cuisois in a tiauitional uataLase oi ]ava iteiatois, except
unlike the latteithey have to Le closeu altei use. Scanneis ietuin iows in oiuei. Useis
oLtain a scannei on an HBase taLle Ly calling HTable.getScanner(scan) wheie the
scan paiametei is a conliguieu instance ol a Scan oLject. In the Scan instance, you can
pass the iow at which to stait anu stop the scan, which columns in a iow to ietuin in
the iow iesult, anu optionally, a liltei to iun on the seivei siue.
9
The ResultScanner
inteilace, which is ietuineu when you call HTable.getScanner(), is as lollows:
public interface ResultScanner extends Closeable, Iterable<Result> {
public Result next() throws IOException;
public Result [] next(int nbRows) throws IOException;
public void close();
}
You can ask loi the next iow`s iesults oi a numLei ol iows. Each invocation ol
next() involves a tiip Lack to the iegionseivei, so giaLLing a Lunch ol iows at once can
make loi signilicant peiloimance savings.
10
9. To leain moie aLout the seivei-siue lilteiing mechanism in HBase, see http://hadoop.apachc.org/hbasc/
docs/currcnt/api/org/apachc/hadoop/hbasc/ji|tcr/pac|agc-sunnary.htn|.
10. The hbase.client.scanner.caching conliguiation option is set to 1 Ly uelault. You can also set how
much to cache/pieletch on the Scan instance itsell. Scanneis will, unuei the coveis, letch this many
iesults at a time, Liinging them client siue, anu ietuining to the seivei to letch the next Latch only
altei the cuiient Latch has Leen exhausteu. Highei caching values will enaLle lastei scanning Lut will
eat up moie memoiy in the client. Also, avoiu setting the caching so high that the time spent piocessing
the Latch client-siue exceeus the scannei lease peiiou. Il a client lails to check Lack with the seivei
Leloie the scannei lease expiies, the seivei will go aheau anu gaiLage collect iesouices consumeu Ly
the scannei seivei-siue. The uelault scannei lease is 60 seconus, anu can Le changeu Ly setting
hbase.regionserver.lease.period. Clients will see an Un|nownScanncrExccption il the scannei lease
has expiieu.
476 | Chapter 13: HBase
The auvantage ol stoiing things as Long.MAX_VALUE - stamp may not Le cleai in the
pievious example. It has moie use when you want to get the newest oLseivations loi a
given ollset anu limit, which is olten the case in weL applications. Il the oLseivations
weie stoieu with the actual stamps, we woulu Le aLle to get only the oluest oLseivations
loi a given ollset anu limit elliciently. Getting the newest woulu mean getting all ol
them anu then giaLLing them oll the enu. One ol the piime ieasons loi moving liom
RDBMS to HBase is to allow loi these types ol eaily-out scenaiios.
HBase Versus RDBMS
HBase anu othei column-oiienteu uataLases aie olten compaieu to moie tiauitional
anu populai ielational uataLases oi RDBMSs. Although they uillei uiamatically in theii
implementations anu in what they set out to accomplish, the lact that they aie potential
solutions to the same pioLlems means that uespite theii enoimous uilleiences, the
compaiison is a laii one to make.
As uesciiLeu pieviously, HBase is a uistiiLuteu, column-oiienteu uata stoiage system.
It picks up wheie Hauoop lelt oll Ly pioviuing ianuom ieaus anu wiites on top ol
HDFS. It has Leen uesigneu liom the giounu up with a locus on scale in eveiy uiiection:
tall in numLeis ol iows (Lillions), wiue in numLeis ol columns (millions), anu to Le
hoiizontally paititioneu anu ieplicateu acioss thousanus ol commouity noues auto-
matically. The taLle schemas miiioi the physical stoiage, cieating a system loi ellicient
uata stiuctuie seiialization, stoiage, anu ietiieval. The Luiuen is on the application
uevelopei to make use ol this stoiage anu ietiieval in the iight way.
Stiictly speaking, an RDBMS is a uataLase that lollows Couu`s 12 Rules. Typical
RDBMSs aie lixeu-schema, iow-oiienteu uataLases with ACID piopeities anu a so-
phisticateu SQL gueiy engine. The emphasis is on stiong consistency, ieleiential in-
tegiity, aLstiaction liom the physical layei, anu complex gueiies thiough the SQL lan-
guage. You can easily cieate seconuaiy inuexes, peiloim complex innei anu outei joins,
count, sum, soit, gioup, anu page youi uata acioss a numLei ol taLles, iows, anu
columns.
Foi a majoiity ol small- to meuium-volume applications, theie is no suLstitute loi the
ease ol use, llexiLility, matuiity, anu poweilul leatuie set ol availaLle open souice
RDBMS solutions like MySQL anu PostgieSQL. Howevei, il you neeu to scale up in
teims ol uataset size, ieau/wiite concuiiency, oi Loth, you`ll soon linu that the con-
veniences ol an RDBMS come at an enoimous peiloimance penalty anu make uistii-
Lution inheiently uillicult. The scaling ol an RDBMS usually involves Lieaking Couu`s
iules, loosening ACID iestiictions, loigetting conventional DBA wisuom, anu on the
way losing most ol the uesiiaLle piopeities that maue ielational uataLases so conve-
nient in the liist place.
HBase Versus RDBMS | 477
Successful Service
Heie is a synopsis ol how the typical RDBMS scaling stoiy iuns. The lollowing list
piesumes a successlul giowing seivice:
|nitia| pub|ic |aunch
Move liom local woikstation to shaieu, iemote hosteu MySQL instance with a
well-uelineu schema.
Scrvicc bcconcs norc popu|ar, too nany rcads hitting thc databasc
Auu memcacheu to cache common gueiies. Reaus aie now no longei stiictly ACID;
cacheu uata must expiie.
Scrvicc continucs to grow in popu|arity, too nany writcs hitting thc databasc
Scale MySQL veitically Ly Luying a Leeleu up seivei with 16 coies, 12S GB ol RAM,
anu Lanks ol 15 k RPM haiu uiives. Costly.
Ncw jcaturcs incrcascs qucry conp|cxity, now wc havc too nany joins
Denoimalize youi uata to ieuuce joins. (That`s not what they taught me in DBA
school!)
Rising popu|arity swanps thc scrvcr, things arc too s|ow
Stop uoing any seivei-siue computations.
Sonc qucrics arc sti|| too s|ow
Peiiouically piemateiialize the most complex gueiies, tiy to stop joining in most
cases.
Rcads arc OK, but writcs arc gctting s|owcr and s|owcr
Diop seconuaiy inuexes anu tiiggeis (no inuexes?).
At this point, theie aie no cleai solutions loi how to solve youi scaling pioLlems. In
any case, you`ll neeu to Legin to scale hoiizontally. You can attempt to Luilu some type
ol paititioning on youi laigest taLles, oi look into some ol the commeicial solutions
that pioviue multiple mastei capaLilities.
Countless applications, Lusinesses, anu weLsites have successlully achieveu scalaLle,
lault-toleiant, anu uistiiLuteu uata systems Luilt on top ol RDBMSs anu aie likely using
many ol the pievious stiategies. But what you enu up with is something that is no longei
a tiue RDBMS, saciilicing leatuies anu conveniences loi compiomises anu complexi-
ties. Any loim ol slave ieplication oi exteinal caching intiouuces weak consistency into
youi now uenoimalizeu uata. The inelliciency ol joins anu seconuaiy inuexes means
almost all gueiies Lecome piimaiy key lookups. A multiwiitei setup likely means no
ieal joins at all anu uistiiLuteu tiansactions aie a nightmaie. Theie`s now an incieuiLly
complex netwoik topology to manage with an entiiely sepaiate clustei loi caching.
Even with this system anu the compiomises maue, you will still woiiy aLout youi
piimaiy mastei ciashing anu the uaunting possiLility ol having 10 times the uata anu
10 times the loau in a lew months.
478 | Chapter 13: HBase
HBase
Entei HBase, which has the lollowing chaiacteiistics:
No rca| indcxcs
Rows aie stoieu seguentially, as aie the columns within each iow. Theieloie, no
issues with inuex Lloat, anu inseit peiloimance is inuepenuent ol taLle size.
Autonatic partitioning
As youi taLles giow, they will automatically Le split into iegions anu uistiiLuteu
acioss all availaLle noues.
Sca|c |incar|y and autonatica||y with ncw nodcs
Auu a noue, point it to the existing clustei, anu iun the iegionseivei. Regions will
automatically ieLalance anu loau will spieau evenly.
Connodity hardwarc
Clusteis aie Luilt on $1,000$5,000 noues iathei than $50,000 noues. RDBMSs
aie I/O hungiy, ieguiiing moie costly haiuwaie.
Iau|t to|crancc
Lots ol noues means each is ielatively insignilicant. No neeu to woiiy aLout inui-
viuual noue uowntime.
Batch proccssing
MapReuuce integiation allows lully paiallel, uistiiLuteu joLs against youi uata
with locality awaieness.
Il you stay up at night woiiying aLout youi uataLase (uptime, scale, oi speeu), then
you shoulu seiiously consiuei making a jump liom the RDBMS woilu to HBase. Utilize
a solution that was intenueu to scale iathei than a solution Laseu on stiipping uown
anu thiowing money at what useu to woik. Vith HBase, the soltwaie is liee, the haiu-
waie is cheap, anu the uistiiLution is intiinsic.
Use Case: HBase at Streamy.com
Stieamy.com is a ieal-time news aggiegatoi anu social shaiing platloim. Vith a Lioau
leatuie set, we staiteu out with a complex implementation on top ol PostgieSQL. It`s
a teiiilic piouuct with a gieat community anu a Leautilul coueLase. Ve tiieu eveiy tiick
in the Look to keep things last as we scaleu, going so lai as to mouily the PostgieSQL
coue uiiectly to suit oui neeus. Oiiginally taking auvantage ol all RDBMS goouies, we
lounu that eventually, one Ly one, we hau to let them all go. Along the way, oui entiie
team Lecame the DBA.
Ve uiu manage to solve many ol the issues that we ian into, Lut theie weie two that
eventually leu to the uecision to linu anothei solution liom outsiue the woilu ol
RDBMS.
HBase Versus RDBMS | 479
Stieamy ciawls thousanus ol RSS leeus anu aggiegates hunuieus ol millions ol items
liom them. In auuition to having to stoie these items, one ol oui moie complex gueiies
ieaus a time-oiueieu list ol all items liom a set ol souices. At the high enu, this can iun
to seveial thousanu souices anu all ol theii items all in a single gueiy.
Very large items tables
At liist, this was a single items taLle, Lut the high numLei ol seconuaiy inuexes maue
inseits anu upuates veiy slow. Ve staiteu to uiviue items up into seveial one-to-one
link taLles to stoie othei inloimation, sepaiating static lielus liom uynamic ones,
giouping lielus Laseu on how they weie gueiieu, anu uenoimalizing eveiything along
the way. Even with these changes, single upuates ieguiieu iewiiting the entiie iecoiu,
so tiacking statistics on items was uillicult to scale. The iewiiting ol iecoius anu having
to upuate inuexes along the way aie intiinsic piopeities ol the RDBMS we weie using.
They coulu not Le uecoupleu. Ve paititioneu oui taLles, which was not too uillicult
Lecause ol the natuial paitition ol time, Lut the complexity got out ol hanu last. Ve
neeueu anothei solution!
Very large sort merges
Peiloiming soiteu meiges ol time-oiueieu lists is common in many VeL 2.0 applica-
tions. An example SQL gueiy might look like this:
SELECT id, stamp, type FROM streams
WHERE type IN ('type1','type2','type3','type4',...,'typeN')
ORDER BY stamp DESC LIMIT 10 OFFSET 0;
Assuming iu is a piimaiy key on stieams, anu that stamp anu type have seconuaiy
inuexes, an RDBMS gueiy plannei tieats this gueiy as lollows:
MERGE (
SELECT id, stamp, type FROM streams
WHERE type = 'type1' ORDER BY stamp DESC,
...,
SELECT id, stamp, type FROM streams
WHERE type = 'typeN' ORDER BY stamp DESC
) ORDER BY stamp DESC LIMIT 10 OFFSET 0;
The pioLlem heie is that we aie altei only the top 10 IDs, Lut the gueiy plannei actually
mateiializes an entiie meige anu then limits at the enu. A simple heapsoit acioss each
ol the types woulu allow you to eaily out once you have the top 10. In oui case, each
type coulu have tens ol thousanus ol IDs in it, so mateiializing the entiie list anu soiting
it was extiemely slow anu unnecessaiy. Ve actually went so lai as to wiite a custom
PL/Python sciipt that peiloimeu a heapsoit using a seiies ol gueiies like the lollowing:
SELECT id, stamp, type FROM streams
WHERE type = 'typeN'
ORDER BY stamp DESC LIMIT 1 OFFSET 0;
480 | Chapter 13: HBase
Il we enueu up taking liom typeN (it was the next most iecent in the heap), we woulu
iun anothei gueiy:
SELECT id, stamp, type FROM streams
WHERE type = 'typeN'
ORDER BY stamp DESC LIMIT 1 OFFSET 1;
In neaily all cases, this outpeiloimeu the native SQL implementation anu the gueiy
plannei`s stiategy. In the woist cases loi SQL, we weie moie than an oiuei ol magnituue
lastei using the Python pioceuuie. Ve lounu ouiselves continually tiying to outsmait
the gueiy plannei.
Again, at this point, we ieally neeueu anothei solution.
Life with HBase
Oui RDBMS-Laseu system was always capaLle ol coiiectly implementing oui ieguiie-
ments; the issue was scaling. Vhen you stait to locus on scale anu peiloimance iathei
than coiiectness, you enu up shoit-cutting anu optimizing loi youi uomain-specilic
use cases eveiywheie possiLle. Once you stait implementing youi own solutions to
youi uata pioLlems, the oveiheau anu complexity ol an RDBMS gets in youi way. The
aLstiaction liom the stoiage layei anu ACID ieguiiements aie an enoimous Laiiiei anu
luxuiy that you cannot always alloiu when Luiluing loi scale. HBase is a uistiiLuteu,
column-oiienteu, soiteu map stoie anu not much else. The only majoi pait that is
aLstiacteu liom the usei is the uistiiLution, anu that`s exactly what we uon`t want to
ueal with. Business logic, on the othei hanu, is veiy specializeu anu optimizeu. Vith
HBase not tiying to solve all ol oui pioLlems, we`ve Leen aLle to solve them Lettei
ouiselves anu iely on HBase loi scaling oui stoiage, not oui logic. It was an extiemely
liLeiating expeiience to Le aLle to locus on oui applications anu logic iathei than the
scaling ol the uata itsell.
Ve cuiiently have taLles with hunuieus ol millions ol iows anu tens ol thousanus ol
columns; the thought ol stoiing Lillions ol iows anu millions ol columns is exciting,
not scaiy.
Praxis
In this section, we uiscuss some ol the common issues useis iun into when iunning an
HBase clustei unuei loau.
Versions
Up until HBase 0.20, HBase aligneu its veisioning with that ol Hauoop. A paiticulai
HBase veision woulu iun on any Hauoop that hau a matching minoi veision, wheie
minoi veision in this context is consiueieu the numLei Letween the peiious (e.g., 20 is
Praxis | 481
the minoi veision ol an HBase 0.20.5). HBase 0.20.5 woulu iun on an Hauoop 0.20.2,
Lut HBase 0.19.5 woulu not iun on Hauoop 0.20.0.
Vith HBase 0.90,
11
the veision ielationship was Lioken. The Hauoop ielease cycle has
sloweu anu no longei aligns with that ol HBase uevelopments. Also, the intent is that
now a paiticulai HBase veision can iun on multiple veisions ol Hauoop. Foi example,
HBase 0.90.x will woik with Loth Hauoop 0.20.x anu 0.21.x.
This saiu, ensuie you aie iunning compatiLle veisions ol Hauoop anu HBase. Check
the ieguiiements section ol youi uownloau. IncompatiLle veisions will thiow an ex-
ception complaining aLout the veision mismatch, il you aie lucky. Il they cannot talk
to each sulliciently to pass veisions, you may see youi HBase clustei hang inuelinitely,
soon altei staitup. The mismatch exception oi HBase hang can also happen on upgiaue
il oluei veisions ol eithei HBase oi Hauoop can still Le lounu on the classpath Lecause
ol impeilect cleanup ol the olu soltwaie.
HDFS
HBase`s use ol HDFS is veiy uilleient liom how it`s useu Ly MapReuuce. In MapRe-
uuce, geneially, HDFS liles aie openeu, with theii content stieameu thiough a map
task anu then closeu. In HBase, uata liles aie openeu on clustei staitup anu kept open
so that we avoiu paying the lile open costs on each access. Because ol this, HBase tenus
to see issues not noimally encounteieu Ly MapReuuce clients:
Running out oj ji|c dcscriptors
Because we keep liles open, on a loaueu clustei, it uoesn`t take long Leloie we iun
into system- anu Hauoop-imposeu limits. Foi instance, say we have a clustei that
has thiee noues each iunning an instance ol a uatanoue anu a iegionseivei anu
we`ie iunning an uploau into a taLle that is cuiiently at 100 iegions anu 10 column
lamilies. Allow that each column lamily has on aveiage two llush liles. Doing the
math, we can have 100 10 2, oi 2,000, liles open at any one time. Auu to this
total miscellaneous othei uesciiptois consumeu Ly outstanuing scanneis anu ]ava
liLiaiies. Each open lile consumes at least one uesciiptoi ovei on the iemote ua-
tanoue. The uelault limit on the numLei ol lile uesciiptois pei piocess is 1,02+.
Vhen we exceeu the lilesystem u|init, we`ll see the complaint aLout Too nany
opcn ji|cs in logs, Lut olten you`ll liist see inueteiminate Lehavioi in HBase. The
lix ieguiies incieasing the lile uesciiptoi u|init count.
12
You can veiily that the
HBase piocess is iunning with sullicient lile uesciiptois Ly looking at the liist lew
lines ol a iegionseivei`s log. It emits vitals such as the ]VM Leing useu anu envi-
ionment settings such as the lile uesciiptoi u|init.
11. Vhy 0.90? Ve wanteu theie to Le no conlusion that a Lieak hau Leen maue, so we put a laige gap Letween
oui new veisioning anu that ol Hauoop`s.
12. See the HBase FAQ loi how to up the u|init on youi clustei.
482 | Chapter 13: HBase
Running out oj datanodc thrcads
Similaily, the Hauoop uatanoue has an uppei Lounu ol 256 on the numLei ol
thieaus it can iun at any one time. Given the same taLle statistics guoteu in the
pieceuing Lullet, it`s easy to see how we can exceeu this uppei Lounu ielatively
eaily, given that in the uatanoue as ol this wiiting each open connection to a lile
Llock consumes a thieau. Il you look in the uatanoue log, you`ll see a complaint
like xccivcrCount 258 cxcccds thc |init oj concurrcnt xcicvcrs 25 Lut again, you`ll
likely see HBase act eiiatically Leloie you encountei this log entiy. Inciease the
dfs.datanode.max.xcievers (note that the piopeity name is misspelleu) count in
HDFS anu iestait youi clustei.
13
Sync
You must iun HBase on an HDFS that has a woiking sync. Otheiwise, you will
lose uata. This means iunning HBase on Hauoop 0.20.205.0 oi latei.
1+
.
UI
HBase iuns a weL seivei on the mastei to piesent a view on the state ol youi iunning
clustei. By uelault, it listens on poit 60010. The mastei UI uisplays a list ol Lasic at-
tiiLutes such as soltwaie veisions, clustei loau, ieguest iates, lists ol clustei taLles, anu
paiticipating iegionseiveis. Click on a iegionseivei in the mastei UI anu you aie taken
to the weL seivei iunning on the inuiviuual iegionseivei. It lists the iegions this seivei
is caiiying anu Lasic metiics such as iesouices consumeu anu ieguest iates.
Metrics
Hauoop has a metiics system that can Le useu to emit vitals ovei a peiiou to a con-
tcxt (this is coveieu in Metiics on page 350). EnaLling Hauoop metiics, anu in pai-
ticulai tying them to Ganglia oi emitting them via ]MX, will give you views on what is
happening on youi clustei cuiiently anu in the iecent past. HBase also auus metiics ol
its ownieguest iates, counts ol vitals, iesouices useuthat can Le caught Ly a Ha-
uoop context. See the lile hadoop-nctrics.propcrtics unuei the HBase conj uiiectoiy.
15
Schema Design
HBase taLles aie like those in an RDBMS, except that cells aie veisioneu, iows aie
soiteu, anu columns can Le auueu on the lly Ly the client as long as the column lamily
they Lelong to pieexists. These lactois shoulu Le consiueieu when uesigning schemas
13. See the HBase tiouLleshooting guiue loi moie uetail on this issue.
1+. On iegionseivei ciash, when iunning on an oluei veision ol Hauoop, euits wiitten to the commit log kept
in HDFS weie not iecoveiaLle, as liles that hau not Leen piopeily closeu lost all euits no mattei how
much hau Leen wiitten to them at the time ol the ciash.
15. Yes, this lile is nameu loi Hauoop, though it`s loi setting up HBase metiics.
Praxis | 483
loi HBase, Lut lai anu away the most impoitant concein uesigning schemas is consiu-
eiation ol how the uata will Le accesseu. All access is via piimaiy key so the key uesign
shoulu lenu itsell to how the uata is going to Le gueiieu. The othei piopeity to keep in
minu when uesigning schemas is that a uelining attiiLute ol column(-lamily)-oiienteu
stoies, like HBase, is that it can host wiue anu spaisely populateu taLles at no incuiieu
cost.
16
Joins
Theie is no native uataLase join lacility in HBase, Lut wiue taLles can make it so that
theie is no neeu loi uataLase joins pulling liom seconuaiy oi teitiaiy taLles. A wiue
iow can sometimes Le maue to holu all uata that peitains to a paiticulai piimaiy key.
Row keys
Take time uesigning youi iow key. In the weathei uata example in this chaptei, the
compounu iow key has a station pielix that seiveu to gioup tempeiatuies Ly station.
The ieveiseu timestamp sullix maue it so tempeiatuies coulu Le scanneu oiueieu liom
most iecent to oluest. A smait compounu key can Le useu to clustei uata in ways
amenaLle to how it will Le accesseu.
Designing compounu keys, you may have to zeio-pau numLei components so iow keys
soit piopeily. Otheiwise, you will iun into the issue wheie 10 soits Leloie 2 when only
Lyte-oiuei is consiueieu (02 soits Leloie 10).
Il youi keys aie integeis, use a Linaiy iepiesentation iathei than peisist the stiing vei-
sion ol a numLeiit consumes less space.
Counters
At StumLleUpon, the liist piouuction leatuie ueployeu on HBase was keeping counteis
loi the stumbleupon.com liontenu. Counteis useu to Le kept in MySQL, Lut the iate ol
change was such that uiops weie lieguent anu the loau imposeu Ly the countei wiites
was such that weL uesigneis sell-imposeu limits on what was counteu. Using the incre
mentColumnValue() methou on org.apache.hadoop.hbase.HTable, counteis can Le in-
ciementeu many thousanus ol times a seconu.
Bulk Load
HBase has an ellicient lacility loi Lulk loauing HBase Ly wiiting its inteinal uata loimat
uiiectly into the lilesystem liom MapReuuce. Going this ioute, it`s possiLle to loau an
HBase instance at iates that aie an oiuei ol magnituue oi moie Leyonu those attainaLle
16. Column-Stoies loi Viue anu Spaise Data Ly Daniel ]. ALaui.
484 | Chapter 13: HBase
Ly wiiting via the HBase client API. The lacility is uesciiLeu at http://hbasc.apachc.org/
docs/currcnt/bu||-|oads.htn|. It`s also possiLle to Lulk loau into a live taLle.
Praxis | 485
CHAPTER 14
ZooKeeper
So lai in this Look, we have Leen stuuying laige-scale uata piocessing. This chaptei is
uilleient: it is aLout Luiluing geneial uistiiLuteu applications using Hauoop`s uis-
tiiLuteu cooiuination seivice, calleu ZooKeepei.
Viiting uistiiLuteu applications is haiu. It`s haiu piimaiily Lecause ol paitial lailuie.
Vhen a message is sent acioss the netwoik Letween two noues anu the netwoik lails,
the senuei uoes not know whethei the ieceivei got the message. It may have gotten
thiough Leloie the netwoik laileu, oi it may not have. Oi peihaps the ieceivei`s piocess
uieu. The only way that the senuei can linu out what happeneu is to ieconnect to the
ieceivei anu ask it. This is paitial lailuie: when we uon`t even know il an opeiation
laileu.
ZooKeepei can`t make paitial lailuies go away, since they aie intiinsic to uistiiLuteu
systems. It ceitainly uoes not hiue paitial lailuies, eithei.
1
But what ZooKeepei uoes
uo is give you a set ol tools to Luilu uistiiLuteu applications that can salely hanule
paitial lailuies.
ZooKeepei also has the lollowing chaiacteiistics:
ZooKccpcr is sinp|c
ZooKeepei is, at its coie, a stiippeu-uown lilesystem that exposes a lew simple
opeiations, anu some extia aLstiactions such as oiueiing anu notilications.
ZooKccpcr is cxprcssivc
The ZooKeepei piimitives aie a iich set ol Luiluing Llocks that can Le useu to Luilu
a laige class ol cooiuination uata stiuctuies anu piotocols. Examples incluue: uis-
tiiLuteu gueues, uistiiLuteu locks, anu leauei election among a gioup ol peeis.
1. This is the message ol ]. Valuo et al., A Note on DistiiLuteu Computing, (199+), http://rcscarch.sun
.con/tcchrcp/1991/sn|i_tr-91-29.pdj. That is, uistiiLuteu piogiamming is lunuamentally uilleient liom
local piogiamming, anu the uilleiences cannot simply Le papeieu ovei.
487
ZooKccpcr is high|y avai|ab|c
ZooKeepei iuns on a collection ol machines anu is uesigneu to Le highly availaLle,
so applications can uepenu on it. ZooKeepei can help you avoiu intiouucing single
points ol lailuie into youi system, so you can Luilu a ieliaLle application.
ZooKccpcr jaci|itatcs |oosc|y coup|cd intcractions
ZooKeepei inteiactions suppoit paiticipants that uo not neeu to know aLout one
anothei. Foi example, ZooKeepei can Le useu as a ienuezvous mechanism so that
piocesses that otheiwise uon`t know ol each othei`s existence (oi netwoik uetails)
can uiscovei anu inteiact with each othei. Cooiuinating paities may not even Le
contempoianeous, since one piocess may leave a message in ZooKeepei that is
ieau Ly anothei altei the liist has shut uown.
ZooKccpcr is a |ibrary
ZooKeepei pioviues an open souice, shaieu iepositoiy ol implementations anu
iecipes ol common cooiuination patteins. Inuiviuual piogiammeis aie spaieu the
Luiuen ol wiiting common piotocols themselves (which aie olten uillicult to get
iight). Ovei time, the community can auu to anu impiove the liLiaiies, which is
to eveiyone`s Lenelit.
ZooKeepei is highly peiloimant, too. At Yahoo!, wheie it was cieateu, the thioughput
loi a ZooKeepei clustei has Leen Lenchmaikeu at ovei 10,000 opeiations pei seconu
loi wiite-uominant woikloaus geneiateu Ly hunuieus ol clients. Foi woikloaus wheie
ieaus uominate, which is the noim, the thioughput is seveial times highei.
2
Installing and Running ZooKeeper
Vhen tiying out ZooKeepei loi the liist time, it`s simplest to iun it in stanualone moue
with a single ZooKeepei seivei. You can uo this on a uevelopment machine, loi exam-
ple. ZooKeepei ieguiies ]ava 6 to iun, so make suie you have it installeu liist. You uon`t
neeu Cygwin to iun ZooKeepei on Vinuows, since theie aie Vinuows veisions ol the
ZooKeepei sciipts. (Vinuows is suppoiteu only as a uevelopment platloim, not as a
piouuction platloim.)
Downloau a staLle ielease ol ZooKeepei liom the Apache ZooKeepei ieleases page at
http://zoo|ccpcr.apachc.org/rc|cascs.htn|, anu unpack the taiLall in a suitaLle location:
% tar xzf zookeeper-x.y.z.tar.gz
ZooKeepei pioviues a lew Linaiies to iun anu inteiact with the seivice, anu it`s con-
venient to put the uiiectoiy containing the Linaiies on youi commanu-line path:
% export ZOOKEEPER_INSTALL=/home/tom/zookeeper-x.y.z
% export PATH=$PATH:$ZOOKEEPER_INSTALL/bin
2. Detaileu Lenchmaiks aie availaLle in the excellent papei ZooKeepei: Vait-liee cooiuination loi Inteinet-
scale systems, Ly Patiick Hunt, Mahauev Konai, Flavio P. ]ungueiia, anu Benjamin Reeu (USENIX
Annual Technology Conleience, 2010).
488 | Chapter 14: ZooKeeper
Beloie iunning the ZooKeepei seivice, we neeu to set up a conliguiation lile. The con-
liguiation lile is conventionally calleu zoo.cjg anu placeu in the conj suLuiiectoiy (al-
though you can also place it in /ctc/zoo|ccpcr, oi in the uiiectoiy uelineu Ly the
ZOOCFGDIR enviionment vaiiaLle, il set). Heie`s an example:
tickTime=2000
dataDir=/Users/tom/zookeeper
clientPort=2181
This is a stanuaiu ]ava piopeities lile, anu the thiee piopeities uelineu in this example
aie the minimum ieguiieu loi iunning ZooKeepei in stanualone moue. Biielly,
tickTime is the Lasic time unit in ZooKeepei (specilieu in milliseconus), dataDir is the
local lilesystem location wheie ZooKeepei stoies peisistent uata, anu clientPort is the
poit the ZooKeepei listens on loi client connections (21S1 is a common choice). You
shoulu change dataDir to an appiopiiate setting loi youi system.
Vith a suitaLle conliguiation uelineu, we aie now ieauy to stait a local ZooKeepei
seivei:
% zkServer.sh start
To check whethei ZooKeepei is iunning, senu the ruok commanu (Aie you OK?) to
the client poit using nc (telnet woiks, too):
% echo ruok | nc localhost 2181
imok
That`s ZooKeepei saying, I`m OK. Theie aie othei commanus, known as the loui-
lettei woius, loi managing ZooKeepei anu they aie listeu in TaLle 1+-1.
Tab|c 11-1. ZooKccpcr connands: thc jour-|cttcr words
Category Command Description
Server status ruok Prints imok if the server is running and not in an error state.
conf Print the server configuration (from zoo.cfg).
envi Print the server environment, including ZooKeeper version, Java version and other system
properties.
srvr Print server statistics, including latency statistics, the number of znodes, and the server
mode (standalone, leader or follower).
stat Print server statistics and connected clients.
srst Reset server statistics.
isro Shows if the server is in read-only (ro) mode (due to a network partition), or read-write
mode (rw).
Client connections dump List all the sessions and ephemeral znodes for the ensemble. You must connect to the
leader (see srvr) for this command.
cons List connection statistics for all the servers clients.
crst Reset connection statistics.
Watches wchs List summary information for the servers watches.
Installing and Running ZooKeeper | 489
Category Command Description
wchc List all the servers watches by connection. Caution: may impact server performance for
large number of watches.
wchp List all the servers watches by znode path. Caution: may impact server performance for
large number of watches.
Monitoring mntr Lists server statistics in Java Properties format, suitable as a source for monitoring systems
like Ganglia and Nagios.
In auuition to the mntr commanu, ZooKeepei exposes statistics via ]MX. Foi moie
uetails see the ZooKeepei uocumentation at http://zoo|ccpcr.apachc.org/. Theie aie
also monitoiing tools anu iecipes in the src/contrib uiiectoiy ol the uistiiLution.
An Example
Imagine a gioup ol seiveis that pioviue some seivice to clients. Ve want clients to Le
aLle to locate one ol the seiveis, so they can use the seivice. One ol the challenges is
maintaining the list ol seiveis in the gioup.
The memLeiship list cleaily cannot Le stoieu on a single noue in the netwoik, as the
lailuie ol that noue woulu mean the lailuie ol the whole system (we woulu like the list
to Le highly availaLle). Suppose loi a moment that we hau a ioLust way ol stoiing the
list. Ve woulu still have the pioLlem ol how to iemove a seivei liom the list il it laileu.
Some piocess neeus to Le iesponsiLle loi iemoving laileu seiveis, Lut note that it can`t
Le the seiveis themselves, since they aie no longei iunning!
Vhat we aie uesciiLing is not a passive uistiiLuteu uata stiuctuie, Lut an active one,
anu one that can change the state ol an entiy when some exteinal event occuis. Zoo-
Keepei pioviues this seivice, so let`s see how to Luilu this gioup memLeiship applica-
tion (as it is known) with it.
Group Membership in ZooKeeper
One way ol unueistanuing ZooKeepei is to think ol it as pioviuing a high-availaLility
lilesystem. It uoesn`t have liles anu uiiectoiies, Lut a unilieu concept ol a noue, calleu
a znodc, which acts Loth as a containei ol uata (like a lile) anu a containei ol othei
znoues (like a uiiectoiy). Znoues loim a hieiaichical namespace, anu a natuial way to
Luilu a memLeiship list is to cieate a paient znoue with the name ol the gioup anu
chilu znoues with the name ol the gioup memLeis (seiveis). This is shown in
Figuie 1+-1.
490 | Chapter 14: ZooKeeper
Iigurc 11-1. ZooKccpcr znodcs
In this example, we won`t stoie uata in any ol the znoues, Lut in a ieal application, you
coulu imagine stoiing uata aLout the memLeis in theii znoues, such as hostname.
Creating the Group
Let`s intiouuce ZooKeepei`s ]ava API Ly wiiting a piogiam to cieate a znoue loi the
gioup, /zoo in this example. See Example 1+-1.
Exanp|c 11-1. A progran to crcatc a znodc rcprcscnting a group in ZooKccpcr
public class CreateGroup implements Watcher {

private static final int SESSION_TIMEOUT = 5000;

private ZooKeeper zk;
private CountDownLatch connectedSignal = new CountDownLatch(1);

public void connect(String hosts) throws IOException, InterruptedException {
zk = new ZooKeeper(hosts, SESSION_TIMEOUT, this);
connectedSignal.await();
}

@Override
public void process(WatchedEvent event) { // Watcher interface
if (event.getState() == KeeperState.SyncConnected) {
connectedSignal.countDown();
}
}

public void create(String groupName) throws KeeperException,
InterruptedException {
An Example | 491
String path = "/" + groupName;
String createdPath = zk.create(path, null/*data*/, Ids.OPEN_ACL_UNSAFE,
CreateMode.PERSISTENT);
System.out.println("Created " + createdPath);
}

public void close() throws InterruptedException {
zk.close();
}
public static void main(String[] args) throws Exception {
CreateGroup createGroup = new CreateGroup();
createGroup.connect(args[0]);
createGroup.create(args[1]);
createGroup.close();
}
}
Vhen the main() methou is iun, it cieates a CreateGroup instance anu then calls its
connect() methou. This methou instantiates a new ZooKeeper oLject, the main class ol
the client API anu the one that maintains the connection Letween the client anu the
ZooKeepei seivice. The constiuctoi takes thiee aiguments: the liist is the host auuiess
(anu optional poit, which uelaults to 21S1) ol the ZooKeepei seivice;
3
the seconu is
the session timeout in milliseconus (which we set to 5 seconus), explaineu in moie
uetail latei; anu the thiiu is an instance ol a Watcher oLject. The Watcher oLject ieceives
callLacks liom ZooKeepei to inloim it ol vaiious events. In this case, CreateGroup is a
Watcher, so we pass this to the ZooKeeper constiuctoi.
Vhen a ZooKeeper instance is cieateu, it staits a thieau to connect to the ZooKeepei
seivice. The call to the constiuctoi ietuins immeuiately, so it is impoitant to wait loi
the connection to Le estaLlisheu Leloie using the ZooKeeper oLject. Ve make use ol
]ava`s CountDownLatch class (in the java.util.concurrent package) to Llock until the
ZooKeeper instance is ieauy. This is wheie the Watcher comes in. The Watcher inteilace
has a single methou:
public void process(WatchedEvent event);
Vhen the client has connecteu to ZooKeepei, the Watcher ieceives a call to its
process() methou with an event inuicating that it has connecteu. On ieceiving a con-
nection event (iepiesenteu Ly the Watcher.Event.KeeperState enum, with value
SyncConnected), we ueciement the countei in the CountDownLatch, using its count
Down() methou. The latch was cieateu with a count ol one, iepiesenting the numLei ol
events that neeu to occui Leloie it ieleases all waiting thieaus. Altei calling count
Down() once, the countei ieaches zeio anu the await() methou ietuins.
The connect() methou has now ietuineu, anu the next methou to Le invokeu on the
CreateGroup is the create() methou. In this methou, we cieate a new ZooKeepei znoue
3. Foi a ieplicateu ZooKeepei seivice, this paiametei is the comma-sepaiateu list ol seiveis (host anu
optional poit) in the ensemLle.
492 | Chapter 14: ZooKeeper
using the create() methou on the ZooKeeper instance. The aiguments it takes aie the
path (iepiesenteu Ly a stiing), the contents ol the znoue (a Lyte aiiay, null heie), an
access contiol list (oi ACL loi shoit, which heie is a completely open ACL, allowing
any client to ieau oi wiite the znoue), anu the natuie ol the znoue to Le cieateu.
Znoues may Le ephemeial oi peisistent. An ephemeial znoue will Le ueleteu Ly the
ZooKeepei seivice when the client that cieateu it uisconnects, eithei Ly explicitly uis-
connecting oi il the client teiminates loi whatevei ieason. A peisistent znoue, on the
othei hanu, is not ueleteu when the client uisconnects. Ve want the znoue iepiesenting
a gioup to live longei than the liletime ol the piogiam that cieates it, so we cieate a
peisistent znoue.
The ietuin value ol the create() methou is the path that was cieateu Ly ZooKeepei.
Ve use it to piint a message that the path was successlully cieateu. Ve will see how
the path ietuineu Ly create() may uillei liom the one passeu into the methou when
we look at seguential znoues.
To see the piogiam in action, we neeu to have ZooKeepei iunning on the local machine,
anu then we can type:
% export CLASSPATH=ch14/target/classes/:$ZOOKEEPER_INSTALL/*:$ZOOKEEPER_INSTALL/lib/*:\
$ZOOKEEPER_INSTALL/conf
% java CreateGroup localhost zoo
Created /zoo
Joining a Group
The next pait ol the application is a piogiam to iegistei a memLei in a gioup. Each
memLei will iun as a piogiam anu join a gioup. Vhen the piogiam exits, it shoulu Le
iemoveu liom the gioup, which we can uo Ly cieating an ephemeial znoue that iep-
iesents it in the ZooKeepei namespace.
The JoinGroup piogiam implements this iuea, anu its listing is in Example 1+-2. The
logic loi cieating anu connecting to a ZooKeeper instance has Leen ielactoieu into a Lase
class, ConnectionWatcher, anu appeais in Example 1+-3.
Exanp|c 11-2. A progran that joins a group
public class JoinGroup extends ConnectionWatcher {

public void join(String groupName, String memberName) throws KeeperException,
InterruptedException {
String path = "/" + groupName + "/" + memberName;
String createdPath = zk.create(path, null/*data*/, Ids.OPEN_ACL_UNSAFE,
CreateMode.EPHEMERAL);
System.out.println("Created " + createdPath);
}

public static void main(String[] args) throws Exception {
JoinGroup joinGroup = new JoinGroup();
joinGroup.connect(args[0]);
An Example | 493
joinGroup.join(args[1], args[2]);

// stay alive until process is killed or thread is interrupted
Thread.sleep(Long.MAX_VALUE);
}
}
Exanp|c 11-3. A hc|pcr c|ass that waits jor thc conncction to ZooKccpcr to bc cstab|ishcd
public class ConnectionWatcher implements Watcher {

private static final int SESSION_TIMEOUT = 5000;
protected ZooKeeper zk;
private CountDownLatch connectedSignal = new CountDownLatch(1);
public void connect(String hosts) throws IOException, InterruptedException {
zk = new ZooKeeper(hosts, SESSION_TIMEOUT, this);
connectedSignal.await();
}

@Override
public void process(WatchedEvent event) {
if (event.getState() == KeeperState.SyncConnected) {
connectedSignal.countDown();
}
}

public void close() throws InterruptedException {
zk.close();
}
}
The coue loi JoinGroup is veiy similai to CreateGroup. It cieates an ephemeial znoue as
a chilu ol the gioup znoue in its join() methou, then simulates uoing woik ol some
kinu Ly sleeping until the piocess is loiciLly teiminateu. Latei, you will see that upon
teimination, the ephemeial znoue is iemoveu Ly ZooKeepei.
Listing Members in a Group
Now we neeu a piogiam to linu the memLeis in a gioup (see Example 1+-+).
Exanp|c 11-1. A progran to |ist thc ncnbcrs in a group
public class ListGroup extends ConnectionWatcher {

public void list(String groupName) throws KeeperException,
InterruptedException {
String path = "/" + groupName;

try {
List<String> children = zk.getChildren(path, false);
if (children.isEmpty()) {
System.out.printf("No members in group %s\n", groupName);
494 | Chapter 14: ZooKeeper
System.exit(1);
}
for (String child : children) {
System.out.println(child);
}
} catch (KeeperException.NoNodeException e) {
System.out.printf("Group %s does not exist\n", groupName);
System.exit(1);
}
}

public static void main(String[] args) throws Exception {
ListGroup listGroup = new ListGroup();
listGroup.connect(args[0]);
listGroup.list(args[1]);
listGroup.close();
}
}
In the list() methou, we call getChildren() with a znoue path anu a watch llag to
ietiieve a list ol chilu paths loi the znoue, which we piint out. Placing a watch on a
znoue causes the iegisteieu Watcher to Le tiiggeieu il the znoue changes state. Although
we`ie not using it heie, watching a znoue`s chiluien woulu peimit a piogiam to get
notilications ol memLeis joining oi leaving the gioup, oi ol the gioup Leing ueleteu.
Ve catch KeeperException.NoNodeException, which is thiown in the case when the
gioup`s znoue uoes not exist.
Let`s see ListGroup in action. As expecteu, the zoo gioup is empty, since we haven`t
auueu any memLeis yet:
% java ListGroup localhost zoo
No members in group zoo
Ve can use JoinGroup to auu some memLeis. Ve launch them as Lackgiounu pio-
cesses, since they uon`t teiminate on theii own (uue to the sleep statement):
% java JoinGroup localhost zoo duck &
% java JoinGroup localhost zoo cow &
% java JoinGroup localhost zoo goat &
% goat_pid=$!
The last line saves the piocess ID ol the ]ava piocess iunning the piogiam that auus
goat as a memLei. Ve neeu to iememLei the ID so that we can kill the piocess in a
moment, altei checking the memLeis:
% java ListGroup localhost zoo
goat
duck
cow
To iemove a memLei, we kill its piocess:
% kill $goat_pid
An Example | 495
Anu a lew seconus latei, it has uisappeaieu liom the gioup Lecause the piocess`s Zoo-
Keepei session has teiminateu (the timeout was set to 5 seconus) anu its associateu
ephemeial noue has Leen iemoveu:
% java ListGroup localhost zoo
duck
cow
Let`s stanu Lack anu see what we`ve Luilt heie. Ve have a way ol Luiluing up a list ol
a gioup ol noues that aie paiticipating in a uistiiLuteu system. The noues may have no
knowleuge ol each othei. A client that wants to use the noues in the list to peiloim
some woik, loi example, can uiscovei the noues without them Leing awaie ol the cli-
ent`s existence.
Finally, note that gioup memLeiship is not a suLstitution loi hanuling netwoik eiiois
when communicating with a noue. Even il a noue is a gioup memLei, communications
with it may lail, anu such lailuies must Le hanuleu in the usual ways (ietiying, tiying
a uilleient memLei ol the gioup, anu so on).
ZooKeeper command-line tools
ZooKeepei comes with a commanu-line tool loi inteiacting with the ZooKeepei name-
space. Ve can use it to list the znoues unuei the /zoo znoue as lollows:
% zkCli.sh localhost ls /zoo
Processing ls
WatchedEvent: Server state change. New state: SyncConnected
[duck, cow]
You can iun the commanu without aiguments to uisplay usage instiuctions.
Deleting a Group
To iounu oll the example, let`s see how to uelete a gioup. The ZooKeeper class pioviues
a delete() methou that takes a path anu a veision numLei. ZooKeepei will uelete a
znoue only il the veision numLei specilieu is the same as the veision numLei ol the
znoue it is tiying to uelete, an optimistic locking mechanism that allows clients to uetect
conllicts ovei znoue mouilication. You can Lypass the veision check, howevei, Ly using
a veision numLei ol 1 to uelete the znoue iegaiuless ol its veision numLei.
Theie is no iecuisive uelete opeiation in ZooKeepei, so you have to uelete chilu znoues
Leloie paients. This is what we uo in the DeleteGroup class, which will iemove a gioup
anu all its memLeis (Example 1+-5).
Exanp|c 11-5. A progran to dc|ctc a group and its ncnbcrs
public class DeleteGroup extends ConnectionWatcher {

public void delete(String groupName) throws KeeperException,
InterruptedException {
String path = "/" + groupName;
496 | Chapter 14: ZooKeeper

try {
List<String> children = zk.getChildren(path, false);
for (String child : children) {
zk.delete(path + "/" + child, -1);
}
zk.delete(path, -1);
} catch (KeeperException.NoNodeException e) {
System.out.printf("Group %s does not exist\n", groupName);
System.exit(1);
}
}

public static void main(String[] args) throws Exception {
DeleteGroup deleteGroup = new DeleteGroup();
deleteGroup.connect(args[0]);
deleteGroup.delete(args[1]);
deleteGroup.close();
}
}
Finally, we can uelete the zoo gioup that we cieateu eailiei:
% java DeleteGroup localhost zoo
% java ListGroup localhost zoo
Group zoo does not exist
The ZooKeeper Service
ZooKeepei is a highly availaLle, high-peiloimance cooiuination seivice. In this section,
we look at the natuie ol the seivice it pioviues: its mouel, opeiations, anu
implementation.
Data Model
ZooKeepei maintains a hieiaichical tiee ol noues calleu znoues. A znoue stoies uata
anu has an associateu ACL. ZooKeepei is uesigneu loi cooiuination (which typically
uses small uata liles), not high-volume uata stoiage, so theie is a limit ol 1 MB on the
amount ol uata that may Le stoieu in any znoue.
Data access is atomic. A client ieauing the uata stoieu at a znoue will nevei ieceive only
some ol the uata; eithei the uata will Le ueliveieu in its entiiety, oi the ieau will lail.
Similaily, a wiite will ieplace all the uata associateu with a znoue. ZooKeepei guaian-
tees that the wiite will eithei succeeu oi lail; theie is no such thing as a paitial wiite,
wheie only some ol the uata wiitten Ly the client is stoieu. ZooKeepei uoes not suppoit
an appenu opeiation. These chaiacteiistics contiast with HDFS, which is uesigneu loi
high-volume uata stoiage, with stieaming uata access, anu pioviues an appenu
opeiation.
The ZooKeeper Service | 497
Znoues aie ieleienceu Ly paths, which in ZooKeepei aie iepiesenteu as slash-uelimiteu
Unicoue chaiactei stiings, like lilesystem paths in Unix. Paths must Le aLsolute, so
they must Legin with a slash chaiactei. Fuitheimoie, they aie canonical, which means
that each path has a single iepiesentation, anu so paths uo not unueigo iesolution. Foi
example, in Unix, a lile with the path /a/b can eguivalently Le ieleiieu to Ly the
path /a/./b, since . ieleis to the cuiient uiiectoiy at the point it is encounteieu in the
path. In ZooKeepei, . uoes not have this special meaning anu is actually illegal as a
path component (as is .. loi the paient ol the cuiient uiiectoiy).
Path components aie composeu ol Unicoue chaiacteis, with a lew iestiictions (these
aie spelleu out in the ZooKeepei ieleience uocumentation). The stiing zookeepei is
a ieseiveu woiu anu may not Le useu as a path component. In paiticulai, ZooKeepei
uses the /zoo|ccpcr suLtiee to stoie management inloimation, such as inloimation on
guotas.
Note that paths aie not URIs, anu they aie iepiesenteu in the ]ava API Ly a
java.lang.String, iathei than the Hauoop Path class (oi Ly the java.net.URI class, loi
that mattei).
Znoues have some piopeities that aie veiy uselul loi Luiluing uistiiLuteu applications,
which we uiscuss in the lollowing sections.
Ephemeral znodes
Znoues can Le one ol two types: ephemeial oi peisistent. A znoue`s type is set at cieation
time anu may not Le changeu latei. An ephemeial znoue is ueleteu Ly ZooKeepei when
the cieating client`s session enus. By contiast, a peisistent znoue is not tieu to the client`s
session anu is ueleteu only when explicitly ueleteu Ly a client (not necessaiily the one
that cieateu it). An ephemeial znoue may not have chiluien, not even ephemeial ones.
Even though ephemeial noues aie tieu to a client session, they aie visiLle to all clients
(suLject to theii ACL policy, ol couise).
Ephemeial znoues aie iueal loi Luiluing applications that neeu to know when ceitain
uistiiLuteu iesouices aie availaLle. The example eailiei in this chaptei uses ephemeial
znoues to implement a gioup memLeiship seivice, so any piocess can uiscovei the
memLeis ol the gioup at any paiticulai time.
Sequence numbers
A scqucntia| znoue is given a seguence numLei Ly ZooKeepei as a pait ol its name. Il
a znoue is cieateu with the seguential llag set, then the value ol a monotonically in-
cieasing countei (maintaineu Ly the paient znoue) is appenueu to its name.
Il a client asks to cieate a seguential znoue with the name /a/b-, loi example, then the
znoue cieateu may actually have the name /a/b-3.
+
Il, latei on, anothei seguential znoue
with the name /a/b- is cieateu, then it will Le given a unigue name with a laigei value
498 | Chapter 14: ZooKeeper
ol the counteiloi example, /a/b-5. In the ]ava API, the actual path given to seguential
znoues is communicateu Lack to the client as the ietuin value ol the create() call.
Seguence numLeis can Le useu to impose a gloLal oiueiing on events in a uistiiLuteu
system, anu may Le useu Ly the client to inlei the oiueiing. In A Lock Sei-
vice on page 517, you will leain how to use seguential znoues to Luilu a shaieu lock.
Watches
Vatches allow clients to get notilications when a znoue changes in some way. Vatches
aie set Ly opeiations on the ZooKeepei seivice, anu aie tiiggeieu Ly othei opeiations
on the seivice. Foi example, a client might call the exists opeiation on a znoue, placing
a watch on it at the same time. Il the znoue uoesn`t exist, then the exists opeiation
will ietuin lalse. Il, some time latei, the znoue is cieateu Ly a seconu client, then the
watch is tiiggeieu, notilying the liist client ol the znoue`s cieation. You will see piecisely
which opeiations tiiggei otheis in the next section.
Vatcheis aie tiiggeieu only once.
5
To ieceive multiple notilications, a client neeus to
ieiegistei the watch. Il the client in the pievious example wishes to ieceive luithei
notilications loi the znoue`s existence (to Le notilieu when it is ueleteu, loi example),
it neeus to call the exists opeiation again to set a new watch.
Theie is an example in A Conliguiation Seivice on page 510 uemonstiating how to
use watches to upuate conliguiation acioss a clustei.
Operations
Theie aie nine Lasic opeiations in ZooKeepei, listeu in TaLle 1+-2.
Tab|c 11-2. Opcrations in thc ZooKccpcr scrvicc
Operation Description
create Creates a znode (the parent znode must already exist)
delete Deletes a znode (the znode must not have any children)
exists Tests whether a znode exists and retrieves its metadata
getACL, setACL Gets/sets the ACL for a znode
getChildren Gets a list of the children of a znode
getData, setData Gets/sets the data associated with a znode
sync Synchronizes a clients view of a znode with ZooKeeper
+. It is conventional (Lut not ieguiieu) to have a tiailing uash on path names loi seguential noues, to make
theii seguence numLeis easy to ieau anu paise (Ly the application).
5. Except loi callLacks loi connection events, which uo not neeu ieiegistiation.
The ZooKeeper Service | 499
Upuate opeiations in ZooKeepei aie conuitional. A delete oi setData opeiation has to
specily the veision numLei ol the znoue that is Leing upuateu (which is lounu liom a
pievious exists call). Il the veision numLei uoes not match, the upuate will lail. Up-
uates aie a nonLlocking opeiation, so a client that loses an upuate (Lecause anothei
piocess upuateu the znoue in the meantime) can ueciue whethei to tiy again oi take
some othei action, anu it can uo so without Llocking the piogiess ol any othei piocess.
Although ZooKeepei can Le vieweu as a lilesystem, theie aie some lilesystem piimitives
that it uoes away with in the name ol simplicity. Because liles aie small anu aie wiitten
anu ieau in theii entiiety, theie is no neeu to pioviue open, close, oi seek opeiations.
The sync opeiation is not like fsync() in POSIX lilesystems. As men-
tioneu eailiei, wiites in ZooKeepei aie atomic, anu a successlul wiite
opeiation is guaianteeu to have Leen wiitten to peisistent stoiage on a
majoiity ol ZooKeepei seiveis. Howevei, it is peimissiLle loi ieaus to
lag the latest state ol ZooKeepei seivice, anu the sync opeiation exists
to allow a client to Liing itsell up-to-uate. This topic is coveieu in moie
uetail in the section on Consistency on page 505.
Multi-update
Theie is anothei ZooKeepei opeiation, calleu multi, which Latches togethei multiple
piimitive opeiations into a single unit that eithei succeeus oi lails in its entiiety. The
situation wheie some ol the piimitive opeiations succeeu anu some lail can nevei aiise.
Multi-upuate is veiy uselul loi Luiluing stiuctuies in ZooKeepei that maintain some
gloLal invaiiant. One example is an unuiiecteu giaph. Each veitex in the giaph is nat-
uially iepiesenteu as a znoue in ZooKeepei, anu to auu oi iemove an euge we neeu to
upuate the two znoues coiiesponuing to its veitices, since each has a ieleience to the
othei. Il we only useu piimitive ZooKeepei opeiations, it woulu Le possiLle loi anothei
client to oLseive the giaph in an inconsistent state wheie one veitex is connecteu to
anothei Lut the ieveise connection is aLsent. Batching the upuates on the two znoues
into one multi opeiation ensuies that the upuate is atomic, so a paii ol veitices can
nevei have a uangling connection.
APIs
Theie aie two coie language Linuings loi ZooKeepei clients, one loi ]ava anu one loi
C; theie aie also contrib Linuings loi Peil, Python, anu REST clients. Foi each Linuing,
theie is a choice Letween peiloiming opeiations synchionously oi asynchionously.
Ve`ve alieauy seen the synchionous ]ava API. Heie`s the signatuie loi the exists op-
eiation, which ietuins a Stat oLject that encapsulates the znoue`s metauata, oi null il
the znoue uoesn`t exist:
public Stat exists(String path, Watcher watcher) throws KeeperException,
InterruptedException
The asynchionous eguivalent, which is also lounu in the ZooKeeper class, looks like this:
500 | Chapter 14: ZooKeeper
public void exists(String path, Watcher watcher, StatCallback cb, Object ctx)
In the ]ava API, all the asynchionous methous have void ietuin types, since the iesult
ol the opeiation is conveyeu via a callLack. The callei passes a callLack implementation,
whose methou is invokeu when a iesponse is ieceiveu liom ZooKeepei. In this case,
the callLack is the StatCallback inteilace, which has the lollowing methou:
public void processResult(int rc, String path, Object ctx, Stat stat);
The rc aigument is the ietuin coue, coiiesponuing to the coues uelineu Ly KeeperEx
ception. A nonzeio coue iepiesents an exception, in which case the stat paiametei will
Le null. The path anu ctx aiguments coiiesponu to the eguivalent aiguments passeu
Ly the client to the exists() methou, anu can Le useu to iuentily the ieguest loi which
this callLack is a iesponse. The ctx paiametei can Le an aiLitiaiy oLject that may Le
useu Ly the client when the path uoes not give enough context to uisamLiguate the
ieguest. Il not neeueu, it may Le set to null.
Theie aie actually two C shaieu liLiaiies. The single-thieaueu liLiaiy, zookeeper_st,
suppoits only the asynchionous API anu is intenueu loi platloims wheie the pthread
liLiaiy is not availaLle oi staLle. Most uevelopeis will use the multithieaueu liLiaiy,
zookeeper_mt, as it suppoits Loth the synchionous anu asynchionous APIs. Foi uetails
on how to Luilu anu use the C API, please ielei to the README lile in the src/c uiiectoiy
ol the ZooKeepei uistiiLution.
Should I Use the Synchronous or Asynchronous API?
Both APIs ollei the same lunctionality, so the one you use is laigely a mattei ol style.
The asynchionous API is appiopiiate il you have an event-uiiven piogiamming mouel,
loi example.
The asynchionous API allows you to pipeline ieguests, which in some scenaiios can
ollei Lettei thioughput. Imagine that you want to ieau a laige Latch ol znoues anu
piocess them inuepenuently. Using the synchionous API, each ieau woulu Llock until
it ietuineu, wheieas with the asynchionous API, you can liie oll all the asynchionous
ieaus veiy guickly anu piocess the iesponses in a sepaiate thieau as they come Lack.
Watch triggers
The ieau opeiations exists, getChildren, anu getData may have watches set on them,
anu the watches aie tiiggeieu Ly wiite opeiations: create, delete, anu setData. ACL
opeiations uo not paiticipate in watches. Vhen a watch is tiiggeieu, a watch event is
geneiateu, anu the watch event`s type uepenus Loth on the watch anu the opeiation
that tiiggeieu it:
A watch set on an exists opeiation will Le tiiggeieu when the znoue Leing watcheu
is cieateu, ueleteu, oi has its uata upuateu.
The ZooKeeper Service | 501
A watch set on a getData opeiation will Le tiiggeieu when the znoue Leing watcheu
is ueleteu oi has its uata upuateu. No tiiggei can occui on cieation, since the znoue
must alieauy exist loi the getData opeiation to succeeu.
A watch set on a getChildren opeiation will Le tiiggeieu when a chilu ol the znoue
Leing watcheu is cieateu oi ueleteu, oi when the znoue itsell is ueleteu. You can
tell whethei the znoue oi its chilu was ueleteu Ly looking at the watch event type:
NodeDeleted shows the znoue was ueleteu, anu NodeChildrenChanged inuicates that
it was a chilu that was ueleteu.
The comLinations aie summaiizeu in TaLle 1+-3.
Tab|c 11-3. Watch crcation opcrations and thcir corrcsponding triggcrs
Watch trigger
Watch creation create delete setData
znode child znode child
exists NodeCreated NodeDeleted NodeData
Changed
getData NodeDeleted NodeData
Changed
getChildren NodeChildren
Changed
NodeDeleted NodeChildren
Changed

A watch event incluues the path ol the znoue that was involveu in the event, so loi
NodeCreated anu NodeDeleted events, you can tell which noue was cieateu oi ueleteu
simply Ly inspecting the path. To uiscovei which chiluien have changeu altei a Node
ChildrenChanged event, you neeu to call getChildren again to ietiieve the new list ol
chiluien. Similaily, to uiscovei the new uata loi a NodeDataChanged event, you neeu to
call getData. In Loth ol these cases, the state ol the znoues may have changeu Letween
ieceiving the watch event anu peiloiming the ieau opeiation, so you shoulu Leai this
in minu when wiiting applications.
ACLs
A znoue is cieateu with a list ol ACLs, which ueteimines who can peiloim ceitain
opeiations on it.
ACLs uepenu on authentication, the piocess Ly which the client iuentilies itsell to
ZooKeepei. Theie aie a lew authentication schemes that ZooKeepei pioviues:
digcst
The client is authenticateu Ly a useiname anu passwoiu.
sas|
The client is authenticateu using KeiLeios.
502 | Chapter 14: ZooKeeper
ip
The client is authenticateu Ly its IP auuiess.
Clients may authenticate themselves altei estaLlishing a ZooKeepei session. Authen-
tication is optional, although a znoue`s ACL may ieguiie an authenticateu client, in
which case the client must authenticate itsell to access the znoue. Heie is an example
ol using the digcst scheme to authenticate with a useiname anu passwoiu:
zk.addAuthInfo("digest", "tom:secret".getBytes());
An ACL is the comLination ol an authentication scheme, an iuentity loi that scheme,
anu a set ol peimissions. Foi example, il we wanteu to give a client with the IP auuiess
10.0.0.1 ieau access to a znoue, we woulu set an ACL on the znoue with the ip scheme,
an ID ol 10.0.0.1, anu READ peimission. In ]ava, we woulu cieate the ACL oLject as
lollows:
new ACL(Perms.READ,
new Id("ip", "10.0.0.1"));
The lull set ol peimissions aie listeu in TaLle 1+-+. Note that the exists opeiation is
not goveineu Ly an ACL peimission, so any client may call exists to linu the Stat loi
a znoue oi to uiscovei that a znoue uoes not in lact exist.
Tab|c 11-1. ACL pcrnissions
ACL permission Permitted operations
CREATE create (a child znode)
READ getChildren
getData
WRITE setData
DELETE delete (a child znode)
ADMIN setACL
Theie aie a numLei ol pieuelineu ACLs uelineu in the ZooDefs.Ids class, incluuing
OPEN_ACL_UNSAFE, which gives all peimissions (except ADMIN peimission) to eveiyone.
In auuition, ZooKeepei has a pluggaLle authentication mechanism, which makes it
possiLle to integiate thiiu-paity authentication systems il neeueu.
Implementation
The ZooKeepei seivice can iun in two moues. In standa|onc nodc, theie is a single
ZooKeepei seivei, which is uselul loi testing uue to its simplicity (it can even Le
emLeuueu in unit tests), Lut pioviues no guaiantees ol high-availaLility oi iesilience.
In piouuction, ZooKeepei iuns in rcp|icatcd nodc, on a clustei ol machines calleu an
cnscnb|c. ZooKeepei achieves high-availaLility thiough ieplication, anu can pioviue a
seivice as long as a majoiity ol the machines in the ensemLle aie up. Foi example, in a
The ZooKeeper Service | 503
live-noue ensemLle, any two machines can lail anu the seivice will still woik Lecause
a majoiity ol thiee iemain. Note that a six-noue ensemLle can also toleiate only two
machines lailing, since with thiee lailuies the iemaining thiee uo not constitute a ma-
joiity ol the six. Foi this ieason, it is usual to have an ouu numLei ol machines in an
ensemLle.
Conceptually, ZooKeepei is veiy simple: all it has to uo is ensuie that eveiy mouilication
to the tiee ol znoues is ieplicateu to a majoiity ol the ensemLle. Il a minoiity ol the
machines lail, then a minimum ol one machine will suivive with the latest state. The
othei iemaining ieplicas will eventually catch up with this state.
The implementation ol this simple iuea, howevei, is nontiivial. ZooKeepei uses a pio-
tocol calleu ZaL that iuns in two phases, which may Le iepeateu inuelinitely:
Phasc 1: Lcadcr c|cction
The machines in an ensemLle go thiough a piocess ol electing a uistinguisheu
memLei, calleu the |cadcr. The othei machines aie teimeu jo||owcrs. This phase is
linisheu once a majoiity (oi quorun) ol lolloweis have synchionizeu theii state
with the leauei.
Phasc 2: Atonic broadcast
All wiite ieguests aie loiwaiueu to the leauei, which Lioaucasts the upuate to the
lolloweis. Vhen a majoiity have peisisteu the change, the leauei commits the up-
uate, anu the client gets a iesponse saying the upuate succeeueu. The piotocol loi
achieving consensus is uesigneu to Le atomic, so a change eithei succeeus oi lails.
It iesemLles a two-phase commit.
Does ZooKeeper Use Paxos?
No. ZooKeepei`s ZaL piotocol is not the same as the well-known Paxos algoiithm
(Leslie Lampoit, Paxos Maue Simple, ACM S|GACT Ncws jDistributcd Conputing
Co|unnj 32, + Vhole NumLei 121, DecemLei 2001 515S.). ZaL is similai, Lut it
uilleis in seveial aspects ol its opeiation, such as ielying on TCP loi its message oiueiing
guaiantees.
ZaL is uesciiLeu in A simple totally oiueieu Lioaucast piotocol Ly Benjamin Reeu
anu Flavio ]ungueiia (LAD|S `08: Procccdings oj thc 2nd Wor|shop on Largc-Sca|c Dis-
tributcd Systcns and Midd|cwarc, pages 16, New Yoik, NY, USA, 200S. ACM).
Google`s ChuLLy Lock Seivice (Mike Buiiows, The ChuLLy Lock Seivice loi Loosely-
Coupleu DistiiLuteu Systems, NovemLei 2006, http://|abs.goog|c.con/papcrs/chubby
.htn|), which shaies similai goals with ZooKeepei, is Laseu on Paxos.
504 | Chapter 14: ZooKeeper
Il the leauei lails, the iemaining machines holu anothei leauei election anu continue
as Leloie with the new leauei. Il the olu leauei latei iecoveis, it then staits as a lollowei.
Leauei election is veiy last, aiounu 200 ms accoiuing to one puLlisheu iesult,
6
so pei-
loimance uoes not noticeaLly uegiaue uuiing an election.
All machines in the ensemLle wiite upuates to uisk Leloie upuating theii in-memoiy
copy ol the znoue tiee. Reau ieguests may Le seiviceu liom any machine, anu since
they involve only a lookup liom memoiy, they aie veiy last.
Consistency
Unueistanuing the Lasis ol ZooKeepei`s implementation helps in unueistanuing the
consistency guaiantees that the seivice makes. The teims leauei anu lollowei loi
the machines in an ensemLle aie apt, loi they make the point that a lollowei may lag
the leauei Ly a numLei ol upuates. This is a conseguence ol the lact that only a majoiity
anu not all ol the ensemLle neeus to have peisisteu a change Leloie it is committeu. A
goou mental mouel loi ZooKeepei is ol clients connecteu to ZooKeepei seiveis that
aie lollowing the leauei. A client may actually Le connecteu to the leauei, Lut it has no
contiol ovei this, anu cannot even know il this is the case.
7
See Figuie 1+-2.
Eveiy upuate maue to the znoue tiee is given a gloLally unigue iuentiliei, calleu a
zxid (which stanus loi ZooKeepei tiansaction ID). Upuates aie oiueieu, so il zxid
z
1
is less than z
2
, then z
1
happeneu Leloie z
2
, accoiuing to ZooKeepei, which is the
single authoiity on oiueiing in the uistiiLuteu system.
6. Repoiteu Ly Yahoo! at http://zoo|ccpcr.apachc.org/doc/currcnt/zoo|ccpcrOvcr.htn|.
7. It is possiLle to conliguie ZooKeepei so that the leauei uoes not accept client connections. In this case,
its only joL is to cooiuinate upuates. Do this Ly setting the leaderServes piopeity to no. This is
iecommenueu loi ensemLles ol moie than thiee seiveis.
The ZooKeeper Service | 505
Iigurc 11-2. Rcads arc satisjicd by jo||owcrs, whi|c writcs arc connittcd by thc |cadcr
The lollowing guaiantees loi uata consistency llow liom ZooKeepei`s uesign:
Scqucntia| consistcncy
Upuates liom any paiticulai client aie applieu in the oiuei that they aie sent. This
means that il a client upuates the znoue z to the value a, anu in a latei opeiation,
it upuates z to the value b, then no client will evei see z with value a altei it has
seen it with value b (il no othei upuates aie maue to z).
Atonicity
Upuates eithei succeeu oi lail. This means that il an upuate lails, no client will evei
see it.
Sing|c systcn inagc
A client will see the same view ol the system iegaiuless ol the seivei it connects to.
This means that il a client connects to a new seivei uuiing the same session, it will
not see an oluei state ol the system than the one it saw with the pievious seivei.
Vhen a seivei lails anu a client tiies to connect to anothei in the ensemLle, a seivei
that is Lehinu the one that laileu will not accept connections liom the client until
it has caught up with the laileu seivei.
Durabi|ity
Once an upuate has succeeueu, it will peisist anu will not Le unuone. This means
upuates will suivive seivei lailuies.
Tinc|incss
The lag in any client`s view ol the system is Lounueu, so it will not Le out ol uate
Ly moie than some multiple ol tens ol seconus. This means that iathei than allow
506 | Chapter 14: ZooKeeper
a client to see uata that is veiy stale, a seivei will shut uown, loicing the client to
switch to a moie up-to-uate seivei.
Foi peiloimance ieasons, ieaus aie satislieu liom a ZooKeepei seivei`s memoiy anu
uo not paiticipate in the gloLal oiueiing ol wiites. This piopeity can leau to the ap-
peaiance ol inconsistent ZooKeepei states liom clients that communicate thiough a
mechanism outsiue ZooKeepei.
Foi example, client A upuates znoue z liom a to a`, A tells B to ieau z, B ieaus the value
ol z as a, not a`. This is peilectly compatiLle with the guaiantees that ZooKeepei makes
(this conuition that it uoes not piomise is calleu Simultaneously Consistent Cioss-
Client Views). To pievent this conuition liom happening, B shoulu call sync on z,
Leloie ieauing z`s value. The sync opeiation loices the ZooKeepei seivei to which B is
connecteu to catch up with the leauei, so that when B ieaus z`s value it will Le the
one that A set (oi a latei value).
Slightly conlusingly, the sync opeiation is only availaLle as an asyn-
chronous call. The ieason loi this is that you uon`t neeu to wait loi it to
ietuin, since ZooKeepei guaiantees that any suLseguent opeiation will
happen altei the sync completes on the seivei, even il the opeiation is
issueu Leloie the sync completes.
Sessions
A ZooKeepei client is conliguieu with the list ol seiveis in the ensemLle. On staitup,
it tiies to connect to one ol the seiveis in the list. Il the connection lails, it tiies anothei
seivei in the list, anu so on, until it eithei successlully connects to one ol them oi lails
il all ZooKeepei seiveis aie unavailaLle.
Once a connection has Leen maue with a ZooKeepei seivei, the seivei cieates a new
session loi the client. A session has a timeout peiiou that is ueciueu on Ly the appli-
cation that cieates it. Il the seivei hasn`t ieceiveu a ieguest within the timeout peiiou,
it may expiie the session. Once a session has expiieu, it may not Le ieopeneu, anu any
ephemeial noues associateu with the session will Le lost. Although session expiiy is a
compaiatively iaie event, since sessions aie long-liveu, it is impoitant loi applications
to hanule it (we will see how in The Resilient ZooKeepei Application
on page 513).
Sessions aie kept alive Ly the client senuing ping ieguests (also known as heaitLeats)
whenevei the session is iule loi longei than a ceitain peiiou. (Pings aie automatically
sent Ly the ZooKeepei client liLiaiy, so youi coue uoesn`t neeu to woiiy aLout main-
taining the session.) The peiiou is chosen to Le low enough to uetect seivei lailuie
(manilesteu Ly a ieau timeout) anu ieconnect to anothei seivei within the session
timeout peiiou.
The ZooKeeper Service | 507
Failovei to anothei ZooKeepei seivei is hanuleu automatically Ly the ZooKeepei client,
anu, ciucially, sessions (anu associateu ephemeial znoues) aie still valiu altei anothei
seivei takes ovei liom the laileu one.
Duiing lailovei, the application will ieceive notilications ol uisconnections anu con-
nections to the seivice. Vatch notilications will not Le ueliveieu while the client is
uisconnecteu, Lut they will Le ueliveieu when the client successlully ieconnects. Also,
il the application tiies to peiloim an opeiation while the client is ieconnecting to
anothei seivei, the opeiation will lail. This unueilines the impoitance ol hanuling con-
nection loss exceptions in ieal-woilu ZooKeepei applications (uesciiLeu in The Re-
silient ZooKeepei Application on page 513).
Time
Theie aie seveial time paiameteis in ZooKeepei. The tic| tinc is the lunuamental peiiou
ol time in ZooKeepei anu is useu Ly seiveis in the ensemLle to ueline the scheuule on
which theii inteiactions iun. Othei settings aie uelineu in teims ol tick time, oi aie at
least constiaineu Ly it. The session timeout, loi example, may not Le less than 2 ticks
oi moie than 20. Il you attempt to set a session timeout outsiue this iange, it will Le
mouilieu to lall within the iange.
A common tick time setting is 2 seconus (2,000 milliseconus). This tianslates to an
allowaLle session timeout ol Letween + anu +0 seconus. Theie aie a lew consiueiations
in selecting a session timeout.
A low session timeout leaus to lastei uetection ol machine lailuie. In the gioup mem-
Leiship example, the session timeout is the time it takes loi a laileu machine to Le
iemoveu liom the gioup. Bewaie ol setting the session timeout too low, howevei, since
a Lusy netwoik can cause packets to Le uelayeu anu may cause inauveitent session
expiiy. In such an event, a machine woulu appeai to llap: leaving anu then iejoining
the gioup iepeateuly in a shoit space ol time.
Applications that cieate moie complex ephemeial state shoulu lavoi longei session
timeouts, as the cost ol ieconstiuction is highei. In some cases, it is possiLle to uesign
the application so it can iestait within the session timeout peiiou anu avoiu session
expiiy. (This might Le uesiiaLle to peiloim maintenance oi upgiaues.) Eveiy session
is given a unigue iuentity anu passwoiu Ly the seivei, anu il these aie passeu to Zoo-
Keepei while a connection is Leing maue, it is possiLle to iecovei a session (as long as
it hasn`t expiieu). An application can theieloie aiiange a giacelul shutuown, wheieLy
it stoies the session iuentity anu passwoiu to staLle stoiage Leloie iestaiting the pio-
cess, ietiieving the stoieu session iuentity anu passwoiu anu iecoveiing the session.
You shoulu view this leatuie as an optimization, which can help avoiu expiie sessions.
It uoes not iemove the neeu to hanule session expiiy, which can still occui il a machine
lails unexpecteuly, oi even il an application is shut uown giacelully Lut uoes not iestait
Leloie its session expiiesloi whatevei ieason.
508 | Chapter 14: ZooKeeper
As a geneial iule, the laigei the ZooKeepei ensemLle, the laigei the session timeout
shoulu Le. Connection timeouts, ieau timeouts, anu ping peiious aie all uelineu intei-
nally as a lunction ol the numLei ol seiveis in the ensemLle, so as the ensemLle giows,
these peiious ueciease. Consiuei incieasing the timeout il you expeiience lieguent
connection loss. You can monitoi ZooKeepei metiicssuch as ieguest latency
statisticsusing ]MX.
States
The ZooKeeper oLject tiansitions thiough uilleient states in its lilecycle (see Fig-
uie 1+-3). You can gueiy its state at any time Ly using the getState() methou:
public States getState()
States is an enum iepiesenting the uilleient states that a ZooKeeper oLject may Le in.
(Despite the enum`s name, an instance ol ZooKeeper may only Le in one state at a time.)
A newly constiucteu ZooKeeper instance is in the CONNECTING state, while it tiies to
estaLlish a connection with the ZooKeepei seivice. Once a connection is estaLlisheu,
it goes into the CONNECTED state.
Iigurc 11-3. ZooKccpcr statc transitions
A client using the ZooKeeper oLject can ieceive notilications ol the state tiansitions Ly
iegisteiing a Watcher oLject. On enteiing the CONNECTED state, the watchei ieceives a
WatchedEvent whose KeeperState value is SyncConnected.
The ZooKeeper Service | 509
A ZooKeepei Watcher oLject seives uouLle uuty: it can Le useu to Le
notilieu ol changes in the ZooKeepei state (as uesciiLeu in this section),
anu it can Le useu to Le notilieu ol changes in znoues (uesciiLeu in
Vatch tiiggeis on page 501). The (uelault) watchei passeu into the
ZooKeeper oLject constiuctoi is useu loi state changes, Lut znoue
changes may eithei use a ueuicateu instance ol Watcher (Ly passing one
in to the appiopiiate ieau opeiation), oi they may shaie the uelault one
il using the loim ol the ieau opeiation that takes a Loolean llag to specily
whethei to use a watchei.
The ZooKeeper instance may uisconnect anu ieconnect to the ZooKeepei seivice, mov-
ing Letween the CONNECTED anu CONNECTING states. Il it uisconnects, the watchei ieceives
a Disconnected event. Note that these state tiansitions aie initiateu Ly the ZooKeeper
instance itsell, anu it will automatically tiy to ieconnect il the connection is lost.
The ZooKeeper instance may tiansition to a thiiu state, CLOSED, il eithei the close()
methou is calleu oi the session times out as inuicateu Ly a KeeperState ol type
Expired. Once in the CLOSED state, the ZooKeeper oLject is no longei consiueieu to Le
alive (this can Le testeu using the isAlive() methou on States) anu cannot Le ieuseu.
To ieconnect to the ZooKeepei seivice, the client must constiuct a new ZooKeeper
instance.
Building Applications with ZooKeeper
Having coveieu ZooKeepei in some uepth, let`s tuin Lack to wiiting some uselul
applications with it.
A Configuration Service
One ol the most Lasic seivices that a uistiiLuteu application neeus is a conliguiation
seivice so that common pieces ol conliguiation inloimation can Le shaieu Ly machines
in a clustei. At the simplest level, ZooKeepei can act as a highly availaLle stoie loi
conliguiation, allowing application paiticipants to ietiieve oi upuate conliguiation
liles. Using ZooKeepei watches, it is possiLle to cieate an active conliguiation seivice,
wheie inteiesteu clients aie notilieu ol changes in conliguiation.
Let`s wiite such a seivice. Ve make a couple ol assumptions that simplily the imple-
mentation (they coulu Le iemoveu with a little moie woik). Fiist, the only conliguiation
values we neeu to stoie aie stiings, anu keys aie just znoue paths, so we use a znoue to
stoie each key-value paii. Seconu, theie is a single client that peiloims upuates at any
one time. Among othei things, this mouel lits with the iuea ol a mastei (such as the
namenoue in HDFS) that wishes to upuate inloimation that its woikeis neeu to lollow.
510 | Chapter 14: ZooKeeper
Ve wiap the coue up in a class calleu ActiveKeyValueStore:
public class ActiveKeyValueStore extends ConnectionWatcher {
private static final Charset CHARSET = Charset.forName("UTF-8");
public void write(String path, String value) throws InterruptedException,
KeeperException {
Stat stat = zk.exists(path, false);
if (stat == null) {
zk.create(path, value.getBytes(CHARSET), Ids.OPEN_ACL_UNSAFE,
CreateMode.PERSISTENT);
} else {
zk.setData(path, value.getBytes(CHARSET), -1);
}
}
}
The contiact ol the write() methou is that a key with the given value is wiitten to
ZooKeepei. It hiues the uilleience Letween cieating a new znoue anu upuating an ex-
isting znoue with a new value, Ly testing liist loi the znoue using the exists opeiation
anu then peiloiming the appiopiiate opeiation. The othei uetail woith mentioning is
the neeu to conveit the stiing value to a Lyte aiiay, loi which we just use the
getBytes() methou with a UTF-S encouing.
To illustiate the use ol the ActiveKeyValueStore, consiuei a ConfigUpdater class that
upuates a conliguiation piopeity with a value. The listing appeais in Example 1+-6.
Exanp|c 11-. An app|ication that updatcs a propcrty in ZooKccpcr at randon tincs
public class ConfigUpdater {

public static final String PATH = "/config";

private ActiveKeyValueStore store;
private Random random = new Random();

public ConfigUpdater(String hosts) throws IOException, InterruptedException {
store = new ActiveKeyValueStore();
store.connect(hosts);
}

public void run() throws InterruptedException, KeeperException {
while (true) {
String value = random.nextInt(100) + "";
store.write(PATH, value);
System.out.printf("Set %s to %s\n", PATH, value);
TimeUnit.SECONDS.sleep(random.nextInt(10));
}
}

Building Applications with ZooKeeper | 511
public static void main(String[] args) throws Exception {
ConfigUpdater configUpdater = new ConfigUpdater(args[0]);
configUpdater.run();
}
}
The piogiam is simple. A ConfigUpdater has an ActiveKeyValueStore that connects to
ZooKeepei in ConfigUpdater`s constiuctoi. The run() methou loops loievei, upuating
the /conjig znoue at ianuom times with ianuom values.
Next, let`s look at how to ieau the /conjig conliguiation piopeity. Fiist, we auu a ieau
methou to ActiveKeyValueStore:
public String read(String path, Watcher watcher) throws InterruptedException,
KeeperException {
byte[] data = zk.getData(path, watcher, null/*stat*/);
return new String(data, CHARSET);
}
The getData() methou ol ZooKeepei takes the path, a Watcher, anu a Stat oLject. The
Stat oLject is lilleu in with values Ly getData(), anu is useu to pass inloimation Lack
to the callei. In this way, the callei can get Loth the uata anu the metauata loi a znoue,
although in this case, we pass a null Stat Lecause we aie not inteiesteu in the metauata.
As a consumei ol the seivice, ConfigWatcher (see Example 1+-7) cieates an ActiveKey
ValueStore, anu altei staiting, calls the stoie`s read() methou (in its displayConfig()
methou) to pass a ieleience to itsell as the watchei. It uisplays the initial value ol the
conliguiation that it ieaus.
Exanp|c 11-7. An app|ication that watchcs jor updatcs oj a propcrty in ZooKccpcr and prints thcn
to thc conso|c
public class ConfigWatcher implements Watcher {

private ActiveKeyValueStore store;

public ConfigWatcher(String hosts) throws IOException, InterruptedException {
store = new ActiveKeyValueStore();
store.connect(hosts);
}

public void displayConfig() throws InterruptedException, KeeperException {
String value = store.read(ConfigUpdater.PATH, this);
System.out.printf("Read %s as %s\n", ConfigUpdater.PATH, value);
}
@Override
public void process(WatchedEvent event) {
if (event.getType() == EventType.NodeDataChanged) {
try {
displayConfig();
} catch (InterruptedException e) {
System.err.println("Interrupted. Exiting.");
Thread.currentThread().interrupt();
512 | Chapter 14: ZooKeeper
} catch (KeeperException e) {
System.err.printf("KeeperException: %s. Exiting.\n", e);
}
}
}

public static void main(String[] args) throws Exception {
ConfigWatcher configWatcher = new ConfigWatcher(args[0]);
configWatcher.displayConfig();

// stay alive until process is killed or thread is interrupted
Thread.sleep(Long.MAX_VALUE);
}
}
Vhen the ConfigUpdater upuates the znoue, ZooKeepei causes the watchei to liie with
an event type ol EventType.NodeDataChanged. ConfigWatcher acts on this event in its
process() methou Ly ieauing anu uisplaying the latest veision ol the conlig.
Because watches aie one-time signals, we tell ZooKeepei ol the new watch each time
we call read() on ActiveKeyValueStorethis ensuies we see lutuie upuates. Fuithei-
moie, we aie not guaianteeu to ieceive eveiy upuate, since Letween the ieceipt ol the
watch event anu the next ieau, the znoue may have Leen upuateu, possiLly many times,
anu as the client has no watch iegisteieu uuiing that peiiou, it is not notilieu. Foi the
conliguiation seivice, this is not a pioLlem Lecause clients caie only aLout the latest
value ol a piopeity, as it takes pieceuence ovei pievious values, Lut in geneial you
shoulu Le awaie ol this potential limitation.
Let`s see the coue in action. Launch the ConfigUpdater in one teiminal winuow:
% java ConfigUpdater localhost
Set /config to 79
Set /config to 14
Set /config to 78
Then launch the ConfigWatcher in anothei winuow immeuiately alteiwaiu:
% java ConfigWatcher localhost
Read /config as 79
Read /config as 14
Read /config as 78
The Resilient ZooKeeper Application
The liist ol the Fallacies ol DistiiLuteu Computing
S
states that The netwoik is ielia-
Lle. As they stanu, the piogiams so lai have Leen assuming a ieliaLle netwoik, so when
they iun on a ieal netwoik, they can lail in seveial ways. Let`s examine possiLle lailuie
moues anu what we can uo to coiiect them so that oui piogiams aie iesilient in the
lace ol lailuie.
S. See http://cn.wi|ipcdia.org/wi|i/Ia||acics_oj_Distributcd_Conputing.
Building Applications with ZooKeeper | 513
Eveiy ZooKeepei opeiation in the ]ava API ueclaies two types ol exception in its thiows
clause: InterruptedException anu KeeperException.
InterruptedException
An InterruptedException is thiown il the opeiation is inteiiupteu. Theie is a stanuaiu
]ava mechanism loi canceling Llocking methous, which is to call interrupt() on the
thieau liom which the Llocking methou was calleu. A successlul cancellation will iesult
in an InterruptedException. ZooKeepei auheies to this stanuaiu, so you can cancel a
ZooKeepei opeiation in this way. Classes oi liLiaiies that use ZooKeepei shoulu usually
piopagate the InterruptedException so that theii clients can cancel theii opeiations.
9
An InterruptedException uoes not inuicate a lailuie, Lut iathei that the opeiation has
Leen canceleu, so in the conliguiation application example, it is appiopiiate to piop-
agate the exception, causing the application to teiminate.
KeeperException
A KeeperException is thiown il the ZooKeepei seivei signals an eiioi oi il theie is a
communication pioLlem with the seivei. Theie aie vaiious suLclasses ol
KeeperException loi uilleient eiioi cases. Foi example, KeeperException.NoNodeExcep
tion is a suLclass ol KeeperException that is thiown il you tiy to peiloim an opeiation
on a znoue that uoesn`t exist.
Eveiy suLclass ol KeeperException has a coiiesponuing coue with inloimation aLout
the type ol eiioi. Foi example, loi KeeperException.NoNodeException the coue is Keep
erException.Code.NONODE (an enum value).
Theie aie two ways then to hanule KeeperException: eithei catch KeeperException anu
test its coue to ueteimine what iemeuying action to take, oi catch the eguivalent
KeeperException suLclasses anu peiloim the appiopiiate action in each catch Llock.
KeeperExceptions lall into thiee Lioau categoiies.
A state exception occuis when the opeiation lails Lecause it cannot Le
applieu to the znoue tiee. State exceptions usually happen Lecause anothei piocess is
mutating a znoue at the same time. Foi example, a setData opeiation with a veision
numLei will lail with a KeeperException.BadVersionException il the znoue is upuateu
Ly anothei piocess liist, since the veision numLei uoes not match. The piogiammei is
usually awaie that this kinu ol conllict is possiLle anu will coue to ueal with it.
Some state exceptions inuicate an eiioi in the piogiam, such as KeeperExcep
tion.NoChildrenForEphemeralsException, which is thiown when tiying to cieate a chilu
znoue ol an ephemeial znoue.
State exceptions.
9. Foi moie uetail, see the excellent aiticle Dealing with InteiiupteuException Ly Biian Goetz.
514 | Chapter 14: ZooKeeper
RecoveiaLle exceptions aie those liom which the application can
iecovei within the same ZooKeepei session. A iecoveiaLle exception is manilesteu Ly
KeeperException.ConnectionLossException, which means that the connection to
ZooKeepei has Leen lost. ZooKeepei will tiy to ieconnect, anu in most cases the ie-
connection will succeeu anu ensuie that the session is intact.
Howevei, ZooKeepei cannot tell whethei the opeiation that laileu with KeeperExcep
tion.ConnectionLossException was applieu. This is an example ol paitial lailuie (which
we intiouuceu at the Leginning ol the chaptei). The onus is theieloie on the piogiam-
mei to ueal with the unceitainty, anu the action that shoulu Le taken uepenus on the
application.
At this point, it is uselul to make a uistinction Letween idcnpotcnt anu nonidcnpo-
tcnt opeiations. An iuempotent opeiation is one that may Le applieu one oi moie times
with the same iesult, such as a ieau ieguest oi an unconuitional setData. These can
simply Le ietiieu.
A noniuempotent opeiation cannot Le inuisciiminately ietiieu, as the ellect ol applying
it multiple times is not the same as applying it once. The piogiam neeus a way ol
uetecting whethei its upuate was applieu Ly encouing inloimation in the znoue`s path
name oi its uata. Ve shall uiscuss how to ueal with laileu noniuempotent opeiations
in RecoveiaLle exceptions on page 51S, when we look at the implementation ol a
lock seivice.
In some cases, the ZooKeepei session Lecomes invaliu
peihaps Lecause ol a timeout oi Lecause the session was closeu (Loth get a KeeperEx
ception.SessionExpiredException), oi peihaps Lecause authentication laileu (Keeper
Exception.AuthFailedException). In any case, all ephemeial noues associateu with the
session will Le lost, so the application neeus to ieLuilu its state Leloie ieconnecting to
ZooKeepei.
A reliable configuration service
Going Lack to the write() methou in ActiveKeyValueStore, iecall that it is composeu
ol an exists opeiation lolloweu Ly eithei a create oi a setData:
public void write(String path, String value) throws InterruptedException,
KeeperException {
Stat stat = zk.exists(path, false);
if (stat == null) {
zk.create(path, value.getBytes(CHARSET), Ids.OPEN_ACL_UNSAFE,
CreateMode.PERSISTENT);
} else {
zk.setData(path, value.getBytes(CHARSET), -1);
}
}
Taken as a whole, the write() methou is iuempotent, so we can alloiu to unconui-
tionally ietiy it. Heie`s a mouilieu veision ol the write() methou that ietiies in a loop.
Recoverable exceptions.
Unrecoverable exceptions.
Building Applications with ZooKeeper | 515
It is set to tiy a maximum numLei ol ietiies (MAX_RETRIES) anu sleeps loi
RETRY_PERIOD_SECONDS Letween each attempt:
public void write(String path, String value) throws InterruptedException,
KeeperException {
int retries = 0;
while (true) {
try {
Stat stat = zk.exists(path, false);
if (stat == null) {
zk.create(path, value.getBytes(CHARSET), Ids.OPEN_ACL_UNSAFE,
CreateMode.PERSISTENT);
} else {
zk.setData(path, value.getBytes(CHARSET), stat.getVersion());
}
} catch (KeeperException.SessionExpiredException e) {
throw e;
} catch (KeeperException e) {
if (retries++ == MAX_RETRIES) {
throw e;
}
// sleep then retry
TimeUnit.SECONDS.sleep(RETRY_PERIOD_SECONDS);
}
}
}
The coue is caielul not to ietiy KeeperException.SessionExpiredException, since when
a session expiies, the ZooKeeper oLject enteis the CLOSED state, liom which it can nevei
ieconnect (ielei to Figuie 1+-3). Ve simply iethiow the exception
10
anu let the callei
cieate a new ZooKeeper instance, so that the whole write() methou can Le ietiieu. A
simple way to cieate a new instance is to cieate a new ConfigUpdater (which we`ve
actually ienameu ResilientConfigUpdater) to iecovei liom an expiieu session:
public static void main(String[] args) throws Exception {
while (true) {
try {
ResilientConfigUpdater configUpdater =
new ResilientConfigUpdater(args[0]);
configUpdater.run();
} catch (KeeperException.SessionExpiredException e) {
// start a new session
} catch (KeeperException e) {
// already retried, so exit
e.printStackTrace();
break;
}
}
}
10. Anothei way ol wiiting the coue woulu Le to have a single catch Llock, just loi KeeperException, anu a
test to see whethei its coue has the value KeeperException.Code.SESSIONEXPIRED. Vhich methou you use
is a mattei ol style, since they Loth Lehave in the same way.
516 | Chapter 14: ZooKeeper
An alteinative way ol uealing with session expiiy woulu Le to look loi a KeeperState
ol type Expired in the watchei (that woulu Le the ConnectionWatcher in the example
heie), anu cieate a new connection when this is uetecteu. This way, we woulu just keep
ietiying in the write() methou, even il we got a KeeperException.SessionExpiredExcep
tion, since the connection shoulu eventually Le ieestaLlisheu. Regaiuless ol the piecise
mechanics ol how we iecovei liom an expiieu session, the impoitant point is that it is
a uilleient kinu ol lailuie liom connection loss anu neeus to Le hanuleu uilleiently.
Theie`s actually anothei lailuie moue that we`ve ignoieu heie. Vhen
the ZooKeeper oLject is cieateu, it tiies to connect to a ZooKeepei seivei.
Il the connection lails oi times out, then it tiies anothei seivei in the
ensemLle. Il, altei tiying all ol the seiveis in the ensemLle, it can`t con-
nect, then it thiows an IOException. The likelihoou ol all ZooKeepei
seiveis Leing unavailaLle is low; neveitheless, some applications may
choose to ietiy the opeiation in a loop until ZooKeepei is availaLle.
This is just one stiategy loi ietiy hanulingtheie aie many otheis, such as using ex-
ponential Lackoll wheie the peiiou Letween ietiies is multiplieu Ly a constant each
time. The org.apache.hadoop.io.retry package in Hauoop Coie is a set ol utilities loi
auuing ietiy logic into youi coue in a ieusaLle way, anu it may Le helplul loi Luiluing
ZooKeepei applications.
A Lock Service
A uistiiLuteu lock is a mechanism loi pioviuing mutual exclusion Letween a collection
ol piocesses. At any one time, only a single piocess may holu the lock. DistiiLuteu locks
can Le useu loi leauei election in a laige uistiiLuteu system, wheie the leauei is the
piocess that holus the lock at any point in time.
Do not conluse ZooKeepei`s own leauei election with a geneial leauei
election seivice, which can Le Luilt using ZooKeepei piimitives (anu in
lact one implementation is incluueu with ZooKeepei). ZooKeepei`s
own leauei election is not exposeu puLlicly, unlike the type ol geneial
leauei election seivice we aie uesciiLing heie, which is uesigneu to Le
useu Ly uistiiLuteu systems that neeu to agiee upon a mastei piocess.
To implement a uistiiLuteu lock using ZooKeepei, we use seguential znoues to impose
an oiuei on the piocesses vying loi the lock. The iuea is simple: liist uesignate a lock
znoue, typically uesciiLing the entity Leing lockeu on, say /|cadcr; then clients that want
to acguiie the lock cieate seguential ephemeial znoues as chiluien ol the lock znoue.
At any point in time, the client with the lowest seguence numLei holus the lock. Foi
example, il two clients cieate znoues at aiounu the same time, /|cadcr/|oc|-1
anu /|cadcr/|oc|-2, then the client that cieateu /|cadcr/|oc|-1 holus the lock, since its
Building Applications with ZooKeeper | 517
znoue has the lowest seguence numLei. The ZooKeepei seivice is the aiLitei ol oiuei,
since it assigns the seguence numLeis.
The lock may Le ieleaseu simply Ly ueleting the znoue /|cadcr/|oc|-1; alteinatively, il
the client piocess uies, it will Le ueleteu Ly viitue ol it Leing an ephemeial znoue. The
client that cieateu /|cadcr/|oc|-2 will then holu the lock, since it has the next lowest
seguence numLei. It will Le notilieu that it has the lock Ly cieating a watch that liies
when znoues go away.
The pseuuocoue loi lock acguisition is as lollows:
1. Cieate an ephemeial seguential znoue nameu |oc|- unuei the lock znoue anu ie-
memLei its actual path name (the ietuin value ol the create opeiation).
2. Get the chiluien ol the lock znoue anu set a watch.
3. Il the path name ol the znoue cieateu in 1 has the lowest numLei ol the chiluien
ietuineu in 2, then the lock has Leen acguiieu. Exit.
+. Vait loi the notilication liom the watch set in 2 anu go to step 2.
The herd effect
Although this algoiithm is coiiect, theie aie some pioLlems with it. The liist pioLlem
is that this implementation sulleis liom the hcrd cjjcct. Consiuei hunuieus oi thou-
sanus ol clients, all tiying to acguiie the lock. Each client places a watch on the lock
znoue loi changes in its set ol chiluien. Eveiy time the lock is ieleaseu, oi anothei
piocess staits the lock acguisition piocess, the watch liies anu eveiy client ieceives a
notilication. The heiu ellect ieleis to a laige numLei ol clients Leing notilieu ol the
same event, when only a small numLei ol them can actually pioceeu. In this case, only
one client will successlully acguiie the lock, anu the piocess ol maintaining anu senuing
watch events to all clients causes tiallic spikes, which put piessuie on the ZooKeepei
seiveis.
To avoiu the heiu ellect, the conuition loi notilication neeus to Le ielineu. The key
oLseivation loi implementing locks is that a client neeus to Le notilieu only when the
chilu znoue with the prcvious seguence numLei goes away, not when any chilu znoue
is ueleteu (oi cieateu). In oui example, il clients have cieateu the znoues /|cadcr/
|oc|-1, /|cadcr/|oc|-2, anu /|cadcr/|oc|-3, then the client holuing /|cadcr/|oc|-3 only
neeus to Le notilieu when /|cadcr/|oc|-2 uisappeais. It uoes not neeu to Le notilieu
when /|cadcr/|oc|-1 uisappeais oi when a new znoue /|cadcr/|oc|-1 is auueu.
Recoverable exceptions
Anothei pioLlem with the lock algoiithm as it stanus is that it uoesn`t hanule the case
when the cieate opeiation lails uue to connection loss. Recall that in this case we uo
not know il the opeiation succeeueu oi laileu. Cieating a seguential znoue is a
noniuempotent opeiation, so we can`t simply ietiy, since il the liist cieate hau
518 | Chapter 14: ZooKeeper
succeeueu, we woulu have an oiphaneu znoue that woulu nevei Le ueleteu (until the
client session enueu, at least). Deaulock woulu Le the unloitunate iesult.
The pioLlem is that altei ieconnecting, the client can`t tell whethei it cieateu any ol
the chilu znoues. By emLeuuing an iuentiliei in the znoue name, il it sulleis a connection
loss, it can check to see whethei any ol the chiluien ol the lock noue have its iuentiliei
in theii name. Il a chilu contains its iuentiliei, it knows that the cieate opeiation suc-
ceeueu, anu it shoulun`t cieate anothei chilu znoue. Il no chilu has the iuentiliei in its
name, then the client can salely cieate a new seguential chilu znoue.
The client`s session iuentiliei is a long integei that is unigue loi the ZooKeepei seivice
anu theieloie iueal loi the puipose ol iuentilying a client acioss connection loss events.
The session iuentiliei can Le oLtaineu Ly calling the getSessionId() methou on the
ZooKeeper ]ava class.
The ephemeial seguential znoue shoulu Le cieateu with a name ol the loim |oc|-
<sessionId>-, so that when the seguence numLei is appenueu Ly ZooKeepei, the name
Lecomes |oc|-<sessionId>-<sequenceNumber>. The seguence numLeis aie unigue to the
paient, not to the name ol the chilu, so this technigue allows the chilu znoues to iuentily
theii cieatois as well as impose an oiuei ol cieation.
Unrecoverable exceptions
Il a client`s ZooKeepei session expiies, the ephemeial znoue cieateu Ly the client will
Le ueleteu, ellectively ielinguishing the lock oi at least loileiting the client`s tuin to
acguiie the lock. The application using the lock shoulu iealize that it no longei holus
the lock, clean up its state, anu then stait again Ly cieating a new lock oLject anu tiying
to acguiie it. Notice that it is the application that contiols this piocess, not the lock
implementation, since it cannot seconu-guess how the application neeus to clean up
its state.
Implementation
Implementing a uistiiLuteu lock coiiectly is a uelicate mattei, since accounting loi all
ol the lailuie moues is nontiivial. ZooKeepei comes with a piouuction-guality lock
implementation in ]ava calleu WriteLock that is veiy easy loi clients to use.
More Distributed Data Structures and Protocols
Theie aie many uistiiLuteu uata stiuctuies anu piotocols that can Le Luilt with Zoo-
Keepei, such as Laiiieis, gueues, anu two-phase commit. One inteiesting thing to note
is that these aie synchionous piotocols, even though we use asynchionous ZooKeepei
piimitives (such as notilications) to Luilu them.
The ZooKeepei weLsite uesciiLes seveial such uata stiuctuies anu piotocols in pseu-
uocoue. ZooKeepei comes with implementations ol some ol these stanuaiu iecipes
Building Applications with ZooKeeper | 519
(incluuing locks, leauei election, gueues); they can Le lounu in the rccipcs uiiectoiy ol
the uistiiLution.
The Cuiatoi pioject (https://github.con/Nctj|ix/curator) also pioviues an extensive set
ol ZooKeepei iecipes.
BookKeeper and Hedwig
Boo|Kccpcr is a highly-availaLle anu ieliaLle logging seivice. It can Le useu to pioviue
wiite-aheau logging, which is a common technigue loi ensuiing uata integiity in stoiage
systems. In a system using wiite-aheau logging, eveiy wiite opeiation is wiitten to the
tiansaction log Leloie it is applieu. Using this pioceuuie, we uon`t have to wiite the
uata to peimanent stoiage altei eveiy wiite opeiation Lecause in the event ol a system
lailuie, the latest state may Le iecoveieu Ly ieplaying the tiansaction log loi any wiites
that hau not Leen applieu.
BookKeepei clients cieate logs calleu |cdgcrs, anu each iecoiu appenueu to a leugei is
calleu a |cdgcr cntry, which is simply a Lyte aiiay. Leugeis aie manageu Ly boo|ics,
which aie seiveis that ieplicate the leugei uata. Note that leugei uata is not stoieu in
ZooKeepei, only metauata is.
Tiauitionally, the challenge has Leen to make systems that use wiite-aheau logging
ioLust in the lace ol lailuie ol the noue wiiting the tiansaction log. This is usually uone
Ly ieplicating the tiansaction log in some mannei. Hauoop`s HDFS namenoue, loi
instance, wiites its euit log to multiple uisks, one ol which is typically an NFS mounteu
uisk. Howevei, in the event ol lailuie ol the piimaiy, lailovei is still manual. By pio-
viuing logging as a highly availaLle seivice, BookKeepei piomises to make lailovei
tianspaient, since it can toleiate the loss ol Lookie seiveis. (In the case ol HDFS High-
AvailaLility, uesciiLeu on 50, a BookKeepei-Laseu euit log will iemove the ieguiiement
loi using NFS loi shaieu stoiage.)
Hcdwig is a topic-Laseu puLlish-suLsciiLe system Luilt on BookKeepei. Thanks to its
ZooKeepei unueipinnings, Heuwig is a highly availaLle seivice anu guaiantees message
ueliveiy even il suLsciiLeis aie ollline loi extenueu peiious ol time.
BookKeepei is a ZooKeepei suLpioject, anu you can linu moie inloimation on how to
use it, anu Heuwig, at http://zoo|ccpcr.apachc.org/boo||ccpcr/.
ZooKeeper in Production
In piouuction, you shoulu iun ZooKeepei in ieplicateu moue. Heie we will covei some
ol the consiueiations loi iunning an ensemLle ol ZooKeepei seiveis. Howevei, this
section is not exhaustive, so you shoulu consult the ZooKeepei Auministiatoi`s
Guiue loi uetaileu up-to-uate instiuctions, incluuing suppoiteu platloims, iecom-
menueu haiuwaie, maintenance pioceuuies, anu conliguiation piopeities.
520 | Chapter 14: ZooKeeper
Resilience and Performance
ZooKeepei machines shoulu Le locateu to minimize the impact ol machine anu netwoik
lailuie. In piactice, this means that seiveis shoulu Le spieau acioss iacks, powei sup-
plies, anu switches, so that the lailuie ol any one ol these uoes not cause the ensemLle
to lose a majoiity ol its seiveis.
Foi applications that ieguiie low-latency seivice (on the oiuei ol a lew milliseconus),
it is impoitant to iun all the seiveis in an ensemLle in a single uata centei. Some use
cases uon`t ieguiie low-latency iesponses, howevei, which makes it leasiLle to spieau
seiveis acioss uata centeis (at least two pei uata centei) loi extia iesilience. Example
applications in this categoiy aie leauei election anu uistiiLuteu coaise-giaineu locking,
Loth ol which have ielatively inlieguent state changes so the oveiheau ol a lew tens ol
milliseconus that intei-uata centei messages incuis is not signilicant to the oveiall
lunctioning ol the seivice.
ZooKeepei has the concept ol an obscrvcr nodc, which is like a non-
voting lollowei. Since they uo not paiticipate in the vote loi consensus
uuiing wiite ieguests, oLseiveis allow a ZooKeepei clustei to impiove
ieau peiloimance without huiting wiite peiloimance.
11
OLseiveis can
Le useu to goou auvantage to allow a ZooKeepei clustei to span uata
centeis without impacting latency as much as iegulai voting lolloweis.
This is achieveu Ly placing the voting memLeis in one uata centei anu
oLseiveis in the othei.
ZooKeepei is a highly availaLle system, anu it is ciitical that it can peiloim its lunctions
in a timely mannei. Theieloie, ZooKeepei shoulu iun on machines that aie ueuicateu
to ZooKeepei alone. Having othei applications contenu loi iesouices can cause Zoo-
Keepei`s peiloimance to uegiaue signilicantly.
Conliguie ZooKeepei to keep its tiansaction log on a uilleient uisk uiive liom its snap-
shots. By uelault, Loth go in the uiiectoiy specilieu Ly the dataDir piopeity, Lut Ly
specilying a location loi dataLogDir, the tiansaction log will Le wiitten theie. By having
its own ueuicateu uevice (not just a paitition), a ZooKeepei seivei can maximize the
iate at which it wiites log entiies to uisk, which it uoes seguentially, without seeking.
Since all wiites go thiough the leauei, wiite thioughput uoes not scale Ly auuing seiveis,
so it is ciucial that wiites aie as last as possiLle.
Il the piocess swaps to uisk, peiloimance will Le auveisely allecteu. This can Le avoiueu
Ly setting the ]ava heap size to less than the amount ol unuseu physical memoiy on
the machine. The ZooKeepei sciipts will souice a lile calleu java.cnv liom its conligu-
11. This is uiscusseu in moie uetail in OLseiveis: Making ZooKeepei Scale Even Fuithei Ly
Heniy RoLinson.
ZooKeeper in Production | 521
iation uiiectoiy, anu this can Le useu to set the JVMFLAGS enviionment vaiiaLle to set
the heap size (anu any othei uesiieu ]VM aiguments).
Configuration
Each seivei in the ensemLle ol ZooKeepei seiveis has a numeiic iuentiliei that is unigue
within the ensemLle, anu must lall Letween 1 anu 255. The seivei numLei is specilieu
in plain text in a lile nameu nyid in the uiiectoiy specilieu Ly the dataDir piopeity.
Setting each seivei numLei is only hall ol the joL. Ve also neeu to give all the seiveis
all the iuentities anu netwoik locations ol the otheis in the ensemLle. The ZooKeepei
conliguiation lile must incluue a line loi each seivei, ol the loim:
server.n=hostname:port:port
The value ol n is ieplaceu Ly the seivei numLei. Theie aie two poit settings: the liist
is the poit that lolloweis use to connect to the leauei, anu the seconu is useu loi leauei
election. Heie is a sample conliguiation loi a thiee-machine ieplicateu ZooKeepei
ensemLle:
tickTime=2000
dataDir=/disk1/zookeeper
dataLogDir=/disk2/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=zookeeper1:2888:3888
server.2=zookeeper2:2888:3888
server.3=zookeeper3:2888:3888
Seiveis listen on thiee poits: 21S1 loi client connections; 2SSS loi lollowei connections,
il they aie the leauei; anu 3SSS loi othei seivei connections uuiing the leauei election
phase. Vhen a ZooKeepei seivei staits up, it ieaus the nyid lile to ueteimine which
seivei it is, then ieaus the conliguiation lile to ueteimine the poits it shoulu listen on,
as well as the netwoik auuiesses ol the othei seiveis in the ensemLle.
Clients connecting to this ZooKeepei ensemLle shoulu use zookeeper1:2181,zoo
keeper2:2181,zookeeper3:2181 as the host stiing in the constiuctoi loi the ZooKeeper
oLject.
In ieplicateu moue, theie aie two extia manuatoiy piopeities: initLimit anu
syncLimit, Loth measuieu in multiples ol tickTime.
initLimit is the amount ol time to allow loi lolloweis to connect to anu sync with the
leauei. Il a majoiity ol lolloweis lail to sync within this peiiou, then the leauei ienounces
its leaueiship status anu anothei leauei election takes place. Il this happens olten (anu
you can uiscovei il this is the case Lecause it is loggeu), it is a sign that the setting is too
low.
522 | Chapter 14: ZooKeeper
syncLimit is the amount ol time to allow a lollowei to sync with the leauei. Il a lollowei
lails to sync within this peiiou, it will iestait itsell. Clients that weie attacheu to this
lollowei will connect to anothei one.
These aie the minimum settings neeueu to get up anu iunning with a clustei ol Zoo-
Keepei seiveis. Theie aie, howevei, moie conliguiation options, paiticulaily loi tuning
peiloimance, uocumenteu in the ZooKeepei Auministiatoi`s Guiue.
ZooKeeper in Production | 523
CHAPTER 15
Sqoop
Aaron Kimball
A gieat stiength ol the Hauoop platloim is its aLility to woik with uata in seveial
uilleient loims. HDFS can ieliaLly stoie logs anu othei uata liom a plethoia ol souices,
anu MapReuuce piogiams can paise uiveise au hoc uata loimats, extiacting ielevant
inloimation anu comLining multiple uata sets into poweilul iesults.
But to inteiact with uata in stoiage iepositoiies outsiue ol HDFS, MapReuuce piogiams
neeu to use exteinal APIs to get to this uata. Olten, valuaLle uata in an oiganization is
stoieu in ielational uataLase systems (RDBMS). Sqoop is an open-souice tool that al-
lows useis to extiact uata liom a ielational uataLase into Hauoop loi luithei piocessing.
This piocessing can Le uone with MapReuuce piogiams oi othei highei-level tools such
as Hive. (It`s even possiLle to use Sgoop to move uata liom a ielational uataLase into
HBase.) Vhen the linal iesults ol an analytic pipeline aie availaLle, Sgoop can expoit
these iesults Lack to the uataLase loi consumption Ly othei clients.
In this chaptei, we`ll take a look at how Sgoop woiks anu how you can use it in youi
uata piocessing pipeline.
Getting Sqoop
Sgoop is availaLle in a lew places. The piimaiy home ol the pioject is http://incubator
.apachc.org/sqoop/. This iepositoiy contains all the Sgoop souice coue anu uocumen-
tation. Ollicial ieleases aie availaLle at this site, as well as the souice coue loi the veision
cuiiently unuei uevelopment. The iepositoiy itsell contains instiuctions loi compiling
the pioject. Alteinatively, Clouueia`s DistiiLution loi Hauoop contains an installation
package loi Sgoop alongsiue compatiLle euitions ol Hauoop anu othei tools like Hive.
Il you uownloau a ielease liom Apache, it will Le placeu in a uiiectoiy such as /honc/
yournanc/sqoop-x.y.z/. Ve`ll call this uiiectoiy $SQOOP_HOME. You can iun Sgoop Ly
iunning the executaLle sciipt $SQOOP_HOME/bin/sqoop.
525
Il you`ve installeu a ielease liom Clouueia, the package will have placeu Sgoop`s sciipts
in stanuaiu locations like /usr/bin/sqoop. You can iun Sgoop Ly simply typing sqoop at
the commanu line.
(Regaiuless ol how you install Sgoop, we`ll ielei to this sciipt as just sqoop liom heie
on.)
Running Sgoop with no aiguments uoes not uo much ol inteiest:
% sqoop
Try sqoop help for usage.
Sgoop is oiganizeu as a set ol tools oi commanus. Vithout selecting a tool, Sgoop uoes
not know what to uo. help is the name ol one such tool; it can piint out the list ol
availaLle tools, like this:
% sqoop help
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
See 'sqoop help COMMAND' for information on a specific command.
As it explains, the help tool can also pioviue specilic usage instiuctions on a paiticulai
tool, Ly pioviuing that tool`s name as an aigument:
% sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Common arguments:
--connect <jdbc-uri> Specify JDBC connect string
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-home <dir> Override $HADOOP_HOME
--help Print usage instructions
-P Read password from console
--password <password> Set authentication password
--username <username> Set authentication username
--verbose Print more information while working
...
526 | Chapter 15: Sqoop
An alteinate way ol iunning a Sgoop tool is to use a tool-specilic sciipt. This sciipt will
Le nameu sqoop-toolname. Foi example, sqoop-help, sqoop-import, etc. These com-
manus aie iuentical to iunning sqoop help oi sqoop import.
A Sample Import
Altei you install Sgoop, you can use it to impoit uata to Hauoop.
Sgoop impoits liom uataLases. The list ol uataLases that it has Leen testeu with incluues
MySQL, PostgieSQL, Oiacle, SQL Seivei anu DB2. Foi the examples in this chaptei
we`ll use MySQL, which is easy-to-use anu availaLle loi a laige numLei ol platloims.
To install anu conliguie MySQL, lollow the uocumentation at http://dcv.nysq|.con/
doc/rcjnan/5.1/cn/. Chaptei 2 (Installing anu Upgiauing MySQL) in paiticulai
shoulu help. Useis ol DeLian-Laseu Linux systems (e.g., ULuntu) can type sudo apt-
get install mysql-client mysql-server. ReuHat useis can type sudo yum install
mysql mysql-server.
Now that MySQL is installeu, let`s log in anu cieate a uataLase (Example 15-1).
Exanp|c 15-1. Crcating a ncw MySQL databasc schcna
% mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 349
Server version: 5.1.37-1ubuntu5.4 (Ubuntu)
Type 'help;' or '\h' for help. Type '\c' to clear the current input
statement.
mysql> CREATE DATABASE hadoopguide;
Query OK, 1 row affected (0.02 sec)
mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO '%'@'localhost';
Query OK, 0 rows affected (0.00 sec)
mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO ''@'localhost';
Query OK, 0 rows affected (0.00 sec)
mysql> quit;
Bye
The passwoiu piompt aLove asks loi youi ioot usei passwoiu. This is likely the same
as the passwoiu loi the ioot shell login. Il you aie iunning ULuntu oi anothei vaiiant
ol Linux wheie ioot cannot uiiectly log in, then entei the passwoiu you pickeu at
MySQL installation time.
In this session, we cieateu a new uataLase schema calleu hadoopguide, which we`ll use
thioughout this appenuix. Ve then alloweu any local usei to view anu mouily the
contents ol the hadoopguide schema, anu closeu oui session.
1
A Sample Import | 527
Now let`s log Lack into the uataLase (not as ioot, Lut as youisell this time), anu cieate
a taLle to impoit into HDFS (Example 15-2).
Exanp|c 15-2. Popu|ating thc databasc
% mysql hadoopguide
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 352
Server version: 5.1.37-1ubuntu5.4 (Ubuntu)
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> CREATE TABLE widgets(id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
-> widget_name VARCHAR(64) NOT NULL,
-> price DECIMAL(10,2),
-> design_date DATE,
-> version INT,
-> design_comment VARCHAR(100));
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'sprocket', 0.25, '2010-02-10',
-> 1, 'Connects two gizmos');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'gizmo', 4.00, '2009-11-30', 4,
-> NULL);
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'gadget', 99.99, '1983-08-13',
-> 13, 'Our flagship product');
Query OK, 1 row affected (0.00 sec)
mysql> quit;
In the aLove listing, we cieateu a new taLle calleu widgets. Ve`ll Le using this lictional
piouuct uataLase in luithei examples in this chaptei. The widgets taLle contains seveial
lielus iepiesenting a vaiiety ol uata types.
Now let`s use Sgoop to impoit this taLle into HDFS:
% sqoop import --connect jdbc:mysql://localhost/hadoopguide \
> --table widgets -m 1
10/06/23 14:44:18 INFO tool.CodeGenTool: Beginning code generation
...
10/06/23 14:44:20 INFO mapred.JobClient: Running job: job_201006231439_0002
10/06/23 14:44:21 INFO mapred.JobClient: map 0% reduce 0%
10/06/23 14:44:32 INFO mapred.JobClient: map 100% reduce 0%
10/06/23 14:44:34 INFO mapred.JobClient: Job complete:
job_201006231439_0002
1. Ol couise, in a piouuction ueployment, we`u neeu to Le much moie caielul aLout access contiol, Lut
this seives loi uemonstiation puiposes. The aLove piivilege giant also assumes you`ie iunning a pseuuo-
uistiiLuteu Hauoop instance. Il you`ie woiking with a uistiiLuteu Hauoop clustei, you`u neeu to enaLle
iemote access Ly at least one usei, whose account will Le useu to peiloim impoits anu expoits via Sgoop.
528 | Chapter 15: Sqoop
...
10/06/23 14:44:34 INFO mapreduce.ImportJobBase: Retrieved 3 records.
Sgoop`s import tool will iun a MapReuuce joL that connects to the MySQL uataLase
anu ieaus the taLle. By uelault, this will use loui map tasks in paiallel to speeu up the
impoit piocess. Each task will wiite its impoiteu iesults to a uilleient lile, Lut all in a
common uiiectoiy. Since we knew that we hau only thiee iows to impoit in this ex-
ample, we specilieu that Sgoop shoulu use a single map task (-m 1) so we get a single
lile in HDFS.
Ve can inspect this lile`s contents like so:
% hadoop fs -cat widgets/part-m-00000
1,sprocket,0.25,2010-02-10,1,Connects two gizmos
2,gizmo,4.00,2009-11-30,4,null
3,gadget,99.99,1983-08-13,13,Our flagship product
The connect stiing (jdbc:nysq|://|oca|host/hadoopguidc) shown in the
example will ieau liom a uataLase on the local machine. Il a uistiiLuteu
Hauoop clustei is Leing useu, then localhost shoulu not Le specilieu in
the connect stiing; map tasks not iunning on the same machine as the
uataLase will lail to connect. Even il Sgoop is iun liom the same host
as the uataLase sevei, the lull hostname shoulu Le specilieu.
By uelault, Sgoop will geneiate comma-uelimiteu text liles loi oui impoiteu uata. De-
limiteis can Le explicitly specilieu, as well as lielu enclosing anu escape chaiacteis to
allow the piesence ol uelimiteis in the lielu contents. The commanu-line aiguments
that specily uelimitei chaiacteis, lile loimats, compiession, anu moie line-giaineu
contiol ol the impoit piocess aie uesciiLeu in the Sqoop Uscr Guidc uistiiLuteu with
Sgoop,
2
as well as in the online help (sqoop help import, oi man sqoop-import in CDH).
2. AvailaLle liom the weLsite at http://incubator.apachc.org/sqoop.
A Sample Import | 529
Text and binary file formats
Sgoop is capaLle ol impoiting into a lew uilleient lile loimats. Text liles
(the uelault) ollei a human-ieauaLle iepiesentation ol uata, platloim
inuepenuence, anu the simplest stiuctuie. Howevei, they cannot holu
Linaiy lielus (such as uataLase columns ol type VARBINARY) anu cannot
uistinguish Letween null values anu Stiing-Laseu lielus containing the
value "null".
To hanule these conuitions, you can eithei use eithei Sgoop`s
SeguenceFile-Laseu loimat, oi its Avio-Laseu loimat. Both Avio uata
liles anu SeguenceFiles pioviue the most piecise iepiesentation ol the
impoiteu uata possiLle. They also allow uata to Le compiesseu while
ietaining MapReuuce`s aLility to piocess uilleient sections ol the same
lile in paiallel. Howevei, cuiient veisions ol Sgoop cannot loau eithei
Avio oi SeguenceFiles into Hive (although you can loau Avio uata liles
into Hive manually). A linal uisauvantage ol SeguenceFiles is that they
aie ]ava-specilic, wheieas Avio uata liles can Le piocesseu Ly a wiue
iange ol languages.
Generated Code
In auuition to wiiting the contents ol the uataLase taLle to HDFS, Sgoop has also
pioviueu you with a geneiateu ]ava souice lile (widgcts.java) wiitten to the cuiient local
uiiectoiy. (Altei iunning the sqoop import commanu aLove, you can see this lile Ly
iunning ls widgets.java.)
Coue geneiation is a necessaiy pait ol Sgoop`s impoit piocess; as you`ll leain in
DataLase Impoits: A Deepei Look on page 531, Sgoop uses geneiateu coue to
hanule the ueseiialization ol taLle-specilic uata liom the uataLase souice Leloie wiiting
it to HDFS.
The geneiateu class (widgets) is capaLle ol holuing a single iecoiu ietiieveu liom the
impoiteu taLle. It can manipulate such a iecoiu in MapReuuce oi stoie it in a Seguen-
ceFile in HDFS. (SeguenceFiles wiitten Ly Sgoop uuiing the impoit piocess will stoie
each impoiteu iow in the value element ol the SeguenceFile`s key-value paii loimat,
using the geneiateu class.)
It is likely that you uon`t want to name youi geneiateu class widgets since each instance
ol the class ieleis to only a single iecoiu. Ve can use a uilleient Sgoop tool to geneiate
souice coue without peiloiming an impoit; this geneiateu coue will still examine the
uataLase taLle to ueteimine the appiopiiate uata types loi each lielu:
% sqoop codegen --connect jdbc:mysql://localhost/hadoopguide \
> --table widgets --class-name Widget
The codegen tool simply geneiates coue; it uoes not peiloim the lull impoit. Ve speci-
lieu that we`u like it to geneiate a class nameu Widget; this will Le wiitten to
Widgct.java. Ve also coulu have specilieu --class-name anu othei coue-geneiation ai-
530 | Chapter 15: Sqoop
guments uuiing the impoit piocess we peiloimeu eailiei. This tool can Le useu to
iegeneiate coue, il you acciuentally iemove the souice lile, oi geneiate coue with uil-
leient settings than weie useu uuiing the impoit.
Il you`ie woiking with iecoius impoiteu to SeguenceFiles, it is inevitaLle that you`ll
neeu to use the geneiateu classes (to ueseiialize uata liom the SeguenceFile stoiage).
You can woik with text lile-Laseu iecoius without using geneiateu coue, Lut as we`ll
see in Voiking with Impoiteu Data on page 535, Sgoop`s geneiateu coue can han-
ule some teuious aspects ol uata piocessing loi you.
Additional Serialization Systems
Recent veisions ol Sgoop suppoit Avio-Laseu seiialization anu schema geneiation as
well (see Avio on page 112), allowing you to use Sgoop in youi pioject without
integiating with geneiateu coue.
Database Imports: A Deeper Look
As mentioneu eailiei, Sgoop impoits a taLle liom a uataLase Ly iunning a MapReuuce
joL that extiacts iows liom the taLle, anu wiites the iecoius to HDFS. How uoes Map-
Reuuce ieau the iows? This section explains how Sgoop woiks unuei the hoou.
At a high level, Figuie 15-1 uemonstiates how Sgoop inteiacts with Loth the uataLase
souice anu Hauoop. Like Hauoop itsell, Sgoop is wiitten in ]ava. ]ava pioviues an API
calleu ]ava DataLase Connectivity, oi ]DBC, that allows applications to access uata
stoieu in an RDBMS as well as inspect the natuie ol this uata. Most uataLase venuois
pioviue a ]DBC drivcr that implements the ]DBC API anu contains the necessaiy coue
to connect to theii uataLase seivei.
Database Imports: A Deeper Look | 531
Iigurc 15-1. Sqoop`s inport proccss
Baseu on the URL in the connect stiing useu to access the uataLase,
Sgoop attempts to pieuict which uiivei it shoulu loau. You may still
neeu to uownloau the ]DBC uiivei itsell anu install it on youi Sgoop
client. Foi cases wheie Sgoop uoes not know which ]DBC uiivei is ap-
piopiiate, useis can specily exactly how to loau the ]DBC uiivei into
Sgoop. This capaLility allows Sgoop to woik with a wiue vaiiety ol
uataLase platloims.
Beloie the impoit can stait, Sgoop uses ]DBC to examine the taLle it is to impoit. It
ietiieves a list ol all the columns anu theii SQL uata types. These SQL types (VARCHAR,
INTEGER, anu so on) can then Le mappeu to ]ava uata types (String, Integer, etc.), which
will holu the lielu values in MapReuuce applications. Sgoop`s coue geneiatoi will use
this inloimation to cieate a taLle-specilic class to holu a iecoiu extiacteu liom the taLle.
The Widget class liom eailiei, loi example, contains the lollowing methous that ietiieve
each column liom an extiacteu iecoiu:
public Integer get_id();
public String get_widget_name();
public java.math.BigDecimal get_price();
public java.sql.Date get_design_date();
public Integer get_version();
public String get_design_comment();
Moie ciitical to the impoit system`s opeiation, though, aie the seiialization methous
that loim the DBWritable inteilace, which allow the Widget class to inteiact with ]DBC:
532 | Chapter 15: Sqoop
public void readFields(ResultSet __dbResults) throws SQLException;
public void write(PreparedStatement __dbStmt) throws SQLException;
]DBC`s ResultSet inteilace pioviues a cuisoi that ietiieves iecoius liom a gueiy; the
readFields() methou heie will populate the lielus ol the Widget oLject with the columns
liom one iow ol the ResultSet`s uata. The write() methou shown aLove allows Sgoop
to inseit new Widget iows into a taLle, a piocess calleu cxporting. Expoits aie uiscusseu
in Peiloiming an Expoit on page 5+0.
The MapReuuce joL launcheu Ly Sgoop uses an InputFormat that can ieau sections ol
a taLle liom a uataLase via ]DBC. The DataDrivenDBInputFormat pioviueu with Hauoop
paititions a gueiy`s iesults ovei seveial map tasks.
Reauing a taLle is typically uone with a simple gueiy such as:
SELECT col1,col2,col3,... FROM tableName
But olten, Lettei impoit peiloimance can Le gaineu Ly uiviuing this gueiy acioss mul-
tiple noues. This is uone using a sp|itting co|unn. Using metauata aLout the taLle, Sgoop
will guess a goou column to use loi splitting the taLle (typically the piimaiy key loi the
taLle, il one exists). The minimum anu maximum values loi the piimaiy key column
aie ietiieveu, anu then these aie useu in conjunction with a taiget numLei ol tasks to
ueteimine the gueiies that each map task shoulu issue.
Foi example, suppose the widgets taLle hau 100,000 entiies, with the id column con-
taining values 0 thiough 99,999. Vhen impoiting this taLle, Sgoop woulu ueteimine
that id is the piimaiy key column loi the taLle. Vhen staiting the MapReuuce joL, the
DataDrivenDBInputFormat useu to peiloim the impoit woulu then issue a statement such
as SELECT MIN(id), MAX(id) FROM widgets. These values woulu then Le useu to intei-
polate ovei the entiie iange ol uata. Assuming we specilieu that 5 map tasks shoulu
iun in paiallel (with -m 5), this woulu iesult in each map task executing gueiies such
as: SELECT id, widget_name, ... FROM widgets WHERE id >= 0 AND id < 20000, SELECT
id, widget_name, ... FROM widgets WHERE id >= 20000 AND id < 40000, anu so on.
The choice ol splitting column is essential to elliciently paiallelizing woik. Il the id
column weie not uniloimly uistiiLuteu (peihaps theie aie no wiugets with IDs Letween
50,000 anu 75,000), then some map tasks may have little oi no woik to peiloim,
wheieas otheis have a gieat ueal. Useis can specily a paiticulai splitting column when
iunning an impoit joL, to tune the joL to the uata`s actual uistiiLution. Il an impoit
joL is iun as a single (seguential) task with -m 1, then this split piocess is not peiloimeu.
Altei geneiating the ueseiialization coue anu conliguiing the InputFormat, Sgoop senus
the joL to the MapReuuce clustei. Map tasks execute the gueiies anu ueseiialize iows
liom the ResultSet into instances ol the geneiateu class, which aie eithei stoieu uiiectly
in SeguenceFiles oi tiansloimeu into uelimiteu text Leloie Leing wiitten to HDFS.
Database Imports: A Deeper Look | 533
Controlling the Import
Sgoop uoes not neeu to impoit an entiie taLle at a time. Foi example, a suLset ol the
taLle`s columns can Le specilieu loi impoit. Useis can also specily a WHERE clause to
incluue in gueiies, which Lounu the iows ol the taLle to impoit. Foi example, il wiugets
0 thiough 99,999 weie impoiteu last month, Lut this month oui venuoi catalog
incluueu 1,000 new types ol wiuget, an impoit coulu Le conliguieu with the clause
WHERE id >= 100000; this will stait an impoit joL ietiieving all the new iows auueu to
the souice uataLase since the pievious impoit iun. Usei-supplieu WHERE clauses aie
applieu Leloie task splitting is peiloimeu, anu aie pusheu uown into the gueiies exe-
cuteu Ly each task.
Imports and Consistency
Vhen impoiting uata to HDFS, it is impoitant that you ensuie access to a consistent
snapshot ol the souice uata. Map tasks ieauing liom a uataLase in paiallel aie iunning
in sepaiate piocesses. Thus, they cannot shaie a single uataLase tiansaction. The Lest
way to uo this is to ensuie that any piocesses that upuate existing iows ol a taLle aie
uisaLleu uuiing the impoit.
Direct-mode Imports
Sgoop`s aichitectuie allows it to choose liom multiple availaLle stiategies loi peiloim-
ing an impoit. Most uataLases will use the DataDrivenDBInputFormat-Laseu appioach
uesciiLeu aLove. Some uataLases ollei specilic tools uesigneu to extiact uata guickly.
Foi example, MySQL`s mysqldump application can ieau liom a taLle with gieatei
thioughput than a ]DBC channel. The use ol these exteinal tools is ieleiieu to as dircct
nodc in Sgoop`s uocumentation. Diiect moue must Le specilically enaLleu Ly the usei
(via the --direct aigument), as it is not as geneial-puipose as the ]DBC appioach. (Foi
example, MySQL`s uiiect moue cannot hanule laige oLjectsCLOB oi BLOB columns,
as Sgoop neeus to use a ]DBC-specilic API to loau these columns into HDFS.)
Foi uataLases that pioviue such tools, Sgoop can use these to gieat ellect. A uiiect-
moue impoit liom MySQL is usually much moie ellicient (in teims ol map tasks anu
time ieguiieu) than a compaiaLle ]DBC-Laseu impoit. Sgoop will still launch multiple
map tasks in paiallel. These tasks will then spawn instances ol the mysqldump piogiam
anu ieau its output. The ellect is similai to a uistiiLuteu implementation ol mk-
parallel-dump liom the Maatkit tool set. Sgoop can also peiloim uiiect-moue impoits
liom PostgieSQL.
Even when uiiect moue is useu to access the contents ol a uataLase, the metauata is
still gueiieu thiough ]DBC.
534 | Chapter 15: Sqoop
Working with Imported Data
Once uata has Leen impoiteu to HDFS, it is now ieauy loi piocessing Ly custom Map-
Reuuce piogiams. Text-Laseu impoits can Le easily useu in sciipts iun with Hauoop
Stieaming oi in MapReuuce joLs iun with the uelault TextInputFormat.
To use inuiviuual lielus ol an impoiteu iecoiu, though, the lielu uelimiteis (anu any
escape/enclosing chaiacteis) must Le paiseu anu the lielu values extiacteu anu con-
veiteu to the appiopiiate uata types. Foi example, the iu ol the spiocket wiuget is
iepiesenteu as the stiing "1" in the text lile, Lut shoulu Le paiseu into an Integer oi
int vaiiaLle in ]ava. The geneiateu taLle class pioviueu Ly Sgoop can automate this
piocess, allowing you to locus on the actual MapReuuce joL to iun. Each auto-
geneiateu class has seveial oveiloaueu methous nameu parse() that opeiate on the uata
iepiesenteu as Text, CharSequence, char[], oi othei common types.
The MapReuuce application calleu MaxWidgetId (availaLle in the example coue) will
linu the wiuget with the highest ID.
The class can Le compileu into a ]AR lile along with Widgct.java. Both Hauoop (ha-
doop-corc-version.jar) anu Sgoop (sqoop-version.jar) will neeu to Le on the classpath
loi compilation. The class liles can then Le comLineu into a ]AR lile anu executeu like
so:
% jar cvvf widgets.jar *.class
% HADOOP_CLASSPATH=/usr/lib/sqoop/sqoop-version.jar hadoop jar \
> widgets.jar MaxWidgetId -libjars /usr/lib/sqoop/sqoop-version.jar
This commanu line ensuies that Sgoop is on the classpath locally (via $HADOOP_CLASS
PATH), when iunning the MaxWidgetId.run() methou, as well as when map tasks aie
iunning on the clustei (via the -libjars aigument).
Vhen iun, the naxwidgcts path in HDFS will contain a lile nameu part-r-00000 with
the lollowing expecteu iesult:
3,gadget,99.99,1983-08-13,13,Our flagship product
It is woith noting that in this example MapReuuce piogiam, a Widget oLject was
emitteu liom the mappei to the ieuucei; the auto-geneiateu Widget class implements
the Writable inteilace pioviueu Ly Hauoop, which allows the oLject to Le sent via
Hauoop`s seiialization mechanism, as well as wiitten to anu ieau liom SeguenceFiles.
The MaxWidgetId example is Luilt on the new MapReuuce API. MapReuuce applications
that iely on Sgoop-geneiateu coue can Le Luilt on the new oi olu APIs, though some
auvanceu leatuies (such as woiking with laige oLjects) aie moie convenient to use in
the new API.
Avio-Laseu impoits can Le piocesseu using the APIs uesciiLeu in Avio MapRe-
uuce on page 126. Vith the geneiic Avio mapping the MapReuuce piogiam uoes not
neeu to use schema-specilic geneiateu coue (although this is an option too, Ly using
Avio`s specilic compileiSgoop uoes not uo the coue geneiation in this case). The
Working with Imported Data | 535
example coue incluues a piogiam calleu MaxWidgetIdGenericAvro, which linus the
wiuget with the highest ID anu wiites out the iesult in an Avio uata lile.
Imported Data and Hive
As noteu in Chaptei 12, loi many types ol analysis, using a system like Hive to hanule
ielational opeiations can uiamatically ease the uevelopment ol the analytic pipeline.
Especially loi uata oiiginally liom a ielational uata souice, using Hive makes a lot ol
sense. Hive anu Sgoop togethei loim a poweilul toolchain loi peiloiming analysis.
Suppose we hau anothei log ol uata in oui system, coming liom a weL-Laseu wiuget
puichasing system. This may ietuin log liles containing a wiuget iu, a guantity, a ship-
ping auuiess, anu an oiuei uate.
Heie is a snippet liom an example log ol this type:
1,15,120 Any St.,Los Angeles,CA,90210,2010-08-01
3,4,120 Any St.,Los Angeles,CA,90210,2010-08-01
2,5,400 Some Pl.,Cupertino,CA,95014,2010-07-30
2,7,88 Mile Rd.,Manhattan,NY,10005,2010-07-18
By using Hauoop to analyze this puichase log, we can gain insight into oui sales opei-
ation. By comLining this uata with the uata extiacteu liom oui ielational uata souice
(the widgets taLle), we can uo Lettei. In this example session, we will compute which
zip coue is iesponsiLle loi the most sales uollais, so we can Lettei locus oui sales team`s
opeiations. Doing this ieguiies uata liom Loth the sales log anu the widgets taLle.
The aLove taLle shoulu Le in a local lile nameu sa|cs.|og loi this to woik.
Fiist, let`s loau the sales uata into Hive:
hive> CREATE TABLE sales(widget_id INT, qty INT,
> street STRING, city STRING, state STRING,
> zip INT, sale_date STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
OK
Time taken: 5.248 seconds
hive> LOAD DATA LOCAL INPATH "sales.log" INTO TABLE sales;
Copying data from file:/home/sales.log
Loading data to table sales
OK
Time taken: 0.188 seconds
Sgoop can geneiate a Hive taLle Laseu on a taLle liom an existing ielational uata souice.
Since we`ve alieauy impoiteu the widgets uata to HDFS, we can geneiate the Hive taLle
uelinition anu then loau in the HDFS-iesiuent uata:
% sqoop create-hive-table --connect jdbc:mysql://localhost/hadoopguide \
> --table widgets --fields-terminated-by ','
...
10/06/23 18:05:34 INFO hive.HiveImport: OK
10/06/23 18:05:34 INFO hive.HiveImport: Time taken: 3.22 seconds
10/06/23 18:05:35 INFO hive.HiveImport: Hive import complete.
536 | Chapter 15: Sqoop
% hive
hive> LOAD DATA INPATH "widgets" INTO TABLE widgets;
Loading data to table widgets
OK
Time taken: 3.265 seconds
Vhen cieating a Hive taLle uelinition with a specilic alieauy-impoiteu uataset in minu,
we neeu to specily the uelimiteis useu in that uataset. Otheiwise, Sgoop will allow Hive
to use its uelault uelimiteis (which aie uilleient liom Sgoop`s uelault uelimiteis).
Hive`s type system is less iich than that ol most SQL systems. Many
SQL types uo not have uiiect analogues in Hive. Vhen Sgoop geneiates
a Hive taLle uelinition loi an impoit, it uses the Lest Hive type availaLle
to holu a column`s values. This may iesult in a ueciease in piecision.
Vhen this occuis, Sgoop will pioviue you with a waining message, such
as this one:
10/06/23 18:09:36 WARN hive.TableDefWriter:
Column design_date had to be
cast to a less precise type in Hive
This thiee-step piocess ol impoiting uata to HDFS, cieating the Hive taLle, anu then
loauing the HDFS-iesiuent uata into Hive can Le shoiteneu to one step il you know
that you want to impoit stiaight liom a uataLase uiiectly into Hive. Duiing an impoit,
Sgoop can geneiate the Hive taLle uelinition anu then loau in the uata. Hau we not
alieauy peiloimeu the impoit, we coulu have executeu this commanu, which ie-cieates
the widgets taLle in Hive, Laseu on the copy in MySQL:
% sqoop import --connect jdbc:mysql://localhost/hadoopguide \
> --table widgets -m 1 --hive-import
The sqoop import tool iun with the --hive-import aigument will loau
the uata uiiectly liom the souice uataLase into Hive; it inleis a Hive
schema automatically Laseu on the schema loi the taLle in the souice
uataLase. Using this, you can get staiteu woiking with youi uata in Hive
with only one commanu.
Regaiuless ol which uata impoit ioute we chose, we can now use the widgets uata set
anu the sales uata set togethei to calculate the most piolitaLle zip coue. Let`s uo so,
anu also save the iesult ol this gueiy in anothei taLle loi latei:
hive> CREATE TABLE zip_profits (sales_vol DOUBLE, zip INT);
OK
hive> INSERT OVERWRITE TABLE zip_profits
> SELECT SUM(w.price * s.qty) AS sales_vol, s.zip FROM SALES s
> JOIN widgets w ON (s.widget_id = w.id) GROUP BY s.zip;
...
3 Rows loaded to zip_profits
Working with Imported Data | 537
OK
hive> SELECT * FROM zip_profits ORDER BY sales_vol DESC;
...
OK
403.71 90210
28.0 10005
20.0 95014
Importing Large Objects
Most uataLases pioviue the capaLility to stoie laige amounts ol uata in a single lielu.
Depenuing on whethei this uata is textual oi Linaiy in natuie, it is usually iepiesenteu
as a CLOB oi BLOB column in the taLle. These laige oLjects aie olten hanuleu specially
Ly the uataLase itsell. In paiticulai, most taLles aie physically laiu out on uisk as in
Figuie 15-2. Vhen scanning thiough iows to ueteimine which iows match the ciiteiia
loi a paiticulai gueiy, this typically involves ieauing all columns ol each iow liom uisk.
Il laige oLjects weie stoieu inline in this lashion, they woulu auveisely allect the
peiloimance ol such scans. Theieloie, laige oLjects aie olten stoieu exteinally liom
theii iows, as in Figuie 15-3. Accessing a laige oLject olten ieguiies opening it
thiough the ieleience containeu in the iow.
Iigurc 15-2. Databasc tab|cs arc typica||y physica||y rcprcscntcd as an array oj rows, with a|| thc
co|unns in a row storcd adjaccnt to onc anothcr
The uilliculty ol woiking with laige oLjects in a uataLase suggests that a system such
as Hauoop, which is much Lettei suiteu to stoiing anu piocessing laige, complex uata
oLjects, is an iueal iepositoiy loi such inloimation. Sgoop can extiact laige oLjects
liom taLles anu stoie them in HDFS loi luithei piocessing.
As in a uataLase, MapReuuce typically natcria|izcs eveiy iecoiu Leloie passing it along
to the mappei. Il inuiviuual iecoius aie tiuly laige, this can Le veiy inellicient.
As shown eailiei, iecoius impoiteu Ly Sgoop aie laiu out on uisk in a lashion veiy
similai to a uataLase`s inteinal stiuctuie: an aiiay ol iecoius with all lielus ol a iecoiu
concatenateu togethei. Vhen iunning a MapReuuce piogiam ovei impoiteu iecoius,
each map task must lully mateiialize all lielus ol each iecoiu in its input split. Il the
contents ol a laige oLject lielu aie only ielevant loi a small suLset ol the total numLei
ol iecoius useu as input to a MapReuuce piogiam, it woulu Le inellicient to lully ma-
538 | Chapter 15: Sqoop
teiialize all these iecoius. Fuitheimoie, uepenuing on the size ol the laige oLject, lull
mateiialization in memoiy may Le impossiLle.
To oveicome these uilliculties, Sgoop will stoie impoiteu laige oLjects in a sepaiate
lile calleu a LoLFile. The LoLFile loimat can stoie inuiviuual iecoius ol veiy laige size
(a 6+-Lit auuiess space is useu). Each iecoiu in a LoLFile holus a single laige oLject.
The LoLFile loimat allows clients to holu a ieleience to a iecoiu without accessing the
iecoiu contents. Vhen iecoius aie accesseu, this is uone thiough a java.io.Input
Stream (loi Linaiy oLjects) oi java.io.Reader (loi chaiactei-Laseu oLjects).
Vhen a iecoiu is impoiteu, the noimal lielus will Le mateiializeu togethei in a text
lile, along with a ieleience to the LoLFile wheie a CLOB oi BLOB column is stoieu.
Foi example, suppose oui widgets taLle containeu a BLOB lielu nameu schematic
holuing the actual schematic uiagiam loi each wiuget.
An impoiteu iecoiu might then look like:
2,gizmo,4.00,2009-11-30,4,null,externalLob(lf,lobfile0,100,5011714)
The externalLob(...) text is a ieleience to an exteinally stoieu laige oLject, stoieu in
LoLFile loimat (lf) in a lile nameu |obji|c0, with the specilieu Lyte ollset anu length
insiue that lile.
Vhen woiking with this iecoiu, the Widget.get_schematic() methou woulu ietuin an
oLject ol type BlobRef ieleiencing the schematic column, Lut not actually containing
its contents. The BlobRef.getDataStream() methou actually opens the LoLFile anu ie-
tuins an InputStream allowing you to access the schematic lielu`s contents.
Vhen iunning a MapReuuce joL piocessing many Widget iecoius, you might neeu to
access the schematic lielu ol only a hanulul ol iecoius. This system allows you to incui
Iigurc 15-3. Largc objccts arc usua||y hc|d in a scparatc arca oj storagc, thc nain row storagc contains
indircct rcjcrcnccs to thc |argc objccts
Importing Large Objects | 539
the I/O costs ol accessing only the ieguiieu laige oLject entiies, as inuiviuual schematics
may Le seveial megaLytes oi moie ol uata.
The BlobRef anu ClobRef classes cache ieleiences to unueilying LoLFiles within a map
task. Il you uo access the schematic lielu ol seveial seguentially oiueieu iecoius, they
will take auvantage ol the existing lile pointei`s alignment on the next iecoiu Louy.
Performing an Export
In Sgoop, an inport ieleis to the movement ol uata liom a uataLase system into HDFS.
By contiast, an cxport uses HDFS as the souice ol uata anu a iemote uataLase as the
uestination. In the pievious sections, we impoiteu some uata anu then peiloimeu some
analysis using Hive. Ve can expoit the iesults ol this analysis to a uataLase loi con-
sumption Ly othei tools.
Beloie expoiting a taLle liom HDFS to a uataLase, we must piepaie the uataLase to
ieceive the uata Ly cieating the taiget taLle. Vhile Sgoop can inlei which ]ava types
aie appiopiiate to holu SQL uata types, this tianslation uoes not woik in Loth uiiections
(loi example, theie aie seveial possiLle SQL column uelinitions that can holu uata in
a ]ava String; this coulu Le CHAR(64), VARCHAR(200), oi something else entiiely). Con-
seguently, you must ueteimine which types aie most appiopiiate.
Ve aie going to expoit the zip_profits taLle liom Hive. Ve neeu to cieate a taLle in
MySQL that has taiget columns in the same oiuei, with the appiopiiate SQL types:
% mysql hadoopguide
mysql> CREATE TABLE sales_by_zip (volume DECIMAL(8,2), zip INTEGER);
Query OK, 0 rows affected (0.01 sec)
Then we iun the expoit commanu:
% sqoop export --connect jdbc:mysql://localhost/hadoopguide -m 1 \
> --table sales_by_zip --export-dir /user/hive/warehouse/zip_profits \
> --input-fields-terminated-by '\0001'
...
10/07/02 16:16:50 INFO mapreduce.ExportJobBase: Transferred 41 bytes in 10.8947
seconds (3.7633 bytes/sec)
10/07/02 16:16:50 INFO mapreduce.ExportJobBase: Exported 3 records.
Finally, we can veiily that the expoit woikeu Ly checking MySQL:
% mysql hadoopguide -e 'SELECT * FROM sales_by_zip'
+--------+-------+
| volume | zip |
+--------+-------+
| 28.00 | 10005 |
| 403.71 | 90210 |
| 20.00 | 95014 |
+--------+-------+
Vhen we cieateu the zip_profits taLle in Hive, we uiu not specily any uelimiteis. So
Hive useu its uelault uelimiteis: a Ctil-A chaiactei (Unicoue 0x0001) Letween lielus,
540 | Chapter 15: Sqoop
anu a newline at the enu ol each iecoiu. Vhen we useu Hive to access the contents ol
this taLle (in a SELECT statement), Hive conveiteu this to a taL-uelimiteu iepiesentation
loi uisplay on the console. But when ieauing the taLles uiiectly liom liles, we neeu to
tell Sgoop which uelimiteis to use. Sgoop assumes iecoius aie newline-uelimiteu Ly
uelault, Lut neeus to Le tolu aLout the Ctil-A lielu uelimiteis. The --input-fields-
terminated-by aigument to sqoop export specilieu this inloimation. Sgoop suppoits
seveial escape seguences (which stait with a '\' chaiactei) when specilying uelimiteis.
In the example syntax aLove, the escape seguence is encloseu in 'single quotes' to
ensuie that the shell piocesses it liteially. Vithout the guotes, the leauing Lackslash
itsell may neeu to Le escapeu (loi example, --input-fields-terminated-by \\0001).
The escape seguences suppoiteu Ly Sgoop aie listeu in TaLle 15-1.
Tab|c 15-1. Escapc scqucnccs can bc uscd to spccijy nonprintab|c charactcrs as jic|d and rccord
dc|initcrs in Sqoop
Escape Description
\b backspace
\n newline
\r carriage return
\t tab
\' single-quote
\" double-quote
\\ backslash
\0 NUL. This will insert NUL characters between fields or lines, or will disable enclosing/escaping if used for one of the
--enclosed-by, --optionally-enclosed-by, or --escaped-by arguments.
\0ooo The octal representation of a Unicode characters code point. The actual character is specified by the octal value ooo.
\0xhhh The hexadecimal representation of a Unicode characters code point. This should be of the form \0xhhh, where
hhh is the hex value. For example, --fields-terminated-by '\0x10' specifies the carriage return
character.
Exports: A Deeper Look
The aichitectuie ol Sgoop`s expoit capaLility is veiy similai in natuie to how Sgoop
peiloims impoits. (See Figuie 15-+.) Beloie peiloiming the expoit, Sgoop picks a stiat-
egy Laseu on the uataLase connect stiing. Foi most systems, Sgoop uses ]DBC. Sgoop
then geneiates a ]ava class Laseu on the taiget taLle uelinition. This geneiateu class has
the aLility to paise iecoius liom text liles anu inseit values ol the appiopiiate types into
a taLle (in auuition to the aLility to ieau the columns liom a ResultSet). A MapReuuce
joL is then launcheu that ieaus the souice uata liles liom HDFS, paises the iecoius
using the geneiateu class, anu executes the chosen expoit stiategy.
Exports: A Deeper Look | 541
Iigurc 15-1. Exports arc pcrjorncd in para||c| using MapRcducc
The ]DBC-Laseu expoit stiategy Luilus up Latch INSERT statements that will each auu
multiple iecoius to the taiget taLle. Inseiting many iecoius pei statement peiloims
much Lettei than executing many single-iow INSERT statements on most uataLase sys-
tems. Sepaiate thieaus aie useu to ieau liom HDFS anu communicate with the uata-
Lase, to ensuie that I/O opeiations involving uilleient systems aie oveilappeu as much
as possiLle.
Foi MySQL, Sgoop can employ a uiiect-moue stiategy using mysqlimport. Each map
task spawns a mysqlimport piocess that it communicates with via a nameu FIFO on the
local lilesystem. Data is then stieameu into mysqlimport via the FIFO channel, anu liom
theie into the uataLase.
Vhile most MapReuuce joLs ieauing liom HDFS pick the uegiee ol paiallelism (num-
Lei ol map tasks) Laseu on the numLei anu size ol the liles to piocess, Sgoop`s expoit
system allows useis explicit contiol ovei the numLei ol tasks. The peiloimance ol the
expoit can Le allecteu Ly the numLei ol paiallel wiiteis to the uataLase, so Sgoop uses
the CombineFileInputFormat class to gioup up the input liles into a smallei numLei ol
map tasks.
542 | Chapter 15: Sqoop
Exports and Transactionality
Due to the paiallel natuie ol the piocess, an expoit is olten not an atomic opeiation.
Sgoop will spawn multiple tasks to expoit slices ol the uata in paiallel. These tasks can
complete at uilleient times, meaning that even though tiansactions aie useu insiue
tasks, iesults liom one task may Le visiLle Leloie the iesults ol anothei task. Moieovei,
uataLases olten use lixeu-size Lulleis to stoie tiansactions. As a iesult, one tiansaction
cannot necessaiily contain the entiie set ol opeiations peiloimeu Ly a task. Sgoop
commits iesults eveiy lew thousanu iows, to ensuie that it uoes not iun out ol memoiy.
These inteimeuiate iesults aie visiLle while the expoit continues. Applications that will
use the iesults ol an expoit shoulu not Le staiteu until the expoit piocess is complete,
oi they may see paitial iesults.
To solve this pioLlem, Sgoop can expoit to a tempoiaiy staging taLle, then at the enu
ol the joLil the expoit has succeeueumove the stageu uata into the uestination
taLle in a single tiansaction. You can specily a staging taLle with the --staging-table
option. The staging taLle must alieauy exist anu have the same schema as the uestina-
tion. It must also Le empty, unless the --clear-staging-table option is also supplieu.
Exports and SequenceFiles
The example expoit ieau souice uata liom a Hive taLle, which is stoieu in HDFS as a
uelimiteu text lile. Sgoop can also expoit uelimiteu text liles that weie not Hive taLles.
Foi example, it can expoit text liles that aie the output ol a MapReuuce joL.
Sgoop can also expoit iecoius stoieu in SeguenceFiles to an output taLle, although
some iestiictions apply. A SeguenceFile can contain aiLitiaiy iecoiu types. Sgoop`s
expoit tool will ieau oLjects liom SeguenceFiles anu senu them uiiectly to the Output
Collector, which passes the oLjects to the uataLase expoit OutputFormat. To woik with
Sgoop, the iecoiu must Le stoieu in the value poition ol the SeguenceFile`s key-value
paii loimat anu must suLclass the org.apache.sqoop.lib.SqoopRecord aLstiact class (as
is uone Ly all classes geneiateu Ly Sgoop).
Il you use the couegen tool (sqoop-codegen) to geneiate a SqoopRecord implementation
loi a iecoiu Laseu on youi expoit taiget taLle, you can then wiite a MapReuuce pio-
giam, which populates instances ol this class anu wiites them to SeguenceFiles. sqoop-
export can then expoit these SeguenceFiles to the taLle. Anothei means Ly which uata
may Le in SqoopRecord instances in SeguenceFiles is il uata is impoiteu liom a uataLase
taLle to HDFS, mouilieu in some lashion, anu the iesults stoieu in SeguenceFiles holu-
ing iecoius ol the same uata type.
In this case, Sgoop shoulu ieuse the existing class uelinition to ieau uata liom Seguen-
ceFiles, iathei than geneiate a new (tempoiaiy) iecoiu containei class to peiloim the
expoit, as is uone when conveiting text-Laseu iecoius to uataLase iows. You can sup-
piess coue geneiation anu insteau use an existing iecoiu class anu jai Ly pioviuing the
Exports: A Deeper Look | 543
--class-name anu --jar-file aiguments to Sgoop. Sgoop will use the specilieu class,
loaueu liom the specilieu jai, when expoiting iecoius.
In the lollowing example, we will ie-impoit the widgets taLle as SeguenceFiles, anu
then expoit it Lack to the uataLase in a uilleient taLle:
% sqoop import --connect jdbc:mysql://localhost/hadoopguide \
> --table widgets -m 1 --class-name WidgetHolder --as-sequencefile \
> --target-dir widget_sequence_files --bindir .
...
10/07/05 17:09:13 INFO mapreduce.ImportJobBase: Retrieved 3 records.
% mysql hadoopguide
mysql> CREATE TABLE widgets2(id INT, widget_name VARCHAR(100),
-> price DOUBLE, designed DATE, version INT, notes VARCHAR(200));
Query OK, 0 rows affected (0.03 sec)
mysql> exit;
% sqoop export --connect jdbc:mysql://localhost/hadoopguide \
> --table widgets2 -m 1 --class-name WidgetHolder \
> --jar-file widgets.jar --export-dir widget_sequence_files
...
10/07/05 17:26:44 INFO mapreduce.ExportJobBase: Exported 3 records.
Duiing the impoit, we specilieu the SeguenceFile loimat, anu that we wanteu the jai
lile to Le placeu in the cuiient uiiectoiy (with --bindir), so we can ieuse it. Otheiwise,
it woulu Le placeu in a tempoiaiy uiiectoiy. Ve then cieateu a uestination taLle loi
the expoit, which hau a slightly uilleient schema, alLeit one that is compatiLle with
the oiiginal uata. Ve then ian an expoit that useu the existing geneiateu coue to ieau
the iecoius liom the SeguenceFile anu wiite them to the uataLase.
544 | Chapter 15: Sqoop
CHAPTER 16
Case Studies
Hadoop Usage at Last.fm
Last.fm: The Social Music Revolution
Founueu in 2002, Last.lm is an Inteinet iauio anu music community weLsite that olleis
many seivices to its useis, such as liee music stieams anu uownloaus, music anu event
iecommenuations, peisonalizeu chaits, anu much moie. Theie aie aLout 25 million
people who use Last.lm eveiy month, geneiating huge amounts ol uata that neeu to Le
piocesseu. One example ol this is useis tiansmitting inloimation inuicating which
songs they aie listening to (this is known as scioLLling). This uata is piocesseu anu
stoieu Ly Last.lm, so the usei can access it uiiectly (in the loim ol chaits), anu it is also
useu to make uecisions aLout useis` musical tastes anu compatiLility, anu aitist anu
tiack similaiity.
Hadoop at Last.fm
As Last.lm`s seivice uevelopeu anu the numLei ol useis giew liom thousanus to mil-
lions, stoiing, piocessing, anu managing all the incoming uata Lecame incieasingly
challenging. Foitunately, Hauoop was guickly Lecoming staLle enough anu was en-
thusiastically auopteu as it Lecame cleai how many pioLlems it solveu. It was liist useu
at Last.lm in eaily 2006 anu was put into piouuction a lew months latei. Theie weie
seveial ieasons loi auopting Hauoop at Last.lm:
The uistiiLuteu lilesystem pioviueu ieuunuant Lackups loi the uata stoieu on it
(e.g., weL logs, usei listening uata) at no extia cost.
ScalaLility was simplilieu thiough the aLility to auu cheap, commouity haiuwaie
when ieguiieu.
The cost was iight (liee) at a time when Last.lm hau limiteu linancial iesouices.
545
The open souice coue anu active community meant that Last.lm coulu lieely mou-
ily Hauoop to auu custom leatuies anu patches.
Hauoop pioviueu a llexiLle liamewoik loi iunning uistiiLuteu computing algo-
iithms with a ielatively easy leaining cuive.
Hauoop has now Lecome a ciucial pait ol Last.lm`s inliastiuctuie, cuiiently consisting
ol two Hauoop clusteis spanning ovei 50 machines, 300 coies, anu 100 TB ol uisk
space. Hunuieus ol uaily joLs aie iun on the clusteis peiloiming opeiations, such as
loglile analysis, evaluation ol A/B tests, au hoc piocessing, anu chaits geneiation. This
case stuuy will locus on the piocess ol geneiating chaits, as this was the liist usage ol
Hauoop at Last.lm anu illustiates the powei anu llexiLility that Hauoop pioviues ovei
othei appioaches when woiking with veiy laige uatasets.
Generating Charts with Hadoop
Last.lm uses usei-geneiateu tiack listening uata to piouuce many uilleient types ol
chaits, such as weekly chaits loi tiacks, pei countiy anu pei usei. A numLei ol Hauoop
piogiams aie useu to piocess the listening uata anu geneiate these chaits, anu these
iun on a uaily, weekly, oi monthly Lasis. Figuie 16-1 shows an example ol how this
uata is uisplayeu on the site; in this case, the weekly top tiacks.
Iigurc 1-1. Last.jn top trac|s chart
546 | Chapter 16: Case Studies
Listening uata typically aiiives at Last.lm liom one ol two souices:
A usei plays a tiack ol hei own (e.g., listening to an MP3 lile on a PC oi othei
uevice), anu this inloimation is sent to Last.lm using eithei the ollicial Last.lm
client application oi one ol many hunuieus ol thiiu-paity applications.
A usei tunes into one ol Last.lm`s Inteinet iauio stations anu stieams a song to hei
computei. The Last.lm playei oi weLsite can Le useu to access these stieams anu
extia lunctionality is maue availaLle to the usei, allowing hei to love, skip, oi Lan
each tiack that she listens to.
Vhen piocessing the ieceiveu uata, we uistinguish Letween a tiack listen suLmitteu Ly
a usei (the liist souice aLove, ieleiieu to as a scrobb|c liom heie on) anu a tiack listeneu
to on the Last.lm iauio (the seconu souice, mentioneu eailiei, ieleiieu to as a radio
|istcn liom heie on). This uistinction is veiy impoitant in oiuei to pievent a leeuLack
loop in the Last.lm iecommenuation system, which is Laseu only on scioLLles. One ol
the most lunuamental Hauoop joLs at Last.lm takes the incoming listening uata anu
summaiizes it into a loimat that can Le useu loi uisplay puiposes on the Last.lm weLsite
as well as loi input to othei Hauoop piogiams. This is achieveu Ly the Tiack Statistics
piogiam, which is the example uesciiLeu in the lollowing sections.
The Track Statistics Program
Vhen tiack listening uata is suLmitteu to Last.lm, it unueigoes a valiuation anu
conveision phase, the enu iesult ol which is a numLei ol space-uelimiteu text liles
containing the usei ID, the tiack ID, the numLei ol times the tiack was scioLLleu, the
numLei ol times the tiack was listeneu to on the iauio, anu the numLei ol times it was
skippeu. TaLle 16-1 contains sample listening uata, which is useu in the lollowing
examples as input to the Tiack Statistics piogiam (the ieal uata is gigaLytes in size anu
incluues many moie lielus that have Leen omitteu heie loi simplicity`s sake).
Tab|c 1-1. Listcning data
UserId TrackId Scrobble Radio Skip
111115 222 0 1 0
111113 225 1 0 0
111117 223 0 1 1
111115 225 1 0 0
These text liles aie the initial input pioviueu to the Tiack Statistics piogiam, which
consists ol two joLs that calculate vaiious values liom this uata anu a thiiu joL that
meiges the iesults (see Figuie 16-2).
The Unigue Listeneis joL calculates the total numLei ol unigue listeneis loi a tiack Ly
counting the liist listen Ly a usei anu ignoiing all othei listens Ly the same usei. The
Sum joL accumulates the total listens, scioLLles, iauio listens, anu skips loi each tiack
Hadoop Usage at Last.fm | 547
Ly counting these values loi all listens Ly all useis. Although the input loimat ol these
two joLs is iuentical, two sepaiate joLs aie neeueu, as the Unigue Listeneis joL is ie-
sponsiLle loi emitting values pei tiack pei usei, anu the Sum joL emits values pei tiack.
The linal Meige joL is iesponsiLle loi meiging the inteimeuiate output ol the two
othei joLs into the linal iesult. The enu iesults ol iunning the piogiam aie the lollowing
values pei tiack:
NumLei ol unigue listeneis
NumLei ol times the tiack was scioLLleu
NumLei ol times the tiack was listeneu to on the iauio
NumLei ol times the tiack was listeneu to in total
NumLei ol times the tiack was skippeu on the iauio
Each joL anu its MapReuuce phases aie uesciiLeu in moie uetail next. Please note that
the pioviueu coue snippets have Leen simplilieu uue to space constiaints; loi uownloau
uetails loi the lull coue listings, ielei to the pielace.
Calculating the number of unique listeners
The Unigue Listeneis joL calculates, pei tiack, the numLei ol unigue listeneis.
The UniqueListenersMapper piocesses the space-uelimiteu iaw lis-
tening uata anu emits the usei ID associateu with each tiack ID:
public void map(LongWritable position, Text rawLine, OutputCollector<IntWritable,
IntWritable> output, Reporter reporter) throws IOException {
String[] parts = (rawLine.toString()).split(" ");
UniqueListenersMapper.
Iigurc 1-2. Trac|Stats jobs
548 | Chapter 16: Case Studies
int scrobbles = Integer.parseInt(parts[TrackStatisticsProgram.COL_SCROBBLES]);
int radioListens = Integer.parseInt(parts[TrackStatisticsProgram.COL_RADIO]);
// if track somehow is marked with zero plays - ignore
if (scrobbles <= 0 && radioListens <= 0) {
return;
}
// if we get to here then user has listened to track,
// so output user id against track id
IntWritable trackId = new IntWritable(
Integer.parseInt(parts[TrackStatisticsProgram.COL_TRACKID]));
IntWritable userId = new IntWritable(
Integer.parseInt(parts[TrackStatisticsProgram.COL_USERID]));
output.collect(trackId, userId);
}
The UniqueListenersReducers ieceives a list ol usei IDs pei tiack
ID anu puts these IDs into a Set to iemove any uuplicates. The size ol this set is then
emitteu (i.e., the numLei ol unigue listeneis) loi each tiack ID. Stoiing all the ieuuce
values in a Set iuns the iisk ol iunning out ol memoiy il theie aie many values loi a
ceitain key. This hasn`t happeneu in piactice, Lut to oveicome this, an extia
MapReuuce step coulu Le intiouuceu to iemove all the uuplicate values oi a seconuaiy
soit coulu Le useu (loi moie uetails, see Seconuaiy Soit on page 276):
public void reduce(IntWritable trackId, Iterator<IntWritable> values,
OutputCollector<IntWritable, IntWritable> output, Reporter reporter)
throws IOException {

Set<Integer> userIds = new HashSet<Integer>();
// add all userIds to the set, duplicates automatically removed (set contract)
while (values.hasNext()) {
IntWritable userId = values.next();
userIds.add(Integer.valueOf(userId.get()));
}
// output trackId -> number of unique listeners per track
output.collect(trackId, new IntWritable(userIds.size()));
}
TaLle 16-2 shows the sample input uata loi the joL. The map output appeais in Ta-
Lle 16-3 anu the ieuuce output in TaLle 16-+.
Tab|c 1-2. job input
Line of file UserId TrackId Scrobbled Radio play Skip
LongWritable IntWritable IntWritable Boolean Boolean Boolean
0 11115 222 0 1 0
1 11113 225 1 0 0
2 11117 223 0 1 1
3 11115 225 1 0 0
UniqueListenersReducer.
Hadoop Usage at Last.fm | 549
Tab|c 1-3. Mappcr output
TrackId UserId
IntWritable IntWritable
222 11115
225 11113
223 11117
225 11115
Tab|c 1-1. Rcduccr output
TrackId #listeners
IntWritable IntWritable
222 1
225 2
223 1
Summing the track totals
The Sum joL is ielatively simple; it just auus up the values we aie inteiesteu in loi each
tiack.
The input uata is again the iaw text liles, Lut in this case, it is hanuleu guite
uilleiently. The uesiieu enu iesult is a numLei ol totals (unigue listenei count, play
count, scioLLle count, iauio listen count, skip count) associateu with each tiack. To
simplily things, we use an inteimeuiate TrackStats oLject geneiateu using Hauoop
Recoiu I/O, which implements WritableComparable (so it can Le useu as output) to holu
these values. The mappei cieates a TrackStats oLject anu sets the values on it loi each
line in the lile, except loi the unigue listenei count, which is lelt empty (it will Le lilleu
in Ly the linal meige joL):
public void map(LongWritable position, Text rawLine,
OutputCollector<IntWritable, TrackStats> output, Reporter reporter)
throws IOException {

String[] parts = (rawLine.toString()).split(" ");
int trackId = Integer.parseInt(parts[TrackStatisticsProgram.COL_TRACKID]);
int scrobbles = Integer.parseInt(parts[TrackStatisticsProgram.COL_SCROBBLES]);
int radio = Integer.parseInt(parts[TrackStatisticsProgram.COL_RADIO]);
int skip = Integer.parseInt(parts[TrackStatisticsProgram.COL_SKIP]);
// set number of listeners to 0 (this is calculated later)
// and other values as provided in text file
TrackStats trackstat = new TrackStats(0, scrobbles + radio, scrobbles, radio, skip);
output.collect(new IntWritable(trackId), trackstat);
}
SumMapper.
550 | Chapter 16: Case Studies
In this case, the ieuucei peiloims a veiy similai lunction to the mappei
it sums the statistics pei tiack anu ietuins an oveiall total:
public void reduce(IntWritable trackId, Iterator<TrackStats> values,
OutputCollector<IntWritable, TrackStats> output, Reporter reporter)
throws IOException {
TrackStats sum = new TrackStats(); // holds the totals for this track
while (values.hasNext()) {
TrackStats trackStats = (TrackStats) values.next();
sum.setListeners(sum.getListeners() + trackStats.getListeners());
sum.setPlays(sum.getPlays() + trackStats.getPlays());
sum.setSkips(sum.getSkips() + trackStats.getSkips());
sum.setScrobbles(sum.getScrobbles() + trackStats.getScrobbles());
sum.setRadioPlays(sum.getRadioPlays() + trackStats.getRadioPlays());
}
output.collect(trackId, sum);
}
TaLle 16-5 shows the input uata loi the joL (the same as loi the Unigue Listeneis joL).
The map output appeais in TaLle 16-6 anu the ieuuce output in TaLle 16-7.
Tab|c 1-5. job input
Line UserId TrackId Scrobbled Radio play Skip
LongWritable IntWritable IntWritable Boolean Boolean Boolean
0 11115 222 0 1 0
1 11113 225 1 0 0
2 11117 223 0 1 1
3 11115 225 1 0 0
Tab|c 1-. Map output
TrackId #listeners #plays #scrobbles #radio plays #skips
IntWritable IntWritable IntWritable IntWritable IntWritable IntWritable
222 0 1 0 1 0
225 0 1 1 0 0
223 0 1 0 1 1
225 0 1 1 0 0
Tab|c 1-7. Rcducc output
TrackId #listeners #plays #scrobbles #radio plays #skips
IntWritable IntWritable IntWritable IntWritable IntWritable IntWritable
222 0 1 0 1 0
225 0 2 2 0 0
223 0 1 0 1 1
SumReducer.
Hadoop Usage at Last.fm | 551
Merging the results
The linal joL neeus to meige the output liom the two pievious joLs: the numLei ol
unigue listeneis pei tiack anu the statistics pei tiack. In oiuei to Le aLle to meige these
uilleient inputs, two uilleient mappeis (one loi each type ol input) aie useu. The two
inteimeuiate joLs aie conliguieu to wiite theii iesults to uilleient paths, anu the
MultipleInputs class is useu to specily which mappei will piocess which liles. The
lollowing coue shows how the JobConf loi the joL is set up to uo this:
MultipleInputs.addInputPath(conf, sumInputDir,
SequenceFileInputFormat.class, IdentityMapper.class);
MultipleInputs.addInputPath(conf, listenersInputDir,
SequenceFileInputFormat.class, MergeListenersMapper.class);
It is possiLle to use a single mappei to hanule uilleient inputs, Lut the example solution
is moie convenient anu elegant.
This mappei is useu to piocess the UniqueListenerJob`s output ol
unigue listeneis pei tiack. It cieates a TrackStats oLject in a similai mannei to the
SumMapper, Lut this time, it lills in only the unigue listenei count pei tiack anu leaves
the othei values empty:
public void map(IntWritable trackId, IntWritable uniqueListenerCount,
OutputCollector<IntWritable, TrackStats> output, Reporter reporter)
throws IOException {
TrackStats trackStats = new TrackStats();
trackStats.setListeners(uniqueListenerCount.get());
output.collect(trackId, trackStats);
}
TaLle 16-S shows some input loi the mappei; the coiiesponuing output is shown in
TaLle 16-9.
Tab|c 1-8. McrgcListcncrsMappcr input
TrackId #listeners
IntWritable IntWritable
222 1
225 2
223 1
Tab|c 1-9. McrgcListcncrsMappcr output
TrackId #listeners #plays #scrobbles #radio #skips
222 1 0 0 0 0
225 2 0 0 0 0
223 1 0 0 0 0
MergeListenersMapper.
552 | Chapter 16: Case Studies
The IdentityMapper is conliguieu to piocess the SumJob`s output ol
TrackStats oLjects anu, as no auuitional piocessing is ieguiieu, it uiiectly emits the
input uata (see TaLle 16-10).
Tab|c 1-10. |dcntityMappcr input and output
TrackId #listeners #plays #scrobbles #radio #skips
IntWritable IntWritable IntWritable IntWritable IntWritable IntWritable
222 0 1 0 1 0
225 0 2 2 0 0
223 0 1 0 1 1
The two mappeis aLove emit values ol the same type: a TrackStats oLject
pei tiack, with uilleient values lilleu in. The linal ieuuce phase can ieuse the
SumReducer uesciiLeu eailiei to cieate a TrackStats oLject pei tiack, sum up all the
values, anu emit it (see TaLle 16-11).
Tab|c 1-11. Iina| SunRcduccr output
TrackId #listeners #plays #scrobbles #radio #skips
IntWritable IntWritable IntWritable IntWritable IntWritable IntWritable
222 1 1 0 1 0
225 2 2 2 0 0
223 1 1 0 1 1
The linal output liles aie then accumulateu anu copieu to a seivei wheie a weL seivice
makes the uata availaLle to the Last.lm weLsite loi uisplay. An example ol this is shown
in Figuie 16-3, wheie the total numLei ol listeneis anu plays aie uisplayeu loi a tiack.
Iigurc 1-3. Trac|Stats rcsu|t
IdentityMapper.
SumReducer.
Hadoop Usage at Last.fm | 553
Summary
Hauoop has Lecome an essential pait ol Last.lm`s inliastiuctuie anu is useu to geneiate
anu piocess a wiue vaiiety ol uatasets ianging liom weL logs to usei listening uata. The
example coveieu heie has Leen simplilieu consiueiaLly in oiuei to get the key concepts
acioss; in ieal-woilu usage the input uata has a moie complicateu stiuctuie anu the
coue that piocesses it is moie complex. Hauoop itsell, while matuie enough loi pio-
uuction use, is still in active uevelopment, anu new leatuies anu impiovements aie
auueu Ly the Hauoop community eveiy week. Ve at Last.lm aie happy to Le pait ol
this community as a contiiLutoi ol coue anu iueas, anu as enu useis ol a gieat piece ol
open souice technology.
Auiian Voouheau anu Maic ue Palol
Hadoop and Hive at Facebook
Introduction
Hauoop can Le useu to loim coie Lackenu Latch anu neai ieal-time computing inlia-
stiuctuies. It can also Le useu to stoie anu aichive massive uatasets. In this case stuuy,
we will exploie Lackenu uata aichitectuies anu the iole Hauoop can play in them. Ve
will uesciiLe hypothetical Hauoop conliguiations, potential uses ol Hivean open
souice uata waiehousing anu SQL inliastiuctuie Luilt on top ol Hauoopanu the
uilleient kinus ol Lusiness anu piouuct applications that have Leen Luilt using this
inliastiuctuie.
Hadoop at Facebook
History
The amount ol log anu uimension uata in FaceLook that neeus to Le piocesseu anu
stoieu has exploueu as the usage ol the site has incieaseu. A key ieguiiement loi any
uata piocessing platloim loi this enviionment is the aLility to scale iapiuly. Fuithei,
engineeiing iesouices Leing limiteu, the system shoulu Le veiy ieliaLle anu easy to use
anu maintain.
Initially, uata waiehousing at FaceLook was peiloimeu entiiely on an Oiacle instance.
Altei we staiteu hitting scalaLility anu peiloimance pioLlems, we investigateu whethei
theie weie open souice technologies that coulu Le useu in oui enviionment. As pait ol
this investigation, we ueployeu a ielatively small Hauoop instance anu staiteu puL-
lishing some ol oui coie uatasets into this instance. Hauoop was attiactive Lecause
Yahoo! was using it inteinally loi its Latch piocessing neeus anu Lecause we weie
lamiliai with the simplicity anu scalaLility ol the MapReuuce mouel as populaiizeu Ly
Google.
554 | Chapter 16: Case Studies
Oui initial piototype was veiy successlul: the engineeis loveu the aLility to piocess
massive amounts ol uata in ieasonaLle timeliames, an aLility that we just uiu not have
Leloie. They also loveu Leing aLle to use theii lavoiite piogiamming language loi pio-
cessing (using Hauoop stieaming). Having oui coie uatasets puLlisheu in one
centializeu uata stoie was also veiy convenient. At aiounu the same time, we staiteu
ueveloping Hive. This maue it even easiei loi useis to piocess uata in the Hauoop clustei
Ly Leing aLle to expiess common computations in the loim ol SQL, a language with
which most engineeis anu analysts aie lamiliai.
As a iesult, the clustei size anu usage giew Ly leaps anu Lounus, anu touay FaceLook
is iunning the seconu laigest Hauoop clustei in the woilu. As ol this wiiting, we holu
moie than 2 PB ol uata in Hauoop anu loau moie than 10 TB ol uata into it eveiy uay.
Oui Hauoop instance has 2,+00 coies anu aLout 9 TB ol memoiy anu iuns at 100
utilization at many points uuiing the uay. Ve aie aLle to scale out this clustei iapiuly
in iesponse to oui giowth, anu we have Leen aLle to take auvantage ol open souice Ly
mouilying Hauoop wheie ieguiieu to suit oui neeus. Ve have contiiLuteu Lack to open
souice, Loth in the loim ol contiiLutions to some coie components ol Hauoop as well
as Ly open-souicing Hive, which is now a Hauoop top-level pioject.
Use cases
Theie aie at least loui inteiielateu Lut uistinct classes ol uses loi Hauoop at FaceLook:
Piouucing uaily anu houily summaiies ovei laige amounts ol uata. These summa-
iies aie useu loi a numLei ol uilleient puiposes within the company:
Repoits Laseu on these summaiies aie useu Ly engineeiing anu nonengineeiing
lunctional teams to uiive piouuct uecisions. These summaiies incluue iepoits
on giowth ol the useis, page views, anu aveiage time spent on the site Ly the
useis.
Pioviuing peiloimance numLeis aLout auveitisement campaigns that aie iun
on FaceLook.
Backenu piocessing loi site leatuies such as people you may like anu applica-
tions you may like.
Running au hoc joLs ovei histoiical uata. These analyses help answei guestions
liom oui piouuct gioups anu executive team.
As a ue lacto long-teim aichival stoie loi oui log uatasets.
To look up log events Ly specilic attiiLutes (wheie logs aie inuexeu Ly such
attiiLutes), which is useu to maintain the integiity ol the site anu piotect useis
against spamLots.
Data architecture
Figuie 16-+ shows the Lasic components ol oui aichitectuie anu the uata llow within
these components.
Hadoop and Hive at Facebook | 555
As shown in Figuie 16-+, the lollowing components aie useu in piocessing uata:
Scribc
Log uata is geneiateu Ly weL seiveis as well as inteinal seivices such as the Seaich
Lackenu. Ve use SciiLe, an open souice log collection seivice uevelopeu in Face-
Look that ueposits hunuieus ol log uatasets with uaily volume in tens ol teiaLytes
into a hanulul ol NFS seiveis.
HDIS
A laige liaction ol this log uata is copieu into one cential HDFS instance. Dimen-
sion uata is also sciapeu liom oui inteinal MySQL uataLases anu copieu ovei into
HDFS uaily.
Hivc/Hadoop
Ve use Hive, a Hauoop suLpioject uevelopeu in FaceLook, to Luilu a uata waie-
house ovei all the uata collecteu in HDFS. Files in HDFS, incluuing log uata liom
SciiLe anu uimension uata liom the MySQL tiei, aie maue availaLle as taLles with
logical paititions. A SQL-like gueiy language pioviueu Ly Hive is useu in conjunc-
tion with MapReuuce to cieate/puLlish a vaiiety ol summaiies anu iepoits, as well
as to peiloim histoiical analysis ovei these taLles.
Too|s
Biowsei-Laseu inteilaces Luilt on top ol Hive allow useis to compose anu launch
Hive gueiies (which in tuin launch MapReuuce joLs) using just a lew mouse clicks.
Iigurc 1-1. Data warchousing architccturc at Iaccboo|
556 | Chapter 16: Case Studies
Traditiona| RDBMS
Ve use Oiacle anu MySQL uataLases to puLlish these summaiies. The volume ol
uata heie is ielatively small, Lut the gueiy iate is high anu neeus ieal-time iesponse.
DataBcc
An in-house ETL woikllow soltwaie that is useu to pioviue a common liamewoik
loi ieliaLle Latch piocessing acioss all uata piocessing joLs.
Data liom the NFS tiei stoiing SciiLe uata is continuously ieplicateu to the HDFS
clustei Ly copiei joLs. The NFS uevices aie mounteu on the Hauoop tiei anu the copiei
piocesses iun as map-only joLs on the Hauoop clustei. This makes it easy to scale the
copiei piocesses anu makes them lault-iesilient. Cuiiently, we copy ovei 6 TB pei uay
liom SciiLe to HDFS in this mannei. Ve also uownloau up to + TB ol uimension uata
liom oui MySQL tiei to HDFS eveiy uay. These aie also conveniently aiiangeu on the
Hauoop clustei, as map-only joLs that copy uata out ol MySQL Loxes.
Hadoop configuration
The cential philosophy Lehinu oui Hauoop ueployment is consoliuation. Ve use a
single HDFS instance, anu a vast majoiity ol piocessing is uone in a single MapReuuce
clustei (iunning a single joLtiackei). The ieasons loi this aie laiily stiaightloiwaiu:
Ve can minimize the auministiative oveiheaus Ly opeiating a single clustei.
Data uoes not neeu to Le uuplicateu. All uata is availaLle in a single place loi all
the use cases uesciiLeu pieviously.
By using the same compute clustei acioss all uepaitments, we get tiemenuous
elliciencies.
Oui useis woik in a collaLoiative enviionment, so ieguiiements in teims ol guality
ol seivice aie not oneious (yet).
Ve also have a single shaieu Hive metastoie (using a MySQL uataLase) that holus
metauata aLout all the Hive taLles stoieu in HDFS.
Hypothetical Use Case Studies
In this section, we will uesciiLe some typical pioLlems that aie common loi laige weL-
sites, which aie uillicult to solve thiough tiauitional waiehousing technologies, simply
Lecause the costs anu scales involveu aie piohiLitively high. Hauoop anu Hive can
pioviue a moie scalaLle anu moie cost-ellective solution in such situations.
Advertiser insights and performance
One ol the most common uses ol Hauoop is to piouuce summaiies liom laige volumes
ol uata. It is veiy typical ol laige au netwoiks, such as FaceLook au netwoik, Google
AuSense, anu many otheis, to pioviue auveitiseis with stanuaiu aggiegateu statistics
aLout theii aus that help the auveitiseis to tune theii campaigns ellectively. Computing
Hadoop and Hive at Facebook | 557
auveitisement peiloimance numLeis on laige uatasets is a veiy uata-intensive opeia-
tion, anu the scalaLility anu cost auvantages ol Hauoop anu Hive can ieally help in
computing these numLeis in a ieasonaLle time liame anu at a ieasonaLle cost.
Many au netwoiks pioviue stanuaiuizeu CPC- anu CPM-Laseu au-units to the auvei-
tiseis. The CPC aus aie cost-pei-click aus: the auveitisei pays the au netwoik amounts
that aie uepenuent on the numLei ol clicks that the paiticulai au gets liom the useis
visiting the site. The CPM aus (shoit loi cost pcr ni||c, that is, the cost pei thousanu
impiessions), on the othei hanu, Lill the auveitiseis amounts that aie piopoitional to
the numLei ol useis who see the au on the site. Apait liom these stanuaiuizeu au units,
in the last lew yeais aus that have moie uynamic content that is tailoieu to each inui-
viuual usei have also Lecome common in the online auveitisement inuustiy. Yahoo!
uoes this thiough SmaitAus, wheieas FaceLook pioviues its auveitiseis with Social Aus.
The lattei allows the auveitiseis to emLeu inloimation liom a usei`s netwoik ol liienus;
loi example, a Nike au may ielei to a liienu ol the usei who iecently lanneu Nike anu
shaieu that inloimation with his liienus on FaceLook. In auuition, FaceLook also pio-
viues Engagement Au units to the auveitiseis, wheiein the useis can moie ellectively
inteiact with the au, Le it Ly commenting on it oi Ly playing emLeuueu viueos. In
geneial, a wiue vaiiety ol aus aie pioviueu to the auveitiseis Ly the online au netwoiks,
anu this vaiiety also auus yet anothei uimension to the vaiious kinus ol peiloimance
numLeis that the auveitiseis aie inteiesteu in getting aLout theii campaigns.
At the most Lasic level, auveitiseis aie inteiesteu in knowing the total anu the numLei
ol unigue useis that have seen the au oi have clickeu on it. Foi moie uynamic aus, they
may even Le inteiesteu in getting the Lieakuown ol these aggiegateu numLeis Ly the
kinu ol uynamic inloimation shown in the au unit oi the kinu ol engagement action
unueitaken Ly the useis on the au. Foi example, a paiticulai auveitisement may have
Leen shown 100,000 times to 30,000 unigue useis. Similaily, a viueo emLeuueu insiue
an Engagement Au may have Leen watcheu Ly 100,000 unigue useis. In auuition, these
peiloimance numLeis aie typically iepoiteu loi each au, campaign, anu account. An
account may have multiple campaigns with each campaign iunning multiple aus on
the netwoik. Finally, these numLeis aie typically iepoiteu loi uilleient time uuiations
Ly the au netwoiks. Typical uuiations aie uaily, iolling week, month to uate, iolling
month, anu sometimes even loi the entiie liletime ol the campaign. Moieovei, auvei-
tiseis also look at the geogiaphic Lieakuown ol these numLeis among othei ways ol
slicing anu uicing this uata, such as what peicentage ol the total vieweis oi clickeis ol
a paiticulai au aie in the Asia Pacilic iegion.
As is eviuent, theie aie loui pieuominant uimension hieiaichies: the account, cam-
paign, anu au uimension; the time peiiou; the type ol inteiaction; anu the usei uimen-
sion. The last ol these is useu to iepoit unigue numLeis, wheieas the othei thiee aie
the iepoiting uimensions. The usei uimension is also useu to cieate aggiegateu geo-
giaphic pioliles loi the vieweis anu clickeis ol aus. All this inloimation in totality allows
the auveitiseis to tune theii campaigns to impiove theii ellectiveness on any given au
netwoik. Asiue liom the multiuimensional natuie ol this set ol pipelines, the volumes
558 | Chapter 16: Case Studies
ol uata piocesseu anu the iate at which this uata is giowing on a uaily Lasis make this
uillicult to scale without a technology like Hauoop loi laige au netwoiks. As ol this
wiiting, loi example, the au log volume that is piocesseu loi au peiloimance numLeis
at FaceLook is appioximately 1 TB pei uay ol (uncompiesseu) logs. This volume has
seen a 30-lolu inciease since ]anuaiy 200S, when the volumes weie in the iange ol 30
GB pei uay. Hauoop`s aLility to scale with haiuwaie has Leen a majoi lactoi Lehinu
the aLility ol these pipelines to keep up with this uata giowth with minoi tweaking ol
joL conliguiations. Typically, these conliguiation changes involve incieasing the num-
Lei ol ieuuceis loi the Hauoop joLs that aie piocessing the intensive poitions ol these
pipelines. The laigest ol these stages cuiiently iun with +00 ieuuceis (an inciease ol
eight times liom the 50 ieuuceis that weie Leing useu in ]anuaiy 200S).
Ad hoc analysis and product feedback
Apait liom iegulai iepoits, anothei piimaiy use case loi a uata waiehousing solution
is to Le aLle to suppoit au hoc analysis anu piouuct leeuLack solutions. Any typical
weLsite, loi example, makes piouuct changes, anu it is typical loi piouuct manageis
oi engineeis to unueistanu the impact ol a new leatuie, Laseu on usei engagement as
well as on the click-thiough iate on that leatuie. The piouuct team may even wish to
uo a ueepei analysis on what is the impact ol the change Laseu on vaiious iegions anu
countiies, such as whethei this change incieases the click-thiough iate ol the useis in
the US oi whethei it ieuuces the engagement ol useis in Inuia. A lot ol this type ol
analysis coulu Le uone with Hauoop Ly using Hive anu iegulai SQL. The measuiement
ol click-thiough iate can Le easily expiesseu as a join ol the impiessions anu clicks loi
the paiticulai link ielateu to the leatuie. This inloimation can Le joineu with geogiaphic
inloimation to compute the ellect ol piouuct changes on uilleient iegions. SuLse-
guently one can compute aveiage click-thiough iate loi uilleient geogiaphic iegions
Ly peiloiming aggiegations ovei them. All ol these aie easily expiessiLle in Hive using
a couple ol SQL gueiies (that woulu, in tuin, geneiate multiple Hauoop joLs). Il only
an estimate weie ieguiieu, the same gueiies can Le iun loi a sample set ol the useis
using sampling lunctionality natively suppoiteu Ly Hive. Some ol this analysis neeus
the use ol custom map anu ieuuce sciipts in conjunction with the Hive SQL, anu that
is also easy to plug into a Hive gueiy.
A goou example ol a moie complex analysis is estimating the peak numLei ol useis
logging into the site pei minute loi the entiie past yeai. This woulu involve sampling
page view logs (Lecause the total page view uata loi a populai weLsite is huge), giouping
it Ly time anu then linuing the numLei ol new useis at uilleient time points via a custom
ieuuce sciipt. This is a goou example wheie Loth SQL anu MapReuuce aie ieguiieu
loi solving the enu usei pioLlem anu something that is possiLle to achieve easily with
Hive.
Hadoop and Hive at Facebook | 559
Data analysis
Hive anu Hauoop can Le easily useu loi tiaining anu scoiing loi uata analysis applica-
tions. These uata analysis applications can span multiple uomains such as populai
weLsites, Lioinloimatics companies, anu oil exploiation companies. A typical example
ol such an application in the online au netwoik inuustiy woulu Le the pieuiction ol
what leatuies ol an au makes it moie likely to Le noticeu Ly the usei. The tiaining phase
typically woulu involve iuentilying the iesponse metiic anu the pieuictive leatuies. In
this case, a goou metiic to measuie the ellectiveness ol an au coulu Le its click-thiough
iate. Some inteiesting leatuies ol the au coulu Le the inuustiy veitical that it Lelongs
to, the content ol the au, the placement ol the au on the page, anu so on. Hive is uselul
loi assemLling tiaining uata anu then leeuing the same into a uata analysis engine
(typically R oi usei piogiams wiitten in MapReuuce). In this paiticulai case, uilleient
au peiloimance numLeis anu leatuies can Le stiuctuieu as taLles in Hive. One can
easily sample this uata (sampling is ieguiieu as R can only hanule limiteu uata volume)
anu peiloim the appiopiiate aggiegations anu joins using Hive gueiies to assemLle a
iesponse taLle that contains the most impoitant au leatuies that ueteimine the ellec-
tiveness ol an auveitisement. Howevei, since sampling loses inloimation, some ol the
moie impoitant uata analysis applications use paiallel implementations ol populai uata
analysis keinels using the MapReuuce liamewoik.
Once the mouel has Leen tiaineu, it may Le ueployeu loi scoiing on a uaily Lasis. The
Lulk ol the uata analysis tasks uo not peiloim uaily scoiing though. Many ol them aie
au hoc in natuie anu ieguiie one-time analysis that can Le useu as input into the piouuct
uesign piocess.
Hive
Overview
Vhen we staiteu using Hauoop, we veiy guickly Lecame impiesseu Ly its scalaLility
anu availaLility. Howevei, we weie woiiieu aLout wiuespieau auoption, piimaiily Le-
cause ol the complexity involveu in wiiting MapReuuce piogiams in ]ava (as well as
the cost ol tiaining useis to wiite them). Ve weie awaie that a lot ol engineeis anu
analysts in the company unueistoou SQL as a tool to gueiy anu analyze uata, anu that
a lot ol them weie piolicient in a numLei ol sciipting languages like PHP anu Python.
As a iesult, it was impeiative loi us to uevelop soltwaie that coulu Liiuge this gap
Letween the languages that the useis weie piolicient in anu the languages ieguiieu to
piogiam Hauoop.
It was also eviuent that a lot ol oui uatasets weie stiuctuieu anu coulu Le easily paiti-
tioneu. The natuial conseguence ol these ieguiiements was a system that coulu mouel
uata as taLles anu paititions anu that coulu also pioviue a SQL-like language loi gueiy
anu analysis. Also essential was the aLility to plug in customizeu MapReuuce piogiams
wiitten in the piogiamming language ol the usei`s choice into the gueiy. This system
560 | Chapter 16: Case Studies
was calleu Hive. Hive is a uata waiehouse inliastiuctuie Luilt on top ol Hauoop anu
seives as the pieuominant tool that is useu to gueiy the uata stoieu in Hauoop at
FaceLook. In the lollowing sections, we uesciiLe this system in moie uetail.
Data organization
Data is oiganizeu consistently acioss all uatasets anu is stoieu compiesseu, paititioneu,
anu soiteu:
Conprcssion
Almost all uatasets aie stoieu as seguence liles using the gzip couec. Oluei uatasets
aie iecompiesseu to use the Lzip couec that gives suLstantially moie compiession
than gzip. Bzip is slowei than gzip, Lut oluei uata is accesseu much less lieguently
anu this peiloimance hit is well woith the savings in teims ol uisk space.
Partitioning
Most uatasets aie paititioneu Ly uate. Inuiviuual paititions aie loaueu into Hive,
which loaus each paitition into a sepaiate HDFS uiiectoiy. In most cases, this
paititioning is Laseu simply on uatestamps associateu with sciiLe logliles. How-
evei, in some cases, we scan uata anu collate them Laseu on timestamp availaLle
insiue a log entiy. Going loiwaiu, we aie also going to Le paititioning uata on
multiple attiiLutes (loi example, countiy anu uate).
Sorting
Each paitition within a taLle is olten soiteu (anu hash-paititioneu) Ly unigue ID
(il one is piesent). This has a lew key auvantages:
It is easy to iun sampleu gueiies on such uatasets.
Ve can Luilu inuexes on soiteu uata.
Aggiegates anu joins involving unigue IDs can Le uone veiy elliciently on such
uatasets.
Loauing uata into this long-teim loimat is uone Ly uaily MapReuuce joLs (anu is uis-
tinct liom the neai ieal-time uata impoit piocesses).
Query language
The Hive Queiy language is veiy SQL-like. It has tiauitional SQL constiucts like joins,
gioup Lys, wheie, select, liom clauses, anu liom clause suLgueiies. It tiies to conveit
SQL commanus into a set ol MapReuuce joLs. Apait liom the noimal SQL clauses, it
has a Lunch ol othei extensions, like the aLility to specily custom mappei anu ieuucei
sciipts in the gueiy itsell, the aLility to inseit into multiple taLles, paititions, HDFS, oi
local liles while uoing a single scan ol the uata anu the aLility to iun the gueiy on uata
samples iathei than the lull uataset (this aLility is laiily uselul while testing gueiies).
The Hive metastoie stoies the metauata loi a taLle anu pioviues this metauata to the
Hive compilei loi conveiting SQL commanus to MapReuuce joLs. Thiough paitition
Hadoop and Hive at Facebook | 561
piuning, map-siue aggiegations, anu othei leatuies, the compilei tiies to cieate plans
that can optimize the iuntime loi the gueiy.
Data pipelines using Hive
Auuitionally, the aLility pioviueu Ly Hive in teims ol expiessing uata pipelines in SQL
can anu has pioviueu the much neeueu llexiLility in putting these pipelines togethei in
an easy anu expeuient mannei. This is especially uselul loi oiganizations anu piouucts
that aie still evolving anu giowing. Many ol the opeiations neeueu in piocessing uata
pipelines aie the well-unueistoou SQL opeiations like join, gioup Ly, anu uistinct ag-
giegations. Vith Hive`s aLility to conveit SQL into a seiies ol Hauoop MapReuuce
joLs, it Lecomes laiily easy to cieate anu maintain these pipelines. Ve illustiate these
lacets ol Hive in this section Ly using an example ol a hypothetical au netwoik anu
showing how some typical aggiegateu iepoits neeueu Ly the auveitiseis can Le com-
puteu using Hive. As an example, assuming that an online au netwoik stoies inloima-
tion on aus in a taLle nameu dim_ads anu stoies all the impiessions seiveu to that au in
a taLle nameu impression_logs in Hive, with the lattei taLle Leing paititioneu Ly uate,
the uaily impiession numLeis (Loth unigue anu total Ly campaign, that aie ioutinely
given Ly au netwoiks to the auveitiseis) loi 200S-12-01 aie expiessiLle as the lollowing
SQL in Hive:
SELECT a.campaign_id, count(1), count(DISTINCT b.user_id)
FROM dim_ads a JOIN impression_logs b ON(b.ad_id = a.ad_id)
WHERE b.dateid = '2008-12-01'
GROUP BY a.campaign_id;
This woulu also Le the typical SQL statement that one coulu use in othei RDBMSs such
as Oiacle, DB2, anu so on.
In oiuei to compute the uaily impiession numLeis Ly au anu account liom the same
joineu uata as eailiei, Hive pioviues the aLility to uo multiple gioup Lys simultaneously
as shown in the lollowing gueiy (SQL-like Lut not stiictly SQL):
FROM(
SELECT a.ad_id, a.campaign_id, a.account_id, b.user_id
FROM dim_ads a JOIN impression_logs b ON (b.ad_id = a.ad_id)
WHERE b.dateid = '2008-12-01') x
INSERT OVERWRITE DIRECTORY 'results_gby_adid'
SELECT x.ad_id, count(1), count(DISTINCT x.user_id) GROUP BY x.ad_id
INSERT OVERWRITE DIRECTORY 'results_gby_campaignid'
SELECT x.campaign_id, count(1), count(DISTINCT x.user_id) GROUP BY x.campaign_id
INSERT OVERWRITE DIRECTORY 'results_gby_accountid'
SELECT x.account_id, count(1), count(DISTINCT x.user_id) GROUP BY x.account_id;
In one ol the optimizations that is Leing auueu to Hive, the gueiy can Le conveiteu into
a seguence ol Hauoop MapReuuce joLs that aie aLle to scale with uata skew. Essen-
tially, the join is conveiteu into one MapReuuce joL anu the thiee gioup Lys aie con-
veiteu into loui MapReuuce joLs, with the liist one geneiating a paitial aggiegate on
unique_id. This is especially uselul Lecause the uistiiLution ol impression_logs ovei
unique_id is much moie uniloim as compaieu to ad_id (typically in an au netwoik, a
562 | Chapter 16: Case Studies
lew aus uominate in that they aie shown moie uniloimly to the useis). As a iesult,
computing the paitial aggiegation Ly unique_id allows the pipeline to uistiiLute the
woik moie uniloimly to the ieuuceis. The same template can Le useu to compute
peiloimance numLeis loi uilleient time peiious Ly simply changing the uate pieuicate
in the gueiy.
Computing the liletime numLeis can Le moie tiicky though, as using the stiategy ue-
sciiLeu pieviously, one woulu have to scan all the paititions ol the impression_logs
taLle. Theieloie, in oiuei to compute the liletime numLeis, a moie viaLle stiategy is to
stoie the liletime counts on a pei ad_id, unique_id giouping eveiy uay in a paitition ol
an inteimeuiate taLle. The uata in this taLle comLineu with the next uays
impression_logs can Le useu to inciementally geneiate the liletime au peiloimance
numLeis. As an example, in oiuei to get the impiession numLeis loi 200S-12-01, the
inteimeuiate taLle paitition loi 200S-11-30 is useu. The Hive gueiies that can Le useu
to achieve this aie as lollows:
INSERT OVERWRITE lifetime_partial_imps PARTITION(dateid='2008-12-01')
SELECT x.ad_id, x.user_id, sum(x.cnt)
FROM (
SELECT a.ad_id, a.user_id, a.cnt
FROM lifetime_partial_imps a
WHERE a.dateid = '2008-11-30'
UNION ALL
SELECT b.ad_id, b.user_id, 1 as cnt
FROM impression_log b
WHERE b.dateid = '2008-12-01'
) x
GROUP BY x.ad_id, x.user_id;
This gueiy computes the paitial sums loi 200S-12-01, which can Le useu loi computing
the 200S-12-01 numLeis as well as the 200S-12-02 numLeis (not shown heie). The
SQL is conveiteu to a single Hauoop MapReuuce joL that essentially computes the
gioup Ly on the comLineu stieam ol inputs. This SQL can Le lolloweu Ly the lollowing
Hive gueiy, which computes the actual numLeis loi uilleient gioupings (similai to the
one in the uaily pipelines):
FROM(
SELECT a.ad_id, a.campaign_id, a.account_id, b.user_id, b.cnt
FROM dim_ads a JOIN lifetime_partial_imps b ON (b.ad_id = a.ad_id)
WHERE b.dateid = '2008-12-01') x
INSERT OVERWRITE DIRECTORY 'results_gby_adid'
SELECT x.ad_id, sum(x.cnt), count(DISTINCT x.user_id) GROUP BY x.ad_id
INSERT OVERWRITE DIRECTORY 'results_gby_campaignid'
SELECT x.campaign_id, sum(x.cnt), count(DISTINCT x.user_id) GROUP BY x.campaign_id
INSERT OVERWRITE DIRECTORY 'results_gby_accountid'
SELECT x.account_id, sum(x.cnt), count(DISTINCT x.user_id) GROUP BY x.account_id;
Hive anu Hauoop aie Latch piocessing systems that cannot seive the computeu uata
with the same latency as a usual RDBMS such as Oiacle oi MySQL. Theieloie, on many
occasions, it is still uselul to loau the summaiies geneiateu thiough Hive anu Hauoop
Hadoop and Hive at Facebook | 563
to a moie tiauitional RDBMS loi seiving this uata to useis thiough uilleient BI tools
oi even though a weL poital.
Problems and Future Work
Fair sharing
Hauoop clusteis typically iun a mix ol piouuction uaily joLs that neeu to linish com-
putation within a ieasonaLle time liame as well as au hoc joLs that may Le ol uilleient
piioiities anu sizes. In typical installations, these joLs tenu to iun oveinight, when
inteileience liom au hoc joLs iun Ly useis is minimal. Howevei, oveilap Letween laige
au hoc anu piouuction joLs is olten unavoiuaLle anu, without aueguate saleguaius,
can impact the latency ol piouuction joLs. ETL piocessing also contains seveial neai
ieal-time joLs that must Le peiloimeu at houily inteivals (these incluue piocesses to
copy SciiLe uata liom NFS seiveis as well as houily summaiies computeu ovei some
uatasets). It also means that a single iogue joL can Liing uown the entiie clustei anu
put piouuction piocesses at iisk.
The laii-shaiing Hauoop joLscheuulei, uevelopeu at FaceLook anu contiiLuteu Lack
to Hauoop, pioviues a solution to many ol these issues. It ieseives guaianteeu compute
iesouices loi specilic pools ol joLs while at the same time letting iule iesouices Le useu
Ly eveiyone. It also pievents laige joLs liom hogging clustei iesouices Ly allocating
compute iesouices in a laii mannei acioss these pools. Memoiy can Lecome one ol the
most contenueu iesouices in the clustei. Ve have maue some changes to Hauoop so
that il the ]oLTiackei is low on memoiy, Hauoop joL suLmissions aie thiottleu. This
can allow the usei piocesses to iun with ieasonaLle pei-piocess memoiy limits, anu it
is possiLle to put in place some monitoiing sciipts in oiuei to pievent MapReuuce joLs
liom impacting HDFS uaemons (uue piimaiily to high memoiy consumption) iunning
on the same noue. Log uiiectoiies aie stoieu in sepaiate uisk paititions anu cleaneu
iegulaily, anu we think it can also Le uselul to put MapReuuce inteimeuiate stoiage in
sepaiate uisk paititions as well.
Space management
Capacity management continues to Le a Lig challengeutilization is incieasing at a
last iate with giowth ol uata. Many giowing companies with giowing uatasets have the
same pain. In many situations, much ol this uata is tempoiaiy in natuie. In such cases,
one can use ietention settings in Hive anu also iecompiess oluei uata in Lzip loimat to
save on space. Although conliguiations aie laigely symmetiical liom a uisk stoiage
point ol view, auuing a sepaiate tiei ol high-stoiage-uensity machines to holu oluei
uata may piove Lenelicial. This will make it cheapei to stoie aichival uata in Hauoop.
Howevei, access to such uata shoulu Le tianspaient. Ve aie cuiiently woiking on a
uata aichival layei to make this possiLle anu to unily all the aspects ol uealing with
oluei uata.
564 | Chapter 16: Case Studies
Scribe-HDFS integration
Cuiiently, SciiLe wiites to a hanulul ol NFS lileis liom wheie uata is pickeu up anu
ueliveieu to HDFS Ly custom copiei joLs as uesciiLeu eailiei. Ve aie woiking on
making SciiLe wiite uiiectly to anothei HDFS instance. This will make it veiy easy to
scale anu auministei SciiLe. Due to the high uptime ieguiiements loi SciiLe, its taiget
HDFS instance is likely to Le uilleient liom the piouuction HDFS instance (so that it
is isolateu liom any loau/uowntime issues uue to usei joLs).
Improvements to Hive
Hive is still unuei active uevelopment. A numLei ol key leatuies aie Leing woikeu on
such as oiuei Ly anu having clause suppoit, moie aggiegate lunctions, moie Luilt in
lunctions, uatetime uata type, anu so on. At the same time, a numLei ol peiloimance
optimizations aie Leing woikeu on, such as pieuicate pushuown anu common suLex-
piession elimination. On the integiation siue, ]DBC anu ODBC uiiveis aie Leing ue-
velopeu in oiuei to integiate with OLAP anu BI tools. Vith all these optimizations, we
hope that we can unlock the powei ol MapReuuce anu Hauoop anu Liing it closei to
nonengineeiing communities as well within FaceLook. Foi moie inloimation on this
pioject, please visit http://hadoop.apachc.org/hivc/.
]oyueep Sen Saima anu Ashish Thusoo
Nutch Search Engine
Background
Nutch is a liamewoik loi Luiluing scalaLle Inteinet ciawleis anu seaich engines. It`s
an Apache Soltwaie Founuation pioject, anu a suLpioject ol Lucene, anu it`s availaLle
unuei the Apache 2.0 license.
Ve won`t go ueeply into the anatomy ol a weL ciawlei as suchthe puipose ol this
case stuuy is to show how Hauoop can Le useu to implement vaiious complex pio-
cessing tasks typical loi a seaich engine. Inteiesteu ieaueis can linu plenty ol Nutch-
specilic inloimation on the ollicial site ol the pioject (http://|uccnc.apachc.org/nutch).
Sullice to say that in oiuei to cieate anu maintain a seaich engine, one neeus the lol-
lowing suLsystems:
Databasc oj pagcs
This uataLase keeps tiack ol all pages known to the ciawlei anu theii status, such
as the last time it visiteu the page, its letching status, ieliesh inteival, content
checksum, etc. In Nutch teiminology, this uataLase is calleu Craw|Db.
Nutch Search Engine | 565
List oj pagcs to jctch
As ciawleis peiiouically ieliesh theii view ol the VeL, they uownloau new pages
(pieviously unseen) oi ieliesh pages that they think alieauy expiieu. Nutch calls
such a list ol canuiuate pages piepaieu loi letching a jctch|ist.
Raw pagc data
Page content is uownloaueu liom iemote sites anu stoieu locally in the oiiginal
uninteipieteu loimat, as a Lyte aiiay. This uata is calleu the pagc contcnt in Nutch.
Parscd pagc data
Page content is then paiseu using a suitaLle paiseiNutch pioviues paiseis loi
uocuments in many populai loimats, such as HTML, PDF, Open Ollice anu Mi-
ciosolt Ollice, RSS, anu otheis.
Lin| graph databasc
This uataLase is necessaiy to compute link-Laseu page ianking scoies, such as
PageRank. Foi each URL known to Nutch, it contains a list ol othei URLs pointing
to it, anu theii associateu anchoi text (liom HTML <a href="..">anchor
text</a> elements). This uataLase is calleu Lin|Db.
Iu||-tcxt scarch indcx
This is a classical inveiteu inuex, Luilt liom the collecteu page metauata anu liom
the extiacteu plain-text content. It is implementeu using the excellent Lucene
liLiaiy.
Ve Liielly mentioneu Leloie that Hauoop Legan its lile as a component in Nutch,
intenueu to impiove its scalaLility anu to auuiess cleai peiloimance Lottlenecks causeu
Ly a centializeu uata piocessing mouel. Nutch was also the liist puLlic piool-ol-concept
application poiteu to the liamewoik that woulu latei Lecome Hauoop, anu the elloit
ieguiieu to poit Nutch algoiithms anu uata stiuctuies to Hauoop pioveu to Le sui-
piisingly small. This pioLaLly encouiageu the lollowing uevelopment ol Hauoop as a
sepaiate suLpioject with the aim ol pioviuing a ieusaLle liamewoik loi applications
othei than Nutch.
Cuiiently, neaily all Nutch tools piocess uata Ly iunning one oi moie MapReuuce joLs.
Data Structures
Theie aie seveial majoi uata stiuctuies maintaineu in Nutch, anu they all make use ol
Hauoop I/O classes anu loimats. Depenuing on the puipose ol the uata, anu the way
it`s accesseu once it`s cieateu, the uata is kept eithei using Hauoop map liles oi
seguence liles.
Since the uata is piouuceu anu piocesseu Ly MapReuuce joLs, which in tuin iun seveial
map anu ieuuce tasks, its on-uisk layout coiiesponus to the common Hauoop output
loimats, that is, MapFileOutputFormat anu SequenceFileOutputFormat. So to Le piecise,
we shoulu say that uata is kept in seveial paitial map liles oi seguence liles, with as
566 | Chapter 16: Case Studies
many paits as theie weie ieuuce tasks in the joL that cieateu the uata. Foi simplicity,
we omit this uistinction in the lollowing sections.
CrawlDb
CiawlDL stoies the cuiient status ol each URL, as a map lile ol <url, CrawlDatum>,
wheie keys use Text anu values use a Nutch-specilic CrawlDatum class (which imple-
ments the Writable inteilace).
In oiuei to pioviue a guick ianuom access to the iecoius (sometimes uselul loi uiag-
nostic ieasons, when useis want to examine inuiviuual iecoius in CiawlDL), this uata
is stoieu in map liles anu not in seguence liles.
CiawlDL is initially cieateu using the Injectoi tool, which simply conveits a plain-text
iepiesentation ol the initial list ol URLs (calleu the seeu list) to a map lile in the loimat
uesciiLeu eailiei. SuLseguently, it is upuateu with the inloimation liom the letcheu
anu paiseu pagesmoie on that shoitly.
LinkDb
This uataLase stoies the incoming link inloimation loi eveiy URL known to Nutch. It
is a map lile ol <url, Inlinks>, wheie Inlinks is a list ol URL anu anchoi text uata. It`s
woith noting that this inloimation is not immeuiately availaLle uuiing page collection,
Lut the ieveise inloimation is availaLle, namely that ol outgoing links liom a page. The
piocess ol inveiting this ielationship is implementeu as a MapReuuce joL, uesciiLeu
shoitly.
Segments
Segments in Nutch pailance coiiesponu to letching anu paising a Latch ol URLs.
Figuie 16-5 piesents how segments aie cieateu anu piocesseu.
A segment (which is ieally a uiiectoiy in a lilesystem) contains the lollowing paits
(which aie simply suLuiiectoiies containing MapFileOutputFormat oi SequenceFileOut
putFormat uata):
contcnt
Contains the iaw uata ol uownloaueu pages, as a map lile ol <url, Content>. Nutch
uses a map lile heie, Lecause it neeus last ianuom access in oiuei to piesent a cacheu
view ol a page.
craw|_gcncratc
Contains the list ol URLs to Le letcheu, togethei with theii cuiient status ietiieveu
liom CiawlDL, as a seguence lile ol <url, CrawlDatum>. This uata uses the seguence
lile loimat, liist Lecause it`s piocesseu seguentially, anu seconu Lecause we
coulun`t satisly the map lile invaiiants ol soiteu keys. Ve neeu to spieau URLs
that Lelong to the same host as lai apait as possiLle to minimize the loau pei taiget
host, anu this means that iecoius aie soiteu moie oi less ianuomly.
Nutch Search Engine | 567
craw|_jctch
Contains status iepoits liom the letching, that is, whethei it was successlul, what
was the iesponse coue, etc. This is stoieu in a map lile ol <url, CrawlDatum>.
craw|_parsc
The list ol outlinks loi each successlully letcheu anu paiseu page is stoieu heie so
that Nutch can expanu its ciawling liontiei Ly leaining new URLs.
parsc_data
Metauata collecteu uuiing paising; among otheis, the list ol outgoing links (out-
links) loi a page. This inloimation is ciucial latei on to Luilu an inveiteu giaph (ol
incoming linksinlinks).
parsc_tcxt
Plain-text veision ol the page, suitaLle loi inuexing in Lucene. These aie stoieu as
a map lile ol <url, ParseText> so that Nutch can access them guickly when
Luiluing summaiies (snippets) to uisplay the list ol seaich iesults.
New segments aie cieateu liom CiawlDL when the Geneiatoi tool is iun (1 in Fig-
uie 16-5), anu initially contain just a list ol URLs to letch (the craw|_gcncratc suLuii-
ectoiy). As this list is piocesseu in seveial steps, the segment collects output uata liom
the piocessing tools in a set ol suLuiiectoiies.
Foi example, the contcnt pait is populateu Ly a tool calleu Fetchei, which uownloaus
iaw uata liom URLs on the letchlist (2). This tool also saves the status inloimation in
craw|_jctch so that this uata can Le useu latei on loi upuating the status ol the page in
CiawlDL.
The iemaining paits ol the segment aie populateu Ly the Paise segment tool (3), which
ieaus the content section, selects appiopiiate content paisei Laseu on the ueclaieu (oi
Iigurc 1-5. Scgncnts
568 | Chapter 16: Case Studies
uetecteu) MIME type, anu saves the iesults ol paising in thiee paits: craw|_parsc,
parsc_data, anu parsc_tcxt. This uata is then useu to upuate the CiawlDL with new
inloimation (+) anu to cieate the LinkDL (5).
Segments aie kept aiounu until all pages piesent in them aie expiieu. Nutch applies a
conliguiaLle maximum time limit, altei which a page is loiciLly selecteu loi ieletching;
this helps the opeiatoi phase out all segments oluei than this limit (Lecause he can Le
suie that Ly that time all pages in this segment woulu have Leen ieletcheu).
Segment uata is useu to cieate Lucene inuexes (6piimaiily the parsc_tcxt anu
parsc_data paits), Lut it also pioviues a uata stoiage mechanism loi guick ietiieval ol
plain text anu iaw content uata. The loimei is neeueu so that Nutch can geneiate
snippets (liagments ol uocument text Lest matching a gueiy); the lattei pioviues the
aLility to piesent a cacheu view ol the page. In Loth cases, this uata is accesseu uiiectly
liom map liles in iesponse to ieguests loi snippet geneiation oi loi cacheu content. In
piactice, even loi laige collections the peiloimance ol accessing uata uiiectly liom map
liles is guite sullicient.
Selected Examples of Hadoop Data Processing in Nutch
The lollowing sections piesent ielevant uetails ol some Nutch tools to illustiate how
the MapReuuce paiauigm is applieu to a conciete uata piocessing task in Nutch.
Link inversion
HTML pages collecteu uuiing ciawling contain HTML links, which may point eithei
to the same page (inteinal links) oi to othei pages. HTML links aie uiiecteu liom souice
page to taiget page. See Figuie 16-6.
Iigurc 1-. Lin| invcrsion
Howevei, most algoiithms loi calculating a page`s impoitance (oi guality) neeu the
opposite inloimation, that is, what pages contain outlinks that point to the cuiient
page. This inloimation is not ieauily availaLle when ciawling. Also, the inuexing pio-
cess Lenelits liom taking into account the anchoi text on inlinks so that this text may
semantically eniich the text ol the cuiient page.
Nutch Search Engine | 569
As mentioneu eailiei, Nutch collects the outlink inloimation anu then uses this uata
to Luilu a LinkDL, which contains this ieveiseu link uata in the loim ol inlinks anu
anchoi text.
This section piesents a iough outline ol the implementation ol the LinkDL toolmany
uetails have Leen omitteu (such as URL noimalization anu lilteiing) in oiuei to piesent
a cleai pictuie ol the piocess. Vhat`s lelt gives a classical example ol why the
MapReuuce paiauigm lits so well with the key uata tiansloimation piocesses ieguiieu
to iun a seaich engine. Laige seaich engines neeu to ueal with massive weL giaph uata
(many pages with a lot ol outlinks/inlinks), anu the paiallelism anu lault toleiance
olleieu Ly Hauoop make this possiLle. Auuitionally, it`s easy to expiess the link invei-
sion using the map-soit-ieuuce piimitives, as illustiateu next.
The snippet Lelow piesents the joL initialization ol the LinkDL tool:
JobConf job = new JobConf(configuration);
FileInputFormat.addInputPath(job, new Path(segmentPath, "parse_data"));
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(LinkDb.class);
job.setReducerClass(LinkDb.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Inlinks.class);
job.setOutputFormat(MapFileOutputFormat.class);
FileOutputFormat.setOutputPath(job, newLinkDbPath);
As we can see, the souice uata loi this joL is the list ol letcheu URLs (keys) anu the
coiiesponuing ParseData iecoius that contain among otheis the outlink inloimation
loi each page, as an aiiay ol outlinks. An outlink contains Loth the taiget URL anu the
anchoi text.
The output liom the joL is again a list ol URLs (keys), Lut the values aie instances ol
inlinks, which is simply a specializeu Set ol inlinks that contain taiget URLs anu anchoi
text.
Peihaps suipiisingly, URLs aie typically stoieu anu piocesseu as plain text anu not as
java.net.URL oi java.net.URI instances. Theie aie seveial ieasons loi this: URLs ex-
tiacteu liom uownloaueu content usually neeu noimalization (e.g., conveiting host-
names to loweicase, iesolving ielative paths), aie olten Lioken oi invaliu, oi ielei to
unsuppoiteu piotocols. Many noimalization anu lilteiing opeiations aie Lettei ex-
piesseu as text patteins that span seveial paits ol a URL. Also, loi the puipose ol link
analysis, we may still want to piocess anu count invaliu URLs.
Let`s take a closei look now at the map() anu reduce() implementationsin this case,
they aie simple enough to Le implementeu in the Louy ol the same class:
public void map(Text fromUrl, ParseData parseData,
OutputCollector<Text, Inlinks> output, Reporter reporter) {
...
Outlink[] outlinks = parseData.getOutlinks();
Inlinks inlinks = new Inlinks();
for (Outlink out : outlinks) {
570 | Chapter 16: Case Studies
inlinks.clear(); // instance reuse to avoid excessive GC
String toUrl = out.getToUrl();
String anchor = out.getAnchor();
inlinks.add(new Inlink(fromUrl, anchor));
output.collect(new Text(toUrl), inlinks);
}
}
You can see liom this listing that loi each Outlink, oui map() implementation piouuces
a paii ol <toUrl, Inlinks>, wheie Inlinks contains just a single Inlink containing
fromUrl anu the anchoi text. The uiiection ol the link has Leen inveiteu.
SuLseguently, these one-element-long Inlinks aie aggiegateu in the reduce() methou:
public void reduce(Text toUrl, Iterator<Inlinks> values,
OutputCollector<Text, Inlinks> output, Reporter reporter) {
Inlinks result = new Inlinks();
while (values.hasNext()) {
result.add(values.next());
}
output.collect(toUrl, result);
}
Fiom this coue, it`s oLvious that we have got exactly what we wanteuthat is, a list
ol all fromUrls that point to oui toUrl, togethei with theii anchoi text. The inveision
piocess has Leen accomplisheu.
This uata is then saveu using the MapFileOutputFormat anu Lecomes the new veision ol
LinkDL.
Generation of fetchlists
Let`s take a look now at a moie complicateu use case. Fetchlists aie piouuceu liom the
CiawlDL (which is a map lile ol <url, crawlDatum>, with the crawlDatum containing a
status ol this URL), anu they contain URLs ieauy to Le letcheu, which aie then pio-
cesseu Ly the Nutch Fetchei tool. Fetchei is itsell a MapReuuce application (uesciiLeu
shoitly). This means that the input uata (paititioneu in N paits) will Le piocesseu Ly
N map tasksthe Fetchei tool enloices that SequenceFileInputFormat shoulu not lui-
thei split the uata in moie paits than theie aie alieauy input paititions. Ve mentioneu
eailiei Liielly that letchlists neeu to Le geneiateu in a special way so that the uata in
each pait ol the letchlist (anu conseguently piocesseu in each map task) meets ceitain
ieguiiements:
1. All URLs liom the same host neeu to enu up in the same paitition. This is ieguiieu
so that Nutch can easily implement in-]VM host-level Llocking to avoiu ovei-
whelming taiget hosts.
2. URLs liom the same host shoulu Le as lai apait as possiLle (i.e., well mixeu with
URLs liom othei hosts) in oiuei to minimize the host-level Llocking.
Nutch Search Engine | 571
3. Theie shoulu Le no moie than x URLs liom any single host so that laige sites with
many URLs uon`t uominate smallei sites (anu URLs liom smallei sites still have a
chance to Le scheuuleu loi letching).
+. URLs with high scoies shoulu Le pieleiieu ovei URLs with low scoies.
5. Theie shoulu Le, at most, y URLs in total in the letchlist.
6. The numLei ol output paititions shoulu match the optimum numLei ol letching
map tasks.
In this case, two MapReuuce joLs aie neeueu to satisly all these ieguiiements, as illus-
tiateu in Figuie 16-7. Again, in the lollowing listings, we aie going to skip some uetails
ol these steps loi the sake ol Lievity.
Iigurc 1-7. Gcncration oj jctch|ists
In this step, Nutch iuns a MapReuuce
joL to select URLs that aie consiueieu eligiLle loi letching anu to soit them Ly theii
scoie (a lloating-point value assigneu to each URL, e.g., a PageRank scoie). The input
uata comes liom CiawlDL, which is a map lile ol <url, datum>. The output liom this
joL is a seguence lile with <score, <url, datum>>, soiteu in uescenuing oiuei Ly scoie.
Fiist, let`s look at the joL setup:
FileInputFormat.addInputPath(job, crawlDbPath);
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(Selector.class);
job.setPartitionerClass(Selector.class);
job.setReducerClass(Selector.class);
FileOutputFormat.setOutputPath(job, tempDir);
job.setOutputFormat(SequenceFileOutputFormat.class);
job.setOutputKeyClass(FloatWritable.class);
job.setOutputKeyComparatorClass(DecreasingFloatComparator.class);
job.setOutputValueClass(SelectorEntry.class);
The Selector class implements thiee lunctions: mappei, ieuucei, anu paititionei. The
last lunction is especially inteiesting: Selector uses a custom Partitioner to assign
URLs liom the same host to the same ieuuce task so that we can satisly ciiteiia 35
Step 1: Select, sort by score, limit by URL count per host.
572 | Chapter 16: Case Studies
liom the pievious list. Il we uiun`t oveiiiue the uelault paititionei, URLs liom the same
host woulu enu up in uilleient paititions ol the output, anu we woulun`t Le aLle to
tiack anu limit the total counts, Lecause MapReuuce tasks uon`t communicate Letween
themselves. As it is now, all URLs that Lelong to the same host will enu up Leing
piocesseu Ly the same ieuuce task, which means we can contiol how many URLs pei
host aie selecteu.
It`s easy to implement a custom paititionei so that uata that neeus to Le piocesseu in
the same task enus up in the same paitition. Let`s take a look liist at how the
Selector class implements the Partitioner inteilace (which consists ol a single
methou):
/** Partition by host. */
public int getPartition(FloatWritable key, Writable value, int numReduceTasks) {
return hostPartitioner.getPartition(((SelectorEntry)value).url, key,
numReduceTasks);
}
The methou ietuins an integei numLei liom 0 to numReduceTasks - 1. It simply ieplaces
the key with the oiiginal URL liom SelectorEntry to pass the URL (insteau ol scoie)
to an instance ol PartitionUrlByHost, wheie the paitition numLei is calculateu:
/** Hash by hostname. */
public int getPartition(Text key, Writable value, int numReduceTasks) {
String urlString = key.toString();
URL url = null;
try {
url = new URL(urlString);
} catch (MalformedURLException e) {
LOG.warn("Malformed URL: '" + urlString + "'");
}
int hashCode = (url == null ? urlString : url.getHost()).hashCode();
// make hosts wind up in different partitions on different runs
hashCode ^= seed;
return (hashCode & Integer.MAX_VALUE) % numReduceTasks;
}
As you can see liom the coue snippet, the paitition numLei is a lunction ol only the
host pait ol the URL, which means that all URLs that Lelong to the same host will enu
up in the same paitition.
The output liom this joL is soiteu in uecieasing oiuei Ly scoie. Since theie aie many
iecoius in CiawlDL with the same scoie, we coulun`t use MapFileOutputFormat Lecause
we woulu violate the map lile`s invaiiant ol stiict key oiueiing.
OLseivant ieaueis will notice that we hau to use something othei than the oiiginal keys,
Lut we still want to pieseive the oiiginal key-value paiis. Ve use heie a SelectorEn
try class to pass the oiiginal key-value paiis to the next step ol piocessing.
Selector.reduce() keeps tiack ol the total numLei ol URLs anu the maximum numLei
ol URLs pei host, anu simply uiscaius excessive iecoius. Please note that the
Nutch Search Engine | 573
enloicement ol the total count limit is necessaiily appioximate. Ve calculate the limit
loi the cuiient task as the total limit uiviueu Ly the numLei ol ieuuce tasks. But we
uon`t know loi suie liom within the task that it is going to get an egual shaie ol URLs;
inueeu, in most cases, it uoesn`t Lecause ol the uneven uistiiLution ol URLs among
hosts. Howevei, loi Nutch this appioximation is sullicient.
In the pievious step, we enueu up with a se-
guence lile ol <score, selectorEntry>. Now we have to piouuce a seguence lile ol
<url, datum> anu satisly ciiteiia 1, 2, anu 6 just uesciiLeu. The input uata loi this step
is the output uata piouuceu in step 1.
The lollowing is a snippet showing the setup ol this joL:
FileInputFormat.addInputPath(job, tempDir);
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(SelectorInverseMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(SelectorEntry.class);
job.setPartitionerClass(PartitionUrlByHost.class);
job.setReducerClass(PartitionReducer.class);
job.setNumReduceTasks(numParts);
FileOutputFormat.setOutputPath(job, output);
job.setOutputFormat(SequenceFileOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(CrawlDatum.class);
job.setOutputKeyComparatorClass(HashComparator.class);
The SelectorInverseMapper class simply uiscaius the cuiient key (the scoie value), ex-
tiacts the oiiginal URL anu uses it as a key, anu uses the SelectorEntry as the value.
Caielul ieaueis may wonuei why we uon`t go one step luithei, extiacting also the
oiiginal CrawlDatum anu using it as the valuemoie on this shoitly.
The linal output liom this joL is a seguence lile ol <Text, CrawlDatum>, Lut oui output
liom the map phase uses <Text, SelectorEntry>. Ve have to specily that we use uil-
leient key/value classes loi the map output, using the setMapOutputKeyClass() anu
setMapOutputValueClass() setteisotheiwise, Hauoop assumes that we use the same
classes as ueclaieu loi the ieuuce output (this conllict usually woulu cause a joL to lail).
The output liom the map phase is paititioneu using the PartitionUrlByHost class so
that it again assigns URLs liom the same host to the same paitition. This satislies
ieguiiement 1.
Once the uata is shullleu liom map to ieuuce tasks, it`s soiteu Ly Hauoop accoiuing
to the output key compaiatoi, in this case the HashComparator. This class uses a simple
hashing scheme to mix URLs in a way that is least likely to put URLs liom the same
host close to each othei.
In oiuei to meet ieguiiement 6, we set the numLei ol ieuuce tasks to the uesiieu numLei
ol Fetcher map tasks (the numParts mentioneu eailiei), keeping in minu that each ieuuce
paitition will Le useu latei on to cieate a single Fetcher map task.
Step 2: Invert, partition by host, sort randomly.
574 | Chapter 16: Case Studies
The PartitionReducer class is iesponsiLle loi the linal step, that is, to conveit <url,
selectorEntry> to <url, crawlDatum>. A suipiising siue ellect ol using
HashComparator is that seveial URLs may Le hasheu to the same hash value, anu Hauoop
will call reduce() methou passing only the liist such keyall othei keys consiueieu
egual will Le uiscaiueu. Now it Lecomes cleai why we hau to pieseive all URLs in
SelectorEntry iecoius, Lecause now we can extiact them liom the iteiateu values. Heie
is the implementation ol this methou:
public void reduce(Text key, Iterator<SelectorEntry> values,
OutputCollector<Text, CrawlDatum> output, Reporter reporter) throws IOException {
// when using HashComparator, we get only one input key in case of hash collisions
// so use only URLs extracted from values
while (values.hasNext()) {
SelectorEntry entry = values.next();
output.collect(entry.url, entry.datum);
}
}
Finally, the output liom ieuuce tasks is stoieu as a SequenceFileOutputFormat in a Nutch
segment uiiectoiy, in a craw|_gcncratc suLuiiectoiy. This output satislies all ciiteiia
liom 1 to 6.
Fetcher: A multithreaded MapRunner in action
The Fetchei application in Nutch is iesponsiLle loi uownloauing the page content liom
iemote sites. As such, it is impoitant that the piocess uses eveiy oppoitunity loi pai-
allelism, in oiuei to minimize the time it takes to ciawl a letchlist.
Theie is alieauy one level ol paiallelism piesent in Fetcheimultiple paits ol the input
letchlists aie assigneu to multiple map tasks. Howevei, in piactice this is not sullicient:
seguential uownloau ol URLs, liom uilleient hosts (see the eailiei section on HashCom
parator), woulu Le a tiemenuous waste ol time. Foi this ieason, the Fetchei map tasks
piocess this uata using multiple woikei thieaus.
Hauoop uses the MapRunner class to implement the seguential piocessing ol input uata
iecoius. The Fetcher class implements its own MapRunner that uses multiple thieaus to
piocess input iecoius in paiallel.
Let`s Legin with the setup ol the joL:
job.setSpeculativeExecution(false);
FileInputFormat.addInputPath(job, "segment/crawl_generate");
job.setInputFormat(InputFormat.class);
job.setMapRunnerClass(Fetcher.class);
FileOutputFormat.setOutputPath(job, segment);
job.setOutputFormat(FetcherOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NutchWritable.class);
Nutch Search Engine | 575
Fiist, we tuin oll speculative execution. Ve can`t iun seveial map tasks uownloauing
content liom the same hosts Lecause it woulu violate the host-level loau limits (such
as the numLei ol concuiient ieguests anu the numLei ol ieguests pei seconu).
Next, we use a custom InputFormat implementation that pievents Hauoop liom split-
ting paititions ol input uata into smallei chunks (splits), thus cieating moie map tasks
than theie aie input paititions. This again ensuies that we contiol host-level access
limits.
Output uata is stoieu using a custom OutputFormat implementation, which cieates sev-
eial output map liles anu seguence liles cieateu using uata containeu in NutchWrita
ble values. The NutchWritable class is a suLclass ol GenericWritable, aLle to pass in-
stances ol seveial uilleient Writable classes ueclaieu in auvance.
The Fetchei class implements the MapRunner inteilace, anu we set this class as the joL`s
MapRunner implementation. The ielevant paits ol the coue aie listeu heie:
public void run(RecordReader<Text, CrawlDatum> input,
OutputCollector<Text, NutchWritable> output,
Reporter reporter) throws IOException {
int threadCount = getConf().getInt("fetcher.threads.fetch", 10);
feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);
feeder.start();
for (int i = 0; i < threadCount; i++) { // spawn threads
new FetcherThread(getConf()).start();
}
do { // wait for threads to exit
try {
Thread.sleep(1000);
} catch (InterruptedException e) {}
reportStatus(reporter);
} while (activeThreads.get() > 0);
}
Fetcher ieaus many input iecoius in auvance, using the QueueFeeder thieau that puts
input iecoius into a set ol pei-host gueues. Then seveial FetcherThread instances aie
staiteu, which consume items liom pei-host gueues, while QueueFeeder keeps ieauing
input uata to keep the gueues lilleu. Each FetcherThread consumes items liom any
nonempty gueue.
In the meantime, the main thieau ol the map task spins aiounu waiting loi all thieaus
to linish theii joL. Peiiouically, it iepoits the status to the liamewoik to ensuie that
Hauoop uoesn`t consiuei this task to Le ueau anu kill it. Once all items aie piocesseu,
the loop is linisheu anu the contiol is ietuineu to Hauoop, which consiueis this map
task to Le completeu.
Indexer: Using custom OutputFormat
This is an example ol a MapReuuce application that uoesn`t piouuce seguence lile oi
map lile outputinsteau, the output liom this application is a Lucene inuex. Again,
576 | Chapter 16: Case Studies
as MapReuuce applications may consist ol seveial ieuuce tasks, the output liom this
application may consist ol seveial paitial Lucene inuexes.
Nutch Inuexei tool uses inloimation liom CiawlDL, LinkDL, anu Nutch segments
(letch status, paising status, page metauata, anu plain-text uata), so the joL setup sec-
tion involves auuing seveial input paths:
FileInputFormat.addInputPath(job, crawlDbPath);
FileInputFormat.addInputPath(job, linkDbPath);
// add segment data
FileInputFormat.addInputPath(job, "segment/crawl_fetch");
FileInputFormat.addInputPath(job, "segment/crawl_parse");
FileInputFormat.addInputPath(job, "segment/parse_data");
FileInputFormat.addInputPath(job, "segment/parse_text");
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(Indexer.class);
job.setReducerClass(Indexer.class);
FileOutputFormat.setOutputPath(job, indexDir);
job.setOutputFormat(OutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LuceneDocumentWrapper.class);
All coiiesponuing iecoius loi a URL uispeiseu among these input locations neeu to Le
comLineu to cieate Lucene uocuments to Le auueu to the inuex.
The Mapper implementation in Indexer simply wiaps input uata, whatevei its souice
anu implementation class, in a NutchWritable, so that the ieuuce phase may ieceive
uata liom uilleient souices, using uilleient classes, anu still Le aLle to consistently
ueclaie a single output value class (as NutchWritable) liom Loth map anu ieuuce steps.
The Reducer implementation iteiates ovei all values that lall unuei the same key (URL),
unwiaps the uata (letch CrawlDatum, CiawlDL CrawlDatum, LinkDL Inlinks, Parse
Data anu ParseText) anu, using this inloimation, Luilus a Lucene uocument, which is
then wiappeu in a Writable LuceneDocumentWrapper anu collecteu. In auuition to all
textual content (coming eithei liom the plain-text uata oi liom metauata), this uocu-
ment also contains PageRank-like scoie inloimation (oLtaineu liom CiawlDL uata).
Nutch uses this scoie to set the Loost value ol Lucene uocument.
The OutputFormat implementation is the most inteiesting pait ol this tool:
public static class OutputFormat extends
FileOutputFormat<WritableComparable, LuceneDocumentWrapper> {
public RecordWriter<WritableComparable, LuceneDocumentWrapper>
getRecordWriter(final FileSystem fs, JobConf job,
String name, final Progressable progress) throws IOException {
final Path out = new Path(FileOutputFormat.getOutputPath(job), name);
final IndexWriter writer = new IndexWriter(out.toString(),
new NutchDocumentAnalyzer(job), true);
return new RecordWriter<WritableComparable, LuceneDocumentWrapper>() {
boolean closed;
public void write(WritableComparable key, LuceneDocumentWrapper value)
Nutch Search Engine | 577
throws IOException { // unwrap & index doc
Document doc = value.get();
writer.addDocument(doc);
progress.progress();
}

public void close(final Reporter reporter) throws IOException {
// spawn a thread to give progress heartbeats
Thread prog = new Thread() {
public void run() {
while (!closed) {
try {
reporter.setStatus("closing");
Thread.sleep(1000);
} catch (InterruptedException e) { continue; }
catch (Throwable e) { return; }
}
}
};
try {
prog.start();
// optimize & close index
writer.optimize();
writer.close();
} finally {
closed = true;
}
}
};
}
Vhen an instance ol RecordWriter is ieguesteu, the OutputFormat cieates a new Lucene
inuex Ly opening an IndexWriter. Then, loi each new output iecoiu collecteu in the
ieuuce methou, it unwiaps the Lucene uocument liom LuceneDocumentWrapper value
anu auus it to the inuex.
Vhen a ieuuce task is linisheu, Hauoop will tiy to close the RecordWriter. In this case,
the piocess ol closing may take a long time, Lecause we woulu like to optimize the
inuex Leloie closing. Duiing this time, Hauoop may concluue that the task is hung,
since theie aie no piogiess upuates, anu it may attempt to kill it. Foi this ieason, we
liist stait a Lackgiounu thieau to give ieassuiing piogiess upuates, anu then pioceeu
to peiloim the inuex optimization. Once the optimization is completeu, we stop the
piogiess upuatei thieau. The output inuex is now cieateu, optimizeu, anu is closeu,
anu ieauy loi use in a seaichei application.
Summary
This shoit oveiview ol Nutch necessaiily omits many uetails, such as eiioi hanuling,
logging, URL lilteiing anu noimalization, uealing with ieuiiects oi othei loims ol
aliaseu pages (such as miiiois), iemoving uuplicate content, calculating PageRank
578 | Chapter 16: Case Studies
scoiing, etc. You can linu this anu much moie inloimation on the ollicial page ol the
pioject anu on the wiki (http://wi|i.apachc.org/nutch).
Touay, Nutch is useu Ly many oiganizations anu inuiviuual useis. Still, opeiating a
seaich engine ieguiies nontiivial investment in haiuwaie, integiation, anu customiza-
tion, anu the maintenance ol the inuex; so in most cases, Nutch is useu to Luilu com-
meicial veitical- oi lielu-specilic seaich engines.
Nutch is unuei active uevelopment, anu the pioject lollows closely new ieleases ol
Hauoop. As such, it will continue to Le a piactical example ol a ieal-lile application
that uses Hauoop at its coie, with excellent iesults.
Anuizej Bialecki
Log Processing at Rackspace
Rackspace Hosting has always pioviueu manageu systems loi enteipiises, anu in that
vein, Mailtiust Lecame Rackspace`s mail uivision in Fall 2007. Rackspace cuiiently
hosts email loi ovei 1 million useis anu thousanus ol companies on hunuieus ol seiveis.
Requirements/The Problem
Tiansleiiing the mail geneiateu Ly Rackspace customeis thiough the system geneiates
a consiueiaLle papei tiail, in the loim ol aiounu 150 GB pei uay ol logs in vaiious
loimats. It is extiemely helplul to aggiegate that uata loi giowth planning puiposes
anu to unueistanu how customeis use oui applications, anu the iecoius aie also a Loon
loi tiouLleshooting pioLlems in the system.
Il an email lails to Le ueliveieu, oi a customei is unaLle to log in, it is vital that oui
customei suppoit team is aLle to linu enough inloimation aLout the pioLlem to Legin
the ueLugging piocess. To make it possiLle to linu that inloimation guickly, we cannot
leave the logs on the machines that geneiateu them oi in theii oiiginal loimat. Insteau,
we use Hauoop to uo a consiueiaLle amount ol piocessing, with the enu iesult Leing
Lucene inuexes that customei suppoit can gueiy.
Logs
Two ol oui highest volume log loimats aie piouuceu Ly the Postlix mail tianslei agent
anu Miciosolt Exchange Seivei. All mail that tiavels thiough oui systems touches
Postlix at some point, anu the majoiity ol messages tiavel thiough multiple Postlix
seiveis. The Exchange enviionment is inuepenuent Ly necessity, Lut one class ol Postlix
machines acts as an auueu layei ol piotection anu uses SMTP to tianslei messages
Letween mailLoxes hosteu in each enviionment.
The messages tiavel thiough many machines, Lut each seivei only knows enough aLout
the uestination ol the mail to tianslei it to the next iesponsiLle seivei. Thus, in oiuei
to Luilu the complete histoiy ol a message, oui log piocessing system neeus to have a
Log Processing at Rackspace | 579
gloLal view ol the system. This is wheie Hauoop helps us immensely: as oui system
giows, so uoes the volume ol logs. Foi oui log piocessing logic to stay viaLle, we hau
to ensuie that it woulu scale, anu MapReuuce was the peilect liamewoik loi that
giowth.
Brief History
Eailiei veisions ol oui log piocessing system weie Laseu on MySQL, Lut as we gaineu
moie anu moie logging machines, we ieacheu the limits ol what a single MySQL seivei
coulu piocess. The uataLase schema was alieauy ieasonaLly uenoimalizeu, which
woulu have maue it less uillicult to shaiu, Lut MySQL`s paititioning suppoit was still
veiy weak at that point in time. Rathei than implementing oui own shaiuing anu pio-
cessing solution aiounu MySQL, we chose to use Hauoop.
Choosing Hadoop
As soon as you shaiu the uata in a RDBMS system, you lose a lot ol the auvantages ol
SQL loi peiloiming analysis ol youi uataset. Hauoop gives us the aLility to easily pio-
cess all ol oui uata in paiallel using the same algoiithms we woulu loi smallei uatasets.
Collection and Storage
Log collection
The seiveis geneiating the logs we piocess aie uistiiLuteu acioss multiple uata centeis,
Lut we cuiiently have a single Hauoop clustei, locateu in one ol those uata centeis (see
Figuie 16-S). In oiuei to aggiegate the logs anu place them into the clustei, we use the
Unix syslog ieplacement syslog-ng anu some simple sciipts to contiol the cieation ol
liles in Hauoop.
Iigurc 1-8. Hadoop data j|ow at Rac|spacc
580 | Chapter 16: Case Studies
Vithin a uata centei, syslog-ng is useu to tianslei logs liom a sourcc machine to a loau-
Lalanceu set ol co||cctor machines. On the collectois, each type ol log is aggiegateu into
a single stieam anu lightly compiesseu with gzip (step A in Figuie 16-S). Fiom iemote
collectois, logs can Le tiansleiieu thiough an SSH tunnel cioss-uata centei to collectois
that aie local to the Hauoop clustei (step B).
Once the compiesseu log stieam ieaches a local collectoi, it can Le wiitten to Hauoop
(step C). Ve cuiiently use a simple Python sciipt that Lulleis input uata to uisk anu
peiiouically pushes the uata into the Hauoop clustei using the Hauoop commanu-line
inteilace. The sciipt copies the log Lulleis to input lolueis in Hauoop when they ieach
a multiple ol the Hauoop Llock size oi when enough time has passeu.
This methou ol secuiely aggiegating logs liom uilleient uata centeis was uevelopeu
Leloie SOCKS suppoit was auueu to Hauoop via the hadoop.rpc.socket.fac
tory.class.default paiametei anu SocksSocketFactory class. By using SOCKS suppoit
anu the HDFS API uiiectly liom iemote collectois, we coulu eliminate one uisk wiite
anu a lot ol complexity liom the system. Ve plan to implement a ieplacement using
these leatuies in lutuie uevelopment spiints.
Once the iaw logs have Leen placeu in Hauoop, they aie ieauy loi piocessing Ly oui
MapReuuce joLs.
Log storage
Oui Hauoop clustei cuiiently contains 15 uatanoues with commouity CPUs anu thiee
500 GB uisks each. Ve use a uelault ieplication lactoi ol thiee loi liles that neeu to
suivive loi oui aichive peiiou ol six months anu two loi anything else.
The Hauoop namenoue uses haiuwaie iuentical to the uatanoues. To pioviue ieason-
aLly high availaLility, we use two seconuaiy namenoues anu a viitual IP that can easily
Le pointeu at any ol the thiee machines with snapshots ol the HDFS. This means that
in a lailovei situation, theie is potential loi us to lose up to 30 minutes ol uata, ue-
penuing on the ages ol the snapshots on the seconuaiy namenoues. This is acceptaLle
loi oui log piocessing application, Lut othei Hauoop applications may ieguiie lossless
lailovei Ly using shaieu stoiage loi the namenoue`s image.
MapReduce for Logs
Processing
In uistiiLuteu systems, the sau tiuth ol unigue iuentilieis is that they aie iaiely actually
unigue. All email messages have a (supposeuly) unigue iuentiliei calleu a ncssagc-id
that is geneiateu Ly the host wheie they oiiginateu, Lut a Lau client coulu easily senu
out uuplicates. In auuition, since the uesigneis ol Postlix coulu not tiust the message-
iu to uniguely iuentily the message, they weie loiceu to come up with a sepaiate ID
Log Processing at Rackspace | 581
calleu a qucuc-id, which is guaianteeu to Le unigue only loi the liletime ol the message
on a local machine.
Although the message-iu tenus to Le the uelinitive iuentiliei loi a message, in Postlix
logs, it is necessaiy to use gueue-ius to linu the message-iu. Looking at the seconu line
in Example 16-1 (which is loimatteu to Lettei lit the page), you will see the hex stiing
1DBD21B48AE, which is the gueue-iu ol the message that the log line ieleis to. Because
inloimation aLout a message (incluuing its message-iu) is output as sepaiate lines when
it is collecteu (potentially houis apait), it is necessaiy loi oui paising coue to keep state
aLout messages.
Exanp|c 1-1. Postjix |og |incs
Nov 12 17:36:54 gate8.gate.sat.mlsrvr.com postfix/smtpd[2552]: connect from hostname
Nov 12 17:36:54 relay2.relay.sat.mlsrvr.com postfix/qmgr[9489]: 1DBD21B48AE:
from=<[email protected]>, size=5950, nrcpt=1 (queue active)
Nov 12 17:36:54 relay2.relay.sat.mlsrvr.com postfix/smtpd[28085]: disconnect from
hostname
Nov 12 17:36:54 gate5.gate.sat.mlsrvr.com postfix/smtpd[22593]: too many errors
after DATA from hostname
Nov 12 17:36:54 gate5.gate.sat.mlsrvr.com postfix/smtpd[22593]: disconnect from
hostname
Nov 12 17:36:54 gate10.gate.sat.mlsrvr.com postfix/smtpd[10311]: connect from
hostname
Nov 12 17:36:54 relay2.relay.sat.mlsrvr.com postfix/smtp[28107]: D42001B48B5:
to=<[email protected]>, relay=hostname[ip], delay=0.32, delays=0.28/0/0/0.04,
dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 1DBD21B48AE)
Nov 12 17:36:54 gate20.gate.sat.mlsrvr.com postfix/smtpd[27168]: disconnect from
hostname
Nov 12 17:36:54 gate5.gate.sat.mlsrvr.com postfix/qmgr[1209]: 645965A0224: removed
Nov 12 17:36:54 gate2.gate.sat.mlsrvr.com postfix/smtp[15928]: 732196384ED: to=<m
[email protected]>, relay=hostname[ip], conn_use=2, delay=0.69, delays=0.04/
0.44/0.04/0.17, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 02E1544C005)
Nov 12 17:36:54 gate2.gate.sat.mlsrvr.com postfix/qmgr[13764]: 732196384ED: removed
Nov 12 17:36:54 gate1.gate.sat.mlsrvr.com postfix/smtpd[26394]: NOQUEUE: reject: RCP
T from hostname 554 5.7.1 <[email protected]>: Client host rejected: The
sender's mail server is blocked; from=<[email protected]> to=<mapred
[email protected]> proto=ESMTP helo=<[email protected]>
Fiom a MapReuuce peispective, each line ol the log is a single key-value paii. In phase
1, we neeu to map all lines with a single gueue-iu key togethei, anu then ieuuce them
to ueteimine il the log message values inuicate that the gueue-iu is complete.
Similaily, once we have a completeu gueue-iu loi a message, we neeu to gioup Ly the
message-iu in phase 2. Ve Map each completeu gueue-iu with its message-iu as key
anu a list ol its log lines as the value. In Reuuce, we ueteimine whethei all ol the gueue-
ius loi the message-iu inuicate that the message lelt oui system.
Togethei, the two phases ol the mail log MapReuuce joL anu theii InputFormat anu
OutputFormat loim a type ol stagcd cvcnt-drivcn architccturc (SEDA). In SEDA, an ap-
plication is Lioken up into multiple stages, which aie sepaiateu Ly gueues. In a Ha-
uoop context, a gueue coulu Le eithei an input loluei in HDFS that a MapReuuce joL
582 | Chapter 16: Case Studies
consumes liom oi the implicit gueue that MapReuuce loims Letween the Map anu
Reuuce steps.
In Figuie 16-9, the aiiows Letween stages iepiesent the gueues, with a uasheu aiiow
Leing the implicit MapReuuce gueue. Each stage can senu a key-value paii (SEDA calls
them events oi messages) to anothei stage via these gueues.
Iigurc 1-9. MapRcducc chain
Duiing the liist phase ol oui Mail log piocessing joL, the inputs to the Map
stage aie eithei a line numLei key anu log message value oi a gueue-iu key to an aiiay
ol log-message values. The liist type ol input is geneiateu when we piocess a iaw loglile
liom the gueue ol input liles, anu the seconu type is an inteimeuiate loimat that iep-
iesents the state ol a gueue-iu we have alieauy attempteu to piocess Lut that was ie-
gueueu Lecause it was incomplete.
In oiuei to accomplish this uual input, we implementeu a Hauoop InputFormat that
uelegates the woik to an unueilying SequenceFileRecordReader oi LineRecordReader,
uepenuing on the lile extension ol the input FileSplit. The two input loimats come
liom uilleient input lolueis (gueues) in HDFS.
Phase 1: Map.
Log Processing at Rackspace | 583
Duiing this phase, the Reuuce stage ueteimines whethei the gueue-iu
has enough lines to Le consiueieu completeu. Il the gueue-iu is completeu, we output
the message-iu as key anu a HopWritable oLject as value. Otheiwise, the gueue-iu is set
as the key, anu the aiiay ol log lines is iegueueu to Le Mappeu with the next set ol iaw
logs. This will continue until we complete the gueue-iu oi until it times out.
The HopWritable oLject is a PO]O that implements Hauoop`s
Writable inteilace. It completely uesciiLes a message liom the viewpoint
ol a single seivei, incluuing the senuing auuiess anu IP, attempts to
uelivei the message to othei seiveis, anu typical message heauei
inloimation.
This split output is accomplisheu with an OutputFormat implementation that is some-
what symmetiical with oui uual InputFormat. Oui MultiSequenceFileOutputFormat was
implementeu Leloie the Hauoop API auueu a MultipleSequenceFileOutputFormat in
i0.17.0, Lut lullills the same type ol goal: we neeueu oui Reuuce output paiis to go to
uilleient liles uepenuing on chaiacteiistics ol theii keys.
In the next stage ol the Mail log piocessing joL, the input is a message-iu
key, with a HopWritable value liom the pievious phase. This stage uoes not contain any
logic: insteau, it simply comLines the inputs liom the liist phase using the stanuaiu
SequenceFileInputFormat anu IdentityMapper.
In the linal Reuuce stage, we want to see whethei all ol the HopWrita
bles we have collecteu loi the message-iu iepiesent a complete message path thiough
oui system. A message path is essentially a uiiecteu giaph (which is typically acyclic,
Lut it may contain loops il seiveis aie misconliguieu). In this giaph, a veitex is a seivei,
which can Le laLeleu with multiple gueue-ius, anu attempts to uelivei the message liom
one seivei to anothei aie euges. Foi this piocessing, we use the ]GiaphT giaph liLiaiy.
Foi output, we again use the MultiSequenceFileOutputFormat. Il the Reuucei ueciues
that all ol the gueue-ius loi a message-iu cieate a complete message path, then the
message is seiializeu anu gueueu loi the SolrOutputFormat. Otheiwise, the
HopWritables loi the message aie gueueu loi phase 2: Map stage to Le iepiocesseu with
the next Latch ol gueue-ius.
The SolrOutputFormat contains an emLeuueu Apache Soli instancein the lashion that
was oiiginally iecommenueu Ly the Soli wikito geneiate an inuex on local uisk.
Closing the OutputFormat then involves compiessing the uisk inuex to the linal uesti-
nation loi the output lile. This appioach has a lew auvantages ovei using Soli`s HTTP
inteilace oi using Lucene uiiectly:
Ve can enloice a Soli schema.
Map anu Reuuce iemain iuempotent.
Inuexing loau is iemoveu liom the Seaich noues.
Phase 1: Reduce.
Phase 2: Map.
Phase 2: Reduce.
584 | Chapter 16: Case Studies
Ve cuiiently use the uelault HashPartitioner class to ueciue which Reuuce task will
ieceive paiticulai keys, which means that the keys aie semiianuomly uistiiLuteu. In a
lutuie iteiation ol the system, we`u like to implement a new Partitioner to split Ly
senuing auuiess insteau (oui most common seaich teim). Once the inuexes aie split
Ly senuei, we can use the hash ol the auuiess to ueteimine wheie to meige oi gueiy
loi an inuex, anu oui seaich API will only neeu to communicate with the ielevant noues.
Merging for near-term search
Altei a set ol MapReuuce phases have completeu, a uilleient set ol machines aie notilieu
ol the new inuexes anu can pull them loi meiging. These Seaich noues aie iunning
Apache Tomcat anu Soli instances to host completeu inuexes, along with a seivice to
pull anu meige the inuexes to local uisk (step D in Figuie 16-S).
Each compiesseu lile liom SolrOutputFormat is a complete Lucene inuex, anu Lucene
pioviues the IndexWriter.addIndexes() methous loi guickly meiging multiple inuexes.
Oui MergeAgent seivice uecompiesses each new inuex into a Lucene RAMDirectory oi
FSDirectory (uepenuing on size), meiges them to local uisk, anu senus a <commit/>
ieguest to the Soli instance hosting the inuex to make the changeu inuex visiLle to
gueiies.
The Queiy/Management API is a thin layei ol PHP coue that hanules shaiuing
the output inuexes acioss all ol the Seaich noues. Ve use a simple implementation ol
consistent hashing to ueciue which Seaich noues aie iesponsiLle loi each inuex lile.
Cuiiently, inuexes aie shaiueu Ly theii cieation time anu then Ly theii hasheu lilename,
Lut we plan to ieplace the lilename hash with a senuing auuiess hash at some point in
the lutuie (see phase 2: Reuuce).
Because HDFS alieauy hanules ieplication ol the Lucene inuexes, theie is no neeu to
keep multiple copies availaLle in Soli. Insteau, in a lailovei situation, the Seaich noue
is completely iemoveu, anu othei noues Lecome iesponsiLle loi meiging the inuexes.
Vith this system, we`ve achieveu a 15-minute tuinaiounu time liom log
geneiation to availaLility ol a seaich iesult loi oui Customei Suppoit team.
Oui seaich API suppoits the lull Lucene gueiy syntax, so we commonly see complex
gueiies like:
sender:"[email protected]" -recipient:"[email protected]"
recipient:"@rackspace.com" short-status:deferred timestamp:[1228140900 TO 2145916799]
Each iesult ietuineu Ly a gueiy is a complete seiializeu message path, which inuicates
whethei inuiviuual seiveis anu iecipients ieceiveu the message. Ve cuiiently uisplay
the path as a 2D giaph (Figuie 16-10) that the usei can inteiact with to expanu points
ol inteiest, Lut theie is a lot ol ioom loi impiovement in the visualization ol this uata.
Sharding.
Search results.
Log Processing at Rackspace | 585
Iigurc 1-10. Data trcc
Archiving for analysis
In auuition to pioviuing shoit-teim seaich loi Customei Suppoit, we aie also inteiesteu
in peiloiming analysis ol oui log uata.
Eveiy night, we iun a seiies ol MapReuuce joLs with the uay`s inuexes as input. Ve
implementeu a SolrInputFormat that can pull anu uecompiess an inuex, anu emit each
uocument as a key-value paii. Vith this InputFormat, we can iteiate ovei all message
paths loi a uay anu answei almost any guestion aLout oui mail system, incluuing:
Pei uomain uata (viiuses, spam, connections, iecipients)
Most ellective spam iules
Loau geneiateu Ly specilic useis
Reasons loi message Lounces
Geogiaphical souices ol connections
Aveiage latency Letween specilic machines
586 | Chapter 16: Case Studies
Since we have months ol compiesseu inuexes aichiveu in Hauoop, we aie also aLle to
ietiospectively answei guestions that oui nightly log summaiies leave out. Foi instance,
we iecently wanteu to ueteimine the top senuing IP auuiesses pei month, which we
accomplisheu with a simple one-oll MapReuuce joL.
Stu Hoou
Cascading
Cascauing is an open souice ]ava liLiaiy anu application piogiamming inteilace (API)
that pioviues an aLstiaction layei loi MapReuuce. It allows uevelopeis to Luilu com-
plex, mission-ciitical uata piocessing applications that iun on Hauoop clusteis.
The Cascauing pioject Legan in the summei ol 2007. Its liist puLlic ielease, veision
0.1, launcheu in ]anuaiy 200S. Veision 1.0 was ieleaseu in ]anuaiy 2009. Binaiies,
souice coue, anu auu-on mouules can Le uownloaueu liom the pioject weLsite,
http://www.cascading.org/.
Map anu Reuuce opeiations ollei poweilul piimitives. Howevei, they tenu to Le
at the wiong level ol gianulaiity loi cieating sophisticateu, highly composaLle coue
that can Le shaieu among uilleient uevelopeis. Moieovei, many uevelopeis linu it uil-
licult to think in teims ol MapReuuce when laceu with ieal-woilu pioLlems.
To auuiess the liist issue, Cascauing suLstitutes the keys anu values useu in Map-
Reuuce with simple lielu names anu a uata tuple mouel, wheie a tuple is simply a list
ol values. Foi the seconu issue, Cascauing uepaits liom Map anu Reuuce opeiations
uiiectly Ly intiouucing highei-level aLstiactions as alteinatives: Functions, Filteis, Ag-
giegatois, anu Bulleis.
Othei alteinatives Legan to emeige at aLout the same time as the pioject`s initial puLlic
ielease, Lut Cascauing was uesigneu to complement them. Consiuei that most ol these
alteinative liamewoiks impose pie- anu post-conuitions, oi othei expectations.
Foi example, in seveial othei MapReuuce tools, you must pieloimat, liltei, oi impoit
youi uata into the Hauoop Filesystem (HDFS) piioi to iunning the application. That
step ol piepaiing the uata must Le peiloimeu outsiue ol the piogiamming aLstiaction.
In contiast, Cascauing pioviues means to piepaie anu manage youi uata as integial
paits ol the piogiamming aLstiaction.
This case stuuy Legins with an intiouuction to the main concepts ol Cascauing, then
linishes with an oveiview ol how ShaieThis uses Cascauing in its inliastiuctuie.
Please see the Cascauing Usei Guiue on the pioject weLsite loi a moie in-uepth pie-
sentation ol the Cascauing piocessing mouel.
Cascading | 587
Fields, Tuples, and Pipes
The MapReuuce mouel uses keys anu values to link input uata to the Map lunction,
the Map lunction to the Reuuce lunction, anu the Reuuce lunction to the output uata.
But as we know, ieal-woilu Hauoop applications aie usually moie than one MapRe-
uuce joL chaineu togethei. Consiuei the canonical woiu count example implementeu
in MapReuuce. Il you neeueu to soit the numeiic counts in uescenuing oiuei, not an
unlikely ieguiiement, it woulu neeu to Le uone in a seconu MapReuuce joL.
So, in the aLstiact, keys anu values not only Linu Map to Reuuce, Lut Reuuce to the
next Map, anu then to the next Reuuce, anu so on (Figuie 16-11). That is, key-value
paiis aie souiceu liom input liles anu stieam thiough chains ol Map anu Reuuce op-
eiations, anu linally iest in an output lile. Vhen you implement enough ol these
chaineu MapReuuce applications, you stait to see a well-uelineu set ol key/value ma-
nipulations useu ovei anu ovei again to mouily the key/value uata stieam.
Iigurc 1-11. Counting and sorting in MapRcducc
Cascauing simplilies this Ly aLstiacting away keys anu values anu ieplacing them with
tuples that have coiiesponuing lielu names, similai in concept to taLles anu column
names in a ielational uataLase. Anu uuiing piocessing, stieams ol these lielus anu tuples
aie then manipulateu as they pass thiough usei-uelineu opeiations linkeu togethei Ly
pipes (Figuie 16-12).
So, MapReuuce keys anu values aie ieuuceu to:
Iic|ds
Fielus aie a collection ol eithei String names (like liistname), numeiic positions
(like 2, oi 1, loi the thiiu anu last position, iespectively), oi a comLination ol
Loth, veiy much like column names. So lielus aie useu to ueclaie the names ol
values in a tuple anu to select values Ly name liom a tuple. The lattei is like a SQL
select call.
588 | Chapter 16: Case Studies
Tup|c
A tuple is simply an aiiay ol java.lang.Comparable oLjects. A tuple is veiy much
like a uataLase iow oi iecoiu.
Anu the Map anu Reuuce opeiations aie aLstiacteu Lehinu one oi moie pipe instances
(Figuie 16-13):
Each
The Each pipe piocesses a single input tuple at a time. It may apply eithei a Func
tion oi a Filter opeiation (uesciiLeu shoitly) to the input tuple.
GroupBy
The GroupBy pipe gioups tuples on giouping lielus. It Lehaves just like the SQL
group by statement. It can also meige multiple input tuple stieams into a single
stieam, il they all shaie the same lielu names.
CoGroup
The CoGroup pipe Loth joins multiple tuple stieams togethei Ly common lielu
names, anu it also gioups the tuples Ly the common giouping lielus. All stanuaiu
join types (innei, outei, etc.) anu custom joins can Le useu acioss two oi moie
tuple stieams.
Iigurc 1-13. Pipc typcs
Iigurc 1-12. Pipcs |in|cd by jic|ds and tup|cs
Cascading | 589
Every
The Every pipe piocesses a single giouping ol tuples at a time, wheie the gioup
was gioupeu Ly a GroupBy oi CoGroup pipe. The Every pipe may apply eithei an
Aggiegatoi oi a Bullei opeiation to the giouping.
SubAssembly
The SubAssembly pipe allows loi nesting ol assemLlies insiue a single pipe, which
can, in tuin, Le nesteu in moie complex assemLlies.
All these pipes aie chaineu togethei Ly the uevelopei into pipe assemLlies in which
each assemLly can have many input tuple stieams (souices) anu many output tuple
stieams (sinks) (see Figuie 16-1+).
Iigurc 1-11. A sinp|c PipcAsscnb|y
On the suilace, this might seem moie complex than the tiauitional MapReuuce mouel.
Anu aumitteuly theie aie moie concepts heie than Map, Reuuce, Key, anu Value. But
in piactice, theie aie many moie concepts that must all woik in tanuem to pioviue
uilleient Lehaviois.
Foi example, il a uevelopei wanteu to pioviue a seconuaiy soiting ol ieuucei values,
she woulu neeu to implement Map, Reuuce, a composite Key (two Keys nesteu in a
paient Key), Value, Paititionei, an output value giouping Compaiatoi, anu an out-
put key Compaiatoi, all ol which woulu Le coupleu to one anothei in vaiying ways
anu, veiy likely, nonieusaLle in suLseguent applications.
In Cascauing, this woulu Le one line ol coue: new GroupBy(<previous>, <grouping
fields>, <secondary sorting fields>), wheie previous is the pipe that came Leloie.
Operations
As mentioneu eailiei, Cascauing uepaits liom MapReuuce Ly intiouucing alteinative
opeiations that eithei aie applieu to inuiviuual tuples oi gioups ol tuples (Figuie 16-15):
Function
A Function opeiates on inuiviuual input tuples anu may ietuin zeio oi moie output
tuples loi eveiy one input. Functions aie applieu Ly the Each pipe.
590 | Chapter 16: Case Studies
Filter
A Filter is a special kinu ol lunction that ietuins a Loolean value inuicating
whethei the cuiient input tuple shoulu Le iemoveu liom the tuple stieam. A
function coulu seive this puipose, Lut the Filter is optimizeu loi this case, anu
many lilteis can Le gioupeu Ly logical lilteis like And, Or, Xor, anu Not, iapiuly
cieating moie complex lilteiing opeiations.
Aggregator
An Aggregator peiloims some opeiation against a gioup ol tuples, wheie the
gioupeu tuples aie gioupeu Ly a common set ol lielu values. Foi example, all
tuples having the same last-name value. Common Aggregator implementations
woulu Le Sum, Count, Average, Max, anu Min.
Buffer
A Buffer is similai to the Aggregator, except it is optimizeu to act as a sliuing
winuow acioss all the tuples in a unigue giouping. This is uselul when the uevel-
opei neeus to elliciently inseit missing values in an oiueieu set ol tuples (like a
missing uate oi uuiation), oi cieate a iunning aveiage. Usually Aggregator is the
opeiation ol choice when woiking with gioups ol tuples, since many
Aggregators can Le chaineu togethei veiy elliciently, Lut sometimes a Buffer is the
Lest tool loi the joL.
Opeiations aie Lounu to pipes when the pipe assemLly is cieateu (Figuie 16-16).
The Each anu Every pipes pioviue a simple mechanism loi selecting some oi all values
out ol an input tuple Leloie Leing passeu to its chilu opeiation. Anu theie is a simple
mechanism loi meiging the opeiation iesults with the oiiginal input tuple to cieate the
output tuple. Vithout going into gieat uetail, this allows loi each opeiation to only
caie aLout aigument tuple values anu lielus, not the whole set ol lielus in the cuiient
input tuple. SuLseguently, opeiations can Le ieusaLle acioss applications the same way
]ava methous can Le ieusaLle.
Iigurc 1-15. Opcration typcs
Cascading | 591
Iigurc 1-1. An asscnb|y oj opcrations
Foi example, in ]ava, a methou ueclaieu as concatenate(String first, String
second) is moie aLstiact than concatenate(Person person). In the seconu case, the
concatenate() lunction must know aLout the Person oLject; in the liist case, it is
agnostic to wheie the uata came liom. Cascauing opeiations exhiLit this same guality.
Taps, Schemes, and Flows
In many ol the pievious uiagiams, theie aie ieleiences to souices anu sinks. In
Cascauing, all uata is ieau liom oi wiitten to Tap instances, Lut is conveiteu to anu
liom tuple instances via Scheme oLjects:
Tap
A Tap is iesponsiLle loi the how anu wheie paits ol accessing uata. Foi exam-
ple, is the uata on HDFS oi the local lilesystem? In Amazon S3 oi ovei HTTP?
Scheme
A Scheme is iesponsiLle loi ieauing iaw uata anu conveiting it to a tuple anu/oi
wiiting a tuple out into iaw uata, wheie this iaw uata can Le lines ol text, Hauoop
Linaiy seguence liles, oi some piopiietaiy loimat.
Note that Taps aie not pait ol a pipe assemLly, anu so they aie not a type ol Pipe.
But they aie connecteu with pipe assemLlies when they aie maue clustei-executaLle.
Vhen a pipe assemLly is connecteu with the necessaiy numLei ol souice anu sink Tap
instances, we get a Flow. A Flow is cieateu when a pipe assemLly is connecteu with its
ieguiieu numLei ol souice anu sink taps, anu the Taps eithei emit oi captuie the lielu
names the pipe assemLly expects. That is, il a Tap emits a tuple with the lielu name
line (Ly ieauing uata liom a lile on HDFS), the heau ol the pipe assemLly must Le
expecting a line value as well. Otheiwise, the piocess that connects the pipe assemLly
with the Taps will immeuiately lail with an eiioi.
So pipe assemLlies aie ieally uata piocess uelinitions, anu aie not executaLle on theii
own. They must Le connecteu to souice anu sink Tap instances Leloie they can iun on
a clustei. This sepaiation Letween Taps anu pipe assemLlies is pait ol what makes
Cascauing so poweilul.
592 | Chapter 16: Case Studies
Il you think ol pipe assemLlies like a ]ava class, then a Flow is like a ]ava OLject instance
(Figuie 16-17). That is, the same pipe assemLly can Le instantiateu many times into
new Flows, in the same application, without leai ol any inteileience Letween them.
This allows pipe assemLlies to Le cieateu anu shaieu like stanuaiu ]ava liLiaiies.
Iigurc 1-17. A I|ow
Cascading in Practice
Now that we know what Cascauing is anu have a goou iuea how it woiks, what uoes
an application wiitten in Cascauing look like? See Example 16-2.
Exanp|c 1-2. Word count and sort
Scheme sourceScheme =
new TextLine(new Fields("line"));
Tap source =
new Hfs(sourceScheme, inputPath);
Scheme sinkScheme = new TextLine();
Tap sink =
new Hfs(sinkScheme, outputPath, SinkMode.REPLACE);
Pipe assembly = new Pipe("wordcount");
String regexString = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";
Function regex = new RegexGenerator(new Fields("word"), regexString);
assembly =
new Each(assembly, new Fields("line"), regex);
assembly =
new GroupBy(assembly, new Fields("word"));
Aggregator count = new Count(new Fields("count"));
assembly = new Every(assembly, count);
assembly =
new GroupBy(assembly, new Fields("count"), new Fields("word"));
Cascading | 593
FlowConnector flowConnector = new FlowConnector();
Flow flow =
flowConnector.connect("word-count", source, sink, assembly);
flow.complete();
Ve cieate a new Scheme that ieaus simple text liles anu emits a new Tuple loi each
line in a lielu nameu line, as ueclaieu Ly the Fields instance.
Ve cieate a new Scheme that wiites simple text liles anu expects a Tuple with any
numLei ol lielus/values. Il moie than one value, they will Le taL-uelimiteu in the
output lile.
Ve cieate souice anu sink Tap instances that ieleience the input lile anu output
uiiectoiy, iespectively. The sink Tap will oveiwiite any lile that may alieauy exist.
Ve constiuct the heau ol oui pipe assemLly, anu name it woiucount. This name
is useu to Linu the souice anu sink taps to the assemLly. Multiple heaus oi tails
woulu ieguiie unigue names.
Ve constiuct an Each pipe with a lunction that will paise the line lielu into a new
Tuple loi each woiu encounteieu.
Ve constiuct a GroupBy pipe that will cieate a new Tuple giouping loi each unigue
value in the lielu woiu.
Ve constiuct an Every pipe with an Aggregator that will count the numLei ol
Tuples in eveiy unigue woiu gioup. The iesult is stoieu in a lielu nameu count.
Ve constiuct a GroupBy pipe that will cieate a new Tuple giouping loi each unigue
value in the lielu count anu seconuaiy soit each value in the lielu woiu. The
iesult will Le a list ol count anu woiu values with count soiteu in incieasing
oiuei.
Ve connect the pipe assemLly to its souices anu sinks into a Flow, anu then execute
the Flow on the clustei.
In the example, we count the woius encounteieu in the input uocument, anu we soit
the counts in theii natuial oiuei (ascenuing). Anu il some woius have the same count
value, these woius aie soiteu in theii natuial oiuei (alphaLetical).
One oLvious pioLlem with this example is that some woius might have uppeicase
letteis; loi example, the anu The when the woiu comes at the Leginning ol a sen-
tence. So we might ueciue to inseit a new opeiation to loice all the woius to
loweicase, Lut we iealize that all lutuie applications that neeu to paise woius liom
uocuments shoulu have the same Lehavioi, so we ueciue to cieate a ieusaLle pipe
SubAssembly, just like we woulu Ly cieating a suLioutine in a tiauitional application
(see Example 16-3).
594 | Chapter 16: Case Studies
Exanp|c 1-3. Crcating a SubAsscnb|y
public class ParseWordsAssembly extends SubAssembly
{
public ParseWordsAssembly(Pipe previous)
{
String regexString = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";
Function regex = new RegexGenerator(new Fields("word"), regexString);
previous = new Each(previous, new Fields("line"), regex);
String exprString = "word.toLowerCase()";
Function expression =
new ExpressionFunction(new Fields("word"), exprString, String.class);
previous = new Each(previous, new Fields("word"), expression);
setTails(previous);
}
}
Ve suLclass the SubAssembly class, which is itsell a kinu ol Pipe.
Ve cieate a ]ava expiession lunction that will call toLowerCase() on the String value
in the lielu nameu woiu. Ve must also pass in the ]ava type the expiession expects
woiu to Le, in this case, String. (http://www.janino.nct/ is useu unuei the coveis.)
Ve must tell the SubAssembly supeiclass wheie the tail enus ol oui pipe suLassemLly
aie.
Fiist, we cieate a SubAssembly pipe to holu oui paise woius pipe assemLly. Since this
is a ]ava class, it can Le ieuseu in any othei application, as long as theie is an incoming
lielu nameu woiu (Example 16-+). Note that theie aie ways to make this lunction
even moie geneiic, Lut they aie coveieu in the Cascauing Usei Guiue.
Exanp|c 1-1. Extcnding word count and sort with a SubAsscnb|y
Scheme sourceScheme = new TextLine(new Fields("line"));
Tap source = new Hfs(sourceScheme, inputPath);
Scheme sinkScheme = new TextLine(new Fields("word", "count"));
Tap sink = new Hfs(sinkScheme, outputPath, SinkMode.REPLACE);
Pipe assembly = new Pipe("wordcount");
assembly =
new ParseWordsAssembly(assembly);
assembly = new GroupBy(assembly, new Fields("word"));
Aggregator count = new Count(new Fields("count"));
assembly = new Every(assembly, count);
assembly = new GroupBy(assembly, new Fields("count"), new Fields("word"));
Cascading | 595
FlowConnector flowConnector = new FlowConnector();
Flow flow = flowConnector.connect("word-count", source, sink, assembly);
flow.complete();
Ve ieplace the Each liom the pievious example with oui ParseWordsAssembly pipe.
Finally, we just suLstitute in oui new SubAssembly iight wheie the pievious Every anu
woiu paisei lunction was useu in the pievious example. This nesting can continue as
ueep as necessaiy.
Flexibility
Take a step Lack anu see what this new mouel has given us oi, Lettei yet, what it has
taken away.
You see, we no longei think in teims ol MapReuuce joLs, oi Mappei anu Reuucei
inteilace implementations, anu how to Linu oi link suLseguent MapReuuce joLs to the
ones that pieceue them. Duiing iuntime, the Cascauing plannei liguies out the op-
timal way to paitition the pipe assemLly into MapReuuce joLs anu manages the linkages
Letween them (Figuie 16-1S).
Iigurc 1-18. How a I|ow trans|atcs to chaincd MapRcducc jobs
Because ol this, uevelopeis can Luilu applications ol aiLitiaiy gianulaiity. They can
stait with a small application that just lilteis a loglile, Lut then can iteiatively Luilu up
moie leatuies into the application as neeueu.
Since Cascauing is an API anu not a syntax like stiings ol SQL, it is moie llexiLle. Fiist
oll, uevelopeis can cieate uomain-specilic languages (DSLs) using theii lavoiite lan-
guage, like Gioovy, ]RuLy, ]ython, Scala, anu otheis (see the pioject site loi examples).
Seconu, uevelopeis can extenu vaiious paits ol Cascauing, like allowing custom Thiilt
oi ]SON oLjects to Le ieau anu wiitten to anu allowing them to Le passeu thiough the
tuple stieam.
596 | Chapter 16: Case Studies
Hadoop and Cascading at ShareThis
ShaieThis is a shaiing netwoik that makes it simple to shaie any online content. Vith
the click ol a Lutton on a weL page oi Liowsei plug-in, ShaieThis allows useis to
seamlessly access theii contacts anu netwoiks liom anywheie online anu shaie the
content thiough email, IM, FaceLook, Digg, moLile SMS, etc., without evei leaving the
cuiient page. PuLlisheis can ueploy the ShaieThis Lutton to tap the seivice`s univeisal
shaiing capaLilities to uiive tiallic, stimulate viial activity, anu tiack the shaiing ol
online content. ShaieThis also simplilies social meuia seivices Ly ieuucing cluttei on
weL pages anu pioviuing instant uistiiLution ol content acioss social netwoiks, alliliate
gioups, anu communities.
As ShaieThis useis shaie pages anu inloimation thiough the online wiugets, a contin-
uous stieam ol events entei the ShaieThis netwoik. These events aie liist lilteieu anu
piocesseu, anu then hanueu to vaiious Lackenu systems, incluuing AsteiData,
HypeitaLle, anu Katta.
The volume ol these events can Le huge, too laige to piocess with tiauitional systems.
This uata can also Le veiy uiity thanks to injection attacks liom iogue systems,
Liowsei Lugs, oi laulty wiugets. Foi this ieason, ShaieThis chose to ueploy Hauoop as
the piepiocessing anu oichestiation liontenu to theii Lackenu systems. They also chose
to use Amazon VeL Seivices to host theii seiveis, on the Elastic Computing Clouu
(EC2), anu pioviue long-teim stoiage, on the Simple Stoiage Seivice (S3), with an eye
towaiu leveiaging Elastic MapReuuce (EMR).
In this oveiview, we will locus on the log piocessing pipeline (Figuie 16-19). The log
piocessing pipeline simply takes uata stoieu in an S3 Lucket, piocesses it (uesciiLeu
shoitly), anu stoies the iesults Lack into anothei Lucket. Simple Queue Seivice (SQS)
is useu to cooiuinate the events that maik the stait anu completion ol uata piocessing
iuns. Downstieam, othei piocesses pull uata that loau AsteiData, pull URL lists liom
HypeitaLle to souice a weL ciawl, oi pull ciawleu page uata to cieate Lucene inuexes
loi use Ly Katta. Note that Hauoop is cential to the ShaieThis aichitectuie. It is useu
to cooiuinate the piocessing anu movement ol uata Letween aichitectuial components.
Vith Hauoop as the liontenu, all the event logs can Le paiseu, lilteieu, cleaneu, anu
oiganizeu Ly a set ol iules Leloie evei Leing loaueu into the AsteiData clustei oi useu
Ly any othei component. AsteiData is a clusteieu uata waiehouse that can suppoit
laige uatasets anu allow loi complex au hoc gueiies using a stanuaiu SQL syntax.
ShaieThis chose to clean anu piepaie the incoming uatasets on the Hauoop clustei anu
then to loau that uata into the AsteiData clustei loi au hoc analysis anu iepoiting.
Though possiLle with AsteiData, it maue a lot ol sense to use Hauoop as the liist stage
in the piocessing pipeline to ollset loau on the main uata waiehouse.
Cascauing was chosen as the piimaiy uata piocessing API to simplily the uevelopment
piocess, couily how uata is cooiuinateu Letween aichitectuial components, anu pio-
viue the uevelopei-lacing inteilace to those components. This iepiesents a uepaituie
liom moie tiauitional Hauoop use cases, which essentially just gueiy stoieu uata.
Cascading | 597
Insteau, Cascauing anu Hauoop togethei pioviue Lettei anu simplei stiuctuie to the
complete solution, enu-to-enu, anu thus moie value to the useis.
Foi uevelopeis, Cascauing maue it easy to stait with a simple unit test (Ly suLclassing
cascading.ClusterTestCase) that uiu simple text paising anu then to layei in moie
piocessing iules while keeping the application logically oiganizeu loi maintenance.
Cascauing aiueu this oiganization in a couple ol ways. Fiist, stanualone opeiations
(Functions, Filteis, etc.) coulu Le wiitten anu testeu inuepenuently. Seconu, the ap-
plication was segmenteu into stages: one loi paising, one loi iules, anu a linal stage loi
Linning/collating the uata, all via the SubAssembly Lase class uesciiLeu eailiei.
The uata coming liom the ShaieThis loggeis looks a lot like Apache logs with uate/
timestamps, shaie URLs, ieleiiei URLs, anu a Lit ol metauata. To use the uata loi
analysis uownstieam, the URLs neeueu to Le unpackeu (paising gueiy-stiing uata,
uomain names, etc.). So a top-level SubAssembly was cieateu to encapsulate the paising,
anu chilu SuLAssemLlies weie nesteu insiue to hanule specilic lielus il they weie sul-
liciently complex to paise.
The same was uone loi applying iules. As eveiy Tuple passeu thiough the iules SubAs
sembly, it was maikeu as Lau il any ol the iules weie tiiggeieu. Along with the Lau
tag, a uesciiption ol why the iecoiu was Lau was auueu to the Tuple loi latei ieview.
Finally, a splittei SuLAssemLly was cieateu to uo two things. Fiist, to allow loi the
tuple stieam to split into two, one stieam loi goou uata anu one loi Lau uata.
Seconu, the splittei Linneu the uata into inteivals, such as eveiy houi. To uo this, only
two opeiations weie necessaiy: the liist to cieate the inteival liom the tincstanp value
alieauy piesent in the stieam, anu the seconu to use the intcrva| anu good/bad metauata
to cieate a uiiectoiy path (loi example, 05/goou/ wheie 05 is 5am anu goou
means the tuple passeu all the iules). This path woulu then Le useu Ly the Cascauing
Iigurc 1-19. Thc SharcThis |og proccssing pipc|inc
598 | Chapter 16: Case Studies
TemplateTap, a special Tap that can uynamically output tuple stieams to uilleient
locations Laseu on values in the Tuple. In this case, the TemplateTap useu the path
value to cieate the linal output path.
The uevelopeis also cieateu a louith SuLAssemLlythis one to apply Cascauing As-
seitions uuiing unit testing. These asseitions uouLle-checkeu that iules anu paising
SuLAssemLlies uiu theii joL.
In the unit test in Example 16-5, we see the splittei isn`t Leing testeu, Lut it is auueu in
anothei integiation test not shown.
Exanp|c 1-5. Unit tcsting a I|ow
public void testLogParsing() throws IOException
{
Hfs source = new Hfs(new TextLine(new Fields("line")), sampleData);
Hfs sink =
new Hfs(new TextLine(), outputPath + "/parser", SinkMode.REPLACE);
Pipe pipe = new Pipe("parser");
// split "line" on tabs
pipe = new Each(pipe, new Fields("line"), new RegexSplitter("\t"));
pipe = new LogParser(pipe);
pipe = new LogRules(pipe);
// testing only assertions
pipe = new ParserAssertions(pipe);
Flow flow = new FlowConnector().connect(source, sink, pipe);
flow.complete(); // run the test flow
// verify there are 98 tuples, 2 fields, and matches the regex pattern
// for TextLine schemes the tuples are { "offset", "line }
validateLength(flow, 98, 2, Pattern.compile("^[0-9]+(\\t[^\\t]*){19}$"));
}
Foi integiation anu ueployment, many ol the leatuies Luilt into Cascauing alloweu loi
easiei integiation with exteinal systems anu loi gieatei piocess toleiance.
In piouuction, all the SuLAssemLlies aie joineu anu planneu into a Flow, Lut insteau
ol just souice anu sink Taps, tiap Taps weie planneu in (Figuie 16-20). Noimally, when
an opeiation thiows an exception liom a iemote Mappei oi Reuucei task, the Flow
will lail anu kill all its manageu MapReuuce joLs. Vhen a Flow has tiaps, any excep-
tions aie caught anu the uata causing the exception is saveu to the Tap associateu with
the cuiient tiap. Then the next Tuple is piocesseu without stopping the Flow. Sometimes
you want youi Flows to lail on eiiois, Lut in this case, the ShaieThis uevelopeis knew
they coulu go Lack anu look at the laileu uata anu upuate theii unit tests while the
Cascading | 599
piouuction system kept iunning. Losing a lew houis ol piocessing time was woise than
losing a couple ol Lau iecoius.
Using Cascauing`s event listeneis, Amazon SQS coulu Le integiateu. Vhen a Flow
linishes, a message is sent to notily othei systems that theie is uata ieauy to Le pickeu
up liom Amazon S3. On lailuie, a uilleient message is sent, aleiting othei piocesses.
The iemaining uownstieam piocesses pick up wheie the log piocessing pipeline leaves
oll on uilleient inuepenuent clusteis. The log piocessing pipeline touay iuns once a
uay, so theie is no neeu to keep a 100-noue clustei sitting aiounu loi the 23 houis it
has nothing to uo. So it is uecommissioneu anu iecommissioneu 2+ houis latei.
In the lutuie, it woulu Le tiivial to inciease this inteival on smallei clusteis to eveiy 6
houis, oi 1 houi, as the Lusiness uemanus. Inuepenuently, othei clusteis aie Looting
anu shutting uown at uilleient inteivals Laseu on the neeus ol the Lusiness unit ie-
sponsiLle loi that component. Foi example, the weL ciawlei component (using Bixo,
a Cascauing-Laseu weL-ciawlei toolkit uevelopeu Ly EMI anu ShaieThis) may iun
continuously on a small clustei with a companion HypeitaLle clustei. This on-uemanu
mouel woiks veiy well with Hauoop, wheie each clustei can Le tuneu loi the kinu ol
woikloau it is expecteu to hanule.
Iigurc 1-20. Thc SharcThis |og proccssing I|ow
Summary
Hauoop is a veiy poweilul platloim loi piocessing anu cooiuinating the movement ol
uata acioss vaiious aichitectuial components. Its only uiawLack is that the piimaiy
computing mouel is MapReuuce.
Cascauing aims to help uevelopeis Luilu poweilul applications guickly anu simply,
thiough a well-ieasoneu API, without neeuing to think in MapReuuce, while leaving
the heavy lilting ol uata uistiiLution, ieplication, uistiiLuteu piocess management, anu
liveness to Hauoop.
600 | Chapter 16: Case Studies
Reau moie aLout Cascauing, join the online community, anu uownloau sample appli-
cations Ly visiting the pioject weLsite.
Chiis K Vensel
TeraByte Sort on Apache Hadoop
This artic|c is rcproduccd jron http://sortbcnchnar|.org/YahooHadoop.pdj, which was
writtcn in May 2008. jin Gray and his succcssors dcjinc a jani|y oj bcnchnar|s to jind
thc jastcst sort prograns cvcry ycar. TcraBytc Sort and othcr sort bcnchnar|s arc |istcd
with winncrs ovcr thc ycars at http://sortbcnchnar|.org/. |n Apri| 2009, Arun Murthy
and | won thc ninutc sort (whcrc thc ain is to sort as nuch data as possib|c in undcr onc
ninutc) by sorting 500 GB in 59 scconds on 1,10 Hadoop nodcs. Wc a|so sortcd a tcr-
abytc in 2 scconds on thc sanc c|ustcr. Thc c|ustcr wc uscd in 2009 was sini|ar to thc
hardwarc |istcd bc|ow, cxccpt that thc nctwor| was nuch bcttcr with on|y 2-to-1 ovcr-
subscription bctwccn rac|s instcad oj 5-to-1 in thc prcvious ycar. Wc a|so uscd LZO
conprcssion on thc intcrncdiatc data bctwccn thc nodcs. Wc a|so sortcd a pctabytc
(10
15
bytcs) in 975 ninutcs on 3,58 nodcs, jor an avcragc ratc oj 1.03 TB/ninutc. Scc
http://dcvc|opcr.yahoo.nct/b|ogs/hadoop/2009/05/hadoop_sorts_a_pctabytc_in_12
.htn| jor norc dctai|s about thc 2009 rcsu|ts.
Apache Hauoop is an open souice soltwaie liamewoik that uiamatically simplilies
wiiting uistiiLuteu uata-intensive applications. It pioviues a uistiiLuteu lilesystem,
which is moueleu altei the Google File System,
1
anu a MapReuuce
2
implementation
that manages uistiiLuteu computation. Since the piimaiy piimitive ol MapReuuce is a
uistiiLuteu soit, most ol the custom coue is glue to get the uesiieu Lehavioi.
I wiote thiee Hauoop applications to iun the teiaLyte soit:
1. TeraGen is a MapReuuce piogiam to geneiate the uata.
2. TeraSort samples the input uata anu uses MapReuuce to soit the uata into a total
oiuei.
3. TeraValidate is a MapReuuce piogiam that valiuates the output is soiteu.
The total is aiounu 1,000 lines ol ]ava coue, which will Le checkeu in to the Hauoop
example uiiectoiy.
TeraGen geneiates output uata that is Lyte-loi-Lyte eguivalent to the C veision incluuing
the newlines anu specilic keys. It uiviues the uesiieu numLei ol iows Ly the uesiieu
numLei ol tasks anu assigns ianges ol iows to each map. The map jumps the ianuom
numLei geneiatoi to the coiiect value loi the liist iow anu geneiates the lollowing iows.
1. S. Ghemawat, H. GoLiof, anu S.-T. Leung. The Google File System. In 19th Synposiun on Opcrating
Systcns Princip|cs (OctoLei 2003), Lake Geoige, NY: ACM.
2. ]. Dean anu S. Ghemawat. MapReuuce: Simplifeu Data Piocessing on Laige Clusteis. In Sixth
Synposiun on Opcrating Systcn Dcsign and |np|cncntation (DecemLei 200+), San Fiancisco, CA.
TeraByte Sort on Apache Hadoop | 601
Foi the linal iun, I conliguieu TeraGen to use 1,S00 tasks to geneiate a total ol 10 Lillion
iows in HDFS, with a Llock size ol 512 MB.
TeraSort is a stanuaiu MapReuuce soit, except loi a custom paititionei that uses a
soiteu list ol N-1 sampleu keys that ueline the key iange loi each ieuuce. In paiticulai,
all keys such that sanp|ci-1 <= |cy < sanp|ci aie sent to ieuuce i. This guaiantees
that the output ol ieuuce i aie all less than the output ol ieuuce i-1. To speeu up the
paititioning, the paititionei Luilus a two-level tiie that guickly inuexes into the list ol
sample keys Laseu on the liist two Lytes ol the key. TeraSort geneiates the sample keys
Ly sampling the input Leloie the joL is suLmitteu anu wiiting the list ol keys into HDFS.
I wiote an input anu output loimat, which aie useu Ly all thiee applications, that ieau
anu wiite the text liles in the iight loimat. The output ol the ieuuce has ieplication set
to 1, insteau ol the uelault 3, Lecause the contest uoes not ieguiie the output uata Le
ieplicateu on to multiple noues. I conliguieu the joL with 1,S00 maps anu 1,S00 ieuuces
anu io.sort.mb, io.sort.factor, fs.inmemory.size.mb, anu a task heap size sullicient
that tiansient uata was nevei spilleu to uisk othei at the enu ol the map. The samplei
useu 100,000 keys to ueteimine the ieuuce Lounuaiies, although as can Le seen in
Figuie 16-21, the uistiiLution Letween ieuuces was haiuly peilect anu woulu Lenelit
liom moie samples. You can see the uistiiLution ol iunning tasks ovei the joL iun in
Figuie 16-22.
Iigurc 1-21. P|ot oj rcducc output sizc vcrsus jinish tinc
602 | Chapter 16: Case Studies
TeraValidate ensuies that the output is gloLally soiteu. It cieates one map pei lile in
the output uiiectoiy, anu each map ensuies that each key is less than oi egual to the
pievious one. The map also geneiates iecoius with the liist anu last keys ol the lile, anu
the ieuuce ensuies that the liist key ol lile i is gieatei than the last key ol lile i-1. Any
pioLlems aie iepoiteu as output ol the ieuuce with the keys that aie out ol oiuei.
The clustei I ian on was:
910 noues
2 guau coie Xeons at 2.0 Ghz pei noue
+ SATA uisks pei noue
S G RAM pei noue
1 gigaLit Etheinet on each noue
+0 noues pei iack
S gigaLit Etheinet uplinks liom each iack to the coie
Reu Hat Enteipiise Linux Seivei ielease 5.1 (keinel 2.6.1S)
Sun ]ava ]DK 1.6.005-L13
Iigurc 1-22. Nunbcr oj tas|s in cach phasc across tinc
TeraByte Sort on Apache Hadoop | 603
The soit completeu in 209 scconds (3.18 ninutcs). I ian Hauoop tiunk (pie-0.1S.0) with
patches loi HADOOP-3++3 anu HADOOP-3++6, which weie ieguiieu to iemove in-
teimeuiate wiites to uisk. Although I hau the 910 noues mostly to mysell, the netwoik
coie was shaieu with anothei active 2,000-noue clustei, so the times vaiieu a lot ue-
penuing on the othei activity.
Owen O`Malley, Yahoo!
Using Pig and Wukong to Explore Billion-edge Network Graphs
Netwoiks at massive scale aie lascinating. The numLei ol things they mouel aie ex-
tiemely geneial: il you have a collection ol things (that we`ll call noues), they aie ielateu
(euges), anu il the noues anu euges tell a stoiy (noue/euge metauata), you have a net-
woik giaph.
I staiteu the Inlochimps pioject, a site to linu, shaie, oi sell any uataset in the woilu.
At Inlochimps, we`ve got a whole Lag ol tiicks ieauy to apply to any inteiesting netwoik
giaph that comes into the collection. Ve chielly use Pig (uesciiLeu in Chaptei 11) anu
Vukong, a toolkit we`ve uevelopeu loi Hauoop stieaming in the RuLy piogiamming
language. They let us wiite simple sciipts like the ones Lelowalmost all ol which lit
on a single piinteu pageto piocess teiaLyte-scale giaphs. Heie aie a lew uatasets that
come up in a seaich loi netwoik on inlochimps.oig:
3
A social netwoik, such as Twittei oi FaceLook. Ve somewhat impeisonally mouel
people as noues, anu ielationships (@mrflip is liienus with @tom_e_white) oi actions
(@infochimps mentioneu @hadoop) as euges. The numLei ol messages a usei has sent
anu the Lag ol woius liom all those messages aie each impoitant pieces ol noue
metauata.
A linkeu uocument collection such as Vikipeuia oi the entiie weL.
+
Each page is
a noue (caiiying its title, view count, anu categoiies as noue metauata). Each hy-
peilink is an euge, anu the lieguency at which people click liom one page to the
next is euge metauata.
The connections ol neuions (noues) anu synapses (euges) in the C. c|cgans
iounuwoim.
5
3. http://injochinps.org/scarch?qucry=nctwor|
+. http://www.datawrang|ing.con/wi|ipcdia-pagc-trajjic-statistics-datasct
5. http://www.wornat|as.org/ncurona|wiring.htn|
604 | Chapter 16: Case Studies
A highway map, with exits as noues, anu highway segments as euges. The Open
Stieet Map pioject`s uataset has gloLal coveiage ol place names (noue metauata),
stieet numLei ianges (euge metauata), anu moie.
6
Oi the many esoteiic giaphs that lall out il you take an inteiesting system anu shake
it just iight. Stieam thiough a lew million Twittei messages, anu emit an euge loi
eveiy paii ol nonkeyLoaiu chaiacteis occuiiing within the same message. Simply
Ly oLseiving olten, when humans use , they also use , you can ie-cieate a
map ol human languages (see Figuie 16-23).
Iigurc 1-23. Twittcr |anguagc nap
6. http://www.opcnstrcctnap.org/
Using Pig and Wukong to Explore Billion-edge Network Graphs | 605
Vhat`s amazing aLout these oiganic netwoik giaphs is that given enough uata, a col-
lection ol poweilul tools aie aLle to gcncrica||y use this netwoik stiuctuie to expose
insight. Foi example, we`ve useu vaiiants ol the same algoiithm
7
to uo each ol:
Rank the most impoitant pages in the Vikipeuia linkeu-uocument collection.
Google uses a vastly moie ielineu veision ol this appioach to iuentily top seaich
hits.
Iuentily celeLiities anu expeits in the Twittei social giaph. Useis who have many
moie lolloweis than theii tistiank woulu imply aie olten spammeis.
Pieuict a school`s impact on stuuent euucation, using millions ol anonymizeu exam
scoies gatheieu ovei live yeais.
Measuring Community
The most inteiesting netwoik in the Inlochimps collection is a massive ciawl ol the
Twittei social giaph. Vith moie than 90 million noues, 2 Lillion euges, it is a maivelous
instiument loi unueistanuing what people talk aLout anu how they ielate to each othei.
Heie is an exploiation, using the suLgiaph ol People who talk aLout Inlochimps oi
Hauoop,
S
ol thiee ways to chaiacteiize a usei`s community:
Vho aie the people they conveise with (the ieply giaph)?
Do the people they engage with iecipiocate that attention (symmetiic links)?
Among the usei`s community, how many engage with each othei (clusteiing
coellicient)?
Everybodys Talkin at Me: The Twitter Reply Graph
Twittei lets you ieply to anothei usei`s message anu thus engage in conveisation. Since
it`s an expiessly puLlic activity, a ieply is a stiong socia| to|cn: it shows inteiest in what
the othei is saying anu uemonstiates that inteiest is woith ieLioaucasting.
The liist step in oui piocessing is uone in Vukong, a RuLy language liLiaiy loi Hauoop.
It lets us wiite small, agile piogiams capaLle ol hanuling multiteiaLyte uata stieams.
Heie is a snippet liom the class that iepiesents a twittei message (oi twcct):
9
class Tweet < Struct.new(:tweet_id, :screen_name, :created_at,
:reply_tweet_id, :reply_screen_name, :text)
def initialize(raw_tweet)
7. All aie steauy-state netwoik llow pioLlems. A llowing ciowu ol weLsuileis wanueiing the linkeu-
uocument collection will visit the most inteiesting pages the most olten. The tianslei ol social capital
implieu Ly social netwoik inteiactions highlights the most cential actois within each community. The
yeai-to-yeai piogiess ol stuuents to highei oi lowei test scoies implies what each school`s ellect on a
geneiic class woulu Le.
S. Chosen without apology, in keeping with the ego-centeieu ethos ol social netwoiks.
9. You can linu lull woiking souice coue on this Look`s weLsite.
606 | Chapter 16: Case Studies
# ... gory details of parsing raw tweet omitted
end

# Tweet is a reply if there's something in the reply_tweet_id slot
def is_reply?
not reply_tweet_id.blank?
true
end
Twittei`s Stieam API lets anyone easily pull gigaLytes ol messages.
10
They aiiive in a
iaw ]SON loimat:
{"text":"Just finished the final draft for Hadoop: the Definitive Guide!",
"screen_name":"tom_e_white","reply_screen_name":null,"id":3239897342,
"reply_tweet_id":null,...}
{"text":"@tom_e_white Can't wait to get a copy!",
"screen_name":"mrflip","reply_screen_name":"tom_e_white","id":3239873453,
"reply_tweet_id":3239897342,...}
{"text":"@josephkelly great job on the #InfoChimps API.
Remind me to tell you about the time a baboon broke into our house.",
"screen_name":"wattsteve","reply_screen_name":"josephkelly","id":16434069252,...}
{"text":"@mza Re: http://j.mp/atbroxmr Check out @James_Rubino's
http://bit.ly/clusterfork ? Lots of good hadoop refs there too",
"screen_name":"mrflip","reply_screen_name":"@mza","id":7809927173,...}
{"text":"@tlipcon divide lots of data into little parts. Magic software gnomes
fix up the parts, elves then assemble those into whole things #hadoop",
"screen_name":"nealrichter","reply_screen_name":"tlipcon","id":4491069515,...}
The reply_screen_name anu reply_tweet_id let you lollow the conveisation (as you can
see, they`ie otheiwise null). Let`s linu each ieply anu emit the iespective usei IDs as
an euge:
11
class ReplyGraphMapper < LineStreamer
def process(raw_tweet)
tweet = Tweet.new(raw_tweet)
if tweet.is_reply?
emit [tweet.screen_name, tweet.reply_screen_name]
end
end
end
The mappei ueiives liom LineStreamer, a class that leeus each line as a single iecoiu
to its process methou. Ve only have to ueline that process methou; Vukong anu
Hauoop take caie ol the iest. In this case, we use the iaw ]SON iecoiu to cieate a tweet
oLject. Vheie usei A ieplies to usei B, emit the euge as A anu B sepaiateu Ly a taL. The
iaw output will look like this:
% reply_graph_mapper --run raw_tweets.json a_replies_b.tsv
mrflip tom_e_white
10. Relei to the Twittei uevelopei site oi use a tool like Hayes Davis` Flamingo.
11. In piactice, we ol couise use numeiic IDs anu not scieen names, Lut it`s easiei to lollow along with scieen
names. In oiuei to keep the giaph-theoiy uiscussion geneial, I`m going to play loose with some uetails
anu leave out vaiious janitoiial uetails ol loauing anu iunning.
Using Pig and Wukong to Explore Billion-edge Network Graphs | 607
wattsteve josephkelly
mrflip mza
nealrichter tlipcon
You shoulu ieau this as a ieplies L anu inteipiet it as a uiiecteu out euge: @watt
steve conveys social capital to @josephkelly.
Edge pairs versus adjacency list
That is the cdgc pairs iepiesentation ol a netwoik. It`s simple, anu it gives an egual
jumping-oll point loi in- oi out- euges, Lut theie`s some uuplication ol uata. You can
tell the same stoiy liom the noue`s point ol view (anu save some uisk space) Ly iolling
up on the souice noue. Ve call this the adjaccncy |ist, anu it can Le geneiateu in Pig Ly
a simple GROUP BY. Loau the lile:
a_replies_b = LOAD 'a_replies_b.tsv' AS (src:chararray, dest:chararray);
Then linu all euges out liom each noue Ly giouping on souice:
replies_out = GROUP a_replies_b BY src;
DUMP replies_out
(cutting,{(tom_e_white)})
(josephkelly,{(wattsteve)})
(mikeolson,{(LusciousPear),(kevinweil),(LusciousPear),(tlipcon)})
(mndoci,{(mrflip),(peteskomoroch),(LusciousPear),(mrflip)})
(mrflip,{(LusciousPear),(mndoci),(mndoci),(esammer),(ogrisel),(esammer),(wattsteve)})
(peteskomoroch,{(CMastication),(esammer),(DataJunkie),(mndoci),(nealrichter),...
(tlipcon,{(LusciousPear),(LusciousPear),(nealrichter),(mrflip),(kevinweil)})
(tom_e_white,{(mrflip),(lenbust)})
Degree
A simple, uselul measuie ol inlluence is the numLei ol ieplies a usei ieceives. In giaph
teims, this is the dcgrcc (specilically the in-dcgrcc, since this is a uiiecteu giaph).
Pig`s nesteu FOREACH syntax lets us count the uistinct incoming ieplieis (neighLoi
noues) anu the total incoming ieplies in one pass:
12
a_replies_b = LOAD 'a_replies_b.tsv' AS (src:chararray, dest:chararray);
replies_in = GROUP a_replies_b BY dest; -- group on dest to get in-links
replies_in_degree = FOREACH replies_in {
nbrs = DISTINCT a_replies_b.src;
GENERATE group, COUNT(nbrs), COUNT(a_replies_b);
};
DUMP replies_in_degree
(cutting,1L,1L)
(josephkelly,1L,1L)
12. Due to the small size ol the euge paii iecoius anu a pesky Hauoop implementation uetail, the mappei
may spill uata to uisk eaily. Il the joLtiackei uashLoaiu shows spilleu iecoius gieatly exceeuing map
output iecoius, tiy Lumping up the io.sort.record.percent:
PIG_OPTS="-Dio.sort.record.percent=0.25 -Dio.sort.mb=350" pig my_file.pig
608 | Chapter 16: Case Studies
(mikeolson,3L,4L)
(mndoci,3L,4L)
(mrflip,5L,9L)
(peteskomoroch,9L,18L)
(tlipcon,4L,8L)
(tom_e_white,2L,2L)
In this sample, @peteskomoroch has 9 neighLois anu 1S incoming ieplies, lai moie than
most. This laige vaiiation in uegiee is typical loi social netwoiks. Most useis see a small
numLei ol ieplies, Lut a lew celeLiitiessuch as @THE_REAL_SHAQ (LasketLall stai Sha-
guille O`Neill) oi @sockington (a lictional cat)ieceive millions. By contiast, almost
eveiy inteisection on a ioau map is loui-way.
13
The skeweu uatallow piouuceu Ly this
wilu vaiiation in uegiee has impoitant iamilications loi how you piocess such giaphs
moie latei.
Symmetric Links
Vhile millions ol people have given @THE_REAL_SHAQ a shout-out on twittei, he has
unueistanuaLly not iecipiocateu with millions ol ieplies. As the giaph shows, I lie-
guently conveise with @mndoci,
1+
making ouis a synnctric |in|. This accuiately iellects
the lact that I have moie in common with @mndoci than with @THE_REAL_SHAQ.
One way to linu symmetiic links is to take the euges in A Replied To B that aie also in
A Replied By B. Ve can uo that set inteisection with an innei sell-join:
15
a_repl_to_b = LOAD 'a_replies_b.tsv' AS (user_a:chararray, user_b:chararray);
a_repl_by_b = LOAD 'a_replies_b.tsv' AS (user_b:chararray, user_a:chararray);
-- symmetric edges appear in both sets
a_symm_b_j = JOIN a_repl_to_b BY (user_a, user_b),
a_repl_by_b BY (user_a, user_b);
...
Howevei, this senus two lull copies ol the euge-paiis list to the ieuuce phase, uouLling
the memoiy ieguiieu. Ve can uo Lettei Ly noticing that liom a noue`s point ol view,
a symmetiic link is eguivalent to a paiieu euge: one out anu one in. Make the giaph
unuiiecteu Ly putting the noue with lowest soit oiuei in the liist slotLut pieseive
the uiiection as a piece ol euge metauata:
a_replies_b = LOAD 'a_replies_b.tsv' AS (src:chararray, dest:chararray);
a_b_rels = FOREACH a_replies_b GENERATE
((src <= dest) ? src : dest) AS user_a,
((src <= dest) ? dest : src) AS user_b,
((src <= dest) ? 1 : 0) AS a_re_b:int,
13. The laigest outliei that comes to minu is the lamous Magic RounuaLout in Swinuon, Englanu, with
uegiee 10, http://cn.wi|ipcdia.org/wi|i/Magic_Roundabout_28Swindon29.
1+. Deepak Singh, open uata auvocate anu Lizuev managei ol the Amazon AVS clouu.
15. Cuiient veisions ol Pig get conluseu on sell-joins, so just loau the taLle with uilleiently nameu ielations
as shown heie.
Using Pig and Wukong to Explore Billion-edge Network Graphs | 609
((src <= dest) ? 0 : 1) AS b_re_a:int;
DUMP a_b_rels
(mrflip,tom_e_white,1,0)
(josephkelly,wattsteve,0,1)
(mrflip,mza,1,0)
(nealrichter,tlipcon,0,1)
Now gathei all euges loi each noue paii. A symmetiic euge has at least one ieply in
each uiiection:
a_b_rels_g = GROUP a_b_rels BY (user_a, user_b);
a_symm_b_all = FOREACH a_b_rels_g GENERATE
group.user_a AS user_a,
group.user_b AS user_b,
(( (SUM(a_b_rels.a_re_b) > 0) AND
(SUM(a_b_rels.b_re_a) > 0) ) ? 1 : 0) AS is_symmetric:int;
DUMP a_symm_b_all
(mrflip,tom_e_white,1)
(mrflip,mza,0)
(josephkelly,wattsteve,0)
(nealrichter,tlipcon,1)
...
a_symm_b = FILTER a_symm_b_all BY (is_symmetric == 1);
STORE a_symm_b INTO 'a_symm_b.tsv';
Heie`s a poition ol the output, showing that @mrflip anu @tom_e_white have a sym-
metiic link:
(mrflip,tom_e_white,1)
(nealrichter,tlipcon,1)
...
Community Extraction
So lai, we`ve geneiateu a noue measuie (in-uegiee) anu an euge measuie (symmetiic
link iuentilication). Let`s move out one step anu look at a neighLoihoou measuie: how
many ol a given peison`s liienus aie liienus with each othei? Along the way, we`ll
piouuce the euge set loi a visualization like the one aLove.
Get neighbors
Choose a seeu noue (heie, @hadoop). Fiist, iounu up the seeu`s neighLois:
a_replies_b = LOAD 'a_replies_b.tsv' AS (src:chararray, dest:chararray);
-- Extract edges that originate or terminate on the seed
n0_edges = FILTER a_replies_b BY (src == 'hadoop') OR (dest == 'hadoop');
-- Choose the node in each pair that *isn't* our seed:
n1_nodes_all = FOREACH n0_edges GENERATE
((src == 'hadoop') ? dest : src) AS screen_name;
n1_nodes = DISTINCT n1_nodes_all;
DUMP n1_nodes
610 | Chapter 16: Case Studies
Now inteisect the set ol neighLois with the set ol staiting noues to linu all euges oiig-
inating in n1_nodes:
n1_edges_out_j = JOIN a_replies_b BY src,
n1_nodes BY screen_name USING 'replicated';
n1_edges_out = FOREACH n1_edges_out_j GENERATE src, dest;
Oui copy ol the giaph (with moie than 1 Lillion euges) is lai too laige to lit in memoiy.
On the othei hanu, the neighLoi count loi a single usei iaiely exceeus a couple million,
which lits easily in memoiy. Incluuing USING 'replicated' in the ]OIN commanu in-
stiucts Pig to uo a map-siue join (also calleu a jragncnt rcp|icatc join). Pig holus the
n1_nodes ielation in memoiy as a lookup taLle anu stieams the lull euge list past.
Vhenevei the join conuition is metsrc is in the n1_nodes lookup taLleit piouuces
output. No ieuuce step means an enoimous speeuup!
To leave only euges wheie Loth souice anu uestination aie neighLois ol the seeu noue,
iepeat the join:
n1_edges_j = JOIN n1_edges_out BY dest,
n1_nodes BY screen_name USING 'replicated';
n1_edges = FOREACH n1_edges_j GENERATE src, dest;
DUMP n1_edges
(mrflip,tom_e_white)
(mrflip,mza)
(wattsteve,josephkelly)
(nealrichter,tlipcon)
(bradfordcross,lusciouspear)
(mrflip,jeromatron)
(mndoci,mrflip)
(nealrichter,datajunkie)
Community metrics and the 1 million 1 million problem
Vith @hadoop, @cloudera anu @infochimps as seeus, I applieu similai sciipts to 2 Lillion
messages to cieate Figuie 16-2+ (this image is also hosteu on this Look`s weLsite).
As you can see, the Lig uata community is veiy inteiconnecteu. The link neighLoihoou
ol a celeLiity such as @THE_REAL_SHAQ is lai moie spaise. Ve can chaiacteiize this using
the c|ustcring cocjjicicnt: the iatio ol actual n1_edges to the maximum numLei ol pos-
siLle n1_edges. It ianges liom zeio (no neighLoi links to any othei neighLoi) to one
(eveiy neighLoi links to eveiy othei neighLoi). A moueiately high clusteiing coellicient
inuicates a cohesive community. A low clusteiing coellicient coulu inuicate wiuely
uispeiseu inteiest (as it uoes with @THE_REAL_SHAQ), oi it coulu inuicate the kinu ol
inoiganic community that a spam account woulu engenuei.
Using Pig and Wukong to Explore Billion-edge Network Graphs | 611
Iigurc 1-21. Big data connunity on Twittcr
Local properties at global scale
Ve`ve calculateu community metiics at the scale ol a noue, an euge, anu a neighLoi-
hoou. How aLout the whole gloLe? Theie`s not enough space heie to covei it, Lut you
can simultaneously ueteimine the clusteiing coellicient loi eveiy noue Ly geneiating
eveiy tiiangle in the giaph. Foi each usei, compaiing the numLei ol tiiangles they
Lelong to with theii uegiee leaus to the clusteiing coellicient.
Be caielul, though! RememLei the wiue vaiiation in noue uegiee uiscusseu aLove?
Recklessly extenuing the pievious methou will leau to an explosion ol uatapop stai
@britneyspears (5.2M lolloweis, +20k lollowing as ol ]uly 2010) oi @WholeFoods (1.7M
lolloweis, 600k lollowing) will each geneiate tiillions ol entiies. Vhat`s woise, since
laige communities have a spaise clusteiing coellicient, almost all ol these will Le thiown
away! Theie is a veiy elegant way to uo this on the lull giaph,
16
Lut always keep in
minu what the ieal woilu says aLout the pioLlem. Il you`ie willing to asseit that
@britneyspears isn`t rca||y liienus with +20,000 people, you can keep only the stiong
links. Veight each euge (Ly numLei ol ieplies, whethei it`s symmetiic, anu so on) anu
set limits on the numLei ol links liom any noue. This shaiply ieuuces the inteimeuiate
uata size, yet still uoes a ieasonaLle joL ol estimating cohesiveness.
Philip (llip) Kiomei, Inlochimps
16. See http://www.s|idcsharc.nct/ydn/3-xx|-grapha|gohadoopsunnit2010Seigei Vassilvitskii (@vsergei)
anu ]ake Holman (@jakehofman) ol Yahoo! Reseaich solve seveial giaph pioLlems Ly veiy intelligently
thiowing away most ol the giaph.
612 | Chapter 16: Case Studies
APPENDIX A
Installing Apache Hadoop
It`s easy to install Hauoop on a single machine to tiy it out. (Foi installation on a clustei,
please ielei to Chaptei 9.) The guickest way is to uownloau anu iun a Linaiy ielease
liom an Apache Soltwaie Founuation Miiioi.
In this appenuix, we covei how to install Hauoop Common, HDFS, anu MapReuuce.
Instiuctions loi installing the othei piojects coveieu in this Look aie incluueu at the
stait ol the ielevant chaptei.
Prerequisites
Hauoop is wiitten in ]ava, so you will neeu to have ]ava installeu on youi machine,
veision 6 oi latei. Sun`s ]DK is the one most wiuely useu with Hauoop, although otheis
have Leen iepoiteu to woik.
Hauoop iuns on Unix anu on Vinuows. Linux is the only suppoiteu piouuction plat-
loim, Lut othei llavois ol Unix (incluuing Mac OS X) can Le useu to iun Hauoop loi
uevelopment. Vinuows is only suppoiteu as a uevelopment platloim, anu auuitionally
ieguiies Cygwin to iun. Duiing the Cygwin installation piocess, you shoulu incluue
the opcnssh package il you plan to iun Hauoop in pseuuo-uistiiLuteu moue (see lol-
lowing explanation).
Installation
Stait Ly ueciuing which usei you`u like to iun Hauoop as. Foi tiying out Hauoop oi
ueveloping Hauoop piogiams, it is simplest to iun Hauoop on a single machine using
youi own usei account.
Downloau a staLle ielease, which is packageu as a gzippeu tai lile, liom the Apache
Hauoop ieleases page anu unpack it somewheie on youi lilesystem:
% tar xzf hadoop-x.y.z.tar.gz
613
Beloie you can iun Hauoop, you neeu to tell it wheie ]ava is locateu on youi system.
Il you have the JAVA_HOME enviionment vaiiaLle set to point to a suitaLle ]ava installa-
tion, that will Le useu, anu you uon`t have to conliguie anything luithei. (It is olten set
in a shell staitup lile, such as -/.bash_proji|c oi -/.bashrc.) Otheiwise, you can set the
]ava installation that Hauoop uses Ly euiting conj/hadoop-cnv.sh anu specilying the
JAVA_HOME vaiiaLle. Foi example, on my Mac, I changeu the line to ieau:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
to point to veision 1.6.0 ol ]ava. On ULuntu, the eguivalent line is:
export JAVA_HOME=/usr/lib/jvm/java-6-sun
It`s veiy convenient to cieate an enviionment vaiiaLle that points to the Hauoop in-
stallation uiiectoiy (HADOOP_INSTALL, say) anu to put the Hauoop Linaiy uiiectoiy on
youi commanu-line path. Foi example:
% export HADOOP_INSTALL=/home/tom/hadoop-x.y.z
% export PATH=$PATH:$HADOOP_INSTALL/bin
Check that Hauoop iuns Ly typing:
% hadoop version
Hadoop 0.20.2
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707
Compiled by chrisdo on Fri Feb 19 08:07:34 UTC 2010
Configuration
Each component in Hauoop is conliguieu using an XML lile. Common piopeities go
in corc-sitc.xn|, HDFS piopeities go in hdjs-sitc.xn|, anu MapReuuce piopeities go in
naprcd-sitc.xn|. These liles aie all locateu in the conj suLuiiectoiy.
In eailiei veisions ol Hauoop, theie was a single site conliguiation lile
loi the Common, HDFS, anu MapReuuce components, calleu hadoop-
sitc.xn|. Fiom ielease 0.20.0 onwaiu, this lile has Leen split into thiee:
one loi each component. The piopeity names have not changeu, just
the conliguiation lile they have to go in. You can see the uelault settings
loi all the piopeities that aie goveineu Ly these conliguiation liles Ly
looking in the docs uiiectoiy ol youi Hauoop installation loi HTML liles
calleu corc-dcjau|t.htn|, hdjs-dcjau|t.htn|, anu naprcd-dcjau|t.htn|.
Hauoop can Le iun in one ol thiee moues:
Standa|onc (or |oca|) nodc
Theie aie no uaemons iunning anu eveiything iuns in a single ]VM. Stanualone
moue is suitaLle loi iunning MapReuuce piogiams uuiing uevelopment, since it
is easy to test anu ueLug them.
614 | Appendix A: Installing Apache Hadoop
Pscudo-distributcd nodc
The Hauoop uaemons iun on the local machine, thus simulating a clustei on a
small scale.
Iu||y distributcd nodc
The Hauoop uaemons iun on a clustei ol machines. This setup is uesciiLeu in
Chaptei 9.
To iun Hauoop in a paiticulai moue, you neeu to uo two things: set the appiopiiate
piopeities, anu stait the Hauoop uaemons. TaLle A-1 shows the minimal set ol piop-
eities to conliguie each moue. In stanualone moue, the local lilesystem anu the local
MapReuuce joL iunnei aie useu, while in the uistiiLuteu moues the HDFS anu Map-
Reuuce uaemons aie staiteu.
Tab|c A-1. Kcy conjiguration propcrtics jor dijjcrcnt nodcs
Component Property Standalone Pseudo-distributed Fully distributed
Common fs.default.name file:/// (default) hdfs://localhost/ hdfs://namenode/
HDFS dfs.replication N/A 1 3 (default)
MapReduce mapred.job.tracker local (default) localhost:8021 jobtracker:8021
You can ieau moie aLout conliguiation in Hauoop Conliguiation on page 302.
Standalone Mode
In stanualone moue, theie is no luithei action to take, since the uelault piopeities aie
set loi stanualone moue, anu theie aie no uaemons to iun.
Pseudo-Distributed Mode
The conliguiation liles shoulu Le cieateu with the lollowing contents anu placeu in the
conj uiiectoiy (although you can place conliguiation liles in any uiiectoiy as long as
you stait the uaemons with the --config option):
<?xml version="1.0"?>
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
<?xml version="1.0"?>
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
Configuration | 615
</property>
</configuration>
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
Configuring SSH
In pseuuo-uistiiLuteu moue, we have to stait uaemons, anu to uo that, we neeu to have
SSH installeu. Hauoop uoesn`t actually uistinguish Letween pseuuo-uistiiLuteu anu
lully uistiiLuteu moues: it meiely staits uaemons on the set ol hosts in the clustei
(uelineu Ly the s|avcs lile) Ly SSH-ing to each host anu staiting a uaemon piocess.
Pseuuo-uistiiLuteu moue is just a special case ol lully uistiiLuteu moue in which the
(single) host is localhost, so we neeu to make suie that we can SSH to localhost anu log
in without having to entei a passwoiu.
Fiist, make suie that SSH is installeu anu a seivei is iunning. On ULuntu, loi example,
this is achieveu with:
% sudo apt-get install ssh
On Vinuows with Cygwin, you can set up an SSH seivei (altei having
installeu the openssh package) Ly iunning ssh-host-config -y.
On Mac OS X, make suie Remote Login (unuei System Pieleiences,
Shaiing) is enaLleu loi the cuiient usei (oi all useis).
Then to enaLle passwoiu-less login, geneiate a new SSH key with an empty passphiase:
% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Test this with:
% ssh localhost
Il successlul, you shoulu not have to type in a passwoiu.
616 | Appendix A: Installing Apache Hadoop
Formatting the HDFS filesystem
Beloie it can Le useu, a Lianu-new HDFS installation neeus to Le loimatteu. The loi-
matting piocess cieates an empty lilesystem Ly cieating the stoiage uiiectoiies anu the
initial veisions ol the namenoue`s peisistent uata stiuctuies. Datanoues aie not in-
volveu in the initial loimatting piocess, since the namenoue manages all ol the lilesys-
tem`s metauata, anu uatanoues can join oi leave the clustei uynamically. Foi the same
ieason, you uon`t neeu to say how laige a lilesystem to cieate, since this is ueteimineu
Ly the numLei ol uatanoues in the clustei, which can Le incieaseu as neeueu, long altei
the lilesystem was loimatteu.
Foimatting HDFS is guick to uo. ]ust type the lollowing:
% hadoop namenode -format
Starting and stopping the daemons
To stait the HDFS anu MapReuuce uaemons, type:
% start-dfs.sh
% start-mapred.sh
Il you have placeu conliguiation liles outsiue the uelault conj uiiectoiy,
stait the uaemons with the --config option, which takes an aLsolute
path to the conliguiation uiiectoiy:
% start-dfs.sh --config path-to-config-directory
% start-mapred.sh --config path-to-config-directory
Thiee uaemons will Le staiteu on youi local machine: a namenoue, a seconuaiy name-
noue, anu a uatanoue. You can check whethei the uaemons staiteu successlully
Ly looking at the logliles in the |ogs uiiectoiy (in the Hauoop installation uiiectoiy), oi
Ly looking at the weL UIs, at http://|oca|host:50030/ loi the joLtiackei anu at
http://|oca|host:50070/ loi the namenoue. You can also use ]ava`s jps commanu to see
whethei they aie iunning.
Stopping the uaemons is uone in the oLvious way:
% stop-dfs.sh
% stop-mapred.sh
Fully Distributed Mode
Setting up a clustei ol machines Liings many auuitional consiueiations, so this moue
is coveieu in Chaptei 9.
Configuration | 617
APPENDIX B
Clouderas Distribution for Hadoop
Clouueia`s DistiiLution loi Hauoop (heiealtei CDH) is Laseu on the most iecent staLle
veision ol Apache Hauoop with numeious patches, Lackpoits, anu upuates. Clouueia
makes the uistiiLution availaLle in a numLei ol uilleient loimats: souice anu Linaiy
tai liles, RPMs, DeLian packages, VMwaie images, anu sciipts loi iunning CDH in the
clouu. CDH is liee, ieleaseu unuei the Apache 2.0 license anu availaLle at http://www
.c|oudcra.con/hadoop/.
To simplily ueployment, Clouueia hosts packages on puLlic yum anu apt iepositoiies.
CDH enaLles you to install anu conliguie Hauoop on each machine using a single
commanu. Kickstait useis can commission entiie Hauoop clusteis without manual
inteivention.
CDH manages cioss-component veisions anu pioviues a staLle platloim with a com-
patiLle set ol packages that woik togethei. As ol CDH3, the lollowing packages aie
incluueu, many ol which aie coveieu elsewheie in this Look:
HDFS Sell-healing uistiiLuteu lile system
MapReuuce Poweilul, paiallel uata piocessing liamewoik
Hauoop Common A set ol utilities that suppoit the Hauoop suLpiojects
HBase Hauoop uataLase loi ianuom ieau/wiite access
Hive SQL-like gueiies anu taLles on laige uatasets
Pig Datallow language anu compilei
Oozie Voikllow loi inteiuepenuent Hauoop joLs
Sgoop Integiate uataLases anu uata waiehouses with Hauoop
Flume Highly ieliaLle, conliguiaLle stieaming uata collection
ZooKeepei Cooiuination seivice loi uistiiLuteu applications
Hue Usei inteilace liamewoik anu SDK loi visual Hauoop applications
To uownloau CDH, visit http://www.c|oudcra.con/down|oads/.
619
APPENDIX C
Preparing the NCDC Weather Data
This section gives a iunthiough ol the steps taken to piepaie the iaw weathei uata liles
so they aie in a loim that is amenaLle loi analysis using Hauoop. Il you want to get a
copy ol the uata to piocess using Hauoop, you can uo so Ly lollowing the instiuctions
given at the weLsite that accompanies this Look at http://www.hadoopboo|.con/. The
iest ol this section explains how the iaw weathei uata liles weie piocesseu.
The iaw uata is pioviueu as a collection ol tar liles, compiesseu with bzip2. Each yeai
ol ieauings comes in a sepaiate lile. Heie`s a paitial uiiectoiy listing ol the liles:
1901.tar.bz2
1902.tar.bz2
1903.tar.bz2
...
2000.tar.bz2
Each tar lile contains a lile loi each weathei station`s ieauings loi the yeai, compiesseu
with gzip. (The lact that the liles in the aichive aie compiesseu makes the bzip2 com-
piession on the aichive itsell ieuunuant.) Foi example:
% tar jxf 1901.tar.bz2
% ls -l 1901 | head
011990-99999-1950.gz
011990-99999-1950.gz
...
011990-99999-1950.gz
Since theie aie tens ol thousanus ol weathei stations, the whole uataset is maue up ol
a laige numLei ol ielatively small liles. It`s geneially easiei anu moie ellicient to piocess
a smallei numLei ol ielatively laige liles in Hauoop (see Small liles anu ComLineFi-
leInputFoimat on page 237), so in this case, I concatenateu the uecompiesseu liles loi
a whole yeai into a single lile, nameu Ly the yeai. I uiu this using a MapReuuce piogiam,
to take auvantage ol its paiallel piocessing capaLilities. Let`s take a closei look at the
piogiam.
621
The piogiam has only a map lunction: no ieuuce lunction is neeueu since the map uoes
all the lile piocessing in paiallel with no comLine stage. The piocessing can Le uone
with a Unix sciipt so the Stieaming inteilace to MapReuuce is appiopiiate in this case;
see Example C-1.
Exanp|c C-1. Bash script to proccss raw NCDC data ji|cs and storc in HDIS
#!/usr/bin/env bash
# NLineInputFormat gives a single line: key is offset, value is S3 URI
read offset s3file
# Retrieve file from S3 to local disk
echo "reporter:status:Retrieving $s3file" >&2
$HADOOP_INSTALL/bin/hadoop fs -get $s3file .
# Un-bzip and un-tar the local file
target=`basename $s3file .tar.bz2`
mkdir -p $target
echo "reporter:status:Un-tarring $s3file to $target" >&2
tar jxf `basename $s3file` -C $target
# Un-gzip each station file and concat into one file
echo "reporter:status:Un-gzipping $target" >&2
for file in $target/*/*
do
gunzip -c $file >> $target.all
echo "reporter:status:Processed $file" >&2
done
# Put gzipped version into HDFS
echo "reporter:status:Gzipping $target and putting in HDFS" >&2
gzip -c $target.all | $HADOOP_INSTALL/bin/hadoop fs -put - gz/$target.gz
The input is a small text lile (ncdc_ji|cs.txt) listing all the liles to Le piocesseu (the liles
stait out on S3, so the liles aie ieleienceu using S3 URIs that Hauoop unueistanus).
Heie is a sample:
s3n://hadoopbook/ncdc/raw/isd-1901.tar.bz2
s3n://hadoopbook/ncdc/raw/isd-1902.tar.bz2
...
s3n://hadoopbook/ncdc/raw/isd-2000.tar.bz2
By specilying the input loimat to Le NLineInputFormat, each mappei ieceives one line
ol input, which contains the lile it has to piocess. The piocessing is explaineu in the
sciipt, Lut, Liielly, it unpacks the bzip2 lile, anu then concatenates each station lile into
a single lile loi the whole yeai. Finally, the lile is gzippeu anu copieu into HDFS. Note
the use ol hadoop fs -put - to consume liom stanuaiu input.
Status messages aie echoeu to stanuaiu eiioi with a reporter:status pielix so that they
get inteipieteu as a MapReuuce status upuate. This tells Hauoop that the sciipt is
making piogiess anu is not hanging.
622 | Appendix C: Preparing the NCDC Weather Data
The sciipt to iun the Stieaming joL is as lollows:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-D mapred.reduce.tasks=0 \
-D mapred.map.tasks.speculative.execution=false \
-D mapred.task.timeout=12000000 \
-input ncdc_files.txt \
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
-output output \
-mapper load_ncdc_map.sh \
-file load_ncdc_map.sh
I set the numLei ol ieuuce tasks to zeio, since this is a map-only joL. I also tuineu oll
speculative execution so uuplicate tasks uiun`t wiite the same liles (although the
appioach uiscusseu in Task siue-ellect liles on page 216 woulu have woikeu, too).
The task timeout was set high so that Hauoop uiun`t kill tasks that aie taking a long
time (loi example, when unaichiving liles, oi copying to HDFS, when no piogiess is
iepoiteu).
Last, the liles weie aichiveu on S3 Ly copying them liom HDFS using distcp.
Preparing the NCDC Weather Data | 623

You might also like