Hadoop Beginner's Guide
4/5
()
About this ebook
Data is arriving faster than you can process it and the overall volumes keep growing at a rate that keeps you awake at night. Hadoop can help you tame the data beast. Effective use of Hadoop however requires a mixture of programming, design, and system administration skills.
"Hadoop Beginner's Guide" removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the book gives the understanding needed to effectively use Hadoop to solve real world problems.
Starting with the basics of installing and configuring Hadoop, the book explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems.
While learning different ways to develop applications to run on Hadoop the book also covers tools such as Hive, Sqoop, and Flume that show how Hadoop can be integrated with relational databases and log collection.
In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered.
ApproachAs a Packt Beginner's Guide, the book is packed with clear step-by-step instructions for performing the most useful tasks, getting you up and running quickly, and learning by doing.
Who this book is forThis book assumes no existing experience with Hadoop or cloud services. It assumes you have familiarity with a programming language such as Java or Ruby but gives you the needed background on the other topics.
Garry Turkington
Garry Turkington has 14 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems. In his current roles as VP Data Engineering at Improve Digital and the company's lead architect he is primarily responsible for the realization of systems that store, process, and extract value from the company's large data volumes. Before joining Improve Digital he spent time at Amazon UK where he led several software development teams, building systems that process the Amazon catalog data for every item worldwide. Prior to this he spent a decade in various government positions in both the UK and USA. He has BSc and PhD degrees in computer science from the Queens University of Belfast in Northern Ireland and a MEng in Systems Engineering from Stevens Institute of Technology in the USA.
Read more from Garry Turkington
Learning Hadoop 2 Rating: 4 out of 5 stars4/5Hadoop: Data Processing and Modelling Rating: 0 out of 5 stars0 ratings
Related to Hadoop Beginner's Guide
Related ebooks
Mastering Spark for Data Science Rating: 0 out of 5 stars0 ratingsHadoop in Practice Rating: 0 out of 5 stars0 ratingsApache Hive Cookbook Rating: 0 out of 5 stars0 ratingsLearn Hadoop in 24 Hours Rating: 0 out of 5 stars0 ratingsFrank Kane's Taming Big Data with Apache Spark and Python Rating: 0 out of 5 stars0 ratingsHadoop MapReduce v2 Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsHadoop in Action Rating: 0 out of 5 stars0 ratingsHDInsight Essentials - Second Edition Rating: 0 out of 5 stars0 ratingsDesigning Cloud Data Platforms Rating: 0 out of 5 stars0 ratingsPractical Machine Learning Rating: 2 out of 5 stars2/5Machine Learning Systems: Designs that scale Rating: 0 out of 5 stars0 ratingsGraph Databases in Action: Examples in Gremlin Rating: 0 out of 5 stars0 ratingsHadoop 2.x Administration Cookbook Rating: 0 out of 5 stars0 ratingsMLOps Engineering at Scale Rating: 0 out of 5 stars0 ratingsMastering Hadoop Rating: 0 out of 5 stars0 ratingsCassandra High Availability Rating: 5 out of 5 stars5/5Instant MapReduce Patterns – Hadoop Essentials How-to Rating: 0 out of 5 stars0 ratingsSpark for Data Science Rating: 0 out of 5 stars0 ratingsDynamoDB Applied Design Patterns Rating: 3 out of 5 stars3/5PostgreSQL Replication - Second Edition Rating: 0 out of 5 stars0 ratingsMicrosoft SQL Server 2014 Business Intelligence Development Beginner’s Guide Rating: 0 out of 5 stars0 ratingsHadoop Blueprints Rating: 0 out of 5 stars0 ratingsHadoop Essentials Rating: 5 out of 5 stars5/5Hadoop Real-World Solutions Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsApache Spark for Data Science Cookbook Rating: 0 out of 5 stars0 ratingsData Analytics with Google Cloud Platform: Build Real Time Data Analytics on Google Cloud Platform Rating: 0 out of 5 stars0 ratingsMastering Cloud Development using Microsoft Azure Rating: 0 out of 5 stars0 ratingsData Pipelines with Apache Airflow Rating: 0 out of 5 stars0 ratingsPostgreSQL High Performance Cookbook Rating: 0 out of 5 stars0 ratingsTalend Open Studio Cookbook Rating: 2 out of 5 stars2/5
CAD-CAM For You
3D Printing Designs: Fun and Functional Projects Rating: 0 out of 5 stars0 ratingsAutoCAD® Pocket Reference Rating: 0 out of 5 stars0 ratingsFreeCAD | Step by Step: Learn how to easily create 3D objects, assemblies, and technical drawings Rating: 5 out of 5 stars5/53D Printing For Dummies Rating: 4 out of 5 stars4/5SketchUp Success for Woodworkers: Four Simple Rules to Create 3D Drawings Quickly and Accurately Rating: 2 out of 5 stars2/5Beginning AutoCAD® 2023 Exercise Workbook: For Windows® Rating: 0 out of 5 stars0 ratingsFreeCAD | Design Projects: Design advanced CAD models step by step Rating: 5 out of 5 stars5/5Tinkercad | Step by Step Rating: 0 out of 5 stars0 ratingsSolidworks 2018 Learn by Doing - Part 3: DimXpert and Rendering Rating: 0 out of 5 stars0 ratingsAutoCAD Electrical 2023 Black Book Rating: 0 out of 5 stars0 ratingsCAD 101: The Ultimate Beginner's Guide Rating: 0 out of 5 stars0 ratingsCAD 101: The Ultimate Beginners Guide Rating: 5 out of 5 stars5/5Mastering AutoCAD for Mac Rating: 0 out of 5 stars0 ratingsAutoCAD 2018 For Beginners Rating: 5 out of 5 stars5/53D Printing Designs: Octopus Pencil Holder Rating: 0 out of 5 stars0 ratingsAutodesk® Revit Basics Training Manual Rating: 5 out of 5 stars5/5Autodesk Revit 2023 Black Book Rating: 5 out of 5 stars5/5Beginning AutoCAD® 2022 Exercise Workbook: For Windows® Rating: 0 out of 5 stars0 ratingsFusion 360 | CAD Design Projects Part I Rating: 0 out of 5 stars0 ratingsAutodesk Fusion 360 Black Book (V 2.0.15293) - Part 1 Rating: 0 out of 5 stars0 ratingsMastering Autodesk Revit MEP 2014: Autodesk Official Press Rating: 0 out of 5 stars0 ratingsMastering AutoCAD Civil 3D 2015: Autodesk Official Press Rating: 0 out of 5 stars0 ratingsAutoCAD Electrical 2021 Black Book Rating: 5 out of 5 stars5/5AutoCAD 2019 For Architectural Design Rating: 5 out of 5 stars5/53D Printing Designs: Design an SD Card Holder Rating: 0 out of 5 stars0 ratings3D Printer Projects for Makerspaces Rating: 4 out of 5 stars4/5FreeCAD 0.20 Black Book Rating: 0 out of 5 stars0 ratingsFreeCAD 0.19 Black Book Rating: 0 out of 5 stars0 ratingsOpenSCAD Basics Tutorial Rating: 0 out of 5 stars0 ratings
Reviews for Hadoop Beginner's Guide
7 ratings2 reviews
- Rating: 1 out of 5 stars1/5I'm a data architect and modeler. I'm not a programmer. I was looking to get an overview to understand how Hadoop would influence the way I architect and model. I read the first chapter and haven't got a clue.
- Rating: 4 out of 5 stars4/5good
Book preview
Hadoop Beginner's Guide - Garry Turkington
Table of Contents
Hadoop Beginner's Guide
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Time for action – heading
What just happened?
Pop quiz – heading
Have a go hero – heading
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. What It's All About
Big data processing
The value of data
Historically for the few and not the many
Classic data processing systems
Scale-up
Early approaches to scale-out
Limiting factors
A different approach
All roads lead to scale-out
Share nothing
Expect failure
Smart software, dumb hardware
Move processing, not data
Build applications, not infrastructure
Hadoop
Thanks, Google
Thanks, Doug
Thanks, Yahoo
Parts of Hadoop
Common building blocks
HDFS
MapReduce
Better together
Common architecture
What it is and isn't good for
Cloud computing with Amazon Web Services
Too many clouds
A third way
Different types of costs
AWS – infrastructure on demand from Amazon
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)
Elastic MapReduce (EMR)
What this book covers
A dual approach
Summary
2. Getting Hadoop Up and Running
Hadoop on a local Ubuntu host
Other operating systems
Time for action – checking the prerequisites
What just happened?
Setting up Hadoop
A note on versions
Time for action – downloading Hadoop
What just happened?
Time for action – setting up SSH
What just happened?
Configuring and running Hadoop
Time for action – using Hadoop to calculate Pi
What just happened?
Three modes
Time for action – configuring the pseudo-distributed mode
What just happened?
Configuring the base directory and formatting the filesystem
Time for action – changing the base HDFS directory
What just happened?
Time for action – formatting the NameNode
What just happened?
Starting and using Hadoop
Time for action – starting Hadoop
What just happened?
Time for action – using HDFS
What just happened?
Time for action – WordCount, the Hello World of MapReduce
What just happened?
Have a go hero – WordCount on a larger body of text
Monitoring Hadoop from the browser
The HDFS web UI
The MapReduce web UI
Using Elastic MapReduce
Setting up an account in Amazon Web Services
Creating an AWS account
Signing up for the necessary services
Time for action – WordCount on EMR using the management console
What just happened?
Have a go hero – other EMR sample applications
Other ways of using EMR
AWS credentials
The EMR command-line tools
The AWS ecosystem
Comparison of local versus EMR Hadoop
Summary
3. Understanding MapReduce
Key/value pairs
What it mean
Why key/value data?
Some real-world examples
MapReduce as a series of key/value transformations
Pop quiz – key/value pairs
The Hadoop Java API for MapReduce
The 0.20 MapReduce Java API
The Mapper class
The Reducer class
The Driver class
Writing MapReduce programs
Time for action – setting up the classpath
What just happened?
Time for action – implementing WordCount
What just happened?
Time for action – building a JAR file
What just happened?
Time for action – running WordCount on a local Hadoop cluster
What just happened?
Time for action – running WordCount on EMR
What just happened?
The pre-0.20 Java MapReduce API
Hadoop-provided mapper and reducer implementations
Time for action – WordCount the easy way
What just happened?
Walking through a run of WordCount
Startup
Splitting the input
Task assignment
Task startup
Ongoing JobTracker monitoring
Mapper input
Mapper execution
Mapper output and reduce input
Partitioning
The optional partition function
Reducer input
Reducer execution
Reducer output
Shutdown
That's all there is to it!
Apart from the combiner…maybe
Why have a combiner?
Time for action – WordCount with a combiner
What just happened?
When you can use the reducer as the combiner
Time for action – fixing WordCount to work with a combiner
What just happened?
Reuse is your friend
Pop quiz – MapReduce mechanics
Hadoop-specific data types
The Writable and WritableComparable interfaces
Introducing the wrapper classes
Primitive wrapper classes
Array wrapper classes
Map wrapper classes
Time for action – using the Writable wrapper classes
What just happened?
Other wrapper classes
Have a go hero – playing with Writables
Making your own
Input/output
Files, splits, and records
InputFormat and RecordReader
Hadoop-provided InputFormat
Hadoop-provided RecordReader
OutputFormat and RecordWriter
Hadoop-provided OutputFormat
Don't forget Sequence files
Summary
4. Developing MapReduce Programs
Using languages other than Java with Hadoop
How Hadoop Streaming works
Why to use Hadoop Streaming
Time for action – implementing WordCount using Streaming
What just happened?
Differences in jobs when using Streaming
Analyzing a large dataset
Getting the UFO sighting dataset
Getting a feel for the dataset
Time for action – summarizing the UFO data
What just happened?
Examining UFO shapes
Time for action – summarizing the shape data
What just happened?
Time for action – correlating of sighting duration to UFO shape
What just happened?
Using Streaming scripts outside Hadoop
Time for action – performing the shape/time analysis from the command line
What just happened?
Java shape and location analysis
Time for action – using ChainMapper for field validation/analysis
What just happened?
Have a go hero
Too many abbreviations
Using the Distributed Cache
Time for action – using the Distributed Cache to improve location output
What just happened?
Counters, status, and other output
Time for action – creating counters, task states, and writing log output
What just happened?
Too much information!
Summary
5. Advanced MapReduce Techniques
Simple, advanced, and in-between
Joins
When this is a bad idea
Map-side versus reduce-side joins
Matching account and sales information
Time for action – reduce-side join using MultipleInputs
What just happened?
DataJoinMapper and TaggedMapperOutput
Implementing map-side joins
Using the Distributed Cache
Have a go hero - Implementing map-side joins
Pruning data to fit in the cache
Using a data representation instead of raw data
Using multiple mappers
To join or not to join...
Graph algorithms
Graph 101
Graphs and MapReduce – a match made somewhere
Representing a graph
Time for action – representing the graph
What just happened?
Overview of the algorithm
The mapper
The reducer
Iterative application
Time for action – creating the source code
What just happened?
Time for action – the first run
What just happened?
Time for action – the second run
What just happened?
Time for action – the third run
What just happened?
Time for action – the fourth and last run
What just happened?
Running multiple jobs
Final thoughts on graphs
Using language-independent data structures
Candidate technologies
Introducing Avro
Time for action – getting and installing Avro
What just happened?
Avro and schemas
Time for action – defining the schema
What just happened?
Time for action – creating the source Avro data with Ruby
What just happened?
Time for action – consuming the Avro data with Java
What just happened?
Using Avro within MapReduce
Time for action – generating shape summaries in MapReduce
What just happened?
Time for action – examining the output data with Ruby
What just happened?
Time for action – examining the output data with Java
What just happened?
Have a go hero – graphs in Avro
Going forward with Avro
Summary
6. When Things Break
Failure
Embrace failure
Or at least don't fear it
Don't try this at home
Types of failure
Hadoop node failure
The dfsadmin command
Cluster setup, test files, and block sizes
Fault tolerance and Elastic MapReduce
Time for action – killing a DataNode process
What just happened?
NameNode and DataNode communication
Have a go hero – NameNode log delving
Time for action – the replication factor in action
What just happened?
Time for action – intentionally causing missing blocks
What just happened?
When data may be lost
Block corruption
Time for action – killing a TaskTracker process
What just happened?
Comparing the DataNode and TaskTracker failures
Permanent failure
Killing the cluster masters
Time for action – killing the JobTracker
What just happened?
Starting a replacement JobTracker
Have a go hero – moving the JobTracker to a new host
Time for action – killing the NameNode process
What just happened?
Starting a replacement NameNode
The role of the NameNode in more detail
File systems, files, blocks, and nodes
The single most important piece of data in the cluster – fsimage
DataNode startup
Safe mode
SecondaryNameNode
So what to do when the NameNode process has a critical failure?
BackupNode/CheckpointNode and NameNode HA
Hardware failure
Host failure
Host corruption
The risk of correlated failures
Task failure due to software
Failure of slow running tasks
Time for action – causing task failure
What just happened?
Have a go hero – HDFS programmatic access
Hadoop's handling of slow-running tasks
Speculative execution
Hadoop's handling of failing tasks
Have a go hero – causing tasks to fail
Task failure due to data
Handling dirty data through code
Using Hadoop's skip mode
Time for action – handling dirty data by using skip mode
What just happened?
To skip or not to skip...
Summary
7. Keeping Things Running
A note on EMR
Hadoop configuration properties
Default values
Time for action – browsing default properties
What just happened?
Additional property elements
Default storage location
Where to set properties
Setting up a cluster
How many hosts?
Calculating usable space on a node
Location of the master nodes
Sizing hardware
Processor / memory / storage ratio
EMR as a prototyping platform
Special node requirements
Storage types
Commodity versus enterprise class storage
Single disk versus RAID
Finding the balance
Network storage
Hadoop networking configuration
How blocks are placed
Rack awareness
The rack-awareness script
Time for action – examining the default rack configuration
What just happened?
Time for action – adding a rack awareness script
What just happened?
What is commodity hardware anyway?
Pop quiz – setting up a cluster
Cluster access control
The Hadoop security model
Time for action – demonstrating the default security
What just happened?
User identity
The super user
More granular access control
Working around the security model via physical access control
Managing the NameNode
Configuring multiple locations for the fsimage class
Time for action – adding an additional fsimage location
What just happened?
Where to write the fsimage copies
Swapping to another NameNode host
Having things ready before disaster strikes
Time for action – swapping to a new NameNode host
What just happened?
Don't celebrate quite yet!
What about MapReduce?
Have a go hero – swapping to a new NameNode host
Managing HDFS
Where to write data
Using balancer
When to rebalance
MapReduce management
Command line job management
Have a go hero – command line job management
Job priorities and scheduling
Time for action – changing job priorities and killing a job
What just happened?
Alternative schedulers
Capacity Scheduler
Fair Scheduler
Enabling alternative schedulers
When to use alternative schedulers
Scaling
Adding capacity to a local Hadoop cluster
Have a go hero – adding a node and running balancer
Adding capacity to an EMR job flow
Expanding a running job flow
Summary
8. A Relational View on Data with Hive
Overview of Hive
Why use Hive?
Thanks, Facebook!
Setting up Hive
Prerequisites
Getting Hive
Time for action – installing Hive
What just happened?
Using Hive
Time for action – creating a table for the UFO data
What just happened?
Time for action – inserting the UFO data
What just happened?
Validating the data
Time for action – validating the table
What just happened?
Time for action – redefining the table with the correct column separator
What just happened?
Hive tables – real or not?
Time for action – creating a table from an existing file
What just happened?
Time for action – performing a join
What just happened?
Have a go hero – improve the join to use regular expressions
Hive and SQL views
Time for action – using views
What just happened?
Handling dirty data in Hive
Have a go hero – do it!
Time for action – exporting query output
What just happened?
Partitioning the table
Time for action – making a partitioned UFO sighting table
What just happened?
Bucketing, clustering, and sorting... oh my!
User-Defined Function
Time for action – adding a new User Defined Function (UDF)
What just happened?
To preprocess or not to preprocess...
Hive versus Pig
What we didn't cover
Hive on Amazon Web Services
Time for action – running UFO analysis on EMR
What just happened?
Using interactive job flows for development
Have a go hero – using an interactive EMR cluster
Integration with other AWS products
Summary
9. Working with Relational Databases
Common data paths
Hadoop as an archive store
Hadoop as a preprocessing step
Hadoop as a data input tool
The serpent eats its own tail
Setting up MySQL
Time for action – installing and setting up MySQL
What just happened?
Did it have to be so hard?
Time for action – configuring MySQL to allow remote connections
What just happened?
Don't do this in production!
Time for action – setting up the employee database
What just happened?
Be careful with data file access rights
Getting data into Hadoop
Using MySQL tools and manual import
Have a go hero – exporting the employee table into HDFS
Accessing the database from the mapper
A better way – introducing Sqoop
Time for action – downloading and configuring Sqoop
What just happened?
Sqoop and Hadoop versions
Sqoop and HDFS
Time for action – exporting data from MySQL to HDFS
What just happened?
Mappers and primary key columns
Other options
Sqoop's architecture
Importing data into Hive using Sqoop
Time for action – exporting data from MySQL into Hive
What just happened?
Time for action – a more selective import
What just happened?
Datatype issues
Time for action – using a type mapping
What just happened?
Time for action – importing data from a raw query
What just happened?
Have a go hero
Sqoop and Hive partitions
Field and line terminators
Getting data out of Hadoop
Writing data from within the reducer
Writing SQL import files from the reducer
A better way – Sqoop again
Time for action – importing data from Hadoop into MySQL
What just happened?
Differences between Sqoop imports and exports
Inserts versus updates
Have a go hero
Sqoop and Hive exports
Time for action – importing Hive data into MySQL
What just happened?
Time for action – fixing the mapping and re-running the export
What just happened?
Other Sqoop features
Incremental merge
Avoiding partial exports
Sqoop as a code generator
AWS considerations
Considering RDS
Summary
10. Data Collection with Flume
A note about AWS
Data data everywhere...
Types of data
Getting network traffic into Hadoop
Time for action – getting web server data into Hadoop
What just happened?
Have a go hero
Getting files into Hadoop
Hidden issues
Keeping network data on the network
Hadoop dependencies
Reliability
Re-creating the wheel
A common framework approach
Introducing Apache Flume
A note on versioning
Time for action – installing and configuring Flume
What just happened?
Using Flume to capture network data
Time for action – capturing network traffic in a log file
What just happened?
Time for action – logging to the console
What just happened?
Writing network data to log files
Time for action – capturing the output of a command to a flat file
What just happened?
Logs versus files
Time for action – capturing a remote file in a local flat file
What just happened?
Sources, sinks, and channels
Sources
Sinks
Channels
Or roll your own
Understanding the Flume configuration files
Have a go hero
It's all about events
Time for action – writing network traffic onto HDFS
What just happened?
Time for action – adding timestamps
What just happened?
To Sqoop or to Flume...
Time for action – multi level Flume networks
What just happened?
Time for action – writing to multiple sinks
What just happened?
Selectors replicating and multiplexing
Handling sink failure
Have a go hero - Handling sink failure
Next, the world
Have a go hero - Next, the world
The bigger picture
Data lifecycle
Staging data
Scheduling
Summary
11. Where to Go Next
What we did and didn't cover in this book
Upcoming Hadoop changes
Alternative distributions
Why alternative distributions?
Bundling
Free and commercial extensions
Cloudera Distribution for Hadoop
Hortonworks Data Platform
MapR
IBM InfoSphere Big Insights
Choosing a distribution
Other Apache projects
HBase
Oozie
Whir
Mahout
MRUnit
Other programming abstractions
Pig
Cascading
AWS resources
HBase on EMR
SimpleDB
DynamoDB
Sources of information
Source code
Mailing lists and forums
LinkedIn groups
HUGs
Conferences
Summary
A. Pop Quiz Answers
Chapter 3, Understanding MapReduce
Pop quiz – key/value pairs
Pop quiz – walking through a run of WordCount
Chapter 7, Keeping Things Running
Pop quiz – setting up a cluster
Index
Hadoop Beginner's Guide
Hadoop Beginner's Guide
Copyright © 2013 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: February 2013
Production Reference: 1150213
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-84951-7-300
www.packtpub.com
Cover Image by Asher Wishkerman (<[email protected]>)
Credits
Author
Garry Turkington
Reviewers
David Gruzman
Muthusamy Manigandan
Vidyasagar N V
Acquisition Editor
Robin de Jongh
Lead Technical Editor
Azharuddin Sheikh
Technical Editors
Ankita Meshram
Varun Pius Rodrigues
Copy Editors
Brandt D'Mello
Aditya Nair
Laxmi Subramanian
Ruta Waghmare
Project Coordinator
Leena Purkait
Proofreader
Maria Gould
Indexer
Hemangini Bari
Production Coordinator
Nitesh Thakur
Cover Work
Nitesh Thakur
About the Author
Garry Turkington has 14 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems. In his current roles as VP Data Engineering and Lead Architect at Improve Digital, he is primarily responsible for the realization of systems that store, process, and extract value from the company's large data volumes. Before joining Improve Digital, he spent time at Amazon.co.uk, where he led several software development teams building systems that process Amazon catalog data for every item worldwide. Prior to this, he spent a decade in various government positions in both the UK and USA.
He has BSc and PhD degrees in Computer Science from the Queens University of Belfast in Northern Ireland and an MEng in Systems Engineering from Stevens Institute of Technology in the USA.
I would like to thank my wife Lea for her support and encouragement—not to mention her patience—throughout the writing of this book and my daughter, Maya, whose spirit and curiosity is more of an inspiration than she could ever imagine.
About the Reviewers
David Gruzman is a Hadoop and big data architect with more than 18 years of hands-on experience, specializing in the design and implementation of scalable high-performance distributed systems. He has extensive expertise of OOA/OOD and (R)DBMS technology. He is an Agile methodology adept and strongly believes that a daily coding routine makes good software architects. He is interested in solving challenging problems related to real-time analytics and the application of machine learning algorithms to the big data sets.
He founded—and is working with—BigDataCraft.com, a boutique consulting firm in the area of big data. Visit their site at www.bigdatacraft.com. David can be contacted at [email protected]. More detailed information about his skills and experience can be found at http://www.linkedin.com/in/davidgruzman.
Muthusamy Manigandan is a systems architect for a startup. Prior to this, he was a Staff Engineer at VMWare and Principal Engineer with Oracle. Mani has been programming for the past 14 years on large-scale distributed-computing applications. His areas of interest are machine learning and algorithms.
Vidyasagar N V has been interested in computer science since an early age. Some of his serious work in computers and computer networks began during his high school days. Later, he went to the prestigious Institute Of Technology, Banaras Hindu University, for his B.Tech. He has been working as a software developer and data expert, developing and building scalable systems. He has worked with a variety of second, third, and fourth generation languages. He has worked with flat files, indexed files, hierarchical databases, network databases, relational databases, NoSQL databases, Hadoop, and related technologies. Currently, he is working as Senior Developer at Collective Inc., developing big data-based structured data extraction techniques from the Web and local information. He enjoys producing high-quality software and web-based solutions and designing secure and scalable data systems. He can be contacted at <[email protected]>.
I would like to thank the Almighty, my parents, Mr. N Srinivasa Rao and Mrs. Latha Rao, and my family who supported and backed me throughout my life. I would also like to thank my friends for being good friends and all those people willing to donate their time, effort, and expertise by participating in open source software projects. Thank you, Packt Publishing for selecting me as one of the technical reviewers for this wonderful book. It is my honor to be a part of it.
www.PacktPub.com
Support files, eBooks, discount offers and more
You might want to visit www.PacktPub.com for support files and downloads related to your book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.
Why Subscribe?
Fully searchable across every book published by Packt
Copy and paste, print and bookmark content
On demand and accessible via web browser
Free Access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.
Preface
This book is here to help you make sense of Hadoop and use it to solve your big data problems. It's a really exciting time to work with data processing technologies such as Hadoop. The ability to apply complex analytics to large data sets—once the monopoly of large corporations and government agencies—is now possible through free open source software (OSS).
But because of the seeming complexity and pace of change in this area, getting a grip on the basics can be somewhat intimidating. That's where this book comes in, giving you an understanding of just what Hadoop is, how it works, and how you can use it to extract value from your data now.
In addition to an explanation of core Hadoop, we also spend several chapters exploring other technologies that either use Hadoop or integrate with it. Our goal is to give you an understanding not just of what Hadoop is but also how to use it as a part of your broader technical infrastructure.
A complementary technology is the use of cloud computing, and in particular, the offerings from Amazon Web Services. Throughout the book, we will show you how to use these services to host your Hadoop workloads, demonstrating that not only can you process large data volumes, but also you don't actually need to buy any physical hardware to do so.
What this book covers
This book comprises of three main parts: chapters 1 through 5, which cover the core of Hadoop and how it works, chapters 6 and 7, which cover the more operational aspects of Hadoop, and chapters 8 through 11, which look at the use of Hadoop alongside other products and technologies.
Chapter 1, What It's All About, gives an overview of the trends that have made Hadoop and cloud computing such important technologies today.
Chapter 2, Getting Hadoop Up and Running, walks you through the initial setup of a local Hadoop cluster and the running of some demo jobs. For comparison, the same work is also executed on the hosted Hadoop Amazon service.
Chapter 3, Understanding MapReduce, goes inside the workings of Hadoop to show how MapReduce jobs are executed and shows how to write applications using the Java API.
Chapter 4, Developing MapReduce Programs, takes a case study of a moderately sized data set to demonstrate techniques to help when deciding how to approach the processing and analysis of a new data source.
Chapter 5, Advanced MapReduce Techniques, looks at a few more sophisticated ways of applying MapReduce to problems that don't necessarily seem immediately applicable to the Hadoop processing model.
Chapter 6, When Things Break, examines Hadoop's much-vaunted high availability and fault tolerance in some detail and sees just how good it is by intentionally causing havoc through killing processes and intentionally using corrupt data.
Chapter 7, Keeping Things Running, takes a more operational view of Hadoop and will be of most use for those who need to administer a Hadoop cluster. Along with demonstrating some best practice, it describes how to prepare for the worst operational disasters so you can sleep at night.
Chapter 8, A Relational View On Data With Hive, introduces Apache Hive, which allows Hadoop data to be queried with a SQL-like syntax.
Chapter 9, Working With Relational Databases, explores how Hadoop can be integrated with existing databases, and in particular, how to move data from one to the other.
Chapter 10, Data Collection with Flume, shows how Apache Flume can be used to gather data from multiple sources and deliver it to destinations such as Hadoop.
Chapter 11, Where To Go Next, wraps up the book with an overview of the broader Hadoop ecosystem, highlighting other products and technologies of potential interest. In addition, it gives some ideas on how to get involved with the Hadoop community and to get help.
What you need for this book
As we discuss the various Hadoop-related software packages used in this book, we will describe the particular requirements for each chapter. However, you will generally need somewhere to run your Hadoop cluster.
In the simplest case, a single Linux-based machine will give you a platform to explore almost all the exercises in this book. We assume you have a recent distribution of Ubuntu, but as long as you have command-line Linux familiarity any modern distribution will suffice.
Some of the examples in later chapters really need multiple machines to see things working, so you will require access to at least four such hosts. Virtual machines are completely acceptable; they're not ideal for production but are fine for learning and exploration.
Since we also explore Amazon Web Services in this book, you can run all the examples on EC2 instances, and we will look at some other more Hadoop-specific uses of AWS throughout the book. AWS services are usable by anyone, but you will need a credit card to sign up!
Who this book is for
We assume you are reading this book because you want to know more about Hadoop at a hands-on level; the key audience is those with software development experience but no prior exposure to Hadoop or similar big data technologies.
For developers who want to know how to write MapReduce applications, we assume you are comfortable writing Java programs and are familiar with the Unix command-line interface. We will also show you a few programs in Ruby, but these are usually only to demonstrate language independence, and you don't need to be a Ruby expert.
For architects and system administrators, the book also provides significant value in explaining how Hadoop works, its place in the broader architecture, and how it can be managed operationally. Some of the more involved techniques in Chapter 4, Developing MapReduce Programs, and Chapter 5, Advanced MapReduce Techniques, are probably of less direct interest to this audience.
Conventions
In this book, you will find several headings appearing frequently.
To give clear instructions of how to complete a procedure or task, we use:
Time for action – heading
Action 1
Action 2
Action 3
Instructions often need some extra explanation so that they make sense, so they are followed with:
What just happened?
This heading explains the working of tasks or instructions that you have just completed.
You will also find some other learning aids in the book, including:
Pop quiz – heading
These are short multiple-choice questions intended to help you test your own understanding.
Have a go hero – heading
These set practical challenges and give you ideas for experimenting with what you have learned.
You will also find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.
Code words in text are shown as follows: You may notice that we used the Unix command rm to remove the Drush directory rather than the DOS del command.
A block of code is set as follows:
# * Fine Tuning
#
key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 16M
thread_stack = 512K
thread_cache_size = 8
max_connections = 300
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
# * Fine Tuning
#
key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 16M
thread_stack = 512K
thread_cache_size = 8
max_connections = 300
Any command-line input or output is written as follows:
cd /ProgramData/Propeople rm -r Drush git clone --branch master http://git.drupal.org/project/drush.git
Newterms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: On the Select Destination Location screen, click on Next to accept the default destination.
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title through the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website, or added to any list of existing errata, under the Errata section of that title.
Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors, and our ability to bring you valuable content.
Questions
You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.
Chapter 1. What It's All About
This book is about Hadoop, an open source framework for large-scale data processing. Before we get into the details of the technology and its use in later chapters, it is important to spend a little time exploring the trends that led to Hadoop's creation and its enormous success.
Hadoop was not created in a vacuum; instead, it exists due to the explosion in the amount of data being created and consumed and a shift that sees this data deluge arrive at small startups and not just huge multinationals. At the same time, other trends have changed how software and systems are deployed, using cloud resources alongside or even in preference to more traditional infrastructures.
This chapter will explore some of these trends and explain in detail the specific problems Hadoop seeks to solve and the drivers that shaped its design.
In the rest of this chapter we shall:
Learn about the big data revolution
Understand what Hadoop