Ebook1,003 pages7 hours

Hadoop Beginner's Guide

Name: Hadoop Beginner's Guide
Brand: Packt Publishing
Rating: 4.1 (7 reviews)

By Garry Turkington

Rating: 4 out of 5 stars

4/5

()

Read preview

About this ebook

In Detail

Data is arriving faster than you can process it and the overall volumes keep growing at a rate that keeps you awake at night. Hadoop can help you tame the data beast. Effective use of Hadoop however requires a mixture of programming, design, and system administration skills.

"Hadoop Beginner's Guide" removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the book gives the understanding needed to effectively use Hadoop to solve real world problems.

Starting with the basics of installing and configuring Hadoop, the book explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems.

While learning different ways to develop applications to run on Hadoop the book also covers tools such as Hive, Sqoop, and Flume that show how Hadoop can be integrated with relational databases and log collection.

In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered.

Approach

As a Packt Beginner's Guide, the book is packed with clear step-by-step instructions for performing the most useful tasks, getting you up and running quickly, and learning by doing.

Who this book is for

This book assumes no existing experience with Hadoop or cloud services. It assumes you have familiarity with a programming language such as Java or Ruby but gives you the needed background on the other topics.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateFeb 22, 2013

ISBN9781849517317

Author

Garry Turkington

Garry Turkington has 14 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems. In his current roles as VP Data Engineering at Improve Digital and the company's lead architect he is primarily responsible for the realization of systems that store, process, and extract value from the company's large data volumes. Before joining Improve Digital he spent time at Amazon UK where he led several software development teams, building systems that process the Amazon catalog data for every item worldwide. Prior to this he spent a decade in various government positions in both the UK and USA. He has BSc and PhD degrees in computer science from the Queens University of Belfast in Northern Ireland and a MEng in Systems Engineering from Stevens Institute of Technology in the USA.

Related to Hadoop Beginner's Guide

Related ebooks

Skip carousel

Mastering Spark for Data Science
Ebook
Mastering Spark for Data Science
byAndrew Morgan
Rating: 0 out of 5 stars
0 ratings
Hadoop in Practice
Ebook
Hadoop in Practice
byAlex Holmes
Rating: 0 out of 5 stars
0 ratings
Apache Hive Cookbook
Ebook
Apache Hive Cookbook
byShrey Mehrotra
Rating: 0 out of 5 stars
0 ratings
Learn Hadoop in 24 Hours
Ebook
Learn Hadoop in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
Frank Kane's Taming Big Data with Apache Spark and Python
Ebook
Frank Kane's Taming Big Data with Apache Spark and Python
byFrank Kane
Rating: 0 out of 5 stars
0 ratings
Hadoop MapReduce v2 Cookbook - Second Edition
Ebook
Hadoop MapReduce v2 Cookbook - Second Edition
byThilina Gunarathne
Rating: 0 out of 5 stars
0 ratings
Hadoop in Action
Ebook
Hadoop in Action
byChuck Lam
Rating: 0 out of 5 stars
0 ratings
HDInsight Essentials - Second Edition
Ebook
HDInsight Essentials - Second Edition
byRajesh Nadipalli
Rating: 0 out of 5 stars
0 ratings
Designing Cloud Data Platforms
Ebook
Designing Cloud Data Platforms
byDanil Zburivsky
Rating: 0 out of 5 stars
0 ratings
Practical Machine Learning
Ebook
Practical Machine Learning
byGollapudi Sunila
Rating: 2 out of 5 stars
2/5
Machine Learning Systems: Designs that scale
Ebook
Machine Learning Systems: Designs that scale
byJeffrey Smith
Rating: 0 out of 5 stars
0 ratings
Graph Databases in Action: Examples in Gremlin
Ebook
Graph Databases in Action: Examples in Gremlin
byJosh Perryman
Rating: 0 out of 5 stars
0 ratings
Hadoop 2.x Administration Cookbook
Ebook
Hadoop 2.x Administration Cookbook
byGurmukh Singh
Rating: 0 out of 5 stars
0 ratings
MLOps Engineering at Scale
Ebook
MLOps Engineering at Scale
byCarl Osipov
Rating: 0 out of 5 stars
0 ratings
Mastering Hadoop
Ebook
Mastering Hadoop
bySandeep Karanth
Rating: 0 out of 5 stars
0 ratings
Cassandra High Availability
Ebook
Cassandra High Availability
byRobbie Strickland
Rating: 5 out of 5 stars
5/5
Instant MapReduce Patterns – Hadoop Essentials How-to
Ebook
Instant MapReduce Patterns – Hadoop Essentials How-to
bySrinath Perera
Rating: 0 out of 5 stars
0 ratings
Spark for Data Science
Ebook
Spark for Data Science
bySrinivas Duvvuri
Rating: 0 out of 5 stars
0 ratings
DynamoDB Applied Design Patterns
Ebook
DynamoDB Applied Design Patterns
byUchit Vyas
Rating: 3 out of 5 stars
3/5
PostgreSQL Replication - Second Edition
Ebook
PostgreSQL Replication - Second Edition
byHans-Jürgen Schönig
Rating: 0 out of 5 stars
0 ratings
Microsoft SQL Server 2014 Business Intelligence Development Beginner’s Guide
Ebook
Microsoft SQL Server 2014 Business Intelligence Development Beginner’s Guide
byReza Rad
Rating: 0 out of 5 stars
0 ratings
Hadoop Blueprints
Ebook
Hadoop Blueprints
byAnurag Shrivastava
Rating: 0 out of 5 stars
0 ratings
Hadoop Essentials
Ebook
Hadoop Essentials
byShiva Achari
Rating: 5 out of 5 stars
5/5
Hadoop Real-World Solutions Cookbook - Second Edition
Ebook
Hadoop Real-World Solutions Cookbook - Second Edition
byDeshpande Tanmay
Rating: 0 out of 5 stars
0 ratings
Apache Spark for Data Science Cookbook
Ebook
Apache Spark for Data Science Cookbook
byPadma Priya Chitturi
Rating: 0 out of 5 stars
0 ratings
Data Analytics with Google Cloud Platform: Build Real Time Data Analytics on Google Cloud Platform
Ebook
Data Analytics with Google Cloud Platform: Build Real Time Data Analytics on Google Cloud Platform
byMurari Ramuka
Rating: 0 out of 5 stars
0 ratings
Mastering Cloud Development using Microsoft Azure
Ebook
Mastering Cloud Development using Microsoft Azure
byRoberto Freato
Rating: 0 out of 5 stars
0 ratings
Data Pipelines with Apache Airflow
Ebook
Data Pipelines with Apache Airflow
byJulian de Ruiter
Rating: 0 out of 5 stars
0 ratings
PostgreSQL High Performance Cookbook
Ebook
PostgreSQL High Performance Cookbook
byChitij Chauhan
Rating: 0 out of 5 stars
0 ratings
Talend Open Studio Cookbook
Ebook
Talend Open Studio Cookbook
byRick Barton
Rating: 2 out of 5 stars
2/5

CAD-CAM For You

Skip carousel

3D Printing Designs: Fun and Functional Projects
Ebook
3D Printing Designs: Fun and Functional Projects
byLarson Joe
Rating: 0 out of 5 stars
0 ratings
AutoCAD® Pocket Reference
Ebook
AutoCAD® Pocket Reference
byCheryl R. Shrock
Rating: 0 out of 5 stars
0 ratings
FreeCAD | Step by Step: Learn how to easily create 3D objects, assemblies, and technical drawings
Ebook
FreeCAD | Step by Step: Learn how to easily create 3D objects, assemblies, and technical drawings
byM.Eng. Johannes Wild
Rating: 5 out of 5 stars
5/5
3D Printing For Dummies
Ebook
3D Printing For Dummies
byRichard Horne
Rating: 4 out of 5 stars
4/5
SketchUp Success for Woodworkers: Four Simple Rules to Create 3D Drawings Quickly and Accurately
Ebook
SketchUp Success for Woodworkers: Four Simple Rules to Create 3D Drawings Quickly and Accurately
byDavid Heim
Rating: 2 out of 5 stars
2/5
Beginning AutoCAD® 2023 Exercise Workbook: For Windows®
Ebook
Beginning AutoCAD® 2023 Exercise Workbook: For Windows®
byCheryl R. Shrock
Rating: 0 out of 5 stars
0 ratings
FreeCAD | Design Projects: Design advanced CAD models step by step
Ebook
FreeCAD | Design Projects: Design advanced CAD models step by step
byM.Eng. Johannes Wild
Rating: 5 out of 5 stars
5/5
Tinkercad | Step by Step
Ebook
Tinkercad | Step by Step
byM.Eng. Johannes Wild
Rating: 0 out of 5 stars
0 ratings
Solidworks 2018 Learn by Doing - Part 3: DimXpert and Rendering
Ebook
Solidworks 2018 Learn by Doing - Part 3: DimXpert and Rendering
byTutorial Books
Rating: 0 out of 5 stars
0 ratings
AutoCAD Electrical 2023 Black Book
Ebook
AutoCAD Electrical 2023 Black Book
byGaurav Verma
Rating: 0 out of 5 stars
0 ratings
CAD 101: The Ultimate Beginner's Guide
Ebook
CAD 101: The Ultimate Beginner's Guide
byJohannes Wild
Rating: 0 out of 5 stars
0 ratings
CAD 101: The Ultimate Beginners Guide
Ebook
CAD 101: The Ultimate Beginners Guide
byM.Eng. Johannes Wild
Rating: 5 out of 5 stars
5/5
Mastering AutoCAD for Mac
Ebook
Mastering AutoCAD for Mac
byGeorge Omura
Rating: 0 out of 5 stars
0 ratings
AutoCAD 2018 For Beginners
Ebook
AutoCAD 2018 For Beginners
byKishore Topu
Rating: 5 out of 5 stars
5/5
Plastic Injection Mold Design for Toolmakers
Ebook series
Plastic Injection Mold Design for Toolmakers
byMike Rowe
3D Printing Designs: Octopus Pencil Holder
Ebook
3D Printing Designs: Octopus Pencil Holder
byLarson Joe
Rating: 0 out of 5 stars
0 ratings
Autodesk® Revit Basics Training Manual
Ebook
Autodesk® Revit Basics Training Manual
byBrian W. Clayton
Rating: 5 out of 5 stars
5/5
Autodesk Revit 2023 Black Book
Ebook
Autodesk Revit 2023 Black Book
byGaurav Verma
Rating: 5 out of 5 stars
5/5
Beginning AutoCAD® 2022 Exercise Workbook: For Windows®
Ebook
Beginning AutoCAD® 2022 Exercise Workbook: For Windows®
byCheryl R. Shrock
Rating: 0 out of 5 stars
0 ratings
Fusion 360 | CAD Design Projects Part I
Ebook
Fusion 360 | CAD Design Projects Part I
byM.Eng. Johannes Wild
Rating: 0 out of 5 stars
0 ratings
Autodesk Fusion 360 Black Book (V 2.0.15293) - Part 1
Ebook
Autodesk Fusion 360 Black Book (V 2.0.15293) - Part 1
byGaurav Verma
Rating: 0 out of 5 stars
0 ratings
Mastering Autodesk Revit MEP 2014: Autodesk Official Press
Ebook
Mastering Autodesk Revit MEP 2014: Autodesk Official Press
byDon Bokmiller
Rating: 0 out of 5 stars
0 ratings
Mastering AutoCAD Civil 3D 2015: Autodesk Official Press
Ebook
Mastering AutoCAD Civil 3D 2015: Autodesk Official Press
byCyndy Davenport
Rating: 0 out of 5 stars
0 ratings
AutoCAD Electrical 2021 Black Book
Ebook
AutoCAD Electrical 2021 Black Book
byGaurav Verma
Rating: 5 out of 5 stars
5/5
AutoCAD 2019 For Architectural Design
Ebook
AutoCAD 2019 For Architectural Design
byTutorial Books
Rating: 5 out of 5 stars
5/5
3D Printing Designs: Design an SD Card Holder
Ebook
3D Printing Designs: Design an SD Card Holder
byLarson Joe
Rating: 0 out of 5 stars
0 ratings
3D Printer Projects for Makerspaces
Ebook
3D Printer Projects for Makerspaces
byLydia Sloan Cline
Rating: 4 out of 5 stars
4/5
FreeCAD 0.20 Black Book
Ebook
FreeCAD 0.20 Black Book
byGaurav Verma
Rating: 0 out of 5 stars
0 ratings
FreeCAD 0.19 Black Book
Ebook
FreeCAD 0.19 Black Book
byGaurav Verma
Rating: 0 out of 5 stars
0 ratings
OpenSCAD Basics Tutorial
Ebook
OpenSCAD Basics Tutorial
byTutorial Books
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
Podcast episode
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
byPHPRoundtable Podcast
0 ratings
0% found this document useful
#444: [INTRODUCING] Amazon DevOps Guru: Amazon DevOps Guru is a machine learning powered service that makes it easy to improve an applicatio
Podcast episode
#444: [INTRODUCING] Amazon DevOps Guru: Amazon DevOps Guru is a machine learning powered service that makes it easy to improve an applicatio
byAWS Podcast
0 ratings
0% found this document useful
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
Podcast episode
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
byData Engineering Podcast
0 ratings
0% found this document useful
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
Podcast episode
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
byScreaming in the Cloud
0 ratings
0% found this document useful
S1:E1 "The Beginning"
Podcast episode
S1:E1 "The Beginning"
byData Science Now
0 ratings
0% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
Podcast episode
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
byData Engineering Podcast
0 ratings
0% found this document useful
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
Podcast episode
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
byData Engineering Podcast
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
Podcast episode
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful
Engineering interview tips & tricks: with Emma Draper & Jonas
Podcast episode
Engineering interview tips & tricks: with Emma Draper & Jonas
byGo Time: Golang, Software Engineering
0 ratings
0% found this document useful
State of DevOps Report 2021 with Nathen Harvey and Dustin Smith: This week, Stephanie Wong and Carter Morgan are talking about the recently released State of DevOps Report.
Podcast episode
State of DevOps Report 2021 with Nathen Harvey and Dustin Smith: This week, Stephanie Wong and Carter Morgan are talking about the recently released State of DevOps Report.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Azure Databricks: I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were...
Podcast episode
Azure Databricks: I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were...
byData Skeptic
0 ratings
0% found this document useful
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
Podcast episode
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Episode 1: Too Much Choice | LU1: Does the Linux community lean on the age old excuse of choice, to brush of the real limitations of desktop Linux environments? We debate that, and then discuss the growing reasons to roll your own email server.
Podcast episode
Episode 1: Too Much Choice | LU1: Does the Linux community lean on the age old excuse of choice, to brush of the real limitations of desktop Linux environments? We debate that, and then discuss the growing reasons to roll your own email server.
byLINUX Unplugged
0 ratings
0% found this document useful
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42: A Whirlwind Tour Of The PostgreSQL Database (Interview)
Podcast episode
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42: A Whirlwind Tour Of The PostgreSQL Database (Interview)
byData Engineering Podcast
100%
100% found this document useful
Hasty Treat - Hosting + Web Services Pricing Explainer: In this Hasty Treat, Scott and Wes talk about how hosting and web services pricing works, and how to figure out what you need, and what you don’t. LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs...
Podcast episode
Hasty Treat - Hosting + Web Services Pricing Explainer: In this Hasty Treat, Scott and Wes talk about how hosting and web services pricing works, and how to figure out what you need, and what you don’t. LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
Podcast episode
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Build Your Own Data Pipeline - Andreas Kretz
Podcast episode
Build Your Own Data Pipeline - Andreas Kretz
byDataTalks.Club
0 ratings
0% found this document useful
The Cloudcast #335 - Managing Waste in the Cloud: Brian talks with Andy Richman (Product Manager @ParkMyCloud) about which groups are responsible for cloud costs, how expectations and behaviors are changing as applications move to the cloud (Finance vs. DevOps), do we need DevFinOps, the technology th...
Podcast episode
The Cloudcast #335 - Managing Waste in the Cloud: Brian talks with Andy Richman (Product Manager @ParkMyCloud) about which groups are responsible for cloud costs, how expectations and behaviors are changing as applications move to the cloud (Finance vs. DevOps), do we need DevFinOps, the technology th...
byThe Cloudcast
0 ratings
0% found this document useful
DataOps 101 - Lars Albertsson
Podcast episode
DataOps 101 - Lars Albertsson
byDataTalks.Club
0 ratings
0% found this document useful
MLOps Coffee Sessions #12: Journey of Flyte at Lyft and Through Open-source // Ketan Umare
Podcast episode
MLOps Coffee Sessions #12: Journey of Flyte at Lyft and Through Open-source // Ketan Umare
byMLOps.community
0 ratings
0% found this document useful
Developing Multi-Cloud Skills
Podcast episode
Developing Multi-Cloud Skills
byThe Cloudcast
0 ratings
0% found this document useful
Humble Bundle with Andy Oxfeld: Andy Oxfeld, Engineering Manager, tells us all the details about how Humble Bundle runs on Google Cloud Platform.
Podcast episode
Humble Bundle with Andy Oxfeld: Andy Oxfeld, Engineering Manager, tells us all the details about how Humble Bundle runs on Google Cloud Platform.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Streaming alternatives to Kafka
Podcast episode
Streaming alternatives to Kafka
byThe Cloudcast
0 ratings
0% found this document useful
Episode 442: RR 434: Surviving Webpack with Ross Kaffenberger
Podcast episode
Episode 442: RR 434: Surviving Webpack with Ross Kaffenberger
byRuby Rogues
0 ratings
0% found this document useful
Hub Ocean: This is an interview with a senior data scientist from Hub Ocean, a platform that aims to unlock and unite ocean data. Hub Ocean - as the name suggests is a hub for ocean data Now we have talked about these kinds of data hubs before on the podcast -...
Podcast episode
Hub Ocean: This is an interview with a senior data scientist from Hub Ocean, a platform that aims to unlock and unite ocean data. Hub Ocean - as the name suggests is a hub for ocean data Now we have talked about these kinds of data hubs before on the podcast -...
byThe MapScaping Podcast - GIS, Geospatial, Remote Sensing, earth observation and digital geography
0 ratings
0% found this document useful
Apache Beam with Kenneth Knowles and Pablo Estrada: On the podcast this week, your hosts and talk about the data processing tool Apache Beam with guests and . Kenn starts us off with an overview of how Apache Beam began and how Cloud Dataflow was involved. The unique batch and stream method and...
Podcast episode
Apache Beam with Kenneth Knowles and Pablo Estrada: On the podcast this week, your hosts and talk about the data processing tool Apache Beam with guests and . Kenn starts us off with an overview of how Apache Beam began and how Cloud Dataflow was involved. The unique batch and stream method and...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
New Trends in Serverless
Podcast episode
New Trends in Serverless
byThe Cloudcast
0 ratings
0% found this document useful
Moving up a level of abstraction with serverless on MongoDB Atlas and AWS
Podcast episode
Moving up a level of abstraction with serverless on MongoDB Atlas and AWS
byThe Stack Overflow Podcast
0 ratings
0% found this document useful
Whiteboard Confessional: Naming Is Hard, Don’t Make it Worse: Join me as I continue the Whiteboard Confessional series with a look the importance of owning your own domain names while touching upon what split-horizon DNS is and why companies use it, what the Route 53 Resolver is actually designed to do, why it is im
Podcast episode
Whiteboard Confessional: Naming Is Hard, Don’t Make it Worse: Join me as I continue the Whiteboard Confessional series with a look the importance of owning your own domain names while touching upon what split-horizon DNS is and why companies use it, what the Route 53 Resolver is actually designed to do, why it is im
byAWS Morning Brief
0 ratings
0% found this document useful

Skip carousel

Monitor Systems And Docker Deployments
Linux Format
Article
Monitor Systems And Docker Deployments
Jun 30, 2020
Welcome to Netdata, software for distributed real-time performance and health monitoring of UNIX machines. Don’t you dare turn that page! A key advantage of Netdata is that it collects all of its metrics without introducing too much load on to the Li
8 min read
Types Of Databases
Linux Format
Article
Types Of Databases
Aug 27, 2019
NoSQL databases provide the performance, scalability and stability that’s required by the modern data-driven apps we interact with these days. But that is where the similarity between NoSQL systems end. In fact, it wouldn’t be wrong to say that the o
1 min read
Grafana Terminology
Linux Format
Article
Grafana Terminology
Jan 14, 2020
A Grafana data source is a database, file or service that provides data to Grafana – it cannot operate without data. A Grafana panel is the basic building block of Grafana. Panels are made of visualisations or queries. A Grafana query is used for req
1 min read
AWS Vs Azure What’s The Difference?
PC Pro Magazine
Article
AWS Vs Azure What’s The Difference?
Sep 11, 2022
7 min read
Your First Steps In Grafana
Linux Format
Article
Your First Steps In Grafana
Nov 17, 2020
The easiest way to get hold of Grafana and begin using it as soon as possible is by downloading and executing its official Docker image. This means that apart from the Docker image, you won’t need to download, set up or install anything else for Graf
1 min read
What is ELT?
Techfastly
Article
What is ELT?
Apr 1, 2021
It stands for extract, load, and transform- the processes a data pipeline uses for replicating the data from a source system into a target system such as a cloud data warehouse. 1. Extraction is the first step in which data is copied from the source
6 min read
An easy-to-Understand Overview of Popular extended BPF Tools: BCC, Falco, and More
Techfastly
Article
An easy-to-Understand Overview of Popular extended BPF Tools: BCC, Falco, and More
Apr 1, 2022
7 min read
Tweak And Tune Your Own Kernel Scheduler
Linux Format
Article
Tweak And Tune Your Own Kernel Scheduler
Nov 14, 2023
SCHEDTOOL Credit: https://github.com/freequaos/schedtool OUR EXPERT QUICK TIP The first time you compile your own kernel, prepare the disk for handling up to 12GB of new data. Also reserve a good chunk of time and your favourite brew. A compile runs
11 min read
Rediscover Speed With The Redis Revolution
Linux Format
Article
Rediscover Speed With The Redis Revolution
Jul 25, 2023
Credit: https://redis.io Redis is an open-source, in-memory data structure store that has gained popularity R as a highly efficient caching and messaging system. It prioritises speed, efficiency and versatility, making it a top choice for various ap
8 min read
Disk Management
Linux Format
Article
Disk Management
Mar 5, 2024
Recently, I have spent a not insignificant amount of time R remediating Linux systems with some poor design choices regards disk layout and management on Linux. I thought of it as something worthwhile sharing with the community at large. Ubuntu is th
3 min read
MapReduce: The ‘Big Data’ Idea Inside Your Android Phone
APC
Article
MapReduce: The ‘Big Data’ Idea Inside Your Android Phone
Dec 2, 2019
4 min read
Answers
Linux Format
Article
Answers
Jul 2, 2019
Q How do I generate a very large file, say around 980GB, with the dd command? I’m trying to test a quota for an XFS file system under RHEL7. I use the following command as the user oracle, which has the user quota set: and it outputs Where is my
7 min read
5 Tools That Integrate Your Cloud Storage Into Windows File Explorer
Tech Advisor
Article
5 Tools That Integrate Your Cloud Storage Into Windows File Explorer
May 1, 2024
6 min read
Accurate, Open Source IP-based Localisation
Linux Format
Article
Accurate, Open Source IP-based Localisation
Dec 14, 2021
8 min read
Answers
Linux Format
Article
Answers
Mar 9, 2021
8 min read
Answers
Linux Format
Article
Answers
Mar 5, 2024
10 min read
Monitor And Graph Your System Metrics
Linux Format
Article
Monitor And Graph Your System Metrics
Dec 13, 2022
Credit: https://oss.oetiker.ch/rrdtool Matt Holder has worked in IT support for over a decade, and always tries to use Linux alongside other installed systems. The code used in this article can be downloaded from https:// github.com/ mattmole/ LXF297
8 min read
“We Should Pay Attention To The Way That A New Language Can Redefine The Limits Of Computing”
PC Pro Magazine
Article
“We Should Pay Attention To The Way That A New Language Can Redefine The Limits Of Computing”
Feb 11, 2021
7 min read
Hot Picks
Linux Format
Article
Hot Picks
Mar 9, 2021
13 min read
Set Up A Production- Ready Web Server
APC
Article
Set Up A Production- Ready Web Server
Nov 4, 2019
8 min read
HotPicks
Linux Format
Article
HotPicks
Feb 11, 2020
13 min read
How We Tested…
Linux Format
Article
How We Tested…
Jan 12, 2021
You’ll find these applications in the software repositories of most desktop distributions, even if the featured version is not the latest. Some programs provide Snap packages, and others provide installable binaries for RPM- and DEB-based distributio
1 min read
HotPicks
Linux Format
Article
HotPicks
Jun 29, 2021
12 min read
Mailserver
Linux Format
Article
Mailserver
Oct 19, 2021
3 min read
Lag Is Killing Games
Linux Format
Article
Lag Is Killing Games
Jan 11, 2022
8 min read
Automatically Provision Devices With Ansible
Linux Format
Article
Automatically Provision Devices With Ansible
Nov 15, 2022
Matt Holder has worked in IT support for over a decade, and always tries to utilise Linux alongside other installed systems. C loud computing is a term that means a number of things. Software as a Service (SaaS) is one such example of what can be hos
9 min read
Perfect Backup: Perfect? No, But Darn Close
PCWorld
Article
Perfect Backup: Perfect? No, But Darn Close
Jan 11, 2023
3 min read
Observability Of The Kernel And Containers
Linux Format
Article
Observability Of The Kernel And Containers
Apr 4, 2023
Mihalis Tsoukalos is currently working on Time Series. You can reach him at: @mactsouk. For our final delve into eBPF, we’re tackling applications, the kernel and Docker containers. At the end of the day, all Linux machines execute code for applicat
10 min read
Build A Pi-powered Network Storage Device
Linux Format
Article
Build A Pi-powered Network Storage Device
Dec 14, 2021
10 min read
Answers
Linux Format
Article
Answers
Dec 14, 2021
8 min read

Related categories

Skip carousel

Reviews for Hadoop Beginner's Guide

Rating: 4.142857142857143 out of 5 stars

4/5

7 ratings2 reviews

Rating: 1 out of 5 stars
1/5
I'm a data architect and modeler. I'm not a programmer. I was looking to get an overview to understand how Hadoop would influence the way I architect and model. I read the first chapter and haven't got a clue.
Rating: 4 out of 5 stars
4/5
good

Book preview

Hadoop Beginner's Guide - Garry Turkington

Hadoop Beginner's Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers and more

Why Subscribe?

Free Access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Time for action – heading

What just happened?

Pop quiz – heading

Have a go hero – heading

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. What It's All About

Big data processing

The value of data

Historically for the few and not the many

Classic data processing systems

Scale-up

Early approaches to scale-out

Limiting factors

A different approach

All roads lead to scale-out

Share nothing

Expect failure

Smart software, dumb hardware

Move processing, not data

Build applications, not infrastructure

Hadoop

Thanks, Google

Thanks, Doug

Thanks, Yahoo

Parts of Hadoop

Common building blocks

HDFS

MapReduce

Better together

Common architecture

What it is and isn't good for

Cloud computing with Amazon Web Services

Too many clouds

A third way

Different types of costs

AWS – infrastructure on demand from Amazon

Elastic Compute Cloud (EC2)

Simple Storage Service (S3)

Elastic MapReduce (EMR)

What this book covers

A dual approach

Summary

2. Getting Hadoop Up and Running

Hadoop on a local Ubuntu host

Other operating systems

Time for action – checking the prerequisites

What just happened?

Setting up Hadoop

A note on versions

Time for action – downloading Hadoop

What just happened?

Time for action – setting up SSH

What just happened?

Configuring and running Hadoop

Time for action – using Hadoop to calculate Pi

What just happened?

Three modes

Time for action – configuring the pseudo-distributed mode

What just happened?

Configuring the base directory and formatting the filesystem

Time for action – changing the base HDFS directory

What just happened?

Time for action – formatting the NameNode

What just happened?

Starting and using Hadoop

Time for action – starting Hadoop

What just happened?

Time for action – using HDFS

What just happened?

Time for action – WordCount, the Hello World of MapReduce

What just happened?

Have a go hero – WordCount on a larger body of text

Monitoring Hadoop from the browser

The HDFS web UI

The MapReduce web UI

Using Elastic MapReduce

Setting up an account in Amazon Web Services

Creating an AWS account

Signing up for the necessary services

Time for action – WordCount on EMR using the management console

What just happened?

Have a go hero – other EMR sample applications

Other ways of using EMR

AWS credentials

The EMR command-line tools

The AWS ecosystem

Comparison of local versus EMR Hadoop

Summary

3. Understanding MapReduce

Key/value pairs

What it mean

Why key/value data?

Some real-world examples

MapReduce as a series of key/value transformations

Pop quiz – key/value pairs

The Hadoop Java API for MapReduce

The 0.20 MapReduce Java API

The Mapper class

The Reducer class

The Driver class

Writing MapReduce programs

Time for action – setting up the classpath

What just happened?

Time for action – implementing WordCount

What just happened?

Time for action – building a JAR file

What just happened?

Time for action – running WordCount on a local Hadoop cluster

What just happened?

Time for action – running WordCount on EMR

What just happened?

The pre-0.20 Java MapReduce API

Hadoop-provided mapper and reducer implementations

Time for action – WordCount the easy way

What just happened?

Walking through a run of WordCount

Startup

Splitting the input

Task assignment

Task startup

Ongoing JobTracker monitoring

Mapper input

Mapper execution

Mapper output and reduce input

Partitioning

The optional partition function

Reducer input

Reducer execution

Reducer output

Shutdown

That's all there is to it!

Apart from the combiner…maybe

Why have a combiner?

Time for action – WordCount with a combiner

What just happened?

When you can use the reducer as the combiner

Time for action – fixing WordCount to work with a combiner

What just happened?

Reuse is your friend

Pop quiz – MapReduce mechanics

Hadoop-specific data types

The Writable and WritableComparable interfaces

Introducing the wrapper classes

Primitive wrapper classes

Array wrapper classes

Map wrapper classes

Time for action – using the Writable wrapper classes

What just happened?

Other wrapper classes

Have a go hero – playing with Writables

Making your own

Input/output

Files, splits, and records

InputFormat and RecordReader

Hadoop-provided InputFormat

Hadoop-provided RecordReader

OutputFormat and RecordWriter

Hadoop-provided OutputFormat

Don't forget Sequence files

Summary

4. Developing MapReduce Programs

Using languages other than Java with Hadoop

How Hadoop Streaming works

Why to use Hadoop Streaming

Time for action – implementing WordCount using Streaming

What just happened?

Differences in jobs when using Streaming

Analyzing a large dataset

Getting the UFO sighting dataset

Getting a feel for the dataset

Time for action – summarizing the UFO data

What just happened?

Examining UFO shapes

Time for action – summarizing the shape data

What just happened?

Time for action – correlating of sighting duration to UFO shape

What just happened?

Using Streaming scripts outside Hadoop

Time for action – performing the shape/time analysis from the command line

What just happened?

Java shape and location analysis

Time for action – using ChainMapper for field validation/analysis

What just happened?

Have a go hero

Too many abbreviations

Using the Distributed Cache

Time for action – using the Distributed Cache to improve location output

What just happened?

Counters, status, and other output

Time for action – creating counters, task states, and writing log output

What just happened?

Too much information!

Summary

5. Advanced MapReduce Techniques

Simple, advanced, and in-between

Joins

When this is a bad idea

Map-side versus reduce-side joins

Matching account and sales information

Time for action – reduce-side join using MultipleInputs

What just happened?

DataJoinMapper and TaggedMapperOutput

Implementing map-side joins

Using the Distributed Cache

Have a go hero - Implementing map-side joins

Pruning data to fit in the cache

Using a data representation instead of raw data

Using multiple mappers

To join or not to join...

Graph algorithms

Graph 101

Graphs and MapReduce – a match made somewhere

Representing a graph

Time for action – representing the graph

What just happened?

Overview of the algorithm

The mapper

The reducer

Iterative application

Time for action – creating the source code

What just happened?

Time for action – the first run

What just happened?

Time for action – the second run

What just happened?

Time for action – the third run

What just happened?

Time for action – the fourth and last run

What just happened?

Running multiple jobs

Final thoughts on graphs

Using language-independent data structures

Candidate technologies

Introducing Avro

Time for action – getting and installing Avro

What just happened?

Avro and schemas

Time for action – defining the schema

What just happened?

Time for action – creating the source Avro data with Ruby

What just happened?

Time for action – consuming the Avro data with Java

What just happened?

Using Avro within MapReduce

Time for action – generating shape summaries in MapReduce

What just happened?

Time for action – examining the output data with Ruby

What just happened?

Time for action – examining the output data with Java

What just happened?

Have a go hero – graphs in Avro

Going forward with Avro

Summary

6. When Things Break

Failure

Embrace failure

Or at least don't fear it

Don't try this at home

Types of failure

Hadoop node failure

The dfsadmin command

Cluster setup, test files, and block sizes

Fault tolerance and Elastic MapReduce

Time for action – killing a DataNode process

What just happened?

NameNode and DataNode communication

Have a go hero – NameNode log delving

Time for action – the replication factor in action

What just happened?

Time for action – intentionally causing missing blocks

What just happened?

When data may be lost

Block corruption

Time for action – killing a TaskTracker process

What just happened?

Comparing the DataNode and TaskTracker failures

Permanent failure

Killing the cluster masters

Time for action – killing the JobTracker

What just happened?

Starting a replacement JobTracker

Have a go hero – moving the JobTracker to a new host

Time for action – killing the NameNode process

What just happened?

Starting a replacement NameNode

The role of the NameNode in more detail

File systems, files, blocks, and nodes

The single most important piece of data in the cluster – fsimage

DataNode startup

Safe mode

SecondaryNameNode

So what to do when the NameNode process has a critical failure?

BackupNode/CheckpointNode and NameNode HA

Hardware failure

Host failure

Host corruption

The risk of correlated failures

Task failure due to software

Failure of slow running tasks

Time for action – causing task failure

What just happened?

Have a go hero – HDFS programmatic access

Hadoop's handling of slow-running tasks

Speculative execution

Hadoop's handling of failing tasks

Have a go hero – causing tasks to fail

Task failure due to data

Handling dirty data through code

Using Hadoop's skip mode

Time for action – handling dirty data by using skip mode

What just happened?

To skip or not to skip...

Summary

7. Keeping Things Running

A note on EMR

Hadoop configuration properties

Default values

Time for action – browsing default properties

What just happened?

Additional property elements

Default storage location

Where to set properties

Setting up a cluster

How many hosts?

Calculating usable space on a node

Location of the master nodes

Sizing hardware

Processor / memory / storage ratio

EMR as a prototyping platform

Special node requirements

Storage types

Commodity versus enterprise class storage

Single disk versus RAID

Finding the balance

Network storage

Hadoop networking configuration

How blocks are placed

Rack awareness

The rack-awareness script

Time for action – examining the default rack configuration

What just happened?

Time for action – adding a rack awareness script

What just happened?

What is commodity hardware anyway?

Pop quiz – setting up a cluster

Cluster access control

The Hadoop security model

Time for action – demonstrating the default security

What just happened?

User identity

The super user

More granular access control

Working around the security model via physical access control

Managing the NameNode

Configuring multiple locations for the fsimage class

Time for action – adding an additional fsimage location

What just happened?

Where to write the fsimage copies

Swapping to another NameNode host

Having things ready before disaster strikes

Time for action – swapping to a new NameNode host

What just happened?

Don't celebrate quite yet!

What about MapReduce?

Have a go hero – swapping to a new NameNode host

Managing HDFS

Where to write data

Using balancer

When to rebalance

MapReduce management

Command line job management

Have a go hero – command line job management

Job priorities and scheduling

Time for action – changing job priorities and killing a job

What just happened?

Alternative schedulers

Capacity Scheduler

Fair Scheduler

Enabling alternative schedulers

When to use alternative schedulers

Scaling

Adding capacity to a local Hadoop cluster

Have a go hero – adding a node and running balancer

Adding capacity to an EMR job flow

Expanding a running job flow

Summary

8. A Relational View on Data with Hive

Overview of Hive

Why use Hive?

Thanks, Facebook!

Setting up Hive

Prerequisites

Getting Hive

Time for action – installing Hive

What just happened?

Using Hive

Time for action – creating a table for the UFO data

What just happened?

Time for action – inserting the UFO data

What just happened?

Validating the data

Time for action – validating the table

What just happened?

Time for action – redefining the table with the correct column separator

What just happened?

Hive tables – real or not?

Time for action – creating a table from an existing file

What just happened?

Time for action – performing a join

What just happened?

Have a go hero – improve the join to use regular expressions

Hive and SQL views

Time for action – using views

What just happened?

Handling dirty data in Hive

Have a go hero – do it!

Time for action – exporting query output

What just happened?

Partitioning the table

Time for action – making a partitioned UFO sighting table

What just happened?

Bucketing, clustering, and sorting... oh my!

User-Defined Function

Time for action – adding a new User Defined Function (UDF)

What just happened?

To preprocess or not to preprocess...

Hive versus Pig

What we didn't cover

Hive on Amazon Web Services

Time for action – running UFO analysis on EMR

What just happened?

Using interactive job flows for development

Have a go hero – using an interactive EMR cluster

Integration with other AWS products

Summary

9. Working with Relational Databases

Common data paths

Hadoop as an archive store

Hadoop as a preprocessing step

Hadoop as a data input tool

The serpent eats its own tail

Setting up MySQL

Time for action – installing and setting up MySQL

What just happened?

Did it have to be so hard?

Time for action – configuring MySQL to allow remote connections

What just happened?

Don't do this in production!

Time for action – setting up the employee database

What just happened?

Be careful with data file access rights

Getting data into Hadoop

Using MySQL tools and manual import

Have a go hero – exporting the employee table into HDFS

Accessing the database from the mapper

A better way – introducing Sqoop

Time for action – downloading and configuring Sqoop

What just happened?

Sqoop and Hadoop versions

Sqoop and HDFS

Time for action – exporting data from MySQL to HDFS

What just happened?

Mappers and primary key columns

Other options

Sqoop's architecture

Importing data into Hive using Sqoop

Time for action – exporting data from MySQL into Hive

What just happened?

Time for action – a more selective import

What just happened?

Datatype issues

Time for action – using a type mapping

What just happened?

Time for action – importing data from a raw query

What just happened?

Have a go hero

Sqoop and Hive partitions

Field and line terminators

Getting data out of Hadoop

Writing data from within the reducer

Writing SQL import files from the reducer

A better way – Sqoop again

Time for action – importing data from Hadoop into MySQL

What just happened?

Differences between Sqoop imports and exports

Inserts versus updates

Have a go hero

Sqoop and Hive exports

Time for action – importing Hive data into MySQL

What just happened?

Time for action – fixing the mapping and re-running the export

What just happened?

Other Sqoop features

Incremental merge

Avoiding partial exports

Sqoop as a code generator

AWS considerations

Considering RDS

Summary

10. Data Collection with Flume

A note about AWS

Data data everywhere...

Types of data

Getting network traffic into Hadoop

Time for action – getting web server data into Hadoop

What just happened?

Have a go hero

Getting files into Hadoop

Hidden issues

Keeping network data on the network

Hadoop dependencies

Reliability

Re-creating the wheel

A common framework approach

Introducing Apache Flume

A note on versioning

Time for action – installing and configuring Flume

What just happened?

Using Flume to capture network data

Time for action – capturing network traffic in a log file

What just happened?

Time for action – logging to the console

What just happened?

Writing network data to log files

Time for action – capturing the output of a command to a flat file

What just happened?

Logs versus files

Time for action – capturing a remote file in a local flat file

What just happened?

Sources, sinks, and channels

Sources

Sinks

Channels

Or roll your own

Understanding the Flume configuration files

Have a go hero

It's all about events

Time for action – writing network traffic onto HDFS

What just happened?

Time for action – adding timestamps

What just happened?

To Sqoop or to Flume...

Time for action – multi level Flume networks

What just happened?

Time for action – writing to multiple sinks

What just happened?

Selectors replicating and multiplexing

Handling sink failure

Have a go hero - Handling sink failure

Next, the world

Have a go hero - Next, the world

The bigger picture

Data lifecycle

Staging data

Scheduling

Summary

11. Where to Go Next

What we did and didn't cover in this book

Upcoming Hadoop changes

Alternative distributions

Why alternative distributions?

Bundling

Free and commercial extensions

Cloudera Distribution for Hadoop

Hortonworks Data Platform

MapR

IBM InfoSphere Big Insights

Choosing a distribution

Other Apache projects

HBase

Oozie

Whir

Mahout

MRUnit

Other programming abstractions

Pig

Cascading

AWS resources

HBase on EMR

SimpleDB

DynamoDB

Sources of information

Source code

Mailing lists and forums

LinkedIn groups

HUGs

Conferences

Summary

A. Pop Quiz Answers

Chapter 3, Understanding MapReduce

Pop quiz – key/value pairs

Pop quiz – walking through a run of WordCount

Chapter 7, Keeping Things Running

Pop quiz – setting up a cluster

Index

Hadoop Beginner's Guide

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: February 2013

Production Reference: 1150213

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-84951-7-300

www.packtpub.com

Cover Image by Asher Wishkerman (<[email protected]>)

Credits

Author

Garry Turkington

Reviewers

David Gruzman

Muthusamy Manigandan

Vidyasagar N V

Acquisition Editor

Robin de Jongh

Lead Technical Editor

Azharuddin Sheikh

Technical Editors

Ankita Meshram

Varun Pius Rodrigues

Copy Editors

Brandt D'Mello

Aditya Nair

Laxmi Subramanian

Ruta Waghmare

Project Coordinator

Leena Purkait

Proofreader

Maria Gould

Indexer

Hemangini Bari

Production Coordinator

Nitesh Thakur

Cover Work

Nitesh Thakur

About the Author

Garry Turkington has 14 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems. In his current roles as VP Data Engineering and Lead Architect at Improve Digital, he is primarily responsible for the realization of systems that store, process, and extract value from the company's large data volumes. Before joining Improve Digital, he spent time at Amazon.co.uk, where he led several software development teams building systems that process Amazon catalog data for every item worldwide. Prior to this, he spent a decade in various government positions in both the UK and USA.

He has BSc and PhD degrees in Computer Science from the Queens University of Belfast in Northern Ireland and an MEng in Systems Engineering from Stevens Institute of Technology in the USA.

I would like to thank my wife Lea for her support and encouragement—not to mention her patience—throughout the writing of this book and my daughter, Maya, whose spirit and curiosity is more of an inspiration than she could ever imagine.

About the Reviewers

David Gruzman is a Hadoop and big data architect with more than 18 years of hands-on experience, specializing in the design and implementation of scalable high-performance distributed systems. He has extensive expertise of OOA/OOD and (R)DBMS technology. He is an Agile methodology adept and strongly believes that a daily coding routine makes good software architects. He is interested in solving challenging problems related to real-time analytics and the application of machine learning algorithms to the big data sets.

He founded—and is working with—BigDataCraft.com, a boutique consulting firm in the area of big data. Visit their site at www.bigdatacraft.com. David can be contacted at [email protected]. More detailed information about his skills and experience can be found at http://www.linkedin.com/in/davidgruzman.

Muthusamy Manigandan is a systems architect for a startup. Prior to this, he was a Staff Engineer at VMWare and Principal Engineer with Oracle. Mani has been programming for the past 14 years on large-scale distributed-computing applications. His areas of interest are machine learning and algorithms.

Vidyasagar N V has been interested in computer science since an early age. Some of his serious work in computers and computer networks began during his high school days. Later, he went to the prestigious Institute Of Technology, Banaras Hindu University, for his B.Tech. He has been working as a software developer and data expert, developing and building scalable systems. He has worked with a variety of second, third, and fourth generation languages. He has worked with flat files, indexed files, hierarchical databases, network databases, relational databases, NoSQL databases, Hadoop, and related technologies. Currently, he is working as Senior Developer at Collective Inc., developing big data-based structured data extraction techniques from the Web and local information. He enjoys producing high-quality software and web-based solutions and designing secure and scalable data systems. He can be contacted at <[email protected]>.

I would like to thank the Almighty, my parents, Mr. N Srinivasa Rao and Mrs. Latha Rao, and my family who supported and backed me throughout my life. I would also like to thank my friends for being good friends and all those people willing to donate their time, effort, and expertise by participating in open source software projects. Thank you, Packt Publishing for selecting me as one of the technical reviewers for this wonderful book. It is my honor to be a part of it.

www.PacktPub.com

Support files, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support files and downloads related to your book.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by Packt

Copy and paste, print and bookmark content

On demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

Preface

This book is here to help you make sense of Hadoop and use it to solve your big data problems. It's a really exciting time to work with data processing technologies such as Hadoop. The ability to apply complex analytics to large data sets—once the monopoly of large corporations and government agencies—is now possible through free open source software (OSS).

But because of the seeming complexity and pace of change in this area, getting a grip on the basics can be somewhat intimidating. That's where this book comes in, giving you an understanding of just what Hadoop is, how it works, and how you can use it to extract value from your data now.

In addition to an explanation of core Hadoop, we also spend several chapters exploring other technologies that either use Hadoop or integrate with it. Our goal is to give you an understanding not just of what Hadoop is but also how to use it as a part of your broader technical infrastructure.

A complementary technology is the use of cloud computing, and in particular, the offerings from Amazon Web Services. Throughout the book, we will show you how to use these services to host your Hadoop workloads, demonstrating that not only can you process large data volumes, but also you don't actually need to buy any physical hardware to do so.

What this book covers

This book comprises of three main parts: chapters 1 through 5, which cover the core of Hadoop and how it works, chapters 6 and 7, which cover the more operational aspects of Hadoop, and chapters 8 through 11, which look at the use of Hadoop alongside other products and technologies.

Chapter 1, What It's All About, gives an overview of the trends that have made Hadoop and cloud computing such important technologies today.

Chapter 2, Getting Hadoop Up and Running, walks you through the initial setup of a local Hadoop cluster and the running of some demo jobs. For comparison, the same work is also executed on the hosted Hadoop Amazon service.

Chapter 3, Understanding MapReduce, goes inside the workings of Hadoop to show how MapReduce jobs are executed and shows how to write applications using the Java API.

Chapter 4, Developing MapReduce Programs, takes a case study of a moderately sized data set to demonstrate techniques to help when deciding how to approach the processing and analysis of a new data source.

Chapter 5, Advanced MapReduce Techniques, looks at a few more sophisticated ways of applying MapReduce to problems that don't necessarily seem immediately applicable to the Hadoop processing model.

Chapter 6, When Things Break, examines Hadoop's much-vaunted high availability and fault tolerance in some detail and sees just how good it is by intentionally causing havoc through killing processes and intentionally using corrupt data.

Chapter 7, Keeping Things Running, takes a more operational view of Hadoop and will be of most use for those who need to administer a Hadoop cluster. Along with demonstrating some best practice, it describes how to prepare for the worst operational disasters so you can sleep at night.

Chapter 8, A Relational View On Data With Hive, introduces Apache Hive, which allows Hadoop data to be queried with a SQL-like syntax.

Chapter 9, Working With Relational Databases, explores how Hadoop can be integrated with existing databases, and in particular, how to move data from one to the other.

Chapter 10, Data Collection with Flume, shows how Apache Flume can be used to gather data from multiple sources and deliver it to destinations such as Hadoop.

Chapter 11, Where To Go Next, wraps up the book with an overview of the broader Hadoop ecosystem, highlighting other products and technologies of potential interest. In addition, it gives some ideas on how to get involved with the Hadoop community and to get help.

What you need for this book

As we discuss the various Hadoop-related software packages used in this book, we will describe the particular requirements for each chapter. However, you will generally need somewhere to run your Hadoop cluster.

In the simplest case, a single Linux-based machine will give you a platform to explore almost all the exercises in this book. We assume you have a recent distribution of Ubuntu, but as long as you have command-line Linux familiarity any modern distribution will suffice.

Some of the examples in later chapters really need multiple machines to see things working, so you will require access to at least four such hosts. Virtual machines are completely acceptable; they're not ideal for production but are fine for learning and exploration.

Since we also explore Amazon Web Services in this book, you can run all the examples on EC2 instances, and we will look at some other more Hadoop-specific uses of AWS throughout the book. AWS services are usable by anyone, but you will need a credit card to sign up!

Who this book is for

We assume you are reading this book because you want to know more about Hadoop at a hands-on level; the key audience is those with software development experience but no prior exposure to Hadoop or similar big data technologies.

For developers who want to know how to write MapReduce applications, we assume you are comfortable writing Java programs and are familiar with the Unix command-line interface. We will also show you a few programs in Ruby, but these are usually only to demonstrate language independence, and you don't need to be a Ruby expert.

For architects and system administrators, the book also provides significant value in explaining how Hadoop works, its place in the broader architecture, and how it can be managed operationally. Some of the more involved techniques in Chapter 4, Developing MapReduce Programs, and Chapter 5, Advanced MapReduce Techniques, are probably of less direct interest to this audience.

Conventions

In this book, you will find several headings appearing frequently.

To give clear instructions of how to complete a procedure or task, we use:

Time for action – heading

Action 1

Action 2

Action 3

Instructions often need some extra explanation so that they make sense, so they are followed with:

What just happened?

This heading explains the working of tasks or instructions that you have just completed.

You will also find some other learning aids in the book, including:

Pop quiz – heading

These are short multiple-choice questions intended to help you test your own understanding.

Have a go hero – heading

These set practical challenges and give you ideas for experimenting with what you have learned.

You will also find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: You may notice that we used the Unix command rm to remove the Drush directory rather than the DOS del command.

A block of code is set as follows:

# * Fine Tuning

key_buffer = 16M

key_buffer_size = 32M

max_allowed_packet = 16M

thread_stack = 512K

thread_cache_size = 8

max_connections = 300

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

# * Fine Tuning

key_buffer = 16M

key_buffer_size = 32M

max_allowed_packet = 16M

thread_stack = 512K

thread_cache_size = 8

max_connections = 300

Any command-line input or output is written as follows:

cd /ProgramData/Propeople rm -r Drush git clone --branch master http://git.drupal.org/project/drush.git

Newterms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: On the Select Destination Location screen, click on Next to accept the default destination.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title through the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website, or added to any list of existing errata, under the Errata section of that title.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. What It's All About

This book is about Hadoop, an open source framework for large-scale data processing. Before we get into the details of the technology and its use in later chapters, it is important to spend a little time exploring the trends that led to Hadoop's creation and its enormous success.

Hadoop was not created in a vacuum; instead, it exists due to the explosion in the amount of data being created and consumed and a shift that sees this data deluge arrive at small startups and not just huge multinationals. At the same time, other trends have changed how software and systems are deployed, using cloud resources alongside or even in preference to more traditional infrastructures.

This chapter will explore some of these trends and explain in detail the specific problems Hadoop seeks to solve and the drivers that shaped its design.

In the rest of this chapter we shall:

Learn about the big data revolution

Understand what Hadoop

Enjoying the preview?

Page 1 of 1

Hadoop Beginner's Guide

About this ebook

Garry Turkington

Read more from Garry Turkington

Learning Hadoop 2

Hadoop: Data Processing and Modelling

Related authors

Related to Hadoop Beginner's Guide

Related ebooks

Mastering Spark for Data Science

Hadoop in Practice

Apache Hive Cookbook

Learn Hadoop in 24 Hours

Frank Kane's Taming Big Data with Apache Spark and Python

Hadoop MapReduce v2 Cookbook - Second Edition

Hadoop in Action

HDInsight Essentials - Second Edition

Designing Cloud Data Platforms

Practical Machine Learning

Machine Learning Systems: Designs that scale

Graph Databases in Action: Examples in Gremlin

Hadoop 2.x Administration Cookbook

MLOps Engineering at Scale

Mastering Hadoop

Cassandra High Availability

Instant MapReduce Patterns – Hadoop Essentials How-to

Spark for Data Science

DynamoDB Applied Design Patterns

PostgreSQL Replication - Second Edition

Microsoft SQL Server 2014 Business Intelligence Development Beginner’s Guide

Hadoop Blueprints

Hadoop Essentials

Hadoop Real-World Solutions Cookbook - Second Edition

Apache Spark for Data Science Cookbook

Data Analytics with Google Cloud Platform: Build Real Time Data Analytics on Google Cloud Platform

Mastering Cloud Development using Microsoft Azure

Data Pipelines with Apache Airflow

PostgreSQL High Performance Cookbook

Talend Open Studio Cookbook

CAD-CAM For You

3D Printing Designs: Fun and Functional Projects

AutoCAD® Pocket Reference

FreeCAD | Step by Step: Learn how to easily create 3D objects, assemblies, and technical drawings

3D Printing For Dummies

SketchUp Success for Woodworkers: Four Simple Rules to Create 3D Drawings Quickly and Accurately

Beginning AutoCAD® 2023 Exercise Workbook: For Windows®

FreeCAD | Design Projects: Design advanced CAD models step by step

Tinkercad | Step by Step

Solidworks 2018 Learn by Doing - Part 3: DimXpert and Rendering

AutoCAD Electrical 2023 Black Book

CAD 101: The Ultimate Beginner's Guide

CAD 101: The Ultimate Beginners Guide

Mastering AutoCAD for Mac

AutoCAD 2018 For Beginners

Plastic Injection Mold Design for Toolmakers

3D Printing Designs: Octopus Pencil Holder

Autodesk® Revit Basics Training Manual

Autodesk Revit 2023 Black Book

Beginning AutoCAD® 2022 Exercise Workbook: For Windows®

Fusion 360 | CAD Design Projects Part I

Autodesk Fusion 360 Black Book (V 2.0.15293) - Part 1

Mastering Autodesk Revit MEP 2014: Autodesk Official Press

Mastering AutoCAD Civil 3D 2015: Autodesk Official Press

AutoCAD Electrical 2021 Black Book

AutoCAD 2019 For Architectural Design

3D Printing Designs: Design an SD Card Holder

3D Printer Projects for Makerspaces

FreeCAD 0.20 Black Book

FreeCAD 0.19 Black Book

OpenSCAD Basics Tutorial

Related podcast episodes

Related articles

Related categories

Reviews for Hadoop Beginner's Guide

What did you think?

Book preview

Hadoop Beginner's Guide - Garry Turkington

Table of Contents

Hadoop Beginner's Guide

Hadoop Beginner's Guide