Practical Data Science Cookbook Sample Chapter

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31
At a glance
Powered by AI
The document provides an overview of the book 'Practical Data Science Cookbook' including a preview chapter and information about the authors.

The book covers preparing your data science environment, working with data sources and cleaning/preparing data, data analysis and visualization, machine learning techniques, and deploying models.

The authors have advanced degrees and extensive experience in data science, machine learning, and Python development. They currently work in industry, academia, and as consultants.

Practical Data Science Cookbook

Tony Ojeda
Sean Patrick Murphy
Benjamin Bengfort
Abhijit Dasgupta







Chapter No. 1
" Preparing Your Data Science Environment"
In this package, you will find:
A Biography of the authors of the book
A preview chapter from the book, Chapter NO.1 "Preparing Your Data
Science Environment"
A synopsis of the books content
Information on where to buy this book









About the Authors
Tony Ojeda is an accomplished data scientist and entrepreneur, with expertise
in business process optimization and over a decade of experience creating and
implementing innovative data products and solutions. He has a Master's degree in
Finance from Florida International University and an MBA with concentrations in
Strategy and Entrepreneurship from DePaul University. He is the founder of District
Data Labs, a cofounder of Data Community DC, and is actively involved in promoting
data science education through both organizations.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
First and foremost, I'd like to thank my coauthors for the tireless work they
put in to make this book something we can all be proud to say we wrote
together. I hope to work on many more projects and achieve many great
things with you in the future.
I'd like to thank our reviewers, specifically Will Voorhees and Sarah Kelley,
for reading every single chapter of the book and providing excellent feedback
on each one. This book owes much of its quality to their great advice
and suggestions.
I'd also like to thank my family and friends for their support
and encouragement in just about everything I do.
Last, but certainly not least, I'd like to thank my fiance and partner in life,
Nikki, for her patience, understanding, and willingness to stick with me
throughout all my ambitious undertakings, this book being just one of
them. I wouldn't dare take risks and experiment with nearly as many things
professionally if my personal life was not the stable, loving, supportive
environment she provides.

Sean Patrick Murphy spent 15 years as a senior scientist at The J ohns Hopkins
University Applied Physics Laboratory, where he focused on machine learning,
modeling and simulation, signal processing, and high performance computing in the
Cloud. Now, he acts as an advisor and data consultant for companies in SF, NY, and
DC. He completed his graduation from The J ohns Hopkins University and his MBA
from the University of Oxford. He currently co-organizes the Data Innovation DC
meetup and cofounded the Data Science MD meetup. He is also a board member and
cofounder of Data Community DC.




For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Benjamin Bengfort is an experienced data scientist and Python developer who
has worked in military, industry, and academia for the past 8 years. He is currently
pursuing his PhD in Computer Science at the University of Maryland, College Park,
doing research in Metacognition and Natural Language Processing. He holds a Master's
degree in Computer Science from North Dakota State University, where he taught
undergraduate Computer Science courses. He is also an adjunct faculty member at
Georgetown University, where he teaches Data Science and Analytics. Benjamin has
been involved in two data science start-ups in the DC region: leveraging large-scale
machine learning and Big Data techniques across a variety of applications. He has a
deep appreciation for the combination of models and data for entrepreneurial effect,
and he is currently building one of these start-ups into a more mature organization.
I'd like to thank Will Voorhees for his tireless support in everything I've been
doing, even agreeing to review my technical writing. He made my chapters
understandable, and I'm thankful that he reads what I write. It's been essential
to my career and sanity to have a classmate, a colleague, and a friend like
him. I'd also like to thank my coauthors, Tony and Sean, for working their
butts off to make this book happen; it was a spectacular effort on their part.
I'd also like to thank Sarah Kelley for her input and fresh take on the
material; so far, she's gone on many adventures with us, and I'm looking
forward to the time when I get to review her books! Finally, I'd especially
like to thank my wife, J aci, who puts up with a lot, especially when I bite
off more than I can chew and end up working late into the night. Without her,
I wouldn't be writing anything at all. She is an inspiration, and one of the
writers in my family, she is the one who students will be reading, even a
hundred years from now.

Abhijit Dasgupta is a data consultant working in the greater DC-Maryland-Virginia
area, with several years of experience in biomedical consulting, business analytics,
bioinformatics, and bioengineering consulting. He has a PhD in Biostatistics from the
University of Washington and over 40 collaborative peer-reviewed manuscripts, with
strong interests in bridging the statistics/machine-learning divide. He is always on the
lookout for interesting and challenging projects, and is an enthusiastic speaker and
discussant on new and better ways to look at and analyze data. He is a member of Data
Community DC and a founding member and co-organizer of Statistical Programming
DC (formerly, R Users DC).



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Practical Data Science Cookbook
We live in the age of data. As increasing amounts are generated each year, the need to
analyze and create value from this asset is more important than ever. Companies that
know what to do with their data and how to do it well will have a competitive advantage
over companies that don't. Due to this, there will be increasing demand for people who
possess both the analytical and technical abilities to extract valuable insights from data
and the business acumen to create valuable and pragmatic solutions that put these insights
to use.
This book provides multiple opportunities to learn how to create value from data through
a variety of projects that run the spectrum of types of contemporary data science projects.
Each chapter stands on its own, with step-by-step instructions that include screenshots,
code snippets, more detailed explanations where necessary, and with a focus on process
and practical application.
The goal of this book is to introduce you to the data science pipeline, show you how it
applies to a variety of different data science projects, and get you comfortable enough
to apply it in future to projects of your own. Along the way, you'll learn different
analytical and programming lessons, and the fact that you are working through an
actual project while learning will help cement these concepts and facilitate your
understanding of them.
What This Book Covers
Chapter 1, Preparing Your Data Science Environment, introduces you to the data
science pipeline and helps you get your data science environment properly set up
with instructions for the Mac, Windows, and Linux operating systems.
Chapter 2, Driving Visual Analysis with Automobile Data (R), takes you through
the process of analyzing and visualizing automobile data to identify trends and
patterns in fuel efficiency over time.
Chapter 3, Simulating American Football Data (R), provides a fun and entertaining
project where you will analyze the relative offensive and defensive strengths of football
teams and simulate games, predicting which teams should win against other teams.
Chapter 4, Modeling Stock Market Data (R), shows you how to build your
own stock screener and use moving averages to analyze historical stock prices.
Chapter 5, Visually Exploring Employment Data (R), shows you how to obtain
employment and earnings data from the Bureau of Labor Statistics and conduct
geospatial analysis at different levels with R.




For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Chapter 6, Creating Application-oriented Analyses Using Tax Data (Python),
shows you how to use Python to transition your analyses from one-off, custom efforts
to reproducible and production-ready code using income distribution data as the base
for the project.
Chapter 7, Driving Visual Analyses with Automobile Data (Python), mirrors the
automobile data analyses and visualizations in Chapter 2, Driving Visual Analysis with
Automobile Data (R), but does so using the powerful programming language, Python.
Chapter 8, Working with Social Graphs (Python), shows you how to build, visualize,
and analyze a social network that consists of comic book character relationships.
Chapter 9, Recommending Movies at Scale (Python), walks you through building
a movie recommender system with Python.
Chapter 10, Harvesting and Geolocating Twitter Data (Python), shows you how to
connect to the Twitter API and plot the geographic information contained in profiles.
Chapter 11, Optimizing Numerical Code with NumPy and SciPy (Python), walks you
through how to optimize numerically intensive Python code to save you time and money
when dealing with large datasets.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
1
Preparing Your Data
Science Environment
In this chapter, we will cover the following:
Understanding the data science pipeline
Installing R on Windows, Mac OS X, and Linux
Installing libraries in R and RStudio
Installing Python on Linux and Mac OS X
Installing Python on Windows
Installing the Python data stack on Mac OS X and Linux
Installing extra Python packages
Installing and using virtualenv
Introduction
A traditional cookbook contains culinary recipes of interest to the authors and helps readers
expand their repertoire of foods to prepare. Many might believe that the end product of a
recipe is the dish itself, and one can read this book much in the same way. Every chapter
guides the reader through the application of the stages of the data science pipeline to
different datasets with various goals. Also, just as in cooking, the nal product can simply be
the analysis applied to a particular set.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Preparing Your Data Science Environment
8
We hope that you will take a broader view, however. Data scientists learn by doing, ensuring
that every iteration and hypothesis improves the practioner's knowledge base. By taking
multiple datasets through the data science pipeline using two different programming
languages (R and Python), we hope that you will start to abstract out the analysis patterns,
see the bigger picture, and achieve a deeper understanding of this rather ambiguous eld of
data science.
We also want you to know that, unlike culinary recipes, data science recipes are ambiguous.
When chefs begin a particular dish, they have a very clear picture in mind of what the nished
product will look like. For data scientists, the situation is often different. One does not always
know what the dataset in question will look like, and what might or might not be possible,
given the amount of time and resources. Recipes are essentially a way to dig into the data and
get started on the path towards asking the right questions to complete the best dish possible.
If you are from a statistical or mathematical background, the modeling techniques on display
might not excite you per se. Pay attention to how many of the recipes overcome practical issues
in the data science pipeline, such as loading large datasets and working with scalable tools to
adapting known techniques to create data applications, interactive graphics, and web pages
rather than reports and papers. We hope that these aspects will enhance your appreciation and
understanding of data science and apply good data science to your domains.
Practicing data scientists require a great number and diversity of tools to get the job done.
Data practitioners scrape, clean, visualize, model, and perform a million different tasks with a
wide array of tools. If you ask most people working with data, you will learn that the foremost
component in this toolset is the language used to perform the analysis and modeling of the
data. Identifying the best programming language for a particular task is akin to asking which
world religion is correct, just with slightly less bloodshed.
In this book, we split our attention between two highly regarded, yet very different, languages
used for data analysisR and Pythonand leave it up to you to make your own decision as
to which language you prefer. We will help you by dropping hints along the way as to the
suitability of each language for various tasks, and we'll compare and contrast similar analyses
done on the same dataset with each language.
When you learn new concepts and techniques, there is always the question of depth versus
breadth. Given a xed amount of time and effort, should you work towards achieving
moderate prociency in both R and Python, or should you go all in on a single language?
From our professional experiences, we strongly recommend that you aim to master one
language and have awareness of the other. Does that mean skipping chapters on a particular
language? Absolutely not! However, as you go through this book, pick one language and dig
deeper, looking not only to develop conversational ability, but also uency.
To prepare for this chapter, ensure that you have sufcient bandwidth to download up to
several gigabytes of software in a reasonable amount of time.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Chapter 1
9
Understanding the data science pipeline
Before we start installing any software, we need to understand the repeatable set of steps
that we will use for data analysis throughout the book.
How to do it...
The following ve steps are key for data analysis:
1. Acquisition: The rst step in the pipeline is to acquire the data from a variety of
sources, including relational databases, NoSQL and document stores, web scraping,
and distributed databases such as HDFS on a Hadoop platform, RESTful APIs, at
les, or, and hopefully this is not the case, PDFs.
2. Exploration and understanding: The second step is to come to an understanding
of the data that you will use and how it was collected; this often requires signicant
exploration.
3. Munging, wrangling, and manipulation: This step is often the single most time-
consuming and important step in the pipeline. Data is almost never in the needed
form for the desired analysis.
4. Analysis and modeling: This is the fun part where the data scientist gets to explore
the statistical relationships between the variables in the data and pulls out his or her
bag of machine learning tricks to cluster, categorize, or classify the data and create
predictive models to see into the future.
5. Communicating and operationalizing: At the end of the pipeline, we need to give the
data back in a compelling form and structure, sometimes to ourselves to inform the
next iteration, and sometimes to a completely different audience. The data products
produced can be a simple one-off report or a scalable web product that will be used
interactively by millions.
How it works...
Although the preceding list is a numbered list, don't assume that every project will strictly
adhere to this exact linear sequence. In fact, agile data scientists know that this process is
highly iterative. Often, data exploration informs how the data must be cleaned, which then
enables more exploration and deeper understanding. Which of these steps comes rst often
depends on your initial familiarity with the data. If you work with the systems producing and
capturing the data every day, the initial data exploration and understanding stage might be
quite short, unless something is wrong with the production system. Conversely, if you are
handed a dataset with no background details, the data exploration and understanding stage
might require quite some time (and numerous non-programming steps, such as talking with
the system developers).



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Preparing Your Data Science Environment
10
The following diagram shows the data science pipeline:
As you probably have heard or read by now, data munging or wrangling can often consume 80
percent or more of project time and resources. In a perfect world, we would always be given
perfect data. Unfortunately, this is never the case, and the number of data problems that you
will see is virtually innite. Sometimes, a data dictionary might change or might be missing,
so understanding the eld values is simply not possible. Some data elds may contain
garbage or values that have been switched with another eld. An update to the web app that
passed testing might cause a little bug that prevents data from being collected, causing a few
hundred thousand rows to go missing. If it can go wrong, it probably did at some point; the
data you analyze is the sum total of all of these mistakes.
The last step, communication and operationalization, is absolutely critical, but with intricacies
that are not often fully appreciated. Note that the last step in the pipeline is not entitled data
visualization and does not revolve around simply creating something pretty and/or compelling,
which is a complex topic in itself. Instead, data visualizations will become a piece of a larger
story that we will weave together from and with data. Some go even further and say that the
end result is always an argument as there is no point in undertaking all of this effort unless
you are trying to persuade someone or some group of a particular point.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Chapter 1
11
Installing R on Windows, Mac OS X,
and Linux
Straight from the R project, "R is a language and environment for statistical computing and
graphics". And it has emerged as one of the de facto languages for statistical and data
analysis. For us, it will be the default tool that we use in the rst half of the book.
Getting ready
Make sure you have a good broadband connection to the Internet as you may have to
download up to 200 MB of software.
How to do it...
Installing R is easy; use the following steps:
1. Go to Comprehensive R Archive Network (CRAN) and download the latest release of
R for your particular operating system:
For Windows, go to http://cran.r-project.org/bin/windows/
base/
For Linux, go to http://cran.us.r-project.org/bin/linux/
For Mac OS X, go to http://cran.us.r-project.org/bin/macosx/
As of February 2014, the latest release of R is Version 3.0.2 from September 2013.
2. Once downloaded, follow the excellent instructions provided by CRAN to install the
software on your respective platform. For both Windows and Mac, just double-click on
the downloaded install packages.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Preparing Your Data Science Environment
12
3. With R installed, go ahead and launch it. You should see a window similar to what is
shown in the following screenshot:
4. You can stop at just downloading R, but you will miss out on the excellent Integrated
Development Environment (IDE) built for R, called RStudio. Visit http://www.
rstudio.com/ide/download/ to download RStudio, and follow the online
installation instructions.
5. Once installed, go ahead and run RStudio. The following screenshot shows one of our
author's customized RStudio congurations with the Console panel in the upper-left
corner, the editor in the upper-right corner, the current variable list in the lower-left
corner, and the current directory in the lower-right corner.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Chapter 1
13
How it works...
R is an interpreted language that appeared in 1993 and is an implementation of the S
statistical programming language that emerged from Bell Labs in the '70s (S-PLUS is a
commercial implementation of S). R, sometimes referred to as GNU S due to its open source
license, is a domain-specic language (DSL) focused on statistical analysis and visualization.
While you can do many things with R, not seemingly related directly to statistical analysis
(including web scraping), it is still a domain-specic language and not intended for general-
purpose usage.
R is also supported by CRAN, the Comprehensive R Archive Network (http://cran.r-
project.org/). CRAN contains an accessible archive of previous versions of R, allowing
for analyses depending on older versions of the software to be reproduced. Further, CRAN
contains hundreds of freely downloaded software packages greatly extending the capability
of R. In fact, R has become the default development platform for multiple academic
elds, including statistics, resulting in the latest and greatest statistical algorithms being
implemented rst in R.
RStudio (http://www.rstudio.com/) is available under the GNU Affero General Public
License v3 and is open source and free to use. RStudio, Inc., the company, offers additional
tools and services for R as well as commercial support.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Preparing Your Data Science Environment
14
See also
Refer to the Getting Started with R article at https://support.rstudio.com/
hc/en-us/articles/201141096-Getting-Started-with-R
Visit the home page for RStudio at http://www.rstudio.com/
Refer to the Stages in the Evolution of S article at http://cm.bell-labs.com/
cm/ms/departments/sia/S/history.html
Refer to the A Brief History of S PS le at http://cm.bell-labs.com/stat/
doc/94.11.ps
Installing libraries in R and RStudio
R has an incredible number of libraries that add to its capabilities. In fact, R has become the
default language for many college and university statistics departments across the country.
Thus, R is often the language that will get the rst implementation of newly developed
statistical algorithms and techniques. Luckily, installing additional libraries is easy, as you will
see in the following sections.
Getting ready
As long as you have R or RStudio installed, you should be ready to go.
How to do it...
R makes installing additional packages simple:
1. Launch the R interactive environment or, preferably, RStudio.
2. Let's install ggplot2. Type the following command, and then press the Enter key:
install.packages("ggplot2")
Note that for the remainder of the book, it is assumed that when we specify
entering a line of text, it is implicitly followed by hitting the Return or Enter
key on the keyboard.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Chapter 1
15
3. You should now see text similar to the following as you scroll down the screen:
trying URL 'http://cran.rstudio.com/bin/macosx/contrib/3.0/
ggplot2_0.9.3.1.tgz'
Content type 'application/x-gzip' length 2650041 bytes (2.5
Mb)
opened URL
==================================================
downloaded 2.5 Mb
The downloaded binary packages are in
/var/folders/db/z54jmrxn4y9bjtv8zn_1zlb00000gn/T//Rtmpw0N1dA/
downloaded_packages
4. You might have noticed that you need to know the exact name, in this case,
ggplot2, of the package you wish to install. Visit http://cran.us.r-project.
org/web/packages/available_packages_by_name.html to make sure you
have the correct name.
5. RStudio provides a simpler mechanism to install packages. Open up RStudio if you
haven't already done so.
6. Go to Tools in the menu bar and select Install Packages . A new window will pop
up, as shown in the following screenshot:
7. As soon as you start typing in the Packages eld, RStudio will show you a list of
possible packages. The autocomplete feature of this eld simplies the installation of
libraries. Better yet, if there is a similarly named library that is related, or an earlier or
newer version of the library with the same rst few letters of the name, you will see it.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Preparing Your Data Science Environment
16
8. Let's install a few more packages that we highly recommend. At the R prompt, type
the following commands:
install.packages("lubridate")
install.packages("plyr")
install.packages("reshape2")
Downloading the example code
You can download the example code les for all Packt books you
have purchased from your account at http://www.packtpub.
com. If you purchased this book elsewhere, you can visit http://
www.packtpub.com/support and register to have the les
e-mailed directly to you.
How it works...
Whether you use RStudio's graphical interface or the install.packages command, you do
the same thing. You tell R to search for the appropriate library built for your particular version
of R. When you issue the command, R reports back the URL of the location where it has found
a match for the library in CRAN and the location of the binary packages after download.
There's more...
R's community is one of its strengths, and we would be remiss if we didn't briey mention
two things. R-bloggers is a website that aggregates R-related news and tutorials from over
450 different blogs. If you have a few questions on R, this is a great place to look for more
information. The Stack Overow site (http://www.stackoverflow.com) is a great place
to ask questions and nd answers on R using the tag rstats.
Finally, as your prowess with R grows, you might consider building an R package that others
can use. Giving an in-depth tutorial on the library building process is beyond the scope of this
book, but keep in mind that community submissions form the heart of the R movement.
See also
Refer to the 10 R packages I wish I knew about earlier article at http://blog.
yhathq.com/posts/10-R-packages-I-wish-I-knew-about-earlier.html
Visit the R-bloggers website at http://www.r-bloggers.com/
Refer to the Creating R Packages: A Tutorial at http://cran.r-project.org/
doc/contrib/Leisch-CreatingPackages.pdf



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Chapter 1
17
Refer to the Top 100 R packages for 2013 (Jan-May)! article at http://www.r-
bloggers.com/top-100-r-packages-for-2013-jan-may/
Visit the Learning R blog website at http://learnr.wordpress.com
Installing Python on Linux and Mac OS X
Luckily for us, Python comes preinstalled on most versions of Mac OS X and many avors of
Linux (both the latest versions of Ubuntu and Fedora come with Python 2.7 or later versions
out of the box). Thus, we really don't have a lot to do for this recipe, except check whether
everything is installed.
For this book, we will work with Python 2.7.x and not Version 3. Thus, if Python 3 is your
default installed Python, you will have to make sure to use Python 2.7.
Getting ready
Just make sure you have a good Internet connection, just in case we need to install anything.
How to do it...
Perform the following steps in the command prompt:
1. Open a new terminal window and type the following command:
which python
2. If you have Python installed, you should see something like this:
/usr/bin/python
3. Next, check which version you are running with the following command:
python --version
On my MacBook Air, I see the following:
Python 2.7.5
How it works...
If you are planning on using OS X, you might want to set up a separate Python distribution
on your machine for a few reasons. First, each time Apple upgrades your OS, it can and
will obliterate your installed Python packages, forcing a reinstall of all previously installed
packages. Secondly, new versions of Python will be released more frequently than Apple will
update the Python distribution included with OS X. Thus, if you want to stay on the bleeding
edge of Python releases, it is best to install your own distribution. Finally, Apple's Python
release is slightly different from the ofcial Python release and is located in a nonstandard
location on the hard drive.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Preparing Your Data Science Environment
18
There are a number of tutorials available online to help walk you through the installation and
setup of a separate Python distribution on your Mac. We recommend an excellent guide,
available at http://docs.python-guide.org/en/latest/starting/install/osx/,
to install a separate Python distribution on your Mac.
There's more...
One of the confusing aspects of Python is that the language is currently straddled between
two versions. The Python 3.0 release is a fundamentally different version of the language that
came out around Python Version 2.5. However, because Python is used in many operating
systems (hence, it is installed by default on OS X and Linux), the Python Software Foundation
decided to gradually upgrade the standard library to Version 3 to maintain backwards
compatibility. Starting with Version 2.6, the Python 2.x versions have become increasingly like
Version 3. The latest version is Python 3.4 and many expect a transition to happen in Python
3.5. Don't worry about learning the specic differences between Python 2.x and 3.x, although
this book will focus primarily on the lastest 2.x version. Further, we have ensured that the
code in this book is portable between Python 2.x and 3.x with some minor differences.
See also
Refer to the Python For Beginners guide at http://www.python.org/about/
gettingstarted/
Refer to The Hitchhiker's Guide to Python at http://docs.python-guide.org/
en/latest/
Refer to the Python Development Environment on Mac OS X Mavericks 10.9 article at
http://hackercodex.com/guide/python-development-environment-on-
mac-osx/
Installing Python on Windows
Installing Python on Windows systems is complicated, leaving you with three different
options. First, you can choose to use the standard Windows release with executable installer
from Python.org available at http://www.python.org/download/releases/. The
potential problem with this route is that the directory structure, and therefore, the paths for
conguration and settings will be different from the standard Python installation. As a result,
each Python package that was installed (and there will be many) might have path problems.
Further, most tutorials and answers online won't apply to a Windows environment, and
you will be left to your own devices to gure out problems. We have witnessed countless
tutorial-ending problems for students who install Python on Windows in this way. Unless you
are an expert, we recommend that you do not choose this option.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Chapter 1
19
The second option is to install a prebundled Python distribution that contains all scientic,
numeric, and data-related packages in a single install. There are two suitable bundles,
one from Enthought and another from Continuum Analytics. Enthought offers the Canopy
distribution of Python 2.7.6 in both 32- and 64-bit versions for Windows. The free version
of the software, Canopy Express, comes with more than 50 Python packages precongured
so that they work straight out of the box, including pandas, NumPy, SciPy, IPython, and
matplotlib, which should be sufcient for the purposes of this book. Canopy Express also
comes with its own IDE reminiscent of MATLAB or RStudio.
Continuum Analytics offers Anaconda, a completely free (even for commercial work)
distribution of Python 2.6, 2.7, and 3.3, which contains over 100 Python packages for
science, math, engineering, and data analysis. Anaconda contains NumPy, SciPy, pandas,
IPython, matplotlib, and much more, and it should be more than sufcient for the work that we
will do in this book.
The third, and best option for purists, is to run a virtual Linux machine within Windows using
the free VirtualBox (https://www.virtualbox.org/wiki/Downloads) from Oracle
software. This will allow you to run Python in whatever version of Linux you prefer. The
downsides to this approach are that virtual machines tend to run a bit slower than native
software, and you will have to get used to navigating via the Linux command line, a skill that
any practicing data scientist should have.
How to do it...
Perform the following steps to install Python using VirtualBox:
1. If you choose to run Python in a virtual Linux machine, visit https://www.
virtualbox.org/wiki/Downloads to download VirtualBox from Oracle Software
for free.
2. Follow the detailed install instructions for Windows at https://www.virtualbox.
org/manual/ch01.html#intro-installing.
3. Continue with the instructions and walk through the sections entitled 1.6
Starting VirtualBox, 1.7 Creating your rst virtual machine, and 1.8 Running
your virtual machine.
4. Once your virtual machine is running, head over to the Installing Python on Linux and
Mac OS X recipe.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Preparing Your Data Science Environment
20
If you want to install Continuum Analytics' Anaconda distribution locally instead, follow
these steps:
1. If you choose to install Continuum Analytics' Anaconda distribution, go to http://
continuum.io/downloads and select either the 64- or 32-bit version of the
software (the 64-bit version is preferable) under Windows installers.
2. Follow the detailed install instructions for Windows at http://docs.continuum.
io/anaconda/install.html.
How it works...
For many readers, choosing between a prepackaged Python distribution and running a virtual
machine might be easy based on their experience. If you are wrestling with this decision, keep
reading. If you come from a Windows-only background and/or don't have much experience
with a *nix command line, the virtual machine-based route will be challenging and will force
you to expand your skill set greatly. This takes effort and a signicant amount of tenacity,
both useful for data science in general (trust us on this one). If you have the time and/or
knowledge, running everything in a virtual machine will move you further down the path to
becoming a data scientist and, most likely, make your code easier to deploy in production
environments. If not, you can choose the backup plan and use the Anaconda distribution,
as many people choose to do.
For the remainder of this book, we will always include Linux/Mac OS X-oriented Python
package install instructions rst and supplementary Anaconda install instructions second.
Thus, for Windows users, we will assume you have either gone the route of the Linux virtual
machine or used the Anaconda distribution. If you choose to go down another path, we
applaud your sense of adventure and wish you the best of luck! Let Google be with you.
See also
Refer to the Anaconda web page at https://store.continuum.io/cshop/
anaconda/
Visit the Enthought Canopy Express web page at https://www.enthought.com/
canopy-express/
Visit the VirtualBox website at https://www.virtualbox.org/
Various installers of Python packages for Windows at http://www.lfd.uci.
edu/~gohlke/pythonlibs



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Chapter 1
21
Installing the Python data stack on Mac OS
X and Linux
While Python is often said to have "batteries included", there are a few key libraries that
really take Python's ability to work with data to another level. In this recipe, we will install
what is sometimes called the SciPy stack, which includes NumPy, SciPy, pandas, matplotlib,
and IPython.
Getting ready
This recipe assumes that you have a standard Python installed.
If, in the previous section, you decided to install the Anaconda distribution
(or another distribution of Python with the needed libraries included), you
can skip this recipe.
To check whether you have a particular Python package installed, start up your Python
interpreter and try to import the package. If successful, the package is available on your
machine. Also, you will probably need root access to your machine via the sudo command.
How to do it...
The following steps will allow you to install the Python data stack on Linux:
1. When installing this stack on Linux, you must know which distribution of Linux you are
using. The avor of Linux usually determines the package management system that
you will be using, and the options include apt-get, yum, and rpm.
2. Open your browser and navigate to http://www.scipy.org/install.html,
which contains detailed instructions for most platforms.
These instructions may change and should supersede the instructions offered here,
if different.
3. Open up a shell.
4. If you are using Ubuntu or Debian, type the following:
sudo apt-get install build-essential python-dev python-
setuptools python-numpy python-scipy python-matplotlib ipython
ipython-notebook python-pandas python-sympy python-nose



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Preparing Your Data Science Environment
22
5. If you are using Fedora, type the following:
sudo yum install numpy scipy python-matplotlib ipython python-
pandas sympy python-nose
You have several options to install the Python data stack on your Macintosh running OS X.
These are:
The rst option is to download prebuilt installers (.dmg) for each tool, and install
them as you would any other Mac application (this is recommended).
The second option is if you have MacPorts, a command line-based system to
install software, available on your system. You will also likely need XCode with the
command-line tools already installed. If so, you can enter:
sudo port install py27-numpy py27-scipy py27-matplotlib py27-
ipython +notebook py27-pandas py27-sympy py27-nose
As the third option, Chris Fonnesbeck provides a bundled way to install the stack
on the Mac that is tested and covers all the packages we will use here. Refer to
http://fonnesbeck.github.io/ScipySuperpack.
All the preceding options will take time as a large number of les will be installed on
your system.
How it works...
Installing the SciPy stack has been challenging historically due to compilation dependencies,
including the need for Fortran. Thus, we don't recommend that you compile and install from
source code, unless you feel comfortable doing such things.
Now, the better question is, what did you just install? We installed the latest versions of
NumPy, SciPy, matplotlib, IPython, IPython Notebook, pandas, SymPy, and nose. The following
are their descriptions:
SciPy: This is a Python-based ecosystem of open source software for mathematics,
science, and engineering and includes a number of useful libraries for machine
learning, scientic computing, and modeling.
NumPy: This is the foundational Python package providing numerical computation in
Python, which is C-like and incredibly fast, particularly when using multidimensional
arrays and linear algebra operations. NumPy is the reason that Python can do
efcient, large-scale numerical computation that other interpreted or scripting
languages cannot do.
matplotlib: This is a well-established and extensive 2D plotting library for Python that
will be familiar to MATLAB users.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Chapter 1
23
IPython: This offers a rich and powerful interactive shell for Python. It is a
replacement for the standard Python Read-Eval-Print Loop (REPL), among many
other tools.
IPython Notebook: This offers a browser-based tool to perform and record work done
in Python with support for code, formatted text, markdown, graphs, images, sounds,
movies, and mathematical expressions.
pandas: This provides a robust data frame object and many additional tools to make
traditional data and statistical analysis fast and easy.
nose: This is a test harness that extends the unit testing framework in the Python
standard library.
There's more...
We will discuss the various packages in greater detail in the chapter in which they are
introduced. However, we would be remiss if we did not at least mention the Python IDEs. In
general, we recommend using your favorite programming text editor in place of a full-blown
Python IDE. This can include the open source Atom from GitHub, the excellent Sublime Text
editor, or TextMate, a favorite of the Ruby crowd. Vim and Emacs are both excellent choices
not only because of their incredible power but also because they can easily be used to edit
les on a remote server, a common task for the data scientist. Each of these editors is highly
congurable with plugins that can handle code completion, highlighting, linting, and more. If
you must have an IDE, take a look at PyCharm (the community edition is free) from the IDE
wizards at JetBrains, Spyder, and Ninja-IDE. You will nd that most Python IDEs are better
suited for web development as opposed to data work.
See also
For more information on pandas, refer to the Python Data Analysis Library article at
http://pandas.pydata.org/
Visit the NumPy website at http://www.numpy.org/
Visit the SciPy website at http://www.scipy.org/
Visit the matplotlib website at http://matplotlib.org/
Visit the IPython website at http://ipython.org/
Refer the History of SciPy article at http://wiki.scipy.org/History_of_
SciPy
Visit the MacPorts home page at http://www.macports.org/
Visit the XCode web page at https://developer.apple.com/xcode/
features/
Visit the XCode download page at https://developer.apple.com/xcode/
downloads/



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Preparing Your Data Science Environment
24
Installing extra Python packages
There are a few additional Python libraries that you will need throughout this book. Just as R
provides a central repository for community-built packages, so does Python in the form of the
Python Package Index (PyPI). As of August 28, 2014, there were 48,054 packages in PyPI.
Getting ready
A reasonable Internet connection is all that is needed for this recipe. Unless otherwise
specied, these directions assume that you are using the default Python distribution that
came with your system, and not Anaconda.
How to do it...
The following steps will show you how to download a Python package and install it from the
command line:
1. Download the source code for the package in the place you like to keep
your downloads.
2. Unzip the package.
3. Open a terminal window.
4. Navigate to the base directory of the source code.
5. Type in the following command:
python setup.py install
6. If you need root access, type in the following command:
sudo python setup.py install
To use pip, the contemporary and easiest way to install Python packages, follow these steps:
1. First, let's check whether you have pip already installed by opening a terminal and
launching the Python interpreter. At the interpreter, type:
>>>import pip
2. If you don't get an error, you have pip installed and can move on to step 5. If you see
an error, let's quickly install pip.
3. Download the get-pip.py le from https://raw.github.com/pypa/pip/
master/contrib/get-pip.py onto your machine.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Chapter 1
25
4. Open a terminal window, navigate to the downloaded le, and type:
python get-pip.py
Alternatively, you can type in the following command:
sudo python get-pip.py
5. Once pip is installed, make sure you are at the system command prompt.
6. If you are using the default system distribution of Python, type in the following:
pip install networkx
Alternatively, you can type in the following command:
sudo pip install networkx
7. If you are using the Anaconda distribution, type in the following command:
conda install networkx
8. Now, let's try to install another package, ggplot. Regardless of your distribution,
type in the following command:
pip install ggplot
Alternatively, you can type in the following command:
sudo pip install ggplot
How it works...
You have at least two options to install Python packages. In the preceding "old fashioned"
way, you download the source code and unpack it on your local computer. Next, you run the
included setup.py script with the install ag. If you want, you can open the setup.py
script in a text editor and take a more detailed look at exactly what the script is doing. You
might need the sudo command, depending on the current user's system privileges.
As the second option, we leverage the pip installer, which automatically grabs the package
from the remote repository and installs it to your local machine for use by the system-level
Python installation. This is the preferred method, when available.
There's more...
pip is capable, so we suggest taking a look at the user guide online. Pay special attention
to the very useful pip freeze > requirements.txt functionality so that you can
communicate about external dependencies with your colleagues.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Preparing Your Data Science Environment
26
Finally, conda is the package manager and pip replacement for the Anaconda Python
distribution or, in the words of its home page, "a cross-platform, Python-agnostic binary
package manager". Conda has some very lofty aspirations that transcend the Python
language. If you are using Anaconda, we encourage you to read further on what conda
can do and use it, and not pip, as your default package manager.
See also
Refer to the pip User Guide at http://www.pip-installer.org/en/latest/
user_guide.html
Visit the Conda home page at http://conda.pydata.org
Refer to the Conda blog posts from Continuum Blog at http://www.continuum.
io/blog/conda
Installing and using virtualenv
virtualenv is a transformative Python tool. Once you start using it, you will never look back.
virtualenv creates a local environment with its own Python distribution installed. Once this
environment is activated from the shell, you can easily install packages using pip install
into the new local Python.
At rst, this might sound strange. Why would anyone want to do this? Not only does this help
you handle the issue of package dependencies and versions in Python but also allows you
to experiment rapidly without breaking anything important. Imagine that you build a web
application that requires Version 0.8 of the awesome_template library, but then your new
data product needs the awesome_template library Version 1.2. What do you do? With
virtualenv, you can have both.
As another use case, what happens if you don't have admin privileges on a particular
machine? You can't install the packages using sudo pip install required for your
analysis so what do you do? If you use virtualenv, it doesn't matter.
Virtual environments are development tools that software developers use to collaborate
effectively. Environments ensure that the software runs on different computers (for example,
from production to development servers) with varying dependencies. The environment
also alerts other developers to the needs of the software under development. Python's
virtualenv ensures that the software created is in its own holistic environment, can be tested
independently, and built collaboratively.
Getting ready
Assuming you have completed the previous recipe, you are ready to go for this one.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Chapter 1
27
How to do it...
Install and test the virtual environment using the following steps:
1. Open a command-line shell and type in the following command:
pip install virtualenv
Alternatively, you can type in the following command:
sudo pip install virtualenv
2. Once installed, type virtualenv in the command window, and you should be
greeted with the information shown in the following screenshot:
3. Create a temporary directory and change location to this directory using the following
commands:
mkdir temp
cd temp



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Preparing Your Data Science Environment
28
4. From within the directory, create the rst virtual environment named venv:
virtualenv venv
5. You should see text similar to the following:
New python executable in venv/bin/python
Installing setuptools, pip...done.
6. The new local Python distribution is now available. To use it, we need to activate venv
using the following command:
source ./venv/bin/activate
7. The activated script is not executable and must be activated using the source
command. Also, note that your shell's command prompt has probably changed
and is prexed with venv to indicate that you are now working in your new
virtual environment.
8. To check this fact, use which to see the location of Python, as follows:
which python
You should see the following output:
/path/to/your/temp/venv/bin/python
So, when you type python once your virtual environment is activated, you will run the
local Python.
9. Next, install something by typing the following:
pip install flask
Flask is a micro-web framework written in Python; the preceding command will install
a number of packages that Flask uses.
10. Finally, we demonstrate the versioning power that virtual environment and pip offer,
as follows:
pip freeze > requirements.txt
cat requirements.txt
This should produce the following output:
Flask==0.10.1
Jinja2==2.7.2
MarkupSafe==0.19
Werkzeug==0.9.4
itsdangerous==0.23
wsgiref==0.1.2



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Chapter 1
29
11. Note that not only the name of each package is captured, but also the exact version
number. The beauty of this requirements.txt le is that if we have a new virtual
environment, we can simply issue the following command to install each of the
specied versions of the listed Python packages:
pip install -r requirements.txt
12. To deactivate your virtual environment, simply type the following at the shell prompt:
deactivate
How it works...
virtualenv creates its own virtual environment with its own installation directories that operate
independently from the default system environment. This allows you to try out new libraries
without polluting your system-level Python distribution. Further, if you have an application that
just works and want to leave it alone, you can do so by making sure the application has its
own virtualenv.
There's more...
virtualenv is a fantastic tool, one that will prove invaluable to any Python programmer.
However, we wish to offer a note of caution. Python provides many tools that connect to
C-shared objects in order to improve performance. Therefore, installing certain Python
packages, such as NumPy and SciPy, into your virtual environment may require external
dependencies to be compiled and installed, which are system specic. Even when successful,
these compilations can be tedious, which is one of the reasons for maintaining a virtual
environment. Worse, missing dependencies will cause compilations to fail, producing errors
that require you to troubleshoot alien error messages, dated make les, and complex
dependency chains. This can be daunting to even the most veteran data scientist.
A quick solution is to use a package manager to install complex libraries into the system
environment (aptitude or Yum for Linux, Homebrew or MacPorts for OS X, and Windows will
generally already have compiled installers). These tools use precompiled forms of the third-
party packages. Once you have these Python packages installed in your system environment,
you can use the --system-site-packages ag when initializing a virtualenv. This ag tells
the virtualenv tool to use the system site packages already installed and circumvents the
need for an additional installation that will require compilation. In order to nominate packages
particular to your environment that might already be in the system (for example, when you
wish to use a newer version of a package), use pip install I to install dependencies
into virtualenv and ignore the global packages. This technique works best when you only
install large-scale packages on your system, but use virtualenv for other types of development.



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Preparing Your Data Science Environment
30
For the rest of the book, we will assume that you are using a virtualenv and have the tools
mentioned in this chapter ready to go. Therefore, we won't enforce or discuss the use of
virtual environments in much detail. Just consider the virtual environment as a safety net
that will allow you to perform the recipes listed in this book in isolation.
See also
Read an introduction to virtualenv at http://www.virtualenv.org/en/
latest/virtualenv.html
Explore virtualenvwrapper at http://virtualenvwrapper.readthedocs.org/
en/latest/
Explore virtualenv at https://pypi.python.org/pypi/virtualenv



For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook
Where to buy this book
You can buy Practical Data Science Cookbook from the Packt Publishing website:
ht t ps: / / www. packt pub. com/ bi g- dat a- and- busi ness-
i nt el l i gence/ pr act i cal - dat a- sci ence- cookbook.
Free shipping to the US, UK, Europe and selected Asian countries. For more information, please
read our shipping policy.
Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and
most internet book retailers.

















www.PacktPub.com




For More Information:
www.packtpub.com/big-data-and-business-intel ligence/practical-data-
science-cookbook

You might also like