Programming Skills For Data Science - Start Writing Code To Wrangle, Analyze, and Visualize Data With R
Programming Skills For Data Science - Start Writing Code To Wrangle, Analyze, and Visualize Data With R
The series aims to tie all three of these areas together to help the reader build
end-to-end systems for fighting spam; making recommendations; building
personalization; detecting trends, patterns, or problems; and gaining insight
from the data exhaust of systems and user interactions.
b
Make sure to connect with us!
informit.com/socialconnect
Programming Skills
for Data Science
Start Writing Code to
Wrangle, Analyze, and
Visualize Data with R
Michael Freeman
Joel Ross
The authors and publisher have taken care in the preparation of this book, but make no expressed or implied
warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for
incidental or consequential damages in connection with or arising out of the use of the information or
programs contained herein.
For information about buying this title in bulk quantities, or for special sales opportunities (which may include
electronic versions; custom cover designs; and content particular to your business, training goals, marketing
focus, or branding interests), please contact our corporate sales department
at corpsales@pearsoned.com or (800) 382-3419.
For questions about sales outside the U.S., please contact intlcs@pearson.com.
All rights reserved. This publication is protected by copyright, and permission must be obtained from the
publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by
any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding
permissions, request forms and the appropriate contacts within the Pearson Education Global Rights &
Permissions Department, please visit www.pearsoned.com/permissions/.
ISBN-13: 978-0-13-513310-1
ISBN-10: 0-13-513310-6
1 18
v
v
This page intentionally left blank
Contents
Foreword xi
Preface xiii
Acknowledgments xvii
About the Authors xix
I: Getting Started 1
5 Introduction to R 53
5.1 Programming with R 53
5.2 Running R Code 54
5.3 Including Comments 58
5.4 Defining Variables 58
5.5 Getting Help 63
6 Functions 69
6.1 What Is a Function? 69
6.2 Built-in R Functions 71
6.3 Loading Functions 73
6.4 Writing Functions 75
6.5 Using Conditional Statements 79
7 Vectors 81
7.1 What Is a Vector? 81
7.2 Vectorized Operations 83
7.3 Vector Indices 88
7.4 Vector Filtering 90
7.5 Modifying Vectors 92
8 Lists 95
8.1 What Is a List? 95
8.2 Creating Lists 96
8.3 Accessing List Elements 97
8.4 Modifying Lists 100
8.5 Applying Functions to Lists with lapply() 102
Index 345
Foreword
The data science skill set is ever-expanding to include more and more of the analytics pipeline. In
addition to fitting statistical and machine learning models, data scientists are expected to ingest
data from different file formats, interact with APIs, work at the command line, manipulate data,
create plots, build dashboards, and track all their work in git. By combining all of these
components, data scientists can produce amazing results. In this text, Michael Freeman and Joel
Ross have created the definitive resource for new and aspiring data scientists to learn foundational
programming skills.
Michael and Joel are best known for leveraging visualization and front-end interfaces to compose
explanations of complex data science topics. In addition to their written work, they have created
interactive explanations of statistical methods, including a particularly clarifying and captivating
introduction to hierarchical modeling. It is this sensibility and deep commitment to demystifying
complicated topics that they bring to their new book, which teaches a plethora of data science
skills.
This tour of data science begins by setting up the local computing environment such as text editors,
RStudio, the command line, and git. This lays a solid foundation—that is far too often glossed
over—making it easier to learn core data skills. After this, those core skills are given attention,
including data manipulation, visualization, reporting, and an excellent explanation of APIs. They
even show how to use git collaboratively, something data scientists all too often neglect to integrate
into their projects.
Programming Skills for Data Science lives up to its name in teaching the foundational skills needed to
get started in data science. This book provides valuable insights for both beginners and those with
more experience who may be missing some key knowledge. Michael and Joel made full use of their
years of teaching experience to craft an engrossing tutorial.
—Jared Lander, series editor
This page intentionally left blank
Preface
Transforming data into actionable information requires the ability to clearly and reproducibly
wrangle, analyze, and visualize that data. These skills are the foundations of data science, a field that
has amplified our collective understanding of issues ranging from disease transmission to racial
inequities. Moreover, the ability to programmatically interact with data enables researchers and
professionals to quickly discover and communicate patterns in data that are often difficult to
detect. Understanding how to write code to work with data allows people to engage with
information in new ways and on larger scales.
The existence of free and open source software has made these tools accessible to anyone with
access to a computer. The purpose of this book is to teach people how to leverage programming to
ask questions of their data sets.
If you are interested in pursuing a career in data science, or if you use data on a regular basis and
want to use programming techniques to gain information from that data, then this text is for you.
xiv Preface
Book Structure
The book is divided into six sections, each of which is summarized here.
This section walks through the steps of downloading and installing necessary software for the rest
of the book. More specifically, Chapter 1 details how to install a text editor, Bash terminal, the R
interpreter, and the RStudio program. Then, Chapter 2 describes how to use the command line for
basic file system navigation.
This section walks through the technical basis of project management, including keeping track of
the version of your code and producing documentation. Chapter 3 introduces the git software to
track line-by-line code changes, as well as the corresponding popular code hosting and
collaboration service GitHub. Chapter 4 then describes how to use Markdown to produce the
well-structured and -styled documentation needed for sharing and presenting data.
This section introduces the R programming language, the primary language used throughout the
book. In doing so, it introduces the basic syntax of the language (Chapter 5), describes
fundamental programming concepts such as functions (Chapter 6), and introduces the basic data
structures of the language: vectors (Chapter 7), and lists (Chapter 8).
Because the most time-consuming part of data science is often loading, formatting, exploring, and
reshaping data, this section of the book provides a deep dive into the best ways to wrangle data in R.
After introducing techniques and concepts for understanding the structure of real-world data
(Chapter 9), the book presents the data structure most commonly used for managing data in R: the
data frame (Chapter 10). To better support working with this data, the book then describes
two packages for programmatically interacting with the data: dplyr (Chapter 11), and
tidyr (Chapter 12). The last two chapters of the section describe how to load data from
databases (Chapter 13) and web-based data services with application programming interfaces
(APIs) (Chapter 14).
This section of the book focuses on the conceptual and technical skills necessary to design and
build visualizations as part of the data science process. It begins with an overview of data
visualization principles (Chapter 15) to guide your choices in designing visualizations. Chapter 16
then describes in granular detail how to use the ggplot2 visualization package in R. Finally,
Chapter 17 explores the use of three additional R packages for producing engaging interactive
visualizations.
As in any domain, data science insights are valuable only if they can be shared with and understood
by others. The final section of the book focuses on using two different approaches to creating
interactive platforms to share your insights (directly from your R program!). Chapter 18 uses the R
Preface xv
Markdown framework to transform analyses into sharable documents and websites. Chapter 19
takes this a step further with the Shiny framework, which allows you to create interactive web
applications using R. Chapter 20 then describes approaches for working on collaborative teams of
data scientists, and Chapter 21 details how you can further your education beyond this book.
Book Conventions
Throughout the book, you will see computer code appear inline with the text, as well as in distinct
blocks. When code appears inline, it will appear in monospace font. A distinct code block looks
like this:
The text in the code blocks is colored to reflect the syntax of the programming language used
(typically the R language). Example code blocks often include values that you need to replace.
These replacement values appear in UPPER_CASE_FONT, with words separated by underscores. For
example, if you need to work with a folder of your choosing, you would put the name of your folder
where it says FOLDER_NAME in the code. Code sections will all include comments: in programming,
comments are bits of text that are not interpreted as computer instructions—they aren’t code,
they’re just notes about the code! While a computer is able to understand the code, comments are
there to help people understand it. Tips for writing your own descriptive comments are discussed in
Chapter 5.
To guide your reading, we also include five types of special callout notes:
Tip: These boxes provide best practices and shortcuts that can make your life easier.
Remember: These boxes reinforce key points that are important to keep in mind.
Caution: These boxes describe common mistakes and explain how to avoid them.
Going Further: These boxes suggest resources for expanding your knowledge beyond this text.
Throughout the text there are instructions for using specific keyboard keys. These are included in
the text in lowercase monospace font. When multiple keys need to be pressed at the same time,
they are separated by a plus sign (+). For example, if you needed to press the Command and “c” keys
at the same time, it would appear as Cmd+c.
Whenever the cmd key is used, Windows users should instead use the Control (ctrl) key.
xvi Preface
This book includes a large number of code examples and demonstrations, with reported output and
results. That said, the best way to learn to program is to do it, so we highly recommend that as you
read, you type out the code examples and try them yourself! Experiment with different options and
variations—if you’re wondering how something works or if an option is supported, the best thing
to do is try it yourself. This will help you not only practice the actual writing of code, but also better
develop your own mental model of how data science programs work.
Many chapters conclude by applying the described techniques to a real data set in an In Action
section. These sections take a data-driven approach to understanding issues such as gentrification,
investment in education, and variation in life expectancy around the world. These sections use a
hands-on approach to using new skills, and all code is available online.1
As you move through each chapter, you may want to complete the accompanying set of online
exercises.2 This will help you practice new techniques and ensure your understanding of the
material. Solutions to the exercises are also available online.
Finally, you should know that this text does not aim to be comprehensive. It is both impractical
and detrimental to learning to attempt to explain every nuance and option in the R language and
ecosystem (particularly to people who are just starting out). While we discuss a large number of
popular tools and packages, the book cannot explain all possible options that exist now or will be
created in the future. Instead, this text aims to provide a primer on each topic—giving you enough
details to understand the basics and to get up and running with a particular data science
programming task. Beyond those basics, we provide copious links and references to further
resources where you can explore more and dive deeper into topics that are relevant or of interest to
you. This book will provide the foundations of using R for data science—it is up to each reader to
apply and build upon those skills.
Accompanying Code
To guide your learning, a set of online exercises (and their solutions) is available for each chapter.
The complete analysis code for all seven In Action sections is also provided. See the book website3 for
details.
Register your copy of Programming Skills for Data Science on the InformIT site for convenient
access to updates and/or corrections as they become available. To start the registration pro-
cess, go to informit.com/register and log in or create an account. Enter the product ISBN
(9780135133101) and click Submit. Look on the Registered Products tab for an Access Bonus
Content link next to this product, and follow that link to access any available bonus materials.
If you would like to be notified of exclusive offers on new editions and updates, please check the
box to receive email from us.
1
In-Action Code: https://github.com/programming-for-data-science/in-action
2
Book Exercises: https://github.com/programming-for-data-science
3
https://programming-for-data-science.github.io
Acknowledgments
We would like to thank the University of Washington Information School for providing us with an
environment in which to collaborate and develop these materials. We had the support of many
faculty members—in particular, David Stearns (who contributed to the materials on version
control) as well as Jessica Hullman and Ott Toomet (who provided initial feedback on the text). We
also thank Kevin Hodges, Jason Baik, and Jared Lander for their comments and insights, as well as
Debra Williams Cauley, Julie Nahil, Rachel Paul, Jill Hobbs, and the staff at Pearson for their work
bringing this book to press.
Finally, this book would not have been possible without the extraordinary open source community
around the R programming language.
This page intentionally left blank
About the Authors
Michael Freeman is a Senior Lecturer at the University of Washington Information School,
where he teaches courses in data science, interactive data visualization, and web development.
Prior to his teaching career, he worked as a data visualization specialist and research fellow at the
Institute for Health Metrics and Evaluation. There, he performed quantitative global health
research and built a variety of interactive visualization systems to help researchers and the public
explore global health trends.
Michael is interested in applications of data science to social justice, and holds a Master’s
in Public Health from the University of Washington. (His faculty page is at
https://faculty.washington.edu/mikefree/.)
Joel Ross is a Senior Lecturer at the University of Washington Information School, where he
teaches courses in web development, mobile application development, software architecture, and
introductory programming. While his primary focus is on teaching, his research interests include
games and gamification, pervasive systems, computer science education, and social computing. He
has also done research on crowdsourcing systems, human computation, and encouraging
environmental sustainability.
Joel earned his M.S. and Ph.D. in Information and Computer Sciences from the University of
California, Irvine. (His faculty page is at https://faculty.washington.edu/joelross/.)
This page intentionally left blank
I
Getting Started
The first part of this book is designed to help you install necessary software for doing data science
(Chapter 1), and to introduce you to the syntax needed to provide text-based instructions to your
computer using the command line (Chapter 2). Note that all of the software that you will
download is free, and instructions are included for both Mac and Windows operating systems.
1
Setting Up Your Computer
In order to write code to work with data, you will need to use a number of different (free) software
programs for writing, executing, and managing your code. This chapter details which software you
will need and explains how to install those programs. While there are a variety of options for each
task, we discuss software programs that are largely supported within the data science community,
and whose popularity continues to grow.
It is an unfortunate reality that one of the most frustrating and confusing barriers to working with
code is getting your machine properly set up. This chapter aims to provide sufficient information
for setting up your machine and troubleshooting the installation process.
In short, you will need to install the following programs, each of which is described in detail in the
following sections.
There are two different programs that we suggest you use for writing code:
n RStudio: An integrated development environment (IDE) for writing and executing R code. This
will be your primary work environment for doing data science. You will also need to install
the R software so that RStudio will be able to execute your code (discussed later in this
section).
n Atom: A lightweight text editor that supports programming in lots of different languages.
(Other text editors will also work effectively; some further suggestions are included in this
chapter.)
To manage your code, you will need to install and set up the following programs:
n git: An application used to track changes to your files (namely, your code). This is crucial for
maintaining an organized project, and can help facilitate collaboration with other
developers. This program is already installed on Macs.
n GitHub: A web service for hosting code online. You don’t actually need to install anything
(GitHub uses git), but you will need to create a free account on the GitHub website. The
corresponding exercises for this book are hosted on GitHub.
4 Chapter 1 Setting Up Your Computer
To provide instructions to your machine (i.e., run code), you will need to have an environment in
which to provide those instructions, while also ensuring that your machine is able to understand
the language in which you’re writing your code.
n Bash shell: A command line interface for controlling your computer. This will provide you
with a text-based interface you can use to work with your machine. Macs already have a Bash
shell program called Terminal, which you can use “out of the box.” On Windows, installing
git will also install an application called Git Bash, which you can use as your Bash shell.
n R: A programming language commonly used for data science. This is the primary
programming language used throughout this book. “Installing R” actually means
downloading and installing tools that will let your computer understand and run R code.
The remainder of this chapter has additional information about the purpose of each software
system, how to install it, and alternative configurations or options. The programs are described in
the order they are introduced in the book (though in many cases, the software programs are used in
tandem).
To use the command line, you will need to use a command shell (also called a command prompt or
terminal). This computer program provides the interface in which you type commands. In
particular, this book discusses the Bash shell, which provides a particular set of commands
common to Mac and Linux machines.
Alternatively, the 64-bit Windows 10 Anniversary Update (August 2016) includes a version of an
integrated Bash shell. You can access this by enabling the subsystem for Linux1 and then running
bash in the command prompt.
Caution: Windows includes its own command shell, called Command Prompt (previously
DOS Prompt), but it has a different set of commands and features. If you try to use the com-
mands described in Chapter 2 with DOS Prompt, they will not work. For a more advanced
Windows Management Framework, you can look into using Powershell.a Because Bash is
more common in open source programming like in this book, we will focus on that set of
commands.
a
https://docs.microsoft.com/en-us/powershell/scripting/getting-started/getting-started-with-windows-
powershell
git comes pre-installed on Macs, though it is possible that the first time you try to use the tool you
will be prompted to install the Xcode command line developer tools via a dialog box. You can choose to
install these tools, or download the latest version of git online.
On Windows, you will need to download2 the git software. Once you have downloaded the
installer, double-click on your downloaded file, and follow the instructions to complete
installation. This will also install a program called Git Bash, which provides a command line
(text-based) interface for executing commands on your computer. See Section 1.1.2 for alternative
and additional Windows command line tools.
On Linux, you can install git using apt-get or a similar command. For more information, see the
download page for Linux.3
1
Install the Windows subsystem for Linux: https://msdn.microsoft.com/en-us/commandline/wsl/install_guide
2
git downloads: https://git-scm.com/downloads
3
git download for Linux and Unix: https://git-scm.com/download/linux
6 Chapter 1 Setting Up Your Computer
Many different text editors are available, all of which have slightly different appearances and
features. You only need to download and use one of the following programs (we recommend Atom
as a default), but feel free to try out different ones until you find something you like (and then
evangelize about it to your friends!).
Tip: Programming involves working with many different file types, each of which is indi-
cated by its extension (the letters after the . in the file name, such as .pdf). It is useful
to specify that your computer should show these extensions in File Explorer or Finder; see
instructions for Windowsa or for Macb to enable this.
a
https://helpx.adobe.com/x-productkb/global/show-hidden-files-folders-extensions.html
b
https://support.apple.com/kb/PH25381?locale=en_US
1.4.1 Atom
6
Atom is a text editor built by the folks at GitHub. As it is an open source project, people are
continually building (and making available) interesting and useful extensions to Atom. Atom’s
built-in spell-check is a great feature, especially for documents that require lots of written text. It
also has excellent support for Markdown, a markup language used regularly in this book (see
Chapter 4). In fact, much of this text was written using Atom!
To download Atom, visit the application’s webpage and click the “Download” button to download
the program. On Windows, you will download the installer AtomSetup.exe file; double-click on
that icon to install the application. On a Mac, you will download a zip file; open that file and drag
the Atom.app file to your “Applications” folder.
4
GitHub: https://github.com
5
Join GitHub: https://github.com/join
6
Atom: https://atom.io
1.5 Downloading the R Language 7
Once you’ve installed Atom, you can open the program and create a new text file (just like you
would create a new file with a word processor such as Microsoft Word). When you save a document
that is a particular file type (e.g., FILE_NAME.R or FILE_NAME.md), Atom (or any other modern text
editor) will apply a language specific color scheme to your text, making it easier to read.
The trick to using Atom more efficiently is to get comfortable with the Command Palette.7 If you
press cmd+shift+p (Mac) or ctrl+shift+p (Windows), Atom will open a small window where you
can search for whatever you want the editor to do. For example, if you type in markdown, you can
get a list of commands related to Markdown files (including the ability to open up a preview right in
Atom).
To program with R, you will need to install the R Interpreter on your machine. This software is able to
“read” code written in R and use that code to control your computer, thereby “programming” it.
The easiest way to install R is to download it from the Comprehensive R Archive Network (CRAN).12
Click on the appropriate link for your operating system to find a link to the installer. On a Mac,
7
Atom Command Palette: http://flight-manual.atom.io/getting-started/sections/atom-basics/#command-palette
8
Atom Flight Manual: http://flight-manual.atom.io
9
Visual Studio Code: https://code.visualstudio.com
10
Sublime Text: https://www.sublimetext.com/3
11
The R Project for Statistical Computing: https://www.r-project.org
12
The Comprehensive R Archive Network (CRAN): https://cran.rstudio.com
8 Chapter 1 Setting Up Your Computer
click the link to download the .pkg file for the latest version supported by your computer.
Double-click on the .pkg file and follow the prompts to install the software. On Windows, follow
the Windows link to “install R for the first time,” then click the link to download the latest version of
R for Windows. You will need to double-click on the .exe file and follow the prompts to install the
software.
To install the RStudio program, visit the download page,13 select to “Download” the free version of
RStudio Desktop, and then select the installer for your operating system to download it.
After the download is complete, double-click on the .exe or .dmg file to run the installer. Follow
the steps of the installer, and you should be prepared to use RStudio.
This chapter has walked you through setting up the necessary software for basic data science,
including the following programs:
With this software installed, you are ready to get started programming for data science!
13
Download RStudio: https://www.rstudio.com/products/rstudio/download/
2
Using the Command Line
The command line is an interface to a computer—a way for you (the human) to communicate with
the machine. Unlike common graphical interfaces that use “windows, icons, menus, and pointers”
(i.e., WIMP), the command line is text-based, meaning you type commands instead of clicking on
icons. The command line lets you do everything you would normally do by clicking with a mouse,
but by typing in a manner similar to programming! As a data scientist, you will mostly use the
command line to manage your files and keep track of your code using a version control system (see
Chapter 3).
While the command line is not as friendly or intuitive as a graphical interface, it has the advantage
of being both more powerful and more efficient (it’s faster to type than to move a mouse, and you
can do lots of “clicks” with a single command). The command line is also necessary when working
on remote servers (other computers that often do not have graphical interfaces enabled). Thus, the
command line is an essential tool for data scientists, particularly when working with large amounts
of data or files.
This chapter provides a brief introduction to basic tasks using the command line—enough to get
you comfortable navigating the interface and to enable you to interpret commands.
Once you open up the command shell (the Terminal program on Mac, or Git Bash on Windows),
you should see something like the screen shown in Figure 2.1.
A command shell is the textual equivalent of having opened up Finder or File Explorer and having
it display the user’s “Home” folder. While every command shell program has a slightly different
interface, most will display at least the following information:
n The machine you are currently interfacing with (you can use the command line to control
different computers across a network or the internet). In Figure 2.1 the Mac machine (top) is
work-laptop1, and the Windows machine (bottom) is is-joelrossm13.
10 Chapter 2 Using the Command Line
Figure 2.1 Newly opened command shells: Terminal on a Mac (top) and Git Bash on Windows
(bottom). Red notes are added.
n The directory (folder) you are currently looking at. In Figure 2.1 the Mac directory is
~/Documents, while the Windows directory is ~/Desktop. The ~ is a shorthand for the
“home directory”: /Users/CURRENT_USER/ on a Mac, or C:/Users/CURRENT_USER/ on
Windows.
n The user you are logged in as. In Figure 2.1 the users are mikefree (Mac) and joelross
(Windows).
n The command prompt (typically denoted as the $ symbol), which is where you will type in
your commands.
Remember: Lines of code that begin with a pound symbol (#) are comments: They are
included to explain the code to human readers (they will be ignored by your computer!).
2.2 Navigating the File System 11
# Print the working directory (which folder the shell is currently inside)
pwd
This command stands for print working directory (shell commands are highly abbreviated to make
them faster to type), and will tell the computer to print the folder you are currently “in.” You can
see the results of the pwd command (among others) in Figure 2.2.
Fun Fact: Command line functions like pwd actually start a tiny program (app) that does
exactly one thing. In this case, the app prints the working directory. When you run a com-
mand, you’re actually executing a tiny program!
Folders on computers are stored in a hierarchy: each folder has more folders inside it, which have
more folders inside them. This produces a tree structure similar to the one shown in Figure 2.3.
You describe what folder you are in by putting a slash / between each folder in the tree. Thus
/Users/mikefree means “the mikefree folder, which is inside the Users folder.” You can
optionally include a trailing / at the end of a directory: /Users/mikefree and
/Users/mikefree/ are identical. The final / can be useful for indicating that something is a folder,
rather than just a file that lacks an extension.
Figure 2.2 Using basic commands to navigate and explore a file system using the command line.
12 Chapter 2 Using the Command Line
Users
Guest mikefree
Desktop Documents
project-1 project-2
At the very top (or bottom, depending on your point of view) is the root / directory—which has no
name, and so is just indicated with that single slash. Thus /Users/mikefree really means “the
mikefree folder, which is inside the Users folder, which is inside the root folder.”
Caution: There is no clicking with the mouse on the command line (at all!). This includes
clicking to move the cursor to an earlier part of the command you have typed, which can be
frustrating. You will need to use your left and right arrow keys to move the cursor instead.
However, you can make the cursor jump over segments of your syntax if you hold down the
alt (or option) key when you press the left and right arrow keys.
The command to change your directory is called cd (for change directory). You type this
command as:
2.2 Navigating the File System 13
# Change the working directory to the child folder with the name "FOLDER_NAME"
cd FOLDER_NAME
The first word in this example is the command, or what you want the computer to do. In this case,
you’re issuing the cd command.
The second word is an example of an argument, which is a programming term that means “more
details about what to do.” In this case, you’re providing a required argument of what folder you want
to change to! You will, of course, need to replace FOLDER_NAME with the name of the folder to
change to (which need not be in all caps).
For practice, you can try changing to the Desktop folder and printing your current location to
confirm that you have moved locations.
Tip: The up and down arrow keys will let you cycle though your previous commands so you
don’t need to retype them!
The ls command says to list the folder contents. If you just issue this command without an
argument (as shown in the example), it will list the contents of the current folder. If you include an
optional argument (e.g., ls FOLDER_NAME), you can “peek” at the contents of a folder you are not
currently in (as in Figure 2.2).
Caution: The command line often gives limited or no feedback for your actions. For exam-
ple, if there are no files in the folder, then ls will show nothing, so it may seem as if it “didn’t
work.” Additionally, when you’re typing a password, the letters you type won’t be displayed
(not even as *) as a security measure.
Just because you don’t see any results from your command/typing, that doesn’t mean it
didn’t work! Trust in yourself, and use basic commands like ls and pwd to confirm any
changes if you’re unsure. Take it slow, one step at a time.
Caution: The ls command is specific to Bash shells, such as Terminal or Git Bash. Other
command shells such as the Windows Command Prompt use different commands. This
book focuses on the syntax for Bash shells, which are available across all operating systems
and are more common on remote servers where the command line becomes a necessity (see
Section 2.6).
14 Chapter 2 Using the Command Line
2.2.3 Paths
Both the cd and ls commands work even for folders that are not “immediately inside” the current
directory! You can refer to any file or folder on the computer by specifying its path. A file’s path is
“how you get to that file”: the list of folders you would need to click through to get to the file, with
each folder separated by a slash (/). For example, user mikefree could navigate to his Desktop by
describing the path to that location in his file system:
# Change the directory to the Desktop using an absolute path (from the root)
cd /Users/mikefree/Desktop/
This code says to start at the root directory (that initial /), then go to Users, then go to mikefree,
then to Desktop. Because this path starts with a specific directory (the root directory), it is called an
absolute path. No matter what folder you currently happen to be in, that path will refer to the
correct directory because it always starts on its journey from the root.
Because this path doesn’t have the leading slash, it just says to “go to the mikefree/Desktop/
folder from the current location.” This is an example of a relative path: it gives you directions to a file
relative to the current folder. As such, the relative path mikefree/Desktop/ will refer to the correct
location only if you happen to be in the /Users folder; if you start somewhere else, who knows
where you will end up!
Remember: You should always use relative paths, particularly when programming! Because
you will almost always be managing multiples files in a project, you should refer to the files
relatively within your project. That way, your program can easily work across computers. For
example, if your code refers to /Users/YOUR_USER_NAME/PROJECT_NAME/data, it can run
only on the YOUR_USER_NAME account. However, if you use a relative path within your code
(i.e., PROJECT_NAME/data), the program will run on multiple computers—which is crucial
for collaborative projects.
You can refer to the “current folder” by using a single dot (.). So the command
means “list the contents of the current folder” (the same thing you get if you leave off the argument
entirely).
If you want to go up a directory, you use two dots (..) to refer to the parent folder (that is, the one
that contains this one). So the command
means “list the contents of the folder that contains the current folder.”
2.3 Managing Files 15
Note that . and .. act just like folder names, so you can include them anywhere in paths:
../../my_folder says to “go up two directories, and then into my_folder.”
Tip: Most command shells like Terminal and Git Bash support tab-completion. If you type
out just the first few letters of a file or folder name and then press the tab key, it will automat-
ically fill in the rest of the name! If the name is ambiguous (e.g., you type Do and there is both
a Documents and a Downloads folder), you can press Tab twice to see the list of matching
folders. Then add enough letters to distinguish them and press Tab to complete the name.
This shortcut will make your life easier.
Additionally, you can use a tilde ~ as shorthand for the absolute path to the home directory of the
current user. Just as dot (.) refers to “current folder,” ~ refers to the user’s home directory (usually
/Users/USERNAME). And of course, you can use the tilde as part of a path as well (e.g., ~/Desktop is
an absolute path to the desktop for the current user).
You can specify a path (relative or absolute) to a file as well as to a folder by including the full
filename at the end of the folder path—like the “destination”:
# Use the `cat` command to conCATenate and print the contents of a file
cat ~/Desktop/my_file.txt
Files are sometimes discussed as if they were part of the folder that contains them. For example,
telling someone to “go up a directory from ~/Desktop/my_file.txt” is just shorthand for saying
“go up a directory from the folder that contains ~/Desktop/my_file.txt” (e.g., from
~/Desktop/ to the ~ home directory).
1
An example list of Unix commands can be found here: http://www.lagmonster.org/docs/unix/intro-137.html
16 Chapter 2 Using the Command Line
Caution: The command line makes it dangerously easy to permanently delete multiple files
or folders and will not ask you to confirm that you want to delete them (or move them to the
“recycling bin”). Be very careful when using the terminal to manage your files, as it is very
powerful.
Be aware that many of these commands won’t print anything when you run them. This often
means that they worked; they just did so quietly. If it doesn’t work, you will know because you will
see a message telling you so (and why, if you read the message). So just because you didn’t get any
output, that doesn’t mean you did something wrong—you can use another command (such as ls)
to confirm that the files or folders changed in the way you wanted!
# View the manual for the `mkdir` command (not available in Git Bash)
man mkdir
This command will display the manual for the mkdir command (shown in Figure 2.4). Because
manuals are often long, they are opened up in a command line viewer called less. You can “scroll”
up and down by using the arrow keys. Press the q key to quit and return to the command prompt.
If you look under “Synopsis,” you can see a summary of all the different arguments this command
understands. A few notes about reading this syntax:
n Anything written in brackets [] is optional. Arguments that are not in brackets (e.g.,
directory_name) are required.
n Underlined arguments are ones you choose: You don’t actually type the word
directory_name, but instead insert your own directory name. Contrast this with the
options: if you want to use the -p option, you need to type -p exactly.
n “Options” (or “flags”) for command line programs are often marked with a leading hyphen
- to distinguish them from file or folder names. Options may change the way a command
line program behaves—just as you might set “easy” or “hard” as the mode in a game. You can
either write out each option individually or combine them: mkdir -p -v and mkdir -pv
are equivalent.
Some options may require an additional argument beyond just indicating a particular
operation style. In Figure 2.4 you can see that the -m option requires you to specify an
additional mode argument; check the details in the “Description” for exactly what that
argument should be.
2.3 Managing Files 17
Figure 2.4 The manual (“man”) page for the mkdir command, as shown on a Mac Terminal.
Command line manuals (“man pages”) are often very difficult to read and understand. Start by
looking at just the required arguments (which are usually straightforward), and then search for and
use a particular option if you want to change a command’s behavior. For practice, read the man
page for rm and try to figure out how to delete a folder and not just a single file. Be careful, as this is a
good way to unintentionally permanently delete files.
Tip: Manual pages are a good example of the kind of syntax explanations you will find when
learning about a particular command, but are not necessarily the best way to actually learn
to use a command. To do that, we recommend more focused resources, such as Michael Har-
tle’s excellent online tutorial Learn Enough Command Line to Be Dangerous.a Try searching
online for a particular command to find many different tutorials and examples!
a
https://www.learnenough.com/command-line-tutorial
Some other useful commands you could explore are listed in Table 2.2.
2.3.2 Wildcards
One last note about working with files: since you will often work with multiple files, command
shells offer some shortcuts for talking about files with similar names. In particular, you can use an
asterisk * as a wildcard when referring to files. This symbol acts like a “wild” or “blank” tile in the
board game Scrabble—it can be “replaced” by any character (or any set of characters) when
determining which file(s) you’re talking about.
18 Chapter 2 Using the Command Line
n *.txt refers to all files that have .txt at the end. cat *.txt would output the contents of
every .txt file in the folder.
n hello*.txt refers to all files that start with hello and end with .txt, no matter how many
characters are in the middle (including no characters!).
n *.* refers to all files that have an extension (usually all files).
As an example, you could remove all files that have the extension .txt by using the following
syntax (again, be careful!):
Consider another command: echo lets you “echo” (print out) some text. For example, you can echo
"Hello World", which is the traditional first computer program written for a new language or
environment:
What happens if you forget the closing quotation mark (")? You keep pressing enter but the shell
just shows a > each time!
What’s going on? Because you didn’t “close” the quote, the shell thinks you are still typing the
message you want to echo! When you press enter, it adds a line break instead of ending the
command, and the > indicates that you’re still going. If you finally close the quote, you will see
your multi-line message printed.
2.4 Dealing with Errors 19
Tip: If you ever get stuck in the command line, press ctrl+c (the control and c keys
together). This almost always means “cancel” and will “stop” whatever program or com-
mand is currently running in the shell so that you can try again. Just remember: “ctrl+c to
flee.”
If that doesn’t work, try pressing the esc key, or typing exit, q, or quit. Those commands
will cover most command line programs.
This book discusses a variety of approaches to handling errors in computer programs. Many
programs do provide error messages that explain what went wrong, though the density of these
messages may make it tempting to disregard them. If you enter an unrecognized command, the
shell will inform you of your mistake, as shown in Figure 2.5. In that example, a simple typo (lx
instead of ls) is invalid syntax, yielding a fairly helpful error message (command not found—the
computer can’t find the lx command you are trying to use).
However, forgetting arguments yields different results. In some cases, there will be a default
behavior (consider what happens if you enter cd without any arguments). If some arguments are
required to run a command, the shell may provide you with a brief summary of the command’s
usage, as shown in Figure 2.6.
Remember: Whenever the command line (or any other code interpreter, for that matter)
provides you with feedback, take the time to read the message and think about what the
problem might be before you try again.
Figure 2.5 An error on the command line due to a typo in the command name.
Figure 2.6 Executing a command without the required arguments may provide information about how
to use the command.
20 Chapter 2 Using the Command Line
n > says “take the output of the command and put it in this file.” For example, echo "Hello
World" > hello.txt will put the outputted text "Hello World" into a file called
hello.txt. Note that this will replace any previous content in the file, or create the file if it
doesn’t exist. This is a great way to save the output of your command line work!
n >> says “take the output of the command and append it to the end of this file.” This will keep
you from overwriting previous content.
n | (the pipe) says “take the output of this command and send it to the next command.” For
example, cat hello.txt | less would take the output of the hello.txt file and send it
to the less program, which provides the arrow-based “scrolling” interface that man pages
use. This is primarily used when you need to “chain” multiple commands together—that is,
take the result of one command and send it to the next, and then send the result of that to
the next command. This type of sequencing is used in R, as described in Chapter 11.
You might not use this syntax on a regular basis, but it is useful to be familiar with the symbols and
concepts. Indeed, you can use them to quickly perform some complex data tasks, such as
determining how often a word appears in a set of files. For example, the text of this book was
written across a number of different files, all with the extension .Rmd (more on this in Chapter 18).
To see how frequently the word “data” appears in these .Rmd files, you could first search for the word
using the grep command (using a wildcard to specify all files with that extension), then redirect
the output of the search to the wc command to count the words:
# Search .Rmd files for "data", then perform a word count on the results
grep -io data *.Rmd | wc -w
This command shows the value of interest on the command line: The word “data” is used 1897
times! While this example is somewhat dense and requires understanding the different options each
command makes available, it demonstrates the potential power of the command line.
To access a remote computer, you will most commonly use the ssh (secure shell) command. ssh is
a command utility and protocol for securely transferring information over a network. In this case,
the information being transferred will be the commands you run on the machine and the output
they produce. At the most basic level, you can use the ssh command to connect to a remote
machine by specifying the host URL of that machine. For example, if you wanted to connect to a
computer at ovid.washington.edu, you would use the command:
However, most remote machines don’t let just anyone connect to them for security reasons.
Instead, you need to specify your username for that machine. You do this by putting the username
followed by an @ symbol at the beginning of the host URL:
When you give this command, the remote server will prompt you for your password to that
machine. Remember that the command line won’t show anything (even *) as you type in the
password, but it is being entered nonetheless!
Tip: If you connect to a remote server repeatedly, it can become tedious to constantly retype
your password. Instead, you can create and use an ssh key,a which “saves” your authentica-
tion information on the server so you don’t need to put in a password each time. Check with
the administrator of the remote machine for specific instructions.
a
https://help.github.com/articles/generating-a- new-ssh-key-and-adding-it-to-the-ssh-agent/
Once you connect to a remote server, you will see the command prompt change to that of the
remote server, as shown in Figure 2.7.
Figure 2.7 Connecting to a remote server using the ssh command on a Mac Terminal.
22 Chapter 2 Using the Command Line
At that point, you can use commands like pwd and ls to see where you are on that remote
computer, cd to navigate to another folder, and any other command line command you wish—just
as if you had opened a terminal on that machine!
Once you are finished working on the remote machine, you can disconnect by using the exit
command. Closing the command shell will also usually end your connection, but using exit will
more explicitly stop any ongoing processes on a remote machine.
The ssh utility will let you connect to a remote machine and control it as if it were right in front of
you. But if you want to move files between your local machine and the remote one, you will need to
use the scp (secure copy) command. This command works exactly like the cp command
mentioned earlier, but copies files over the secure SSH protocol.
To copy a local file to a location on a remote machine, you need to specify the username and host
URL of that machine, similar to what you would use to connect via ssh. In addition, you will need
to specify the destination path (which folder to copy the file to) on that remote machine. You can
specify a path on a remote machine by including it after a colon (:) following the host URL. For
example, to refer to the ~/projects folder on the ovid.washington.edu machine (for user
mikefree), you would use
mikefree@ovid.washington.edu:~/projects
Thus to copy a local file to a folder on a remote machine, user mikefree would use a command like
this:
# Securely copy the local file data.csv into the projects folder on the
# remote machine
scp data.csv mikefree@ovid.washington.edu:~/projects
# Or more generically:
scp MY_LOCAL_FILE username@hostname:path/to/destination
It is important to note that file paths are relative to the currently connected machine—that is why you
need to specify the host URL. For example, if you had connected to a remote server via ssh and
wanted to copy a file back to your local machine, you would need to specify the remote path to your
computer! Since most personal computers don’t have easily identifiable hostnames, it’s usually
easiest to copy a file to a local machine by disconnecting from ssh and making the first scp
argument the remote host:
Going Further: Other utilities can also be used to copy files between machines. For example,
the rsync command will copy only changes to a file or folder, which helps avoid the need to
frequently transfer large amounts of data.
Overall, being able to use basic terminal commands will allow you to navigate to and interact with a
wide variety of machines, and provides you with a quick and powerful interface to your computer.
For practice using the command line, see the set of accompanying book exercises.2
2
Command line exercises: https://github.com/programming-for-data-science/chapter-02-exercises
This page intentionally left blank
II
Managing Projects
This section of the book teaches you the necessary skills for managing data science projects. The
two core skills involved are keeping track of the version of your code (Chapter 3), and producing
documentation for your code using a language called Markdown (Chapter 4).
3
Version Control with
git and GitHub
One of the most important parts of writing code to work with data is keeping track of changes to
your code. Maintaining a clear and well-documented history of your work is crucial for
transparency and collaboration. Even if you are working independently, tracking your changes will
enable you to revert to earlier versions of your project and more easily identify errors.
Alternatives to proper version control systems—such as emailing code to others, or having dozens
of versions of the same file—lack any structured way of backing up work, and are time-consuming
and error-prone. This is why you should be using a version control system like git.
This chapter introduces the git command line program and the GitHub cloud storage service, two
wonderful tools that track changes to your code (git) and facilitate collaboration (GitHub). git
and GitHub are the industry standards for the family of tasks known as version control. Being able
to manage changes to your code and share that code with others is one of the most important
technical skills a data scientist can learn, and is the focus of this chapter as well as Chapter 20.
Tip: Because this chapter revolves around using new interfaces and commands to track file
changes—which can be difficult to understand abstractly—we suggest that you follow along
with the instructions as they are introduced throughout the chapter. The best way to learn
is by doing!
1
Git homepage: http://git-scm.com/
28 Chapter 3 Version Control with git and GitHub
A version control system (VCS) is a tool for managing a collection of program code that
provides you with three important capabilities: reversibility, concurrency, and
annotation.2
Version control systems work a lot like Dropbox or Google Docs: they allow multiple people to work
on the same files at the same time, and to view or “roll back” to previous versions. However, systems
like git differ from Dropbox in a couple of key ways:
n Each new version or “checkpoint” of your files must be explicitly created (committed). git
doesn’t save a new version of your entire project each time you save a file to disk. Instead,
after making progress on your project (which may involve editing multiple files), you take a
snapshot of your work, along with a description of what you’ve changed.
n For text files (which almost all programming files are), git tracks changes line by line. This
means it can easily and automatically combine changes from multiple people, and give you
very precise information about which lines of code have changed.
Like Dropbox and Google Docs, git can show you all previous versions of a file and can quickly roll
back to one of those previous versions. This is often helpful in programming, especially if you
embark on making a massive set of changes, only to discover partway through that those changes
were a bad idea (we speak from experience here).
But where git really comes in handy is in team development. Almost all professional development
work is done in teams, which involves multiple people working on the same set of files at the same
time. git helps teams coordinate all these changes, and provides a record so that anyone can see
how a given file ended up the way it did.
There are a number of different version control systems that offer these features, but git is the de
facto standard—particularly when used in combination with the cloud-based service GitHub.
n repository (repo): A database of your file history, containing all the checkpoints of all your
files, along with some additional meta-data. This database is stored in a hidden subdirectory
named .git within your project directory. If you want to sound cool and in-the-know, call
the project folder itself a “repo” (even though the repository is technically the database
inside the project folder).
n commit: A snapshot or checkpoint of your work at a given time that has been added to the
repository (saved in the database). Each commit will also maintain additional information,
including the name of the person who did the commit, a message describing the commit,
and a timestamp. This extra tracking information allows you to see when, why, and by
whom changes were made to a given file. Committing a set of changes creates a snapshot of
what that work looks like at the time, which you can return to in the future.
2
Raymond, E. S. (2009). Understanding version-control systems. http://www.catb.org/esr/writings/version-
control/version-control.html
3.1 What Is git? 29
n remote: A link to a copy of your repository on a different machine. This link points to a
location on the web where the copy is stored. Typically this will be a central (“master”)
version of the project that all local copies point to. This chapter generally deals with copies
stored on GitHub as remote repositories. You can push (upload) commits to, and pull
(download) commits from, a remote repository to keep everything in sync.
n merging: git supports having multiple different versions of your work that all live side by
side (in what are called branches), which may be created by one person or by many
collaborators. git allows the commits (checkpoints) saved in different versions of the code
to be easily merged (combined) back together without any need to manually copy and paste
different pieces of the code. This makes it easy to separate and then recombine work from
different developers.
Teams can set up their own servers to host these centralized repositories, but many choose to use a
server maintained by someone else. The most popular of these in the open source world is
GitHub,3 which as of 2017 had more than 24 million developers using the site.4 In addition to
hosting centralized repositories, GitHub offers other team development features such as issue
tracking, wiki pages, and notifications. Public repositories on GitHub are free, but you have to pay
for private ones.
In short, GitHub is a site that will host a copy of your project in the cloud, enabling multiple people
to collaborate (using git). git is what you use to do version control; GitHub is one possible place
where repositories of code can be stored.
Going Further: Although GitHub is the most popular service that hosts “git” repositories,
it is not the only such site. BitBucket a offers a similar set of features to GitHub, though it
has a different pricing model (you get unlimited free private repos, but are limited in the
number of collaborators). GitLabb offers a hosting system that incorporates more operations
and deployment services for software projects.
a
https://bitbucket.org
b
https://gitlab.com
3
GitHub: https://github.com
4
The State of the Octoverse 2017: https://octoverse.github.com
30 Chapter 3 Version Control with git and GitHub
Caution: The interface and functionality of websites such as GitHub are constantly evolving
and may change. Additional features may become available, and the current structure may
be reorganized to better support common usage.
The first time you use git on your machine after having installed it, you will need to configure7 the
installation, telling git who you are so you can commit changes to a repository. You can do this by
using the git command line command with the config option (i.e., running the git config
command):
Even after git knows who you are, it will still prompt you for your password before pushing your
code up to GitHub. One way to save some time is by setting up an SSH key for GitHub. This will
allow GitHub to recognize and permit interactions coming from your machine. If you don’t set up
the key, you will need to enter your GitHub password each time you want to push changes up to
GitHub (which may be multiple times a day). Instructions for setting up an SSH key are available
from GitHub Help.8 Make sure you set up your key on a machine that you control and trust!
5
GitHub Desktop: https://desktop.github.com
6
Sourcetree: https://www.sourcetreeapp.com
7
GitHub: Set Up Git: https://help.github.com/articles/set-up-git/
8
GitHub: Authenticating to GitHub: https://help.github.com/articles/generating-a-new-ssh-key-and-
adding-it-to-the-ssh-agent/
3.2 Configuration and Project Setup 31
A repository is always created in an existing directory (folder) on your computer. For example, you
could create a new folder called learning_git on your computer’s Desktop. You can turn this
directory into a repository by telling the git program to run the init action (running the git
init command) inside that directory:
# Change your current directory to the new folder you just created
cd learning_git
The git init command creates a new hidden folder called .git inside the current directory.
Because it’s hidden, you won’t see this folder in Finder, but if you use ls -a (the “list” command
with the all option) you can see it listed. This folder is the “database” of changes that you will
make—git will store all changes you commit in this folder. The inclusion of the .git folder causes
a directory to become a repository; you refer to the whole directory as the “repo.” However, you
won’t ever have to directly interact with this hidden folder; instead, you will use a short set of
terminal commands to interact with the database.
Caution: Do not put one repo inside of another! Because a git repository tracks all of the con-
tent inside of a single folder (including the content in subfolders), this will turn one repo
into a “sub-repo” of another. Managing changes to both the repo and sub-repo will be diffi-
cult and should be avoided.
Instead, you should create a lot of different repos on your computer (one for each project),
making sure that they are in separate folders.
Note that it is also not a good idea to have a git repository inside of a shared folder, such
as one managed with Dropbox or Google Drive. Those systems’ built-in file tracking will
interfere with how git manages changes to files.
The git status command will give you information about the current “state” of the repo.
Running this command on a new repo tells you a few things (as shown in Figure 3.1):
Figure 3.1 Checking the status of a new (empty) repository with the git status command.
n That you’re at the initial commit (you haven’t committed anything yet)
n That currently there are no changes to files that you need to commit (save) to the database
n What to do next! (namely, create/copy files and use “git add” to track)
That last point is important. git status messages are verbose and somewhat awkward to read (this is
the command line after all). Nevertheless, if you look at them carefully, they will almost always tell
you which command to use next.
Tip: If you ever get stuck, use git status to figure out what to do next!
This makes git status the most useful command in the entire process. As you are learning the
basics of git, you will likely find it useful to run the command before and after each other command
to see how the status of your project changes. Learn it, use it, love it.
Remember: After editing a file, always save it to your computer’s hard drive (e.g., with File
> Save). git can track only changes that have been saved!
Figure 3.2 The status of a repository with changes that have not (yet) been added and are therefore
shown in red.
The first step is to add those changes to the staging area. The staging area is like a shopping cart in
an online store: you put changes in temporary storage before you commit to recording them in the
database (e.g., before clicking “purchase”).
You add files to the staging area using the git add command (replacing FILENAME in the following
example with the name/path of the file or folder you want to add):
# Add changes to a file with the name FILENAME to the staging area
# Replace FILENAME with the name of your file (e.g., favorite_books.txt)
git add FILENAME
This will add a single file in its current saved state to the staging area. For example, git add
favorite_books.txt would add that file to the staging area. If you change the file later, you will
need to add the updated version by running the git add command again.
You can also add all of the contents of the current directory (tracked or untracked) to the staging
area with the following command:
This command is the most common way to add files to the staging area, unless you’ve made
changes to specific files that you aren’t ready to commit yet. Once you’ve added files to the staging
area, you’ve “changed” the repo and so can run git status again to see what it says to do next. As
you can see in Figure 3.3, git will tell you which files are in the staging area, as well as the
command to unstage those files (i.e., remove them from the “cart”).
3.3.2 Committing
When you’re happy with the contents of your staging area (i.e., you’re ready to purchase), it’s time
to commit those changes, saving that snapshot of the files in the repository database. You do this
with the git commit command:
Figure 3.3 The status of a repository after adding changes (added files are displayed in green).
You should replace "Your message here" with a short message saying what changes that commit
makes to the repo. For example, you could type git commit -m "Create favorite_books.txt
file".
Caution: If you forget the -m option, git will put you into a command line text editor so that
you can compose a message (then save and exit to finish the commit). If you haven’t done
any other configuration, you might be dropped into the vim editor. Type :q (colon then q)
and press enter to flee from this place and try again, remembering the -m option! Don’t
panic: getting stuck in vim happens to everyone.a
a
https://stackoverflow.blog/2017/05/23/stack-overflow-helping-one-million-developers-exit-vim/
Your commit messages should be informative9 about which changes the commit is making to the
repo. "stuff" is not a good commit message. In contrast, "Fix critical authorization
error" is a good commit message.
Commit messages should use the imperative mood ("Add feature", not "Added feature").
They should complete the following sentence:
Other advice suggests that you limit your message to 50 characters (like an email subject line), at
least for the first line—this helps when you are going back and looking at previous commits. If you
want to include more detail, do so after a blank line. (For more detailed commit messages, we
recommend you learn to use vim or another command line text editor.)
A specific commit message format may also be required by your company or project team. Further
consideration of good commit messages can be found in this blog post.10
9
Do not do this: https://xkcd.com/1296/
10
Chris Beams: How to Write a Git Commit Message blog post: http://chris.beams.io/posts/git-commit/
3.3 Tracking Project Changes 35
As you make commits, remember that they are a public part of your project history, and will be read
by your professors, bosses, coworkers, and other developers on the internet.11
After you’ve committed your changes, be sure and check git status, which should now say that
there is nothing to commit!
In general, you will make lots of changes to your code (editing lots of files, running and testing your
code, and so on). Once you’re at a good “break point”—you’ve got a feature working, you’re stuck
and need some coffee, you’re about to embark on some radical changes—be sure to add and commit
your changes to make sure you don’t lose any work and you can always get back to that point.
Remember: Each commit represents a set of changes, which can and usually does include
multiple files. Do not think about each commit being a change to a file; instead, think about
each commit as being a snapshot of your entire project!
Tip: If you accidentally add files that you want to “unadd,” you can use the git reset com-
mand (with no additional arguments) to remove all added files from the staging area.
If you accidentally commit files when you didn’t want to, you can “undo” the commit using
the command git reset --soft HEAD~1. This command makes it so the commit you just
made never occurred, leaving the changed files in your working directory. You can then edit
which files you wish to commit before running the git commit command again. Note that
this works only on the most recent commit, and you cannot (easily) undo commits that have
been pushed to a remote repository.
Figure 3.4 The local git process: add changes to the staging area, then create a checkpoint of your
project by making a commit. The commit saves a version of the project at this point in time to the
database of file history.
11
Don’t join this group: https://twitter.com/gitlost
36 Chapter 3 Version Control with git and GitHub
Repositories stored on GitHub are examples of remotes: other repos that are linked to your local
one. Each repo can have multiple remotes, and you can synchronize commits between them. Each
remote has a URL associated with it (indicating where on the internet the remote copy of the repo
can be found), but they are given “alias” names—similar to browser bookmarks. By convention, the
remote repo stored on GitHub’s servers is named origin, since it tends to be the “origin” of any
code you’ve started working on.
To use GitHub, you will need to create a free GitHub account, which is discussed in Chapter 1.
Next, you will need to “link” your local repository to the remote one on GitHub. There are two
common processes for doing this:
1. If you already have a project tracked with git on your computer, you can create a new repository
on GitHub by clicking the green “New Repository” button on the GitHub homepage (you
will need to be logged in). This will create a new empty repo on GitHub’s servers under your
account. Follow the provided instructions on how to link a repository on your machine to
the new one on GitHub.
2. If there is a project on GitHub that you want to edit on your computer, you can clone (download) a
copy of a repo that already exists on GitHub, allowing you to work with and modify that
code. This process is more common, so it is described in more detail here.
Each repository on GitHub has a web portal at a unique location. For example, https://github.com/
programming-for-data-science/book-exercises is the webpage for the programming exercises that
accompany this book. You can click on the files and folders on this page to view their source and
contents online, but you won’t change them through the browser.
Remember: You should always create a local copy of the repository when working with code.
Although GitHub’s web interface supports it, you should never make changes or commit directly
to GitHub. All development work is done locally, and changes you make are then uploaded
and merged into the remote. This allows you to test your work and to be more flexible with
your development.
3.4 Storing Projects on GitHub 37
To fork a repo, click the “Fork” button in the upper right of the screen (shown in Figure 3.5). This
will copy the repo over to your own account; you will be able to download and upload changes to
that copy but not to the original. Once you have a copy of the repo under your own account, you
need to download the entire project (files and their history) to your local machine to make
changes. You do this by using the git clone command:
# Change to the folder that will contain the downloaded repository folder
cd ~/Desktop
This command creates a new repo (directory) in the current folder, and downloads a copy of the code
and all the commits from the URL you specify into that new folder.
Caution: Make sure that you are in the desired location in the command line before running
any git commands. For example, you would want to cd out of the learning_git directory
described earlier; you don’t want to clone into a folder that is already a repo!
You can get the URL for the git clone command from the address bar of your browser, or by
clicking the green “Clone or Download” button. If you click that button, you will see a pop-up that
contains a small clipboard icon that will copy the URL to your clipboard, as shown in Figure 3.6.
This allows you to use your terminal to clone the repository. If you click “Open in Desktop,” it will
prompt you to use a program called GitHub Desktop12 to manage your version control (a
technology not discussed in this book). But do not click the “Download Zip” option, as it contains
code without the previous version history (the code, but not the repository itself).
Figure 3.5 The Fork button for a repository on GitHub’s web portal. Click this button to create your
own copy of a repository on GitHub.
12
https://desktop.github.com
38 Chapter 3 Version Control with git and GitHub
Figure 3.6 The Clone button for a repository on GitHub’s web portal. Click this button to open the
dialog box, then click the clipboard icon to copy the GitHub URL needed to clone the repository to your
machine. Red notes are added.
Remember: Make sure you clone from the forked version (the one under your account!) so
that the repo downloads with a proper link back to the origin remote.
Note that you will only need to clone once per machine. clone is like init for repos that are on
GitHub; in fact, the clone command includes the init command (so you do not need to init a
cloned repo). After cloning, you will have a full copy of the repository—which includes the full
project history—on your machine.
Committing will save your changes locally, but it does not push those changes to GitHub. If you
refresh the web portal page (make sure you’re looking at the one under your account), you
shouldn’t see your changes yet.
To get the changes to GitHub (and share your code with others), you will need to push (upload)
them to GitHub’s computers. You can do this with the following command:
By default, this command will push the current code to the origin remote (specifically, to its
master branch of development). When you cloned the repo, it came with an origin “bookmark”
link to the original repo’s location on GitHub. To check where the remote is, you can use the
following command:
Once you’ve pushed your code, you should be able to refresh the GitHub webpage and see your
changes on the web portal.
If you want to download the changes (commits) that someone else has made, you can do that using
the pull command. This command will download the changes from GitHub and merge them into
the code on your local machine:
Caution: Because pulling down code involves merging versions of code together, you will
need to keep an eye out for merge conflicts! Merge conflicts are discussed in more detail in
Chapter 20.
Going Further: The commands git pull and git push have the default behavior of inter-
acting with the master branch at the origin remote location. git push is thus equivalent
to the more explicit command git push origin master. As discussed in Chapter 20, you
will adjust these arguments when engaging in more complex and collaborative development
processes.
The overall process of using git and GitHub together is illustrated in Figure 3.7.
Figure 3.7 The remote git process: fork a repository to create a copy on GitHub, then clone it to
your machine. Then add and commit changes, and push them up to GitHub to share.
40 Chapter 3 Version Control with git and GitHub
Tip: If you are working with others (or just on different computers), always pull in the latest
changes before you start working. This will get you the most up-to-date changes, and reduce
the chances that you will encounter an issue when you try to push your code.
This will give you a list of the sequence of commits you’ve made: you can see who made which
changes and when. (The term HEAD refers to the most recent commit made.) The optional
--oneline argument gives you a nice compact printout, though it displays less information (as
shown in Figure 3.8). Note that each commit is listed with its SHA-1 hash (the sequence of
random-looking numbers and letters), which you can use to identify that commit.
1. You can replace a file (or the entire project directory!) with a version saved as a previous
commit.
2. You can have git “reverse” the changes that you made with a previous commit, effectively
applying the opposite changes and thereby undoing it.
Figure 3.8 A project’s commit history displayed using the git log --oneline command in the
terminal. Each commit is identified by a six-digit hash (e.g., e4894a0), the most recent of which is
referred to as the HEAD.
3.5 Accessing Project History 41
Note that both of these approaches require you to have committed a working version of the code
that you want to go back to. git only knows about changes that have been committed: if you don’t
commit, git can’t help you!
For both forms of undoing, you first need to decide which version of the file to revert to. Use the
git log --oneline command described earlier, and note the SHA-1 hash for the commit that
saved the version you want to revert to. The first six characters of each hash is a unique ID and acts
as the “name” for the commit.
To go back to an older version of the file (to “revert” it to the version of a previous commit), you can
use the git checkout command:
# Checkout (load) the version of the file from the given commit
git checkout COMMIT_HASH FILENAME
Replace COMMIT_HASH and FILENAME with the commit ID hash and the file you want to revert,
respectively. This will replace the current version of that single file with the version saved in
COMMIT_HASH. You can also use -- as the commit hash to refer to the most recent commit (called the
HEAD), such as if you want to discard current changes:
# Checkout the file from the HEAD (the most recent commit)
git checkout -- FILENAME
This will change the file in your working directory, so that it appears just as it did when you made the
earlier commit.
Caution: You can use the git checkout command to view project files at the time of a par-
ticular commit by leaving off the filename (i.e., git checkout COMMIT_HASH). However,
you can’t actually commit any changes to these files when you do this. Thus you should use
this command only to explore the files at a previous point in time.
If you do this (or if you forget the filename when checking out), you can return to your most
recent version of the code with the following command:
If you just had one bad commit but don’t want to throw out other valuable changes you made to
your project later, you can use the git revert command:
This command will determine which changes the specified commit made to the files, and then
apply the opposite changes to effectively “back out” the commit. Note that this does not go back to
the given commit number (that’s what git checkout is for!), but rather reverses only the commit you
specify.
The git revert command does create a new commit (the --no-edit option tells git that you
don’t want to include a custom commit message). This is great from an archival point of view: you
never “destroy history” and lose the record of which changes were made and then reverted. History
is important; don’t mess with it!
Caution: The git reset command can destroy your commit history. Be very careful when
using it. We recommend you never reset beyond the most recent commit—that is, use it
only to unstage files (git reset) or undo the most recent commit (git reset --soft
HEAD~1).
You can tell git to ignore files like these by creating a special hidden file in your project directory
called .gitignore (note the leading dot). This text file contains a list of files or folders that git
should “ignore” and therefore not “see” as one of the files in the folder. The file uses a very simple
format: each line contains the path to a directory or file to ignore; multiple files are placed on
multiple lines. For example:
The easiest way to create the .gitignore file is to use your preferred text editor (e.g., Atom). Select
File > New from the menu and choose to make the .gitignore file directly inside your repo (in
the root folder of that repo, not in a subfolder).
If you are on a Mac, we strongly suggest globally ignoring your .DS_Store file. There’s no need to
ever share or track this file. To always ignore this file on your machine, you can create a “global”
.gitignore file (e.g., in your ~ home directory), and then tell git to always exclude files listed
there through the core.excludesfile configuration option:
Note that you may still want to list .DS_Store in a repo’s local .gitignore file in case you are
collaborating with others.
Additionally, GitHub provides a number of suggested .gitignore files for different languages,13
including R.14 These are good places to start when creating a local .gitignore file for a project.
Whew! You made it through! This chapter has a lot to take in, but really you just need to
understand and use the following half-dozen commands:
While it’s tempting to ignore version control systems, they will save you time in the long run. git
is a particularly complex and difficult-to-understand system given its usefulness and popularity. As
such, a wide variety of tutorials and explanations are available online if you need further
clarification. Here are a few recommendations to get started:
n Atlassian’s Git Tutorial15 is an excellent introduction to all of the major git commands.
13
.gitignore templates: https://github.com/github/gitignore
14
.gitignore template for R: https://github.com/github/gitignore/blob/master/R.gitignore
15
https://www.atlassian.com/git/tutorials/what-is-version-control
16
https://education.github.com/git-cheat-sheet-education.pdf
17
https://help.github.com/articles/git-and-github-learning-resources/
18
https://try.github.io
44 Chapter 3 Version Control with git and GitHub
n Jenny Bryan’s free online book Happy Git and GitHub for the useR19 provides an in-depth
approach to using version control for R users.
n DataCamp’s online course Introduction to Git for Data Science20 will also cover the basics of
git.
n The Pro Git Book21 is the official reference for full (if not necessarily clear) details on any and
all git commands.
For practice working with git and GitHub, see the set of accompanying book exercises.22
19
http://happygitwithr.com
20
https://www.datacamp.com/courses/introduction-to-git-for-data-science/
21
https://git-scm.com/book/en/v2
22
Version control exercises: https://github.com/programming-for-data-science/chapter-03-exercises
4
Using Markdown for
Documentation
As a data scientist, you will often encounter the somewhat trivial task of adding formatting to plain
text (e.g., making it bold or italic) without the use of a program like Microsoft Word. This chapter
introduces Markdown, a simple programming syntax that can be used to describe text formatting
and structure by adding special characters to the text. Being comfortable with this simple syntax to
describe text rendering will help you document your code, and post well-formatted messages to
question forums (such as StackOverflow1 ) or chat applications (such as Slack2 ), as well as create
clear documentation that describes your code’s purpose when hosted on GitHub (called the
“README” file). In this chapter, you will learn the basics of Markdown syntax, and how to leverage
it to produce readable code documents.
Fun Fact: Markdown belongs to a family of programming languages used to describe doc-
ument formatting known as markup languages (confusing, right?). For example, HTML
(HyperText Markup Language) is used to describe the content and format of websites.
1
StackOverflow: https://stackoverflow.com
2
Slack: https://slack.com
3
Markdown: Syntax original specification by John Gruber: https://daringfireball.net/projects/markdown/syntax
46 Chapter 4 Using Markdown for Documentation
There are a few different ways you can format text, as summarized in Table 4.1.
While there are further variations and syntax options, these are the most common.
4.1.3 Hyperlinks
Providing hyperlinks in documentation is a great way to reference other resources on the web. You
turn text into a hyperlink in Markdown by surrounding the text in square brackets [], and placing
the URL to link to immediately after that in parentheses (). Here’s an example:
[text to display](https://some/url/or/path)
The text between the brackets (“text to display”) will be displayed in your document with hyperlink
formatting. Clicking on the hyperlink will direct a web browser to the URL in the parentheses
(https://some/url/or/path). Note that hyperlinks can be included inline in the middle of a
paragraph or list item; the text to display can also be formatted with Markdown to make it bold or
italic.
Figure 4.1 Markdown text formatting. The code version is on the left; the rendered version is on the
right.
Figure 4.2 Markdown block formatting. Code (left) and rendered output (right).
While the URL is most commonly an absolute path to a resource on the web, it can also be a relative
path to another file on the same machine (the file path is relative to the Markdown document that
contains the link). This is particularly useful for linking from one Markdown file to another (e.g., if
the documentation for a project is spread across multiple pages).
4.1.4 Images
Markdown also supports the rendering of images in your documents, which allows you to include
diagrams, charts, and pictures in your documentation. The syntax for including images is similar to
that for hyperlinks, except with an exclamation point ! before the link to indicate that it should be
shown as an image:
When shown as an image, the “text to display” becomes an alternate text description for the image,
which will be shown if the image cannot be shown (e.g., if it fails to load). This is particularly
important for the accessibility of the documents you create, as anyone using a screenreader can be
read the description provided in place of the image.
48 Chapter 4 Using Markdown for Documentation
As with hyperlinks, the path to an image can be an absolute path (for referencing images on the
web), or a relative path to an image file on the same machine (the file path is relative to the
Markdown document). Specifying the correct path is the most common problem when rendering
images in Markdown; make sure to review paths (Section 2.2.3) if you have any trouble rendering
your image.
4.1.5 Tables
While syntax for tables isn’t supported in all Markdown environments, tables can be shown on
GitHub and in many other rendering engines. Tables are useful for organizing content, though
they are somewhat verbose to express in markup syntax. For example, Table 4.2 describing
Markdown syntax and formatting was written using the following Markdown syntax:
| Syntax | Formatting |
| :-------------| :--------------------------------------------------------------- |
| `#` | Header (use `##` for second level, `###` for third level, etc.) |
| ```` ``` ```` | Code section (3 backticks) that encapsulate the code |
| `-` | Bulleted/unordered lists (hyphens) |
| `>` | Block quote |
This is known as a pipe table, as columns are separated with the pipe symbol (|). The first line
contains the column headers, followed by a line of hyphens (-), followed by each row of the table
on a new line. The colon (:) next to the hyphens indicates that the content in that column should
be aligned to the left. The outer pipe characters and additional spaces in each row are optional, but
they help keep the code easy to read; it isn’t required to have the pipes line up.
(Note that in the table the triple backticks used for a code section are surrounded by quadruple
backticks to make sure that they are rendered as the ` symbol, and not interpreted as a Markdown
command!)
For other Markdown options—including blockquotes and syntax-colored code blocks—see, for
example, this GitHub Markdown Cheatsheet.4
Indeed, the web portal page for each GitHub repository will automatically format and display as
project documentation the Markdown file called README.md (it must have this name) stored in the
root directory of the project repo. The README file contains important instructions and details
about the program—it asks you to “read me!” Most public GitHub repositories include a README
4
Markdown Cheatsheet: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
4.2 Rendering Markdown 49
that explains the context and usage of the code in the repo. For example, the documentation
describing this book’s exercises is written in this README.md file,5 with individual folders also
having their own README files to explain the code in those specific directories.
Caution: The syntax may vary slightly across programs and services that render Markdown.
For example, Slack doesn’t technically support Markdown syntax (though it is very similar
to Markdown). GitHub in particular has special limitations and extensions to the language;
see the documentationa for details or if you run into problems.
a
https://help.github.com/categories/writing-on-github/
However, it can be helpful to preview your rendered Markdown before posting code to GitHub or
StackOverflow. One of the best ways to do this is to write your marked code in a text editor that
supports preview rendering, such as Atom.
To preview what your rendered content will look like, simply open a Markdown file (.md) in Atom.
Then use the command palette6 (or the shortcut ctrl+shift+m) to toggle the Markdown Preview.
Once this preview is open, it will automatically update to reflect any changes to the Markdown
code as you type.
Tip: You can use the command palette to Toggle Github Style for the Markdown preview;
this will make the rendered preview look (mostly) the same as it will when uploaded to
GitHub, though some syntactical differences may still apply.
n Many editors (such as Visual Studio Code7 ) include automatic Markdown rendering, or have
extensions to provide that functionality.
n Stand-alone programs such as MacDown8 (Mac only) will also do the same work, often
providing nicer-looking editor windows.
n There are a variety of online Markdown editors that you can use for practice or quick tests.
Dillinger9 is one of the nicer ones, but there are plenty of others if you’re looking for
something more specific.
n A number of Google Chrome Extensions will render Markdown files for you. For example,
Markdown Reader10 provides a simple rendering of a Markdown file (note it may differ
slightly from the way GitHub would render the document). Once you’ve installed the
extension, you can drag-and-drop a .md file into a blank Chrome tab to view the formatted
document. Double-click to view the raw code.
5
See https://github.com/programming-for-data-science/book-exercises/blob/master/README.md
6
Atom Command Palette: http://flight-manual.atom.io/getting-started/sections/atom-basics/#command-palette
7
Visual Studio Code: https://code.visualstudio.com
8
MacDown: Markdown editor (Mac Only): http://macdown.uranusjr.com
9
Dillinger: online Markdown editor: http://dillinger.io
10
Markdown Reader extension for Google Chrome: https://chrome.google.com/webstore/detail/markdown-
reader/gpoigdifkoadgajcincpilkjmejcaanc?hl=en
50 Chapter 4 Using Markdown for Documentation
n If you want to render (compile) your markdown to a .pdf file, you can use an Atom
extension11 or a variety of other programs to do so.
This chapter introduced Markdown syntax as a helpful tool for formatting documentation about
your code. You will use this syntax to provide information about your code (e.g., in git repository
README.md files), to ask questions about your code (e.g., on StackOverflow), and to present the
results of your code analysis (e.g., using R Markdown, described in Chapter 18). For practice writing
Markdown syntax, see the set of accompanying book exercises.12
11
Markdown to PDF extension for Atom: https://atom.io/packages/markdown-pdf
12
Markdown exercises: https://github.com/programming-for-data-science/chapter-04-exercises
III
Foundational R Skills
This section of the book introduces the fundamentals of the R programming language. In doing so,
it both explains the syntax of the language and describes the core concepts in computer
programming you will need to begin writing code to work with data.
5
Introduction to R
R is an extraordinarily powerful open source software program built for working with data. It is one
of the most popular data science tools because of its ability to efficiently perform statistical analysis,
implement machine learning algorithms, and create data visualizations. R is the primary
programming language used throughout this book, and understanding its foundational operations
is key to being able to perform more complex tasks.
Fun Fact: R is called “R” in part because it was inspired by the language “S,” a language
for Statistics developed by AT&T, and because it was developed by Ross Ihaka and Robert
Gentleman.
In previous chapters, you leveraged formal language to give instructions to your computer, such as
by writing syntactically precise instructions at the command line. Programming in R works in a
similar manner: you write instructions using R’s special language and syntax, which the computer
interprets as instructions for how to work with data.
However, as projects grow in complexity, it becomes useful if you can write down all the
instructions in a single place, and then order the computer to execute all of those instructions at
once. This list of instructions is called a script. Executing or “running” a script will cause each
instruction (line of code) to be run in order, one after the other, just as if you had typed them in one
by one. Writing scripts allows you to save, share, and reuse your work. By saving instructions in a
file (or set of files), you can easily check, change, and re-execute the list of instructions as you figure
out how to use data to answer questions. And, because R is an interpreted language, rather than a
compiled language like C or Java, R programming environments give you the ability to separately
execute each individual line of code in your script if you desire.
54 Chapter 5 Introduction to R
As you begin working with data in R, you will be writing multiple instructions (lines of code) and
saving them in files with the .R extension, representing R scripts. You can write this R code in any
text editor (such as Atom), but we recommend you usually use RStudio, a program that is
specialized for writing and running R scripts.
When you open the RStudio program, you will see an interface similar to that in Figure 5.1. An
RStudio session usually involves four sections (“panes”), though you can customize this layout if
you wish:
n Script: The top-left pane is a simple text editor for writing your R code as different script files.
While it is not as robust as a text editing program like Atom, it will colorize code,
auto-complete text, and allow you to easily execute your code. Note that this pane is hidden
if there are no open scripts; select File > New File > R Script from the menu to create
a new script file.
Figure 5.1 RStudio’s user interface, showing a script file. Red notes are added.
5.2 Running R Code 55
To execute (run) the code you write, you have two options:
1. You can execute a section of your script by selecting (highlighting) the desired code
and clicking the “Run” button (or use the keyboard shortcut1 : cmd+enter on Mac, or
ctrl+enter on Windows). If no lines are selected, this will run the line currently
containing the cursor. This is the most common way to execute code in RStudio.
Tip: Use cmd+a (Mac) or ctrl+a (Windows) to select the entire script!
2. You can execute an entire script by clicking the “Source” button (at the top right of the
Script pane, or via shift+cmd+enter) to execute all lines of code in the script file, one
at a time, from top to bottom. This command will treat the current script file as the
“source” of code to run. If you check the “Source on Save” option, your entire script
will be executed every time you save the file (which may or may not be appropriate,
depending on the complexity of your script and its output). You can also hover your
mouse over this or any other button to see keyboard shortcuts.
Fun Fact: The Source button actually calls an R function called source(),
described in Chapter 14.
n Console: The bottom-left pane is a console for entering R commands. This is identical to an
interactive session you would run on the command line, in which you can type and execute
one line of code at a time. The console will also show the printed results of executing the
code from the Script pane. If you want to perform a task once, but don’t want to save that task
in your script, simply type it in the console and press enter.
Tip: Just as with the command line, you can use the up arrow to easily access previ-
ously executed lines of code.
n Plots, packages, help, etc.: The bottom-right pane contains multiple tabs for accessing a
variety of information about your program. When you create visualizations, those plots will
be rendered in this section. You can also see which packages you have loaded or look up
information about files. Most importantly, you can access the official documentation for the
R language in this pane. If you ever have a question about how something in R works, this is a
good place to start!
1
RStudio Keyboard Shortcuts: https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts
56 Chapter 5 Introduction to R
Note that you can use the small spaces between the quadrants to adjust the size of each area to your
liking. You can also use menu options to reorganize the panes.
Tip: RStudio provides a built-in link to a “Cheatsheet” for the IDE—as well as for other pack-
ages described in this text—through the Help > Cheatsheets menu.
With the R software installed, you can start an interactive R session on a Mac by typing R (or
lowercase r) into the Terminal to run the R program. This will start the session and provide you
with some information about the R language, as shown in Figure 5.2.
Notice that this description also includes instructions on what to do next—most importantly, "Type
'q()' to quit R."
Remember: Always read the output carefully when working on the command line!
Once you’ve started running an interactive R session, you can begin entering one line of code at a
time at the prompt (>). This is a nice way to experiment with the R language or to quickly run some
code. For example, you can try doing some math at the command prompt (e.g., enter 1 + 1 and
see the output).
It is also possible to run entire scripts from the command line by using the RScript program,
specifying the .R file you wish to execute, as shown in Figure 5.3. Entering the command shown in
Figure 5.3 in the terminal would execute each line of R code written in the analysis.R file,
performing all of the instructions that you had saved there. This is helpful if your data has changed,
and you want to recalculate the results of your analysis using the same instructions.
On Windows (and some other operating systems), you may need to tell the computer where to find
the R and RScript programs to execute—that is, the path to these programs. You can do this by
specifying the absolute path to the R.exe program when you execute it, as in Figure 5.3.
Going Further: If you use Windows and plan to run R from the command line regularly
(which is not required or even suggested in this book), a better solution is to add the folder
containing these programs to your computer’s PATH variable. This system-level variable con-
tains a list of folders that the computer searches when finding programs to execute. The rea-
son the computer knows where to find the git.exe program when you type git in the com-
mand line is because that program is “on the PATH.”
In Windows, you can add the R.exe and RScript.exe programs to your computer’s PATH by
editing your machine’s environment variables through the Control Panel.a Overall, using R
from the command line can be tricky; we recommend you just use RStudio instead as you’re
starting out.
a
https://helpdeskgeek.com/windows-10/add-windows-path-environment-variable/
Figure 5.3 Using the RScript command to run an R script from a command shell: Mac (top) and
Windows (bottom).
58 Chapter 5 Introduction to R
Caution: On Windows, the R interpreter download also installs an “RGui” application (e.g.,
“R x64 3.4.4”), which will likely be the default program for opening .R scripts. Make sure to
use the RStudio IDE for working in R!
Comments should be clear, concise, and helpful. They should provide information that is not
otherwise present or “obvious” in the code itself.
In R, you mark text as a comment by putting it after the pound symbol (#). Everything from the #
until the end of the line is a comment. You put descriptive comments immediately above the code
they describe, but you can also put very short notes at the end of the line of code, as in the
following example (note that the R code syntax used is described in the following section):
(You may recognize this # syntax and commenting behavior from command line examples in
previous chapters—because the same syntax is used in a Bash shell!)
In the R language, variable names can contain any combination of letters, numbers, periods (.), or
underscores (_), though they must begin with a letter. Like almost everything in programming,
variable names are case sensitive. It is best practice to make variable names descriptive and
informative about what data they contain. For example, x is not a good variable name, whereas
num_cups_coffee is a good variable name. Throughout this book, we use the formatting suggested
in the tidyverse style guide.2 As such, variable names should be all lowercase letters, separated by
underscores (_). This is also known as snake_case.
2
Tidyverse style guide: http://style.tidyverse.org
5.4 Defining Variables 59
Remember: There is an important distinction between syntax and style. The syntax of a lan-
guage describes the rules for writing the code so that a computer can interpret it. Certain
operations are permitted, and others are not. Conversely, styles are optional conventions
that make it easier for other humans to interpret your code. The use of a style guide allows
you to describe the conventions you will follow in your code to help keep things like vari-
able names consistent.
Storing information in a variable is referred to as assigning a value to the variable. You assign a
value to a variable using the assignment operator <-. For example:
Notice that the variable name goes on the left, and the value goes on the right.
You can see which value (data) is “inside” a variable by either executing that variable name as a line
of code or by using R’s built-in print() function (functions are detailed in Chapter 6):
The print() function prints out the value (3) stored in the variable (num_cups_coffee). The [1]
in that output indicates that the first element stored in the variable is the number 3—this is
discussed in detail in Chapter 7.
You can also use mathematical operators (e.g., +, -, /, *) when assigning values to variables. For
example, you could create a variable that is the sum of two numbers as follows:
# Use the plus (+) operator to add numbers, assigning the result to a variable
too_much_coffee <- 3 + 4
Once a value (like a number) is in a variable, you can use that variable in place of any other value.
So all of the following statements are valid:
In many ways, script files are just note pads where you’ve jotted down the R code you wish to run.
Lines of code can be (and often are) executed out of order, particularly when you want to change or
fix a previous statement. When you do change a previous line of code, you will need to re-execute
that line of code to have it take effect, as well as re-execute any subsequent lines if you want them to
use the updated value.
Executing all of the lines of code one after another would assign the variables and print a value 5. If
you edited line 1 to say num_cups_coffee <- 4, the computer wouldn’t do anything different
until you re-executed the line (by selecting it and pressing cmd+enter). And re-executing line 1
wouldn’t cause another new value to be printed, since that command occurs at line 4! If you then
re-executed line 4 (by selecting that line and pressing cmd+enter), it would still print out
5—because you haven’t told R to recalculate the value of caffeine_level! You would need to
re-execute all of the lines of code (e.g., select them all and pressing cmd+enter) to have your script
print out the desired (new) value of 6. This kind of behavior is common for computer programming
languages (though different from environments like Excel, where values are automatically updated
when you change other referenced cells).
Going Further: In statically typed languages, you need to declare the type of variable you
want to create. For example, in the Java programming language (which is not used in this
text), you have to indicate the type of variable you want to create: if you want the integer 10
to be stored in the variable my_num, you would have to write int my_num = 10 (where int
indicates that my_num will be an integer).
n Numeric: The default computational data type in R is numeric data, which consists of the
set of real numbers (including decimals). You can use mathematical operators on numeric
data (such as +, -, *, -, etc.). There are also numerous functions that work on numeric data
(such as for calculating sums or averages).
5.4 Defining Variables 61
Note that you can use multiple operators in a single expression. As in algebra, parentheses
can be used to enforce order of operations:
n Character: Character data stores strings of characters (e.g., letters, special characters,
numbers) in a variable. You specify that information is character data by surrounding it with
either single quotes (') or double quotes ("); the tidyverse style guide suggests always using
double quotes.
Note that character data is still data, so it can be assigned to a variable just like numeric data.
There are no special operators for character data, though there are a many built-in functions
for working with strings.
Caution: If you see a plus sign (+) in the terminal as opposed to the typical greater
than symbol (>)—as in Figure 5.4—you have probably forgotten to close a quotation
mark. If you find yourself in this situation, you can press the esc key to cancel the line
of code and start over. This will also work if you forget to close a set of parentheses (())
or brackets ([]).
n Logical: Logical (boolean) data types store “yes-or-no” data. A logical value can be one of two
values: TRUE or FALSE. Importantly, these are not the strings "TRUE" or "FALSE"; logical
values are a different type! If you prefer, you can use the shorthand T or F in lieu of TRUE and
FALSE in variable assignment.
Fun Fact: Logical values are called “booleans” after mathematician and logician
George Boole.
Figure 5.4 An unclosed statement in the RStudio console: press the esc key to cancel the statement
and return to the command prompt.
62 Chapter 5 Introduction to R
Logical values are most commonly produced by applying a relational operator (also called a
comparison operator) to some other data. Comparison operators are used to compare
values and include < (less than), > (greater than), <= (less than or equal), >= (greater than or
equal), == (equal), and != (not equal). Here are a few examples:
# Equivalently, you can compare values that are not stored in variables
6 == 8 # returns logical value FALSE
If you want to write a more complex logical expression (i.e., for when something is true and
something else is false), you can do so using logical operators (also called boolean
operators). These include & (and), | (or), and ! (not).
# Are there fewer than 30 total strings AND fewer than 6 band members?
total_strings < 30 & total_band_members < 6 # FALSE
# Are there fewer than 30 total strings OR fewer than 6 band members?
total_strings < 30 | total_band_members < 6 # TRUE
It’s easy to write complex—even overly complex—expressions with logical operators. If you
find yourself getting lost in your logic, we recommend rethinking your question to see if
there is a simpler way to express it!
n Integer: Integer (whole-number) values are technically a different data type than numeric
values because of how they are stored and manipulated by the R interpreter. This is
something that you will rarely encounter, but it’s good to know that you can specify that a
number is of the integer type rather than the general numeric type by placing a capital L (for
“long integer”) after a value in variable assignment (my_integer <- 10L). You will rarely
do this intentionally, but this is helpful for answering the question, Why is there an L after my
number…?
n Complex: Complex (imaginary) numbers have their own data storage type in R, and are
created by placing an i after the number: complex_variable <- 2i. We will not be using
complex numbers in this book, as they rarely are important for data science.
1. Read the error messages: If there is an issue with the way you have written or executed your
code, R will often print out an error message in your console (in red in RStudio). Do your best
to decipher the message—read it carefully, and think about what is meant by each word in
the message—or you can put that message directly into Google to search for more
information. You will soon get the hang of interpreting these messages if you put the time
into trying to understand them. For example, Figure 5.5 shows the result of accidentally
mistyping a variable name. In that error message, R indicated that the object cty was not
found. This makes sense, because the code never defined a variable cty (the variable was
called city).
2. Google: When you’re trying to figure out how to do something, it should come as no surprise
that search engines such as Google are often the best resource. Try searching for queries like
"how to DO_THING in R". More frequently than not, your question will lead you to a Q&A
forum called StackOverflow (discussed next), which is a great place to find potential answers.
Figure 5.5 RStudio showing an error message due to a typo (there is no variable cty).
64 Chapter 5 Introduction to R
Tip: There is a classical method of fixing errors called rubber duck debugging, which
involves trying to explain your code/problem to an inanimate object (talking to pets
works too). You will usually be able to fix the problem if you just step back and think
about how you would explain it to someone else!
You can also look up help by using the help() function (e.g., help(print) will look up
information on the print() function, just as ?print does). There is also an example()
function you can call to see examples of a function in action (e.g., example(print)). This
will be more applicable starting in Chapter 6.
5. RStudio Community: RStudio recently launched an online community4 for R users. The
intention is to build a more positive online community for getting programming help with R
and engaging with the open source community using the software.
3
RDocumentation.org: https://www.rdocumentation.org
4
RStudio Community: https://community.rstudio.com
5.5 Getting Help 65
Luckily, you’re not alone in this process! There is a huge number of resources that you can use to
help you learn R or any other topic in programming or data science. This section provides an
overview and examples of the types of resources you might use.
n Books: Many excellent text resources are available both in print and for free online. Books
can provide a comprehensive overview of a topic, usually with a large number of examples
and links to even more resources. We typically recommend them for beginners, as they help
to cover all of the myriad steps involved in programming and their extensive examples help
inform good programming habits. Free online books are easily accessible (and allow you to
copy-and-paste code examples), but physical print can provide a useful point of reference
(and typing out examples is a great way to practice).
For learning R in particular, R for Data Science5 is one of the best free online textbooks,
covering the programming language through the lens of the tidyverse collection of
packages (which are used in this book as well). Excellent print books include R for Everyone6
and The Art of R Programming.7
n Tutorials and videos: The internet is also host to a large number of more informal
explanations of programming concepts. These range from mini-books (such as the
opinionated but clear introduction aRrgh: a newcomer’s (angry) guide to R8 ), to tutorial series
(such as those provided by R Tutor 9 or Quick-R10 ), to focused articles and guides (e.g., posts on
R-bloggers11 ), to particularly informative StackOverflow responses. These smaller guides are
particularly useful when you’re trying to answer a specific question or clarify a single
concept—when you want to know how to do one thing, not necessarily understand the
entire language. In addition, many people have created and shared online video tutorials
(such as Pearson’s LiveLessons12 ), often in support of a course or textbook. Video code
blogging is even more common in other programming languages such as JavaScript. Video
demonstrations are great at showing you how to actually use a programming concept in
practice—you can see all the steps that go into a program (though there is no substitute for
doing it yourself).
Because such guides can be created and hosted by anyone, the quality and accuracy may
vary. It’s always a good idea to confirm your understanding of a concept with multiple
sources (do multiple tutorials agree?), with your own experience (does the solution actually
work for your code?), and your own intuition (does that seem like a sensible explanation?).
In general, we encourage you to start with more popular or official guides, as they are more
likely to encourage best practices.
n Interactive tutorials and courses: The best way to learn any skill is by doing it, and there are
multiple interactive websites that will let you learn and practice programming right in your
web browser. These are great for seeing topics in action or for experimenting with different
5
Wickham, H., & Grolemund, G. (2016). R for Data Science. O’Reilly Media, Inc. http://r4ds.had.co.nz
6
Lander, J. P. (2017). R for Everyone: Advanced Analytics and Graphics (2nd ed.). Boston, MA: Addison-Wesley.
7
Matloff, N. (2011). The Art of R Programming: A Tour of Statistical Software Design. San Francisco, CA: No Starch Press.
8
aRrgh: a newcomer’s (angry) guide to R: http://arrgh.tim-smith.us
9
R Tutor: http://www.r-tutor.com/; start with the introduction at http://www.r-tutor.com/r-introduction
10
Quick-R: https://www.statmethods.net/index.html; be sure and follow the hyperlinks.
11
R-Bloggers: https://www.r-bloggers.com
12
LiveLessons video tutorials: https://www.youtube.com/user/livelessons
66 Chapter 5 Introduction to R
The most popular set of interactive tutorials for R programming are provided by DataCamp14
and are presented as online courses (a sequence of explanations and exercises that you can
learn to use a skill) on different topics. DataCamp tutorials provide videos and interactive
tutorials for a wide range of different data science topics. While most of the introductory
courses (e.g., Introduction to R15 ) are free, more advanced courses require you to sign up and
pay for the service. Nevertheless, even at the free level, this is an effective set of resources for
picking up new skills.
In addition to these informal interactive courses, it is possible to find more formal online
courses in R and data science through massive open online course (MOOC) services such as
Coursera16 or Udacity.17 For example, the Data Science at Scale18 course from the University of
Washington offers a deep introduction to data science (though it assumes some
programming experience, so it may be more appropriate for after you’ve finished this book!).
Note that these online courses almost always require a paid fee, though you can sometimes
earn university credit or certifications from them.
n Documentation: One of the best places to start out when learning a programming concept
is the official documentation. In addition to the base R documentation described in the
previous section, many system creators will produce useful “getting started” guides and
references—called “vignettes” in the R community—that you can use (to encourage
adoption of their tool). For example, the dplyr package (described in great detail in
Chapter 11) has an official “getting started” summary on its homepage19 as well as a complete
reference.20 Further detail on a package may also often be found linked from that package’s
homepage on GitHub (where the documentation can be kept under version control);
checking the GitHub page for a package or library is often an effective way to gain more
information about it. Additionally, many R packages host their documentation in .pdf
format on CRAN’s website; to learn to use a package, you will need to read its explanation
carefully and try out its examples!
13
swirl interactive tutorial: http://swirlstats.com
14
DataCamp: https://www.datacamp.com/home
15
DataCamp: Introduction to R: https://www.datacamp.com/courses/free-introduction-to-r
16
Coursera: https://www.coursera.org
17
Udacity: https://www.udacity.com
18
Data Science at Scale: online course from the University of Washington: https://www.coursera.org/
specializations/data-science
19
dplyr homepage: https://dplyr.tidyverse.org
20
dplyr reference: https://dplyr.tidyverse.org/reference/index.html
5.5 Getting Help 67
This section lists only a few of the many, many resources for learning R. You can find many more
online resources on similar topics by searching for “TOPIC tutorial” or “how to DO_SOMETHING
in R.” You may also find other compilations of resources. For example, RStudio has put together a
list21 of its recommended tutorials and resources.
In the end, remember that the best way to learn about anything—whether about programming or
from a set of data—is to ask questions. For practice writing code in R and familiarizing yourself with
RStudio, see the set of accompanying book exercises.22
21
RStudio: Online Learning resource collection: https://www.rstudio.com/online-learning/
22
Introductory R exercises: https://github.com/programming-for-data-science/chapter-05-exercises
This page intentionally left blank
6
Functions
As you begin to take on data science projects, you will find that the tasks you perform will involve
multiple different instructions (lines of code). Moreover, you will often want to be able to repeat
these tasks (both within and across projects). For example, there are many steps involved in
computing summary statistics for some data, and you may want to repeat this analysis for different
variables in a data set or perform the same type of analysis across two different data sets. Planning
out and writing your code will be notably easier if can you group together the lines of code
associated with each overarching task into a single step.
Functions represent a way for you to add a label to a group of instructions. Thinking about the tasks
you need to perform (rather than the individual lines of code you need to write) provides a useful
abstraction in the way you think about your programming. It will help you hide the details and
generalize your work, allowing you to better reason about it. Instead of thinking about the many
lines of code involved in each task, you can think about the task itself (e.g., compute_summary_
stats()). In addition to helping you better reason about your code, labeling groups of instructions
will allow you to save time by reusing your code in different contexts—repeating the task without
rewriting the individual instructions.
This chapter explores how to use functions in R to perform advanced capabilities and create code
that is flexible for analyzing multiple data sets. After considering a function in a general sense, it
discusses using built-in R functions, accessing additional functions by loading R packages, and
writing your own functions.
In addition to grouping instructions, functions in programming languages like R tend to follow the
mathematical definition of functions, which is a set of operations (instructions!) that are
performed on some inputs and lead to some outputs. Function inputs are called arguments (also
70 Chapter 6 Functions
referred to as parameters); specifying an argument for a function is called passing the argument
into the function (like passing a football). A function then returns an output to use. For example,
imagine a function that can determine the largest number in a set of numbers—that function’s
input would be the set of numbers, and the output would be the largest number in the set.
Grouping instructions into reusable functions is helpful throughout the data science process,
including areas such as the following:
n Data management: You can group instructions for loading and organizing data so they can be
applied to multiple data sets.
n Data analysis: You can store the steps for calculating a metric of interest so that you can
repeat your analysis for multiple variables.
n Data visualization: You can define a process for creating graphics with a particular structure
and style so that you can generate consistent reports.
Remember: In this text, we always include empty parentheses () when referring to a func-
tion by name to help distinguish between variables that hold functions and variables that
hold values (e.g., add_values() versus my_value). This does not mean that the function
takes no arguments; instead, it is just a useful shorthand for indicating that a variable holds
a function (not a value).
If you call any of these functions interactively, R will display the returned value (the output) in the
console. However, the computer is not able to “read” what is written in the console—that’s for
humans to view! If you want the computer to be able to use a returned value, you will need to give
6.2 Built-in R Functions 71
that value a name so that the computer can refer to it. That is, you need to store the returned value
in a variable:
# You can then use the variable as usual, such as for a comparison
min_is_greater_than_one <- smallest_number > 1 # returns FALSE
In the last example, the resulting value of the “inner” function function—sqrt()—is immediately
used as an argument. Because that value is used immediately, you don’t have to assign it a separate
variable name. Consequently, it is known as an anonymous variable.
To learn more about any individual function, you can look it up in the R documentation by using
?FUNCTION_NAME as described in Chapter 5.
Tip: Part of learning any programming language is identifying which functions are available
in that language and understanding how to use them. Thus, you should look around and
become familiar with these functions—but do not feel that you need to memorize them! It’s
enough to be aware that they exist, and then be able to look up the name and arguments for
that function. As you can imagine, Google also comes in handy here (i.e., “how to DO_TASK
in R”).
This is just a tiny taste of the many different functions available in R. More functions will be
introduced throughout the text, and you can also see a nice list of options in the R Reference Card2
cheatsheet.
1
Quick-R: Built-in Functions: http://www.statmethods.net/management/functions.html
2
R Reference Card: cheatsheet summarizing built-in R functions: https://cran.r-project.org/doc/contrib/Short-
refcard.pdf
72 Chapter 6 Functions
Named arguments are written by putting the name of the argument (which is like a variable name),
followed by the equals symbol (=), followed by the value to pass to that argument. For example:
Named arguments are almost always optional (since they have default values), and can be included
in any order. Indeed, many functions allow you to specify arguments either as positional
arguments (called such because they are determined by their position in the argument list) or with
a name. For example, the second positional argument to the round() function can also be
specified as the named argument digits:
# These function calls are all equivalent, though the 2nd is most clear/common
round(3.1415, 3) # 3.142
round(3.1415, digits = 3) # 3.142
round(digits = 3, 3.1415) # 3.142
function (using ?paste in RStudio), you will see the documentation shown in Figure 6.1. The usage
displayed —paste (..., sep = " ", collapse = NULL)— specifies that the function takes
any number of positional arguments (represented by the ...), as well as two additional named
arguments: sep (whose default value is " ", making pasted words default to having a space
between them) and collapse (used when pasting vectors, described in Chapter 7).
Tip: In R’s documentation, functions that require a limited number of unnamed arguments
will often refer to them as x. For example, the documentation for round() is listed as follows:
round(x, digits = 0). The x just means “the data value to run this function on.”
Fun Fact: The mathematical operators (e.g., +) are actually functions in R that take two argu-
ments (the operands). The familiar mathematical notation is just a shortcut.
benefit from the work of others. (This is the amazing thing about the open source
community—people solve problems and then make those solutions available to others.) Popular R
packages exist for manipulating data (dplyr), making beautiful graphics (ggplot2), and
implementing machine learning algorithms (randomForest).
R packages do not ship with the R software by default, but rather need to be downloaded (once) and
then loaded into your interpreter’s environment (each time you wish to use them). While this may
seem cumbersome, the R software would be huge and slow if you had to install and load all
available packages to do anything with it.
Luckily, it is possible to install and load R packages from within R. The base R software provides
install.packages() function for installing packages, and the library() function for loading
them. The following example illustrates installing and loading the stringr package (which
contains handy functions for working with character strings):
# Install the `stringr` package. Only needs to be done once per computer
install.packages("stringr")
# Load the package (make `stringr` functions available in this `R` session)
library("stringr") # quotes optional here, but best to include them
Caution: When you install a package, you may receive a warning message about the package
being built under a previous version of R. In all likelihood, this shouldn’t cause a problem,
but you should pay attention to the details of the messages and keep them in mind (espe-
cially if you start getting unexpected errors).
Errors installing packages are some of the trickiest to solve, since they depend on machine-
specific configuration details. Read any error messages carefully to determine what the prob-
lem may be.
The install.packages() function downloads the necessary set of R code for a given package
(which explains why you need to do it only once per machine), while the library() function
loads those scripts into your current R session (you connect to the “library” where the package has
been installed). If you’re curious where the library of packages is located on your computer, you can
run the R function .libPaths() to see where the files are stored.
Caution: Loading a package sometimes overrides a function of the same name that is
already in your environment. This may cause a warning to appear in your R terminal, but
it does not necessarily mean you made a mistake. Make sure to read warning messages care-
fully and attempt to decipher their meaning. If the warning doesn’t refer to something that
seems to be a problem (such as overriding existing functions you weren’t going to use), you
can ignore it and move on.
After loading a package with the library() function, you have access to functions that were
written as part of that package. For example, stringr provides a function str_count() that
6.4 Writing Functions 75
returns how many times a “substring” appears in a word (see the stringr documentation3 for a
complete list of functions included in that package):
Because there are so many packages, many of them will provide functions with the same names.
You thus might need to distinguish between the str_count() function from stringr and the
str_count() function from somewhere else. You can do this by using the full package name of the
function (called namespacing the function)—written as the package name, followed by a double
colon (::), followed by the name of the function:
Much of the work involved in programming for data science involves finding, understanding, and
using these external packages (no need to reinvent the wheel!). A number of such packages will be
discussed and introduced in this text, but you must also be willing to extrapolate what you learn
(and research further examples) to new situations.
Tip: There are packages available to help you improve the style of your R code. The lintra
package detects code that violates the tidyverse style guide, and the stylerb package
applies suggested formatting to your code. After loading those packages, you can run
lint("MY_FILENAME.R") and style_file("MY_FILENAME.R") (using the appropriate
filename) to help ensure you have used good code style.
a
https://github.com/jimhester/lintr
b
http://styler.r-lib.org
3
https://cran.r-project.org/web/packages/stringr/stringr.pdf
76 Chapter 6 Functions
The best way to understand the syntax for defining a function is to look at an example:
# Call the `make_full_name()` function with the values "Alice" and "Kim"
my_name <- make_full_name("Alice", "Kim") # returns "Alice Kim" into `my_name`
Functions are in many ways like variables: they have a name to which you assign a value (using the
same assignment operator: <-). One difference is that they are written using the function keyword
to indicate that you are creating a function and not simply storing a value. Per the tidyverse style
guide,4 functions should be written in snake_case and named using verbs—after all, they define
something that the code will do. A function’s name should clearly suggest what it does (without
becoming too long).
Remember: Although tidyverse functions are written in snake_case, many built-in R func-
tions use a dot . to separate words—for example, install.packages() and is.numeric()
(which determines whether a value is a number and not, for example, a character string).
n Arguments: The value assigned to the function name uses the syntax function(...) to
indicate that you are creating a function (as opposed to a number or character string). The
words put between the parentheses are names for variables that will contain the values
passed in as arguments. For example, when you call make_full_name("Alice", "Kim"),
the value of the first argument ("Alice") will be assigned to the first variable (first_name),
and the value of the second argument ("Kim") will be assigned to the second variable
(last_name).
Importantly, you can make the argument names anything you want (name_first,
given_name, and so on), just as long as you then use that variable name to refer to the
argument inside the function body. Moreover, these argument variables are available only
while inside the function. You can think of them as being “nicknames” for the values. The
variables first_name, last_name, and full_name exist only within this particular
function; that is, they are accessible within the scope of the function.
n Body: The body of the function is a block of code that falls between curly braces {} (a “block”
is represented by curly braces surrounding code statements). The cleanest style is to put the
opening { immediately after the arguments list, and the closing } on its own line.
4
tidyverse style Guide: http://style.tidyverse.org/functions.html
6.4 Writing Functions 77
The function body specifies all the instructions (lines of code) that your function will
perform. A function can contain as many lines of code as you want. You will usually want
more than 1 line to make the effort of creating the function worthwhile, but if you have
more than 20 lines, you might want to break it up into separate functions. You can use the
argument variables in here, create new variables, call other functions, and so on. Basically,
any code that you would write outside of a function can be written inside of one as well!
n Return value: A function will return (output) whatever value is evaluated in the last
statement (line) of that function. In the preceding example, the final full_name statement
will be returned.
It is also possible to explicitly state what value to return by using the return() function,
passing it the value that you wish your function to return:
However, it is considered good style to use the return() statement only when you wish to
return a value before the final statement is executed (see Section 6.5). As such, you can place
the value you wish to return as the last line of the function, and it will be returned:
You can call (execute) a function you defined the same way you call built-in functions. When you
do so, R will take the arguments you pass in (e.g., "Alice" and "Kim") and assign them to the
argument variables. It then executes each line of code in the function body one at a time. When it
gets to the last line (or the return() call), it will end the function and return the last expression,
which could be assigned to a different variable outside of the function.
Overall, writing functions is an effective way to group lines of code together, creating an
abstraction for those statements. Instead of needing to think about doing four or five steps at once,
you can just think about a single step: calling the function! This makes it easier to understand your
code and the analysis you need to perform.
78 Chapter 6 Functions
For example, consider a function that calculates a person’s body mass index (BMI):
# Calculate body mass index (kg/mˆ2) given the input in pounds (lbs) and
# inches (inches)
calculate_bmi <- function(lbs, inches) {
height_in_meters <- inches * 0.0254
weight_in_kg <- lbs * 0.453592
bmi <- weight_in_kg / height_in_meters ˆ 2
bmi
}
# Calculate the BMI of a person who is 180 pounds and 70 inches tall
calculate_bmi(180, 70)
Recall that when you execute a function, R evaluates each line of code, replacing the arguments of
that function with the values you supply. When you execute the function (e.g., by calling
calculate_bmi(180, 70)), you are essentially replacing the variable lbs with the value 180, and
replacing the variable inches with the value 70 throughout the function.
But if you try to run each statement in the function one at a time, then the variables lbs and
inches won’t have values (because you never actually called the function)! Thus a strategy for
debugging functions is to assign sample values to your arguments, and then run through the
function line by line. For example, you could do the following (either within the function, in
another part of the script, or just in the console):
With those variables assigned, you can run each statement inside the function one at a time,
checking the intermediate results to see where your code makes a mistake—and then you can fix
that line and retest the function! Be sure to delete the temporary variables when you’re done.
Note that while this will identify syntax errors, it will not help you identify logical errors. For
example, this strategy will not help if you use the incorrect conversion between inches and meters,
or pass the arguments to your function in the incorrect order. For example, calculate_bmi(70,
180) won’t return an error, but it will return a very different BMI than calculate_bmi(180, 70).
Remember: When you pass arguments to functions, order matters! Be sure that you are
passing in values in the order expected by the function.
6.5 Using Conditional Statements 79
IF something is true
do some lines of code
OTHERWISE
do some other lines of code
In R, you write these conditional statements using the keywords if and else and the following
syntax:
Note that the else needs to be on the same line as the closing curly brace (}) of the if block. It is
also possible to omit the else and its block, in case you don’t want to do anything when the
condition isn’t met.
The condition can be any variable or expression that resolves to a logical value (TRUE or FALSE).
Thus both of the following conditional statements are valid:
# If the porridge temperature exceeds a given threshold, enter the code block
if (porridge_temp > 120) { # expression is true
print("This porridge is too hot!") # will be executed
}
You can further extend the set of conditions evaluated using an else if statement (e.g., an if
immediately after an else). For example:
Note that a set of conditional statements causes the code to branch—that is, only one block of the
code will be executed. As such, you may want to have one block return a specific value from a
function, while the other block might keep going (or return something else). This is when you
would want to use the return() function:
Note that this example didn’t use an explicit else clause, but rather just let the function “keep
going” when the if condition wasn’t met. While both approaches would be valid (achieve the
same desired result), it’s better code design to avoid `else` statements when possible and to
instead view the if conditional as just handling a “special case.”
Overall, conditionals and functions are ways to organize the flow of code in your program: to
explicitly tell the R interpreter in which order lines of code should be executed. These structures
become particularly useful as programs get large, or when you need to combine code from multiple
script files. For practice using and writing functions, see the set of accompanying book exercises.5
5
Function exercises: https://github.com/programming-for-data-science/chapter-06-exercises
7
Vectors
As you move from practicing R basics to interacting with data, you will need to understand how
that data is stored, and to carefully consider the appropriate structure for the organization,
analysis, and visualization of your data. This chapter covers the foundational concepts for working
with vectors in R. Vectors are the fundamental data type in R, so understanding these concepts is
key to effectively programming in the language. This chapter discusses how R stores information in
vectors, the way in which operations are executed in vectorized form, and how to extract data from
vectors.
Remember: All the elements in a vector need to have the same type (e.g., numeric, character,
logical). You can’t have a vector whose elements include both numbers and character strings.
When you print out a variable in R, the interpreter prints out a [1] before the value you have stored
in your variable. This is R telling you that it is printing from the first element in your vector (more
on element indexing later in this chapter). When R prints a vector, it prints the elements separated
with spaces (technically tabs), not commas.
You can use the length() function to determine how many elements are in a vector:
Other functions can also help with creating vectors. For example, the seq() function mentioned
in Chapter 6 takes two arguments and produces a vector of the integers between them. An optional
third argument specifies how many numbers to skip in each step:
As a shorthand, you can produce a sequence with the colon operator (a:b), which returns a vector
from a to b with the element values being incremented by 1:
# Use the colon operator (:) as a shortcut for the `seq()` function
one_to_seventy <- 1:70
When you print out one_to_seventy (as in Figure 7.1), in addition to the leading [1] that you’ve
seen in all printed results, there are bracketed numbers at the start of each line. These bracketed
numbers tell you the starting position (index) of elements printed on that line. Thus the [1]
means that the printed line shows elements starting at element number 1, a [28] means that the
printed line shows elements starting at element number 28, and so on. This information is
7.2 Vectorized Operations 83
Figure 7.1 Creating a vector using the seq() function and printing the results in the RStudio terminal.
intended to help make the output more readable, so you know where in the vector you are when
looking at a printed line of elements.
Figure 7.2 demonstrates the element-wise nature of the vectorized operations shown in the following
code:
Vectors support any operators that apply to their “type” (i.e., numeric or character). While you
can’t apply mathematical operators (namely, +) to combine vectors of character strings, you can use
functions like paste() to concatenate the elements of two vectors, as described in Section 7.2.3.
84 Chapter 7 Vectors
v1 <- c(3, 1, 4, 1, 5)
v2 <- c(1, 6, 1, 8, 0)
v3 <- v1 + v2
v3 v1 v2
4 3 1
7 1 6
<- +
5 4 1
9 1 8
5 5 0
Figure 7.2 Vector operations are applied element-wise: the first element in the resulting vector (v3)
is the sum of the first element in the first vector (v1) and the first element in the second vector (v2).
7.2.1 Recycling
Recycling refers to what R does in cases when there are an unequal number of elements in two
operand vectors. If R is tasked with performing a vectorized operation with two vectors of unequal
length, it will reuse (recycle) elements from the shorter vector. For example:
# Add vectors
v3 <- v1 + v2 # returns 2 5 6 3 6
In this example, R first combined the elements in the first position of each vector (1 + 1 = 2).
Then, it combined elements from the second position (3 + 2 = 5). When it got to the third
element (which was present only in v1), it went back to the beginning of v2 to select a value,
yielding 5 + 1 = 6. This recycling is illustrated in Figure 7.3.
7.2 Vectorized Operations 85
v1 <- c(1, 3, 5, 1, 5)
v2 <- c(1, 2)
v3 <- v1 + v2
v3 v1 v2
2 1 1
5 3 2
<- +
6 5 1
3 1 2
6 5 1
Figure 7.3 Recycling values in vector addition. If one vector is shorter than another (e.g., v2), the
values will be repeated (recycled) to match the length of the longer vector. Recycled values are in red.
Remember: Recycling will occur no matter whether the longer vector is the first or the sec-
ond operand. In either case, R will provide a warning message if the length of the longer
vector is not a multiple of the shorter (so that there would be elements “left over” from recy-
cling). This warning doesn’t necessarily mean you did something wrong, but you should pay
attention to it because it may be indicative of an error (i.e., you thought the vectors were of
the same length, but made a mistake somewhere).
As you can see (and probably expected), the operation added 4 to every element in the vector.
This sensible behavior occurs because R stores all character, numeric, and boolean values as vectors.
Even when you thought you were creating a single value (a scalar), you were actually creating a
86 Chapter 7 Vectors
vector with a single element (length 1). When you create a variable storing the number 7 (e.g., with
x <- 7), R creates a vector of length 1 with the number 7 as that single element.
This is why R prints the [1] in front of all results: it’s telling you that it’s showing a vector (which
happens to have one element) starting at element number 1.
# Print out `x`: R displays the vector index (1) in the console
print(x)
# [1] 7
This behavior explains why you can’t use the length() function to get the length of a character
string; it just returns the length of the vector containing that string (which is 1). Instead, you would
use the nchar() function to get the number of characters in a character string.
Thus when you add a “scalar” such as 4 to a vector, what you’re really doing is adding a vector with
a single element 4. As such, the same recycling principle applies, so that the single element is
recycled and applied to each element of the first operand.
This means that you can use nearly any function on a vector, and it will act in the same vectorized,
element-wise manner: the function will result in a new vector where the function’s transformation
has been applied to each individual element in order.
For example, consider the round() function described in Chapter 6. This function rounds the
given argument to the nearest whole number (or number of decimal places if specified).
But recall that the 1.67 in the preceding example is actually a vector of length 1. If you instead pass
a vector containing multiple values as an argument, the function will perform the same rounding
on each element in the vector.
Vectorized operations such as these are also possible with character data. For example, the nchar()
function, which returns the number of characters in a string, can be used equivalently for a vector
of length 1 or a vector with many elements inside of it:
Remember: When you use a function on a vector, you’re using that function on each item in
the vector!
You can even use vectorized functions in which each argument is a vector. For example, the
following code uses the paste() function to paste together elements in two different vectors. Just
as the plus operator (+) performed element-wise addition, other vectorized functions such as
paste() are also implemented element-wise:
# Use the vectorized paste() operation to paste together the vectors above
band <- paste(colors, locations, sep = "") # returns "Greensky" "Bluegrass"
88 Chapter 7 Vectors
Notice the same element-wise combination is occurring: the paste() function is applied to the
first elements, then to the second elements, and so on.
This vectorization process is extremely powerful, and is a significant factor in what makes R an
efficient language for working with large data sets (particularly in comparison to languages that
require explicit iteration through elements in a collection).1 To write really effective R code, you will
need to be comfortable applying functions to vectors of data, and getting vectors of data back as
results.
Going Further: As with other programming languages, R does support explicit iteration in
the form of loops. For example, if you wanted to take an action for each element in a vector,
you could do that using a for loop. However, because operations are vectorized in R, there
is no need to explicitly iterate through vectors. While you are able to write loops in R, they
are almost entirely unnecessary for writing the language and therefore are not discussed in
this text.
The simplest way that you can refer to individual elements in a vector by their index, which is the
number of their position in the vector. For example, in the vector
the "a" (the first element) is at index 1, "e" (the second element) is at index 2, and so on.
Remember: In R, vector elements are indexed starting with 1. This is distinct from most
other programming languages, which are zero-indexed and so reference the first element in
a set at index 0.
You can retrieve a value from a vector using bracket notation. With this approach, you refer to the
element at a particular index of a vector by writing the name of the vector, followed by square
brackets ([]) that contain the index of interest:
1
Vectorization in R: Why? is a blog post by Noam Ross with detailed discussion about the underlying mechanics
of vectorization: http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html
7.3 Vector Indices 89
Caution: Don’t get confused by the [1] in the printed output. It doesn’t refer to which
index you got from people, but rather to the index in the extracted result (e.g., stored in
second_person) that is being printed!
If you specify an index that is out-of-bounds (e.g., greater than the number of elements in the
vector) in the square brackets, you will get back the special value NA, which stands for not available.
Note that this is not the character string "NA", but rather a specific logical value.
If you specify a negative index in the square brackets, R will return all elements except the
(negative) index specified:
It’s common practice to use the colon operator to quickly specify a range of indices to extract:
R will go through the boolean vector and extract every item at the same position as a TRUE. In the
preceding example, since filter is TRUE at indices 1 and 5, then shoe_sizes[filter] returns a
vector with the elements from indices 1 and 5.
This may seem a bit strange, but it is actually incredibly powerful because it lets you select elements
from a vector that meet a certain criteria—a process called filtering. You perform this filtering
operation by first creating a vector of boolean values that correspond with the indices meeting that
criteria, and then put that filter vector inside the square brackets to return the values of interest:
7.4 Vector Filtering 91
# Create a boolean vector that indicates if a shoe size is less than 6.5
shoe_is_small <- shoe_sizes < 6.5 # returns T F F F T
The magic here is that you are once again using recycling: the relational operator < is vectorized,
meaning that the shorter vector (6.5) is recycled and applied to each element in the shoe_sizes
vector, thus producing the boolean vector that you want!
You can even combine the second and third lines of code into a single statement. You can think of
the following as saying shoe_sizes where shoe_sizes is less than 6.5:
This is a valid statement because the expression inside of the square brackets (shoe_sizes < 6.5)
is evaluated first, producing a boolean vector (a vector of TRUEs and FALSEs) that is then used to
filter the shoe_sizes vector. Figure 7.4 diagrams this evaluation. This kind of filtering is crucial for
being able to ask real-world questions of data sets.
11 11 6
<- 7 [ 7 < 6 ]
8 8 6
4 4 4 6
Figure 7.4 A demonstration of vector filtering using relational operators. The value 6 is recycled to
match the length of the shoe_sizes vector. The resulting boolean values are used to filter the vector.
92 Chapter 7 Vectors
You can assign an element at a particular vector index a new value by specifying the index on the
left-hand side of the operation:
To create a new element in your vector, you need to specify the index in which you want to store
the new value:
Of course, there’s no reason that you can’t select multiple elements on the left-hand side and assign
them multiple values. The assignment operator is also vectorized!
If you try to modify an element at an index that is greater than the length of the vector, R will fill
the vector with NA values:
Since keeping track of indices can be difficult (and may easily change with your data, making the
code fragile), a better approach for adding information at the end of a vector is to create a new
vector by combining an existing vector with new elements(s):
# Use the `c()` function to combine the `people` vector and the name "Josh"
more_people <- c(people, "Josh")
print(more_people)
# [1] "Sarah" "Amit" "Zhang" "Josh"
Finally, vector modification can be combined with vector filtering to allow you to replace a specific
subset of values. For example, you could replace all values in a vector that were greater than 10 with
the number 10 (to “cap” the values). Because the assignment operator is vectorized, you can
leverage recycling to assign a single value to each element that has been filtered from the vector:
In this example, the number 10 gets recycled for each element in which v1 is greater than 10
(v1[v1 > 10]).
This technique is particularly powerful when wrangling and cleaning data, as it will allow you to
identify and manipulate invalid values or other outliers.
Overall, vectors provide a powerful way of organizing and grouping data for analysis, and will be
used throughout your programming with R. For practice working with vectors in R, see the set of
accompanying book exercises.2
2
Vector exercises: https://github.com/programming-for-data-science/chapter-07-exercises
This page intentionally left blank
8
Lists
This chapter covers an additional R data type called a list. Lists are somewhat similar to vectors, but
can store more types of data and usually include more details about that data (with some cost). Lists
are R’s version of a map, which is a common and extremely useful way of organizing data in a
computer program. Moreover, lists are used to create data frames, which are the primary data
storage type used for working with sets of real data in R. This chapter covers how to create and
access elements in a list, as well as how to apply functions to lists.
Elements in a list can also be tagged with names that you can use to easily refer to them. For
example, rather than talking about the list’s “element #1,” you can talk about the list’s
“first_name element.” This feature allows you to use lists to create a type of map. In computer
programming, a map (or “mapping”) is a way of associating one value with another. The most
common real-world example of a map is a dictionary or encyclopedia. A dictionary associates each
word with its definition: you can “look up” a definition by using the word itself, rather than
needing to look up the 3891st definition in the book. In fact, this same data structure is called a
dictionary in the Python programming language!
Caution: The definition of a list in the R language is distinct from how some other languages
use the term “list.” When you begin to explore other languages, don’t assume that the same
terminology implies the same capabilities.
As a result, lists are extremely useful for organizing data. They allow you to group together data like
a person’s name (characters), job title (characters), salary (number), and whether the person is a
member of a union (logical)—and you don’t have to remember whether the person’s name or title
was the first element!
96 Chapter 8 Lists
Remember: If you want to label elements in a collection, use a list. While vector elements
can also be tagged with names, that practice is somewhat uncommon and requires a more
verbose syntax for accessing the elements.
However, you can (and should) specify the tags for each element in the list by putting the name of
the tag (which is like a variable name), followed by an equals symbol (=), followed by the value you
want to go in the list and be associated with that tag. This is similar to how named arguments are
specified for functions (see Section 6.2.1). For example:
This creates a list of four elements: "Ada", which is tagged with first_name; "Programmer",
which is tagged with job; 78000, which is tagged with salary; and TRUE, which is tagged with
in_union.
Remember: You can have vectors as elements of a list. In fact, each scalar value in the pre-
ceding example is really a vector (of length 1).
However, tags make it easier and less error-prone to access specific elements. In addition, tags help
other programmers read and understand the code—tags let them know what each element in the
list represents, similar to an informative variable name. Thus it is recommended to always tag lists
you create.
Tip: You can get a vector of the names of your list items using the names() function. This
is useful for understanding the structure of variables that may have come from other data
sources.
8.3 Accessing List Elements 97
Because lists can store elements of different types, they can store values that are lists themselves. For
example, consider adding a list of favorite items to the person list in the previous example:
This data structure (a list of lists) is a common way to represent data that is typically stored in
JavaScript Object Notation (JSON). For more information on working with JSON data, see Chapter 14.
Because list elements are (usually) tagged, you can access them by their tag name rather than by the
index number you used with vectors. You do this by using dollar notation: refer to the element
with a particular tag in a list by writing the name of the list, followed by a $, followed by the
element’s tag (a syntax unavailable to named vectors):
You can almost read the dollar sign as if it were an “apostrophe s” (possessive) in English. Thus,
person$salary would mean “the person list’s salary value.”
Regardless of whether a list element has a tag, you can also access it by its numeric index (i.e., if it is
the first, second, and so on item in the list). You do this by using double-bracket notation. With
this notation, you refer to the element at a particular index of a list by writing the name of the list,
followed by double square brackets ([[]]) that contain the index of interest:
# This is a list (not a vector!), even though elements have the same type
animals <- list("Aardvark", "Baboon", "Camel")
You can also use double-bracket notation to access an element by its tag if you put a character string
of the tag name inside the brackets. This is particularly useful in cases when the tag name is stored
in a variable:
Remember that lists can contain complex values (including other lists). Accessing these elements
with either dollar or double-bracket notation will return that “nested” list, allowing you to access
its elements:
In this example, job_qualifications is a variable that refers to a list, so its elements can be
accessed via dollar notation. But as with any operator or function, it is also possible to use dollar
notation on an anonymous value (e.g., a literal value that has not been assigned to a variable). That
is, because job_post$qualifications is a list, you can use bracket or dollar notation to refer to
an element of that list without assigning it to a variable first:
This example of “chaining” together dollar-sign operators allows you to directly access elements in
lists with a complex structure: you can use a single expression to refer to the “job-post’s
qualification’s experience” value.
100 Chapter 8 Lists
NULL is a special value that means “undefined” (note that it is a special value NULL, not the
character string "NULL"). NULL is somewhat similar to the term NA—the difference is that NA is used
to refer to a value that is missing (such as an empty element in a vector)—that is, a “hole.”
Conversely, NULL is used to refer to a value that is not defined but doesn’t necessarily leave a “hole”
in the data. NA values usually result when you are creating or loading data that may have parts
missing; NULL can be used to remove values. For more information on the difference between these
values, see this R-Bloggers post.1
1
R: NA vs. NULL post on R-Bloggers: https://www.r-bloggers.com/r-na-vs-null/
8.4 Modifying Lists 101
Remember: Vectors use single-bracket notation for accessing elements by index, but lists use
double-bracket notation for accessing elements by index!
The single-bracket syntax used with vectors isn’t actually selecting values by index; instead, it is
filtering by whatever vector is inside the brackets (which may be just a single element—the index
number to retrieve). In R, single brackets always mean to filter a collection. So if you put single
brackets after a list, what you’re actually doing is getting a filtered sublist of the elements that have
those indices, just as single brackets on a vector returns a subset of elements from that vector:
Notice that with lists you can filter by a vector of tag names (as well as by a vector of element indices).
102 Chapter 8 Lists
In short, remember that single brackets return a list, whereas double brackets return a list element.
You almost always want to refer to the value itself rather than a list, so you almost always want to
use double brackets (or better yet—dollar notation) when accessing lists.
In particular, you need to use a function called lapply() (for list apply). This function takes two
arguments: a list you want to operate upon, followed by a function you want to “apply” to each
item in that list. For example:
Caution: Make sure you pass your actual function to the lapply() function, not a charac-
ter string of your function name (i.e., paste, not "paste"). You’re also not actually calling
that function (i.e., paste, not paste()). Just put the name of the function! After that, you
can include any additional arguments you want the applied function to be called with—for
example, how many digits to round to, or what value to paste to the end of a string.
8.5 Applying Functions to Lists with lapply() 103
The lapply() function returns a new list; the original one is unmodified.
You commonly use lapply() with your own custom functions that define what you want to do to
a single element in that list:
# [[3]]
# [1] "Hello Zhang"
Additionally, lapply() is a member of the “*apply()” family of functions. Each member of this
set of functions starts with a different letter and is used with a different data structure, but
otherwise all work basically the same way. For example, lapply() is used for lists, while sapply()
(simplified apply) works well for vectors. You can use both lapply() and sapply() on vectors, the
difference is what the function returns. As you might imagine, lapply() will return a list, while
sapply() will return a vector:
# A vector of people
people <- c("Sarah", "Amit", "Zhang")
The sapply() function is really useful only with functions that you define yourself. Most built-in R
functions are vectorized so they will work correctly on vectors when used directly (e.g.,
toupper(people)).
Lists represent an alternative technique to vectors for organizing data in R. In practice, the two data
structures will both be used in your programs, and in fact can be combined to create a data frame
(described in Chapter 10). For practice working with lists in R, see the set of accompanying book
exercises.2
2
List exercises: https://github.com/programming-for-data-science/chapter-08-exercises
This page intentionally left blank
IV
Data Wrangling
The following data wrangling chapters provide you with the necessary skills for understanding,
loading, manipulating, reshaping, and exploring data structures. Perhaps the most
time-consuming part of data science is preparing and exploring your data set, and learning how to
perform these tasks programmatically can make the process easier and more transparent.
Mastering these skills is thus vital to being an effective data scientist.
9
Understanding Data
Previous chapters have introduced the basic programming fundamentals for working with data,
detailing how you can tell a computer to do data processing for you. To use a computer to analyze
data, you need to both access a data set and interpret that data set so that you can ask meaningful
questions about it. This will enable you to transform raw data into actionable information.
This chapter provides a high-level overview of how to interpret data sets as you get started doing
data science—it details the sources of data you might encounter, the formats that data may take,
and strategies for determining which questions to ask of that data. Developing a clear mental
model of what the values in a data set signify is a necessary prerequisite before you can program a
computer to effectively analyze that data.
n Sensors: The volume of data being collected by sensors has increased dramatically in the last
decade. Sensors that automatically detect and record information, such as pollution sensors
that measure air quality, are now entering the personal data management sphere (think of
FitBits or other step counters). Assuming these devices have been properly calibrated, they
offer a reliable and consistent mechanism for data collection.
n Surveys: Data that is less externally measurable, such as people’s opinions or personal
histories, can be gathered from surveys. Because surveys are dependent on individuals’
self-reporting of their behavior, the quality of data may vary (across surveys, or across
individuals). Depending on the domain, people may have poor recall (i.e., people don’t
remember what they ate last week) or have incentives to respond in a particular way (i.e.,
people may over-report healthy behaviors). The biases inherent in survey responses should
be recognized and, when possible, adjusted for in your analysis.
n Record keeping: In many domains, organizations use both automatic and manual processes
to keep track of their activities. For example, a hospital may track the length and result of
every surgery it performs (and a governing body may require that hospital to report those
108 Chapter 9 Understanding Data
results). The reliability of such data will depend on the quality of the systems used to
produce it. Scientific experiments also depend on diligent record keeping of results.
n Secondary data analysis: Data can be compiled from existing knowledge artifacts or
measurements, such as counting word occurrences in a historical text (computers can help
with this!).
All of these methods of collecting data can lead to potential concerns and biases. For example,
sensors may be inaccurate, people may present themselves in particular ways when responding to
surveys, record keeping may only focus on particular tasks, and existing artifacts may already
exclude perspectives. When working with any data set, it is vital to consider where the data came
from (e.g., who recorded it, how, and why) to effectively and meaningfully analyze it.
Luckily, there are also plenty of free, nonproprietary data sets that you can work with.
Organizations will often make large amounts of data available to the public to support experiment
duplication, promote transparency, or just see what other people can do with that data. These data
sets are great for building your data science skills and portfolio, and are made available in a variety
of formats. For example, data may be accessed as downloadable CSV spreadsheets (see Chapter 10),
as relational databases (see Chapter 13), or through a web service API (see Chapter 14).
1
U.S. government’s open data: https://www.data.gov
2
Government of Canada open data: https://open.canada.ca/en/open-data
3
Open Government Data Platform India: https://data.gov.in
4
City of Seattle open data portal: https://data.seattle.gov
9.2 Finding Data 109
n News and journalism: Journalism remains one of the most important contexts in which
data is gathered and analyzed. Journalists do much of the legwork in producing
data—searching existing artifacts, questioning and surveying people, or otherwise revealing
and connecting previously hidden or ignored information. News media usually publish the
analyzed, summative information for consumption, but they also may make the source data
available for others to confirm and expand on their work. For example, the New York Times5
makes much of its historical data available through a web service, while the data politics
blog FiveThirtyEight 6 makes all of the data behind its articles available on GitHub (invalid
models and all).
n Scientific research: Another excellent source of data is ongoing scientific research, whether
performed in academic or industrial settings. Scientific studies are (in theory) well grounded
and structured, providing meaningful data when considered within their proper scope.
Since science needs to be disseminated and validated by others to be usable, research is often
made publicly available for others to study and critique. Some scientific journals, such as the
premier journal Nature, require authors to make their data available for others to access and
investigate (check out its list7 of scientific data repositories!).
n Social networks and media organizations: Some of the largest quantities of data produced
occur online, automatically recorded from people’s usage of and interactions with social
media applications such as Facebook, Twitter, or Google. To better integrate these services
into people’s everyday lives, social media companies make much of their data
programmatically available for other developers to access and use. For example, it is possible
to access live data from Twitter,8 which has been used for a variety of interesting analyses.
Google also provides programmatic access9 to most of its many services (including search
and YouTube).
n Online communities: As data science has rapidly increased in popularity, so too has the
community of data science practitioners. This community and its online spaces are another
great source for interesting and varied data sets and analysis. For example, Kaggle10 hosts a
number of data sets as well as “challenges” to analyze them. Socrata11 (which powers the
Seattle data repository), also collects a variety of data sets (often from professional or
government contributors). Somewhat similarly, the UCI Machine Learning Repository12
maintains a collection of data sets used in machine learning, drawn primarily from
academic sources. And there are many other online lists of data sources as well—including a
dedicated Subreddit /r/Datasets.13
In short, there are a huge number of real-world data sets available for you to work with—whether
you have a specific question you would like to answer, or just want to explore and be inspired.
5
New York Times Developer Network: https://developer.nytimes.com
6
FiveThirtyEight: Our Data: https://data.fivethirtyeight.com
7
Nature: Recommended Data Repositories: https://www.nature.com/sdata/policies/repositories
8
Twitter developer platform: https://developer.twitter.com/en/docs
9
Google APIs Explorer: https://developers.google.com/apis-explorer/
10
Kaggle: “the home of data science and machine learning”: https://www.kaggle.com
11
Socrata: data as a service platform: https://opendata.socrata.com
12
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php
13
/r/DataSets: https://www.reddit.com/r/datasets/
110 Chapter 9 Understanding Data
The field of statistics commonly classifies values into one of four levels, described in Table 9.1.
Nominal data (often equivalently categorical data) is data that has no implicit ordering. For
example, you cannot say that “apples are more than oranges,” though you can indicate that a
particular fruit either is an apple or an orange. Nominal data is commonly used to indicate that an
observation belongs in a particular category or group. You do not usually perform mathematical
analysis on nominal data (e.g., you can’t find the “average” fruit), though you can discuss counts or
distributions. Nominal data can be represented by strings (such as the name of the fruit), but also
by numbers (e.g., “fruit type #1”, “fruit type #2”). Just because a value in a data set is a number, that
does not mean you can do math upon it! Note that boolean values (TRUE or FALSE) are a type of
nominal value.
Ordinal data establishes an order for nominal categories. Ordinal data may be used for
classification, but it also establishes that some groups are greater than or less than others. For
example, you may have classifications of hotels or restaurants as 5-star, 4-star, and so on. There is an
ordering to these categories, but the distances between the values may vary. You are able to find the
minimum, maximum, and even median values of ordinal variables, but you can’t compute a
statistical mean (since ordinal values do not define how much greater one value is than another).
Note that it is possible to treat nominal variables as ordinal by enforcing an ordering, though in
14
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680.
https://doi.org/10.1126/science.103.2684.677
9.3 Types of Data 111
effect this changes the measurement level of the data. For example, colors are usually nominal
data—you cannot say that “red is greater than blue.” This is despite the conventional ordering
based on the colors of a rainbow; when you say that “red comes before blue (in the rainbow),”
you’re actually replacing the nominal color value with an ordinal value representing its position in a
rainbow (which itself is dependent on the ratio value of its wavelength)! Ordinal data is also
considered categorical.
Ratio data (often equivalently continuous data) is the most common level of measurement in
real-world data: data based on population counts, monetary values, or amount of activity is usually
measured at the ratio level. With ratio data, you can find averages, as well as measure the distance
between different values (a feature also available with interval data). As you might expect, you can
also compare the ratio of two values when working with ratio data (i.e., value x is twice as great as
value y).
Interval data is similar to ratio data, except there is no fixed zero point. For example, dates cannot
be discussed in proportional terms (i.e., you wouldn’t say that Wednesday is twice as Monday).
Therefore, you can compute the distance (interval) between two values (i.e., 2 days apart), but you
cannot compute the ratio between two values. Interval data is also considered continuous.
Identifying and understanding the level of measurement of a particular data feature is important
when determining how to analyze a data set. In particular, you need to know what kinds of
statistical analysis will be valid for that data, as well as how to interpret what that data is measuring.
In practice, most data sets are structured as tables of information, with individual data values
arranged into rows and columns (see Figure 9.1). These tables are similar to how data may be
recorded in a spreadsheet (using a program such as Microsoft Excel). In a table, each row represents
a record or observation: an instance of a single thing being measured (e.g., a person, a sports
match). Each column represents a feature: a particular property or aspect of the thing being
measured (e.g., the person’s height or weight, the scores in a sports game). Each data value can be
referred to as a cell in the table.
Viewed in this way, a table is a collection of “things” being measured, each of which has a particular
value for a characteristic of that thing. And, because all the observations share the same
characteristics (features), it is possible to analyze them comparatively. Moreover, by organizing data
into a table, each data value (cell) can be automatically given two associated meanings: which
observation it is from as well as which feature it represents. This structure allows you to discern
semantic meaning from the numbers: the number 64 in figure Figure 9.1 is not just some value; it’s
“Ada’s height.”
The table in Figure 9.1 represents a small (even tiny) data set, in that it contains just five
observations (rows). The size of a data set is generally measured in terms of its number of
112 Chapter 9 Understanding Data
Feature
Record or
observation
Figure 9.1 A table of data (of people’s weights and heights). Rows represent observations, while
columns represent features.
observations: a small data set may contain only a few dozen observations, while a large data set may
contain thousands or hundreds of thousands of records. Indeed, “Big Data” is a term that, in part,
refers to data sets that are so large that they can’t be loaded into the computer’s memory without
special handling, and may have billions or even trillions of rows! Yet, even a data set with a relatively
small number of observations can contain a large number of cells if they record a lot of features per
observations (though these tables can often be “inverted” to have more rows and fewer columns;
see Chapter 12). Overall, the number of observations and features (rows and columns) is referred to
as the dimensions of the data set—not to be confused with referring to a table’s “two-dimensional”
data structure (because each data value has two meanings: observation and feature).
Although it is commonly structured in this way, data need not be represented as a single table.
More complex data sets may spread data values across multiple tables (such as in a database; see
Chapter 13). In other complex data structures, each individual cell in the table may hold a vector or
even its own data table. This can cause the table to no longer be two-dimensional, but three- or
more-dimensional. Indeed, many data sets available from web services are structured as “nested
tables”; see Chapter 14 for details.
basic level of understanding of that problem domain to do any sensible analysis of that data. You
will need to develop a mental model of what the data values mean. This includes understanding
the significance and purpose of any features (so you’re not doing math on contextless numbers),
the range of expected values for a feature (to detect outliers and other errors), and some of the
subtleties that may not be explicit in the data set (such as biases or aggregations that may hide
important causalities).
As a specific example, if you wanted to analyze the table shown in Figure 9.1, you would need to first
understand what is meant by “height” and “weight” of a person, the implied units of the numbers
(inches, centimeters, … or something else?), an expected range (does Ada’s height of 64 mean she is
short?), and other external factors that may have influenced the data (e.g., age).
Remember: You do not need to necessarily be an expert in the problem domain (though it
wouldn’t hurt); you just need to acquire sufficient domain knowledge to work within that
problem domain!
While people’s heights and other data sets discussed in this text should be familiar to most readers,
in practice you are quite likely to come across data from problem domains that are outside of your
personal domain expertise. Or, more problematically, the data set may be from a problem domain
that you think you understand but actually have a flawed mental model of (a failure of
meta-cognition).
For example, consider the data set shown in Figure 9.2, a screenshot taken from the City of Seattle’s
data repository. This data set presents information on Land Use Permits, a somewhat opaque
bureaucratic procedure with which you may be unfamiliar. The question becomes: how would you
acquire sufficient domain knowledge to understand and analyze this data set?
Gathering domain knowledge almost always requires outside research—you will rarely be able to
understand a domain just by looking at a spreadsheet of numbers. To gain general domain
knowledge, we recommend you start by consulting a general knowledge reference: Wikipedia
provides easy access to basic descriptions. Be sure to read any related articles or resources to
improve your understanding: sifting through the vast amount of information online requires
cross-referencing different resources, and mapping that information to your data set.
That said, the best way to learn about a problem is to find a domain expert who can help explain the
domain to you. If you want to know about land use permits, try to find someone who has used one
in the past. The second best solution is to ask a librarian—librarians are specifically trained to help
people discover and acquire basic domain knowledge. Libraries may also support access to more
specialized information sources.
Figure 9.2 A preview of land use permits data from the City of Seattle.15 Content has been edited
for display in this text.
Many publicly available data sets come with summative explanations, instructions for access and
usage, or even descriptions of individual features. This meta-data (data about the data) is the best
way to begin to understand what value is represented by each cell in the table, since the
information comes directly from the source.
For example, Seattle’s land use permits page has a short summary (though you would want to look
up what an “over-the-counter review application” is), provides a number of categories and tags, lists
the dimensions of the data set (14,200 rows as of this writing), and gives a quick description of each
column.
15
City of Seattle: Land Use Permits (access requires a free account): https://data.seattle.gov/Permitting/Land-
Use-Permits/uyyd-8gak
9.4 Interpreting Data 115
Understanding who generated the data set (and how they did so!) will allow you to know where to
find more information about the data—it will let you know who the domain experts are. Moreover,
knowing the source and methodology behind the data can help you uncover hidden biases or
other subtleties that may not be obvious in the data itself. For example, the Land Use Permits page
notes that the data was provided by the “City of Seattle, Department of Planning and
Development” (now the Department of Construction & Inspections). If you search for this
organization, you can find its website.16 This website would be a good place to gain further
information about the specific data found in the data set.
Once you understand this meta-data, you can begin researching the data set itself:
Regardless of the presence of meta-data, you will need to understand the columns of the table to
work with it. Go through each column and check if you understand:
3. For categorical data: what different categories are represented, and what do those mean?
If the meta-data provides a key to the data table, this becomes an easy task. Otherwise, you may
need to study the source of the data to determine how to understand the features, sparking
additional domain research.
Tip: As you read through a data set—or anything really—you should write down the terms
and phrases you are not familiar with to look up later. This will discourage you from (inaccu-
rately) guessing a term’s meaning, and will help delineate between terms you have and have
not yet clarified.
For example, the Land Use Permits data set provides clear descriptions of the columns in the
meta-data, but looking at the sample data reveals that some of the values may require additional
research. For example, what are the different Permit Types and Decision Types? By going back to the
source of the data (the Department of Construction home page), you can navigate to the Permits
page and then to the “Permits We Issue (A-Z)” to see a full list of possible permit types. This will let
you find out, for example, that “PLAT” refers to “creating or modifying individual parcels of
property”—in other words, adjusting lot boundaries.
To understand the features, you will need to look at some sample observations. Open up the
spreadsheet or table and look at the first few rows to get a sense for what kind of values they have
and what that may say about the data.
16
Seattle Department of Construction & Inspections (access requires a free account):
http://www.seattle.gov/dpd/
116 Chapter 9 Understanding Data
Depending on the problem domain, a data set may contain a large amount of jargon, both to
explain the data and inside the data itself. Making sure you understand all the technical terms used
will go a long way toward ensuring you can effectively discuss and analyze the data.
Caution: Watch out for acronyms you are not familiar with, and be sure to look them up!
For example, looking at the “Table Preview,” you may notice that many of the values for the “Permit
Type” feature use the term “SEPA.” Searching for this acronym would lead you to a page describing
the State Policy Environmental Act (requiring environmental impact to be considered in how land is
used), as well as details on the “Threshold Determination” process.
Overall, interpreting a data set will require research and work that is not programming. While it
may seem like such work is keeping you from making progress in processing the data, having a
valid mental model of the data is both useful and necessary to perform data analysis.
To answer this question, you will need to understand the problem domain of disease burden
measurement and acquire a data set that is well positioned to address the question. For example,
one appropriate data set would be the Global Burden of Disease17 study performed by the Institute for
Health Metrics and Evaluation, which details the burden of disease in the United States and around
the world.
Once you have acquired this data set, you will need to operationalize the motivating question.
Considering each of the key words, you will need to identify a set of diseases, and then quantify
what is meant by “worst.” For example, the question could be more concretely phrased as any of
these interpretations:
n Which disease causes the largest number of deaths in the United States?
n Which disease causes the most premature deaths in the United States?
Depending on your definition of “worst,” you will perform very different computations and
analysis, possibly arriving at different answers. You thus need to be able to decide what precisely is
meant by a question—a task that requires understanding the nuances found in the question’s
problem domain.
Figure 9.3 shows visualizations that try to answer this very question. The figure contains screenshots
of treemaps from an online tool called GBD Compare.18 A treemap is like a pie chart that is built with
17
IHME: Global Burden of Disease: http://www.healthdata.org/node/835
18
GBD Compare: visualization for global burden of disease: https://vizhub.healthdata.org/gbd-compare/
9.5 Using Data to Answer Questions 117
Figure 9.3 Treemaps from the GBD Compare tool showing the proportion of deaths (top), years of
life lost (middle), and years lived with disability (bottom) attributable to each disease in the United
States.
118 Chapter 9 Understanding Data
rectangles: the area of each segment is drawn proportionally to an underlying piece of data. The
additional advantage of the treemap is that it can show hierarchies of information by nesting
different levels of rectangles inside of one another. For example, in Figure 9.3, the disease burden
from each communicable disease (shown in red) is nested within the same segment of each chart.
Depending on how you choose to operationalize the idea of the “worst disease,” different diseases
stand out as the most impactful. As you can see in Figure 9.3, almost 90% of all deaths are caused by
non-communicable diseases such as cardiovascular diseases (CVD) and cancers (Neoplasms), shown
in blue. When you consider the age of death for each person (computing a metric called Years of Life
Lost), this value drops to 80%. Moreover, this metric enables you to identify causes of death that
disproportionately affect young people, such as traffic accidents (Trans Inj) and self-harm, shown in
green (see the middle chart in Figure 9.3). Finally, if you consider the “worst” disease to be that
currently causing the most physical disability in the population (as in the bottom chart in
Figure 9.3), the impacts of musculoskeletal conditions (MSK) and mental health issues (Mental) are
exposed.
Because data analysis is about identifying answers to questions, the first step is to ensure you have a
strong understanding of the question of interest and how it is being measured. Only after you have
mapped from your questions of interest to specific features (columns) of your data can you perform
an effective and meaningful analysis of that data.
10
Data Frames
This chapter introduces data frame values, which are the primary two-dimensional data storage
type used in R. In many ways, data frames are similar to the row-and-column table layout that you
may be familiar with from spreadsheet programs like Microsoft Excel. Rather than interact with
this data structure through a user interface (UI), you will learn how to programmatically and
reproducibly perform operations on this data type. This chapter covers ways of creating, describing,
and accessing data from data frames in R.
Data frames are really just lists (see Chapter 8) in which each element is a vector of the same length.
Each vector represents a column, not a row. The elements at corresponding indices in the vectors
are considered part of the same row (record). This structure makes sense because each row may have
different types of data—such as a person’s name (string) and height (number)—and vector
elements must all be of the same type.
Figure 10.1 A table of data (of people’s weights and heights) when viewed as a data frame in RStudio.
120 Chapter 10 Data Frames
For example, you can think of the data shown in Figure 10.1 as a list of three vectors: name, height,
and weight. The name, height, and weight of the first person measured are represented by the first
elements of the name, height, and weight vectors, respectively.
You can work with data frames as if they were lists, but data frames have additional properties that
make them particularly well suited for handling tables of data.
# A vector of names
name <- c("Ada", "Bob", "Chris", "Diya", "Emma")
# A vector of heights
height <- c(64, 74, 69, 69, 71)
# A vector of weights
weight <- c(135, 156, 139, 144, 152)
The last argument to the data.frame() function is included because one of the vectors contains
strings; it tells R to treat that vector as a typical vector, instead of another data type called a factor
when constructing the data frame. This is usually what you will want to do—see Section 10.3.2 for
more information.
You can also specify data frame column names using the key = value syntax used by named lists
when you create your data frame:
10.2 Working with Data Frames 121
Because data frame elements are lists, you can access the values from people using the same dollar
notation and double-bracket notation as you use with lists:
For more flexible approaches to accessing data from data frames, see section 10.2.3.
ncol(people) # [1] 3
dim(people) # [1] 5 3
colnames(people) # [1] "name" "height" "weight"
rownames(people) # [1] "1" "2" "3" "4" "5"
Many of these description functions can also be used to modify the structure of a data frame. For
example, you can use the colnames() functions to assign a new set of column names to a data
frame.
Table 10.2 summarizes how single-bracket notation can be used to access data frames. Take special
note of the fourth option’s syntax (for retrieving rows): you still include the comma (,), but
because you leave the which column value blank, you get all of the columns!
# Extract the row with the name "Ada" (and all columns)
people["Ada", ] # note the comma, indicating all columns
Of course, because numbers and strings are stored in vectors, you’re actually specifying vectors of
names or indices to extract. This allows you to get multiple rows or columns:
Additionally, you can use a vector of boolean values to specify your indices of interest (just as you
did with vectors):
Remember: The type of data that is returned when selecting data using single brackets
depends on how many columns you are selecting. Extracting values from more than one col-
umn will produce a data frame; extracting from just one column will produce a vector.
Tip: In general, it’s easier, cleaner, and less buggy to filter by column name (character string),
rather than by column number, because it’s not unusual for column order to change in a data
frame. You should almost never access data in a data frame by its positional index. Instead,
you should use the column name to specify columns, and a filter to specify rows of interest.
Going Further: While data frames are the two-dimensional data structure suggested by this
book, they are not the only 2D data structure in R. For example, a matrix is a two-dimensional
data structure in which all of the values have the same type (usually numeric).
To use all the syntax and functions described in this chapter, first confirm that a data object
is a data frame (using is.data.frame()), and if necessary, convert an object to a data frame
(such as by using the as.data.frame() function).
124 Chapter 10 Data Frames
Most spreadsheet programs, such as Microsoft Excel, Numbers, and Google Sheets, are just
interfaces for formatting and interacting with data that is saved in this format. These programs
easily import and export .csv files. But note that .csv files are unable to save the formatting and
calculation formulas used in those programs—a .csv file stores only the data!
You can load the data from a .csv file into R by using the read.csv() function:
# Read data from the file `my_file.csv` into a data frame `my_df`
my_df <- read.csv("my_file.csv", stringsAsFactors = FALSE)
Again, use the stringsAsFactors argument to make sure string data is stored as a vector rather
than as a factor (see Section 10.3.2 for details). This function will return a data frame just as if you
had created it yourself.
Remember: If an element is missing from a data frame (which is very common with real-
world data), R will fill that cell with the logical value NA, meaning “not available.” There are
multiple waysa to handle this in an analysis; you can filter for those values using bracket
notation to replace them, exclude them from your analysis, or impute them using more
sophisticated techniques.
a
See, for example, http://www.statmethods.net/input/missingdata.html
Conversely, you can write data to a .csv file using the write.csv() function, in which you specify
the data frame you want to write, the filename of the file you want to write the data to, and other
optional arguments:
Additionally, there are many data sets you can explore that ship with the R software. You can see a
list of these data sets using the data() function, and begin working with them directly (try
10.3 Working with CSV Data 125
View(mtcars) as an example). Moreover, many packages include data sets that are well suited for
demonstrating their functionality. For a robust (though incomplete) list of more than 1,000 data
sets that ship with R packages, see this webpage.1
Like the command line, the R interpreter (running inside RStudio) has a current working
directory from which all file paths are relative. The trick is that the working directory is not necessarily
the directory of the current script file! This makes sense, as you may have many files open in RStudio at
the same time, and your R interpreter can have only one working directory.
Just as you can view the current working directory when on the command line (using pwd), you can
use an R function to view the current working directory when in R:
You often will want to change the working directory to be your project’s directory (wherever your
scripts and data files happen to be; often the root of your project repository). It is possible to change
the current working directory using the setwd() function. However, this function also takes an
absolute path, so doesn’t fix the problem of working across machines. You should not include this
absolute path in your script (though you could use it from the console).
A better solution is to use RStudio itself to change the working directory. This is reasonable because
the working directory is a property of the current running environment, which is what RStudio makes
accessible. The easiest way to do this is to use the Session > Set Working Directory menu
option (see Figure 10.2): you can either set the working directory To Source File Location (the
folder containing whichever .R script you are currently editing; this is usually what you want), or
you can browse for a particular directory with Choose Directory.
As a specific example, consider trying to load the my-data.csv file from the analysis.R script,
given the folder structure illustrated in Figure 10.3. In your analysis.R script you want to be able
to use a relative path to access your data (my-data.csv). In other words, you don’t want to have to
specify the absolute path (/Users/YOUR_NAME/Documents/projects/analysis-project/
data/my-data.csv) to find this. Instead, you want to provide instructions on how your program
can find your data file relative to where you are working (in your analysis.R file). After setting the
session’s path to the working directory, you will be able to use the relative path to find it:
1
R Package Data Sets: https://vincentarelbundock.github.io/Rdatasets/datasets.html
126 Chapter 10 Data Frames
Figure 10.2 Use Session > Set Working Directory to change the working directory through
RStudio.
Figure 10.3 The folder structure for a sample project. Once you set the working directory in RStudio,
you can access the my-data.csv file from the analysis.R script using the relative path
data/my-data.csv.
Factors are a data structure for optimizing variables that consist of a finite set of categories (i.e., they
are categorical variables). For example, imagine that you had a vector of shirt sizes that could take
on only the values small, medium, or large. If you were working with a large data set (thousands of
shirts!), it would end up taking up a lot of memory to store the character strings (5+ letters per word
at 1 or more bytes per letter) for each of those variables.
A factor would instead store a number (called a level) for each of these character strings—for
example, 1 for small, 2 for medium, or 3 for large (though the order of the numbers may vary).
R will remember the relationship between the integers and their labels (the strings). Since each
number takes just 2–4 bytes (rather than 1 byte per letter), factors allow R to keep much more
information in memory.
10.3 Working with CSV Data 127
To see how factor variables appear similar to (but are actually different from) vectors, you can create
a factor variable using as.factor():
When you print out the shirt_sizes_factor variable, R still (intelligently) prints out the labels
that you are presumably interested in. It also indicates the levels, which are the only possible values
that elements can take on.
It is worth restating: factors are not vectors. This means that most all the operations and functions
you want to use on vectors will not work:
If you create a data frame with a string vector as a column (as happens with read.csv()), it will
automatically be treated as a factor unless you explicitly tell it not to be:
The NA produced in the preceding example can be avoided if the stringsAsFactors option is set
to FALSE:
This is not to say that factors can’t be useful (beyond just saving memory)! They offer easy ways to
group and process data using specialized functions:
While this is a handy use of factors, you can easily do the same type of aggregation without them
(as shown in Chapter 11).
In general, the skills associated with this text are more concerned with working with data as
vectors. Thus you should always use stringsAsFactors = FALSE when creating data frames or
loading .csv files that include strings.
This chapter has introduced the data frame as the primary data structure for working with
two-dimensional data in R. Moving forward, almost all analysis and visualization work will depend
on working with data frames. For practice working with data frames, see the set of accompanying
book exercises.2
2
Data frame exercises: https://github.com/programming-for-data-science/chapter-10-exercises
This page intentionally left blank
11
Manipulating Data
with dplyr
The dplyr1 (“dee-ply-er”) package is the preeminent tool for data wrangling in R (and perhaps in
data science more generally). It provides programmers with an intuitive vocabulary for executing
data management and analysis tasks. Learning and using this package will make your data
preparation and management process faster and easier to understand. This chapter introduces the
philosophy behind the package and provides an overview of how to use the package to work with
data frames using its expressive and efficient syntax.
n Filter out irrelevant data and keep only observations (rows) of interest
You can use these words when describing the algorithm or process for interrogating data, and then
use dplyr to write code that will closely follow your “plain language” description because it uses
1
dplyr: http://dplyr.tidyverse.org
132 Chapter 11 Manipulating Data with dplyr
functions and procedures that share the same language. Indeed, many real-world questions about a
data set come down to isolating specific rows/columns of the data set as the “elements of interest”
and then performing a basic comparison or computation (e.g., mean, count, max). While it is
possible to perform such computation with base R functions (described in the previous chapters),
the dplyr package makes it much easier to write and read such code.
Since dplyr is an external package, you will need to install it (once per machine) and load it in each
script in which you want to use the functions:
Fun Fact: dplyr is a key part of the tidyversea collection of R packages, which also includes
tidyr (Chapter 12) and ggplot2 (Chapter 16). While these packages are discussed indi-
vidually, you can install and use them all at once by installing and loading the collected
"tidyverse" package.
a
https://www.tidyverse.org
After loading the package, you can call any of the functions just as if they were the built-in
functions you’ve come to know and love.
To demonstrate the usefulness of the dplyr package as a tool for asking questions of real data sets,
this chapter applies the functions to historical data about U.S. presidential elections. The
presidentialElections data set is included as part of the pscl package, so you will need to
install and load that package to access the data:
This data set contains the percentage of votes that were cast in each state for the Democratic Party
candidate in each presidential election from 1932 to 2016. Each row contains the state, year,
percentage of Democrat votes (demVote), and whether each state was a member of the former
Confederacy during the Civil War (south). For more information, see the pscl package reference
manual,2 or use ?presidentialElections to view the documentation in RStudio.
2
pscl reference manual: https://cran.r-project.org/web/packages/pscl/pscl.pdf
11.2 Core dplyr Functions 133
11.2.1 Select
The select() function allows you to choose and extract columns of interest from your data frame,
as illustrated in Figure 11.1.
The select() function takes as arguments the data frame to select from, followed by the names of
the columns you wish to select (without quotation marks)!
This use of select() is equivalent to simply extracting the columns using base R syntax:
While this base R syntax achieves the same end, the dplyr approach provides a more expressive
syntax that is easier to read and write.
Remember: Inside the function argument list (inside the parentheses) of dplyr functions,
you specify data frame columns without quotation marks—that is, you just give the column
names as variable names, rather than as character strings. This is referred to as non-standard
evaluation (NSE).a While this capability makes dplyr code easier to write and read, it can
occasionally create challenges when trying to work with a column name that is stored in a
variable.
If you encounter errors in such situations, you can and should fall back to working with base
R syntax (e.g., dollar sign and bracket notation).
a
http://dplyr.tidyverse.org/articles/programming.html
select(
presidentialElections,
year,
demVote
)
Figure 11.1 Using the select() function to select the columns year and demVote from the
presidentialElections data frame.
134 Chapter 11 Manipulating Data with dplyr
50
25
This selection of data could be used to explore trends in voting patterns across states, as shown in
Figure 11.2. For an interactive exploration of how state voting patterns have shifted over time, see
this piece by the New York Times.3
Note that the arguments to the select() function can also be vectors of column names—you can
write exactly what you would specify inside bracket notation, just without calling c(). Thus you
can both select a range of columns using the : operator, and exclude columns using the - operator:
# Select columns `state` through `year` (i.e., `state`, `demVote`, and `year`)
select(presidentialElections, state:year)
3
Over the Decades, How States Have Shifted: https://archive.nytimes.com/www.nytimes.com/interactive/2012/
10/15/us/politics/swing-history.html
11.2 Core dplyr Functions 135
Caution: Unlike with the use of bracket notation, using select() to select a single column
will return a data frame, not a vector. If you want to extract a specific column or value from a
data frame, you can use the pull() function from the dplyr package, or use base R syntax.
In general, use dplyr for manipulating a data frame, and then use base R for referring to
specific values in that data.
11.2.2 Filter
The filter() function allows you to choose and extract rows of interest from your data frame
(contrasted with select(), which extracts columns), as illustrated in Figure 11.3.
The filter() function takes in the data frame to filter, followed by a comma-separated list of
conditions that each returned row must satisfy. Again, column names must be specified without
quotation marks. The preceding filter() statement is equivalent to extracting the rows using the
following base R syntax:
The filter() function will extract rows that match all given conditions. Thus you can specify
that you want to filter a data frame for rows that meet the first condition and the second condition
(and so on). For example, you may be curious about how the state of Colorado voted in 2008:
filter(
presidentialElections,
year == 2008
)
Figure 11.3 Using the filter() function to select observations from the
presidentialElections data frame in which the year column is 2008.
136 Chapter 11 Manipulating Data with dplyr
In cases where you are using multiple conditions—and therefore might be writing really long
code—you should break the single statement into multiple lines for readability (as in the preceding
example). Because you haven’t closed the parentheses on the function arguments, R will treat each
new line as part of the current statement. See the tidyverse style guide4 for more details.
Caution: If you are working with a data frame that has row names (presidentialElec-
tions does not), the dplyr functions will remove row names. If you need to retain these
names, consider instead making them a column (feature) of the data, thereby allowing you
to include those names in your wrangling and analysis. You can add row names as a column
using the mutate function (described in Section 11.2.3):
11.2.3 Mutate
The mutate() function allows you to create additional columns for your data frame, as illustrated
in Figure 11.4. For example, it may be useful to add a column to the presidentialElections data
frame that stores the percentage of votes that went to other candidates:
The mutate() function takes in the data frame to mutate, followed by a comma-separated list of
columns to create using the same name = vector syntax you use when creating lists or data frames
from scratch. As always, the names of the columns in the data frame are specified without
quotation marks. Again, it is common to put each new column declaration on a separate line for
spacing and readability.
Caution: Despite the name, the mutate() function doesn’t actually change the data frame;
instead, it returns a new data frame that has the extra columns added. You will often want to
replace your old data frame variable with this new value (as in the preceding code).
4
tidyverse style guide: http://style.tidyverse.org
11.2 Core dplyr Functions 137
mutate(
presidentialElections,
other_parties_vote = 100 - demVote,
abs_vote_difference = abs(demVote - other_parties_vote)
)
Figure 11.4 Using the mutate() function to create new columns on the presidentialElections
data frame. Note that the mutate() function does not actually change a data frame (you need to
assign the result to a variable).
Tip: If you want to rename a particular column rather than adding a new one, you can use
the dplyr function rename(), which is actually a variation of passing a named argument to
the select() function to select columns aliased to different names.
11.2.4 Arrange
The arrange() function allows you to sort the rows of your data frame by some feature (column
value), as illustrated in Figure 11.5. For example, you may want to sort the
presidentialElections data frame by year, and then within each year, sort the rows based on
the percentage of votes that went to the Democratic Party candidate:
As demonstrated in the preceding code, you can pass multiple arguments into the arrange()
function (in addition to the data frame to arrange). The data frame will be sorted by the column
provided as the second argument, then by the column provided as the third argument (in case of a
“tie”), and so on. Like mutate(), the arrange() function doesn’t actually modify the argument
data frame; instead, it returns a new data frame that you can store in a variable to use later.
By default, the arrange() function will sort rows in increasing order. To sort in reverse (decreasing)
order, place a minus sign (-) in front of the column name (e.g., -year). You can also use the desc()
helper function; for example, you can pass desc(year) as the argument.
11.2.5 Summarize
The summarize() function (equivalently summarise() for those using the British spelling) will
generate a new data frame that contains a “summary” of a column, computing a single value from
the multiple elements in that column. This is an aggregation operation (i.e., it will reduce an entire
column to a single value—think about taking a sum or average), as illustrated in Figure 11.6. For
example, you can calculate the average percentage of votes cast for Democratic Party candidates:
The summarize() function takes in the data frame to aggregate, followed by values that will be
computed for the resulting summary table. These values are specified using name = value syntax,
summarize(
presdentialElections,
mean_dem_vote = mean(demVote),
mean_other_parties = mean(other_parties_vote)
)
Figure 11.6 Using the summarize() function to calculate summary statistics for the presiden-
tialElections data frame.
11.3 Performing Sequential Operations 139
similar to using mutate() or defining a list. You can use multiple arguments to include multiple
aggregations in the same statement. This will return a data frame with a single row and a different
column for each value that is computed by the function, as shown in Figure 11.6.
The summarize() function produces a data frame (a table) of summary values. If you want to
reference any of those individual aggregates, you will need to extract them from this table using
base R syntax or the dplyr function pull().
You can use the summarize() function to aggregate columns with any function that takes a vector
as a parameter and returns a single value. This includes many built-in R functions such as mean(),
max(), and median(). Alternatively, you can write your own summary functions. For example,
using the presidentialElections data frame, you may want to find the least close election (i.e.,
the one in which the demVote was furthest from 50% in absolute value). The following code
constructs a function to find the value furthest from 50 in a vector, and then applies the function to
the presidentialElections data frame using summarize():
The true power of the summarize() function becomes evident when you are working with data
that has been grouped. In that case, each different group will be summarized as a different row in the
summary table (see Section 11.4).
“Which state had the highest percentage of votes for the Democratic Party candidate (Barack
Obama) in 2008?”
140 Chapter 11 Manipulating Data with dplyr
2. Of the percentages in 2008, filter down to the one with the highest percentage of votes for a
Democrat.
3. Select the name of the state that meets the above criteria.
# Use a sequence of steps to find the state with the highest 2008
# `demVote` percentage
While this approach works, it clutters the work environment with variables you won’t need to use
again. It does help with readability (the result of each step is explicit), but those extra variables make
it harder to modify and change the algorithm later (you have to change them in two places).
An alternative to saving each step as a distinct, named variable would be to use anonymous
variables and nest the desired statements within other functions. While this is possible, it quickly
becomes difficult to read and write. For example, you could write the preceding algorithm as
follows:
# Use nested functions to find the state with the highest 2008
# `demVote` percentage
most_dem_state <- select( # 3. Select name of the state
filter( # 2. Filter down to the highest `demVote`
filter( # 1. Filter down to only 2008 votes
presidentialElections, # arguments for the Step 1 `filter`
year == 2008
),
demVote == max(demVote) # second argument for the Step 2 `filter`
),
state # second argument for the Step 3 `select`
)
This version uses anonymous variables—result values that are not assigned to variables (and so are
anonymous)—but instead are immediately used as the arguments to other functions. You’ve used
these anonymous variables frequently with the print() function and with filters (those vectors of
TRUE and FALSE values)—even the max(demVote) in the Step 2 filter is an anonymous variable!
11.3 Performing Sequential Operations 141
This nested approach achieves the same result as the previous example does without creating extra
variables. But, even with only three steps, it can get quite complicated to read—in a large part
because you have to think about it “inside out,” with the code in the middle being evaluated first.
This will obviously become undecipherable for more involved operations.
# Ask the same question of our data using the pipe operator
most_dem_state <- presidentialElections %>% # data frame to start with
filter(year == 2008) %>% # 1. Filter down to only 2008 votes
filter(demVote == max(demVote)) %>% # 2. Filter down to the highest `demVote`
select(state) # 3. Select name of the state
Here the presidentialElections data frame is “piped” in as the first argument to the first
filter() call; because the argument has been piped in, the filter() call takes in only the
remaining arguments (e.g., year == 2008). The result of that function is then piped in as the first
argument to the second filter() call (which needs to specify only the remaining arguments), and
so on. The additional arguments (such as the filter criteria) continue to be passed in as normal, as if
no data frame argument is needed.
Because all dplyr functions discussed in this chapter take as a first argument the data frame to
manipulate, and then return a manipulated data frame, it is possible to “chain” together any of
these functions using a pipe!
Yes, the %>% operator can be awkward to type and takes some getting use to (especially compared to
the command line’s use of | to pipe). However, you can ease the typing by using the RStudio
keyboard shortcut cmd+shift+m.
Tip: You can see all RStudio keyboard shortcuts by navigating to the Tools > Keyboard
Shortcuts Help menu, or you can use the keyboard shortcut alt+shift+k (yes, this is
the keyboard shortcut to show the keyboard shortcuts menu!).
The pipe operator is loaded when you load the dplyr package (it is available only if you load that
package), but it will work with any function, not just dplyr ones. This syntax, while slightly odd,
can greatly simplify the way you write code to ask questions about your data.
142 Chapter 11 Manipulating Data with dplyr
Fun Fact: Many packages load other packages (which are referred to as dependencies). For
example, the pipe operator is actually part of the magrittra package, which is loaded as a
dependency of dplyr.
a
https://cran.r-project.org/web/packages/magrittr/vignettes/ magrittr.html
Note that as in the preceding example, it is best practice to put each “step” of a pipe sequence on its
own line (indented by two spaces). This allows you to easily rearrange the steps (simply by moving
lines), as well as to “comment out” particular steps to test and debug your analysis as you go.
The group_by() function allows you to create associations among groups of rows in a data frame so
that you can easily perform such aggregations. It takes as arguments a data frame to do the
grouping on, followed by which column(s) you wish to use to group the data—each row in the
table will be grouped with other rows that have the same value in that column. For example, you
can group all of the data in the presidentialElections data set into groups whose rows share
the same state value:
The group_by() function returns a tibble,5 which is a version of a data frame used by the
“tidyverse”6 family of packages (which includes dplyr). You can think of this as a “special” kind of
data frame—one that is able to keep track of “subsets” (groups) within the same variable. While this
grouping is not visually apparent (i.e., it does not sort the rows), the tibble keeps track of each row’s
group for computation, as shown in Figure 11.7.
The group_by() function is useful because it lets you apply operations to groups of data without
having to explicitly break your data into different variables (sometimes called bins or chunks). Once
you’ve used group_by() to group the rows of a data frame, you can apply other verbs (e.g.,
summarize(), filter()) to that tibble, and they will be automatically applied to each group (as if
they were separate data frames). Rather than needing to explicitly extract different sets of data into
separate data frames and run the same operations on each, you can use the group_by() function
to accomplish all of this with a single command:
5
tibble package website: http://tibble.tidyverse.org
6
tidyverse website: https://www.tidyverse.org
11.4 Analyzing Data Frames by Group 143
Figure 11.7 A tibble—created by the group_by() function—that stores associations by the grouping
variable (state). Red notes are added.
The preceding code will first group the rows together by state, then compute summary
information (mean() values) for each one of these groups (i.e., for each state), as illustrated in
Figure 11.8. A summary of groups will still return a tibble, where each row is the summary of a
summarize(
grouped,
mean_dem_vote = mean(demVote),
mean_other_parties = mean(other_parties_vote)
)
Figure 11.8 Using the group_by() and summarize() functions to calculate summary statistics in
the presidentialElections data frame by state.
144 Chapter 11 Manipulating Data with dplyr
different group. You can extract values from a tibble using dollar sign or bracket notation, or
convert it back into a normal data frame with the as.data.frame() function.
This form of grouping can allow you to quickly compare different subsets of your data. In doing so,
you’re redefining your unit of analysis. Grouping lets you frame your analysis question in terms of
comparing groups of observations, rather than individual observations. This form of abstraction
makes it easier to ask and answer complex questions about your data.
1. Data storage: Rather than duplicating information about each donor every time that person
makes a donation, you can store that information a single time. This will reduce the amount
of space your data takes up.
2. Data updates: If you need to update information about a donor (e.g., the donor’s phone
number changes), you can make that change in a single location.
This separation and organization of data is a core concern in the design of relational databases,
which are discussed in Chapter 13.
At some point, you will want to access information from both data sets (e.g., you need to email
donors about their contributions), and thus need a way to reference values from both data frames
at once—in effect, to combine the data frames. This process is called a join (because you are
“joining” the data frames together). When you perform a join, you identify columns which are
present in both tables, and use those columns to “match” corresponding rows to one another.
Those column values are used as identifiers to determine which rows in each table correspond to
one another, and thus will be combined into a single row in the resulting (joined) table.
Donations Donors
Figure 11.9 An example data frame of donations (left) and donor information (right). Notice that not
all donors are present in both data frames.
11.5 Joining Data Frames Together 145
The left_join() function is one example of a join. This function looks for matching columns
between two data frames, and then returns a new data frame that is the first (“left”) argument with
extra columns from the second (“right”) argument added on—in effect, “merging” the tables. You
specify which columns you want to “match” on by specifying a by argument, which takes a vector
of columns names (as strings).
For example, because both of the data frames in Figure 11.9 have a donor_name column, you can
“match” the rows from the donor table to the donations table by this column and merge them
together, producing the joined table illustrated in Figure 11.10.
# Combine (join) donations and donors data frames by their shared column
# ("donor_name")
combined_data <- left_join(donations, donors, by = "donor_name")
When you perform a left join as in the preceding code, the function performs the following steps:
1. It goes through each row in the table on the “left” (the first argument; e.g., donations),
considering the values from the shared columns (e.g., donor_name).
2. For each of these values from the left-hand table, the function looks for a row in the
right-hand table (e.g., donors) that has the same value in the specified column.
3. If it finds such a matching row, it adds any other data values from columns that are in
donors but not in donations to that left-hand row in the resulting table.
4. It repeats steps 1–3 for each row in the left-hand table, until all rows have been given values
from their matches on the right (if any).
You can see in Figure 11.10 that there were elements in the left-hand table (donations) that did not
match to a row in the right-hand table (donors). This may occur because there are some donations
whose donors do not have contact information (there is no matching donor_name entry): those
rows will be given NA (not available) values, as shown in Figure 11.10.
Remember: A left join returns all of the rows from the first table, with all of the columns
from both tables.
For rows to match, they need to have the same data in all specified shared columns. However, if the
names of your columns don’t match or if you want to match only on specific columns, you can use
a named vector (one with tags similar to a list) to indicate the different names from each data frame.
If you don’t specify a by argument, the join will match on all shared column names.
Donations Donors
Figure 11.10 In a left join, columns from the right hand table (Donors) are added to the end of the
left-hand table (Donations). Rows are on matched on the shared column (donor_name). Note the
observations present in the left-hand table that don’t have a corresponding row in the right-hand table
(Yang Huiyan).
Caution: Because of how joins are defined, the argument order matters! For example, in a
left_join(), the resulting table has rows for only the elements in the left (first) table; any
unmatched elements in the second table are lost.
If you switch the order of the arguments, you will instead keep all of the information from the
donors data frame, adding in available information from donations (see Figure 11.11).
# Combine (join) donations and donors data frames (see Figure 11.11)
combined_data <- left_join(donors, donations, by = "donor_name")
Since some donor_name values show up multiple times in the right-hand (donations) table, the
rows from donors end up being repeated so that the information can be “merged” with each set of
values from donations. Again, notice that rows that lack a match in the right-hand table don’t get
any additional information (representing “donors” who gave their contact information to the
organization, but have not yet made a donation).
Because the order of the arguments matters, dplyr (and relational database systems in general)
provide several different kinds of joins, each influencing which rows are included in the final table.
Note that in all joins, columns from both tables will be present in the resulting table—the join type
dictates which rows are included. See Figure 11.12 for a diagram of these joins.
n left_join: All rows from the first (left) data frame are returned. That is, you get all the data
from the left-hand table, with extra column values added from the right-hand table.
Left-hand rows without a match will have NA in the right-hand columns.
11.5 Joining Data Frames Together 147
Donors Donations
Figure 11.11 Switching the order of the tables in a left-hand join (compared to Figure 11.10) returns
a different set of rows. All rows from the left-hand table (donors) are returned with additional columns
from the right-hand table (donations).
Select all records from Table A Select all records from Table A, Select all records from Table B, Select all records from Table A
and Table B, where the join along with records from Table B along with records from Table A and Table B, regardless of
condition is met. for which the join condition is for which the join condition is whether the join condition
met (if at all). met (if at all). is met or not.
n right_join: All rows from the second (right) data frame are returned. That is, you get all the
data from the right-hand table, with extra column values added from the left-hand table.
Right-hand rows without a match will have NA in the left-hand columns. This is the
“opposite” of a left_join, and the equivalent of switching the order of the arguments.
n inner_join: Only rows in both data frames are returned. That is, you get any rows that had
matching observations in both tables, with the column values from both tables. There will
be no additional NA values created by the join. Observations from the left that had no match
148 Chapter 11 Manipulating Data with dplyr
in the right, or observations from the right that had no match in the left, will not be
returned at all—the order of arguments does not matter.
n full_join: All rows from both data frames are returned. That is, you get a row for any
observation, whether or not it matched. If it happened to match, values from both tables
will appear in that row. Observations without a match will have NA in the columns from the
other table—the order of arguments does not matter.
The key to deciding between these joins is to think about which set of data you want as your set of
observations (rows), and which columns you’d be okay with being NA if a record is missing.
Tip: Jenny Bryan has created an excellent “cheatsheet”a for dplyr join functions that you
can reference.
a
http://stat545.com/bit001_dplyr-cheatsheet.html
Going Further: All the joins discussed here are mutating joins, which add columns from one
table to another. dplyr also provides filtering joins, which exclude rows based on whether
they have a matching observation in another table, and set operations, which combine obser-
vations as if they were set elements. See the package documentationa for more detail on
these options—but to get started you can focus primarily on the mutating joins.
a
https://cran.r-project.org/web/packages/dplyr/vignettes/two- table.html
Before you can start asking targeted questions of the data set, you will need to understand the
structure of the data set a bit better:
7
dplyr in Action: https://github.com/programming-for-data-science/in-action/tree/master/dplyr
8
Introduction to dplyr: http://dplyr.tidyverse.org/articles/dplyr.html
9
Bureau of Labor Statistics: air flights data: https://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120
11.6 dplyr in Action: Analyzing Flight Data 149
A subset of the flights data frame in RStudio’s Viewer is shown in Figure 11.13.
Given this information, you may be interested in asking questions such as the following:
Your task here is to map from these questions to specific procedures so that you can write the
appropriate dplyr code.
This question involves comparing observations (flights) that share a particular feature (airline), so
you perform the analysis as follows:
1. Since you want to consider all the flights from a particular airline (based on the carrier
feature), you will first want to group the data by that feature.
2. You need to figure out the largest number of delayed departures (based on the dep_delay
feature)—which means you need to find the flights that were delayed (filtering for them).
3. You can take the found flights and aggregate them into a count (summarize the different
groups).
Figure 11.13 A subset of the flights data set, which is included as part of the nycflights13
package.
150 Chapter 11 Manipulating Data with dplyr
4. You will then need to find which group has the highest count (filtering).
Tip: When you’re trying to find the right operation to answer your question of interest, the
phrase “Find the entry that…” usually corresponds to a filter() operation!
Once you have established this algorithm, you can directly map it to dplyr functions:
Remember: Often many approaches can be used to solve the same problem. The preceding
code shows one possible approach; as an alternative, you could filter for delayed departures
before grouping. The point is to think through how you might solve the problem (by hand)
in terms of the Grammar of Data Manipulation, and then convert that into dplyr!
Unfortunately, the final answer to this question appears to be an abbreviation: UA. To reduce the
size of the flights data frame, information about each airline is stored in a separate data frame
called airlines. Since you are interested in combining these two data frames (your answer and the
airline information), you can use a join:
After this step, you will have learned that the carrier that had the largest absolute number of delays
was United Air Lines Inc. Before criticizing the airline too strongly, however, keep in mind that you
might be interested in the proportion of flights that are delayed, which would require a separate
analysis.
To answer this question, you can follow a similar approach. Because this question pertains to how
early flights arrive, the outcome (feature) of interest is arr_delay (noting that a negative amount of
delay indicates that the flight arrived early). You will want to group this information by destination
11.6 dplyr in Action: Analyzing Flight Data 151
airport (dest) where the flight arrived. And then, since you’re interested in the average arrival delay,
you will want to summarize those groups to aggregate them:
It’s always a good idea to check your work as you perform each step of an analysis—don’t write a
long sequence of manipulations and hope that you got the right answer! By printing out the
most_early data frame at this point, you notice that it has a lot of NA values, as seen in Figure 11.14.
This kind of unexpected result occurs frequently when doing data programming—and the best way
to solve the problem is to work backward. By carefully inspecting the arr_delay column, you may
notice that some entries have NA values—the arrival delay is not available for that record. Because
you can’t take the mean() of NA values, you decide to exclude those values from the analysis. You
can do this by passing an na.rm = TRUE argument (“NA remove”) to the mean() function:
Figure 11.14 Average delay by destination in the flights data set. Because NA values are present
in the data set, the mean delay for many destinations is calculated as NA. To remove NA values from
the mean() function, set na.rm = FALSE.
152 Chapter 11 Manipulating Data with dplyr
Removing NA values returns numeric results, and you can continue working through your
algorithm:
print(most_early)
# A tibble: 1 x 3
# dest name delay
# <chr> <chr> <dbl>
#1 LEX Blue Grass -22
Answering this question follows a very similar structure to the first question. The preceding code
reduces the steps to a single statement by including the left_join() statement in the sequence of
piped operations. Note that the column containing the airport code has a different name in the
flights and airports data frames (dest and faa, respectively), so you use a named vector value
for the by argument to specify the match.
As a result, you learn that LEX—Blue Grass Airport in Lexington, Kentucky—is the airport with the
earliest average arrival time (22 minutes early!).
These kinds of summary questions all follow a similar pattern: group the data by a column (feature)
of interest, compute a summary value for (another) feature of interest for each group, filter down to a
row of interest, and select the columns that answer your question:
# Identify the month in which flights tend to have the longest delays
flights %>%
group_by(month) %>% # group by selected feature
summarize(delay = mean(arr_delay, na.rm = TRUE)) %>% # summarize delays
filter(delay == max(delay)) %>% # filter for the record of interest
select(month) %>% # select the column that answers the question
print() # print the tibble out directly
# A tibble: 1 x 1
# month
# <int>
#1 7
If you are okay with the result being in the form of a tibble rather than a vector, you can even pipe
the results directly to the print() function to view the results in the R console (the answer being
11.6 dplyr in Action: Analyzing Flight Data 153
July). Alternatively, you can use a package such as ggplot2 (see Chapter 16) to visually
communicate the delays by month, as in Figure 11.15.
Overall, understanding how to formulate questions, translate them into data manipulation steps
(following the Grammar of Data Manipulation), and then map those to dplyr functions will enable
you to quickly and effectively learn pertinent information about your data set. For practice
wrangling data with the dplyr package, see the set of accompanying book exercises.10
One of the most common data wrangling challenges is adjusting how exactly row and columns are
used to represent your data. Structuring (or restructuring) data frames to have the desired shape can
be the most difficult part of creating a visualization, running a statistical model, or implementing a
machine learning algorithm.
This chapter describes how you can use the tidyr (“tidy-er”) package to effectively transform your
data into an appropriate shape for analysis and visualization.
Indeed, these principles lead to the data structuring described in Chapter 9: rows represent
observations, and columns represent features of that data.
However, asking different questions of a data set may involve different interpretations of what
constitutes an “observation.” For example, Section 11.6 described working with the flights data
set from the nycflights13 package, in which each observation is a flight. However, the analysis
made comparisons between airlines, airports, and months. Each question worked with a different
unit of analysis, implying a different data structure (e.g., what should be represented by each row).
While the example somewhat changed the nature of these rows by grouping and joining different
data sets, having a more specific data structure where each row represented a specific unit of analysis
1
tidyr: https://tidyr.tidyverse.org
156 Chapter 12 Reshaping Data with tidyr
(e.g., an airline or a month) may have made much of the wrangling and analysis more
straightforward.
To use multiple different definitions of an “observation” when investigating your data, you will
need to create multiple representations (i.e., data frames) of the same data set—each with its own
configuration of rows and columns.
To demonstrate how you may need to adjust what each observation represents, consider the
(fabricated) data set of music concert prices shown in Table 12.1. In this table, each observation
(row) represents a city, with each city having features (columns) of the ticket price for a specific
band.
But consider if you wanted to analyze the ticket price across all concerts. You could not do this
easily with the data in its current form, since the data is organized by city (not by concert)! You
would prefer instead that all of the prices were listed in a single column, as a feature of a row
representing a single concert (a city-and-band combination), as in Table 12.2.
Table 12.1 A “wide” data set of concert ticket price in different cities. Each observation (i.e., unit
of analysis) is a city, and each feature is the concert ticket price for a given band.
city greensky_bluegrass trampled_by_turtles billy_strings fruition
Seattle 40 30 15 30
Portland 40 20 25 50
Denver 20 40 25 40
Minneapolis 30 100 15 20
Table 12.2 A “long” data set of concert ticket price by city and band. Each observation (i.e., unit
of analysis) is a city–band combination, and each has a single feature that is the ticket price.
city band price
Seattle greensky_bluegrass 40
Portland greensky_bluegrass 40
Denver greensky_bluegrass 20
Minneapolis greensky_bluegrass 30
Seattle trampled_by_turtles 30
Portland trampled_by_turtles 20
Denver trampled_by_turtles 40
Minneapolis trampled_by_turtles 100
Seattle billy_strings 15
Portland billy_strings 25
Denver billy_strings 25
Minneapolis billy_strings 15
Seattle fruition 30
Portland fruition 50
Denver fruition 40
Minneapolis fruition 20
12.2 From Columns to Rows: gather() 157
Both Table 12.1 and Table 12.2 represent the same set of data—they both have prices for 16 different
concerts. But by representing that data in terms of different observations, they may better support
different analyses. These data tables are said to be in a different orientation: the price data in
Table 12.1 is often referred to being in wide format (because it is spread wide across multiple
columns), while the price data in Table 12.2 is in long format (because it is in one long column).
Note that the long format table includes some duplicated data (the names of the cities and bands
are repeated), which is part of why the data might instead be stored in wide format in the first place!
For example, to move from wide format (Table 12.1) to long format (Table 12.2), you need to gather
all of the prices into a single column. You can do this using the gather() function, which collects
data values stored across multiple columns into a single new feature (e.g., “price” in Table 12.2),
along with an additional new column representing which feature that value was gathered from
(e.g., “band” in Table 12.2). In effect, it creates two columns representing key–value pairs of the
feature and its value from the original data frame.
The gather() function takes in a number of arguments, starting with the data frame to gather
from. It then takes in a key argument giving a name for a column that will contain as values the
column names the data was gathered from—for example, a new band column that will contains
the values "greensky_bluegrass", "trampled_by_turtles", and so on. The third argument is a
value, which is the name for the column that will contain the gathered values—for example,
price to contain the price numbers. Finally, the function takes in arguments representing which
columns to gather data from, using syntax similar to using dplyr to select() those columns (in
the preceding example, -city indicates that it should gather from all columns except city). Again,
any columns provided as this final set of arguments will have their names listed in the key column,
and their values listed in the value column. This process is illustrated in Figure 12.1. The gather()
function’s syntax can be hard to intuit and remember; try tracing where each value “moves” in the
table and diagram.
Note that once data is in long format, you can continue to analyze an individual feature (e.g., a
specific band) by filtering for that value. For example, filter(band_data_long, band ==
"greensky_bluegrass") would produce just the prices for a single band.
158 Chapter 12 Reshaping Data with tidyr
gather(
band_data,
key = band,
value = price,
-city
)
Figure 12.1 The gather() function takes values from multiple columns (greensky_bluegrass,
trampled_by_turtles, etc.) and gathers them into a (new) single column (price). In doing so, it
also creates a new column (band) that stores the names of the columns that were gathered (i.e., the
column name in which each value was stored prior to gathering).
Table 12.3 A “wide” data set of concert ticket prices for a set of bands. Each observation (i.e.,
unit of analysis) is a band, and each feature is the ticket price in a given city.
band Denver Minneapolis Portland Seattle
billy_strings 25 15 25 15
fruition 40 20 50 30
greensky_bluegrass 20 30 40 40
trampled_by_turtles 40 100 20 30
12.3 From Rows to Columns: spread() 159
# Reshape long data (Table 12.2), spreading prices out among multiple features
price_by_band <- spread(
band_data_long, # data frame to spread from
key = city, # column indicating where to get new feature names
value = price # column indicating where to get new feature values
)
The spread() function takes arguments similar to those passed to the gather() function, but
applies them in the opposite direction. In this case, the key and value arguments are where to get
the column names and values, respectively. The spread() function will create a new column for
each unique value in the provided key column, with values taken from the value feature. In the
preceding example, the new column names (e.g., "Denver", "Minneapolis") were taken from the
city feature in the long format table, and the values for those columns were taken from the price
feature. This process is illustrated in Figure 12.2.
By combining gather() and spread(), you can effectively change the “shape” of your data and
what concept is represented by an observation.
spread(
band_data_long,
key = city,
value = price
)
Figure 12.2 The spread() function spreads out a single column into multiple columns. It creates a
new column for each unique value in the provided key column (city). The values in each new column
will be populated with the provided value column (price).
160 Chapter 12 Reshaping Data with tidyr
Tip: Before spreading or gathering your data, you will often need to unite multiple columns
into a single column, or to separate a single column into multiple columns. The tidyr func-
tions unite()a and separate()b provide a specific syntax for these common data prepara-
tion tasks.
a
https://tidyr.tidyverse.org/reference/unite.html
b
https://tidyr.tidyverse.org/reference/separate.html
After having downloaded the data, you will need to load it into your R environment:
When you first load the data, each observation (row) represents an indicator for a country, with
features (columns) that are the values of that indicator in a given year (see Figure 12.3). Notice that
many values, particularly for earlier years, are missing (NA). Also, because R does not allow column
names to be numbers, the read.csv() function has prepended an X to each column name (which is
just a number in the raw .csv file).
While in terms of the indicator this data is in long format, in terms of the indicator and year the
data is in wide format—a single column contains all the values for a single year. This structure
allows you to make comparisons between years for the indicators by filtering for the indicator of
interest. For example, you could compare each country’s educational expenditure in 1990 to its
expenditure in 2014 as follows:
2
World Bank Data Explorer: https://data.worldbank.org
3
World Bank education: http://datatopics.worldbank.org/education
4
tidyr in Action: https://github.com/programming-for-data-science/in-action/tree/master/tidyr
12.4 tidyr in Action: Exploring Educational Statistics 161
Figure 12.3 Untransformed World Bank educational data used in Section 12.4.
# Plot the expenditure in 1990 against 2014 using the `ggplot2` package
# See Chapter 16 for details
expenditure_chart <- ggplot(data = expenditure_plot_data) +
geom_text_repel(
mapping = aes(x = X1990 / 100, y = X2014 / 100, label = Country.Code)
) +
scale_x_continuous(labels = percent) +
scale_y_continuous(labels = percent) +
labs(title = indicator, x = "Expenditure 1990", y = "Expenditure 2014")
Figure 12.4 shows that the expenditure (relative to gross domestic product) is fairly correlated
between the two time points: countries that spent more in 1990 also spent more in 2014
(specifically, the correlation—calculated in R using the cor() function—is .64).
However, if you want to extend your analysis to visually compare how the expenditure across all
years varies for a given country, you would need to reshape the data. Instead of having each
observation be an indicator for a country, you want each observation to be an indicator for a
country for a year—thereby having all of the values for all of the years in a single column and
making the data long(er) format.
Figure 12.4 A comparison of each country’s education expenditures in 1990 and 2014.
Figure 12.5 Reshaped educational data (long format by year). This structure allows you to more
easily create visualizations across multiple years.
As shown in Figure 12.5, this gather() statement creates a year column, so each observation (row)
represents the value of an indicator in a particular country in a given year. The expenditure for each
year is stored in the value column created (coincidentally, this column is given the name
"value").
12.4 tidyr in Action: Exploring Educational Statistics 163
This structure will now allow you to compare fluctuations in an indicator’s value over time (across
all years):
The resulting chart, shown in Figure 12.6, uses the available data to show a timeline of the
fluctuations in government expenditures on education in Spain. This produces a more complete
picture of the history of educational investment, and draws attention to major changes as well as
the absence of data in particular years.
4%
3%
2%
You may also want to compare two indicators to each other. For example, you may want to assess
the relationship between each country’s literacy rate (a first indicator) and its unemployment rate
(a second indicator). To do this, you would need to reshape the data again so that each observation
is a particular country and each column is an indicator. Since indicators are currently in one
column, you need to spread them out using the spread() function:
This wide format data shape allows for comparisons between two different indicators. For example,
you can explore the relationship between female unemployment and female literacy rates, as
shown in Figure 12.7.
30%
20%
10%
0%
Each comparison in this analysis—between two time points, over a full time-series, and between
indicators—required a different representation of the data set. Mastering use of the tidyr
functions will allow you to quickly transform the shape of your data set, allowing for rapid and
effective data analysis. For practice reshaping data with the tidyr package, see the set of
accompanying book exercises.5
5
tidyr exercises: https://github.com/programming-for-data-science/chapter-12-exercises
This page intentionally left blank
13
Accessing Databases
This chapter introduces relational databases as a way to structure and organize complex data sets.
After introducing the purpose and format of relational databases, it describes the syntax for
interacting with them using R. By the end of the chapter you will be able to wrangle data from a
database.
In particular, your data may not be structured in a way that it can easily and efficiently be
represented as a single data frame. For example, imagine you were trying to organize information
about music playlists (e.g., on a service such as Spotify). If your playlist is the unit of analysis you are
interested in, each playlist would be an observation (row) and would have different features
(columns) included. One such feature you could be interested in is the songs that appear on the
playlist (implying that one of your columns should be songs). However, playlists may have lots of
different songs, and you may also be tracking further information about each song (e.g., the artist,
the genre, the length of the song). Thus you could not easily represent each song as a simple data
type such as a number or string. Moreover, because the same song may appear in multiple playlists,
such a data set would include a lot of duplicate information (e.g., the title and artist of the song).
To solve this problem, you could use multiple data frames (perhaps loaded from multiple .csv
files), joining those data frames together as described in Chapter 11 to ask questions of the data.
However, that solution would require you to manage multiple different .csv files, as well as to
determine an effective and consistent way of joining them together. Since organizing, tracking, and
updating multiple .csv files can be difficult, many large data sets are instead stored in databases.
Metonymically, a database is a specialized application (called a database management system) used
to save, organize, and access information—similar to what git does for versions of code, but in this
case for the kind of data that might be found in multiple .csv files. Because many organizations
store their data in a database of some kind, you will need to be able to access that data to analyze it.
168 Chapter 13 Accessing Databases
Moreover, accessing data directly from a database makes it possible to process data sets that are too
large to fit into your computer’s memory (RAM) at once. The computer will not be required to hold
a reference to all the data at once, but instead will be able to apply your data manipulation
(e.g., selecting and filtering the data) to the data stored on a computer’s hard drive.
What makes relational databases special is how they specify the relationships between these tables.
In particular, each record (row) in a table is given a field (column) called the primary key. The
primary key is a unique value for each row in the table, so it lets you refer to a particular record.
Thus even if there were two songs with the same name and artist, you could still distinguish
between them by referencing them through their primary key. Primary keys can be any unique
identifier, but they are almost always numbers and are frequently automatically generated and
assigned by the database. Note that databases can’t just use the “row number” as the primary key,
because records may be added, removed, or reordered—which means a record won’t always be at
the same index!
Moreover, each record in one table may be associated with a record in another—for example, each
record in a songs table might have an associated record in the artists table indicating which
artist performed the song. Because each record in the artists table has a unique key, the songs
table is able to establish this association by including a field (column) for each record that contains
the corresponding key from artists (see Figure 13.1). This is known as a foreign key (it is the key
from a “foreign” or other table). Foreign keys allow you to join tables together, similar to how you
would with dplyr. You can think of foreign keys as a formalized way of defining a consistent
column for the join() function’s by argument.
Databases can use tables with foreign keys to organize data into complex structures; indeed, a
database may have a table that just contains foreign keys to link together other tables! For example,
if a database needs to represent data such that each playlist can have multiple songs, and songs can
be on many playlists (a “many-to-many” relationship), you could introduce a new “bridge table”
(e.g., playlists_songs) whose records represent the associations between the two other tables
(see Figure 13.2). You can think of this as a “table of lines to draw between the other tables.” The
database could then join all three of the tables to access the information about all of the songs for a
particular playlist.
13.1 An Overview of Relational Databases 169
foreign key
table: artists table: songs
id name id title artist_id
10 David Bowie 80 Bohemian Rhapsody 11
11 Queen 81 Don’t Stop Me Now 11
12 Prince 82 Purple Rain 12
83 Starman 10
primary key
Figure 13.1 An example pair of database tables (top). Each table has a primary key column id. The
songs table (top right) also has an artist_id foreign key used to associate it with the artists
table (top left). The bottom table illustrates how the foreign key can be used when joining the tables.
Going Further: Database design, development, and use is actually its own (very rich) prob-
lem domain. The broader question of making databases reliable and efficient is beyond the
scope of this book.
n SQLite1 is the simplest SQL database system, and so is most commonly used for testing and
development (though rarely in real-world “production” systems). SQLite databases have the
advantage of being highly self-contained: each SQLite database is a single file (with the
1
SQLite: https://www.sqlite.org/index.html
170 Chapter 13 Accessing Databases
Figure 13.2 An example “bridge table” (top right) used to associate many playlists with many songs.
The bottom table illustrates how these three tables might be joined.
.sqlite extension) that is formatted to enable the SQLite RDMS to access and manipulate
its data. You can almost think of these files as advanced, efficient versions of .csv files that
can hold multiple tables! Because the database is stored in a single file, this makes it easy to
share databases with others or even place one under version control.
To work with an SQLite database you can download and install a command line application2
for manipulating the data. Alternatively, you can use an application such as DB Browser for
SQLite,3 which provides a graphical interface for interacting with the data. This is
particularly useful for testing and verifying your SQL and R code.
n PostgreSQL4 (often shortened to “Postgres”) is a free open source RDMS, providing a more
robust system and set of features (e.g., for speeding up data access and ensuring data
integrity) and functions than SQLite. It is often used in real-world production systems, and is
the recommended system to use if you need a “full database.” Unlike with SQLite, a Postgres
database is not isolated to a single file that can easily be shared, though there are ways to
export a database.
2
SQLite download page: https://www.sqlite.org/download.html; look for “Precompiled Binaries” for your system.
3
DB Browser for SQLite: http://sqlitebrowser.org
4
PostgreSQL: https://www.postgresql.org
13.2 A Taste of SQL 171
You can download and install the Postgres RDMS from its website;5 follow the instructions in
the installation wizard to set up the database system. This application will install the
manager on your machine, as well as provide you with a graphical application (pgAdmin) to
administer your databases. You can also use the provided psql command line application if
you add it to your PATH; alternatively, the SQL Shell application will open the command line
interface directly.
n MySQL6 is a free (but closed source) RDMS, providing a similar level of features and structure
as Postgres. MySQL is a more popular system than Postgres, so its use is more common, but
can be somewhat more difficult to install and set up.
If you wish to set up and use a MySQL database, we recommend that you install the
Community Server Edition from the MySQL website.7 Note that you do not need to sign up
for an account (click the smaller “No thanks, just start my download” link instead).
We suggest you use SQLite when you’re just experimenting with a database (as it requires the least
amount of setup), and recommend Postgres if you need something more full-featured.
This section introduces the most basic of SQL statements: the SELECT statement used to access
data. Note that it is absolutely possible to access and manipulate a database through R without
using SQL; see Section 13.3. However, it is often useful to understand the underlying commands
that R is issuing. Moreover, if you eventually need to discuss database manipulations with someone
else, this language will provide some common ground.
Caution: Most RDMSs support SQL, though systems often use slightly different “flavors” of
SQL. For example, data types may be named differently, or different RDMSs may support
additional functions or features.
Tip: For a more thorough introduction to SQL, w3schoolsa offers a very newbie-friendly tuto-
rial on SQL syntax and usage. You can also find more information in Forta, Sams Teach Your-
self SQL in 10 Minutes, Fourth Edition (Sams, 2013), and van der Lans, Introduction to SQL, Fourth
Edition (Addison-Wesley, 2007).
a
https://www.w3schools.com/sql/default.asp
5
PostgreSQL download page: https://www.postgresql.org/download
6
MySQL: https://www.mysql.com
7
MySQL download Page: https://dev.mysql.com/downloads/mysql
172 Chapter 13 Accessing Databases
The most commonly used SQL statement is the SELECT statement. The SELECT statement is used to
access and extract data from a database (without modifying that data)—this makes it a query
statement. It performs the same work as the select() function in dplyr. In its simplest form, the
SELECT statement has the following format:
(In SQL, comments are written on their own line surrounded by /* */.)
This query will return the data from the specified column in the specified table (keywords like
SELECT in SQL are usually written in all-capital letters—though they are not case-sensitive—while
column and table names are often lowercase). For example, the following statement would return
all of the data from the title column of the songs table (as shown in Figure 13.3):
You can select multiple columns by separating the names with commas (,). For example, to select
both the id and title columns from the songs table, you would use the following query:
/* Access the `id` and `title` columns from the `songs` table */
SELECT id, title FROM songs
If you wish to select all the columns, you can use the special * symbol to represent “everything”—
the same wildcard symbol you use on the command line! The following query will return all
columns from the songs table:
Figure 13.3 A SELECT statement and results shown in the SQLite Browser.
13.2 A Taste of SQL 173
Using the * wildcard to select data is common practice when you just want to load the entire table
from the database.
You can also optionally give the resulting column a new name (similar to a mutate manipulation)
by using the AS keyword. This keyword is placed immediately after the name of the column to be
aliased, followed by the new column name. It doesn’t actually change the table, just the label of the
resulting “subtable” returned by the query.
/* Access the `id` column (calling it `song_id`) from the `songs` table */
SELECT id AS song_id FROM songs
The SELECT statement performs a select data manipulation. To perform a filter manipulation, you
add a WHERE clause at the end of the SELECT statement. This clause includes the keyword WHERE
followed by a condition, similar to the boolean expression you would use with dplyr. For example,
to select the title column from the songs table with an artist_id value of 11, you would use
the following query (also shown in Figure 13.4):
# Filter for the rows with a particular `artist_id`, and then select
# the `title` column
filter(songs, artist_id == 11) %>%
select(title)
Figure 13.4 A WHERE clause and results shown in the SQLite Browser.
174 Chapter 13 Accessing Databases
The filter condition is applied to the whole table, not just the selected columns. In SQL, the
filtering occurs before the selection.
Note that a WHERE condition uses = (not ==) as the “is equal” operator. Conditions can also use
other relational operators (e.g., >, <=), as well as some special keywords such as LIKE, which will
check whether the column value is inside a string. (String values in SQL must be specified in
quotation marks—it’s most common to use single quotes.)
You can combine multiple WHERE conditions by using the AND, OR, and NOT keywords as boolean
operators:
The statement SELECT columns FROM table WHERE conditions is the most common form of
SQL query. But you can also include other keyword clauses to perform further data manipulations.
For example, you can include an ORDER_BY clause to perform an arrange manipulation (by a
specified column), or a GROUP_BY clause to perform aggregation (typically used with SQL-specific
aggregation functions such as MAX() or MIN()). See the official documentation for your database
system (e.g., for Postgres8 ) for further details on the many options available when specifying
SELECT queries.
The SELECT statements described so far all access data in a single table. However, the entire point of
using a database is to be able to store and query data across multiple tables. To do this, you use a join
manipulation similar to that used in dplyr. In SQL, a join is specified by including a JOIN clause,
which has the following format:
As with dplyr, an SQL join will by default “match” columns if they have the same value in the
same column. However, tables in databases often don’t have the same column names, or the shared
column name doesn’t refer to the same value—for example, the id column in artists is for the
artist ID, while the id column in songs is for the song ID. Thus you will almost always include an
ON clause to specify which columns should be matched to perform the join (writing the names of
the columns separated by an = operator):
/* Access artists, song titles, and ID values from two JOINed tables */
SELECT artists.id, artists.name, songs.id, songs.title FROM artists
JOIN songs ON songs.artist_id = artists.id
This query (shown in Figure 13.5) will select the IDs, names, and titles from the artists and songs
tables by matching to the foreign key (artist_id); the JOIN clause appears on its own line just for
readability. To distinguish between columns with the same name from different tables, you specify
each column first by its table name, followed by a period (.), followed by the column name. (The
dot can be read like “apostrophe s” in English, so artists.id would be “the artists table’s id.”)
8
PostgreSQL: SELECT: https://www.postgresql.org/docs/current/static/sql-select.html
13.3 Accessing a Database from R 175
Figure 13.5 A JOIN statement and results shown in the SQLite Browser.
You can join on multiple conditions by combining them with AND clauses, as with multiple WHERE
conditions.
Like dplyr, SQL supports four kinds of joins (see Chapter 11 to review them). By default, the JOIN
statement will perform an inner join—meaning that only rows that contain matches in both tables
will be returned (e.g., the joined table will not have rows that don’t match). You can also make this
explicit by specifying the join clause with the keywords INNER JOIN. Alternatively, you can specify
that you want to perform a LEFT JOIN, RIGHT JOIN, or OUTER JOIN (i.e., a full join). For example,
to perform a left join you would use a query such as the following:
/* Access artists and song titles, including artists without any songs */
SELECT artists.id, artists.name, songs.id, songs.title FROM artists
LEFT JOIN songs ON songs.artist_id = artists.id
Notice that the statement is written the same way as before, except with an extra word to clarify the
type of join.
As with dplyr, deciding on the type of join to use requires that you carefully consider which
observations (rows) must be included, and which features (columns) must not be missing in the
table you produce. Most commonly you are interested in an inner join, which is why that is the
default!
use the same, familiar R syntax and data structures (i.e., data frames) to work with databases. The
simplest way to access a database through R is to use the dbplyr9 package, which was developed as
part of the tidyverse collection. This package allows you to query a relational database using
dplyr functions, avoiding the need to use an external application!
Going Further: RStudio also provides an interface and documentation for connecting to a
database through the IDE; see the Databases Using R portal.a
a
https://db.rstudio.com
Because dbplyr is another external package (like dplyr and tidyr), you will need to install it
before you can use it. However, because dbplyr is actually a “backend” for dplyr (it provides the
behind-the-scenes code that dplyr uses to work with a database), you actually need to use
functions from dplyr and so load in the dplyr package instead. However, you will also need to load
the DBI package, which is installed along with dbplyr and allows you to connect to the database:
You will also need to install an additional package depending on which kind of database you wish
to access. These packages provide a common interface (set of functions) across multiple database
formats—they will allow you to access an SQLite database and a Postgres database using the same
R functions.
Remember that databases are managed and accessed through an RDMS, which is a separate
program from the R interpreter. Thus, to access databases through R, you will need to “connect” to
that external RDMS program and use R to issue statements through it. You can connect to an
external database using the dbConnect() function provided by the DBI package:
The dbConnect() function takes as a first argument a “connection” interface provided by the
relevant database connection package (e.g., RSQLite). The remaining arguments specify the
9
dbplyr repository page: https://github.com/tidyverse/dbplyr
13.3 Accessing a Database from R 177
location of the database, and are dependent on where that database is located and what kind of
database it is. For example, you use a dbname argument to specify the path to a local SQLite
database file, while you use host, user, and password to specify the connection to a database on a
remote machine.
Caution: Never include your database password directly in your R script—saving it in plain
text will allow others to easily steal it! Instead, dbplyr recommends that you prompt users
for the password through RStudio by using the askForPassword()a function from the
rstudioapi package (which will cause a pop-up window to appear for users to type in their
password). See the dbplyr introduction vignetteb for an example.
a
https://www.rdocumentation.org/packages/rstudioapi/versions/0.7/topics/askForPassword
b
https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html
Once you have a connection to the database, you can use the dbListTables() function to get a
vector of all the table names. This is useful for checking that you’ve connected to the database
(as well as seeing what data is available to you!).
Since all SQL queries access data FROM a particular table, you will need to start by creating a reference
to that table in the form of a variable. You can do this by using the tbl() function provided by
dplyr (not dbplyr!). This function takes as arguments the connection to the database and the
name of the table you want to reference. For example, to query a songs table as in Figure 13.1, you
would use the following:
If you print this variable out, you will notice that it looks mostly like a normal data frame
(specifically a tibble), except that the variable refers to a remote source (since the table is in the
database, not in R!); see Figure 13.6.
Once you have a reference to the table, you can use the same dplyr functions discussed in
Chapter 11 (e.g., select(), filter()). Just use the table in place of the data frame to manipulate!
# Construct a query from the `songs_table` for songs by Queen (artist ID 11)
queen_songs_query <- songs_table %>%
filter(artist_id == 11) %>%
select(title)
Figure 13.6 A database tbl, printed in RStudio.This is only a preview of the data that will be returned
by the data base.
178 Chapter 13 Accessing Databases
The dbplyr package will automatically convert a sequence of dplyr functions into an equivalent
SQL statement, without the need for you to write any SQL! You can see the SQL statement it is
generating by using the show_query() function:
Importantly, using dplyr methods on a table does not return a data frame (or even a tibble).
In fact, it displays just a small preview of the requested data! Actually querying the data from a
database is relatively slow in comparison to accessing data in a data frame, particularly when the
database is on a remote computer. Thus dbplyr uses lazy evaluation—it actually executes the
query on the database only when you explicitly tell it to do so. What is shown when you print
the queen_songs_query is just a subset of the data; the results will not include all of the rows
returned if there are a large number of them! RStudio very subtly indicates that the data is just a
preview of what has been requested—note in Figure 13.6 that the dimensions of the songs_table
are unknown (i.e., table<songs> [?? X 3]). Lazy evaluation keeps you from accidentally
making a large number of queries and downloading a huge amount of data as you are designing
and testing your data manipulation statements (i.e., writing your select() and filter() calls).
To actually query the database and load the results into memory as a R value you can manipulate,
use the collect() function. You can often add this function call as a last step in your pipe of
dplyr calls.
This tibble is exactly like those described in earlier chapters; you can use as.data.frame() to
convert it into a data frame. Thus, anytime you want to query data from a database in R, you will
need to perform the following steps:
And with that, you have accessed and queried a database using R! You can now write R code to use
the same dplyr functions for either a local data frame or a remote database, allowing you to test
and then expand your data analysis.
Tip: For more information on using dbplyr, check out the introduction vignette.a
a
https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html
For practice working with databases, see the set of accompanying book exercises.10
10
Database exercises: https://github.com/programming-for-data-science/chapter-13-exercises
This page intentionally left blank
14
Accessing Web APIs
Previous chapters have described how to access data from local .csv files, as well as from local
databases. While working with local data is common for many analyses, more complex shared data
systems leverage web services for data access. Rather than store data on each analyst’s computer,
data is stored on a remote server (i.e., a central computer somewhere on the internet) and accessed
similarly to how you access information on the web (via a URL). This allows scripts to always work
with the latest data available when performing analysis of data that may be changing rapidly, such
as social media data.
In this chapter, you will learn how to use R to programmatically interact with data stored by web
services. From an R script, you can read, write, and delete data stored by these services (though this
book focuses on the skill of reading data). Web services may make their data accessible to computer
programs like R scripts by offering an application programming interface (API). A web service’s API
specifies where and how particular data may be accessed, and many web services follow a particular
style known as REpresentational State Transfer (REST).1 This chapter covers how to access and work
with data from these RESTful APIs.
While some APIs provide an interface for leveraging some functionality, other APIs provide an
interface for accessing data. One of the most common sources of these data APIs are web
services—that is, websites that offer an interface for accessing their data.
With web services, the interface (the set of “functions” you can call to access the data) takes the
form of HTTP requests—that is, requests for data sent following the HyperText Transfer Protocol.
1
Fielding, R. T. (2000). Architectural styles and the design of network-based software architectures. University of
California, Irvine, doctoral dissertation. https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm.
Note that this is the original specification and is very technical.
182 Chapter 14 Accessing Web APIs
This is the same protocol (way of communicating) used by your browser to view a webpage! An
HTTP request represents a message that your computer sends to a web server: another computer on
the internet that “serves,” or provides, information. That server, upon receiving the request, will
determine what data to include in the response it sends back to the requesting computer. With a
web browser, the response data takes the form of HTML files that the browser can render as
webpages. With data APIs, the response data will be structured data that you can convert into R
data types such as lists or data frames.
In short, loading data from a web API involves sending an HTTP request to a server for a particular
piece of data, and then receiving and parsing the response to that request.
Learning how to use web APIs will greatly expand the available data sets you may want to use for
analysis. Companies and services with large amounts of data, such as Twitter,2 iTunes,3 or Reddit,4
make (some of) their data publicly accessible through an API. This chapter will use the GitHub API5
to demonstrate how to work with data stored in a web service.
14.2.1 URIs
Which resource you want to access is specified with a Uniform Resource Identifier (URI).6 A URI is
a generalization of a URL (Uniform Resource Locator)—what you commonly think of as a “web
address.” A URI acts a lot like the address on a postal letter sent within a large organization such as a
university: you indicate the business address as well as the department and the person to receive
the letter, and will get a different response (and different data) from Alice in Accounting than from
Sally in Sales.
Like postal letter addresses, URIs have a very specific format used to direct the request to the right
resource, illustrated in Figure 14.1.
https://domain.com:9999/example/page/type=husky&name=dubs#nose
2
Twitter API: https://developer.twitter.com/en/docs
3
iTunes search API: https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-
search-api/
4
Reddit API: https://www.reddit.com/dev/api/
5
GitHub API: https://developer.github.com/v3/
6
Uniform Resource Identifier (URI) Generic Syntax (official technical specification): https://tools.ietf.org/html/
rfc3986
14.2 RESTful Requests 183
Not all parts of the URI are required. For example, you don’t necessarily need a port, query, or
fragment. Important parts of the URI include:
n scheme (protocol): The “language” that the computer will use to communicate the request
to the API. With web services this is normally https (secure HTTP).
n path: The identifier of the resource on that web server you wish to access. This may be the
name of a file with an extension if you’re trying to access a particular file, but with web
services it often just looks like a folder path!
n query: Extra parameters (arguments) with further details about the resource to access.
The domain and path usually specify the location of the resource of interest. For example,
www.domain.com/users might be an identifier for a resource that serves information about all the
users. Web services can also have “subresources” that you can access by adding extra pieces to the
path. For example, www.domain.com/users/layla might access to the specific resource (“layla”)
that you are interested in.
With web APIs, the URI is often viewed as being broken up into three parts, as shown in Figure 14.2:
n The base URI is the domain that is included on all resources. It acts as the “root” for any
particular endpoint. For example, the GitHub API has a base URI of
https://api.github.com. All requests to the GitHub API will have that base.
n An endpoint is the location that holds the specific information you want to access. Each API
will have many different endpoints at which you can access specific data resources. The
GitHub API, for example, has different endpoints for /users and /orgs so that you can
access data about users or organizations, respectively.
Note that many endpoints support accessing multiple subresources. For example, you can
access information about a specific user at the endpoint /users/:username. The colon :
indicates that the subresource name is a variable—you can replace that part of the endpoint
with whatever string you want. Thus if you were interested in the GitHub user nbremer,7 you
would access the /users/nbremer endpoint.
Subresources may have further subresources (which may or may not have variable names).
The endpoint /orgs/:org/repos refers to the list of repositories belonging to an
organization. Variable names in endpoints might alternatively be written inside of curly
braces {}—for example, /orgs/{org}/repos. Neither the colon nor the braces are
https://api.github.com/search/repositories/q=dplyr&sort=forks
7
Nadieh Bremer, freelance data visualization designer: https://www.visualcinnamon.com
184 Chapter 14 Accessing Web APIs
n Query parameters allow you to specify additional information about which exact
information you want from the endpoint, or how you want it to be organized (see
Section 14.2.1.1 for more details).
Remember: One of the biggest challenges in accessing a web API is understanding what
resources (data) the web service makes available and which endpoints (URIs) can request
those resources. Read the web service’s documentation carefully—popular services often
include examples of URIs and the data returned from them.
A query is constructed by appending the endpoint and any query parameters to the base URI. For
example, so you could access a GitHub user by combining the base URI (https://api.github.
com) and endpoint (/users/nbremer) into a single string: https://api.github.com/users/
nbremer. Sending a request to that URI will return data about the user—you can send this request
from an R program or by visiting that URI in a web browser, as shown in Figure 14.3. In short, you
can access a particular data resource by sending a request to a particular endpoint.
Indeed, one of the easiest ways to make a request to a web API is by navigating to the URI using your
web browser. Viewing the information in your browser is a great way to explore the resulting data,
and make sure you are requesting information from the proper URI (i.e., that you haven’t made a
typo in the URI).
Tip: The JSON format (see Section 14.4) of data returned from web APIs can be quite messy
when viewed in a web browser. Installing a browser extension such as JSONViewa will for-
mat the data in a somewhat more readable way. Figure 14.3 shows data formatted with this
extension.
a
https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc
Web URIs can optionally include query parameters, which are used to request a more specific
subset of data. You can think of them as additional optional arguments that are given to the request
function—for example, a keyword to search for or criteria to order results by.
The query parameters are listed at the end of a URI, following a question mark (?) and are formed as
key–value pairs similar to how you named items in lists. The key (parameter name) is listed first,
followed by an equals sign (=), followed by the value (parameter value), with no spaces between
anything. You can include multiple query parameters by putting an ampersand (&) between each
key–value pair. You can see an example of this syntax by looking at the URL bar in a web browser
when you use a search engine such as Google or Yahoo, as shown in Figure 14.4. Search engines
produce URLs with a lot of query parameters, not all of which are obvious or understandable.
14.2 RESTful Requests 185
Notice that the exact query parameter name used differs depending on the web service. Google uses
a q parameter (likely for “query”) to store the search term, while Yahoo uses a p parameter.
Similar to arguments for functions, API endpoints may either require query parameters (e.g., you
must provide a search term) or optionally allow them (e.g., you may provide a sorting order). For
example, the GitHub API has a /search/repositories endpoint that allows users to search for a
specific repository: you are required to provide a q parameter for the query, and can optionally
provide a sort parameter for how to sort the results:
# A GitHub API URI with query parameters: search term `q` and sort
# order `sort`
https://api.github.com/search/repositories?q=dplyr&sort=forks
186 Chapter 14 Accessing Web APIs
Figure 14.4 Search engine URLs for Google (top) and Yahoo (bottom) with query parameters (under-
lined in blue). The “search term” parameter for each web service is underlined in red.
Caution: Many special characters (e.g., punctuation) cannot be included in a URL. This
group includes characters such as spaces! Browsers and many HTTP request packages will
automatically encode these special characters into a usable format (for example, converting
a space into a %20), but sometimes you may need to do this conversion yourself.
Many web services require you to register with them to send them requests. This allows the web
service to limit access to the data, as well as to keep track of who is asking for which data (usually so
that if someone starts “spamming” the service, that user can be blocked).
To facilitate this tracking, many services provide users with access tokens (also called API keys).
These unique strings of letters and numbers identify a particular developer (like a secret password
that works just for you). Furthermore, your API key can provide you with additional access to
information based on which user you are. For example, when you get an access key for the GitHub
API, that key will provide you with additional access and control over your repositories. This
enables you to request information about private repos, and even programmatically interact with
GitHub through the API (i.e., you can delete a repo8 —so tread carefully!).
Web services will require you to include your access token in the request, usually as a query
parameter; the exact name of the parameter varies, but it often looks like access_token or
api_key. When exploring a web service, keep an eye out for whether it requires such tokens.
8
GitHub API, delete a repository https://developer.github.com/v3/repos/#delete-a- repository
14.2 RESTful Requests 187
Figure 14.5 A subset of the GitHub API response returned by the URI https://api.github.com/
search/repositories?q=dplyr&sort=forks, as displayed in a web browser.
Caution: Watch out for APIs that mention using an authentication service called OAuth
when explaining required API keys. OAuth is a system for performing authentication—that
is, having someone prove that they are who they say they are. OAuth is generally used to let
someone log into a website from your application (like what a “Log in with Google” button
does). OAuth systems require more than one access key, and these keys must be kept secret.
Moreover, they usually require you to run a web server to use them correctly (which requires
significant extra setup; see the full httr documentationa for details). You can do this in R,
but may want to avoid this challenge while learning how to use APIs.
a
https://cran.r-project.org/web/packages/httr/httr.pdf
188 Chapter 14 Accessing Web APIs
Access tokens are a lot like passwords; you will want to keep them secret and not share them with
others. This means that you should not include them in any files you commit to git and push to
GitHub. The best way to ensure the secrecy of access tokens in R is to create a separate script file in
your repo (e.g., api_keys.R) that includes exactly one line, assigning the key to a variable:
To access this variable in your “main” script, you can use the source() function to load and run
your api_keys.R script (similar to clicking the Source button to run a script). This function will
execute all lines of code in the specified script file, as if you had “copy-and-pasted” its contents and
run them all with ctrl+enter. When you use source() to execute the api_keys.R script, it will
execute the code statement that defines the api_key variable, making it available in your
environment for your use:
# In your "main" script, load your API key from another file
# (Make sure working directory is set before running the following code!)
Anyone else who runs the script will need to provide an api_key variable to access the API using
that user’s own key. This practice keeps everyone’s account separate.
You can keep your api_keys.R file from being committed by including the filename in the
.gitignore file in your repo; that will keep it from even possibly being committed with your code!
See Chapter 3 for details about working with the .gitignore file.
n OPTIONS: Return the set of methods that can be performed on the resource.
14.3 Accessing Web APIs from R 189
By far the most commonly used verb is GET, which is used to “get” (download) data from a web
service—this is the type of request that is sent when you enter a URL into a web browser. Thus you
would send a GET request for the /users/nbremer endpoint to access that data resource.
Taken together, this structure of treating each datum on the web as a resource that you can interact
with via HTTP requests is referred to as the REST architecture (REpresentational State Transfer).
Thus, a web service that enables data access through named resources and responds to HTTP
requests is known as a RESTful service, that has a RESTful API.
# The URI for the `search/repositories` endpoint of the GitHub API: query
# for `dplyr`, sorting by `forks`
https://api.github.com/search/repositories?q=dplyr&sort=forks
This query accesses the /search/repositories endpoint, and also specifies two query
parameters:
n sort: The attribute of each repository that you would like to use to sort the results (in this
case, the number of forks of the repo)
(Note that the data you will get back is structured in JSON format. See Section 14.4 for details.)
While you can access this information using your browser, you will want to load it into R for
analysis. In R, you can send GET requests using the httr9 package. As with dplyr, you will need to
install and load this package to use it:
This package provides a number of functions that reflect HTTP verbs. For example, the GET()
function will send an HTTP GET request to the URI:
This code will make the same request as your web browser, and store the response in a variable
called response. While it is possible to include query parameters in the URI string (as above), httr
9
Getting started with httr: official quickstart guide for httr: https://cran.r-project.org/web/packages/httr/
vignettes/quickstart.html
190 Chapter 14 Accessing Web APIs
also allows you to include them as a list passed as a query argument. Furthermore, if you plan on
accessing multiple different endpoints (which is common), you can structure your code a bit more
modularly, as described in the following example; this structure makes it easy to set and change
variables (instead of needing to do a complex paste() operation to produce the correct string):
# Restructure the previous request to make it easier to read and update. DO THIS.
# Make your request, specifying the query parameters via the `query` argument
response <- GET(resource_uri, query = query_params)
If you try printing out the response variable that is returned by the GET() function, you will first
see information about the response:
Response [https://api.github.com/search/repositories?q=dplyr&sort=forks]
Date: 2018-03-14 06:43
Status: 200
Content-Type: application/json; charset=utf-8
Size: 171 kB
This is called the response header. Each response has two parts: the header and the body. You can
think of the response as an envelope: the header contains meta-data like the address and postage
date, while the body contains the actual contents of the letter (the data).
Tip: The URI shown when you print out the response variable is a good way to check exactly
which URI you sent the request to: copy that into your browser to make sure it goes where
you expected!
Since you are almost always interested in working with the response body, you will need to extract
that data from the response (e.g., open up the envelope and pull out the letter). You can do this
with the content() function:
Note the second argument type = "text"; this is needed to keep httr from doing its own
processing on the response data (you will use other methods to handle that processing).
14.4 Processing JSON Data 191
In JSON, lists of key–value pairs (called objects) are put inside braces ({ }), with the key and the
value separated by a colon (:) and each pair separated by a comma (,). Key–value pairs are often
written on separate lines for readability, but this isn’t required. Note that keys need to be character
strings (so, “in quotes”), while values can either be character strings, numbers, booleans (written in
lowercase as true and false), or even other lists! For example:
{
"first_name": "Ada",
"job": "Programmer",
"salary": 78000,
"in_union": true,
"favorites": {
"music": "jazz",
"food": "pizza",
}
}
Additionally, JSON supports arrays of data. Arrays are like untagged lists (or vectors with different
types), and are written in square brackets ([ ]), with values separated by commas. For example:
Just as R allows you to have nested lists of lists, JSON can have any form of nested objects and arrays.
This structure allows you to store arrays (think vectors) within objects (think lists), such as the
following (more complex) set of data about Ada:
{
"first_name": "Ada",
"job": "Programmer",
"pets": ["Magnet", "Mocha", "Anni", "Fifi"],
"favorites": {
"music": "jazz",
"food": "pizza",
"colors": ["green", "blue"]
}
}
The JSON equivalent of a data frame is to store data as an array of objects. This is like having a list of
lists. For example, the following is an array of objects of FIFA Men’s World Cup data10 :
[
{"country": "Brazil", "titles": 5, "total_wins": 70, "total_losses": 17},
{"country": "Italy", "titles": 4, "total_wins": 66, "total_losses": 20},
{"country": "Germany", "titles": 4, "total_wins": 45, "total_losses": 17},
{"country": "Argentina", "titles": 2, "total_wins": 42, "total_losses": 21},
{"country": "Uruguay", "titles": 2, "total_wins": 20, "total_losses": 19}
]
# Represent the sample JSON data (World Cup data) as a list of lists in R
list(
list(country = "Brazil", titles = 5, total_wins = 70, total_losses = 17),
list(country = "Italy", titles = 4, total_wins = 66, total_losses = 20),
list(country = "Germany", titles = 4, total_wins = 45, total_losses = 17),
list(country = "Argentina", titles = 2, total_wins = 42, total_losses = 21),
list(country = "Uruguay", titles = 2, total_wins = 20, total_losses = 19)
)
This structure is incredibly common in web API data: as long as each object in the array has the
same set of keys, then you can easily consider this structure to be a data frame where each object
(list) represents an observation (row), and each key represents a feature (column) of that
observation. A data frame representation of this data is shown in Figure 14.6.
Remember: In JSON, tables are represented as lists of rows, instead of a data frame’s list of
columns.
10
FIFA World Cup data: https://www.fifa.com/fifa-tournaments/statistics-and-records/worldcup/teams/
index.html
14.4 Processing JSON Data 193
Figure 14.6 A data frame representation of World Cup statistics (left), which can also be represented
as JSON data (right).
A more effective solution for transforming JSON data is to use the jsonlite package.11 This
package provides helpful methods to convert JSON data into R data, and is particularly well suited
for converting content into data frames.
The jsonlite package provides a function called fromJSON() that allows you to convert from a
JSON string into a list—or even a data frame if the intended columns have the same lengths!
11
Package jsonlite: full documentation for jsonlite: https://cran.r-project.org/web/packages/jsonlite/
jsonlite.pdf
194 Chapter 14 Accessing Web APIs
Both the raw JSON data (response_text) and the parsed data structure (response_data) are
shown in Figure 14.7. As you can see, the raw string (response_text) is indecipherable. However,
once it is transformed using the fromJSON() function, it has a much more operable structure.
The response_data will contain a list built out of the JSON. Depending on the complexity of the
JSON, this may already be a data frame you can View()—but more likely you will need to explore
the list to locate the “main” data you are interested in. Good strategies for this include the
following techniques:
n You can print() the data, but that is often hard to read (it requires a lot of scrolling).
n The str() function will return a list’s structure, though it can still be hard to read.
n The names() function will return the keys of the list, which is helpful for delving into the
data.
Figure 14.7 Parsing the text of an API response using fromJSON(). The untransformed text is shown
on the left (response_text), which is transformed into a list (on the right) using the fromJSON()
function.
14.4 Processing JSON Data 195
# Use various methods to explore and extract information from API results
The set of responses—GitHub repositories that match the search term “dplry”—returned from the
request and stored in the response_data$items key is shown in Figure 14.8.
Figure 14.8 Data returned by the GitHub API: repositories that match the term “dplyr” (stored in the
variable response_data$items in the code example).
196 Chapter 14 Accessing Web APIs
# Store the second data frame as a column of the first -- A BAD IDEA
people$favorites <- favorites # the `favorites` column is a data frame!
Nested data frames make it hard to work with the data using previously established techniques and
syntax. Luckily, the jsonlite package provides a helpful function for addressing this issue, called
flatten(). This function takes the columns of each nested data frame and converts them into
appropriately named columns in the “outer” data frame, as shown in Figure 14.9:
Note that flatten() works on only values that are already data frames. Thus you may need to find
the appropriate element inside of the list—that is, the element that is the data frame you want to
flatten.
In practice, you will almost always want to flatten the data returned from a web API. Thus, your
algorithm for requesting and parsing data from an API is this:
1. Use GET() to request the data from an API, specifying the URI (and any query parameters).
2. Use content() to extract the data from your response as a JSON string (as “text”).
3. Use fromJSON() to convert the data from a JSON string into a list.
5. Use flatten() to flatten your data into a properly structured data frame.
A nested data
frame (the
favorites
column is
storing a data
frame!)
flatten()
A data frame in
the desired
format created
using the
flatten()
function
Figure 14.9 The flatten() function transforming a nested data frame (top) into a usable format
(bottom).
Given the geographic nature of this question, this section builds a map of the best-rated Cuban
restaurants in Seattle, as shown in Figure 14.12. The complete code for this analysis is also available
online in the book’s code repository.13
To send requests to the Yelp Fusion API, you will need to acquire an API key. You can do this by
signing up for an account on the API’s website, and registering an application (it is common for
APIs to require you to register for access). As described earlier, you should store your API key in a
separate file so that it can be kept secret:
This API requires you to use an alternative syntax for specifying your API key in the HTTP
request—instead of passing your key as a query parameter, you’ll need to add a header to the
request that you make to the API. An HTTP header provides additional information to the server
about who is sending the request—it’s like extra information on the request’s envelope. Specifically,
12
Yelp Fusion API documentation: https://www.yelp.com/developers/documentation/v3
13
APIs in Action: https://github.com/programming-for-data-science/in-action/tree/master/apis
198 Chapter 14 Accessing Web APIs
you will need to include an “Authorization” header containing your API key (in the format
expected by the API) for the request to be accepted:
# Load your API key from a separate file so that you can access the API:
source("api_key.R") # the `yelp_key` variable is now available
This code invokes the add_headers() method inside the GET() request. The header that it adds
sets the value of the Authorization header to “bearer yelp_key”. This syntax indicates that the API
should grant authorization to the bearer of the API key (you). This authentication process is used
instead of setting the API key as a query parameter (a method of authentication that is not
supported by the Yelp Fusion API).
As with any other API, you can determine the URI to send the request to by reading through the
documentation. Given the prompt of searching for Cuban restaurants in Seattle, you should focus
on the Business Search documentation,14 a section of which is shown in Figure 14.10.
Figure 14.10 A subset of the Yelp Fusion API Business Search documentation.
14
Yelp Fusion API Business Search endpoint documentation: https://www.yelp.com/developers/
documentation/v3/business_search
14.5 APIs in Action: Finding Cuban Food in Seattle 199
As you read through the documentation, it is important to identify the query parameters that you
need to specify in your request. In doing so, you are mapping from your question of interest to the
specific R code you will need to write. For this question (“Where is the best Cuban food in
Seattle?”), you need to figure out how to make the following specifications:
n Food: Rather than search all businesses, you need to search for only restaurants. The API
makes this available through the term parameter.
n Cuban: The restaurants you are interested in must be of a certain type. To support this, you
can specify the category of your search (making sure to specify a supported category, as
described elsewhere in the documentation15 ).
n Seattle: The restaurant you are looking for must be in Seattle. There are a few ways of
specifying a location, the most general of which is to use the location parameter. You can
further limit your results using the radius parameter.
n Best: To find the best food, you can control how the results are sorted with the sort_by
parameter. You’ll want to sort the results before you receive them (that is, by using an API
parameter and not dplyr) to save you some effort and to make sure the API sends only the
data you care about.
Often the most time-consuming part of using an API is figuring out how to hone in on your data of
interest using the parameters of the API. Once you understand how to control which resource
(data) is returned, you can then construct and send an HTTP request to the API:
# Construct a search query for the Yelp Fusion API's Business Search endpoint
base_uri <- "https://api.yelp.com/v3"
endpoint <- "/businesses/search"
search_uri <- paste0(base_uri, endpoint)
# Make a GET request, including the API key (as a header) and the list of
# query parameters
response <- GET(
search_uri,
query = query_params,
add_headers(Authorization = paste("bearer", yelp_key))
)
15
Yelp Fusion API Category List: https://www.yelp.com/developers/documentation/v3/all_category_list
200 Chapter 14 Accessing Web APIs
As with any other API response, you will need to use the content() method to extract the content
from the response, and then format the result using the fromJSON() method. You will then need to
find the data frame of interest in your response. A great way to start is to use the names() function
on your result to see what data is available (in this case, you should notice that the businesses key
stores the desired information). You can flatten() this item into a data frame for easy access.
# Flatten the data frame stored in the `businesses` key of the response
restaurants <- flatten(response_data$businesses)
Because the data was requested in sorted format, you can mutate the data frame to include a column
with the rank number, as well as add a column with a string representation of the name and rank:
The final step is to create a map of the results. The following code uses two different visualization
packages (namely, ggmap and ggplot2), both of which are explained in more detail in Chapter 16.
Figure 14.11 A subset of the data returned by a request to the Yelp Fusion API for Cuban food in
Seattle.
14.5 APIs in Action: Finding Cuban Food in Seattle 201
Figure 14.12 A map of the best Cuban restaurants in Seattle, according to the Yelp Fusion API.
# Create a base layer for the map (Google Maps image of Seattle)
base_map <- ggmap(get_map(location = "Seattle, WA", zoom = )) 11
Below is the full script that runs the analysis and creates the map—only 52 lines of clearly
commented code to figure out where to go to dinner!
202 Chapter 14 Accessing Web APIs
# Create a base layer for the map (Google Maps image of Seattle)
base_map <- ggmap(get_map(location = "Seattle, WA", zoom = )) 11
14.5 APIs in Action: Finding Cuban Food in Seattle 203
Using this approach, you can use R to load and format data from web APIs, enabling you to analyze
and work with a wider variety of data. For practice working with APIs, see the set of accompanying
book exercises.16
16
API exercises: https://github.com/programming-for-data-science/chapter-14-exercises
This page intentionally left blank
V
Data Visualization
This section of the book covers the conceptual (design) and technical (programming) skills
necessary to construct meaningful visualizations. It provides the necessary visualization theory
(Chapter 15) to identify optimal layouts for your data, and includes in-depth descriptions of the
most popular visualization packages in R (Chapter 16 and Chapter 17).
15
Designing Data
Visualizations
Data visualization, when done well, allows you to reveal patterns in your data and communicate
insights to your audience. This chapter describes the conceptual and design skills necessary to craft
effective and expressive visual representations of your data. In doing so, it introduces skills for each of
the following steps in the visualization process:
Generating visual displays of your data is a key step in the analytical process. While you should
strive to design aesthetically pleasing visuals, it’s important to remember that visualization is a
means to an end. Devising appropriate renderings of your data can help expose underlying patterns
in your data that were previously unseen, or that were undetectable by other tests.
1
Card, S. K., Mackinlay, J. D., & Shneiderman, B. (1999). Readings in information visualization: Using vision to think.
Burlington, MA: Morgan Kaufmann.
208 Chapter 15 Designing Data Visualizations
To demonstrate how visualization makes a distinct contribution to the data analysis process
(beyond statistical tests), consider the canonical data set Anscombe’s Quartet (which is included
with the R software as the data set anscombe). This data set consists of four pairs of x and y data: (x1 ,
y1 ), (xx , y2 ), and so on. The data set is shown in Table 15.1.
The challenge of Anscombe’s Quartet is to identify differences between the four pairs of columns.
For example, how does the (x1 , y1 ) pair differ from the (x2 , y2 ) pair? Using a nonvisual approach to
answer this question, you could compute a variety of descriptive statistics for each set, as shown in
Table 15.2. Given these six statistical assessments, these four data sets appear to be identical.
However, if you graphically represent the relationship between each x and y pair, as in Figure 15.1,
you reveal the distinct nature of their relationships.
While computing summary statistics is an important part of the data exploration process, it is only
through visual representations that differences across these sets emerge. The simple graphics in
Figure 15.1 expose variations in the distributions of x and y values, as well as in the relationships
between them. Thus the choice of representation becomes paramount when analyzing and
presenting data. The following sections introduce basic principles for making that choice.
Table 15.1 Anscombe’s Quartet: four data sets with two features each
x1 y1 x2 y2 x3 y3 x4 y4
10.00 8.04 10.00 9.14 10.00 7.46 8.00 6.58
8.00 6.95 8.00 8.14 8.00 6.77 8.00 5.76
13.00 7.58 13.00 8.74 13.00 12.74 8.00 7.71
9.00 8.81 9.00 8.77 9.00 7.11 8.00 8.84
11.00 8.33 11.00 9.26 11.00 7.81 8.00 8.47
14.00 9.96 14.00 8.10 14.00 8.84 8.00 7.04
6.00 7.24 6.00 6.13 6.00 6.08 8.00 5.25
4.00 4.26 4.00 3.10 4.00 5.39 19.00 12.50
12.00 10.84 12.00 9.13 12.00 8.15 8.00 5.56
7.00 4.82 7.00 7.26 7.00 6.42 8.00 7.91
5.00 5.68 5.00 4.74 5.00 5.73 8.00 6.89
Table 15.2 Anscombe’s Quartet: the (X, Y) pairs share identical summary statistics
Set Mean X Std. Deviation X Mean Y Std. Deviation Y Correlation Linear Fit
1 9.00 3.32 7.50 2.03 0.82 y= 3 + 0.5x
2 9.00 3.32 7.50 2.03 0.82 y= 3 + 0.5x
3 9.00 3.32 7.50 2.03 0.82 y= 3 + 0.5x
4 9.00 3.32 7.50 2.03 0.82 y= 3 + 0.5x
15.2 Selecting Visual Layouts 209
Anscombe's Quartet
Set 1 Set 2
12.5
10.0
7.5
5.0
y
Set 3 Set 4
12.5
10.0
7.5
5.0
5 10 15 5 10 15
x
Figure 15.1 Anscombe’s Quartet: scatterplots reveal four different (x, y) relationships that are not
detectable using descriptive statistics.
1. The specific question of interest you are attempting to answer in your domain
2. The type of data you have available for answering that question
4. The spatial limitations in the medium you are using (pixels on the screen, inches on the
page, etc.)
This section focuses on the second of these constraints (data type); the last two constraints are
addressed in Section 15.3 and Section 15.4. The first constraint (the question of interest) is closely
tied to Chapter 10 on understanding data. Based on your domain, you need to hone in on a
question of interest, and identify a data set that is well suited for answering your question. This
section will expand upon the same data set and question from Chapter 10:
As with the Anscombe’s Quartet example, most basic exploratory data questions can be reduced to
investigating how a variable is distributed or how variables are related to one another. Once you have
mapped from your question of interest to a specific data set, your visualization type will largely
depend on the data type of your variables. The data type of each column—nominal, ordinal, or
210 Chapter 15 Designing Data Visualizations
continuous—will dictate how the information can be represented. The following sections describe
techniques for visually exploring each variable, as well as making comparisons across variables.
For continuous variables, a histogram will allow you to see the distribution and range of values, as
shown in Figure 15.2. Alternatively, you can use a box plot or a violin plot, both of which are
shown Figure 15.3. Note that outliers (extreme values) in the data set have been removed to better
express the information in the charts.
While these visualizations display information about the distribution of the number of deaths by
cause, they all leave an obvious question unanswered: what are the names of these diseases?
Figure 15.4 uses a bar chart to label the top 10 causes of death, but due to the constraint of the page
size, this display is able to express just a small subset of the data. In other words, bar charts don’t
easily scale to hundreds or thousands of observations because they are inefficient to scan, or won’t
fit in a given medium.
40
30
Frequency
20
10
Number of Deaths for Each Cause Number of Deaths for Each Cause
40k 40k
Number of Deaths
Number of Deaths
30k 30k
20k 20k
10k 10k
0 0
Figure 15.3 Alternative visualizations for showing distributions of the number of deaths in the United
States: violin plot (left) and box plot (right). Some outliers have been removed for demonstration.
Cerebrovascular disease
Cause
Diabetes mellitus
Breast cancer
0 200k 400k
Number of Deaths
Figure 15.4 Top causes of death in the United States as shown in a bar chart.
15M
Cause
Breast cancer
Diabetes mellitus
Colon and rectum cancer
Total Deaths
Cause
Breast cancer
Diabetes mellitus
Colon and rectum cancer
Chronic kidney disease
Lower respiratory infections
Chronic obstructive pulmonary disease
Cerebrovascular disease
Tracheal, bronchus, and lung cancer
Alzheimer disease and other dementias
Ischemic heart disease
Total Deaths
Figure 15.5 Proportional representations of the top causes of death in the United States: stacked
bar chart (top) and pie chart (bottom).
benefit of a treemap is expressing hierarchical data (more on this later in the chapter). Later
sections explore the trade-offs in perceptual accuracy associated with each of these representations.
If your variable of interest is a categorical variable, you will need to aggregate your data (e.g., count
the number of occurrences of different categories) to ask similar questions about the distribution.
15.2 Selecting Visual Layouts 213
Once doing so, you can use similar techniques to show the data (e.g., bar chart, pie chart, treemap).
For example, the diseases in this data set are categorized into three types of diseases:
non-communicable diseases, such as heart disease or lung cancer; communicable diseases, such as
tuberculosis or whooping cough; and injuries, such as road traffic accidents or self harm. To
understand how this categorical variable (disease type) is distributed, you can count the number of
rows for each category, then display those quantitative values, as in Figure 15.6.
For comparing relationships between two continuous variables, the best choice is a scatterplot.
The visual processing system is quite good at estimating the linearity in a field of points created by a
scatterplot, allowing you to describe how two variables are related. For example, using the disease
burden data set, you can compare different metrics for measuring health loss. Figure 15.7 compares
the disease burden as measured by the number of deaths due to each cause to the number of years of
life lost (a metric that accounts for the age at death for each individual).
You can extend this approach to multiple continuous variables by creating a scatterplot matrix of
all continuous features in the data set. Figure 15.8 compares all pairs of metrics of disease burden,
Non−communicable
Cause Group
Communicable
Injuries
0 20 40 60 80
Number of Causes
Figure 15.6 A visual representation of the number of causes in each disease category: non-
communicable diseases, communicable diseases, and injuries.
214 Chapter 15 Designing Data Visualizations
60M
40M
20M
0 200k 400k
Number of Deaths
Figure 15.7 Using a scatterplot to compare two continuous variables: the number of deaths versus
the years of live lost for each disease in the United States.
DALYs
Corr: Corr: Corr:
200
100
0.859 0.725 0.914
0
400k Deaths
Corr: Corr:
200k 0.259 0.956
0
50M
40M
Corr:
YLDs
30M
20M 0.307
10M
0
60M
YLLs
40M
20M
0
0 50M 0 200k 400k 0 20M 40M 0 50M
Figure 15.8 Comparing multiple continuous measurements of disease burden using a scatterplot
matrix.
15.2 Selecting Visual Layouts 215
including number of deaths, years of life lost (YLLs), years lived with disability (YLDs, a measure of
the disability experienced by the population), and disability-adjusted life years (DALYs, a combined
measure of life lost and disability).
When comparing relationships between one continuous variable and one categorical variable, you
can compute summary statistics for each group (see Figure 15.6), use a violin plot to display
distributions for each category (see Figure 15.9), or use faceting to show the distribution for each
category (see Figure 15.10).
For assessing relationships between two categorical variables, you need a layout that enables you to
assess the co-occurrences of nominal values (that is, whether an observation contains both values). A
great way to do this is to count the co-occurrences and show a heatmap. As an example, consider a
broader set of population health data that evaluates the leading cause of death in each country
(also from the Global Burden of Disease study). Figure 15.11 shows a subset of this data, including
the disease type (communicable, non-communicable) for each disease, and the region where each
country is found.
One question you may ask about this categorical data is:
“In each region, how often is the leading cause of death a communicable disease versus a
non-communicable disease?”
To answer this question, you can aggregate the data by region, and count the number of times each
disease category (communicable, non-communicable) appears as the category for the leading cause
of death. This aggregated data (shown in Figure 15.12) can then be displayed as a heatmap, as in
Figure 15.13.
40k
Number of Deaths
30k
20k
10k
Figure 15.9 A violin plot showing the continuous distributions of the number of deaths for each
cause (by category). Some outliers have been removed for demonstration.
216 Chapter 15 Designing Data Visualizations
30
Frequency
20
10
0 10k 20k 30k 40k 0 10k 20k 30k 40k 0 10k 20k 30k 40k
Number of Deaths
Figure 15.10 A faceted layout of histograms showing the continuous distributions of the number of
deaths for each cause (by category). Some outliers have been removed for demonstration.
Figure 15.11 The leading cause of death in each country. The category of each disease (communi-
cable, non-communicable) is shown, as is the region in which each country is found.
15.2 Selecting Visual Layouts 217
Figure 15.12 Number of countries in each region in which the leading cause of death is
communicable/non-communicable.
Figure 15.13 A heatmap of the number of countries in each region in which the leading cause of
death is communicable/non-communicable.
Figure 15.14 A treemap of the number of deaths in the United States from each cause.
Screenshot from GBD Compare, a visualization tool for the global burden of disease
(https://vizhub.healthdata.org/gbd-compare/).
(e.g., lung cancer) is a member of a family of causes (e.g., cancers), which can be further grouped
into overarching categories (e.g., non-communicable diseases). Hierarchical data can be visualized
using treemaps (Figure 15.14), circle packing (Figure 15.15), sunburst diagrams (Figure 15.16), or
other layouts. Each of these visualizations uses an area encoding to represent a numeric value.
These shapes (rectangles, circles, or arcs) are organized in a layout that clearly expresses the
hierarchy of information.
The benefit of visualizing the hierarchy of a data set, however, is not without its costs. As described
in Section 15.3, it is quite difficult to visually decipher and compare values encoded in a treemap
(especially with rectangles of different aspect ratios). However, these displays provide a great
summary overview of hierarchies, which is an important starting point for visually exploring data.
15.2 Selecting Visual Layouts 219
Figure 15.15 A re-creation of the treemap visualization (of disease burden in the United States)
using a circle pack layout. Created using the d3.js library https://d3js.org.
220 Chapter 15 Designing Data Visualizations
Figure 15.16 A re-creation of the treemap visualization (of disease burden in the United States)
using a sunburst diagram. Created using the d3.js library https://d3js.org.
Your task is thus to select the encodings that are most accurately decoded by users, answering the
question:
“What visual form best allows you to exploit the human visual system and available space to
accurately display your data values?”
In designing a visual layout, you should choose the graphical encodings that are most accurately
visually decoded by your audience. This means that, for every value in your data, your user’s
interpretation of that value should be as accurate as possible. The accuracy of these perceptions is
referred to as the effectiveness of a graphical encoding. Academic research2 measuring the
perceptiveness of different visual encodings has established a common set of possible encodings for
quantitative information, listed here in order from most effective to least effective:
n Area: the area of an element, such as a circle or a rectangle, typically used in a bubble chart (a
scatterplot with differently sized markers) or a treemap
n Angle: the rotational angle of each marker, typically used in a circular layout like a pie chart
n Color: the color of each marker, usually along a continuous color scale
n Volume: the volume of a three-dimensional shape, typically used in a 3D bar chart
As an example, consider the very simple data set in Table 15.3. An effective visualization of this data
set would enable you to easily distinguish between the values of each group (e.g., between the
values 10 and 11). While this identification is simple for a position encoding, detecting this 10%
difference is very difficult for other encodings. Comparisons between encodings of this data set are
shown in Figure 15.17.
Thus when a visualization designer makes a blanket claim like “You should always use a bar chart
rather than a pie chart,” the designer is really saying, “A bar chart, which uses position encoding
along a common scale, is more accurately visually decoded compared to a pie chart (which uses an
angle encoding).”
Table 15.3 A simple data set to demonstrate the perceptiveness of different graphical encodings
(shown in Figure 15.17). Users should be able to visually distinguish between these values.
group value
a 1
b 10
c 11
d 7
e 8
2
Most notably, Cleveland, W. S., & McGill, R. (1984). Graphical perception: Theory, experimentation, and
application to the development of graphical methods. Journal of the American Statistical Association, 79(387), 531–554.
https://doi.org/10.1080/01621459.1984.10478080
222 Chapter 15 Designing Data Visualizations
30
9
group
a
value
b
value
20
6 c
d
e
10
3
0
a b c d e
group
value value
0.0 10.0
2.5 7.5
5.0 5.0
7.5
2.5
10.0
0.0
a b c d e a b c d e
group group
Figure 15.17 Different graphical encodings of the same data. Note the variation in perceptibility of
differences between values!
To design your visualization, you should begin by encoding the most important data features with
the most accurately decoded visual features (position, then length, then area, and so on). This will
provide you with guidance as you compare different chart options and begin to explore more
creative layouts.
While these guidelines may feel intuitive, the volume and distribution of your data often make this
task more challenging. You may struggle to display all of your data, requiring you to also work to
maximize the expressiveness of your visualizations (see Section 15.4).
n The hue of a color, which is likely how you think of describing a color (e.g., “green” or “blue”)
n The saturation or intensity of a color, which describes how “rich” the color is on a linear
scale between gray (0%) and the full display of the hue (100%)
15.3 Choosing Effective Graphical Encodings 223
n The lightness of the color, which describes how “bright” the color is on a linear scale from
black (0%) to white (100%)
This color model can be seen in Figure 15.18, which is an example of an interactive color selector3
that allows you to manipulate each attribute independently to pick a color. The HSL model
provides a good foundation for color selection in data visualization.
When selecting colors for visualization, the data type of your variable should drive your decisions.
Depending on the data type (categorical or continuous), the purpose of your encoding will likely
be different:
n For categorical variables, a color encoding is used to distinguish between groups. Therefore, you
should select colors with different hues that are visually distinct and do not imply a rank
ordering.
n For continuous variables, a color encoding is used to estimate values. Therefore, colors should
be picked using a linear interpolation between color points (i.e., different lightness values).
Picking colors that most effectively satisfy these goals is trickier than it seems (and beyond the
scope of this short section). But as with any other challenge in data science, you can build upon the
open source work of other people. One of the most popular tools for picking colors (especially for
maps) is Cynthia Brewer’s ColorBrewer.4 This tool provides a wonderful set of color palettes that
differ in hue for categorical data (e.g., “Set3”) and in lightness for continuous data (e.g., “Purples”);
see Figure 15.19. Moreover, these palettes have been carefully designed to be viewable to people
3
HSL Calculator by w3schools: https://www.w3schools.com/colors/colors_hsl.asp
4
ColorBrewer: http://colorbrewer2.org
224 Chapter 15 Designing Data Visualizations
Figure 15.19 All palettes made available by the colorbrewer package in R. Run the display.
brewer.all() function to see them in RStudio.
15.3 Choosing Effective Graphical Encodings 225
with certain forms of color blindness. These palettes are available in R through the RColorBrewer
package; see Chapter 16 for details on how to use this package as part of your visualization process.
Selecting between different types of color palettes depends on the semantic meaning of the data.
This choice is illustrated in Figure 15.20, which shows map visualizations of the population of each
county in Washington state. The choice between different types of continuous color scales
depends on the data:
n Sequential color scales are often best for displaying continuous values along a linear scale
(e.g., for this population data).
n Diverging color scales are most appropriate when the divergence from a center value is
meaningful (e.g., the midpoint is zero). For example, if you were showing changes in
population over time, you could use a diverging scale to show increases in population using
one hue, and decreases in population using another hue.
n Multi-hue color scales afford an increase in contrast between colors by providing a broader
color range. While this allows for more precise interpretations than a (single hue) sequential
color scale, the user may misinterpret or misjudge the differences in hue if the scale is not
carefully chosen.
Figure 15.20 Population data in Washington represented with four ColorBrewer scales. The sequen-
tial and black/white scales accurately represent continuous data, while the diverging scale (inappro-
priately) implies divergence from a meaningful center point. Colors in the multi-hue scale may be
misinterpreted as having different meanings.
226 Chapter 15 Designing Data Visualizations
n Black and white color scales are equivalent to sequential color scales (just with a hue of
gray!) and may be required for your medium (e.g., when printing in a book or newspaper).
Overall, the choice of color will depend on the data. Your goal is to make sure that the color scale
chosen enables the viewer to most effectively distinguish between the data’s values and meanings.
As an example, consider Figure 15.21, in which you are able to count the occurrences of the number
3 at dramatically different speeds in each graphic. This is possible because your brain naturally
identifies elements of the same color (more specifically, opacity) without having to put forth any
effort. This technique can be used to drive focus in a visualization, thereby helping people quickly
identify pertinent information.
How many 3’s are there? How many 3’s are there?
Figure 15.21 Because opacity is processed preattentively, the visual processing system identifies
elements of interest (the number 3) without effort in the right graphic, but not in the left graphic.
5
Healey, C. G., & Enns, J. T. (2012). Attention and visual memory in visualization and computer graphics. IEEE
Transactions on Visualization and Computer Graphics, 18(7), 1170–1188. https://doi.org/10.1109/TVCG.2011.127. Also at:
https://www.csc2.ncsu.edu/faculty/healey/PP/
6
Ware, C. (2012). Information visualization: Perception for design. Philadelphia, PA: Elsevier.
15.4 Expressive Data Displays 227
Color Opacity
10.0 10.0
7.5 7.5
y
5.0
y
5.0
2.5 2.5
0.0 0.0
0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
x x
Shape Size
10.0 10.0
7.5 7.5
y
5.0
y
5.0
2.5 2.5
0.0 0.0
0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
x x
Enclosure
10.0
7.5
y
5.0
2.5
0.0
0.0 2.5 5.0 7.5 10.0
x
Figure 15.22 Driving focus with preattentive attributes. The selected point is clear in each graph,
but especially easy to detect using color.
In addition to color, you can use other visual attributes that help viewers preattentively distinguish
observations from those around them, as illustrated in Figure 15.22. Notice how quickly you can
identify the “selected” point—though this identification happens more rapidly with some
encodings (i.e., color) than with others!
As you can see, color and opacity are two of the most powerful ways to grab attention. However, you
may find that you are already using color and opacity to encode a feature of your data, and thus
can’t also use these encodings to draw attention to particular observations. In that case, you can
consider the remaining options (e.g., shape, size, enclosure) to direct attention to a specific set of
observations.
A set of facts [data] is expressible in a language [visual layout] if that language contains
a sentence [form] that
7
Mackinlay, J. (1986). Automating the design of graphical presentations of relational information. ACM
Transactions on Graphics, 5(2), 110–141. https://doi.org/10.1145/22949.22950. Restatement by Jeffrey Heer.
228 Chapter 15 Designing Data Visualizations
The prompt of this expressiveness aim is to devise visualizations that express all of (and only) the
data in your data set. The most common barrier to expressiveness is occlusion (overlapping data
points). As an example, consider Figure 15.23, which visualizes the distribution of the number of
deaths attributable to different causes in the United States. This chart uses the most visually
perceptive visual encoding (position), but fails to express all of the data due to the overlap in values.
There are two common approaches to address the failure of expressiveness caused by overlapping
data points:
2. Break the data into different groupings or facets to alleviate the overlap (by showing only a
subset of the data at a time).
Alternatively, you could consider changing the data that you are visualizing by aggregating it in an
appropriate way. For example, you could group your data by values that have similar number of
deaths (putting each into a “bin”), and then use a position encoding to show the number of
observations per bin. The result of this is the commonly used layout known as a histogram, as
shown in Figure 15.25. While this visualization does communicate summary information to your
audience, it is unable to express each individual observation in the data (which would
communicate more information through the chart).
At times, the expressiveness and effectiveness principles are at odds with one another. In an
attempt to maximize expressiveness (and minimize the overlap of your symbols), you may have to
choose a less effective encoding. While there are multiple strategies for this—for example, breaking
Figure 15.23 Position encoding of the number of deaths from each cause in the United States.
Notice how the overlapping points (occlusion) prevent this layout from expressing all of the data.
Some outliers have been removed for demonstration.
0 10k 20k 30k 40k 0 10k 20k 30k 40k 0 10k 20k 30k 40k
Number of Deaths for Each Cause
Figure 15.24 Position encoding of the number of deaths from each cause in the United States,
faceted by the category of each cause. The use of a lower opacity in conjunction with the faceting
enhances the expressiveness of the plots. Some outliers have been removed for demonstration.
15.5 Enhancing Aesthetics 229
40
Number of Causes
30
20
10
the data into multiple plots, aggregating the data, and changing the opacity of your symbols—the
most appropriate choice will depend on the distribution and volume of your data, as well as the
specific question you wish to answer.
Tip: Making beautiful charts is a practice of removing clutter, not adding design.
One of the most renowned data visualization theorists, Edward Tufte, frames this idea in terms of
the data–ink ratio.8 Tufte argues that in every chart, you should maximize the ink dedicated to
displaying the data (and in turn, minimize the non-data ink). This can translate to a number of
actions:
n Remove unnecessary encodings. For example, if you have a bar chart, the bars should have
different colors only if that information isn’t otherwise expressed.
8
Tufte, E. R. (1986). The visual display of quantitative information. Cheshire, CT: Graphics Press.
230 Chapter 15 Designing Data Visualizations
n Avoid visual effects. Any 3D effects, unnecessary shading, or other distracting formatting
should be avoided. Tufte refers to this as “chart junk.”
n Include chart and axis labels. Provide a title for your chart, as well as meaningful labels for
your axes.
n Lighten legends/labels. Reduce the size or opacity of axis labels. Avoid using striking colors.
It’s easy to look at a chart such as the chart on the left side of Figure 15.26 and claim that it looks
unpleasant. However, describing why it looks distracting and how to improve it can be more
challenging. If you follow the tips in this section and strive for simplicity, you can remove
unnecessary elements and drive focus to the data (as shown on the right-hand side of Figure 15.26).
Luckily, many of these optimal choices are built into the default R packages for visualization, or are
otherwise readily implemented. That being said, you may have to adhere to the aesthetics of your
organization (or your own preferences!), so choosing an easily configurable visualization package
(such as ggplot2, described in Chapter 16) is crucial.
As you begin to design and build visualizations, remember the following guidelines:
3. Choose optimal graphical encodings based on how well they are visually decoded.
5. Enhance the aesthetics by removing visual effects, and by including clear labels.
These guidelines will be a helpful start, and don’t forget that visualizations are about insights, not
pictures.
Group Size
9
Size
0
a b c d e
Group
Figure 15.26 Removing distracting and uninformative visual features (left) and adding informative
labels to create a cleaner chart (right).
16
Creating Visualizations
with ggplot2
The ability to create visualizations (graphical representations) of data is a key step in being able to
communicate information and findings to others. In this chapter, you will learn to use the
ggplot21 package to declaratively make beautiful visual representations of your data.
Although R does provide built-in plotting functions, the ggplot2 package is built on the premise
of the Grammar of Graphics (similar to how dplyr implements a Grammar of Data Manipulation;
indeed, both packages were originally developed by the same person). This makes the package
particularly effective for describing how visualizations should represent data, and has turned it into
the preeminent plotting package in R. Learning to use this package will allow you to make nearly
any kind of (static) data visualization, customized to your exact specifications.
n The geometric objects (e.g., circles, lines) that appear on the plot
n The aesthetics (appearance) of the geometric objects, and the mappings from variables in the
data to those aesthetics
n A position adjustment for placing elements on the plot so they don’t overlap
1
ggplot2: http://ggplot2.tidyverse.org
2
Wickham, H. (2010). A layered grammar of graphics. Journal of Computational and Graphical Statistics, 19(1), 3–28.
https://doi.org/10.1198/jcgs.2009.07098. Also at http://vita.had.co.nz/papers/layered-grammar.pdf
232 Chapter 16 Creating Visualizations with ggplot2
ggplot2 further organizes these components into layers, where each layer displays a single type of
(highly configurable) geometric object. Following this grammar, you can think of each plot as a set of
layers of images, where each image’s appearance is based on some aspect of the data set.
Collectively, this grammar enables you to discuss what plots look like using a standard set of
vocabulary. And like with dplyr and the Grammar of Data Manipulation, ggplot2 uses this
grammar directly to declare plots, allowing you to more easily create specific visual images and tell
stories3 about your data.
ggplot2 is yet another external package (like dplyr, httr, etc.), so you will need to install and load
the package to use it:
This will make all of the plotting functions you will need available. As a reminder, plots will be
rendered in the lower-right quadrant of RStudio, as shown in Figure 16.1.
Fun Fact: Similar to dplyr, the ggplot2 package also comes with a number of built-in data
sets. This chapter will use the provided midwest data set as an example, described below.
This section uses the midwest data set that is included as part of the ggplot2 package—a subset of
the data is shown in Figure 16.2. The data set contains information on each of 437 counties in
5 states in the midwestern United States (specifically, Illinois, Indiana, Michigan, Ohio, and
Wisconsin). For each county, there are 28 features that describe the demographics of the county,
including racial composition, poverty levels, and education rates. To learn more about the data,
you can consult the documentation (?midwest).
To create a plot using the ggplot2 package, you call the ggplot() function, specifying as an
argument the data that you wish to plot (i.e., ggplot(data = SOME_DATA_FRAME)). This will
create a blank canvas upon which you can layer different visual markers. Each layer contains a
specific geometry—think points, lines, and so on—that will be drawn on the canvas. For example, in
Figure 16.3 (created using the following code), you can add a layer of points to assess the association
between the percentage of people with a college education and the percentage of adults living in
poverty in counties in the Midwest.
3
Sander, L. (2016). Telling stories with data using the grammar of graphics. Code Words, 6. https://codewords.
recurse.com/issues/six/telling-stories-with-data-using-the-grammar-of-graphics
16.2 Basic Plotting with ggplot2 233
Figure 16.1 ggplot2 graphics will render in the lower-right quadrant of the RStudio window.
Figure 16.2 A subset of the midwest data set, which captures demographic information on 5 mid-
western states. The data set is included as part of the ggplot2 package and used throughout this
chapter.
234 Chapter 16 Creating Visualizations with ggplot2
40
percadultpoverty
30
20
10
0
10 20 30 40 50
percollege
Figure 16.3 A basic use of ggplot: comparing the college education rates to adult poverty rates in
Midwestern counties by adding a layer of points (thereby creating a scatterplot).
# Plot the `midwest` data set, with college education rate on the x-axis and
# percentage of adult poverty on the y-axis
ggplot(data = midwest) +
geom_point(mapping = aes(x = percollege, y = percadultpoverty))
n The ggplot() function is passed the data frame to plot as the named data argument (it can
also be passed as the first positional argument). Calling this function creates the blank
canvas on which the visualization will be created.
n You specify the type of geometric object (sometimes referred to as a “geom”) to draw by
calling one of the many geom_ functions4 —in this case, geom_point(). Functions to
render a layer of geometric objects all share a common prefix (geom_), followed by the name
of the kind of geometry you wish to create. For example, geom_point() will create a layer
with “point” (dot) elements as the geometry. There are a large number of these functions;
more details are provided in Section 16.2.1.
n In each geom_ function, you must specify the aesthetic mappings, which specify how data
from the data frame will be mapped to the visual aspects of the geometry. These mappings
are defined using the aes() (aesthetic) function. The aes() function takes a set of named
arguments (like a list), where the argument name is the visual property to map to, and the
argument value is the data feature (i.e., the column in the data frame) to map from. The value
returned by the aes() function is passed to the named mapping argument (or passed as the
first positional argument).
4
Layer: geoms function reference: http://ggplot2.tidyverse.org/reference/index.html#section-layer-geoms
16.2 Basic Plotting with ggplot2 235
Caution: The aes() function uses non-standard evaluation similar to dplyr, so you
don’t need to put the data frame column names in quotes. This can cause issues if the
name of the column you wish to plot is stored as a string in a variable (e.g., plot_var
<- "COLUMN_NAME"). To handle this situation, you can use the aes_string() func-
tion instead and specify the column names as string values or variables.
n You add layers of geometric objects to the plot by using the addition (+) operator.
Thus, you can create a basic plot by specifying a data set, an appropriate geometry, and a set of
aesthetic mappings.
Tip: The ggplot2 package includes a qplot() functiona for creating “quick plots.” This
function is a convenient shortcut for making simple, “default”-like plots. While this is a nice
starting point, the strength of ggplot2 lies in its customizability, so read on!
a
http://www.statmethods.net/advgraphs/ggplot2.html
n geom_smooth() for drawing smoothed lines (e.g., for simple trends or approximations)
n geom_polygon() for drawing arbitrary shapes (e.g., for drawing an area in a coordinate
plane)
Each of these geom_ functions requires as an argument a set of aesthetic mappings (defined using
the aes() function, described in Section 16.2.2), though the specific visual properties that the data
will map to will vary. For example, you can map a data feature to the shape of a geom_point()
(e.g., if the points should be circles or squares), or you can map a feature to the linetype of a
geom_line() (e.g., if it is solid or dotted), but not vice versa.
Since graphics are two-dimensional representations of data, almost all geom_ functions require an x
and y mapping. For example, in Figure 16.4, the bar chart of the number of counties per state (left)
is built using the geom_col() geometry, while the hexagonal aggregation of the scatterplot from
Figure 16.3 (right) is built using the geom_hex() function.
236 Chapter 16 Creating Visualizations with ggplot2
1.2e+07
40
count
percadultpoverty
9.0e+06
12.5
30
poptotal
10.0
6.0e+06 7.5
20
5.0
3.0e+06 2.5
10
0.0e+00
IL IN MI OH WI 10 20 30 40
state percollege
Figure 16.4 Plots with column geometry (left) and binned hexagons (right). The rectangles in the
column geometry represent separate observations (counties) that have been automatically stacked
on top of each other; see Section 16.3.1 for details.
What makes this really powerful is that you can add multiple geometries to a plot. This allows you to
create complex graphics showing multiple aspects of your data, as in Figure 16.5.
While the geom_point() and geom_smooth() layers in this code both use the same aesthetic
mappings, there’s no reason you couldn’t assign different aesthetic mappings to each geometry.
Note that if the layers do share some aesthetic mappings, you can specify those as an argument to
the ggplot() function as follows:
# A plot with both points and a smoothed line, sharing aesthetic mappings
ggplot(data = midwest, mapping = aes(x = percollege, y = percadultpoverty)) +
geom_point() + # uses the default x and y mappings
geom_smooth() + # uses the default x and y mappings
geom_point(mapping = aes(y = percchildbelowpovert)) # uses own y mapping
16.2 Basic Plotting with ggplot2 237
40
percadultpoverty
30
20
10
0
10 20 30 40 50
percollege
Figure 16.5 A plot comparing the adult poverty rate and the college education rate using multiple
geometries. Each layer is added with a different ggplot2 function: geom_point() for points, and
geom_smooth() for the smoothed line.
Each geometry will use the data and individual aesthetics specified in the ggplot() function
unless they are overridden by individual specifications.
Going Further: Some geom_ functions also perform a statistical transformation on the data,
aggregating the data (e.g., counting the number of observations) before mapping that data
to an aesthetic. While you can do many of these transformations using the dplyr functions
group_by() and summarize(), a statistical transformation allows you to apply some aggre-
gations purely to adjust the data’s presentation, without needing to modify the data itself.
You can find more information in the documentation.a
a
http://ggplot2.tidyverse.org/reference/index.html#section-layer-stats
The data-driven aesthetics for a plot are specified using the aes() function and passed into a
particular geom_ function layer. For example, if you want to know which state each county is in,
you can add a mapping from the state feature of each row to the color channel. ggplot2 will even
create a legend for you automatically (as in Figure 16.6)! Note that using the aes() function will
cause the visual channel to be based on the data specified in the argument.
238 Chapter 16 Creating Visualizations with ggplot2
40 state 40
percadultpoverty
percadultpoverty
IL
30 30
IN
20 MI 20
OH
10 10
WI
0 0
10 20 30 40 50 10 20 30 40 50
percollege percollege
Figure 16.6 Different approaches for choosing color when comparing the adult poverty rate and
the college education rate. The left uses a data-driven approach, in which each observation’s state
column is used to set the color (an aesthetic mapping), while the right sets a constant color for all
observations. Code is below.
Conversely, if you wish to apply a visual property to an entire geometry, you can set that property
on the geometry by passing it as an argument to the geom_ function, outside of the aes() call, as
shown in the following code. Figure 16.6 shows both approaches: driving color with the aesthetic
(left) and choosing constant styles for each point (right).
# Set a consistent color ("red") for all points -- not driven by data
ggplot(data = midwest) +
geom_point(
mapping = aes(x = percollege, y = percadultpoverty),
color = "red",
alpha = .3
)
variable to the color encoding (using the fill aesthetic). In Figure 16.7 you can see the racial
breakdown for the population in each state by adding a fill to the column geometry:
Remember: You will need to use your dplyr and tidyr skills to wrangle your data frames
into the proper orientation for plotting. Being confident in those skills will make using the
ggplot2 library a relatively straightforward process; the hard part is getting your data in the
desired shape.
Tip: Use the fill aesthetic when coloring in bars or other area shapes (that is, specifying
what color to “fill” the area). The color aesthetic is instead used for the outline (stroke) of
the shapes.
1.2e+07
race
9.0e+06
popamerindian
population
popasian
6.0e+06
popblack
popother
3.0e+06
popwhite
0.0e+00
IL IN MI OH WI
state
Figure 16.7 A stacked bar chart of the number of people in each state (by race). Colors are added
by setting a fill aesthetic based on the race column.
240 Chapter 16 Creating Visualizations with ggplot2
By default, ggplot will adjust the position of each rectangle by stacking the “columns” for each
county. The plot thus shows all of the elements instead of causing them to overlap. However, if you
wish to specify a different position adjustment, you can use the position argument. For example,
to see the relative composition (e.g., percentage) of people by race in each state, you can use a
"fill" position (to fill each bar to 100%). To see the relative measures within each state side by
side, you can use a "dodge" position. To explicitly achieve the default behavior, you can use the
"identity" position. The first two options are shown in Figure 16.8.
1.00
3e+06
race race
population
population
0.75 popamerindian popamerindian
popasian 2e+06 popasian
0.50
popblack popblack
popother 1e+06 popother
0.25
popwhite popwhite
0.00 0e+00
IL IN MI OH WI IL IN MI OH WI
state state
Figure 16.8 Bar charts of state population by race, shown with different position adjustments: filled
(left) and dodged (right).
# Create a percentage (filled) column of the population (by race) in each state
ggplot(state_race_long) +
geom_col(
mapping = aes(x = state, y = population, fill = race), position = "fill"
)
# Create a grouped (dodged) column of the number of people (by race) in each state
ggplot(state_race_long) +
geom_col(
mapping = aes(x = state, y = population, fill = race), position = "dodge"
)
# Plot the `midwest` data set, with college education rate on the x-axis and
# percentage of adult poverty on the y-axis. Color by state.
ggplot(data = midwest) +
geom_point(mapping = aes(x = percollege, y = percadultpoverty, color = state))
# Plot the `midwest` data set, with college education rate and
# percentage of adult poverty. Explicitly set the scales.
ggplot(data = midwest) +
geom_point(mapping = aes(x = percollege, y = percadultpoverty, color = state)) +
scale_x_continuous() + # explicitly set a continuous scale for the x-axis
scale_y_continuous() + # explicitly set a continuous scale for the y-axis
scale_color_discrete() # explicitly set a discrete scale for the color aesthetic
16.3 Complex Layouts and Customization 241
Each scale can be represented by a function named in the following format: scale_, followed by
the name of the aesthetic property (e.g., x or color), followed by an _ and the type of the scale
(e.g., continuous or discrete). A continuous scale will handle values such as numeric data
(where there is a continuous set of numbers), whereas a discrete scale will handle values such as
colors (since there is a small discrete list of distinct colors). Notice also that scales are added to a plot
using the + operator, similar to a geom layer.
While the default scales will often suffice for your plots, it is possible to explicitly add different
scales to replace the defaults. For example, you can use a scale to change the direction of an axis
(scale_x_reverse()), or plot the data on a logarithmic scale (scale_x_log10()). You can also use
scales to specify the range of values on an axis by passing in a limits argument. Explicit limits are
useful for making sure that multiple graphs share scales or formats, as well as for customizing the
appearance of your visualizations. For example, the following code imposes the same scale across
two plots, as shown in Figure 16.9:
40 40
percadultpoverty
10 10
0 0
10 20 30 40 50 10 20 30 40 50
percollege percollege
Figure 16.9 Plots of the percent college-educated population versus the percent adult poverty in
Wisconsin (left) and Michigan (right). These plots share the same explicit scales (which are not based
solely on the plotted data). Notice how it is easy to compare the two data sets to each other because
the axes and colors match!
# Define a discrete color scale using the unique set of locations (urban/rural)
color_scale <- scale_color_discrete(limits = unique(labeled$location))
242 Chapter 16 Creating Visualizations with ggplot2
These scales can also be used to specify the “tick” marks and labels; see the ggplot2 documentation
for details. For further ways of specifying where the data appears on the graph, see Section 16.3.3.
40
percadultpoverty
state
30 IL
IN
MI
20
OH
WI
10
0
10 20 30 40 50
percollege
Figure 16.10 A comparison of each county’s adult poverty rate and college education rate, using
color to show the state each county is in. These colors come from the ColorBrewer Set3 palette.
5
ColorBrewer: http://colorbrewer2.org
16.3 Complex Layouts and Customization 243
If you instead want to define your own color scheme, you can make use of a variety of ggplot2
functions. For discrete color scales6 , you can specify a distinct set of colors to map to using a
function such as scale_color_manual(). For continuous color scales7 , you can specify a range of
colors to display using a function such as scale_color_gradient().
n coord_cartesian(): The default Cartesian coordinate system, where you specify x and y
values—x values increase from left to right, and y values increase from bottom to top
n coord_fixed(): A Cartesian system with a “fixed” aspect ratio (e.g., 1.78 for “widescreen”)
n coord_quickmap(): A coordinate system that approximates a good aspect ratio for maps.
See the documentation for more details
The example in Figure 16.11 uses coord_flip() to create a horizontal bar chart (a useful layout for
making labels more legible). In the geom_col() function’s aesthetic mapping, you do not change
what you assign to the x and y variables to make the bars horizontal; instead, you call the
coord_flip() function to switch the orientation of the graph. The following code (which
generates Figure 16.11) also creates a factor variable to sort the bars using the variable of interest:
6
Gradient color scales function reference: http://ggplot2.tidyverse.org/reference/scale_gradient.html
7
Create your own discrete scale function reference:http://ggplot2.tidyverse.org/reference/scale_manual.html
8
Coordinate systems function reference: http://ggplot2.tidyverse.org/reference/index.html#section-
coordinate-systems
244 Chapter 16 Creating Visualizations with ggplot2
COOK, IL
WAYNE, MI
CUYAHOGA, OH
OAKLAND, MI
location
FRANKLIN, OH
MILWAUKEE, WI
HAMILTON, OH
MARION, IN
DU PAGE, IL
MACOMB, MI
0e+00 1e+06 2e+06 3e+06 4e+06 5e+06
poptotal
Figure 16.11 A horizontal bar chart of the population in the ten most populous counties. The orien-
tation of the chart is “flipped” by calling the coord_flip() function.
In general, the coordinate system is used to specify where in the plot the x and y axes are placed,
while scales are used to determine which values are shown on those axes.
16.3.4 Facets
Facets are ways of grouping a visualization into multiple different pieces (subplots). This allows you
to view a separate plot for each unique value in a categorical variable. Conceptually, breaking a plot
up into facets is similar to using the group_by() verb in dplyr: it creates the same visualization for
each group separately (just as summarize() performs the same analysis for each group).
You can construct a plot with multiple facets by using a facet_ function such as facet_wrap().
This function will produce a “row” of subplots, one for each categorical variable (the number of
rows can be specified with an additional argument); subplots will “wrap” to the next line if there is
not enough space to show them all in a single row. Figure 16.12 demonstrates faceting; as you can
see in this plot, using facets is basically an “automatic” way of doing the same kind of grouping
performed in Figure 16.9, which shows separate graphs for Wisconsin and Michigan.
16.3 Complex Layouts and Customization 245
IL IN MI
40
30
20
percadultpoverty
10
location
0
10 20 30 40 50 Rural
OH WI
Urban
40
30
20
10
0
10 20 30 40 50 10 20 30 40 50
percollege
Figure 16.12 A comparison of each county’s adult poverty rate and college education rate. A separate
plot is created for each state using the facet_wrap() function.
Note that the argument to the facet_wrap() function is the column to facet by, with the column
name written with a tilde (~) in front of it, turning it into a formula.9 A formula is a bit like an
equation in mathematics; that is, it represents a set of operations to perform. The tilde can be read
“as a function of.” The facet_ functions take formulas as arguments in order to determine how
they should group and divide the subplots. In short, with facet_wrap() you need to put a ~ in
front of the feature name you want to “group” by. See the official ggplot2 documentation10 for
facet_ functions for more details and examples.
9
Formula documentation: https://www.rdocumentation.org/packages/stats/versions/3.4.3/topics/formula. See
the Details in particular.
10
ggplot2 facetting: https://ggplot2.tidyverse.org/reference/#section-facetting
246 Chapter 16 Creating Visualizations with ggplot2
You can add titles and axis labels to a chart using the labs() function (not labels(), which is a
different R function!), as in Figure 16.13. This function takes named arguments for each aspect to
label—either title (or subtitle or caption), or the name of the aesthetic (e.g., x, y, color). Axis
aesthetics such as x and y will have their label shown on the axis, while other aesthetics will use the
provided label for the legend.
40
30
Urbanity
Rural
20 Urban
10
0
10 20 30 40 50
Percentage of College Educated Adults
Figure 16.13 A comparison of each county’s adult poverty rate and college education rate. The
labs() function is used to add a title and labels for each aesthetic mapping.
You can also add labels into the plot itself (e.g., to label each point or line) by adding a new
geom_text() (for plain text) or geom_label() (for boxed text). In effect, you’re plotting an extra
set of data values that happen to be the value names. For example, in Figure 16.14, labels are used to
identify the county with the highest level of poverty in each state. The background and border for
each piece of text is created by using the geom_label_repel() function, which provides labels
that don’t overlap.
40 MENOMINEE, WI
ATHENS, OH JACKSON, IL
30
Urbanity
ISABELLA, M
MI
Rural
20 MONROE, IN Urban
10
0
0 20 40
Percentage of College Educated Adults
Figure 16.14 Using labels to identify the county in each state with the highest level of poverty. The
ggrepel package is used to prevent labels from overlapping.
# Load the `ggrepel` package: functions that prevent labels from overlapping
library(ggrepel)
n Choropleth maps: Maps in which different geographic areas are shaded based on data about
each region (as in Figure 16.16). These maps can be used to visualize data that is aggregated to
specified geographic areas. For example, you could show the eviction rate in each state using
a choropleth map. Choropleth maps are also called heatmaps.
n Dot distribution maps: Maps in which markers are placed at specific coordinates, as in
Figure 16.19. These plots can be used to visualize observations that occur at discrete
(latitude/longitude) points. For example, you could show the specific address of each
eviction notice filed in a given city.
This section details how to build such maps using ggplot2 and complementary packages.
such as those made available by the U.S. Census Bureau11 and OpenStreetMap12 can be freely
downloaded and used in R.
To help you get started with mapping, ggplot2 includes a handful of shapefiles (meaning you
don’t need to download one). You can load a given shapefile by providing the name of the shapefile
you wish to load (e.g., "usa", "state", "world") to the map_data() function. Once you have the
desired shapefile in a usable format, you can render a map using the geom_polygon() function.
This function plots a shape by drawing lines between each individual pair of x- and y- coordinates
(in order), similar to a “connect-the-dots” puzzle. To maintain an appropriate aspect ratio for your
map, use the coord_map() coordinate system. The map created by the following code is shown in
Figure 16.15.
50
45
40
lat
35
30
25
−120 −100 −80
long
Figure 16.15 A U.S. state map, made with ggplot2.
11
U.S. Census: Cartographic Boundary Shapefiles: https://www.census.gov/geo/maps-data/data/
tiger-cart-boundary.html
12
OpenStreetMap: Shapefiles: https://wiki.openstreetmap.org/wiki/Shapefiles
250 Chapter 16 Creating Visualizations with ggplot2
The data in the state_shape variable is just a data frame of longitude/latitude points that describe
how to draw the outline of each state—the group variable indicates which state each point belongs
to. If you want each geographic area (in this case, each U.S. state) to express different data through a
visual channel such as color, you need to load the data, join it to the shapefile, and map the fill of
each polygon. As is often the case, the biggest challenge is getting the data in the proper format for
visualizing it (not using the visualization package). The map in Figure 16.16, which is built using
the following code, shows the eviction rate in each U.S. state in 2016. The data was downloaded
from the Eviction Lab at Princeton University.13
Eviction Rate
8
Figure 16.16 A choropleth map of eviction rates by state, made with ggplot2.
# Draw the map setting the `fill` of each state using its eviction rate
ggplot(state_shape) +
geom_polygon(
mapping = aes(x = long, y = lat, group = group, fill = eviction.rate),
color = "white", # show state outlines
size = .1 # thinly stroked
) +
coord_map() + # use a map-based coordinate system
scale_fill_continuous(low = "#132B43", high = "Red") +
labs(fill = "Eviction Rate") +
blank_theme # variable containing map styles (defined in next code snippet)
13
Eviction Lab: https://evictionlab.org. The Eviction Lab at Princeton University is a project directed by Matthew
Desmond and designed by Ashley Gromis, Lavar Edmonds, James Hendrickson, Katie Krywokulski, Lillian Leung,
and Adam Porton. The Eviction Lab is funded by the JPB, Gates, and Ford Foundations, as well as the Chan
Zuckerberg Initiative.
16.4 Building Maps 251
The beauty and challenge of working with ggplot2 are that nearly every visual feature is
configurable. These features can be adjusted using the theme() function for any plot (including
maps!). Nearly every granular detail—minor grid lines, axis tick color, and more—is available for
your manipulation. See the documentation14 for details. The following is an example set of styles
targeted to remove default visual features from maps:
50
45
40
lat
35
30
25
−120 −100 −80
long
Figure 16.17 Adding discrete points to a map.
14
ggplot2 themes reference: http://ggplot2.tidyverse.org/reference/index.html#section-themes
252 Chapter 16 Creating Visualizations with ggplot2
# Draw the state outlines, then plot the city points on the map
ggplot(state_shape) +
geom_polygon(mapping = aes(x = long, y = lat, group = group)) +
geom_point(
data = cities, # plots own data set
mapping = aes(x = long, y = lat), # points are drawn at given coordinates
color = "red"
) +
coord_map() # use a map-based coordinate system
As you seek to increase the granularity of your map visualizations, it may be infeasible to describe
every feature with a set of coordinates. This is why many visualizations use images (rather than
polygons) to show geographic information such as streets, topography, buildings, and other
geographic features. These images are called map tiles—they are pictures that can be stitched
together to represent a geographic area. Map tiles are usually downloaded from a remote server, and
then combined to display the complete map. The ggmap15 package provides a nice extension to
ggplot2 for both downloading map tiles and rendering them in R. Map tiles are also used with the
Leaflet package, described in Chapter 17.
Before mapping this data, a minor amount of formatting needs to be done on the raw data set
(shown in Figure 16.18):
15
ggmap repository on GitHub: https://github.com/dkahle/ggmap
16
data.gov: Eviction Notices: https://catalog.data.gov/dataset/eviction-notices
17
ggplot2 in Action: https://github.com/programming-for-data-science/in-action/tree/master/ggplot2
16.5 ggplot2 in Action: Mapping Evictions in San Francisco 253
# Data wrangling: format dates, filter to 2017 notices, extract lat/long data
notices <- notices %>%
mutate(date = as.Date(File.Date, format="%m/%d/%y")) %>%
filter(format(date, "%Y") == "2017") %>%
separate(Location, c("lat", "long"), ", ") %>% # split column at the comma
mutate(
lat = as.numeric(gsub("\\(", "", lat)), # remove starting parentheses
long = as.numeric(gsub("\\)", "", long)) # remove closing parentheses
)
Figure 16.18 A subset of the eviction notices data downloaded from data.gov.
To create a background map of San Francisco, you can use the qmplot() function from the
development version of ggmap package (see below). Because the ggmap package is built to work with
ggplot2, you can then display points on top of the map as you normally would (using
geom_point()). Figure 16.19 shows the location of each eviction notice filed in 2017, created using
the following code:
# Create a map of San Francisco, with a point at each eviction notice address
# Use `install_github()` to install the newer version of `ggmap` on GitHub
# devtools::install_github("dkhale/ggmap") # once per machine
library("ggmap")
library("ggplot2")
254 Chapter 16 Creating Visualizations with ggplot2
Figure 16.19 Location of each eviction notice in San Francisco in 2017. The image is generated by
layering points on top of map tiles using the ggplot2 package.
16.5 ggplot2 in Action: Mapping Evictions in San Francisco 255
Tip: You can store a plot returned by the ggplot() function in a variable (as in the preceding
code)! This allows you to add different layers on top of a base plot, or to render the plot at
chosen locations throughout a report (see Chapter 18).
While Figure 16.19 captures the gravity of the issue of evictions in the city, the overlapping nature
of the points prevents ready identification of any patterns in the data. Using the geom_polygon()
function, you can compute point density across two dimensions and display the computed values
in contours, as shown in Figure 16.20.
This example of the geom_polygon() function uses the stat argument to automatically perform a
statistical transformation (aggregation)—similar to what you could do using the dplyr functions
group_by() and summarize()—that calculates the shape and color of each contour based on
point density (a "density2d" aggregation). ggplot2 stores the result of this aggregation in an
internal data frame in a column labeled level, which can be accessed using the stat() helper
function to set the fill (that is, mapping = aes(fill = stat(level))).
Tip: For more examples of producing maps with ggplot2, see this tutorial.a
a
http://eriqande.github.io/rep-res-web/lectures/making-maps-with-R.html
This chapter introduced the ggplot2 package for constructing precise data visualizations. While
the intricacies of this package can be difficult to master, the investment is well worth the effort, as it
enables you to control the granular details of your visualizations.
256 Chapter 16 Creating Visualizations with ggplot2
Figure 16.20 A heatmap of eviction notices in San Francisco. The image is created by aggregating
eviction notices into 2D Contours with one of ggplot2’s statistical transformations.
Tip: Similar to dplyr and many other packages, ggplot2 has a large number of functions. A
cheatsheet for the package is available through the RStudio menu: Help > Cheatsheets.
In addition, this phenomenal cheatsheeta describes how to control the granular details of
your ggplot2 visualizations.
a
http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/
For practice creating configurable visualizations with ggplot2, see the set of accompanying book
exercises.18
18
ggplot2 exercises: https://github.com/programming-for-data-science/chapter-16-exercises
17
Interactive Visualization in R
Adding interactivity to a visualization provides an additional mechanism through which data can
be presented in an engaging, efficient, and communicative way. Interactions can allow users to
effectively explore large data sets by panning and zooming through plots, or by hovering over specific
plot geometry to gain additional details on demand.1
While ggplot2 is the definitive, leading package for making static plots in R, there is not a
comparably popular package for creating interactive visualizations. Thus this chapter briefly
introduces three different packages for building such visualizations. Instead of offering an in-depth
description (as with ggplot2), this chapter provides a high-level “tour” of these packages. The first
two (Plotly and Bokeh) are able to add basic interactions to the plots you might make with ggplot2,
while the third (Leaflet) is used to create interactive map visualizations. Picking among these (and
other) packages depends on the type of interactions you want your visualization to provide, the
ease of use, the clarity of the package documentation, and your aesthetic preferences. And because
these open source projects are constantly evolving, you will need to reference their documentation
to make the best use of these packages. Indeed, exploring these packages further is great practice in
learning to use new R packages!
The first two sections demonstrate creating interactive plots of the iris data set, a canonical data
set in the machine learning and visualization world in which a flower’s species is predicted using
features of that flower. The data set is built into the R software, and is partially shown in Figure 17.1.
For example, you can use ggplot2 to create a static visualization of flower species in terms of the
length of the petals and the sepals (the container for the buds), as shown in Figure 17.2:
The following sections show how to use the plotly and rbokeh packages to make this plot
interactive. The third section of the chapter then explores interactive mapping with the leaflet
package.
1
Shneiderman, B. (1996). The eyes have it: A task by data type taxonomy for information visualizations. Proceedings
of the. 1996 IEEE Symposium on Visual Languages (pp. 336–). Washington, DC: IEEE Computer Society. http://dl.acm.
org/citation.cfm?id=832277.834354
258 Chapter 17 Interactive Visualization in R
Figure 17.1 A subset of the iris data set, in which each observation (row) represents the physical
measurements of a flower. This canonical data set is used to practice the machine learning task of
classification—the challenge is to predict (classify) each flower’s Species based on the other features.
2.5
2.0
Species
Petal.Width
1.5 setosa
versicolor
1.0
virginica
0.5
0.0
2.0 2.5 3.0 3.5 4.0 4.5
Sepal.Width
Figure 17.2 A static visualization of the iris data set, created using ggplot2.
Plotly is an external package (like dplyr or ggplot2), so you will need to install and load the
package before you can use it:
2
Plotly: https://plot.ly/r/
17.1 The plotly Package 259
This will make all of the plotting functions you will need available.
With the package loaded, there are two main ways to create interactive plots. First, you can take any
plot created using ggplot2 and “wrap” it in a Plotly plot,3 thereby adding interactions to it. You do
this by taking the plot returned by the ggplot() function and passing it into the ggplotly()
function provided by the plotly package:
# Create (and store) a scatterplot of the `iris` data set using ggplot2
flower_plot <- ggplot(data = iris) +
geom_point(mapping = aes(x = Sepal.Width, y = Petal.Width, color = Species))
This will render an interactive version of the iris plot! You can hover the mouse over any
geometry element to see details about that data point, or you can click and drag in the plot area to
zoom in on a cluster of points (see Figure 17.3).
When you move the mouse over a Plotly chart, you can see the suite of interaction types built into
it through the menu that appears (see Figure 17.3). You can use these options to navigate and zoom
into the data to explore it.
In addition to making ggplot plots interactive, you can use the Plotly API itself (e.g., calling its
own functions) to build interactive graphics. For example, the following code will create an
equivalent plot of the iris data set:
Figure 17.3 Plotly chart interactions: hover for tooltips (left), and brush (click + drag) to zoom into a
region (right). More interactions, such as panning, are provided via the interaction menu at the top of
the left-hand chart.
3
Plotly ggplot2 library: https://plot.ly/ggplot2/ (be sure to check the navigation links in the menu on the left).
260 Chapter 17 Interactive Visualization in R
Plotly plots are created using the plot_ly() function, which is a sort of corollary to the ggplot()
function. The plot_ly() function takes as arguments details about how the chart should be
rendered. For example, in the preceding code, arguments are used to specify the data, the aesthetic
mappings, and the plot type (that is, geometry). Aesthetic mappings are specified as formulas (using
a tilde ~), indicating that the visual channel is a “function of” the data column. Also note that
Plotly will try to “guess” values such as type and mode if they are left unspecified (and in which
case it will print out a warning in the console).
For a complete list of options available to the plot_ly() function, see the official documentation.4
It’s often easiest to learn to make Plotly charts by working from one of the many examples.5 We
suggest that you find an example that is close to what you want to produce, and then read that code
and modify it to fit your particular use case.
In addition to using the plot_ly() function to specify how the data will be rendered, you can add
other chart options, such as titles and axes labels. These are specified using the layout() function,
which is conceptually similar to the labs() and theme() functions from ggplot2. Plotly’s
layout() function takes as an argument a Plotly chart (e.g., one returned by the plot_ly()
function), and then modifies that object to produce a chart with a different layout. Most
commonly, this is done by piping the Plotly chart into the layout() function:
# Create a plot, then pipe that plot into the `layout()` function to modify it
# (Example adapted from the Plotly documentation)
plot_ly(
data = iris, # pass in the data to be visualized
x = ~Sepal.Width, # use a formula to specify the column for the x-axis
y = ~Petal.Width, # use a formula to specify the column for the y-axis
color = ~Species, # use a formula to specify the color encoding
type = "scatter", # specify the type of plot to create
mode = "markers" # determine the "drawing mode" for the scatter (points)
) %>%
layout(
title = "Iris Data Set Visualization", # plot title
xaxis = list(title = "Sepal Width", ticksuffix = "cm"), # axis label + format
yaxis = list(title = "Petal Width", ticksuffix = "cm") # axis label + format
)
4
Plotly: R Figure Reference: https://plot.ly/r/reference/
5
Plotly: Basic Charts example gallery: https://plot.ly/r/#basic-charts
17.2 The rbokeh Package 261
The chart created by this code is shown in Figure 17.4. The xaxis and yaxis arguments expect lists
of axis properties, allowing you to control many aspects of each axis (such as the title and the
ticksuffix to put after each numeric value in the axis). You can read about the structure and
options to the other arguments in the API documentation.6
setosa
2.5cm
versicolor
virginica
2cm
Petal Width
1.5cm
1cm
0.5cm
0cm
2cm 2.5cm 3cm 3.5cm 4cm 4.5cm
Sepal Width
Figure 17.4 A Plotly chart with informative labels and axes added using the layout() function.
6
Plotly layout: https://plot.ly/r/reference/#layout
7
Bokeh: http://bokeh.pydata.org
8
rbokeh, R Interface for Bokeh: http://hafen.github.io/rbokeh/
262 Chapter 17 Interactive Visualization in R
As with other packages, you will need to install and load the rbokeh package before you can use it.
At the time of this writing, the version of rbokeh on CRAN (what is installed with
install.packages()) gives warnings—but not errors!—for R version 3.4; installing a
development version from the package’s maintainer Ryan Hafen fixes this problem.
You create a new plot with Bokeh by calling the figure() function (which is a corollary to the
ggplot() and plot_ly() functions). The figure() function will create a new plotting area, to
which you add layers of plot elements such as plot geometry. Similar to when using geometries in
ggplot2, each layer is created with a different function—all of which start with the ly_ prefix.
These layer functions take as a first argument the plot region created with figure(), so in practice
they are “added” to a plot through piping rather than through the addition operator.
For example, the following code shows how to recreate the iris visualization using Bokeh (shown
in Figure 17.5):
The code for adding layers is reminiscent of how geometries act as layers in ggplot2. Bokeh even
supports non-standard evaluation (referring to column names without quotes) just like
ggplot2—as opposed to Plotly’s reliance on formulas. However, formatting the axis tick marks is
more verbose with Bokeh (and is not particularly clear in the documentation).
The plot that is generated by Bokeh (Figure 17.5) is quite similar to the version generated by Plotly
(Figure 17.4) in terms of general layout, and offers a comparable set of interaction utilities through a
17.3 The leaflet Package 263
Figure 17.5 A Bokeh chart with styled axes. Note the interaction menu to the right of the chart.
toolbar to the right of the chart. Thus you might choose between these packages based on which
coding style you prefer, as well as any other aesthetic or interactive design choices of the packages.
As with other packages, you will need to install and load the leaflet package before you can use it:
9
Leaflet: https://leafletjs.com
10
Leaflet for R: https://rstudio.github.io/leaflet/
264 Chapter 17 Interactive Visualization in R
You can create a new Leaflet map by calling the leaflet() function. Just as calling ggplot() will
create a blank canvas for constructing a plot, the leaflet() function will create a blank canvas on
which you can build a map. Similar to the other visualization packages, Leaflet maps are then
constructed by adding (via pipes) a series of layers with different visual elements to constitute the
image—including map tiles, markers, lines, and polygons.
The most important layer to add when creating a Leaflet map are the map tiles, which are added
with the addTiles() function. Map tiles are a series of small square images, each of which shows a
single piece of a map. These tiles can then be placed next to each other (like tiles on a bathroom
floor) to form the full image of the map to show. Map tiles power mapping applications like Leaflet
and Google Maps, enabling them to show a map of the entire world at a wide variety of levels of
zoom (from street level to continent level); which tiles will be rendered depends on what region
and zoom level the user is looking at. As you interactively navigate through the map (e.g., panning
to the side or zooming in or out), Leaflet will automatically load and show the appropriate tiles to
display the desired map!
Fun Fact: It takes 366,503,875,925 tiles (each 256 × 256 pixels) to map the entire globe for
the (standard) 20 different zoom levels!
There are many different sources of map tiles that you can use in your maps, each of which has its
own appearance and included information (e.g., rivers, streets, and buildings). By default, Leaflet
will use tiles from OpenStreetMap,11 an open source set of map tiles. OpenStreetMap provides a
number of different tile sets; you can choose which to use by passing in the name of the tile set (or a
URL schema for the tiles) to the addTiles() function. But you can also choose to use another map
tile provider12 depending on your aesthetic preferences and desired information. You do this by
instead using the addProviderTiles() function (again passing in the name of the tile set). For
example, the following code creates a basic map (Figure 17.6) using map tiles from the Carto13
service. Note the use of the setView() function to specify where to center the map (including the
“zoom level”).
# Create a new map and add a layer of map tiles from CartoDB
leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
setView(lng = -122.3321, lat = 47.6062
, zoom = ) # center
10 the map on Seattle
The rendered map will be interactive in the sense that you can drag and scroll to pan and zoom—just
as with other online mapping services!
After rendering a basic map with a chosen set of map tiles, you can add further layers to the map to
show more information. For instance, you can add a layer of shapes or markers to help answer
questions about events that occur at specific geographic locations. To do this, you will need to pass
the data to map into the leaflet() function call as the data argument (i.e., leaflet(data =
SOME_DATA_FRAME)). You can then use the addCircles() function to add a layer of circles to the
11
OpenStreetMap map data service: https://www.openstreetmap.org
12
Leaflet-providers preview http://leaflet-extras.github.io/leaflet-providers/preview/
13
Carto map data service: https://carto.com
17.3 The leaflet Package 265
Figure 17.6 A map of Seattle, created using the leaflet package. The image is constructed by
stitching together a layer of map tiles, provided by the Carto service.
map (similar to adding a geometry in ggplot2). This function will take as arguments the data
columns to map to the circle’s location aesthetics, specified as formulas (with a ~).
# Create the map of Seattle, specifying the data to use and a layer of circles
leaflet(data = locations) %>% # specify the data you want to add as a layer
addProviderTiles("CartoDB.Positron") %>%
setView(lng = -122.3321, lat = 47.6062, zoom = 11) %>% # focus on Seattle
addCircles(
lat = ~latitude, # a formula specifying the column to use for latitude
lng = ~ longitude, # a formula specifying the column to use for longitude
popup = ~label, # a formula specifying the information to pop up
radius = 500, # radius for the circles, in meters
stroke = FALSE # remove the outline from each circle
)
266 Chapter 17 Interactive Visualization in R
Figure 17.7 A map showing two universities in Seattle, created by adding a layer of markers
(addCircles()) on top of a layer of map tiles.
Caution: Interactive visualization packages such as plotly and leaflet are limited in
the number of markers they can display. Because they render scalable vector graphics (SVGs)
rather than raster images, they actually add a new visual element for each marker. As a result
they are often unable to handle more than a few thousand points (something that isn’t an
issue with ggplot2).
The preceding code also adds interactivity to the map by providing popups—information that pops
up on click and remains displayed—as shown in Figure 17.7. Because these popups appear when
users are interacting with the circle elements you created, they are specified as another argument to
the addCircles() function—that is, as a value of the formula for which column to map to the
popup. Alternatively, you can cause labels to appear on hover by passing in the label argument
instead of popup.
14
City of Seattle Land use permits: https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr
17.4 Interactive Visualization in Action: Exploring Changes to the City of Seattle 267
Figure 17.8 City of Seattle data on permits for buildings in Seattle, showing the subset of new
permits since 2010.
the City of Seattle’s open data program. A subset of this data is shown in Figure 17.8. The complete
code for this analysis is also available online in the book code repository.15
First, the data needs to be loaded into R and filtered down to the subset of data of interest (new
buildings since 2010):
Before mapping these points, you may want to get a higher-level view of the data. For example, you
could aggregate the data to show the number of permits issued per year. This will again involve a bit
of data wrangling, which is often the most time-consuming part of visualization:
# Create a new column storing the year the permit was issued
new_buildings <- new_buildings %>%
mutate(year = substr(IssuedDate, 1, 4)) # extract the year
15
Interactive visualization in action: https://github.com/programming-for-data-science/in-action/tree/master/
interactive-vis
268 Chapter 17 Interactive Visualization in R
The preceding code produces the bar chart shown in Figure 17.9. Keep in mind that the data was
downloaded before the summer of 2018, so the observed downward trend is an artifact of when the
visualization was created!
Figure 17.9 The number of permits issued for new buildings in Seattle since 2010. The chart was
built before the summer of 2018.
17.4 Interactive Visualization in Action: Exploring Changes to the City of Seattle 269
After understanding this high-level view of the data, you likely want to know where buildings are
being constructed. To do so, you can take the previous map of Seattle and add an additional layer of
circles on top of the tiles (one for each building constructed) using the addCircles() function:
The results of this code are shown in Figure 17.10—it’s a lot of new buildings. And because the map
is interactive, you can click on each one to get more details!
While this visualization shows all of the new construction, it leaves unanswered the question of
who benefits and who suffers as a result of this change. You would need to do further research into
the number of affordable housing units being built, and the impact on low-income and homeless
communities. As you may discover, building at such a rapid pace often has a detrimental effect on
housing security in a city.
As with ggplot2, the visual attributes of each shape or marker (such as the size or color) can also be
driven by data. For example, you could use information about the permit classification (i.e., if the
Figure 17.10 A Leaflet map of permits for new buildings in Seattle since 2010.
270 Chapter 17 Interactive Visualization in R
permit is for a home versus a commercial building) to color the individual circles. To effectively
map this (categorical) data to a set of colors in Leaflet, you can use the colorFactor() function.
This function is a lot like a scale in ggplot2, in that it returns a specific mapping to use:
The colorFactor() function returns a new function (here called palette_fn()) that maps from
a set of data values (here the unique values from the PermitClass column) to a set of colors—it
performs an aesthetic mapping. You can use this function to specify how the circles on the map
should be rendered (as with ggplot2 geometries, further arguments can be used to customize the
shape rendering):
To make these colors meaningful, you will need to add a legend to your map. As you might have
expected, you can do this by adding another layer with a legend in it, specifying the color scale,
values, and other attributes:
Putting it together, the following code generates the interactive map displayed in Figure 17.11.
17.4 Interactive Visualization in Action: Exploring Changes to the City of Seattle 271
Figure 17.11 A Leaflet map of permits for new buildings in Seattle since 2010, colored by construc-
tion category.
In summary, packages for developing interactive visualizations (whether plots or maps) use the
same general concepts as ggplot2, but with their own preferred syntax for specifying plot options
and customizations. As you choose among these (and other) packages for making visualizations,
272 Chapter 17 Interactive Visualization in R
consider the style of code you prefer to use, the trade-off of customizability versus ease of use, and
the visual design choices of each package. There are dozens (if not hundreds) of other packages
available and more created every day; exploring and learning these packages is an excellent way to
expand your programming and data science skills.
That said, when you are exploring new packages, be careful about using code that is poorly
documented or not widely used—such packages may have internal errors, memory leaks, or even
security flaws that haven’t been noticed or addressed yet. It’s a good idea to view the package code
on GitHub, where you can check the popularity by looking at the number of stars (similar to “likes”)
and forks for the project, as well as how actively and recently new commits have been made to the
code. Such research and consideration are vital when choosing one of the many packages for
building interactive visualizations—or doing any other kind of work—with R.
For practice building interactive visualizations, see the set of accompanying book exercises.16
16
Interactive visualization exercises: https://github.com/programming-for-data-science/chapter-17-exercises
VI
Building and Sharing
Applications
The final part of this book focuses on the technologies that allow you to collaborate with others
and share your work with the world. It walks through multiple approaches to building interactive
web applications (Chapter 18, Chapter 19), and explains how to leverage git and GitHub when
working as a member of a team (Chapter 20).
18
Dynamic Reports with
R Markdown
The insights you discover through your analysis are only valuable if you can share them with
others. To do this, it’s important to have a simple, repeatable process for combining the set of
charts, tables, and statistics you generate into an easily presentable format.
This chapter introduces R Markdown1 as a tool for compiling and sharing your results. R
Markdown is a development framework that supports using R to dynamically create documents,
such as websites (.html files), reports (.pdf files), and even slideshows (using ioslides or slidy).
As you may have guessed, R Markdown does this by providing the ability to blend Markdown
syntax and R code so that, when compiled and executed, the results from your code will be
automatically injected into a formatted document. The ability to automatically generate reports
and documents from a computer script eliminates the need to manually update the results of a data
analysis project, enabling you to more effectively share the information that you’ve produced from
your data. In this chapter, you will learn the fundamentals of the R Markdown package so that you
can create well-formatted documents that combine analysis and reporting.
1
R Markdown: https://rmarkdown.rstudio.com
2
knitr package: https://yihui.name/knitr/
276 Chapter 18 Dynamic Reports with R Markdown
RStudio will then prompt you to provide some additional details about what kind of R Markdown
document you want to create (shown in Figure 18.2). In particular, you will need to choose a default
document type and output format. You can also provide a title and author information that will be
Figure 18.1 Create a new R Markdown document in RStudio via the dropdown menu (File > New
File > R Markdown).
Figure 18.2 RStudio wizard for creating R Markdown documents. Enter a Title and Author, and select
the document output format (we suggest HTML to start).
18.1 Setting Up a Report 277
included in the document. This chapter focuses on creating HTML documents (websites, the
default format); other formats require the installation of additional software.
Once you’ve chosen your desired document type and output format, RStudio will open up a new
script file for you. You should save this file with the extension .Rmd (for “R Markdown”), which tells
the computer and RStudio that the document contains Markdown content with embedded R code.
If you use a different extension, RStudio won’t know how to interpret the code and render the
output!
The wizard-generated file contains some example code demonstrating how to write an R Markdown
document. Understanding the basic structure of this file will enable you to insert your own content
into this structure.
A .Rmd file has three major types of content: the header, the Markdown content, and R code
chunks.
n The header is found at the top of the file, and includes text with the following format:
---
title: "EXAMPLE_TITLE"
author: "YOUR_NAME"
date: "2/01/2018"
output: html_document
---
This header is written in YAML3 format, which is yet another way of formatting structured
data, similar to CSV or JSON. In fact, YAML is a superset of JSON and can represent the same
data structures, just using indentation and dashes instead of braces and commas.
The header contains meta-data, or information about the file and how it should be
processed and rendered. For example, the title, author, and date will be automatically
included and displayed at the top of your generated document. You can include additional
information and configuration options as well, such as whether there should be a table of
contents. See the R Markdown documentation4 for further details.
n Everything below the header is the content that will be included in your report, and is
primarily made up of Markdown content. This is normal Markdown text like that described
in Chapter 4. For example, you could include the following markdown code in your .Rmd file:
## Second Level Header
This is just plain markdown that can contain **bold** or _italics_.
R Markdown also provides the ability to render code content inline with the Markdown
content, as described later in this chapter.
n R code chunks can be included in the middle of the regular Markdown content. These
segments (chunks) of R code look like normal code block elements (using three
backticks ```), but with an extra {r} immediately after the opening set of backticks. Inside
these code chunks you include regular R code, which will be evaluated and then rendered
3
YAML: http://yaml.org
4
R Markdown HTML Documents: http://rmarkdown.rstudio.com/html_document_format.html
278 Chapter 18 Dynamic Reports with R Markdown
into the document. Section 18.2 provides more details about the format and process used by
these chunks.
```{r}
# R code chunk in an R Markdown file
some_variable <- 100
```
Combining these content types (header, markdown, and code chunks), you will be able to
reproducibly create documents to share your insights.
While it is straightforward to generate such documents, the knitting process can make it hard to
debug errors in your R code (whether syntax or logical), in part because the output may or may not
show up in the document! We suggest that you write complex R code in another script and then use
the source() function to insert that script into your .Rmd file and use calculated variables in your
output (see Chapter 14 for details and examples of the source() function). This makes it possible
to test your data processing work outside of the knitted document. It also separates the concerns
of the data and its representation—which is good programming practice.
Nevertheless, you should be sure to knit your document frequently, paying close attention to any
errors that appear in the console.
Tip: If you’re having trouble finding your error, a good strategy is to systematically remove
(“comment out”) segments of your code and attempt to re-knit the document. This will help
you identify the problematic syntax.
Figure 18.3 Click on RStudio’s Knit button to compile your code to the desired document type (e.g.,
HTML).
18.2 Integrating Markdown and R Code 279
```{r}
# Execute R code in here
course_number <- 201
```
By default, the code chunk will execute the R code listed, and then render both the code that was
executed and the result of the last statement into the Markdown—similar to what would be
returned by a function. Indeed, you can think of code chunks as functions that calculate and return
a value that will be included in the rendered report. If your code chunk doesn’t return a particular
expression (e.g., the last line is just an assignment), then no returned output will be rendered,
although R Markdown will still render the code that was executed.
The first “argument” (options_example) is a “name” or label for the chunk; it is followed by
named arguments (written in option = VALUE format) for the options. While including chunk
names is technically optional, this practice will help you create well-documented code and
reference results in the text. It will also help in the debugging process, as it will allow RStudio to
produce more detailed error messages.
280 Chapter 18 Dynamic Reports with R Markdown
There are many options5 you can use when creating code chunks. Some of the most useful ones
have to do with how the executed code is output in the document:
n echo indicates whether you want the R code itself to be displayed in the document (i.e., if you
want readers to be able to see your work and reproduce your calculations and analysis). The
value is either TRUE (do display; the default) or FALSE (do not display).
n message indicates whether you want any messages generated by the code to be displayed.
This includes print statements! The value is either TRUE (do display; the default) or FALSE (do
not display).
n include indicates if any results of the code should be output in the report. Note that any
code in this chunk will still be executed—it just won’t be included in the output. It is
extremely common and best practice to have a “setup” code chunk at the beginning of your
report that has the include = FALSE option and is used to do initial processing work—such
as library() packages, source() analysis code, or perform some other data wrangling. The
R Markdown reports produced by RStudio’s wizard include a code chunk like this.
If you want to show your R code but not evaluate it, you can use a standard Markdown code block
that indicates the r language (```r instead of ```{r}), or set the eval option to FALSE.
Recall that a single backtick (`) is the Markdown syntax for making text display as code. You can
make R Markdown evaluate—rather than display—inline code by adding the letter r and a space
immediately after the first backtick. For example:
To calculate 3 + 4 inside some text, you can use `r 3 + 4` right in the _middle_.
When you knit this text, `r 3 + 4` would be replaced with the number 7 (what 3 + 4 evaluates
to).
You can also reference values computed in any code chunks that precede the inline code. For
example, `r SOME_VARIABLE` would include the value of SOME_VARIABLE inline with the
paragraph. In fact, it is best practice to do your calculations in a code block (with the echo =
FALSE option), save the result in a variable, and then inline that variable to display it.
Tip: To quickly access the R Markdown Cheatsheet and Reference, use the RStudio menu:
Help > Cheatsheets.
5
knitr Chunk options and package options: https://yihui.name/knitr/options/
18.3 Rendering Data and Visualizations in Reports 281
will produce:
For this reason, you usually want to have the code block generate a string that you save in a
variable, which you can then display with an inline expression (e.g., on its own line):
`r msg`
When knit, this code produces the text shown in Figure 18.4. Note that the Markdown syntax
included in the variable is rendered as well: `r msg` is replaced by the value of the expression just
as if you had typed that Markdown in directly. This allows you to even include dynamic styling if
you construct a “Markdown string” (i.e., containing Markdown syntax) from your data.
Figure 18.4 A preview of the .html file that is created by knitting an R Markdown document con-
taining a chunk that stores a message in a variable and an inline expression of that message.
282 Chapter 18 Dynamic Reports with R Markdown
Alternatively, you can give your chunk a results option6 with a value "asis", which will cause
the output to be rendered directly into the Markdown. When combined with the base R function
cat() (which concatenates content without specifying additional information such as vector
position), you can make a code chunk effectively render a specific string:
`r markdown_list`
◦ Lions
◦ Tigers
◦ Bears
◦ Oh mys
When this approach is combined with the vectorized paste() function and its collapse
argument, it becomes possible to convert vectors into Markdown lists that can be rendered:
6
knitr text result options: https://yihui.name/knitr/options/#text-results
18.3 Rendering Data and Visualizations in Reports 283
# Paste `-` in front of each animal and join the items together with
# newlines between
markdown_list <- paste("-", animals, collapse = "\n")
```
`r markdown_list`
Of course, the contents of the vector (e.g., the text "Lions") could include additional Markdown
syntax to make it bold, italic, or hyperlinked text.
Tip: Creating a “helper function” to help with formatting your output is a great approach.
For some other work in this area, see the pandera package.
a
http://rapporter.github.io/pander/
Figure 18.5 compares the rendered R Markdown results with and without the kable() function.
The kable() function supports a number of other arguments that can be used to customize how it
outputs a table; see the documentation for details. Again, if the values in the data frame are strings
that contain Markdown syntax (e.g., bold, italics, or hyperlinks), they will be rendered as such in
the table!
Going Further: Tables generated with the kable() function can be further customized
using additional packages, such as kableExtra.a This package allows you to add more layers
and styling to a table using a format similar to how you add labels and themes with ggplot2.
a
http://haozhu233.github.io/kableExtra/
284 Chapter 18 Dynamic Reports with R Markdown
Figure 18.5 R Markdown rendering a data frame with and without the kable() function.
So while you may need to do a little bit of work to manually generate the Markdown syntax, R
Markdown makes it is possible to dynamically produce complex documents based on dynamic data
sources.
When knit, the document generated that includes this code would include the ggplot2 chart.
Moreover, RStudio allows you to preview each code chunk before knitting—just click the green play
button icon above each chunk, as shown in Figure 18.6. While this can help you debug individual
chunks, it may be tedious to do in longer scripts, especially if variables in one code chunk rely on
an earlier chunk.
It is best practice to do any data wrangling necessary to prepare the data for your plot in a separate
.R file, which you can then source() into the R Markdown (in an initial setup code chunk with
the include = FALSE option). See Section 18.5 for an example of this organization.
Click to produce
preview below.
Figure 18.6 A preview of the content generated by knitr is displayed when you click the green play
button icon (very helpful for debugging .Rmd files!).
language, is a syntax for describing the structure and formatting of content (though HTML is far
more extensive and detailed). In particular, HTML is a markup language that can be automatically
rendered by web browsers, so it is the language used to create webpages. In fact, you can open up
.html files generated by RStudio in any web browser to see the content. Additionally, this means
that the .html files you create with R Markdown can be put online as webpages for others to view!
As it turns out, you can use GitHub not only to host versions of your code repository, but also to
serve (display) .html files—including ones generated from R Markdown. Github will host
webpages on a publicly accessible web server that can “serve” the page to anyone who requests it (at
a particular URL on the github.io domain). This feature is known as GitHub Pages.7
Using GitHub Pages involves a few steps. First, you need to knit your document into a .html file
with the name index.html—this is the traditional name for a website’s homepage (and the file
that will be served at a particular URL by default). You will need to have pushed this file to a GitHub
repository; the index.html file will need to be in the root folder of the repo.
7
What Is GitHub Pages: https://help.github.com/articles/what-is-github-pages/
286 Chapter 18 Dynamic Reports with R Markdown
Next, you need to configure that GitHub repository to enable GitHub Pages. On the web portal page
for your repo, click on the “Settings” tab, and scroll down to the section labeled “GitHub Pages.”
From there, you need to specify the “Source” of the .html file that Github Pages should serve.
Select the “master branch” option to enable GitHub Pages and have it serve the “master” version of
your index.html file (see Figure 18.7).
Going Further: If you push code to a different branch on GitHub with the name gh-pages,
GitHub Pages will automatically be enabled—serving the files on that branch—without any
need to adjust the repository settings. See Section 20.1 for details on working with branches.
Once you’ve enabled GitHub Pages, you will be able to view your hosted webpage at the URL:
Replace GITHUB_USERNAME with the username of the account hosting the repo, and REPO_NAME with
your repository name. Thus, if you pushed your code to the mkfreeman/report repo on GitHub
(stored online at https://github.com/mkfreeman/report), the webpage would be available at
https://mkfreeman.github.io/report. See the official documentation8 for more details and
options.
Figure 18.7 Enable hosting via GitHub Pages for a repository by navigating to the Settings tab on a
repository and scrolling down to the GitHub Pages section. Set the “source” as the master branch to
host your compiled index.html file as a website!
8
Documentation for GitHub Pages: https://help.github.com/ articles/user-organization-and-project-pages/
18.5 R Markdown in Action: Reporting on Life Expectancy 287
To keep the code organized, the report will be written in two separate files:
n analysis.R, which will contain the analysis and save important values in variables
n index.Rmd, which will source() the analysis.R script, and generate the report (the file is
named so that it can be hosted on GitHub Pages when rendered)
As each step is completed in this file, key reporting values and charts are saved to variables so that
they can be referenced in the index.Rmd file.
To reference these variables, you load the analysis.R script (with source()) in a “setup” block of
the index.Rmd file, enabling its data to be referenced within the Markdown. The include =
FALSE code chunk option means that the block will be evaluated, but not rendered in the
document.
Figure 18.8 A subset of the World Bank data on the life expectancy in each country from 1960 to
2015.
9
World Bank: life expectancy at birth data: https://data.worldbank.org/indicator/SP.DYN.LE00.IN
10
R Markdown in Action: https://github.com/programming-for-data-science/in-action/tree/master/r-markdown
288 Chapter 18 Dynamic Reports with R Markdown
Remember: All “algorithmic” work should be done in the separate analysis.R file, allow-
ing you to more easily debug and iterate your analysis. Since visualizations are part of the
“presented” information, they could instead be generated directly in the R Markdown,
though the data to be visualized should be preprocessed in the analysis.R file.
To compute the metrics of interest in your analysis.R file, you can use dplyr functions to ask
questions of the data set. For example:
In this example, the data frame longest_le stores an answer to the question Which country had the
longest life expectancy in 2015? This data frame could be included directly as content of the
index.Rmd file. You will be able to reference values from this data frame inline to ensure the report
contains the most up-to-date information, even if the data in your analysis changes:
The data revealed that the country with the longest life expectancy is
`r longest_le$Country.Name`, with a life expectancy of
`r longest_le$expectancy`.
When rendered, this code snippet would replace `r longest_le$Country.Name` with the value
of that variable. Similarly, if you want to show a table as part of your report, you can construct a
data frame with the desired information in your analysis.R script, and render it in your
index.Rmd file using the kable() function:
Once you have stored the desired information in the top_10_gain data frame in your analysis.R
script, you can display that information in your index.Rmd file using the following syntax:
Figure 18.9 shows the entire report; the complete analysis and R Markdown code to generate this
report follows. Note that the report uses a package called rworldmap to quickly generate a simple,
static world map (as an alternative to mapping with ggplot2).
# analysis.R script
# Notice that R puts the letter "X" in front of each year column,
# as column names can't begin with numbers
# Join this data frame to a shapefile that describes how to draw each country
# The `rworldmap` package provides a helpful function for doing this
mapped_data <- joinCountryData2Map(
life_exp,
joinCode = "ISO3",
nameJoinColumn = "Country.Code",
mapResolution = "high"
)
The following index.Rmd file renders the report using the preceding analysis.R script:
---
title: "Life Expectancy Report"
output: html_document
---
## Overview
This is a brief report regarding life expectancy for each country from
1960 to 2015 ([source](https://data.worldbank.org/indicator/SP.DYN.LE00.IN)).
The data reveals that the country with the longest life expectancy was
`r longest_le$Country.Name`, with a life expectancy of
`r longest_le$expectancy`. That life expectancy was `r le_difference`
years longer than the life expectancy in `r shortest_le$Country.Name`.
Here are the countries whose life expectancy **improved the most** since 1960.
For practice creating reports with R Markdown, see the set of accompanying book exercises.11
11
R Markdown exercises: https://github.com/programming-for-data-science/chapter-18-exercises
19
Building Interactive Web
Applications with Shiny
Adding interactivity to a data report is a highly effective way of communicating information and
enabling users to explore a data set. This chapter describes the Shiny1 framework for building
interactive applications using R. This will allow you to create dynamic systems in which users can
choose what information they want to see, and how they want to see it.
Shiny provides a structure for communicating between a user interface (i.e., a web browser) and a
data server (i.e., an R session), allowing users to interactively change the “code” that is run and the
data that are output. This not only enables developers to create interactive data presentations, but
provides a way for users to interact directly with an R session (without requiring them to write any
code).
Sharing data with others requires your code to perform two different tasks: it needs to process and
analyze information, and then present that information for the user to see. Moreover, with an
interactive application, the user is able to interact with the presented data (e.g., click on a button or
enter a search term into a form). That user input then needs to be used to re-process the
information, and then re-present the output results.
The Shiny framework provides a structure for applications to perform this exchange: it enables you
to write R functions that are able to output (serve) results to a web browser, as well as an interface for
showing those outputs in the browser. Users can interact with this interface to send information to
1
Shiny: http://shiny.rstudio.com
294 Chapter 19 Building Interactive Web Applications with Shiny
the server, which will then output new content for the user. Passing these inputs and outputs back
and forth (as illustrated in Figure 19.1) allows Shiny to provide a dynamic and interactive user
experience!
Fun Fact: Because Shiny is rendering a user interface for a web browser, it actually gener-
ates a website. That is, the framework will create all of the necessary components (HTML ele-
ments), their styles (CSS rules), and the scripts (JavaScript code) to enable interactivity. But
don’t worry: you don’t need to know anything about these languages; Shiny code is writ-
ten entirely in R. However, if you already know a few things about web development, you
can augment the Shiny-generated elements and interactivity to really make your applica-
tion shine.
n User interface (UI): The UI of a Shiny app defines how the application is displayed in the
browser. The UI provides a webpage that renders R content such as text or graphics (just like a
knitted R Markdown document). Moreover, a Shiny UI supports interactivity through
control widgets, which are interactive controls for the application (think: buttons or
sliders). The UI can specify a layout for these components, allowing you to organize your
content in side-by-side panels, or across multiple tabs.
n Server: The server of a Shiny app defines and processes the data that will be displayed by the
UI. Generally speaking, a server is a program running on a computer (often remotely) that
receives requests and provides (“serves”) content based on the request. For example, when
you request information from a web API, you submit a request to a server that processes the
request and returns the desired information. In a Shiny application, you can think of the
server as an interactive R session that the user will use to “run” data processing functions by
interacting with the UI in the web browser (not in RStudio). The server takes in inputs from
the user (based on their interactions) and runs functions that provide outputs (e.g., text or
charts) for the UI to display. These data processing functions are reactive, which means they
are automatically rerun whenever the input changes (they “react” to it). This allows the
output to be dynamic and interactive.
19.1 The Shiny Framework 295
n Control widget: An element in the UI that allows the user to provide input to the server—for
example, a text input box, a dropdown menu, or a slider. Control widgets store input values,
which are automatically updated as the user interacts with the widget. Updates to the value
stored by the widget are sent from the UI to the server, which will react to those changes to
generate new content to display.
n Reactive output: An element in the UI that displays dynamic (changing) content produced
by the server—for example, a chart that dynamically updates when the user selects different
data to display, or a table that responds to a search query. A reactive output will
automatically update whenever the server sends it a new value to display.
n Render function: Functions in the server that produce output that can be understood and
displayed by the UI’s reactive outputs. A render function will automatically “re-execute”
whenever a related control widget changes, producing an updated value that will be read and
displayed by a reactive output.
n Reactivity: Shiny apps are designed around reactivity: updating some components in the UI
(e.g., the control widgets) will cause other components (e.g., the render functions in the
server) to “react” to that change and automatically re-execute. This is similar to how
equations in a spreadsheet program like Microsoft Excel work: when you change the value in
one cell, any others that reference it “react” and change as well.
Shiny is made available through the shiny package—another external package (like dplyr and
ggplot2) that you will need to install and load before you can use it:
This will make all of the framework functions and variables you will need to work with available.
Mirroring Figure 19.1, Shiny applications are separated into two components (parts of the
application): the UI and the server.
1. The UI defines how the application is displayed in the browser. The UI for a Shiny
application is defined as a value, almost always one returned from calling one of Shiny’s
layout functions.
The following example UI defines a fluidPage() (where the content will “fluidly” flow
down the page based on the browser size) that contains three content elements: static text
content for the page heading, a text input box where the user can type a name, and the
output text of a calculated message value (which is defined by the server). These functions
and their usage are described in more detail in Section 19.2.
296 Chapter 19 Building Interactive Web Applications with Shiny
2. The server defines and processes the data that will be displayed by the UI. The server for a
Shiny application is defined as a function (in contrast, the UI is a value). This function needs
to take in two lists as arguments, conventionally called input and output. The values in the
input list are received from the user interface (e.g., web browser), and are used to create
content (e.g., calculate information or make graphics). This content is then saved in the
output list so that it can be sent back to the UI to be rendered in the browser. The server uses
render functions to assign these values to output so that the content will automatically be
recalculated whenever the input list changes. For example:
(The specifics of the server and its functions are detailed in Section 19.3.)
The UI and the server are both written in the app.R file. They are combined by calling the
shinyApp() function, which takes a UI value and a server function as arguments. For example:
Executing the shinyApp() function will start the app. Alternatively, you can launch a Shiny app
using the “Run App” button at the top of RStudio (see Figure 19.2). This will launch a viewer
window presenting your app (Figure 19.3); you can also click the “Open in Browser” button at the
top to show the app running in your computer’s web browser. Note that if you need to stop the app,
you can close the window or click the “Stop Sign” icon that appears on the RStudio console.
Tip: If you change the UI or the server, you generally do not need to stop and start the app.
Instead, you can refresh the browser or viewer window, and it will reload with the new UI
and server.
When this example application is run, Shiny will combine the UI and server components into a
webpage that allows the user to type a name into an input box; the page will then say “Hello” to
whatever name is typed in (as shown in Figure 19.3). As the user types into the input box (created by
the textInput() function), the UI sends an updated username value to the server; this value is
stored in the input argument list as input$username. The renderText() function in the server
then reacts to the change to the input$username value, and automatically re-executes to calculate
a new renderable value that is stored in output$message and sent back to the UI (illustrated in
Figure 19.4). Through this process, the app provides a dynamic experience in which the user types
into a box and sees the message change in response. While this is a simple example, the same
structure can be used to create searchable data tables, change the content of interactive graphics, or
even specify the parameters of a machine learning model!
Figure 19.2 Use RStudio to run a Shiny app. The "Run App" button starts the application, while the
“Stop Sign” icon in the console stops it.
298 Chapter 19 Building Interactive Web Applications with Shiny
Figure 19.3 A Shiny application that greets a user based on an input name, running in the RStudio
viewer. Note the “Open in Browser” and Refresh buttons at the top.
Scott
Hello Scott!
Figure 19.4 Variables passing between a UI and a server. The server function accepts inputs from
the UI and generates a set of outputs that are passed back to the UI to be rendered.
Tip: The reactivity involved in Shiny apps can make them difficult to debug. Code state-
ments don’t flow directly from top to bottom as with most scripts, and Shiny may produce
somewhat obscure error messages in the console when something goes wrong. As with R
Markdown, a good strategy for identifying problematic code is to systematically remove
(“comment out”) segments of your project and attempt to rerun your application.
For additional advice on how to fix issues in Shiny apps, see the official Debugging Shiny appli-
cationsa guide.
a
https://shiny.rstudio.com/articles/debugging.html
A Shiny app divides responsibilities between its UI and server: the UI is responsible for presenting
information, while the server is responsible for processing information. Enabling such a separation
of concerns is a fundamental principle when designing computer programs, as it allows developers
to isolate their problem solving and more easily create scalable and collaborative projects. Indeed,
this division is the same separation recommended in splitting code across .R and .Rmd files.
While it is possible to define both the UI and server in the same app.R file, you can further
emphasize this separation of concerns by instead defining the UI and server in separate files (e.g.,
my_ui.R and my_server.R). You can then use the source() function to load those variables into
the app.R script for combining. Such a division can help keep your code more organized and
understandable, particularly as your apps grow larger.
19.2 Designing User Interfaces 299
If you name the separate files exactly ui.R and server.R (and have the last value returned in each
script be the UI value and the server function, respectively), RStudio will be able to launch your
Shiny application without having a unified app.R file. Even so, it is better practice to use a single
app.R script to run your Shiny app, and then source() in the UI and server to keep them
separated.
Caution: Avoid creating both an app.R and files named exactly ui.R and server.R in your
project. This can confuse RStudio and cause your application not to run. Pick one approach
or the other!
Going Further: You can use the Shiny framework to add interactive widgets to HTML docu-
ments created using R Markdown! See the Introduction to Interactive Documents article.a Note
that the webpage will still need to be hosted somewhere that supports a Shiny server (such
as shinyapps.io, described in Section 19.4).
a
https://shiny.rstudio.com/articles/interactive-docs.html
When you write code defining a UI, you are defining how the app will be displayed in the browser.
You create a UI by calling a layout function such as fluidPage(), which will return a UI definition
that can be used by the shinyApp() function. Layout functions take as arguments the content
elements (pieces of content) that you want the layout to contain (and thus will be shown in the
app’s UI):
A layout function can take as many content elements as needed, each as an additional argument
(often placed onto separate lines for readability). For example, the UI shown in Figure 19.2 has three
content elements: one produced by the h2() function, one produced by the textInput()
function, and one produced by the textOutput() function.
Many different types of content elements can be passed to a layout function, as described in the
following sections.
Tip: You can initially implement your app with an “empty” server function as a way to design
and test your UI—a UI does not require any actual content in the server! See Section 19.3 for
an example of an empty server function.
300 Chapter 19 Building Interactive Web Applications with Shiny
Content elements are created by calling specific functions that create them. For example, the h1()
function will create an element that has a first-level heading (similar to using a # in Markdown).
These functions are passed arguments that are the content (usually strings) that should be shown:
Static content functions can alternatively be referenced as elements of the tags list (e.g.,
tags$h1()), so they are also known as “tag functions.” This is because static content functions are
used to produce HTML,2 the language used to specify the content of webpages (recall that a Shiny
app is an interactive webpage). As such, static content functions are all named after HTML tags. But
since Markdown is also compiled into HTML tags (as when you knit an R Markdown document),
many static content functions correspond to Markdown syntax, such as those described in
Table 19.1. See the HTML Tags Glossary3 for more information about the meaning of individual
functions and their common arguments.
Static content functions can be passed multiple unnamed arguments (i.e., multiple strings), all of
which are included as that kind of static content. You can even pass other content elements as
arguments to a tag function, allowing you to “nest” formatted content:
It is common practice to include a number of static elements (often with such nesting) to describe
your application—similar to how you would include static Markdown content in an R Markdown
document. In particular, almost all Shiny apps include a titlePanel() content element, which
provides both a second-level heading (h2()) element for the page title and specifies the title shown
in the tab of a web browser.
2
HTML tutorials and reference from Mozilla developer network:
https://developer.mozilla.org/en-US/docs/Web/HTML
3
Shiny HTML Tags Glossary: https://shiny.rstudio.com/articles/tag-glossary.html
19.2 Designing User Interfaces 301
Table 19.1 Some example static content functions and their Markdown equivalents
Static Content Function Markdown Equivalent Description
h1("Heading 1") # Heading 1 A first-level heading
h2("Heading 2") ## Heading 2 A second-level heading
p("some text") some text (on own line) A paragraph (of plain text)
em("some text") _some text_ Emphasized (italic) text
strong("some text") **some text** Strong (bold) text
a("some text", href = "url") [some text](url) A hyperlink (anchor)
img("description", src = "path")  An image
Going Further: If you are familiar with HTML syntax, you can write such content directly
using the HTML() function, passing in a string of the HTML you want to include. Similarly,
if you are familiar with CSS, you can include stylesheets using the includeCSS() content
function. See the article Style Your Apps with CSSa for other options and details.
a
https://shiny.rstudio.com/articles/css.html
Each widget handles user input by storing a value that the user has entered—whether by typing
into a box, moving a slider, or clicking a button. When the user interacts with the widget and
changes the input, the stored value automatically changes as well. Thus you can almost think of
each widget’s value as a “variable” that the user is able to modify by interacting with the web
browser. Updates to the value stored by the widget are sent to the server, which will react to those
changes to generate new content to display.
Like static content elements, control widgets are created by calling an appropriate function—most
of which include the word “input” in the name. For example:
n textInput() creates a box in which the user can enter text. The “Greeting” app described
previously includes a textInput().
n sliderInput() creates a slider that the user can drag to choose a value (or range of values).
Figure 19.5 Examples of control widgets that can be included in the UI of a Shiny application (image
from shiny.rstudio.com).
n radioButtons() creates “radio” buttons (the user can select only one of these buttons at a
time, just like selecting the station on a radio).
See the documentation4 for a complete list of available control widgets, and the widgets gallery5 for
examples.
n An inputId (a string) or “name” for the widget’s value. This is the “key” that allows the
server to access that widget’s value (literally, it is the key for that value in the input list
argument).
n A label (as a string or static content element) that will be shown alongside the widget and
tell the user what the value represents. The label can be an empty string ("") if you don’t
want to show anything.
Other arguments may be required by a particular widget. For example, a slider widget requires a
min, max, and (starting) value, as in the code below.
Control widgets are used to solicit input values from the user, which are then sent to the server for
processing. See Section 19.3 for details on how to use these input values.
4
Shiny reference: http://shiny.rstudio.com/reference/shiny/latest/
5
Shiny Widgets Gallery: http://shiny.rstudio.com/gallery/widget-gallery.html
19.2 Designing User Interfaces 303
As with other content elements, reactive outputs are created by calling an appropriate function,
most of which include the word “output” in the name. For example:
n textOutput() displays output as plain text; use htmlOutput() if you want to render HTML
content.
n plotOutput() displays a graphical plot, such as one created with the ggplot2 package. The
plotlyOutput() function from the plotly package can be used to render an interactive
plot, or you can make a ggplot2 plot interactive.6
Each of these functions takes as an argument the outputId (a string) or “name” for the value that
will be displayed. The function uses this “key” to access the value that is output by the server. For
example, you could show the following information generated by your server:
Note that each function may support additional arguments as well (e.g., to specify the size of a
plot). See the documentation for details on individual functions.
6
Interactive Plots: http://shiny.rstudio.com/articles/plot-interaction.html
304 Chapter 19 Building Interactive Web Applications with Shiny
Caution: Each page can show a single output value just once (because it needs to be given a
unique id in the generated HTML). For example, you can’t include textOutput(outputId
= "mean_value") twice in the same UI.
Remember: As you build your application’s UI, be careful to keep track of the names
(inputId and outputId) you give to each control widget and reactive output; you will need
these to match with the values referenced by the server!
19.2.4 Layouts
You can specify how content is organized on the page by using different layout content elements.
Layout elements are similar to other content elements, but are used to specify the position of
different pieces of content on the page—for example, organizing content into columns or grids, or
breaking up a webpage into tabs.
Layout content elements are also created by calling associated functions; see the Shiny
documentation or the Layout Guide7 for a complete list. Layout functions all take as arguments a
sequence of other content elements (created by calling other functions) that will be shown on the
page following the specified layout. For example, the previous examples use a fluidPage() layout
to position content from top to bottom in a way that responds to the size of the browser window.
Because layouts themselves are content elements, it’s also possible to pass the result of calling one
layout function as an argument to another. This allows you to specify some content that is laid out
in “columns," and then have the “columns” be placed into a “row” of a grid. As an example, the
commonly used sidebarLayout() function organizes content into two columns: a “sidebar”
(shown in a gray box, often used for control widgets or related content) and a “main” section (often
used for reactive outputs such as plots or tables). Thus sidebarLayout() needs to be passed two
arguments: a sidebarPanel() layout element that contains the content for the sidebar, and a
mainPanel() layout element that contains the content for the main section:
7
Shiny Application Layout Guide: http://shiny.rstudio.com/articles/layout-guide.html
19.2 Designing User Interfaces 305
Caution: Because Shiny layouts are usually responsive to web browser size, on a small win-
dow (such as the default app viewer) the sidebar may be placed above the content—since
there isn’t room for it to fit nicely on the side!
Since a layout and its content elements are often nested (similar to some static content elements),
you almost always want to use line breaks and indentation to make that nesting apparent in the
code. With large applications or complex layouts, you may need to trace down the page to find the
closing parenthesis ) that indicates exactly where a particular layout’s argument list (passed in
content) ends.
Because layout functions can quickly become complex (with many other nested content
functions), it is also useful to store the returned layouts in variables. These variables can then be
passed into higher-level layout functions. The following example specifies multiple “tabs” of
content (created using the tabPanel() layout function), which are then passed into a
navbarPage() layout function to create a page with a “navigation bar” at the top to browse the
different tabs. The result is shown in Figure 19.6.
Figure 19.6 A “multi-page” application built with Shiny’s layout functions, including navbarPage()
and sidebarLayout(). Red notes are added.
The Shiny framework can be used to develop highly complex layouts just by calling R functions. For
more examples and details on how to achieve particular layout and UI effects, check the Shiny
documentation and application gallery.
Fun Fact: Much of Shiny’s styling and layout structure is based on the Bootstrapa web frame-
work, which is how it supports layouts that are responsive to window size. Note that Shiny
uses Bootstrap 3, not the more recent Bootstrap 4.
a
http://getbootstrap.com/docs/3.3/
You create a Shiny server by defining a function (rather than calling a provided one, as with a UI).
The function must be defined to take at least two arguments: a list to hold the input values, and a
list to hold the output values:
19.3 Developing Application Servers 307
Note that a server function is just a normal function, albeit one that will be executed to “set up” the
application’s reactive data processing. Thus you can include any code statements that would
normally go in a function—though that code will be run only once (when the application is first
started) unless defined as part of a render function.
When the server function is called to set up the application, it will be passed the input and output
list arguments. The first argument (input) will be a list containing any values stored by the control
widgets in the UI: each inputId (“name”) in a control widget will be a key in this list, whose value
is the value currently stored by the widget. For example, the textInput() shown in Figure 19.2 has
an inputId of username, so would cause the input list to have a username key (referenced as
input$username inside of the server function). This allows the server to access any data that the
user has input into the UI. Importantly, these lists are reactive, so the values inside of them will
automatically change as the user interacts with the UI’s control widgets.
The primary purpose of the server function is to assign new values to the output list (each with an
appropriate key). These values will then be displayed by the reactive outputs defined in the UI. The
output list is assigned values that are produced by render functions, which are able to produce
output in a format that can be understood by the UI’s outputs (reactive outputs can’t just display
plain strings). As with the UI’s reactive output functions, the server uses different render functions
for the different types of output it provides, as shown in Table 19.2.
The result of a render function must be assigned to a key in the output list argument that matches
the outputId (“name”) specified in the reactive output. For example, if the UI includes
textOutput(outputId = "message"), then the value must be assigned to output$message. If
the keys don’t match, then the UI won’t know what output to display! In addition, the type of
render function must match the type of reactive output: you can’t have the server provide a plot
to render but have the UI try to output a table for that value! This usually means that the word
after “render” in the render function needs to be the same as the word before “Output” in the
reactive output function. Note that Shiny server functions will usually have multiple render
functions assigning values to the output list—one for each associated reactive output in the UI.
Table 19.2 Some example render functions and their associated reactive outputs
Render Function (Server) Reactive Output (UI) Content Type
renderText() textOutput() Unformatted text (character strings)
renderTable() tableOutput() A simple data table
renderDataTable() dataTableOutput() An interactive data table (use the DT package)
renderPlot() plotOutput() A graphical plot (e.g., created with ggplot2)
renderPlotly() plotlyOutput() An interactive Plotly plot
renderLeaflet() leafletOutput() An interactive Leaflet map
renderPrint() verbatimTextOutput() Any output produced with print()
308 Chapter 19 Building Interactive Web Applications with Shiny
All render functions take as an argument a reactive expression. A reactive expression is a lot like a
function: it is written as a block of code (in braces {}) that returns the value to be rendered. Indeed,
the only difference between writing a function and writing a reactive expression is that you don’t
include the keyword function or a list of arguments—you just include the block (the braces and
the code inside it).
What is significant about render functions is that they will automatically “rerun” their passed-in
code block every time a value they reference in the input list changes. So if the user interacts with
the username control widget in the UI (and thereby changes the input$username value), the
function in the preceding example will be executed again—producing a new value that will be
reassigned to output$message. And once output$message changes, any reactive output in the UI
(e.g., a textOutput()) will update to show the latest value. This makes the app interactive!
Remember: In effect, render functions are functions that will be rerun automatically when
an input changes, without you having to call them explicitly! You can think of them as the
functions you define for how the output should be determined—and those functions will
be rerun when the input changes.
Thus your server defines a series of “functions” (render functions) that specify how the output
should change based on changes to the input—when that input changes, the output changes
along with it.
19.4 Publishing Shiny Apps 309
Tip: Data values that are not reactive (that will not change based on user interaction) can be
defined elsewhere in the server function, as normal. If you want a nonreactive data value to
be available to the UI as well—such as one that contains configuration or static data range
information—you should create it outside of the server function in the app.R file, or in a
separate global.R file. See the Scoping Rules for Shiny Apps articlea for details.
a
https://shiny.rstudio.com/articles/scoping.html
Going Further: Understanding the flow of data in and between render functions and other
reactive expressions is the key to developing complex Shiny applications. For more details on
reactivity in Shiny, see RStudio’s articles on reactivity,a particularly Reactivity: An Overviewb
and How to Understand Reactivity in R.c
a
https://shiny.rstudio.com/articles/#reactivity
b
https://shiny.rstudio.com/articles/reactivity-overview.html
c
https://shiny.rstudio.com/articles/understanding-reactivity.html
While there are a few different solutions for hosting Shiny apps, the simplest is hosting through
shinyapps.io8 . shinyapps.io is a platform provided by RStudio that is used for hosting and running
Shiny apps. Anyone can deploy and host five small(ish) applications to the platform for free,
though deploying large applications costs money.
To host your app on shinyapps.io, you will need to create a free account.9 You can sign up with
GitHub (recommended) or a Google account. After you sign up, follow the site’s instructions:
n Select an account name, keeping in mind it will be part of the URL people use to access your
application.
n Install the required rsconnect package (it may have been included with your RStudio
download).
n Set your authorization token (“password”) for uploading your app. To do this, click the green
“Copy to Clipboard” button, and then paste that selected command into the Console in
RStudio. You should need to do this just once per machine.
8
shinyapps.io web hosting for Shiny apps: https://www.shinyapps.io
9
shinyapps.io signup: https://www.shinyapps.io/admin/#/signup
310 Chapter 19 Building Interactive Web Applications with Shiny
Don’t worry about the listed “Step 3 - Deploy”; you should instead publish directly through
RStudio!
After you have set up an account, you can publish your application by running your app through
RStudio (i.e., by clicking the “Run App” button), and then clicking the “Publish” button in the
upper-right corner of the app viewer (see Figure 19.7).
After a minute of processing and uploading, your app should become available online to use at the
URL:
1. Always test and debug your app locally (e.g., on your own computer, by running the app
through RStudio). It’s easier to find and fix errors locally; make sure the app works on your
machine before you even try to put it online.
2. You can view the error logs for your deployed app by either using the “Logs” tab in the
application view or calling the showLogs() function (part of the rsconnect package).
These logs will show print() statements and often list the errors that explain the problem
that occurred when deploying your app.
3. Use correct folder structures and relative paths. All of your app files should reside in a single
folder (usually named after the project). Make sure any .csv or .R files referenced are inside
the app folder, and that you use relative paths to refer to them in your code. Do not ever
include any setwd() statements in your code; only set the working directory through
RStudio (because shinyapps.io will have its own working directory).
Figure 19.7 Click the Publish button in the upper-right corner of a Shiny app to publish it to
shinyapps.io.
19.5 Shiny in Action: Visualizing Fatal Police Shootings 311
4. Make sure that any external packages you use are referenced with the library() function in
your app.R file. The most common problem we’ve seen involves external packages not being
available. See the documentation10 for an example and suggested solutions.
For more options and details, see the shinyapps.io user guide.11
As of the time of writing, there were 506 fatalities captured in the data set during the time period,
each one of which has 17 pieces of information about the incident, such as the name, age, and race
of the victim (a subset of the data is shown in Figure 19.8). The purpose of the Shiny application is
to understand the geographic distribution of where people have been killed by the police, and to
provide summary information about the incidents, such as the total number of people killed
broken down by race or gender. The final product (shown in Figure 19.10) allows users to select a
variable in the data set—such as race or gender—through which to analyze the data. This choice
will dictate the color encoding in the map as well as the level of aggregation in a summary table.
A main component of this application will be an interactive map displaying the location of each
shooting. The color of each point will expresses additional information about that individual (such
Figure 19.8 A subset of the police shootings data set, originally compiled by the Washington Post.
10
shinyapps.io build errors on deployment: http://docs.rstudio.com/shinyapps.io/Troubleshooting.
html#build-errors-on-deployment
11
shinyapps.io user guide: http://docs.rstudio.com/shinyapps.io/index.html
12
“Fatal Force”, Washington Post: https://www.washingtonpost.com/graphics/2018/national/police-
shootings-2018/
13
Fatal Shootings GitHub page: https://github.com/washingtonpost/data-police-shootings
14
Shiny in Action: https://github.com/programming-for-data-science/in-action/tree/master/shiny
312 Chapter 19 Building Interactive Web Applications with Shiny
as race or gender). While the column used to dictate the color will eventually be dynamically
selected by the user, you can start by creating a map with the column “hard-coded.” For example,
you can use Leaflet (discussed in Section 17.3) to generate a map displaying the location of each
shooting with points colored by race of the victim (shown in Figure 19.9):
Tip: Because server inputs in a Shiny application are strings, it’s helpful to use R’s double-
bracket notation to select data of interest (e.g., df[[input$some_key]]), rather than rely-
ing on dplyr functions such as select().
19.5 Shiny in Action: Visualizing Fatal Police Shootings 313
Figure 19.9 A map of each person killed by the police in 2018, created using leaflet.
Tip: A great way to develop a Shiny application is to first build a static version of your con-
tent, then swap out static values (variable names) for dynamic ones (information stored in
the input variable). Starting with a working version of your content will make debugging
the application much easier.
While this map allows you to get an overall sense of the geographic distribution of the fatalities,
supporting it with specific quantitative data—such as the total number of people killed by
race—can provide more precise information. Such summary information can be calculated using
the dplyr functions group_by() and count(). Note the use of double-bracket notation to pass in
the column values directly (rather than referencing the column by name), which will allow you to
more easily make the column name dynamic in the Shiny application.
With these data representations established, you can begin implementing the Shiny application.
For every Shiny app, you will need to create a UI and a server. It’s often useful to start with the UI to
help provide structure to the application (and it’s easier to test that it works). To create the UI that
will render these elements, you can use a structure similar to that described in Section 19.2.4 and
declare a fluidPage() layout that has a sidebarPanel() to keep the control widgets (a
“dropdown box” that lets the user select which column to analyze) along the side, and a
mainPanel() in which to show the primary content (the leaflet map and the data table):
314 Chapter 19 Building Interactive Web Applications with Shiny
# Define the UI for the application that renders the map and table
my_ui <- fluidPage(
# Application title
titlePanel("Fatal Police Shootings"),
You can check that the UI looks correct by providing an empty server function and calling the
shinyApp() function to run the application. While the map and data won’t show up (they haven’t
been defined), you can at least check the layout of your work.
Once the UI is complete, you can fill in the server. Since the UI renders two reactive outputs (a
leafletOutput() and a tableOutput()), your server needs to provide corresponding render
functions. These functions can return versions of the “hard-coded” map and data table defined
previously, but using information taken from the UI’s input to select the appropriate column—in
other words, replacing the "race" column with the column named by input$analysis_var.
Notice the use of input$analysis_var to dynamically set the color of each point, as well as the
aggregation column for the data table.
19.5 Shiny in Action: Visualizing Fatal Police Shootings 315
Figure 19.10 A Shiny application exploring fatal police shootings in 2018. A dropdown menu allows
users to select the feature that dictates the color on the map, as well as the level of aggregation for
the summary table.
As this example shows, in a little less than 80 lines of well-commented code, you can build an
interactive application for exploring fatal police shootings. The final application is shown in
Figure 19.10, and the full code appears below.
# Load libraries
library(shiny)
library(dplyr)
library(leaflet)
inputId = "analysis_var",
label = "Level of Analysis",
choices = c("gender", "race", "body_camera", "threat_level")
)
),
By creating interactive user interfaces for exploring your data, you can empower others to discover
relationships in the data, regardless of their technical skills. This will help bolster their
understanding of your data set and eliminate requests for you to perform different analyses (others
can do it themselves!).
Tip: Shiny is a very complex framework and system, so RStudio provides a large number of
resources to help you learn to use it. In addition to providing a cheatsheet available through
the RStudio menu (Help > Cheatsheets), RStudio has compiled a detailed and effective
set of video and written tutorials.a
a
http://shiny.rstudio.com/tutorial/
For practice building Shiny applications, see the set of accompanying book exercises.15
15
Shiny exercises: https://github.com/programming-for-data-science/chapter-19-exercises
20
Working Collaboratively
To be a successful member of a data science team, you will need to be able to effectively collaborate
with others. While this is true for nearly any practice, an additional challenge for collaborative data
science is working on shared code for the same project. Many of the techniques for supporting
collaborative coding involve writing clear, well-documented code (as demonstrated throughout
this book!) that can be read, understood, and modified by others. But you will also need to be able
to effectively integrate your code with code written by others, avoiding any “copy-and-pasting”
work for collaboration. The best way to do this is to use a version control system. Indeed, one of the
biggest benefits of git is its ability to support collaboration (working with other people). In this
chapter, you will expand your version control skills to maintain different versions of the same code
base using git’s branching model, and familiarize yourself with two different models for
collaborative development.
Chapter 3 describes how to use git when you are working on a single branch (called master) using
a linear sequence of commits. As an example, Figure 20.1 illustrates a series of commits for a sample
project history. Each one of these commits—identified by its hash (e.g., e6cfd89 in short
form)—follows sequentially from the previous commit. Each commit builds directly on the other;
you would move back and forth through the history in a straight line. This linear sequence
represents a workflow using a single line of development. Having a single line of development is a great
start for a work process, as it allows you to track changes and revert to earlier versions of your work.
In addition to supporting single development lines, git supports a nonlinear model in which you
“branch off” from a particular line of development to create new concurrent change histories. You
can think of these as “alternate timelines,” which are used for developing different features or fixing
bugs. For example, suppose you want to develop a new visualization for your project, but you’re
320 Chapter 20 Working Collaboratively
Figure 20.1 A diagram of a linear sequence of commits alongside a log of the commit history as
shown in the terminal. This project has a single history of commits (i.e., branch), each represented by
a six-character commit hash. The HEAD—most recent commit—is on the master branch.
bug-fix G H
master A B D G H I
experiment C E F
Time
Figure 20.2 A sequence of commits spread across multiple branches, producing “alternate time-
lines.” Commits switch between being added to each branch (timeline). The commits on the bug-fix
branch (labeled G and H) are merged into the master branch, becoming part of that history.
unsure if it will look good and be incorporated. You don’t want to pollute the primary line of
development (the “main work”) with experimental code, so instead you branch off from the line
of development to work on this code at the same time as the rest of the core work. You are able to
commit iterative changes to both the experimental visualization branch and the main
development line, as shown in Figure 20.2. If you eventually decide that the code from the
experimental branch is worth keeping, you can easily merge it back into the main development line
as if it were created there from the start!
The line printed with the asterisk (*) is the “current branch” you’re on. You can use the same git
branch command to create a new branch:
This will create a new branch called BRANCH_NAME (replace BRANCH_NAME with whatever name you
want; usually not in all caps). For example, you could create a branch called experiment:
If you run git branch again, you will see that this hasn’t actually changed what branch you’re on. In
fact, all you have done is create a new branch that starts at the current commit!
Going Further: Creating a new branch is similar to creating a new pointer to a node in the
linked list data structure from computer science.
To switch to a different branch, you use the git checkout command (the same one described in
Section 3.5.2).
For example, you can switch to the experiment branch with the following command:
Checking out a branch doesn’t actually create a new commit! All it does is change the HEAD so that it
now refers to the latest commit of the target branch (the alternate timeline). HEAD is just an alias for
“the most recent commit on the current branch.” It lets you talk about the most recent commit
generically, rather than needing to use a particular commit hash.
You can confirm that the branch has changed by running the git branch command and looking
for the asterisk (*), as shown in Figure 20.3.
Alternatively (and more commonly), you can create and checkout a branch in a single step using
the -b option with git checkout:
Figure 20.3 Using git commands on the command line to display the current branch (git branch),
and create and checkout a new branch called experiment (git checkout -b experiment).
For example, to create and switch to a new branch called experiment, you would use the following
command:
This effectively does a git branch BRANCH_NAME followed by a git checkout BRANCH_NAME.
This is the recommended way of creating new branches.
Once you have checked out a particular branch, any new commits from that point on will occur in
the “alternate timeline,” without disturbing any other line of development. New commits will be
“attached” to the HEAD (the most recent commit on the current branch), while all other branches
(e.g., master) will stay the same. If you use git checkout again, you can switch back to the other
branch. This process is illustrated in Figure 20.4.
master master
experiment experiment
master master
experiment experiment
master master
experiment experiment
Figure 20.4 Using git to commit to multiple branches. A hollow circle is used to represent where the
next commit will be added to the history. Switching branches, as in figures (a), (d), and (f), will change
the location of the HEAD (the commit that points to the hollow circle), while making new commits, as
in figures (b), (c), and (e), will add new commits to the current branch.
20.1 Tracking Different Versions of Code with Branches 323
Importantly, checking out a branch will “reset” the files and code in the repo to whatever they
looked like when you made the last commit on that branch; the code from the other branches’
versions is stored in the repo’s .git database. You can switch back and forth between branches and
watch your code change!
1. git status: Check the status of your project. This confirms that the repo is on the master
branch.
2. git checkout -b experiment: Create and checkout a new branch, experiment. This
code will branch off of the master branch.
3. Make an update to the file in a text editor (still on the experiment branch).
4. git commit -am "Update README": This will add and commit the changes (as a single
command)! This commit is made only to the experiment branch; it exists in that timeline.
5. git checkout master: Switch back to the master branch. The file switches to show the
latest version of the master branch.
6. git checkout experiment: Switch back to the experiment branch. The file switches to
show the latest version of the experiment branch.
(1)
(2)
(4)
(5)
(6)
Figure 20.5 Switching branches allows you to work on multiple versions of the code simultaneously.
324 Chapter 20 Working Collaboratively
Caution: You can only check out a branch if the current working directory has no uncommit-
ted changes. This means you will need to commit any changes to the current branch before
you checkout another branch. If you want to “save” your changes but don’t want to commit
to them, you can use git’s ability to temporarily stasha changes.
a
https://git-scm.com/book/en/v2/Git-Tools-Stashing-and-Cleaning
Finally, you can delete a branch using git branch -d BRANCH_NAME. Note that this command
will give you a warning if you might lose work; be sure to read the output message!
Taken together, these commands will allow you to develop different aspects of your project in
parallel. The next section discusses how to bring these lines of development together.
Tip: You can also use the git checkout BRANCH_NAME FILE_NAME command to check-
out an individual file from a particular branch. This will load the file directly into the cur-
rent working directory as a file change, replacing the current version of the file (git will not
merge the two versions of the file together)! This is identical to checking out a file from a past
commit (as described in Chapter 3), just using a branch name instead of a commit hash.
For example, you can merge the experiment branch into the master branch as follows:
The merge command will (in effect) walk through each line of code in the two versions of the files,
looking for any differences. Changes to each line of code in the incoming branch will then be
applied to the equivalent line in the current branch, so that the current version of the files contains
all of the incoming changes. For example, if the experiment branch included a commit that added
a new code statement to a file at line 5, changed the code statement at line 9, and deleted the code
statement at line 13, then git would add the new line 5 to the file (pushing everything else down),
change the code statement that was at line 9, and delete the code statement that was at line 13. git
will automatically “stitch” together the two versions of the files so that the current version contains
all of the changes.
20.1 Tracking Different Versions of Code with Branches 325
Tip: When merging, think about where you want the code to “end up”—that is the branch
you want to checkout and merge into!
In effect, merging will take the commits from another branch and insert them into the history of
the current branch. This is illustrated in Figure 20.6.
master A B
experiment C D
master A B C D
experiment C D
Figure 20.6 Merging an experiment branch into the master branch. The committed changes from
the experiment branch (labeled C and D) are inserted into the master branch’s history, while also
remaining present in the experiment branch.
Note that the git merge command will merge OTHER_BRANCH into the branch you are currently
on. For example, if you want to take the changes from your experiment branch and merge them
into your master branch, you will need to first checkout your master branch, and merge in the
changes from the experiment branch.
Caution: If something goes wrong, don’t panic and close your command shell! Instead, take
a breath and look up how to fix the problem you’ve encountered (e.g., how to exit vim). As
always, if you’re unsure why something isn’t working with git, use git status to check the
current status and to determine which steps to do next.
If the two branches have not edited the same line of code, git will stitch the files together
seamlessly and you can move forward with your development. Otherwise, you will have to resolve
any conflict that occurs as part of your merge.
git is just a simple computer program, and has no way of knowing which version of the conflicting
code it should keep—is the master version or the experiment version better? Since git can’t
326 Chapter 20 Working Collaboratively
(a)
(b)
(c)
(d)
(e)
(f)
(g)
determine which version of the code to keep, it stops the merge in the middle and forces you to
choose what code is correct manually.
To resolve the merge conflict, you will need to edit the files (code) to pick which version to keep.
git adds special characters (e.g., <<<<<<<<) to the files to indicate where it encountered a conflict
(and thus where you need to make a decision about which code to keep), as shown in Figure 20.8.
Figure 20.8 A merge conflict as shown in Atom. You can select the version of the code you wish to
keep by clicking one of the Use me buttons, or edit the code in the file directly.
1. Use git status to see which files have merge conflicts. Note that multiple files may have
conflicts, and each file may have more than one conflict.
2. Choose which version of the code to keep. You do this by editing the files (e.g., in RStudio or
Atom). You can make these edits manually, though some IDEs (including Atom) provide
buttons that let you directly choose a version of the code to keep (e.g., the “Use me” button
in Figure 20.8).
Note that you can choose to keep the “original” HEAD version from the current branch, the
“incoming” version from the other branch, or some combination thereof. Alternatively, you
can replace the conflicting code with something new entirely! Think about what you want
the “correct” version of the final code to be, and make it so. Remember to remove the
<<<<<<< and ======= and >>>>>>> characters; these are not legal code in any language.
Tip: When resolving a merge conflict, pretend that a cat walked across your keyboard
and added a bunch of extra junk to your code. Your task is to fix your work and restore
it to a clean, working state. Be sure to test your code to confirm that it continues to work
after making these changes!
3. Once you are confident that the conflicts are all resolved and everything works as it should,
follow the instructions shown by git status to add and commit the change you made to
the code to resolve the conflict:
328 Chapter 20 Working Collaboratively
This will complete the merge! Use git status to check that everything is clean again.
Tip: If you want to “cancel” a merge with a conflict (e.g., you initiated a merge, but you don’t
want to go through with it because of various conflicts), you can cancel the merge process
with the git merge --abort command.
Remember: Merge conflicts are expected. You didn’t do something wrong if one occurs! Don’t
worry about getting merge conflicts or try to avoid them: just resolve the conflict, fix the
“bug” that has appeared, and move on with your life.
1. You will not be able to push to GitHub if merging your commits into GitHub’s repo might
cause a merge conflict. git will instead report an error, telling you that you need to pull
changes first and make sure that your version is up to date. “Up to date” in this case means
that you have downloaded and merged all the commits on your local machine, so there is no
chance of divergent changes causing a merge conflict when you merge by pushing.
2. Whenever you pull changes from GitHub, there may be a merge conflict. These are resolved
in the exact same way as when merging local branches; that is, you need to edit the files to
resolve the conflict, then add and commit the updated versions.
Thus, when working with GitHub (and especially with multiple people), you will need to perform
the following steps to upload your changes:
Of course, because GitHub repositories are repos just like the ones on your local machine, they can
have branches as well. You gain access to any remote branches when you clone a repo; you can see a
list of them with git branch -a (using the “all” option).
20.2 Developing Projects Using Feature Branches 329
If you create a new branch on your local machine, it is possible to push that branch to GitHub,
creating a mirroring branch on the remote repo (which usually has the alias name origin). You do
this by specifying the branch in the git push command:
where BRANCH_NAME is the name of the branch you are currently on (and thus want to push to
GitHub). For example, you could push the experiment branch to GitHub with the following
command:
You often want to create an association between your local branch with the remote one on GitHub.
You can establish this relationship by including the -u option in your push:
This causes your local branch to “track” the one on GitHub. Then when you run a command such
as git status, it will tell you whether one repo has more commits than the other. Tracking will be
remembered once set up, so you only need to use the -u option once. It is best to do this the first
time you push a local branch to GitHub.
The goal when organizing projects into feature branches is that the master branch should always
contain “production-level” code: valid, completely working code that you could deploy or publish
(read: give to your boss or teacher) at a whim. All feature branches branch off of master, and are
allowed to contain temporary or even broken code (since they are still in development). This way
there is always a “working” (if incomplete) copy of the code (master), and development can be kept
330 Chapter 20 Working Collaboratively
isolated and considered independent of the whole. Note that this organization is similar to how the
earlier example uses an experiment branch.
1. You decide to add a new feature to the project: a snazzy visualization. You create a new
feature branch off of master to isolate this work:
# Make sure you are on the `master` branch
git checkout master
2. You then do your coding work while on this branch. Once you have completed some work,
you would make a commit to add that progress:
3. Unfortunately, you may then realize that there is a bug in the master branch. To address this
issue, you would switch back to the master branch, then create a new branch to fix the bug:
(You would fix a bug on a separate branch if it was complex or involved multiple commits, in
order to work on the fix separate from your regular work).
4. After fixing the bug on the bug-fix branch, you would add and commit those changes, then
checkout the master branch to merge the fix back into master:
# Add and commit changes that fix the bug (on the `bug-fix` branch)
git add .
git commit -m "Fix the bug"
# Merge the changes from `bug-fix` into the current (`master`) branch
git merge bug-fix
5. Now that you have fixed the bug (and merged the changes into master), you can get back to
developing the visualization (on the new-chart branch). When it is complete, you will add
and commit those changes, then checkout the master branch to merge the visualization
code back into master:
20.3 Collaboration Using the Centralized Workflow 331
The use of feature branches helps isolate progress on different elements of a project, reducing the
need for repeated merging (and the resultant conflicts) of half-finished features and creating an
organized project history. Note that feature branches can be used as part of either the centralized
workflow (see Section 20.3) or the forking workflow (see Section 20.4).
To make sure everyone is able to push to the repository, whoever creates the repo will need to add
the other team members as collaborators.2 They can do this under the “Settings” tab of the repo’s
web portal page, as shown in Figure 20.9. (The creator will want to give all team members “write”
access so they can push changes to the repo.)
1
Atlassian: Centralized Workflow: https://www.atlassian.com/git/tutorials/comparing-
workflows#centralized-workflow
2
GitHub: Inviting collaborators to a personal repository: https://help.github.com/articles/inviting-
collaborators-to-a-personal-repository/
332 Chapter 20 Working Collaboratively
Figure 20.9 Adding a collaborator to a GitHub repository via the web portal.
Once everyone has been added to the GitHub repository, each team member will need to clone the
repository to their local machines to work on the code individually, as shown in Figure 20.10.
Collaborators can then push any changes they make to the central repository, and pull any
changes made by others.
When you are contributing to the same repository along with multiple other people, it’s important
to ensure that you are working on the most up-to-date version of the code. This means that you will
regularly have to pull changes from GitHub that your team members may have committed. As a
result, developing code with the centralized workflow follows these steps:
20.3 Collaboration Using the Centralized Workflow 333
1. To begin your work session, pull in the latest changes from GitHub. For example:
2. Do your work, making changes to the code. Remember to add and commit your work each
time you make notable progress!
3. Once you are satisfied with your changes and want to share them with your team, you’ll
need to upload the changes back to GitHub. But note that if someone pushes a commit to
GitHub before you push your own changes, you will need to integrate those changes into your
code (and test them!) before doing your own push up to GitHub. Thus you’ll want to first
pull down any changes that have been made in the interim (there may not be any) so that
you are up to date and ready to push:
Remember that when you pull in changes, git is really merging the remote branch with
your local one, which may result in a merge conflict you need to resolve; be sure to fix the
conflict and then mark it as resolved. (The --no-edit argument used with git commit tells
git to use the default commit message, instead of specifying your own with the -m option.)
While this strategy of working on a single master branch may suffice for small teams and projects,
you can spend less time merging commits from different team members if your team instead uses a
dedicated feature branch for each feature they work on.
Remember: In the feature branch workflow, each branch is for a different feature, not a dif-
ferent developer! This means that a developer can work on multiple different features, and a
feature can be worked on by multiple developers.
As an example of this workflow, consider the collaboration on a feature occurring between two
developers, Ada and Bebe:
1. Ada decides to add a new feature to the code, a snazzy visualization. She creates a new feature
branch off of master:
# Double-check that the current branch is the `master` branch
git checkout master
2. Ada does some work on this feature, and then commits that work when she’s satisfied with it:
3. Happy with her work, Ada decide to takes a break. She pushes her feature branch to GitHub
to back it up (and so her team can also contribute to it):
4. After talking to Ada, Bebe decides to help finish up the feature. She checks out the feature
branch and makes some changes, then pushes them back to GitHub:
The git fetch command will “download” commits and branches from GitHub (but
without merging them); it is used to get access to branches that were created after the repo
20.4 Collaboration Using the Forking Workflow 335
was originally cloned. Note that git pull is actually a shortcut for a git fetch followed by
a git merge!
6. Ada decides the feature is finished, and merges it back into master. But first, she makes sure
she has the latest version of the master code:
# Switch to the `master` branch, and download any changes
git checkout master
git pull
7. Now that the feature has been successfully added to the project, Ada can delete the feature
branch (using git branch -d new-chart). She can delete GitHub’s version of the branch
through the web portal interface (recommended), or by using git push origin -d
new-chart.
This kind of workflow is very common and effective for supporting collaboration. Moreover, as
projects grow in size, you may need to start being more organized about how and when you create
feature branches. For example, the Git Flow3 model organizes feature branches around product
releases, and is a popular starting point for large collaborative projects.
In this model, each person contributes code to their own personal copy of the repository. The
changes between these different repos are merged together through a GitHub process called a pull
3
Git Flow: A successful Git branching model: http://nvie.com/posts/a-successful-git-branching-model/
336 Chapter 20 Working Collaboratively
fork fork
request.4 A pull request (colloquially called a “PR”) is a request for the changes in one version of the
code (i.e., a fork or branch) to be pulled (merged) into another. With pull requests, one developer
can send a request to another developer, essentially saying “I forked your repository and made some
changes: can you integrate my changes into your repo?” The second developer can perform a code
review: reviewing the proposed changes and making comments or asking for corrections to
anything that appears problematic. Once the changes are made (committed and pushed to the
“source” branch on GitHub), the pull request can be accepted and the changes merged into the
“target” branch. Because pull requests can be applied across (forked) repositories that share history,
a developer can fork an existing professional project, make changes to that fork, and then send a
pull request back to the original developer asking that developer to merge in changes.
Caution: You should only use pull requests to integrate changes on remote branches (i.e., two
different forks of a repo). To integrate commits from different branches of the same repos-
itory, you should merge changes on your local machine (not using GitHub’s pull request
feature).
To issue a pull request, you will need to make changes to your fork of a repository and push those to
GitHub. For example, you could walk through the following steps:
4
GitHub: About pull requests: https://help.github.com/articles/about-pull-requests/
20.4 Collaboration Using the Forking Workflow 337
1. Fork a repository to create your own version on GitHub. For example, you could fork the
repository for the dplyr package5 if you wanted to make additions to it, or fix a bug that
you’ve identified.
You will need to clone your fork of the repository to your own machine. Be careful that you
clone the correct repo (look at the username for the repo in the GitHub web portal—where it
says YOUR_USER_NAME in Figure 20.12).
2. After you’ve cloned your fork of the repository to your own machine, you can make any
changes desired. When you’re finished, add and commit those changes, then push them up
to GitHub.
You can use feature branches to make these changes, including pushing the feature branches
to GitHub as described earlier.
3. Once you’ve pushed your changes, navigate to the web portal page for your fork of the
repository on GitHub and click the “New Pull Request” button as shown in Figure 20.12.
4. On the next page, you will need to specify which branches of the repositories you wish to
merge. The base branch is the one you want to merge into (often the master branch of the
original repository), and the head branch (labeled “compare”) is the branch with the new
changes you want to be merged in (often the master branch of your fork of the repository),
as shown in Figure 20.13.
5. After clicking the “Create Pull Request” button (in Figure 20.13), you will write a title and a
description for your pull request (as shown in Figure 20.14). After describing your proposed
changes, click the “Create pull request” button to issue the pull request.
Figure 20.12 Create a new pull request by clicking the “New Pull Request” button on your fork of a
repository.
5
dplyr Package GitHub Repository: https://github.com/tidyverse/dplyr
338 Chapter 20 Working Collaboratively
Figure 20.13 Compare changes between the two forks of a repository on GitHub before issuing a
pull request.
Figure 20.14 Write a title and description for your pull request, then issue the request by clicking
the “Create Pull Request” button.
Remember: The pull request is a request to merge two branches, not to merge a specific set
of commits. This means that you can push more commits to the head (“merge-from”) branch,
and they will automatically be included in the pull request—the request is always up to date
with whatever commits are on the (remote) branch.
If the code reviewer requests changes, you make those changes to your local repo and just
push the changes as normal. They will be integrated into the existing pull request automat-
ically without you needing to issue a new request!
20.4 Collaboration Using the Forking Workflow 339
You can view all pull requests (including those that have been accepted) through the “Pull Requests”
tab at the top of the repository’s web portal. This view will allow you to see comments that have
been left by the reviewer.
If someone (e.g., another developer on your team) sends you a pull request, you can accept that pull
request6 through GitHub’s web portal. If the branches can be merged without a conflict, you can do
this simply by clicking the “Merge pull request” button. However, if GitHub detects that a conflict
may occur, you will need to pull down the branches and merge them locally.7
Note that when you merge a pull request via the GitHub website, the merge is done in the repository
on GitHub’s servers. Your copy of the repository on your local machine will not yet have those
changes, so you will need to use git pull to download the updates to the appropriate branch.
In the end, the ability to effectively collaborate with others on programming and data projects is
one of the biggest benefits of using git and GitHub. While such collaboration may involves some
coordination and additional commands, the techniques described in this chapter will enable you
to work with others—both within your team and throughout the open source community—on
larger and more important projects.
Tip: Branches and collaboration are among the most confusing parts of git, so there is no
shortage of resources that aim to help clarify this interaction. Git and GitHub in Plain Englisha
is an example tutorial focused on collaboration with branches, while Learn Git Branching b is
an interactive tutorial focused on branching itself. Additional interactive visualizations of
branching with git can be found here.c
a
https://red-badger.com/blog/2016/11/29/gitgithub-in-plain-english
b
http://learngitbranching.js.org
c
https://onlywei.github.io/explain-git-with-d3/
For practice working with collaborative version control methods, see the set of accompanying book
exercises.8
6
GitHub: Merging a pull request: https://help.github.com/articles/merging-a-pull-request/
7
GitHub: Checking out pull requests locally: https://help.github.com/articles/checking-out-pull-
requests-locally
8
git collaboration exercises: https://github.com/programming-for-data-science/chapter-20-exercises
This page intentionally left blank
21
Moving Forward
In this text, you have learned the foundational programming skills necessary for entering the data
science field. The ability to write code to work with data empowers you to explore and communicate
information in transparent, reusable, and collaborative ways. As many data scientists will attest,
the most time-consuming part of a project is organizing and exploring the data—something that
you are now more than capable of doing. These skills on their own are quite valuable for gaining
insight from quantitative information, but there is always more to learn. If you are eager to expand
your skills, there are a few areas that serve as obvious next steps in data science.
n R for Everyone1 introduces statistical modeling and evaluation in R, including linear and
non-linear methods.
1
Lander, J. P. (2017). R for everyone: Advanced analytics and graphics (2nd ed.). Boston, MA: Addison-Wesley.
342 Chapter 21 Moving Forward
n OpenIntro Stats3 is an open source4 set of texts that focus on the basics of probability and
statistics.
n Python is another popular language for doing data science. Like R, it is open source, and has
a large community of people contributing to its statistical, machine learning, and
visualization packages. Because R and Python largely enable you to solve the same problems
in data science, the motivations to learn Python would include collaboration (to work with
people who only use Python), curiosity (about how a similar language solves the same
problems), and analysis (if a specific sophisticated analysis is only available in a Python
package). A great book for learning to program for data science in Python is the Python Data
Science Handbook.6
2
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). Springer.
http://www-bcf.usc.edu/~gareth/ISL/
3
Diez, D. M., Barr, C. D., & Cetinkaya-Rundel, M. (2012). OpenIntro statistics. CreateSpace. https://www.
openintro.org/stat/textbook.php
4
OpenIntro Statistics: https://www.openintro.org
5
A Visual Introduction to Machine Learning: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
6
VanderPlas, J. (2016). Python Python data science handbook: Essential tools for working with data. O’Reilly Media, Inc.
21.3 Ethical Responsibilities 343
limitations of the Shiny framework. Building interactive websites from scratch requires a
notable time investment, but it gives you complete control over the style and behavior of
your webpages. If you are seriously interested in building custom visualizations, look into
using the d3.js7 JavaScript library, which you can also read about in Visual storytelling
with D3.8
Such consequences of unchecked assumptions in data science can be difficult to detect and have
outsized effects on people, so tread carefully as you move forward with your newly acquired skills.
Remember: you are responsible for the impact of the programs that you write. The analytical and
programming skills covered in this text empower you to identify and communicate about the
injustices in the world. As a data scientist, you have a moral responsibility to do no harm with your
skills (or better yet, to work to undo harms that have occurred in the past and are occurring today).
As you begin to work in data science, you must always consider how people will be differentially
impacted by your work. Think carefully about who is represented in—and excluded from—your
data, what assumptions are built into your analysis, and how any decisions made using your data
could differentially benefit different communities—particularly those communities that are often
overlooked.
Thank you for reading our book! We hope that it provided inspiration and guidance in your pursuit
of data science, and that you use these skills for good.
7
d3.js https://d3js.org
8
King, R. S. (2014). Visual storytelling with D3: An introduction to data visualization in JavaScript. Addison-Wesley.
9
Angwin, J. L. (2016, May 23). Machine bias. ProPublica. https://www.propublica.org/article/machine-bias-
risk-assessments-in-criminal-sentencing
10
Machine Bias Analysis, Complete Code: https://github.com/propublica/compas-analysis
This page intentionally left blank
Index
Symbols
, (comma)
data frame syntax, 122
function syntax, 69
key-value pair syntax, 191
" (double quotes), character data syntax, 61
’ (single quotes), character data syntax, 61
.. (double dot), moving up directory, 14
. (single dot), referencing current folder, 14
| (pipe)
directing output, 20
pipe table, 48
! (exclamation point), Markdown image syntax, 47
# (pound/hashtag symbol)
comment syntax, 10, 58
$ (dollar notation)
accessing data frames, 122
accessing list elements, 97–98
%>% (pipe operator), dplyr package, 141–142
() (parentheses)
function syntax, 70
Markdown hyperlink syntax, 46
* (asterisk wildcard)
loading entire table from database, 173
using wildcards with files, 17–18
? (question mark), query parameter syntax, 184
[] (single-bracket notation)
accessing data frames, 122–123
comparing single- and double-bracket notation, 101
Markdown hyperlink syntax, 46
retrieving value from vector, 88
[[]] (double-bracket notation)
list syntax, 98–99, 101
selecting data of interest for application, 312
346 {} (braces)
Anonymous variables, 71, 140 Bash shell. See also Git Bash
anscombe data set, in R, 208 commands, 13
example finding Cuban food in Seattle, 196–197 Bins, breaking data into different variables, 142
registering with web services, 186–188 BitBucket, comparing with GitHub, 29
APIs (application programming interfaces). Blockquotes, markdown options, 48
See also Web APIs Blocks, markdown formatting syntax, 47
defined, 181 Body, function parts, 76–77
in plotly package, 258
Color 347
D R language, 60–63
selecting visual layouts, 209–210
d3.js JavaScript library, 343
vectorized functions and, 87
Data
vectorized operations and, 83
acquiring domain knowledge, 112–113
Data visualization
analyzing. See Data analysis
aesthetics, 229–230
answering questions, 116–118
choosing effective colors, 222–226
dplyr example analyzing flight data, 148–153
choosing effective graphical encodings, 220–222
dplyr grammar for manipulating, 131–132
expressive displays, 227–229
encoding, 220–222, 229, 237
ggplot2. See ggplot2 package
finding, 108–109
of hierarchical data, 217–220
flattening JSON data, 196–197
leveraging preattentive attributes, 226–227
generating, 107–108
with multiple variables, 213–217
interactive presentation, 293
overview of, 205–207
interpreting, 112
purpose of, 207–209
measuring, 110–111
reusable functions, 70
overview of, 107
selecting visual layouts, 209–210
ratio data, 111
with single variable, 210–213
reusable functions in managing, 70
tidyr package. See tidyr package
schemas, 113–116
Data visualization, interactive
structures, 111–112, 122
example exploring changes to Seattle, 266–272
transforming into information, 341
leaflet package, 263–266
understanding data schemas, 113–116
overview of, 257–258
visualization of. See Data visualization
plotly package, 258–261
working with CSV data, 124–125
rbokeh package, 261–263
wrangling, 106
Databases
Data analysis
accessing from R, 175–179
generating data, 108
designing relational, 144
reusable functions, 70
overview of relational, 167–169
tidyr package. See tidyr package
setting up relational, 169–171
Data frames
SQL statements, 171–175
accessing, 122–123
DataCamp, resources for learning R, 66
analyzing by group, 142–144
dbConnect(), accessing SQLite, 176–177
creating, 120–121
dbListTables(), listing database tables, 177
describing structure of, 121–122
dbplyr package, 176–179
factor variables, 126–129
dbplyr package, accessing databases, 174
joining, 144–148
Debugging functions, 78. See also Error handling
overview of, 119–120
Directories
viewing working directory, 125–126
accessing command line and, 10
working with CSV data, 124–125
changing from command line, 12–13
data() function, viewing available data sets, 124–125
printing working directory, 11
Data-ink ratio, aesthetics of graphics, 229
Data schemas, 113–116
350 Directories
Forking workflow
feature branches in, 331, 333–335
G
gather()
working with, 335–339
applying to educational statistics, 161–163
Formats
combining with spread(), 159
table, 157
tidyr function for changing columns to rows, 157–158
text, 46
geom_ functions
Formulas, 245
adding titles and labels to charts, 247–248
Frameworks
aesthetic mappings and, 237–238
defined, 293
creating choropleth maps, 249–250
Shiny framework. See Shiny framework
creating dot distribution maps, 252
fromJSON(), converting JSON string to list, 193–194, 200
example mapping evictions in San Francisco, 253–256
full_join(), 148
rendering plots, 284
function keyword, 76
specifying geometric objects, 234
Functions
specifying geometries, 235–237
for aesthetic mappings (aes()), 237–238
statistical transformation of data, 237
applying to lists, 102–103
Geometries
built-in, 71–72
ggplot2 layers, 232
c() function, 81–82
position adjustments, 238–240
conditional statements, 79–80
specifying geometric objects, 234–235
converting dplyr functions into SQL statements, 178
specifying with ggplot2 package, 235–237
coord_ functions, 243–244
GET
correlation function (cor()), 161
example finding Cuban food in Seattle, 197–198, 202
creating lists, 96
HTTP verbs, 188–189
debugging, 78. See also Error handling
sending GET requests, 189–190
developing application servers, 307–309
getwd(), viewing working directory, 125
geometry. See geom_ functions
ggmap package
inspecting data frames, 121–122
example finding Cuban food in Seattle, 200–203
loading, 73–75
example mapping evictions in San Francisco, 253
named arguments, 72–73
map tiles, 252
nested statements within, 140–141
ggplot()
overview of, 69–70
creating plots, 232, 234
referencing database table, 177
example mapping evictions in San Francisco, 256
in Shiny layouts, 305
ggplot2 package
syntax, 70–71
tidyr functions for changing columns to/from rows,
aesthetic mappings, 237–238
157–159 basic plotting, 232–235
vectorized, 86–88 choropleth maps, 248–251
viewing available data sets (data()), 124–125 coordinate systems, 243–244
writing, 75–77 dot distribution maps, 252
Functions, dplyr example finding Cuban food in Seattle, 200
arrange(), 137–138 example mapping evictions in San Francisco, 252–256
core functions, 131–132 facets, 244–245
filter(), 135–136 Grammar of Graphics, 231–232
group_by(), 142–144 labels and annotations, 246–248
left_join(), 145–147 map types, 248
mutate(), 136–137 position adjustments, 238–240
overview of, 132 rendering plots, 284
select(), 133–134 specifying geometries, 235–237
summarize(), 138–139
summarizing information using, 313
352 ggplot2 package
static plot of iris data set, 257–258 Google Docs, version control systems compared with, 28
statistical transformation of data, 255 Google, getting help via, 63
styling with scales, 240–242 Google Sheets, working with CSV data, 124
tidyr example, 160–161 Government publications, sources of data, 108
ggplotly(), 259 Grammar of Data Manipulation (Wickham), 131
ggrepel package, preventing labels from overlapping, Grammar of Graphics, 231–232
247–248 Graphics. See also by individual types of graphs; Data
git visualization
accessing project history, 40–42 aesthetics, 229–230
adding files, 32–33 choosing effective graphical encodings, 220–222
branching model. See Branches expressive displays, 227–229
checking repository status, 31–33 with ggplot2. See ggplot2 package
committing changes, 33–35 Grammar of Graphics, 231–232
core concepts, 27–28 leveraging preattentive attributes, 226–227
creating repository, 30–31 selecting visual layouts, 209–210
ignoring files, 42–44 visualizing hierarchical data, 217–220
installing, 5 group_by()
leveraging using GitHub, 6 analyzing data frames by group, 142–144
local git process, 35 facets and, 244
managing code with, 3–4 statistical transformation of data, 255
overview of, 27–28 summarizing information using, 313
project setup and configuration, 30 GROUP_BY clause, SQL SELECT, 174
tracking changes, 32
tutorials, 43–44 H
version control, 4
Heatmaps. See also Choropleth maps
Git Bash. See also Bash shell
data visualization with multiple variables, 215, 217
accessing command line, 9–10
example mapping evictions in San Francisco, 256
commands used by, 13
Help
executing code using Bash shell, 4–5
R language, 63–64
ls command, 13
RStudio, 55
tab-completion support, 15
Hidden files, 42–44
Git Flow model, 335
Hierarchical data, visualization of, 217–220
GitHub
Histograms
accessing project history, 40–42
data visualization with multiple variables, 216
creating centralized repository, 331–333
expressive displays, 229
creating GitHub account, 6
visualizing data with single variable, 210
forking/cloning repos on GitHub, 36–38
Hosts, Shiny apps, 309–310
ignoring files, 42–44
HSL Calculator, 223
managing code with, 3
HSL (hue-saturation-lightness) color model, 222–223
overview of, 29
HTML (Hypertext Markup Language)
pushing/pulling repos on GitHub, 38–40
HTML Tags Glossary, 300–301
README file, 48–49
markup languages, 45
sharing reports as website, 285–286
sharing reports as website, 284–286
storing projects on, 36
web development language, 342
tutorials, 43–44
HTTP (HyperText Transfer Protocol)
.gitignore, ignoring files, 42–44
header, 196–197
GitLab, comparing with GitHub, 29
leaflet() 353
dplyr core functions, 131, 136–137 Orientation, tidyr data tables, 157
example finding Cuban food in Seattle, 202 Out-of-bounds indices, vector indices, 89
example report on life expectancy, 289–290 OUTER JOIN clause, SQL SELECT, 174
Mutating joins, 148 Outliers, visualizing data with single variable, 210
MySQL, 171 Output
directing/redirecting, 20
N dynamic, 303–304
functions and, 69
NA value
reactive, 295
compared with NULL, 100
Shiny framework, 293–294
logical values and, 89
modifying vectors and, 92
Named arguments, R functions, 72–73 P
Named lists, creating data frames, 120 Packages
names() function, creating lists and, 96 Bokeh, 261
Nested statements, within other functions, 140–141 ggmap. See ggmap package
Nested structures, visualizing hierarchical data, 217–220 ggplot2. See ggplot2 package
Staging area, adding files, 33. See also add (git) referencing database table, 177
Statistics ls command, 13
summarize(), dplyr core functions, 131, 138–139 The tidyverse style guide
Sunburst diagrams, 218, 220 defining variables, 58
example finding Cuban food in Seattle, 202 forking/cloning repos and, 36–38
Wildcards, command line, 17–18 working with feature branch workflows, 333–335
Windows, icons, menus, and pointers working with forking workflows, 335–339
(WIMP), 9
Windows Management Framework, 5 X
Windows OSs
Xcode command line developer tools, 5
accessing command line, 9–10
command-line tools, 4–5
installing git, 5 Z
Windows, types of interfaces, 9 Zooming, interactive data visualization, 257
Credits
Cover: Garry Killian/Shutterstock
Chapter 2, Figures 2.1a, 2.2, 2.4, 2.5, 2.6, 2.7: Screenshot of Mac © 2018 Apple Inc.
Chapter 2, Figure 2.1b: Screenshot of Git Bash © Software Freedom Conservancy, Inc.
Chapter 3: “A version control system … reversibility, concurrency, and annotation”. Raymond, E. S. (2009).
Understanding version-control systems. http://www.catb.org/esr/writings/versioncontrol/version-control.html
Chapter 3, Figures 3.1, 3.2, 3.3, 3.8: Screenshot of Mac © 2018 Apple Inc.
Chapter 3: “If you forget the -m option, git … happens to everyone”. Stack Overflow: Helping One Million
Developers Exit Vim, David Robinson, Stack Exchange Inc.
Chapter 3, Figures 3.5, 3.6: Screenshot of GitHub’s web portal © 2018 GitHub Inc.
Chapter 4, Figures 4.1, 4.2: Screenshot of Markdown © 2002–2018 The Daring Fireball Company LLC
Chapter 5, Figure 5.3b: Screenshot of Git Bash © Software Freedom Conservancy, Inc.
Chapter 9, Table 9.1: Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684),
677–680. https://doi.org/10.1126/science.103.2684.677
Chapter 10, Figures 10.1, 10.2, 10.3: Screenshot of Rstudio © 2018 Rstudio
Chapter 11, Figures 11.1, 11.3, 11.4, 11.5, 11.6, 11.8, 11.9, 11.10, 11.11, 11.13, 11.14: Screenshot of Rstudio © 2018 Rstudio
Chapter 12, Figures 12.1, 12.2, 12.3, 12.5: Screenshot of Rstudio © 2018 Rstudio
Chapter 12, Figures 12.4, 12.6, 12.7: Screenshot of ggplot2 © Hadley Wickham
Chapter 13, Figures 13.3, 13.4, 13.5: Screenshot of SQLite Browser © DB Browser
Chapter 14, Figures 14.3, 14.5, 14.6, 14.8, 14.9, 14.11: Screenshot of Rstudio © GitHub Inc.
Chapter 14, Figure 14.10: Screenshot of Yelp Fusion © 2004–2018 Yelp Inc.
Chapter 14, Figure 14.12: Google Inc.
Chapter 15: “The purpose of visualization is insight, not pictures”. Card, S. K., Mackinlay, J. D., & Shneiderman, B.
(1999). Readings in information visualization: using vision to think. Morgan Kaufmann.
Chapter 15, Figures 15.11, 15.12, 15.13: Screenshot of Rstudio © 2018 Rstudio
Chapter 15, Figures 15.15, 15.16: Screenshot of d3.js © 2017 Mike Bostock
Chapter 15, Figure 15.18: Screenshot of HSL Calculator © 1999-2018 by Refsnes Data
Chapter 15: “[perceptual] tasks that can be performed on large multi-element displays in less than 200 to 250
milliseconds”. Healey, C. G., & Enns, J. T. (2012). Attention and visual memory in visualization and computer
graphics. IEEE Transactions on Visualization and Computer Graphics, 18(7), 1170–1188.
https://doi.org/10.1109/TVCG.2011.127. Also at: https://www.csc2.ncsu.edu/
Chapter 15: “A set of facts is expressible in a visual language if the sentences (i.e. the visualizations) in
the language express all the facts in the set of data, and only the facts in the data”. Mackinlay, J. (1986).
Automating the Design of Graphical Presentations of Relational Information. ACM Trans. Graph., 5(2), 110–141.
https://doi.org/10.1145/22949.22950. Restatement by Jeffrey Heer.
Chapter 16: “the data being plotted … data shown in different plots”. Wickham, H. (2010). A Layered
Grammar of Graphics. Journal of Computational and Graphical Statistics, 19(1), 3–28.
https://doi.org/10.1198/jcgs.2009.07098. Also at http://vita.had.co.nz/papers/layered-grammar.pdf
Chapter 16, Figures 16.1, 16.2, 16.18: Screenshot of Rstudio © 2018 Rstudio
Chapter 17, Figures 17.3, 17.4: Screenshot of Plotly chart © 2018 Plotly
Chapter 17, Figure 17.6: Map of Seattle © OpenStreetMap contributors; map of Seattle © CARTO 2018
Chapter 17, Figures 17.10, 17.11: Map of permits © OpenStreetMap contributors ; Map of permits © CARTO 2018
Chapter 18, Figures 18.1, 18.2, 18.3, 18.4, 18.5, 18.8: Screenshot of R Markdown © 2018 Rstudio
Chapter 18: “Echo indicates whether you want … a code chunk like this.” Yihui Xie
Chapter 18, Figure 18.9: Life expectancy at birth, total (years) by The World Bank
Chapter 19, Figure 19.1: Screenshot of R Markdown © 2018 Rstudio; icons made by Freepik from
www.flaticon.com are licensed by CC 3.0 BY.
Chapter 19, Figures 19.2, 19.3, 19.4, 19.8: Screenshot of R Markdown © 2018 Rstudio
Chapter 19, Figures 19.5, 19.6, 19.7: Screenshot of Shiny applications © 2018 Rstudio
Chapter 19, Figures 19.9, 19.10: Map of police Shooting © OpenStreetMap contributors; map of police Shooting
© Stamen Design LLC
Chapter 20, Figures 20.1, 20.3: Screenshot of Mac © 2018 Apple Inc.
Chapter 20, Figures 20.5, 20.7: Screenshot of Git © Software Freedom Conservancy
Chapter 20, Figures 20.8, 20.9, 20.12, 20.13, 20.14: Screenshot of GitHub’s web portal © 2018 GitHub Inc.
Photo by izusek/gettyimages
• Automatically receive a coupon for 35% off your next purchase, valid
for 30 days. Look for your code in your InformIT cart or the Manage
Codes section of your account page.
• Download available product updates.
• Access bonus material if available.*
• Check the box to hear from us and receive exclusive offers on new
editions and related products.
*
Registration benefits vary by product. Benefits will be listed on your account page under
Registered Products.
Addison-Wesley • Adobe Press • Cisco Press • Microsoft Press • Pearson IT Certification • Prentice Hall • Que • Sams • Peachpit Press