Download Machine Learning In Production Developing And Optimizing Data Science Workflows And Applications Addison Wesley Data Analytics Series 1St Edition Andrew Kelleher online ebook texxtbook full chapter pdf

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

Machine Learning in Production

Developing and Optimizing Data


Science Workflows and Applications
Addison Wesley Data Analytics Series
1st Edition Andrew Kelleher
Visit to download the full and correct content document:
https://ebookmeta.com/product/machine-learning-in-production-developing-and-optimi
zing-data-science-workflows-and-applications-addison-wesley-data-analytics-series-1
st-edition-andrew-kelleher/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Practical Data Science with Hadoop and Spark Designing


and Building Effective Analytics at Scale Addison
Wesley Data Analytics 1st Edition Ofer Mendelevitch

https://ebookmeta.com/product/practical-data-science-with-hadoop-
and-spark-designing-and-building-effective-analytics-at-scale-
addison-wesley-data-analytics-1st-edition-ofer-mendelevitch/

Foundations of Deep Reinforcement Learning Theory and


Practice in Python Addison Wesley Data Analytics Series
1st Edition Laura Graesser Wah Loon Keng

https://ebookmeta.com/product/foundations-of-deep-reinforcement-
learning-theory-and-practice-in-python-addison-wesley-data-
analytics-series-1st-edition-laura-graesser-wah-loon-keng/

Data Science, Analytics and Machine Learning with R 1st


Edition Luiz Favero

https://ebookmeta.com/product/data-science-analytics-and-machine-
learning-with-r-1st-edition-luiz-favero/

Machine Learning and Big Data Analytics (Proceedings of


International Conference on Machine Learning and Big
Data Analytics (ICMLBDA) 2021) 1st Edition Rajiv Misra

https://ebookmeta.com/product/machine-learning-and-big-data-
analytics-proceedings-of-international-conference-on-machine-
learning-and-big-data-analytics-icmlbda-2021-1st-edition-rajiv-
Machine Learning and Data Science: Fundamentals and
Applications 1st Edition Prateek Agrawal (Editor)

https://ebookmeta.com/product/machine-learning-and-data-science-
fundamentals-and-applications-1st-edition-prateek-agrawal-editor/

Online Learning Analytics (Data Analytics Applications)


1st Edition Jay Liebowitz

https://ebookmeta.com/product/online-learning-analytics-data-
analytics-applications-1st-edition-jay-liebowitz/

Physics of Data Science and Machine Learning 1st


Edition Rauf

https://ebookmeta.com/product/physics-of-data-science-and-
machine-learning-1st-edition-rauf/

Earth Observation Data Analytics Using Machine and Deep


Learning: Modern Tools, Applications and Challenges 1st
Edition Sanjay Garg

https://ebookmeta.com/product/earth-observation-data-analytics-
using-machine-and-deep-learning-modern-tools-applications-and-
challenges-1st-edition-sanjay-garg/

Data Analytics in Bioinformatics: A Machine Learning


Perspective 1st Edition Rabinarayan Satpathy (Editor)

https://ebookmeta.com/product/data-analytics-in-bioinformatics-a-
machine-learning-perspective-1st-edition-rabinarayan-satpathy-
editor/
Machine Learning
in Production
The Pearson Addison-Wesley
Data & Analytics Series

Visit informit.com/awdataseries for a complete list of available publications.

T he Pearson Addison-Wesley Data & Analytics Series provides readers with


practical knowledge for solving problems and answering questions with data.
Titles in this series primarily focus on three areas:
1. Infrastructure: how to store, move, and manage data
2. Algorithms: how to mine intelligence or make predictions based on data
3. Visualizations: how to represent data and insights in a meaningful and
compelling way

The series aims to tie all three of these areas together to help the reader build
end-to-end systems for fighting spam; making recommendations; building
personalization; detecting trends, patterns, or problems; and gaining insight
from the data exhaust of systems and user interactions.

Make sure to connect with us!


informit.com/socialconnect
Machine Learning
in Production
Developing and Optimizing
Data Science Workflows and
Applications

Andrew Kelleher
Adam Kelleher

Boston • Columbus • New York • San Francisco • Amsterdam • Cape Town


Dubai • London • Madrid • Milan • Munich • Paris • Montreal • Toronto • Delhi • Mexico City
São Paulo • Sydney • Hong Kong • Seoul • Singapore • Taipei • Tokyo
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and the publisher was aware of a trademark
claim, the designations have been printed with initial capital letters or in all capitals.

The authors and publisher have taken care in the preparation of this book, but make no expressed or implied
warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for
incidental or consequential damages in connection with or arising out of the use of the information or
programs contained herein.

For information about buying this title in bulk quantities, or for special sales opportunities (which may
include electronic versions; custom cover designs; and content particular to your business, training goals,
marketing focus, or branding interests), please contact our corporate sales department
at [email protected] or (800) 382-3419.

For government sales inquiries, please contact [email protected].

For questions about sales outside the U.S., please contact [email protected].

Visit us on the Web: informit.com/aw

Library of Congress Control Number: 2018954331

Copyright © 2019 Pearson Education, Inc.

All rights reserved. This publication is protected by copyright, and permission must be obtained from the
publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by
any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding
permissions, request forms, and the appropriate contacts within the Pearson Education Global Rights &
Permissions Department, please visit www.pearsoned.com/permissions/.

ISBN-13: 978-0-13-411654-9
ISBN-10: 0-13-411654-2

1 19
v

This book is dedicated to our lifelong mentor, William F. Walsh III.


We could never thank you enough for all the years of support and
encouragement.

v
This page intentionally left blank
Contents
Foreword xv
Preface xvii
About the Authors xxi

I: Principles of Framing 1

1 The Role of the Data Scientist 3


1.1 Introduction 3
1.2 The Role of the Data Scientist 3
1.2.1 Company Size 3
1.2.2 Team Context 4
1.2.3 Ladders and Career Development 5
1.2.4 Importance 5
1.2.5 The Work Breakdown 6
1.3 Conclusion 6

2 Project Workflow 7
2.1 Introduction 7
2.2 The Data Team Context 7
2.2.1 Embedding vs. Pooling Resources 8
2.2.2 Research 8
2.2.3 Prototyping 9
2.2.4 A Combined Workflow 10
2.3 Agile Development and the Product Focus 10
2.3.1 The 12 Principles 11
2.4 Conclusion 15

3 Quantifying Error 17
3.1 Introduction 17
3.2 Quantifying Error in Measured Values 17
3.3 Sampling Error 19
3.4 Error Propagation 21
3.5 Conclusion 23

4 Data Encoding and Preprocessing 25


4.1 Introduction 25
4.2 Simple Text Preprocessing 26
4.2.1 Tokenization 26
viii Contents

4.2.2 N-grams 27
4.2.3 Sparsity 28
4.2.4 Feature Selection 28
4.2.5 Representation Learning 30
4.3 Information Loss 33
4.4 Conclusion 34

5 Hypothesis Testing 37
5.1 Introduction 37
5.2 What Is a Hypothesis? 37
5.3 Types of Errors 39
5.4 P-values and Confidence Intervals 40
5.5 Multiple Testing and “P-hacking” 41
5.6 An Example 42
5.7 Planning and Context 43
5.8 Conclusion 44

6 Data Visualization 45
6.1 Introduction 45
6.2 Distributions and Summary Statistics 45
6.2.1 Distributions and Histograms 46
6.2.2 Scatter Plots and Heat Maps 51
6.2.3 Box Plots and Error Bars 55
6.3 Time-Series Plots 58
6.3.1 Rolling Statistics 58
6.3.2 Auto-Correlation 60
6.4 Graph Visualization 61
6.4.1 Layout Algorithms 62
6.4.2 Time Complexity 64
6.5 Conclusion 64

II: Algorithms and Architectures 67

7 Introduction to Algorithms and Architectures 69


7.1 Introduction 69
7.2 Architectures 70
Contents ix

7.2.1 Services 71
7.2.2 Data Sources 72
7.2.3 Batch and Online Computing 72
7.2.4 Scaling 73
7.3 Models 74
7.3.1 Training 74
7.3.2 Prediction 75
7.3.3 Validation 76
7.4 Conclusion 77

8 Comparison 79
8.1 Introduction 79
8.2 Jaccard Distance 79
8.2.1 The Algorithm 80
8.2.2 Time Complexity 81
8.2.3 Memory Considerations 81
8.2.4 A Distributed Approach 81
8.3 MinHash 82
8.3.1 Assumptions 83
8.3.2 Time and Space Complexity 83
8.3.3 Tools 83
8.3.4 A Distributed Approach 83
8.4 Cosine Similarity 84
8.4.1 Complexity 85
8.4.2 Memory Considerations 85
8.4.3 A Distributed Approach 86
8.5 Mahalanobis Distance 86
8.5.1 Complexity 86
8.5.2 Memory Considerations 87
8.5.3 A Distributed Approach 87
8.6 Conclusion 88

9 Regression 89
9.1 Introduction 89
9.1.1 Choosing the Model 90
9.1.2 Choosing the Objective Function 90
9.1.3 Fitting 91
9.1.4 Validation 92
x Contents

9.2 Linear Least Squares 96


9.2.1 Assumptions 97
9.2.2 Complexity 97
9.2.3 Memory Considerations 97
9.2.4 Tools 98
9.2.5 A Distributed Approach 98
9.2.6 A Worked Example 98
9.3 Nonlinear Regression with Linear Regression 105
9.3.1 Uncertainty 107
9.4 Random Forest 109
9.4.1 Decision Trees 109
9.4.2 Random Forests 112
9.5 Conclusion 115

10 Classification and Clustering 117


10.1 Introduction 117
10.2 Logistic Regression 118
10.2.1 Assumptions 121
10.2.2 Time Complexity 121
10.2.3 Memory Considerations 122
10.2.4 Tools 122
10.3 Bayesian Inference, Naive Bayes 122
10.3.1 Assumptions 124
10.3.2 Complexity 124
10.3.3 Memory Considerations 124
10.3.4 Tools 124
10.4 K-Means 125
10.4.1 Assumptions 127
10.4.2 Complexity 128
10.4.3 Memory Considerations 128
10.4.4 Tools 128
10.5 Leading Eigenvalue 128
10.5.1 Complexity 129
10.5.2 Memory Considerations 130
10.5.3 Tools 130
10.6 Greedy Louvain 130
10.6.1 Assumptions 130
10.6.2 Complexity 130
Contents xi

10.6.3 Memory Considerations 131


10.6.4 Tools 131
10.7 Nearest Neighbors 131
10.7.1 Assumptions 132
10.7.2 Complexity 132
10.7.3 Memory Considerations 133
10.7.4 Tools 133
10.8 Conclusion 133

11 Bayesian Networks 135


11.1 Introduction 135
11.2 Causal Graphs, Conditional Independence, and Markovity 136
11.2.1 Causal Graphs and Conditional Independence 136
11.2.2 Stability and Dependence 137
11.3 D-separation and the Markov Property 138
11.3.1 Markovity and Factorization 138
11.3.2 D-separation 139
11.4 Causal Graphs as Bayesian Networks 142
11.4.1 Linear Regression 142
11.5 Fitting Models 143
11.6 Conclusion 147

12 Dimensional Reduction and Latent Variable Models 149


12.1 Introduction 149
12.2 Priors 149
12.3 Factor Analysis 151
12.4 Principal Components Analysis 152
12.4.1 Complexity 154
12.4.2 Memory Considerations 154
12.4.3 Tools 154
12.5 Independent Component Analysis 154
12.5.1 Assumptions 158
12.5.2 Complexity 158
12.5.3 Memory Considerations 159
12.5.4 Tools 159
12.6 Latent Dirichlet Allocation 159
12.7 Conclusion 165
xii Contents

13 Causal Inference 167


13.1 Introduction 167
13.2 Experiments 168
13.3 Observation: An Example 171
13.4 Controlling to Block Non-causal Paths 177
13.4.1 The G-formula 179
13.5 Machine-Learning Estimators 182
13.5.1 The G-formula Revisited 182
13.5.2 An Example 183
13.6 Conclusion 187

14 Advanced Machine Learning 189


14.1 Introduction 189
14.2 Optimization 189
14.3 Neural Networks 191
14.3.1 Layers 192
14.3.2 Capacity 193
14.3.3 Overfitting 196
14.3.4 Batch Fitting 199
14.3.5 Loss Functions 200
14.4 Conclusion 201

III: Bottlenecks and Optimizations 203

15 Hardware Fundamentals 205


15.1 Introduction 205
15.2 Random Access Memory 205
15.2.1 Access 205
15.2.2 Volatility 206
15.3 Nonvolatile/Persistent Storage 206
15.3.1 Hard Disk Drives or “Spinning Disks” 207
15.3.2 SSDs 207
15.3.3 Latency 207
15.3.4 Paging 207
15.3.5 Thrashing 208
15.4 Throughput 208
15.4.1 Locality 208
15.4.2 Execution-Level Locality 208
15.4.3 Network Locality 209
Contents xiii

15.5 Processors 209


15.5.1 Clock Rate 209
15.5.2 Cores 210
15.5.3 Threading 210
15.5.4 Branch Prediction 210
15.6 Conclusion 212

16 Software Fundamentals 213


16.1 Introduction 213
16.2 Paging 213
16.3 Indexing 214
16.4 Granularity 214
16.5 Robustness 216
16.6 Extract, Transfer/Transform, Load 216
16.7 Conclusion 216

17 Software Architecture 217


17.1 Introduction 217
17.2 Client-Server Architecture 217
17.3 N-tier/Service-Oriented Architecture 218
17.4 Microservices 220
17.5 Monolith 220
17.6 Practical Cases (Mix-and-Match Architectures) 221
17.7 Conclusion 221

18 The CAP Theorem 223


18.1 Introduction 223
18.2 Consistency/Concurrency 223
18.2.1 Conflict-Free Replicated Data Types 224
18.3 Availability 225
18.3.1 Redundancy 225
18.3.2 Front Ends and Load Balancers 225
18.3.3 Client-Side Load Balancing 228
18.3.4 Data Layer 228
18.3.5 Jobs and Taskworkers 230
18.3.6 Failover 230
xiv Contents

18.4 Partition Tolerance 231


18.4.1 Split Brains 231
18.5 Conclusion 232

19 Logical Network Topological Nodes 233


19.1 Introduction 233
19.2 Network Diagrams 233
19.3 Load Balancing 234
19.4 Caches 235
19.4.1 Application-Level Caching 236
19.4.2 Cache Services 237
19.4.3 Write-Through Caches 238
19.5 Databases 238
19.5.1 Primary and Replica 238
19.5.2 Multimaster 239
19.5.3 A/B Replication 240
19.6 Queues 241
19.6.1 Task Scheduling and Parallelization 241
19.6.2 Asynchronous Process Execution 242
19.6.3 API Buffering 243
19.7 Conclusion 243

Bibliography 245

Index 247
Foreword
This pragmatic book introduces both machine learning and data science, bridging gaps between
data scientist and engineer, and helping you bring these techniques into production. It helps
ensure that your efforts actually solve your problem, and offers unique coverage of real-world
optimization in production settings. This book is filled with code examples in Python and
visualizations to illustrate concepts in algorithms. Validation, hypothesis testing, and visualization
are introduced early on as these are all key to ensuring that your efforts in data science are actually
solving your problem. Part III of the book is unique among data science and machine learning
books because of its focus on real-world concerns in optimization. Thinking about hardware,
infrastructure, and distributed systems are all steps to bringing machine learning and data science
techniques into a production setting.

Andrew and Adam Kelleher bring their experience in engineering and data science, respectively,
from their work at BuzzFeed. The topics covered and where to provide breadth versus depth are
informed by their real-world experience solving problems in a large production environment.
Algorithms for comparison, classification, clustering, and dimensionality reduction are all
presented with examples of specific problems that can be solved with each. Explorations into more
advanced topics like Bayesian networks or deep learning are provided after the framework for basic
machine learning tasks is laid.

This book is a great addition to the Data & Analytics Series. It provides a well-grounded
introduction to data science and machine learning with a focus on problem-solving. It should
serve as a great resource to any engineer or “accidental programmer” with a more traditional math
or science background looking to apply machine learning to their production applications and
environment.
—Paul Dix, series editor
This page intentionally left blank
Preface
Most of this book was written while Andrew and Adam were working together at BuzzFeed. Adam
was a data scientist, Andrew was an engineer, and they spent a good deal of time working together
on the same team! Given that they’re identical twins of triplets, it was confusing and amusing for
everyone involved.

The idea for this book came after PyGotham in New York City in August 2014. There were several
talks relating to the relatively broadly defined field of “data science.” What we noticed was that
many data scientists start their careers driven by the curiosity and excitement of learning new
things. They discover new tools and often have a favorite technique or algorithm. They’ll apply that
tool to the problem they’re working on. When you have a hammer, every problem looks like a nail.
Often, as with neural networks (discussed in Chapter 14), it’s more like a pile driver. We wanted to
push past the hype of data science by giving data scientists, especially at the time they’re starting
their careers, a whole tool box. One could argue the context and error analysis tools of Part I are
actually more important than the advanced techniques discussed in Part III. In fact, they’re a major
motivator in writing this book. It’s very unlikely a choice of algorithm will be successful if its signal
is trumped by its noise, or if there is a high amount of systematic error. We hope this book provides
the right tools to take on the projects our readers encounter, and to be successful in their careers.

There’s no lack of texts in machine learning or computer science. There are even some decent texts
in the field of data science. What we hope to offer with this book is a comprehensive and rigorous
entry point to the field of data science. This tool box is slim and driven by our own experience of
what is useful in practice. We try to avoid opening up paths that lead to research-level problems. If
you’re solving research-level problems as a junior data scientist, you’ve probably gone out of scope.

There’s a critical side of data science that is separate from machine learning: engineering. In Part III
of this text we get into the engineering side. We discuss the problems you’re likely to encounter and
give you the fundamentals you’ll need to overcome them. Part III is essentially a Computer Science
201-202 crash course. Once you know what you’re building, you still have to address many
considerations on the path to production. This means understanding your toolbox from the
perspective of the tools.

Who This Book Is For


For the last several years there has been a serious demand for good engineers. During the Interactive
session of SXSW in 2008 we heard the phrase “accidental developer” coined for the first time. It was
used to describe people playing the role of engineer without having had formal training. They
simply happened into that position and began filling it out of necessity. More than a decade later
we still see this demand for developers, but it’s also begun to extend to data scientists. Who fills the
role of the “accidental data scientist”? Well, it’s usually developers. Or physics undergraduates. Or
math majors. People who haven’t had much if any formal training in all the disciplines required of
a data scientist. People who don’t lack for technical training, and have all the prerequisite curiosity
and ambition to succeed. People in need of a tool box.

This book is intended to be a crash course for those people. We run through a basic procedure for
taking on most data science tasks, encouraging data scientists to use their data set, rather than the
tools of the day, as the starting point. Data-driven data science is key to success. The big open secret
xviii Preface

of data science is that while modeling is important, the bread and butter of data science is
simple queries, aggregations, and visualizations. Many industries are in a place where they’re
accumulating and seeing data for the very first time. There is value to be delivered quickly
and with minimal complexity.

Modeling is important, but hard. We believe in applying the principles of agile development to
data science. We talk about this a lot in Chapter 2. Start with a minimal solution: a simple heuristic
based on a data aggregation, for example. Improve the heuristic with a simple model when your
data pipeline is mature and stable. Improve the model when you don’t have anything more
important to do with your time. We’ll provide realistic case studies where this approach is applied.

What This Book Covers


We start this text by providing you with some background on the field of data science. Part I,
“Principles of Framing,” includes Chapter 1, “The Role of the Data Scientist,” which serves as a
starting point for your understanding of the data industry.

Chapter 2, “Project Workflow,” sets the context for data science by describing agile development.
It’s a philosophy that helps keep scope small, and development efficient. It can be hard to keep
yourself from trying out the latest machine learning framework or tools offered by cloud platforms,
but it pays off in the long run.

Next, in Chapter 3, “Quantifying Error,” we provide you with a basic introduction to error analysis.
Much of data science is reporting simple statistics. Without understanding the error in those
statistics, you’re likely to come to invalid conclusions. Error analysis is a foundational skill and
important enough to be the first item in your tool kit.

We continue in Chapter 4, “Data Encoding and Preprocessing,” by discovering a few of the many
ways of encoding the real world in the form of data. Naturally this leads us to ask data-driven
questions about the real world. The framework for answering these questions is hypothesis testing,
which we provide a foundation for in Chapter 5, “Hypothesis Testing.”

At this point, we haven’t seen many graphs, and our tool kit is lacking in communicating our
results to the outside (nontechnical) world. We aim to resolve this in Chapter 6, “Data
Visualization,” where we learn many approaches to it. We keep the scope small and aim to mostly
either make plots of quantities we know how to calculate errors for, or plots that resolve some of the
tricky nuances of data visualization. While these tools aren’t as flashy as interactive visualizations
in d3 (which are worth learning!), they serve as a solid foundational skill set for communicating
results to nontechnical audiences.

Having provided the basic tools for working with data, we move on to more advanced concepts in
Part II, “Algorithms and Architecture.” We start with a brief introduction to data architectures in
Chapter 7, “Data Architectures,” and an introduction to basic concepts in machine learning in
Chapter 8, “Comparison.” You now have some very handy methods for measuring the similarities
of objects.

From there, we have some tools to do basic machine learning. In Chapter 9, “Regression,” we
introduce regression and start with one of the most important tools: linear regression. It’s odd to
start with such a simple tool in the age of neural networks and nonlinear machine learning, but
Preface xix

linear regression is outstanding for several reasons. As we’ll detail later, it’s interpretable, stable, and
often provides an excellent baseline. It can describe nonlinearities with some simple tricks, and
recent results have shown that polynomial regression (a simple modification of linear regression)
can outperform deep feedforward networks on typical applications!

From there, we describe one more basic workhorse of regression: the random forest. These are
nonlinear algorithms that rely on a statistical trick, called “bagging,” to provide excellent baseline
performance for a wide range of tasks. If you want a simple model to start a task with and linear
regression doesn’t quite work for you, random forest is a nice candidate.

Having introduced regression and provided some basic examples of the machine learning
workflow, we move on to Chapter 10, “Classification and Clustering.” We see a variety of methods
that work on both vector and graph data. We use this section to provide some basic background on
graphs and an abbreviated introduction to Bayesian inference. We dive into Bayesian inference and
causality in the next chapter.

Our Chapter 11, “Bayesian Networks,” is both unconventional and difficult. We take the view that
Bayesian networks are most intuitive (though not necessarily easiest) from the viewpoint of causal
graphs. We lay this intuition as the foundation for our introduction of Bayesian networks and
come back to it in later sections as the foundation for understanding causal inference. In the
Chapter 12, “Dimensional Reduction and Latent Variable Models,” we build off of the foundation
of Bayesian networks to understand PCA and other variants of latent factor models. Topic modeling
is an important example of a latent variable model, and we provide a detailed example on the
newgroups data set.

As the next to last data-focused chapter, we focus on the problem of causal inference in Chapter 13,
“Causal Inference.” It’s hard to understate the importance of this skill. Data science typically aims
to inform how businesses act. The assumption is that the data tells you something about the
outcomes of your actions. That can only be true if your analysis has captured causal relationships
and not just correlative ones. In that sense, understanding causation underlies much of what we do
as data scientists. Unfortunately, with a view toward minimizing scope, it’s also too often the first
thing to cut. It’s important to balance stakeholder expectations when you scope a project, and good
causal inference can take time. We hope to empower data scientists to make informed decisions
and not to accept purely correlative results lightly.

Finally, in the last data-focused chapter we provide a section to introduce some of the nuances of
more advanced machine learning techniques in Chapter 14, “Advanced Machine Learning.” We use
neural networks as a tool to discuss overfitting and model capacity. The focus should be on using as
simple a solution as is available. Resist the urge to start with neural networks as a first model. Simple
regression techniques almost always provide a good enough baseline for a first solution.

Up to this point, the platform on which all of the data science happens has been in the
background. It’s where you do the data science and is not the primary focus. Not anymore. In the
last part of this book, Part III, “Bottlenecks and Optimizations,” we go in depth on hardware,
software, and the systems they make up.

We start with a comprehensive look at hardware in Chapter 15, “Hardware Fundamentals.” This
provides a tool box of basic resources we have to work with and also provides a framework to discuss
xx Preface

the constraints under which we must operate. These constraints are physical limitations on what is
possible, and those limitations are realized in the hardware.

Chapter 16, “Software Fundamentals,” provides the fundamentals of software and a basic
description of data logistics with a section on extract-transfer/transform-load, commonly known
as ETL.

Next, we give an overview of design considerations for architecture in Chapter 17, “Architecture
Fundamentals.” Architecture is the design for how your whole system fits together. It includes the
components for data storage, data transfer, and computation, as well as how they all communicate
with one another. Some architectures are more efficient than others and objectively do their jobs
better than others. Still, a less efficient solution might be more practical, given constraints on time
and resources. We hope to provide enough context so you can make informed decisions. Even if
you’re a data scientist and not an engineer, we hope to provide enough knowledge so you can at
least understand what’s happening with your data platform.

We then move on to some more advanced topics in engineering. Chapter 18, “The CAP Theorem,”
covers some fundamental bounds on database performance. Finally, we discuss how it all fits
together in the last chapter, which is on network topology: Chapter 19, “Logical Network
Topological Nodes.”

Going Forward
We hope that not only can you do the machine learning side of data science, but you can also
understand what’s possible in your own data platform. From there, you can understand what you
might need to build and find an efficient path for building out your infrastructure as you need to.
We hope that with a complete toolbox, you’re free to realize that the tools are only a part of the
solution. They’re a means to solve real problems, and real problems always have resource
constraints.

If there’s one lesson to take away from this book, it’s that you should always direct your resources
toward solving the problems with the highest return on investment. Solving your problem is a real
constraint. Occasionally, it might be true that nothing but the best machine learning models can
solve it. The question to ask, then, is whether that’s the best problem to solve or if there’s a simpler
one that presents a lower-risk value proposition.

Finally, while we would have liked to have addressed all aspects of production machine learning in
this book, it currently exists more as a production data science text. In subsequent editions, we
intend to cover omissions, especially in the area of machine learning infrastructure. This new
material will include methods to parallelize model training and prediction; the basics of
Tensorflow, Apache Airflow, Spark, and other frameworks and tools; the details of several real
machine learning platforms, including Uber’s Michelangelo, Google’s TFX, and our own work on
similar systems; and avoiding and managing coupling in machine learning systems. We encourage
the reader to seek out the many books, papers, and blog posts covering these topics in the
meantime, and to check for updates on the book’s website at adamkelleher.com/ml_book.

We hope you’ll enjoy learning these tools as much as we did, and we hope this book will save you
time and effort in the long run.
About the Authors
Andrew Kelleher is a staff software engineer and distributed systems architect at Venmo. He was
previously a staff software engineer at BuzzFeed and has worked on data pipelines and algorithm
implementations for modern optimization. He graduated with a BS in physics from Clemson
University. He runs a meetup in New York City that studies the fundamentals behind distributed
systems in the context of production applications, and was ranked one of FastCompany’s most
creative people two years in a row.

Adam Kelleher wrote this book while working as principal data scientist at BuzzFeed and adjunct
professor at Columbia University in the City of New York. As of May 2018, he is chief data scientist
for research at Barclays and teaches causal inference and machine learning products at Columbia.
He graduated from Clemson University with a BS in physics, and has a PhD in cosmology from
University of North Carolina at Chapel Hill.
This page intentionally left blank
I
Principles of Framing
Chapter 1, “The Role of the Data Scientist,” provides background information about the field of
data science. This should serve as a starting point to gain context for the role of data science in
industry.

Chapter 2, “Project Workflow,” describes project workflow and how it relates to the principles of
agile software development.

Chapter 3, “Quantifying Error,” introduces the concept of measurement error and describes how to
quantify it. It then shows how to propagate error approximately through calculations.

Chapter 4, “Data Encoding and Preprocessing,” describes how to encode complex, real-world data
into something a machine learning algorithm can understand. Using text processing as the
example case, the chapter explores the information that is lost due to this encoding.

Chapter 5, “Hypothesis Testing,” covers this core skill for a data scientist. You’ll encounter
statistical tests and p-values throughout your work, and in the application of algorithms like
least-squares regression. This chapter provides a brief introduction to statistical hypothesis testing.

Chapter 6, “Data Visualization,” is the last subject before the unit on machine learning. Data
visualization and exploratory data analysis are critical steps in machine learning, where you
evaluate the quality of your data and develop intuition for what you’ll model.
1
The Role of the Data Scientist

1.1 Introduction
We want to set the context for this book by exposing you to the focus on products, rather than
methods, early on. Data scientists often make shortcuts, use rules of thumb, and forgo rigor. They
do this in favor of speed and with reasonable levels of uncertainty with which to make decisions.
The world moves fast, and businesses don’t have time for you to write a dissertation on error bars
when they need answers to hard questions.

We’ll begin by describing how the sizes of companies put different demands on a data scientist.
Then, we’ll describe agile development: the framework for building products that keeps them
responsive to the world outside of the office. We’ll discuss ladders and career development. These
are useful for both data scientists and the companies they work for. They lay out expectations that
companies have for their scientists and help scientists see which traits the company has found
useful. Finally, we’ll describe what data scientists actually “do” with their time.

1.2 The Role of the Data Scientist


The role of the data scientist is different depending on the context. It’s worth having an in-depth
understanding of some of the factors that influence your role so you can adapt as your role changes.
A lot of this chapter is informed by working within a company that grew from about 150 people to
almost 1,500 in a few short years. As the size changed, the roles, supporting structure, management,
interdepartmental communications, infrastructure, and expectations of the role changed with it.
Adam came in as a data scientist at 300 people, and Andrew came in as an engineer at 150 people.
We both stayed on as the company grew over the years. Here is some of what we learned.

1.2.1 Company Size


When the company was smaller, we tended to be generalists. We didn’t have the head count to have
people work on very specific tasks, even though that might have led to deeper analyses, depth of
perspective, and specialized knowledge about the products. As a data scientist at the then-small
company, Adam did analyses across several products and departments. As the company grew, the
team roles tended to get more specialized, and data scientists tended to start working more on one
product or a small number of related products. There’s an obvious benefit: they can have deep
4 Chapter 1 The Role of the Data Scientist

knowledge of a sophisticated product, so they have fuller context and nuanced understanding that
they might not be capable of if they were working on several different products.

A popular team structure is for a product to be built and maintained by a small, mostly autonomous
team. We’ll go into detail on that in the next section. When our company was smaller, team
members often performed a much more general role, acting as the machine learning engineer, the
data analyst, the quantitative researcher, and even the product manager and project manager. As
the company grew, the company hired more people to take on these roles, so team members’ roles
became more specialized.

1.2.2 Team Context


Most of the context of this book will be for data scientists working in small, autonomous teams,
roughly following the Agile Manifesto. This was largely developed in the context of software
engineering, so the focus is on producing code. It extends well to executing data science projects.
The manifesto is as follows:
n Individuals and interactions over processes and tools

n Working software over comprehensive documentation


n Customer collaboration over contract negotiation
n Responding to change over following a plan

The bold in this list indicates where the priorities lie. The items on the right of each line are still
important, but the items on the left are the priorities. This means that team structure is flat, with
more experienced people working alongside (rather than above) more junior people. They share
skills with interactions like pair-coding and peer-reviewing each other’s code. A great benefit to
this is that everyone learns quickly from direct interactions with more senior teammates as peers.
A drawback is that there can be a little friction when senior developers have their code reviewed by
junior team members.

The team’s overall goal is to produce working software quickly, so it’s okay to procrastinate on
documentation. There is generally less focus on process and more on getting things done. As long
as the team knows what’s going on and they’re capable of onboarding new members efficiently
enough, they can focus on the work of shipping products.

On the other side of this, the focus on moving fast causes teams to take shortcuts. This can lead to
systems being more fragile. It can also create an ever-growing list of things to do more perfectly
later. These tasks make up what is called technical debt. Much like debt in finance, it’s a natural part
of the process. Many argue, especially in smaller companies, that it’s a necessary part of the process.
The argument is that a team should do enough “paying the debt” by writing documentation,
making cleaner abstractions, and adding test coverage to keep a sustainable pace of development
and keep from introducing bugs.

Teams generally work directly with stakeholders, and data scientists often have a front-facing role
in these interactions. There is constant feedback between teams and their stakeholders to make
sure the project is still aligned with stakeholder priorities. This is opposed to contract negotiation,
where the requirements are laid out and the team decouples from the stakeholders, delivering the
product at a later date. In business, things move fast. Priorities change, and the team and product
1.2 The Role of the Data Scientist 5

must adapt to those changes. Frequent feedback from stakeholders lets teams learn about changes
quickly and adapt to them before investing too much in the wrong product and features.

It’s hard to predict the future. If you come up with a long- or moderate-term plan, priorities can
shift, team structure can change, and the plan can fall apart. Planning is important, and trying to
stick to a plan is important. You’ll do all you can to make a plan for building an amazing product,
but you’ll often have to respond quickly and agilely to change. It can be hard to throw out your
favorite plans as priorities shift, but it’s a necessary part of the job.

Data scientists are integral members of these teams. They help their teams develop products and
help the product managers evaluate a product’s performance. Throughout product development,
there are critical decisions to make about its features. To that end, a data scientist works with
product managers and engineers to formulate questions to answer. They can be as simple as “What
unit on this page generates the most clicks?” and as complex as “How would the site perform if the
recommender system didn’t exist?” Data lets us answer these questions, and data scientists are the
people who analyze and help interpret the data for making these decisions. They do this in the
context of a dynamic team environment and have to work quickly and effectively in response to
change.

1.2.3 Ladders and Career Development


Sometimes data scientists are contrasted against the data analyst. The data analyst has an
overlapping skill set, which includes querying databases, making plots, doing statistics, and
interpreting data. In addition to these skills, according to this view, a data scientist is someone who
can build production machine-learning systems. If that were an apt view, then there might not be
such a thing as a junior data scientist. It’s not typical to start your career designing production
machine-learning systems. Most companies have well-defined “ladders” for career advancement,
with specific skills expected at each level.

The team’s goal is to build and ship products. There are many skills that are critically important
for this that have nothing to do with data. Ladders go beyond technical skills to include
communication skills, the ability to understand project scope, and the ability to balance long- and
short-term goals.

Generally, companies will define an “individual contributor” track and a “management” track.
Junior scientists will start in the same place and shift onto a specific track as their skills develop.
They generally start out being able to execute tasks on projects with guidance from more senior
team members. They advance to being able to execute tasks more autonomously. Finally, they’re
the ones helping people execute tasks and usually take more of a role in project planning. The shift
often happens at this point, when they hit the “senior” level of their role.

1.2.4 Importance
The data scientist, like everyone on their teams, has an important role. Analysis can lie on the
“critical path” of a project’s development. This means the analysis might need to be finished before
a project can proceed and be delivered. If a data scientist isn’t skillful with their analysis and
6 Chapter 1 The Role of the Data Scientist

delivers too slowly or incompletely, they might block progress. You don’t want to be responsible for
delaying the release of a product or feature!

Without data, decision-makers might move more toward experience and intuition. While these
might not be wrong, they’re not the best way to make decisions. Adding data to the
decision-making process moves business more toward science. The data scientist, then, has a
critical role in making business decisions more rational.

1.2.5 The Work Breakdown


Anecdotally, probably 80 to 90 percent of the work a data scientist does outside of interpersonal
and managerial tasks is basic analysis and reporting on experimental and observational data. Much
of the data the scientist has to work with is observational since experimental data takes time and
resources to collect, while observational data is essentially “free” once you’ve implemented data
collection. This makes observational data analysis methods important to be familiar with. You’ll
examine correlation and causation later in this book. You’ll develop an understanding of
observational data analysis methods by contrasting them with experimental data and an
understanding of why observational results are often biased.

Many data scientists work primarily with experimental data. We’ll cover experiment design and
analysis in some detail as well. Good experiment design is hard. Web-scale experiments, while
often providing large samples, don’t guarantee you’ll actually be able to measure the experimental
effects you’re looking for, even when they’re large! Randomized assignment doesn’t even guarantee
you’ll have correct experimental results (due to selection bias). We’ll cover all of this and more later
in the book.

The other 10 or so percent of the work is the stuff you usually read about when you hear about data
science in the news. It’s the cool machine learning, artificial intelligence, and Internet of Things
applications that are so exciting and drive so many people toward the field of data science. In a very
real sense, these applications are the future, but they’re also the minority of the work data scientists
do, unless they’re the hybrid data scientist/machine learning engineer type. Those roles are
relatively rare and are generally for very senior data scientists. This book is aimed at entry- to
mid-level data scientists. We want to give you the skills to start developing your career in whichever
direction you’d like so you can find the data science role that is perfect for you.

1.3 Conclusion
Getting things right can be hard. Often, the need to move fast supercedes the need to get it right.
Consider the case when you need to decide between two policies, A and B, that cost the same
amount to implement. You must implement one, and time is a factor. If you can show that the effect
of policy A, Y(A), is more positive than Y(B), it doesn’t matter how much more positive it is. As long
as Y(A) − Y(B) > 0, policy A is the right choice. As long as your measurement is good enough to be
within 100 percent of the correct difference, you know enough to make the policy choice!

At this point, you should have a better idea of what it means to be a data scientist. Now that you
understand a little about the context, you can start exploring the product development process.
2
Project Workflow

2.1 Introduction
This chapter focuses on the workflow of executing data science tasks as one-offs versus tasks that
will eventually make up components in production systems. We’ll present a few diagrams of
common workflows and propose combining two as a general approach. At the end of this chapter
you should understand where they fit in an organization that uses data-driven analyses to fuel
innovation. We’ll start by giving a little more context about team structure. Then, we’ll break down
the workflow into several steps: planning, design/preprocessing, analysis, and action. These steps
often blend together and are usually not formalized. At the end, you’ll have gone from the concept
of a product, like a recommender system or a deep-dive analysis, to a working prototype or result.

At that stage, you’re ready to start working with engineers to have the system implemented in
production. That might mean bringing an algorithm into a production setting automating a
report, or something else.

We should say that as you get closer to feature development, your workflow can evolve to look more
like an engineer’s workflow. Instead of prototyping in a Jupyter Notebook on your computer, you
might prototype a model as a component of a microservice. This chapter is really aimed at getting a
data scientist oriented with the steps that start them toward building prototypes for models.

When you’re prototyping data products, it’s important to keep in mind the broader context of the
organization. The focus should be more on testing value propositions than on perfect architecture,
clean code, and crisp software abstractions. Those things take time, and the world changes quickly.
With that in mind, we spend the remainder of this chapter talking about the agile methodology
and how data products should follow that methodology like any other piece of software.

2.2 The Data Team Context


When you’re faced with a problem that you might solve with machine learning, usually many
options are available. You could make a fast, heuristic solution involving very little math that you
could produce in a day and move on to the next project. You could take a smarter approach and
probably achieve better performance. The cost is your time and the loss of opportunity to spend
that time working on a different product. Finally, you could implement the state of the art.
8 Chapter 2 Project Workflow

That usually means you’d have to research the best approach before even beginning coding,
implement algorithms from scratch, and potentially solve unsolved problems with how to scale the
implementation.

When you’re working with limited resources, as you usually are in small organizations, the third
option usually isn’t the best choice. If you want a high-quality and competitive product, the first
option might not be the best either. Where you fall along the spectrum between the get-it-done
and state-of-the-art approaches depends on the problem, the context, and the resources available.
If you’re making a healthcare diagnosis system, the stakes are much higher than if you’re building a
content recommendation system.

To understand why you’ll use machine learning at all, you need a little context for where and how
it’s used. In this section, we’ll try to give you some understanding of how teams are structured,
what some workflows might look like, and practical constraints on machine learning.

2.2.1 Embedding vs. Pooling Resources


In our experience, we’ve seen two models for data science teams. The first is a “pool of resources”
where the team gets a request and someone on the team fulfills it. The second is for members of the
team to “embed” with other teams in the organization to help them with their work.

In the first “pool of resources” approach, each request of the team gets assigned and triaged like
with any project. Some member of the team executes it, and if they need help, they lean on
someone else. A common feature of this approach is that tasks aren’t necessarily related, and it’s
not formally decided that a single member of the team executes all the tasks in a certain domain or
that a single member should handle all incoming requests from a particular person. It makes sense
to have the same person answer the questions for the same stakeholders so they can develop more
familiarity with the products and more rapport with the stakeholders. When teams are small, the
same data scientist will tend to do this for many products, and there’s little specialization.

In the “embedded” approach, a data scientist works with some team in the organization each day,
understanding the team’s needs and their particular goals. In this scenario, the understanding of
problems and the approaches are clear as the data scientist is exposed to them day to day. This is
probably the biggest contrast between the “embedded” and “pool of resources” approaches.
Anecdotally, the former is more common than the latter in small organizations. Larger
organizations tend to have more need and resources for the latter.

This chapter has a dual focus. First we’ll discuss the data science project life cycle in particular, and
then we’ll cover the integration of the data science project cycle with a technical project life cycle.

2.2.2 Research
The steps to develop a project involving a machine learning component aren’t really different from
those of an engineering project. Planning, design, development, integration, deployment, and
post-deployment are still the steps of the product life cycle (see Figure 2.1).

There are two major differences between a typical engineering product and one involving a data
science component. The first is that with a data science component, there are commonly
2.2 The Data Team Context 9

Planning Design Development Integration Deployment Postdeployment

Figure 2.1 The stages of a product’s life cycle

Design/
Planning Analysis Action
Preprocessing

Figure 2.2 The independent life cycle of a data science task

unknowns, especially in smaller teams or teams with less experience. This creates the need for a
recursive workflow, where analysis can be done and redone.

The second major difference is that many if not most data science tasks are executed without the
eventual goal of deployment to production. This creates a more abridged product life cycle (see
Figure 2.2).

The Field Guide to Data Science [1] explains that four steps comprise the procedure of data science
tasks. Figure 2.2 shows our interpretation.

Here are the steps:

1. Build domain knowledge and collect data.

2. Preprocess the data. This involves cleaning out sources of error (e.g., removing outliers), as
well as reformatting the data as needed.

3. Execute some analyses and draw conclusions. This is where models are applied and tested.

4. Do something with the result. Report it or refine the existing infrastructure.

2.2.3 Prototyping
The workflows we’ve outlined are useful for considering the process of data science tasks
independently. These steps, while linear, seem in some ways to mirror the general steps to software
prototyping as outlined in “Software Prototyping: Adoption, Practice and Management”[2].
Figure 2.3 shows our interpretation of these steps.

Planning Implementation Evaluation Completion

Figure 2.3 The life cycle of a software prototype


10 Chapter 2 Project Workflow

1. Planning: Assess project requirements.

2. Implementation: Build a prototype.

3. Evaluation: Determine whether the problem is solved.

4. Completion: If the needs are not satisfied, re-assess and incorporate new information.

2.2.4 A Combined Workflow


Sometimes these tasks are straightforward. When they take some time and investment, a more
rigorous focus on process becomes crucial. We propose the typical data science track should be
considered as being like a typical prototyping track for engineering projects. This is especially
useful when the end goal is a component in a production system. In situations where data science
influences engineering decisions or products that are brought to production, you can picture a
combined product life cycle, as in Figure 2.4.

Planning Design Analysis Engineering Design

Development Integration Deployment Postdeployment

Figure 2.4 The combined product life cycle of an engineering project dependent on exploratory
analysis

This approach allows data scientists to work with engineers in an initial planning and design phase,
before the engineering team takes lessons learned to inform their own planning and design
processes with technical/infrastructural considerations taken fully into account. It also allows data
scientists to operate free of technical constraints and influences, which could otherwise slow
progress and lead to premature optimization.

2.3 Agile Development and the Product Focus


Now that you understand how prototyping works and the product life cycle, we can build a richer
context around product development. The end goal is to build a product that provides value to
someone. That might be a product that performs a task, like translating langugages for travelers or
recommending articles to read on your morning commute. It might be a product that monitors
heart rates for patients after surgery or tracks people’s fitness as they work toward personal
milestones. Common to each of these products is a value proposition.
2.3 Agile Development and the Product Focus 11

When you’re building a new product, you have a value proposition in mind. The issue is that it’s
likely untested. You might have good reason to believe that the proposition will be true: that users
are willing to pay $1 for an app that will monitor their heart rate after surgery (or will tolerate some
number of ads for a free app). You wouldn’t be building it in the first place if you didn’t believe in
the value proposition. Unfortunately, things don’t always turn out how you expect. The whole
purpose of AB tests is to test product changes in the real world and make sure reality aligns with our
expectations. It’s the same with value propositions. You need to build the product to see whether
the product is worth building.

To manage this paradox, we always start with a minimum viable product, or MVP. It’s minimal in the
sense that it’s the simplest thing you can possibly build while still providing the value that you’re
proposing providing. For the heart rate monitor example, it might be a heart rate monitor that
attaches to a hardware device, alerts you when you’re outside of a target range, and then calls an
ambulance if you don’t respond. This is a version of an app that can provide value in the extra
security. Any more features (e.g., providing a fancy dashboard, tracking goals, etc.), and you’re
going beyond just testing the basic value proposition. It takes time to develop features, and that is
time you might invest in testing a different value proposition! You should do as little work as
possible to test the value proposition and then decide whether to invest more resources in the
product or shift focus to something different.

Some version of this will be true with every product you build. You can look at features of large
products as their own products. Facebook’s Messenger app was originally part of the Facebook
platform and was split into its own mobile app. That’s a case where a feature literally evolved into its
own product. Everything you build should have this motivation behind it of being minimal. This
can cause problems, and we have strategies to mitigate them. The cycle of software development is
built around this philosophy, and you can see it in the concept of microservice architecture, as well
as the “sprints” of the product development cycle. These leads us to the principles of the agile
methodology.

2.3.1 The 12 Principles


The agile methodology is described with 12 principles [3].

1. Our highest priority is to satisfy the customer through early and continuous delivery of
valuable software. The customer is the person you’re providing value to. That can be a consumer,
or it can be the organization you’re working for. The reason you’d like to deliver software early is to
test the value proposition by actually putting it in front of the user. The requirement that the
software be “valuable” means you don’t work so fast that you fail to test your value proposition.

2. Welcome changing requirements, even late in development. Agile processes harness change
for the customer’s competitive advantage. This principle sounds counterintuitive. When
requirements for software change, you have to throw away some of your work, go back to the
planning phase to re-specify the work to be done, and then do the new work. That’s a lot of
inefficiency! Consider the alternative: the customer needs have changed. The value proposition is
no longer satisfied by the software requirements as they were originally planned. If you don’t adapt
your software to the (unknown!) new requirements, the value proposition, as executed by your
software, will fail to meet the customer’s needs. Clearly, it’s better to throw away some work than to
throw away the whole product without testing the value proposition! Even better, if the
12 Chapter 2 Project Workflow

competition isn’t keeping this “tight coupling” with their stakeholders (or customers), then your
stakeholders are at a competitive advantage!

3. Deliver working software frequently, from a couple of weeks to a couple of months, with a
preference to the shorter timescale. There are a few reasons for this. One of them is for
consistency with the last principle. You should deliver software often, so you can get frequent
feedback from stakeholders. That will let you adjust your project plans at each step of its
development and make sure you’re aligned with the stakeholders’ needs as well as you can be. The
time when you deliver value is a great time to hear more about the customer’s needs and get ideas
for new features. We don’t think we’ve ever been in a meeting where we put a new product or
feature in front of someone and didn’t hear something along the lines of “You know, it would be
amazing if it also did... .”

Another reason for this is that the world changes quickly. If you don’t deliver value quickly, your
opportunity for providing that value can pass. You might be building a recommender system for an
app and take so long with the prototype that the app is already being deprecated! More realistically,
you might take so long that the organization’s priorities have shifted to other projects and you’ve
lost support (from product managers, engineers, and others) for the system you were working on.

4. Businesspeople and developers must work together daily throughout the project. This
principle is an extension of the previous two. Periodically meeting with the stakeholders isn’t the
only time to connect the software development process with the context of the business.
Developers should at least also be meeting with product managers to keep context with the
business goals of their products. These managers, ideally, would be in their team check-ins each
day, or at the least a few times per week. This makes sure that not only does the team building the
software keep the context of what they’re working on, but the business knows where the software
engineering and data resources (your and your team’s time) are being spent.

5. Build projects around motivated individuals. Give them the environment and support they
need, and trust them to get the job done. One sure way to restrict teams from developing things
quickly is to have them all coordinate their work through a single manager. Not only does that
person have to keep track of everything everyone is working on, but they need to have the time to
physically meet with all of them! This kind of development doesn’t scale. Typically, teams will be
small enough to share a pizza and have one lead per team. The leads can communicate with each
other in a decentralized way (although they do typically all communicate through management
meetings), and you can scale the tech organization by just adding new similar teams.

Each person on a team has a role, and that lets the team function as a mostly autonomous unit.
The product person keeps the business goals in perspective and helps coordinate with stakeholders.
The engineering manager helps make sure the engineers are staying productive and does a lot
of the project planning. The engineers write the code and participate in the project planning
process. The data scientist answers questions for the product person and can have different roles
(depending on seniority) with managing the product’s data sources, building machine learning
and statistical tools for products, and helping figure out the presentation of data and statistics to
stakeholders. In short, the team has everything they need to work quickly and efficiently together
to get the job done. When external managers get too involved in the details of team’s operations,
they can end up slowing them down just as easily as they can help.
2.3 Agile Development and the Product Focus 13

6. The most efficient and effective method of conveying information to and within a
development team is face-to-face conversation. A lot of communication is done over chat
clients, through shared documents, and through email. These media can make it hard to judge
someone’s understanding of project requirements as well as their motivation, focus, and
confidence for getting it done. Team morale can fluctuate throughout product development.
People can tend to err on the side of agreeing to work that they aren’t sure they can execute. When
teams communicate face to face, it’s much easier to notice these issues and handle them before
they’re a problem.

As a further practical issue, when you communicate over digital media, there can be a lot of other
windows, and even other conversations, going on. It can be hard to have a deep conversation with
someone when you aren’t even sure if they’re paying attention!

7. Working software is the primary measure of progress. Your goal is to prove value
propositions. If you follow the steps we’ve already outlined, then the software you’re building is
satisfying stakeholders’ needs. You can do that without implementing the best software
abstractions, cleaning up your code, fully documenting your code, and adding complete test
coverage. In short, you can take as many shortcuts as you like (respecting the next principle), as
long as your software works!

When things break, it’s important to take a retrospective. Always have a meeting to figure out why
it happened but without placing blame on any individual. The whole team is responsible when
things do or don’t work. Make whatever changes are necessary to make sure things don’t break in
the future. That might mean setting a higher standard for test coverage, adding more
documentation around certain types of code (like describing input data), or cleaning up your code
just a little more.

8. Agile processes promote sustainable development. The sponsors, developers, and users
should be able to maintain a constant pace indefinitely. When you’re working fast, it’s easy for
your code to end up messy. It’s easy to write big monolithic blocks of code instead of breaking it up
into nice small functions with test coverage on each. It’s easy to write big services instead of
microservices with clearly defined responsibilities. All of these things get you to a value proposition
quickly and can be great if they’re done in the right context. All of them are also technical debt,
which is something you need to fix later when you end up having to build new features onto the
product.

When you have to change a monolithic block of code you’ve written, it can be really hard to read
through all the logic. It’s even worse if you change teams and someone else has to read through it!
It’s the type of problem that can slow progress to a halt if it isn’t kept in check. You should always
notice when you’re taking shortcuts and consider at each week’s sprint whether you might fix some
small piece of technical debt so it doesn’t build up too much. Remember that you’d like to keep up
your pace of development indefinitely, and you want to keep delivering product features at the
same rate. Your stakeholders will notice if they suddenly stop seeing you for a while! All of this
brings us to the next point.

9. Continuous attention to technical excellence and good design enhances agility. When you
have clear abstractions, code can be much more readable. When functions are short, clean, and
well-documented, it’s easy for anyone to read and modify the code. This is true for software
development as well as for data science. Data scientists in particular can be guilty of poor coding
14 Chapter 2 Project Workflow

standards: one character variable names, large blocks of data preprocessing code with no
documentation, and other bad practices. If you make a habit of writing good code, it won’t slow
you down to do it! In fact, it’ll speed up the team as a whole.

10. Simplicity—the art of maximizing the amount of work not done—is essential. Writing a
good MVP can be an art. How do you know exactly the features to write to test your value
proposition? How do you know what software development best practices you can skip to keep a
sustainable pace of development? Which architectural shortcuts can you get away with now and in
the long term?

These are all skills you learn with practice and that your manager and team will be good resources
for advice. If you’re not sure which product features really test the minimum value proposition, talk
to your product manager and your stakeholders. If you’re not sure how sloppy your code can be,
talk to a more senior data scientist, or even to an engineer on your team.

11. The best architectures, requirements, and designs emerge from self-organizing teams.
Some things are hard to understand unless you’re working with them directly. The team writing the
software is going to have the best idea what architectural changes are going to work the best. This is
partly because they know the architecture well and partly because they know their strengths and
weaknesses for executing it. Teams communicate with each other and can collaborate without the
input of other managers. They can build bigger systems that work together than they could on their
own, and when several teams coordinate, they can architect fairly large and complex systems
without a centralized architect guiding them.

12. At regular intervals, the team reflects on how to become more effective and then tunes and
adjusts its behavior accordingly. While the focus is on delivering value quickly and often and
working closely with stakeholders to do that, teams also have to be introspective occasionally to
make sure they’re working as well as they can. This is often done once per week in a “retrospective”
meeting, where the team will get together and talk about what went well during the past week,
what didn’t work well, and what they’ll plan to change for the next week.

These are the 12 principles of agile development. They apply to data science as well as software. If
someone ever proposes a big product loaded with features and says “Let’s build this!” you should
think about how to do it agilely. Think about what the main value proposition is (chances are that
it contains several). Next, think of the minimal version of it that lets you test the proposition. Build
it, and see whether it works!

Often in data science, there are extra shortcuts you can take. You can use a worse-performing model
while you work on a better one just to fill the gap that the engineers are building around. You can
write big monolithic functions that return a model just by copying and pasting a prototype from a
Jupyter Notebook. You can use CSV files instead of running database queries when you need static
data sets. Get creative, but always think about what you’d need to do to build something right. That
might be creating good abstractions around your models, replacing CSV files with database queries
to get live data, or just writing cleaner code.
2.4 Conclusion 15

To summarize, there are four points to the Agile Manifesto. Importantly, these are tendencies. Real
life is not usually dichotomous. These points really reflect our priorities:
n Individuals and interactions over processes and tools

n Working software over comprehensive documentation


n Customer collaboration over contract negotiation

n Responding to change over following a plan

2.4 Conclusion
Ideally now you have a good idea of what the development process looks like and where you fit in.
We hope you’ll take the agile philosophy as a guide when building data products and will see the
value in keeping a tight feedback loop with your stakeholders.

Now that you have the context for doing data science, let’s learn the skills!
This page intentionally left blank
3
Quantifying Error

To kill an error is as good a service as, and sometimes even


better than, the establishing of a new truth or fact.
—Charles Darwin

3.1 Introduction
Most measurements have some error associated with them. We often think of the numbers we
report as exact values (e.g., “there were 9,126 views of this article”). Anyone who has implemented
multiple tracking systems that are supposed to measure the same quantity knows there is rarely
perfect agreement between measurements. The chances are that neither system measures the
ground truth—there are always failure modes, and it’s hard to know how often failures happen.

Aside from errors in data collections, some measured quantities are uncertain. Instead of running
an experiment with all users of your website, you’ll work with a sample. Metrics like retention and
engagement you measure in the sample are noisy measurements of what you’d see in the whole
population. You can quantify that noise and make sure you bound the error from sampling to
something within reasonable limits.

In this chapter, we’ll discuss the concept of error analysis. You’ll learn how to think about error in a
measurement, and you’ll learn how to calculate error in simple quantities you derive from
measurements. You’ll develop some intuition for when error matters a lot and when you can safely
ignore it.

3.2 Quantifying Error in Measured Values


Imagine you want to measure the length of a piece of string. You take out a ruler, stretch the string
along the length of the ruler, and type the length you measure into a spreadsheet. You’re a good
scientist, so you know that you really shouldn’t stop at one measurement. You measure it again and
get a slightly different result. The string was stretched a little less; maybe it was a little misaligned on
the ruler the first time. You repeat the process over and over again. You find the measurements
plotted in Figure 3.1.
18 Chapter 3 Quantifying Error

0 12

Figure 3.1 Several measurements of a string are given by the red dots along the number line. The
true length of the string is shown with the vertical blue line. If you look at the average value of the
measurements, it falls around the center of the group of red dots. It’s higher than the true value, so
you have a positive bias in your measurement process.

There is a true length to the string, but the string bends a little. To straighten it out to measure it,
you have to stretch it a little, so the measurement tends to be a little longer than it should be. If you
average your measurements together, the average will be a little higher than the true length of the
string. This difference between the “expected” length from your measurements and the “true”
length is called systematic error. It’s also sometimes called bias.

If there is no bias, there is still some random spread your measurements take around the true value.
On average you measure the true value, but each measurement is a little low or a little high. This
type of error is called random error. It’s what we commonly think of as measurement noise. We usually
measure it with the standard deviations of the measurements around their average value.

If measurements have no systematic error, then you can take a large enough sample of them,
average them together, and find the true value! This is a great situation to be in, if you’re able to take
several independent measurements. Unfortunately, it’s not a common situation. Usually you can
make only one measurement and expect that there is at least a little systematic error (e.g., data is
only lost, so count measurements are systematically low).

Consider the case of tracking impressions on a web page. When a user clicks a link to the page on
which you’re tracking impressions, in some instances, they will not follow the link completely (but
exit before arriving at the final page). Still closer to conversion, they may load the page but not
allow the pixel tracking impressions on that page to be requested. Further, we may double-count
impressions in the case a user refreshes the page for whatever reason (which happens quite a lot).
These all contribute random and systematic errors, going in different directions. It’s hard to say
whether the measurement will be systematically low or high.

There are certainly ways to quantify errors in tracking. Server logs, for example, can tell the story of
requests complete with response codes your tracking pixels may have missed. tcpdump or
wireshark can be used for monitoring attempted connections that get dropped or disconnected
before the requests are fulfilled. The main consideration is that both of these methods are difficult
in real-time reporting applications. That doesn’t mean, though, you can’t do a sampled
comparison of tracked impressions to impressions collected through these other less convenient
more expensive means.

Once you’ve implemented your tracking system and checked it against some ground truth (e.g.,
another system, like Google Analytics), you’ll usually assume the error in these raw numbers is
small and that you can safely ignore it.

There is another context where you have to deal with systematic and random error, where you can’t
safely ignore the error. This comes up most often in AB testing, where you look at a performance
metric within a subpopulation of your users (i.e., those participating in the experiment) and want
to extrapolate that result to all of your users. The measurement you make with your experiment,
3.3 Sampling Error 19

you hope, is an “unbiased” measurement (one with no systematic error) of the “true” value of the
metric (the one you would measure over the whole population).

To understand error from sampling, it’ll be helpful to take a little side trip into sampling error. The
end result is familiar: with each measurement, we should have random and systematic error in
comparison to the “true” value.

3.3 Sampling Error


Sampling error is a very rich subject. There are entire books written about it. We can’t hope to cover
all of its intricacies here, but we can provide enough of an introduction to give you some working
knowledge. We hope you’ll continue reading more on your own!

Suppose you run a news website, and you want to know the average amount of time it takes you to
read an article on your website. You could read every article on the site, record your reading time,
and get your answer that way, but that’s incredibly labor intensive. It would be great if you could
read a much smaller number of the articles and be reasonably confident about the average reading
time.

The trick you’ll use is this: you can take a random sample of articles on the website, measure the
reading time for those articles, and take the average. This will be a measurement of the average
reading time for articles on the whole website. It probably won’t match the actual average reading
time exactly, the one you’d measure if you read all of the articles. This true number is called the
population average since it’s averaging over the whole population of articles instead of just a sample
from it.

How close does the average read time in your sample compare with the average read time across the
whole site? This is where the magic happens. The result comes from the central limit theorem. It
says that the average of N independent measurements, µN , from a population is an unbiased
estimate for the population average, µ, as long as you have a reasonably large number of samples.
Even better, it says the random error for the sample average, σµ , is just the sample standard

deviation, σN , divided by the square root of the sample size, N:
σN
σµ = √ (3.1)
N

In practice, N = 30 is a pretty good rule of thumb for using this approximation. Let’s draw a sample
from a uniform distribution to try it.

First, let’s make the population of reading times. Let’s make it uniform over the range of 5 to 15
minutes and generate a population of 1,000 articles.

1 import numpy as np
2
3 population = np.random.uniform(5,15, size=1000)

and then sample 30 articles from it at random.

1 sample = np.random.choice(population, size=30, replace=False)


20 Chapter 3 Quantifying Error

Note that in practice, you won’t have access to a whole population to sample from. If these were the
reading times of articles, none of the reading times is even measured when you start the process!
Instead, you’d sample 30 articles from a database and then read those articles to generate your
sample from the populations. We generate a population to sample from here, just so we can check
how close our sample mean is to the population mean.

Note also that database queries don’t sample randomly from the database. To get random sampling,
you can use the rand() SQL function to generate random floats between 0 and 1. Then, you can
sort by the random value, or limit to results with rand() < 0.05 for example, to keep 5 percent of
results. An example query might look like this (NOTE: This should never be used on large tables):

SELECT article_id, rand() as r FROM articles WHERE r < 0.05;

Continuing, you can compute the population and sample means, as shown here:

1 population.mean()
2 sample.mean()

which for us returns 10.086 for the population and 9.701 for the sample. Note that your values will
be different since we’re dealing with random numbers. Our sample mean is only 3 percent below
the population value!

Repeating this sampling process (keeping the population fixed) and plotting the resulting averages,
the histogram of sample averages takes on a bell curve shape. If you look at the standard deviation
of this bell curve, it’s exactly the quantity that we measured earlier, σµ . This turns out to be
extremely convenient since we know a lot about bell curves.

Another useful fact is that 95 percent of measurements that fall onto a bell curve happen within
±1.96σµ of the average. This range, (µN − 1.96σµ , µN + 1.96σµ ), is called the 95 percent confidence
interval for the measurement: 95 percent of times you take a sample it will fall within this range of
the true value. Another useful way to look at it is that if you take a sample, and estimate this range,
you’re 95 percent sure that the true value is within this range!

In the context of our example, that means you can expect roughly 95 percent of the time that our
sample average will be within this range of the population average. You can compute the range as
follows:

1 lower_range = sample.mean() - 1.96 * sample.std(ddof=1) /


2 np.sqrt(len(sample))
3 upper_range = sample.mean() + 1.96 * sample.std(ddof=1) /
4 np.sqrt(len(sample))

You use ddof=1 because here you’re trying to estimate a population standard deviation from a
sample. To estimate a sample standard deviation, you can leave it as the default of 0. The values we
get here are 8.70 for the lower value and 10.70 for the upper. This means from this sample, the true
population value will be between 8.70 and 10.70 95 percent of the time. We use an interval like this
to estimate a population value.

Notice that the denominator is √1 , where N is the size of the sample. The standard deviation and
N
the mean don’t change with the sample size (except to get rid of some measurement noise), so the
sample size is the piece that controls the size of your confidence intervals. How much do they
3.4 Error Propagation 21

change? If you increase the sample size, Nnew = 100Nold , increasing the sample 100 times, the factor
is √N1 = √ 1 1 √1
= 10 . You can see, then, that the error bars only shrink to one-tenth of
new 100Nold Nold
their original size. The error bars decrease slowly with the sample size!

We should also note that if the number of samples is comparable to the size of the whole
population, you need to use a finite-population correction. We won’t go into that here since it’s
pretty rare that you actually need to use it.

Note that you can get creative with how you use this rule. A click-through rate (CTR) is a metric
you’re commonly interested in. If a user views a link to an article, that is called an impression. If they
click the link, that is called a click. An impression is an opportunity to click. In that sense, each
impression is a trial, and each click is a success. The CTR, then, is a success rate and can be thought
of as a probability of success given a trial.

If you code a click as a 1 and an impression with no click as a 0, then each impression gives you
either a 1 or a 0. You end up with a big list of 1s and 0s. If you average these, you take the sum of the
outcomes, which is just the number of clicks divided by the number of trials. The average of these
binary outcomes is just the click-through rate! You can apply the central limit theorem. You can
take the standard deviation of these 1/0 measurements and divide by the square root of the number
of measurements to get the standard error. You can use the standard error as before to get a
confidence interval for your CTR measurement!

Now that you know how to calculate standard errors and confidence intervals, you’ll want to be
able to derive error measurements on calculated quantities. You don’t often care about metrics
alone but rather differences in metrics. That’s how you know, for example, if one thing is performing
better or worse than another thing.

3.4 Error Propagation


So, assume you’ve done all the work to take good random samples of data for two measurements,
and you’ve calculated standard errors for the two measurements. Maybe these measurements are
the click-through rates of two different articles. Suppose you’d like to know if one article is clicking
better than the other. How can you find that out?

A simple way is to look at the difference in the click-through rates. Suppose article 1 has CTR p1
with standard error σ1 and article 2 has CTR p2 with standard error σ2 . Then the difference, d, is
d = p1 − p2 . If the difference is positive, that means p1 > p2 , and article 1 is the better clicking
article. If it’s negative, then article 2 clicks better.

The trouble is that the standard errors might be bigger than d! How can you interpret things in that
case? You need to find the standard error for d. If you can say you’re 95 percent sure that the
difference is positive, then you can say you’re 95 percent sure that article 1 is clicking better.

Let’s take a look at how to estimate the standard error of an arbitrary function of many variables. If
you know calculus, this will be a fun section to read! If you don’t, feel free to skip ahead to the
results.
22 Chapter 3 Quantifying Error

Start with the Taylor Series, which is written as follows:

N<∞
X (x − a)n
f (x) ≈ (3.2)
n!
n≥0

If you let f be a function of two variables, x and y, then you can compute up to the first order term.

∂f ∂f
f (x, y) ≈ f (xo , yo ) + (x − xo ) + (y − yo ) + O(2) (3.3)
∂x ∂y

Here, O(2) denotes terms that are of size (x − xo )n or (y − yo )n where n is greater than or equal to 2.
Since these differences are relatively small, raising them to larger powers makes them very small
and ignorable.

When xo and yo are the expectations of x and y, you can put this equation in terms of the definition
of variance, σ 2 = (f (x, y) − f (xo , yo ))2 , by subtracting f (xo , yo ) from both sides, squaring, and
taking expectation values. You’re dropping terms like (x − xo )(y − yo ), which amounts to assuming
that the errors in x and y are uncorrelated.
2
∂f ∂f

σf2 ≈ (x − xo ) + (y − yo )
∂x ∂y
 2  2
∂f ∂f
= σx2 + σy2 (3.4)
∂x ∂y

Just taking the square root gives you the standard error we were looking for!

This formula should work well whenever the measurement errors in x and y are relatively small and
uncorrelated. Small here means that the relative error, e.g., σx /xo , is less than 1.

You can use this formula to derive a lot of really useful formulae! If you let f (x, y) = x − y, then this
will give you the standard error in the difference that you wanted before! If you let f (x, y) = x/y,
then you get standard error in a ratio, like the standard error in a click rate due to a measurement
error in clicks and impressions!

You’ll give a few handy formulae here for reference. Here, c1 and c2 will be constants with no
measurement error associated with them. x and y will be variables with measurement error. If you
ever like to assume that x or y has no error, simply plug in σx = 0, for example, and the formulae will
simplify.

f (x, y) σf
q
c1 x − c2 y c2 σ 2 + c22 σy2
q 1 x
c1 x + c2 y c2 σ 2 + c22 σy2
s 1 x
 2 
σx 2 σy
x/y f x + y
s
σy 2
 
σx 2

xy f x + y
Another random document with
no related content on Scribd:
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the free


distribution of electronic works, by using or distributing this work (or
any other work associated in any way with the phrase “Project
Gutenberg”), you agree to comply with all the terms of the Full
Project Gutenberg™ License available with this file or online at
www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand, agree
to and accept all the terms of this license and intellectual property
(trademark/copyright) agreement. If you do not agree to abide by all
the terms of this agreement, you must cease using and return or
destroy all copies of Project Gutenberg™ electronic works in your
possession. If you paid a fee for obtaining a copy of or access to a
Project Gutenberg™ electronic work and you do not agree to be
bound by the terms of this agreement, you may obtain a refund from
the person or entity to whom you paid the fee as set forth in
paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only be


used on or associated in any way with an electronic work by people
who agree to be bound by the terms of this agreement. There are a
few things that you can do with most Project Gutenberg™ electronic
works even without complying with the full terms of this agreement.
See paragraph 1.C below. There are a lot of things you can do with
Project Gutenberg™ electronic works if you follow the terms of this
agreement and help preserve free future access to Project
Gutenberg™ electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law in
the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name
associated with the work. You can easily comply with the terms of
this agreement by keeping this work in the same format with its
attached full Project Gutenberg™ License when you share it without
charge with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the terms
of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.

1.E. Unless you have removed all references to Project Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears, or
with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is derived


from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is posted


with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning of
this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute this


electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1 with
active links or immediate access to the full terms of the Project
Gutenberg™ License.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or expense
to the user, provide a copy, a means of exporting a copy, or a means
of obtaining a copy upon request, of the work in its original “Plain
Vanilla ASCII” or other form. Any alternate format must include the
full Project Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or providing


access to or distributing Project Gutenberg™ electronic works
provided that:

• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt that
s/he does not agree to the terms of the full Project Gutenberg™
License. You must require such a user to return or destroy all
copies of the works possessed in a physical medium and
discontinue all use of and all access to other copies of Project
Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™


electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except


for the “Right of Replacement or Refund” described in paragraph
1.F.3, the Project Gutenberg Literary Archive Foundation, the owner
of the Project Gutenberg™ trademark, and any other party
distributing a Project Gutenberg™ electronic work under this
agreement, disclaim all liability to you for damages, costs and
expenses, including legal fees. YOU AGREE THAT YOU HAVE NO
REMEDIES FOR NEGLIGENCE, STRICT LIABILITY, BREACH OF
WARRANTY OR BREACH OF CONTRACT EXCEPT THOSE
PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE THAT THE
FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of receiving it,
you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or entity
that provided you with the defective work may elect to provide a
replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth in
paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the


Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and distribution
of Project Gutenberg™ electronic works, harmless from all liability,
costs and expenses, including legal fees, that arise directly or
indirectly from any of the following which you do or cause to occur:
(a) distribution of this or any Project Gutenberg™ work, (b)
alteration, modification, or additions or deletions to any Project
Gutenberg™ work, and (c) any Defect you cause.

Section 2. Information about the Mission of


Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.
Section 3. Information about the Project
Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many small
donations ($1 to $5,000) are particularly important to maintaining tax
exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About Project


Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.

Project Gutenberg™ eBooks are often created from several printed


editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.

You might also like