Download Machine Learning In Production Developing And Optimizing Data Science Workflows And Applications Addison Wesley Data Analytics Series 1St Edition Andrew Kelleher online ebook texxtbook full chapter pdf
Download Machine Learning In Production Developing And Optimizing Data Science Workflows And Applications Addison Wesley Data Analytics Series 1St Edition Andrew Kelleher online ebook texxtbook full chapter pdf
Download Machine Learning In Production Developing And Optimizing Data Science Workflows And Applications Addison Wesley Data Analytics Series 1St Edition Andrew Kelleher online ebook texxtbook full chapter pdf
https://ebookmeta.com/product/practical-data-science-with-hadoop-
and-spark-designing-and-building-effective-analytics-at-scale-
addison-wesley-data-analytics-1st-edition-ofer-mendelevitch/
https://ebookmeta.com/product/foundations-of-deep-reinforcement-
learning-theory-and-practice-in-python-addison-wesley-data-
analytics-series-1st-edition-laura-graesser-wah-loon-keng/
https://ebookmeta.com/product/data-science-analytics-and-machine-
learning-with-r-1st-edition-luiz-favero/
https://ebookmeta.com/product/machine-learning-and-big-data-
analytics-proceedings-of-international-conference-on-machine-
learning-and-big-data-analytics-icmlbda-2021-1st-edition-rajiv-
Machine Learning and Data Science: Fundamentals and
Applications 1st Edition Prateek Agrawal (Editor)
https://ebookmeta.com/product/machine-learning-and-data-science-
fundamentals-and-applications-1st-edition-prateek-agrawal-editor/
https://ebookmeta.com/product/online-learning-analytics-data-
analytics-applications-1st-edition-jay-liebowitz/
https://ebookmeta.com/product/physics-of-data-science-and-
machine-learning-1st-edition-rauf/
https://ebookmeta.com/product/earth-observation-data-analytics-
using-machine-and-deep-learning-modern-tools-applications-and-
challenges-1st-edition-sanjay-garg/
https://ebookmeta.com/product/data-analytics-in-bioinformatics-a-
machine-learning-perspective-1st-edition-rabinarayan-satpathy-
editor/
Machine Learning
in Production
The Pearson Addison-Wesley
Data & Analytics Series
The series aims to tie all three of these areas together to help the reader build
end-to-end systems for fighting spam; making recommendations; building
personalization; detecting trends, patterns, or problems; and gaining insight
from the data exhaust of systems and user interactions.
Andrew Kelleher
Adam Kelleher
The authors and publisher have taken care in the preparation of this book, but make no expressed or implied
warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for
incidental or consequential damages in connection with or arising out of the use of the information or
programs contained herein.
For information about buying this title in bulk quantities, or for special sales opportunities (which may
include electronic versions; custom cover designs; and content particular to your business, training goals,
marketing focus, or branding interests), please contact our corporate sales department
at [email protected] or (800) 382-3419.
For questions about sales outside the U.S., please contact [email protected].
All rights reserved. This publication is protected by copyright, and permission must be obtained from the
publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by
any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding
permissions, request forms, and the appropriate contacts within the Pearson Education Global Rights &
Permissions Department, please visit www.pearsoned.com/permissions/.
ISBN-13: 978-0-13-411654-9
ISBN-10: 0-13-411654-2
1 19
v
v
This page intentionally left blank
Contents
Foreword xv
Preface xvii
About the Authors xxi
I: Principles of Framing 1
2 Project Workflow 7
2.1 Introduction 7
2.2 The Data Team Context 7
2.2.1 Embedding vs. Pooling Resources 8
2.2.2 Research 8
2.2.3 Prototyping 9
2.2.4 A Combined Workflow 10
2.3 Agile Development and the Product Focus 10
2.3.1 The 12 Principles 11
2.4 Conclusion 15
3 Quantifying Error 17
3.1 Introduction 17
3.2 Quantifying Error in Measured Values 17
3.3 Sampling Error 19
3.4 Error Propagation 21
3.5 Conclusion 23
4.2.2 N-grams 27
4.2.3 Sparsity 28
4.2.4 Feature Selection 28
4.2.5 Representation Learning 30
4.3 Information Loss 33
4.4 Conclusion 34
5 Hypothesis Testing 37
5.1 Introduction 37
5.2 What Is a Hypothesis? 37
5.3 Types of Errors 39
5.4 P-values and Confidence Intervals 40
5.5 Multiple Testing and “P-hacking” 41
5.6 An Example 42
5.7 Planning and Context 43
5.8 Conclusion 44
6 Data Visualization 45
6.1 Introduction 45
6.2 Distributions and Summary Statistics 45
6.2.1 Distributions and Histograms 46
6.2.2 Scatter Plots and Heat Maps 51
6.2.3 Box Plots and Error Bars 55
6.3 Time-Series Plots 58
6.3.1 Rolling Statistics 58
6.3.2 Auto-Correlation 60
6.4 Graph Visualization 61
6.4.1 Layout Algorithms 62
6.4.2 Time Complexity 64
6.5 Conclusion 64
7.2.1 Services 71
7.2.2 Data Sources 72
7.2.3 Batch and Online Computing 72
7.2.4 Scaling 73
7.3 Models 74
7.3.1 Training 74
7.3.2 Prediction 75
7.3.3 Validation 76
7.4 Conclusion 77
8 Comparison 79
8.1 Introduction 79
8.2 Jaccard Distance 79
8.2.1 The Algorithm 80
8.2.2 Time Complexity 81
8.2.3 Memory Considerations 81
8.2.4 A Distributed Approach 81
8.3 MinHash 82
8.3.1 Assumptions 83
8.3.2 Time and Space Complexity 83
8.3.3 Tools 83
8.3.4 A Distributed Approach 83
8.4 Cosine Similarity 84
8.4.1 Complexity 85
8.4.2 Memory Considerations 85
8.4.3 A Distributed Approach 86
8.5 Mahalanobis Distance 86
8.5.1 Complexity 86
8.5.2 Memory Considerations 87
8.5.3 A Distributed Approach 87
8.6 Conclusion 88
9 Regression 89
9.1 Introduction 89
9.1.1 Choosing the Model 90
9.1.2 Choosing the Objective Function 90
9.1.3 Fitting 91
9.1.4 Validation 92
x Contents
Bibliography 245
Index 247
Foreword
This pragmatic book introduces both machine learning and data science, bridging gaps between
data scientist and engineer, and helping you bring these techniques into production. It helps
ensure that your efforts actually solve your problem, and offers unique coverage of real-world
optimization in production settings. This book is filled with code examples in Python and
visualizations to illustrate concepts in algorithms. Validation, hypothesis testing, and visualization
are introduced early on as these are all key to ensuring that your efforts in data science are actually
solving your problem. Part III of the book is unique among data science and machine learning
books because of its focus on real-world concerns in optimization. Thinking about hardware,
infrastructure, and distributed systems are all steps to bringing machine learning and data science
techniques into a production setting.
Andrew and Adam Kelleher bring their experience in engineering and data science, respectively,
from their work at BuzzFeed. The topics covered and where to provide breadth versus depth are
informed by their real-world experience solving problems in a large production environment.
Algorithms for comparison, classification, clustering, and dimensionality reduction are all
presented with examples of specific problems that can be solved with each. Explorations into more
advanced topics like Bayesian networks or deep learning are provided after the framework for basic
machine learning tasks is laid.
This book is a great addition to the Data & Analytics Series. It provides a well-grounded
introduction to data science and machine learning with a focus on problem-solving. It should
serve as a great resource to any engineer or “accidental programmer” with a more traditional math
or science background looking to apply machine learning to their production applications and
environment.
—Paul Dix, series editor
This page intentionally left blank
Preface
Most of this book was written while Andrew and Adam were working together at BuzzFeed. Adam
was a data scientist, Andrew was an engineer, and they spent a good deal of time working together
on the same team! Given that they’re identical twins of triplets, it was confusing and amusing for
everyone involved.
The idea for this book came after PyGotham in New York City in August 2014. There were several
talks relating to the relatively broadly defined field of “data science.” What we noticed was that
many data scientists start their careers driven by the curiosity and excitement of learning new
things. They discover new tools and often have a favorite technique or algorithm. They’ll apply that
tool to the problem they’re working on. When you have a hammer, every problem looks like a nail.
Often, as with neural networks (discussed in Chapter 14), it’s more like a pile driver. We wanted to
push past the hype of data science by giving data scientists, especially at the time they’re starting
their careers, a whole tool box. One could argue the context and error analysis tools of Part I are
actually more important than the advanced techniques discussed in Part III. In fact, they’re a major
motivator in writing this book. It’s very unlikely a choice of algorithm will be successful if its signal
is trumped by its noise, or if there is a high amount of systematic error. We hope this book provides
the right tools to take on the projects our readers encounter, and to be successful in their careers.
There’s no lack of texts in machine learning or computer science. There are even some decent texts
in the field of data science. What we hope to offer with this book is a comprehensive and rigorous
entry point to the field of data science. This tool box is slim and driven by our own experience of
what is useful in practice. We try to avoid opening up paths that lead to research-level problems. If
you’re solving research-level problems as a junior data scientist, you’ve probably gone out of scope.
There’s a critical side of data science that is separate from machine learning: engineering. In Part III
of this text we get into the engineering side. We discuss the problems you’re likely to encounter and
give you the fundamentals you’ll need to overcome them. Part III is essentially a Computer Science
201-202 crash course. Once you know what you’re building, you still have to address many
considerations on the path to production. This means understanding your toolbox from the
perspective of the tools.
This book is intended to be a crash course for those people. We run through a basic procedure for
taking on most data science tasks, encouraging data scientists to use their data set, rather than the
tools of the day, as the starting point. Data-driven data science is key to success. The big open secret
xviii Preface
of data science is that while modeling is important, the bread and butter of data science is
simple queries, aggregations, and visualizations. Many industries are in a place where they’re
accumulating and seeing data for the very first time. There is value to be delivered quickly
and with minimal complexity.
Modeling is important, but hard. We believe in applying the principles of agile development to
data science. We talk about this a lot in Chapter 2. Start with a minimal solution: a simple heuristic
based on a data aggregation, for example. Improve the heuristic with a simple model when your
data pipeline is mature and stable. Improve the model when you don’t have anything more
important to do with your time. We’ll provide realistic case studies where this approach is applied.
Chapter 2, “Project Workflow,” sets the context for data science by describing agile development.
It’s a philosophy that helps keep scope small, and development efficient. It can be hard to keep
yourself from trying out the latest machine learning framework or tools offered by cloud platforms,
but it pays off in the long run.
Next, in Chapter 3, “Quantifying Error,” we provide you with a basic introduction to error analysis.
Much of data science is reporting simple statistics. Without understanding the error in those
statistics, you’re likely to come to invalid conclusions. Error analysis is a foundational skill and
important enough to be the first item in your tool kit.
We continue in Chapter 4, “Data Encoding and Preprocessing,” by discovering a few of the many
ways of encoding the real world in the form of data. Naturally this leads us to ask data-driven
questions about the real world. The framework for answering these questions is hypothesis testing,
which we provide a foundation for in Chapter 5, “Hypothesis Testing.”
At this point, we haven’t seen many graphs, and our tool kit is lacking in communicating our
results to the outside (nontechnical) world. We aim to resolve this in Chapter 6, “Data
Visualization,” where we learn many approaches to it. We keep the scope small and aim to mostly
either make plots of quantities we know how to calculate errors for, or plots that resolve some of the
tricky nuances of data visualization. While these tools aren’t as flashy as interactive visualizations
in d3 (which are worth learning!), they serve as a solid foundational skill set for communicating
results to nontechnical audiences.
Having provided the basic tools for working with data, we move on to more advanced concepts in
Part II, “Algorithms and Architecture.” We start with a brief introduction to data architectures in
Chapter 7, “Data Architectures,” and an introduction to basic concepts in machine learning in
Chapter 8, “Comparison.” You now have some very handy methods for measuring the similarities
of objects.
From there, we have some tools to do basic machine learning. In Chapter 9, “Regression,” we
introduce regression and start with one of the most important tools: linear regression. It’s odd to
start with such a simple tool in the age of neural networks and nonlinear machine learning, but
Preface xix
linear regression is outstanding for several reasons. As we’ll detail later, it’s interpretable, stable, and
often provides an excellent baseline. It can describe nonlinearities with some simple tricks, and
recent results have shown that polynomial regression (a simple modification of linear regression)
can outperform deep feedforward networks on typical applications!
From there, we describe one more basic workhorse of regression: the random forest. These are
nonlinear algorithms that rely on a statistical trick, called “bagging,” to provide excellent baseline
performance for a wide range of tasks. If you want a simple model to start a task with and linear
regression doesn’t quite work for you, random forest is a nice candidate.
Having introduced regression and provided some basic examples of the machine learning
workflow, we move on to Chapter 10, “Classification and Clustering.” We see a variety of methods
that work on both vector and graph data. We use this section to provide some basic background on
graphs and an abbreviated introduction to Bayesian inference. We dive into Bayesian inference and
causality in the next chapter.
Our Chapter 11, “Bayesian Networks,” is both unconventional and difficult. We take the view that
Bayesian networks are most intuitive (though not necessarily easiest) from the viewpoint of causal
graphs. We lay this intuition as the foundation for our introduction of Bayesian networks and
come back to it in later sections as the foundation for understanding causal inference. In the
Chapter 12, “Dimensional Reduction and Latent Variable Models,” we build off of the foundation
of Bayesian networks to understand PCA and other variants of latent factor models. Topic modeling
is an important example of a latent variable model, and we provide a detailed example on the
newgroups data set.
As the next to last data-focused chapter, we focus on the problem of causal inference in Chapter 13,
“Causal Inference.” It’s hard to understate the importance of this skill. Data science typically aims
to inform how businesses act. The assumption is that the data tells you something about the
outcomes of your actions. That can only be true if your analysis has captured causal relationships
and not just correlative ones. In that sense, understanding causation underlies much of what we do
as data scientists. Unfortunately, with a view toward minimizing scope, it’s also too often the first
thing to cut. It’s important to balance stakeholder expectations when you scope a project, and good
causal inference can take time. We hope to empower data scientists to make informed decisions
and not to accept purely correlative results lightly.
Finally, in the last data-focused chapter we provide a section to introduce some of the nuances of
more advanced machine learning techniques in Chapter 14, “Advanced Machine Learning.” We use
neural networks as a tool to discuss overfitting and model capacity. The focus should be on using as
simple a solution as is available. Resist the urge to start with neural networks as a first model. Simple
regression techniques almost always provide a good enough baseline for a first solution.
Up to this point, the platform on which all of the data science happens has been in the
background. It’s where you do the data science and is not the primary focus. Not anymore. In the
last part of this book, Part III, “Bottlenecks and Optimizations,” we go in depth on hardware,
software, and the systems they make up.
We start with a comprehensive look at hardware in Chapter 15, “Hardware Fundamentals.” This
provides a tool box of basic resources we have to work with and also provides a framework to discuss
xx Preface
the constraints under which we must operate. These constraints are physical limitations on what is
possible, and those limitations are realized in the hardware.
Chapter 16, “Software Fundamentals,” provides the fundamentals of software and a basic
description of data logistics with a section on extract-transfer/transform-load, commonly known
as ETL.
Next, we give an overview of design considerations for architecture in Chapter 17, “Architecture
Fundamentals.” Architecture is the design for how your whole system fits together. It includes the
components for data storage, data transfer, and computation, as well as how they all communicate
with one another. Some architectures are more efficient than others and objectively do their jobs
better than others. Still, a less efficient solution might be more practical, given constraints on time
and resources. We hope to provide enough context so you can make informed decisions. Even if
you’re a data scientist and not an engineer, we hope to provide enough knowledge so you can at
least understand what’s happening with your data platform.
We then move on to some more advanced topics in engineering. Chapter 18, “The CAP Theorem,”
covers some fundamental bounds on database performance. Finally, we discuss how it all fits
together in the last chapter, which is on network topology: Chapter 19, “Logical Network
Topological Nodes.”
Going Forward
We hope that not only can you do the machine learning side of data science, but you can also
understand what’s possible in your own data platform. From there, you can understand what you
might need to build and find an efficient path for building out your infrastructure as you need to.
We hope that with a complete toolbox, you’re free to realize that the tools are only a part of the
solution. They’re a means to solve real problems, and real problems always have resource
constraints.
If there’s one lesson to take away from this book, it’s that you should always direct your resources
toward solving the problems with the highest return on investment. Solving your problem is a real
constraint. Occasionally, it might be true that nothing but the best machine learning models can
solve it. The question to ask, then, is whether that’s the best problem to solve or if there’s a simpler
one that presents a lower-risk value proposition.
Finally, while we would have liked to have addressed all aspects of production machine learning in
this book, it currently exists more as a production data science text. In subsequent editions, we
intend to cover omissions, especially in the area of machine learning infrastructure. This new
material will include methods to parallelize model training and prediction; the basics of
Tensorflow, Apache Airflow, Spark, and other frameworks and tools; the details of several real
machine learning platforms, including Uber’s Michelangelo, Google’s TFX, and our own work on
similar systems; and avoiding and managing coupling in machine learning systems. We encourage
the reader to seek out the many books, papers, and blog posts covering these topics in the
meantime, and to check for updates on the book’s website at adamkelleher.com/ml_book.
We hope you’ll enjoy learning these tools as much as we did, and we hope this book will save you
time and effort in the long run.
About the Authors
Andrew Kelleher is a staff software engineer and distributed systems architect at Venmo. He was
previously a staff software engineer at BuzzFeed and has worked on data pipelines and algorithm
implementations for modern optimization. He graduated with a BS in physics from Clemson
University. He runs a meetup in New York City that studies the fundamentals behind distributed
systems in the context of production applications, and was ranked one of FastCompany’s most
creative people two years in a row.
Adam Kelleher wrote this book while working as principal data scientist at BuzzFeed and adjunct
professor at Columbia University in the City of New York. As of May 2018, he is chief data scientist
for research at Barclays and teaches causal inference and machine learning products at Columbia.
He graduated from Clemson University with a BS in physics, and has a PhD in cosmology from
University of North Carolina at Chapel Hill.
This page intentionally left blank
I
Principles of Framing
Chapter 1, “The Role of the Data Scientist,” provides background information about the field of
data science. This should serve as a starting point to gain context for the role of data science in
industry.
Chapter 2, “Project Workflow,” describes project workflow and how it relates to the principles of
agile software development.
Chapter 3, “Quantifying Error,” introduces the concept of measurement error and describes how to
quantify it. It then shows how to propagate error approximately through calculations.
Chapter 4, “Data Encoding and Preprocessing,” describes how to encode complex, real-world data
into something a machine learning algorithm can understand. Using text processing as the
example case, the chapter explores the information that is lost due to this encoding.
Chapter 5, “Hypothesis Testing,” covers this core skill for a data scientist. You’ll encounter
statistical tests and p-values throughout your work, and in the application of algorithms like
least-squares regression. This chapter provides a brief introduction to statistical hypothesis testing.
Chapter 6, “Data Visualization,” is the last subject before the unit on machine learning. Data
visualization and exploratory data analysis are critical steps in machine learning, where you
evaluate the quality of your data and develop intuition for what you’ll model.
1
The Role of the Data Scientist
1.1 Introduction
We want to set the context for this book by exposing you to the focus on products, rather than
methods, early on. Data scientists often make shortcuts, use rules of thumb, and forgo rigor. They
do this in favor of speed and with reasonable levels of uncertainty with which to make decisions.
The world moves fast, and businesses don’t have time for you to write a dissertation on error bars
when they need answers to hard questions.
We’ll begin by describing how the sizes of companies put different demands on a data scientist.
Then, we’ll describe agile development: the framework for building products that keeps them
responsive to the world outside of the office. We’ll discuss ladders and career development. These
are useful for both data scientists and the companies they work for. They lay out expectations that
companies have for their scientists and help scientists see which traits the company has found
useful. Finally, we’ll describe what data scientists actually “do” with their time.
knowledge of a sophisticated product, so they have fuller context and nuanced understanding that
they might not be capable of if they were working on several different products.
A popular team structure is for a product to be built and maintained by a small, mostly autonomous
team. We’ll go into detail on that in the next section. When our company was smaller, team
members often performed a much more general role, acting as the machine learning engineer, the
data analyst, the quantitative researcher, and even the product manager and project manager. As
the company grew, the company hired more people to take on these roles, so team members’ roles
became more specialized.
The bold in this list indicates where the priorities lie. The items on the right of each line are still
important, but the items on the left are the priorities. This means that team structure is flat, with
more experienced people working alongside (rather than above) more junior people. They share
skills with interactions like pair-coding and peer-reviewing each other’s code. A great benefit to
this is that everyone learns quickly from direct interactions with more senior teammates as peers.
A drawback is that there can be a little friction when senior developers have their code reviewed by
junior team members.
The team’s overall goal is to produce working software quickly, so it’s okay to procrastinate on
documentation. There is generally less focus on process and more on getting things done. As long
as the team knows what’s going on and they’re capable of onboarding new members efficiently
enough, they can focus on the work of shipping products.
On the other side of this, the focus on moving fast causes teams to take shortcuts. This can lead to
systems being more fragile. It can also create an ever-growing list of things to do more perfectly
later. These tasks make up what is called technical debt. Much like debt in finance, it’s a natural part
of the process. Many argue, especially in smaller companies, that it’s a necessary part of the process.
The argument is that a team should do enough “paying the debt” by writing documentation,
making cleaner abstractions, and adding test coverage to keep a sustainable pace of development
and keep from introducing bugs.
Teams generally work directly with stakeholders, and data scientists often have a front-facing role
in these interactions. There is constant feedback between teams and their stakeholders to make
sure the project is still aligned with stakeholder priorities. This is opposed to contract negotiation,
where the requirements are laid out and the team decouples from the stakeholders, delivering the
product at a later date. In business, things move fast. Priorities change, and the team and product
1.2 The Role of the Data Scientist 5
must adapt to those changes. Frequent feedback from stakeholders lets teams learn about changes
quickly and adapt to them before investing too much in the wrong product and features.
It’s hard to predict the future. If you come up with a long- or moderate-term plan, priorities can
shift, team structure can change, and the plan can fall apart. Planning is important, and trying to
stick to a plan is important. You’ll do all you can to make a plan for building an amazing product,
but you’ll often have to respond quickly and agilely to change. It can be hard to throw out your
favorite plans as priorities shift, but it’s a necessary part of the job.
Data scientists are integral members of these teams. They help their teams develop products and
help the product managers evaluate a product’s performance. Throughout product development,
there are critical decisions to make about its features. To that end, a data scientist works with
product managers and engineers to formulate questions to answer. They can be as simple as “What
unit on this page generates the most clicks?” and as complex as “How would the site perform if the
recommender system didn’t exist?” Data lets us answer these questions, and data scientists are the
people who analyze and help interpret the data for making these decisions. They do this in the
context of a dynamic team environment and have to work quickly and effectively in response to
change.
The team’s goal is to build and ship products. There are many skills that are critically important
for this that have nothing to do with data. Ladders go beyond technical skills to include
communication skills, the ability to understand project scope, and the ability to balance long- and
short-term goals.
Generally, companies will define an “individual contributor” track and a “management” track.
Junior scientists will start in the same place and shift onto a specific track as their skills develop.
They generally start out being able to execute tasks on projects with guidance from more senior
team members. They advance to being able to execute tasks more autonomously. Finally, they’re
the ones helping people execute tasks and usually take more of a role in project planning. The shift
often happens at this point, when they hit the “senior” level of their role.
1.2.4 Importance
The data scientist, like everyone on their teams, has an important role. Analysis can lie on the
“critical path” of a project’s development. This means the analysis might need to be finished before
a project can proceed and be delivered. If a data scientist isn’t skillful with their analysis and
6 Chapter 1 The Role of the Data Scientist
delivers too slowly or incompletely, they might block progress. You don’t want to be responsible for
delaying the release of a product or feature!
Without data, decision-makers might move more toward experience and intuition. While these
might not be wrong, they’re not the best way to make decisions. Adding data to the
decision-making process moves business more toward science. The data scientist, then, has a
critical role in making business decisions more rational.
Many data scientists work primarily with experimental data. We’ll cover experiment design and
analysis in some detail as well. Good experiment design is hard. Web-scale experiments, while
often providing large samples, don’t guarantee you’ll actually be able to measure the experimental
effects you’re looking for, even when they’re large! Randomized assignment doesn’t even guarantee
you’ll have correct experimental results (due to selection bias). We’ll cover all of this and more later
in the book.
The other 10 or so percent of the work is the stuff you usually read about when you hear about data
science in the news. It’s the cool machine learning, artificial intelligence, and Internet of Things
applications that are so exciting and drive so many people toward the field of data science. In a very
real sense, these applications are the future, but they’re also the minority of the work data scientists
do, unless they’re the hybrid data scientist/machine learning engineer type. Those roles are
relatively rare and are generally for very senior data scientists. This book is aimed at entry- to
mid-level data scientists. We want to give you the skills to start developing your career in whichever
direction you’d like so you can find the data science role that is perfect for you.
1.3 Conclusion
Getting things right can be hard. Often, the need to move fast supercedes the need to get it right.
Consider the case when you need to decide between two policies, A and B, that cost the same
amount to implement. You must implement one, and time is a factor. If you can show that the effect
of policy A, Y(A), is more positive than Y(B), it doesn’t matter how much more positive it is. As long
as Y(A) − Y(B) > 0, policy A is the right choice. As long as your measurement is good enough to be
within 100 percent of the correct difference, you know enough to make the policy choice!
At this point, you should have a better idea of what it means to be a data scientist. Now that you
understand a little about the context, you can start exploring the product development process.
2
Project Workflow
2.1 Introduction
This chapter focuses on the workflow of executing data science tasks as one-offs versus tasks that
will eventually make up components in production systems. We’ll present a few diagrams of
common workflows and propose combining two as a general approach. At the end of this chapter
you should understand where they fit in an organization that uses data-driven analyses to fuel
innovation. We’ll start by giving a little more context about team structure. Then, we’ll break down
the workflow into several steps: planning, design/preprocessing, analysis, and action. These steps
often blend together and are usually not formalized. At the end, you’ll have gone from the concept
of a product, like a recommender system or a deep-dive analysis, to a working prototype or result.
At that stage, you’re ready to start working with engineers to have the system implemented in
production. That might mean bringing an algorithm into a production setting automating a
report, or something else.
We should say that as you get closer to feature development, your workflow can evolve to look more
like an engineer’s workflow. Instead of prototyping in a Jupyter Notebook on your computer, you
might prototype a model as a component of a microservice. This chapter is really aimed at getting a
data scientist oriented with the steps that start them toward building prototypes for models.
When you’re prototyping data products, it’s important to keep in mind the broader context of the
organization. The focus should be more on testing value propositions than on perfect architecture,
clean code, and crisp software abstractions. Those things take time, and the world changes quickly.
With that in mind, we spend the remainder of this chapter talking about the agile methodology
and how data products should follow that methodology like any other piece of software.
That usually means you’d have to research the best approach before even beginning coding,
implement algorithms from scratch, and potentially solve unsolved problems with how to scale the
implementation.
When you’re working with limited resources, as you usually are in small organizations, the third
option usually isn’t the best choice. If you want a high-quality and competitive product, the first
option might not be the best either. Where you fall along the spectrum between the get-it-done
and state-of-the-art approaches depends on the problem, the context, and the resources available.
If you’re making a healthcare diagnosis system, the stakes are much higher than if you’re building a
content recommendation system.
To understand why you’ll use machine learning at all, you need a little context for where and how
it’s used. In this section, we’ll try to give you some understanding of how teams are structured,
what some workflows might look like, and practical constraints on machine learning.
In the first “pool of resources” approach, each request of the team gets assigned and triaged like
with any project. Some member of the team executes it, and if they need help, they lean on
someone else. A common feature of this approach is that tasks aren’t necessarily related, and it’s
not formally decided that a single member of the team executes all the tasks in a certain domain or
that a single member should handle all incoming requests from a particular person. It makes sense
to have the same person answer the questions for the same stakeholders so they can develop more
familiarity with the products and more rapport with the stakeholders. When teams are small, the
same data scientist will tend to do this for many products, and there’s little specialization.
In the “embedded” approach, a data scientist works with some team in the organization each day,
understanding the team’s needs and their particular goals. In this scenario, the understanding of
problems and the approaches are clear as the data scientist is exposed to them day to day. This is
probably the biggest contrast between the “embedded” and “pool of resources” approaches.
Anecdotally, the former is more common than the latter in small organizations. Larger
organizations tend to have more need and resources for the latter.
This chapter has a dual focus. First we’ll discuss the data science project life cycle in particular, and
then we’ll cover the integration of the data science project cycle with a technical project life cycle.
2.2.2 Research
The steps to develop a project involving a machine learning component aren’t really different from
those of an engineering project. Planning, design, development, integration, deployment, and
post-deployment are still the steps of the product life cycle (see Figure 2.1).
There are two major differences between a typical engineering product and one involving a data
science component. The first is that with a data science component, there are commonly
2.2 The Data Team Context 9
Design/
Planning Analysis Action
Preprocessing
unknowns, especially in smaller teams or teams with less experience. This creates the need for a
recursive workflow, where analysis can be done and redone.
The second major difference is that many if not most data science tasks are executed without the
eventual goal of deployment to production. This creates a more abridged product life cycle (see
Figure 2.2).
The Field Guide to Data Science [1] explains that four steps comprise the procedure of data science
tasks. Figure 2.2 shows our interpretation.
2. Preprocess the data. This involves cleaning out sources of error (e.g., removing outliers), as
well as reformatting the data as needed.
3. Execute some analyses and draw conclusions. This is where models are applied and tested.
2.2.3 Prototyping
The workflows we’ve outlined are useful for considering the process of data science tasks
independently. These steps, while linear, seem in some ways to mirror the general steps to software
prototyping as outlined in “Software Prototyping: Adoption, Practice and Management”[2].
Figure 2.3 shows our interpretation of these steps.
4. Completion: If the needs are not satisfied, re-assess and incorporate new information.
Figure 2.4 The combined product life cycle of an engineering project dependent on exploratory
analysis
This approach allows data scientists to work with engineers in an initial planning and design phase,
before the engineering team takes lessons learned to inform their own planning and design
processes with technical/infrastructural considerations taken fully into account. It also allows data
scientists to operate free of technical constraints and influences, which could otherwise slow
progress and lead to premature optimization.
When you’re building a new product, you have a value proposition in mind. The issue is that it’s
likely untested. You might have good reason to believe that the proposition will be true: that users
are willing to pay $1 for an app that will monitor their heart rate after surgery (or will tolerate some
number of ads for a free app). You wouldn’t be building it in the first place if you didn’t believe in
the value proposition. Unfortunately, things don’t always turn out how you expect. The whole
purpose of AB tests is to test product changes in the real world and make sure reality aligns with our
expectations. It’s the same with value propositions. You need to build the product to see whether
the product is worth building.
To manage this paradox, we always start with a minimum viable product, or MVP. It’s minimal in the
sense that it’s the simplest thing you can possibly build while still providing the value that you’re
proposing providing. For the heart rate monitor example, it might be a heart rate monitor that
attaches to a hardware device, alerts you when you’re outside of a target range, and then calls an
ambulance if you don’t respond. This is a version of an app that can provide value in the extra
security. Any more features (e.g., providing a fancy dashboard, tracking goals, etc.), and you’re
going beyond just testing the basic value proposition. It takes time to develop features, and that is
time you might invest in testing a different value proposition! You should do as little work as
possible to test the value proposition and then decide whether to invest more resources in the
product or shift focus to something different.
Some version of this will be true with every product you build. You can look at features of large
products as their own products. Facebook’s Messenger app was originally part of the Facebook
platform and was split into its own mobile app. That’s a case where a feature literally evolved into its
own product. Everything you build should have this motivation behind it of being minimal. This
can cause problems, and we have strategies to mitigate them. The cycle of software development is
built around this philosophy, and you can see it in the concept of microservice architecture, as well
as the “sprints” of the product development cycle. These leads us to the principles of the agile
methodology.
1. Our highest priority is to satisfy the customer through early and continuous delivery of
valuable software. The customer is the person you’re providing value to. That can be a consumer,
or it can be the organization you’re working for. The reason you’d like to deliver software early is to
test the value proposition by actually putting it in front of the user. The requirement that the
software be “valuable” means you don’t work so fast that you fail to test your value proposition.
2. Welcome changing requirements, even late in development. Agile processes harness change
for the customer’s competitive advantage. This principle sounds counterintuitive. When
requirements for software change, you have to throw away some of your work, go back to the
planning phase to re-specify the work to be done, and then do the new work. That’s a lot of
inefficiency! Consider the alternative: the customer needs have changed. The value proposition is
no longer satisfied by the software requirements as they were originally planned. If you don’t adapt
your software to the (unknown!) new requirements, the value proposition, as executed by your
software, will fail to meet the customer’s needs. Clearly, it’s better to throw away some work than to
throw away the whole product without testing the value proposition! Even better, if the
12 Chapter 2 Project Workflow
competition isn’t keeping this “tight coupling” with their stakeholders (or customers), then your
stakeholders are at a competitive advantage!
3. Deliver working software frequently, from a couple of weeks to a couple of months, with a
preference to the shorter timescale. There are a few reasons for this. One of them is for
consistency with the last principle. You should deliver software often, so you can get frequent
feedback from stakeholders. That will let you adjust your project plans at each step of its
development and make sure you’re aligned with the stakeholders’ needs as well as you can be. The
time when you deliver value is a great time to hear more about the customer’s needs and get ideas
for new features. We don’t think we’ve ever been in a meeting where we put a new product or
feature in front of someone and didn’t hear something along the lines of “You know, it would be
amazing if it also did... .”
Another reason for this is that the world changes quickly. If you don’t deliver value quickly, your
opportunity for providing that value can pass. You might be building a recommender system for an
app and take so long with the prototype that the app is already being deprecated! More realistically,
you might take so long that the organization’s priorities have shifted to other projects and you’ve
lost support (from product managers, engineers, and others) for the system you were working on.
4. Businesspeople and developers must work together daily throughout the project. This
principle is an extension of the previous two. Periodically meeting with the stakeholders isn’t the
only time to connect the software development process with the context of the business.
Developers should at least also be meeting with product managers to keep context with the
business goals of their products. These managers, ideally, would be in their team check-ins each
day, or at the least a few times per week. This makes sure that not only does the team building the
software keep the context of what they’re working on, but the business knows where the software
engineering and data resources (your and your team’s time) are being spent.
5. Build projects around motivated individuals. Give them the environment and support they
need, and trust them to get the job done. One sure way to restrict teams from developing things
quickly is to have them all coordinate their work through a single manager. Not only does that
person have to keep track of everything everyone is working on, but they need to have the time to
physically meet with all of them! This kind of development doesn’t scale. Typically, teams will be
small enough to share a pizza and have one lead per team. The leads can communicate with each
other in a decentralized way (although they do typically all communicate through management
meetings), and you can scale the tech organization by just adding new similar teams.
Each person on a team has a role, and that lets the team function as a mostly autonomous unit.
The product person keeps the business goals in perspective and helps coordinate with stakeholders.
The engineering manager helps make sure the engineers are staying productive and does a lot
of the project planning. The engineers write the code and participate in the project planning
process. The data scientist answers questions for the product person and can have different roles
(depending on seniority) with managing the product’s data sources, building machine learning
and statistical tools for products, and helping figure out the presentation of data and statistics to
stakeholders. In short, the team has everything they need to work quickly and efficiently together
to get the job done. When external managers get too involved in the details of team’s operations,
they can end up slowing them down just as easily as they can help.
2.3 Agile Development and the Product Focus 13
6. The most efficient and effective method of conveying information to and within a
development team is face-to-face conversation. A lot of communication is done over chat
clients, through shared documents, and through email. These media can make it hard to judge
someone’s understanding of project requirements as well as their motivation, focus, and
confidence for getting it done. Team morale can fluctuate throughout product development.
People can tend to err on the side of agreeing to work that they aren’t sure they can execute. When
teams communicate face to face, it’s much easier to notice these issues and handle them before
they’re a problem.
As a further practical issue, when you communicate over digital media, there can be a lot of other
windows, and even other conversations, going on. It can be hard to have a deep conversation with
someone when you aren’t even sure if they’re paying attention!
7. Working software is the primary measure of progress. Your goal is to prove value
propositions. If you follow the steps we’ve already outlined, then the software you’re building is
satisfying stakeholders’ needs. You can do that without implementing the best software
abstractions, cleaning up your code, fully documenting your code, and adding complete test
coverage. In short, you can take as many shortcuts as you like (respecting the next principle), as
long as your software works!
When things break, it’s important to take a retrospective. Always have a meeting to figure out why
it happened but without placing blame on any individual. The whole team is responsible when
things do or don’t work. Make whatever changes are necessary to make sure things don’t break in
the future. That might mean setting a higher standard for test coverage, adding more
documentation around certain types of code (like describing input data), or cleaning up your code
just a little more.
8. Agile processes promote sustainable development. The sponsors, developers, and users
should be able to maintain a constant pace indefinitely. When you’re working fast, it’s easy for
your code to end up messy. It’s easy to write big monolithic blocks of code instead of breaking it up
into nice small functions with test coverage on each. It’s easy to write big services instead of
microservices with clearly defined responsibilities. All of these things get you to a value proposition
quickly and can be great if they’re done in the right context. All of them are also technical debt,
which is something you need to fix later when you end up having to build new features onto the
product.
When you have to change a monolithic block of code you’ve written, it can be really hard to read
through all the logic. It’s even worse if you change teams and someone else has to read through it!
It’s the type of problem that can slow progress to a halt if it isn’t kept in check. You should always
notice when you’re taking shortcuts and consider at each week’s sprint whether you might fix some
small piece of technical debt so it doesn’t build up too much. Remember that you’d like to keep up
your pace of development indefinitely, and you want to keep delivering product features at the
same rate. Your stakeholders will notice if they suddenly stop seeing you for a while! All of this
brings us to the next point.
9. Continuous attention to technical excellence and good design enhances agility. When you
have clear abstractions, code can be much more readable. When functions are short, clean, and
well-documented, it’s easy for anyone to read and modify the code. This is true for software
development as well as for data science. Data scientists in particular can be guilty of poor coding
14 Chapter 2 Project Workflow
standards: one character variable names, large blocks of data preprocessing code with no
documentation, and other bad practices. If you make a habit of writing good code, it won’t slow
you down to do it! In fact, it’ll speed up the team as a whole.
10. Simplicity—the art of maximizing the amount of work not done—is essential. Writing a
good MVP can be an art. How do you know exactly the features to write to test your value
proposition? How do you know what software development best practices you can skip to keep a
sustainable pace of development? Which architectural shortcuts can you get away with now and in
the long term?
These are all skills you learn with practice and that your manager and team will be good resources
for advice. If you’re not sure which product features really test the minimum value proposition, talk
to your product manager and your stakeholders. If you’re not sure how sloppy your code can be,
talk to a more senior data scientist, or even to an engineer on your team.
11. The best architectures, requirements, and designs emerge from self-organizing teams.
Some things are hard to understand unless you’re working with them directly. The team writing the
software is going to have the best idea what architectural changes are going to work the best. This is
partly because they know the architecture well and partly because they know their strengths and
weaknesses for executing it. Teams communicate with each other and can collaborate without the
input of other managers. They can build bigger systems that work together than they could on their
own, and when several teams coordinate, they can architect fairly large and complex systems
without a centralized architect guiding them.
12. At regular intervals, the team reflects on how to become more effective and then tunes and
adjusts its behavior accordingly. While the focus is on delivering value quickly and often and
working closely with stakeholders to do that, teams also have to be introspective occasionally to
make sure they’re working as well as they can. This is often done once per week in a “retrospective”
meeting, where the team will get together and talk about what went well during the past week,
what didn’t work well, and what they’ll plan to change for the next week.
These are the 12 principles of agile development. They apply to data science as well as software. If
someone ever proposes a big product loaded with features and says “Let’s build this!” you should
think about how to do it agilely. Think about what the main value proposition is (chances are that
it contains several). Next, think of the minimal version of it that lets you test the proposition. Build
it, and see whether it works!
Often in data science, there are extra shortcuts you can take. You can use a worse-performing model
while you work on a better one just to fill the gap that the engineers are building around. You can
write big monolithic functions that return a model just by copying and pasting a prototype from a
Jupyter Notebook. You can use CSV files instead of running database queries when you need static
data sets. Get creative, but always think about what you’d need to do to build something right. That
might be creating good abstractions around your models, replacing CSV files with database queries
to get live data, or just writing cleaner code.
2.4 Conclusion 15
To summarize, there are four points to the Agile Manifesto. Importantly, these are tendencies. Real
life is not usually dichotomous. These points really reflect our priorities:
n Individuals and interactions over processes and tools
2.4 Conclusion
Ideally now you have a good idea of what the development process looks like and where you fit in.
We hope you’ll take the agile philosophy as a guide when building data products and will see the
value in keeping a tight feedback loop with your stakeholders.
Now that you have the context for doing data science, let’s learn the skills!
This page intentionally left blank
3
Quantifying Error
3.1 Introduction
Most measurements have some error associated with them. We often think of the numbers we
report as exact values (e.g., “there were 9,126 views of this article”). Anyone who has implemented
multiple tracking systems that are supposed to measure the same quantity knows there is rarely
perfect agreement between measurements. The chances are that neither system measures the
ground truth—there are always failure modes, and it’s hard to know how often failures happen.
Aside from errors in data collections, some measured quantities are uncertain. Instead of running
an experiment with all users of your website, you’ll work with a sample. Metrics like retention and
engagement you measure in the sample are noisy measurements of what you’d see in the whole
population. You can quantify that noise and make sure you bound the error from sampling to
something within reasonable limits.
In this chapter, we’ll discuss the concept of error analysis. You’ll learn how to think about error in a
measurement, and you’ll learn how to calculate error in simple quantities you derive from
measurements. You’ll develop some intuition for when error matters a lot and when you can safely
ignore it.
0 12
Figure 3.1 Several measurements of a string are given by the red dots along the number line. The
true length of the string is shown with the vertical blue line. If you look at the average value of the
measurements, it falls around the center of the group of red dots. It’s higher than the true value, so
you have a positive bias in your measurement process.
There is a true length to the string, but the string bends a little. To straighten it out to measure it,
you have to stretch it a little, so the measurement tends to be a little longer than it should be. If you
average your measurements together, the average will be a little higher than the true length of the
string. This difference between the “expected” length from your measurements and the “true”
length is called systematic error. It’s also sometimes called bias.
If there is no bias, there is still some random spread your measurements take around the true value.
On average you measure the true value, but each measurement is a little low or a little high. This
type of error is called random error. It’s what we commonly think of as measurement noise. We usually
measure it with the standard deviations of the measurements around their average value.
If measurements have no systematic error, then you can take a large enough sample of them,
average them together, and find the true value! This is a great situation to be in, if you’re able to take
several independent measurements. Unfortunately, it’s not a common situation. Usually you can
make only one measurement and expect that there is at least a little systematic error (e.g., data is
only lost, so count measurements are systematically low).
Consider the case of tracking impressions on a web page. When a user clicks a link to the page on
which you’re tracking impressions, in some instances, they will not follow the link completely (but
exit before arriving at the final page). Still closer to conversion, they may load the page but not
allow the pixel tracking impressions on that page to be requested. Further, we may double-count
impressions in the case a user refreshes the page for whatever reason (which happens quite a lot).
These all contribute random and systematic errors, going in different directions. It’s hard to say
whether the measurement will be systematically low or high.
There are certainly ways to quantify errors in tracking. Server logs, for example, can tell the story of
requests complete with response codes your tracking pixels may have missed. tcpdump or
wireshark can be used for monitoring attempted connections that get dropped or disconnected
before the requests are fulfilled. The main consideration is that both of these methods are difficult
in real-time reporting applications. That doesn’t mean, though, you can’t do a sampled
comparison of tracked impressions to impressions collected through these other less convenient
more expensive means.
Once you’ve implemented your tracking system and checked it against some ground truth (e.g.,
another system, like Google Analytics), you’ll usually assume the error in these raw numbers is
small and that you can safely ignore it.
There is another context where you have to deal with systematic and random error, where you can’t
safely ignore the error. This comes up most often in AB testing, where you look at a performance
metric within a subpopulation of your users (i.e., those participating in the experiment) and want
to extrapolate that result to all of your users. The measurement you make with your experiment,
3.3 Sampling Error 19
you hope, is an “unbiased” measurement (one with no systematic error) of the “true” value of the
metric (the one you would measure over the whole population).
To understand error from sampling, it’ll be helpful to take a little side trip into sampling error. The
end result is familiar: with each measurement, we should have random and systematic error in
comparison to the “true” value.
Suppose you run a news website, and you want to know the average amount of time it takes you to
read an article on your website. You could read every article on the site, record your reading time,
and get your answer that way, but that’s incredibly labor intensive. It would be great if you could
read a much smaller number of the articles and be reasonably confident about the average reading
time.
The trick you’ll use is this: you can take a random sample of articles on the website, measure the
reading time for those articles, and take the average. This will be a measurement of the average
reading time for articles on the whole website. It probably won’t match the actual average reading
time exactly, the one you’d measure if you read all of the articles. This true number is called the
population average since it’s averaging over the whole population of articles instead of just a sample
from it.
How close does the average read time in your sample compare with the average read time across the
whole site? This is where the magic happens. The result comes from the central limit theorem. It
says that the average of N independent measurements, µN , from a population is an unbiased
estimate for the population average, µ, as long as you have a reasonably large number of samples.
Even better, it says the random error for the sample average, σµ , is just the sample standard
√
deviation, σN , divided by the square root of the sample size, N:
σN
σµ = √ (3.1)
N
In practice, N = 30 is a pretty good rule of thumb for using this approximation. Let’s draw a sample
from a uniform distribution to try it.
First, let’s make the population of reading times. Let’s make it uniform over the range of 5 to 15
minutes and generate a population of 1,000 articles.
1 import numpy as np
2
3 population = np.random.uniform(5,15, size=1000)
Note that in practice, you won’t have access to a whole population to sample from. If these were the
reading times of articles, none of the reading times is even measured when you start the process!
Instead, you’d sample 30 articles from a database and then read those articles to generate your
sample from the populations. We generate a population to sample from here, just so we can check
how close our sample mean is to the population mean.
Note also that database queries don’t sample randomly from the database. To get random sampling,
you can use the rand() SQL function to generate random floats between 0 and 1. Then, you can
sort by the random value, or limit to results with rand() < 0.05 for example, to keep 5 percent of
results. An example query might look like this (NOTE: This should never be used on large tables):
Continuing, you can compute the population and sample means, as shown here:
1 population.mean()
2 sample.mean()
which for us returns 10.086 for the population and 9.701 for the sample. Note that your values will
be different since we’re dealing with random numbers. Our sample mean is only 3 percent below
the population value!
Repeating this sampling process (keeping the population fixed) and plotting the resulting averages,
the histogram of sample averages takes on a bell curve shape. If you look at the standard deviation
of this bell curve, it’s exactly the quantity that we measured earlier, σµ . This turns out to be
extremely convenient since we know a lot about bell curves.
Another useful fact is that 95 percent of measurements that fall onto a bell curve happen within
±1.96σµ of the average. This range, (µN − 1.96σµ , µN + 1.96σµ ), is called the 95 percent confidence
interval for the measurement: 95 percent of times you take a sample it will fall within this range of
the true value. Another useful way to look at it is that if you take a sample, and estimate this range,
you’re 95 percent sure that the true value is within this range!
In the context of our example, that means you can expect roughly 95 percent of the time that our
sample average will be within this range of the population average. You can compute the range as
follows:
You use ddof=1 because here you’re trying to estimate a population standard deviation from a
sample. To estimate a sample standard deviation, you can leave it as the default of 0. The values we
get here are 8.70 for the lower value and 10.70 for the upper. This means from this sample, the true
population value will be between 8.70 and 10.70 95 percent of the time. We use an interval like this
to estimate a population value.
Notice that the denominator is √1 , where N is the size of the sample. The standard deviation and
N
the mean don’t change with the sample size (except to get rid of some measurement noise), so the
sample size is the piece that controls the size of your confidence intervals. How much do they
3.4 Error Propagation 21
change? If you increase the sample size, Nnew = 100Nold , increasing the sample 100 times, the factor
is √N1 = √ 1 1 √1
= 10 . You can see, then, that the error bars only shrink to one-tenth of
new 100Nold Nold
their original size. The error bars decrease slowly with the sample size!
We should also note that if the number of samples is comparable to the size of the whole
population, you need to use a finite-population correction. We won’t go into that here since it’s
pretty rare that you actually need to use it.
Note that you can get creative with how you use this rule. A click-through rate (CTR) is a metric
you’re commonly interested in. If a user views a link to an article, that is called an impression. If they
click the link, that is called a click. An impression is an opportunity to click. In that sense, each
impression is a trial, and each click is a success. The CTR, then, is a success rate and can be thought
of as a probability of success given a trial.
If you code a click as a 1 and an impression with no click as a 0, then each impression gives you
either a 1 or a 0. You end up with a big list of 1s and 0s. If you average these, you take the sum of the
outcomes, which is just the number of clicks divided by the number of trials. The average of these
binary outcomes is just the click-through rate! You can apply the central limit theorem. You can
take the standard deviation of these 1/0 measurements and divide by the square root of the number
of measurements to get the standard error. You can use the standard error as before to get a
confidence interval for your CTR measurement!
Now that you know how to calculate standard errors and confidence intervals, you’ll want to be
able to derive error measurements on calculated quantities. You don’t often care about metrics
alone but rather differences in metrics. That’s how you know, for example, if one thing is performing
better or worse than another thing.
A simple way is to look at the difference in the click-through rates. Suppose article 1 has CTR p1
with standard error σ1 and article 2 has CTR p2 with standard error σ2 . Then the difference, d, is
d = p1 − p2 . If the difference is positive, that means p1 > p2 , and article 1 is the better clicking
article. If it’s negative, then article 2 clicks better.
The trouble is that the standard errors might be bigger than d! How can you interpret things in that
case? You need to find the standard error for d. If you can say you’re 95 percent sure that the
difference is positive, then you can say you’re 95 percent sure that article 1 is clicking better.
Let’s take a look at how to estimate the standard error of an arbitrary function of many variables. If
you know calculus, this will be a fun section to read! If you don’t, feel free to skip ahead to the
results.
22 Chapter 3 Quantifying Error
N<∞
X (x − a)n
f (x) ≈ (3.2)
n!
n≥0
If you let f be a function of two variables, x and y, then you can compute up to the first order term.
∂f ∂f
f (x, y) ≈ f (xo , yo ) + (x − xo ) + (y − yo ) + O(2) (3.3)
∂x ∂y
Here, O(2) denotes terms that are of size (x − xo )n or (y − yo )n where n is greater than or equal to 2.
Since these differences are relatively small, raising them to larger powers makes them very small
and ignorable.
When xo and yo are the expectations of x and y, you can put this equation in terms of the definition
of variance, σ 2 = (f (x, y) − f (xo , yo ))2 , by subtracting f (xo , yo ) from both sides, squaring, and
taking expectation values. You’re dropping terms like (x − xo )(y − yo ), which amounts to assuming
that the errors in x and y are uncorrelated.
2
∂f ∂f
σf2 ≈ (x − xo ) + (y − yo )
∂x ∂y
2 2
∂f ∂f
= σx2 + σy2 (3.4)
∂x ∂y
Just taking the square root gives you the standard error we were looking for!
This formula should work well whenever the measurement errors in x and y are relatively small and
uncorrelated. Small here means that the relative error, e.g., σx /xo , is less than 1.
You can use this formula to derive a lot of really useful formulae! If you let f (x, y) = x − y, then this
will give you the standard error in the difference that you wanted before! If you let f (x, y) = x/y,
then you get standard error in a ratio, like the standard error in a click rate due to a measurement
error in clicks and impressions!
You’ll give a few handy formulae here for reference. Here, c1 and c2 will be constants with no
measurement error associated with them. x and y will be variables with measurement error. If you
ever like to assume that x or y has no error, simply plug in σx = 0, for example, and the formulae will
simplify.
f (x, y) σf
q
c1 x − c2 y c2 σ 2 + c22 σy2
q 1 x
c1 x + c2 y c2 σ 2 + c22 σy2
s 1 x
2
σx 2 σy
x/y f x + y
s
σy 2
σx 2
xy f x + y
Another random document with
no related content on Scribd:
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the terms
of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.
• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
1.F.4. Except for the limited right of replacement or refund set forth in
paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.