Sample
Sample
Sample
in Production
Machine Learning
in Production
Developing and Optimizing
Data Science Workflows and
Applications
Andrew Kelleher
Adam Kelleher
The authors and publisher have taken care in the preparation of this book, but make no expressed or implied
warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for
incidental or consequential damages in connection with or arising out of the use of the information or
programs contained herein.
For information about buying this title in bulk quantities, or for special sales opportunities (which may
include electronic versions; custom cover designs; and content particular to your business, training goals,
marketing focus, or branding interests), please contact our corporate sales department
at [email protected] or (800) 382-3419.
For questions about sales outside the U.S., please contact [email protected].
All rights reserved. This publication is protected by copyright, and permission must be obtained from the
publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by
any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding
permissions, request forms, and the appropriate contacts within the Pearson Education Global Rights &
Permissions Department, please visit www.pearsoned.com/permissions/.
ISBN-13: 978-0-13-411654-9
ISBN-10: 0-13-411654-2
1 19
v
v
This page intentionally left blank
Contents
Foreword xv
Preface xvii
About the Authors xxi
I: Principles of Framing 1
2 Project Workflow 7
2.1 Introduction 7
2.2 The Data Team Context 7
2.2.1 Embedding vs. Pooling Resources 8
2.2.2 Research 8
2.2.3 Prototyping 9
2.2.4 A Combined Workflow 10
2.3 Agile Development and the Product Focus 10
2.3.1 The 12 Principles 11
2.4 Conclusion 15
3 Quantifying Error 17
3.1 Introduction 17
3.2 Quantifying Error in Measured Values 17
3.3 Sampling Error 19
3.4 Error Propagation 21
3.5 Conclusion 23
4.2.2 N-grams 27
4.2.3 Sparsity 28
4.2.4 Feature Selection 28
4.2.5 Representation Learning 30
4.3 Information Loss 33
4.4 Conclusion 34
5 Hypothesis Testing 37
5.1 Introduction 37
5.2 What Is a Hypothesis? 37
5.3 Types of Errors 39
5.4 P-values and Confidence Intervals 40
5.5 Multiple Testing and “P-hacking” 41
5.6 An Example 42
5.7 Planning and Context 43
5.8 Conclusion 44
6 Data Visualization 45
6.1 Introduction 45
6.2 Distributions and Summary Statistics 45
6.2.1 Distributions and Histograms 46
6.2.2 Scatter Plots and Heat Maps 51
6.2.3 Box Plots and Error Bars 55
6.3 Time-Series Plots 58
6.3.1 Rolling Statistics 58
6.3.2 Auto-Correlation 60
6.4 Graph Visualization 61
6.4.1 Layout Algorithms 62
6.4.2 Time Complexity 64
6.5 Conclusion 64
7.2.1 Services 71
7.2.2 Data Sources 72
7.2.3 Batch and Online Computing 72
7.2.4 Scaling 73
7.3 Models 74
7.3.1 Training 74
7.3.2 Prediction 75
7.3.3 Validation 76
7.4 Conclusion 77
8 Comparison 79
8.1 Introduction 79
8.2 Jaccard Distance 79
8.2.1 The Algorithm 80
8.2.2 Time Complexity 81
8.2.3 Memory Considerations 81
8.2.4 A Distributed Approach 81
8.3 MinHash 82
8.3.1 Assumptions 83
8.3.2 Time and Space Complexity 83
8.3.3 Tools 83
8.3.4 A Distributed Approach 83
8.4 Cosine Similarity 84
8.4.1 Complexity 85
8.4.2 Memory Considerations 85
8.4.3 A Distributed Approach 86
8.5 Mahalanobis Distance 86
8.5.1 Complexity 86
8.5.2 Memory Considerations 87
8.5.3 A Distributed Approach 87
8.6 Conclusion 88
9 Regression 89
9.1 Introduction 89
9.1.1 Choosing the Model 90
9.1.2 Choosing the Objective Function 90
9.1.3 Fitting 91
9.1.4 Validation 92
x Contents
Bibliography 245
Index 247
Foreword
This pragmatic book introduces both machine learning and data science, bridging gaps between
data scientist and engineer, and helping you bring these techniques into production. It helps
ensure that your efforts actually solve your problem, and offers unique coverage of real-world
optimization in production settings. This book is filled with code examples in Python and
visualizations to illustrate concepts in algorithms. Validation, hypothesis testing, and visualization
are introduced early on as these are all key to ensuring that your efforts in data science are actually
solving your problem. Part III of the book is unique among data science and machine learning
books because of its focus on real-world concerns in optimization. Thinking about hardware,
infrastructure, and distributed systems are all steps to bringing machine learning and data science
techniques into a production setting.
Andrew and Adam Kelleher bring their experience in engineering and data science, respectively,
from their work at BuzzFeed. The topics covered and where to provide breadth versus depth are
informed by their real-world experience solving problems in a large production environment.
Algorithms for comparison, classification, clustering, and dimensionality reduction are all
presented with examples of specific problems that can be solved with each. Explorations into more
advanced topics like Bayesian networks or deep learning are provided after the framework for basic
machine learning tasks is laid.
This book is a great addition to the Data & Analytics Series. It provides a well-grounded
introduction to data science and machine learning with a focus on problem-solving. It should
serve as a great resource to any engineer or “accidental programmer” with a more traditional math
or science background looking to apply machine learning to their production applications and
environment.
—Paul Dix, series editor
This page intentionally left blank
Preface
Most of this book was written while Andrew and Adam were working together at BuzzFeed. Adam
was a data scientist, Andrew was an engineer, and they spent a good deal of time working together
on the same team! Given that they’re identical twins of triplets, it was confusing and amusing for
everyone involved.
The idea for this book came after PyGotham in New York City in August 2014. There were several
talks relating to the relatively broadly defined field of “data science.” What we noticed was that
many data scientists start their careers driven by the curiosity and excitement of learning new
things. They discover new tools and often have a favorite technique or algorithm. They’ll apply that
tool to the problem they’re working on. When you have a hammer, every problem looks like a nail.
Often, as with neural networks (discussed in Chapter 14), it’s more like a pile driver. We wanted to
push past the hype of data science by giving data scientists, especially at the time they’re starting
their careers, a whole tool box. One could argue the context and error analysis tools of Part I are
actually more important than the advanced techniques discussed in Part III. In fact, they’re a major
motivator in writing this book. It’s very unlikely a choice of algorithm will be successful if its signal
is trumped by its noise, or if there is a high amount of systematic error. We hope this book provides
the right tools to take on the projects our readers encounter, and to be successful in their careers.
There’s no lack of texts in machine learning or computer science. There are even some decent texts
in the field of data science. What we hope to offer with this book is a comprehensive and rigorous
entry point to the field of data science. This tool box is slim and driven by our own experience of
what is useful in practice. We try to avoid opening up paths that lead to research-level problems. If
you’re solving research-level problems as a junior data scientist, you’ve probably gone out of scope.
There’s a critical side of data science that is separate from machine learning: engineering. In Part III
of this text we get into the engineering side. We discuss the problems you’re likely to encounter and
give you the fundamentals you’ll need to overcome them. Part III is essentially a Computer Science
201-202 crash course. Once you know what you’re building, you still have to address many
considerations on the path to production. This means understanding your toolbox from the
perspective of the tools.
This book is intended to be a crash course for those people. We run through a basic procedure for
taking on most data science tasks, encouraging data scientists to use their data set, rather than the
tools of the day, as the starting point. Data-driven data science is key to success. The big open secret
xviii Preface
of data science is that while modeling is important, the bread and butter of data science is
simple queries, aggregations, and visualizations. Many industries are in a place where they’re
accumulating and seeing data for the very first time. There is value to be delivered quickly
and with minimal complexity.
Modeling is important, but hard. We believe in applying the principles of agile development to
data science. We talk about this a lot in Chapter 2. Start with a minimal solution: a simple heuristic
based on a data aggregation, for example. Improve the heuristic with a simple model when your
data pipeline is mature and stable. Improve the model when you don’t have anything more
important to do with your time. We’ll provide realistic case studies where this approach is applied.
Chapter 2, “Project Workflow,” sets the context for data science by describing agile development.
It’s a philosophy that helps keep scope small, and development efficient. It can be hard to keep
yourself from trying out the latest machine learning framework or tools offered by cloud platforms,
but it pays off in the long run.
Next, in Chapter 3, “Quantifying Error,” we provide you with a basic introduction to error analysis.
Much of data science is reporting simple statistics. Without understanding the error in those
statistics, you’re likely to come to invalid conclusions. Error analysis is a foundational skill and
important enough to be the first item in your tool kit.
We continue in Chapter 4, “Data Encoding and Preprocessing,” by discovering a few of the many
ways of encoding the real world in the form of data. Naturally this leads us to ask data-driven
questions about the real world. The framework for answering these questions is hypothesis testing,
which we provide a foundation for in Chapter 5, “Hypothesis Testing.”
At this point, we haven’t seen many graphs, and our tool kit is lacking in communicating our
results to the outside (nontechnical) world. We aim to resolve this in Chapter 6, “Data
Visualization,” where we learn many approaches to it. We keep the scope small and aim to mostly
either make plots of quantities we know how to calculate errors for, or plots that resolve some of the
tricky nuances of data visualization. While these tools aren’t as flashy as interactive visualizations
in d3 (which are worth learning!), they serve as a solid foundational skill set for communicating
results to nontechnical audiences.
Having provided the basic tools for working with data, we move on to more advanced concepts in
Part II, “Algorithms and Architecture.” We start with a brief introduction to data architectures in
Chapter 7, “Data Architectures,” and an introduction to basic concepts in machine learning in
Chapter 8, “Comparison.” You now have some very handy methods for measuring the similarities
of objects.
From there, we have some tools to do basic machine learning. In Chapter 9, “Regression,” we
introduce regression and start with one of the most important tools: linear regression. It’s odd to
start with such a simple tool in the age of neural networks and nonlinear machine learning, but
Preface xix
linear regression is outstanding for several reasons. As we’ll detail later, it’s interpretable, stable, and
often provides an excellent baseline. It can describe nonlinearities with some simple tricks, and
recent results have shown that polynomial regression (a simple modification of linear regression)
can outperform deep feedforward networks on typical applications!
From there, we describe one more basic workhorse of regression: the random forest. These are
nonlinear algorithms that rely on a statistical trick, called “bagging,” to provide excellent baseline
performance for a wide range of tasks. If you want a simple model to start a task with and linear
regression doesn’t quite work for you, random forest is a nice candidate.
Having introduced regression and provided some basic examples of the machine learning
workflow, we move on to Chapter 10, “Classification and Clustering.” We see a variety of methods
that work on both vector and graph data. We use this section to provide some basic background on
graphs and an abbreviated introduction to Bayesian inference. We dive into Bayesian inference and
causality in the next chapter.
Our Chapter 11, “Bayesian Networks,” is both unconventional and difficult. We take the view that
Bayesian networks are most intuitive (though not necessarily easiest) from the viewpoint of causal
graphs. We lay this intuition as the foundation for our introduction of Bayesian networks and
come back to it in later sections as the foundation for understanding causal inference. In the
Chapter 12, “Dimensional Reduction and Latent Variable Models,” we build off of the foundation
of Bayesian networks to understand PCA and other variants of latent factor models. Topic modeling
is an important example of a latent variable model, and we provide a detailed example on the
newgroups data set.
As the next to last data-focused chapter, we focus on the problem of causal inference in Chapter 13,
“Causal Inference.” It’s hard to understate the importance of this skill. Data science typically aims
to inform how businesses act. The assumption is that the data tells you something about the
outcomes of your actions. That can only be true if your analysis has captured causal relationships
and not just correlative ones. In that sense, understanding causation underlies much of what we do
as data scientists. Unfortunately, with a view toward minimizing scope, it’s also too often the first
thing to cut. It’s important to balance stakeholder expectations when you scope a project, and good
causal inference can take time. We hope to empower data scientists to make informed decisions
and not to accept purely correlative results lightly.
Finally, in the last data-focused chapter we provide a section to introduce some of the nuances of
more advanced machine learning techniques in Chapter 14, “Advanced Machine Learning.” We use
neural networks as a tool to discuss overfitting and model capacity. The focus should be on using as
simple a solution as is available. Resist the urge to start with neural networks as a first model. Simple
regression techniques almost always provide a good enough baseline for a first solution.
Up to this point, the platform on which all of the data science happens has been in the
background. It’s where you do the data science and is not the primary focus. Not anymore. In the
last part of this book, Part III, “Bottlenecks and Optimizations,” we go in depth on hardware,
software, and the systems they make up.
We start with a comprehensive look at hardware in Chapter 15, “Hardware Fundamentals.” This
provides a tool box of basic resources we have to work with and also provides a framework to discuss
xx Preface
the constraints under which we must operate. These constraints are physical limitations on what is
possible, and those limitations are realized in the hardware.
Chapter 16, “Software Fundamentals,” provides the fundamentals of software and a basic
description of data logistics with a section on extract-transfer/transform-load, commonly known
as ETL.
Next, we give an overview of design considerations for architecture in Chapter 17, “Architecture
Fundamentals.” Architecture is the design for how your whole system fits together. It includes the
components for data storage, data transfer, and computation, as well as how they all communicate
with one another. Some architectures are more efficient than others and objectively do their jobs
better than others. Still, a less efficient solution might be more practical, given constraints on time
and resources. We hope to provide enough context so you can make informed decisions. Even if
you’re a data scientist and not an engineer, we hope to provide enough knowledge so you can at
least understand what’s happening with your data platform.
We then move on to some more advanced topics in engineering. Chapter 18, “The CAP Theorem,”
covers some fundamental bounds on database performance. Finally, we discuss how it all fits
together in the last chapter, which is on network topology: Chapter 19, “Logical Network
Topological Nodes.”
Going Forward
We hope that not only can you do the machine learning side of data science, but you can also
understand what’s possible in your own data platform. From there, you can understand what you
might need to build and find an efficient path for building out your infrastructure as you need to.
We hope that with a complete toolbox, you’re free to realize that the tools are only a part of the
solution. They’re a means to solve real problems, and real problems always have resource
constraints.
If there’s one lesson to take away from this book, it’s that you should always direct your resources
toward solving the problems with the highest return on investment. Solving your problem is a real
constraint. Occasionally, it might be true that nothing but the best machine learning models can
solve it. The question to ask, then, is whether that’s the best problem to solve or if there’s a simpler
one that presents a lower-risk value proposition.
Finally, while we would have liked to have addressed all aspects of production machine learning in
this book, it currently exists more as a production data science text. In subsequent editions, we
intend to cover omissions, especially in the area of machine learning infrastructure. This new
material will include methods to parallelize model training and prediction; the basics of
Tensorflow, Apache Airflow, Spark, and other frameworks and tools; the details of several real
machine learning platforms, including Uber’s Michelangelo, Google’s TFX, and our own work on
similar systems; and avoiding and managing coupling in machine learning systems. We encourage
the reader to seek out the many books, papers, and blog posts covering these topics in the
meantime, and to check for updates on the book’s website at adamkelleher.com/ml_book.
We hope you’ll enjoy learning these tools as much as we did, and we hope this book will save you
time and effort in the long run.
About the Authors
Andrew Kelleher is a staff software engineer and distributed systems architect at Venmo. He was
previously a staff software engineer at BuzzFeed and has worked on data pipelines and algorithm
implementations for modern optimization. He graduated with a BS in physics from Clemson
University. He runs a meetup in New York City that studies the fundamentals behind distributed
systems in the context of production applications, and was ranked one of FastCompany’s most
creative people two years in a row.
Adam Kelleher wrote this book while working as principal data scientist at BuzzFeed and adjunct
professor at Columbia University in the City of New York. As of May 2018, he is chief data scientist
for research at Barclays and teaches causal inference and machine learning products at Columbia.
He graduated from Clemson University with a BS in physics, and has a PhD in cosmology from
University of North Carolina at Chapel Hill.
This page intentionally left blank
13
Causal Inference
13.1 Introduction
We’ve introduced a couple of machine-learning algorithms and suggested that they can be used to
produce clear, interpretable results. You’ve seen that logistic regression coefficients can be used to
say how much more likely an outcome will occur in conjunection with a feature (for binary
features) or how much more likely an outcome is to occur per unit increase in a variable (for
real-valued features). We’d like to make stronger statements. We’d like to say “If you increase a
variable by a unit, then it will have the effect of making an outcome more likely.”
These two interpretations of a regression coefficient are so similar on the surface that you may have
to read them a few times to take away the meaning. The key is that in the first case, we’re describing
what usually happens in a system that we observe. In the second case, we’re saying what will
happen if we intervene in that system and disrupt it from its normal operation.
After we go through an example, we’ll build up the mathematical and conceptual machinery to
describe interventions. We’ll cover how to go from a Bayesian network describing observational
data to one that describes the effects of an intervention. We’ll go through some classic approaches
to estimating the effects of interventions, and finally we’ll explain how to use machine-learning
estimators to estimate the effects of interventions.
If you imagine a binary outcome, such as “I’m late for work,” you can imagine some features that
might vary with it. Bad weather can cause you to be late for work. Bad weather can also cause you to
wear rain boots. Days when you’re wearing rain boots, then, are days when you’re more likely be
late for work. If you look at the correlation between the binary feature “wearing rain boots” and the
outcome “I’m late for work,” you’ll find a positive relationship. It’s nonsense, of course, to say that
wearing rain boots causes you to be late for work. It’s just a proxy for bad weather. You’d never
recommend a policy of “You shouldn’t wear rain boots, so you’ll be late for work less often.” That
would be reasonable only if “wearing rain boots” was causally related to “being late for work.” As an
intervention to prevent lateness, not wearing rain boots doesn’t make any sense.
In this chapter, you’ll learn the difference between correlative (rain boots and lateness) and causal
(rain and lateness) relationships. We’ll discuss the gold standard for establishing causality: an
experiment. We’ll also cover some methods to discover causal relationships in cases when you’re
not able to run an experiment, which happens often in realistic settings.
168 Chapter 13 Causal Inference
13.2 Experiments
The case that might be familiar to you is an AB test. You can make a change to a product and test it
against the original version of the product. You do this by randomly splitting your users into two
groups. The group membership is denoted by D, where D = 1 is the group that experiences the new
change (the test group), and D = 0 is the group that experiences the original version of the product
(the control group). For concreteness, let’s say you’re looking at the effect of a recommender system
change that recommends articles on a website. The control group experiences the original
algorithm, and the test group experiences the new version. You want to see the effect of this change
on total pageviews, Y.
You’ll measure this effect by looking at a quantity called the average treatment effect (ATE). The ATE is
the average difference in the outcome between the test and control groups, Etest [Y] − Econtrol [Y], or
δnaive = E[Y|D = 1] − E[Y|D = 0]. This is the “naive” estimator for the ATE since here we’re ignoring
everything else in the world. For experiments, it’s an unbiased estimate for the true effect.
A nice way to estimate this is to do a regression. That lets you also measure error bars at the same
time and include other covariates that you think might reduce the noise in Y so you can get more
precise results. Let’s continue with this example.
1 import numpy as np
2 import pandas as pd
3
4 N = 1000
5
6 x = np.random.normal(size=N)
7 d = np.random.binomial(1., 0.5, size=N)
8 y = 3. * d + x + np.random.normal()
9
10 X = pd.DataFrame({'X': x, 'D': d, 'Y': y})
Here, we’ve randomized D to get about half in the test group and half in the control. X is some
other covariate that causes Y, and Y is the outcome variable. We’ve added a little extra noise to Y to
just make the problem a little noisier.
You can use a regression model Y = β0 + β1 D to estimate the expected value of Y, given the
covariate D, as E[Y|D] = β0 + β1 D. The β0 piece will be added to E[Y|D] for all values of D (i.e., 0 or
1). The β1 part is added only when D = 1 because when D = 0, it’s multiplied by zero. That means
E[Y|D = 0] = β0 when D = 0 and E[Y|D = 1] = β0 + β1 when D = 1. Thus, the β1 coefficient is
going to be the difference in average Y values between the D = 1 group and the D = 0 group,
E[Y|D = 1] − E[Y|D = 0] = β1 ! You can use that coefficient to estimate the effect of this experiment.
When you do the regression of Y against D, you get the result in Figure 13.1.
Df Model: 1
Why did this work? Why is it okay to say the effect of the experiment is just the difference between
the test and control group outcomes? It seems obvious, but that intuition will break down in the
next section. Let’s make sure you understand it deeply before moving on.
Each person can be assigned to the test group or the control group, but not both. For a person
assigned to the test group, you can talk hypothetically about the value their outcome would have
had, had they been assigned to the control group. You can call this value Y 0 because it’s the value Y
would take if D had been set to 0. Likewise, for control group members, you can talk about a
hypothetical Y 1 . What you really want to measure is the difference in outcomes δ = Y 1 − Y 0 for
each person. This is impossible since each person can be in only one group! For this reason, these
Y 1 and Y 0 variables are called potential outcomes.
If a person is assigned to the test group, you measure the outcome Y = Y 1 . If a person is assigned to
the control group, you measure Y = Y 0 . Since you can’t measure the individual effects, maybe you
can measure population level effects. We can try to talk instead about E[Y 1 ] and E[Y 0 ]. We’d like
E[Y 1 ] = E[Y|D = 1] and E[Y 0 ] = E[Y|D = 0], but we’re not guaranteed that that’s true. In the
170 Chapter 13 Causal Inference
recommender system test example, what would happen if you assigned people with higher Y 0
pageview counts to the test group? You might measure an effect that’s larger than the true effect!
Fortunately, you randomize D to make sure it’s independent of Y 0 and Y 1 . That way, you’re sure
that E[Y 1 ] = E[Y|D = 1] and E[Y 0 ] = E[Y|D = 0], so you can say that δ = E[Y 1 − Y 0 ] =
E[Y|D = 1] − E[Y|D = 0]. When other factors can influence assignment, D, then you can no longer
be sure you have correct estimates! This is true in general when you don’t have control over a
system, so you can’t ensure D is independent of all other factors.
In the general case, D won’t just be a binary variable. It can be ordered, discrete, or continuous. You
might wonder about the effect of the length of an article on the share rate, about smoking on the
probability of getting lung cancer, of the city you’re born in on future earnings, and so on.
Just for fun before we go on, let’s see something nice you can do in an experiment to get more
precise results. Since we have a co-variate, X, that also causes Y, we can account for more of the
variation in Y. That makes our predictions less noisy, so our estimates for the effect of D will be
more precise! Let’s see how this looks. We regress on both D and X now to get Figure 13.2.
Df Model: 2
Notice that the R2 is much better. Also, notice that the confidence interval for D is much narrower!
We went from a range of 3.95 − 2.51 = 1.2 down to 3.65 − 2.76 = 0.89. In short, finding covariates
that account for the outcome can increase the precision of your experiments!
N C
Figure 13.3 The neighborhood is a cause of its racial composition and poverty levels. The poverty
level is a cause of crime.
Here, there is no causal relationship between race and crime, but you would find them to be
correlated in observational data. Let’s simulate some data to examine this.
1 N = 10000
2
3 neighborhood = np.array(range(N))
4
5 industry = neighborhood % 3
6
7 race = ((neighborhood % 3
8
9 + np.random.binomial(3, p=0.2, size=N))) % 4
10
11 income = np.random.gamma(25, 1000*(industry + 1))
12
13 crime = np.random.gamma(100000. / income, 100, size=N)
14
15 X = pd.DataFrame({'$R$': race, '$I$': income, '$C$': crime,
16
17 '$E$': industry, '$N$': neighborhood})
172 Chapter 13 Causal Inference
Here, each data point will be a neighborhood. There are common historic reasons for the racial
composition and the dominant industry in each neighborhood. The industry determines the
income levels in the neighborhood, and the income level is inversely related with crime rates.
If you plot the correlation matrix for this data (Figure 13.4), you can see that race and crime are
correlated, even though there is no causal relationship between them!
C E I N R
C 1.000000 -0.542328 -0.567124 0.005518 -0.492169
E -0.542328 1.000000 0.880411 0.000071 0.897789
I -0.567124 0.880411 1.000000 -0.005650 0.793993
N 0.005518 0.000071 -0.005650 1.000000 -0.003666
R -0.492169 0.897789 0.793993 -0.003666 1.000000
Figure 13.4 Raw data showing correlations between crime (C), industry (E), income (I), neighborhood
(N), and race (R)
You can take a regression approach and see how you can interpret the regression coefficients. Since
we know the right model to use, we can just do the right regression, which gives the results in
Figure 13.5.
From this you can see that when 1/I increases by a unit, the number of crimes increases by 123
units. If the crime units are in crimes per 10,000 people, this means 123 more crimes per 10,000
people.
This is a nice result, but you’d really like to know whether the result is causal. If it is causal, that
means you can design a policy intervention to exploit the relationship. That is, you’d like to know
if people earned more income, everything else held fixed, would there be less crime? If this were a
causal result, you could say that if you make incomes higher (independent of everything else), then
you can expect that for each unit decrease in 1/I, you’ll see 123 fewer crimes. What is keeping us
from making those claims now?
You’ll see that regression results aren’t necessarily causal; let’s look at the relationship between race
and crime. We’ll do another regression as shown here:
Here, you find a strong correlative relationship between race and crime, even though there’s no
causal relationship. You know that if we moved a lot of white people into a black neighborhood
(holding income level constant), you should have no effect on crime. If this regression were causal,
then you would. Why do you find a significant regression coefficient even when there’s no causal
relationship?
In this example, you went wrong because racial composition and income level were both caused by
the history of each neighborhood. This is a case where two variables share a common cause. If you
don’t control for that history, then you’ll find a spurious association between the two variables.
What you’re seeing is a general rule: when two variables share a common cause, they will be
correlated (or, more generally, statistically dependent) even when there’s no causal relationship
between them.
Another nice example of this common cause problem is that when lemonade sales are high, crime
rates are also high. If you regress crime on lemonade sales, you’d find a significant increase in
crimes per unit increase in lemonade sales! Clearly the solution isn’t to crack down on lemonade
stands. As it happens, more lemonade is sold on hot days. Crime is also higher on hot days. The
weather is a common cause of crime and lemonade sales. We find that the two are correlated even
though there is no causal relationship between them.
The solution in the lemonade example is to control for the weather. If you look at all days where it
is sunny and 95 degrees Fahrenheit, the effect of the weather on lemonade sales is constant. The
174 Chapter 13 Causal Inference
effect of weather and crime is also constant in the restricted data set. Any variance in the two must
be because of other factors. You’ll find that lemonade sales and crime no longer have a significant
correlation in this restricted data set. This problem is usually called confounding, and the way to
break confounding is to control for the confounder.
Similarly, if you look only at neighborhoods with a specific history (in this case the relevant
variable is the dominant industry), then you’ll break the relationship between race and income
and so also the relationship between race and crime.
To reason about this more rigorously, let’s look at Figure 13.3. We can see the source of dependence,
where there’s a path from N to R and a path from N through E and P to C. If you were able to break
this path by holding a variable fixed, you could disrupt the dependence that flows along it. The
result will be different from the usual observational result. You will have changed the dependencies
in the graph, so you will have changed the joint distribution of all these variables.
If you intervene to set the income level in an area in a way that is independent of the dominant
industry, you’ll break the causal link between the industry and the income, resulting in the graph
in Figure 13.7. In this system, you should find that the path that produces dependence between
race and crime is broken. The two should be independent.
13.3 Observation: An Example 175
N C
Figure 13.7 The result of an intervention, where you set the income level by direct intervention in a
way that is independent of the dominant industry in the neighborhood
How can you do this controlling using only observational data? One way is just to restrict to subsets
of the data. You can, for example, look only at industry 0 and see how this last regression looks.
1 X_restricted = X[X['$E$'] == 0]
2
3 races = {0: 'african-american', 1: 'hispanic',
4 2: 'asian', 3: 'white'}
5 X_restricted['race'] = X_restricted['$R$'].apply(lambda x: races[x])
6 race_dummies = pd.get_dummies(X_restricted['race'])
7 X_restricted[race_dummies.columns] = race_dummies
8 model = OLS(X_restricted['$C$'], race_dummies)
9 result = model.fit()
10 result.summary()
Now you can see that all of the results are within confidence of each other! The dependence
between race and crime is fully explained by the industry in the area. In other words, in this
hypothetical data set, crime is independent of race when you know what the dominant industry is
in the area. What you have done is the same as the conditioning you did before.
Notice that the confidence intervals on the new coefficients are fairly wide compared to what they
were before. This is because you’ve restricted to a small subset of your data. Can you do better,
maybe by using more of the data? It turns out there’s a better way to control for something than
restricting the data set. You can just regress on the variables you’d like to control for!
Figure 13.8 A hypothetical regression on race indicator variables predicting crime rates, but con-
trolling for local industry using stratification of the data. There are no differences in expected crimes,
controlling for industry.
12 industry_dummies = pd.get_dummies(X['industry'])
13 X[industry_dummies.columns] = industry_dummies
14
15 x = list(industry_dummies.columns)[1:] + list(race_dummies.columns)
16
17 model = OLS(X['$C$'], X[x])
18 result = model.fit()
19 result.summary()
Here, the confidence intervals are much narrower, and you see there’s still no significant
association between race and income level: the coefficients are roughly equal. This is a causal
regression result: you can now see that there would be no effect of an intervention to change the
racial composition of neighborhoods. This simple example is nice because you can see what to
control for, and you’ve measured the things you need to control for. How do you know what to
13.4 Controlling to Block Non-causal Paths 177
Figure 13.9 Statistics highlighting the relationship between race and industry from an OLS fit
control for in general? Will you always be able to do it successfully? It turns out it’s very hard in
practice, but sometimes it’s the best you can do.
You saw in the previous chapter that conditioning can break statistical dependence. If you
condition on the middle variable of a path X → Y → Z, you’ll break the dependence between X
and Z that the path produces. If you condition on a confounding variable X ← Z → Y, you can
break the dependence between X and Y induced by the confounder as well. It’s important to note
that statistical dependence induced by other paths between X and Y is left unharmed by this
178 Chapter 13 Causal Inference
conditioning. If, for example, you condition on Z in the system in Figure 13.10, you’ll get rid of the
confounding but leave the causal dependence.
X Y
Figure 13.10 Conditioning on Z disrupts the confounding but leaves the causal statistical depen-
dence between X and Y intact
If you had a general rule to choose which paths to block, you could eliminate all noncausal
dependence between variables but save the causal dependence. The “back-door” criterion is the
rule you’re looking for. It tells you what set of variables, Z, you should control for to eliminate any
noncausal statistical dependence between Xi and Xj . You should note a final nuance before
introducing the criterion. If you want to know if the correlation between Xi and Xj is “causal,” you
have to worry about the direction of the effect. It’s great to know, for example, that the correlation
“being on vacation” and “being relaxed” is not confounded, but you’d really like to know whether
“being on vacation” causes you to “be relaxed.” That will inform a policy of going on vacation in
order to be relaxed. If the causation were reversed, you couldn’t take that policy.
With that in mind, the back-door criterion is defined relative to an ordered pair of variables,
(Xi , Xj ), where Xi will be the cause, and Xj will be the effect.
We won’t prove this theorem, but let’s build some intuition for it. First, let’s examine the condition
“no variable in Z is a descendant of Xi .” You learned earlier that if you condition on a common
effect of Xi and Xj , then the two variables will be conditionally dependent, even if they’re normally
independent. This remains true if you condition on any effect of the common effect (and so on
down the paths). Thus, you can see that the first part of the back-door criterion prevents you from
introducing extra dependence where there is none.
There is something more to this condition, too. If you have a chain like Xi → Xk → Xj , you see that
Xk is a descendant of Xi . It’s not allowed in Z. This is because if you condition on Xk , you’d block a
causal path between Xi and Xj . Thus, you see that the first condition also prevents you from
conditioning on variables that fall along causal paths.
The second condition says “Z blocks every path between Xi and Xj that contains an arrow into Xi .”
This part will tell us to control for confounders. How can you see this? Let’s consider some cases
where there is one or more node along the path between Xi and Xj and the path contains an arrow
into Xi . If there is a collider along the path between Xi and Xj , then the path is already blocked, so
13.4 Controlling to Block Non-causal Paths 179
you just condition on the empty set to block that path. Next, if there is a fork along the path, like
the path Xi ← Xk → Xj , and no colliders, then you have typical confounding. You can condition
on any node along the path that will block it. In this case, you add Xk to the set Z. Note that there
can be no causal path from Xi to Xj with an arrow pointing into Xi because of the arrow pointing
into Xi .
Thus, you can see that you’re blocking all noncausal paths from Xi to Xj , and the remaining
statistical dependence will be showing the causal dependence of Xj on Xi . Is there a way you can
use this dependence to estimate the effects of interventions?
X1
X2 X3
X4
X5
Figure 13.11 A pre-intervention causal graph. Data collected from this system reflects the way the
world works when we just observe it.
You want to estimate the effect of X2 on X5 . That is, you want to say “If I intervene in this system to
set the value of X2 to x2 , what will happen to X5 ? To quantify the effect, you have to realize that all
of these variables are taking on values that depend not only on their predecessors but also on noise
in the system. Thus, even if there’s a deterministic effect of X2 on X5 (say, raising the value of X5 by
exactly one unit), you can only really describe the value X5 will take with a distribution of values.
Thus, when you’re estimating the effect of X2 on X5 , what you really want is the distribution of X5
when you intervene to set the value of X2 .
Let’s look at what we mean by intervene. We’re saying we want to ignore the usual effect of X1 on X2
and set the value of X2 to x2 by applying some external force (our action) to X2 . This removes the
usual dependence between X2 and X1 and disrupts the downstream effect of X1 on X4 by breaking
the path that passes through X2 . Thus, we’ll also expect the marginal distribution between X1 and
X4 , P(X1 , X4 ) to change, as well as the distribution of X1 and X5 ! Our intervention can affect every
variable downstream from it in ways that don’t just depend on the value x2 . We actually disrupt
other dependences.
You can draw a new graph that represents this intervention. At this point, you’re seeing that the
operation is very different from observing the value of X2 = x2 , i.e., simply conditioning on
180 Chapter 13 Causal Inference
X2 = x2 . This is because you’re disrupting other dependences in the graph. You’re actually talking
about a new system described by the graph in Figure 13.12.
X1
X2 = x2 X3
X4
X5
Figure 13.12 The graph representing the intervention do(X2 = x2 ). The statistics of this data will be
different from that in from the system in Figure 13.11
You need some new notation to talk about an intervention like this, so you’ll denote do(X2 = x2 )
the intervention where you perform this operation. This gives you the definition of the
intervention, or do-operation.
What does the joint distribution look like for this new graph? Let’s use the usual factorization, and
write the following:
Pdo(X2 =x2 ) (X1 , X2 , X3 , X4 , X5 ) = P(X5 |X4 )P(X4 |X2 , X3 )P(X3 |X1 )δ(X2 , x2 )P(X1 ) (13.1)
Here we’ve just indicated P(X2 ) by the δ-function, so P(X2 ) = 0 if X2 6= x2 , and P(X2 ) = 1 when
X2 = x2 . We’re basically saying that when we intervene to set X2 = x2 , we’re sure that it worked. We
can carry through that X2 = x2 elsewhere, like in the distribution for P(X4 |X2 , X3 ), but just
replacing X2 with X2 = x2 , since the whole right-hand side is zero if X2 6= x2 .
Finally, let’s just condition on the X2 distribution to get rid of the weirdness on the right-hand side
of this formula. We can write the following:
Pdo(X2 =x2) (X1 , X2 , X3 , X4 , X5 |X2 ) = P(X5 |X4 )P(X4 |X2 = x2 , X3 )P(X3 |X1 )P(X1 ) (13.2)
However, this is the same as the original formula, divided by P(X2 |X1 )! To be precise,
P(X1 , X2 = x2 , X3 , X4 , X5 )
Pdo(X2 =x2) (X1 , X2 , X3 , X4 , X5 |X2 = x2 ) = (13.3)
P(X2 = x2 |X1 )
13.4 Controlling to Block Non-causal Paths 181
P(X1 , ..., Xn )
P(X1 , ..., Xn |do(Xi = xi )) = (13.4)
P(Xi |Pa(Xi ))
This leads us to a nice general rule: the parents of a variable will always satisfy the back-door
criterion! It turns out we can be more general than this even. If we marginalize out everything
except Xi and Xj , we see the parents are the set of variables that control confounders.
P(Xj , Xi , Pa(Xi ))
P(Xj , Pa(Xi )|do(Xi = xi)) = (13.5)
P(Xi |Pa(Xi ))
It turns out (we’ll state without proof) that you can generalize the parents to any set, Z, that
satisfies the back door criterion.
P(Xj , Xi , Z)
P(Xj , Z|do(Xi = xi)) = (13.6)
P(Xi |Z)
You can marginalize Z out of this and use the definition of conditional probability to write an
important formula, shown in Definition 13.3.
This is a general formula for estimating the distribution of Xj under the intervention Xi . Notice
that all of these distributions are from the pre-intervention system. This means you can use
observational data to estimate the distribution of Xj under some hypothetical intervention!
There are a few critical caveats here. First, the term in the denominator of Equation 13.4,
P(Xi |Pa(Xi )), must be nonzero for the quantity on the left side to be defined. This means you would
have to have observed Xi taking on the value you’d like to set it to with your intervention. If you’ve
never seen it, you can’t say how the system might behave in response to it!
Next, you’re assuming that you have a set Z that you can control for. Practically, it’s hard to know if
you’ve found a good set of variables. There can always be a confounder you have never thought to
measure. Likewise, your way of controlling for known confounders might not do a very good job.
You’ll understand this second caveat more as you go into some machine learning estimators.
With these caveats, it can be hard to estimate causal effects from observational data. You should
consider the results of a conditioning approach to be a provisional estimate of a causal effect. If
you’re sure you’re not violating the first condition of the back-door criterion, then you can expect
that you’ve removed some spurious dependence. You can’t say for sure that you’ve reduced bias.
Imagine, for example, two sources of bias for the effect of Xi on Xj . Suppose you’re interested in
measuring an average value of Xj , Edo(Xi =xi ) [Xj ] = µj . Path A introduces a bias of −δ, and path B
introduces a bias of 2δ. If you estimate the mean without controlling for either path, you’ll find
(biased)
µj = µj + 2δ − δ = µj + δ. If you control for a confounder along path A, then you remove its
(biased,A)
contribution to the bias, which leaves µj = µj + 2δ. Now the bias is twice as large! The
182 Chapter 13 Causal Inference
problem, of course, is that the bias you corrected was actually pushing our estimate back toward its
correct value. In practice, more controlling usually helps, but you can’t be guaranteed that you
won’t find an effect like this.
Now that you have a good background in observational causal inference, let’s see how
machine-learning estimators can help in practice!
In the simplest case, you’d like to estimate the expected difference in some outcome, Xj , per unit
change in a variable you have control over, Xi . For example, you’d might like to measure
E[Xj |do(Xi = 1)] − E[Xj |do(Xi = 0)]. This tells you the change in Xj you can expect on average when
you set Xi to 1 from what it would be if Xi were set to 0.
Let’s revisit the g-formula to see how can measure these kinds of quantities.
If you take expectation values on each side (by multiplying by Xj and summing over Xj ), then you
find this: X
E(Xj |do(Xi = xi )) = E(Xj |Xi , Z)P(Z) (13.8)
Z
In practice, it’s easy to estimate the first factor on the right side of this formula. If you fit a
regression estimator using mean-squared error loss, then the best fit is just the expected value of Xj
at each point (Xi , Z). As long as the model has enough freedom to accurately describe the expected
value, you can estimate this first factor by using standard machine-learning approaches.
To estimate the whole left side, you need to deal with the P(Z) term, as well as the sum. It turns out
there’s a simple trick for doing this. If your data was generated by drawing from the observational
joint distribution, then your samples of Z are actually just draws from P(Z). Then, if you replace the
P(Z) term by 1/N (for N samples) and sum over data points, you’re left with an estimator for this
sum. That is, you can make the substitution as follows:
N
1 X (k)
EN (Xj |do(Xi = xi )) = E(Xj |Xi , Z(k) ), (13.9)
N
k=1
where the (k) index runs over our data points, from 1 to N. Let’s see how all of this works in an
example.
13.5 Machine-Learning Estimators 183
13.5.2 An Example
Let’s go back to the graph in Figure 13.11. We’ll use an example from Judea Pearl’s book. We’re
concerned with the sidewalk being slippery, so we’re investigating its causes. X5 can be 1 or 0, for
slippery or not, respectively. You’ve found that the sidewalk is slippery when it’s wet, and you’ll use
X4 to indicate whether the sidewalk is wet. Next, you need to know the causes of the sidewalk being
wet. You see that a sprinkler is near the sidewalk, and if the sprinkler is on, it makes the sidewalk
wet. X2 will indicate whether the sprinkler is on. You’ll notice the sidewalk is also wet after it rains,
which you’ll indicate with X3 being 1 after rain, 0 otherwise. Finally, you note that on sunny days
you turn the sprinkler on. You’ll indicate the weather with X1 , where X1 is 1 if it is sunny, and 0
otherwise.
In this picture, rain and the sprinkler being on are negatively related to each other. This statistical
dependence happens because of their mutual dependence on the weather. Let’s simulate some data
to explore this system. You’ll use a lot of data, so the random error will be small, and you can focus
your attention on the bias.
1 import numpy as np
2 import pandas as pd
3 from scipy.special import expit
4
5 N = 100000
6 inv_logit = expit
7 x1 = np.random.binomial(1, p=0.5, size=N)
8 x2 = np.random.binomial(1, p=inv_logit(-3.*x1))
9 x3 = np.random.binomial(1, p=inv_logit(3.*x1))
10 x4 = np.bitwise_or(x2, x3)
11 x5 = np.random.binomial(1, p=inv_logit(3.*x4))
12
13 X = pd.DataFrame({'$x_1$': x1, '$x_2$': x2, '$x_3$': x3,
14 '$x_4$': x4, '$x_5$': x5})
Every variable here is binary. You use a logistic link function to make logistic regression
appropriate. When you don’t know the data-generating process, you might get a little more
creative. You’ll come to this point in a moment!
Let’s look at the correlation matrix, shown in Figure 13.13. When the weather is good, the sprinkler
is turned on. When it rains, the sprinkler is turned off. You can see there’s a negative relationship
between the sprinkler being on and the rain due to this relationship.
There are a few ways you can get an estimate for the effect of X2 on X5 . The first is simply by finding
the probability that X5 = 1 given that X2 = 1 or X2 = 0. The difference in these probabilities tells
you how much more likely it is that the sidewalk is slippery given that the sprinkler was on. A
simple way to calculate these probabilities is simply to average the X5 variable in each subset of the
data (where X2 = 0 and X2 = 1). You can run the following, which produces the table in
Figure 13.14.
1 X.groupby('$x_2$').mean()[['$x_5$']]
184 Chapter 13 Causal Inference
x1 x2 x3 x4 x5
Figure 13.13 The correlation matrix for the simulated data set. Notice that X2 and X3 are negatively
related because of their common cause, X1 .
x5
x2
0 0.861767
1 0.951492
Figure 13.14 The naive conditional expectation values for whether the grass is wet given that the
sprinkler is on, E[X5 |X2 = x2 ]. This is not a causal result because you haven’t adjusted for confounders.
If you look at the difference here, you see that the sidewalk is 0.95 − 0.86 = 0.09, or nine percentage
points more likely to be slippery given that the sprinkler was on. You can compare this with the
interventional graph to get the true estimate for the change. You can generate this data using the
process shown here:
1 N = 100000
2 inv_logit = expit
3 x1 = np.random.binomial(1, p=0.5, size=N)
4 x2 = np.random.binomial(1, p=0.5, size=N)
5 x3 = np.random.binomial(1, p=inv_logit(3.*x1))
6 x4 = np.bitwise_or(x2, x3)
7 x5 = np.random.binomial(1, p=inv_logit(3.*x4))
8
9 X = pd.DataFrame({'$x_1$': x1, '$x_2$': x2, '$x_3$': x3,
10 '$x_4$': x4, '$x_5$': x5})
Now, X2 is independent of X1 and X3 . If you repeat the calculation from before (try it!), you get a
difference of 0.12, or 12 percentage points. This is about 30 percent larger than the naive estimate!
Now, you’ll use some machine learning approaches to try to get a better estimate of the true (0.12)
effect strictly using the observational data. First, you’ll try a logistic regression on the first data set.
Let’s re-create the naive estimate, just to make sure it’s working properly.
13.5 Machine-Learning Estimators 185
You first build a logistic regression model using X2 to predict X5 . You do the prediction and use it to
get probabilities of X5 under the X2 = 0 and X2 = 1 states. You did this over the whole data set.
The reason for this is that you’ll often have more interesting data sets, with many more variables
changing, and you’ll want to see the average effect of X2 on X5 over the whole data set. This
procedure lets you do that. Finally, you find the average difference in probabilities between the two
states, and you get the same 0.09 result as before!
Now, you’d like to do controlling on the same observational data to get the causal (0.12) result. You
perform the same procedure as before, but this time you include X1 in the regression.
1 model = LogisticRegression()
2 model = model.fit(X[['$x_2$', '$x_1$']], X['$x_5$'])
3
4 # what would have happened if $x_2$ was always 0:
5 X0 = X.copy()
6 X0['$x_2$'] = 0
7 y_pred_0 = model.predict_proba(X0[['$x_2$', '$x_1$']])
8
9 # what would have happened if $x_2$ was always 1:
10 X1 = X.copy()
11 X1['$x_2$'] = 1
12
13 # now, let's check the difference in probabilities
14 y_pred_1 = model.predict_proba(X1[['$x_2$', '$x_1$']])
15 y_pred_1[:, 1].mean() - y_pred_0[:,1].mean()
In this case, you find 0.14 for the result. You’ve over-estimated it! What went wrong? You didn’t
actually do anything wrong with the modeling procedure. The problem is simply that logistic
186 Chapter 13 Causal Inference
regression isn’t the right model for this situation. It’s the correct model for each variable’s parents
to predict its value but doesn’t work properly for descendants that follow the parents. Can we do
better, with a more general model?
This will be your first look at how powerful neural networks can be for general machine-learning
tasks. You’ll learn about building them in a little more detail in the next chapter. For now, let’s try a
deep feedforward neural network using keras. It’s called deep because there are more than just the
input and output layers. It’s a feedforward network because you put some input data into the
network and pass them forward through the layers to produce the output.
Deep feedforward networks have the property of being “universal function approximators,” in the
sense that they can approximate any function, given enough neurons and layers (although it’s not
always easy to learn, in practice). You’ll construct the network like this:
Now do the same prediction procedure as before, which produces the result 0.129.
1 X_zero = X.copy()
2 X_zero['$x_2$'] = 0
3 x5_pred_0 = model.predict(X_zero[['$x_1$', '$x_2$']].values)
4
5 X_one = X.copy()
6 X_one['$x_2$'] = 1
7 x5_pred_1 = model.predict(X_one[['$x_1$', '$x_2$']].values)
8
9 x5_pred_1.mean() - x5_pred_0.mean()
You’ve done better than the logistic regression model! This was a tricky case. You’re given binary
data where it’s easy to calculate probabilities, and you’d do the best by simply using the g-formula
directly. When you do this (try it yourself!), you calculate the true result of 0.127 from this data.
Your neural network model is very close!
Now, you’d like to enact a policy that would make the sidewalk less likely to be slippery. You know
that if you turn the sprinkler on less often, that should do the trick. You see that enacting this
policy (and so intervening to change the system), you can expect the slipperiness of the sidewalk to
13.6 Conclusion 187
decrease. How much? You want to compare the pre-intervention chance of slipperiness with the
post-intervention chance, when you set sprinkler = off. You can simply calculate this with our
neural network model like so:
1 X['$x_5$'].mean() - x5_pred_0.mean()
This gives the result 0.07. It will be 7 percent less likely that the sidewalk is slippery if you make a
policy of keeping the sprinkler turned off!
13.6 Conclusion
In this chapter, you’ve developed the tools to do causal inference. You’ve learned that machine
learning models can be useful to get more general model specifications, and you saw that the better
you can predict an outcome using a machine learning model, the better you can remove bias from
an observational causal effect estimate.
Observational causal effect estimates should always be used with care. Whenever possible, you
should try to do a randomized controlled experiment instead of using the observational estimate.
In this example, you should simply use randomized control: flip a coin each day to see whether the
sprinkler gets turned on. This re-creates the post-intervention system and lets you measure how
much less likely the sidewalk is to be slippery when the sprinkler is turned off versus turned on (or
when the system isn’t intervened upon). When you’re trying to estimate the effect of a policy, it’s
hard to find a substitute for actually testing the policy through a controlled experiment.
It’s especially useful to be able to think causally when designing machine-learning systems. If you’d
simply like to say what outcome is most likely given what normally happens in a system, a standard
machine learning algorithm is appropriate. You’re not trying to predict the result of an
intervention, and you’re not trying to make a system that is robust to changes in how the system
operates. You just want to describe the system’s joint distribution (or expectation values under it).
If you would like to inform policy changes, predict the outcomes of intervention, or make the
system robust to changes in variables upstream from it (i.e., external interventions), then you will
want a causal machine learning system, where you control for the appropriate variables to measure
causal effects.
An especially interesting application area is when you’re estimating the coefficients in a logistic
regression. Earlier, you saw that logistic regression coefficients had a particular interpretation in
observational data: they describe how much more likely an outcome is per unit increase in some
independent variable. If you control for the right variables to get a causal logistic regression
estimate (or just do the regression on data generated by control), then you have a new, stronger
interpretation: the coefficients tell you how much more likely an outcome is to occur when you
intervene to increase the value of an independent variable by one unit. You can use these
coefficients to inform policy!
Index
Numbers
12 principles of agile methodology, 11–14
“12-factor rules,” 71
95 percent confidence interval, 20, 107
A
A/B replication, 229–230, 240–241
access, RAM (random access memory), 205–206
aggregation, 214
agile development, product focus and, 10–11
agile methodology, 12 principles, 11–14
algorithms
Cannon’s algorithm, 97
classification algorithms, 117
k-means, 125–127
logistic regression, 118–122
naive Bayes, 122–124
clustering algorithms, 117
greedy Louvain, 130–131
k-means. see k-means
leading eigenvalue, 128–130
nearest neighbors, 131–133
comparison algorithms. See comparison
algorithms
Amazon, Route 53, 226–227
ANNoy, 133
API buffering, queues, 243
application-level caching, 236
architectures, 70–71
batch computing, 72–73
data sources, 72
online computing, 72–73
scaling, 73–74
services, 71
software architecture
client-server architecture, 217–218
microservices, 220
mix-and-match architectures, 221
248 architectures
monolith, 220 C
n-tier/service-oriented architecture, 218–219
cache invalidation, 72
assumptions
cache services, 237
greedy Louvain, 130
caches, 72, 235
ICA (independent component analysis), 158
application-level caching, 236
k-means, 127
cache services, 237
linear least squares, 97
write-through caches, 238
logistic regression, 121
Cannon’s algorithm, 97
MinHash, 83
naive Bayes, 124
CAP theorem, 223
availability, 225
nearest neighbors, 132
client-side load balancing, 228
asynchronous process execution, queues, 242–243
data layers, 228–230
ATE (average treatment effect), 168
failover, 230–231
auto-correlation, time-series plots, 60–61
front ends and load balancers, 225–228
availability, CAP theorem, 225
jobs and taskworkers, 230
client-side load balancing, 228
redundancy, 225
data layers, 228–230
consistency/concurrency, 223–224
failover, 230–231
conflict-free data types, 224–225
front ends and load balancers, 225–228
partition tolerance, 231–232
jobs and taskworkers, 230
capacity, neural networks, 193–196
redundancy, 225
career development, for data scientists, 5
average treatment effect (ATE), 168
CARP (Common Address Redundancy Protocol),
avoiding locally caching sensitive information, 237
227–228
S technical debt, 4, 13
sampling error, 19–21 terminal nodes, 110
scaling, 73–74 test coverage, 34
scatter plots, 51–55 testing
hypothesis testing, 37
scikit learn, 122, 128
multiple testing, 41–42
scipy.optimize.leastsq, 98
tests, Jarque-Bera test, 108
secondary, 228
text preprocessing, 26
SELECT FOR UPDATE, 224
feature selection, 28–30
self-organizing teams, 14
n-grams, 27–28
separation of concerns, 70–71
representation learning, 30–33
sequences, 79–80
sparsity, 28
service-oriented architectures (SOAs), 71, 218–219
tokenization, 26–27
services, 71
thrashing, nonvolatile/persistent storage, 208
sets, 79–80
threading, processors, 210
sharding, 229, 241
threads, 208
simplicity, 14
threads of execution, 73
sklear.neighbors, 133
throughput, 208–209
SOAs (service-oriented architectures), 71, 218–219
time complexity, 64
sockets, 217
Jaccard distance, 81
software architecture
logistic regression, 121
client-server architecture, 217–218
MinHash, 83
microservices, 220
time to live (TTL), 72
mix-and-match architectures, 221
time-series plots, 58
monolith, 220
auto-correlation, 60–61
n-tier/service-oriented architecture, 218–219
rolling statistics, 58–60
solid-state drives (SSDs), nonvolatile/persistent
tokenization, text preprocessing, 26–27
storage, 207
tools
space complexity, MinHash, 83
greedy Louvain, 131
sparse vectors, 28
ICA (independent component analysis), 159
sparsity, text preprocessing, 28
k-means, 128
spinning disks, 206–208
leading eigenvalue, 130
split brains, 231–232 linear least squares, 98
SSDs (solid-state drives), 207 logistic regression, 122
stability, dependence and (Bayesian networks), MinHash, 83
137–138 naive Bayes, 124
static content, application-level caching, 236 nearest neighbors, 133
stochastic gradient descent, 75, 200 PCA (principle components analysis), 154
stochasticity, 200 topics, 159
storage, nonvolatile/persistent storage, 206–208 topological ordering, 139
storing data, 215 tracking impressions, 18
supervised learning, 125 training models, 74–75
survival plots, 51 true value, 18
swapping, 208 TTL (time to live), 72
systematic error, 18 Type I errors, 39
Type II errors, 39
T
task scheduling, queues, 241–242 U
taskworkers, availability, 230 uncertainty, nonlinear regression with linear
teams, 12 regression, 107–109
role of, data scientists, 4–5 underpowered, 39
self-organizing teams, 14
Z statistic 255
V W
validation, 92–96
workstations, 69
models, 76–77
write-through caches, 238
value proposition, 10–11
variables
binary variables, 25 Z
continuous variables, 46 Z statistic, 38–39