Sample

Machine Learning
in Production
Machine Learning
in Production
Developing and Optimizing
Data Science Workflows and
Applications
Andrew Kelleher
Adam Kelleher
Boston • Columbus • New York • San Francisco • Amsterdam • Cape Town

Dubai • London • Madrid • Milan • Munich • Paris • Montreal • Toronto • Delhi • Mexico City
São Paulo • Sydney • Hong Kong • Seoul • Singapore • Taipei • Tokyo
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and the publisher was aware of a trademark
claim, the designations have been printed with initial capital letters or in all capitals.
The authors and publisher have taken care in the preparation of this book, but make no expressed or implied
warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for
incidental or consequential damages in connection with or arising out of the use of the information or
programs contained herein.
For information about buying this title in bulk quantities, or for special sales opportunities (which may
include electronic versions; custom cover designs; and content particular to your business, training goals,
marketing focus, or branding interests), please contact our corporate sales department
at [email protected] or (800) 382-3419.
For government sales inquiries, please contact [email protected].
For questions about sales outside the U.S., please contact [email protected].
Visit us on the Web: informit.com/aw
Library of Congress Control Number: 2018954331
Copyright © 2019 Pearson Education, Inc.
All rights reserved. This publication is protected by copyright, and permission must be obtained from the
publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by
any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding
permissions, request forms, and the appropriate contacts within the Pearson Education Global Rights &
Permissions Department, please visit www.pearsoned.com/permissions/.
ISBN-13: 978-0-13-411654-9
ISBN-10: 0-13-411654-2
1 19
v
This book is dedicated to our lifelong mentor, William F. Walsh III.

We could never thank you enough for all the years of support and
encouragement.
v
This page intentionally left blank
Contents
Foreword xv
Preface xvii
About the Authors xxi
I: Principles of Framing 1
1 The Role of the Data Scientist 3

1.1 Introduction 3
1.2 The Role of the Data Scientist 3
1.2.1 Company Size 3
1.2.2 Team Context 4
1.2.3 Ladders and Career Development 5
1.2.4 Importance 5
1.2.5 The Work Breakdown 6
1.3 Conclusion 6
2 Project Workflow 7
2.1 Introduction 7
2.2 The Data Team Context 7
2.2.1 Embedding vs. Pooling Resources 8
2.2.2 Research 8
2.2.3 Prototyping 9
2.2.4 A Combined Workflow 10
2.3 Agile Development and the Product Focus 10
2.3.1 The 12 Principles 11
2.4 Conclusion 15
3 Quantifying Error 17
3.1 Introduction 17
3.2 Quantifying Error in Measured Values 17
3.3 Sampling Error 19
3.4 Error Propagation 21
3.5 Conclusion 23
4 Data Encoding and Preprocessing 25

4.1 Introduction 25
4.2 Simple Text Preprocessing 26
4.2.1 Tokenization 26
viii Contents
4.2.2 N-grams 27
4.2.3 Sparsity 28
4.2.4 Feature Selection 28
4.2.5 Representation Learning 30
4.3 Information Loss 33
4.4 Conclusion 34
5 Hypothesis Testing 37
5.1 Introduction 37
5.2 What Is a Hypothesis? 37
5.3 Types of Errors 39
5.4 P-values and Confidence Intervals 40
5.5 Multiple Testing and “P-hacking” 41
5.6 An Example 42
5.7 Planning and Context 43
5.8 Conclusion 44
6 Data Visualization 45
6.1 Introduction 45
6.2 Distributions and Summary Statistics 45
6.2.1 Distributions and Histograms 46
6.2.2 Scatter Plots and Heat Maps 51
6.2.3 Box Plots and Error Bars 55
6.3 Time-Series Plots 58
6.3.1 Rolling Statistics 58
6.3.2 Auto-Correlation 60
6.4 Graph Visualization 61
6.4.1 Layout Algorithms 62
6.4.2 Time Complexity 64
6.5 Conclusion 64
II: Algorithms and Architectures 67
7 Introduction to Algorithms and Architectures 69

7.1 Introduction 69
7.2 Architectures 70
Contents ix
7.2.1 Services 71
7.2.2 Data Sources 72
7.2.3 Batch and Online Computing 72
7.2.4 Scaling 73
7.3 Models 74
7.3.1 Training 74
7.3.2 Prediction 75
7.3.3 Validation 76
7.4 Conclusion 77
8 Comparison 79
8.1 Introduction 79
8.2 Jaccard Distance 79
8.2.1 The Algorithm 80
8.2.3 Memory Considerations 81
8.2.4 A Distributed Approach 81
8.3 MinHash 82
8.3.1 Assumptions 83
8.3.2 Time and Space Complexity 83
8.3.3 Tools 83
8.4 Cosine Similarity 84
8.4.1 Complexity 85
8.5 Mahalanobis Distance 86
8.5.1 Complexity 86
8.6 Conclusion 88
9 Regression 89
9.1 Introduction 89
9.1.1 Choosing the Model 90
9.1.2 Choosing the Objective Function 90
9.1.3 Fitting 91
9.1.4 Validation 92
x Contents
9.2 Linear Least Squares 96

9.2.2 Complexity 97
9.2.4 Tools 98
9.2.6 A Worked Example 98
9.3 Nonlinear Regression with Linear Regression 105
9.3.1 Uncertainty 107
9.4 Random Forest 109
9.4.1 Decision Trees 109
9.4.2 Random Forests 112
9.5 Conclusion 115
10 Classification and Clustering 117

10.1 Introduction 117
10.2 Logistic Regression 118
10.2.4 Tools 122
10.3 Bayesian Inference, Naive Bayes 122
10.3.2 Complexity 124
10.3.4 Tools 124
10.4 K-Means 125
10.4.4 Tools 128
10.5 Leading Eigenvalue 128
10.5.3 Tools 130
10.6 Greedy Louvain 130
Contents xi

10.6.4 Tools 131
10.7 Nearest Neighbors 131
10.7.4 Tools 133
10.8 Conclusion 133
11 Bayesian Networks 135

11.2 Causal Graphs, Conditional Independence, and Markovity 136
11.2.1 Causal Graphs and Conditional Independence 136
11.2.2 Stability and Dependence 137
11.3 D-separation and the Markov Property 138
11.3.1 Markovity and Factorization 138
11.3.2 D-separation 139
11.4 Causal Graphs as Bayesian Networks 142
11.4.1 Linear Regression 142
11.5 Fitting Models 143
11.6 Conclusion 147
12 Dimensional Reduction and Latent Variable Models 149

12.2 Priors 149
12.3 Factor Analysis 151
12.4 Principal Components Analysis 152
12.4.3 Tools 154
12.5 Independent Component Analysis 154
12.5.4 Tools 159
12.6 Latent Dirichlet Allocation 159
12.7 Conclusion 165
xii Contents
13 Causal Inference 167

13.2 Experiments 168
13.3 Observation: An Example 171
13.4 Controlling to Block Non-causal Paths 177
13.4.1 The G-formula 179
13.5 Machine-Learning Estimators 182
13.5.1 The G-formula Revisited 182
13.5.2 An Example 183
13.6 Conclusion 187
14 Advanced Machine Learning 189

14.2 Optimization 189
14.3 Neural Networks 191
14.3.1 Layers 192
14.3.2 Capacity 193
14.3.3 Overfitting 196
14.3.4 Batch Fitting 199
14.3.5 Loss Functions 200
14.4 Conclusion 201
III: Bottlenecks and Optimizations 203
15 Hardware Fundamentals 205

15.2 Random Access Memory 205
15.2.1 Access 205
15.2.2 Volatility 206
15.3 Nonvolatile/Persistent Storage 206
15.3.1 Hard Disk Drives or “Spinning Disks” 207
15.3.2 SSDs 207
15.3.3 Latency 207
15.3.4 Paging 207
15.3.5 Thrashing 208
15.4 Throughput 208
15.4.1 Locality 208
15.4.2 Execution-Level Locality 208
15.4.3 Network Locality 209
Contents xiii
15.5 Processors 209

15.5.1 Clock Rate 209
15.5.2 Cores 210
15.5.3 Threading 210
15.5.4 Branch Prediction 210
15.6 Conclusion 212
16 Software Fundamentals 213

16.2 Paging 213
16.3 Indexing 214
16.4 Granularity 214
16.5 Robustness 216
16.6 Extract, Transfer/Transform, Load 216
16.7 Conclusion 216
17 Software Architecture 217

17.2 Client-Server Architecture 217
17.3 N-tier/Service-Oriented Architecture 218
17.4 Microservices 220
17.5 Monolith 220
17.6 Practical Cases (Mix-and-Match Architectures) 221
17.7 Conclusion 221
18 The CAP Theorem 223

18.2 Consistency/Concurrency 223
18.2.1 Conflict-Free Replicated Data Types 224
18.3 Availability 225
18.3.1 Redundancy 225
18.3.2 Front Ends and Load Balancers 225
18.3.3 Client-Side Load Balancing 228
18.3.4 Data Layer 228
18.3.5 Jobs and Taskworkers 230
18.3.6 Failover 230
xiv Contents
18.4 Partition Tolerance 231

18.4.1 Split Brains 231
18.5 Conclusion 232
19 Logical Network Topological Nodes 233

19.2 Network Diagrams 233
19.3 Load Balancing 234
19.4 Caches 235
19.4.1 Application-Level Caching 236
19.4.2 Cache Services 237
19.4.3 Write-Through Caches 238
19.5 Databases 238
19.5.1 Primary and Replica 238
19.5.2 Multimaster 239
19.5.3 A/B Replication 240
19.6 Queues 241
19.6.1 Task Scheduling and Parallelization 241
19.6.2 Asynchronous Process Execution 242
19.6.3 API Buffering 243
19.7 Conclusion 243
Bibliography 245
Index 247
Foreword
This pragmatic book introduces both machine learning and data science, bridging gaps between
data scientist and engineer, and helping you bring these techniques into production. It helps
ensure that your efforts actually solve your problem, and offers unique coverage of real-world
optimization in production settings. This book is filled with code examples in Python and
visualizations to illustrate concepts in algorithms. Validation, hypothesis testing, and visualization
are introduced early on as these are all key to ensuring that your efforts in data science are actually
solving your problem. Part III of the book is unique among data science and machine learning
books because of its focus on real-world concerns in optimization. Thinking about hardware,
infrastructure, and distributed systems are all steps to bringing machine learning and data science
techniques into a production setting.
Andrew and Adam Kelleher bring their experience in engineering and data science, respectively,
from their work at BuzzFeed. The topics covered and where to provide breadth versus depth are
informed by their real-world experience solving problems in a large production environment.
Algorithms for comparison, classification, clustering, and dimensionality reduction are all
presented with examples of specific problems that can be solved with each. Explorations into more
advanced topics like Bayesian networks or deep learning are provided after the framework for basic
machine learning tasks is laid.
This book is a great addition to the Data & Analytics Series. It provides a well-grounded
introduction to data science and machine learning with a focus on problem-solving. It should
serve as a great resource to any engineer or “accidental programmer” with a more traditional math
or science background looking to apply machine learning to their production applications and
environment.
—Paul Dix, series editor
Preface
Most of this book was written while Andrew and Adam were working together at BuzzFeed. Adam
was a data scientist, Andrew was an engineer, and they spent a good deal of time working together
on the same team! Given that they’re identical twins of triplets, it was confusing and amusing for
everyone involved.
The idea for this book came after PyGotham in New York City in August 2014. There were several
talks relating to the relatively broadly defined field of “data science.” What we noticed was that
many data scientists start their careers driven by the curiosity and excitement of learning new
things. They discover new tools and often have a favorite technique or algorithm. They’ll apply that
tool to the problem they’re working on. When you have a hammer, every problem looks like a nail.
Often, as with neural networks (discussed in Chapter 14), it’s more like a pile driver. We wanted to
push past the hype of data science by giving data scientists, especially at the time they’re starting
their careers, a whole tool box. One could argue the context and error analysis tools of Part I are
actually more important than the advanced techniques discussed in Part III. In fact, they’re a major
motivator in writing this book. It’s very unlikely a choice of algorithm will be successful if its signal
is trumped by its noise, or if there is a high amount of systematic error. We hope this book provides
the right tools to take on the projects our readers encounter, and to be successful in their careers.
There’s no lack of texts in machine learning or computer science. There are even some decent texts
in the field of data science. What we hope to offer with this book is a comprehensive and rigorous
entry point to the field of data science. This tool box is slim and driven by our own experience of
what is useful in practice. We try to avoid opening up paths that lead to research-level problems. If
you’re solving research-level problems as a junior data scientist, you’ve probably gone out of scope.
There’s a critical side of data science that is separate from machine learning: engineering. In Part III
of this text we get into the engineering side. We discuss the problems you’re likely to encounter and
give you the fundamentals you’ll need to overcome them. Part III is essentially a Computer Science
201-202 crash course. Once you know what you’re building, you still have to address many
considerations on the path to production. This means understanding your toolbox from the
perspective of the tools.
Who This Book Is For

For the last several years there has been a serious demand for good engineers. During the Interactive
session of SXSW in 2008 we heard the phrase “accidental developer” coined for the first time. It was
used to describe people playing the role of engineer without having had formal training. They
simply happened into that position and began filling it out of necessity. More than a decade later
we still see this demand for developers, but it’s also begun to extend to data scientists. Who fills the
role of the “accidental data scientist”? Well, it’s usually developers. Or physics undergraduates. Or
math majors. People who haven’t had much if any formal training in all the disciplines required of
a data scientist. People who don’t lack for technical training, and have all the prerequisite curiosity
and ambition to succeed. People in need of a tool box.
This book is intended to be a crash course for those people. We run through a basic procedure for
taking on most data science tasks, encouraging data scientists to use their data set, rather than the
tools of the day, as the starting point. Data-driven data science is key to success. The big open secret
xviii Preface
of data science is that while modeling is important, the bread and butter of data science is
simple queries, aggregations, and visualizations. Many industries are in a place where they’re
accumulating and seeing data for the very first time. There is value to be delivered quickly
and with minimal complexity.
Modeling is important, but hard. We believe in applying the principles of agile development to
data science. We talk about this a lot in Chapter 2. Start with a minimal solution: a simple heuristic
based on a data aggregation, for example. Improve the heuristic with a simple model when your
data pipeline is mature and stable. Improve the model when you don’t have anything more
important to do with your time. We’ll provide realistic case studies where this approach is applied.
What This Book Covers

We start this text by providing you with some background on the field of data science. Part I,
“Principles of Framing,” includes Chapter 1, “The Role of the Data Scientist,” which serves as a
starting point for your understanding of the data industry.
Chapter 2, “Project Workflow,” sets the context for data science by describing agile development.
It’s a philosophy that helps keep scope small, and development efficient. It can be hard to keep
yourself from trying out the latest machine learning framework or tools offered by cloud platforms,
but it pays off in the long run.
Next, in Chapter 3, “Quantifying Error,” we provide you with a basic introduction to error analysis.
Much of data science is reporting simple statistics. Without understanding the error in those
statistics, you’re likely to come to invalid conclusions. Error analysis is a foundational skill and
important enough to be the first item in your tool kit.
We continue in Chapter 4, “Data Encoding and Preprocessing,” by discovering a few of the many
ways of encoding the real world in the form of data. Naturally this leads us to ask data-driven
questions about the real world. The framework for answering these questions is hypothesis testing,
which we provide a foundation for in Chapter 5, “Hypothesis Testing.”
At this point, we haven’t seen many graphs, and our tool kit is lacking in communicating our
results to the outside (nontechnical) world. We aim to resolve this in Chapter 6, “Data
Visualization,” where we learn many approaches to it. We keep the scope small and aim to mostly
either make plots of quantities we know how to calculate errors for, or plots that resolve some of the
tricky nuances of data visualization. While these tools aren’t as flashy as interactive visualizations
in d3 (which are worth learning!), they serve as a solid foundational skill set for communicating
results to nontechnical audiences.
Having provided the basic tools for working with data, we move on to more advanced concepts in
Part II, “Algorithms and Architecture.” We start with a brief introduction to data architectures in
Chapter 7, “Data Architectures,” and an introduction to basic concepts in machine learning in
Chapter 8, “Comparison.” You now have some very handy methods for measuring the similarities
of objects.
From there, we have some tools to do basic machine learning. In Chapter 9, “Regression,” we
introduce regression and start with one of the most important tools: linear regression. It’s odd to
start with such a simple tool in the age of neural networks and nonlinear machine learning, but
Preface xix
linear regression is outstanding for several reasons. As we’ll detail later, it’s interpretable, stable, and
often provides an excellent baseline. It can describe nonlinearities with some simple tricks, and
recent results have shown that polynomial regression (a simple modification of linear regression)
can outperform deep feedforward networks on typical applications!
From there, we describe one more basic workhorse of regression: the random forest. These are
nonlinear algorithms that rely on a statistical trick, called “bagging,” to provide excellent baseline
performance for a wide range of tasks. If you want a simple model to start a task with and linear
regression doesn’t quite work for you, random forest is a nice candidate.
Having introduced regression and provided some basic examples of the machine learning
workflow, we move on to Chapter 10, “Classification and Clustering.” We see a variety of methods
that work on both vector and graph data. We use this section to provide some basic background on
graphs and an abbreviated introduction to Bayesian inference. We dive into Bayesian inference and
causality in the next chapter.
Our Chapter 11, “Bayesian Networks,” is both unconventional and difficult. We take the view that
Bayesian networks are most intuitive (though not necessarily easiest) from the viewpoint of causal
graphs. We lay this intuition as the foundation for our introduction of Bayesian networks and
come back to it in later sections as the foundation for understanding causal inference. In the
Chapter 12, “Dimensional Reduction and Latent Variable Models,” we build off of the foundation
of Bayesian networks to understand PCA and other variants of latent factor models. Topic modeling
is an important example of a latent variable model, and we provide a detailed example on the
newgroups data set.
As the next to last data-focused chapter, we focus on the problem of causal inference in Chapter 13,
“Causal Inference.” It’s hard to understate the importance of this skill. Data science typically aims
to inform how businesses act. The assumption is that the data tells you something about the
outcomes of your actions. That can only be true if your analysis has captured causal relationships
and not just correlative ones. In that sense, understanding causation underlies much of what we do
as data scientists. Unfortunately, with a view toward minimizing scope, it’s also too often the first
thing to cut. It’s important to balance stakeholder expectations when you scope a project, and good
causal inference can take time. We hope to empower data scientists to make informed decisions
and not to accept purely correlative results lightly.
Finally, in the last data-focused chapter we provide a section to introduce some of the nuances of
more advanced machine learning techniques in Chapter 14, “Advanced Machine Learning.” We use
neural networks as a tool to discuss overfitting and model capacity. The focus should be on using as
simple a solution as is available. Resist the urge to start with neural networks as a first model. Simple
regression techniques almost always provide a good enough baseline for a first solution.
Up to this point, the platform on which all of the data science happens has been in the
background. It’s where you do the data science and is not the primary focus. Not anymore. In the
last part of this book, Part III, “Bottlenecks and Optimizations,” we go in depth on hardware,
software, and the systems they make up.
We start with a comprehensive look at hardware in Chapter 15, “Hardware Fundamentals.” This
provides a tool box of basic resources we have to work with and also provides a framework to discuss
xx Preface
the constraints under which we must operate. These constraints are physical limitations on what is
possible, and those limitations are realized in the hardware.
Chapter 16, “Software Fundamentals,” provides the fundamentals of software and a basic
description of data logistics with a section on extract-transfer/transform-load, commonly known
as ETL.
Next, we give an overview of design considerations for architecture in Chapter 17, “Architecture
Fundamentals.” Architecture is the design for how your whole system fits together. It includes the
components for data storage, data transfer, and computation, as well as how they all communicate
with one another. Some architectures are more efficient than others and objectively do their jobs
better than others. Still, a less efficient solution might be more practical, given constraints on time
and resources. We hope to provide enough context so you can make informed decisions. Even if
you’re a data scientist and not an engineer, we hope to provide enough knowledge so you can at
least understand what’s happening with your data platform.
We then move on to some more advanced topics in engineering. Chapter 18, “The CAP Theorem,”
covers some fundamental bounds on database performance. Finally, we discuss how it all fits
together in the last chapter, which is on network topology: Chapter 19, “Logical Network
Topological Nodes.”
Going Forward
We hope that not only can you do the machine learning side of data science, but you can also
understand what’s possible in your own data platform. From there, you can understand what you
might need to build and find an efficient path for building out your infrastructure as you need to.
We hope that with a complete toolbox, you’re free to realize that the tools are only a part of the
solution. They’re a means to solve real problems, and real problems always have resource
constraints.
If there’s one lesson to take away from this book, it’s that you should always direct your resources
toward solving the problems with the highest return on investment. Solving your problem is a real
constraint. Occasionally, it might be true that nothing but the best machine learning models can
solve it. The question to ask, then, is whether that’s the best problem to solve or if there’s a simpler
one that presents a lower-risk value proposition.
Finally, while we would have liked to have addressed all aspects of production machine learning in
this book, it currently exists more as a production data science text. In subsequent editions, we
intend to cover omissions, especially in the area of machine learning infrastructure. This new
material will include methods to parallelize model training and prediction; the basics of
Tensorflow, Apache Airflow, Spark, and other frameworks and tools; the details of several real
machine learning platforms, including Uber’s Michelangelo, Google’s TFX, and our own work on
similar systems; and avoiding and managing coupling in machine learning systems. We encourage
the reader to seek out the many books, papers, and blog posts covering these topics in the
meantime, and to check for updates on the book’s website at adamkelleher.com/ml_book.
We hope you’ll enjoy learning these tools as much as we did, and we hope this book will save you
time and effort in the long run.
About the Authors
Andrew Kelleher is a staff software engineer and distributed systems architect at Venmo. He was
previously a staff software engineer at BuzzFeed and has worked on data pipelines and algorithm
implementations for modern optimization. He graduated with a BS in physics from Clemson
University. He runs a meetup in New York City that studies the fundamentals behind distributed
systems in the context of production applications, and was ranked one of FastCompany’s most
creative people two years in a row.
Adam Kelleher wrote this book while working as principal data scientist at BuzzFeed and adjunct
professor at Columbia University in the City of New York. As of May 2018, he is chief data scientist
for research at Barclays and teaches causal inference and machine learning products at Columbia.
He graduated from Clemson University with a BS in physics, and has a PhD in cosmology from
University of North Carolina at Chapel Hill.
13
Causal Inference
13.1 Introduction
We’ve introduced a couple of machine-learning algorithms and suggested that they can be used to
produce clear, interpretable results. You’ve seen that logistic regression coefficients can be used to
say how much more likely an outcome will occur in conjunection with a feature (for binary
features) or how much more likely an outcome is to occur per unit increase in a variable (for
real-valued features). We’d like to make stronger statements. We’d like to say “If you increase a
variable by a unit, then it will have the effect of making an outcome more likely.”
These two interpretations of a regression coefficient are so similar on the surface that you may have
to read them a few times to take away the meaning. The key is that in the first case, we’re describing
what usually happens in a system that we observe. In the second case, we’re saying what will
happen if we intervene in that system and disrupt it from its normal operation.
After we go through an example, we’ll build up the mathematical and conceptual machinery to
describe interventions. We’ll cover how to go from a Bayesian network describing observational
data to one that describes the effects of an intervention. We’ll go through some classic approaches
to estimating the effects of interventions, and finally we’ll explain how to use machine-learning
estimators to estimate the effects of interventions.
If you imagine a binary outcome, such as “I’m late for work,” you can imagine some features that
might vary with it. Bad weather can cause you to be late for work. Bad weather can also cause you to
wear rain boots. Days when you’re wearing rain boots, then, are days when you’re more likely be
late for work. If you look at the correlation between the binary feature “wearing rain boots” and the
outcome “I’m late for work,” you’ll find a positive relationship. It’s nonsense, of course, to say that
wearing rain boots causes you to be late for work. It’s just a proxy for bad weather. You’d never
recommend a policy of “You shouldn’t wear rain boots, so you’ll be late for work less often.” That
would be reasonable only if “wearing rain boots” was causally related to “being late for work.” As an
intervention to prevent lateness, not wearing rain boots doesn’t make any sense.
In this chapter, you’ll learn the difference between correlative (rain boots and lateness) and causal
(rain and lateness) relationships. We’ll discuss the gold standard for establishing causality: an
experiment. We’ll also cover some methods to discover causal relationships in cases when you’re
not able to run an experiment, which happens often in realistic settings.
168 Chapter 13 Causal Inference
13.2 Experiments
The case that might be familiar to you is an AB test. You can make a change to a product and test it
against the original version of the product. You do this by randomly splitting your users into two
groups. The group membership is denoted by D, where D = 1 is the group that experiences the new
change (the test group), and D = 0 is the group that experiences the original version of the product
(the control group). For concreteness, let’s say you’re looking at the effect of a recommender system
change that recommends articles on a website. The control group experiences the original
algorithm, and the test group experiences the new version. You want to see the effect of this change
on total pageviews, Y.
You’ll measure this effect by looking at a quantity called the average treatment effect (ATE). The ATE is
the average difference in the outcome between the test and control groups, Etest [Y] − Econtrol [Y], or
δnaive = E[Y|D = 1] − E[Y|D = 0]. This is the “naive” estimator for the ATE since here we’re ignoring
everything else in the world. For experiments, it’s an unbiased estimate for the true effect.
A nice way to estimate this is to do a regression. That lets you also measure error bars at the same
time and include other covariates that you think might reduce the noise in Y so you can get more
precise results. Let’s continue with this example.
1 import numpy as np
2 import pandas as pd
3
4 N = 1000
5
6 x = np.random.normal(size=N)
7 d = np.random.binomial(1., 0.5, size=N)
8 y = 3. * d + x + np.random.normal()
9
10 X = pd.DataFrame({'X': x, 'D': d, 'Y': y})
Here, we’ve randomized D to get about half in the test group and half in the control. X is some
other covariate that causes Y, and Y is the outcome variable. We’ve added a little extra noise to Y to
just make the problem a little noisier.
You can use a regression model Y = β0 + β1 D to estimate the expected value of Y, given the
covariate D, as E[Y|D] = β0 + β1 D. The β0 piece will be added to E[Y|D] for all values of D (i.e., 0 or
1). The β1 part is added only when D = 1 because when D = 0, it’s multiplied by zero. That means
E[Y|D = 0] = β0 when D = 0 and E[Y|D = 1] = β0 + β1 when D = 1. Thus, the β1 coefficient is
going to be the difference in average Y values between the D = 1 group and the D = 0 group,
E[Y|D = 1] − E[Y|D = 0] = β1 ! You can use that coefficient to estimate the effect of this experiment.
When you do the regression of Y against D, you get the result in Figure 13.1.
1 from statsmodels.api import OLS

2 X['intercept'] = 1.
3 model = OLS(X['Y'], X[['D', 'intercept']])
4 result = model.fit()
5 result.summary()
13.2 Experiments 169
OLS Regression Results
Dep. Variable: Y R-squared: 0.560
Model: OLS Adj. R-squared: 0.555
Method: Least Squares F-statistic: 124.5
Date: Sun, 08 Apr 2018 Prob (F-statistic): 3.79e-19
Time: 22:28:01 Log-Likelihood: -180.93
No. Observations: 100 AIC: 365.9
Df Residuals: 98 BIC: 371.1
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
D 3.3551 0.301 11.158 0.000 2.758 3.952
intercept -0.1640 0.225 -0.729 0.468 -0.611 0.283
Omnibus: 0.225 Durbin-Watson: 1.866
Prob(Omnibus): 0.894 Jarque-Bera (JB): 0.360
Skew: 0.098 Prob(JB): 0.835
Kurtosis: 2.780 Cond.No. 2.78

Figure 13.1 The regression for Y = β0 + β1 D
Why did this work? Why is it okay to say the effect of the experiment is just the difference between
the test and control group outcomes? It seems obvious, but that intuition will break down in the
next section. Let’s make sure you understand it deeply before moving on.
Each person can be assigned to the test group or the control group, but not both. For a person
assigned to the test group, you can talk hypothetically about the value their outcome would have
had, had they been assigned to the control group. You can call this value Y 0 because it’s the value Y
would take if D had been set to 0. Likewise, for control group members, you can talk about a
hypothetical Y 1 . What you really want to measure is the difference in outcomes δ = Y 1 − Y 0 for
each person. This is impossible since each person can be in only one group! For this reason, these
Y 1 and Y 0 variables are called potential outcomes.
If a person is assigned to the test group, you measure the outcome Y = Y 1 . If a person is assigned to
the control group, you measure Y = Y 0 . Since you can’t measure the individual effects, maybe you
can measure population level effects. We can try to talk instead about E[Y 1 ] and E[Y 0 ]. We’d like
E[Y 1 ] = E[Y|D = 1] and E[Y 0 ] = E[Y|D = 0], but we’re not guaranteed that that’s true. In the
recommender system test example, what would happen if you assigned people with higher Y 0
pageview counts to the test group? You might measure an effect that’s larger than the true effect!
Fortunately, you randomize D to make sure it’s independent of Y 0 and Y 1 . That way, you’re sure
that E[Y 1 ] = E[Y|D = 1] and E[Y 0 ] = E[Y|D = 0], so you can say that δ = E[Y 1 − Y 0 ] =
E[Y|D = 1] − E[Y|D = 0]. When other factors can influence assignment, D, then you can no longer
be sure you have correct estimates! This is true in general when you don’t have control over a
system, so you can’t ensure D is independent of all other factors.
In the general case, D won’t just be a binary variable. It can be ordered, discrete, or continuous. You
might wonder about the effect of the length of an article on the share rate, about smoking on the
probability of getting lung cancer, of the city you’re born in on future earnings, and so on.
Just for fun before we go on, let’s see something nice you can do in an experiment to get more
precise results. Since we have a co-variate, X, that also causes Y, we can account for more of the
variation in Y. That makes our predictions less noisy, so our estimates for the effect of D will be
more precise! Let’s see how this looks. We regress on both D and X now to get Figure 13.2.
Dep. Variable: Y R-squared: 0.754
Date: Sun, 08 Apr 2018 Prob (F-statistic): 2.75e-30
Time: 22:59:08 Log-Likelihood: -151.76
No. Observations: 100 AIC: 309.5
Df Residuals: 97 BIC: 317.3
Df Model: 2
coef std err t P>|t| [0.025 0.975]
D 3.2089 0.226 14.175 0.000 2.760 3.658
X 1.0237 0.117 8.766 0.000 0.792 1.256
intercept 0.0110 0.170 0.065 0.949 -0.327 0.349
Skew: -0.293 Prob(JB): 0.307
Kurtosis: 2.528 Cond.No. 2.81

Figure 13.2 The regression for Y = β0 + β1 D + β2 X
Notice that the R2 is much better. Also, notice that the confidence interval for D is much narrower!
We went from a range of 3.95 − 2.51 = 1.2 down to 3.65 − 2.76 = 0.89. In short, finding covariates
that account for the outcome can increase the precision of your experiments!
13.3 Observation: An Example

Let’s look at an example of what happens when you don’t make your cause independent of
everything else. We’ll use it to show how to build some intuition for how observation is different
from intervention. Let’s look at a simple model for the correlation between race, poverty, and crime
in a neighborhood. Poverty reduces people’s options in life and makes crime more likely. That
makes poverty a cause of crime. Next, neighborhoods have a racial composition that can persist
over time, so the neighborhood is a cause of racial composition. The neighborhood also determines
some social factors, like culture and education, and so can be a cause of poverty. This gives us the
causal diagram in Figure 13.3.
N C
Figure 13.3 The neighborhood is a cause of its racial composition and poverty levels. The poverty
level is a cause of crime.
Here, there is no causal relationship between race and crime, but you would find them to be
correlated in observational data. Let’s simulate some data to examine this.
1 N = 10000
2
3 neighborhood = np.array(range(N))
4
5 industry = neighborhood % 3
6
7 race = ((neighborhood % 3
8
9 + np.random.binomial(3, p=0.2, size=N))) % 4
10
11 income = np.random.gamma(25, 1000*(industry + 1))
12
13 crime = np.random.gamma(100000. / income, 100, size=N)
14
15 X = pd.DataFrame({'$R$': race, '$I$': income, '$C$': crime,
16
17 '$E$': industry, '$N$': neighborhood})
Here, each data point will be a neighborhood. There are common historic reasons for the racial
composition and the dominant industry in each neighborhood. The industry determines the
income levels in the neighborhood, and the income level is inversely related with crime rates.
If you plot the correlation matrix for this data (Figure 13.4), you can see that race and crime are
correlated, even though there is no causal relationship between them!
C E I N R
C 1.000000 -0.542328 -0.567124 0.005518 -0.492169
E -0.542328 1.000000 0.880411 0.000071 0.897789
I -0.567124 0.880411 1.000000 -0.005650 0.793993
N 0.005518 0.000071 -0.005650 1.000000 -0.003666
R -0.492169 0.897789 0.793993 -0.003666 1.000000
Figure 13.4 Raw data showing correlations between crime (C), industry (E), income (I), neighborhood
(N), and race (R)
You can take a regression approach and see how you can interpret the regression coefficients. Since
we know the right model to use, we can just do the right regression, which gives the results in
Figure 13.5.
Generalized Linear Model Regression Results
Dep. Variable: C No. Observations: 10000

Model: GLM Df Residuals: 9999
Model Family: Gamma Df Model: 0
Link Function: inverse_power Scale: 1.53451766278
Method: IRLS Long-Likelihood: -68812.
Date: Sun, 06 Aug 2017 Deviance: 15138.
Time: 22:43:02 Pearson chi2: 1.53e+04
No. Iterations: 7
coef std err z P>|z| [0.025 0.975]

1/I 123.0380 1.524 80.726 0.000 120.051 126.025
Figure 13.5 Regression for crime against the inverse of income
1 from statsmodels.api import GLM

2 import statsmodels.api as sm
3
4 X['$1/I$'] = 1. / X['$I$']
5 model = GLM(X['$C$'], X[['$1/I$']], family=sm.families.Gamma())
7 result.summary()
From this you can see that when 1/I increases by a unit, the number of crimes increases by 123
units. If the crime units are in crimes per 10,000 people, this means 123 more crimes per 10,000
people.
This is a nice result, but you’d really like to know whether the result is causal. If it is causal, that
means you can design a policy intervention to exploit the relationship. That is, you’d like to know
if people earned more income, everything else held fixed, would there be less crime? If this were a
causal result, you could say that if you make incomes higher (independent of everything else), then
you can expect that for each unit decrease in 1/I, you’ll see 123 fewer crimes. What is keeping us
from making those claims now?
You’ll see that regression results aren’t necessarily causal; let’s look at the relationship between race
and crime. We’ll do another regression as shown here:

3
4 races = {0: 'african-american', 1: 'hispanic',
5 2: 'asian', 3: 'white'}
6 X['race'] = X['$R$'].apply(lambda x: races[x])
7 race_dummies = pd.get_dummies(X['race'])
8 X[race_dummies.columns] = race_dummies
9 model = OLS(X['$C$'], race_dummies)
11 result.summary()
Figure 13.6 show the result.
Here, you find a strong correlative relationship between race and crime, even though there’s no
causal relationship. You know that if we moved a lot of white people into a black neighborhood
(holding income level constant), you should have no effect on crime. If this regression were causal,
then you would. Why do you find a significant regression coefficient even when there’s no causal
relationship?
In this example, you went wrong because racial composition and income level were both caused by
the history of each neighborhood. This is a case where two variables share a common cause. If you
don’t control for that history, then you’ll find a spurious association between the two variables.
What you’re seeing is a general rule: when two variables share a common cause, they will be
correlated (or, more generally, statistically dependent) even when there’s no causal relationship
between them.
Another nice example of this common cause problem is that when lemonade sales are high, crime
rates are also high. If you regress crime on lemonade sales, you’d find a significant increase in
crimes per unit increase in lemonade sales! Clearly the solution isn’t to crack down on lemonade
stands. As it happens, more lemonade is sold on hot days. Crime is also higher on hot days. The
weather is a common cause of crime and lemonade sales. We find that the two are correlated even
though there is no causal relationship between them.
The solution in the lemonade example is to control for the weather. If you look at all days where it
is sunny and 95 degrees Fahrenheit, the effect of the weather on lemonade sales is constant. The
Dep. Variable: C R-squared: 0.262

Method: Least Squares F-statistic: 1184.
Date: Sun, 06 Aug 2017 Prob (F-statistic): 0.00
Time: 22:59:47 Log-Likelihood: -65878.
No. Observations: 10000 AIC: 1.318e+05
Df Residuals: 9996 BIC: 1.318e+05
Df Model: 3
coef std err t P>|t| [0.025 0.975]

african-american 411.9718 3.395 121.351 0.000 405.317 418.626
asian 155.0682 3.020 51.350 0.000 149.149 160.988
hispanic 248.8263 3.066 81.159 0.000 242.817 254.836
white 132.0232 6.909 19.108 0.000 118.479 145.567

Skew: 1.335 Prob(JB): 0.00
Kurtosis: 6.217 Cond. No. 2.29
Figure 13.6 Statistics highlighting relationships between race and crime
effect of weather and crime is also constant in the restricted data set. Any variance in the two must
be because of other factors. You’ll find that lemonade sales and crime no longer have a significant
correlation in this restricted data set. This problem is usually called confounding, and the way to
break confounding is to control for the confounder.
Similarly, if you look only at neighborhoods with a specific history (in this case the relevant
variable is the dominant industry), then you’ll break the relationship between race and income
and so also the relationship between race and crime.
To reason about this more rigorously, let’s look at Figure 13.3. We can see the source of dependence,
where there’s a path from N to R and a path from N through E and P to C. If you were able to break
this path by holding a variable fixed, you could disrupt the dependence that flows along it. The
result will be different from the usual observational result. You will have changed the dependencies
in the graph, so you will have changed the joint distribution of all these variables.
If you intervene to set the income level in an area in a way that is independent of the dominant
industry, you’ll break the causal link between the industry and the income, resulting in the graph
in Figure 13.7. In this system, you should find that the path that produces dependence between
race and crime is broken. The two should be independent.
N C
Figure 13.7 The result of an intervention, where you set the income level by direct intervention in a
way that is independent of the dominant industry in the neighborhood
How can you do this controlling using only observational data? One way is just to restrict to subsets
of the data. You can, for example, look only at industry 0 and see how this last regression looks.
1 X_restricted = X[X['$E$'] == 0]
2
5 X_restricted['race'] = X_restricted['$R$'].apply(lambda x: races[x])
6 race_dummies = pd.get_dummies(X_restricted['race'])
7 X_restricted[race_dummies.columns] = race_dummies
8 model = OLS(X_restricted['$C$'], race_dummies)
10 result.summary()
This produces the result in Figure 13.8.
Now you can see that all of the results are within confidence of each other! The dependence
between race and crime is fully explained by the industry in the area. In other words, in this
hypothetical data set, crime is independent of race when you know what the dominant industry is
in the area. What you have done is the same as the conditioning you did before.
Notice that the confidence intervals on the new coefficients are fairly wide compared to what they
were before. This is because you’ve restricted to a small subset of your data. Can you do better,
maybe by using more of the data? It turns out there’s a better way to control for something than
restricting the data set. You can just regress on the variables you’d like to control for!

3
6 X['race'] = X['$R$'].apply(lambda x: races[x])
7 race_dummies = pd.get_dummies(X['race'])
8 X[race_dummies.columns] = race_dummies
9
10 industries = {i: 'industry_{}'.format(i) for i in range(3)}
11 X['industry'] = X['$E$'].apply(lambda x: industries[x])

Df Model: 3
coef std err t P>|t| [0.025 0.975]

asian 421.3615 12.326 34.185 0.000 397.194 445.529
hispanic 421.3907 6.239 67.536 0.000 409.157 433.624
white 484.8838 40.816 11.880 0.000 404.856 564.911

Skew: 1.038 Prob(JB): 1.40e-205

Figure 13.8 A hypothetical regression on race indicator variables predicting crime rates, but con-
trolling for local industry using stratification of the data. There are no differences in expected crimes,
controlling for industry.
12 industry_dummies = pd.get_dummies(X['industry'])
13 X[industry_dummies.columns] = industry_dummies
14
15 x = list(industry_dummies.columns)[1:] + list(race_dummies.columns)
16
17 model = OLS(X['$C$'], X[x])
19 result.summary()
Then, you get Figure 13.9 shows the result.
Here, the confidence intervals are much narrower, and you see there’s still no significant
association between race and income level: the coefficients are roughly equal. This is a causal
regression result: you can now see that there would be no effect of an intervention to change the
racial composition of neighborhoods. This simple example is nice because you can see what to
control for, and you’ve measured the things you need to control for. How do you know what to

Df Model: 5
coef std err t P>|t| [0.025 0.975]

industry_1 -215.1618 4.931 -43.638 0.000 -224.827 -205.497
industry_2 -278.9783 5.581 -49.984 0.000 -289.919 -268.038
asian 418.0980 5.203 80.361 0.000 407.900 428.296
hispanic 423.5622 4.216 100.464 0.000 415.298 431.827
white 422.1700 6.530 64.647 0.000 409.369 434.971

Skew: 1.306 Prob(JB): 0.00

Figure 13.9 Statistics highlighting the relationship between race and industry from an OLS fit
control for in general? Will you always be able to do it successfully? It turns out it’s very hard in
practice, but sometimes it’s the best you can do.
13.4 Controlling to Block Non-causal Paths

You just saw that you can take a correlative result and make it a causal result by controlling for the
right variables. How do you know what variables to control for? How do you know that regression
analysis will control for them? This section relies heavily on d-separation from Chapter 11. If that
material isn’t fresh, you might want to review it now.
You saw in the previous chapter that conditioning can break statistical dependence. If you
condition on the middle variable of a path X → Y → Z, you’ll break the dependence between X
and Z that the path produces. If you condition on a confounding variable X ← Z → Y, you can
break the dependence between X and Y induced by the confounder as well. It’s important to note
that statistical dependence induced by other paths between X and Y is left unharmed by this
conditioning. If, for example, you condition on Z in the system in Figure 13.10, you’ll get rid of the
confounding but leave the causal dependence.
X Y
Figure 13.10 Conditioning on Z disrupts the confounding but leaves the causal statistical depen-
dence between X and Y intact
If you had a general rule to choose which paths to block, you could eliminate all noncausal
dependence between variables but save the causal dependence. The “back-door” criterion is the
rule you’re looking for. It tells you what set of variables, Z, you should control for to eliminate any
noncausal statistical dependence between Xi and Xj . You should note a final nuance before
introducing the criterion. If you want to know if the correlation between Xi and Xj is “causal,” you
have to worry about the direction of the effect. It’s great to know, for example, that the correlation
“being on vacation” and “being relaxed” is not confounded, but you’d really like to know whether
“being on vacation” causes you to “be relaxed.” That will inform a policy of going on vacation in
order to be relaxed. If the causation were reversed, you couldn’t take that policy.
With that in mind, the back-door criterion is defined relative to an ordered pair of variables,
(Xi , Xj ), where Xi will be the cause, and Xj will be the effect.
Definition 13.1. Back-Door Conditioning

It’s sufficient to control for a set of variables, Z, to eliminate noncausal dependence for the effect
of Xi on Xj in a causal graph, G, if
n No variable in Z is a descendant of Xi , and
n Z blocks every path between Xi and Xj that contains an arrow into Xi .
We won’t prove this theorem, but let’s build some intuition for it. First, let’s examine the condition
“no variable in Z is a descendant of Xi .” You learned earlier that if you condition on a common
effect of Xi and Xj , then the two variables will be conditionally dependent, even if they’re normally
independent. This remains true if you condition on any effect of the common effect (and so on
down the paths). Thus, you can see that the first part of the back-door criterion prevents you from
introducing extra dependence where there is none.
There is something more to this condition, too. If you have a chain like Xi → Xk → Xj , you see that
Xk is a descendant of Xi . It’s not allowed in Z. This is because if you condition on Xk , you’d block a
causal path between Xi and Xj . Thus, you see that the first condition also prevents you from
conditioning on variables that fall along causal paths.
The second condition says “Z blocks every path between Xi and Xj that contains an arrow into Xi .”
This part will tell us to control for confounders. How can you see this? Let’s consider some cases
where there is one or more node along the path between Xi and Xj and the path contains an arrow
into Xi . If there is a collider along the path between Xi and Xj , then the path is already blocked, so
you just condition on the empty set to block that path. Next, if there is a fork along the path, like
the path Xi ← Xk → Xj , and no colliders, then you have typical confounding. You can condition
on any node along the path that will block it. In this case, you add Xk to the set Z. Note that there
can be no causal path from Xi to Xj with an arrow pointing into Xi because of the arrow pointing
into Xi .
Thus, you can see that you’re blocking all noncausal paths from Xi to Xj , and the remaining
statistical dependence will be showing the causal dependence of Xj on Xi . Is there a way you can
use this dependence to estimate the effects of interventions?
13.4.1 The G-formula

Let’s look at what it really means to make an intervention. What it means is that you have a graph
like in Figure 13.11.
X1
X2 X3
X4
X5
Figure 13.11 A pre-intervention causal graph. Data collected from this system reflects the way the
world works when we just observe it.
You want to estimate the effect of X2 on X5 . That is, you want to say “If I intervene in this system to
set the value of X2 to x2 , what will happen to X5 ? To quantify the effect, you have to realize that all
of these variables are taking on values that depend not only on their predecessors but also on noise
in the system. Thus, even if there’s a deterministic effect of X2 on X5 (say, raising the value of X5 by
exactly one unit), you can only really describe the value X5 will take with a distribution of values.
Thus, when you’re estimating the effect of X2 on X5 , what you really want is the distribution of X5
when you intervene to set the value of X2 .
Let’s look at what we mean by intervene. We’re saying we want to ignore the usual effect of X1 on X2
and set the value of X2 to x2 by applying some external force (our action) to X2 . This removes the
usual dependence between X2 and X1 and disrupts the downstream effect of X1 on X4 by breaking
the path that passes through X2 . Thus, we’ll also expect the marginal distribution between X1 and
X4 , P(X1 , X4 ) to change, as well as the distribution of X1 and X5 ! Our intervention can affect every
variable downstream from it in ways that don’t just depend on the value x2 . We actually disrupt
other dependences.
You can draw a new graph that represents this intervention. At this point, you’re seeing that the
operation is very different from observing the value of X2 = x2 , i.e., simply conditioning on
X2 = x2 . This is because you’re disrupting other dependences in the graph. You’re actually talking
about a new system described by the graph in Figure 13.12.
X1
X2 = x2 X3
X4
X5
Figure 13.12 The graph representing the intervention do(X2 = x2 ). The statistics of this data will be
different from that in from the system in Figure 13.11
You need some new notation to talk about an intervention like this, so you’ll denote do(X2 = x2 )
the intervention where you perform this operation. This gives you the definition of the
intervention, or do-operation.
Definition 13.2. Do-operation

We describe an intervention called the do() operation in a system described by a DAG, G as an
operation where we do Xi by
n Delete all edges in G that point into Xi , and
n Set the value of Xi to xi .
What does the joint distribution look like for this new graph? Let’s use the usual factorization, and
write the following:
Pdo(X2 =x2 ) (X1 , X2 , X3 , X4 , X5 ) = P(X5 |X4 )P(X4 |X2 , X3 )P(X3 |X1 )δ(X2 , x2 )P(X1 ) (13.1)
Here we’ve just indicated P(X2 ) by the δ-function, so P(X2 ) = 0 if X2 6= x2 , and P(X2 ) = 1 when
X2 = x2 . We’re basically saying that when we intervene to set X2 = x2 , we’re sure that it worked. We
can carry through that X2 = x2 elsewhere, like in the distribution for P(X4 |X2 , X3 ), but just
replacing X2 with X2 = x2 , since the whole right-hand side is zero if X2 6= x2 .
Finally, let’s just condition on the X2 distribution to get rid of the weirdness on the right-hand side
of this formula. We can write the following:
Pdo(X2 =x2) (X1 , X2 , X3 , X4 , X5 |X2 ) = P(X5 |X4 )P(X4 |X2 = x2 , X3 )P(X3 |X1 )P(X1 ) (13.2)
However, this is the same as the original formula, divided by P(X2 |X1 )! To be precise,
P(X1 , X2 = x2 , X3 , X4 , X5 )
Pdo(X2 =x2) (X1 , X2 , X3 , X4 , X5 |X2 = x2 ) = (13.3)
P(X2 = x2 |X1 )
Incredibly, this formula works in general. We can write the following:
P(X1 , ..., Xn )
P(X1 , ..., Xn |do(Xi = xi )) = (13.4)
P(Xi |Pa(Xi ))
This leads us to a nice general rule: the parents of a variable will always satisfy the back-door
criterion! It turns out we can be more general than this even. If we marginalize out everything
except Xi and Xj , we see the parents are the set of variables that control confounders.
P(Xj , Xi , Pa(Xi ))
P(Xj , Pa(Xi )|do(Xi = xi)) = (13.5)
P(Xi |Pa(Xi ))
It turns out (we’ll state without proof) that you can generalize the parents to any set, Z, that
satisfies the back door criterion.
P(Xj , Xi , Z)
P(Xj , Z|do(Xi = xi)) = (13.6)
P(Xi |Z)
You can marginalize Z out of this and use the definition of conditional probability to write an
important formula, shown in Definition 13.3.
Definition 13.3. Robins G-Formula

X
P(Xj |do(Xi = xi )) = P(Xj |Xi , Z)P(Z)
z
This is a general formula for estimating the distribution of Xj under the intervention Xi . Notice
that all of these distributions are from the pre-intervention system. This means you can use
observational data to estimate the distribution of Xj under some hypothetical intervention!
There are a few critical caveats here. First, the term in the denominator of Equation 13.4,
P(Xi |Pa(Xi )), must be nonzero for the quantity on the left side to be defined. This means you would
have to have observed Xi taking on the value you’d like to set it to with your intervention. If you’ve
never seen it, you can’t say how the system might behave in response to it!
Next, you’re assuming that you have a set Z that you can control for. Practically, it’s hard to know if
you’ve found a good set of variables. There can always be a confounder you have never thought to
measure. Likewise, your way of controlling for known confounders might not do a very good job.
You’ll understand this second caveat more as you go into some machine learning estimators.
With these caveats, it can be hard to estimate causal effects from observational data. You should
consider the results of a conditioning approach to be a provisional estimate of a causal effect. If
you’re sure you’re not violating the first condition of the back-door criterion, then you can expect
that you’ve removed some spurious dependence. You can’t say for sure that you’ve reduced bias.
Imagine, for example, two sources of bias for the effect of Xi on Xj . Suppose you’re interested in
measuring an average value of Xj , Edo(Xi =xi ) [Xj ] = µj . Path A introduces a bias of −δ, and path B
introduces a bias of 2δ. If you estimate the mean without controlling for either path, you’ll find
(biased)
µj = µj + 2δ − δ = µj + δ. If you control for a confounder along path A, then you remove its
(biased,A)
contribution to the bias, which leaves µj = µj + 2δ. Now the bias is twice as large! The
problem, of course, is that the bias you corrected was actually pushing our estimate back toward its
correct value. In practice, more controlling usually helps, but you can’t be guaranteed that you
won’t find an effect like this.
Now that you have a good background in observational causal inference, let’s see how
machine-learning estimators can help in practice!
13.5 Machine-Learning Estimators

In general, you won’t want to estimate the full joint distribution under an intervention. You may
not even be interested in marginals. Usually, you’re just interested in a difference in average effects.
In the simplest case, you’d like to estimate the expected difference in some outcome, Xj , per unit
change in a variable you have control over, Xi . For example, you’d might like to measure
E[Xj |do(Xi = 1)] − E[Xj |do(Xi = 0)]. This tells you the change in Xj you can expect on average when
you set Xi to 1 from what it would be if Xi were set to 0.
Let’s revisit the g-formula to see how can measure these kinds of quantities.
13.5.1 The G-formula Revisited

The g-formula tells you how to estimate a causal effect, P(Xj |do(Xi = xi)), using observational data
and a set of variables to control for (based on our knowledge of the causal structure). It says this:
X
P(Xj |do(Xi = xi )) = P(Xj |Xi , Z)P(Z) (13.7)
Z
If you take expectation values on each side (by multiplying by Xj and summing over Xj ), then you
find this: X
E(Xj |do(Xi = xi )) = E(Xj |Xi , Z)P(Z) (13.8)
Z
In practice, it’s easy to estimate the first factor on the right side of this formula. If you fit a
regression estimator using mean-squared error loss, then the best fit is just the expected value of Xj
at each point (Xi , Z). As long as the model has enough freedom to accurately describe the expected
value, you can estimate this first factor by using standard machine-learning approaches.
To estimate the whole left side, you need to deal with the P(Z) term, as well as the sum. It turns out
there’s a simple trick for doing this. If your data was generated by drawing from the observational
joint distribution, then your samples of Z are actually just draws from P(Z). Then, if you replace the
P(Z) term by 1/N (for N samples) and sum over data points, you’re left with an estimator for this
sum. That is, you can make the substitution as follows:
N
1 X (k)
EN (Xj |do(Xi = xi )) = E(Xj |Xi , Z(k) ), (13.9)
N
k=1
where the (k) index runs over our data points, from 1 to N. Let’s see how all of this works in an
example.
13.5.2 An Example
Let’s go back to the graph in Figure 13.11. We’ll use an example from Judea Pearl’s book. We’re
concerned with the sidewalk being slippery, so we’re investigating its causes. X5 can be 1 or 0, for
slippery or not, respectively. You’ve found that the sidewalk is slippery when it’s wet, and you’ll use
X4 to indicate whether the sidewalk is wet. Next, you need to know the causes of the sidewalk being
wet. You see that a sprinkler is near the sidewalk, and if the sprinkler is on, it makes the sidewalk
wet. X2 will indicate whether the sprinkler is on. You’ll notice the sidewalk is also wet after it rains,
which you’ll indicate with X3 being 1 after rain, 0 otherwise. Finally, you note that on sunny days
you turn the sprinkler on. You’ll indicate the weather with X1 , where X1 is 1 if it is sunny, and 0
otherwise.
In this picture, rain and the sprinkler being on are negatively related to each other. This statistical
dependence happens because of their mutual dependence on the weather. Let’s simulate some data
to explore this system. You’ll use a lot of data, so the random error will be small, and you can focus
your attention on the bias.
1 import numpy as np
2 import pandas as pd
3 from scipy.special import expit
4
5 N = 100000
6 inv_logit = expit
7 x1 = np.random.binomial(1, p=0.5, size=N)
8 x2 = np.random.binomial(1, p=inv_logit(-3.*x1))
9 x3 = np.random.binomial(1, p=inv_logit(3.*x1))
10 x4 = np.bitwise_or(x2, x3)
12
13 X = pd.DataFrame({'$x_1$': x1, '$x_2$': x2, '$x_3$': x3,
14 '$x_4$': x4, '$x_5$': x5})
Every variable here is binary. You use a logistic link function to make logistic regression
appropriate. When you don’t know the data-generating process, you might get a little more
creative. You’ll come to this point in a moment!
Let’s look at the correlation matrix, shown in Figure 13.13. When the weather is good, the sprinkler
is turned on. When it rains, the sprinkler is turned off. You can see there’s a negative relationship
between the sprinkler being on and the rain due to this relationship.
There are a few ways you can get an estimate for the effect of X2 on X5 . The first is simply by finding
the probability that X5 = 1 given that X2 = 1 or X2 = 0. The difference in these probabilities tells
you how much more likely it is that the sidewalk is slippery given that the sprinkler was on. A
simple way to calculate these probabilities is simply to average the X5 variable in each subset of the
data (where X2 = 0 and X2 = 1). You can run the following, which produces the table in
Figure 13.14.
1 X.groupby('$x_2$').mean()[['$x_5$']]
x1 x2 x3 x4 x5
x1 1.000000 -0.405063 0.420876 0.200738 0.068276
x2 -0.405063 1.000000 -0.172920 0.313897 0.102955
x3 0.420876 -0.172920 1.000000 0.693363 0.255352
x4 0.200738 0.313897 0.693363 1.000000 0.362034
x5 0.068276 0.102955 0.255352 0.362034 1.000000
Figure 13.13 The correlation matrix for the simulated data set. Notice that X2 and X3 are negatively
related because of their common cause, X1 .
x5
x2
0 0.861767
1 0.951492
Figure 13.14 The naive conditional expectation values for whether the grass is wet given that the
sprinkler is on, E[X5 |X2 = x2 ]. This is not a causal result because you haven’t adjusted for confounders.
If you look at the difference here, you see that the sidewalk is 0.95 − 0.86 = 0.09, or nine percentage
points more likely to be slippery given that the sprinkler was on. You can compare this with the
interventional graph to get the true estimate for the change. You can generate this data using the
process shown here:
1 N = 100000
2 inv_logit = expit
6 x4 = np.bitwise_or(x2, x3)
8
9 X = pd.DataFrame({'$x_1$': x1, '$x_2$': x2, '$x_3$': x3,
10 '$x_4$': x4, '$x_5$': x5})
Now, X2 is independent of X1 and X3 . If you repeat the calculation from before (try it!), you get a
difference of 0.12, or 12 percentage points. This is about 30 percent larger than the naive estimate!
Now, you’ll use some machine learning approaches to try to get a better estimate of the true (0.12)
effect strictly using the observational data. First, you’ll try a logistic regression on the first data set.
Let’s re-create the naive estimate, just to make sure it’s working properly.
1 from sklearn.linear_model import LogisticRegression

2
3 # build our model, predicting $x_5$ using $x_2$
4 model = LogisticRegression()
5 model = model.fit(X[['$x_2$']], X['$x_5$'])
6
7
8 # what would have happened if $x_2$ was always 0:
9 X0 = X.copy()
10 X0['$x_2$'] = 0
11 y_pred_0 = model.predict_proba(X0[['$x_2$']])
12
14 X1 = X.copy()
15 X1['$x_2$'] = 1
16 y_pred_1 = model.predict_proba(X1[['$x_2$']])
17
18 # now, let's check the difference in probabilities
19 y_pred_1[:, 1].mean() - y_pred_0[:,1].mean()
You first build a logistic regression model using X2 to predict X5 . You do the prediction and use it to
get probabilities of X5 under the X2 = 0 and X2 = 1 states. You did this over the whole data set.
The reason for this is that you’ll often have more interesting data sets, with many more variables
changing, and you’ll want to see the average effect of X2 on X5 over the whole data set. This
procedure lets you do that. Finally, you find the average difference in probabilities between the two
states, and you get the same 0.09 result as before!
Now, you’d like to do controlling on the same observational data to get the causal (0.12) result. You
perform the same procedure as before, but this time you include X1 in the regression.
1 model = LogisticRegression()
2 model = model.fit(X[['$x_2$', '$x_1$']], X['$x_5$'])
3
5 X0 = X.copy()
6 X0['$x_2$'] = 0
7 y_pred_0 = model.predict_proba(X0[['$x_2$', '$x_1$']])
8
10 X1 = X.copy()
11 X1['$x_2$'] = 1
12
13 # now, let's check the difference in probabilities
14 y_pred_1 = model.predict_proba(X1[['$x_2$', '$x_1$']])
15 y_pred_1[:, 1].mean() - y_pred_0[:,1].mean()
In this case, you find 0.14 for the result. You’ve over-estimated it! What went wrong? You didn’t
actually do anything wrong with the modeling procedure. The problem is simply that logistic
regression isn’t the right model for this situation. It’s the correct model for each variable’s parents
to predict its value but doesn’t work properly for descendants that follow the parents. Can we do
better, with a more general model?
This will be your first look at how powerful neural networks can be for general machine-learning
tasks. You’ll learn about building them in a little more detail in the next chapter. For now, let’s try a
deep feedforward neural network using keras. It’s called deep because there are more than just the
input and output layers. It’s a feedforward network because you put some input data into the
network and pass them forward through the layers to produce the output.
Deep feedforward networks have the property of being “universal function approximators,” in the
sense that they can approximate any function, given enough neurons and layers (although it’s not
always easy to learn, in practice). You’ll construct the network like this:
1 from keras.layers import Dense, Input

2 from keras.models import Model
3
4 dense_size = 128
5 input_features = 2
6
7 x_in = Input(shape=(input_features,))
8 h1 = Dense(dense_size, activation='relu')(x_in)
9 h2 = Dense(dense_size, activation='relu')(h1)
10 h3 = Dense(dense_size, activation='relu')(h2)
11 y_out = Dense(1, activation='sigmoid')(h3)
12
13 model = Model(input=x_in, output=y_out)
14 model.compile(loss='binary_crossentropy', optimizer='adam')
15 model.fit(X[['$x_1$', '$x_2$']].values, X['$x_5$'])
Now do the same prediction procedure as before, which produces the result 0.129.
1 X_zero = X.copy()
2 X_zero['$x_2$'] = 0
3 x5_pred_0 = model.predict(X_zero[['$x_1$', '$x_2$']].values)
4
5 X_one = X.copy()
6 X_one['$x_2$'] = 1
7 x5_pred_1 = model.predict(X_one[['$x_1$', '$x_2$']].values)
8
9 x5_pred_1.mean() - x5_pred_0.mean()
You’ve done better than the logistic regression model! This was a tricky case. You’re given binary
data where it’s easy to calculate probabilities, and you’d do the best by simply using the g-formula
directly. When you do this (try it yourself!), you calculate the true result of 0.127 from this data.
Your neural network model is very close!
Now, you’d like to enact a policy that would make the sidewalk less likely to be slippery. You know
that if you turn the sprinkler on less often, that should do the trick. You see that enacting this
policy (and so intervening to change the system), you can expect the slipperiness of the sidewalk to
13.6 Conclusion 187
decrease. How much? You want to compare the pre-intervention chance of slipperiness with the
post-intervention chance, when you set sprinkler = off. You can simply calculate this with our
neural network model like so:
1 X['$x_5$'].mean() - x5_pred_0.mean()
This gives the result 0.07. It will be 7 percent less likely that the sidewalk is slippery if you make a
policy of keeping the sprinkler turned off!
13.6 Conclusion
In this chapter, you’ve developed the tools to do causal inference. You’ve learned that machine
learning models can be useful to get more general model specifications, and you saw that the better
you can predict an outcome using a machine learning model, the better you can remove bias from
an observational causal effect estimate.
Observational causal effect estimates should always be used with care. Whenever possible, you
should try to do a randomized controlled experiment instead of using the observational estimate.
In this example, you should simply use randomized control: flip a coin each day to see whether the
sprinkler gets turned on. This re-creates the post-intervention system and lets you measure how
much less likely the sidewalk is to be slippery when the sprinkler is turned off versus turned on (or
when the system isn’t intervened upon). When you’re trying to estimate the effect of a policy, it’s
hard to find a substitute for actually testing the policy through a controlled experiment.
It’s especially useful to be able to think causally when designing machine-learning systems. If you’d
simply like to say what outcome is most likely given what normally happens in a system, a standard
machine learning algorithm is appropriate. You’re not trying to predict the result of an
intervention, and you’re not trying to make a system that is robust to changes in how the system
operates. You just want to describe the system’s joint distribution (or expectation values under it).
If you would like to inform policy changes, predict the outcomes of intervention, or make the
system robust to changes in variables upstream from it (i.e., external interventions), then you will
want a causal machine learning system, where you control for the appropriate variables to measure
causal effects.
An especially interesting application area is when you’re estimating the coefficients in a logistic
regression. Earlier, you saw that logistic regression coefficients had a particular interpretation in
observational data: they describe how much more likely an outcome is per unit increase in some
independent variable. If you control for the right variables to get a causal logistic regression
estimate (or just do the regression on data generated by control), then you have a new, stronger
interpretation: the coefficients tell you how much more likely an outcome is to occur when you
intervene to increase the value of an independent variable by one unit. You can use these
coefficients to inform policy!
Index
Numbers
12 principles of agile methodology, 11–14
“12-factor rules,” 71
95 percent confidence interval, 20, 107
A
A/B replication, 229–230, 240–241
access, RAM (random access memory), 205–206
aggregation, 214
agile development, product focus and, 10–11
agile methodology, 12 principles, 11–14
algorithms
Cannon’s algorithm, 97
classification algorithms, 117
k-means, 125–127
logistic regression, 118–122
naive Bayes, 122–124
clustering algorithms, 117
greedy Louvain, 130–131
k-means. see k-means
leading eigenvalue, 128–130
nearest neighbors, 131–133
comparison algorithms. See comparison
algorithms
Amazon, Route 53, 226–227
ANNoy, 133
API buffering, queues, 243
application-level caching, 236
architectures, 70–71
batch computing, 72–73
data sources, 72
online computing, 72–73
scaling, 73–74
services, 71
software architecture
client-server architecture, 217–218
microservices, 220
mix-and-match architectures, 221
248 architectures
monolith, 220 C
n-tier/service-oriented architecture, 218–219
cache invalidation, 72
assumptions
cache services, 237
greedy Louvain, 130
caches, 72, 235
ICA (independent component analysis), 158
application-level caching, 236
k-means, 127
cache services, 237
linear least squares, 97
write-through caches, 238
logistic regression, 121
Cannon’s algorithm, 97
MinHash, 83
naive Bayes, 124
CAP theorem, 223
availability, 225
nearest neighbors, 132
client-side load balancing, 228
asynchronous process execution, queues, 242–243
data layers, 228–230
ATE (average treatment effect), 168
failover, 230–231
auto-correlation, time-series plots, 60–61
front ends and load balancers, 225–228
availability, CAP theorem, 225
jobs and taskworkers, 230
client-side load balancing, 228
redundancy, 225
data layers, 228–230
consistency/concurrency, 223–224
failover, 230–231
conflict-free data types, 224–225
front ends and load balancers, 225–228
partition tolerance, 231–232
jobs and taskworkers, 230
capacity, neural networks, 193–196
redundancy, 225
career development, for data scientists, 5
average treatment effect (ATE), 168
CARP (Common Address Redundancy Protocol),
avoiding locally caching sensitive information, 237
227–228
B causal Bayesian networks, 136–137

causal graphs, Bayesian networks, 142–143
back-door conditioning, 178 causal inference
bag of words, 27, 29 controlling to block, g-formula, 179–182
bar charts, 46–47 controlling to block non-causal paths, 177–179
batch computing, 72–73 experiments, 168–171
batch fitting, neural networks, 199–200 machine-learning estimators, 182
batched training algorithms, models, 74 examples, 182–187
Bayesian inference, 122 g-formula, 182
Bayesian networks, 135 observation, examples, 171–177
casual graphs and conditional independence, CCDF (complementary cumulative distribution
136–137 function), 49–51
causal graphs, linear regression, 142–143 changing requirements, 11–12
d-separation, 139–142 choosing
fitting models, 143–146 models, regression, 90
Markovity, factorization and, 138–139 objective functions, 90–91
stability and dependence, 137–138 classification algorithms, 117
Bernoulli distribution, 160 k-means, 125–127
bias, 18 logistic regression, 118–122
binary trees, 214 naive Bayes, 122–124
binary variables, 25 clicks, 21
blocking, 142 click-through rate (CTR), 21, 149–150
blocking non-causal paths, 177–179 client-server architecture, 217–218
g-formula, 179–182 client-side load balancing, availability, 228
boosting, 113 clock rate, processors, 209
bootstrap aggregating, 112–113 clustering algorithms, 117
box plots, 55–57 greedy Louvain, 130–131
branch prediction, processors, 210–212 k-means. see k-means
distributions and histograms 249
leading eigenvalue, 128–130 data analysts, 5

nearest neighbors, 131–133 data layers, availability, 228–230
clusters, 117 data preprocessing, 25
k-means, 127 data scientists
combined workflows, 10 career development, 5
Common Address Redundancy Protocol (CARP), importance of, 5–6
227–228 role of
communication, 13 company size, 3–4
company size, role of, data scientists, 3–4 teams, 4–5
comparison algorithms data sources, 72
cosine similarity, 84–86 data streams, 72
Jaccard distance, 79–80 data teams, 7–8
algorithms, 80–81 project workflows
distributed approach, 81–82 combined workflows, 10
memory, 81 embedding versus pooling resources, 8
time complexity, 81 prototyping, 9–10
Mahalanobis distance, 86–87 research, 8–9
MinHash, 82–84 data visualization, 45
complementary cumulative distribution function distributions and summary statistics, 45
(CCDF), 49–51 box plots and error bars, 55–57
complexity distributions and histograms, 46–51
cosine similarity, 85 scatter plots and heat maps, 51–55
greedy Louvain, 130 graph visualization, 61
ICA (independent component analysis), 158 layout algorithms, 62–64
k-means, 128 time complexity, 64
leading eigenvalue, 129 time-series plots, 58
linear least squares, 97 auto-correlation, 60–61
Mahalanobis distance, 86 rolling statistics, 58–60
naive Bayes, 124 databases
nearest neighbors, 132–133 A/B replication, 240–241
PCA (principle components analysis), 154 multimaster replication, 239–240
concurrency, CAP theorem, 223–224 primary and replica, 238–239
conditional independence, causal graphs and, data-processing inequality, 33
136–137 debt, technical debt, 4, 13
confidence intervals, hypothesis testing, 40–41 decision boundaries, 118
conflict-free data types, 224–225 decision trees, 109–112
confounding, 174 random forests, 112–115
consistency, CAP theorem, 223–224 deep feedforward networks, 186
context, hypothesis testing, 43–44 layers, 192
continuous variables, 46 deliver value quickly, 12
convolutions, 191 dependence, stability and, Bayesian networks,
cores, processors, 210 137–138
cosine similarity, 84–85 dirichlet distributions, 160
complexity, 85 discrete variables, 46
distributed approach, 86 disks, 206–208
memory, 85 distributed approach
critical value, 39 cosine similarity, 86
CTR (click-through rate), 21 Jaccard distance, 81–82
linear least squares, 98
D Mahalanobis distance, 87
MinHash, 83–84
daemon, 218
distributions and histograms, 46–51
data, storing, 215
250 distributions and summary statistics
distributions and summary statistics, 45 global minimum, 190

box plots, 55–57 gradient descent, 190
distributions and histograms, 46–51 granularity, 214–215
error bars, 55–57 graph visualization, 61
heat maps, 51–55 layout algorithms, 62–64
scatter plots, 51–55 time complexity, 64
do-operations, 180 graphical models, 135
d-separation, 139–142 Bayesian networks, 135
causal Bayesian networks, 136–137
E greedy Louvain, 130–131
assumptions, 130
ElastiCache, 231
complexity, 130
Elasticsearch, 241
memory, 131
embedding versus pooling resources, 8
tools, 131
errors
hypothesis testing, 39–40
quantifying in measured values, 17–19
H
random error, 18 hard disk drive (HDD), 206–208
sampling error, 19–21 hardware
systematic error, 18 nonvolatile/persistent storage, 206–208
tracking impressions, 18 processors, 209–212
error bars, 55–57 random access memory (RAM), 205–206
error propagation, 21–23 throughput, 208–209
ETL (extract, transfer/transform, load), 216 hash indexes, 214
euclidean distance, 131 HDD (hard disk drive), 206–208
examples heat maps, 51–55
hypothesis testing, 42–43 histograms, 47–48
linear least squares, 98–105 distributions and, 46–51
machine-learning estimators, 182–187 horizontal sharding, 73
observation, 171–177 hyperparameters, 76
execution-level locality, throughput, 208–209 hypothesis, defined, 37–39
experiments, causal inference, 168–171 hypothesis testing, 37
extract, transfer/transform, load (ETL), 216 confidence intervals, 40–41
context, 43–44
F errors, 39–40
examples, 42–43
Facebook, Messenger app, 11 multiple testing, 41–42
factor analysis, latent variable models, 151–152 p-hacking, 41–42
factor loading matrix, 151 planning, 43–44
factorization, Markovity, 138–139 p-values, 40–41
failover, availability, 230–231
FastICA, 159 I
feature selection, text preprocessing, 28–30 ICA (independent component analysis), 154–159
fitting models, 91–92, 143–146 igraph, 130
front ends, availability, 225–228 importance of, data scientists, 5–6
functions impressions, 21, 149
objective functions, choosing, 90–91 tracking, 18
rand () function, 20 independent component analysis (ICA), 154–159
indexing, 214
G information loss, 33–34
Internet, representations of, 234
generate_signatures, 84
Internet sockets, 217
g-formula, 179–182
interpretability, 111
memory 251
intersection, 80 LHS (locally sensitive hashing), 74, 82

intervene, 179 life cycle of software prototypes, 9
interventions, g-formula, 179 linear least squares, 96–98
examples, 98–105
J linear regression, 29, 89, 96–97
Bayesian networks, 142–143
Jaccard distance, 79–80
memory, 122
algorithms, 80–81
load balancers, 73
distributed approach, 81–82
availability, 225–228
memory, 81
load balancing, 234–235
time complexity, 81
local machines, 69
Jarque-Bera test, 108
locality, throughput, 208
jobs, availability, 230
locally sensitive hashing (LHS), 74, 82
joint distributions, 139
locking, 224–225
junior scientists, 5
logistic regression, 25–26, 118–121
Jupyter Notebooks, 69–70
assumptions, 121
memory, 122
K time complexity, 121
kernel density estimation, 49 tools, 122
k-fold cross-validation, 95 long-tailed distribution, 49
k-means, 125–127 loss functions, neural networks, 200–201
assumptions, 127 LRU (least recently used), 72
complexity, 128
memory, 128 M
tools, 128
machine learning
Kronecker delta, 128–129
neural networks, 191–192
batch fitting, 199–200
L capacity, 193–196
ladders for data scientists, 5 layers, 192–193
lag, 60 loss functions, 200–201
lasso regression, 29 overfitting, 196–199
latency, 218–219 optimization, 189–191
nonvolatile/persistent storage, 207 machine-learning estimators, 182
latent dirichlet allocation, 159–165 examples, 182–187
latent variable models, 149 g-formula, 182
factor analysis, 151–152 MAE (mean absolute error), 200
ICA (independent component analysis), Mahalanobis distance, 86–87, 131
154–159 Markovity, factorization and, 138–139
latent dirichlet allocation, 159–165 master-slave configuration, 228
PCA (principle components analysis), mean absolute error (MAE), 200
152–154 mean squared error (MSE), 91, 200
layers, neural networks, 192–193 measured values, quantifying, error, 17–19
layout algorithms, 62–64 measurement noise, 18
leading eigenvalue, 128–130 memory
complexity, 129 cosine similarity, 85
memory, 130 greedy Louvain, 131
tools, 130 ICA (independent component analysis), 159
leaf nodes, 110 Jaccard distance, 81
least recently used (LRU), 72 k-means, 128
leave-one-out cross-validation, 95 leading eigenvalue, 130
ledger format, 214 linear least squares, 97
252 memory
linear regression, 122 networks

logistic regression, 122 Bayesian networks, 135
Mahalanobis distance, 87 casual graphs and conditional independence,
naive Bayes, 124 136–137
nearest neighbors, 133 causal graphs, linear regression, 142–143
PCA (principle components analysis), 154 d-separation, 139–142
random access memory (RAM), 205–206 fitting models, 143–146
memory management unit (MMU), 206 Markovity, factorization and, 138–139
Messenger app, Facebook, 11 stability and dependence, 137–138
microservices, 220 neural networks, 189, 191–192
MinHash, 82–83 batch fitting, 199–200
assumptions, 83 capacity, 193–196
distributed approach, 83–84 layers, 192–193
space complexity, 83 loss functions, 200–201
time complexity, 83 overfitting, 196–199
tools, 83 neural networks, 189, 191–192
minimum viable product (MVP), 11 batch fitting, 199–200
MINIPACK, 98 capacity, 193–196
mix-and-match architectures, 221 layers, 192–193
MMU (memory management unit), 206 loss functions, 200–201
model fitting, 89, 91–92 overfitting, 196–199
model validation, 76–77 neurons, 192
models, 74 nginx, 227
choosing, for regression, 90 n-grams, text preprocessing, 27–28
fitting, 91–92, 143–146 non-causal paths, controlling to block, 177–179
graphical models, 135 g-formula, 179–182
latent variable models, 149 nonlinear regression with linear regression, 105–107
factor analysis, 151–152 uncertainty, 107–109
ICA (independent component analysis), nonvolatile/persistent storage, 206–208
154–159 n-tier, 218–219
latent dirichlet allocation, 159–165 numpy, 98
PCA (principle components analysis), 152–154
predictions, 75–76
training, 74–75
O
validating, 92–96 objective functions, choosing, 90–91
validation, 76–77 observation, examples, 171–177
modularity, 128–129 observational data, 6
monolith, 220 online algorithms, 72–73
MSE (mean squared error), 91, 200 online computing, 72–73
multimaster replication, 230, 239–240 online training algorithms, models, 74
multiple testing, hypothesis testing, 41–42 OpenCV, 133
MVP (minimum viable product), 11 optimization, machine learning, 189–191
overfitting, 76
N neural networks, 196–199
naive Bayes, 122–124

assumptions, 124
P
complexity, 124 paging, 213–214
memory, 124 nonvolatile/persistent storage, 207–208
tools, 124 parallelization, queues, 241–242
nearest neighbors, 131–133 parameters, 74
network diagrams, 233–234 hyperparameters, 76
network locality, throughput, 209 partition tolerance, 231–232
Route 253
paths, d-separation, 140 R

PCA (principle components analysis), 152–154
RAM (random access memory), 205–206
Pearl, Judea, 135
rand () function, 20
persistence, 216
random access memory (RAM), 205–206
p-hacking, hypothesis testing, 41–42
random error, 18
pie charts, 47–48
random forests, 109, 112–115
planning hypothesis testing, 43–44
randomized lasso, 30
pooling resources, versus embedding, 8
receiver operator characteristic (ROC), 120
population, 19
recurrent neural network (RNN), 31
potential outcomes, 169
Redis, 237
power calculation, 39
redundancy, availability, 225
practical cases, software architecture, 221
regression, 89
predictions, models, 75–76
choosing
preprocessing, text preprocessing, 26
models, 90
feature selection, 28–30
objective functions, 90–91
n-grams, 27–28
decision trees, 109–112
representation learning, 30–33
fitting models, 91–92
sparsity, 28
linear least squares, 96–98
tokenization, 26–27
examples, 98–105
primary databases, 238–239
logistic regression, 25–26, 118–121
primary/replica, 228 assumptions, 121
principle components, 152 memory, 122
principle components analysis (PCA), 152–154 time complexity, 121
priorities, for teams, 4–5 tools, 122
priors, variables, 149–151 nonlinear regression with linear regression,
processors, 209 105–107
branch prediction, 210–212 uncertainty, 107–109
clock rate, 209 random forests, 109, 112–115
cores, 210 validation, 92–96
threading, 210 regularization methods, 189, 198
product focus, agile development and, 10–11 replica databases, 238–239
production environments, 69 replica lag, 239
productionizing, 69 replication, 229–230
project workflows A/B replication, 229–230, 240–241
combined workflows, 10 replication lag, 228
data teams, 7–8 representation learning, text preprocessing, 30–33
embedding versus pooling resources, 8 representations, 30
prototyping, 9–10 research, data teams, workflows, 8–9
research, 8–9
resources, embedding versus pooling resources, 8
prototyping, data teams, 9–10 RNN (recurrent neural network), 31
p-value, 39 Robins G-Formula, 181
hypothesis testing, 40–41
robustness, 216
python-louvain, 131
ROC (receiver operator characteristic), 120
role of data scientists
Q company size, 3–4
quantifying errors, in measured values, 17–19 teams, 4–5
queues, 241 rolling mean, 59
API buffering, 243 rolling statistics, 58–60
asynchronous process execution, 242–243 Route 53, 226–227
task scheduling and parallelization,
241–242
254 sampling error
S technical debt, 4, 13
sampling error, 19–21 terminal nodes, 110
scaling, 73–74 test coverage, 34
scatter plots, 51–55 testing
hypothesis testing, 37
scikit learn, 122, 128
multiple testing, 41–42
scipy.optimize.leastsq, 98
tests, Jarque-Bera test, 108
secondary, 228
text preprocessing, 26
SELECT FOR UPDATE, 224
feature selection, 28–30
self-organizing teams, 14
n-grams, 27–28
separation of concerns, 70–71
representation learning, 30–33
sequences, 79–80
sparsity, 28
service-oriented architectures (SOAs), 71, 218–219
tokenization, 26–27
services, 71
thrashing, nonvolatile/persistent storage, 208
sets, 79–80
threading, processors, 210
sharding, 229, 241
threads, 208
simplicity, 14
threads of execution, 73
sklear.neighbors, 133
throughput, 208–209
SOAs (service-oriented architectures), 71, 218–219
time complexity, 64
sockets, 217
Jaccard distance, 81
software architecture
logistic regression, 121
client-server architecture, 217–218
MinHash, 83
microservices, 220
time to live (TTL), 72
mix-and-match architectures, 221
time-series plots, 58
monolith, 220
auto-correlation, 60–61
n-tier/service-oriented architecture, 218–219
rolling statistics, 58–60
solid-state drives (SSDs), nonvolatile/persistent
tokenization, text preprocessing, 26–27
storage, 207
tools
space complexity, MinHash, 83
greedy Louvain, 131
sparse vectors, 28
ICA (independent component analysis), 159
sparsity, text preprocessing, 28
k-means, 128
spinning disks, 206–208
leading eigenvalue, 130
split brains, 231–232 linear least squares, 98
SSDs (solid-state drives), 207 logistic regression, 122
stability, dependence and (Bayesian networks), MinHash, 83
137–138 naive Bayes, 124
static content, application-level caching, 236 nearest neighbors, 133
stochastic gradient descent, 75, 200 PCA (principle components analysis), 154
stochasticity, 200 topics, 159
storage, nonvolatile/persistent storage, 206–208 topological ordering, 139
storing data, 215 tracking impressions, 18
supervised learning, 125 training models, 74–75
survival plots, 51 true value, 18
swapping, 208 TTL (time to live), 72
systematic error, 18 Type I errors, 39
Type II errors, 39
T
task scheduling, queues, 241–242 U
taskworkers, availability, 230 uncertainty, nonlinear regression with linear
teams, 12 regression, 107–109
role of, data scientists, 4–5 underpowered, 39
self-organizing teams, 14
Z statistic 255
union, 80 discrete variables, 46

UNIX sockets, 217 priors, 149–151
unsupervised learning, 125 vertical scaling, 73
UPDATE statement, 224 vocabulary, 26
volatility, RAM (random access memory), 206
V W
validation, 92–96
workstations, 69
models, 76–77
write-through caches, 238
value proposition, 10–11
variables
binary variables, 25 Z
continuous variables, 46 Z statistic, 38–39

Sample

Uploaded by

Copyright:

Available Formats

Sample

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sample

Uploaded by

Copyright:

Available Formats

Machine Learning

Boston • Columbus • New York • San Francisco • Amsterdam • Cape Town

For government sales inquiries, please contact [email protected].

Visit us on the Web: informit.com/aw

Library of Congress Control Number: 2018954331

Copyright © 2019 Pearson Education, Inc.

This book is dedicated to our lifelong mentor, William F. Walsh III.

1 The Role of the Data Scientist 3

4 Data Encoding and Preprocessing 25

II: Algorithms and Architectures 67

7 Introduction to Algorithms and Architectures 69

9.2 Linear Least Squares 96

10 Classification and Clustering 117

10.6.3 Memory Considerations 131

11 Bayesian Networks 135

12 Dimensional Reduction and Latent Variable Models 149

13 Causal Inference 167

14 Advanced Machine Learning 189

III: Bottlenecks and Optimizations 203

15 Hardware Fundamentals 205

15.5 Processors 209

16 Software Fundamentals 213

17 Software Architecture 217

18 The CAP Theorem 223

18.4 Partition Tolerance 231

19 Logical Network Topological Nodes 233

Who This Book Is For

What This Book Covers

1 from statsmodels.api import OLS

OLS Regression Results

Dep. Variable: Y R-squared: 0.560

Model: OLS Adj. R-squared: 0.555

Method: Least Squares F-statistic: 124.5

Date: Sun, 08 Apr 2018 Prob (F-statistic): 3.79e-19

Time: 22:28:01 Log-Likelihood: -180.93

No. Observations: 100 AIC: 365.9

Df Residuals: 98 BIC: 371.1

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

D 3.3551 0.301 11.158 0.000 2.758 3.952

intercept -0.1640 0.225 -0.729 0.468 -0.611 0.283

Omnibus: 0.225 Durbin-Watson: 1.866

Prob(Omnibus): 0.894 Jarque-Bera (JB): 0.360

Skew: 0.098 Prob(JB): 0.835

Kurtosis: 2.780 Cond.No. 2.78

Dep. Variable: Y R-squared: 0.754

Model: OLS Adj. R-squared: 0.749

Method: Least Squares F-statistic: 148.8

Date: Sun, 08 Apr 2018 Prob (F-statistic): 2.75e-30

Time: 22:59:08 Log-Likelihood: -151.76

No. Observations: 100 AIC: 309.5

Df Residuals: 97 BIC: 317.3

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

D 3.2089 0.226 14.175 0.000 2.760 3.658

X 1.0237 0.117 8.766 0.000 0.792 1.256

intercept 0.0110 0.170 0.065 0.949 -0.327 0.349

Omnibus: 2.540 Durbin-Watson: 1.648

Prob(Omnibus): 0.281 Jarque-Bera (JB): 2.362