Ben G Weber - Data Science in Production - Building Scalable Model Pipelines With Python-Independently Published (2020)
Ben G Weber - Data Science in Production - Building Scalable Model Pipelines With Python-Independently Published (2020)
Ben G Weber - Data Science in Production - Building Scalable Model Pipelines With Python-Independently Published (2020)
Weber
Preface vii
0.1 Prerequisites . . . . . . . . . . . . . . . . . . . . vii
0.2 Book Contents . . . . . . . . . . . . . . . . . . . viii
0.3 Code Examples . . . . . . . . . . . . . . . . . . . x
0.4 Acknowledgements . . . . . . . . . . . . . . . . . x
1 Introduction 1
1.1 Applied Data Science . . . . . . . . . . . . . . . 3
1.2 Python for Scalable Compute . . . . . . . . . . . 4
1.3 Cloud Environments . . . . . . . . . . . . . . . . 6
1.3.1 Amazon Web Services (AWS) . . . . . . . 7
1.3.2 Google Cloud Platform (GCP) . . . . . . 8
1.4 Coding Environments . . . . . . . . . . . . . . . 9
1.4.1 Jupyter on EC2 . . . . . . . . . . . . . . . 9
1.5 Datasets . . . . . . . . . . . . . . . . . . . . . . 13
1.5.1 BigQuery to Pandas . . . . . . . . . . . . 15
1.5.2 Kaggle to Pandas . . . . . . . . . . . . . . 18
1.6 Prototype Models . . . . . . . . . . . . . . . . . 19
1.6.1 Linear Regression . . . . . . . . . . . . . . 20
1.6.2 Logistic Regression . . . . . . . . . . . . . 21
1.6.3 Keras Regression . . . . . . . . . . . . . . 22
1.7 Automated Feature Engineering . . . . . . . . . 26
1.8 Conclusion . . . . . . . . . . . . . . . . . . . . . 31
iii
iv Contents
This book was developed using the leanpub1 platform. Please send
any feedback or corrections to: [email protected]
The data science landscape is constantly evolving, because new
tools and libraries are enabling smaller teams to deliver more im-
pactful products. In the current state, data scientists are expected
to build systems that not only scale to a single product, but a
portfolio of products. The goal of this book is to provide data
scientists with a set of tools that can be used to build predictive
model services for product teams.
This text is meant to be a Data Science 201 course for data science
practitioners that want to develop skills for the applied science dis-
cipline. The target audience is readers with past experience with
Python and scikit-learn than want to learn how to build data prod-
ucts. The goal is to get readers hands-on with a number of tools
and cloud environments that they would use in industry settings.
0.1 Prerequisites
This book assumes that readers have prior knowledge of Python
and Pandas, as well as some experience with modeling packages
such as scikit-learn. This is a book that will focus on breadth
rather than depth, where the goal is to get readers hands on with
a number of different tools.
Python has a large library of books available, covering the lan-
guage fundamentals, specific packages, and disciplines such as data
1
https://leanpub.com/ProductionDataScience
vii
viii Preface
science. Here are some of the books I would recommend for readers
to build additional knowledge of the Python ecosystem.
• Python And Pandas
– Data Science from Scratch (Grus, 2015): Introduces
Python from a data science perspective.
– Python for Data Analysis (McKinney, 2017): Provides ex-
tensive details on the Pandas library.
• Machine Learning
– Hands-On Machine Learning (Géron, 2017): Covers scikit-
learn in depth as well as TensorFlow and Keras.
– Deep Learning for Python (Chollet, 2017): Provides an ex-
cellent introduction to deep learning concepts using Keras
as the core framework.
I will walk through the code samples in this book in detail, but
will not cover the fundamentals of Python. Readers may find it
useful to first explore these texts before digging into building large
scale pipelines in Python.
0.4 Acknowledgements
I was able to author this book using Yihui Xie’s excellent book-
down package (Xie, 2015). For the design, I used Shashi Kumar’s
template2 available under the Creative Commons 4.0 license. The
book cover uses Cédric Franchetti’s image from pxhere3 .
This book was last updated on December 31, 2019.
2
https://bit.ly/2MjFDgV
3
https://pxhere.com/en/photo/1417846
1
Introduction
1
2 1 Introduction
AWS. The result is a remote machine that we can use for Python
scripting. Accomplishing this task requires spinning up an EC2
instance, configuring firewall settings for the EC2 instance, con-
necting to the instance using SSH, and running a few commands
to deploy a Jupyter environment on the machine.
The first step is to set up an AWS account and log into the AWS
management console. AWS provides a free account option with
free-tier access to a number of services including EC2. Next, pro-
vision a machine using the following steps:
The machine may take a few minutes to provision. Once the ma-
chine is ready, the instance state will be set to “running”. We can
now connect to the machine via SSH. One note on the different
AMI options is that some of the configurations are set up with
Python already installed. However, this book focuses on Python 3
and the included version is often 2.7.
There’s two different IPs that you need in order to connect to the
machine via SSH and later connect to the machine via web browser.
The public and private IPs are listed under the “Description” tab
1.4 Coding Environments 11
pip --version
pip install --user jupyter
When you run the jupyter notebook command, you’ll get a URL
with a token that can be used to connect to the machine. Be-
fore entering the URL into your browser, you’ll need to swap the
Private IP output to the console with the Public IP of the EC2
instance, as shown in the snippet below.
# Original URL
The Jupyter Notebook is running at:
1.5 Datasets 13
http://172.31.53.82:8888/?token=
98175f620fd68660d26fa7970509c6c49ec2afc280956a26
You can now paste the updated URL into your browser to connect
to Jupyter on the EC2 machine. The result should be a Jupyer
notebook fresh install with a single file get-pip.py in the base di-
rectory, as shown in Figure 1.3. Now that we have a machine set
up with Python 3 and Jupyter notebook, we can start exploring
different data sets for data science.
1.5 Datasets
To build scalable data pipelines, we’ll need to switch from using
local files, such as CSVs, to distributed data sources, such as Par-
quet files on S3. While the tools used across cloud platforms to
load data vary significantly, the end result is usually the same,
which is a dataframe. In a single machine environment, we can use
Pandas to load the dataframe, while distributed environments use
different implementations such as Spark dataframes in PySpark.
This section will introduce the data sets that we’ll explore through-
out the rest of the book. In this chapter we’ll focus on loading the
14 1 Introduction
data using a single machine, while later chapters will present dis-
tributed approaches. While most of the data sets presented here
can be downloaded as CSV files as read into Pandas using read_csv,
it’s good practice to develop automated workflows to connect to di-
verse data sources. We’ll explore the following datasets throughout
this book:
• Boston Housing: Records of sale prices of homes in the Boston
housing market back in 1980.
• Game Purchases: A synthetic data set representing games pur-
chased by different users on XBox One.
• Natality: One of BigQuery’s open data sets on birth statistics
in the US over multiple decades.
• Kaggle NHL: Play-by-play events from professional hockey
games and game statistics over the past decade.
The first two data sets are single commands to load, as long as
you have the required libraries installed. The Natality and Kaggle
NHL data sets require setting up authentication files before you
can programmatically pull the data sources into Pandas.
The first approach we’ll use to load a data set is to retrieve it di-
rectly from a library. Multiple libraries include the Boston housing
data set, because it is a small data set that is useful for testing out
regression models. We’ll load it from scikit-learn by first running
pip from the command line:
gamesDF = pd.read_csv("https://github.com/bgweber/
Twitch/raw/master/Recommendations/games-expand.csv")
gamesDF.head()
Next, we’ll need to set up the Google Cloud command line tools,
in order to set up credentials for connecting to BigQuery. While
the files to use will vary based on the current release4 , here are the
steps I ran on the command line:
curl -O https://dl.google.com/dl/cloudsdk/channels/
rapid/downloads/google-cloud-sdk-255.0.0-
linux-x86_64.tar.gz
tar zxvf google-cloud-sdk-255.0.0-linux-x86_64.tar.gz
google-cloud-sdk
./google-cloud-sdk/install.sh
3
https://cloud.google.com/gcp
4
https://cloud.google.com/sdk/install
1.5 Datasets 17
Once the Google Cloud command line tools are installed, we can
set up credentials for connecting to BigQuery:
natalityDF = client.query(sql).to_dataframe()
natalityDF.head()
5
https://www.kaggle.com
1.6 Prototype Models 19
These commands will download the data set, unzip the files into
the current directory, and enable read access on the files. Now
that the files are downloaded on the EC2 instance, we can load
and display the Game data set, as shown in Figure 1.7. This data
set includes different files, where the game file provides game-level
summaries and the game_plays file provides play-by-play details.
import pandas as pd
nhlDF = pd.read_csv('game.csv')
nhlDF.head()
art models, but instead cover tools that can be applied to a variety
of different machine learning algorithms.
The library to use for implementing different models will vary
based on the cloud platform and execution environment being used
to deploy a model. The regression models presented in this section
are built with scikit-learn, while the models we’ll build out with
PySpark use MLlib.
model = LinearRegression()
model.fit(x_train, y_train)
model = LogisticRegression()
model.fit(x_train, y_train)
The output of this script is two metrics that describe the perfor-
mance of the model on the holdout data set. The accuracy metric
describes the number of correct predictions over the total num-
ber of predictions, and the ROC metric describes the number of
correctly classified outcomes based on different model thresholds.
ROC is a useful metric to use when the different classes being pre-
dicted are imbalanced, with noticeably different sizes. Since most
players are unlikely to buy a specific game, ROC is a good metric
to utilize for this use case. When I ran this script, the result was
an accuracy of 86.6% and an ROC score of 0.757.
Linear and logistic regression models with scikit-learn are a good
starting point for many machine learning projects. We’ll explore
more complex models in this book, but one of the general strategies
I take as a data scientist is to quickly deliver a proof of concept,
and then iterate and improve a model once it is shown to provide
value to an organization.
This process can take awhile to complete, and based on your en-
vironment may run into installation issues. It’s recommended to
verify that the installation worked by checking your Keras version
in a Jupyter notebook:
import tensorflow as tf
import keras
from keras import models, layers
import matplotlib.pyplot as plt
keras.__version__
loss = history.history['auc']
val_loss = history.history['val_auc']
epochs = range(1, len(loss) + 1)
plt.figure(figsize=(10,6) )
plt.plot(epochs, loss, 'bo', label='Training AUC')
plt.plot(epochs, val_loss, 'b', label='Validation AUC')
plt.legend()
plt.show()
import pandas as pd
game_df = pd.read_csv("game.csv")
plays_df = pd.read_csv("game_plays.csv")
import featuretools as ft
from featuretools import Feature
es = ft.EntitySet(id="plays")
28 1 Introduction
es = es.entity_from_dataframe(entity_id="plays",dataframe=plays_df
,index="play_id", variable_types = {
"event": ft.variable_types.Categorical,
"description": ft.variable_types.Categorical })
f1 = Feature(es["plays"]["event"])
f2 = Feature(es["plays"]["description"])
es = ft.EntitySet(id="plays")
es = es.entity_from_dataframe(entity_id="plays",
dataframe=encoded, index="play_id")
es = es.normalize_entity(base_entity_id="plays",
new_entity_id="games", index="game_id")
features,transform=ft.dfs(entityset=es,
target_entity="games",max_depth=2)
features.reset_index(inplace=True)
features.head()
import framequery as fq
# train a classifier
lr = LogisticRegression()
model = lr.fit(X, y)
# Results
print("Accuracy: " + str(model.score(X, y)))
print("ROC" + str(roc_auc_score(y,model.predict_proba(X)[:,1])))
1.8 Conclusion
Building data products is becoming an essential competency for ap-
plied data scientists. The Python ecosystem provides useful tools
for taking prototype models and scaling them up to production-
quality systems. In this chapter, we laid the groundwork for the
rest of this book by introducing the data sets, coding tools, cloud
environments, and predictive models that we’ll use to build scal-
able model pipelines. We also explored a recent Python library
called FeatureTools, which enables automating much of the fea-
ture engineering steps in a model pipeline.
In our current setup, we built a simple batch model on a single
machine in the cloud. In the next chapter, we’ll explore how to
share our models with the world, by exposing them as endpoints
on the web.
2
Models as Web Endpoints
33
34 2 Models as Web Endpoints
well for hosting Python applications that we’ll explore later in this
chapter. The Cat Facts service provides a simple API that provides
a JSON response containing interesting tidbits about felines. We
can use the /facts/random endpoint to retrieve a random fact using
the requests library:
import requests
result = requests.get("http://cat-fact.herokuapp.com/facts/random")
print(result)
print(result.json())
print(result.json()['text'])
This snippet loads the requests library and then uses the get func-
tion to perform an HTTP get for the passed in URL. The result
is a response object that provides a response code and payload if
available. In this case, the payload can be processed using the json
function, which returns the payload as a Python dictionary. The
three print statements show the response code, the full payload,
and the value for the text key in the returned dictionary object.
The output for a run of this script is shown below.
<Response [200]>
import flask
app = flask.Flask(__name__)
@app.route("/", methods=["GET","POST"])
def predict():
data = {"success": False}
return flask.jsonify(data)
if __name__ == '__main__':
app.run(host='0.0.0.0')
The first step is loading the Flask library and creating a Flask
object using the name special variable. Next, we define a predict
function with a Flask annotation that specifies that the function
should be hosted at “/” and accessible by HTTP GET and POST
commands. The last step specifies that the application should run
using 0.0.0.0 as the host, which enables remote machines to access
2.1 Web Services 37
python3 echo.py
* Serving Flask app "echo" (lazy loading)
* Environment: production
WARNING: This is a development server.
Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
cates that the service was called but that no message was provided
to the echo service.
We can pass parameters to the web service using a few different ap-
proaches. The parameters can be appended to the URL, specified
using the params object when using a GET command, or passed
in using the json parameter when using a POST command. The
snippet below shows how to perform these types of requests. For
small sets of parameters, the GET approach works fine, but for
larger parameters, such as sending images to a server, the POST
approach is preferred.
import requests
result = requests.get("http://52.90.199.190:5000/?msg=HelloWorld!")
print(result.json())
result = requests.get("http://52.90.199.190:5000/",
params = { 'msg': 'Hello from params' })
print(result.json())
result = requests.post("http://52.90.199.190:5000/",
json = { 'msg': 'Hello from data' })
print(result.json())
The output of the code snippet is shown below. There are 3 JSON
responses showing that the service successfully received the mes-
sage parameter and echoed the response:
We can run the script within a Jupyter notebook. The script will
load the image and send it to the server, and then render the result
as a plot. The output of this script, which uses an image of my
in-laws’ cat, is shown in Figure 2.1. We won’t work much with
image data in this book, but I did want to cover how to use more
complex objects with web endpoints.
40 2 Models as Web Endpoints
2.2.1 Scikit-Learn
We’ll start with scikit-learn, which we previously used to build a
propensity model for identifying which players were most likely to
purchase a game. A simple LogisticRegression model object can be
created using the following script:
import pandas as pd
from sklearn.linear_model import LogisticRegression
df = pd.read_csv("https://github.com/bgweber/Twitch/
raw/master/Recommendations/games-expand.csv")
x = df.drop(['label'], axis=1)
y = df['label']
model = LogisticRegression()
model.fit(x, y)
shown below. Once you have loaded a model, you can use the
prediction functions, such as predict_proba.
import pickle
pickle.dump(model, open("logit.pkl", 'wb'))
Pickle is great for simple workflows, but can run into serialization
issues when your execution environment is different from your pro-
duction environment. For example, you might train models on your
local machine using Python 3.7 but need to host the models on an
EC2 instance running Python 3.6 with different library versions
installed.
MLflow is a broad project focused on improving the lifecycle of
machine learning projects. The Models component of this platform
focuses on making models deployable across a diverse range of exe-
cution environments. A key goal is to make models more portable,
so that your training environment does not need to match your
deployment environment. In the current version of MLflow, many
of the save and load functions wrap direct serialization calls, but
future versions will be focused on using generalized model formats.
We can use MLflow to save a model using sklearn.save_model and
load a model using sklearn.load_model. The script below shows how
to perform the same task as the prior code example, but uses
MLflow in place of pickle. The file is saved at the model_path lo-
cation, which is a relative path. There’s also a commented out
command, which needs to be uncommented if the code is executed
multiple times. MLflow currently throws an exception if a model
is already saved at the current location, and the rmtee command
can be used to overwrite the existing model.
import mlflow
import mlflow.sklearn
42 2 Models as Web Endpoints
import shutil
model_path = "models/logit_games_v1"
#shutil.rmtree(model_path)
mlflow.sklearn.save_model(model, model_path)
loaded = mlflow.sklearn.load_model(model_path)
loaded.predict_proba(x)
2.2.2 Keras
Keras provides built-in functionality for saving and loading deep
learning models. We covered building a Keras model for the games
data set in Section 1.6.3. The key steps in this process are shown
in the following snippet:
import tensorflow as tf
import keras
from keras import models, layers
model.compile(optimizer='rmsprop',
loss='binary_crossentropy', metrics=[auc])
2.3 Model Endpoints 43
Once we have trained a Keras model, we can use the save and
load_model functions to persist and reload the model using the h5
file format. One additional step here is that we need to pass the
custom auc function we defined as a metric to the load function in
order to reload the model. Once the model is loaded, we can call
the prediction functions, such as evaluate.
We can also use MLflow for Keras. The save_model and load_model
functions can be used to persist Keras models. As before, we need
to provide the custom-defined auc function to load the model.
import mlflow.keras
model_path = "models/keras_games_v1"
mlflow.keras.save_model(model, model_path)
loaded = mlflow.keras.load_model(model_path,
custom_objects={'auc': auc})
loaded.evaluate(x, y, verbose = 0)
2.3.1 Scikit-Learn
To use scikit-learn to host a predictive model, we’ll modify our echo
service built with Flask. The main changes to make are loading
a scikit-learn model using MLflow, parsing out the feature vector
to pass to the model from the input parameters, and adding the
model result to the response payload. The updated Flask applica-
tion for using scikit-learn is shown in the following snippet:
import pandas as pd
from sklearn.linear_model import LogisticRegression
import mlflow
import mlflow.sklearn
import flask
model_path = "models/logit_games_v1"
model = mlflow.sklearn.load_model(model_path)
app = flask.Flask(__name__)
@app.route("/", methods=["GET","POST"])
def predict():
data = {"success": False}
params = flask.request.args
if "G1" in params.keys():
new_row = { "G1": params.get("G1"),"G2": params.get("G2"),
"G3": params.get("G3"),"G4": params.get("G4"),
"G5": params.get("G5"),"G6": params.get("G6"),
"G7": params.get("G7"),"G8": params.get("G8"),
"G9": params.get("G9"),"G10":params.get("G10")}
new_x = pd.DataFrame.from_dict(new_row,
orient = "index").transpose()
2.3 Model Endpoints 45
data["response"] = str(model.predict_proba(new_x)[0][1])
data["success"] = True
return flask.jsonify(data)
if __name__ == '__main__':
app.run(host='0.0.0.0')
Similar to the echo service, we’ll need to save the app as a Python
file rather than running the code directly in Jupyter. I saved the
code as predict.py and launched the endpoint by running python3
predict.py, which runs the service on port 5000.
import requests
2.3.2 Keras
The setup for Keras is similar to scikit-learn, but there are a few
additions that need to be made to handle the TensorFlow graph
context. We also need to redefine the auc function prior to loading
the model using MLflow. The snippet below shows the complete
code for a Flask app that serves a Keras model for the game pur-
chases data set.
The main thing to note in this script is the use of the graph object.
Because Flask uses multiple threads, we need to define the graph
used by Keras as a global object, and grab a reference to the graph
using the with statement when serving requests.
import pandas as pd
import mlflow
import mlflow.keras
import flask
import tensorflow as tf
import keras as k
global graph
2.4 Model Endpoints 47
graph = tf.get_default_graph()
model_path = "models/keras_games_v1"
model = mlflow.keras.load_model(model_path,
custom_objects={'auc': auc})
app = flask.Flask(__name__)
@app.route("/", methods=["GET","POST"])
def predict():
data = {"success": False}
params = flask.request.args
if "G1" in params.keys():
new_row = { "G1": params.get("G1"), "G2": params.get("G2"),
"G3": params.get("G3"), "G4": params.get("G4"),
"G5": params.get("G5"), "G6": params.get("G6"),
"G7": params.get("G7"), "G8": params.get("G8"),
"G9": params.get("G9"), "G10": params.get("G10") }
new_x = pd.DataFrame.from_dict(new_row,
orient = "index").transpose()
with graph.as_default():
data["response"] = str(model.predict(new_x)[0][0])
data["success"] = True
return flask.jsonify(data)
if __name__ == '__main__':
app.run(host='0.0.0.0')
2.4.1 Gunicorn
We can use Gunicorn to provide a WSGI server for our echo Flask
application. Using gunicorn helps separate the functionality of an
application, which we implemented in Flask, with the deployment
of an application. Gunicorn is a lightweight WSGI implementation
that works well with Flask apps.
It’s straightforward to switch form using Flask directly to using
Gunicorn to run the web service. The new command for running
the application is shown below. Note that we are passing in a bind
parameter to enable remote connections to the service.
The result on the command line is shown below. The main differ-
ence from before is that we now interface with the service on port
8000 rather than on port 5000. If you want to test out the service,
you’ll need to enable remote access on port 8000.
To test the service using Python, we can run the following snippet.
You’ll need to make sure that access to port 8000 is enabled, as
discussed in Section 1.4.1.
result = requests.get("http://52.90.199.190:8000/",
params = { 'msg': 'Hello from Gunicorn' })
print(result.json())
2.4.2 Heroku
Now that we have a Gunicorn application, we can host it in the
cloud using Heroku. Python is one of the core languages supported
by this cloud environment. The great thing about using Heroku is
that you can host apps for free, which is great for showcasing data
science projects. The first step is to set up an account on the web
site: https://www.heroku.com/
Next, we’ll set up the command line tools for Heroku, by running
the commands shown below. There can be some complications
when setting up Heroku on an AMI EC2 instance, but downloading
and unzipping the binaries directly works around these problems.
The steps shown below download a release, extract it, and install
an additional dependency. The last step outputs the version of
50 2 Models as Web Endpoints
wget https://cli-assets.heroku.com/heroku-linux-x64.tar.gz
unzip heroku-linux-x64.tar.gz
tar xf heroku-linux-x64.tar
sudo yum -y install glibc.i686
/home/ec2-user/heroku/bin/heroku --version
/home/ec2-user/heroku/bin/heroku login
/home/ec2-user/heroku/bin/heroku create
Next, we’ll make our changes to the project. We copy our echo.py
file into the directory, add Flask to the list of dependencies in the
requirements.txt file, override the command to run in the Procfile,
and then call heroku local to test the configuration locally.
cp ../echo.py echo.py
echo 'flask' >> requirements.txt
echo "web: gunicorn echo:app" > Procfile
/home/ec2-user/heroku/bin/heroku local
/home/ec2-user/heroku/bin/heroku local
[OKAY] Loaded ENV .env File as KEY=VALUE Format
[INFO] Starting gunicorn 19.9.0
[INFO] Listening at: http://0.0.0.0:5000 (10485)
[INFO] Using worker: sync
[INFO] Booting worker with pid: 10488
result = requests.get("http://localhost:5000/",
params = { 'msg': 'Hello from Heroku Local'})
print(result.json())
The final step is to deploy the service to production. The git com-
mands are used to push the results to Heroku, which automatically
releases a new version of the application. The last command tells
Heroku to scale up to a single worker, which is free.
After these steps run, there should be a message that the applica-
tion has been deployed to Heroku. Now we can call the endpoint,
which has a proper URL, is secured, and can be used to publicly
share data science projects.
result = requests.get("https://obscure-coast-69593.herokuapp.com",
params = { 'msg': 'Hello from Heroku Prod' })
print(result.json())
2.5.1 Dash
Dash is a Python library written by the Plotly team than enables
building interactive web applications with Python. You specify an
application layout and a set of callbacks that respond to user input.
If you’ve used Shiny in the past, Dash shares many similarities,
but is built on Python rather than R. With Dash, you can create
simple applications as we’ll show here, or complex dashboards that
interact with machine learning models.
We’ll create a simple Dash application that provides a UI for in-
teracting with a model. The application layout will contain three
2.5 Interactive Web Services 53
text boxes, where two of these are for user inputs and the third one
shows the output of the model. We’ll create a file called dash_app.py
and start by specifying the libraries to import.
import dash
import dash_html_components as html
import dash_core_components as dcc
from dash.dependencies import Input, Output
import pandas as pd
import mlflow.sklearn
app = dash.Dash(__name__)
app.layout = html.Div(children=[
html.H1(children='Model UI'),
html.P([
html.Label('Game 1 '),
dcc.Input(value='1', type='text', id='g1'),
]),
html.Div([
html.Label('Game 2 '),
dcc.Input(value='0', type='text', id='g2'),
]),
html.P([
html.Label('Prediction '),
dcc.Input(value='0', type='text', id='pred')
]),
])
54 2 Models as Web Endpoints
if __name__ == '__main__':
app.run_server(host='0.0.0.0')
Before writing the callbacks, we can test out the layout of the
application by running python3 dash_app.py, which will run on port
8050 by default. You can browse to your public IP on port 8050
to see the resulting application. The initial application layout is
shown in Figure 2.2. Before any callbacks are added, the result of
the Prediction text box will always be 0.
The next step is to add a callback to the application so that the
Prediction text box is updated whenever the user changes one of
the Game 1 or Game 2 values. To perform this task, we define
a callback shown in the snippet below. The callback is defined
after the application layout, but before the run_server command.
We also load the logistic regression model for the games data set
using MLflow. The callback uses an annotation to define the inputs
to the function, the output, and any additional state that needs
to be provided. The way that the annotation is defined here, the
function will be called whenever the value of Game 1 or Game 2 is
modified by the user, and the value returned by this function will
be set as the value of the Prediction text box.
2.6 Interactive Web Services 55
model_path = "models/logit_games_v1"
model = mlflow.sklearn.load_model(model_path)
@app.callback(
Output(component_id='pred', component_property='value'),
[Input(component_id='g1', component_property='value'),
Input(component_id='g2', component_property='value')]
)
def update_prediction(game1, game2):
new_x = pd.DataFrame.from_dict(new_row,
orient = "index").transpose()
return str(model.predict_proba(new_x)[0][1])
The function takes the two values provided by the user, and creates
a Pandas dataframe. As before, we transpose the dataframe to
provide a single row that we’ll pass as input to the loaded model.
The value predicted by the model is then returned and set as the
value of the Prediction text box.
The updated application with the callback function included
is shown in Figure 2.3. The prediction value now dynamically
changes in response to changes in the other text fields, and pro-
vides a way of introspecting the model.
Dash is great for building web applications, because it eliminates
the need to write JavaScript code. It’s also possible to stylize Dash
application using CSS to add some polish to your tools.
56 2 Models as Web Endpoints
2.6 Conclusion
The Python ecosystem has a great suite of tools for building web
applications. Using only Python, you can write scalable APIs de-
ployed to the open web or custom UI applications that interact
with backend Python code. This chapter focused on Flask, which
can be extended with other libraries and hosted in a wide range
of environments. One of the important concepts we touched on
in this chapter is model persistence, which will be useful in other
contexts when building scalable model pipelines. We also deployed
a simple application to Heroku, which is a separate cloud platform
from AWS and GCP.
This chapter is only an introduction to the many different web
tools within the Python ecosystem, and the topic of scaling these
types of tools is outside the scope of this book. Instead, we’ll focus
on managed solutions for models on the web, which significantly
reduces the DevOps overhead of deploying models as web services.
The next chapter will cover two systems for serverless functions in
managed environments.
3
Models as Serverless Functions
57
58 3 Models as Serverless Functions
# requirements.txt
flask
#main.py
def echo(request):
from flask import jsonify
if "msg" in params:
data["response"] = str(params['msg'])
data["success"] = True
return jsonify(data)
Once the function has been deployed, you can click on the “Testing”
tab to check if the deployment of the function worked as intended.
You can specify a JSON object to pass to the function, and invoke
the function by clicking “Test the function”, as shown in Figure 3.2.
The result of running this test case is the JSON object returned
in the Output dialog, which shows that invoking the echo function
worked correctly.
3.2 Cloud Functions (GCP) 63
import requests
result = requests.post(
"https://us-central1-gameanalytics.cloudfunctions.net/echo"
,json = { 'msg': 'Hello from Cloud Function' })
print(result.json())
{
'response': 'Hello from Cloud Function',
'success': True
}
storage_client = storage.Client()
storage_client.create_bucket(bucket_name)
After running this code, the output of the script should be a sin-
gle bucket, with the name assigned to the bucket_name variable.
We now have a path on GCS that we can use for saving files:
gs://dsp_model_storage.
bucket_name = "dsp_model_store"
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob("serverless/logit/v1")
blob.upload_from_filename("logit.pkl")
After running this script, the local file logit.pkl will now be avail-
able on GCS at the following location:
66 3 Models as Serverless Functions
gs://dsp_model_storage/serverless/logit/v1/logit.pkl
While it’s possible to use URIs such as this directly to access files,
as we’ll explore with Spark in Chapter 6, in this section we’ll re-
trieve the file using the bucket name and blob path. The code
snippet below shows how to download the model file from GCS to
local storage. We download the model file to the local path of lo-
cal_logit.pkl and then load the model by calling pickle.load with
this path.
import pickle
from google.cloud import storage
bucket_name = "dsp_model_store"
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob("serverless/logit/v1")
blob.download_to_filename("local_logit.pkl")
model = pickle.load(open("local_logit.pkl", 'rb'))
model
the model, and cloud storage for retrieving the model object from
GCS.
google-cloud-storage
sklearn
pandas
flask
The next step is to implement our model function in the main.py file.
A small change from before is that the params object is now fetched
using request.get_json() rather than flask.request.args. The main
change is that we are now downloading the model file from GCS
rather than retrieving the file directly from local storage, because
local files are not available when writing Cloud Functions with the
UI tool. An additional change from the prior function is that we
are now reloading the model for every request, rather than loading
the model file once at startup. In a later code snippet, we’ll show
how to use global objects to cache the loaded model.
def pred(request):
from google.cloud import storage
import pickle as pk
import sklearn
import pandas as pd
from flask import jsonify
if "G1" in params:
new_x = pd.DataFrame.from_dict(new_row,
orient = "index").transpose()
data["response"] = str(model.predict_proba(new_x)[0][1])
data["success"] = True
return jsonify(data)
One note in the code snippet above is that the /tmp directory is
used to store the downloaded model file. In Cloud Functions, you
are unable to write to the local disk, with the exception of this
directory. Generally it’s best to read objects directly into memory
rather than pulling objects to local storage, but the Python library
for reading objects from GCS currently requires this approach.
For this function, we created a new Cloud Function named pred,
set the function to execute to pred, and deployed the function to
production. We can now call the function from Python, using the
same approach from 2.3.1 with a URL that now points to the
Cloud Function, as shown below:
import requests
result = requests.post(
"https://us-central1-gameanalytics.cloudfunctions.net/pred"
3.2 Cloud Functions (GCP) 69
{
'response': '0.06745113592634559',
'success': True
}
model = None
def pred(request):
global model
if not model:
# apply model
return jsonify(data)
google-cloud-storage
tensorflow
keras
pandas
flask
result is a Keras predictive model that lazily fetches the model file
and can scale to meet variable workloads as a serverless function.
model = None
graph = None
def predict(request):
global model
global graph
bucket_name = "dsp_model_store_1"
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob("serverless/keras/v1")
72 3 Models as Serverless Functions
blob.download_to_filename("/tmp/games.h5")
model = load_model('/tmp/games.h5',
custom_objects={'auc':auc})
new_x = pd.DataFrame.from_dict(new_row,
orient = "index").transpose()
with graph.as_default():
data["response"]= str(model.predict_proba(new_x)[0][0])
data["success"] = True
return jsonify(data)
To test the deployed model, we can reuse the Python web request
script from the prior section and replace pred with predict in the
request URL. We have now deployed a deep learning model to
production.
There are a few different approaches for locking down Cloud Func-
tions to ensure that only authenticated users have access to the
functions. The easiest approach is to disable “Allow unauthenti-
cated invocations” in the function setup to prevent hosting the
function on the open web. To use the function, you’ll need to set
up IAM roles and credentials for the function. This process involves
a number of steps and may change over time as GCP evolves. In-
stead of walking through this process, it’s best to refer to the GCP
documentation1 .
Another approach for setting up functions that enforce authenti-
cation is by using other services within GCP. We’ll explore this
approach in Chapter 8, which introduces GCP’s PubSub system
for producing and consuming messages within GCP’s ecosystem.
While the first approach is the easiest to implement and can work
well for small-scale deployments, the third approach, where a load
balancer is used to direct calls to the newest function available
is probably the most robust approach for production systems. A
best practice is to add logging to your function, in order to track
predictions over time so that you can log the performance of the
model and identify potential drift.
return {
'statusCode': 200,
'body': event['msg']
}
Click “Save” to deploy the function and then “Test” to test the
file. If you use the default test parameters, then an error will be
returned when running the function, because no msg key is available
in the event object. Click on “Configure test event”, and define use
the following configuration:
{
"msg": "Hello from Lambda!"
}
After clicking on “Test”, you should see the execution results. The
response should be the echoed message with a status code of 200
returned. There’s also details about how long the function took to
execute (25.8ms), the billing duration (100ms), and the maximum
memory used (56 MB).
We have now a simple function running on AWS Lambda. For this
function to be exposed to external systems, we’ll need to set up an
API Gateway, which is covered in Section 3.3.3. This function will
scale up to meet demand if needed, and requires no server mon-
itoring once deployed. To setup a function that deploys a model,
we’ll need to use a different workflow for authoring and publishing
the function, because AWS Lambda does not currently support a
requirements.txt file for defining dependencies when writing func-
76 3 Models as Serverless Functions
tions with the inline code editor. To store the model file that we
want to serve with a Lambda function, we’ll use S3 as a storage
layer for model artifacts.
aws configure
aws s3 ls
78 3 Models as Serverless Functions
zip file that we’ll upload to S3. To accomplish this, we can append
-t . to the end of the pip command in order to install the libraries
into the current directory. The last steps to run on the command
line are copying our logistic regression model into the current di-
rectory, and creating a new file that will implement the Lambda
function.
mkdir lambda
cd lambda
pip install pandas -t .
pip install sklearn -t .
cp ../logit.pkl logit.pkl
vi logit.py
The full source code for the Lambda function that serves our lo-
gistic regression model is shown in the code snippet below. The
structure of the file should look familiar, we first globally define a
model object and then implement a function that services model
requests. This function first parses the response to extract the in-
puts to the model, and then calls predict_proba on the resulting
dataframe to get a model prediction. The result is then returned
as a dictionary object containing a body key. It’s important to de-
fine the function response within the body key, otherwise Lambda
will throw an exception when invoking the function over the web.
if "G1" in event:
new_row = { "G1": event["G1"],"G2": event["G2"],
"G3": event["G3"],"G4": event["G4"],
"G5": event["G5"],"G6": event["G6"],
"G7": event["G7"],"G8": event["G8"],
"G9": event["G9"],"G10":event["G10"]}
new_x = pd.DataFrame.from_dict(new_row,
orient = "index").transpose()
prediction = str(model.predict_proba(new_x)[0][1])
One of the main differences from this approach with the GCP
Cloud Function is that we did not need to explicitly define global
variables that are lazily defined. With Lambda functions, you can
define variables outside the scope of the function that are persisted
before the function is invoked. It’s important to load model objects
outside of the model service function, because reloading the model
each time a request is made can become expensive when handling
large workloads.
To deploy the model, we need to create a zip file of the current
directory, and upload the file to a location on S3. The snippet
below shows how to perform these steps and then confirm that
the upload succeeded using the s3 ls command. You’ll need to
modify the paths to use the S3 bucket name that you defined in
the previous section.
zip -r logitFunction.zip .
aws s3 cp logitFunction.zip s3://dsp-ch3-logit/logitFunction.zip
aws s3 ls s3://dsp-ch3-logit/
Once your function is uploaded as a zip file to S3, you can return
to the AWS console and set up a new Lambda function. Select
“Author from scratch” as before, and under “Code entry type” se-
lect the option to upload from S3, specifying the location from the
cp command above. You’ll also need to define the Handler, which is
a combination of the Python file name and the Lambda function
name. An example configuration for the logit function is shown in
Figure 3.6.
Make sure to select the Python runtime as the same version of
Python that was used to run the pip commands on the EC2 in-
stance. Once the function is deployed by pressing “Save”, we can
test the function using the following definition for the test event.
{
"G1": "1", "G2": "1", "G3": "1",
"G4": "1", "G5": "1",
82 3 Models as Serverless Functions
Since the model is loaded when the function is deployed, the re-
sponse time for testing the function should be relatively fast. An
example output of testing the function is shown in Figure 3.7. The
output of the function is a dictionary that includes a body key and
the output of the model as the value. The function took 110 ms
to execute and was billed for a duration of 200 ms.
So far, we’ve invoked the function only using the built-in test func-
tionality of Lambda. In order to host the function so that other
services can interact with the function, we’ll need to define an API
Gateway. Under the “Designer” tab, click “Add Trigger” and se-
lect “API Gateway”. Next, select “Create a new API” and choose
“Open” as the security setting. After setting up the trigger, an
API Gateway should be visible in the Designer layout, as shown
in Figure 3.8.
Before calling the function from Python code, we can use the API
Gateway testing functionality to make sure that the function is
set up properly. One of the challenges I ran into when testing this
3.3 Lambda Functions (AWS) 83
Lambda function was that the structure of the request varies when
the function is invoked from the web versus the console. This is
why the function first checks if the event object is a web request
or dictionary with parameters. When you use the API Gateway to
test the function, the resulting call will emulate calling the function
as a web request. An example test of the logit function is shown
in Figure 3.9.
Now that the gateway is set up, we can call the function from a
remote host using Python. The code snippet below shows how to
use a POST command to call the function and display the result.
Since the function returns a string for the response, we use the
text attribute rather than the json function to display the result.
import requests
result = requests.post("https://3z5btf0ucb.execute-api.us-east-1.
amazonaws.com/default/logit",
json = { 'G1':'1', 'G2':'0', 'G3':'0', 'G4':'0', 'G5':'0',
'G6':'0', 'G7':'0', 'G8':'0', 'G9':'0', 'G10':'0' })
print(result.text)
3.4 Conclusion 85
3.4 Conclusion
Serverless functions are a type of managed service that enable
developers to deploy production-scale systems without needing to
worry about infrastructure. To provide this abstraction, different
cloud platforms do place constraints on how functions must be
implemented, but the trade-off is generally worth the improvement
in DevOps that these tools enable. While serverless technologies
like Cloud Functions and Lambda can be operationally expensive,
they provide flexibility that can offset these costs.
In this chapter, we implemented echo services and sklearn model
endpoints using both GCP’s Cloud Functions and AWS’s Lambda
offerings. With AWS, we created a local Python environment with
all dependencies and then uploaded the resulting files to S3 to
deploy functions, while in GCP we authored functions directly
using the online code editor. The best system to use will likely
depend on which cloud provider your organization is already using,
but when prototyping new systems, it’s useful to have hands on
experience using more than one serverless function ecosystem.
4
Containers for Reproducible Models
87
88 4 Containers for Reproducible Models
4.1 Docker
Docker, and other platform-as-a-service tools, provide a virtual-
ization concept called containers. Containers run on top of a host
operating system, but provide a standardized environment for code
4.1 Docker 89
running within the container. One of the key goals of this virtual-
ization approach is that you can write code for a target environ-
ment, and any system running Docker can run your container.
Containers are a lightweight alternative to virtual machines, which
provide similar functionality. The key difference is that containers
are much faster to spin up, while providing the same level of iso-
lation as virtual machines. Another benefit is that containers can
re-use layers from other containers, making it much faster to build
and share containers. Containers are a great solution to use when
you need to run conflicting versions of Python runtimes or libraries,
on a single machine.
With docker, you author a file called a Dockerfile that is used to
define the dependencies for a container. The result of building the
Dockerfile is a Docker Image, which packages all of the runtimes,
libraries, and code needed to run an app. A Docker Container
is an instantiated image that is running an application. One of
the useful features in Docker is that new images can build off
of existing images. For our model deployment, we’ll extend the
ubuntu:latest image.
# load Flask
import flask
app = flask.Flask(__name__)
image, adding the name of the image maintainer. Next, the RUN
command is used to install Python, set up a symbolic link, and
install Flask. For containers with many Python libraries, it’s also
possible to use a requirements.txt file. The Copy command inserts
our script into the image and places the file in the root directory.
The final command specifies the arguments to run to execute the
application.
FROM ubuntu:latest
MAINTAINER Ben Weber
ENTRYPOINT ["python3","echo.py"]
After writing a Dockerfile, you can use the build command that
docker provides to create an image. The first command shown
in the snippet below shows how to build an image, tagged as
echo_service, using the file ./Dockerfile. The second command
shows the list of Docker images available on the instance. The
output will show both the ubuntu image we used as the base for
our image, and our newly created image.
To test the container, we can use the same process as before where
we use the external IP of the EC2 instance in a web browser and
pass a msg parameter to the /predict endpoint. Since we set up
a port mapping from the host port of 80 to the container port
80, we can directly invoke the container over the open web. An
example invocation and result from the echo service container is
shown below.
http://34.237.242.46/predict?msg=Hi_from_docker
{"response":"Hi_from_docker","success":true}
4.2 Orchestration
Container orchestration systems are responsible for managing the
life cycles of containers in a cluster. They provide services including
provisioning, scaling, failover, load balancing, and service discovery
between containers. AWS has multiple orchestration solutions, but
the general trend has been moving towards Kubernetes for this
functionality, which is an open-source platform originally designed
by Google.
4.2 Orchestration 93
aged Docker registry that you can use to store and manage images
within the AWS ecosystem. It works well with both ECS and EKS.
The goal of this subsection is to walk through the process of getting
a Docker image from an EC2 instance to ECR. We’ll cover the
following steps:
The first step is to create a repository for the image that we want
to store on ECR. A registry can have multiple repositories, and
each repository can have multiple tagged images. To set up a new
repository, perform the following steps from the AWS console:
After completing these steps, you should have a new repository for
saving images on ECR, as shown in Figure 4.1. The repository will
initially be empty until we push a container.
Since our goal is to push a container from an EC2 instance to
ECR, we’ll need to set up permissions for pushing to the registry
4.2 Orchestration 95
After tagging your image, it’s good to check that the outcome
matches the expected behavior. To check the tags of your images,
run sudo docker images from the command line. An example output
is shown below, with my account ID and region omitted.
The final step is to push the tagged image to the ECR repository.
We can accomplish this by running the command shown below:
After running this command, the echo service should now be avail-
able in the model repository on ECR. To check if the process suc-
ceeded, return the the AWS console and click on “Images” for the
model repository. The repo should now should an image with the
tag models:echo, as shown in Figure 4.2.
The outcome of this process is that we now have a Docker image
pushed to ECR that can be leveraged by an orchestration system.
4.2 Orchestration 97
1. Setting up a cluster
2. Setting up a task
3. Running a task
4. Running a service
At the end of this section, we’ll have a service that manages a task
running the echo service, but we’ll connect directly to the IP of
the provisioned EC2 instance. In the next section, we’ll set up a
load balancer to provide a static URL for accessing the service.
The first step in using ECS is to set up a service. There is a newer
feature in ECS called Fargate that abstracts away the notion of
EC2 instances when running a cluster, but this mode does not cur-
rently support the networking modes that we need for connecting
directly to the container. To set up an ECS cluster, perform the
following steps from the AWS console:
98 4 Containers for Reproducible Models
We now have a task setup in ECS that we can use to host our
image as a container in the cloud. It is a good practice to test our
your tasks in ECS before defining a service to manage the task.
We can test out the task by performing the following steps:
the EC2 console in AWS, you’ll see that a new EC2 instance has
been provisioned, and the name of the instance will be based on
the service, such as: ECS Instance - EC2ContainerService-echo.
Now that we have a container running in our ECS cluster, we can
query it over the open web. To find the URL of the service, click
on the running task and under containers, expand the echo service
details. The console will show an external IP address where the
container can be accessed, as shown in Figure 4.5. An example of
using the echo service is shown in the snippet below.
http://18.212.21.97/predict?msg=Hi_from_ECS
{"response":"Hi_from_ECS","success":true}
We now have a container running in the cloud, but it’s not scalable
and there is no fault tolerance. To enable these types of capabilities
we need to define a service in our ECS cluster than manages the
4.2 Orchestration 101
This will start the service. The service will set the “Desired count”
value to 1, and it may take a few minutes for a new task to get
ramped up by the cluster. Once “Running count” is set to 1, you
can start using the service to host a model. An example of the
provisioned service is shown in Figure 4.6. To find the IP of the
container, click on the task within the service definition.
We now have a container that is managed by a service, but we’re
still accessing the container directly. In order to use ECS in a way
102 4 Containers for Reproducible Models
the same steps shown in the prior section for setting up a service,
but instead of selecting “None” for the load balancer type, perform
the following actions:
It’s taken quite a few steps, but our echo service is now running in
a scalable environment, using a load balancer, and using a service
that will manage tasks to handle failures and provision new EC2
instances as necessary. This approach is quite a bit of configuration
versus Lambda for similar functionality, but this approach may be
preferred based on the type of workload that you need to handle.
There is cost involved with running an ECS cluster, even if you
are not actively servicing requests, so understanding your expected
workload is useful when modeling out the cost of different model
serving options on AWS.
http://model123.us-east-1.elb.amazonaws.com/predict?msg=Hi_from_ELB
{"response":"Hi_from_ELB","success":true}
AWS does provide an option for Kubernetes called EKS, but the
options available through the web console are currently limited for
managing Docker images. EKS can work with ECR as well, and
as the AWS platform evolves EKS will likely be the best option
for new deployments.
Make sure to terminate your cluster, load balancers, and EC2 in-
stances once you are done testing out your deployment to reduce
your cloud platform costs.
104 4 Containers for Reproducible Models
the service to the open web. To deploy the echo service container,
perform the following steps from the GCP console:
To use the service, we’ll need to expose the cluster to the open
web by performing the following steps from the GCP console:
http://35.238.43.63/predict?msg=Hi_from_GKE
{"response":"Hi_from_GKE","success":true}
4.4 Conclusion
Containers are great to use to make sure that your analyses and
models are reproducible across different environments. While con-
tainers are useful for keeping dependencies clean on a single ma-
chine, the main benefit is that they enable data scientists to write
model endpoints without worrying about how the container will
be hosted. This separation of concerns makes it easier to partner
with engineering teams to deploy models to production, or using
the approaches shown in this chapter, data and applied science
teams can also own the deployment of models to production.
The best approach to use for serving models depends on your
deployment environment and expected workload. Typically, you
are constrained to a specific cloud platform when working at a
company, because your model service may need to interface with
other components in the cloud, such as a database or cloud stor-
age. Within AWS, there are multiple options for hosting contain-
ers while GCP is aligned on GKE as a single solution. The main
question to ask is whether it is more cost effective to serve your
model using serverless function technologies or elastic container
technologies. The correct answer will depend on the volume of
traffic you need to handle, the amount of latency that is tolera-
ble for end users, and the complexity of models that you need to
host. Containerized solutions are great for serving complex models
and making sure that you can meet latency requirements, but may
require a bit more DevOps overhead versus serverless functions.
5
Workflow Tools for Model Pipelines
109
110 5 Workflow Tools for Model Pipelines
The pipeline will execute as a single Python script that performs all
of these steps. For situations where you want to use intermediate
outputs from steps across multiple tasks, it’s useful to decompose
the pipeline into multiple processes that are integrated through a
workflow tool such as Airflow.
We’ll build this workflow by first writing a Python script that runs
on an EC2 instance, and then Dockerize the script so that we can
use the container in workflows. To get started, we need to install
a library for writing a Pandas dataframe to BigQuery:
Next, we’ll create a file called pipeline.py that performs the four
pipeline steps identified above. The script shown below performs
these steps by loading the necessary libraries, fetching the CSV
file from GitHub into a Pandas dataframe, splits the dataframe
into train and test groups to simulate historic and more recent
users, builds a logistic regression model using the training data
set, creates predictions for the test data set, and saves the resulting
dataframe to BigQuery.
import pandas as pd
import numpy as np
from google.oauth2 import service_account
from sklearn.linear_model import LogisticRegression
from datetime import datetime
import pandas_gbq
# build a model
model = LogisticRegression()
model.fit(x_train, y_train)
y_pred = model.predict_proba(x_test)[:, 1]
export GOOGLE_APPLICATION_CREDENTIALS=
/home/ec2-user/dsdemo.json
python3 pipeline.py
This script will set up a client for connecting to BigQuery and then
display the result set of the query submitted to BigQuery. You can
also browse to the BigQuery web UI to inspect the results of the
pipeline, as shown in Figure 5.1. We now have a script that can
fetch data, apply a machine learning model, and save the results
as a single process.
With many workflow tools, you can run Python code or bash
scripts directly, but it’s good to set up isolated environments for
executing scripts in order to avoid dependency conflicts for differ-
ent libraries and runtimes. Luckily, we explored a tool for this in
Chapter 4 and can use Docker with workflow tools. It’s useful to
wrap Python scripts in Docker for workflow tools, because you can
add libraries that may not be installed on the system responsible
for scheduling, you can avoid issues with Python version conflicts,
and containers are becoming a common way of defining tasks in
workflow tools.
To containerize our workflow, we need to define a Dockerfile, as
shown below. Since we are building out a new Python environ-
ment from scratch, we’ll need to install Pandas, sklearn, and the
BigQuery library. We also need to copy credentials from the EC2
instance into the container so that we can run the export com-
5.1 Sklearn Workflow 115
mand for authenticating with GCP. This works for short term
deployments, but for longer running containers it’s better to run
the export in the instantiated container rather than copying static
credentials into images. The Dockerfile lists out the Python li-
braries needed to run the script, copies in the local files needed for
execution, exports credentials, and specifies the script to run.
FROM ubuntu:latest
MAINTAINER Ben Weber
ENTRYPOINT ["python3","pipeline.py"]
5.2 Cron
A common requirement for model pipelines is running a task at
a regular frequency, such as every day or every hour. Cron is a
utility that provides scheduling functionality for machines running
the Linux operating system. You can Set up a scheduled task using
the crontab utility and assign a cron expression that defines how
frequently to run the command. Cron jobs run directly on the
machine where cron is utilized, and can make use of the runtimes
and libraries installed on the system.
There are a number of challenges with using cron in production-
grade systems, but it’s a great way to get started with scheduling
a small number of tasks and it’s good to learn the cron expression
syntax that is used in many scheduling systems. The main issue
with the cron utility is that it runs on a single machine, and does
not natively integrate with tools such as version control. If your
machine goes down, then you’ll need to recreate your environment
and update your cron table on a new machine.
A cron expression defines how frequently to run a command. It is
a sequence of 5 numbers that define when to execute for different
time granularities, and it can include wildcards to always run for
certain time periods. A few sample expressions are shown in the
snippet below:
When getting started with cron, it’s good to use tools1 to validate
your expressions. Cron expressions are used in Airflow and many
other scheduling systems.
We can use cron to schedule our model pipeline to run on a reg-
ular frequency. To schedule a command to run, run the following
command on the console:
crontab -e
This command will open up the cron table file for editing in vi.
To schedule the pipeline to run every minute, add the following
commands to the file and save.
After exiting the editor, the cron table will be updated with the
new command to run. The second part of the cron statement is the
command to run. When defining the command to run, it’s useful
to include full file paths. With Docker, we just need to define the
image to run. To check that the script is actually executing, browse
to the BigQuery UI and check the time column on the user_scores
model output table.
We now have a utility for scheduling our model pipeline on a regu-
lar schedule. However, if the machine goes down then our pipeline
will fail to execute. To handle this situation, it’s good to explore
cloud offerings with cron scheduling capabilities.
1
https://crontab.guru/
118 5 Workflow Tools for Model Pipelines
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: sklearn
spec:
schedule: "* * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: sklearn
image: us.gcr.io/[gcp_account]/sklearn_pipeline
restartPolicy: OnFailure
After saving the file, we can use kubectl to update the cluster with
the YAML file. Run the command below to update the cluster
with the model pipeline task:
export AIRFLOW_HOME=~/airflow
pip install --user apache-airflow
122 5 Workflow Tools for Model Pipelines
airflow initdb
airflow scheduler
Airflow also provides a web frontend for managing DAGs that have
been scheduled. To start this service, run the following command
in a new terminal on the same machine.
This command tells Airflow to start the web service on port 8080.
You can open a web browser at this port on your machine to view
the web frontend for Airflow, as shown in Figure 5.3.
Airflow comes preloaded with a number of example DAGs. For our
model pipeline we’ll create a new DAG and then notify Airflow of
the update. We’ll create a file called sklearn.py with the following
DAG definition:
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
5.3 Workflow Tools 123
'email': '[email protected]',
'start_date': datetime(2019, 11, 1),
'email_on_failure': True,
}
t1 = BashOperator(
task_id='sklearn_pipeline',
bash_command='sudo docker run sklearn_pipeline',
dag=dag)
There’s a few steps in this Python script to call out. The script
uses a Bash operator to define the action to perform. The Bash
operator is defined as the last step in the script, which specifies
the operation to perform. The DAG is instantiated with a number
of input arguments that define the workflow settings, such as who
to email when the task fails. A cron expression is passed to the
DAG object to define the schedule for the task, and the DAG
object is passed to the Bash operator to associate the task with
this graph of operations.
Before adding the DAG to airflow, it’s useful to check for syntax
errors in your code. We can run the following command from the
terminal to check for issues with the DAG:
python3 sklearn.py
This command will not run the DAG, but will flag any syntax
errors present in the script. To update Airflow with the new DAG
file, run the following command:
airflow list_dags
-------------------------------------------------------------------
124 5 Workflow Tools for Model Pipelines
DAGS
-------------------------------------------------------------------
games
This command will add the DAG to the list of workflows in Airflow.
To view the list of DAGs, navigate to the Airflow web server, as
shown in Figure 5.4. The web server will show the schedule of the
DAG, and provide a history of past runs of the workflow. To check
that the DAG is actually working, browse to the BigQuery UI and
check for fresh model outputs.
We now have an Airflow service up and running that we can use
to monitor the execution of our workflows. This setup enables us
to track the execution of workflows, backfill any gaps in data sets,
and enable alerting for critical workflows.
Airflow supports a variety of operations, and many companies au-
thor custom operators for internal usage. In our first DAG, we used
the Bash operator to define the task to execute, but other options
are available for running Docker images, including the Docker op-
erator. The code snippet below shows how to change our DAG to
use the Docker operator instead of the Bash operator.
t1 = DockerOperator(
task_id='sklearn_pipeline',
5.3 Workflow Tools 125
image='sklearn_pipeline',
dag=dag)
The DAG we defined does not have any dependencies, since the
container performs all of the steps in the model pipeline. If we
had a dependency, such as running a sklearn_etl container before
running the model pipeline, we can use the set_upstrean command
as shown below. This configuration sets up two tasks, where the
pipeline task will execute after the etl task completes.
t1 = BashOperator(
task_id='sklearn_etl',
bash_command='sudo docker run sklearn_etl',
dag=dag)
t2 = BashOperator(
task_id='sklearn_pipeline',
bash_command='sudo docker run sklearn_pipeline',
dag=dag)
t2.set_upstream(t1)
set up different Scheduler and Worker nodes, one of the recent trends
is using Kubernetes to create more robust Airflow deployments.
It is possible to self-host Airflow on Kubernetes, but it can be
complex to set up. There are also fully-managed versions of Airflow
available for cloud platforms such as Cloud Composer on GCP.
With a managed version of Airflow, you define the DAGs to execute
and set the schedules, and the platform is responsible for providing
a high-availability deployment.
To run our DAG on Cloud Composer, we’ll need to update the
task to use a GKE Pod operator in place of a Docker operator, be-
cause Composer needs to be able to authenticate with Container
Registry. The updated DAG is shown in the snippet below.
t1 = GKEPodOperator(
task_id='sklearn_pipeline',
project_id = '{your_project_id}',
cluster_name = ' us-central1-models-13d59d5b-gke',
name ='sklearn_pipeline',
namespace='default',
location='us-central1-c',
image='us.gcr.io/{your_project_id}/sklearn_pipeline',
dag=dag)
5.4 Conclusion
In this chapter we explored a batch model pipeline for applying a
machine learning model to a set of users and storing the results to
128 5 Workflow Tools for Model Pipelines
129
130 6 PySpark for Batch Pipelines
advice for readers that want to dig deeper into the Spark ecosystem
is to explore books based on the broader Spark ecosystem, such
as (Karau et al., 2015). You’ll likely need to read through Scala or
Java code examples, but the majority of content covered will be
relevant to PySpark.
After a few minutes we’ll have a cluster set up that we can use
for submitting Spark commands. Before attaching a notebook to
the cluster, we’ll first set up the libraries that we’ll use throughout
1
https://community.cloud.databricks.com/
134 6 PySpark for Batch Pipelines
this chapter. Instead of using pip to install libraries, we’ll use the
Databricks UI, which makes sure that every node in the cluster
has the same set of libraries installed. We’ll use both Maven and
PyPI to install libraries on the cluster. To install the BigQuery
connector, perform the following steps:
The UI will then show the status as resolving, and then installing,
and then installed. We also need to attach a few Python libraries
that are not pre-installed on a new Databricks cluster. Standard
libraries such as Pandas are installed, but you might need to up-
grade to a more recent version since the libraries pre-installed by
Databricks can lag significantly.
To install a Python library on Databricks, perform the same steps
as before up to step 5. Next, instead of selecting “Maven” choose
“PyPI”. Under Package, specify the package you want to install and
then click “Install”. To follow along with all of the sections in this
chapter, you’ll need to install the following Python packages:
• koalas - for dataframe conversion.
• featuretools - for feature generation.
• tensorflow - for a deep learning backend.
• keras - for a deep learning model.
You’ll now have a cluster set up capable of performing distributed
feature engineering and deep learning. We’ll start with basic Spark
commands, show off newer functionality such as the Koalas library,
and then dig into these more advanced topics. After this setup,
your cluster library setup should look like Figure 6.1. To ensure
that everything is set up successfully, restart the cluster and check
the status of the installed libraries.
6.2 Spark Environments 135
wget https://github.com/bgweber/Twitch/raw/master/
Recommendations/games-expand.csv
aws s3 cp games-expand.csv s3://dsp-ch6/csv/games-expand.csv
In addition to staging the games data set to S3, we’ll also copy a
subset of the CSV files from the Kaggle NHL data set, which we
set up in Section 1.5.2. Run the following commands to stage the
plays and stats CSV files from the NHL data set to S3.
We now have all of the data sets needed for the code examples
in this chapter. In order to read in these data sets from Spark,
we’ll need to set up S3 credentials for interacting with S3 from the
Spark cluster.
6.2.1 S3 Credentials
For production environments, it is better to use IAM roles to man-
age access instead of using access keys. However, the community
edition of Databricks constrains how much configuration is allowed,
so we’ll use access keys to get up and running with the examples
in this chapter. We already set up a user for accessing S3 from an
EC2 instance. To create a set of credentials for accessing S3 pro-
grammatically, perform the following steps from the AWS console:
The result will be an access key and a secret key enabling access
to S3. Save these values in a secure location, as we’ll use them in
138 6 PySpark for Batch Pipelines
the notebook to connect to the data sets on S3. Once you are done
with this chapter, it is recommended to revoke these credentials.
Now that we have credentials set up for access, we can return to
the Databricks notebook to read in the data set. To enable access
to S3, we need to set the access key and secret key in the Hadoop
configuration of the cluster. To set these keys, run the PySpark
commands shown in the snippet below. You’ll need to replace the
access and secret keys with the credentials we just created for the
S3_Lambda role.
AWS_ACCESS_KEY = "AK..."
AWS_SECRET_KEY = "dC..."
sc._jsc.hadoopConfiguration().set(
"fs.s3n.awsAccessKeyId", AWS_ACCESS_KEY)
sc._jsc.hadoopConfiguration().set(
"fs.s3n.awsSecretAccessKey", AWS_SECRET_KEY)
We can now read the data set into a Spark dataframe using the read
command, as shown below. This command uses the spark context
to issue a read command and reads in the data set using the CSV
input reader. We also specify that the CSV file includes a header
row and that we want Spark to infer the data types for the columns.
When reading in CSV files, Spark eagerly fetches the data set into
memory, which can cause issues for larger data sets. When working
with large CSV files, it’s a best practice to split up large data sets
into multiple files and then read in the files using a wildcard in
the input path. When using other file formats, such as Parquet or
Avro, Spark lazily fetches the data sets.
games_df = spark.read.csv("s3://dsp-ch6/csv/games-expand.csv",
header=True, inferSchema = True)
display(games_df)
stats_df = spark.read.csv("s3://dsp-ch6/csv/game_skater_stats.csv",
header=True, inferSchema = True)
display(stats_df)
and finally CSV. After performing this round trip of data IO, we’ll
end up with our initial Spark dataframe. To start, we’ll save the
stats dataframe in Avro format, using the code snippet shown be-
low. This code writes the dataframe to S3 in Avro format using
the Databricks Avro writer, and then reads in the results using
the same library. The result of performing these steps is that we
now have a Spark dataframe pointing to the Avro files on S3. Since
PySpark lazily evaluates operations, the Avro files are not pulled
to the Spark cluster until an output needs to be created from this
data set.
# AVRO write
avro_path = "s3://dsp-ch6/avro/game_skater_stats/"
stats_df.write.mode('overwrite').format(
"com.databricks.spark.avro").save(avro_path)
# AVRO read
avro_df = sqlContext.read.format(
"com.databricks.spark.avro").load(avro_path)
aws s3 ls s3://dsp-ch6/avro/game_skater_stats/
2019-11-27 23:02:43 1455 _committed_1588617578250853157
2019-11-27 22:36:31 1455 _committed_1600779730937880795
2019-11-27 23:02:40 0 _started_1588617578250853157
2019-11-27 23:31:42 0 _started_6942074136190838586
142 6 PySpark for Batch Pipelines
# parquet out
parquet_path = "s3a://dsp-ch6/games-parquet/"
avro_df.write.mode('overwrite').parquet(parquet_path)
# parquet in
parquet_df = sqlContext.read.parquet(parquet_path)
Like the Avro format, the ORC write command will distribute the
dataframe to multiple files based on the size.
# orc out
orc_path = "s3a://dsp-ch6/games-orc/"
parquet_df.write.mode('overwrite').orc(orc_path)
# orc in
orc_df = sqlContext.read.orc(orc_path)
To complete our round trip of file formats, we’ll write the results
back to S3 in the CSV format. To make sure that we write a single
file rather than a batch of files, we’ll use the coalesce command
to collect the data to a single node before exporting it.This is a
command that will fail with large data sets, and in general it’s
best to avoid using the CSV format when using Spark. However,
CSV files are still a common format for sharing data, so it’s useful
to understand how to export to this format.
# CSV out
csv_path = "s3a://dsp-ch6/games-csv-out/"
orc_df.coalesce(1).write.mode('overwrite').format(
"com.databricks.spark.csv").option("header","true").save(csv_path)
ferent formats based on your use case. For example, you might
need to perform a Pandas operation, such as selecting a specific
element from a dataframe. When this is required, you can use the
toPandas function to pull a Spark dataframe into memory on the
driver node. The PySpark snippet below shows how to perform this
task, display the results, and then convert the Pandas dataframe
back to a Spark dataframe. In general, it’s best to avoid Pandas
when authoring PySpark workflows, because it prevents distribu-
tion and scale, but it’s often the best way of expressing a command
to execute.
stats_pd = stats_df.toPandas()
stats_df = sqlContext.createDataFrame(stats_pd)
import databricks.koalas as ks
stats_ks = stats_df.to_koalas()
stats_df = stats_ks.to_spark()
print(stats_ks['timeOnIce'].mean())
print(stats_ks.iloc[:1, 1:2])
6.3 A PySpark Primer 145
set, expose it as a view to Spark, and then run a query against the
dataframe. The aggregated dataframe is then visualized using the
display command in Databricks.
stats_df = spark.read.csv("s3://dsp-ch6/csv/game_skater_stats.csv",
header=True, inferSchema = True)
stats_df.createOrReplaceTempView("stats")
new_df = spark.sql("""
select player_id, sum(1) as games, sum(goals) as goals
from stats
group by 1
order by 3 desc
limit 5
""")
display(new_df)
sions performed better with the Dataframe API versus Spark SQL,
the difference in performance is now trivial and you should use the
transformation tools that provide the best iteration speed for work-
ing with large data sets. With Spark SQL, you can join dataframes,
run nested queries, set up temp tables, and mix expressive Spark
operations with SQL operations. For example, if you want to look
at the distribution of goals versus shots in the NHL stats data, you
can run the following command on the dataframe.
display(spark.sql("""
select cast(goals/shots * 50 as int)/50.0 as Goals_per_shot
,sum(1) as Players
from (
select player_id, sum(shots) as shots, sum(goals) as goals
from stats
group by 1
having goals >= 5
)
group by 1
order by 1
"""))
This query restricts the ratio of goals to shots to players with more
than 5 goals, to prevent outliers such as goalies scoring during
power plays. We’ll use the display command to output the result
set as a table and then use Databricks to display the output as a
chart. Many Spark ecosystems have ways of visualizing results, and
the Databricks environment provides this capability through the
display command, which works well with both tabular and pivot
table data. After running the above command, you can click on
the chart icon and choose dimensions and measures which show
the distribution of goals versus shots, as visualized in Figure 6.5.
While I’m an advocate of using SQL to transform data, since it
scales to different programming environments, it’s useful to get
familiar with some of the basic dataframe operations in PySpark.
The code snippet below shows how to perform common operations
148 6 PySpark for Batch Pipelines
# dropping columns
copy_df = stats_df.drop('game_id', 'player_id')
# selection columns
copy_df = copy_df.select('assists', 'goals', 'shots')
# adding columns
copy_df = copy_df.withColumn("league", lit('NHL'))
display(copy_df)
new column onto a small dataframe, but the join operation from
the Dataframe API can scale to massive data sets.
The result set from the join operation above is shown in Figure
6.6. Spark supports a variety of different join types, and in this
example we used an inner join to append the league column to the
players stats dataframe.
It’s also possible to perform aggregation operations on a dataframe,
such as calculating sums and averages of columns. An example of
computing the average time on ice for players in the stats data set,
and total number of goals scored is shown in the snippet below.
The groupBy command uses the player_id as the column for collaps-
ing the data set, and the agg command specifies the aggregations
to perform.
summary_df = stats_df.groupBy("player_id").agg(
{'timeOnIce':'avg', 'goals':'sum'})
display(summary_df)
plot option. The resulting plot of goals versus time on ice is shown
in Figure 6.7.
We’ve worked through introductory examples to get up and run-
ning with dataframes in PySpark, focusing on operations that are
useful for munging data prior to training machine learning models.
These types of operations, in combination with reading and writing
dataframes provide a useful set of skills for performing exploratory
analysis on massive data sets.
sample_pd = spark.sql("""
select * from stats
where player_id = 8471214
""").toPandas()
Now we want to perform this operation for every player in the stats
data set. To scale to this volume, we’ll first partition by player_id,
as shown by the groupBy operation in the code snippet below. Next,
We’ll run the analyze_player function for each of these partitioned
data sets using the apply command. While the stats_df dataframe
used as input to this operation and the players_df dataframe re-
turned are Spark dataframes, the sampled_pd dataframe and the
dataframe returned by the analyze player function are Pandas. The
Pandas UDF annotation provides a hint to PySpark for how to dis-
tribute this workload so that it can scale the operation across the
cluster of worker nodes rather than eagerly pulling all of the data
to the driver node. Like most Spark operations, Pandas UDFs are
lazily evaluated and will not be executed until an output value is
needed.
Our initial example now translated to use Pandas UDFs is shown
below. After defining additional modules to include, we specify the
schema of the dataframe that will be returned from the operation.
The schema object defines the structure of the Spark dataframe
that will be returned from applying the analyze player function.
The next step in the code block lists an annotation that defines
this function as a grouped map operation, which means that it
works on dataframes rather than scalar values. As before, we’ll
use the leastsq function to fit the shots and hits attributes. After
calculating the coefficients for this curve fitting, we create a new
Pandas dataframe with the player id, and regression coefficients.
The display command at the end of this code block will force the
Pandas UDF to execute, which will create a partition for each of
6.3 A PySpark Primer 153
the players in the data set, apply the least squares operation, and
merge the results back together into a Spark dataframe.
The key capability that Pandas UDFs provide is that they enable
Python libraries to be used in a distributed environment, as long
as you have a good way of partitioning your data. This means that
154 6 PySpark for Batch Pipelines
function when data has already been aggregated and you want
to make use of familiar Python plotting tools, but it should not
be used for large dataframes.
• Avoid loops: Instead of using for loops, it’s often possible to
use functional approaches such as group by and apply to achieve
the same result. Using this pattern means that code can be par-
allelized by supported execution environments. I’ve noticed that
focusing on using this pattern in Python has also resulted in
cleaner code that is easier to translate to PySpark.
• Minimize eager operations: In order for your pipeline to be as
scalable as possible, it’s good to avoid eager operations that pull
full dataframes into memory. For example, reading in CSVs is an
eager operation, and my work around is to stage the dataframe
to S3 as Parquet before using it in later pipeline steps.
• Use SQL: There are libraries that provide SQL operations
against dataframes in both Python and PySpark. If you’re work-
ing with someone else’s Python code, it can be tricky to decipher
what some of the Pandas operations are achieving. If you plan
on porting your code from Python to PySpark, then using a SQL
library for Pandas can make this translation easier.
By following these best practices when writing PySpark code, I’ve
been able to improve both my Python and PySpark data science
workflows.
needed for data science workflows. In this chapter, we’ll show how
to apply MLlib to a classification problem and save the outputs
from the model application to a data lake.
games_df = spark.read.csv("s3://dsp-ch6/csv/games-expand.csv",
header=True, inferSchema = True)
games_df.createOrReplaceTempView("games_df")
games_df = spark.sql("""
select *, row_number() over (order by rand()) as user_id
,case when rand() > 0.7 then 1 else 0 end as test
from games_df
""")
The first step in the pipeline is loading the data set that we want
to use for model training. The snippet above shows how to load the
games data set, and append two additional attributes to the loaded
dataframe using Spark SQL. The result of running this query is
that about 30% of users will be assigned a test value which we’ll
use for model application, and each record is assigned a unique
user ID which we’ll use when saving the model predictions.
The next step is splitting up the data set into train and test
dataframes. For this pipeline, we’ll use the test dataframe as the
data set for model application, where we predict user behavior.
An example of splitting up the dataframes using the test column
is shown in the snippet below. This should result in roughly 16.1k
training users and 6.8k test users.
roc = BinaryClassificationEvaluator().evaluate(predDF)
print(roc)
6.4 MLlib Batch Pipeline 159
After running this code block, the dataframe will have an addi-
tional column called propensity as shown in Figure 6.10. The final
step in this batch prediction pipeline is to save the results to S3.
We’ll use the select function to retrieve the relevant columns from
the predictions dataframe, and then use the write function on the
dataframe to persist the results as Parquet on S3.
160 6 PySpark for Batch Pipelines
# save results to S3
results_df = predDF.select("user_id", "propensity")
results_path = "s3a://dsp-ch6/game-predictions/"
results_df.write.mode('overwrite').parquet(results_path)
plotDF = spark.sql("""
select cast(propensity*100 as int)/100 as propensity,
label, sum(1) as users
from predDF
group by 1, 2
order by 1, 2
""")
# table output
display(plotDF)
import tensorflow as tf
import keras
from keras import models, layers
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10,)))
model.add(layers.Dropout(0.1))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy')
6.5 Distributed Deep Learning 163
To test for overfitting, we can plot the results of the training and
validation data sets, as shown in Figure 6.12. The snippet below
shows how to use matplotlib to display the losses over time for
these data sets. While the training loss continued to decrease over
additional epochs, the validation loss stopped improving after 20
epochs, but did not noticeably increase over time.
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)
fig = plt.figure(figsize=(10,6) )
plt.plot(epochs, loss, 'bo', label='Training Loss')
plt.plot(epochs, val_loss, 'b', label='Validation Loss')
plt.legend()
plt.show()
display(fig)
partitionedDF = spark.sql("""
select *, cast(rand()*100 as int) as partition_id
from testDF
""")
The next step is to define the Pandas UDF that will apply the
Keras model. We’ll define an output schema of a user ID and
propensity score, as shown below. The UDF uses the predict func-
tion on the model object we previously trained to create a pre-
diction column on the passed in dataframe. The return command
selects the two relevant columns that we defined for the schema
object. The group by command partitions the data set using our
bucketing approach, and the apply command performs the Keras
model application across the cluster of worker nodes. The result is
a Spark dataframe visualized with the display command, as shown
in Figure 6.13.
6.5 Distributed Deep Learning 165
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def apply_keras(pd):
pd['propensity'] = model.predict(pd.iloc[:,0:10])
return pd[['user_id', 'propensity']]
results_df=partitionedDF.groupby('partition_id').apply(apply_keras)
display(results_df)
plays_df = spark.read.csv("s3://dsp-ch6/csv/game_plays.csv",
header=True, inferSchema = True).drop(
'secondaryType', 'periodType', 'dateTime', 'rink_side')
plays_pd = plays_df.filter("rand() < 0.003").toPandas()
plays_pd.shape
6.6 Distributed Feature Engineering 167
import featuretools as ft
from featuretools import Feature
es = ft.EntitySet(id="plays")
es = es.entity_from_dataframe(entity_id="plays",dataframe=plays_pd,
index="play_id", variable_types = {
"event": ft.variable_types.Categorical,
"description": ft.variable_types.Categorical })
f1 = Feature(es["plays"]["event"])
f2 = Feature(es["plays"]["description"])
The next step is using the dfs function to perform deep feature
synthesis on our encoded dataframe. The input dataframe will have
a record per play, while the output dataframe will have a single
record per game after collapsing the detailed events into a wide
column representation using a variety of different aggregations.
es = ft.EntitySet(id="plays")
es = es.entity_from_dataframe(entity_id="plays",
dataframe=encoded, index="play_id")
es = es.normalize_entity(base_entity_id="plays",
new_entity_id="games", index="game_id")
168 6 PySpark for Batch Pipelines
features, transform=ft.dfs(entityset=es,
target_entity="games",max_depth=2)
features.reset_index(inplace=True)
One of the new steps that we need to perform versus the prior
approach, is that we need to determine what the schema will be
for the generated features, since this is needed as an input to the
Pandas UDF annotation. To figure out what the generated schema
is for the generated dataframe, we can create a Spark dataframe
and then retrieve the schema from the dataframe. Before convert-
ing the Pandas dataframe, we need to modify the column names
in the generated dataframe to remove special characters, as shown
in the snippet below. The resulting Spark schema for the feature
application step is displayed in Figure 6.14.
deep feature synthesis. Like the model object in the past section,
copies of these objects will be passed to the Pandas UDF executing
on worker nodes.
# bucket IDs
plays_df.createOrReplaceTempView("plays_df")
plays_df = spark.sql("""
select *, abs(hash(game_id))%1000 as partition_id
from plays_df
""")
We can now apply feature transformation to the full data set, using
the Pandas UDF defined below. The plays dataframe is partitioned
by the bucket before being passed to the generate features func-
tion. This function uses the previously generated feature transfor-
mations to ensure that the same transformation is applied across
all of the worker nodes. The input Pandas dataframe is a narrow
and deep representation of play data, while the returned dataframe
is a shallow and wide representation of game summaries.
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def gen_features(plays_pd):
170 6 PySpark for Batch Pipelines
es = ft.EntitySet(id="plays")
es = es.entity_from_dataframe(entity_id="plays",
dataframe=plays_pd, index="play_id", variable_types = {
"event": ft.variable_types.Categorical,
"description": ft.variable_types.Categorical })
encoded_features = ft.calculate_feature_matrix(defs, es)
encoded_features.reset_index(inplace=True)
es = ft.EntitySet(id="plays")
es = es.entity_from_dataframe(entity_id="plays",
dataframe=encoded, index="play_id")
es = es.normalize_entity(base_entity_id="plays",
new_entity_id="games", index="game_id")
generated = ft.calculate_feature_matrix(transform,es).fillna(0)
generated.reset_index(inplace=True)
generated.columns = generated.columns.str.replace("[(). =]","")
return generated
features_df = plays_df.groupby('partition_id').apply(gen_features)
display(features_df)
3
https://github.com/spotify/spark-bigquery/
172 6 PySpark for Batch Pipelines
To create a data set, we’ll sample 10k records from the natality
public data set in BigQuery. To export this result set to GCS, we
need to create a table on BigQuery with the data that we want
to export. The SQL for creating this data sample is shown in the
snippet above. To export this data to GCS, perform the following
steps:
read and write files on GCS. One of the challenges with using
Spark is that you may not have SSH access to the driver node,
which means that we’ll need to use persistent storage to move the
file to the driver machine. This isn’t recommended for production
environments, but instead is being shown as a proof of concept.
The best practice for managing credentials in a production envi-
ronment is to use IAM roles.
aws s3 ls s3://dsp-ch6/secrets/
To move the json file to the driver node, we can first copy the
credentials file to S3, as shown in the snippet above. Now we can
switch back to Databricks and author the model pipeline. To copy
the file to the driver node, we can read in the file using the sc
Spark context to read the file line by line. This is different from
all of our prior operations where we have read in data sets as
dataframes. After reading the file, we then create a file on the
driver node using the Python open and write functions. Again, this
is an unusual action to perform in Spark, because you typically
want to write to persistent storage rather than local storage. The
174 6 PySpark for Batch Pipelines
result of performing these steps is that the credentials file will now
be available locally on the driver node in the cluster.
creds_file = '/databricks/creds.json'
creds = sc.textFile('s3://dsp-ch6/secrets/dsdemo.json')
Now that we have the json credentials file moved to the driver local
storage, we can set up the Hadoop configuration needed to access
data on GCS. The code snippet below shows how to configure
the project ID, file system implementation, and credentials file
location. After running these commands, we now have access to
read and write files on GCS.
sc._jsc.hadoopConfiguration().set("fs.gs.impl",
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
sc._jsc.hadoopConfiguration().set("fs.gs.project.id",
"your_project_id")
sc._jsc.hadoopConfiguration().set(
"mapred.bq.auth.service.account.json.keyfile", creds_file)
sc._jsc.hadoopConfiguration().set(
"fs.gs.auth.service.account.json.keyfile", creds_file)
natality_path = "gs://dsp_model_store/natality/avro"
natality_df = spark.read.format("avro").load(natality_path)
display(natality_df)
natality_df.createOrReplaceTempView("natality_df")
natality_df = spark.sql("""
SELECT year, plurality, apgar_5min,
mother_age, father_age,
gestation_weeks, ever_born
,case when mother_married = true
then 1 else 0 end as mother_married
,weight_pounds as weight
,case when rand() < 0.5 then 1 else 0 end as test
from natality_df
""").fillna(0)
Next, we’ll translate our dataframe into the vector data types that
MLlib requires as input. The process for transforming the natality
data set is shown in the snippet below. After executing the trans-
form function, we now have training and test data sets we can use
176 6 PySpark for Batch Pipelines
trainVec = assembler.transform(trainDF).select('weight','features')
testVec = assembler.transform(testDF).select('weight', 'features')
folds = 3
rf_trees = [ 50, 100 ]
rf_depth = [ 4, 5 ]
rf= RandomForestRegressor(featuresCol='features',labelCol='weight')
In the final step of our GCP model pipeline, we’ll save the results
to GCS, so that other applications or processes in a workflow can
make use of the predictions. The code snippet below shows how
to write the dataframe to GCS in Avro format. To ensure that
different runs of the pipeline do not overwrite past predictions, we
append a timestamp to the export path.
import time
out_path = "gs://dsp_model_store/natality/preds-{time}/".
format(time = int(time.time()*1000))
predsDF.write.mode('overwrite').format("avro").save(out_path)
print(out_path)
that you can leverage IAM roles for properly managing access to
different services.
6.9 Conclusion
PySpark is a powerful tool for data scientists to build scalable
analyses and model pipelines. It a highly desirable skill set for
companies, because it enables data science teams to own more of
the process of building and owning data products. There’s a variety
of ways to set up an environment for PySpark, and in this chapter
we explored a free notebook environment from one of the popular
Spark vendors.
This chapter focused on batch model pipelines, where the goal is
to create a set of predictions for a large number of users on a
regular schedule. We explored pipelines for both AWS and GCP
deployments, where the data sources and data outputs are data
lakes. One of the issues with these types of pipelines is that predic-
tions may be quite stale by the time that a prediction is used. In
180 6 PySpark for Batch Pipelines
Dataflow is a tool for building data pipelines that can run locally,
or scale up to large clusters in a managed environment. While
Cloud Dataflow was initially incubated at Google as a GCP specific
tool, it now builds upon the open-source Apache Beam library,
making it usable in other cloud environments. The tool provides
input connectors to different data sources, such as BigQuery and
files on Cloud Storage, operators for transforming and aggregating
data, and output connectors to systems such as Cloud Datastore
and BigQuery.
In this chapter, we’ll build a pipeline with Dataflow that reads in
data from BigQuery, applies a sklearn model to create predictions,
and then writes the predictions to BigQuery and Cloud Datastore.
We’ll start by running the pipeline locally on a subset of data and
then scale up to a larger data set using GCP.
Dataflow is designed to enable highly-scalable data pipelines, such
as performing ETL work where you need to move data between
different systems in your cloud deployment. It’s also been extended
to work well for building ML pipelines, and there’s built-in support
for TensorFlow and other machine learning methods. The result is
that Dataflow enables data scientists to build large scale pipelines
without needing the support of an engineering team to scale things
up for production.
The core component in Dataflow is a pipeline, which defines the op-
erations to perform as part of a workflow. A workflow in Dataflow
is a DAG that includes data sources, data sinks, and data trans-
formations. Here are some of the key components:
181
182 7 Cloud Dataflow for Batch Modeling
The first step in this code is to load the necessary modules needed
in order to set up a Beam pipeline. We import IO methods for
reading and writing text files, and utilities for passing parameters
to the Beam pipeline. Next, we define a class that will perform
a DoFn operation on every element passed to the process function.
This class extends the beam.DoFn class, which provides an interface
for processing elements in a collection. The third step is setting up
parameters for the pipeline to use for execution. For this example,
we need to set up the input location for reading the text and output
location for writing the result.
Once we have set up the pipeline options, we can set up the DAG
that defines the sequence of actions to perform. For this example,
we’ll create a simple sequence where the input text is passed to
our append step and the output is passed to the text writer. A
visualization of this pipeline is shown in Figure 7.1. To construct
the DAG, we use pipe (|) commands to chain the different steps
together. Each step in the pipeline is a ParDo or Transform command
that defines the Beam operation to perform. In more complicated
workflows, an operation can have multiple outputs and multiple
inputs.
Once the pipeline is constructed, we can use the run function to
execute the pipeline. When running this example in Jupyter, the
Direct Runner will be used by Beam to execute the pipeline on
the local machine. The last command waits for the pipeline to
complete before proceeding.
With the Direct Runner, all of the global objects defined in the
Python file can be used in the DoFn classes, because the code is run-
ning as a single process. When using a distributed runner, some ad-
ditional steps need to be performed to make sure that the class has
186 7 Cloud Dataflow for Batch Modeling
# run locally
python3 append.py \
# run managed
python3 append.py \
--runner DataflowRunner \
--project your_project_name \
--temp_location gs://dsp_model_store/tmp/
sql = """
SELECT year, plurality, apgar_5min,
mother_age, father_age,
gestation_weeks, ever_born
,case when mother_married = true
then 1 else 0 end as mother_married
,weight_pounds as weight
FROM `bigquery-public-data.samples.natality`
order by rand()
limit 10000
"""
natalityDF = client.query(sql).to_dataframe().fillna(0)
natalityDF.head()
Once we have the data to train on, we can use the LinearRegression
class in sklearn to fit a model. We’ll use the full dataframe for
fitting, because the holdout data is the rest of the data set that
was not sampled. Once trained, we can use pickle to serialize the
model and save it to disk. The last step is to move the model file
from local storage to cloud storage, as shown below. We now have
a model trained that can be used as part of a distributed model
application workflow.
190 7 Cloud Dataflow for Batch Modeling
# Save to GCS
bucket = storage.Client().get_bucket('dsp_model_store')
blob = bucket.blob('natality/sklearn-linear')
blob.upload_from_filename('natality.pkl')
query = """
SELECT year, plurality, apgar_5min,
mother_age, father_age,
gestation_weeks, ever_born
,case when mother_married = true
then 1 else 0 end as mother_married
,weight_pounds as weight
,current_timestamp as time
,GENERATE_UUID() as guid
FROM `bigquery-public-data.samples.natality`
rand()
limit 100
"""
Next, we’ll define a DoFn class that implements the process function
and applies the sklearn model to individual records in the Natality
data set. One of the changes from before is that we now have an
init function, which we use to instantiate a set of fields. In order to
have references to the modules that we need to use in the process
function, we need to assign these as fields in the class, otherwise
the references will be undefined when running the function on dis-
tributed worker nodes. For example, we use self._pd to refer to
the Pandas module instead of pd. For the model, we’ll use lazy ini-
tialization to fetch the model from Cloud Storage once it’s needed.
While it’s possible to implement the setup function defined by the
DoFn interface to load the model, there are limitations on which
runners call this function.
class ApplyDoFn(beam.DoFn):
def __init__(self):
self._model = None
192 7 Cloud Dataflow for Batch Modeling
new_x = self._pd.DataFrame.from_dict(element,
orient = "index").transpose().fillna(0)
weight = self._model.predict(new_x.iloc[:,1:8])[0]
return [ { 'guid': element['guid'], 'weight': weight,
'time': str(element['time']) } ]
Once the model object has been lazily loaded in the process func-
tion, it can be used to apply the linear regression model to the
input record. In Dataflow, records retrieved from BigQuery are re-
turned as a collection of dictionary objects and our process function
is responsible for operating on each of these dictionaries indepen-
dently. We first convert the dictionary to a Pandas dataframe and
then pass it to the model to get a predicted weight. The process
function returns a list of dictionary objects, which describe the
results to write to BigQuery. A list is returned instead of a dictio-
nary, because a process function in Beam can return zero, one, or
multiple objects.
An example element object passed to process function is shown in
the listing below. The object is a dictionary type, where the keys
are the column names of the query record and the values are the
record values.
7.2 Batch Model Pipeline 193
schema = parse_table_schema_from_json(json.dumps({'fields':
[ { 'name': 'guid', 'type': 'STRING'},
{ 'name': 'weight', 'type': 'FLOAT64'},
{ 'name': 'time', 'type': 'STRING'} ]}))
The next step is to create the pipeline and define a DAG of Beam
operations. This time we are not providing input or output argu-
ments to the pipeline, and instead we are passing the input and
output destinations to the BigQuery operators. The pipeline has
three steps: read from BigQuery, apply the model, and write to
BigQuery. To read from BigQuery, we pass in the query and spec-
ify that we are using standard SQL. To apply the model, we use
our custom class for making predictions. To write the results, we
pass the schema and table name to the BigQuery writer, and spec-
ify that a new table should be created if necessary and that data
should be appended to the table if data already exists.
The last step in the script is running the pipeline. While it is pos-
sible to run this complete code listing from Jupyter, the pipeline
will not be able to complete because the project parameter needs
to be passed as a command line argument to the pipeline.
# running locally
python3 apply.py --project your_project_name
# running on GCP
echo $'google-cloud-storage==1.19.0' > reqs.txt
python3 apply.py \
--runner DataflowRunner \
--project your_project_name \
--temp_location gs://dsp_model_store/tmp/ \
--requirements_file reqs.txt \
--maxNumWorkers 5
We can now remove the limit command from the query in the
pipeline and scale the workload to the full dataset. When running
the full-scale pipeline, it’s useful to keep an eye on the job to
make sure that the cluster size does not scale beyond expectations.
Setting the maximum worker count helps avoid issues, but if you
forget to set this parameter than the cluster size can quickly scale
and result in a costly pipeline run.
One of the potential issues with using Python for Dataflow
pipelines is that it can take awhile to initialize a cluster, because
each worker node will install the required libraries for the job from
196 7 Cloud Dataflow for Batch Modeling
class PublishDoFn(beam.DoFn):
def __init__(self):
from google.cloud import datastore
self._ds = datastore
query_iter = query.fetch()
for entity in query_iter:
print(entity)
break
<Entity('natality-guid', '0046cdef-6a0f-4586-86ec-4b995cfc7c4e')
{'weight': 7.9434742419056,
'time': '2019-12-15 03:00:06.319496 UTC'}>
7.3 Conclusion
Dataflow is a powerful data pipeline tool that enables data scien-
tists to rapidly prototype and deploy data processing workflows
that can apply machine learning algorithms. The framework pro-
vides a few basic operations that can be chained together to define
200 7 Cloud Dataflow for Batch Modeling
201
202 8 Streaming Model Workflows
snippet below. We’ll also install a library for working with Kafka
in Python called kafka-python.
# new terminal
bin/kafka-server-start.sh config/server.properties
# new terminal
bin/kafka-topics.sh --create --bootstrap-server localhost:9092
--replication-factor 1 --partitions 1 --topic dsp
# output
[2019-12-18 10:50:25] INFO Log partition=dsp-0, dir=/tmp/kafka-logs
Completed load of log with 1 segments, log start offset 0 and
log end offset 0 in 56 ms (kafka.log.Log)
producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
value_serializer=lambda x: dumps(x).encode('utf-8'))
consumer = KafkaConsumer('dsp',
bootstrap_servers=['localhost:9092'],
value_deserializer=lambda x: loads(x.decode('utf-8')))
for x in consumer:
print(x.value)
vi config/server.properties
advertised.listeners=PLAINTEXT://{external_ip}:9092
df = spark .readStream.format("kafka")
.option("kafka.bootstrap.servers", "{external_ip}:9092")
.option("subscribe", "dsp")
.option("startingOffsets", "earliest").load()
display(df)
For the Spark streaming example, we’ll again use the Games data
set, which has ten attributes and a label column. In this workflow,
we’ll send the feature vector to the streaming pipeline as input,
and output an additional prediction column as the output. We’ll
also append a unique identifier, as shown in the Python snippet
below, in order to track the model applications in the pipeline.
The snippet below shows how to create a Python dict with the ten
attributes needed for the model, append a GUID to the dictionary,
and send the object to the streaming model topic.
The script first trains a logistic regression model using data fetched
from GitHub. The model object is created on the driver node, but
is copied to the worker nodes when used by the UDF. The next
step is to define a UDF that we’ll apply to streaming records in the
8.1 Spark Streaming 211
consumer = KafkaConsumer('preds',
bootstrap_servers=['{external_ip}:9092'],
value_deserializer=lambda x: loads(x))
for x in consumer:
print(x.value)
{'User_ID': '4be94cd4-21e7-11ea-ae04-8c8590b3eee6',
'pred': 0.9325488640736544}
8.2.1 PubSub
PubSub is a fully-managed streaming platform available on GCP.
It provides similar functionality to Kafka for achieving high
throughput and low latency when handling large volumes of mes-
sages, but reduces the amount of DevOps work needed to maintain
the pipeline. One of the benefits of PubSub is that the APIs map
well to common use cases for streaming Dataflow pipelines.
One of the differences from Kafka is that PubSub uses separate
concepts for producer and consumer data sources. In Kafka, you
can publish and subscribe to a topic directly, while in PubSub con-
sumers subscribe to subscriptions rather than directly subscribing
to topics. With PubSub, you first set up a topic and then create
one or more subscriptions that listen on this topic. To create a
topic with PubSub, perform the following steps:
import time
from google.cloud import pubsub_v1
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(
"your_project_name", "dsp")
def callback(message):
print(message.data)
message.ack()
subscriber.subscribe(subscription_path, callback=callback)
while True:
time.sleep(10)
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path("your_project_name8", "natality")
To test out the pipeline, first run the consumer in a Jupyter note-
book and then run the producer in a separate Jupyter notebook.
The result should be that the consumer cell outputs "Hello World"
to the console after receiving a message. Now that we have tested
out basic functionality with PubSub, we can now integrate this
messaging platform into a streaming Dataflow pipeline.
class ApplyDoFn(beam.DoFn):
def __init__(self):
self._model = None
from google.cloud import storage
import pandas as pd
import pickle as pkl
import json as js
self._storage = storage
self._pkl = pkl
self._pd = pd
self._json = js
8.2 Dataflow Streaming 217
element = self._json.loads(element.decode('utf-8'))
new_x = self._pd.DataFrame.from_dict(element,
orient = "index").transpose().fillna(0)
weight = self._model.predict(new_x.iloc[:,1:8])[0]
return [ { 'guid': element['guid'], 'weight': weight,
'time': str(element['time']) } ]
The code snippet above shows the function we’ll use to perform
model application in the streaming pipeline. This function is the
same as the function we defined in Chapter 7 with one modification,
the json.loads function is used to convert the passed in string into
a dictionary object. In the previous pipeline, the elements passed
in from the BigQuery result set were already dictionary objects,
while the elements passed in from the PubSub consumer are string
objects. We’ll also reuse the DoFn function the past chapter which
publishes elements to Datastore, listed in the snippet below.
class PublishDoFn(beam.DoFn):
def __init__(self):
from google.cloud import datastore
self._ds = datastore
entity['time'] = element['time']
client.put(entity)
The code does not explicitly state that this is a streaming pipeline,
and the code above can be executed in a batch or streaming mode.
In order to run this pipeline as a streaming Dataflow deployment,
we need to specify the streaming flag as shown below. We can first
test the pipeline locally before deploying the pipeline to GCP. For
a streaming pipeline, it’s best to use GCP deployments, because
the fully-managed pipeline can scale to match demand, and the
8.2 Dataflow Streaming 219
To test out the pipeline, we’ll need to pass data to the dsp topic
which is forwarded to the natality subscription. The code snippet
below shows how to pass a dictionary object to the topic using
Python and the Google Cloud library. The data passed to Pub-
Sub represents a single record in the BigQuery result set from the
previous chapter.
import json
from google.cloud import pubsub_v1
import time
publisher = pubsub_v1.PublisherClient()
220 8 Streaming Model Workflows
8.3 Conclusion
Streaming model pipelines are useful for systems that need to ap-
ply ML models in real-time. To build these types of pipelines, we
explored two message brokers that can scale to large volumes of
events and provide data sources and data sinks for these pipelines.
Streaming pipelines often constrain the types of operations you
can perform, due to latency requirements. For example, it would
be challenging to build a streaming pipeline that performs feature
generation on user data, because historic data would need to be re-
trieved and combined with the streaming data while maintaining
low latency. There are patterns for achieving this type of result,
such as precomputing aggregates for a user and storing the data
in an application database, but it can be significantly more work
getting this type of pipeline to work in a streaming mode versus a
batch mode.
We first explored Kafka as a streaming message platform and built
a real-time pipeline using structure streaming and PySpark. Next,
we built a steaming Dataflow pipeline reusing components from
the past chapter that now interface with the PubSub streaming
service. Kafka is typically going to provide the best performance
in terms of latency between these two message brokers, but it takes
significantly more resources to maintain this type of infrastructure
versus using a managed solution. For small teams getting started,
PubSub or Kinesis provide great options for scaling to match de-
mand while reducing DevOps support.
3
https://labs.spotify.com/2017/10/16/
222 8 Streaming Model Workflows
223