Data Science and - Engineering at Scale Oreilly Ebook - 44023544USEN
Data Science and - Engineering at Scale Oreilly Ebook - 44023544USEN
Data Science and - Engineering at Scale Oreilly Ebook - 44023544USEN
m
pl
im
en
ts
Data Science
of
and Engineering
at Enterprise Scale
Notebook-Driven Results and Analysis
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Science and
Engineering at Enterprise Scale, the cover image, and related trade dress are trade‐
marks of O’Reilly Media, Inc.
The views expressed in this work are those of the author, and do not represent the
publisher’s views. While the publisher and the author have used good faith efforts to
ensure that the information and instructions contained in this work are accurate, the
publisher and the author disclaim all responsibility for errors or omissions, includ‐
ing without limitation responsibility for damages resulting from the use of or reli‐
ance on this work. Use of the information and instructions contained in this work is
at your own risk. If any code samples or other technology this work contains or
describes is subject to open source licenses or the intellectual property rights of oth‐
ers, it is your responsibility to ensure that your use thereof complies with such licen‐
ses and/or rights.
This work is part of a collaboration between O’Reilly and IBM. See our statement of
editorial independence.
978-1-492-03931-0
[LSI]
For Sofia, David, and Charlie.
Thank you for giving me purpose, joy, and love.
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
vii
Numerical Optimization, the Workhorse of All Machine
Learning 45
Feature Scaling 48
Letting the Libraries Do Their Job 49
The Data Scientist Has a Job to Do Too 50
Summary 50
ix
This book is an easy read—for example, you can take it along on a
flight. I especially like how notebooks get used. They serve as con‐
tainers for most of the technical details, organized as scaffolding:
run them at first to get familiar and see the big picture, then dig into
details on later iterations through the code. Moreover, these note‐
books provide examples of how you’ll be collaborating on data sci‐
ence teams, delivering insights in enterprise.
That’s the point about learning data science: you’ll continue to learn
and grow in your practice, as these popular tools continue to evolve
and as your team continually adapts to business needs. But get
started now, and head in the right direction with Data Science and
Engineering at Enterprise Scale.
— Paco Nathan
Derwen, Inc.
x | Foreword
Preface
• Notebooks
— Notebooks and the Jupyter ecosystem
— Language examples (Python, Scala)
— IBM Watson Studio
• Using machine learning frameworks in a notebook
— Installation, architectures
— Scalable machine learning (Apache Spark)
xi
— Deep learning frameworks
• Analytics in the production environment
— Implementation issues
— Collaboration across the enterprise
xii | Preface
and documentation. You do not need to contact us for permission
unless you’re reproducing a significant portion of the code. For
example, writing a program that uses several chunks of code from
this book does not require permission. Selling or distributing a CD-
ROM of examples from O’Reilly books does require permission.
Answering a question by citing this book and quoting example code
does not require permission. Incorporating a significant amount of
example code from this book into your product’s documentation
does require permission.
We appreciate, but do not require, attribution. An attribution usu‐
ally includes the title, author, publisher, and ISBN. For example:
“Data Science and Engineering at Enterprise Scale by Jerome Nilme‐
ier, PhD (O’Reilly). Copyright 2019 O’Reilly Media, Inc.,
978-1-492-03931-0.”
If you feel your use of code examples falls outside fair use or the per‐
mission given above, feel free to contact us at permis‐
[email protected].
O’Reilly
For almost 40 years, O’Reilly Media has pro‐
vided technology and business training,
knowledge, and insight to help companies
succeed.
How to Contact Us
Please address comments and questions concerning this book to the
publisher:
Preface | xiii
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
Acknowledgments
In no particular order, for their various contributions, assistance,
and/or inspirations:
Justin McCoy, Brendan Dwyer, Vijay Bommireddipalli, Carmen
Rupach, Susan Malaika, Sourav Mazumder, Stacey Ronaghan, Dean
Wampler, Paco Nathan, William Roberts, Michele Cronin, Steve
Beier, Colleen Lobner, Nicole Tache, Anita Chung, Lori Vella,
Katherine Krantz, Edward Leardi, Rachel Monaghan, and Kristen
Brown.
xiv | Preface
CHAPTER 1
Sharing Information Across
Disciplines in the Enterprise
1
publish it, it didn’t happen,” which places a very heavy emphasis on
careful documentation of work as the measure of success. It is not
enough, however, to document and present your findings. As a data
scientist, you must also be prepared to defend your position and
persuade skeptics. This process requires diligence and determina‐
tion if your idea is to be embraced.
On the other hand, for modern enterprise developers and engineers
working in a fast-paced environment, the emphasis is on delivering
code that provides the functionality required for the company’s suc‐
cess. The process of reporting findings is not typically as highly val‐
ued, and documentation is often considered a necessary evil that
only the more diligent developer is committed to maintaining.
Tracking progress is tied more to performance measures for time
management than to explaining your reasoning and design choices.
Furthermore, an aesthetic of compactness and brevity is more
highly valued in a mature codebase. This more terse style, however,
may be more difficult to read without additional documentation or
explanation.
How, then, do we reconcile the two approaches in a coherent way?
The data scientist may have a question about an algorithm that
could affect performance, and will want to run tests. How do these
tests translate into useful code? How does the data scientist persuade
the development team that these tests open a path to a useful
solution?
Conversely, how can an engineer or developer explain some of the
more elegant but difficult-to-read pieces of code to a data scientist
without creating unnecessarily verbose descriptions in the
codebase?
Finally, how can management figure out what on earth their team is
up to, beyond using a ticketing system (such as JIRA or GitHub
Issues)?
Enter the notebook.
n! x x
f x1, . . . , xN = p11⋯pkk .
x1 !⋯xk !
s = np.zeros((nRounds, numFaces))
return s
Figure 1-4. Method for selecting a die value for a six-sided die using a
uniform random number generator; the top bar shows a fair die and
the bottom die shows a biased die
s = np.zeros((nRounds, numFaces))
# assume that nRounds is of reasonable size
# (nTrials can be very large).
# This means that Spark data types won't be needed.
for iRound in range(nRounds):
# each round is assigned a deterministic, unique seed
s[iRound, :] =
countsForSingleRound(
numFaces, nTrials, sparkSeed + iRound, pcdf)
return s
Our histograms here should also look pretty much the same for suf‐
ficiently large sample sizes. Once we have verified this, we can gen‐
erate a pseudorandom sequence for unit testing:
>> np.random.seed(10)
>> sSpark = multinomialSpark(nTrialsUT, p1, size = nRoundsUT)
>> print(sSpark)
>> print(sSpark[0:5])
[[ 1. 0. 1. 0. 0. 0.]
[ 0. 0. 0. 1. 0. 1.]
[ 0. 2. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 1. 0.]
[ 0. 1. 1. 0. 0. 0.]]
These outputs can now be used to define a unit test. Unit tests are
used to verify that a function (or method, depending on how it is
written) is producing the correct outputs. For large codebases, they
are a fundamental component that allows the developer to make
sure that newly added functionality does not break other pieces of
code.
These tests should be added as new functions are incorporated. In
many cases, you can even write the test beforehand and use it as a
recommendation for writing the function by enforcing input and
output types as well as the expected content. This approach to
coding is referred to as test-driven development (TDD), and can be a
very efficient way to assign coding tasks to a team.
At the very least, TDD can be a nice way to concretely express your
idea to those who are considering it for production code. The unit
tests for the two functions discussed are given as follows, with the
outputs extracted. Notice that the random seed assignment is critical
to the reproduceablity of these functions.
class TestMultinomialMethods(unittest.TestCase):
# See
# http://localhost:8888/notebooks/multinomialScratch.ipynb
# for a detailed description
nTrials = 2
nRounds = 5
def testMultinomialLocal(self):
np.random.seed(10)
p = [1/6.]*6
nTrials = 2
nRounds = 5
# reference data generated in notebook
# (preferably a GitHub link)
# http://localhost:8888/notebooks/multinomialScratch.ipynb
# Numpy-Unit-Test-Data
sLocalReference = np.array([[ 0., 1., 0., 0., 1., 0.],
[ 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 1., 0.],
[ 0., 1., 0., 0., 1., 0.],
[ 1., 0., 1., 0., 0., 0.]])
def testMultinomialSpark(self):
np.random.seed(10)
p = [1/6.]*6
nTrials = 2
nRounds = 5
# reference data generated in notebook:
# http://localhost:8888/notebooks/multinomialScratch.ipynb
#Spark-Unit-Test-Data
sSparkReference = np.array([[ 0., 0., 1., 0., 1., 0.],
[ 0., 0., 1., 0., 1., 0.],
[ 0., 0., 1., 0., 1., 0.],
[ 0., 0., 1., 0., 1., 0.],
[ 0., 0., 1., 0., 1., 0.]])
OK
Summary
The tests were a success, and so we can provide not only the func‐
tion, but also a means for testing it. We now have a well-defined
function that can be implemented and studied at scale. The develop‐
ment team can now review the notebook before it gets implemented
in the codebase. From this point forward, we can focus on imple‐
mentation issues rather than wonder if our algorithm is performing
as expected.
In the next chapter, we will discuss the details of setting up your
notebook environment. Once this is complete, you will be able to
run all of the examples in the text and in the GitHub repository.
Summary | 13
CHAPTER 2
Setting Up Your
Notebook Environment
15
Figure 2-1. The enterprise-scale data science ecosystem covered in this
book
You can name the project whatever you like. Here we’ve named it
DSAtEnterpriseScale because the GitHub repository has the same
name. A project is a place for storing (among other services) note‐
books and data. Once the project is created, go to the Assets tab and
see what is available to you, as shown in Figure 2-3. For the pur‐
poses of this text, only data assets and notebooks will be used.
Figure 2-3. Project page for creating a new Watson Studio notebook
• Apache Spark
• Jupyter Notebooks
• TensorFlow
Installing Spark
Apache Spark is a framework for scalable computing. You can, how‐
ever, download and run it on a single compute node (your laptop
works too). The code can be written in local mode, and can then
run on a cluster with larger-scale data and computations without
modifications. The project is written entirely in Scala, which com‐
piles to Java byte code. This means you will need the Java Develop‐
ment Kit (JDK).
Installing Java
Java may already be installed on your machine. To check, simply
type:
$ java -version
For this book, we recommend Java 8. Java 9 will not work for this
version of Spark! Older versions of Java may work, but we have not
tested them for the examples presented.
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
Since the native language of Spark is Scala, you will see a scala>
prompt. We will go into more detail about Scala syntax later, but for
now we only want to verify that an object called the Spark Context
has been created. This object, sc, is initialized by the spark-shell
command. The Spark Context is the portal to all things Spark, and
contains all of the information required to communicate commands
to a cluster of computers. Of course, we are only using a local
machine here for our examples, but it is possible to connect the
Spark Context to a cluster of arbitrary size and run the same code at
much larger scale.
To verify that this object has been initialized properly, type:
scala> sc
You should see:
res0: org.apache.spark.SparkContext =
org.apache.spark.SparkContext@3f322610
If you do not see something like this, something has gone wrong in
creating the Spark Context, and you will not be able to run Spark
commands.
You can also verify that the Python API to Spark is working like so:
$SPARK_HOME/bin/pyspark
You should see a similar splash screen upon initialization, followed
by a Python shell prompt. Verify that the Spark Context is correctly
initialized by typing sc at the Python REPL:
>>> sc
You should see a similar result:
Installing Jupyter
Following the Anaconda documentation, add this to your .bashrc
or .bash_profile file:
export SPARK_HOME=/path/to/spark-2.3.0-bin-hadoop2.7
To launch Jupyter Notebook, simply type:
(py35) $ jupyter notebook
This command will launch Jupyter in your default browser. You
should see a browser tab open that looks like Figure 2-5.
Under the New tab, select the py35 kernel from the drop-down list
(see Figure 2-6). This will open a new notebook with the correct
Figure 2-6. The drop-down menu will allow you to choose which ker‐
nel to run when launching a notebook
For purposes of verifying the installation, you can run the code in
the cell as shown in Figure 2-7.
Figure 2-7. Hello World code, along with a version check; the name of
the kernel (py35) is shown in the upper-right corner
The [*] option specifies the number of threads, and * indicates that
you wish to have as many threads as are available (you can specify a
number as well if you know how many you want to use).
We will only use the Scala and PySpark terminals for this book, but
it may be of interest to you to pursue the other interpreters. Once
Toree is installed, you can launch Jupyter Notebook as usual, but
you will now have several kernels to choose from:
jupyter toree install --interpreters=Scala,PySpark,SparkR,SQL
Summary
That’s it for setting up your environment! You should be able to run
all of the examples in the remainder of the text through any of the
three options discussed in this chapter. We encourage you to run the
notebooks for most of the examples, but you can also copy the code
Summary | 29
CHAPTER 3
Data Science Technologies
Data science tooling has entered a golden age. At the laptop scale,
the most common tools are R, MATLAB, and Python Scikit-learn,
but there are many others. Oftentimes, an expert data scientist will
have her “go-to” language, where she feels most confident develop‐
ing prototypes. A data engineer may also have her preferences when
writing scalable code.
There are so many tools to choose from that it can sometimes be a
challenge to know which ones to start with. Our aim here is to nar‐
row the field by providing technical foundations for a few simple
frameworks. These may not be optimal for all workloads, but they
are certainly among the most popular choices for many use cases.
We will discuss Apache Spark for most scalable applications. For the
deep learning examples, we will use TensorFlow, which we will not
discuss in detail in this chapter. The examples provided in later sec‐
tions are relatively straightforward to follow along with.
Apache Spark
As you have learned, Apache Spark is a framework for writing dis‐
tributed code. It can be developed at the desktop scale on moder‐
ately sized datasets. Once the code is ready, it can be migrated to a
cluster or cloud computing resource. The scale-up process is
straightforward, and often requires only trivial modifications to the
code to run at scale.
31
Spark is often considered to be the second generation of distributed
computing in the enterprise. The first generation was Hadoop,
which consists of the Hadoop Distributed File System (HDFS), a
resource manager (YARN), and an execution framework (Map‐
Reduce). Apache Spark can use a variety of resource managers and
filesystems, but the basic design has its roots in the Hadoop
ecosystem.
Apache Spark | 33
Executor A process that is launched for an application on a worker node, and runs tasks
and keeps data in memory or disk storage across them. Each application has its
own executors.
Job A parallel computation that consists of multiple tasks and gets spawned in
response to a Spark action.
Stage A smaller set of tasks that make up a job and depend on each other.
Spark Core contains the basic parallel framework for running jobs
on a cluster. All of the libraries can be accessed through this frame‐
work, which provides an impressive breadth of parallel functionality
in a single unified platform (see Figure 3-2).
Figure 3-2. The main libraries that are accessible from within Spark
Apache Spark | 35
Notice that the sc.parallelize() method from the Spark Context
was invoked to create the RDD. In the result section, the name of the
object is given as ParallelCollectionRDD. The next two commands
are transformations, which result in new ParallelCollectionRDDs.
Notice that no actual result is given, nor is any calculation carried
out. Transformations are treated as lazy evaluations, which means
that they are not computed until needed.
In another cell, we can carry out some actions on these RDDs:
println("Count: " + filteredRDD.count())
filteredRDD.collect()
Count: 10
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
These actions actually require a result to be computed and returned,
which initiates a calculation. The results of an action are no longer
RDDs, but are local Scala objects (or Python objects, or whatever
API is used). If an action is carried out on an RDD that has under‐
gone many transformations, the entire chain of calculations will be
computed when the first action is called. It is important to remem‐
ber this process when evaluating performance, as it can sometimes
be confusing to determine which operations are consuming the
most time.
Caching Results
Since an RDD is an object that persists for the lifetime of the Spark
job, it is often desirable to access its contents repeatedly. The ability
to cache results in memory was an important initial design element
of the Spark framework, and paved the way for massive improve‐
ments in efficiency for machine learning and other algorithms that
required iterative solvers to find parameters for complex models. An
example of the cache() method is shown here in Scala:
val test = sc.parallelize(1 to 50000,50)
//cache this data
test.cache
val t1 = System.nanoTime()
// first count will trigger evaluation of count *and* cache
test.count
val dt1 = (System.nanoTime() - t1).toDouble/1.0e9
val t2 = System.nanoTime()
// second count operates on cached data only
Figure 3-3. The Spark UI has many features for diagnostics. Here we
are comparing the calculation time for the two count operations. Job
Id 1 is the count after caching, and Job Id 2 is the count before caching.
Apache Spark | 37
Figure 3-4. The Spark UI also provides information on the status of
cached data.
val df = spark.read.json("people.json")
df.show
df.printSchema
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
+-------+
| name|
+-------+
|Michael|
| Andy|
Summary
In this chapter we discussed the fundamentals of the primary frame‐
work for big data that is used in this book, Apache Spark. Though it
is only the very beginning of our understanding of this powerful
framework, this should give you an idea of Spark’s basic syntax and
usage.
Now that we have the tools installed and available to us, along with a
basic understanding of syntax and usage, we are able to dig deeper
into its functionality. The notebook examples that follow in the rest
of the book will build on this basic understanding.
41
Term Description
Numerical A method for solving for parameters for a given loss function and training dataset.
optimization The loss function can be for linear or nonlinear models, but only the linear model
is studied in this chapter. The only optimization technique used in this chapter is
the gradient descent algorithm.
Gradient The most basic numerical optimization technique. It is an iterative process, where
descent the parameters are incrementally updated by evaluating the gradient of the
objective at each step.
f x, w1, w0 = w1 · x + w0
N
f x1, x2, . . . , xN , w0, . . . , wN = w0 + w0 + ∑ wixi
i=1
where the variable with no feature associated with it is called the bias
term. The prediction is a result of applying the model to an incom‐
ing feature set, and is based upon a training set of data. The training
set contains a list of features and labels. Labels can, in general, take
on discrete values (which leads us more naturally to think of them
as labels). For the model studied here, labels are simply floating-
point numbers.
The previous equation is often written in the more compact linear
algebra notation as:
where:
x = 1 x 1 . . . xN
T
is written as a row vector and � = w0 w1 . . . wN is written as a
column vector. The value 1 in the first row of the feature vector is
there to account for the intercept.
T
S w = y − Xw y − Xw
and the optimal parameters w are those that minimize the loss func‐
tion with respect to w. This idea is expressed mathematically as:
w = min S
w
The dataset and the objective function provide all of the information
needed to define a given model. This basic formulation of the loss
function is true for nearly every machine learning model that has
labeled data (supervised learning), and this simple linear model pro‐
vides the conceptual framework that illuminates our understanding
of the entire field.
The next step is to solve for the parameters.
This approach works for any number of features, but is useful only
for the linear regression case. Figure 4-2 shows the resulting line
generated from this approach, which is described in more depth on
GitHub. Any type of model other than a linear regression requires a
process called numerical optimization, which is covered in the next
section.
Figure 4-2. The best-fitting line through the data as generated with the
normal equation
∂S
∇wS =
i ∂wi
wk + 1 = wk − ∇ w S · α
k
where α is adjusted so that step sizes are optimal. If the steps are too
large, we risk leaping across the surface and overstepping the opti‐
mal location. If, however, the steps are too small, we may find our‐
selves taking too many unproductive (and costly) steps. The idea
behind each optimization step is that the new parameters wk + 1
result in a lower loss function. Looking at Figure 4-3, we see that,
initially, the objective function is rapidly reduced. Later steps in the
process (up to K = 100) gradually approach a convergent solution.
This behavior is typical for a well-behaved optimization process.
Had we chosen our α incorrectly, we might not have seen this
smooth approach to a stable solution. In fact, you can adjust α your‐
self in the notebook and see how the behavior changes!
T
If we look at the evolution of parameters � = w0 w1 w2 , shown
in Figure 4-4, we can see that they gradually approach the ground
truth solution from the normal equation. In general, we would not
have a solution like this for comparison, and we usually only look at
higher-level metrics like the objective function as an indicator of the
solution’s fitness. This is one reason why it is so interesting to look at
the linear models, because we can make high-resolution compari‐
sons and develop a deeper intuition about how these systems
behave.
Feature Scaling
This well-behaved optimization was not only the result of adjusting
the propagation step of gradient descent. It required an additional
piece of preprocessing known as feature scaling. Feature scaling is
not necessary in the normal equation approach, but is absolutely
essential in even the simplest of numerical optimization algorithms.
Recall that we like to think of our objective function as a surface.
Imagine a surface, like a golf course, where the putting green is an
ellipse. Ideally, we would like to have a circular putting green, but an
ellipse is okay, right? What about an ellipse that is a foot wide and a
mile long? You would want to putt in one direction, but use a driver
for the other direction, which would make it hard. Feature scaling
has the effect of taking your putting green, no matter how distorted,
and making it more symmetric, so that propagation steps are of sim‐
ilar scale no matter what direction you are going.
The most popular way to scale features is standardization scaling,
although there are many others. The standardization scaling is given
by:
where x j, i is the jth example of ith feature. There are N features and
M examples, and the primed notation x′j, i indicates the scaled fea‐
ture. The summary variables μi and σ are the mean and standard
deviation of the ith feature across the M examples, respectively.
Without this scaling, you would find yourself adjusting propagation
step sizes for each dimension, which would add N more parameters
to tune. Also, with this scaling, as well as scaling α by the number of
samples, we can obtain more general insights into the range of α that
works across many different optimization procedures. The typically
recommended range for propagation step size under these condi‐
tions is somewhere between 0 and 1 as a result.
Summary
In this chapter, we have provided the basis for understanding
machine learning models. These concepts are surprisingly persistent
in even the most complex models, up to and including deep learning
models. In Chapter 5, we will build more on these concepts, and see
how they are used in classic machine learning examples.
51
The simplest case is a binary label, which indicates whether the data
point falls into a particular category or not. We will address how
multiple categories are treated a bit later, but first let’s think about
how a continuous function can be interpreted as a label.
In Chapter 4, we defined the linear model as:
N
z = w0 + ∑ wixi = xw
i=1
f z =z
1
f z =
1 + e−z
and is the third plot of Figure 5-1. In fact, the term activation func‐
tion is not always used in the machine learning field. It is used more
frequently in deep learning, and is discussed in more detail in the
next chapter.
In deep learning, the activation functions are sometimes taken as
labels when used as the outputs, but are also used as intermediate
transformations. In those intermediate cases, it is sometimes more
convenient to use a function that does not have an upper maximum
limit, because the numerical optimization algorithm will perform
better in cases where the gradient does not vanish at high values of
z. The reasons for this are beyond the scope of this text, but we
present one activation function that is very popular in the deep
learning field—the rectified linear unit (ReLU) function, given as:
f x = max z, 0
We will use the column “Heart Failure” as the label and treat the
remaining columns as features. Note that some columns are
reported as categories. Fortunately, Apache Spark (as well as Pandas
and many other machine learning libraries) has automated methods
for converting fields like this into numerical values so that they can
be input into a numerical model. The input labels will be changed
into 0 or 1. The resulting prediction will be a floating-point number.
Collaborative Filtering
Recommendation systems based on the Alternating Least Squares
(ALS) algorithm have gained popularity in recent years because, in
general, they perform better as compared to content-based
approaches. A recommmendation system suggests items to a new
user based on the known preferences of previous users. It has ele‐
ments of both supervised and unsupervised learning in its formula‐
tion. Since it uses labeled data as part of the model training,
however, we can think of it somewhat as a supervised learning algo‐
rithm.
ALS is a matrix factorization algorithm, where a user-item matrix is
factorized into two low-rank non-orthogonal matrices:
R = UT M
2
S U, M = ∑
i, j
uTi m j = R − UM 2
ri j = uTi m j
for the ith user and jth movie. One way to think of this is in terms of
the movie score only as a linear model:
K
ri j = ∑ = αkmk, j
k=1
K
ri j = ∑ = βkui, k
k=1
Collaborative Filtering | 59
Now, the features are related to the user vector and the parameters
are learned weights βk = mk, j. In fact, the training process alternates
between solving for the weights of the movie vector while holding
the user features constant, and solving for the weights of the user
vector while holding the movie features constant. The numerical
solution of the objective is obtained through this alternating least
squares approach.
If this model were not so popular and powerful, it might be consid‐
ered an interesting peculiarity of machine learning. It is, however,
the most widely used machine learning algorithm in ecommerce,
and is particularly effective for recommendation systems, whereby a
list of recommended items can be generated for new users based on
the preferences of similar previous users.
K Means Clustering
The process of K Means clustering is simple:
The distances are computed as the Euclidian norm between two vec‐
tors, given as:
T
d2i j = xi − x j xi − x j
• It can model sparse feature sets very well, which makes it ideal
for dealing directly with word counts as feature vectors, rather
than having to generate word embedding using topics as with
Word2Vec.
• It admits multiple membership in topics, meaning every topic is
able to emit its own distribution of words, and those words can
come from multiple topics.
The notebooks for the LDA algorithm can be found on GitHub and
a display of the discovered topics is shown in the second notebook.
Figure 5-5 shows a word cloud for one of the discovered topics.
Figure 5-5. Word cloud for a topic discovered by the LDA algorithm
applied to the Yahoo Newsgroups dataset
Summary | 65
CHAPTER 6
Advanced Machine Learning
Examples and Applications
67
orders of magnitude larger than the classic machine learning work‐
flows. With models this large, it is difficult to interpret the parame‐
ters in any intuitive way, or even understand the importance of
particular features. With all of these new elements to consider, then,
we should think of deep learning as a completely new field of
knowledge, while being mindful of its similarities to machine
learning.
We will consider only three kinds of layers here: the input layer, the
output layer, and the hidden layer. The input layer is the layer of
neurons that receives the features and transforms them into some‐
thing to feed into the network. The number of features in neural
networks is typically much higher than that in machine learning.
The output layer is the very last layer of neurons, and the output
value is taken as the label. Each neuron typically reports a value
between 0 and 1, which indicates membership in the category asso‐
ciated with that neuron. For the number identification example
shown in Figure 6-4, there are 10 neurons in the output layer
because there are 10 possible labels in the dataset (0 through 9).
During training, these neurons are assigned “one hot” labels (neu‐
ron 0 has a value of 1 if the image is a 0). For the prediction, the
neuron with the largest value is taken as the label.
Finally, hidden layers sit in between the input and output layers. The
existence of hidden layers in a neural network is what gives the neu‐
ral network depth, which is why we call these models deep learning
models.
The input layer is constrained to have the number of neurons equal
to the number of features, and the output layer is constrained to
have the number of categories in the labels. There are no con‐
straints, however, on the number of hidden layers, or the number of
neurons per hidden layer. In fact, it is often recommended to have as
many hidden layers as is computationally feasible, with less concern
given to the overfitting issues that typically arise in simpler models
Graph Analytics
Graph analytics is not always considered a part of the machine
learning corpus. It is, however, a powerful way of understanding
intricate connections between data that are really only discoverable
within a graphical structure. Thus, as a data scientist you should
always have this in your repertoire. Fortunately, if you have already
gone through the trouble of setting up your Apache Spark environ‐
Graph Analytics | 73
ment, you will have ready access to the GraphX library, providing a
low barrier to entry.
Graph Analytics | 75
Figure 6-6. Clickstream graph of Wikipedia entries related to IBM
Watson
Summary
In this final chapter, we have covered some additional examples in
the field of machine learning that will be very helpful for you to
understand. There are many more examples in deep learning and
graph theory that you can run using the frameworks that are in
place, and you are encouraged to try them out. The frameworks that
we have installed for this book are very robust and well understood,
and you should be able to grow your understanding of the field with
these technologies as a starting point. Happy learning!