Machine Learning The Basics
Machine Learning The Basics
Machine Learning The Basics
!! ROUGH DRAFT !!
Alexander Jung
January 7, 2021
model
hypothesis
loss
inference
observations
data
Figure 1: Machine learning implements the scientific principle of “trial and error”.
Machine learning continuously validates and refines a hypothesis based on a model about a
phenomenon that generates observable data.
1
Preface
Machine learning (ML) has become a commodity in our every-day lives. We routinely ask ML
empowered smartphones to suggest lovely food places or to guide us through a strange place.
ML methods have also become standard tools in many fields of science and engineering. A
plethora of ML applications transform human lives at unprecedented pace and scale.
This book portrays ML as the combination of three basic components: data, model
and loss. ML methods combine these three components within computationally efficient
implementations of the basic scientific principle “trial and error”. This principle consists of
the continues adaptation of a hypothesis about a phenomenon that generates data.
ML methods use a hypothesis to compute predictions for future events. ML methods
choose or learn a hypothesis from a (typically very) large set of candidate hypotheses. We
refer to this set as candidates as the model of a ML method.
The adaptation or improvement of the hypothesis is based on the discrepancy between
predictions and observed data. ML methods use a loss function to quantify this discrepancy.
A plethora of different ML methods is obtained by combining different design choices
for the data representation, model and loss. ML methods also differ vastly in their actual
implementations which might obscure their unifying basic principles.
Deep learning methods use cloud computing frameworks to train large models on huge
datasets. Operating on a much finer granularity for data and computation, linear least
squares regression can be implemented on small embedded systems. Nevertheless, deep
learning methods and linear regression use the same principle of iteratively updating a model
based on the discrepancy between model predictions and actual observed data.
The three component picture of ML championed in this book allows a unified treatment
of a wide range of concepts and techniques which seem quite unrelated at first sight. On
a low-level, we discuss the regularization effect of early stopping in terms of adjusting the
effective model space. On a higher-level, we can interpret privacy-preserving and explainable
ML as particular design choices for the model, data and loss.
2
To make good use of ML tools it is instrumental to understand its underlying principles
at different levels of detail. On a lower-level, this tutorial helps ML engineers to choose
suitable methods for the application at hand. The book also provides leaders a higher-level
view on the development of ML which is required to manage a ML or data analysis team. We
believe that thinking about ML as combinations of data, model and loss helps to navigate
the steadily growing offer for ready-to-use ML methods.
Acknowledgement
This tutorial is based on lecture notes prepared for the courses CS-E3210 “Machine Learning:
Basic Principles”, CS-E4800 “Artificial Intelligence”, CS-EJ3211 “Machine Learning with
Python”, CS-EJ3311 “Deep Learning with Python” and CS-C3240 “Machine Learning”
offered at Aalto University and within the Finnish university network fitech.io. This
tutorial is accompanied by practical implementations of ML methods in MATLAB and
Python https://github.com/alexjungaalto/.
This text benefited from the numerous feedback of the students within the courses that
have been (co-)taught by the author. The author is indebted to Shamsiiat Abdurakhmanova,
Tomi Janhunen, Yu Tian, Natalia Vesselinova, Ekaterina Voskoboinik, Buse Atli, Stefan
Mojsilovic for carefully reviewing early drafts of this tutorial. Some of the figures have been
generated with the help of Eric Bach. The author is grateful for the feedback received from
Jukka Suomela, Oleg Vlasovetc, Georgios Karakasidis, Joni Pääkkö, Harri Wallenius and
Satu Korhonen.
3
Contents
1 Introduction 9
1.1 Relation to Other Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.3 Theoretical Computer Science . . . . . . . . . . . . . . . . . . . . . . 14
1.1.4 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.5 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1.6 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Flavours of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3 Organization of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4
2.5.6 Approximate Non-Linear Maps Using Indicator Functions for Feature
Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.7 Python Hypothesis Space . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.8 A Lot of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.9 Over parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.10 Squared Error Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.11 Classification Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.5.12 Intercept Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.5.13 Picture Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.5.14 Maximum Hypothesis Space . . . . . . . . . . . . . . . . . . . . . . . 48
2.5.15 A Large but Finite Hypothesis Space . . . . . . . . . . . . . . . . . . 48
2.5.16 Size of Linear Hypothesis Space . . . . . . . . . . . . . . . . . . . . . 49
3 Some Examples 50
3.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Least Absolute Deviation Regression . . . . . . . . . . . . . . . . . . . . . . 53
3.4 The Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Gaussian Basis Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8 Bayes’ Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.9 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.10 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.11 Artificial Neural Networks – Deep Learning . . . . . . . . . . . . . . . . . . . 64
3.12 Maximum Likelihood Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.13 k-Nearest Neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.14 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.15 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.16 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.17 LinUCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.18 Network Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.19 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.19.1 How Many Neurons? . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.19.2 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5
3.19.3 Data Dependent Hypothesis Space . . . . . . . . . . . . . . . . . . . 73
6
7 Regularization 114
7.1 Regularized ERM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.3 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.4 Regularized Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.5 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.6 Multitask Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.7.1 Ridge Regression as Quadratic Form . . . . . . . . . . . . . . . . . . 122
8 Clustering 123
8.1 Hard Clustering with K-Means . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.2 Soft Clustering with Gaussian Mixture Models . . . . . . . . . . . . . . . . . 130
8.3 Density Based Clustering with DBSCAN . . . . . . . . . . . . . . . . . . . . 134
8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.4.1 Image Compression with k-means . . . . . . . . . . . . . . . . . . . . 135
8.4.2 Compression with k-means . . . . . . . . . . . . . . . . . . . . . . . . 135
10 Privacy-Preserving ML 144
10.1 Privacy-Preserving Feature Learning (Operating on level of individual data
points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
10.1.1 Privacy-Preserving Information Bottleneck . . . . . . . . . . . . . . . 145
10.1.2 Privacy-Preserving Feature Selection . . . . . . . . . . . . . . . . . . 145
7
10.1.3 Privacy-Preserving Random Projections . . . . . . . . . . . . . . . . 145
10.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
10.2.1 Where are you? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
10.3 Federated Learning (Operates on level of local datasets) . . . . . . . . . . . . 146
11 Explainable ML 147
11.1 A Model Agnostic Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
11.2 Explainable Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . 148
13 Glossary 150
8
Chapter 1
Introduction
Consider waking up some morning during winter in Finland and looking outside the window
(see Figure 1.1). It seems to become a nice sunny day which is ideal for a ski trip. To
choose the right gear (clothing, wax) it is vital to have some idea for the maximum daytime
temperature which is typically reached around early afternoon. If we expect a maximum
daytime temperature of around plus 10 degrees, we might not put on the extra warm jacket
but rather take only some extra shirt for change.
Figure 1.1: Looking outside the window during the morning of a winter day in Finland.
How can we predict the maximum daytime temperature for the specific day depicted in
Figure 1.1? Let us now show how this can be done via ML. In a nutshell, ML methods are
computational implementations of a simple (scientific) principle.
This principle contains three components: data, a model and a loss function. Any ML
9
method, including linear regression and deep reinforcement learning, combines these three
components.
We illustrate the (rather abstract) concepts behind the main components of ML with the
above problem of predicting the maximum daytime temperature during some day in Finland
(see Figure 1.1). The prediction shall be based solely on the minimum daytime temperature
observed in the morning of that day.
The Finnish Meteorological Institute (FMI) offers data on historic weather observations.
We can download historic recordings of minimum and maximum daytime temperature recorded
by some FMI weather station. Let us denote the resulting dataset by
Figure 1.2: Each dot represents a day that is characterized by its minimum daytime
temperature x and its maximum daytime temperature y measured at a weather stations
in Finland.
ML methods allow to learn a predictor map h(x), reading in the minimum temperature
10
x and delivering a prediction (forecast or approximation) ŷ = h(x) for the actual maximum
daytime temperature y. We base this prediction on a simple hypothesis for how the minimum
and maximum daytime temperature during some day are related. We assume that they are
related approximately by
y ≈ w1 x + w0 with w1 ≥ 0. (1.2)
This hypothesis reflects the intuition that the maximum daytime temperature y should be
higher for days with a higher minimum daytime temperature x.
Given our initial hypothesis (1.2), it seems reasonable to restrict the ML method to only
consider linear predictor maps
h(w) (x)
3
2
1
feature x
−3 −2 −1
−1 1 2 3
−2
−3
ML would be trivial if there is only one single hypothesis. Having only a single hypothesis
11
means that there is no need to try out different hypotheses to find the best one. To enable
ML, we need to choose between a whole space of different hypotheses. ML methods are
computationally efficient methods to choose (learn) a good hypothesis out of (typically very
large) hypothesis spaces. The hypothesis space constituted by the maps (1.3) for different
weights is uncountably infinite.
To find, or learn, a good hypothesis out of the infinite set (1.3), we need to somehow
assess the quality of a particular hypothesis map. ML methods use data and a loss function
for this purpose.
A loss function is a measure for the difference between the actual data and the predictions
obtained from a hypothesis map (see Figure 1.4). One widely-used example of a loss function
is the squared error loss (y −h(x))2 . Using this loss function, ML methods learn a hypothesis
map out of the model (1.3) by tuning w1 , w0 to minimize the average loss
m
X 2
(1/m) y (i) − h x(i) .
i=1
Figure 1.4: Dots represent days characterized by its minimum daytime temperature x and
its maximum daytime temperature y. We also depict a straight line representing a linear
predictor map. ML methods learn a predictor map with minimum discrepancy between
predictor map and data points.
The above weather prediction is prototypical for many other ML applications. Figure
1 illustrates the typical workflow of a ML method. Starting from some initial guess, ML
methods repeatedly improve their current hypothesis based on (new) observed data.
12
Using the current hypothesis, ML methods make predictions or forecasts about future
observations. The discrepancy between the predictions and the actual observations, as
measured using some loss function, is used to improve the hypothesis. Learning happens
during improving the current hypothesis based on the discrepancy between its predictions
and the actual observations.
ML methods must start with some initial guess or choice for a good hypothesis. This
initial guess can be based on some prior knowledge or domain expertise [50]. While the
initial guess for a hypothesis might not be made explicit in some ML methods, each method
must use such an initial guess. In our weather prediction application discussed above, we
used the approximate linear model (1.2) as the initial hypothesis.
13
principal component analysis, are deeply rooted in the theory of linear algebra (see Sections
3.1 and 9.2).
1.1.2 Optimization
A main design principle for ML methods is to formulate learning tasks as optimization
problems [64]. The weather prediction problem above can be formulated as the problem of
optimizing (minimizing) the prediction error for the maximum daytime temperature. ML
methods are then obtained by applying optimization methods to these learning problems.
The statistical and computational properties of such ML methods can be studied using
tools from the theory of optimization. What sets the optimization problems arising in
ML apart from “standard” optimization problems is that we do not have full access to the
objective function to be minimized. Section 4 discusses methods that are based on estimating
the correct objective function by empirical averages that are computed over subsets of data
points (the training set).
1.1.4 Communication
We can interpret ML as a particular form of data processing. A ML algorithm is fed with
observed data in order to adjust some model and, in turn, compute a prediction of some
14
future event. Thus, ML involves transferring or communicating data to some computer
which executes a ML algorithm.
The design of efficient ML systems also involves the design of efficient communication
between data source and ML algorithm. The learning progress of an ML method will be
slowed down if it cannot be fed it with data at sufficiently large rate. Given limited memory
or storage capacity, being too slow to process data at their rate of arrival (in real-time) means
that we need to “throw away” data. The lost data might have carried relevant information
for the ML task at hand.
1.1.5 Statistics
Consider the data points depicted in Figure 1.2. Each data point represents some previous
day. Each data point (day) is characterized by the minimum and maximum daytime temperature
as measured by some weather observation station. It might be useful to interpret these data
points as independent and identically distributed (i.i.d.) realizations of a random vector
T
z = x, y . The random vector z is distributed according to some fixed but typically
unknown probability distribution p(z). Figure 1.5 extends the scatter plot of Figure 1.2
with some contour line that indicates the probability distribution p(z).
Probability theory offers a great selection on methods for estimating the probability
distribution from observed data (see Section 3.12). Given (an estimate of) the probability
distribution p(z), we can compute estimates for the label of a data point based on its features.
Having a probability distribution p(z) for a randomly drawn data point z = (x, y), allows
us to not only compute a single prediction (point estimate) ŷ of the label y but rather an
entire probability distribution q(ŷ) over all possible prediction values ŷ.
The distribution q(ŷ) represents, for each value ŷ, the probability or how likely it is that
this is the true label value of the data point. By its very definition, this distribution q(ŷ) is
precisely the conditional probability distribution p(y|x) of the label value y, given the feature
value x of a randomly drawn data point z = (x, y) ∼ p(z).
Having an (estimate of) probability distribution p(z) for the observed data points not
only allows us to compute predictions but also to generate new data points. Indeed, we can
artificially augment the available data by randomly drawing new data points according the
probability distribution p(z) (see Section 7.3). A recently popularized class of ML methods
that use probabilistic models to generate synthetic data is known as generative adversarial
networks [27].
15
y
p(z)
Figure 1.5: A scatterplot where each dot represents some day that is characterized by its
minimum daytime temperature x and its maximum daytime temperature y.
• a forest fire management system: perceptions given by satellite images and local
observations using sensors or “crowd sensing” via some mobile application which allows
humans to notify about relevant events; actions amount to issuing warnings and bans
of open fire; return is the reduction of number of forest fires.
• a severe weather warning service: perceptions given by weather radar; actions are
preventive measures taken by farmers or power grid operators; return is measured by
16
savings in damage costs (see https://www.munichre.com/)
• an automated benefit application system for a social insurance institute (like “Kela”
in Finland): perceptions given by information about application and applicant; actions
are either to accept or to reject the application along with a justification for the
decision; return is measured in reduction of processing time (applicants tend to prefer
getting decisions quickly)
• a personal diet assistant: perceived environment is the food preferences of the app
user and their health condition; actions amount to personalized suggestions for healthy
and yummy food; return is the increase in well-being or the reduction in public spending
for health-care.
• the cleaning robot Rumba (see Figure 1.6) perceives its environment using different
sensors (distance sensors, on-board camera); actions amount to choosing different
moving directions (“north”, “south”, “east”, “west”); return might be the amount
of cleaned floor area within a particular time period.
17
Figure 1.6: A cleaning robot chooses actions (moving directions) to maximize a long-term
reward measured by the amount of cleaned floor area per day.
that are predicted as optimal according to some hypothesis which could be obtained by ML
methods.
What sets AI methods apart from other ML methods is that they must compute predictions
in real-time while collecting data and choosing the next action. Consider an AI system
that steers a toy car. In any given state (point of time) the resulting prediction influences
immediately the features of the following data points.
Consider data points to represent the different states of a toy car. For such data points
we could define their labels as the optimal steering angle for these states. However, it might
be very challenging to obtain accurate label values for any of these data points. Instead,
we could evaluate the usefulness of a particular steering angle only in an indirect fashion by
using a reward signal. For the toy car example, we might obtain a reward from a distance
sensor that indicates if the car reduces the distance to some goal or target location.
18
out different choices for the map.
The basic idea of supervised ML methods, as illustrated in Figure 1.7, is to fit a curve
(representing the predictor map) to data points obtained from historic data (see Chapter 4).
While this sounds like a simple task, the challenge of modern ML applications is the sheer
amount of data points.
ML methods must process billions of data points with each single data point characterized
by a potentially vast number of features. Consider data points representing social network
users, whose features include all media that has been posted (videos, images, text).
Besides the size of datasets, another computational challenge for modern ML methods
is that they must be able to fit highly non-linear predictor maps. Deep learning methods
address this challenge by using a computationally convenient representation of non-linear
maps via artificial neural networks [26]).
label y
predictor h(x)
(x(2) , y (2) )
(x(1) , y (1) )
feature x
Figure 1.7: Supervised ML methods fit a curve to (a huge number of) data points.
Unsupervised Learning. Some ML applications do not need the concept of labels but
require only to understand the intrinsic structure of data points. We refer to such applications
as unsupervised ML. Examples of an intrinsic structure is when the data points can be
grouped into few coherent subsets of cluster (see Chapter 8). Another example for such an
intrinsic structure is when the data points are localized around a low-dimensional subspace
(see Chapter 9). Unsupervised ML methods allow to determine such an intrinsic structure.
Reinforcement Learning. Another main flavour of ML considers data points that are
characterized by labels but which cannot be determined easily beforehand. Reinforcement
learning studies applications where the label values can only be determined in an indirect
fashion. Consider the problem of choosing the optimum steering direction for a car based
on the snapshot of an on-board camera. Data points represent a particular state of the car,
its label is the optimum steering direction.
19
It is typically impossible to get labeled data points here since there are so many different
driving scenarios that each have different optimal steering directions. Instead, RL methods
use some predictor of the optimal steering direction and then evaluate the quality of this
prediction by some other sensor signals, e.g., which determine if the car stays in the lane.
20
Two main challenges for the widespread use of ML techniques in critical application
domains is privacy-preservation and explainability. Chapters 10 and 11 will discuss recent
approaches to solve these challenges. We will see that the concepts developed in Chapter 9
for feature learning will be perfect tools for privacy-preserving and explainable ML.
Prerequisites. We assume some familiarity with basic concepts of linear algebra, real
analysis, and probability theory. For a review of those concepts, we recommend [26, Chapter
2-4] and the references therein.
Notation. We mainly follow the notational conventions used in [26]. Boldface upper
case letters such as A, X, . . . denote matrices. Boldface lower case letters such as y, x, . . .)
denote vectors. The generalized identity matrix In×r ∈ {0, 1}n×r is a diagonal matrix with
ones on the main diagonal. The Euclidean norm of a vector x = (x1 , . . . , xn )T is denoted
pPn
kxk = 2
r=1 xr .
21
Chapter 2
model
data loss
Figure 2.1: ML methods fit a model to data via minimizing a loss function.
• data as collections of data points characterized by features (see Section 2.1.1) and
labels (see Section 2.1.2)
• a loss function (see Section 2.3) to measure the quality of a predictor (or classifier).
22
data, which hypothesis space or model to use and with which loss function to measure the
quality of a hypothesis. Once the ML problem is formally defined, we can readily apply
off-the-shelf ML methods to solve them.
Similar to ML problems (or applications) we also think of ML methods as specific
combinations of the three above components. We detail in Chapter 3 how some of the most
popular ML methods, such as linear regression and deep learning methods, are obtained by
specific design choices for the three components.
Linear regression is a ML method which uses linear maps for the hypothesis space and the
squared error loss function. Deep learning methods are characterized by using artificial neural
networks to represent hypothesis spaces constituted by highly non-linear predictor maps. The
remainder of this chapter discusses in some depth each of the three main components of ML.
We use the concept of datapoints in a highly abstract and therefore very flexible manner.
23
Data points can represent very different types of objects. For an image processing application
it might be useful to define datapoints as images.
A recommendation system might use data points to represent costumers. Data points
might represent time periods, animals, mountain hikes, proteins or humans. The meaning
or definition of what data points represent is nothing but a design choice.
One practical requirement for a useful definition of data points is that we should have
access to many of them. ML methods typically rely on constructing estimates for quantities
of interest by averaging over data points. These estimates are often more accurate the more
data points are used for the averaging.
A key parameter of a dataset is the number m of individual datapoints it contains.
Statistically, the larger the sample size m the better. However, there might be restrictions
on computational resources that limit the maximum sample size m that can be processed.
??? nice figure illustrating a dataset with m data points????
In general t is impossible to have full access to every single microscopic property of a
data point. Consider a data point that represents a vaccine. A full characterization of such
a data point would require to specify its chemical composition down to level of molecules
and atoms. Moreover, there are properties of a vaccine that depend on the patient which
received the vaccine.
It is useful to distinguish between two different groups of properties of a data point. The
first group of properties is referred to as features and the second group of properties is
referred to as “label”, or “target” or “output”. This distinction is somewhat blurry. The
same property of a data point might be used as a feature in one application, while it might
be used as a label in another application.
As an example consider feature learning for data points representing images. One
approach to learn representative features of an image is to use some of the image pixels
as the label or target pixels. We can then learn new features by learning a feature map that
allows to predict the target pixels.
2.1.1 Features
Similar to the definition of data points, also the choice of what properties to be used as
features is a design choice. We typically use as features any quantity that can be computed
or measured easily. Note that this is a highly informal characterization since there is no
formal measure for the difficulty of measuring a specific property.
If we develop a ML method that can use snapshots taken by a digital camera, then these
24
snapshots might be a useful choice for the features. However, if we only have a thermometer
at our disposal then we might only use the measured temperature as the feature. In what
follows we will denote the total number of features used to describe a data point by the letter
n.
The ability of ML methods has been boosted by modern information-technology which
allows to measure a huge number of properties about datapoints in many application domains.
Consider a data point representing the book author “Alex Jung”. Alex uses a smartphone
to take snapshots.
Let us assume that Alex takes five snapshots per day on average (sometimes more,
e.g., during a mountain hike). This results in more than 1000 snapshots per year. Each
snapshot contains around 106 pixels. If we only use the greyscale levels of the pixels in
all those snapshots, we would obtain more than 109 new features per year! Modern ML
applications face extremely high-dimensional feature vectors which calls for methods from
high-dimensional statistics [13, 71].
At first sight it might seem that “the more features the better” since using more features
might convey more relevant information to achieve the overall goal. However, as we discuss
in Chapter 7, it can actually be detrimental for the performance of ML methods to use an
excessive amount of (irrelevant) features.
Using too many irrelevant features might overwhelm or jam ML algorithms which should
invest their computational resources mainly in the processing of the most relevant features. It
is difficult to give a precise characterization on the maximum number of features that should
be used. Some guidance is offered by the condition n/m 1 which requires the number of
features to be much larger than the number of data points available for an ML algorithm. In
this high-dimensional regime, there is a high risk of overwhelming ML algorithms by having
too many irrelevant features. To avoid this we could apply some feature selection or model
regularization techniques (see Chapter 9 and Chapter 7).
Choosing “good” features of the datapoints arising within a given ML application is far
from trivial and might be the most difficult task within the overall ML application. The
family of ML methods known as kernel methods [43] is based on constructing efficient
features by applying high-dimensional feature maps.
A recent breakthrough achieved by modern ML methods, which are known as deep
learning methods (see Section 3.11), is their ability to automatically learn good features
without requiring too much manual engineering (“tuning”) [26]. We will discuss the very
basic ideas behind such feature learning methods in Chapter 9 but for now assume the
25
task of selecting good features is already solved.
A datapoint is typically characterized by many individual features x1 , . . . , xn . It is
convenient to stack the individual features into a single feature vector
T
x = x1 , . . . , x n ∈ Rn .
Each datapoint is then characterized by such a feature vector x. The set of all possible
values that the feature vector can take on is sometimes referred to as feature space, which
we denote as X . Note that we allow the feature space to be finite. This can be useful for
network-structured datasets where the data points can be compared with each other by
some application-specific notion of similarity [37, 36, 3, 35]. These approaches use as a feature
space the node set of an “empirical graph” whose nodes represent individual datapoints. The
edges in the empirical graph encode similarities between individual datapoints.
The feature space X is a design choice for the ML engineer facing a particular ML
application and computational infrastructure. If the computational infrastructure allows for
efficient numerical linear algebra, then using X = Rn might be a good choice. In general,
to obtain computationally efficient ML methods one typically uses feature spaces X with a
rich mathematical structure.
The Euclidean space Rn is a prime example of a feature space with a rich geometric and
algebraic structure [60]. The algebraic structure of Rn is defined by linear algebra of vector
addition and multiplication with scalars. A geometric structure is obtained by defining
distances between two elements of Rn via the Euclidean norm. The interplay between these
two structures allows us then to efficiently search over subsets of Rn to find an element that
is closest to some other given element of Rn .
Throughout this book we will mainly use feature spaces X ⊆ Rn which are subsets of
the Euclidean space Rn with some fixed dimension n. Using RGB intensities (modelled as
real numbers) of the pixels within a (rather small) 512 × 512 pixel bitmap, we end up with
a feature space X = Rn of (rather large) dimension n = 3 · 5122 . Indeed, for each of the
512 × 512 pixels we obtain 3 numbers which encode the red, green and blue colour intensity
of the respective pixel (see Figure 2.3).
Consider data points representing images. A natural construction for the feature vector
of such data points is to stack the red, green and blue intensities for all image pixels (see
Figure 2.3). For other types of data points it is less obvious how to represent the datapoints
by a numeric feature vector in Rn . Feature learning methods are ML methods that aim
at automatically determining useful feature vectors. For natural language processing, some
26
successful feature learning methods have been proposed recently [49].
Figure 2.3: If the snapshot z(i) is stored as a 512 × 512 RGB bitmap, we could use as features
x(i) ∈ Rn the red-, green- and blue component of each pixel in the snapshot. The length of
the feature vector would then be n = 3 · 512 · 512 ≈ 786000.
2.1.2 Labels
Besides the features of a data point, there are other properties of a data point that represent
some higher-level information or “quantity of interest” associated with the data point. We
refer to the higher level information, or quantity of interest, associated with a data point as
its label (or “output” or “target”). In contrast to features, determining the value of labels
is more difficult to automate. Many ML methods revolve around finding efficient ways to
determine the label of a data point given its features.
As already mentioned above, the distinction of data point properties into labels and those
that are features is blurry. Roughly speaking, labels are properties of data points that might
only be determined with the help of human experts. For data points representing humans
we could define its label y as an indicator if the person has flu (y = 1) or not (y = 0). This
label value can typically only be determined by a physician. However, in another application
we might have enough resources to determine the flu status of any person of interest and
could use it as a feature that characterizes a person.
Consider a data point that represents some hike, at the start of which the snapshot in
Figure 2.2 has been taken. The features of this data point could be the red, green and blue
intensities of each pixel in the snapshot in Figure 2.2. We can stack these values into a vector
x ∈ Rn whose length n is given by three times the number of pixels in the image. The label
y associated with this data point could be the expected hiking time to reach the mountain
in the snapshot. Alternatively, we could define the label y as the water temperature of the
lake visible in the snapshot.
27
The label space Y of an ML problem contains all possible label values of data points.
For the choice Y = R, We refer to such a ML problem as a regression problem. It is
also common to refer to ML problems involving a discrete (finite or countably infinite) label
space as classification problems.
ML problems with only two different label values are referred to as binary classification
problems. Examples of classification problems are: detecting the presence of a tumour
in a tissue, classifying persons according to their age group or detecting the current floor
conditions ( “grass”, “tiles” or “soil”) for a mower robot.
A data point is called labeled if, besides its features x, the value of its label y is known.The
acquisition of labeled data points typically involves human labour, such as handling a water
thermometer at certain locations in a lake. In other applications, acquiring labels might
require sending out a team of marine biologists to the Baltic sea [63], running a particle
physics experiment at the European organization for nuclear research (CERN) [14], running
animal testing in pharmacology [23].
There are also online market places for human labelling workforce [51]. In these market
places, one can upload data points, such as images, and then pay some money to humans
that label the data points, such as marking images who show a cat.
Many applications involve data points whose features can be determined easily but whose
labels are known for few data points only. Labeled data is a scarce resource. Some of the most
successful ML methods have been devised in application domains where label information
can be acquired easily [29]. ML methods for speech recognition and machine translation can
make use of massive labeled datasets that is freely available [40].
In the extreme case, we do not know the label of any single data point. Even in the
absence of any labeled data, ML methods can be useful for extracting relevant information
out of the features only. We refer to ML methods which do not require any labeled data points
as unsupervised ML methods. We discuss some of the most important unsupervised ML
methods in Chapter 8 and Chapter 9).
As discussed next, many ML methods aim at constructing (or finding) a “good” predictor
h : X → Y which takes the features x ∈ X of a data point as its input and outputs a predicted
label (or output, or target) ŷ = h(x) ∈ Y. A good predictor should be such that ŷ ≈ y, i.e.,
the predicted label ŷ is close (with small error ŷ − y) to the true underlying label y.
28
2.1.3 Scatterplot
Consider datapoints characterized by a single numeric feature x and label y. To get more
insight into the relation between feature and label, it can be instructive to generate a scatter
plot as shown in Figure 1.2. A scatter plot depicts the data points z(i) = (x(i) , y (i) ) in a
two-dimensional plane with the axes representing the values of feature x and label y.
A visual inspection of a scatterplot might suggest potential relationships between feature
x and label y. From Figure 1.2, it seems that there might be a relation between feature x and
label y since datapoints with larger x tend to have larger y. This makes sense since having
a larger minimum daytime temperature typically implies also a larger maximum daytime
temperature.
We can obtain scatter plots for data points with more than two features using feature
learning methods (see Chapter 9). These methods allow to transform high-dimensional data
points, having billions of raw features, to three or two new features. These new features can
then be used as the coordinates for the data point in a scatter plot.
29
These parameters can be estimated using the sample mean (average) and sample variance,
m m
X
(i)
X 2
µ̂x := (1/m) x , and σbx2 := (1/m) x(i) − µ̂x . (2.1)
i=1 i=1
A widely
q used estimator for the square root of the variance is the (sample) standard deviation
Pm 2
ŝx := (1/(m − 1)) i=1 x(i) − µ̂x .
The informal goal (2.2) needs to be made precise in two aspects. First, we need to quantify
the approximation error (2.2) incurred by a given hypothesis map h. Second, we need to
make precise what we actually mean by requiring (2.2) to hold for “any data point”. We
solve the first issue by the concept of a loss function in Section 2.3. The second issue is then
solved in Chapter 4.
The main goal of ML is to learn a good hypothesis h from data. Given a good hypothesis
map h, such that (2.2) is satisfied, ML methods use it to predict the label of any data point.
The prediction ŷ = h(x) is obtained by evaluating the hypothesis for the features x of a data
point. We will use the term predictor map for the hypothesis map to highlight its use for
computing predictions.
If the label space Y is finite, such as Y = {−1, 1}, we refer to a hypothesis also as
a classifier. For a finite label space Y and feature space X = Rn , we can characterize
a particular classifier map h using its decision boundary. The decision boundary of a
classifier h is the set of boundary points between the different decision regions
The decision region Rŷ contains all feature vectors x ∈ X which are mapped to the same
label value ŷ ∈ Y.
In principle, ML methods could use any possible map h : X → Y to predict the label
30
Figure 2.4: A predictor (hypothesis) h maps features x ∈ X , of an on-board camera snapshot,
to the prediction ŷ = h(x) ∈ Y for the coordinate of the current location of a cleaning robot.
ML methods use data to learn predictors h such that ŷ ≈ y (with true label y).
y ∈ Y via computing ŷ = h(x). However, any ML method has only limited computational
resources and therefore can only make use of a subset of all possible predictor maps.
This subset of computationally feasible (“affordable”) predictor maps is referred to as the
hypothesis space or model underlying a ML method.
The largest possible hypothesis space H is the set Y X constituted by all maps from the
feature space X to the label space Y. The elements of Y X are all the maps h : X → Y.
The hypothesis space H = Y X is rarely used in practice since it is simply too large to
work within a reasonable amount of computational resources. ML methods typically use a
hypothesis space H that is a very small subset of Y X (see Figure 2.8).
The preference for a particular hypothesis space often depends on the available computational
infrastructure available to a ML method. Different computational infrastructures favour
different hypothesis spaces. ML methods implemented in a small embedded system, might
prefer a linear hypothesis space which results in algorithms that require a small number of
arithmetic operations. Deep learning methods implemented in a cloud computing environment
typically use much larger hypothesis spaces obtained from deep neural networks.
For the computational infrastructure provided by spreadsheet program, we might
use a hypothesis space constituted by maps h : X → Y which can be implemented easily
by a spreadsheet (see Table 2.1). If we instead use the programming language Python to
implement a ML method, we can obtain a hypothesis class by collecting all possible Python
subroutines with one input (scalar feature x), one output argument (predicted label ŷ) and
consisting of less than 100 lines of code.
If the computational infrastructure allows for efficient numerical linear algebra and the
31
feature space is the Euclidean space Rn , a popular choice for the hypothesis space is
The hypothesis space (2.4) is constituted by the linear maps (functions) h(w) : Rn →
R. The function h(vw) maps the feature vector x ∈ Rn to the predicted label (or output)
h(w) (x) = xT w ∈ R. For n = 1 the feature vector reduces to one single feature x and the
hypothesis space (2.4) consists of all maps h(w) (x) = wx with some weight w ∈ R (see Figure
2.6).
Figure 2.5: A predictor (hypothesis) h : X → Y takes the feature vector x(t) ∈ X (e.g.,
representing the snapshot taken by Rumba at time t) as input and outputs a predicted label
ŷt = h(x(t) ) (e.g., the predicted y-coordinate of Rumba at time t). A key problem studied
within ML is how to automatically learn a good (accurate) predictor h such that yt ≈ h(x(t) ).
h(w) (x)
1 h(1) (x) = x
h(0.7) (x) = 0.7x
Figure 2.6: Three particular members of the hypothesis space H = {h(w) : R → R, h(w) (x) =
w · x} which consists of all linear functions of the scalar feature x. We can parametrize this
hypothesis space conveniently using the weight w ∈ R as h(w) (x) = w · x.
32
w adf decision boundary
h(x) ≥ 0, ŷ = 1
h(x) < 0, ŷ = −1
Figure 2.7: A hypothesis h : X → Y for a binary classification problem, with label space
Y = {−1, 1} and feature space X = R2 , can be represented conveniently via the decision
boundary (dashed line) which separates all feature vectors x with h(x) ≥ 0 from the region
of feature vectors with h(x) < 0. If the decision boundary is a hyperplane {x : wT x = b}
(with normal vector w ∈ Rn ), we refer to the map h as a linear classifier.
The elements of the hypothesis space H in (2.4) are parametrized by the weight vector
w ∈ Rn . Each map h(w) ∈ H is fully specified by the weight vector w. Instead of searching
over the function space H (its elements are functions!), we can equivalently search over all
possible weight vectors w ∈ Rn . The search space Rn is still (unaccountably) infinite but it
has a rich geometric and algebraic structure that allows to efficiently search over this space.
The hypothesis space (2.4) is also appealing because of the broad availability of computing
hardware (graphic processing units) and programming frameworks (numerical linear algebra
libraries).
The hypothesis space (2.4) can also be used for classification problems, e.g., with label
space Y = {−1, 1}. Indeed, given a linear predictor map h(w) we can classify data points
according to ŷ = −1 for h(w) (x) ≥ 0 and ŷ = −1 otherwise. The resulting classifier are
referred to as a linear classifier. ML methods that use linear classifiers include logistic
regression (see Section 3.6), the SVM (see Section 3.7) and naive Bayes’ classifiers (see
Section 3.8). The decision regions (2.3) of a linear classifier are half-spaces and their decision
boundary is a hyperplane {x : wT x = b} (see Figure 2.7).
The hypothesis space (2.4) can only be used for data points whose features are numeric
vectors x = (x1 , . . . , xn )T ∈ Rn . In some application domains, such as natural language
processing, there is no obvious natural choice for numeric features. However, since ML
methods based on the hypothesis space (2.4) are well developed (using numerical linear
algebra), it might be useful to construct numerical features even for non-numeric data (such
33
as text). For text data, there has been significant progress recently on methods that map a
human-generated text into sequences of vectors (see [26, Chap. 12] for more details).
The hypothesis space H, the set of possible predictor maps, used in a ML method, is
a design choice. Some choices have proven useful for a wide range of applications (see
Chapter 3). In general, choosing a suitable hypothesis space requires a good understanding
(“domain expertise”) of statistical properties of the data and the limitations of the available
computational infrastructure.
The design choice for the hypothesis space H has to balance between two conflicting
requirements.
• It has to be sufficiently large such that it contains at least one accurate predictor
map ĥ ∈ H. A hypothesis space H that is too small might fail to include a predictor
map required to reproduce the (potentially highly non-linear) relation between features
and label.
Consider the task of grouping or classifying images into “cat” images and “no cat
image”. The classification of each image is based solely on the feature vector obtained
from the pixel colour intensities.
The relation between features and label (y ∈ {cat, no cat}) is highly non-linear. Any
ML method that uses a hypothesis space consisting only of linear maps will most likely
fail to learn a good predictor (classifier). We say that a ML method underfits the
data if it uses a too small hypothesis space.
• It has to be sufficiently small such that its processing fits the available computational
resources (memory, bandwidth, processing time). We must be able to efficiently search
over the hypothesis space to find good predictors (see Section 2.3 and Chapter 4).
This requirement implies also that the maps h(x) contained in H can be evaluated
(computed) efficiently [5]. Another important reason for using a hypothesis space H
not too large is to avoid overfitting (see Chapter 7). If the hypothesis space H is too
large, then just by luck we might find a predictor which fits well the training dataset.
Such a predictor will perform poorly on data which is different from the training data
(it will not generalize well).
The notion of a hypothesis space being too small or being too large can be made precise
in different ways. The size of a finite hypothesis space H can be defined as its cardinality |H|
which is simply the number of its elements. Example. Consider data points represented by
100 × 10 = 1000 black and white pixels (see Figure 2.3) and characterized by a binary label
34
y ∈ {0, 1}. We can model such data points using the feature space X = {0, 1}1000 and label
space Y = {0, 1}. The largest possible hypothesis space H = Y X consists of all maps from
1000
X to Y. The size or cardinality of this space is |H| = 22 .
Many ML methods use a hypothesis space which contains infinitely many different
predictor maps (see, e.g., (2.4)). For an infinite hypothesis spaces, we cannot simply use
the number of its elements as a measure for its size. Different concepts have been studied
for measuring the size of infinite hypothesis spaces with the Vapnik–Chervonenkis (VC)
dimension being maybe the most famous one [69].
We will use a simplified variant of the VC dimension and define the size of a hypothesis
space H as the maximum number D of arbitrary data points that can be perfectly fit (with
probability one). For any set of D data points with different features, we can always find a
hypothesis h ∈ D such that y = h(x) for all data points (x, y) ∈ D.
Let us illustrate our concept for the size of a hypothesis space with two examples: linear
regression and polynomial regression. Linear regression uses the hypothesis space
Consider m data points, each characterized by a feature vector x(i) ∈ Rn and a numeric label
y (i) ∈ R. We assume that data points are realizations of i.i.d. continuous random variables
with the same probability density function. Under this assumption, the matrix obtained by
stacking (column-wise) the feature vectors is full rank with probability one. Basic linear
algebra allows to show that such a set of data points can be perfectly fit by a linear map
h ∈ H(n) as long as m ≤ n. The size of the linear hypothesis space H(n) is therefore D = n.
(n)
As a second example, consider the hypothesis space Hpoly which is constituted by the set
of polynomials with maximum degree n. The fundamental theorem of algebra tells us that
any set of m data points with different features can be perfectly fit by a polynomial of degree
(n)
n as long as n ≥ m. Therefore, the size of the hypothesis space Hpoly is D = n. Section 3.2
discusses polynomial regression in more detail.
35
H
YX
Figure 2.8: The hypothesis space H is a (typically very small) subset of the (typically very
large) set Y X of all possible maps from feature space X into the label space Y.
Table 2.1: A spreadsheet representing of a hypothesis map h in the form of a look-up table.
The value h(x) is given by the entry in the second column of the row whose first column
entry is x.
36
way to measure of the loss (or error) incurred by using the particular predictor h(x) when
the true label is y.
We formally define a loss function L : X ×Y ×H → R which measures the loss L((x, y), h)
incurred by predicting the label y of a data point using the prediction h(x)(=: ŷ). The
concept of loss functions is best understood by considering some examples.
Regression Loss. For ML problems involving numeric labels y ∈ R, a good first choice
for the loss function can be the squared error loss (see Figure 2.9)
2
L((x, y), h) := y − h(x) . (2.5)
|{z}
=ŷ
The squared error loss (2.5) depends on the features x only via the predicted label value
ŷ = h(x). We can evaluate the squared error loss solely using the prediction h(x) and the true
label value y. Besides the prediction h(x), no other properties of the data point’s features x
are required to determine the squared error loss. We will use the shorthand L(y, ŷ) for any
loss function that depends on the features only via the prediction ŷ = h(x).
Figure 2.9: A widely used choice for the loss function in regression problems (with label
space Y = R) is the squared error loss L((x, y), h) := (y − h(x))2 . Note that in order to
evaluate the loss function for a given hypothesis h, so that we can tell if h is any good, we
need to know the feature x and the label y of the data point.
The squared error loss (2.5) has appealing computational and statistical properties. For
linear predictor maps h(x) = wT x, the squared error loss is a convex and differentiable
function of the weight vector w. This allows, in turn, to efficiently search for the optimal
linear predictor using efficient iterative optimization methods (see Chapter 5).
37
The squared error loss also has a useful interpretation in terms of a probabilistic model
for the features and labels. Minimizing the squared error loss is equivalent to maximum
likelihood estimation within a linear Gaussian model [30, Sec. 2.6.3].
Another loss function used in regression problems is the absolute error loss |ŷ − y|. Using
this loss function to learn a good predictor results in methods that are robust against few
outliers in the training set (see Section 3.3).
Classification Loss. In classification problems with a discrete label space Y, such as
in binary classification where Y = {−1, 1}, the squared error (y − h(x))2 is not a useful
measure for the quality of a classifier h(x). We would like the loss function to punish wrong
classifications, e.g., when the true label is y = −1 but the classifier produces a large positive
number, e.g., h(x) = 1000. On the other hand, for a true label y = −1, we do not want to
punish a classifier h which yields a large negative number, e.g., h(x) = −1000. But exactly
this unwanted result would happen for the squared error loss.
Figure 2.10 depicts a dataset consisting of 5 labeled data points with binary labels
represented by circles (for y = 1) and squares (for y = −1). The squared error loss incurred
by the classifier h1 , which does not separate the two classes perfectly, is smaller than the
squared error loss incurred by classifier h2 which perfectly separates the two classes. The
squared error loss is a bad choice for classification problems with a discrete label space Y.
h2
x(2) x(4) x(5) h
(3) 1
x
x(1)
Figure 2.10: Minimizing the squared error loss would prefer the (poor) classifier h1 over the
(reasonable) classifier h2 .
??? use a different example to illustrate that squared error loss is not a good idea to learn
a linear classifier. e.g., using single feature and showing the graph of the predictor instead
of the decision boundary obtained from a linear predictor (which) might be confusing ???
We now discuss some popular choices for the loss function suitable for ML problems with
binary labels. While the representation of the label values is completely irrelevant, it will be
convenient to encode the two label values by the real numbers −1 and 1. The formulas for
the loss functions we present only apply to this encoding. The modification of these formulas
to a different encoding, such as label values 0 and 1, is not very difficult.
38
Consider the problem of detecting forest fires as early as possible using webcam snapshots
such as the one depicted in Figure 2.11. A particular snapshot is characterized by the features
x and the label y ∈ Y = {−1, 1} with y = 1 if the snapshot shows a forest fire and y = −1
if there is no forest fire. We would like to find or learn a classifier h(x) which takes the
features x as input and provides a classification according to ŷ = 1 if h(x) > 0 and ŷ = −1
if h(x) ≤ 0. Ideally we would like to have ŷ = y for any data point. This suggests to use the
0/1 loss (see Figure 2.12)
1 if yh(x) < 0
L((x, y), h) := (2.6)
0 else.
39
classifier if our goal is to enforce correct classification (ŷ = y). This appealing statistical
property of the 0/1 loss comes at the cost of high computational complexity. Indeed, for a
given data point (x, y), the 0/1 loss (2.6) is neither convex nor differentiable when viewed
as a function of the classifier h. Thus, using the 0/1 loss for binary classification problems
typically involves advanced optimization methods for solving the resulting learning problem
(see Section 3.8).
In order to “cure” the non-convexity of the 0/1 loss we approximate it by a convex loss
function. This convex approximation results in the hinge loss (see Figure 2.12)
While the hinge loss avoids the non-convexity of the 0/1 loss it still is a non-differentiable
function of the classifier h.
The next example of a loss function that is useful for classification problems is differentiable.
The logistic loss is used within logistic regression, see Section 3.6) and defined as
For a fixed feature vector x and label y, both, the hinge and the logistic loss function
are convex functions of the hypothesis h. The logistic loss (2.9) depends smoothly on h
such that we could define a derivative of the loss with respect to h. In contrast, the hinge
loss (2.8) is non-smooth which makes it more difficult to minimize.
ML methods based on the logistic loss function, such as logistic regression in Section
3.6), can make use of simple gradient descent methods (see Chapter 5) to minimize the
average loss. ML methods based on the hinge loss, such as support vector machines [30])
must use of more sophisticated optimization methods to learn a predictor by minimizing the
loss (see Chapter 4).
Let us emphasize that, very much like the choice of features and hypothesis space, the
question of which particular loss function to use within an ML method is a design choice,
which has to be tailored to the application at hand. The choice for the loss function must
take into account the available computational resources and the statistical properties of the
data (e.g. presence of few outliers).
40
loss L
2
hinge loss (for y = 1)
−2 −1 1 2 predictor h(x)
Figure 2.12: Some popular loss functions for binary classification problems with label space
Y = {−1, 1}. Note that the more correct a decision, i.e, the more positive h(x) is (when y =
1), the smaller is the loss. In particular, all depicted loss functions tend to 0 monotonically
with increasing h(x).
An important aspect guiding the choice for the loss function is the computational
complexity of the resulting ML method. The basic idea behind ML methods is
quite simple: learn (find) the particular hypothesis out of a given hypothesis space
which yields the smallest (average) loss. The difficulty of the resulting optimization
problem (see Chapter 4) depends crucially on the properties of the chosen loss
function. Some loss functions allow to use very simple but efficient iterative methods
for solving the optimization problem underlying an ML method (see Chapter 5).
m
X
E(h|D) = (1/m) L((x(i) , y (i) ), h). (2.10)
i=1
To ease notational burden, if the dataset D is clear from the context, we use E(h).
Regret. In some application, we might have access to the predictions obtained from
some reference methods or experts. The quality of a hypothesis h can then be measured
41
via the difference between the loss incurred between its predictions h(x) and loss incurred
by the predictions of the experts [31]. This difference is referred to as the regret in using
the prediction h(x) instead of the expert . The goal of regret minimization is to learn a
hypothesis with small regret compared to all considered experts.
The concept of regret minimization is useful when we do not make any probabilistic
assumptions (such as i.i.d.) about the data points. Without a probabilistic model we
cannot use the Bayes risk (of Bayes optimal estimator) as benchmark. Regret minimization
techniques can be designed and analyzed without any such probabilistic model for the data
[15]. This approach replaces the Bayes risk with the regret relative to given reference
predictors (experts) as the benchmark.
Partial Feedback, “Reward”. Some application involve data points whose labels are
so difficult or costly to determine that we cannot assume to have any labeled data available.
Without any labeled data, cannot use the concept of a loss function to measure the quality
of a prediction.1 Instead we must use some other form of indirect feedback or “reward” that
indicates the usefulness of a particular prediction [15, 67].
Consider the ML problem of predicting the optimal direction for moving next a toy care
given the current state. ML methods can sense the state via a feature vector x whose entries
are pixel intensities of a snapshot. The goal is to learn a hypothesis map from the feature
vector x to a guess ŷ = h(vx) for the optimal steering direction y (true label).
In some applications, we might have not access to the true label of any data point. This
means that we cannot evaluate the quality of a particular map based on the average loss on
training data. Instead, we might have only some indirect signal about the loss incurred by
the prediction ŷ = h(x). Such a feedback signal, or reward, could be obtained by a distance
sensor who measures the change of the distance between the car and its goal such as the
charging station.
42
for which we know the true label values y (i) .
The assumption of knowing the exact true label values y (i) for any data point is an
idealization. We might often face labelling or measurement errors such that the observed
labels are noisy versions of the true label. We discuss techniques that allow ML methods to
cope with noisy labels (see Chapter 7).
Our goal is to learn a predictor map h(x) such that h(x) ≈ y for any data point. We
require the predictor map to belong to the hypothesis space H of linear predictors
The predictor (2.12) is parametrized by the slope w1 and the intercept (bias or offset)
w0 . We indicate this by the notation h(w0 ,w1 ) . A particular choice for w1 , w0 defines some
linear predictor h(w0 ,w1 ) (x) = w1 x + w0 .
Let us use some linear predictor h(w0 ,w1 ) (x) to predict the labels of training data points.
In general, the predictions ŷ (i) = h(w0 ,w1 ) x(i) will not be perfect and incur a non-zero
prediction error ŷ (i) − y (i) (see Figure 2.13).
We measure the goodness of the predictor map h(w0 ,w1 ) using the average squared error
loss (see (2.5))
m
X 2
f (w0 , w1 ) := (1/m) y (i) − h(w0 ,w1 ) (x(i) )
i=1
m
(2.12) X 2
= (1/m) y (i) − (w1 x(i) + w0 ) . (2.13)
i=1
The training error f (w0 , w1 ) is the average of the squared prediction errors incurred by the
predictor h(w0 ,w1 ) (x) to the labeled data points (2.11).
It seems natural to learn a good predictor (2.12) by choosing the weights w0 , w1 to
minimize the training error
m
(2.13) X 2
min f (w0 , w1 ) = (1/m) y (i) − (w1 x(i) + w0 ) . (2.14)
w1 ,w0 ∈R
i=1
43
The optimal weights w00 , w10 are characterized by the zero-gradient condition,2
Inserting (2.13) into (2.15), using basic rules for calculating derivatives, we obtain the
following optimality conditions
m
X m
X
(i)
(w10 x(i) w00 ) x(i) y (i) − (w10 x(i) + w00 ) = 0.
(1/m) y − + = 0, and (1/m) (2.16)
i=1 i=1
0 0
Any weights w00 , w10 that satisfy (2.16) define a predictor h(w0 ,w1 ) = w10 x + w00 that is
optimal in the sense of incurring minimum training error,
We find it convenient to rewrite the optimality condition (2.16) using matrices and
vectors. To this end, we first rewrite the predictor (2.12) as
T T
h(x) = wT x with w = w0 , w1 , x = 1, x .
T
Let us stack the feature vectors x(i) = 1, x(i) and labels y (i) of training data points (2.11)
into the feature matrix and label vector,
T T
X = x(1) , . . . , x(m) ∈ Rm×2 , y = y (1) , . . . , y (m) ∈ Rm . (2.17)
XT y − Xw0 = 0.
(2.18)
The entries of any weight vector w0 = w00 , w10 that satisfies (2.18) are solutions to (2.16).
2
A necessary and sufficient condition for w0 to minimize a convex differentiable function f (w) is ∇f (w0 ) =
0 [12, Sec. 4.2.3].
44
Figure 2.13: We can evaluate the quality of a particular predictor h ∈ H by measuring the
prediction error y − h(x) obtained for a labeled data point (x, y).
2.5 Exercises
2.5.1 How Many Features?
Consider the ML problem underlying a music information retrieval smartphone app [72].
Such an app aims at identifying the song-title based on a short audio recording of (an
interpretation of) the song obtained via the microphone of a smartphone. Here, the feature
vector x represents the sampled audio signal and the label y is a particular song title out of
a huge music database. What is the length n of the feature vector x ∈ Rn if its entries are
the signal amplitudes of a 20 second long recording which is sampled at a rate of 44 kHz?
How many different linear predictors (2.19) are there ? 10, 30,40, infinite.
45
2.5.3 Average Squared Error Loss as Quadratic Form
Consider linear hypothesis space consisting of linear maps parameterized by weights w. We
try to find the best linear map by minimizing the average squared error loss (empirical
risk) incurred on some labeled training data points (x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(m) , y (m) ) Is
it possible to write the resulting empirical risk, viewed as a function f (w) as a convex
quadratic form f (w) = wT Cw + bw + c. If this is possible, how are the matrix C, vector b
and constant c related to the feature vectors and labels of the training data ?
46
(a) (b)
H = h(x) = xT Aw : w ∈ S}.
!
1 −1
Here, we used the matrix A = and the set S = (1, 1)T , (2, 2)2 , (−1, 3)T , (0, 4)T ⊆
−1 1
2
R . What is the cardinality of H, i.e., how many different predictor maps does H contain?
47
the sense of incurring the smallest average squared error loss on the three (training) data
points (x = 1/10, y = 3), (0, 0) and (1, −1).
48
2.5.16 Size of Linear Hypothesis Space
Consider a training set of m data points with feature vectors x(i) ∈ Rn and numeric labels
y (1) , . . . , y (m) . The feature vectors and label values of the training set are arbitrary except
that we assume the feature matrix X = x(1) , . . . is full rank. What condition on m and
n guarantee that we can find a linear predictor h(x) = wT x that perfectly fits the training
set, i.e., y (1) = h x(1) , . . . , y (m) = h x(m) .
49
Chapter 3
Some Examples
• the data which is characterized by features which can be computed or measured easily
and labels which represent high-level facts.
Each of these three components involves design choices for the data features and labels, the
model and loss function. This chapter details the specific design choices used by some of the
most popular ML methods.
The quality of a particular predictor h(w) is measured by the squared error loss (2.5).
Using labeled training data D = {(x(i) , y (i) )}m
i=1 , linear regression learns a predictor ĥ which
50
loss
hinge
loss SVM
piecewise model
linear maps neural nets
constant
Figure 3.1: ML methods fit a model to data by minimizing a loss function. Different ML
methods use different design choices for model, data and loss.
minimizes the average squared error loss, or mean squared error, (see (2.5))
Since the hypothesis space H(n) is parametrized by the weight vector w (see (3.1)), we
can rewrite (3.2) as an optimization problem directly over the weight vector w:
m
X
wopt = argmin(1/m) (y (i) − h(w) (x(i) ))2
w∈Rn
i=1
m
h(w) (x)=wT x X
= argmin(1/m) (y (i) − wT x(i) )2 . (3.3)
w∈Rn
i=1
The optimization problems (3.2) and (3.3) are equivalent in the following sense: Any optimal
weight vector wopt which solves (3.3), can be used to construct an optimal predictor ĥ, which
solves (3.2), via ĥ(x) = h(wopt ) (x) = wopt
T
x.
51
3.2 Polynomial Regression
Consider an ML problem involving data points which are characterized by a single numeric
feature x ∈ R (the feature space is X = R) and a numeric label y ∈ R (the label space is
Y = R). We observe a bunch of labeled data points which are depicted in Figure 3.2.
0.8
0.6
label y
0.4
0.2
Figure 3.2 suggests that the relation x 7→ y between feature x and label y is highly non-
linear. For such non-linear relations between features and labels it is useful to consider a
hypothesis space which is constituted by polynomial functions
n+1
X
(n) (w) (w)
Hpoly = {h :R→R:h (x) = wr xr−1 , with
r=1
We can approximate any non-linear relation y = h(x) with any desired level of accuracy using
a polynomial n+1 r−1
of sufficiently large degree n.1
P
r=1 wr x
As for linear regression (see Section 3.1), we measure the quality of a predictor by the
squared error loss (2.5). Based on labeled training data D = {(x(i) , y (i) )}m i=1 , with scalar
features x(i) and labels y (i) , polynomial regression amounts to minimizing the average squared
error loss (mean squared error) (see (2.5)):
m
X
min (1/m) (y (i) − h(w) (x(i) ))2 . (3.5)
(n)
h∈Hpoly i=1
1
The precise formulation of this statement is known as the “Stone-Weierstrass Theorem” [60, Thm. 7.26].
52
It is useful to interpret polynomial regression as a combination of a feature map (transformation)
(see Section 2.1.1) and linear regression (see Section 3.1). Indeed, any polynomial predictor
(n)
h(w) ∈ Hpoly is obtained as a concatenation of the feature map
Thus, we can implement polynomial regression by first applying the feature map φ (see
(3.6)) to the scalar features x(i) , resulting in the transformed feature vectors
n T
x(i) = φ(x(i) ) = 1, x(i) , . . . , x(i) ∈ Rn+1 , (3.8)
and then applying linear regression (see Section 3.1) to these new feature vectors. By
inserting (3.7) into (3.5), we end up with a linear regression problem (3.3) with feature
(n)
vectors (3.8). Thus, while a predictor h(w) ∈ Hpoly is a non-linear function h(w) (x) of the
original feature x, it is a linear function, given explicitly by g (w) (x) = wT x (see (3.7)), of
the transformed features x (3.8).
53
The Huber loss contains a parameter , which has to be adapted to the application at
hand. The Huber loss is robust to outliers since the corresponding (large) prediction errors
y − ŷ are not squared. Outliers have a smaller effect on the average Huber loss over the
entire dataset.
The Huber loss contains two important special cases. The first special case occurs when
a very large value of ε is chosen, such that the condition |y − ŷ| ≤ ε is always satisfied. In
this case, the Huber loss is equivalent to the squared error loss (y − ŷ)2 (up to a scaling
factor 1/2).
The second special case occurs when ε is chosen very small (close to 0) such that the
condition |y − ŷ| ≤ ε is never satisfied. In this case, the Huber loss is equivalent to the
absolute loss |y − ŷ| scaled by a factor ε.
The choice for the tuning parameter α can be guided by using a probabilistic model,
y = wT x + ε.
Here, w denotes some true underlying weight vector and ε is as a random variable.
Appropriate values for α can then be determined based on the variance of the noise, the
number of non-zero entries in w and a lower bound on the non-zero values. Another option
for choosing the value α is to try out different candidate values and pick the one resulting
in smallest validation loss (see Section 6.2).
54
3.5 Gaussian Basis Regression
As discussed in Section 3.2, we can extend the basic linear regression problem by first
transforming the features x using a vector-valued feature map φ : R → Rn and then applying
a weight vector w to the transformed features φ(x). For polynomial regression, the feature
map is constructed using powers xl of the scalar feature x.
It is possible to use other functions, different from polynomials, to construct the feature
map φ. We can extend linear regression using an arbitrary feature map
with the scalar maps φj : R → R which are referred to as basis functions. The choice
of basis functions depends heavily on the particular application and the underlying relation
between features and labels of the observed data points. The basis functions underlying
polynomial regression are φj (x) = xj .
Another popular choice for the basis functions are “Gaussians”
The family (3.12) of maps is parametrized by the variance σ 2 and the mean (shift) µ. We
obtain Gaussian basis linear regression by combining the feature map
T
φ(x) = φσ1 ,µ1 (x), . . . , φσn ,µn (x) (3.13)
with linear regression (see Figure 3.3). The resulting hypothesis space is then
n
X
(n)
HGauss = {h(w) : R → R : h(w) (x) = wj φσj ,µj (x)
j=1
We different hypothesis spaces HGauss for different choices for the variance σ 2 and shifts
µj used for the Gaussian function in (3.12). These parameters have to be chosen suitably
for the ML application at hand (e.g., using model selection techniques discussed in Section
6.3).
The hypothesis space (3.14) is parameterized by a weight vector w ∈ Rn . Each element
of HGauss corresponds to a particular choice for the weight vector w. Instead of searching
55
over HGauss to find a good hypothesis, we can search over Rn .
y
y = h(x)
1
(2)
ŷ = h(w) (x) with h(w) ∈ HGauss
0 x
−3 −2 −1 0 1 2 3
Figure 3.3: The true relation x 7→ y = h(x) (blue) between feature x and label y is highly
non-linear. We might predict the label using a non-linear predictor ŷ = h(w) (x) with some
(2)
weight vector w ∈ R2 and h(w) ∈ HGauss .
56
seems to be much more reliable. In general it is beneficial to complement a particular
prediction (or classification) result by some reliability information.
Within logistic regression, we assess the quality of a particular classifier h(w) ∈ H(n) using
the logistic loss (2.9) Given some labeled training data D = {x(i) , y (i) }m
i=1 , logistic regression
amounts to minimizing the empirical risk (average logistic loss)
m
X
E(w|D) = (1/m) log(1 + exp(−y (i) h(w) (x(i) )))
i=1
m
h(w) (x)=wT x X
= (1/m) log(1 + exp(−y (i) wT x(i) )). (3.15)
i=1
Once we have found the optimal weight vector w b which minimizes (3.15), we can classify a
data point based on its features x according to
1 if h(w)
b
(x) ≥ 0
ŷ = (3.16)
−1 otherwise.
T T
Since h(w) b
(x) = wb x (see (3.1)), the classifier (3.16) amounts to testing whether w b x≥
0 or not. Thus, the classifier (3.16) partitions the feature space X = Rn into two half-spaces
T T
R1 = x : w b x ≥ 0 and R−1 = x : w b x < 0 which are separated by the hyperplane
T
wb x = 0 (see Figure 2.7). Any data point with features x ∈ R1 (x ∈ R−1 ) is classified as
ŷ = 1 (ŷ = −1).
Logistic regression can be interpreted as a particular probabilistic inference method. This
interpretation is based on modelling the labels y ∈ {−1, 1} as i.i.d. random variables with
some probability P(y = 1) which is parameterized by a linear predictor h(w) (x) = wT x via
or, equivalently,
P(y = 1) = 1/(1 + exp(−wT x)). (3.18)
57
Since P(y = 1) + P(y = −1) = 1,
Given the probabilistic model (3.18), a principled approach to choosing the weight vector
w is based on maximizing the probability (or likelihood) of the observed dataset D =
{(x(i) , y (i) )}m
i=1 under the probabilistic model (3.18). This yields the maximum likelihood
estimator
The maximizer of a positive function f (w) > 0 is not affected by replacing f (w) with
log f (x), i.e., argmax h(w) = argmax log h(w). Therefore, (3.20) can be further developed as
w∈Rn w∈Rn
m
(3.20) X
− log 1+exp(−y (i) wT x(i) )
ŵ = argmax
w∈Rn
i=1
m
X
log 1+exp(−y (i) wT x(i) ) .
= argmin(1/m) (3.21)
w∈Rn
i=1
Comparing (3.21) with (3.15) reveals that logistic regression is nothing but maximum likelihood
estimation of the weight vector w in the probabilistic model (3.18).
58
(see Section 3.6).
The soft-margin SVM [43, Chapter 2] uses the loss
with a tuning parameter λ > 0. According to [43, Chapter 2], a classifier h(wSVM ) minimizing
the loss (3.22), averaged over some labeled data points D = {(x(i) , y (i) )}mi=1 , is equivalent
to maximizing the distance (margin) ξ between the decision boundary, given by the set
T
of points x satisfying wSVM x = 0, and each of the two classes C1 = {x(i) : y (i) = 1} and
C2 = {x(i) : y (i) = −1}. Maximizing this margin is sensible as it ensures that the resulting
classifications are robust against small (relative to the margin) perturbations of the features
(see Section 7.2).
As depicted in Figure 3.4, the margin between the decision boundary and the classes
C1 and C2 is typically determined by few data points (such as x(6) in Figure 3.4) which are
closest to the decision boundary. Such data points are referred to as support vectors and
entirely determine the resulting classifier h(wSVM ) . In other words, once the support vectors
are identified the remaining data points become irrelevant for learning the classifier h(wSVM ) .
x(3)
(4)
x(5)
x x(6) h(w)
ξ
“support vector”
x(1)
x(2)
Figure 3.4: The SVM aims at a classifier h(w) with small hinge loss. Minimizing hinge loss
of a classifier is the same as maximizing the margin ξ between the decision boundary (of the
classifier) and each class of the training set.
We highlight that both, the SVM and logistic regression amount to linear classifiers
h ∈ H(n) (see (3.1)) whose decision boundary is a hyperplane in the feature space X = Rn
(w)
(see Figure 2.7). The difference between SVM and logistic regression is the loss function used
for evaluating the quality of a particular classifier h(w) ∈ H(n) . The SVM uses the hinge
loss (2.8) which is the best convex approximation to the 0/1 loss (2.6). Thus, we expect the
59
classifier obtained by the SVM to yield a smaller classification error probability P(ŷ 6= y)
(with ŷ = 1 if h(x) > 0 and ŷ = −1 otherwise) compared to logistic regression which uses
the logistic loss (2.9).
The statistical superiority of the SVM comes at the cost of increased computational
complexity. In particular, the hinge loss (2.8) is non-differentiable which prevents the use
of simple gradient-based methods (see Chapter 5) and requires more advanced optimization
methods. In contrast, the logistic loss (2.9) is convex and differentiable which allows to apply
simple iterative methods for minimization of the loss (see Chapter 5).
60
3.9 Kernel Methods
Consider a ML (classification or regression) problem with an underlying feature space X .
In order to predict the label y ∈ Y of a data point based on its features x ∈ X , we apply
a predictor h selected out of some hypothesis space H. Let us assume that the available
computational infrastructure only allows to use a linear hypothesis space H(n) (see (3.1)).
For some applications using only linear predictor maps in H(n) is not sufficient to model
the relation between features and labels (see Figure 3.2 for a data set which suggests a
non-linear relation between features and labels). In such cases it is beneficial to add a
pre-processing step before applying a predictor h.
The family of kernel methods is based on transforming the features x to new features
x̂ ∈ X 0 which belong to a (typically very) high-dimensional space X 0 [43]. It is not uncommon
that, while the original feature space is a low-dimensional Euclidean space (e.g., X = R2 ),
the transformed feature space X 0 is an infinite-dimensional function space.
The rationale behind transforming the original features into a new (higher-dimensional)
feature space X 0 is to reshape the intrinsic geometry of the feature vectors x(i) ∈ X such
that the transformed feature vectors x̂(i) have a “simpler” geometry (see Figure 3.5).
Kernel methods are obtained by formulating ML problems (such as linear regression or
logistic regression) using the transformed features x̂ = φ(x). A key challenge within kernel
methods is the choice of the feature map φ : X → X 0 which maps the original feature vector
x to a new feature vector x̂ = φ(x).
X X0
x(5)
x(4) x̂(1)
x(1)
x̂(5)x̂(4)x̂(3)x̂(2)
x(3) x(2)
Figure 3.5: Consider a data set D = {(x(i) , y (i) )}5i=1 constituted by data points with features
x(i) and binary labels y (i) . Left: In the original feature space X , the data points cannot be
separated perfectly by any linear classifier. Right: The feature map φ : X → X 0 transforms
the features x(i) to the new features x̂(i) = φ x(i) in the new feature space X 0 . In the new
feature space X 0 the data points can be separated perfectly by a linear classifier.
61
3.10 Decision Trees
A decision tree is a flowchart-like description of a map h : X → Y which maps the features
x ∈ X of a data point to a predicted label h(x) ∈ Y [30].
While decision trees can be used for arbitrary feature space X and label space Y, we will
discuss them for the particular feature space X = R2 and label space Y = R.
We have depicted an example of a decision tree in Figure 3.6. The decision tree consists
of nodes which are connected by directed edges. We can of a decision tree as a step-by-
step instruction, or a “recipe”, for how to compute the predictor value h(x) given the input
feature x ∈ X . This computation starts at the root node and ends at one of the leaf
nodes.
A leaf node m, which does not have any outgoing edges, corresponds to a certain subset
or “region” Rm ⊆ X of the feature space. The hypothesis h associated with a decision tree
is constant over the regions Rm , such that h(x) = hm for all x ∈ Rm and some fixed number
hm ∈ R. In general, there are two types of nodes in a decision tree:
• decision (or test) nodes, which represent particular “tests” about the feature vector x
(e.g., “is the norm of x larger than 10?”).
The particular decision tree depicted in Figure 3.6 consists of two decision nodes (including
the root node) and three leaf nodes.
Given limited computational resources, we need to restrict ourselves to decision trees
which are not too large. We can define a particular hypothesis space by collecting all decision
trees which uses the tests “kx − uk ≤ r” and “kx − uk ≤ r” (for fixed vectors u and v and
fixed radius r > 0) and depth not larger than 2.3 To assess the quality of different decision
trees we need to use some loss function. Examples of loss functions used to measure the
quality of a decision tree are the squared error loss (for numeric labels)or the impurity of
individual decision regressions (for categorical labels).
In general, we are not interested in one particular decision tree only but in a large set of
different decision trees from which we choose the most suitable given some data (see Section
4.3). We can define a hypothesis space by collecting predictor maps h represented by a set
of decision trees (such as depicted in Figure 3.7).
3
The depth of a decision tree is the maximum number of hops it takes to reach a leaf node starting from
the root and following the arrows. The decision tree depicted in Figure 3.6 has depth 2.
62
A collection of decision trees can be constructed based on a fixed set of “elementary
tests” on the input feature vector, e.g., kxk > 3, x3 < 1 or a continuous ensemble of test
such as {x2 > η}η∈[0,10] . We then build a hypothesis space by considering all decision trees
not exceeding a maximum depth and whose decision nodes implement elementary tests.
R1
kx − uk ≤ r?
no yes
h(x) = h1 kx−vk ≤ r? u v
no yes R2 R3
h(x) = h2 h(x) = h3
Figure 3.6: A decision tree represents a hypothesis h which is constant on subsets Rm , i.e.,
h(x) = hm for all x ∈ Rm . Each subset Rm ⊆ X corresponds to a leaf node in the decision
tree.
kx − uk ≤ r?
kx − uk ≤ r? no yes
no yes h(x) = 1 kx − vk ≤ r?
h(x) = 1 h(x) = 2 no yes
h(x) = 10 h(x) = 20
Figure 3.7: A hypothesis space H consisting of two decision trees with depth at most 2 and
using the tests kx−uk ≤ r and kx−vk ≤ r with a fixed radius r and points u and v.
63
x2
6 x(1)
5 x(3) x1 ≤ 3?
yes
4 no
x2 ≤ 3? x2 ≤ 3?
3 no yes no yes
2 x(2) h(x) = y (3) h(x) = y (2) h(x) = y (1) h(x) = y (4)
1 (4)
x
0 x1
0 1 2 3 4 5 6
Figure 3.8: Using a sufficiently large (deep) decision tree, we can construct a map h that
perfectly fits any given labeled dataset {(x(i) , y (i) )}m (i)
i=1 such that h(x ) = y
(i)
for i = 1, . . . , m.
w1
x1
w2
w7
w3
x2 w4 w8 h(w) (x)
w5
w6 w9
Figure 3.9: ANN representation of a predictor h(w) (x) which maps the input (feature) vector
x = (x1 , x2 )T to a predicted label (output) h(w) (x).
fed into the input units, each of which reads in one single feature xi ∈ R. The features xi
are then multiplied with the weights wj,i associated with the link between the i-th input
node (“neuron”) with the j-th node in the middle (hidden) layer. The output of the j-th
64
node in the hidden layer is given by sj = g( ni=1 wj,i xi ) with some (typically non-linear)
P
activation function g(z). The input (or activation) z for the activation (or output) g(z)
of a neuron is a weighted (linear) combination ni=1 wj,i si of the outputs si of the nodes in
P
the previous layer. For the ANN depicted in Figure 3.9, the activation of the neuron s1 is
z = w1,1 x1 + w1,2 x2 .
Two popular choices for the activation function used within ANNs are the sigmoid
1
function g(z) = 1+exp(−z) or the rectified linear unit g(z) = max{0, z}. An ANN with
many, say 10, hidden layers, is often referred to as a deep neural network and the obtained
ML methods are known as deep learning methods (see [26] for an in-depth introduction
to deep learning methods).
Remarkably, using some simple non-linear activation function g(z) as the building block
for ANNs allows to represent an extremely large class of predictor maps h(w) : Rn → R. The
hypothesis space generated by a given ANN structure, i.e., the set of all predictor maps which
can be implemented by a given ANN and suitable weights w, tends to be much larger than
the hypothesis space (2.4) of linear predictors using weight vectors w of the same length [26,
Ch. 6.4.1.]. It can be shown that an ANN with only one single hidden layer can approximate
any given map h : X → Y = R to any desired accuracy [19]. However, a key insight which
underlies many deep learning methods is that using several layers with few neurons, instead
of one single layer containing many neurons, is computationally favourable [21].
Exercise. Consider the simple ANN structure in Figure 3.10 using the “ReLu”
activation function g(z) = max{z, 0} (see Figure 3.11). Show that there is
a particular choice for the weights w = (w1 , . . . , w9 )T such that the resulting
hypothesis map h(w) (x) is a triangle as depicted in Figure 3.12. Can you also find
a choice for the weights w = (w1 , . . . , w9 )T that produce the same triangle shape if
we replace the ReLu activation function with the linear function g(z) = 10 · z?
The recent success of ML methods based on ANN with many hidden layers (which makes
them deep) might be attributed to the fact that the network representation of hypothesis
maps is beneficial for the computational implementation of ML methods. First, we can
evaluate a map h(w) represented by an ANN efficiently using modern parallel and distributed
computing infrastructure via message passing over the network. Second, the ANN representation
also allows to efficiently how the loss function changes with small modifications of the weights
w. The gradient of the overall loss or empirical risk (see Chapter 5) can be obtained via a
message passing procedure known as back-propagation [26].
65
input hidden output
layer layer layer
w1
x0 = 1
w2
w7
w3
x w4 w8 h(w) (x)
w5
w6 w9
Figure 3.10: This ANN with one hidden layer defines a hypothesis space consisting of all maps
h(w) (x) obtained by implementing the ANN with different weight vectors w = (w1 , . . . , w9 )T .
x1
w1
w2 g(z)
x2
w3
x3
66
h(w) (x)
0 x
−3 −2 −1 0 1 2 3
Figure 3.12: A hypothesis map with the shape of a triangle.
A widely used choice for the probability distribution P(z; w) is a multivariate normal
distribution with mean µ and covariance matrix Σ, both of which constitute the weight
vector w = (µ, Σ) (we have to reshape the matrix Σ suitably into a vector form). Given
the i.i.d. realizations z(1) , . . . , z(m) ∼ P(z; w), the maximum likelihood estimates µ̂, Σ
b of the
mean vector and the covariance matrix are obtained via
m
X
µ̂, Σ
b = argmin (1/m) − log P(z(i) ; (µ, Σ)). (3.24)
µ∈Rn ,Σ∈Sn
+ i=1
The optimization in (3.24) is over all pairs given be some mean vector µ ∈ Rn and some
covariance matrix Σ ∈ Sn+ . Here, Sn+ denotes the set of all psd Hermitian n × n matrices.
Note that this maximum likelihood problem (3.24) can be interpreted as an instance of ERM
(4.2) using the particular loss function (3.23).The resulting estimates are given explicitly as
m
X m
X
(i)
µ̂ = (1/m) z , and Σ
b = (1/m) (z(i) − µ̂)(z(i) − µ̂)T . (3.25)
i=1 i=1
67
Note that the expressions (3.25) are valid only when the probability distribution of the
data points is modelled as a multivariate normal distribution.
x(i)
Figure 3.13: A hypothesis map h for k-NN with k = 1 and feature space X = R2 . The
hypothesis map is constant over regions (indicated by the coloured areas) located around
feature vectors x(i) (indicated by a dot) of a dataset D = {(x(i) , y (i) )}.
68
3.14 Dimensionality Reduction
data points are whole datasets (bunch of data point); label is optimal hyperplane that allows
for optimal dimensionality reduction by projecting onto it; the notion of optimality depends
on the application at hand; one notion of optimality is obtained from approximation errors
(PCA).
3.17 LinUCB
data points are costumers characterized by feature vector; the label is discrete and indicates
which product out of a finite set of products should be advertised to the costumer;
69
A recent thread in ML is to use feature spaces whose structure better reflects the structure
of non-Euclidean data. One example of non-Euclidean data is network-structured data where
individual data points are related by some application-specific notion of similarity. For such
data it might be useful to use as a feature space a graph whose nodes represent individual
data points. Similar data points are connected by an edge.
An particular class of ML problems involve partially labeled network-structured data
arising in many important application domains including signal processing [18, 17], image
processing [47, 62], social networks, internet and bioinformatics [55, 16, 22]. Such network-
structured data (see Figure 3.14 ) can be described by an “empirical graph” G = (V, E, W),
whose nodes V represent individual data points which are connected by edges E if they
are considered “similar” in an application-specific sense. The extend of similarity between
connected nodes i, j ∈ V is encoded in the edge weights Wi,j > 0 which are collected into
|V|×|V|
the weight matrix W ∈ R+ .
The notion of similarity between data points can be based on physical proximity (in time
or space), communication networks or probabilistic graphical models [45, 10, 41]. Besides
the graph structure, datasets carry additional information in the form of labels associated
with individual data points. In a social network, we might define the personal preference
for some product as the label associated with a data point (which represents a user profile).
Acquiring labels is often costly and requires manual labor or experiment design. Therefore,
we assume to have access to the labels of only few data points which belong to a small
“training set”.
The availability of accurate network models for datasets provides computational and
statistical benefits. Computationally, network models lend naturally to highly scalable
ML methods which can be implemented as message passing over the empirical graph [11].
Network models enable to borrow statistical strength between connected data points, which
allows semi-supervised learning (SSL) methods to capitalize on massive amounts of unlabeled
data [16].
The key idea behind many SSL methods is the assumption that labels of close-by data
points are similar, which allows to combine partially labeled data with its network structure
in order to obtain predictors which generalize well [16, 6]. While SSL methods on graphs
have been applied to many application domains, the precise understanding of which type of
data allow for accurate SSL is still in its infancy [75, 53, 1].
Besides the empirical graph structure G, a dataset typically conveys additional information,
e.g., features, labels or model parameters. We can represent this additional information by
70
x[1, 1]x[2, 1]
(a) (b)
x[i]
(c)
Figure 3.14: Examples for the empirical graph of networked data. (a) Chain graph
representing signal amplitudes of discrete time signals. (b) Grid graph representing pixels
of 2D-images. (c) Empirical graph G = (V, E, W) for a dataset obtained from the social
relations between members of a Karate club [77]. The empirical graph contains m nodes
i ∈ V = {1, . . . , m} which represent m individual club members. Two nodes i, j ∈ V are
connected by an edge {i, j} ∈ E if the corresponding club members have interacted outside
the club.
71
a graph signal defined over G. A graph signal h[·] is a map V → R, which associates every
node i ∈ V with the signal value h[i] ∈ R.
Most methods for processing graph signals rely on a signal model which are inspired by
a cluster assumption [16]. The cluster assumption requires similar signal values h[i] ≈ h[j]
at nodes i, j ∈ V, which belong to the same well-connected subset of nodes (“cluster”) of
the empirical graph. The clusteredness of a graph signal h[·] can be measured by the total
variation (TV):
X
khkTV = Wi,j |h(i) − h(j)|. (3.26)
{i,j}∈E
Clustered graph signals arise in digital signal processing which studies graph signals
defined over the chain graph representing sampling time instants. Signal samples at adjacent
time instants are strongly correlated for sufficiently high sampling rate. Image processing
methods rely on close-by pixels tending to be coloured likely which amounts to a clustered
graph signal over a grid graph representing pixels of a 2D image.
The recently introduced network Lasso (nLasso) amounts to a formal ML problem involving
network-structured data which can be represented by an empirical graph G. In particular,
the hypothesis space of nLasso is constituted by graph signals on G:
H = {h : V → Y}. (3.27)
The loss function of nLasso is a combination of squared error and TV (see (3.26))
The regularization parameter λ allows to trade-off a small prediction error y − h(x) against
“clusteredness” of the predictor h.
Logistic Network Lasso. The logistic network Lasso [2, 3] is a modification of the
network Lasso (see Section 3.18) for classification problems involving partially labeled networked
data represented by an empirical graph G = (V, E, W).
Each data point z is characterized by the features x and is associated with a label y ∈ Y,
taking on values from a discrete label space Y. The simplest setting is binary classification
where each data point has a binary label y ∈ {−1, 1}. The hypothesis space underlying
logistic network Lasso is given by the graph signals on the empirical graph:
H = {h : V → Y} (3.29)
72
and the loss function is a combination of logistic loss and TV (see (3.26))
3.19 Exercises
3.19.1 How Many Neurons?
Consider a predictor map h(x) which is piece-wise linear and consisting of 1000 pieces.
Assume we want to represent this map by an ANN using neurons with ReLU activation
functions. How many neurons must the ANN at least contain?
• logistic regression
• linear regression
• k-NN
73
Chapter 4
empirical risk
predictor h ∈ H
Figure 4.1: ML methods aim at learning a predictor h ∈ H that incurs small loss on any
data point. Empirical risk minimization approximates the expected loss or risk with the
empirical risk (solid curve) incurred on a finite set of labeled data points (the training set).
• and a loss function L((x, y), h) which measures the error incurred by predictor h ∈ H.
ML methods find - or learn - an accurate predictor map h out of the model H such that
h(x) ≈ y for any data point (x, y). The deviation between the predicted label ŷ = h(x) and
74
true label y is measured by a loss function L((x, y), h). However, how can we make precise
the requirement that the loss should be small for any data point ?
To assess how well a predictor map is doing for any data point, we can use the concept
of an expected loss or risk. The risk is defined as the expectation of the loss incurred by the
predictor for a randomly drawn data point. In this approach, we interpret data points as
realizations of random variables which are characterized by a probability distribution.
If the probability distribution underlying the data points is known, minimizing the
expected loss or risk amounts to computing the Bayes’ estimator for the label given the
feature vector. Roughly speaking, this estimator can be read of directly from the posterior
probability distribution of the label given the features.
In practice we do not know the true underlying probability distribution and have to
estimate it from data. Therefore, we cannot compute the Bayes’ optimal estimator exactly.
However, we can approximately compute this estimator by replacing the exact probability
distribution with an estimate. Moreover, the risk of the Bayes’ optimal estimator provides
a useful benchmark against we can compare the average loss of practical ML methods.
Using a simple probabilistic model for data points, we formally define empirical risk
minimization (ERM) in Section (4.1). We then specialize the ERM for three particular ML
problems. In Section 4.2, we discuss the ERM obtained for linear regression (see Section 3.1).
The resulting ERM has appealing properties as it amounts to minimizing a differentiable
(smooth) and convex function which can be done efficiently using efficient gradient based
methods (see Chapter 5).
We then discuss in Section 4.3 the ERM obtained for decision trees which yields a
discrete optimization problem and is therefore fundamentally different from the smooth
ERM obtained for linear regression. In particular, we cannot apply gradient based methods
(see Chapter 5) to solve them but have to rely on discrete search methods.
Section 4.4 discusses how Bayes’ methods can be used to solve the non-convex and non-
differentiable ERM problem obtained for classification problems with the 0/1 loss (2.6).
As explained in Section 4.5, many ML methods use the ERM during a training period
to learn a hypothesis which is then applied to new data points during the inference period.
Section 4.6 discusses how to obtain online learning by solving the ERM sequentially as new
data points come in. Online learning can be interpreted as interleaving training and inference
periods.
75
4.1 Why Empirical Risk Minimization?
We assume that data points are i.i.d. realizations drawn from some fixed probability distribution
p(x, y). The probability distribution p(x, y) allows us to define the expected loss or risk
E L((x, y), h)}. (4.1)
ĥ = argmin E(h|D)
h∈H
m
(2.10) X
= argmin(1/m) L((x(i) , y (i) ), h). (4.2)
h∈H i=1
76
The objective function f (w) in (4.3) is the empirical risk E h(w) |D achieved by h(w) when
applied to the data points in the dataset D. Note that the two formulations (4.3) and (4.2)
are fully equivalent. In particular, given the optimal weight vector wopt solving (4.3), the
predictor h(wopt ) is an optimal predictor solving (4.2).
Learning a hypothesis via ERM (4.2) is a form of learning by “trial and error”. An
instructor (or supervisor) provides some snapshots z(i) which are characterized by features
x(i) and associated with known labels y (i) .
The learner then tries out some hypothesis h to tell the label y (i) only from the snapshot
features x(i) and determine the (training) error E(h|D) incurred. If the error E(h|D) is too
large we try out another predictor map h0 instead of h with the hope of achieving a smaller
training error E(h0 |D).
We highlight that the precise shape of the objective function f (w) in (4.3) depends
heavily on the parametrization of the predictor functions, i.e., how does the predictor h(w)
vary with the weight vector w.
The shape of f (w) depends also on the choice for the loss function L((x(i) , y (i) ), h).
As depicted in Figure 4.2, the different combinations of predictor parametrisation and loss
functions can result in objective functions with fundamentally different properties such that
their optimization is more or less difficult.
The objective function f (w) for the ERM obtained for linear regression (see Section 3.1)
is differentiable and convex and can therefore be minimized using simple iterative gradient
descent methods (see Chapter 5). In contrast, the objective function f (w) of ERM obtained
for the SVM (see Section 3.7) is non-differentiable but still convex. The minimization of such
functions is more challenging but still tractable as there exist efficient convex optimization
methods which do not require differentiability of the objective function [57].
The objective function f (w) obtained for ANN are typically highly non-convex having
many local minima. The optimization of non-convex objective function is in general more
difficult than optimizing convex objective functions. However, it turns out that despite the
non-convexity, iterative gradient based methods can still be successfully applied to solve the
ERM [26]. Even more challenging is the ERM obtained for decision trees or Bayes’ classifiers.
These ML problems involve non-differentiable and non-convex objective functions.
77
f (w)
Figure 4.2: Different types of objective functions obtained for ERM in different settings.
Here, |D| denotes the cardinality (number of elements) of the set D. The objective function
f (w) in (4.4) has some computationally appealing properties, since it is convex and smooth
(see Chapter 5).
It will be useful to rewrite the ERM problem (4.4) using matrix and vector representations
of the feature vectors x(i) and labels y (i) contained in the dataset D. To this end, we stack
the labels y (i) and the feature vectors x(i) , for i = 1, . . . , m, into a “label vector” y and
“feature matrix” X as follows
78
Inserting (4.6) into (4.4) we obtain the quadratic problem
minn (1/2)wT Qw − qT w
w∈R | {z }
=f (w)
Since f (w) is a differentiable and convex function, a necessary and sufficient condition for
some w0 to satisfy f (w0 ) = minw∈Rn f (w) is the zero-gradient condition [12, Sec. 4.2.3]
∇f (wopt ) = 0. (4.8)
Combining (4.7) with (4.8), yields the following sufficient and necessary condition for a
weight vector wopt to solve the ERM (4.4):
It can be shown that, for any given feature matrix X and label vector y, there always
exists at least one optimal weight vector wopt which solves (4.9). The optimal weight vector
might not be unique, such that there are several different vectors which achieve the minimum
in (4.4). However, any optimal solution wopt , which solves (4.9), achieves the same minimum
empirical risk
E(h(wopt ) | D) = minn E(h(w) | D) = k(I − P)yk2 . (4.10)
w∈R
Here, we used the orthogonal projection matrix P ∈ Rm×m on the linear span of the feature
matrix X = (x(1) , . . . , x(m) )T ∈ Rm×n (see (4.5)).1
If the feature matrix X (see (4.5)) has full column rank, implying invertibility of the
matrix XT X, the projection matrix P is given explicitly as
−1
P = X XT X XT .
The closed-form solution (4.11) requires the inversion of the n × n matrix XT X. Computing
1
The linear span spanA of a matrix A = (a(1) , . . . , a(m) ) ∈ Rn×m is the subspace of Rn consisting of all
linear combinations of the columns a(r) ∈ Rn of A.
79
the inverse can be computationally challenging for large feature length n (see Figure 2.3 for
a simple ML problem where the feature length is almost a million). Moreover, inverting a
matrix which is close to singular typically introduces numerical errors.
Section 5.4 discusses a method for computing the optimal weight vector wopt which does
not require any matrix inversion. This method, referred to as gradient descent, constructs
a sequence w(0) , w(1) , . . . of increasingly accurate approximations of wopt . This iterative
method has two major benefits compared to evaluating the formula (4.11) using direct matrix
inversion, such as Gauss-Jordan elimination [25]. First, gradient descent requires much fewer
arithmetic operations compared to direct matrix inversion. This is crucial in modern ML
applications involving large feature matrices. Second, gradient descent does not break when
the matrix X is not full rank and the formula (4.11) cannot be used any more.
The idea behind many decision tree learning methods is quite simple: try
out expanding a decision tree by replacing a leaf node with a decision node
(implementing another “test” on the feature vector) in order to reduce the overall
empirical risk as much as possible.
80
Consider the labeled dataset D depicted in Figure 4.3 and a given decision tree for
predicting the label y based on the features x. We start with a very simple tree shown in the
top of Figure 4.3. Then we try out growing the tree by replacing a leaf node with a decision
node. According to Figure 4.3, replacing the right leaf node results in a decision tree which
is able to perfectly represent the training dataset (it achieves zero empirical risk).
x2
6
x(1)
5 x(3)
4 x1 ≤ 3?
no yes
3
h(x) = ◦ h(x) =
2
x(2)
1 x(4)
0 x1
0 1 2 3 4 5 6
x2
x(1) x1 ≤ 3? x2
x(1)
x1 ≤ 3?
(3)
x no yes
x(3)
no yes
x2 ≤ 3? h(x) = h(x) = ◦ x2 ≤ 3?
yes no yes
(2) no x(2)
x h(x) = ◦ h(x) = ◦ x (4)
x1 h(x) = h(x) = ◦
x(4) x1
Figure 4.3: Given the labeled dataset and a decision tree in the top row, we grow the decision
tree by expanding it at one of its two leaf nodes. The resulting new decision trees obtained
by expanding different leaf node is shown in the bottom row.
One important aspect of learning decision trees from labeled data is the question of when
to stop growing. A natural stopping criterion might be obtained from the limitations in
computational resources, i.e., we can only afford to use decision trees up to certain maximum
depth. Besides the computational limitations, we also face statistical limitations for the
maximum size of decision trees. Very large decision trees, which represent highly complicated
maps, we might end up overfitting the training data (see Figure 3.8 and Chapter 7) which
is detrimental for the prediction performance of decision trees obtained for new data (which
has not been used for training or growing the decision tree).
81
4.4 ERM for Bayes’ Classifiers
The family of Bayes’ classifiers is based on using the 0/1 loss (2.6) for measuring the quality
of a classifier h. The resulting ERM is
m
X
ĥ = argmin(1/m) L((x(i) , y (i) ), h)
h∈H i=1
m
(2.6) X
= argmin(1/m) I(h(x(i) ) 6= y (i) ). (4.12)
h∈H i=1
Note that the objective function of this optimization problem is non-smooth (non differentiable)
and non-convex (see Figure 4.2). This prevents us from using standard gradient based
optimization methods (see Chapter 5) to solve (4.12).
We will now approach the ERM (4.12) via a different route by interpreting the data
points (x(i) , y (i) ) as realizations of i.i.d. random variables which are distributed according to
some probability distribution p(x, y). As discussed in Section 2.3, the empirical risk obtained
using 0/1 loss approximates the error probability P(ŷ 6= y) with the predicted label ŷ = 1
for h(x) > 0 and ŷ = −1 otherwise (see (2.7)). Thus, we can approximate the ERM (4.12)
as
(2.7)
ĥ ≈ argmin P(ŷ 6= y). (4.13)
h∈H
Note that the hypothesis h, which is the optimization variable in (4.13), enters into the
objective function of (4.13) via the definition of the predicted label ŷ, which is ŷ = 1 if
h(x) > 0 and ŷ = −1 otherwise.
It turns out that if we would know the probability distribution p(x, y), which is required
to compute P(ŷ 6= y), the solution of (4.13) can be found easily via elementary Bayesian
decision theory [58]. In particular, the optimal classifier h(x) is such that ŷ achieves the
maximum “a-posteriori” probability p(ŷ|x) of the label being ŷ, given (or conditioned on)
the features x. However, since we do not know the probability distribution p(x, y), we have
to estimate (or approximate) it from the observed data points (x(i) , y (i) ) which are modelled
as i.i.d. random variables distributed according to p(x, y).
The estimation of p(x, y) can be based on a particular probabilistic model for the features
and labels which depends on certain parameters and then determining the parameters using
maximum likelihood (see Section 3.12). A widely used probabilistic model is based on
82
Gaussian random vectors. In particular, conditioned on the label y, we model the feature
vector x as a Gaussian vector with mean µy and covariance Σ, i.e.,
Note that the mean vector of x depends on the label such that for y = 1 the mean of x is µ1 ,
while for data points with label y = −1 the mean of x is µ−1 . In contrast, the covariance
matrix Σ = E{(x − µy )(x − µy )T |y} of x is the same for both values of the label y ∈ {−1, 1}.
Note that, while conditioned on y the random vector x is Gaussian, the marginal distribution
of x is a Gaussian mixture model (see Section 8.2). For this probabilistic model of features
and labels, the optimal classifier minimizing the error probability P(ŷ 6= y) is ŷ = 1 for
h(x) > 0 and ŷ = −1 for h(x) ≤ 0 using the classifier map
Carefully note that this expression is only valid if the matrix Σ is invertible.
We cannot implement the classifier (4.15) directly, since we do not know the true values
of the class-specific mean vectors µ1 , µ−1 and covariance matrix Σ. Therefore, we have to
replace those unknown parameters with some estimates µ̂1 , µ̂−1 and Σ, b like the maximum
likelihood estimates which are given by (see (3.25))
m
X
µ̂1 = (1/m1 ) I(y (i) = 1)x(i) ,
i=1
Xm
µ̂−1 = (1/m−1 ) I(y (i) = −1)x(i) ,
i=1
m
X
(i)
µ̂ = (1/m) x ,
i=1
m
X
and Σ
b = (1/m) (z(i) − µ̂)(z(i) − µ̂)T , (4.16)
i=1
Pm
with m1 = i=1 I(y (i) = 1) denoting the number of data points with label y = 1 (m−1
2
We use the shorthand N (x; µ, Σ) to denote the probability density function
1
exp − (1/2)(x−µ)T Σ−1 (x−µ)
p(x) = p
det(2πΣ)
of a Gaussian random vector x with mean µ = E{x} and covariance matrix Σ = E (x−µ)(x−µ)T .
83
is defined similarly). Inserting the estimates (4.16) into (4.15) yields the implementable
classifier
h(x) = wT x with w = Σ b −1 (µ̂1 − µ̂−1 ). (4.17)
We highlight that the classifier (4.17) is only well-defined if the estimated covariance matrix
Σ
b (4.16) is invertible. This requires to use a sufficiently large number of training data points
such that m ≥ n.
Using the route via maximum likelihood estimation, we arrived at (4.17) as an approximate
solution to the ERM (4.12). The final classifier (4.17) turns out to be a linear classifier very
much like logistic regression and SVM. In particular, the classifier (4.17) partitions the
feature space Rn into two halfspaces: one for ŷ = 1 and one for ŷ = −1 (see Figure 2.7).
Thus, the Bayes’ classifier (4.17) belongs to the same family (of linear classifiers) as logistic
regression and the SVM. These three classification methods differ only in the way of choosing
the decision boundary (see Figure 2.7) separating the two half-spaces in the feature space.
For the estimator Σ b (3.25) to be accurate (close to the unknown covariance matrix) we
need a number of data points (sample size) which is at least on the order of n2 . This sample
size requirement might be infeasible for applications with only few data points available.
The maximum likelihood estimate Σ b (4.16) is not invertible whenever m < n. In this
case, the expression (4.17) becomes useless. To cope with small sample size m < n we can
simplify the model (4.14) by requiring the covariance to be diagonal Σ = diag(σ12 , . . . , σn2 ).
This is equivalent to modelling the individual features x1 , . . . , xn of a particular data point
as conditionally independent, given the label y the data point. The resulting special case of
a Bayes’ classifier is often referred to as a naive Bayes classifier.
We finally highlight that the classifier (4.17) is obtained using the generative model (4.14)
for the data. Therefore, Bayes’ classifiers belong to the family of generative ML methods
which involve modelling the data generation. In contrast, logistic regression and SVM do
not require a generative model for the data points but aim directly at finding the relation
between features x and label y of a data point. These methods belong therefore to the family
of discriminative ML methods.
Generative methods such as Bayes’ classifier are preferable for applications with only very
limited amounts of labeled data. Indeed, having a generative model such as (4.14) allows to
synthetically generate more labeled data by generating random features and labels according
to the probability distribution (4.14). We refer to [56] for a more detailed comparison between
generative and discriminative methods.
84
4.5 Training and Inference Periods
Some ML methods repeat the cycle in Figure 1 in a highly irregular fashion. Consider a
large image collection which we use to learn a hypothesis about how cat images look like.
It might be reasonable to adjust the hypothesis by fitting a model to the image collection.
This fitting or training amounts to repeating the cycle in Figure 1 during some specific time
period (the “training time”) for a large number. After the training period, we only apply the
hypothesis to predict the labels of new images. This second phase is also known as inference
time and might be much longer compared to the training time. Ideally, we would like to
have only a very short training period to learn a good hypothesis and then only use the
hypothesis for inference.
85
and y(m) :
T T
m=1: X(1) = x(1) , y(1) = y (1) , (4.18)
(2) T (2) T
X(2) = x(1) , x y(2) = y (1) , y
m=2: , , (4.19)
(3) T (3) T
X(3) = x(1) , x(2) , x y(3 = y (1) , y (2) , y
m=3: , . (4.20)
Note that in this online learning setting, the sample size m has the meaning of a time index.
Naively, we could try to solve the optimality condition (2.18) for each time step m.
However, this approach does not reuse computations already invested in solving (2.18) at
previous time steps m0 < m.
4.7 Exercise
4.7.1 Uniqueness in Linear Regression
Consider linear regression with squared error loss. When is the optimal linear predictor
unique. Does there always exist an optimal linear predictor?
86
x(1) , y (1) , . . . , x(m) , y (m) , we construct the feature matrix X ∈ Rm×m . The columns of
the feature matrix are the feature vectors x(i) . Is this feature matrix a Vandermonde matrix
[24]? Can you say something about the determinant of the feature matrix?
87
Chapter 5
ML methods are optimization methods, that learn an optimal hypothesis out of the model.
The quality of each hypothesis is measured or scored by some average loss or empirical risk.
This average loss, viewed as a function of the hypothesis, defines an objective function whose
minimum is achieved by the optimal hypothesis.
Many ML methods use gradient based methods to efficiently search for a (nearly) optimal
hypothesis. These methods locally approximate the objective function by a linear function
which is used to improve the current guess for the optimal hypothesis. The prototype of a
gradient based optimization methods is gradient descent (GD).
Variants of GD are used to tune the weights of artificial neural networks within deep
learning methods [26]. GD can also be applied to reinforcement learning applications. The
difference between these applications is merely in the details for how to compute or estimate
the gradient and how to incorporate the information provided by the gradients.
In the following, we will mainly focus on ML problems with hypothesis space H consisting
of predictor maps h(w) which are parametrized by a weight vector w ∈ Rn . Moreover, we
will restrict ourselves to loss functions L((x, y), h(w) ) which depend smoothly on the weight
vector w.
Many important ML problems, including linear regression (see Section 3.1) and logistic
regression (see Section 3.6), involve in a smooth loss function. A smooth function f : Rn → R
has continuous partial derivatives of all orders. In particular, we can define the gradient
∇f (w) for a smooth function f (w) at every point w.
88
For a smooth loss function, the resulting ERM (see (4.3))
The approximation (5.3) lends naturally to an iterative method for finding the minimum
of the function f (w). This method is known as gradient descent (GD) and (variants of it)
underlies many state-of-the art ML methods, including deep learning methods.
Figure 5.1: A smooth function f (w) can be approximated locally around a point w0 using a
hyperplane whose normal vector n = (∇f (w0 ), −1) is determined by the gradient ∇f (w0 ).
89
f (w)
−α∇f (w(k) )
4
3 ∇f (w(k) )
2
1
1
w
(k+1) (k)
w w
with a sufficiently small step size α > 0 (a small α ensures that the linear approximation
(5.3) is valid). Then, we repeat this procedure to obtain w(k+2) = w(k+1) − α∇f (w(k+1) )
and so on.
The update (5.4) amounts to a gradient descent (GD) step. For a convex differentiable
objective function f (w) and sufficiently small step size α, the iterates f (w(k) ) obtained by
repeating the GD steps (5.4) converge to a minimum, i.e., limk→∞ f (w(k) ) = f (wopt ) (see
Figure 5.2).
When the GD step is used within an ML method (see Section 5.4 and Section 3.6), the
step size α is also referred to as the learning rate.
In order to implement the GD step (5.4) we need to choose the step size α and we need
90
to be able to compute the gradient ∇f (w(k) ). Both tasks can be very challenging for an ML
problem.
The success of deep learning methods, which represent predictor maps using ANN (see
Section 3.11), can be partially attributed to the ability of computing the gradient ∇f (w(k) )
efficiently via a message passing protocol known as back-propagation [26].
For the particular case of linear regression (see Section 3.1) and logistic regression (see
Section 5.5), we will present precise conditions on the step size α which guarantee convergence
of GD in Section 5.4 and Section 5.5. Moreover, the objective functions f (w) arising within
linear and logistic regression allow for closed-form expressions of the gradient ∇f (w).
f (w(k) )
f (w(k+1) ) f (w(k+2) )
(5.4) f (w(k+2) )
f (w(k+1) ) (5.4)
f (w(k) )
(a) (b)
Figure 5.3: Effect of choosing learning rate α in GD step (5.4) too small (a) or too large (b).
If the steps size α in the GD step (5.4) is chosen too small, the iterations make only very
little progress towards the optimum. If the learning rate α is chosen too large, the iterates
w(k) might not converge at all (it might happen that f (w(k+1) ) > f (w(k) )!).
The choice of the step size α in the GD step (5.4) has a strong impact on the performance
of Algorithm 1. If we choose the step size α too large, the GD steps (5.4) diverge (see Figure
5.3-(b)) and, in turn, Algorithm 1 fails n delivering an approximation of the optimal weight
vector wopt (see (5.7)).
If we choose the step size α too small (see Figure 5.3-(a)), the updates (5.4) make only
very little progress towards approximating the optimal weight vector wopt . In applications
that require real-time processing of data streams, it is possible to repeat the GD steps only
for a moderate number.. Thus, if the GD step size is chosen to small, Algorithm 1 will fail
to deliver a good approximation of wopt within an acceptable amount of computation time.
The optimal choice of the step size α of GD can be a challenging task and many
sophisticated approaches have been proposed for its solution (see [26, Chapter 8]). We
91
will restrict ourselves to a simple sufficient condition on the step size which guarantees
convergence of the GD iterations w(k) for k = 1, 2, . . ..
If the objective function f (w) is convex and smooth, the GD steps (5.4) converge to an
optimum wopt for any step size α satisfying [54]
1
α≤ for all w ∈ Rn . (5.5)
λmax 2
∇ f (w)
Here, we use the Hessian matrix ∇2 f (w) ∈ Rn×n of a smooth function f (w) whose entries
∂f (w)
are the second-order partial derivatives ∂w i ∂wj
of the function f (w). It is important to note
that (5.5) guarantees convergence for every possible initialization w(0) of the GD iterations.
Note that while it might be computationally challenging to determine the maximum
eigenvalue λmax ∇2 f (w) for arbitrary w, it might still be feasible to find an upper bound
U for the maximum eigenvalue. If we know an upper bound U ≥ λmax ∇2 f (w) (valid for
all w ∈ Rn ), the step size α = 1/U still ensures convergence of the GD iteration.
The optimal weight vector wopt for (5.6) should minimize the empirical risk (under squared
error loss (2.5))
m
(4.3) X
E(h(w) |D) = (1/m) (y (i) − wT x(i) )2 , (5.7)
i=1
incurred by the predictor h(w) (x) when applied to the labeled dataset D = {(x(i) , y (i) )}m
i=1 .
Thus, wopt is obtained as the solution of a particular smooth optimization problem (5.2),
92
i.e.,
m
X
wopt = argmin f (w) with f (w) = (1/m) (y (i) − wT x(i) )2 . (5.8)
w∈Rn
i=1
In order to apply GD (5.4) to solve (5.8), and to find the optima weight vector wopt ,
we need to compute the gradient ∇f (w). The gradient of the objective function in (5.8) is
given by
Xm
∇f (w) = −(2/m) (y (i) − wT x(i) )x(i) . (5.9)
i=1
The update (5.10) has an appealing form as it amounts to correcting the previous guess (or
approximation) w(k−1) for the optimal weight vector wopt by the correction term
m
X
(2α/m) (y (i) − w(k−1) )T x(i) ) x(i) . (5.11)
i=1
| {z }
e(i)
The correction term (5.11) is a weighted average of the feature vectors x(i) using weights
(2α/m) · e(i) . These weights consist of the global factor (2α/m) (that applies equally to
all feature vectors x(i) ) and a sample-specific factor e(i) = (y (i) − w(k−1) )T x(i) ), which
(k−1) )
is the prediction (approximation) error obtained by the linear predictor h(w (x(i) ) =
w(k−1) )T x(i) when predicting the label y (i) from the features x(i) .
93
We can interpret the GD step (5.10) as an instance of “learning by trial and error”.
Indeed, the GD step amounts to “trying out” the predictor h(x(i) ) = w(k−1) )T x(i)
and then correcting the weight vector w(k−1) according to the error e(i) = y (i) −
w(k−1) )T x(i) .
The choice of the step size α used for Algorithm 1 can be based on the sufficient condition
(5.5) with the Hessian ∇2 f (w) of the objective function f (w) underlying linear regression
(see (5.8)). This Hessian is given explicitly as
94
regression amounts to an instance of the smooth optimization problem (5.2), i.e.,
In order to apply GD (5.4) to solve (5.13), we need to compute the gradient ∇f (w). The
gradient of the objective function in (5.13) is given by
m
X −y (i)
∇f (w) = (1/m) (i) T (i)
x(i) . (5.14)
i=1
1 + exp(y w x )
The update (5.15) has an appealing form as it amounts to correcting the previous guess (or
approximation) w(k−1) for the optimal weight vector wopt by the correction term
m
X y (i)
(α/m) (i) T (i)
x(i) . (5.16)
i=1
1 + exp(y w x )
| {z }
e(i)
The correction term (5.16) is a weighted average of the feature vectors x(i) , each of those
vectors is weighted by the factor (α/m) · e(i) . These weighting factors consist of the global
95
factor (α/m) (that applies equally to all feature vectors x(i) ) and a sample-specific factor
(i) (k−1) )
e(i) = 1+exp(yy(i) wT x(i) ) , which quantifies the error of the classifier h(w (x(i) ) = w(k−1) )T x(i)
for a data point having true label y (i) ∈ {−1, 1} and the features x(i) ∈ Rn .
We can use the sufficient condition (5.5) (which guarantees convergence of GD) to guide
the choice of the step size α in Algorithm 2. In order to apply condition (5.5), we need
to determine the Hessian ∇2 f (w) matrix of the objective function f (w) underlying logistic
regression (see (5.13)). Some basic calculus reveals (see [30, Ch. 4.4.])
We highlight that, in contrast to the Hessian (5.12) obtained for the objective function arising
in linear regression, the Hessian (5.17) varies with the weight vector w. This makes the
analysis of Algorithm 2 and the optimal choice of step size somewhat more difficult compared
to Algorithm 1. However, since the diagonal entries (5.18) take values in the interval [0, 1],
for normalized features (with kx(i) k = 1) the step size α = 1 ensures convergence of the GD
updates (5.15) to the optimal weight vector wopt solving (5.13).
96
κ(XT X) [34]. Thus, GD will be faster for datasets with a feature matrix X such that
κ(XT X) ≈ 1. It is therefore often beneficial to pre-process the feature vectors using a
normalization (or standardization) procedure as detailed in Algorithm 3.
Exercise. Consider the dataset with feature vectors x(1) = (100, 0)T ∈ R2 and
x(2) = (0, 1/10)T which we stack into the matrix X = (x(1) , x(2) )T . What is
b TX
the condition number of XT X? What is the condition number of X b with
the matrix Xb = (x̂(1) , x̂(2) )T constructed from the normalized feature vectors x̂(i)
delivered by Algorithm 3.
5.7 Stochastic GD
Consider an ML problem with a hypothesis space H which is parametreized by a weight
vector w ∈ Rn (such that each element h(w) of H corresponds to a particular choice of w)
and a loss function L((x, y), h(w) ) which depends smoothly on the weight vector w. The
resulting ERM (5.1) amounts to a smooth optimization problem which can be solved using
GD (5.4).
Note that the gradient ∇f (w) obtained for the optimization problem (5.1) has a particular
structure. Indeed, the gradient is a sum
m
X
∇f (w) = (1/m) ∇fi (w) with fi (w) := L((x(i) , y (i) ), h(w) ). (5.20)
i=1
97
Evaluating the gradient ∇f (w) (e.g., within a GD step (5.4)) by computing the sum in
(5.20) can be computationally challenging for at least two reasons. First, computing the
sum exactly is challenging for extremely large datasets with m in the order of billions.
Second, for datasets which are stored in different data centres located all over the world, the
summation would require huge amount of network resources and also put limits on the rate
by which the GD steps (5.4) can be executed.
ImageNet. The “ImageNet” database contains more than 106 images [42]. These
images are labeled according to their content (e.g., does the image show a dog?).
Let us assume that each image is represented by a (rather small) feature vector
x ∈ Rn of length n = 1000. Then, if we represent each feature by a floating point
number, performing only one single GD update (5.4) per second would require at
least 109 FLOPS.
The idea of stochastic GD (SGD) is quite simple: Replace the exact gradient ∇f (w)
by some approximation which can be computed easier than (5.20). The word “stochastic”
in the name SGD hints already at the use of randomness (stochastic approximations).
One basic variant of SGD approximates the gradient ∇f (w) (see (5.20)) a randomly
selected component ∇fî (w) in (5.20), with the index î being chosen randomly out of {1, . . . , m}.
SGD amounts to iterating the update
It is important to use a fresh randomly chosen index î during each new iteration. The indices
used in different iterations are statistically independent.
Note that SGD replaces the summation over all training data points in the GD step (5.4)
just by the random selection of a single component of the sum. The resulting savings in
computational complexity can be significant in applications where a large number of data
points is stored in a distributed fashion. However, this saving in computational complexity
comes at the cost of introducing a non-zero gradient noise
into the SGD updates. In order avoid the accumulation of the gradient noise (5.22) while
running SGD updates (5.21) the step size α needs to be gradually decreased, e.g., using
α = 1/k with k being the iteration counter (see [52]).
98
The SGD iteration (5.21) assumes that the training data is already collected but so large
that the sum in (5.20) is computationally intractable. Another variant of SGD is obtained by
assuming a different data generation mechanism. If the data points are collected sequentially,
one new data point x(t) , y (t) at each new time step t, we could a SGD variant for online
learning (see Section 4.6). This online SGD algorithm amounts to computing, for each time
step t, the iteration
w(t+1) = w(t) − αt ∇ft+1 (w(t) ). (5.23)
5.8 Exercises
5.8.1 Use Knowledge About Problem Class
Consider the space P of sequences f = (f [0], f [1], . . .) that have the following properties
• a change point n, where f [n] 6= f [n + 1] can only be at integer multiples of 100, e.g.,
n = 100 or n = 300.
Given some unknown function f ∈ P and starting point n0 the problem is to find the
minimum value of f as quickly as possible. We consider iterative algorithms that can query
the function at some point n to obtain the values f [n], f [n − 1] and f [n + 1].
99
Chapter 6
The idea of ERM is to learn a hypothesis out of H that incurs minimum average loss
(empirical error) on a set of labelled data points, which is used as training set. For ML
methods using high-dimensional hypothesis spaces, such as linear maps with a large number
of features or deep neural networks, this approach bears the risk of overfitting.
A method overfits if it learns a predictor h ∈ H that, merely by luck, fits well the training
data but does a poor job on other data. Such a predictor will fail to generalize well to new
data for which we do not know the label y but only the features x (if we would know the
label, then there is no point in learning predictors which estimate the label).
This chapter discusses few basic techniques to detect and avoid overfitting. To detect
overfitting we need to monitor or to validate the performance of the predictor h on new
data point which are not contained in the training set. We call the set of data points used
for validation, as the validation set. The empirical risk incurred by the predictor h on the
validation set is referred to as validation error. If a method overfits, it will learn a predictor
whose training error is much smaller than the validation error.
Validation is useful not only for verifying if the predictor generalises well to new data
(in particular detecting overfitting) but also for guiding model selection. In what follows,
we mean by model selection the problem of selecting a particular hypothesis space out of a
whole ensemble of potential hypothesis spaces H1 , H2 , . . ..
We first study the phenomenon of overfitting within a simple probabilistic model for the
data points in Section 6.1. Then, in Section 6.2, we analyze a simple validation technique
that allows to detect overfitting.
100
6.1 Overfitting
Let us illustrate the phenomenon of overfitting using a simplified model for how a human
child learns the concept “tractor”. In particular, this learning task amounts to finding an
association (or predictor) between an image and the fact if the image shows a tractor or not.
To teach this association to a child, we show it many pictures and tell for each picture if
there is a “tractor” or if there is “no tractor” depicted.
Consider that we have taught the child using the image collection X(train) depicted in
Figure 6.1. For some reason, one of the images is labeled erroneously as “tractor” but
actually shows an ocean wave. As a consequence, if the child is good in memorizing images,
it might predict the presence of tractors whenever looking at a wave (Figure 6.2).
images. The i-th image is characterized by the feature vector x(i) ∈ Rn and labeled with
y (i) = 1(if image depicts a tractor) or with y (i) = −1 (if image does not depict a tractor).
For the sake of the argument, we assume that the child uses a linear predictor h(w) (x) =
xT w, using the features x of the image, and encodes the fact of showing a tractor by y = 1
and if it is not showing a tractor by y = −1. In order to learn the weight vector, we use
ERM with squared error loss over the training dataset, i.e., its learning process amounts to
solving the ERM problem (4.4) using the labeled training dataset D(train) .
If we stack the feature vectors x(i) and labels y (i) into the feature matrix X = (x(1) , . . . , x(mt ) )T
and label vector y = (y (1) , . . . , y (mt ) )T , the optimal linear predictor is obtained for the weight
101
Figure 6.2: The child, who has been taught the concept “tractor” using the image collection
X(train) in Figure 6.1, might “see” a lot of tractors during the next beach holiday.
vector solving (4.9) and the associated training error is given by (4.10), which we repeat here
for convenience:
span{X} = Xa : a ∈ Rn ⊆ Rmt ,
ML methods using linear predictors overfit as soon as the number of features is not
smaller the sample size, i.e., whenever
m ≤ n. (6.3)
102
Inserting P = I into (4.10) yields
To sum up: as soon as the number of training examples mt = |Dtrain | is smaller than the
size n of the feature vector x, there is a linear predictor h(wopt ) achieving zero empirical
risk (see (6.4))on the training data. The result (6.4) only applies if the feature vectors of
the training data points are linearly independent. It can be shown that if the feature vectors
x(1) , . . . , x(m) ∈ Rn are realizations of i.i.d. RVs with a continuous probability distribution,
then with probability one they are linearly independent whenever (6.3) holds.
While this “optimal” predictor h(wopt ) is perfectly accurate on the training data (the
training error is zero!), it will typically incur a non-zero average prediction error y−h(wopt ) (x)
on new data points (x, y) (which are different from the training data). Indeed, using a
simple toy model for the data generation, we obtained the expression (6.26) for the average
prediction error. This average prediction error is lower bounded by the noise variance σ 2
which might be very large even the training error is zero. Thus, in case of overfitting, a small
training error can be highly misleading regarding the average prediction error of a predictor.
A simple, yet quite useful, strategy to detect if a predictor ĥ overfits the training dataset
D(train) , is to compare the resulting training error E(ĥ|D(train) ) (see (6.6)) with the validation
error E(ĥ|D(val) ) (see (6.7)). The validation error E(ĥ|D(val) ) is the empirical risk of the
predictor ĥ on the validation dataset D(val) . If overfitting occurs, the validation error
E(ĥ|D(val) ) is significantly larger than the training error E(ĥ|D(train) ). The occurrence of
overfitting for polynomial regression with degree n (see Section 3.2) chosen too large is
depicted in Figure 7.1.
6.2 Validation
Consider an ML method using some hypothesis space H. We then learn a predictor ĥ ∈ H
by ERM (4.2) using a labeled dataset (the training set). The basic idea of validating the
predictor ĥ is simple: compute the empirical risk of ĥ on a new set of data points (x, y)
which have not been already used for training.
It is very important to validate the predictor ĥ using labeled data points which do not
belong to the dataset which has been used to learn ĥ (e.g., via ERM (4.2)). The predictor
ĥ tends to “look better” on the training set than for other data points, since it is optimized
103
Figure 6.3: The training dataset consists of the blue crosses and can be almost perfectly
fit by a high-degree polynomial. This high-degree polynomial gives only poor results for a
different (validation) dataset indicated by the orange dots.
A very simple recipe for implementing learning and validation of a predictor based on
one single labeled dataset D = {(x(i) , y (i) )}m
i=1 is as follows (see Figure 6.4):
1. randomly divide (“split”) the entire dataset D of labeled snapshots into two disjoint
subsets X(train) (the “training set”) and X(val) (the “validation set”): D = X(train) ∪X(val)
(see Figure 6.4).
2. learn a predictor ĥ via ERM using the training data X(train) , i.e., compute (cf. (4.2))
ĥ = argmin E(h|X(train) )
h∈H
X
= argmin(1/mt ) L((x, y), h) (6.5)
h∈H
(x,y)∈X(train)
104
with corresponding training error
mt
X
(train)
E(ĥ|X ) = (1/mt ) L((x(i) , y (i) ), ĥ) (6.6)
i=1
3. validate the predictor ĥ obtained from (6.5) by computing the empirical risk
X
E(ĥ|X(val) ) = (1/mv ) L((x, y), ĥ) (6.7)
(x,y)∈X(val)
obtained when applying the predictor ĥ to the validation dataset D(val) . We might
refer to E(ĥ|D(val) ) as the validation error.
The choice of the split ratio |D(val) |/|D(train) |, i.e., how large should the training set be
relative to the validation set is often based on experimental tuning. It seems difficult to
make a precise statement on how to choose the split ratio which applies broadly to different
ML problems [44].
Figure 6.4: If we have only one single labeled dataset D, we split it into a training set
D(train) and a validation set D(val) . We use the training set in order to learn (find) a good
predictor ĥ(x) by minimizing the empirical risk E(h|D(train) ) (see (4.2)). In order to validate
the performance of the predictor ĥ on new data, we compute the empirical risk E(h|D(val) )
incurred by ĥ(x) for the validation set D(val) . We refer to the empirical risk E(h|D(val) )
obtained for the validation set as the validation error.
The basic idea of randomly splitting the available labeled data into training and validation
sets is underlying many validation techniques. A popular extension of the above approach,
which is known as k-fold cross-validation, is based on repeating the splitting into training
105
and validation sets k times. During each repetition, this method uses different subsets for
training and validation. We refer to [30, Sec. 7.10] for a detailed discussion of k-fold cross-
validation.
• randomly divide (split) the entire dataset D of labeled snapshots into two disjoint
subsets X(train) (the “training set”) and X(val) (the ”validation set”): D = X(train) ∪X(val)
(see Figure 6.4).
• for each hypothesis space Hl learn predictor ĥl ∈ Hl via ERM (4.2) using training data
X(train) :
106
• compute the validation error of ĥl
mv
X
(val)
E(ĥl |X ) = (1/mv ) L((x(i) , y (i) ), ĥl ) (6.9)
i=1
obtained when applying the predictor ĥl to the validation dataset X(val) .
• pick the hypothesis space Hl resulting in the smallest validation error E(ĥl |X(val) )
The noise variance σ 2 is assumed fixed (non-random) and known. Note that the error
component ε in (6.10) is intrinsic to the data (within our toy model) and cannot be overcome
by any ML method. We highlight that this model for the observed data points might not
be accurate for a particular ML application. However, this toy model will allow us to study
some fundamental behaviour of ML methods.
In order to predict the label y from the features x we will use predictors h that are linear
107
maps of the first r features x1 , . . . , xr . This results in the hypothesis space
The design parameter r determines the size of the hypothesis space H(r) and allows to control
the computational complexity of the resulting ML method which is based on the hypothesis
space H(r) . For r < n, the hypothesis space H(r) is a proper subset of the space of linear
predictors (2.4) used within linear regression (see Section 3.1). Note that each element
h(w) ∈ H(r) corresponds to a particular choice of the weight vector w ∈ Rr .
The quality of a particular predictor h(w) ∈ H(r) is measured via the mean squared error
E(h(w) | X(train) ) incurred over a labeled training set X(train) = {x(i) , y (i) }m
i=1 . Within our toy
t
model (see (6.10), (6.12) and (6.13)), the training data points (x(i) , y (i) ) are i.i.d. copies of
the data point z = (x, y).
Each of the data points in the training dataset is statistically independent from any other
data point (x, y) (which has not been used for training). However, the training data points
(x(i) , y (i) ) and any other (new) data point (x, y) share the same probability distribution (a
multivariate normal distribution):
y (i) = wtrue
T
x(i) + ε(i) , and y = wtrue
T
x+ε (6.13)
108
b T , 0)T . This allows us to write
vector (6.14) as well as the zero padded length-n vector (w
h(w)
b
b T x.
(x) = w (6.16)
We highlight that the formula (6.14) for the optimal weight vector w b is only valid if the
T
matrix Xr Xr is invertible. However, it can be shown that within our toy model (see (6.12)),
this is true with probability one whenever mt ≥ r. In what follows, we will consider the case
of having more training samples than the dimension of the hypothesis space, i.e., mt > r
such that the formula (6.14) is valid (with probability one). The case mt ≤ r will be studied
in Chapter 7.
The optimal weight vector w b (see (6.14)) depends on the training data X(train) via the
feature matrix Xr and label vector y (see (6.15)). Therefore, since we model the training data
as random, the weight vector w b (6.14) is a random quantity. For each different realization
of the training dataset, we obtain a different realization of the optimal weight w.
b
Within our toy model, which relates the features x of a data point to its label y via
(6.10), the best case would be if w
b = wtrue . However, in general this will not happen since
we have to compute w b based on the features x(i) and noisy labels y (i) of the data points in
the training dataset D. Thus, we typically have to face a non-zero estimation error
b − wtrue .
∆w := w (6.17)
Note that this estimation error is a random quantity since the learnt weight vector w
b (see
(6.14)) is random.
Bias and Variance. As we will see below, the prediction quality achieved by h(w) b
It is useful to characterize the MSE Eest by decomposing it into two components, one
component (the “bias”) which depends on the choice r for the hypothesis space and another
component (the “variance”) which only depends on the distribution of the observed feature
vectors x(i) and labels y (i) . It is then not too hard to show that
b 22 + Ekw
Eest = kwtrue − E{w}k b 22
b − E{w}k (6.19)
| {z } | {z }
“bias”B 2 “variance”V
109
The bias term in (6.19), which can be computed as
n
X
2
B = kwtrue − b 22
E{w}k = 2
wtrue,l , (6.20)
l=r+1
measures the distance between the “true predictor” h(wtrue ) (x) = wtrue T
x and the hypothesis
(r)
space H (see (6.11)) of the linear regression problem. The bias is zero if wtrue,l = 0 for any
index l = r + 1, . . . , n, or equivalently if h(wtrue ) ∈ H(r) . We can guarantee h(wtrue ) ∈ H(r)
only if we use the largest possible hypothesis space H(r) with r = n. For r < n, we cannot
guarantee a zero bias term since we have no access to the true underlying weight vector wtrue
in (6.10). In general, the bias term decreases for increasing model size r (see Figure 6.5).
We also highlight that the bias term does not depend on the variance σ 2 of the noise ε in
our toy model (6.10).
Let us now consider the variance term in (6.19). Using the properties of our toy model
(see (6.10), (6.12) and (6.13))
By (6.12), the matrix (XTr Xr )−1 is random and distributed according to an inverse Wishart
distribution [48]. In particular, for mt > r + 1, its expectation is obtained as
As indicated by (6.23), the variance term increases with increasing model complexity r (see
Figure 6.5). This behaviour is in stark contrast to the bias term which decreases with
increasing r. The opposite dependency of bias and variance on the model complexity is
known as the bias-variance tradeoff. Thus, the choice of model complexity r (see (6.11))
has to balance between small variance and small bias term.
Generalization. In most ML applications, we are primarily interested in how well a
predictor h(ŵ) , which has been learnt from some training data D (see (4.2)), predicts the
label y of a new datapoint (which is not contained in the training data D) with features x.
Within our linear regression model, the prediction (approximation guess or estimate) ŷ of
110
Eest
variance
bias
model complexity r
Figure 6.5: The estimation error Eest incurred by linear regression can be decomposed into
a bias term B 2 and a variance term V (see (6.19)). These two components depend on the
model complexity r in an opposite manner resulting in a bias-variance tradeoff.
b T x.
ŷ = w (6.24)
Note that the prediction ŷ is a random variable since (i) the feature vector x is modelled as
a random vector (see (6.12)) and (ii) the optimal weight vector w b (see (6.14)) is random. In
general, we cannot hope for a perfect prediction but have to face a non-zero prediction error
epred := ŷ − y
(6.24)
bTx − y
= w
(6.10)
b T x − (wtrue
= w T
x + ε)
= ∆wT x − ε. (6.25)
Note that, within our toy model (see (6.10), (6.12) and (6.13)), the prediction error epred is
a random variable since (i) the label y is modelled as a random variable (see (6.10)) and (ii)
the prediction ŷ is random.
Since, within our toy model (6.13), ε is zero-mean and independent of x and w b − wtrue ,
111
we obtain the average predictor error as
Epred = E{e2pred }
(6.25),(6.10)
= E{∆wT xxT ∆w} + σ 2
(a)
= E{E{∆wT xxT ∆w | D}} + σ 2
(b)
= E{∆wT ∆w} + σ 2
(6.17),(6.18)
= Eest + σ 2
(6.19)
= B 2 + V + σ2. (6.26)
Here, step (a) is due to the law of total expectation [8] and step (b) uses that, conditioned
on the dataset D, the feature vector x of a new data point (not belonging to D) has zero
mean and covariance matrix I (see (6.12)).
Thus, as indicated by (6.26), the average (expected) prediction error Epred is the sum of
three contributions: (i) the bias B 2 , (ii) the variance V and (iii) the noise variance σ 2 . The
bias and variance, whose sum is the estimation error Eest , can be influenced by varying the
model complexity r (see Figure 6.5) which is a design parameter. The noise variance σ 2 is
the intrinsic accuracy limit of our toy model (6.10) and is not under the control of the ML
engineer. It is impossible for any ML method (no matter how clever it is engineered) to
achieve, on average, a small prediction error than the noise variance σ 2 .
We finally highlight that our analysis of bias (6.20), variance (6.23) and the average
prediction error (6.26) achieved by linear regression only applies if the observed data points
are well modelled as realizations of random vectors according to (6.10), (6.12) and (6.13).
The usefulness of this model for the data arising in a particular application has to be verified
in practice by some validation techniques [76, 70].
An alternative approach for analyzing bias, variance and average prediction error of linear
regression is to use simulations. Here, we generate a number of i.i.d. copies of the observed
data points by some random number generator [4]. Using these i.i.d. copies, we can replace
exact computations (expectations) by empirical approximations (sample averages).
6.5 Diagnosing ML
compare training, validation and benchmark error. benchmark can be Bayes risk when using
probabilistic model (such as i.i.d.), or human performance or risk of some other ML methods
112
(”experts” in regret framework)
Consider a predictor ĥ obtained from ERM (4.2) with training error E(ĥ|X(train) ) and
validation error E(ĥ|D(val) ). By comparing the two numbers E(ĥ|X(train) ) and E(ĥ|D(val) )
with some desired or tolerated error E0 , we can get some idea of how to adapt the current
ERM approach (see (4.2)) to improve performance:
• E(h|X(train) ) E(h|X(val) ): This indicates that the method for solving the ERM (4.2)
is not working properly. The training error obtained by solving the ERM (4.2) should
always be smaller than the validation error. When using GD for solving ERM, one
particular reason for E(h|X(train) ) E(h|X(val) ) could be that the step size α in the
GD step (5.4) is chosen too large (see Figure 5.3-(b)).
6.6 Exercises
6.6.1 Validation Set Size
Consider a linear regression problem with data points characterized by a scalar feature and
numeric label. Assume data points are i.i.d. Gaussian with zero-mean and covariance C
How many data points do we need for a validation set such that the probability that the
MSE incurred on the validation does not deviate by more than 20 percent from the average
MSE is larger than 0.8.
113
Chapter 7
Regularization
114
7.1 Regularized ERM
It seems reasonable to avoid overfitting by pruning the hypothesis space H, i.e., removing
some of its elements. In particular, instead of solving (4.2) we solve the restricted ERM
Another approach to avoid overfitting is to regularize the ERM (4.2) by adding a penalty
term R(h) which somehow measures the complexity or non-regularity of a predictor map h
using a non-negative number R(h) ∈ R+ . We then obtain the regularized ERM
The additional term R(h) aims at approximating (or anticipating) the increase in the
empirical risk of a predictor ĥ when it is applied to new data points, which are different
from the dataset D used to learn the predictor ĥ by (7.2).
The two approaches (7.1) and (7.2), for making ERM (4.2) robust against overfitting
are closely related. In particular, these two approaches are, in a certain sense, dual to
each other: for a given restriction H0 ⊂ H we can find a penalty R(h) term such that the
solutions of (7.1) and (7.2) coincide. Similarly for a many popular types of penalty terms
R(h), we can find a restriction H0 ⊂ H such that the solutions of (7.1) and (7.2) coincide.
This statements can be made precise using the theory of duality for optimization problems
(see [7]).
In what follows we will analyze the occurrence of overfitting in Section ?? and then
discuss in Section 7.4 how to avoid overfitting using regularization.
7.2 Robustness
Overfitting is a main challenges in applying modern ML methods. Modern ML methods use
large hypothesis spaces that allow to represent highly non-linear predictor maps. Just by
pure luck we can find one such predictor map that perfectly fits the training set resulting in
zero training error and, in turn, solving ERM (4.2).
Overfitting is closely related to another property of ML methods: robustness. If a method
overfits it will typically be not robust to small perturbations in the training data. The
robustness to small perturbations in the data is almost a mandatory requirement for ML
115
methods to be useful in important application domains.
The ML methods discussed in Chapter 4 rest on the idealizing assumption that we have
access to the true label values and feature values of a set of data points (the training set).
However, the means by which the label and feature values are determing are prone to errors.
These errors might stem from the measurement device itself (hardward failures) or might
be due to modelling errors. We need ML methods that do not “break” if we feed it slightly
perturbed label values for the training data.
Figure 7.1: Modern ML methods allow to find a predictor map that perfectly fits training
data. Such a predictor might perform poorly on a new data point outside the training set.
To prevent learning such a predictor map we could require it to be robust against small
perturbations in the features of the training data points or the predictor map itself.
116
7.3 Data Augmentation
implement robustness principle by augmenting dataset with random perturbatios of original
training data.
caused by choosing the hypothesis space too large. Therefore, we can avoid overfitting by
making (pruning) the hypothesis space H smaller to obtain a new hypothesis space Hsmall .
This smaller hypothesis space Hsmall can be obtained by pruning, i.e., removing certain maps
h, from H.
A more general strategy is regularization, which amounts to modifying the loss function
of an ML problem in order to favour a subset of predictor maps. Pruning the hypothesis
space can be interpreted as an extreme case of regularization, where the loss functions become
infinite for predictors which do not belong to the smaller hypothesis space Hsmall .
In order to avoid overfitting, we have to augment our basic ERM approach (cf. (4.2)) by
regularization techniques. According to [26], regularization aims at “any modification
we make to a learning algorithm that is intended to reduce its generalization error but not
its training error.” By generalization error, we mean the average prediction error (see (6.26))
incurred by a predictor when applied to new data points (different from the training set).
A simple but effective method to regularize the ERM learning principle, is to augment
the empirical risk (5.7) of linear regression by the penalty term R(h(w) ) := λkwk22 , which
penalizes overly large weight vectors w. Thus, we arrive at regularized ERM
with the regularization parameter λ > 0. The parameter λ trades a small training error
E(h(w) |D) against a small norm kwk of the weight vector. In particular, if we choose a large
value for λ, then weight vectors w with a large norm kwk are “penalized” by having a larger
objective function and are therefore unlikely to be a solution (minimizer) of the optimization
problem (7.3).
117
Specialising (7.3) to the squared error loss and linear predictors yields regularized linear
regression (see (4.4)):
mt
X
(λ)
(y (i) − wT x(i) )2 + λkwk22 ,
ŵ = argmin (1/mt ) (7.4)
w∈Rn
i=1
The optimization problem (7.4) is also known under the name ridge regression [30].
T
Using the feature matrix X = x(1) , . . . , x(mt ) and label vector y = (y (1) , . . . , y (mt ) )T ,
we can rewrite (7.4) more compactly as
This reduces to the closed-form expression (6.14) when λ = 0 in which case regularized
linear regression reduces to ordinary linear regression (see (7.4) and (4.4)). It is important
to note that for λ > 0, the formula (7.6) is always valid, even when XT X is singular (not
invertible). This implies, in turn, that for λ > 0 the optimization problem (7.5) (and (7.4))
have a unique solution (which is given by (7.6)).
We now study the effect of regularization on the bias, variance and average prediction
(λ) T
error incurred by the predictor h(ŵ ) (x) = ŵ(λ) x. To this end, we will again invoke the
simple probabilistic toy model (see (6.10), (6.12) and (6.13)) used already in Section 6.4.
In particular, we interpret the training data D(train) = {(x(i) , y (i) )}m
i=1 as realizations of i.i.d.
t
XT X ≈ mt I (7.8)
118
bias of ŵ(λ)
variance of ŵ(λ)
regularization parameter λ
Figure 7.2: The bias and variance of regularized linear regression depend on the regularization
parameter λ in an opposite manner resulting in a bias-variance tradeoff.
n
X λ 2
= wtrue,l . (7.9)
l=1
1 + λ
By comparing the (approximate) bias term (7.9) of regularized linear regression with the
bias term (6.20) of ordinary linear regression, we see that introducing regularization typically
increases the bias. The bias increases with larger values of the regularization parameter λ.
The variance of regularized linear regression (7.4) satisfies
V = (σ 2 /m2t )×
traceE{((1/mt )XT X+λI)−1 XT X((1/mt )XT X+λI)−1 }. (7.10)
Using the approximation (7.8), which is reasonable for sufficiently large sample size mt , we
can in turn approximate (7.10) as
According to (7.11), the variance of regularized linear regression decreases with increasing
regularization λ. Thus, as illustrated in Figure 7.2, the choice of λ has to balance between the
bias B 2 (7.9) (which increases with increasing λ) and the variance V (7.11) (which decreases
with increasing λ). This is another instance of the bias-variance tradeoff (see Figure 6.5).
So far, we have only discussed the statistical effect of regularization on the resulting ML
119
method (how regularization influences bias, variance, average prediction error). However,
regularization has also an effect on the computational properties of the resulting ML method.
Note that the objective function in (7.5) is a smooth (infinitely often differentiable) convex
function.
Similar to linear regression, we can solve the regularization linear regression problem
using GD (2.5) (see Algorithm 4). The effect of adding the regularization term λkwk22 to
the objective function within linear regression is a speed up of GD. Indeed, we can rewrite
(7.5) as the quadratic problem
min (1/2)wT Qw − qT w
w∈Rn| {z }
=f (w)
This is similar to the quadratic problem (4.7) underlying linear regression but with different
matrix Q. It turns out that the convergence speed of GD (see (5.4)) applied to solving a
quadratic problem of the form (7.12) depends crucially on the condition number κ(Q) ≥ 1
of the psd matrix Q [34]. In particular, GD methods are fast if the condition number κ(Q)
is small (close to 1).
((1/m)XT X)
This condition number is given by λλmaxmin ((1/m)XT X) for ordinary linear regression (see
T
(4.7)) and given by λλmax ((1/m)X X)+λ
T
min ((1/m)X X)+λ
for regularized linear regression (7.12). For increasing
regularization parameter λ, the condition number obtained for regularized linear regression
(7.12) tends to 1:
λmax ((1/m)XT X) + λ
lim = 1. (7.13)
λ→∞ λmin ((1/m)XT X) + λ
120
Algorithm 4 “Regularized Linear Regression via GD”
Input: labeled dataset D = {(x(i) , y (i) )}mi=1 containing feature vectors x
(i)
∈ Rn and labels
y (i) ∈ R; GD step size α > 0.
Initialize:set w(0) := 0; set iteration counter k := 0
1: repeat
2: k := k + 1 (increase iteration counter)
w(k) := (1 − αλ)w(k−1) + α(2/m) m (i)
− w(k−1) )T x(i) )x(i) (do a GD step (5.4))
P
3: i=1 (y
4: until convergence
Output: w(k) (which approximates ŵ(λ) in (7.5))
For any given value λ, we can find a bound C(λ) such that solutions of (7.3) coincide with
the solutions of (7.14). Thus, by solving the regularized ERM (7.3) we are performing
implicitly model selection using a continuous ensemble of hypothesis spaces H(λ) given by
(7.15). In contrast, the simple model selection strategy considered in Section 6.3 uses a
discrete sequence of hypothesis spaces.
121
showing an apple. To learn such a predictor we might have a collecting of hand-drawings
at our disposal. We might now for each hand-drawing certain higher-level information such
as the object it is showing. This allows us to use different choices for the label. We could
also use the confidence level of a hand-drawing showing an orange. Clearly this problem is
related to the problem of predicting the apple confidence.
The definition (design choice) of the labels corresponds to formulating a particular
question we want to have answered by an ML method. Some questions (label choices)
are more difficult to answer while others are easier to answer.
Consider the ML problem arising from guiding the operation of a mower robot. For
a mowing robot, it is important to determine if it is currently on grassland or not. Let us
assume the mower robot is equipped with an on-board camera which allows to take snapshots
which are characterized by a feature vector x (see Figure 2.3). We could then define the
label as either y = 1 if the snapshot suggests that the mower is on grassland and y = −1
if not. However, we might be interested in a finer-grained information about the floor type
and define the label as y = 1 for grassland, y = 0 for soil and y = −1 for when the mower
is on tiles. The latter problem is more difficult since we have to distinguish between three
different types of floor (“grass” vs. “soil” vs. “tiles”) whereas for the former problem we
only have to distinguish between two types of floor (“grass” vs. “no grass”).
7.7 Exercises
7.7.1 Ridge Regression as Quadratic Form
Consider linear hypothesis space consisting of linear maps parameterized by weights w.
We try to find the best linear map by minimizing the regularized average squared error loss
(empirical risk) incurred on some labeled training data points (x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(m) , y (m) ).
As regularizer we use kwk2 , yielding the following learning problem
m
X
min f (w) = . . . + kwk22
w
i=1
Is it possible to write the objective function f (w) as a convex quadratic form f (w) =
T
w Cw + bw + c? If this is possible, how are the matrix C, vector b and constant c related
to the feature vectors and labels of the training data ?
122
Chapter 8
Clustering
xr
x(1)
x(7)
(5)
x
x(6)
x(3)
(4)
x
x(2)
xg
(i) (i)
Figure 8.1: A scatterplot obtained from the features x(i) = (xr , xg )T , given by the redness
(i) (i)
xr and greenness xg , of some snapshots.
Up to now, we mainly considered ML methods which required some labeled training data
in order to learn a good predictor or classifier. We will now start to discuss ML methods
which do not make use of labels. These methods are often referred to as “unsupervised”
since they do not require a supervisor (or teacher) which provides the labels for data points
in a training set.
An important class of unsupervised methods, known as clustering methods, aims at
grouping data points into few subsets (or clusters). While there is no unique formal
definition, we understand by cluster a subset of data points which are more similar to each
other than to the remaining data points (belonging to different clusters). Different clustering
123
methods are obtained for different ways to measure the “similarity” between data points.
There are two main flavours of clustering methods:
Within hard clustering, each data point x(i) belongs to one and only one cluster. In contrast,
soft clustering methods assign a data point x(i) to several different clusters with varying
degree of belonging (confidence).
Clustering methods determine for each data point z(i) a cluster assignment y (i) . The
cluster assignment y (i) encodes the cluster to which the data point x(i) is assigned. For hard
clustering with a prescribed number of k clusters, the cluster assignments y (i) ∈ {1, . . . , k}
represent the index of the cluster to which x(i) belongs.
In contrast, soft clustering methods allow each data point to belong to several different
clusters. The degree with which data point x(i) belongs to cluster c ∈ {1, . . . , k} is represented
(i) (i) (i) T
by the degree of belonging yc ∈ [0, 1], which we stack into the vector y(i) = y1 , . . . , yk ∈
[0, 1]k . Thus, while hard clustering generates non-overlapping clusters, the clusters produced
by soft clustering methods may overlap.
We intentionally used the same symbol y (i) for cluster assignments of a data point as we
used to denote an associated label in classification problems. There is a strong conceptual
link between clustering and classification. We can interpret clustering as an extreme case
of classification without having access to any labeled training data, i.e., we do not know
(i)
the label of any data point.To find the correct labels (cluster assignments) yc , clustering
method rely solely on the intrinsic geometry of the data points.
124
8.1 Hard Clustering with K-Means
In what follows we assume that data points z(i) , for i = 1, . . . , m, are characterized by
feature vectors x(i) ∈ Rn and measure similarity between data points using the Euclidean
distance kx(i) − x(j) k. With a slight abuse of notation, we will occasionally denote a data
point z(i) using its feature vector x(i) . In general, the feature vector is only a (incomplete)
representation of a data point but it is customary in many unsupervised ML methods to
identify a data point with its features. Thus, we consider two data points z(i) and z(j)
similar if kx(i) − x(j) k is small. Moreover, we assume the number k of clusters prescribed.
A simple method for hard clustering is the “k-means” algorithm which requires the
number k of clusters to specified before-hand. The idea underlying k-means is quite simple:
First, given a current guess for the cluster assignments y (i) , determine the cluster means
m(c) = |{i:y(i)1 =c}|
P (i)
x for each cluster. Then, in a second step, update the cluster
i:y (i) =c
assignments y ∈ {1, . . . , k} for each data point x(i) based on the nearest cluster mean. By
(i)
Algorithm 5 “k-means”
Input: dataset D = {x(i) }m i=1 ; number k of clusters.
Initialize: choose initial cluster means m(c) for c = 1, . . . , k.
1: repeat
2: for each data point x(i) , i = 1, . . . , m, do
0
y (i) ∈ argmin kx(i) − m(c ) k (update cluster assignments) (8.1)
c0 ∈{1,...,k}
4: until convergence
Output: cluster assignments y (i) ∈ {1, . . . , k}
0
In (8.1) we denote by argmin kx(i) − m(c ) k the set of all cluster indices c ∈ {1, . . . , k}
c0 ∈{1,...,k}
0
such that kx(i) − m(c) k = minc0 ∈{1,...,k} kx(i) − m(c ) k.
The k-means algorithm requires the specification of initial choices for the cluster means
m(c) , for c = 1, . . . , k. There is no unique optimal strategy for the initialization but several
heuristic strategies can be used. One option is to initialize the cluster means with i.i.d.
125
realizations of a random vector m whose distribution is matched to the dataset D = {x(i) }m
i=1 ,
e.g., m ∼ N (m̂, C) b with sample mean m̂ = (1/m) m x(i) and the sample covariance
P
i=1
b = (1/m) m (x(i) −m̂)(x(i) −m̂)T . Another option is to choose the cluster means m(c) by
C
P
i=1
randomly selecting k different data points x(i) . The cluster means might also be chosen by
evenly partitioning the principal component of the dataset (see Chapter 9).
We now show that k-means can be interpreted as a variant of ERM. To this end we define
the empirical risk as the clustering error,
m
2
(y (i) )
X
{m(c) }kc=1 , {y (i) }m
(i)
E i=1 | D = (1/m)
x − m
. (8.3)
i=1
Note that the empirical risk (8.3) depends on the current guess for the cluster means
{m(c) }kc=1 and cluster assignments {y (i) }m i=1 .
Finding the global optimum of the function (8.3), over all possible cluster means {m(c) }kc=1
and cluster assignments {y (i) }m
i=1 , is difficult as the function is non-convex. However, minimizing
(8.3) only with respect to the cluster assignments {y (i) }m i=1 but with the cluster means
{m(c) }kc=1 held fixed is easy. Similarly, minimizing (8.3) over the choices of cluster means
with the cluster assignments held fixed is also straightforward. This observation is used by
Algorithm 5: it is alternatively minimizing E over all cluster means with the assignments
{y (i) }m
i=1 held fixed and minimizing E over all cluster assignments with the cluster means
{m(c) }kc=1 held fixed.
The interpretation of Algorithm 5 as a method for minimizing the cost function (8.3)
is useful for convergence diagnosis. In particular, we might terminate Algorithm 5 if the
decrease of the objective function E is below a prescribed (small) threshold.
A practical implementation of Algorithm 5 needs to fix three issues:
• Issue 1: We need to specify a “tie-breaking strategy” to handle the case when several
different cluster indices c ∈ {1, . . . , k} achieve the minimum value in (8.1).
• Issue 2: We need to specify how to handle the situation when after a cluster assignment
update (8.1), there is a cluster c with no data points are associated with it, i.e.,
|{i : y (i) = c}| = 0. In this case, the cluster means update (8.2) would be not well
defined for the cluster c.
The following algorithm fixes those three issues in a particular way [28].
126
Algorithm 6 “k-Means II” (slight variation of “Fixed Point Algorithm” in [28])
Input: dataset D = {x(i) }m i=1 ; number k of clusters; tolerance ε ≥ 0.
(c) k m
Initialize: choose initial cluster means m c=1 and cluster assignments y (i) i=1 ; set
iteration counter k := 0; compute E (k) = E {m(c) }kc=1 , {y (i) }mi=1 | D ;
1: repeat
2: for all data points i = 1, . . . , m, update cluster assignment
0
y (i) := min{ argmin kx(i) − m(c ) k} (update cluster assignments) (8.4)
c0 ∈{1,...,k}
127
The variables b(c) ∈ {0, 1} indicate if cluster c is active (b(c) = 1) or cluster c is inactive
(b(c) = 0), in the sense of having no data points assigned to it during the preceding cluster
assignment step (8.4). We use the cluster activity inductors b(c) to make sure that the mean
update (8.5) only to clusters c with at least one data point x(i) .
It can be shown that Algorithm 6 amounts to a fixed-point iteration
{y (i) }m (i) m
i=1 7→ P{y }i=1 (8.6)
128
Figure 8.2: Evolution of cluster means and cluster assignments within k-means.
E {m(c) }kc=1 , {y (i) }m
i=1 | D
local minimum
Figure 8.3: The clustering error E {m(c) }kc=1 , {y (i) }m
i=1 | D (see (8.3)), which is minimized
by k-means, is a non-convex function of the cluster means and assignments. It is therefore
possible for k-means to get trapped around a local minimum.
129
8
E (k)
4
2 4 6 8 10
k
Figure 8.4: The clustering error E (k) achieved by k-means for increasing number k of clusters.
of k. Finally, the choice of k might be guided by some probabilistic model which penalizes
larger values of k.
1
exp − (1/2)(x−µ)T Σ−1 (x−µ)
N (x; µ, Σ) = p (8.7)
det{2πΣ}
130
of a Gaussian random vector with mean µ and (invertible) covariance matrix Σ.1
Each cluster c ∈ {1, . . . , k} is represented by a distribution of the form (8.7) with a
cluster-specific mean µ(c) ∈ Rn and cluster-specific covariance matrix Σ(c) ∈ Rn×n .
Since we do not know before-hand the cluster assignment c(i) of the data point x(i) , we
model c(i) as a random variable with probability distribution
The (prior) probabilities pc are unknown and therefore have to be estimated somehow by
the soft-clustering method. The random cluster assignment c(i) selects the cluster-specific
distribution (8.7) of the random data point x(i) ,
(i) ) (i) )
P(x(i) |c(i) ) = N (x; µ(c , Σ(c ) (8.9)
k
X
yc(i) = 1 for each i = 1, . . . , m. (8.11)
c=1
It is important to note that we use the conditional cluster probability (8.10), conditioned on
(i)
the dataset, for defining the degree of belonging yc . This is reasonable since the degree of
(i)
belonging yc depends on the overall (cluster) geometry of the data set D.
A probabilistic model for the observed data points x(i) is obtained by considering each
(i) (i)
data point x(i) being the result of a random draw from the distribution N (x; µ(c ) , Σ(c ) )
with some cluster c(i) . Since the cluster indices c(i) are unknown,2 we model them as random
1
Note that the distribution (8.7) is only defined for an invertible (non-singular) covariance matrix Σ.
2
After all, the goal of soft-clustering is to estimate the cluster indices c(i) .
131
Σ(3)
Σ(1) Σ(2)
µ(3)
µ(2)
µ(1)
Figure 8.5: The GMM (8.12) yields a probability density function which is a weighted sum
of multivariate normal distributions N (µ(c) , Σ(c) ). The weight of the c-th component is the
cluster probability P(c(i) = c).
variables. In particular, we model the cluster indices c(i) as i.i.d. with probabilitiespc =
P(c(i) = c).
The overall probabilistic model (8.9), (8.8) amounts to a Gaussian mixture model
(GMM). The marginal distribution P(x(i) ), which is the same for all data points x(i) , is a
(additive) mixture of multivariate Gaussian distributions,
k
X
(i)
P(x ) = P(c(i) = c) P(x(i) |c(i) = c) . (8.12)
| {z } | {z }
c=1 pc N (x(i) ;µ(c) ,Σ(c) )
The cluster assignments c(i) are hidden (unobserved) random variables. We thus have to infer
or estimate these variables from the observed data points x(i) which are i.i.d. realizations of
the GMM (8.12).
Using the GMM (8.12) for explaining the observed data points x(i) turns the clustering
problem into a statistical inference or parameter estimation problem [39, 46]. The
estimation problem is to estimate the true underlying cluster probabilities pc (see (8.8)),
cluster means µ(c) and cluster covariance matrices Σ(c) (see (8.9)) from the observed data
points D = {x(i) }m i=1 . The data points x
(i)
are i.i.d. realizations of a random vector with
probability distribution (8.12).
We denote the estimates for the GMM parameters by p̂c (≈ pc ), m(c) (≈ µ(c) ) and C(c) (≈
(i)
Σ(c) ), respectively. Based on these estimates, we can then compute an estimate ŷc of the
132
(a-posterior) probability
yc(i) = P(c(i) = c | D) (8.13)
of the i-th data point x(i) belonging to cluster c, given the observed dataset D.
This estimation problem becomes significantly easier by operating in an alternating
fashion. In each iteration, we first compute a new estimate p̂c of the cluster probabilities pc ,
given the current estimate m(c) , C(c) for the cluster means and covariance matrices. Then,
using this new estimate p̂c for the cluster probabilities, we update the estimates m(c) , C(c)
of the cluster means and covariance matrices. Then, using the new estimates m(c) , C(c) , we
compute a new estimate p̂c and so on. By repeating these two steps, we obtain an iterative
soft-clustering method which is summarized in Algorithm 7.
2: for each data point x(i) and cluster c ∈ {1, . . . , k}, update degrees of belonging
p̂c N (x(i) ; m(c) , C(c) )
yc(i) = Pk (8.14)
c0 =1 p̂c0 N (x(i) ; m(c0 ) , C(c0 ) )
until convergence
4:
(i) (i)
Output: soft cluster assignments y(i) = (y1 , . . . , yk )T for each data point x(i)
As for k-means, we can interpret the soft clustering problem as an instance of the ERM
principle (Chapter 4). In particular, Algorithm 7 aims at minimizing the empirical risk
E {m(c) , C(c) , p̂c }kc=1 | D = − log Prob D; {m(c) , C(c) , p̂c }kc=1 .
(8.15)
The interpretation of Algorithm 7 as a method for minimizing the empirical risk (8.15)
suggests to monitor the decrease of the empirical risk − log Prob D; {m(c) , C(c) , p̂c }kc=1 to
133
decide when to stop iterating (see Step 4 of Algorithm 7).
Similar to k-means Algorithm 5, also the soft clustering Algorithm 7 suffers from the
problem of getting stuck in local minima of the empirical risk (8.15). As for k-means,
we can avoid local minima by running Algorithm 7 several times, each time with a different
initialization for the GMM parameter estimates {m(c) , C(c) , p̂c }kc=1 and then picking the result
which yields the smallest empirical risk (8.15).
The empirical risk (8.15) underlying the soft-clustering Algorithm 7 is essentially a log-
likelihood function. Thus, Algorithm 7 can be interpreted as an approximate maximum
likelihood estimator for the true underlying GMM parameters {µ(c) , Σ(c) , pc }kc=1 . In particular,
Algorithm 7 is an instance of a generic approximate maximum likelihood technique referred
to as expectation maximization (EM) (see [30, Chap. 8.5] for more details). The
interpretation of Algorithm 7 as a special case of EM allows to characterize the behaviour
of Algorithm 7 using existing convergence results for EM methods [74].
There is an interesting link between the soft-clustering Algorithm 7 and k-means. In
particular, k-means hard clustering can be interpreted as an extreme case of soft-clustering
Algorithm 7.
Consider fixing the cluster covariance matrices Σ(c) within the GMM (8.9) to be the
scaled identity:
Σ(c) = σ 2 I for all c ∈ {1, . . . , k}. (8.16)
We assume the covariance matrix (8.16), with a particular value for σ 2 , to be the actual
“correct” covariance matrix for cluster c. Thus, we replace the covariance matrix updates in
Algorithm 7 with C(c) := Σ(c) .
When using a very small variance σ 2 in (8.16)), the update (8.14) tends to enforce
(i)
yc ∈ {0, 1}, i.e., each data point x(i) is associated to exactly one cluster c, whose cluster
mean m(c) is closest to the data point x(i) . Thus, for σ 2 → 0, the soft-clustering update
(8.14) reduces to the hard cluster assignment update (8.1) of the k-means Algorithm 5.
134
can be similar in terms of connectivity, even if their Euclidean distance is large.Density-
based spatial clustering of applications with noise (DBSCAN) is a hard clustering method
that uses a connectivity-based similarity measure. In contrast to k-means and the GMM,
DBSCAN does not require the number of clusters to be pre-defined - the number will depend
on its parameters. Moreover, DBSCAN detects outliers that are interpreted as degenerated
clusters consisting of a single data point. For a detailed discussion of how DBSCAN works,
we refer to https://en.wikipedia.org/wiki/DBSCAN. DBSCAN
8.4 Exercises
8.4.1 Image Compression with k-means
use k-means to compress a RGB bitmap image. Instead of RGB values we need to store only
cluster index and the cluster means.
135
Chapter 9
Feature Learning
Figure 9.1: Dimensionality reduction methods aim at finding a map h which maximally
compresses the raw data while still allowing to accurately reconstruct the original data point
from a small number of features x1 , . . . , xn .
Roughly speaking, ML methods exploit the intrinsic geometry of (large) sets of data
points to compute predictions. By definition, we represent these data points as elements of
the feature space X . Note that the features are a design choice so we can shape the intrinsic
geometry of the data points by using different choices for the features (and feature space).
Feature learning methods automate the choice of finding a good feature space for a given
data set. A subclass of feature learning methods are dimensionality reduction methods,
where the new feature space has a (much) smaller dimension than the original feature space
(see Section 9.1). However, sometimes it might be useful to change to a higher-dimensional
feature space (see Section 9.6).
136
??? Develop feature learning as an approximation problem. The raw data is the vector
to be approximated. The approximation has to be in a (small) subspace which is spanned
by all possible low-dimensional feature vectors???
137
9.2 Principal Component Analysis
Consider a data point z ∈ RD which is represented by a (typically very long) vector of length
D. The length D of the raw feature vector might be easily on the order of millions. To obtain
a small set of relevant features x ∈ Rn , we apply a linear transformation to the data point:
x = Wz. (9.1)
Here, the “compression” matrix W ∈ Rn×D maps (in a linear fashion) the large vector
z ∈ RD to a smaller feature vector x ∈ Rn .
It is reasonable to choose the compression matrix W ∈ Rn×D in (9.1) such that the
resulting features x ∈ Rn allow to approximate the original data point z ∈ RD as accurate
as possible. We can approximate (or recover) the data point z ∈ RD back from the features
x by applying a reconstruction operator R ∈ RD×n , which is chosen such that
(9.1)
z ≈ Rx = RWz. (9.2)
The approximation error E W, R | D resulting when (9.2) is applied to each data point
in a dataset D = {z(i) }m
i=1 is then
m
X
kz(i) − RWz(i) k.
E W, R | D = (1/m) (9.3)
i=1
One can verify that the approximation error E W, R | D can only by minimal if the
compression matrix W is of the form
T
W = WPCA := u(1) , . . . , u(n) ∈ Rn×D , (9.4)
with n orthonormal vectors u(l) which correspond to the n largest eigenvalues of the sample
covariance matrix
Q := (1/m)ZT Z ∈ RD×D (9.5)
T
with data matrix Z = z(1) , . . . , z(m) ∈ Rm×D . 1 By its very definition (9.5), the matrix Q
is positive semi-definite so that it allows for an eigenvalue decomposition (EVD) of the form
1
T
Some authors define the data matrix as Z = e z(1) , . . . , e
z(m) ∈ Rm×D using “centered” data points
z(i) − m b = (1/m) m (i)
P
e b obtained by subtracting the average m i=1 z .
138
[65]
λ(1) . . . 0
Q = u(1) , . . . , u(D) ... u , . . . , u(D) T
(1)
0 0
0 . . . λ(D)
that the length n of the feature vectors x, which is also the number of PCs used, is an input
parameter of Algorithm 8. The number n can be chosen between n = 0 and n = D. However,
it can be shown that PCA for n > m is not well-defined. In particular, the orthonormal
eigenvectors u(n+1) , . . . , u(D) are not unique.
From a computational perspective, Algorithm 8 essentially amounts to performing an
EVD of the sample covariance matrix Q (see (9.5)). Indeed, the EVD of Q provides not
only the optimal compression matrix WPCA but also the measure E (PCA) for the information
loss incurred by replacing the original data points z(i) ∈ RD with the smaller feature vector
x(i) ∈ Rn . In particular, this information loss is measured by the approximation error
T
(obtained for the optimal reconstruction matrix Ropt = WPCA )
D
X
(PCA)
λ(r) .
E := E WPCA , Ropt | D = (9.6)
|{z}
T
r=n+1
=WPCA
As depicted in Figure 9.2, the approximation error E (PCA) decreases with increasing number
139
n of PCs used for the new features (9.1). The maximum error E (PCA) = (1/m) m (i) 2
P
i=1 kz k
is obtained for n = 0, which amounts to completely ignoring the data points z(i) . In the
other extreme case where n = D and x(i) = z(i) , which amounts to no compression at all, the
approximation error is zero E (PCA) = 0.
6
E (PCA)
2 4 6 8 10
n
Figure 9.2: Reconstruction error E (PCA) (see (9.6)) of PCA for varying number n of PCs.
140
• computational budget: choose n sufficiently small such that computational complexity
of overall ML method fits the available computational resources.
• statistical budget: consider using PCA as a pre-processing step within a linear regression
problem (see Section 3.1). Thus, we use the output x(i) of PCA as the feature vectors
in linear regression. In order to avoid overfitting, we should choose n < m (see Chapter
7).
• elbow method: choose n large enough such that approximation error E (PCA) is reasonably
small (see Figure 9.2).
400
first PC x1
200
second PC x2
−8,000
−6,000
−4,000
−2,000 2,000 4,000 6,000
−200
−400
(i) (i) T
Figure 9.3: A scatter plot of feature vectors x(i) = x1 , x2 whose entries are the first two
PCs of the Bitcoin statistics z(i) of the i-th day.
141
9.2.4 Extensions of PCA
There have been proposed several extensions of the basic PCA method:
• kernel PCA [30, Ch.14.5.4]: combines PCA with a non-linear feature map (see
Section 3.9).
• robust PCA [73]: modifies PCA to better cope with outliers in the dataset.
• sparse PCA [30, Ch.14.5.5]: requires each PC to depend only on a small number
of data attributes zj .
142
W with entries drawn i.i.d. from a suitable probability distribution (such as Bernoulli or
Gaussian) yields a good compression matrix W (see (9.1)) with high probability [9, 38].
143
Chapter 10
Privacy-Preserving ML
Many ML applications involve data points representing individual humans. These data
points might include sensitive data, such as medical records, which is subject to privacy
protection. This chapter discusses some techniques for preprocessing the raw data to protect
privacy of individuals while still allowing to solve the overall ML task. We will illustrate
these techniques using a stylized healthcare application.
Figure 10.1: Data points represent humans. We are interested in the fruit preference of
humans. Their gender is considered sensitive information and should not be revealed to ML
methods.
144
must only forward the fraction of infected patients. This is an example of privacy-preserving
data processing. For a sufficiently large number of patients at the hospital (say, more than
1000), we cannot infer much about individual patients just form the fraction of infected
patients treated in that hospital.
10.2 Exercises
10.2.1 Where are you?
Consider a ML method that uses FMI data for temperature forecasts. The ML methods
downloads the following sequence of daily temperatures: ??,???,???,??. What is the most
likely nearest observation station to the ML user ?
145
10.3 Federated Learning (Operates on level of local
datasets)
FL method only exchange model parameter updates; no raw local data is revealed;
146
Chapter 11
Explainable ML
A key challenge for the successful deployment of ML methods to many (critical) application
domain is their explainability. Human users of ML seem to have a strong desire to get
explanations that resolve the uncertainty about predictions and decisions obtained from ML
methods. Explainable ML enables the user to better predict the outcomes of ML methods.
Explainable ML is challenging since explanations must be tailored (personalized) to
individual users with varying backgrounds. Some users might have received university-level
education in ML, while other users might have no formal training in linear algebra. Linear
regression with few features might be perfectly interpretable for the first group but might
be considered a black-box by the latter.
?????? discuss relation between finding good explanations and active learning. Active
learning aims at finding data points (by their features) which provide most information
about the true model parameters. XML aims at finding explanations (e.g. data points from
training set) which provide most information about the prediction provided by some black-
box ML method. ????????????? discuss relation between XML and feature learning. XML
can be obtained from feature learning methods by learning those subset of features which
provide most information about the prediction (not about the label itself) ??????????????
147
given the user background.
148
Chapter 12
Lists of Symbols
12.1 Sets
149
Chapter 13
Glossary
• classifier: a hypothesis map h : X → Y with discrete label space (e.g., Y = {−1, 1}).
• data point: an elementary unit of information such as a single pixel, a single image,
a particular audio recording, a letter, a text document or an entire social network user
profile.
• features: any measurements (or quantities) used to characterize a data point (e.g.,
the maximum amplitude of a sound recoding or the greenness of an RGB image). In
principle, we can use as a feature any quantity which can be measured or computed
easily in an automated fashion.
150
estimate (or approximate) the label y using the predicted label ŷ = h(x). ML is about
automating the search for a good hypothesis map such that the error y − h(x) is small.
• i.i.d.: independent and identically distributed; e.g., “x, y, z are i.i.d. random variables”
means that the joint probability distribution p(x, y, z) of the random variables x, y, z
factors into the product p(x)p(y)p(z) with the marginal probability distribution p(·)
which is the same for all three variables x, y, z.
• label: some property of a data point which is of interest, such as the fact if a webcam
snapshot shows a forest fire or not. In contrast to features, labels are properties of
a data points that cannot be measured or computed easily in an automated fashion.
Instead, acquiring accurate label information often involves human expert labor. Many
ML methods aim at learning accurate predictor maps that allow to guess or approximate
the label of a data point based on its features.
• loss function: a function which associates a given data point (x, y) with features
x and label y and hypothesis map h a number that quantifies the prediction error
y − h(x).
• training data: a dataset which is used for finding a good hypothesis map h ∈ H out
of a hypothesis space H, e.g., via empirical risk minimization (see Chapter 4).
• validation data: a dataset which is used for evaluating the quality of a predictor
which has been learnt using some other (training) data.
151
Bibliography
[2] H. Ambos, N. Tran, and A. Jung. Classifying big data over networks via the logistic
network lasso. In Proc. 52nd Asilomar Conf. Signals, Systems, Computers, Oct./Nov.
2018.
[3] H. Ambos, N. Tran, and A. Jung. The logistic network lasso. arXiv, 2018.
[5] P. Austin, P. Kaski, and K. Kubjas. Tensor network complexity of multilinear maps.
arXiv, 2018.
[7] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 2nd edition,
June 1999.
[8] P. Billingsley. Probability and Measure. Wiley, New York, 3 edition, 1995.
152
[11] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and
Statistical Learning via the Alternating Direction Method of Multipliers, volume 3. Now
Publishers, Hanover, MA, 2010.
[13] P. Bühlmann and S. van de Geer. Statistics for High-Dimensional Data. Springer, New
York, 2011.
[15] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University
Press, New York, NY, USA, 2006.
[16] O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. The MIT
Press, Cambridge, Massachusetts, 2006.
[21] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. CoRR,
abs/1512.03965, 2015.
153
[24] W. Gautschi and G. Inglese. Lower bounds for the condition number of vandermonde
matrices. Numer. Math., 52:241 – 250, 1988.
[25] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University
Press, Baltimore, MD, 3rd edition, 1996.
[26] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
[28] R. Gray, J. Kieffer, and Y. Linde. Locally optimal block quantizer design. Information
and Control, 45:178 – 198, 1980.
[29] A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE
Intelligent Systems, March/April 2009.
[31] E. Hazan. Introduction to Online Convex Optimization. Now Publishers Inc., 2016.
[34] A. Jung. A fixed-point of view on gradient methods for big data. Frontiers in Applied
Mathematics and Statistics, 3, 2017.
[36] A. Jung and M. Hulsebos. The network nullspace property for compressed sensing of
big data over networks. Front. Appl. Math. Stat., Apr. 2018.
[37] A. Jung, N. Quang, and A. Mara. When is Network Lasso Accurate? Front. Appl.
Math. Stat., 3, Jan. 2018.
154
[38] A. Jung, G. Tauböck, and F. Hlawatsch. Compressive spectral estimation for
nonstationary random processes. IEEE Trans. Inf. Theory, 59(5):3117–3138, May 2013.
[40] P. Koehn. Europarl: A parallel corpus for statistical machine translation. In The 10th
Machine Translation Summit, page 79–86., AAMT,, Phuket, Thailand, 2005.
[41] D. Koller, N., and Friedman. Probabilistic Graphical Models: Principles and Techniques.
Adaptive computation and machine learning. MIT Press, 2009.
[43] C. Lampert. Kernel methods in computer vision. Foundations and Trends in Computer
Graphics and Vision, 2009.
[44] J. Larsen and C. Goutte. On optimal data split for generalization estimation and model
selection. In IEEE Workshop on Neural Networks for Signal Process, 1999.
[46] E. L. Lehmann and G. Casella. Theory of Point Estimation. Springer, New York, 2nd
edition, 1998.
[47] V. Lempitsky, P. Kohli, C. Rother, and T. Sharp. Image segmentation with a bounding
box prior. In 2009 IEEE 12th International Conference on Computer Vision, pages
277–284, Sept 2009.
[50] T. Mitchell. The need for biases in learning generalizations. Technical Report CBM-TR
5-110,, Rutgers University, New Brunswick, New Jersey, USA, 1980.
155
[51] K. Mortensen and T. Hughes. Comparing amazon’s mechanical turk platform to
conventional data collection methods in the health and medical research literature. J.
Gen. Intern Med., 33(4):533–538, 2018.
[52] N. Murata. A statistical study on on-line learning. In D. Saad, editor, On-line Learning
in Neural Networks, pages 63–92. Cambridge University Press, New York, NY, USA,
1998.
[53] B. Nadler, N. Srebro, and X. Zhou. Statistical analysis of semi-supervised learning: The
limit of infinite unlabelled data. In Advances in Neural Information Processing Systems
22, pages 1330–1338. 2009.
[57] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization,
1(3):123–231, 2013.
[58] H. Poor. An Introduction to Signal Detection and Estimation. Springer, 2 edition, 1994.
[59] S. Roweis. EM Algorithms for PCA and SPCA. In Advances in Neural Information
Processing Systems, pages 626–632. MIT Press, 1998.
[61] A. Sharma and K. Paliwal. Fast principal component analysis using fixed-point analysis.
Pattern Recognition Letters, 28:1151 – 1155, 2007.
[62] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern
Anal. Mach. Intell., 22(8):888–905, Aug. 2000.
156
[63] S. Smoliski and K. Radtke. Spatial prediction of demersal fish diversity in the baltic
sea: comparison of machine learning and regression-based techniques. ICES Journal of
Marine Science, 74(1):102–111, 2017.
[64] S. Sra, S. Nowozin, and S. J. Wright, editors. Optimization for Machine Learning. MIT
Press, 2012.
[70] O. Vasicek. A test for normality based on sample entropy. Journal of the Royal Statistical
Society. Series B (Methodological), 38(1):54–59, 1976.
[73] J. Wright, Y. Peng, Y. Ma, A. Ganesh, and S. Rao. Robust principal component
analysis: Exact recovery of corrupted low-rank matrices by convex optimization. In
Neural Information Processing Systems, NIPS 2009, 2009.
[75] Y. Yamaguchi and K. Hayashi. When does label propagation fail? a view from a network
generative model. In Proceedings of the Twenty-Sixth International Joint Conference
on Artificial Intelligence, IJCAI-17, pages 3224–3230, 2017.
157
[76] K. Young. Bayesian diagnostics for checking assumptions of normality. Journal of
Statistical Computation and Simulation, 47(3–4):167 – 180, 1993.
[77] W. W. Zachary. An information flow model for conflict and fission in small groups. J.
Anthro. Res., 33(4), 1977.
158