CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning

CPSC540
Machine Learning
Nando de Freitas
January, 2013
University of British Columbia
Outline of the lecture
This lecture provides an introduction to the course. It covers
the following four areas:
1. Introduction to the machine learning approach

2. The big data phenomenon
3. Drawing inspiration from neural systems
4. Machine learning applications and impact
The intent of the lecture is not to explain details of building ML systems,

or to tell you what to study for the exam. Rather it is an overview of what
can be accomplished with ML. If it inspires you, then you’ll have to take
the course and learn a lot of cool stuff !
Application: Invariant recognition in natural images
3
[Thomas Serre 2012]
Computer vision successes
[Thomas Serre 2012]

Millions of labeled examples are used to build real-world
applications, such as pedestrian detection
[Tomas Serre]
Application: Autonomous driving
Mobileye: Already available on Volvo S60

and soon on most car manufacturers
“tufa”
“tufa”
“tufa”
Can you pick out the tufas?
Josh Tenenbaum
Machine learning
Machine learning deals with the problem of extracting features from
data so as to solve many different predictive tasks:
Forecasting (e.g. Energy demand prediction, sales)

Imputing missing data (e.g. Netflix recommendations)
Detecting anomalies (e.g. Intruders, virus mutations)
Classifying (e.g. Credit risk assessment, cancer diagnosis)
Ranking (e.g. Google search, personalization)
Summarizing (e.g. News zeitgeist, social media sentiment)
Decision making (e.g. AI, robotics, compiler tuning, trading )
…
When to apply machine learning
Human expertise is absent (e.g.
Navigating on Mars)
Humans are unable to explain their
expertise (e.g. Speech recognition, vision,
language)
Solution changes with time (e.g. Tracking,
temperature control, preferences)
Solution needs to be adapted to particular
cases (e.g. Biometrics, personalization)
The problem size is to vast for our limited
reasoning capabilities (e.g. Calculating
webpage ranks, matching ads to facebook
pages)
Machine learning in language
“Large” text dataset:
• 1,000,000 words in 1967

• 1,000,000,000,000 words in 2006
Success stories:
• Speech recognition
• Machine translation
What is a common thing that makes both of these work well?
• Lots of labeled data
[Halevy, Norvig & Pereira, 2009]

Scene completion: More data is better
Given an input image with a missing region,

Efros uses matching scenes from a large
collection of photographs to complete the image
[Efros, 2008]
The semantic challenge
“We’ve already solved the sociological problem of building a network
infrastructure that has encouraged hundreds of millions of authors to
share a trillion pages of content.
We’ve solved the technological problem of aggregating and indexing

all this content.
But we’re left with a scientific problem of interpreting the content”

[Halevy, Norvig & Pereira, 2009]
It’s not only about how big your data is. It is about
understanding it and using this understanding to derive
reasonable inferences. Think of citation matching.
http://openie.cs.washington.edu/
Learning, understanding and causality
``Learning denotes changes in the system that are adaptive in the sense
that they enable the system to do the task or tasks drawn from the
same population more efficiently and more effectively the next time.''
Herbert Simon
Environment
Percept
Action
Agent
A source of inspiration
Associative memory
Example 2: Say the alphabet, …. backward
[Jain, Mao & Mohiuddin, 1996]

The x and y coordinates correspond to the spatial location of a rat.
The red dots indicate the place where a particular neuron fires.
[Hafting et al 2005]
Selectivity and Topographic maps in V1
Feature vector 1 0 1 0 … 0
Hidden units
image patch
Deep learning with autoencoders
[Russ Salakhutdinov, Geoff Hinton, Yann Lecun, Yoshua Bengio, Andrew Ng, …]
Validating Unsupervised Learning
neuron responses
1st stage 2nd stage 3rd stage
Ranzato
Top Images For Best Face Neuron
Ranzato
Best Input For Face Neuron
Ranzato
Hierarchical spatial-temporal feature learning
Observed gaze sequence
Model predictions
[Bo Chen et al 2010]

Application: Speech recognition
[George Dahl et al 2011]

Next lecture
In the following lecture we will revise least squares

predictions.
CPSC540
Linear prediction
Nando de Freitas
January, 2013
This lecture introduces us to the topic of supervised learning. Here
the data consists of input-output pairs. Inputs are also often referred
to as covariates, predictors and features; while outputs are known
as variates and labels. The goal of the lecture is for you to:
Understand the supervised learning setting.

Understand linear regression (aka least squares)
Understand how to apply linear regression models to make
predictions.
Learn to derive the least squares estimate by optimization.
Linear supervised learning
Many real processes can be approximated with linear models.
Linear regression often appears as a module of larger systems.
Linear problems can be solved analytically.
Linear prediction provides an introduction to many of the core

concepts of machine learning.
Energy demand prediction
Prostate cancer example
Goal: Predict a prostate-specific
antigen (log of lpsa) from a number of
clinical measures in men who are about
to receive a radical prostatectomy.
The inputs are:

• Log cancer volume (lcavol)
• Log prostate weight (lweight)
• Age
• Log of the amount of benign prostatic hyperplasia
(lbph)
• Seminal vesicle invasion (svi) - binary
• Log of capsular penetration (lcp)
• Gleason score (gleason) – ordered categorical
• Percent of Gleason scores 4 or 5 (pgg45)
Which inputs are more important?
[Hastie, Tibshirani & Friedman book]

|
Linear prediction
Linear prediction
Likewise, for a point that we have never seen before, say x = [50 20],
we generate the following prediction:
y(x) = [1 50 20] = 1 + 0 + 10 = 11.

Optimization approach
Optimization approach
Optimization: Finding the minimum
Optimization
Least squares estimates
Multiple outputs
Next lecture
In the next lecture, we learn to derive the linear regression estimates
by maximum likelihood with multivariate Gaussian distributions.
Please go to the Wikipedia page for the multivariate Normal

distribution beforehand.
CPSC540
Probabilistic linear prediction

and maximum likelihood
Nando de Freitas
January, 2013
In this lecture, we formulate the problem of linear prediction using
probabilities. We also introduce the maximum likelihood estimate and
show that it coincides with the least squares estimate. The goal of the
lecture is for you to learn:
Multivariate Gaussian distributions

How to formulate the likelihood for linear regression
Computing the maximum likelihood estimates for linear
regression.
Understand why maximum likelihood is used.
Univariate Gaussian distribution
Sampling from a Gaussian distribution
The bivariate Gaussian distribution
Multivariate Gaussian distribution
Bivariate Gaussian distribution example
Assume we have two independent univariate Gaussian variables
x1 = N ( µ1 , σ 2 ) and x2 = N ( µ2 , σ 2 )
Their joint distribution p( x1, x2 ) is:

Sampling from a multivariate Gaussian distribution
We have n=3 data points y1 = 1, y2 = 0.5, y3 = 1.5, which are
independent and Gaussian with unknown mean θ and variance 1:
yi ~ N ( θ , 1 ) = θ + N ( 0 , 1 )
with likelihood P( y1 y2 y3 |θ ) = P( y1 |θ ) P( y1 |θ ) P( y3 |θ ) . Consider

two guesses of θ, 1 and 2.5. Which has higher likelihood?
Finding the θ that maximizes the likelihood is equivalent to moving the

Gaussian until the product of 3 green bars (likelihood) is maximized.
The likelihood for linear regression
Let us assume that each label yi is Gaussian distributed with mean xiTθ
and variance σ 2, which in short we write as:
yi = N ( xiTθ , σ 2 ) = xiTθ + N ( 0, σ 2 )
Maximum likelihood
The ML estimate of θ is:
The ML estimate of σ is:
Making predictions
The ML plugin prediction, given the training data D=( X , y ), for a
new input x* and known σ 2 is given by:
P(y| x* ,D, σ 2 ) = N (y| x*T θ ML , σ 2 )

Frequentist learning and maximum likelihood
Frequentist learning assumes that there exists a true model, say with
parameters θο .
^
The estimate (learned value) will be denoted θ.
Given n data, x1:n = {x1, x2,…, xn }, we choose the value of θ that has
more probability of generating the data. That is,
θ^ = arg max p( x1:n |θ )

θ
Bernoulli: a model for coins
A Bernoulli random variable r.v. X takes values in {0,1}
θ if x=1
p(x|θ ) =
1- θ if x=0
Where θ 2 (0,1). We can write this probability more succinctly as

follows:
Entropy
In information theory, entropy H is a measure of the uncertainty
associated with a random variable. It is defined as:
H(X) = - Σ
x
p(x|θ ) log p(x|θ )
Example: For a Bernoulli variable X, the entropy is:

MLE - properties
For independent and identically distributed (i.i.d.) data from p(x|θ 0 ),
the MLE minimizes the Kullback-Leibler divergence:
n

θ̂ = arg max p(xi |θ)
θ
i=1
n
= arg max log p(xi |θ)
θ
i=1
N
N

= arg max N1 log p(xi |θ) − 1
N log p(xi |θ0 )
θ
i=1 i=1
N

1 p(xi |θ)
= arg max N log
θ
i=1
p(xi |θ 0 )

p(xi |θ 0 )
−→ arg min log p(x|θ0 )dx
θ p(xi |θ)

MLE - properties
|i=1

p(xi |θ 0 )
arg min log p(x|θ0 )dx
θ p(xi |θ)
MLE - properties
Under smoothness and identifiability assumptions,
the MLE is consistent:
p
θ̂ → θ 0
or equivalently,
plim(θ̂) = θ0
or equivalently,
lim P (|θ̂ − θ0 | > α)→0

N →∞
for every α.
MLE - properties
The MLE is asymptotically normal. That is, as N → ∞, we have:
θ̂ − θ 0 =⇒ N (0, I −1 )
where I is the Fisher Information matrix.
It is asymptotically optimal or efficient. That is, asymptotically, it has

the lowest variance among all well behaved estimators. In particular it
attains a lower bound on the CLT variance known as the Cramer-Rao
lower bound.
But what about issues like robustness and computation? Is MLE always
the right option?
Bias and variance
Note that the estimator is a function of the data: θ̂ = g(D).
Its bias is:

bias(θ̂) = Ep(D|θ 0 ) (θ̂) − θ 0 = θ̄ − θ 0
Its variance is:

V(θ̂) = Ep(D|θ 0 ) (θ̂ − θ̄)2
Its mean squared error is:
MSE = Ep(D|θ0 ) (θ̂ − θ0 ) = (θ̄ − θ 0 )2 + Ep(D|θ0 ) (θ̂ − θ̄)2

Next lecture
In the next lecture, we introduce ridge regression and the Bayesian
learning approach for linear predictive models.
CPSC540
Regularization,
nonlinear prediction
and generalization
Nando de Freitas
Janurary, 2013
This lecture will teach you how to fit nonlinear functions by using
bases functions and how to control model complexity. The goal is for
you to:
Learn how to derive ridge regression.

Understand the trade-off of fitting the data and regularizing it.
Learn polynomial regression.
Understand that, if basis functions are given, the problem of
learning the parameters is still linear.
Learn cross-validation.
Understand the effects of the number of data and the number of
basis functions on generalization.
Regularization
Derivation
Ridge regression as constrained optimization
Regularization paths
Asδ increases, t(δ) decreases and each θi goes to zero.
θ1
θ2
θ8
t(δ) [Hastie, Tibshirani & Friedman book]

Going nonlinear via basis functions
We introduce basis functions φ(·) to deal with nonlinearity:
y(x) = φ(x)θ + ǫ
For example, φ(x) = [1, x, x2 ],

Going nonlinear via basis functions
y(x) = φ(x)θ + ǫ
φ(x) = [1, x1 , x2 ] φ(x) = [1, x1 , x2 , x21 , x22 ]

Example: Ridge regression with a polynomial of degree 14
y(xi ) = 1 θ0 + xi θ1 + xi2 θ2 + . . . + xi13 θ13 + xi14 θ14
Φ = [ 1 xi xi2 . . . xi13 xi14 ]
J(θ) = ( y − Φ θ ) Τ ( y − Φ θ ) + δ2 θ Τ θ
y y y
small δ medium δ large δ
x x x
Kernel regression and RBFs
We can use kernels or radial basis functions (RBFs) as features:
1
(− λ x−µi 2 )
φ(x) = [κ(x, µ1 , λ), . . . , κ(x, µd , λ)], e.g. κ(x, µi , λ) = e
y(xi ) = φ (xi ) θ = 1θ0 + k(xi , µ1 , λ) θ1 + . . . + k(xi , µd , λ) θd

We can choose the locations µ of the basis functions to be the inputs.
That is, µi = xi . These basis functions are the known as kernels.
The choice of width λ is tricky, as illustrated below.
kernels
Too small λ
Right λ
Too large λ
The big question is how do we
choose the regularization coefficient,
the width of the kernels or the
polynomial order?
One Solution: cross-validation
K-fold crossvalidation
The idea is simple: we split the training data into K folds; then, for each
fold k ∈ {1, . . . , K}, we train on all the folds but the k’th, and test on the
k’th, in a round-robin fashion.
It is common to use K = 5; this is called 5-fold CV.
If we set K = N , then we get a method called leave-one out cross

validation, or LOOCV, since in fold i, we train on all the data cases
except for i, and then test on i.
Example: Ridge regression with polynomial of degree 14
Effect of data when we have the right model
yi = θ0 + xi θ1 + xi2 θ2 + N ( 0 , σ 2 )
Effect of data when the model is too simple
yi = θ0 + xi θ1 + xi2 θ2 + N ( 0 , σ 2 )
Effect of data when the model is very complex
yi = θ0 + xi θ1 + xi2 θ2 + N ( 0 , σ 2 )
Confidence in the predictions
Next lecture
In the next lecture, we introduce Bayesian inference, and show how it
can provide us with an alternative way of learning a model from data.

CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning

Uploaded by

CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning

Uploaded by

CPSC540

1. Introduction to the machine learning approach

The intent of the lecture is not to explain details of building ML systems,

[Thomas Serre 2012]

Mobileye: Already available on Volvo S60

Can you pick out the tufas?

Forecasting (e.g. Energy demand prediction, sales)

• 1,000,000 words in 1967

What is a common thing that makes both of these work well?

• Lots of labeled data

[Halevy, Norvig & Pereira, 2009]

Given an input image with a missing region,

We’ve solved the technological problem of aggregating and indexing

But we’re left with a scientific problem of interpreting the content”

Example 2: Say the alphabet, …. backward

[Jain, Mao & Mohiuddin, 1996]

Observed gaze sequence

[Bo Chen et al 2010]

[George Dahl et al 2011]

In the following lecture we will revise least squares

Understand the supervised learning setting.

Many real processes can be approximated with linear models.

Linear regression often appears as a module of larger systems.

Linear problems can be solved analytically.

Linear prediction provides an introduction to many of the core

The inputs are:

Which inputs are more important?

[Hastie, Tibshirani & Friedman book]

y(x) = [1 50 20] = 1 + 0 + 10 = 11.

Please go to the Wikipedia page for the multivariate Normal

Probabilistic linear prediction

Multivariate Gaussian distributions

Their joint distribution p( x1, x2 ) is:

with likelihood P( y1 y2 y3 |θ ) = P( y1 |θ ) P( y1 |θ ) P( y3 |θ ) . Consider

Finding the θ that maximizes the likelihood is equivalent to moving the

P(y| x* ,D, σ 2 ) = N (y| x*T θ ML , σ 2 )

θ^ = arg max p( x1:n |θ )

Where θ 2 (0,1). We can write this probability more succinctly as

Example: For a Bernoulli variable X, the entropy is:

lim P (|θ̂ − θ0 | > α)→0

where I is the Fisher Information matrix.

It is asymptotically optimal or efficient. That is, asymptotically, it has

Its bias is:

Its variance is:

Its mean squared error is:

MSE = Ep(D|θ0 ) (θ̂ − θ0 ) = (θ̄ − θ 0 )2 + Ep(D|θ0 ) (θ̂ − θ̄)2

Learn how to derive ridge regression.

t(δ) [Hastie, Tibshirani & Friedman book]

For example, φ(x) = [1, x, x2 ],

φ(x) = [1, x1 , x2 ] φ(x) = [1, x1 , x2 , x21 , x22 ]

y(xi ) = 1 θ0 + xi θ1 + xi2 θ2 + . . . + xi13 θ13 + xi14 θ14

Φ = [ 1 xi xi2 . . . xi13 xi14 ]

y(xi ) = φ (xi ) θ = 1θ0 + k(xi , µ1 , λ) θ1 + . . . + k(xi , µd , λ) θd

It is common to use K = 5; this is called 5-fold CV.

If we set K = N , then we get a method called leave-one out cross

You might also like