0% found this document useful (0 votes)
21 views91 pages

CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning

This document provides an overview of a lecture on machine learning. It covers four main topics: [1] an introduction to the machine learning approach, [2] the big data phenomenon, [3] drawing inspiration from neural systems, and [4] machine learning applications and impact. The intent is to provide a high-level overview of what can be accomplished with machine learning and inspire students to take the course to learn more.

Uploaded by

juanagallardo01
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
21 views91 pages

CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning

This document provides an overview of a lecture on machine learning. It covers four main topics: [1] an introduction to the machine learning approach, [2] the big data phenomenon, [3] drawing inspiration from neural systems, and [4] machine learning applications and impact. The intent is to provide a high-level overview of what can be accomplished with machine learning and inspire students to take the course to learn more.

Uploaded by

juanagallardo01
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 91

CPSC540

Machine Learning

Nando de Freitas
January, 2013
University of British Columbia
Outline of the lecture
This lecture provides an introduction to the course. It covers
the following four areas:

1. Introduction to the machine learning approach


2. The big data phenomenon
3. Drawing inspiration from neural systems
4. Machine learning applications and impact

The intent of the lecture is not to explain details of building ML systems,


or to tell you what to study for the exam. Rather it is an overview of what
can be accomplished with ML. If it inspires you, then you’ll have to take
the course and learn a lot of cool stuff !
Application: Invariant recognition in natural images

3
[Thomas Serre 2012]
Computer vision successes

[Thomas Serre 2012]


Millions of labeled examples are used to build real-world
applications, such as pedestrian detection
[Tomas Serre]
Application: Autonomous driving

Mobileye: Already available on Volvo S60


and soon on most car manufacturers
“tufa”

“tufa”

“tufa”

Can you pick out the tufas?

Josh Tenenbaum
Machine learning
Machine learning deals with the problem of extracting features from
data so as to solve many different predictive tasks:

Forecasting (e.g. Energy demand prediction, sales)


Imputing missing data (e.g. Netflix recommendations)
Detecting anomalies (e.g. Intruders, virus mutations)
Classifying (e.g. Credit risk assessment, cancer diagnosis)
Ranking (e.g. Google search, personalization)
Summarizing (e.g. News zeitgeist, social media sentiment)
Decision making (e.g. AI, robotics, compiler tuning, trading )

When to apply machine learning
Human expertise is absent (e.g.
Navigating on Mars)
Humans are unable to explain their
expertise (e.g. Speech recognition, vision,
language)
Solution changes with time (e.g. Tracking,
temperature control, preferences)
Solution needs to be adapted to particular
cases (e.g. Biometrics, personalization)
The problem size is to vast for our limited
reasoning capabilities (e.g. Calculating
webpage ranks, matching ads to facebook
pages)
Machine learning in language
“Large” text dataset:

• 1,000,000 words in 1967


• 1,000,000,000,000 words in 2006

Success stories:

• Speech recognition
• Machine translation

What is a common thing that makes both of these work well?

• Lots of labeled data

[Halevy, Norvig & Pereira, 2009]


Scene completion: More data is better

Given an input image with a missing region,


Efros uses matching scenes from a large
collection of photographs to complete the image

[Efros, 2008]
The semantic challenge
 “We’ve already solved the sociological problem of building a network
infrastructure that has encouraged hundreds of millions of authors to
share a trillion pages of content.

 We’ve solved the technological problem of aggregating and indexing


all this content.

 But we’re left with a scientific problem of interpreting the content”


[Halevy, Norvig & Pereira, 2009]

 It’s not only about how big your data is. It is about
understanding it and using this understanding to derive
reasonable inferences. Think of citation matching.

 http://openie.cs.washington.edu/
Learning, understanding and causality
``Learning denotes changes in the system that are adaptive in the sense
that they enable the system to do the task or tasks drawn from the
same population more efficiently and more effectively the next time.''

Herbert Simon
Environment

Percept
Action

Agent
A source of inspiration
Associative memory

Example 2: Say the alphabet, …. backward

[Jain, Mao & Mohiuddin, 1996]


The x and y coordinates correspond to the spatial location of a rat.

The red dots indicate the place where a particular neuron fires.

[Hafting et al 2005]
Selectivity and Topographic maps in V1
Feature vector 1 0 1 0 … 0

Hidden units

image patch
Deep learning with autoencoders

[Russ Salakhutdinov, Geoff Hinton, Yann Lecun, Yoshua Bengio, Andrew Ng, …]
Validating Unsupervised Learning

neuron responses
1st stage 2nd stage 3rd stage

Ranzato
Top Images For Best Face Neuron

Ranzato
Best Input For Face Neuron

Ranzato
Hierarchical spatial-temporal feature learning

Observed gaze sequence

Model predictions

[Bo Chen et al 2010]


Application: Speech recognition

[George Dahl et al 2011]


Next lecture

In the following lecture we will revise least squares


predictions.
CPSC540

Linear prediction

Nando de Freitas
January, 2013
University of British Columbia
Outline of the lecture
This lecture introduces us to the topic of supervised learning. Here
the data consists of input-output pairs. Inputs are also often referred
to as covariates, predictors and features; while outputs are known
as variates and labels. The goal of the lecture is for you to:

 Understand the supervised learning setting.


 Understand linear regression (aka least squares)
 Understand how to apply linear regression models to make
predictions.
 Learn to derive the least squares estimate by optimization.
Linear supervised learning

 Many real processes can be approximated with linear models.

 Linear regression often appears as a module of larger systems.

 Linear problems can be solved analytically.

 Linear prediction provides an introduction to many of the core


concepts of machine learning.
Energy demand prediction
Prostate cancer example
 Goal: Predict a prostate-specific
antigen (log of lpsa) from a number of
clinical measures in men who are about
to receive a radical prostatectomy.

The inputs are:


• Log cancer volume (lcavol)
• Log prostate weight (lweight)
• Age
• Log of the amount of benign prostatic hyperplasia
(lbph)
• Seminal vesicle invasion (svi) - binary
• Log of capsular penetration (lcp)
• Gleason score (gleason) – ordered categorical
• Percent of Gleason scores 4 or 5 (pgg45)

Which inputs are more important?

[Hastie, Tibshirani & Friedman book]


|
Linear prediction
Linear prediction

Likewise, for a point that we have never seen before, say x = [50 20],
we generate the following prediction:

y(x) = [1 50 20] = 1 + 0 + 10 = 11.


Optimization approach
Optimization approach
Optimization: Finding the minimum
Optimization
Least squares estimates
Multiple outputs
Next lecture
In the next lecture, we learn to derive the linear regression estimates
by maximum likelihood with multivariate Gaussian distributions.

Please go to the Wikipedia page for the multivariate Normal


distribution beforehand.
CPSC540

Probabilistic linear prediction


and maximum likelihood

Nando de Freitas
January, 2013
University of British Columbia
Outline of the lecture
In this lecture, we formulate the problem of linear prediction using
probabilities. We also introduce the maximum likelihood estimate and
show that it coincides with the least squares estimate. The goal of the
lecture is for you to learn:

 Multivariate Gaussian distributions


 How to formulate the likelihood for linear regression
 Computing the maximum likelihood estimates for linear
regression.
 Understand why maximum likelihood is used.
Univariate Gaussian distribution
Sampling from a Gaussian distribution
The bivariate Gaussian distribution
Multivariate Gaussian distribution
Bivariate Gaussian distribution example
Assume we have two independent univariate Gaussian variables

x1 = N ( µ1 , σ 2 ) and x2 = N ( µ2 , σ 2 )

Their joint distribution p( x1, x2 ) is:


Sampling from a multivariate Gaussian distribution
We have n=3 data points y1 = 1, y2 = 0.5, y3 = 1.5, which are
independent and Gaussian with unknown mean θ and variance 1:

yi ~ N ( θ , 1 ) = θ + N ( 0 , 1 )

with likelihood P( y1 y2 y3 |θ ) = P( y1 |θ ) P( y1 |θ ) P( y3 |θ ) . Consider


two guesses of θ, 1 and 2.5. Which has higher likelihood?

Finding the θ that maximizes the likelihood is equivalent to moving the


Gaussian until the product of 3 green bars (likelihood) is maximized.
The likelihood for linear regression
Let us assume that each label yi is Gaussian distributed with mean xiTθ
and variance σ 2, which in short we write as:

yi = N ( xiTθ , σ 2 ) = xiTθ + N ( 0, σ 2 )
Maximum likelihood
The ML estimate of θ is:
The ML estimate of σ is:
Making predictions
The ML plugin prediction, given the training data D=( X , y ), for a
new input x* and known σ 2 is given by:

P(y| x* ,D, σ 2 ) = N (y| x*T θ ML , σ 2 )


Frequentist learning and maximum likelihood
Frequentist learning assumes that there exists a true model, say with
parameters θο .
^
The estimate (learned value) will be denoted θ.

Given n data, x1:n = {x1, x2,…, xn }, we choose the value of θ that has
more probability of generating the data. That is,

θ^ = arg max p( x1:n |θ )


θ
Bernoulli: a model for coins
A Bernoulli random variable r.v. X takes values in {0,1}

θ if x=1
p(x|θ ) =
1- θ if x=0

Where θ 2 (0,1). We can write this probability more succinctly as


follows:
Entropy
In information theory, entropy H is a measure of the uncertainty
associated with a random variable. It is defined as:

H(X) = - Σ
x
p(x|θ ) log p(x|θ )

Example: For a Bernoulli variable X, the entropy is:


MLE - properties
For independent and identically distributed (i.i.d.) data from p(x|θ 0 ),
the MLE minimizes the Kullback-Leibler divergence:
n

θ̂ = arg max p(xi |θ)
θ
i=1
n
= arg max log p(xi |θ)
θ
i=1
N
 N

= arg max N1 log p(xi |θ) − 1
N log p(xi |θ0 )
θ
i=1 i=1
N

1 p(xi |θ)
= arg max N log
θ
i=1
p(xi |θ 0 )

p(xi |θ 0 )
−→ arg min log p(x|θ0 )dx
θ p(xi |θ)

MLE - properties
|i=1

p(xi |θ 0 )
arg min log p(x|θ0 )dx
θ p(xi |θ)
MLE - properties
Under smoothness and identifiability assumptions,
the MLE is consistent:

p
θ̂ → θ 0

or equivalently,

plim(θ̂) = θ0

or equivalently,

lim P (|θ̂ − θ0 | > α)→0


N →∞

for every α.
MLE - properties
The MLE is asymptotically normal. That is, as N → ∞, we have:

θ̂ − θ 0 =⇒ N (0, I −1 )

where I is the Fisher Information matrix.

It is asymptotically optimal or efficient. That is, asymptotically, it has


the lowest variance among all well behaved estimators. In particular it
attains a lower bound on the CLT variance known as the Cramer-Rao
lower bound.

But what about issues like robustness and computation? Is MLE always
the right option?
Bias and variance
Note that the estimator is a function of the data: θ̂ = g(D).

Its bias is:


bias(θ̂) = Ep(D|θ 0 ) (θ̂) − θ 0 = θ̄ − θ 0

Its variance is:


V(θ̂) = Ep(D|θ 0 ) (θ̂ − θ̄)2

Its mean squared error is:

MSE = Ep(D|θ0 ) (θ̂ − θ0 ) = (θ̄ − θ 0 )2 + Ep(D|θ0 ) (θ̂ − θ̄)2


Next lecture
In the next lecture, we introduce ridge regression and the Bayesian
learning approach for linear predictive models.
CPSC540

Regularization,
nonlinear prediction
and generalization

Nando de Freitas
Janurary, 2013
University of British Columbia
Outline of the lecture
This lecture will teach you how to fit nonlinear functions by using
bases functions and how to control model complexity. The goal is for
you to:

 Learn how to derive ridge regression.


 Understand the trade-off of fitting the data and regularizing it.
 Learn polynomial regression.
 Understand that, if basis functions are given, the problem of
learning the parameters is still linear.
Learn cross-validation.
 Understand the effects of the number of data and the number of
basis functions on generalization.
Regularization
Derivation
Ridge regression as constrained optimization
Regularization paths
Asδ increases, t(δ) decreases and each θi goes to zero.

θ1

θ2

θ8

t(δ) [Hastie, Tibshirani & Friedman book]


Going nonlinear via basis functions
We introduce basis functions φ(·) to deal with nonlinearity:

y(x) = φ(x)θ + ǫ

For example, φ(x) = [1, x, x2 ],


Going nonlinear via basis functions
y(x) = φ(x)θ + ǫ

φ(x) = [1, x1 , x2 ] φ(x) = [1, x1 , x2 , x21 , x22 ]


Example: Ridge regression with a polynomial of degree 14

y(xi ) = 1 θ0 + xi θ1 + xi2 θ2 + . . . + xi13 θ13 + xi14 θ14

Φ = [ 1 xi xi2 . . . xi13 xi14 ]

J(θ) = ( y − Φ θ ) Τ ( y − Φ θ ) + δ2 θ Τ θ

y y y
small δ medium δ large δ

x x x
Kernel regression and RBFs
We can use kernels or radial basis functions (RBFs) as features:
1
(− λ x−µi 2 )
φ(x) = [κ(x, µ1 , λ), . . . , κ(x, µd , λ)], e.g. κ(x, µi , λ) = e

y(xi ) = φ (xi ) θ = 1θ0 + k(xi , µ1 , λ) θ1 + . . . + k(xi , µd , λ) θd


We can choose the locations µ of the basis functions to be the inputs.
That is, µi = xi . These basis functions are the known as kernels.
The choice of width λ is tricky, as illustrated below.
kernels

Too small λ

Right λ

Too large λ
The big question is how do we
choose the regularization coefficient,
the width of the kernels or the
polynomial order?
One Solution: cross-validation
K-fold crossvalidation

The idea is simple: we split the training data into K folds; then, for each
fold k ∈ {1, . . . , K}, we train on all the folds but the k’th, and test on the
k’th, in a round-robin fashion.

It is common to use K = 5; this is called 5-fold CV.

If we set K = N , then we get a method called leave-one out cross


validation, or LOOCV, since in fold i, we train on all the data cases
except for i, and then test on i.
Example: Ridge regression with polynomial of degree 14
Effect of data when we have the right model
yi = θ0 + xi θ1 + xi2 θ2 + N ( 0 , σ 2 )
Effect of data when the model is too simple
yi = θ0 + xi θ1 + xi2 θ2 + N ( 0 , σ 2 )
Effect of data when the model is very complex
yi = θ0 + xi θ1 + xi2 θ2 + N ( 0 , σ 2 )
Confidence in the predictions
Next lecture
In the next lecture, we introduce Bayesian inference, and show how it
can provide us with an alternative way of learning a model from data.

You might also like