CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
Machine Learning
Nando de Freitas
January, 2013
University of British Columbia
Outline of the lecture
This lecture provides an introduction to the course. It covers
the following four areas:
3
[Thomas Serre 2012]
Computer vision successes
“tufa”
“tufa”
Josh Tenenbaum
Machine learning
Machine learning deals with the problem of extracting features from
data so as to solve many different predictive tasks:
Success stories:
• Speech recognition
• Machine translation
[Efros, 2008]
The semantic challenge
“We’ve already solved the sociological problem of building a network
infrastructure that has encouraged hundreds of millions of authors to
share a trillion pages of content.
It’s not only about how big your data is. It is about
understanding it and using this understanding to derive
reasonable inferences. Think of citation matching.
http://openie.cs.washington.edu/
Learning, understanding and causality
``Learning denotes changes in the system that are adaptive in the sense
that they enable the system to do the task or tasks drawn from the
same population more efficiently and more effectively the next time.''
Herbert Simon
Environment
Percept
Action
Agent
A source of inspiration
Associative memory
The red dots indicate the place where a particular neuron fires.
[Hafting et al 2005]
Selectivity and Topographic maps in V1
Feature vector 1 0 1 0 … 0
Hidden units
image patch
Deep learning with autoencoders
[Russ Salakhutdinov, Geoff Hinton, Yann Lecun, Yoshua Bengio, Andrew Ng, …]
Validating Unsupervised Learning
neuron responses
1st stage 2nd stage 3rd stage
Ranzato
Top Images For Best Face Neuron
Ranzato
Best Input For Face Neuron
Ranzato
Hierarchical spatial-temporal feature learning
Model predictions
Linear prediction
Nando de Freitas
January, 2013
University of British Columbia
Outline of the lecture
This lecture introduces us to the topic of supervised learning. Here
the data consists of input-output pairs. Inputs are also often referred
to as covariates, predictors and features; while outputs are known
as variates and labels. The goal of the lecture is for you to:
Likewise, for a point that we have never seen before, say x = [50 20],
we generate the following prediction:
Nando de Freitas
January, 2013
University of British Columbia
Outline of the lecture
In this lecture, we formulate the problem of linear prediction using
probabilities. We also introduce the maximum likelihood estimate and
show that it coincides with the least squares estimate. The goal of the
lecture is for you to learn:
x1 = N ( µ1 , σ 2 ) and x2 = N ( µ2 , σ 2 )
yi ~ N ( θ , 1 ) = θ + N ( 0 , 1 )
yi = N ( xiTθ , σ 2 ) = xiTθ + N ( 0, σ 2 )
Maximum likelihood
The ML estimate of θ is:
The ML estimate of σ is:
Making predictions
The ML plugin prediction, given the training data D=( X , y ), for a
new input x* and known σ 2 is given by:
Given n data, x1:n = {x1, x2,…, xn }, we choose the value of θ that has
more probability of generating the data. That is,
θ if x=1
p(x|θ ) =
1- θ if x=0
H(X) = - Σ
x
p(x|θ ) log p(x|θ )
p
θ̂ → θ 0
or equivalently,
plim(θ̂) = θ0
or equivalently,
for every α.
MLE - properties
The MLE is asymptotically normal. That is, as N → ∞, we have:
θ̂ − θ 0 =⇒ N (0, I −1 )
But what about issues like robustness and computation? Is MLE always
the right option?
Bias and variance
Note that the estimator is a function of the data: θ̂ = g(D).
Regularization,
nonlinear prediction
and generalization
Nando de Freitas
Janurary, 2013
University of British Columbia
Outline of the lecture
This lecture will teach you how to fit nonlinear functions by using
bases functions and how to control model complexity. The goal is for
you to:
θ1
θ2
θ8
y(x) = φ(x)θ + ǫ
J(θ) = ( y − Φ θ ) Τ ( y − Φ θ ) + δ2 θ Τ θ
y y y
small δ medium δ large δ
x x x
Kernel regression and RBFs
We can use kernels or radial basis functions (RBFs) as features:
1
(− λ x−µi 2 )
φ(x) = [κ(x, µ1 , λ), . . . , κ(x, µd , λ)], e.g. κ(x, µi , λ) = e
Too small λ
Right λ
Too large λ
The big question is how do we
choose the regularization coefficient,
the width of the kernels or the
polynomial order?
One Solution: cross-validation
K-fold crossvalidation
The idea is simple: we split the training data into K folds; then, for each
fold k ∈ {1, . . . , K}, we train on all the folds but the k’th, and test on the
k’th, in a round-robin fashion.