Tut4 Questions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

MLPR Tutorial1 Sheet 4

Reminders: Attempt the tutorial questions, and ideally discuss them, before your tutorial.
You can seek clarifications and hints on the class forum. Full answers will be released.

1. Some computation with probabilities: It’s common to compute with log-probabilities


to avoid numerical underflow. Quick example: notice that the probability of 2000 coin
tosses, 2−2000 , underflows to zero in Matlab, NumPy, or any package using IEEE 64 bit
floating point numbers.
If we have two possible models, M1 and M2 , for some features, we can define:
a1 = log P(x | M1 ) + log P( M1 )
a2 = log P(x | M2 ) + log P( M2 ).
Up to a constant, these ‘activations’ give the log-posterior probabilities that the model
generated the features. Show that we can get the posterior probability of model M1
neatly with the logistic function:
1
P( M1 | x) = σ ( a1 − a2 ) = .
1 + exp(−( a1 − a2 ))
Now given K models, with ak = log[ P(x | Mk ) P( Mk )], show:
log P( Mk | x) = ak − log ∑ exp ak0 .
k0

The log ∑ exp function occurs frequently in the maths for probabilistic models (not
just model comparison). Show that:
 
log ∑ exp ak = max ak + log ∑ exp ak − max ak0 .
k k k k0

Explain why the expression is often implemented this way. (Hint: consider what
happens when all the ak ’s are less than −1000).

2. Building a toy neural network


Consider the following classification problem. There are two real-valued features x1
and x2 , and a binary class label. The class label is determined by
(
1 if x2 ≥ | x1 |
y=
0 otherwise.

a) Can this function be perfectly represented by logistic regression, or a feedforward


neural network without a hidden layer? Why or why not?

b) Consider a simpler problem for a moment, the classification problem


(
1 if x2 ≥ x1
y=
0 otherwise.
Design a single ‘neuron’ that represents this function. Pick the weights by hand.
Use the hard threshold function
(
1 if a ≥ 0
Θ( a) =
0 otherwise,
applied to a linear combination of the x inputs.

1. Q2 is based on a previous sheet by Amos Storkey, Charles Sutton, and/or Chris Williams

MLPR:tut4 Iain Murray, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2018/ 1


c) Now go back to the classification problem at the beginning of this question.
Design a two layer feedforward network (that is, one hidden layer with two
layers of weights) that represents this function. Use the hard threshold activation
function as in the previous question.
Hints: Use two units in the hidden layer. The unit from the last question will be
one of the units, and you will need to design one more. You can then find settings
of the weights such that the output unit performs a binary AND operation on the
two hidden units.

3. Learning a transformation:
The K-nearest-neighbour (KNN) classifier predicts the label of a feature vector by
finding the K nearest feature vectors in the training set. The label predicted is the
label shared by the majority of the training neighbours. For binary classification we
normally choose K to be an odd number so ties aren’t possible.
KNN is an example of a non-parametric method: no fixed-size vector of parameters is
sufficient to summarize the training data. The complexity of the function represented
by the classifier can grow with the number of training examples N, but the whole
training set needs to be stored so that it can be consulted at test time.

a) How would the predictions from regularized linear logistic regression, with
p(y = 1 | x, w, b) = σ (w> x + b), and 1-nearest neighbours compare on the dataset
below?

Non-parametric methods can have parameters. We could modify the KNN classifier by
taking a linear transformation of the data z = Ax, and finding the K nearest neighbours
using the new z features. One loss function for evaluating possible transformations A,
could be the leave-one-out (LOO) classification error, defined as the fraction of errors
made on the training set when the K nearest neighbours for a training item may not
include the point being classified. (It’s an M-fold cross-validation loss with M = N.)

b) Write down a matrix A where the 1-nearest neighbour classifier has lower LOO
error than using the identity matrix for the data above, and explain why it works.

c) Explain whether we can fit the LOO error for a KNN classifier by gradient descent
on the matrix A.

d) Assume that I have implemented some other classification method where I can
evaluate a cost function c and its derivatives with respect to feature input locations:
Z̄, where Z̄nk = ∂Z∂c and Z is an N × K matrix of inputs.
nk

I will use that code by creating the feature input locations from a linear transfor-
mation of some original features Z = XA. How could I fit the matrix A? If A is a
D × K matrix, with K < D, how will the computational cost of this method scale
with D?
You may quote results given in lecture note w5a.

MLPR:tut4 Iain Murray, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2018/ 2

You might also like