Tut4 Questions
Tut4 Questions
Tut4 Questions
Reminders: Attempt the tutorial questions, and ideally discuss them, before your tutorial.
You can seek clarifications and hints on the class forum. Full answers will be released.
The log ∑ exp function occurs frequently in the maths for probabilistic models (not
just model comparison). Show that:
log ∑ exp ak = max ak + log ∑ exp ak − max ak0 .
k k k k0
Explain why the expression is often implemented this way. (Hint: consider what
happens when all the ak ’s are less than −1000).
1. Q2 is based on a previous sheet by Amos Storkey, Charles Sutton, and/or Chris Williams
3. Learning a transformation:
The K-nearest-neighbour (KNN) classifier predicts the label of a feature vector by
finding the K nearest feature vectors in the training set. The label predicted is the
label shared by the majority of the training neighbours. For binary classification we
normally choose K to be an odd number so ties aren’t possible.
KNN is an example of a non-parametric method: no fixed-size vector of parameters is
sufficient to summarize the training data. The complexity of the function represented
by the classifier can grow with the number of training examples N, but the whole
training set needs to be stored so that it can be consulted at test time.
a) How would the predictions from regularized linear logistic regression, with
p(y = 1 | x, w, b) = σ (w> x + b), and 1-nearest neighbours compare on the dataset
below?
Non-parametric methods can have parameters. We could modify the KNN classifier by
taking a linear transformation of the data z = Ax, and finding the K nearest neighbours
using the new z features. One loss function for evaluating possible transformations A,
could be the leave-one-out (LOO) classification error, defined as the fraction of errors
made on the training set when the K nearest neighbours for a training item may not
include the point being classified. (It’s an M-fold cross-validation loss with M = N.)
b) Write down a matrix A where the 1-nearest neighbour classifier has lower LOO
error than using the identity matrix for the data above, and explain why it works.
c) Explain whether we can fit the LOO error for a KNN classifier by gradient descent
on the matrix A.
d) Assume that I have implemented some other classification method where I can
evaluate a cost function c and its derivatives with respect to feature input locations:
Z̄, where Z̄nk = ∂Z∂c and Z is an N × K matrix of inputs.
nk
I will use that code by creating the feature input locations from a linear transfor-
mation of some original features Z = XA. How could I fit the matrix A? If A is a
D × K matrix, with K < D, how will the computational cost of this method scale
with D?
You may quote results given in lecture note w5a.