Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Exam in Statistical Machine Learning

Statistisk Maskininlärning (1RT700)

Date and time: March 15, 2019, 08.00–13.00

Responsible teacher: Niklas Wahlström

Number of problems: 5

Aiding material: Calculator, mathematical handbooks,


one (1) hand-written sheet of paper with notes and formulas (A4, front and back)

Preliminary grades: grade 3 23 points


grade 4 33 points
grade 5 43 points

Some general instructions and information:

• Your solutions can be given in Swedish or in English.

• Only write on one page of the paper.

• Write your exam code and a page number on all pages.

• Do not use a red pen.

• Use separate sheets of paper for the different problems


(i.e. the numbered problems, 1–5).

• For subproblems (a), (b), (c), . . . , it is usually possible to answer later subproblems
independently of the earlier subproblems (for example, you can answer (b) without
answering (a)).

With the exception of Problem 1, all your answers must be clearly motivated!
A correct answer without a proper motivation will score zero points!

Good luck!
Some relevant formulas

Pages 1–3 contain some expressions that may or may not be useful for solving the exam
problems. This is not a complete list of formulas used in the course, but some of the
problems may require knowledge about certain expressions not listed here. Furthermore,
the formulas listed below are not self-explanatory, meaning that you need to be familiar
with the expressions to be able to interpret them. They are possibly a support for solving
the problems, but not a comprehensive summary of the course.

The Gaussian distribution: The probability density function of the p-dimensional


Gaussian distribution with mean vector µ and covariance matrix Σ is
 
1 1 T −1
N (x | µ, Σ) = √ exp − (x − µ) Σ (x − µ) , x ∈ Rp .
(2π)p/2 det Σ 2

Sum of identically distributed variables: For identically distributed random variables


{zi }ni=1 with mean
 1 Pµ,n variance σ 2 and average
Pn correlation
 1−ρbetween distinct variables ρ,
it holds that E n i=1 zi = µ and Var n i=1 zi = n σ + ρσ .
1
 2 2

Linear regression and regularization:


• The least-squares estimate of β in the linear regression model
p
X
y = β0 + βj xj + 
j=1

is given by the solution β


b LS to the normal equations XT Xβ
b LS = XT y, where
   
1 −xT1 − y1
1 −xT −  y2 
2 
..  and y =  ..  from the training data T = {xi , yi }i=1
n
X =  ..
  
. .  .
1 −xTn − yn

• Ridge regression uses the regularization term γkβk22 = γ pj=0 βj2 .


P

The ridge regression estimate is β


b = (XT X + γI)−1 XT y.
RR

• LASSO uses the regularization term γkβk1 = γ pj=0 |βj |.


P

Maximum likelihood: The maximum likelihood estimate is given by


β
b ML = arg max log `(β)
β

1
where log `(β) = ni=1 log p(yi | xi ; β) is the log-likelihood function (the last equality
P
holds when the n training data points are modeled to be independent).

Logistic regression: The logistic regression combines linear regression with the logistic
function to model the class probability
T
eβ x
p(y = 1 | x) = T .
1 + eβ x

For multi-class logistic regression we use the softmax function and model
T
eβk xi
p(y = k | xi ) = PK T
.
l=1 eβl xi

Discriminant Analysis: The linear discriminant analysis (LDA) classifier models


p(y | x) using Bayes’ theorem and the following assumptions
 
p(x | k)p(y = k) N x | µ
b k , Σ
b π
bk
p(y = k | x) = PK ≈P   ,
j=1 p(x | j)p(y = j) K
j=1 N x | µ
b j , Σ
b π
bj

where

bk = nk /n for k = 1, . . . , K
π
1 X
µ
bk = xi for k = 1, . . . , K
nk i:y =k
i

K
1 X X
Σ
b = (xi − µ b k )T .
b k )(xi − µ
n − K k=1 i:y =k
i

For quadratic discriminant analysis (QDA), the model is


 
N x|µ bk, Σ
bk π bk
p(y = k | x) ≈ P   ,
K
N x|µ
j=1 bj, Σbj π bj

where µ
b k and π
bk are as for LDA, and
1 X
Σ
bk = (xi − µ b k )T .
b k )(xi − µ
n − 1 i:y =k
i

2
P|T |
Classification trees: The cost function for tree splitting is m=1 nm Qm where T is the
tree, |T | the number of terminal nodes, nm the number of training data points falling in
node m, and Qm the impurity of node m. Three common impurity measures for splitting
classification trees are:

Misclassification error: Qm = 1 − max π


bmk
k
K
X
Gini index: Qm = bmk (1 − π
π bmk )
k=1
XK
Entropy/deviance: Qm = − π
bmk log π
bmk
k=1

where π 1
P
bmk = nm i:xi ∈Rm I(yi = k)

Loss functions for classification: For a binary classifier expressed as yb(x) = sign{C(x)},
for some real-valued function C(x), the margin is defined as y · C(x) (note the conven-
tion y ∈ {−1, 1} here). A few common loss functions expressed in terms of the margin,
L(y, C(x)) are,

Exponential loss: L(y, c) = exp(−yc).


(
1 − yc for yc < 1,
Hinge loss: L(y, c) =
0 otherwise.
Binomial deviance: L(y, c) = log(1 + exp(−yc)).

−yc
 for yc < −1,
Huber-like loss: L(y, c) = 4 (1 − yc) for − 1 ≤ yc ≤ 0,
1 2

otherwise.

0

(
1 for yc < 0,
Misclassification loss: L(y, c) =
0 otherwise.

3
1. This problem is composed of 10 true-or-false statements. You only have to classify
these as either true or false. For this problem (only!) no motivation is required.
Each correct answer scores 1 point, each incorrect answer scores -1 point and each
missing answer scores 0 points. The total score for the whole problem is capped
below at 0.

i. A nonlinear classifier can never have a linear decision boundary.


ii. A neural network is an ensemble method.
iii. Convolutional neural networks are well suited for classification problems
where the input is an image.
iv. Deep learning is a parametric method.
v. A marketing company want to build a model for predicting the number of
visitors to a web page. Since the number of visitors is an integer, this is best
viewed as a classification problem.
vi. One should not split datasets randomly into training and test data, but always
take the last data points as the test data.
vii. The k-NN classifier most often suffers from overfitting when k = 1.
viii. Neural networks can only be used for classification problems, and not for
regression problems.
ix. Regularization can be used to avoid overfitting in linear regression.
x. Regularization can only be used for regression methods,
and not for classification methods.

(10p)

4
2. Consider the following training data

i x y
x1 x2
1 0.0 5.0 1
2 4.0 1.0 0
3 1.0 2.0 1
4 4.0 2.0 0
5 0.0 3.0 1
6 1.0 8.0 1
7 9.0 0.0 0
8 6.0 5.0 0
9 8.0 6.0 0
10 5.0 7.0 1

where x is the two-dimensional input variable, y the output and i is the data point
index.
(a) Illustrate the training data points in a graph with x1 and x2 on the two axes.
Represent the points belonging to class 0 with a cross and those belonging to
class 1 with a circle. Also annotate the data points with their data point indices.
(1p)
(b) Based on the training data we want to construct a random forest classifier with
B = 3 trees, each of depth one (i.e., stumps). For this we randomly draw 3 new
datasets by bootstrapping the training data (sampling with replacement). We also
randomly draw an input dimension index (1 or 2), along which the split is to be
performed (if the split dimension is 1, the split is on the form x1 < c, etc). The
following data points indices have been drawn for each of the three bootstrapped
datasets:
Data point indices i Split dimension
Dataset 1 1 2 4 4 6 7 8 9 9 10 Tree 1 2
Dataset 2 1 2 3 4 5 5 6 7 7 9 Tree 2 2
Dataset 3 1 1 4 5 5 5 6 9 9 9 Tree 3 1
For each bootstrapped dataset, construct a classification stump (tree of depth one)
by finding the split among the prescribed dimension which minimizes the Gini
index.
(5p)
(c) The final random forest classifier predicts according to a majority vote of the three
trees. Sketch the decision boundary of the final classifier.
(4p)

5
3. (a) Why is the training error not a good estimate of the test error? Explain in a few
sentences.
(3p)
(b) What is the purpose of using cross-validation? Explain in a few sentences.
(3p)
(c) For each of the following decision boundaries (1, 2, 3 and 4), tell whether it could
possibly come from one of these classifiers
• logistic regression
• LDA
• QDA
• k-NN with k = 3
• A decision tree of depth 2
• A random forest with B = 3 trees, each of maximum depth 3
It is assumed that each classifier only uses x1 and x2 as inputs, and no nonlinear
transformations of them.
Note! Each decision boundary could possibly origin from more than one classifier.
Do not forget to include a motivation for all your answers.

(4p)

6
4. (a) Consider the following training data
i x y
x1 x2
1 −1 2 2
2 0 0 1
3 1 −2 −3
from which we want to learn a linear regression model

y = β0 + β1 x1 + β2 x2 + ε,

where we assume that ε has a Gaussian distribution. Compute β!


b
(3p)
(b) How would β b change if you were to use regularized linear regression (e.g. Ridge
regression or LASSO) in (a) instead? It is enough to provide a qualitative expla-
nation, you do not need to compute β.
b
(2p)
(c) In the context of neural networks, describe, using a few sentences, the difference
between a dense layer and a convolutional layer.
(2p)
(d) Describe how mini-batch gradient descent works and what the main advantage is
in comparison to gradient descent.
(3p)

7
We have after the exam realized that the inverse in H(γ) was missing in the actual
exam. We will take this typo into consideration in the grading.
5. For leave-one-out cross validation (or equivalently c-fold cross validation with
c = n) the cross validation error Eval for ridge regression actually has a closed-
form solution
n  2
1X ŷi − yi
Eval = , (1)
n i=1 1 − [H(γ)]ii
where ŷi is the prediction of yi when the model is learned from all n data points
(no data point is left out), γ the regularization parameter and [H(γ)]ii is element
(i, i) of the matrix H(γ) = X(XT X + γI)−1 XT .
In this problem, we will let xTi denote the entire ith row of X.
(a) Let X−i denote the matrix X where row i is removed, and y−i is the column
vector y with element i removed. Show that
XT−i X−i = XT X − xi xTi ,
XT−i y−i = XT y − xi yi , and
[H(γ)]ii = xTi (XT X + γI)−1 xi
(1p)
(b) Using the results from (a) and a special case of the matrix inversion lemma
A−1 vvT A−1
(A − vvT )−1 = A−1 + ,
1 − vT A−1 v
show that
1
β
b =β
−i
b+ (XT X + γI)−1 xi (ŷi − yi ).
1 − [H(γ)]ii
Here βb are the parameters learned from all data points except i, and β
−i
b the
parameters learned from all data.
Hint: Start from the ridge regression expression for β b as a function of X−i , y
−i −i
and γ.
(4p)
(c) Use your result from (b) to derive eq. (1), starting from
n 2
1 X bT
Eval = β −i xi − yi .
n i=1
(2p)
(d) Describe (in a few sentences) what eq. (1) can be used for in practice.
(3p)

You might also like