University of Edinburgh College of Science and Engineering School of Informatics
University of Edinburgh College of Science and Engineering School of Informatics
SCHOOL OF INFORMATICS
14:30 to 16:30
MSc Courses
Convener: B. Franke
External Examiners: T. Attwood, R. Connor, R. Cooper, D. Marshall, M. Richardson
INSTRUCTIONS TO CANDIDATES
Question 1 is COMPULSORY.
g1 (θ) > 0
g (θ) > 0
θ∗ = arg min f (θ) subject to 2 (1)
θ ...
gk (θ) > 0
where the integral is a definite integral over the whole parameter space. [3 marks ]
ii. Describe the process for rejection sampling using a distribution Q(θ)
that we are able to sample from, and where Q(θ) > αP (D|θ)P (θ). [4 marks ]
iii. In importance sampling using a proposal distribution Q(θ) for a target
distribution P (θ|D), what condition do we need on Q for the importance
sampling procedure to be valid? [1 mark ]
(c) Write out the Bernoulli likelihood for a binary dataset x1 , x2 , . . . , xN with
Bernoulli probability p. From this, derive the log-likelihood and hence show
that the maximum likelihood value for p given data D corresponds to the
proportion of 1s in the dataset. [5 marks ]
(d) Naive Bayes assumes conditional independence. By considering the worst
case of two attributes being identical given the class label, explain what
effect a positive correlation between attributes (given the class) has on the
inferred posterior probabilities. [3 marks ]
Page 1 of 4
2. (a) In standard degree courses, students can get one of a number of final marks
(Fail, 3rd, Lower 2nd, Upper 2nd, 1st). You plan to use a neural network to
predict the class of degree that a student will get dependent on the marks
they obtained on coursework for their courses. Note that different people
do different courses, each with different numbers of coursework. You could
choose to represent the data by having one input attribute for each piece of
coursework, and substituting the mean coursework value (computed across
all those who did the course) in cases where individuals did not do that
course.
Alternatively, you could represent the data for an individual by having
one input attribute for each range (0% to 10%,11% to 20%, . . ., 99% to
100%) and making the input attribute value to be the proportion of the
coursework the person did, that was given a mark in that range. For
example the inputs for one individual might take the form of a vector
(0%, 0%, 0%, 0%, 10%, 40%, 40%, 10%, 0%, 0%)T .
i. Give one brief argument against each of the alternative representations. [6 marks ]
ii. Suppose that you knew you were going to use a Naive Bayes model.
Describe a representation that you could then use that is similar to,
but arguably more elegant than, the first alternative above in that it
explictly represents missing data, and briefly explain why that repre-
sentation is suitable for Naive Bayes? [2 marks ]
(b) This question relates to neural networks in practice:
i. Explain why it is important to standardise the data and start with small
weights in neural networks. [3 marks ]
ii. Why should each of the initial weights for the different units be different
from one another? [2 marks ]
iii. If we stop training early, what is the effective bias this induces on our
learnt networks? [1 mark ]
(c) Describe a simple gradient ascent procedure for optimising the weights w
(which includes the biases) of a neural network, given the network log-
likelihood, denoted L(D|w) for data D. Describe the problems associated
with setting the learning rate. You do not need to say how to compute any
derivatives you might need. [5 marks ]
QUESTION CONTINUES ON NEXT PAGE
Page 2 of 4
QUESTION CONTINUED FROM PREVIOUS PAGE
d
(d) Let σ(x) be the logistic function, and so dx σ(x) = σ(x)(1 − σ(x)). Consider
the single datum likelihood for a logistic regression model:
Page 3 of 4
3. (a) For real y and binary c, let P (y|c, µc , Σc ) be a class conditional Gaussian
distribution with mean µc and covariance matrix Σc :
1 1 T −1
P (y|c, µc , Σc ) = 1 exp − (y − µc ) Σc (y − µc ) (5)
|2πΣc | 2 2
Page 4 of 4