1.deep Learning Assignment1 Solutions 1
1.deep Learning Assignment1 Solutions 1
1.deep Learning Assignment1 Solutions 1
1. Recall that McCulloch Pitts (MP) neuron aggregates the inputs and takes a decision
based on this aggregation. If the sum of all inputs is greater than the threshold (θ),
then the output of MP neuron is 1, otherwise the output is 0. We say that a MP neuron
implements a boolean function if the output of the MP neuron is consistent with the
truth table of the boolean function. In other words, if for a given input configuration,
the boolean function outputs 1 then the output of the neuron should also be 1. Similarly,
if for a given input configuration, the boolean function outputs 0 then the output of the
neuron should also be 0.
Consider the following boolean function:
y ∈ {0, 1}
x1 x2 x3 x4
What should be the value of the threshold (θ) such that the MP neuron implements the
above boolean function? (Note that the circle at the end of the input to the MP neuron
indicates inhibitory input. If any inhibitory input is 1 the output will be 0.)
A. θ = 1
B. θ = 2
C. θ = 3
D. θ = 4
Solution: MP neuron will output 1 if and only if the sum of it’s input is greater than
or equal to the threshold, i.e.
X4
xi ≥ θ
i=1
There are 16 possible inputs to this network: {0,0,0,0}, {0,0,0,1}, ..., {1,1,1,1}. Since
x3 , x4 are are inhibitory inputs, the output of the neuron will be zero whenever any one
of these inputs is 1. This is as expected, because the output of the boolean function will
also be 0 when either x3 or x4 is 1. There are 8 such inputs where either x3 or x4 will be
1 and hence the output of the neuron will be 0. Out of the remaining 8 inputs, the given
boolean function f (x1 , x2 , x3 , x4 ) will output 1 if and only if x1 = 1, x2 = 1, x3 = 0, x4 = 0
and 0 for all other cases. Thus, we want the MP neuron to fire only if both x1 and x2
are 1, i.e., we want the sum
x1 + x2 + x3 + x4 ≥ 2
which implies
θ=2
.
∴ Option B is the correct answer.
2. Keeping the concept discussed in question 1 in mind, consider the following boolean
function:
y ∈ {0, 1}
x1 x2 x3 x4
What should be the value of the threshold (θ) such that the MP neuron implements the
above boolean function? (Note that the circle at the end of the input to the MP neuron
indicates inhibitory input. If any inhibitory input is 1 the output will be 0.)
A. θ = 1
B. θ = 2
Page 2
C. θ = 3
D. θ = 4
Solution: MP neuron will output 1 if and only if the sum of it’s input is greater than
or equal to the threshold, i.e.
X4
xi ≥ θ
i=1
There are 16 possible inputs to this network: {0,0,0,0}, {0,0,0,1}, ..., {1,1,1,1}. Since
x3 , x4 are are inhibitory inputs, the output of the neuron will be zero whenever any one
of these inputs is 1. This is as expected, because the output of the boolean function will
also be 0 when either x3 or x4 is 1. There are 8 such inputs where either x3 or x4 will
be 1 and hence the output of the neuron will be 0. Out of the remaining 8 inputs, the
given boolean function f (x1 , x2 , x3 , x4 ) will output 1 for any one of the following input
settings:
x1 = 1, x2 = 0, x3 = 0, x4 = 0
OR
x1 = 0, x2 = 1, x3 = 0, x4 = 0
and 0 for all other input settings. Thus, we want the MP neuron to fire if either x1 or
x2 is 1, i.e., we want the sum
x1 + x2 + x3 + x4 ≥ 1
which implies
θ=1
.
∴ Option A is the correct answer.
3. Let us consider the movie example as discussed in this week’s lecture. Suppose we want
to predict whether a movie buff would like to watch a movie or not. Note that each
movie is represented by a vector, X = [x1 x2 x3 x4 ] and the description of each input
(xi ) is mentioned in the figure below. Also, the weight assigned to each of these inputs
(or features) is given by W = [w1 w2 w3 w4 ] and the threshold is represented by the
parameter θ.
Page 3
y
x1 = popularity(between 1 to 10)
x2 = isGenreScif i(boolean)
x3 = isDirectorN olan(boolean)
w0 = −θ w1 w2 w3 w4
x4 = imdbRating(between 0 to 1)
x0 = 1 x1 x2 x3 x4
Now, consider the movie Interstellar has the feature vector X = [8 1 1 0.86]; which
means the movie has a popularity of 8 on a a scale of 10 and is a Scifi movie directed
by Nolan with 0.86 as its imdbRating. Now consider a person who assigns the following
weights to each of these inputs: W = [0.14 1 0.9 0.6]. Further, suppose that θ = 2.
Based on the above information, what do you think will be his/her decision?
A. Yes, (s)he will watch it.
B. No, (s)he won’t watch it.
Solution: His/Her decision will be to watch the movie if the weighted sum of the inputs
is greater than the threshold, ie.
X4
w i xi ≥ θ
i=1
4. Keeping the discussion of question 3 in mind, consider the movie The Green Lantern
has the feature vector X = [5 1 0 0.53]. Now consider a person who assigns the
following weights to each of these inputs: W = [0.8 1 0.4 0.8]. Further, suppose that
θ = 7.
Based on the above information, what do you think will be his/her decision?
Page 4
B. No, (s)he won’t watch it.
which means (s)he will not watch the movie as the weighted sum lies below the threshold.
∴ Option B is the correct answer.
Points
Index Class
[x0 , x, y, z]
n1 [1,0,0,0] Class 0
p1 [1,0,0,1] Class 1
p2 [1,0,1,0] Class 1
p3 [1,0,1,1] Class 1
p4 [1,1,0,0] Class 1
p5 [1,1,0,1] Class 1
p6 [1,1,1,0] Class 1
p7 [1,1,1,1] Class 1
Note that there are 8 points which are divided into two classes, Class 0 and Class 1.
We are interested in finding the plane which divides the input space into two classes.
Starting with the weight vector, w = [0, 0, −1, 2], apply the perceptron algorithm by
going over the points in the following order [n1 , p1 , p2 , p3 , p4 , p5 , p6 , p7 ]. If needed, repeat
in the same order till convergence. After the algorithm converges, what is the value of
the weight vector?
A. w = [1, 1, 2, 3]
B. w = [−1, 1, 1, 2]
C. w = [−3, −2, −1, −1]
D. w = [−2, −1, −1, 1]
Page 5
Solution: You can arrive at the solution by implementing the following pseudo code in
python
6. A 2-dimensional dataset for 2 classes is given to you. Plot the data and comment whether
the 2 classes are linearly separable or not. Note that the top 500 rows are of Class A
and the rest 500 are of Class B. You can download the dataset by clicking here. Feel
free to use any programming language/plotting tool of your choice. Once you plot the
data, answer the following question:
Is the data linearly separable ?
A. True
B. False
Page 6
Solution: You can plot the data using the following code:
As we can see from the scatter plot given below, the two classes are linearly separable.
Page 7
7. Partial derivatives This question is not based on the material that we have covered
so far. However, this is a part of the pre-requisites and will be required for the material
that we will cover in the next class. Consider the following function,
x σ f (x)
1
1
f (x) = 1+e−(w·x+b)
∂L ∂L
A. ∂w
= (y − f (x))f (x)(1 − f (x)) and ∂b
= (y − f (x))f (x)(1 − f (x))x
∂L ∂L
B. ∂w
= (y − f (x))(1 − f (x))x and ∂b
= −(y − f (x))f (x)(1 − f (x))
∂L ∂L
C. ∂w
= −(y − f (x))f (x)(1 − f (x))x and ∂b
= −(y − f (x))f (x)(1 − f (x))
Solution: Let us denote z = f (x). Also, let us derive some elementary derivatives
which we then stitch together to create the partial derivatives we seek:
1
L = (y − z)2
2
dL
⇒ = −(y − z)
dz
Next, if θ is one of the parameters on which z = f (x) depends, using the chain rule
of derivatives (and the fact that y is constant), we have:
∂L dL ∂z
⇒ =
∂θ dz ∂θ
∂L ∂z
⇒ = −(y − z)
∂θ ∂θ
∂L ∂f (x)
⇒ = −(y − f (x))
∂θ ∂θ
Page 8
For the sigmoid function, let us denote the total input as wx + b = t and find its
derivative wrt t:
1 1
f (x) = −(wx+b)
=
1+e 1 + e−t
df (x) d 1 −t
⇒ = 1+e
dt dt
df (x) e−t
⇒ =
dt (1 + e−t )2
df (x) 1 1
⇒ = −
dt 1 + e−t (1 + e−t )2
df (x)
⇒ = f (x)(1 − f (x))
dt
Again, using the chain rule of probability and for a generic parameter θ:
∂f (x) df (x) ∂t
⇒ =
∂θ dt ∂θ
∂f (x) ∂t
⇒ = f (x)(1 − f (x))
∂θ ∂θ
And so stitching it all back together and using the fact that the partial derivative of
t wrt w is x and wrt b is 1 we get:
∂L
⇒ = −(y − f (x))f (x)(1 − f (x))x
∂w
∂L
⇒ = −(y − f (x))f (x)(1 − f (x))
∂b
Page 9
Here x, y, z are inputs (constants) and a, b, c, d are parameters (variables). m is an inter-
mediate computation and f is some differentiable function. Specifically, let us consider
f to be the tanh function.
ex − e−x
tanh(x) = x
e + e−x
Note that here E is a function of a, b, c, d. Compute the following partial derivatives of
E with respect to a i.e ∂E
∂a
, and choose the correct option.
∂E
A. ∂a
= (1 − f (c(ax + by) + dz)2 )cx
∂E
B. ∂a
= c(1 − f (c(ax + by) + dz)2 )
∂E
C. ∂a
= (1 − f (c(ax + by) − dz)2 )cx
Solution: Let us denote t = c(ax + by) + dz. Then by chain rule of probability, for a
generic parameter θ we have:
∂E df (t) ∂t
=
∂θ dt ∂θ
We have:
et − e−t
f (t) = t
e + e−t
Using Quotient rule of derivatives we have:
df (t)
⇒ = 1 − f (t)2
dt
Hence, we have:
∂E
⇒ = (1 − f (c(ax + by) + dz)2 )cx
∂a
∴ Option A is the correct answer.
∂E
9. Keeping the graph discussed in question 8 in mind, find ∂b
and choose the correct
option.
∂E
A. ∂b
= (1 − f (c(ax + by) + dz))cy
∂E
B. ∂b
= (1 − f (c(ax + by) + dz)2 )
Page 10
∂E
C. ∂b
= (1 − f (c(ax + by) + dz)2 )cy
∂E
⇒ = (1 − f (c(ax + by) + dz)2 )cy
∂b
∴ Option C is the correct answer.
∂E
10. Keeping the graph discussed in question 8 in mind, find ∂c
and choose the correct
option.
∂E
A. ∂c
= (1 − f (c(ax + by) + dz)2 )(ax + by)
∂E
B. ∂c
= (1 − f (c(ax + by) + dz))(ax + by)
∂E
C. ∂c
= (1 − f (c(ax + by) + dz)2 )
∂E
⇒ = (1 − f (c(ax + by) + dz)2 )(ax + by)
∂c
∴ Option A is the correct answer.
∂E
11. Keeping the graph discussed in question 8 in mind, find ∂d
and choose the correct
option.
∂E
A. ∂d
= 2(1 − f (c(ax + by) + dz)2 )z
∂E
B. ∂d
= (1 − f (c(ax + by) + dz)2 )z
∂E
C. ∂d
= (1 − f (c(ax + by) + dz)2 )
Page 11
Solution: Continuing the derivation in part (a) above we have,
∂E
⇒ = (1 − f (c(ax + by) + dz)2 )z
∂d
∴ Option B is the correct answer.
Page 12