1.deep Learning Assignment1 Solutions 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Deep Learning - Assignment 1 Your Name, Roll Number

1. Recall that McCulloch Pitts (MP) neuron aggregates the inputs and takes a decision
based on this aggregation. If the sum of all inputs is greater than the threshold (θ),
then the output of MP neuron is 1, otherwise the output is 0. We say that a MP neuron
implements a boolean function if the output of the MP neuron is consistent with the
truth table of the boolean function. In other words, if for a given input configuration,
the boolean function outputs 1 then the output of the neuron should also be 1. Similarly,
if for a given input configuration, the boolean function outputs 0 then the output of the
neuron should also be 0.
Consider the following boolean function:

f (x1 , x2 , x3 , x4 ) = (x1 AND x2 ) AND (!x3 AND !x4 )

The MP neuron for the above boolean function is as follows:

y ∈ {0, 1}

x1 x2 x3 x4

What should be the value of the threshold (θ) such that the MP neuron implements the
above boolean function? (Note that the circle at the end of the input to the MP neuron
indicates inhibitory input. If any inhibitory input is 1 the output will be 0.)

A. θ = 1
B. θ = 2
C. θ = 3
D. θ = 4
Solution: MP neuron will output 1 if and only if the sum of it’s input is greater than
or equal to the threshold, i.e.
X4
xi ≥ θ
i=1

There are 16 possible inputs to this network: {0,0,0,0}, {0,0,0,1}, ..., {1,1,1,1}. Since
x3 , x4 are are inhibitory inputs, the output of the neuron will be zero whenever any one
of these inputs is 1. This is as expected, because the output of the boolean function will
also be 0 when either x3 or x4 is 1. There are 8 such inputs where either x3 or x4 will be
1 and hence the output of the neuron will be 0. Out of the remaining 8 inputs, the given
boolean function f (x1 , x2 , x3 , x4 ) will output 1 if and only if x1 = 1, x2 = 1, x3 = 0, x4 = 0
and 0 for all other cases. Thus, we want the MP neuron to fire only if both x1 and x2
are 1, i.e., we want the sum
x1 + x2 + x3 + x4 ≥ 2
which implies
θ=2
.
∴ Option B is the correct answer.

2. Keeping the concept discussed in question 1 in mind, consider the following boolean
function:

f (x1 , x2 , x3 , x4 ) = (x1 OR x2 ) AND (!x3 AND !x4 )

The MP neuron for the above boolean function is as follows:

y ∈ {0, 1}

x1 x2 x3 x4

What should be the value of the threshold (θ) such that the MP neuron implements the
above boolean function? (Note that the circle at the end of the input to the MP neuron
indicates inhibitory input. If any inhibitory input is 1 the output will be 0.)

A. θ = 1
B. θ = 2

Page 2
C. θ = 3
D. θ = 4

Solution: MP neuron will output 1 if and only if the sum of it’s input is greater than
or equal to the threshold, i.e.
X4
xi ≥ θ
i=1

There are 16 possible inputs to this network: {0,0,0,0}, {0,0,0,1}, ..., {1,1,1,1}. Since
x3 , x4 are are inhibitory inputs, the output of the neuron will be zero whenever any one
of these inputs is 1. This is as expected, because the output of the boolean function will
also be 0 when either x3 or x4 is 1. There are 8 such inputs where either x3 or x4 will
be 1 and hence the output of the neuron will be 0. Out of the remaining 8 inputs, the
given boolean function f (x1 , x2 , x3 , x4 ) will output 1 for any one of the following input
settings:
x1 = 1, x2 = 0, x3 = 0, x4 = 0
OR
x1 = 0, x2 = 1, x3 = 0, x4 = 0
and 0 for all other input settings. Thus, we want the MP neuron to fire if either x1 or
x2 is 1, i.e., we want the sum

x1 + x2 + x3 + x4 ≥ 1

which implies
θ=1
.
∴ Option A is the correct answer.

3. Let us consider the movie example as discussed in this week’s lecture. Suppose we want
to predict whether a movie buff would like to watch a movie or not. Note that each
movie is represented by a vector, X = [x1 x2 x3 x4 ] and the description of each input
(xi ) is mentioned in the figure below. Also, the weight assigned to each of these inputs
(or features) is given by W = [w1 w2 w3 w4 ] and the threshold is represented by the
parameter θ.

Page 3
y

x1 = popularity(between 1 to 10)
x2 = isGenreScif i(boolean)
x3 = isDirectorN olan(boolean)
w0 = −θ w1 w2 w3 w4
x4 = imdbRating(between 0 to 1)
x0 = 1 x1 x2 x3 x4
Now, consider the movie Interstellar has the feature vector X = [8 1 1 0.86]; which
means the movie has a popularity of 8 on a a scale of 10 and is a Scifi movie directed
by Nolan with 0.86 as its imdbRating. Now consider a person who assigns the following
weights to each of these inputs: W = [0.14 1 0.9 0.6]. Further, suppose that θ = 2.
Based on the above information, what do you think will be his/her decision?
A. Yes, (s)he will watch it.
B. No, (s)he won’t watch it.

Solution: His/Her decision will be to watch the movie if the weighted sum of the inputs
is greater than the threshold, ie.
X4
w i xi ≥ θ
i=1

In this case, we have,


4
X
wi xi = (8 × 0.14) + (1 × 1) + (1 × 0.9) + (0.86 × 0.6)
i=1
= 3.536
≥θ=2

which means (s)he will watch the movie.


∴ Option A is the correct answer.

4. Keeping the discussion of question 3 in mind, consider the movie The Green Lantern
has the feature vector X = [5 1 0 0.53]. Now consider a person who assigns the
following weights to each of these inputs: W = [0.8 1 0.4 0.8]. Further, suppose that
θ = 7.
Based on the above information, what do you think will be his/her decision?

A. Yes, (s)he will watch it.

Page 4
B. No, (s)he won’t watch it.

Solution: In this case,


4
X
wi xi = (5 × 0.8) + (1 × 1) + (0 × 0.4) + (0.53 × 0.8)
i=1
= 5.424
≤θ=7

which means (s)he will not watch the movie as the weighted sum lies below the threshold.
∴ Option B is the correct answer.

5. Consider a small training set with the following points in R3 :

Points
Index Class
[x0 , x, y, z]
n1 [1,0,0,0] Class 0
p1 [1,0,0,1] Class 1
p2 [1,0,1,0] Class 1
p3 [1,0,1,1] Class 1
p4 [1,1,0,0] Class 1
p5 [1,1,0,1] Class 1
p6 [1,1,1,0] Class 1
p7 [1,1,1,1] Class 1

Note that there are 8 points which are divided into two classes, Class 0 and Class 1.
We are interested in finding the plane which divides the input space into two classes.
Starting with the weight vector, w = [0, 0, −1, 2], apply the perceptron algorithm by
going over the points in the following order [n1 , p1 , p2 , p3 , p4 , p5 , p6 , p7 ]. If needed, repeat
in the same order till convergence. After the algorithm converges, what is the value of
the weight vector?
A. w = [1, 1, 2, 3]
B. w = [−1, 1, 1, 2]
C. w = [−3, −2, −1, −1]
D. w = [−2, −1, −1, 1]

Page 5
Solution: You can arrive at the solution by implementing the following pseudo code in
python

Algorithm 1: Perceptron Learning Algorithm


P ← inputs with label 1;
N ← inputs with label 0;
Initialize w = [0, 0, −1, 2] ;
while !convergence do
for x ∈ [n1 , p1 , p2 , p3 , p4 , p5 , p6 , p7 ] do
if x ∈ P and w.x < 0 then
w =w+x ;
end
if x ∈ N and w.x ≥ 0 then
w =w−x ;
end
end
end
//the algorithm converges when all the inputs are classified correctly
//so after every run of the inner while loop you need to check the number of
errors. When the number of errors is 0 the algorithm would converge and you
can ouput the value of w.

Option B is the correct answer.

6. A 2-dimensional dataset for 2 classes is given to you. Plot the data and comment whether
the 2 classes are linearly separable or not. Note that the top 500 rows are of Class A
and the rest 500 are of Class B. You can download the dataset by clicking here. Feel
free to use any programming language/plotting tool of your choice. Once you plot the
data, answer the following question:
Is the data linearly separable ?
A. True
B. False

Page 6
Solution: You can plot the data using the following code:

As we can see from the scatter plot given below, the two classes are linearly separable.

∴ Option A is the correct answer.

Page 7
7. Partial derivatives This question is not based on the material that we have covered
so far. However, this is a part of the pre-requisites and will be required for the material
that we will cover in the next class. Consider the following function,

x σ f (x)

1
1
f (x) = 1+e−(w·x+b)

The value L is given by,


1
L = (y − f (x))2
2
Here, x and y are constants and w and b are parameters that can be modified. In other
words, L is a function of w and b.
∂L ∂L
Derive the partial derivatives, ∂w
and ∂b
and choose the correct option.

∂L ∂L
A. ∂w
= (y − f (x))f (x)(1 − f (x)) and ∂b
= (y − f (x))f (x)(1 − f (x))x

∂L ∂L
B. ∂w
= (y − f (x))(1 − f (x))x and ∂b
= −(y − f (x))f (x)(1 − f (x))

∂L ∂L
C. ∂w
= −(y − f (x))f (x)(1 − f (x))x and ∂b
= −(y − f (x))f (x)(1 − f (x))

Solution: Let us denote z = f (x). Also, let us derive some elementary derivatives
which we then stitch together to create the partial derivatives we seek:
1
L = (y − z)2
2
dL
⇒ = −(y − z)
dz
Next, if θ is one of the parameters on which z = f (x) depends, using the chain rule
of derivatives (and the fact that y is constant), we have:

∂L dL ∂z
⇒ =
∂θ dz ∂θ
∂L ∂z
⇒ = −(y − z)
∂θ ∂θ
∂L ∂f (x)
⇒ = −(y − f (x))
∂θ ∂θ

Page 8
For the sigmoid function, let us denote the total input as wx + b = t and find its
derivative wrt t:
1 1
f (x) = −(wx+b)
=
1+e 1 + e−t
df (x) d 1 −t
⇒ = 1+e
dt dt
df (x) e−t
⇒ =
dt (1 + e−t )2
df (x) 1 1
⇒ = −
dt 1 + e−t (1 + e−t )2
df (x)
⇒ = f (x)(1 − f (x))
dt
Again, using the chain rule of probability and for a generic parameter θ:

∂f (x) df (x) ∂t
⇒ =
∂θ dt ∂θ
∂f (x) ∂t
⇒ = f (x)(1 − f (x))
∂θ ∂θ
And so stitching it all back together and using the fact that the partial derivative of
t wrt w is x and wrt b is 1 we get:
∂L
⇒ = −(y − f (x))f (x)(1 − f (x))x
∂w
∂L
⇒ = −(y − f (x))f (x)(1 − f (x))
∂b

∴ Option C is the correct answer.

8. Consider the function E as given below,


E = g(x, y, z) = f (c(ax + by) + dz)
Represented as a graph, we have

Page 9
Here x, y, z are inputs (constants) and a, b, c, d are parameters (variables). m is an inter-
mediate computation and f is some differentiable function. Specifically, let us consider
f to be the tanh function.
ex − e−x
tanh(x) = x
e + e−x
Note that here E is a function of a, b, c, d. Compute the following partial derivatives of
E with respect to a i.e ∂E
∂a
, and choose the correct option.

∂E
A. ∂a
= (1 − f (c(ax + by) + dz)2 )cx

∂E
B. ∂a
= c(1 − f (c(ax + by) + dz)2 )

∂E
C. ∂a
= (1 − f (c(ax + by) − dz)2 )cx

Solution: Let us denote t = c(ax + by) + dz. Then by chain rule of probability, for a
generic parameter θ we have:
∂E df (t) ∂t
=
∂θ dt ∂θ
We have:
et − e−t
f (t) = t
e + e−t
Using Quotient rule of derivatives we have:

df (t) (et + e−t )2 − (et − e−t )2


⇒ =
dt (et + e−t )2

df (t)
⇒ = 1 − f (t)2
dt
Hence, we have:
∂E
⇒ = (1 − f (c(ax + by) + dz)2 )cx
∂a
∴ Option A is the correct answer.

∂E
9. Keeping the graph discussed in question 8 in mind, find ∂b
and choose the correct
option.
∂E
A. ∂b
= (1 − f (c(ax + by) + dz))cy

∂E
B. ∂b
= (1 − f (c(ax + by) + dz)2 )

Page 10
∂E
C. ∂b
= (1 − f (c(ax + by) + dz)2 )cy

Solution: Continuing the derivation in part (a) above we have,

∂E
⇒ = (1 − f (c(ax + by) + dz)2 )cy
∂b
∴ Option C is the correct answer.

∂E
10. Keeping the graph discussed in question 8 in mind, find ∂c
and choose the correct
option.
∂E
A. ∂c
= (1 − f (c(ax + by) + dz)2 )(ax + by)

∂E
B. ∂c
= (1 − f (c(ax + by) + dz))(ax + by)

∂E
C. ∂c
= (1 − f (c(ax + by) + dz)2 )

Solution: Continuing the derivation in part (a) above we have,

∂E
⇒ = (1 − f (c(ax + by) + dz)2 )(ax + by)
∂c
∴ Option A is the correct answer.

∂E
11. Keeping the graph discussed in question 8 in mind, find ∂d
and choose the correct
option.
∂E
A. ∂d
= 2(1 − f (c(ax + by) + dz)2 )z

∂E
B. ∂d
= (1 − f (c(ax + by) + dz)2 )z

∂E
C. ∂d
= (1 − f (c(ax + by) + dz)2 )

Page 11
Solution: Continuing the derivation in part (a) above we have,

∂E
⇒ = (1 − f (c(ax + by) + dz)2 )z
∂d
∴ Option B is the correct answer.

Page 12

You might also like