CS 229, Public Course Problem Set #4: Unsupervised Learning and Re-Inforcement Learning
CS 229, Public Course Problem Set #4: Unsupervised Learning and Re-Inforcement Learning
CS 229, Public Course Problem Set #4: Unsupervised Learning and Re-Inforcement Learning
However, EM can also be applied to the supervised learning setting, and in this problem we
discuss a “mixture of linear regressors” model; this is an instance of what is often call the
Hierarchical Mixture of Experts model. We want to represent p(y|x), x ∈ Rn and y ∈ R,
and we do so by again introducing a discrete latent random variable
X X
p(y|x) = p(y, z|x) = p(y|x, z)p(z|x).
z z
For simplicity we’ll assume that z is binary valued, that p(y|x, z) is a Gaussian density,
and that p(z|x) is given by a logistic regression model. More formally
p(z|x; φ) = g(φT x)z (1 − g(φT x))1−z
−(y − θiT x)2
1
p(y|x, z = i; θi ) = √ exp i = 1, 2
2πσ 2σ 2
where σ is a known parameter and φ, θ0 , θ1 ∈ Rn are parameters of the model (here we
use the subscript on θ to denote two different parameter vectors, not to index a particular
entry in these vectors).
Intuitively, the process behind model can be thought of as follows. Given a data point x,
we first determine whether the data point belongs to one of two hidden classes z = 0 or
z = 1, using a logistic regression model. We then determine y as a linear function of x
(different linear functions for different values of z) plus Gaussian noise, as in the standard
linear regression model. For example, the following data set could be well-represented by
the model, but not by standard linear regression.
CS229 Problem Set #4 2
(a) Suppose x, y, and z are all observed, so that we obtain a training set
{(x(1) , y (1) , z (1) ), . . . , (x(m) , y (m) , z (m) )}. Write the log-likelihood of the parameters,
and derive the maximum likelihood estimates for φ, θ0 , and θ1 . Note that because
p(z|x) is a logistic regression model, there will not exist a closed form estimate of φ.
In this case, derive the gradient and the Hessian of the likelihood with respect to φ;
in practice, these quantities can be used to numerically compute the ML esimtate.
(b) Now suppose z is a latent (unobserved) random variable. Write the log-likelihood of
the parameters, and derive an EM algorithm to maximize the log-likelihood. Clearly
specify the E-step and M-step (again, the M-step will require a numerical solution,
so find the appropriate gradients and Hessians).
z ∼ N (0, I)
x|z ∼ N (U z, σ 2 I).
(a) Use the rules for manipulating Gaussian distributions to determine the joint distri-
bution over (x, z) and the conditional distribution of z|x. [Hint: for later parts of
this problem, it will help significantly if you simplify your soluting for the conditional
distribution using the identity we first mentioned in problem set #1: (λI +BA)−1 B =
B(λI + AB)−1 .]
(b) Using these distributions, derive an EM algorithm for the model. Clearly state the
E-step and the M-step of the algorithm.
(c) As σ 2 → 0, show that if the EM algorithm convergences to a parameter vector U ⋆
(and such convergence is guarenteed by the argument presented in class), then U ⋆
1
Pm (i) (i) T
must be an eigenvector of the sample covariance matrix Σ = m i=1 x x — i.e.,
U ⋆ must satisfy
λU ⋆ = ΣU ⋆ .
[Hint: When σ 2 → 0, Σz|x → 0, so the E step only needs to compute the means
µz|x and not the variances. Let w ∈ Rm be a vector containing all these means,
wi = µz(i) |x(i) , and show that the E step and M step can be expressed as
XU XT w
w= , U=
UT U wT w
CS229 Problem Set #4 3
respectively. Finally, show that if U doesn’t change after this update, it must satisfy
the eigenvector equation shown above. ]
For reference, computing the ICA W matrix for the entire set of image patches takes about
5 minutes on a 1.6 Ghz laptop using our implementation.
After you’ve learned the U matrix for PCA (the columns of U should contain the principal
components of the data) and the W matrix of ICA, you can plot the basis functions using
the plot ica bases(W); and plot pca bases(U); functions we have provide. Comment
briefly on the difference between the two sets of basis functions.
1 Recall that the first step of performing PCA is to subtract the mean and normalize the variance of the features.
For the image data we’re using, the preprocessing step for the ICA algorithm is slightly different, though the
precise mechanism and justification is not imporant for the sake of this problem. Those who are curious about
the details should read Bell and Sejnowki’s paper “The ’Independent Components’ of Natural Scenes are Edge
Filters,” which provided the basis for the implementation we use in this problem.
CS229 Problem Set #4 4
(a) Prove that if V1 (s) ≤ V2 (s) for all s ∈ S, then B(V1 )(s) ≤ B(V2 )(s) for all s ∈ S.
(b) Prove that for any V ,
kB π (V ) − V π k∞ ≤ γkV − V π k∞
where kV k∞ = maxs∈S |V (s)|. Intuitively, this means that applying the Bellman
operator B π to any value function V , brings that value function “closer” to the value
function for π, V π . This also means that applying B π repeatedly (an infinite number
of times)
B π (B π (. . . B π (V ) . . .))
will result in the value function V π (a little bit more is needed to make this completely
formal, but we won’t worry about that here).
[Hint: Use the fact that for any α, x ∈ Rn , if i αi = 1 and αi ≥ 0, then i αi xi ≤
P P
maxi xi .]
(c) Now suppose that we have some policy π, and use Policy Iteration to choose a new
policy π ′ according to
X
π ′ (s) = arg max Psa (s′ )V π (s′ ).
a∈A
s′ ∈S
Show that this policy will never perform worse that the previous one — i.e., show
′
that for all s ∈ S, V π (s) ≤ V π (s).
′
[Hint: First show that V π (s) ≤ B π (V π )(s), then use the proceeding excercises to
′ ′
show that B π (V π )(s) ≤ V π (s).]
(d) Use the proceeding exercises to show that policy iteration will eventually converge
(i.e., produce a policy π ′ = π). Furthermore, show that it must converge to the
optimal policy π ⋆ . For the later part, you may use the property that if some value
function satisfies
X
V (s) = R(s) + γ max s′ ∈ SPsa (s′ )V (s′ )
a∈A
then V = V ⋆ .
0.6
0.4
0.2
−0.2
−0.4
−0.6
All states except those at the top of the hill have a constant reward R(s) = −1, while the
goal state at the hilltop has reward R(s) = 0; thus an optimal agent will try to get to the
top of the hill as fast as possible (when the car reaches the top of the hill, the episode is
over, and the car is reset to its initial position). However, when starting at the bottom
of the hill, the car does not have enough power to reach the top by driving forward, so
it must first accerlaterate accelerate backwards, building up enough momentum to reach
the top of the hill. This strategy of moving away from the goal in order to reach the goal
makes the problem difficult for many classical control algorithms.
As discussed in class, Q-learning maintains a table of Q-values, Q(s, a), for each state and
action. These Q-values are useful because, in order to select an action in state s, we only
need to check to see which Q-value is greatest. That is, in state s we take the action
arg max Q(s, a).
a∈A
The Q-learning algorithm adjusts its estimates of the Q-values as follows. If an agent is in
state s, takes action a, then ends up in state s′ , Q-learning will update Q(s, a) by
Q(s, a) = (1 − α)Q(s, a) + γ(R(s′ ) + γ max Q(s′ , a′ ).
′ a ∈A
At each time, your implementation of Q-learning can execute the greedy policy π(s) =
arg maxa∈A Q(s, a)
Implement the [q, steps per episode] = qlearning(episodes) function in the q5/
directory. As input, the function takes the total number of episodes (each episode starts
with the car at the bottom of the hill, and lasts until the car reaches the top), and outputs
a matrix of the Q-values and a vector indicating how many steps it took before the car was
able to reach the top of the hill. You should use the [x, s, absorb] = mountain car(x,
actions(a)) function to simulate one control cycle for the task — the x variable describes
the true (continuous) state of the system, whereas the s variable describes the discrete
index of the state, which you’ll use to build the Q values.
Plot a graph showing the average number of steps before the car reaches the top of the
hill versus the episode number (there is quite a bit of variation in this quantity, so you will
probably want to average these over a large number of episodes, as this will give you a
better idea of how the number of steps before reaching the hilltop is decreasing). You can
also visualize your resulting controller by calling the draw mountain car(q) function.