Mathematical Foundations of Deep Neural Networks, M1407.

E. Ryu
Spring 2024
Due 5pm, Monday, May 06, 2024

Problem 1: Transpose of downsampling. Consider the downsampling operator T : Rm×n →

R(m/2)×(n/2) , defined as the average pool with a 2 × 2 kernel and stride 2. For the sake of
simplicity, assume m and n are even. Describe the action of T ⊤ . More specifically, describe
how to compute T ⊤ (Y ) for any Y ∈ R(m/2)×(n/2) .

Clarification. The downsampling operator T is a linear operator (why?). Therefore, T has a

matrix representation A ∈ R(mn/4)×(mn) such that

T (X) = (A(X.reshape(mn))).reshape(m/2, n/2)

for all X ∈ Rm×n . The adjoint T ⊤ has two equivalent definitions. One definition is

T ⊤ (Y ) = (A⊤ (Y.reshape(mn/4))).reshape(m, n)

for all Y ∈ R(m/2)×(n/2) . Another is

m/2 n/2 m X
Yij (T (X))ij = (T ⊤ (Y ))ij (X)ij
i=1 j=1 i=1 j=1

for all X ∈ Rm×n and Y ∈ R(m/2)×(n/2) .

Hint. To spoil the suspence, T ⊤ is a constant times the nearest neighbor upsampling. Explain
why in your answer.

Problem 2: Nearest neighbor upsampling. How is the nearest neighbor upsampling operator
an instance of transpose convolution? Specifically, describe how
layer = nn . Upsample ( scale_factor =r , mode = ’ nearest ’)

where r is a positive integer, can be equivalently represented by

layer = nn . ConvTranspose2d (...)
layer . weight . data = ...

with ... appropriately filled in.

Problem 3: f-divergence. Let X and Y be two continuous random variables with densities pX
and pY . The f -divergence of X from Y is defined as
pX (x)
Df (X∥Y ) = f pY (x) dx,
pY (x)

where f is a convex function such that f (1) = 0.

(a) Show that Df (X∥Y ) ≥ 0.

(b) Show that f = − log t and f = t log t correspond to the KL divergence.

Problem 4: Generalized inverse transform sampling. Let F : R → [0, 1] be the CDF of a

random variable and let U ∼ Uniform([0, 1]). If F is continuous and strictly increasing and
therefore invertible, then F −1 (U ) is a random variable with CDF F , because

P(F −1 (U ) ≤ t) = P(U ≤ F (t)) = F (t).

When F is not necessarily invertible, the generalized inverse of F is G : (0, 1) → R with

G(u) = inf{x ∈ R | u ≤ F (x)}.

Show that G(U ) is a random variable with CDF F .

Hint. Use the fact that F is right-continuous, i.e., limh→0+ F (x + h) = F (x) for all x ∈ R, and
that limx→−∞ F (x) = 0.

Problem 5: Change of variables formula for Gaussians. If φ : Rn → Rn is a one-to-one differ-

entiable function, Y = φ(X), and Y is a continuous random variable with density function pY ,
then X is a continuous random variable with density function

pX (x) = pY (φ(x)) det (x) .

Let Y ∈ Rn be a continuous random vector with density

1 − 21 ∥y∥2
pY (y) = e ,

i.e., Y ∼ N (0, I). Let X = AY + b with an invertible matrix A ∈ Rn×n and a vector b ∈ Rn .
Define Σ = AA⊺ . Show that X is a continuous random vector with density
1 1 ⊺ −1
pX (x) = p e− 2 (x−b) Σ (x−b) .
(2π) det Σ

Problem 6: Inverse permutation. Let Sn denote the group of length-n permutations. Note
that the map i 7→ σ(i) is a bijection. Define σ −1 ∈ Sn as the permutation representing the
inverse of this map, i.e, σ −1 (σ(i)) = i for i = 1, . . . , n. Describe an algorithm for computing
σ −1 given σ.

Clarification. In this class, we defined σ as a list of length n containing the elements of {1, . . . , n}
exactly once. The output of the algorithm, σ −1 , should also be provided as a list.
Clarification. For this problem, it is sufficient to describe the algorithm in equations or pseu-
docode. There is no need to submit a Python script for this problem.

Problem 7: Permutation matrix. Given a permutation σ ∈ Sn , the permutation matrix of σ is

defined as  ⊺ 
 e⊺ 
 σ(2)  n×n
Pσ =  ..  ∈ R ,

 . 
where e1 , . . . , en ∈ Rn are the standard unit vectors. Show

(a) (Pσ x)i = xσ(i) for all x ∈ Rn and i = 1, . . . , n,

(b) Pσ⊺ = Pσ−1 = Pσ−1 and

(c) | det Pσ | = 1.

Hint. If the rows of U ∈ Rn×n are orthonormal, we say U is an orthogonal matrix. Orthogonal
matrices satisfy U U ⊺ = U ⊺ U = I.

