Estimation of Covariance Matrices: Roman Vershynin

Estimation of covariance matrices
Roman Vershynin
University of Michigan
Probability and Geometry in High Dimensions

Paris, May 2010
Roman Vershynin
Covariance matrix
Basic problem in multivariate statistics:
by sampling from a high-dimensional distribution, determine
its covariance structure.
Principal Component Analysis (PCA): detect the principal
axes along which most dependence occurs:
PCA of a multivariate Gaussian distribution. [Ga

el Varoquauxs blog gael-varoquaux.info]
Roman Vershynin
Covariance matrix
The covariance structure of a high-dimensional distribution

is captured by its covariance matrix .
Let X be a random vector in Rp distributed according to .
We may assume that X is centered (by estimating and
subtracting EX). The covariance matrix of X is defined as
= E XXT = E X X = (EXi Xj )pi,j=1 = (cov(Xi , Xj ))pi,j=1
= (X) is a symmetric, positive semi-definite p p matrix.
It is a multivariate version of the variance Var(X ).
If (X) = I we say that X is isotropic. Every full dimensional
random vector X can be made into an isotropic one by the
linear transformation: 1/2 X .
Roman Vershynin
Estimation of covariance matrices is a basic problem in

multivariate statistics. It arises in signal processing, genomics,
nancial mathematics, pattern recognition, computational
convex geometry.
We take a sample of n independent points X1 , . . . , Xn from
the distribution and form the sample covariance matrix
n
n =
1X
Xk XT
k .
n
i=1
n is a random matrix. Hopefully it approximates well:
Roman Vershynin

Covariance Estimation Problem. Determine the minimal sample
size n = n(p) that guarantees with high probability (say, 0.99) that
the sample covariance matrix n estimates the actual covariance
matrix with fixed accuracy (say, = 0.01) in the operator norm:
kn k kk.
PCA of a multivariate Gaussian distribution. [Ga

el Varoquauxs blog gael-varoquaux.info]
Roman Vershynin
Estimation problem and random matrices

Estimation problem can be stated as a problem on the
spectrum of random matrices.
Assume for simplicity that the distribution is isotropic, = I .
Form our sample X1 , . . . , Xn into a n p random matrix A
with independent rows:
Then the sample covariance matrix is

n
1X
1 T
n =
Xk XT
k = A A.
n
n
i=1
Roman Vershynin
Estimation problem and random matrices
n = n1 AT A.
The desired estimation kn I k is equivalent to saying
that 1n A is an almost isometric embedding Rp Rn :
(1 ) n kAxk2 (1 + ) n for all x S p1 .

Equivalently, the singular values si (A) = eig(AT A)1/2 are all
close to each other and to n:
(1 ) n smin (A) smax (A) (1 + ) n.

Question. What random matrices with independent rows are
almost isometric embeddings?
Roman Vershynin
Random matrices with independent entries

Simplest example: Gaussian distributions.
A is a p n random matrix with independent N(0, 1) entries.
n is called Wishart matrix.
Random matrix theory in the asymptotic regime n, p :
Theorem (Bai-Yin Law) When n, p , n/p const, one has
smin (A) n p, smax (A) n + p a.s.
Roman Vershynin
Random matrices with independent entries

Bai-Yin: smin (A)
p, smax (A)
n+
p.
Thus making n slightly bigger than p we force both extreme

values to be close to each other, and make A an almost
isometric embedding.
Formally, the sample covariance matrix n = n1 AT A nicely
approximates the actual covariance matrix I :
r
p p
kn I k 2
+ .
n n
Answer to the Estimation Problem for Gaussian distributions:
sample size n(p) p suffices to estimate the covariance matrix by
a sample covariance matrix.
Roman Vershynin
Random matrices with independent rows

However, many distributions of interest do not have
independent coordinates. Thus the random matrix A has
independent rows (samples), but not independent entries in
each row.
Problem. Study the spectrum properties of random matrices with
independent rows. When do such n p matrices A produce almost
isometric embeddings?
Roman Vershynin
High dimensional distributions

Under appropriate moment assumptions on the distribution (of the
rows), are there results similar to Bai-Yin?
Definition. A distribution of X in Rp is subgaussian if all its
one-dimensional marginals are subgaussian random variables:
P{|hX, xi| t} 2 exp(ct 2 ).
Similarly we define subexponential distributions (with tails
2 exp(ct)), distributions with finite moments, etc. We thus
always assess a distribution by its one-dimensional marginals.
Examples: The standard normal distribution, the uniform
distributions on round ball, cube of unit vol are subgaussian.
The uniform distribution on any convex body of unit volume
is sub-exponential (follows from Brunn-Minkowski inequality,
see Borells lemma). Discrete distributions are usually not
even subexponential unless they are supported by
exponentially many points.
Roman Vershynin
Random matrices with independent subgaussian rows

Proposition (Random matrices with subgaussian rows). Let A be
an n p matrix whose rows Xk are independent sub-gaussian
isotropic random vectors in Rp . Then with high probability,
n C p smin (A) smax (A) n + C p.

As before, this yields that the sample covariance matrix
n = n1 AT A approximates the actual covariance matrix I :
r
kn I k C
p
p
+C .
n
n
Answer to the Estimation Problem for subgaussian

distributions is same as for Gaussian ones: sample size
n(p) p suffices to estimate the covariance matrix by a
sample covariance matrix.
Roman Vershynin
Random matrices with independent subgaussian rows

Proposition (Random matrices with subgaussian rows). Let A be
an n p matrix whose rows Xk are independent sub-gaussian
isotropic random vectors in Rp . Then with high probability,
n C p smin (A) smax (A) n + C p.

Proof (-net argument). As we know, the conclusion is equivalent
to saying that 1n A is an almost isometric embedding.
Equivalently, we need to show that kAxk22 is close to its expected
value n for every unit vector x. But
n
X
2
kAxk2 =
hXk , xi2
k=1
is a sum of independent subexponential random variables.

Exponential deviation inequalities (Bernsteins) yield that
kAxk2 n with high probability. Conclude by taking union bound
over x in some fixed net of the sphere S p1 and approximation.
Roman Vershynin
Beyond sub-subgaussian
This argument fails for anything weaker than sub-gaussian

distributions exponential deviation inequalities will fail.
Different ideas are needed to address the Estimation Problem
for distributions with heavier tails.
Boundedness assumption: we will assume throughout the rest
of this talk that the distribution is supported in a centered ball
of radius O( p). Most of the (isotropic) distribution always

lies in that ball, as EkXk22 = p.
Roman Vershynin
Random matrices with heavy-tailed rows

Under no moment assumptions at all, we have:
Theorem (Random matrices with heavy tails). Let A be an n p
matrix whose rows Xk are independent isotropic random vectors in
Rp . Then with high probability,
p
p
n C p log p smin (A) smax (A) n + C p log p.

log p is needed (uniform distribution on p orthogonal vectors).
As before, this yields that the sample covariance matrix
n = n1 AT A approximates the actual covariance matrix I :
r
p log p
kn I k C
for n p.
n
This result was proved by Rudelson00 (Bourgain99: log3 p).
The answer to the Estimation Problem for heavy-tailed
distributions is requires a logarithmic oversampling: a sample
size n(p) p log p suffices to estimate the covariance matrix.
Roman Vershynin
Random matrices with heavy-tailed rows

Theorem (Random matrices with heavy tails). Let A be an n p
matrix whose rows Xk are independent isotropic random vectors in
Rp . Then with high probability,
p
p
n C p log p smin (A) smax (A) n + C p log p.

Proof There are now several ways to prove this result. The most
straightforward argument: Ashlwede-Winters approach. It directly
addresses the Estimation Problem. The sample covariance matrix
n
1X
n =
Xk XT
k
n
k=1
is a sum of independent random matrices Xk XT

k . One can prove
and use versions of classical deviation inequalities (Chernoff,
Hoeffding, Bernstein, Bennett . . . ) for sums of random matrices.
Proofs are similar exponentiating and estimating m.g.f. using
trace inequalities (Golden-Thompson). See Tropp10.
Roman Vershynin
When is the logarithmic oversampling needed?

Problem (when is logarithmic oversampling needed?) Classify the
distributions in Rp for which the sample size n(p) p suffices to
estimate the covariance matrix by the sample covariance matrix.
What we know: for general distributions, logarithmic
oversampling is needed: n(p) p log p by Rudelsons
theorem. For subgaussian distributions, not needed: n(p) p.
It was recently shown that n(p) p for sub-exponential
distributions: Adamczak, Litvak, Pajor, Tomczak09. This
includes uniform distributions on all convex bodies.
But there is still a big gap between the distributions that do
not require the logarithmic oversampling (convex bodies) and
those that do require (very discrete).
How to close this gap? We conjecture that for most
distributions, n(p) p. For example, this should hold under
any non-trivial moment assumptions:
Roman Vershynin
The logarithmic oversampling is almost never needed?
Conjecture. Consider a distribution in Rp with bounded q-th

moment for some q > 2, i.e. E|hX, xi|q C q for all unit vectors x.
Then the sample size n p suffices for estimation of the
covariance matrix by the sample covariance matrix n w.h.p.:
kn k .
Recall that any isotropic distributions has a bounded second
moment. The conjecture says that a slightly higher moment
should suffice for estimation without logarithmic oversampling.
Roman Vershynin
The logarithmic oversampling is almost never needed
Theorem (Covariance). Consider a distribution in Rp with bounded

q-th moment for some q > 4. Then the sample covariance matrix
n approximates covariance matrix: with high probability,
kn k (log log p)2
p 12
2
As a consequence, the sample size n (log log p)Cq p suffices

for covariance estimation: kn k .
Roman Vershynin
Estimation of moments of marginals

Once we know we know the variances of all one-dimensional
marginals: hx, xi = hEXXT x, xi = EhX, xi2 .
More generally, we can estimate r -th moments of marginals:
Theorem (Marginals). Consider a random vector X in Rp with
bounded 4r -th moment. Take a sample of size n p if r [1, 2)
and n p r /2 if r (2, ). Then with high probability,
n
1 X

r
r
sup
|hX, xi| E|hX, xi| .
xS p1 n
k=1
The sample size n has optimal order for all r .

For subexponential distributions, this result is due to
Adamczak, Litvak, Pajor, Tomczak09. Without extra
moment assumptions (except the r -th), a logarithmic
oversampling is needed as before. The optimal sample size in
this case is n p r /2 log p due to Guedon, Rudelson07.
Roman Vershynin
Norms of random operators
Corollary (Norms of random operators). Let A be an n p matrix

whose rows Xk are independent random vectors in Rp with
bounded 4r -th moment, r 2. Then with high probability,
kAk`2 `r . n1/2 + p 1/r .
This result is also optimal. Conjectured to hold for r = 2.
For subexponential distributions, this result is due to
Adamczak, Litvak, Pajor, Tomczak09. Without extra
moment assumptions (except the r -th), a logarithmic
oversampling is needed as before.
Roman Vershynin
Heuristics of the argument: structure of divergent series

Two new ingredients in the proofs of these results:
(1) structure of slowly divergent series;
(2) a new decoupling technique.
Consider a simpler problem: for a random vector with heavy
tails, we want to show that kn k = O(1):
n
1X
kn k = sup
hXk , xi2 = O(1).
n
xS n1
k=1
This is a stochastic process indexed by vectors x S n1 .

For each fixed x, we
P have to2control the sum of independent
random variables k hXk , xi . Unfortunately, because of the
heavy tails of these random variables, we can only control the
sum with a polynomial rather than exponential probability
1 nO(1) . This is too weak for uniform control over x in the
sphere S p1 where -nets have exponential sizes in p.
Roman Vershynin
Sums of independent heavy-tailed random variables

This brings us to a basic question in probability theory: control
a sum of independent heavy-tailed random variables Zk .
Here we follow a simple combinatorial approach. Suppose
n
1X
Zk 1.
n
k=1
Try and locate some structure in the terms Zk that is

responsible for the largeness of the sum.
Often one can find an ideal structure: a subset of very large
terms Zk . Namely, suppose there is I [n], |I | = n0 such that
Zk 4
n
n0
for k I .
(we can always locate an ideal structure loosing log n).

Roman Vershynin
Sums of independent heavy-tailed random variables
Ideal structure: a subset I , |I | = n0 , such that Zk 4 nn0 for k I .

Advantage of the ideal structure: the probability that it exists
can be easily bounded. Even if Zk have just the first moment,
say EZk = 1:
By independence, Markovs inequality and union bound over I ,

n
n0 n0
P{ideal structure exists}
e 2n0 .
n0 4n
We get an exponential probability despite the heavy tails.
Roman Vershynin
Combinatorial approach for stochastic processes

Let us see how the combinatorial approach works for
controlling stochastic processes. Assume for some x S n1
n
1X
hXk , xi2 1.
n
k=1
Suppose we can locate an ideal structure responsible for this:

a subset I , |I | = n0 , such that hXk , xi2 4 nn0 for k I . As
we know,
P{ideal structure} e 2n0 .
This is still not strong enough to take union bound over all x
in some net of the sphere S p1 which has cardinality e n .
Dimension reduction: By projecting x onto E = span(Xk )kI
we can automatically assume that x E . This subspace has
dimension n0 . Its -net has cardinality e n0 which is OK!
Unfortunately, x E becomes random, correlated with Xk s.
Decoupling can make x depend on a half of Xk s (random
selection a la Maurey). Condition on this half, finish the proof.
Roman Vershynin
Combinatorial approach for stochastic processes
This argument yields the optimal Marginal Theorem (on

estimation of r -th moments of one-dimensional marginals).
Generally, in locating the ideal structure one looses a log p
factor. To loose just log log p as in the Covariance Theorem,
one has to locate a structure thats weaker (thus harder to
find) than the ideal structure. This requires a structural
theorem for series that diverge slower than the iterated
logarithm.
Roman Vershynin
Sparse estimation of covariance matrices
A variety of practical applications (genomics, pattern

recognition, etc.) require very small sample sizes compared
with the number of parameters, calling for
n p.
In this regime, covariance estimation is generally impossible
(for dimension reasons). But usually (in practice) one knows a
priori some structure of the covariance matrix .
For example, is often known to be sparse, having few
non-zero entries (i.e. most random variables are uncorrelated).
Example:
Roman Vershynin
Covariance graph
Gene association network of E. coli [J. Sch

afer, K. Strimmer05]
Roman Vershynin
Sparse Estimation Problem. Consider a distribution in Rp whose

covariance matrix has at most s p nonzero entries in each
column (equivalently, each component of the distribution is
correlated with at most s other components). Determine the
minimal sample size n = n(p, s) needed to estimate with a fixed
error in the operator norm, and with high probability.
A variety of techniques has been proposed in statistics,
notably the shrinkage methods going back to Stein.
Roman Vershynin
The problem is nontrivial even for Gaussian distributions, and

even if we know the location of the non-zero entries of .
Lets assume this (otherwise take the biggest entries of n ).
Method: compute the sample covariance matrix n . Zero out
all entries that are a priori known to be zero. The resulting
sparse matrix M n should be a good estimator for .
Zeroing out amounts to taking Hadamard product (entrywise)
M n with a given sparse 0/1 matrix M (mask).
Does this method work? Yes:
Roman Vershynin

Theorem (Sparse Estimation). [Levina-V10] Consider a centered
Gaussian distribution in Rp with covariance matrix . Let M be a
symmetric p p mask matrix with 0, 1 entries and with at most
s nonzero entries in each column. Then
r s
s
3
EkM n M k C log p
+
kk.
n n
Compare this with the consequence
of the Bai-Yin law:
rp p
+
kk.
Ekn k 2
n n
This matches the Theorem in the non-sparse case s = p.
Note the mild, logarithmic dependence on the dimension p
and the optimal dependence on the sparsity s.
A logarithmic factor is needed for s = 1, when M = I .
As a consequence, sample size n s log6 p suffices for sparse
estimation. In the sparse case s p, we have n p.
Roman Vershynin

More generally,
Theorem (Estimation of Hadamard Products). [Levina-V10]
Consider a centered Gaussian distribution on Rp with covariance
matrix . Then for every symmetric p p matrix M we have
EkM n M k C log3 p
kMk
kMk
1,2
+
kk.
n
n
P
where kMk1,2 = maxj ( i mij2 )1/2 is the `1 `2 operator norm.
This result is quite general. Applies for arbitrary Gaussian
distributions (no covariance structure assumed), arbitrary
mask matrices M.
Roman Vershynin
Complexity of matrix norm

Sparse Estimation Theorem would follow by an -net
argument if the norm of a sparse matrix can be computed on
a small set.
As is well known, the operator norm of an p p matrix A can
be computed on an 12 -net N of the unit sphere S p1
kAk max kAxk2
xN
and one can construct such net with cardinality |N | e cp .

Can one reduce the size of N for sparse matrices?
Question (discretizing the norm of sparse matrices). Does there
exist a subset N of S p1 such that, for every p p matrix A with
at most s nonzero entries in each row and column, one has
kAk max kAxk2
xN
and with cardinality |N | (Cp/s)s p Cs ?

Roman Vershynin
Gaussian Chaos
Since we dont know how to answer this question, the proof of
the estimation theorem takes a different route through
estimating a Gaussian chaos.
A gaussian chaos arises naturally when one tries to compute
the operator
P norm of a sample covariance matrix
n = n1 nk=1 Xk XT
k :
X
kn k = sup hn x, xi =
xS p1
n (i, j)xi xj =
i,j=1
1X
Xki Xkj xi xj
n
k,i,j
where Xkj are Gaussian random variables (the coordinates of

the sampled points from the Gaussian distribution).
Argument: (1) decoupling; (2) combinatorial approach to
estimation, classifying x according to the measure of its
sparsity similar to [Schechtman04] and many later papers.
Roman Vershynin
References
Survey: M. Rudelson, R. Vershynin, Non-asymptotic theory of

random matrices: extreme singular values, 2010.
Marginal Estimation: R. Vershynin, Approximating the
moments of marginals of high dimensional distributions, 2009.
Covariance Estimation: R. Vershynin, How close is the sample
covariance matrix to the actual covariance matrix? 2010.
Sparse Covariance Estimation: L. Levina, R. Vershynin, Sparse
estimation of covariance matrices, in progress, 2010.
Roman Vershynin

Estimation of Covariance Matrices: Roman Vershynin

Uploaded by

Estimation of Covariance Matrices: Roman Vershynin

Uploaded by

Estimation of covariance matrices

Probability and Geometry in High Dimensions