0% found this document useful (0 votes)
42 views35 pages

Estimation of Covariance Matrices: Roman Vershynin

This document discusses estimating covariance matrices from high-dimensional sample data. It begins by introducing the problem of determining a distribution's covariance structure based on samples. The sample covariance matrix is formed using independent samples, and the goal is for this matrix to accurately estimate the actual covariance matrix. For Gaussian distributions, random matrix theory shows the sample size needs only be slightly larger than the dimension for the sample covariance matrix to be a good estimate. For subgaussian distributions, a similar result holds. More generally, for distributions with independent subgaussian rows in the sample matrix, the sample size needs only grow linearly in the dimension. Even for heavy-tailed distributions requiring no moment assumptions, only a logarithmic increase in sample size over the dimension is

Uploaded by

Arnav Kushwaha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
42 views35 pages

Estimation of Covariance Matrices: Roman Vershynin

This document discusses estimating covariance matrices from high-dimensional sample data. It begins by introducing the problem of determining a distribution's covariance structure based on samples. The sample covariance matrix is formed using independent samples, and the goal is for this matrix to accurately estimate the actual covariance matrix. For Gaussian distributions, random matrix theory shows the sample size needs only be slightly larger than the dimension for the sample covariance matrix to be a good estimate. For subgaussian distributions, a similar result holds. More generally, for distributions with independent subgaussian rows in the sample matrix, the sample size needs only grow linearly in the dimension. Even for heavy-tailed distributions requiring no moment assumptions, only a logarithmic increase in sample size over the dimension is

Uploaded by

Arnav Kushwaha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 35

Estimation of covariance matrices

Roman Vershynin
University of Michigan

Probability and Geometry in High Dimensions


Paris, May 2010

Roman Vershynin

Estimation of covariance matrices

Covariance matrix
Basic problem in multivariate statistics:
by sampling from a high-dimensional distribution, determine
its covariance structure.
Principal Component Analysis (PCA): detect the principal
axes along which most dependence occurs:

PCA of a multivariate Gaussian distribution. [Ga


el Varoquauxs blog gael-varoquaux.info]

Roman Vershynin

Estimation of covariance matrices

Covariance matrix

The covariance structure of a high-dimensional distribution


is captured by its covariance matrix .
Let X be a random vector in Rp distributed according to .
We may assume that X is centered (by estimating and
subtracting EX). The covariance matrix of X is defined as
= E XXT = E X X = (EXi Xj )pi,j=1 = (cov(Xi , Xj ))pi,j=1
= (X) is a symmetric, positive semi-definite p p matrix.
It is a multivariate version of the variance Var(X ).
If (X) = I we say that X is isotropic. Every full dimensional
random vector X can be made into an isotropic one by the
linear transformation: 1/2 X .

Roman Vershynin

Estimation of covariance matrices

Estimation of covariance matrices

Estimation of covariance matrices is a basic problem in


multivariate statistics. It arises in signal processing, genomics,
nancial mathematics, pattern recognition, computational
convex geometry.
We take a sample of n independent points X1 , . . . , Xn from
the distribution and form the sample covariance matrix
n

n =

1X
Xk XT
k .
n
i=1

n is a random matrix. Hopefully it approximates well:

Roman Vershynin

Estimation of covariance matrices

Estimation of covariance matrices


Covariance Estimation Problem. Determine the minimal sample
size n = n(p) that guarantees with high probability (say, 0.99) that
the sample covariance matrix n estimates the actual covariance
matrix with fixed accuracy (say, = 0.01) in the operator norm:
kn k kk.

PCA of a multivariate Gaussian distribution. [Ga


el Varoquauxs blog gael-varoquaux.info]

Roman Vershynin

Estimation of covariance matrices

Estimation problem and random matrices


Estimation problem can be stated as a problem on the
spectrum of random matrices.
Assume for simplicity that the distribution is isotropic, = I .
Form our sample X1 , . . . , Xn into a n p random matrix A
with independent rows:

Then the sample covariance matrix is


n

1X
1 T
n =
Xk XT
k = A A.
n
n
i=1

Roman Vershynin

Estimation of covariance matrices

Estimation problem and random matrices

n = n1 AT A.
The desired estimation kn I k is equivalent to saying
that 1n A is an almost isometric embedding Rp Rn :

(1 ) n kAxk2 (1 + ) n for all x S p1 .


Equivalently, the singular values si (A) = eig(AT A)1/2 are all

close to each other and to n:

(1 ) n smin (A) smax (A) (1 + ) n.


Question. What random matrices with independent rows are
almost isometric embeddings?
Roman Vershynin

Estimation of covariance matrices

Random matrices with independent entries


Simplest example: Gaussian distributions.
A is a p n random matrix with independent N(0, 1) entries.
n is called Wishart matrix.
Random matrix theory in the asymptotic regime n, p :
Theorem (Bai-Yin Law) When n, p , n/p const, one has

smin (A) n p, smax (A) n + p a.s.

Roman Vershynin

Estimation of covariance matrices

Random matrices with independent entries


Bai-Yin: smin (A)

p, smax (A)

n+

p.

Thus making n slightly bigger than p we force both extreme


values to be close to each other, and make A an almost
isometric embedding.
Formally, the sample covariance matrix n = n1 AT A nicely
approximates the actual covariance matrix I :
r
p p
kn I k 2
+ .
n n
Answer to the Estimation Problem for Gaussian distributions:
sample size n(p) p suffices to estimate the covariance matrix by
a sample covariance matrix.

Roman Vershynin

Estimation of covariance matrices

Random matrices with independent rows


However, many distributions of interest do not have
independent coordinates. Thus the random matrix A has
independent rows (samples), but not independent entries in
each row.
Problem. Study the spectrum properties of random matrices with
independent rows. When do such n p matrices A produce almost
isometric embeddings?

Roman Vershynin

Estimation of covariance matrices

High dimensional distributions


Under appropriate moment assumptions on the distribution (of the
rows), are there results similar to Bai-Yin?
Definition. A distribution of X in Rp is subgaussian if all its
one-dimensional marginals are subgaussian random variables:
P{|hX, xi| t} 2 exp(ct 2 ).
Similarly we define subexponential distributions (with tails
2 exp(ct)), distributions with finite moments, etc. We thus
always assess a distribution by its one-dimensional marginals.
Examples: The standard normal distribution, the uniform
distributions on round ball, cube of unit vol are subgaussian.
The uniform distribution on any convex body of unit volume
is sub-exponential (follows from Brunn-Minkowski inequality,
see Borells lemma). Discrete distributions are usually not
even subexponential unless they are supported by
exponentially many points.
Roman Vershynin

Estimation of covariance matrices

Random matrices with independent subgaussian rows


Proposition (Random matrices with subgaussian rows). Let A be
an n p matrix whose rows Xk are independent sub-gaussian
isotropic random vectors in Rp . Then with high probability,

n C p smin (A) smax (A) n + C p.


As before, this yields that the sample covariance matrix
n = n1 AT A approximates the actual covariance matrix I :
r
kn I k C

p
p
+C .
n
n

Answer to the Estimation Problem for subgaussian


distributions is same as for Gaussian ones: sample size
n(p) p suffices to estimate the covariance matrix by a
sample covariance matrix.
Roman Vershynin

Estimation of covariance matrices

Random matrices with independent subgaussian rows


Proposition (Random matrices with subgaussian rows). Let A be
an n p matrix whose rows Xk are independent sub-gaussian
isotropic random vectors in Rp . Then with high probability,

n C p smin (A) smax (A) n + C p.


Proof (-net argument). As we know, the conclusion is equivalent
to saying that 1n A is an almost isometric embedding.
Equivalently, we need to show that kAxk22 is close to its expected
value n for every unit vector x. But
n
X
2
kAxk2 =
hXk , xi2
k=1

is a sum of independent subexponential random variables.


Exponential deviation inequalities (Bernsteins) yield that
kAxk2 n with high probability. Conclude by taking union bound
over x in some fixed net of the sphere S p1 and approximation.
Roman Vershynin

Estimation of covariance matrices

Beyond sub-subgaussian

This argument fails for anything weaker than sub-gaussian


distributions exponential deviation inequalities will fail.
Different ideas are needed to address the Estimation Problem
for distributions with heavier tails.
Boundedness assumption: we will assume throughout the rest
of this talk that the distribution is supported in a centered ball

of radius O( p). Most of the (isotropic) distribution always


lies in that ball, as EkXk22 = p.

Roman Vershynin

Estimation of covariance matrices

Random matrices with heavy-tailed rows


Under no moment assumptions at all, we have:
Theorem (Random matrices with heavy tails). Let A be an n p
matrix whose rows Xk are independent isotropic random vectors in
Rp . Then with high probability,
p
p

n C p log p smin (A) smax (A) n + C p log p.


log p is needed (uniform distribution on p orthogonal vectors).
As before, this yields that the sample covariance matrix
n = n1 AT A approximates the actual covariance matrix I :
r
p log p
kn I k C
for n p.
n
This result was proved by Rudelson00 (Bourgain99: log3 p).
The answer to the Estimation Problem for heavy-tailed
distributions is requires a logarithmic oversampling: a sample
size n(p) p log p suffices to estimate the covariance matrix.
Roman Vershynin

Estimation of covariance matrices

Random matrices with heavy-tailed rows


Theorem (Random matrices with heavy tails). Let A be an n p
matrix whose rows Xk are independent isotropic random vectors in
Rp . Then with high probability,
p
p

n C p log p smin (A) smax (A) n + C p log p.


Proof There are now several ways to prove this result. The most
straightforward argument: Ashlwede-Winters approach. It directly
addresses the Estimation Problem. The sample covariance matrix
n
1X
n =
Xk XT
k
n
k=1

is a sum of independent random matrices Xk XT


k . One can prove
and use versions of classical deviation inequalities (Chernoff,
Hoeffding, Bernstein, Bennett . . . ) for sums of random matrices.
Proofs are similar exponentiating and estimating m.g.f. using
trace inequalities (Golden-Thompson). See Tropp10.
Roman Vershynin

Estimation of covariance matrices

When is the logarithmic oversampling needed?


Problem (when is logarithmic oversampling needed?) Classify the
distributions in Rp for which the sample size n(p) p suffices to
estimate the covariance matrix by the sample covariance matrix.
What we know: for general distributions, logarithmic
oversampling is needed: n(p) p log p by Rudelsons
theorem. For subgaussian distributions, not needed: n(p) p.
It was recently shown that n(p) p for sub-exponential
distributions: Adamczak, Litvak, Pajor, Tomczak09. This
includes uniform distributions on all convex bodies.
But there is still a big gap between the distributions that do
not require the logarithmic oversampling (convex bodies) and
those that do require (very discrete).
How to close this gap? We conjecture that for most
distributions, n(p) p. For example, this should hold under
any non-trivial moment assumptions:
Roman Vershynin

Estimation of covariance matrices

The logarithmic oversampling is almost never needed?

Conjecture. Consider a distribution in Rp with bounded q-th


moment for some q > 2, i.e. E|hX, xi|q C q for all unit vectors x.
Then the sample size n p suffices for estimation of the
covariance matrix by the sample covariance matrix n w.h.p.:
kn k .
Recall that any isotropic distributions has a bounded second
moment. The conjecture says that a slightly higher moment
should suffice for estimation without logarithmic oversampling.

Roman Vershynin

Estimation of covariance matrices

The logarithmic oversampling is almost never needed

Theorem (Covariance). Consider a distribution in Rp with bounded


q-th moment for some q > 4. Then the sample covariance matrix
n approximates covariance matrix: with high probability,
kn k (log log p)2

p 12
2

As a consequence, the sample size n (log log p)Cq p suffices


for covariance estimation: kn k .

Roman Vershynin

Estimation of covariance matrices

Estimation of moments of marginals


Once we know we know the variances of all one-dimensional
marginals: hx, xi = hEXXT x, xi = EhX, xi2 .
More generally, we can estimate r -th moments of marginals:
Theorem (Marginals). Consider a random vector X in Rp with
bounded 4r -th moment. Take a sample of size n p if r [1, 2)
and n p r /2 if r (2, ). Then with high probability,
n
1 X


r
r
sup
|hX, xi| E|hX, xi| .
xS p1 n
k=1

The sample size n has optimal order for all r .


For subexponential distributions, this result is due to
Adamczak, Litvak, Pajor, Tomczak09. Without extra
moment assumptions (except the r -th), a logarithmic
oversampling is needed as before. The optimal sample size in
this case is n p r /2 log p due to Guedon, Rudelson07.
Roman Vershynin

Estimation of covariance matrices

Norms of random operators

Corollary (Norms of random operators). Let A be an n p matrix


whose rows Xk are independent random vectors in Rp with
bounded 4r -th moment, r 2. Then with high probability,
kAk`2 `r . n1/2 + p 1/r .
This result is also optimal. Conjectured to hold for r = 2.
For subexponential distributions, this result is due to
Adamczak, Litvak, Pajor, Tomczak09. Without extra
moment assumptions (except the r -th), a logarithmic
oversampling is needed as before.

Roman Vershynin

Estimation of covariance matrices

Heuristics of the argument: structure of divergent series


Two new ingredients in the proofs of these results:
(1) structure of slowly divergent series;
(2) a new decoupling technique.
Consider a simpler problem: for a random vector with heavy
tails, we want to show that kn k = O(1):
n

1X
kn k = sup
hXk , xi2 = O(1).
n
xS n1
k=1

This is a stochastic process indexed by vectors x S n1 .


For each fixed x, we
P have to2control the sum of independent
random variables k hXk , xi . Unfortunately, because of the
heavy tails of these random variables, we can only control the
sum with a polynomial rather than exponential probability
1 nO(1) . This is too weak for uniform control over x in the
sphere S p1 where -nets have exponential sizes in p.
Roman Vershynin

Estimation of covariance matrices

Sums of independent heavy-tailed random variables


This brings us to a basic question in probability theory: control
a sum of independent heavy-tailed random variables Zk .
Here we follow a simple combinatorial approach. Suppose
n

1X
Zk  1.
n
k=1

Try and locate some structure in the terms Zk that is


responsible for the largeness of the sum.
Often one can find an ideal structure: a subset of very large
terms Zk . Namely, suppose there is I [n], |I | = n0 such that
Zk 4

n
n0

for k I .

(we can always locate an ideal structure loosing log n).


Roman Vershynin

Estimation of covariance matrices

Sums of independent heavy-tailed random variables

Ideal structure: a subset I , |I | = n0 , such that Zk 4 nn0 for k I .


Advantage of the ideal structure: the probability that it exists
can be easily bounded. Even if Zk have just the first moment,
say EZk = 1:
By independence, Markovs inequality and union bound over I ,
  
n
n0 n0
P{ideal structure exists}
e 2n0 .
n0 4n
We get an exponential probability despite the heavy tails.

Roman Vershynin

Estimation of covariance matrices

Combinatorial approach for stochastic processes


Let us see how the combinatorial approach works for
controlling stochastic processes. Assume for some x S n1
n
1X
hXk , xi2  1.
n
k=1

Suppose we can locate an ideal structure responsible for this:


a subset I , |I | = n0 , such that hXk , xi2 4 nn0 for k I . As
we know,
P{ideal structure} e 2n0 .
This is still not strong enough to take union bound over all x
in some net of the sphere S p1 which has cardinality e n .
Dimension reduction: By projecting x onto E = span(Xk )kI
we can automatically assume that x E . This subspace has
dimension n0 . Its -net has cardinality e n0 which is OK!
Unfortunately, x E becomes random, correlated with Xk s.
Decoupling can make x depend on a half of Xk s (random
selection a la Maurey). Condition on this half, finish the proof.
Roman Vershynin

Estimation of covariance matrices

Combinatorial approach for stochastic processes

This argument yields the optimal Marginal Theorem (on


estimation of r -th moments of one-dimensional marginals).
Generally, in locating the ideal structure one looses a log p
factor. To loose just log log p as in the Covariance Theorem,
one has to locate a structure thats weaker (thus harder to
find) than the ideal structure. This requires a structural
theorem for series that diverge slower than the iterated
logarithm.

Roman Vershynin

Estimation of covariance matrices

Sparse estimation of covariance matrices

A variety of practical applications (genomics, pattern


recognition, etc.) require very small sample sizes compared
with the number of parameters, calling for
n  p.
In this regime, covariance estimation is generally impossible
(for dimension reasons). But usually (in practice) one knows a
priori some structure of the covariance matrix .
For example, is often known to be sparse, having few
non-zero entries (i.e. most random variables are uncorrelated).
Example:

Roman Vershynin

Estimation of covariance matrices

Covariance graph

Gene association network of E. coli [J. Sch


afer, K. Strimmer05]

Roman Vershynin

Estimation of covariance matrices

Sparse estimation of covariance matrices

Sparse Estimation Problem. Consider a distribution in Rp whose


covariance matrix has at most s p nonzero entries in each
column (equivalently, each component of the distribution is
correlated with at most s other components). Determine the
minimal sample size n = n(p, s) needed to estimate with a fixed
error in the operator norm, and with high probability.
A variety of techniques has been proposed in statistics,
notably the shrinkage methods going back to Stein.

Roman Vershynin

Estimation of covariance matrices

Sparse estimation of covariance matrices

The problem is nontrivial even for Gaussian distributions, and


even if we know the location of the non-zero entries of .
Lets assume this (otherwise take the biggest entries of n ).
Method: compute the sample covariance matrix n . Zero out
all entries that are a priori known to be zero. The resulting
sparse matrix M n should be a good estimator for .
Zeroing out amounts to taking Hadamard product (entrywise)
M n with a given sparse 0/1 matrix M (mask).
Does this method work? Yes:

Roman Vershynin

Estimation of covariance matrices

Sparse estimation of covariance matrices


Theorem (Sparse Estimation). [Levina-V10] Consider a centered
Gaussian distribution in Rp with covariance matrix . Let M be a
symmetric p p mask matrix with 0, 1 entries and with at most
s nonzero entries in each column. Then
r s
s
3
EkM n M k C log p
+
kk.
n n
Compare this with the consequence
of the Bai-Yin law:
 rp p 
+
kk.
Ekn k 2
n n
This matches the Theorem in the non-sparse case s = p.
Note the mild, logarithmic dependence on the dimension p
and the optimal dependence on the sparsity s.
A logarithmic factor is needed for s = 1, when M = I .
As a consequence, sample size n s log6 p suffices for sparse
estimation. In the sparse case s  p, we have n  p.
Roman Vershynin

Estimation of covariance matrices

Sparse estimation of covariance matrices


More generally,
Theorem (Estimation of Hadamard Products). [Levina-V10]
Consider a centered Gaussian distribution on Rp with covariance
matrix . Then for every symmetric p p matrix M we have
EkM n M k C log3 p

 kMk
kMk 
1,2

+
kk.
n
n

P
where kMk1,2 = maxj ( i mij2 )1/2 is the `1 `2 operator norm.
This result is quite general. Applies for arbitrary Gaussian
distributions (no covariance structure assumed), arbitrary
mask matrices M.

Roman Vershynin

Estimation of covariance matrices

Complexity of matrix norm


Sparse Estimation Theorem would follow by an -net
argument if the norm of a sparse matrix can be computed on
a small set.
As is well known, the operator norm of an p p matrix A can
be computed on an 12 -net N of the unit sphere S p1
kAk max kAxk2
xN

and one can construct such net with cardinality |N | e cp .


Can one reduce the size of N for sparse matrices?
Question (discretizing the norm of sparse matrices). Does there
exist a subset N of S p1 such that, for every p p matrix A with
at most s nonzero entries in each row and column, one has
kAk max kAxk2
xN

and with cardinality |N | (Cp/s)s p Cs ?


Roman Vershynin

Estimation of covariance matrices

Gaussian Chaos
Since we dont know how to answer this question, the proof of
the estimation theorem takes a different route through
estimating a Gaussian chaos.
A gaussian chaos arises naturally when one tries to compute
the operator
P norm of a sample covariance matrix
n = n1 nk=1 Xk XT
k :
X

kn k = sup hn x, xi =
xS p1

n (i, j)xi xj =

i,j=1

1X
Xki Xkj xi xj
n
k,i,j

where Xkj are Gaussian random variables (the coordinates of


the sampled points from the Gaussian distribution).
Argument: (1) decoupling; (2) combinatorial approach to
estimation, classifying x according to the measure of its
sparsity similar to [Schechtman04] and many later papers.
Roman Vershynin

Estimation of covariance matrices

References

Survey: M. Rudelson, R. Vershynin, Non-asymptotic theory of


random matrices: extreme singular values, 2010.
Marginal Estimation: R. Vershynin, Approximating the
moments of marginals of high dimensional distributions, 2009.
Covariance Estimation: R. Vershynin, How close is the sample
covariance matrix to the actual covariance matrix? 2010.
Sparse Covariance Estimation: L. Levina, R. Vershynin, Sparse
estimation of covariance matrices, in progress, 2010.

Roman Vershynin

Estimation of covariance matrices

You might also like