Multivariate Theory For Analyzing High Dimensional

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228366796

Multivariate Theory for Analyzing High Dimensional Data

Article  in  JOURNAL OF THE JAPAN STATISTICAL SOCIETY · January 2007


DOI: 10.14490/jjss.37.53

CITATIONS READS
111 749

1 author:

M. S. Srivastava
University of Toronto
220 PUBLICATIONS   6,562 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

An educational learning system to simulate (matrices) in mathematical course View project

ENVIRONMENTAL AND ECOSYSTEM POLLUTION View project

All content following this page was uploaded by M. S. Srivastava on 19 December 2013.

The user has requested enhancement of the downloaded file.


J. Japan Statist. Soc.
Vol. 37 No. 1 2007 53–86

MULTIVARIATE THEORY FOR ANALYZING HIGH


DIMENSIONAL DATA

M. S. Srivastava*

In this article, we develop a multivariate theory for analyzing multivariate


datasets that have fewer observations than dimensions. More specifically, we con-
sider the problem of testing the hypothesis that the mean vector µ of a p-dimensional
random vector x is a zero vector where N , the number of independent observations
on x , is less than the dimension p. It is assumed that x is normally distributed
with mean vector µ and unknown nonsingular covariance matrix Σ. We propose the
test statistic F + = n−2 (p − n + 1)N x  S + x , where n = N − 1 < p, x and S are
the sample mean vector and the sample covariance matrix respectively, and S + is
the Moore-Penrose inverse of S. It is shown that a suitably normalized version of
the F + statistic is asymptotically normally distributed under the hypothesis. The
asymptotic non-null distribution in one sample case is given. The case when the co-
variance matrix Σ is singular of rank r but the sample size N is larger than r is also
considered. The corresponding results for the case of two-samples and k samples,
known as MANOVA, are given.

Key words and phrases: Distribution of test statistics, DNA microarray data, fewer
observations than dimension, multivariate analysis of variance, singular Wishart.

1. Introduction
In DNA microarray data, gene expressions are available on thousands of
genes of an individual but there are only few individuals in the dataset. For
example, in the data analyzed by Ibrahim et al. (2002), gene expressions were
available on only 14 individuals, of which 4 were normal tissues and 10 were
endometrial cancer tissues. But even after excluding many genes, the dataset
consisted of observations on 3214 genes. Thus, the observation matrix is a 3214×
14 matrix. Although these genes are correlated, it is ignored in the analysis.
The empirical Bayes analysis of Efron et al. (2001) also ignores the correlations
among the genes in the analysis of their data. Similarly, in the comparison of
discrimination methods for the classification of tumors using gene expression
data, Dudoit et al. (2002) considered in their Leukemia example 6817 human
genes on 72 subjects from two groups, 38 from the learning set and 34 from the
test set. Although it gives rise to a 6817×6817 correlation matrix, they examined
a 72 × 72 correlation matrix. Most of the methods used in the above analyses
ignore the correlations among the genes.
The above cited papers use some criteria for reducing the dimension of the
data. For example, Dudoit et al. (2002) reduce the dimension of their Leukemia
data from 3571 to 40, before they applied their methods of classification on
Received July 15, 2005. Revised October 21, 2005. Accepted January 25, 2006.
*Department of Statistics, University of Toronto, 100 St. George Street Toronto, Ontario, Canada
M5S 3G3. Email: [email protected]
54 M. S. SRIVASTAVA

the data. Methods for reducing the dimension in microarray data have also
been given by Alter et al. (2000) in their singular value decomposition methods.
Similarly, Eisen et al. (1998) have given a method of clustering the genes using a
measure similar to the Pearson correlation coefficient. Dimension reduction has
also been the objective of Efron et al. (2001) and Ibrahim et al. (2002). Such
reduction of the dimension is important in microarray data analysis because
even though there are observations on thousands of genes, there are relatively
few genes that reflect the differences between the two groups of data or several
groups of data. Thus, in analyzing any microarray data, the first step should be
to reduce the dimension. The theory developed in this paper provides a method
for reducing the dimension.
The sample covariance matrix for the microarray datasets on N individuals
with p genes in which p is larger than N is a singular matrix of rank n = N − 1.
For example if xik denotes the gene expression of the i-th gene on the k-th
individual, i = 1, . . . , p, k = 1, . . . , N , then the sample covariance matrix is of
the order p × p given by
S = (sij ),
where

N
nsij = (xik − xi )(xjk − xj ),
k=1
and xi and xj are the sample means of the i-th and j-th genes’ expressions,
 −1 N x . The corresponding p × p population
xi = N −1 N k=1 xik , xj = N k=1 jk
covariance matrix Σ will be assumed to be positive-definite. To understand the
difficulty in analyzing such a dataset, let us divide the p genes arbitrarily into two
groups, one consisting of n = N −1 genes and the other consisting of q = p−N +1
genes. We write the sample covariance matrix for the two partitioned groups as
 
S11 S12
S=  S
, S22 = (s22ij ),
S12 22

where S12 = (s121 , . . . , s12q ) : n × q, q = p − n, n = N − 1. Then the sample


multiple correlation between the i-th gene in the second group with the n genes
in the first group is given by

s−1  −1
22ii s12i S11 s12i .

However, since the matrix S is of rank n, it follows from Srivastava and Khatri
(1979, p. 11) that
 −1
S22 = S12 S11 S12 ,
−1
giving s22ii = s12i S11 s12i . Thus, the sample multiple correlation between any
gene i in the second set with the first set is always equal to one for all i =
1, . . . , p − n. Similarly, all the n sample canonical correlations between the two
sets are one for n < (p − n). It thus appears that any inference procedure that
takes into account the correlations among the genes may be based on n genes or
ANALYZING HIGH DIMENSIONAL DATA 55

their linear combinations such as n principal components. On the other hand,


since the test procedure discussed in this article is based on the Moore-Penrose
inverse of the singular sample covariance matrix S = n−1 Y Y  , which involves
the non-zero eigenvalues of Y Y  or equivalently of Y  Y , Y : p × n, it would be
desirable to have a larger p than dictated by the above consideration.
Another situation that often arises in practice is the case when n  p. For
in this case, even if n > p and even though theoretically the covariance matrix S
is positive definite, the smaller eigenvalues are really small as demonstrated by
Johnston (2001) for p = n = 10. Since this could also happen if the covariance
matrix is singular of rank r ≤ p, we shall assume that the covariance matrix is
singular of rank r ≤ n. In the analysis of microarrays data, this situation may
arise since the selected number of characteristics could be close to n.
The objective of this article is to develop a multivariate theory for analyz-
ing high-dimensional data. Specifically, we consider the problem of testing the
hypothesis concerning the mean vector in one-sample, equality of the two mean
vectors in the two-sample case, and the equality of several mean vectors—the
so called MANOVA problem in Sections 2, 3 and 4 respectively. The null dis-
tributions of these statistics are also given in these sections. In the one-sample
case, the non-null distribution of the test statistic is also given in Section 2.
The confidence intervals for linear combinations of the mean vectors or for linear
combinations of contrasts in one, two, and more than two samples are given in
Section 5.
Tests for verifying the assumption of the equality of covariances are given in
Section 6. In Section 7, we give a general procedure for reducing the dimension
of the data. An example from Alon et al. (1999) on colon, where the gene
expressions are taken from normal and cancerous genes is discussed in Section 8.
The paper concludes in Section 9.

2. One-sample problem
Let p-dimensional random vectors x1 , . . . , xN be independent and identically
distributed (hereafter referred to as iid) as normal with mean vectors µ and
unknown nonsingular covariance matrix Σ. Such a distribution will be denoted
by x ∼ Np (µ, Σ). We shall assume that N ≤ p and consider the problem of
testing the hypothesis H : µ = 0 against the alternative A : µ = 0. The sample
mean vector and the sample covariance matrix are respectively defined by

N
(2.1) x = N −1 xi , and S = n−1 V,
i=1

where

N
(2.2) V = (xi − x)(xi − x) , n = N − 1.
i=1

Let B + denote the Moore-Penrose inverse of the m × r matrix B satisfying


the four conditions (i) BB + B = B, (ii) B + BB + = B + , (iii) (B + B) = (B + B),
56 M. S. SRIVASTAVA

(iv) (BB + ) = BB + . The Moore-Penrose inverse is unique and is equal to the


inverse of the nonsingular m × m square matrix B. We propose the following
test statistics for testing the hypothesis that µ = 0 vs µ = 0. Define for n < p,

T + = N x S + x
2

(2.3) = nN x V + x.

Let
2
+ p − n + 1 T+
(2.4) F = .
n n
2
Thus, for n < p, we propose the statistic F + or equivalently the T + statistic
for testing the hypothesis that µ = 0, against the alternative that µ = 0. It
is shown in Subsection 2.1 that F + is invariant under the linear transformation
xi → cΓxi , c = 0, and ΓΓ = Ip . Let

(n − 1)(n + 2) (tr S/p)2


(2.5) b̂ =  .
n2 1
p−1 tr S − (tr S)
2 2
n
An asymptotic distribution of the F + statistic, given later in Theorem 2.3, is
given by


n 1/2 −1
(2.6) lim P (p(p − n + 1) b̂F +
− 1) ≤ z1−α = Φ(z1−α ),
n,p→∞ 2

where Φ denotes the cumulative distribution function (cdf) of a standard normal


random variable with mean 0 and variance 1. In most practical cases n = O(pδ ),
0 < δ < 1, and in this case (2.6) simplifies to
  1/2

n
(2.7) lim P cp,n (b̂F +
− 1) < z1−α = Φ(z1−α ),
n,p→∞ 2

where we choose
cp,n = [(p − n + 1)/(p + 1)]1/2
for fast convergence to normality, see Corollary 2.1.
When 
1 1/2
µ= δ,
nN
where δ is a vector of constants, then the asymptotic power of the F + test as
given in Subsection 2.3 is given by


nb̂p(p − n + 1)−1 F + − n
+
β(F ) = lim P > z1−α | µ
n,p→∞ (2n)1/2
   1/2  
n n µ ∧µ
 Φ −z1−α + ,
p 2 a2
ANALYZING HIGH DIMENSIONAL DATA 57

where ∧ = diag(λ1 , . . . , λp ) and λi are the eigenvalues of the covariance matrix.


A competitive test proposed by Dempster (1958, 1960) is given by

N x x
TD = .
tr S

The asymptotic power of the TD test as given by Bai and Saranadasa (1996) is
given by 
nµ µ
β(TD )  Φ −z1−α + √ .
2pa2
Thus, when Σ = γ 2 I,
  
n µ µ
β(TD ) = −z1−α + √ ,
2p γ2
and   1/2  

n n µµ
β(F + ) = −z1−α + √ .
p 2p γ2

Thus, in this case, Dempster’s test is superior to the F + test, unless (n/p) → 1.
It is as expected since Dempster’s test is uniformly most powerful among all tests
whose power depends on (µ µ/γ 2 ), see Simaika (1941). This test is also invariant
under the linear transformations xi → cΓxi , c = 0, ΓΓ = Ip . In other cases, F +
may be preferred if
 1/2
n
(2.8) µ ∧ µ > µ µ;
pa2

For example, if µ  Np (0, ∧), then on the average (2.8) implies that
 1/2
n
(tr ∧2 ) > tr ∧,
pa2
that is
a21
n> p = bp.
a2
1/2
Since a21 < a2 , such an n exists. Similarly, if µ = λi , the same inequality is
obtained and F + will have better power than the Dempster test.
Next, we compare the power of TD and F + tests by simulation where the
+
F statistic in (2.7) is used. From the asymptotic expressions of the power
given above and in (2.29), it is clear that asymptotically, the tests TD and TBS
(defined in equation (2.26)) have the same power. Thus, in our comparison
of power, we include only the Dempster’s test TD as it is also known that if
Σ = σ 2 Ip , then the test TD is the best invariant test under the transformation
x → cΓx, c = 0, ΓΓ = Ip , among all tests whose power depends on µ µ/σ 2
irrespective of the size of n and p. Therefore it is better than the Hotelling’s
T 2 -test (when n > p), TBS and the F + tests. However, when Σ = σ 2 I, then no
58 M. S. SRIVASTAVA

ordering between the two tests TD and F + exist. Thus, we shall compare the
power of the TD test with the F + test by simulation when the covariance matrix
Σ = diag(d1 , . . . , dp ) = σ 2 Ip . The diagonal elements d1 , . . . , dp are obtained as
an iid sample from several distributions. However, once the values of d1 , . . . , dp
are obtained by a single simulation, they are kept fixed in our simulation study.
Similarly, the values of the elements of the mean vector µ for the alternative
hypothesis are obtained as iid observations from some distributions. Once these
values are obtained, they are also held fixed throughout the simulation. In order
that both tests have the same significance level, we obtain by simulation the cut-
off points for the distribution of the statistic under the hypothesis. For example,
for the statistic F + , we obtain Fα+ such that

(# of F + ≥ Fα+ )
= α,
1000
where F + is calculated from the (n + 1) samples from Np (0, Σ) for each 1000
replications. The power is then calculated from the (n+1) samples from Np (µ, Σ)
replicated again 1000 times. We have chosen α to be 0.05. The power of the two
tests are shown in Tables 1–3. The mean vectors for the alternative are obtained
as

µ1 = {x1 , . . . , xp }, i = 1, . . . , p, xi ∼ U (−0.5, 0.5).


µ2 = {x1 , . . . , xp }, i = 1, . . . , p, for i = 2k, xi ∼ U (−0.5, 0.5);
for i = 2k + 1, xi = 0, k = 0, 1, . . . , p − 1.

In our power comparison, we have used the statistic given in Corollary 2.1.

Table 1. Power, Σ = Ip .

µ1 µ2
p n TD F+ TD F+
60 30 1.0000 0.9780 0.9987 0.8460
100 40 1.0000 1.0000 1.0000 0.9530
60 1.0000 1.0000 1.0000 1.0000
80 1.0000 1.0000 1.0000 0.9970
150 40 1.0000 1.0000 0.9317 0.9830
60 1.0000 1.0000 1.0000 1.0000
80 1.0000 1.0000 1.0000 1.0000
200 40 1.0000 1.0000 1.0000 0.9870
60 1.0000 1.0000 1.0000 1.0000
80 1.0000 1.0000 1.0000 1.0000
400 40 1.0000 1.0000 1.0000 0.9960
60 1.0000 1.0000 1.0000 1.0000
80 1.0000 1.0000 1.0000 1.0000
ANALYZING HIGH DIMENSIONAL DATA 59

Table 2. Power, Σ = D, D = diag{d1 , . . . , dp }, and, di ∼ U (2, 3), i = 1, . . . , p.

µ1 µ2
+
p n TD F TD F+
60 30 0.6424 1.0000 0.2909 1.0000
100 40 0.9417 1.0000 0.5934 1.0000
60 0.9979 1.0000 0.8608 1.0000
80 1.0000 1.0000 0.9622 1.0000
150 40 0.9606 1.0000 0.6605 1.0000
60 0.9991 1.0000 0.9127 1.0000
80 1.0000 1.0000 0.9829 1.0000
200 40 0.9845 1.0000 0.7892 1.0000
60 0.9998 1.0000 0.9668 1.0000
80 1.0000 1.0000 0.9981 1.0000
400 40 1.0000 1.0000 0.9032 1.0000
60 1.0000 1.0000 0.9951 1.0000
80 1.0000 1.0000 0.9981 1.0000

Table 3. Power, Σ = D, D = diag{d1 , . . . , dp }, and, di ∼ χ23 , i = 1, . . . , p.

µ1 µ2
+
p n TD F TD F+
60 30 0.9541 1.0000 0.5420 1.0000
100 40 0.9998 1.0000 0.9394 1.0000
60 1.0000 1.0000 0.9983 1.0000
80 1.0000 1.0000 1.0000 1.0000
150 40 0.9999 1.0000 0.9622 1.0000
60 1.0000 1.0000 0.9999 1.0000
80 1.0000 1.0000 1.0000 1.0000
200 40 1.0000 1.0000 0.9925 1.0000
60 1.0000 1.0000 1.0000 1.0000
80 1.0000 1.0000 1.0000 1.0000
400 40 1.0000 1.0000 0.9996 1.0000
60 1.0000 1.0000 1.0000 1.0000
80 1.0000 1.0000 1.0000 1.0000

In other words, the hypothesis H : µ = 0 is rejected if




+ 2 1/2
F > 1+ z1−α /cp,n
n
= Fα+ .

Thus, ideally under the hypothesis that µ = 0, we should have

P {F + ≥ Fα+ } = α
60 M. S. SRIVASTAVA

if the normal approximation given in Corollary 2.1 is a good approximation.


Thus, in order to ascertain how good this approximation is, we do a simulation
in which we calculate the statistic F + by taking n + 1 samples from N (0, D) and
replicating it 1,000 times. We then calculate
# of F + ≥ Fα+
= α̂
1000
and compare it with α. We call α̂ the attained significance level (ASL). We
choose α = 0.05. Tables 4–6 show the closeness of α̂ with α.

2.1. Invariance and other properties of the F + test


For testing the hypothesis that the mean vector µ is equal to a zero vector
against the alternative that µ = 0, Hotelling’s T 2 -test based on the statistic
n−p+1
(2.9) F = N x S −1 x
np

Table 4. Attained significant level of F + test under H, sample from N (0, I).

n = 30 n = 40 n = 50 n = 60 n = 70 n = 80 n = 90
p = 100 0.077 0.062 0.057 0.058 0.078 0.098 0.117
p = 150 0.069 0.069 0.053 0.073 0.067 0.052 0.059
p = 200 0.053 0.052 0.053 0.047 0.056 0.048 0.039
p = 300 0.068 0.057 0.064 0.069 0.054 0.037 0.039
p = 400 0.071 0.064 0.053 0.067 0.053 0.048 0.048

Table 5. Attained significant level of F + test under H, (1) sample from N (0, D).
(2) D = diag(d1 , . . . , dp ), where Di ∼ U (2, 3).

n = 30 n = 40 n = 50 n = 60 n = 70 n = 80 n = 90
p = 100 0.072 0.060 0.055 0.062 0.071 0.096 0.108
p = 150 0.053 0.047 0.057 0.048 0.052 0.046 0.060
p = 200 0.062 0.060 0.053 0.058 0.050 0.039 0.047
p = 300 0.068 0.067 0.064 0.052 0.053 0.068 0.058
p = 400 0.071 0.061 0.067 0.052 0.051 0.058 0.061

Table 6. Attained significant level of F + test under H, (1) sample from N (0, D).
(2) D = diag(d1 , . . . , dp ), where Di ∼ χ2 of 2df .

n = 30 n = 40 n = 50 n = 60 n = 70 n = 80 n = 90
p = 100 0.049 0.032 0.014 0.008 0.018 0.015 0.030
p = 150 0.028 0.017 0.017 0.015 0.025 0.006 0.002
p = 200 0.019 0.024 0.030 0.018 0.001 0.002 0.010
p = 300 0.043 0.030 0.013 0.013 0.008 0.028 0.012
p = 400 0.055 0.042 0.033 0.021 0.018 0.018 0.002
ANALYZING HIGH DIMENSIONAL DATA 61

is used when n ≥ p, since S is positive definite with probability one and F has an
F -distribution with p and n−p+1 degrees of freedom. The F -test in (2.9) can be
interpreted in many ways. For example, x S −1 x is the sample (squared) distance
of the sample mean vector from the zero vector. It can also be interpreted as the
test based on the p sample principal components of the mean vector, since

x S −1 x = x H0 L−1
0 H0 x,

where
S = H0 L0 H0 , H0 H0 = Ip .
L0 = diag(l1 , . . . , lp ) and H0 x is the vector of the p sample principal components
of the sample mean vector. It can also be shown to be equivalent to a test based
on (a x)2 where a is chosen such that a Sa = 1 and (a x)2 is maximized.
When n < p, S has a singular Wishart distribution, see Srivastava (2003)
for its distribution. For the singular case, a test corresponding to the F -test in
(2.9) can be proposed as,
F − = cN x S − x,
for some constant c, where S − is a generalized inverse of S and SS − S = S. No
such test has been proposed in the literature so far as it raises two obvious ques-
tions, namely which g-inverse to use and what is its distribution. For example,
the p × p sample covariance matrix S can be written as
 
S11 S12
S=  S
,
S12 22

where S11 can be taken as an n × n positive definite matrix. When n < p, it


 S −1 S . Thus, a
follows from Srivastava and Khatri (1979, p. 11) that S22 = S12 11 12
g-inverse of S can be taken as
 
−1
− S11 0
S = 
,
0 0

see Rao (1973, p. 27), Rao and Mitra (1971, p. 208), Schott (1997), or Siotani et
al. (1985, p. 595). In this case,
−1
F − = cN x1  S11 x1 ,

where x = (x1 , x2 ) with x1 : n × 1 and x2 : (p − n) × 1. With c = 1/n2 , it


follows that when µ = 0,
1 −1
F− = N x1  S11 x1
n2
has an F -distribution with n and 1 degrees of freedom. Thus, the distribu-
−1
tion of x1  S11 x1 does not depend on the covariance matrix Σ. In fact, since
  −
A (ASA ) A is a generalized inverse of S for any p × p nonsingular matrix A,
62 M. S. SRIVASTAVA

it is possible that the distribution of x S − x may not depend on the covariance


matrix Σ. When more restrictions are placed on the generalized inverse, such
as in the case of the Moore-Penrose inverse, then the above property may not
hold since in the case of Moore-Penrose inverse A (ASA )+ A may not be the
−1
Moore-Penrose inverse of S. However, the statistic x1 S11 x1 is not invariant un-
der the nonsingular linear transformations x → Ax and S → ASA for any p × p
nonsingular matrix A.
In fact, when N ≤ p, no invariant test exists under the linear transformation
by an element of the group Glp of non-singular p × p matrices. The sample space
χ consists of p × N matrices of rank N ≤ p, since Σ is nonsingular, and any
matrix in χ can be transformed to any other matrix of χ by an element of Glp.
Thus the group Glp acts transitively on the sample space. Hence the only α-level
test that is affine invariant is the test Φ ≡ α, see Lehmann (1959, p. 318).
The generalized inverse used to obtain F − , however, does not use all the
information available in the data. The sufficient statistic for the above problem
is (x, S) or equivalently (x; H  LH), where H : n × p, HH  = In , and

(2.10) nS = V = H  LH,

L = diag(l1 , . . . , ln ), a diagonal matrix. The Moore-Penrose inverse of S is given


by

(2.11) S + = nH  L−1 H.

Thus, we define the sample (squared) distance of the sample mean vector from
the zero vector by

D + = x S + x
2

(2.12) = nx H  L−1 Hx.

It may be noted that Hx is the vector of the n principal components of the


sample mean vector, and n−1 L is the sample covariance of these n components.
Thus the distance function, as in the case of nonsingular S, can also be defined
in terms of the n principal components. For testing the hypothesis that µ = 0
against the alternative that µ = 0, we propose a test statistic based on (a x)2 ,
where a belonging to the column space of H  , a∈ ρ(H  ), is chosen such that
a Sa = 1 and (a x)2 is maximized. It will be shown next that this maximum is
2
equal to D+ .
Since a∈ ρ(H  ), a = H  b for some n-vector b. Hence, since HH  = In

(a x)2 = [b (HH  L1/2 HH  L−1/2 )Hx]2


= [b H(S)1/2 (S + )1/2 x]2
≤ (a Sa)(x S + x) = D+
2

from the Cauchy-Schwarz inequality. The equality holds at

a = (S + )x/(x S + x)1/2 .
ANALYZING HIGH DIMENSIONAL DATA 63

Hence, for testing the hypothesis that µ = 0 against the alternative that µ = 0,
we propose the test statistic

max(p, n) − min(p, n) + 1
(2.13) F+ = N x S + x
n min(p, n)
p−n+1
= N x S + x, if n < p,
n2
n−p+1
= N x S −1 x, if n ≥ p,
np

which is the same as F defined in (2.8) when the sample covariance matrix is
2
nonsingular. Thus, when n < p, we propose the statistics T + or F + , as defined
in (2.3) and (2.4) respectively, for testing the hypothesis that µ = 0 vs µ = 0.
2
We note that the statistic T + is invariant under the transformation xi → cΓxi ,
where c ∈ R(0) , c = 0 and Γ ∈ Op ; R(0) denotes the real line without zero and
Op denotes the group of p × p orthogonal matrices. Clearly cΓ ∈ Glp. To show
the invariance, we note that
2
T+
= N x V + x
n
= N x H  L−1 Hx
= N x Γ ΓH  L−1 HΓ Γx
= N x Γ (ΓV Γ )+ Γx,

since the eigenvalues of ΓV Γ are the same as that of V . The invariance under
scalar transformation obviously holds.

2.2. Distribution of F + when µ = 0


We first consider the case when the covariance matrix Σ is of rank r ≤ n. In
this case, we get the following theorem.

Theorem 2.1. Suppose that the covariance matrix Σ is singular of rank


r ≤ n. Then under the hypothesis that the mean vector µ = 0,
n−r+1
(N x S + x) ∼ Fr,n−r+1 .
nr
The proof is given in the Appendix.

In the above theorem, it is assumed that the covariance matrix Σ is not only
singular but is of rank r ≤ n. At the moment, no statistical test is available
to check this assumption. However, in practical applications, we look at the
eigenvalues of the sample covariance matrix S, and delete the eigenvalues that
are zero or very small, as is done in the selection of principal components.
Tests are available in Srivastava (2005) to check if Σ = γ 2 I, γ 2 unknown,
or if Σ = diag(λ1 , . . . , λp ) under the assumption that n = O(pδ ), o < δ ≤ 1.
A test for the first hypothesis that is of spherecity is also available in Ludoit
64 M. S. SRIVASTAVA

and Wolf (2002) under the assumption that n = O(p). More efficient tests can
be constructed based on the statistics (N x x/ tr S), and N x DS−1 x, depending
upon which of the above two hypotheses is true; here Ds = diag(s11 , . . . , spp ),
S = (sij ). Thus the F + test may not be used when the covariance matrix is
either a constant times an identity matrix or a diagonal matrix. Nevertheless,
we derive the distribution of the F + test when Σ = γ 2 I, and γ 2 is unknown
because in this case we obtain an exact distribution which may serve as a basis
for comparison when only asymptotic or approximate distributions are available.

Theorem 2.2. Let the F + statistic be as defined in (2.4). Then when the
covariance matrix Σ = γ 2 I, and γ 2 is unknown, the F + statistic is distributed
under the hypothesis H, as an F -distribution with n and p − n + 1 degrees of
freedom, n ≤ p.

To derive the distribution of the F + statistic when Σ = γ 2 I and n < p, we


first note that for any p × p orthogonal matrix G, GG = I, the statistic defined
2
by T + is invariant under linear transformations, xi → Gxi . Hence, we may
assume without any loss of generality that

Σ = ∧ = diag(λ1 , . . . , λp ).

Let

(2.14) z = N 1/2 AHx, where A = (H ∧ H  )−1/2 .

Then under the hypothesis H,

z ∼ Nn (0, I),

and
 2

T+
= N x V + x
n
(2.15) = z  (ALA)−1 z.

The n eigenvalues l1 , . . . , ln of the diagonal matrix L are the n non-zero eigen-


values of V = Y Y  , where the n columns of the p × n matrix Y are iid Np (0, ∧).
To derive the asymptotic results in which p may also go to infinity, we assume
that

(2.16) 0 < ai,0 = lim ai < ∞, i = 1, . . . , 4


p→∞

where

(2.17) ai = (tr Σi /p).

Let

(2.18) b = (a21 /a2 ).


ANALYZING HIGH DIMENSIONAL DATA 65

Under the assumption (2.16), consistent and unbiased estimators of a2 and a1 ,


as (n, p) → ∞, are given by
   2

n2 tr S 2 p tr S
(2.19) â2 = −
(n − 1)(n + 2) p n p
and
(2.20) â1 = (tr S/p)

respectively, see Srivastava (2005). Thus, clearly (tr S 2 /p) is not a consistent
estimator of a2 unless p is fixed and n → ∞. Thus, under the assumption (2.16),
a consistent estimator of b as (n, p) → ∞ is given by
 
â21
(2.21) b̂ = .
â2

In the next theorem, we give an asymptotic distribution of the F + statistic,


the proof of which is given in the Appendix.

Theorem 2.3. Under the condition (2.16) and when µ = 0,




n 1/2 −1
lim P (p(p − n + 1) b̂F +
− 1) ≤ z1−α = Φ(z1−α ).
n,p→∞ 2

Corollary 2.1. Let n = O(pδ ), 0 < δ < 1, then under the condition (2.16)
and when µ = 0,
  1/2

n
lim P cp,n (b̂F +
− 1) < z1−α = Φ(z1−α )
n,p→∞ 2

where
cp,n = [(p − n + 1)/(p + 1)]1/2 .

In the following theorem, it is shown that F + > Fn,p−n+1 with probability


one, see the Appendix for its proof.

Theorem 2.4. Let Fn,p−n+1 (α) be the upper 100% point of the F -distri-
bution with n and (p − n + 1) degrees of freedom. Then

P {F + > Fn,p−n+1 (α)} ≥ α.

2.3. Asymptotic distribution of F + when µ = 0


In this section, we obtain the asymptotic distribution under the local alter-
66 M. S. SRIVASTAVA

native,
 1/2
1
(2.22) µ = E(x) = δ.
nN
From (2.15)
2
T+
= z  (ALA)−1 z,
n
where given H,

z ∼ Nn (n−1/2 AHδ, I), A = (H ∧ H  )−1/2 .

Let Γ be an n × n orthogonal matrix, ΓΓ = In , with the first row of Γ given by

δH  A/(δ  H  A2 Hδ)1/2 .

Then, given H,   
θ
w = Γz ∼ Nn ,I ,
0
where

(2.23) θ = (n−1 δ  H  A2 Hδ)1/2 .

As before,
lim (pbT + /n) = lim z  z = lim w w.
2
p→∞ p→∞ p→∞

We note that from Lemma A.2, given in the Appendix,

lim θ2 = lim (n−1 δ  H  A2 Hδ), A = (H ∧ H  )−1/2


n,p→∞ n,p→∞
 
δ ∧ δ
= lim = θ02 .
n,p→∞ pa2

We shall assume that θ02 < ∞. Thus,


2

(pbT + /n) − n
lim P √ > z1−α
n,p→∞ 2n
   
 (pbT + 2 /n) − n − θ 2 2n + 4θ 2 1/2 θ 2 
= lim P  0 0
> z1−α − √ 0
n,p→∞  2 2n 2n 
2n + 4θ 0

= lim Φ[−z1−α + θ02 (2n)−1/2 ].


n,p→∞

Hence, the power of the F + test is given by


  
1/2 µ ∧µ
(2.24) β(F )  Φ −z1−α + (n/p)(n/2)
+
.
a2
ANALYZING HIGH DIMENSIONAL DATA 67

2.4. Dempster’s test


We assume as before that xi are iid Np (µ, Σ). Suppose Σ = σ 2 I, σ 2 un-
known. Then, the sufficient statistics are x and (tr S/p). The problem remains
invariant under the transformation x → cΓxi , where c = 0, and ΓΓ = Ip . The
maximal invariant statistic is given by
N x x
(2.25) TD = .
(tr S/p)
The maximal invariant in the parameter space is given by
γ = (µ µ/σ 2 ).
TD is distributed as an F -distribution with p and np degrees of freedom and
noncentrality parameter γ. Thus TD is the uniformly most powerful invariant
test. This is also the likelihood ratio test. But when Σ = σ 2 I, then TD has
none of the properties mentioned above. Dempster (1958), nevertheless proposed
the same test FD when Σ = σ 2 I, since it can be computed for all values of p
irrespective of whether n ≤ p or n > p. To obtain the distribution of FD , we
note that
Q1
FD = ,
Q2 + · · · + QN
where Qi ’s are independently and identically distributed but not as a chi-square
χ2 random variable. Dempster (1958) approximated the distribution of Qi as
mχ2r , where r is the degrees of freedom associated with χ2 and m is a constant.
Since FD does not depend on m, he gave two equations from which an iterative
solution of r can be obtained. Alternatively, since
E(mχ2r ) = mr = tr Σ,
and
var(mχ2r ) = 2m2 r = 2 tr Σ2 ,
r is given by
(tr Σ)2 (tr Σ/p)2
r= = p = pb.
(tr Σ2 ) (tr Σ2 /p)
A consistent estimator of b has been given in (2.21). Thus, r can be estimated
by
r̂ = pb̂.
Hence, an approximate distribution of FD under the hypothesis that µ = 0 is
given by an F -distribution with [r̂] and [nr̂] degrees of freedom, where [a] denotes
the largest integer ≤ a. The asymptotic distribution of the FD test under the
alternative that µ = 0, is given by
 
r 1/2 −1/2
lim P (FD − 1) > z1−α | µ = (nN ) δ
n,p→∞ 2


nµ µ
− Φ −z1−α + √ = 0,
2pa2
68 M. S. SRIVASTAVA

see Bai and Saranadasa (1996). This also gives the asymptotic distribution of
Dempster’s test under the hypothesis.

2.5. Bai-Saranadasa test (BS test)


Bai and Saranadasa (1996) proposed a standardized version of the Dempster
test which is much easier to handle in order to obtain the asymptotic null and
non-null distributions, with performance almost identical to the Dempster test
in all simulation results given by Bai and Saranadasa (1996). The test statistic
proposed by Bai and Saranadasa (1996) is given by
N x x − tr S
(2.26) TBS = ,
(pâ2 )1/2 (2(n + 1)/n)1/2
where
 
n2 1 n2
pâ2 = tr S 2 − (tr S)2 ≡ Ĉ,
(n + 2)(n − 1) n (n + 2)(n − 1)

and is an unbiased and ratio consistent estimator of tr Σ2 . This test is also in-
variant under the transformation x → cΓxi , c = 0, ΓΓ = I as was the Dempster
test TD . To obtain the distribution of TBS , we need the following lemma.

Lemma 2.1. Let ain be a sequence of constants such that



n
lim max (a2in ) = 0, and lim a2in = 1.
n→∞ 1≤i≤n n→∞
i=1

Then for any iid random variables ui with mean zero and variance one,
 n


lim P ain ui < z = Φ(z),
n→∞
i=1

see Srivastava (1970) or Srivastava (1972, Lemma 2.1).

Thus, if
max1≤i≤p λi
(2.27) lim = 0,
p→∞ (tr Σ2 )1/2

then
lim P {TBS ≤ z} = Φ(z).
n,p→∞

It may be noted that the condition (2.27) will be satisfied if


1
(2.28) λi = O(pγ ), 0≤γ<
2
since a2 is finite. To obtain the power of the test, we need to obtain the distri-
bution of the statistic TBS under the alternative hypothesis µ = 0. As before,
we consider the local alternatives in which

E(x) = µ = (1/nN )1/2 δ.


ANALYZING HIGH DIMENSIONAL DATA 69

From the above results, we get under condition (2.28),




N [x − (nN )−1/2 δ] [x − (nN )−1/2 δ] − tr S


lim < z = Φ(z).
n,p→∞ (2 tr Σ2 )1/2

Since,
     
(N/n)1/2 δ  x δ ∧ δ θ02
lim var √ = lim = lim = 0,
n,p→∞ 2 tr Σ2 n,p→∞ 2n tr Σ2 n,p→∞ 2n

it follows that
  √ 
(N xx ) − tr S pâ2 δδ
lim P √ √ > z1−α − √
n,p→∞ 2pâ2 tr Σ2 n 2pa2
 

δδ
− Φ z1−α + √ = 0.
n 2pa2

Thus, the asymptotic power of the BS test under condition (2.28) is given by

nµ µ
(2.29) β(BS)  Φ −z1−α + √ .
2pa2

2.6. Läuter, Glimm and Kropf test


Läuter et al. (1998) proposed a test based on principal components. The
components are, obtained by using the eigenvectors corresponding to the non-
zero eigenvalues of V + N xx , the complete sufficient statistic under the hy-
pothesis. In place of using the sample covariance of the principal components,
they use O1 SO1 , where V + N xx = O L̃O, OO = In+1 , O = (O1 , O2 ), and
L̃ = diag(˜l1 , . . . , ˜ln+1 ), ˜l1 > · · · , ˜ln+1 . The matrix O1 is of p×k dimension, k ≤ n.
When O1 is a p × k matrix such that O1 SO1 is a positive definite matrix with
probability one, then it follows from Läuter et al. (1998) that the statistic

n−k+1
TLGK = N x O1 (O1 SO1 )−1 O1 x ∼ Fk,n−k+1 , k ≤ n.
nk
The non-null distribution of this statistic is not available. A comparison
by simulation also appears difficult as no guidance is available as to how many
components to include. For example, if we choose only one component from the
n + 1 components, denoted by a, then one may use the t2 -test given by

t2 = N (a x)2 /(a Sa),

where a is a function of the statistic V + N xx . This has a t2 -distribution with n


degrees of freedom. Such a test however, has a poor power when the components
of the mean vector do not shift in the same direction and the shifts are not of
equal magnitude; see Srivastava et al. (2001) for some power comparison.
70 M. S. SRIVASTAVA

3. Two-sample F + test
In this section, we consider the problem of testing that the mean vector
µ1 and µ2 of two populations with common covariance Σ are equal. For this,
we shall define the sample (squared) distance between the two populations with
sample mean vectors x1 and x2 and pooled sample covariance matrix S by

D+ = (x1 − x2 ) S + (x1 − x2 ).
2

Thus, for testing the equality of the two mean vectors, the test statistic F +
becomes
 −1
p−n+1 1 1
F +
= + (x1 − x2 ) S + (x1 − x2 ), n ≤ p,
n2 N1 N2
where x11 , . . . , x1N1 are iid Np (µ1 , Σ), x21 , . . . , x2N2 are iid Np (µ2 , Σ), Σ > 0,
x1 and x2 are the sample mean vectors,


N1 
N2

nS = (x1i − x1 )(x1i − x1 ) + (x2i − x2 )(x2i − x2 ) ,
i=1 i=1
and
n = N1 + N2 − 2.

All the results obtained in Theorems 2.1 to 2.4 for the one-sample case are also
2
available for the two-sample case, except that T + is now defined as
N1 N2
(x1 − x2 ) S + (x1 − x2 ).
2
T+ =
N1 + N2

4. Multivariate analysis of variance (MANOVA)


For MANOVA, that is for testing the equality of k mean vectors, we shall
denote the independently distributed observation vectors by xij , j = 1, . . . , Ni ,
i = 1, . . . , k where xij ∼ N (µi , Σ), Σ > 0. The between sum of squares will be
denoted by

k
B= Ni (xi − x)(xi − x) ,
i=1
where xi are the sample mean vectors,
 k 

x= Ni xi /N , N = N1 + · · · + Nk ,
i=1

and within sum of squares or sum of squares due to error will be denoted by


k 
Ni
V = nS = (xij − xi )(xij − xi ) ,
i=1 j=1

where n = N − k.
ANALYZING HIGH DIMENSIONAL DATA 71

The test statistic we propose for testing the hypothesis that µ1 = · · · = µk


against the alternative that µi = µj for at least one pair of (i, j, i = j) is given
by

m
U+ = (1 + di )−1 = |I + BV + |−1 ,
i=1

where di are the non-zero eigenvalues of the matrix BV + , m = k − 1 ≤ n, and


V + is the Moore-Penrose inverse of V . To obtain the distribution of U + under
the hypothesis that all the mean vectors are equal, we note that the between sum
of squares B defined above can be written as

B = U U ,

where U = (u1 , . . . , um ), m = k − 1 and is distributed as Np,m (θ, Σ, Im ) where


θ = (θ 1 , . . . , θ m ) is a p × m matrix and a random matrix U is said to have
Np,m (θ, Σ, C) if the pdf of U can be written as

−(1/2)pm −(1/2)p −(1/2)m 1
(2π) |C| |Σ| etr − Σ−1 [(U − θ)C −1 (U − θ) ] ,
2

where etr (A) stands for the exponential of the trace of the matrix A, see S & K
(1979, p. 54). Under the hypothesis of the equality of the k mean vectors θ = 0
and U ∼ Np,m (0, Σ, Im ).
Similarly, the sum of squares due to error given in (2.19) can be written as

V = nS = Y Y  ,

where Y : p × n, the n columns of Y are iid Np (0, Σ). It is also known that B
and V are independently distributed. Let

HV H  = L = diag(l1 , . . . , ln ), n ≤ p,

where HH  = In , H : n × p. Hence, the non-zero eigenvalues of BV + are the


non-zero eigenvalues of

(4.1) U  H  A(ALA)−1 AHU = Z  (ALA)−1 Z,

where

(4.2) A = (HΣH  )−1/2 ,

and the m columns of Z are iid Nn (0, I) under the hypothesis that µ = 0.
From Srivastava and von Rosen (2002), the results corresponding to the case
when Σ is singular becomes available. Thus, we get the following theorem.

Theorem 4.1. Suppose the covariance matrix Σ is singular of rank r ≤ n.


Let d1 , . . . , dm be the non-zero eigenvalues of BV + , where V + = H  L−1 H, L =
72 M. S. SRIVASTAVA

diag(l1 , . . . , lr ), H : r × p, HH  = Ir and m = k − 1 ≤ r. Then, in the notation


of Srivastava (2002, p. XV),

m
U+ = (1 + di )−1 ∼ Ur,m,n ,
i=1

and , under the hypothesis that µ = 0, r fixed and n → ∞,


  
1
P − n − (r − m + 1) log Ur,m,n > c
2
(4.3) = P {χ2rm > c} + n−2 η[P (χ2rm+4 > c) − P (χ2rm > c)] + O(n−4 ),

where η = rm(r2 + m − 5)/48 and χ2f denotes a chi-square random variable with
f degrees of freedom.

When Σ = γ 2 I, A = I, and hence we get the following theorem.

Theorem 4.2. Suppose the covariance matrix Σ = γ 2 I, γ is unknown and


m ≤ n < p, then U + has the distribution of Un,m,p . Thus, the asymptotic
expression of P {−l log U + ≤ z} under the hypothesis that µ = 0, is given by

(4.4) P {−lb̂ log U + ≤ z} = P {χ2g ≤ z} + p−2 β[P (χ2g+4 ≤ z) − P (χ2g ≤ z)]


+ O(p−4 ),

where n fixed , p → ∞ and g = nm,


1 g 2
l = p − (n − m + 1), m = k − 1, and β= (n + m − 5).
2 48

Theorem 4.3. Under the null hypothesis and (2.16)




−pb̂ log U + − mn
lim P √ < z1−α = Φ(z1−α ).
n,p→∞ 2mn

For some other tests and power , see Srivastava and Fujikoshi (2006), and
when (p/n) → c, c ∈ (0, ∞), see Fujikoshi et al. (2004).

5. Confidence intervals
When a hypothesis is rejected, it is desirable to find out which component or
components may have caused the rejection. In the context of DNA microarrays,
it will be desirable to find out which genes have been affected after the radiation
treatment. Since p is very large, the confidence intervals obtained by the Bonfer-
roni inequality, which corresponds to the maximum of t-tests, are of no practical
value as they give very wide confidence intervals.
As an alternative, many researchers such as Efron et al. (2001) use the ‘False
Discovery Rate’, FDR, proposed by Benjamini and Hockberg (1995), although
ANALYZING HIGH DIMENSIONAL DATA 73

the conditions required for its validity may not hold in many cases. On the other
hand, we may use approximate confidence intervals for selecting the variables or
to get some idea as to the variable or variables that may have caused the rejection
of the hypothesis. These kind of confidence intervals use the fact that

sup g 2 (a) < c ⇒ sup g 2 (a) < c,


a∈Rp a∈ρ(H  )

for any real valued function g(·) of the p-vector a, where H  : p × n, and Rp is the
p-dimensional Euclidean space. Thus, it gives a confidence coefficient less than
100(1 − α)%. However, we shall assume that they are approximately equal.

5.1. Confidence intervals in one-sample


Let

(5.1) A2α = [n + (2n)1/2 z1−α ],

where z1−α denotes the upper 100α% point of the standard normal distribution.
It may be noted that A2α  χ2n,α . In the notation and assumptions of Section
2, approximate simultaneous confidence intervals for linear combinations a µ at
100(1 − α)% confidence coefficient are given by

a x ± N −1/2 (n/p)1/2 b̂−1/2 (a Sa)1/2 Aα ,

when a Sa = 0 and a∈ ρ(H  ). Thus, approximate simultaneous confidence in-


tervals for the components µi of µ = (µ1 , . . . , µp ) with confidence coefficient
approximately 100(1 − α)% are given by

xi ± N −1/2 (n/p)1/2 b̂−1/2 sii Aα ,


1/2
i = 1, . . . , p,

where x = (x1 , . . . , xp ) , and


S = (sij ).

5.2. Confidence intervals in two-samples


In the two-sample case and again under the assumption made in the begin-
ning of this section, approximate simultaneous confidence intervals for the linear
combinations a (µ1 − µ2 ) at 100(1 − α)% confidence coefficient are given by
 1/2
1 1
a (x1 − x2 ) ± + (n/p)1/2 b̂−1/2 (a Sa)1/2 Aα ,
N1 N2

when a Sa = 0 and a∈ ρ(H  ). Here n = N1 +N2 −2, and S is the pooled estimate
of the covariance matrix Σ. Thus, approximate simultaneous confidence intervals
for the differences in the components of the means µ1i − µ2i are given by
 1/2
1 1
(x1i − x2i ) ± + (n/p)1/2 b̂−1/2 (sii )1/2 A1/2
α , i = 1, . . . , p,
N1 N2
74 M. S. SRIVASTAVA

where

µ1 = (µ11 , . . . , µ1p ) , µ2 = (µ21 , . . . , µ2p ) ,


x1 = (x11 , . . . , x1p ) , x2 = (x21 , . . . , x2p ) ,
and
S = (sij ).

5.3. Confidence intervals in MANOVA


From the results of Section 2, it follows that the eigenvalues of V + B, under
the hypothesis of the equality of means, are asymptotically the eigenvalues of
p−1 U , where U ∼ Wn (m, I), m = k − 1. Approximately, this is equal to the
eigenvalues of W −1 U , where W ∼ Wn (p, I) for large p. With this result in mind,
we now proceed to obtain the confidence intervals in MANOVA.
In the MANOVA, we are comparing k means. Thus, we need simultaneous
 
confidence intervals for kj=1 a µj qj , where kj=1 qj = 0, that is qj ’s for m con-
trasts. These approximate 100(1 − α)% simultaneous confidence intervals for the
contrasts in the means a (µ1 , . . . , µk )q, q = (q1 , . . . , qk ) are given by
  

k 
k    1/2
b2j aVa cα
a xj qj ±    b̂−1/2 ,
j=1
N
j=1 j
p 1 − cα


where V is defined in the beginning of Section 4, and 1−c α
is the upper 100α%
−1
point of the distribution of the largest eigenvalue of W U . The value of cα can
be obtained from Table B.7 in Srivastava (2002) with p0 = n, m0 = k − 1, and
n corresponds to p. The vector a of constants can be chosen as in the previous
two subsections.

6. Testing the equality of covariances


For testing the equality of the two covariance matrices, let S1 and S2 be the
sample covariances based on n1 and n2 degrees of freedom. Then a test for the
equality of the two covariance matrices can be based on the test statistic
(1)
(6.1) T2 = (pb̂ tr V1+ V2 − n1 n2 )/(2n1 n2 )1/2

where V1 = n1 S1 = H1 L1 H1 , H1 H1 = In1 , V2 = n2 S2 and V1+ = H1 L−11 H1 . The


statistic T2 is distributed asymptotically as N (0, 1) under the hypothesis that the
two covariances are equal. The estimate b̂ of b can be obtained from the pooled
estimate. For the equality of the k covariances in MANOVA, let Si = n−1 i Vi
be the sample covariances of the k populations based on ni degrees of freedom,
i = 1, . . . , k. Then, a test for the equality of the k covariances can be based on
the test statistic.

(1)
(6.2) T3 +
= [pb̂ tr V1 V(1) − n1 n(1) ]/ 2n1 n(1)
ANALYZING HIGH DIMENSIONAL DATA 75

where V(1) = kα=2 Vα , and n(1) = n2 + · · · + nk . It can be shown that under
the hypothesis of equality of all covariances, T3 is asymptotically distributed as
N (0, 1). Alternatively, we may use the approximate distribution of

(6.3) P2 = pb̂(tr V1+ V2 ),

and
+
(6.4) P3 = pb̂ tr(V1 V(1) )

which are approximately distributed as χ2n1 n2 and χ2n1 ×n(1) respectively. The tests
in (6.1) and (6.2) are arbitrary. To make them less arbitrary, we may define
(i)
T2 = [pb̂ tr Vi+ V(3−i) − n1 n2 ]/(2n1 n2 )1/2 , i = 1, 2,
and
(i) +
T3 = [pb̂ tr Vi V(i) − ni n(i) ]/(2ni n(i) )1/2 , i = 1, . . . , k.

Then, conservative tests based on


(1) (2)
(6.5) R2 = max(|T2 |, |T2 |)

for testing the equality of two covariance matrices, and


(1) (k)
(6.6) R3 = max(|T3 |, . . . , |T3 |)

for testing the equality of k covariance matrices may be considered. The hypoth-
esis of the equality of the two covariance matrices is rejected if

R2 > z1−α/4

and the hypothesis of equality of k covariance matrices is rejected if

R3 > z1−α/2k

where z1−α is the upper 100α% point of the N (0, 1) distribution.

7. Selection of variables
In order to have good power for any test in multivariate analysis, it is impor-
tant to have as few characteristics as possible. Especially those variables that do
not have discriminating ability, should be removed. We begin with b̂n,n based on
the n characteristics chosen for the three cases, as n largest value of N x2r /Sr,r ,
[N1 N2 /(N1 + N2 )] (x1r − x2r )2 /Srr , and (ar Bar )/(ar Sar ), where r = 1, . . . , p
and ar = (0 . . . 0, 1, 0 . . . 0) is a p-vector with all zeros except the r-th place which
is one. Let p∗ be the number of characteristics for which

(7.1) (p/n)b̂n,n (N x2r /Srr ) > A2α ,


(7.2) (p/n)b̂n,n [N1 N2 /(N1 + N2 )](x1r − x2 r)2 > A2α ,
and
76 M. S. SRIVASTAVA

(7.3) (pb̂n,n )(ar Bar )/(ar Sar ) > cα /(1 − cα ),

for the three cases respectively where A2α is defined in (5.1) and cα in Subsection
5.3. All the testing, procedures, and confidence intervals etc, may now be carried
out with N observations and p∗ characteristics. Other values of p∗ around it may
also be tried for seperation of groups etc. If the selected p∗ is less than n, then
the usual multivariate methods apply. However, if p∗  n or p∗ > n, then the
methods proposed in this paper apply. This method of selection of variables is
illustrated in Example 8.2 in Section 8.

8. An example
Alon et al. (1999) used Affymetrix oligonucleotide arrays to monitor absolute
measurements on expressions of over 6500 human gene expressions in 40 tumour
and 22 normal colon tissues. Alon et al. considered only 2000 genes with highest
minimal intensity across the samples. We therefore consider only data on these
2000 genes on 40 patients and 22 normal subjects.
To analyze the data, we first calculate the pooled sample covariance matrix
assuming that the two populations of tumour and normal tissues have the same
covariance matrix Σ. Although, this assumption can be ascertained by the test
statistic given in Section 6, it may not be necessary at this stage as we will be
testing for it after the selection of variables. From the pooled sample covariance
matrix S, we found that b60,60 = 0.1485. Using the formula (7.2) in Section 7,
we select 103 characteristics. For testing the equality of means, we calculated
b60,103 = 0.110, and
 
N1 N2 p
(8.1) b̂60,103 (x1 − x2 ) S + (x1 − x2 ) = 94.760.
N1 + N2 n
Using the approximation that A2α  χ2n,α , we find that the value of χ260,0.05 =
79.08. Hence, the p-value is 0.0028. Thus, the hypothesis of the equality of
the two means is rejected. Similarly, for testing the equality of the covariances
of the two populations, we calculate the sample covariances of the two popu-
lations, S1 and S2 on 39 and 21 degrees of freedom respectively. The value of
P2 = pb̂60,103 tr V1+ V2 = 523.01, where V1 = 39S1 and V2 = 21S2 . This is approx-
imately chi-square with 819 degrees of freedom. The p-value is 0.999 for this case,
and the hypothesis of the equality of covariances of the two groups is accepted.
Thus, the final analysis may be based on 103 characteristics with both groups
having the common covariance. We can also try a few other values of p if the
seperation of the two groups and equality of the two covariances are maintained.
We first tried p = 200. For this case, we find that b60,200 = 0.0096, and the
value of the statistic in (8.1) is 59.376, with a p-value of 0.4984. Thus, p = 200
does not provide the separation of the two groups. Next we tried p = 100. For
this case, b60,100 = 0.113, and the value of the statistic in (8.1) is 86.0128 with a
p-value of 0.015. Thus, p = 100 is a possible alternative, since the value of the
test statistic P2 for p∗ = 100 is 317.3079 with a p-value of one, accepting the hy-
pothesis of equality of the two covariances of the two groups. Next, we consider
ANALYZING HIGH DIMENSIONAL DATA 77

the case when p = n = 60. We now have the case of p  n. We found that one
eigenvalue of the sample covariance matrix is very small. Thus, we assume that
the population covariance matrix Σ is of rank 59 and apply the results given in
Theorem 2.1. The value of the test statistic is 21.1035 with a p-value of 0.04646.
The corresponding value of the test statistic P2 for p∗ = 60 is 666.0173 with a
p-value of one. Consequently, p = 60 is also a possibility on which we can base
our future analysis. However, it appears from the above analysis that p = 103 is
the best choice.

9. Concluding remarks
In this paper we define sample (squared) distance using the Moore-Penrose
inverse instead of any generalized inverse of the sample covariance matrix which
is singular due to fewer observations than the dimension. For normally dis-
tributed data, it is based on sufficient statistic while a sample (squared) distance
using any generalized inverse is not. These distances are used to propose tests in
one-sample, two-sample and multivariate analysis of variance. Simultaneous con-
fidence intervals for the mean parameters are given. Using the proposed methods,
a dataset is analyzed. This shows that the proposed test statistic performs well.
In addition, it provides confidence intervals which have been used to select the
relevant characteristics from any dataset.

Appendix A: Proof of Theorem 2.1


Here, we consider the case when the covariance matrix Σ is singular of rank
r ≤ n. That is, for an r × p matrix M , M M  = Ir , Σ = M  Dλ M where
Dλ = diag(λ1 , . . . , λr ), λi > 0, i = 1, . . . , r. Also, nS = V = Y Y  where the
n columns of Y are iid Np (0, Σ). Since Σ is of rank r, we can write Σ = BB 
where B : p × r of rank r. Hence, Y = BU , where the n columns of U are iid
Nr (0, I). Then, the matrix S is of rank r, and ρ(S) = ρ(Σ) = ρ(M  ) = ρ(H  )
where ρ(S) denotes the column vector space of S, and HV H  = diag(l1 , . . . , lr ),
H : r × p. Thus, there exists an r × r orthogonal matrix A such that M = AH.
Let u ∼ Np (0, I) and W ∼ Wp (I, n) be independently distributed.
Then, since Σ1/2 = M  Dλ M , M H  = A, and HM  = A , we get
1/2


N
x S + x = u Σ1/2 H  (HV H  )−1 HΣ1/2 u
n
= u M  Dλ M H  (HΣ1/2 W Σ1/2 H  )−1 HΣ1/2 u
1/2

= u M  Dλ A(A Dλ M W M  Dλ A)−1 A Dλ M u
1/2 1/2 1/2 1/2

= u M  (M W M  )−1 M u
= z  U −1 z

where z ∼ Nr (0, I) and U ∼ Wr (I, n). This proves Theorem 2.1.

To prove Theorem 2.2, we need the following lemma.


78 M. S. SRIVASTAVA

Lemma A.1. Let D̃ = diag(d˜1 , . . . , d˜n ) whose distribution is that of the n


non-zero eigenvalues of a p × p matrix distributed as Wp (I, n), n ≤ p. That is,
the joint pdf of d˜1 > · · · > d˜n is given by
   
(1/2)n2 (1/2)pn 1 1
(A.1) π /2 Γn n Γp n
2 2
 


n 
n
˜
−(1/2)
× (d˜i − d˜j )
(1/2)(p−n−1)
d˜i e di
.
i<j i=1

Let P be an n × n random orthogonal matrix with positive diagonal elements


distributed independently of D with density given by

1
π −(1/2)n Γn P P  = I,
2
(A.2) n gn (P )dP,
2
where

(A.3) gn (P ) = J(P  (dP ) → dP ),

the Jacobian of the transformation from P  (dP ) to dP as defined in Srivastava


and Khatri (1979, p. 31). Then U = P D̃P  is distributed as Wn (I, p), n ≤ p.

Proof of Lemma A.1. The Jacobian of the transformation J(P, D̃ → U )


can be obtained from Theorem 1.11.5 of Srivastava and Khatri (1979, p. 31). It
is given by
 −1

n
(A.4) J(P, D̃ → U ) =  (d˜i − d˜j )gn (P ) .
i<j

The joint pdf of D̃ and P is given by


 n

1 
n  (1/2)(p−n−1) ˜
−(1/2)
 gn (P ) (d˜i − d˜j ) d˜ i e di
1
2(1/2)pn Γ n p i<j i=1
2
1
(A.5) =  gn (P )
1
2(1/2)pn Γ n p
2

n 
1
× (d˜i − d˜j )|P D̃P  |(1/2)(p−n−1) etr − P D̃P  .
i<j
2

Putting U = P D̃P  and using the above Jacobian of the transformation given in
(A.4), we get the pdf of U which is Wn (I, p), n ≤ p.

Corollary A.1. Let w ∼ Nn (0, In ) and D̃ = diag(d˜1 , . . . , d˜n ), d˜1 > · · · >
d˜n , whose pdf is given by (A.1) and w and D̃ are independently distributed. Then
p − n + 1  −1
w D̃ w ∼ Fn,p−n+1 .
n
ANALYZING HIGH DIMENSIONAL DATA 79

Proof of Corollary A.1. Let P be any n × n orthogonal matrix dis-


tributed independently of D̃ whose pdf is given by (A.2). Then

w D̃−1 w = w P (P D̃P  )−1 P w


= z  U −1 z,

where

z ∼ Nn (0, In )
and
U ∼ Wn (I, p)

are independently distributed. Hence,


p − n + 1  −1 p − n + 1  −1
w D̃ w = zU z
n n
∼ Fn,p−n+1 .

2
Proof of Theorem 2.2. Since Σ = γ 2 I = ∧ and since T + is invariant
under scalar transformation, we may assume without any loss of generality that
γ 2 = 1. Hence,
A = (H ∧ H  )−1/2 = (HH  )−1/2 = I.
Thus, from (2.15) with a slight change of notation (w instead of z), it follows
that
2
T+
= w L−1 w,
n
where w ∼ Nn (0, I) is independently distributed of the diagonal matrix L. Also,
the diagonal elements of L are the non-zero eigenvalues of Y Y  , where the n
columns of Y are iid Np (0, In ). Thus, L is the diagonal matrix of the eigenvalues
of U = Y  Y ∼ Wn (p, In ). Using Corollary A.1, we thus have
2
p − n + 1 T+ p − n + 1  −1
= wL w
n n n
p − n + 1  −1
= z U z ∼ Fn,p−n+1 .
n

To prove Theorems 2.3, we need the results stated in the following lemma.

Lemma A.2. Let V = Y Y  ∼ Wp (∧, n), where the columns of Y are iid
Np (0, ∧). Let l1 , . . . , ln be the n non-zero eigenvalues of V = H  LH, HH  =
In , L = diag(l1 , . . . , ln ) and the eigenvalues of W ∼ Wn (In , p), are given by
80 M. S. SRIVASTAVA

the diagonal elements of the diagonal matrix D = diag(d1 , . . . , dn ). Then in


probability
  
Y Y tr ∧
(a) lim = lim In = a10 In
p→∞ p p→∞ p

1
(b) lim L = a10 In
p→∞ p

1
(c) lim D = In
p→∞ p

(d) lim H ∧ H  = (a20 /a10 )In
p→∞
 
1   a ∧ a
(e) lim a H Ha = lim
n,p→∞ n n,p→∞ pa1
for a non-null vector a = (a1 , . . . , ap ) of constants.

Proof of Lemma A.2. The n eigenvalues l1 , . . . , ln of the diagonal matrix


L are the n non-zero eigenvalues of V = Y Y  , where the n columns of the p × n
matrix Y are iid Np (0, ∧). The n non-zero eigenvalues of Y Y  are also the n
eigenvalues of Y  Y . Let U denote a p × n matrix where its n columns are iid
Np (0, I). Then, the eigenvalues of Y  Y are in distribution the eigenvalues of
  
u1
  .. 
U ∧ U =  .  ∧ (u1 , . . . , un ),
un
  
u1 ∧ u 1 , u1 ∧ u2 , ..., u1 ∧ un
 u ∧ u 1 , u2 ∧ u2 , ..., u2 ∧ un 
 2 
= .. .. .. .. .
 ., ., ., . 
un ∧ u1 , un ∧ u2 , . . . , un ∧ un
Let U = (u1 , . . . , un ) = (uij ). Then uij are iid N (0, 1) and u1 , . . . , un are iid
Np (0, I). Hence,

E(u1 ∧u1 ) = tr ∧, E(u1 ∧u2 ) = 0,


and
E(Y  Y ) = E(U  ∧ U ) = pa1 In

where
a1 = (tr ∧/p), and lim a1 = a10 .
p→∞

We also note that


E(u2ij ) = 1, var(u2ij ) = 2.
Hence, from Chebyshev’s inequality
#  #  # p #
# u1 ∧ u1 # # 2 − 1) #
# λ (u #
P ## − a1 ## > @ = P # i=1 i 1i
#>@
p # p #
ANALYZING HIGH DIMENSIONAL DATA 81
p
E[ 2
i=1 λi (u1i − 1)]2

p2 @2
p
E[ 2 2
i=1 λi (u1i − 1)2 ]
=
p2 @2
p 2
2 i=1 λi
= .
p2 @2

Since 0 < limp→0 (tr ∧2 /p) < ∞, it follows that


p 2
i=1 λi
lim = 0.
p→∞ p2
Hence,
ui ∧ ui
lim → a10 , i = 1, . . . , n
p→∞ p
in probability. Similarly, it can be shown that in probability

ui ∧ uj
lim = 0, i = j,
p→∞ p
and
Y Y
lim = a10 In in probability.
p→∞ p

This proves (a). Also, if l1 , . . . , ln denote the non-zero eigenvalues of Y Y  then,


from the above result, it follows that

1
lim L = a10 In in probability.
p→∞ p

This proves (b).


It follows from the above results that if d1 , . . . , dn are the non-zero eigenvalues
of ∧−1/2 V ∧−1/2 and D = diag(d1 , . . . , dn ), then in probability
1
(A.6) lim D = In ,
p→∞ p

since d1 , . . . , dn are the eigenvalues of W ∼ Wn (p, I). This proves (c). We note
that
Y Y  = H  L1/2 GG L1/2 H,
for any n × n orthogonal matrix G, GG = In depending on Y . Choosing G =
L1/2 HY (Y  Y )−1 , we find that in distribution,

Y = H  L1/2 G ∼ Np,n (0, ∧, In ).

Thus, in distribution

GY  ∧ Y G = GU  ∧2 U G = L1/2 H ∧ H  L1/2 ,
82 M. S. SRIVASTAVA

where U = (u1 , . . . , un ). We note that


 
ui ∧2 uj tr ∧2
E = δij ,
p p

where δij = 1 if i = j and δij = 0, i = j, i, j = 1, . . . , n, the Kronecker symbol.


Similarly,
 
ui ∧2 uj 2 tr ∧4
Var = , i = j,
p p2
tr ∧4
= , i = j.
p2

Since, limp→∞ tr ∧4 /p = a40 , and 0 < a40 < ∞, it follows that

tr ∧4
lim = 0.
p→∞ p2

Hence, in probability,
     
Y ∧Y  tr ∧2
lim G G = lim In = a20 In .
p→∞ p p→∞ p

Thus, in probability
 
L1/2 H ∧ H  L1/2
lim = a20 In .
p→∞ p

Since, limp→∞ (L/p) = a10 In it follows that in probability

lim (H ∧ H  ) = (a20 /a10 )In .


p→∞

This proves (d).


To prove (e), consider a non-null p-vector a = (a1 , . . . , ap ) . Then, since
Y Y  = H  LH, HH  = In , we get

a Y Y  a a H  LHa
= .
pn pn

With Y = (y 1 , . . . , y n ) and y i = (yi1 , . . . , yip ) , the left side

1 n
= (a y i )2
pn i=1
 
1 n 
p
2 
n 
p
2 2
= aj yij + aj ak yij yik ,
pn i=1 j=1
pn i=1 j<k

of which the second term goes to zero in probability.


ANALYZING HIGH DIMENSIONAL DATA 83

Hence, in probability

1 n  p
a H  LHa
lim a2j yij
2
= lim .
n,p→∞ pn n,p→∞ pn
i=1 j=1


From the law of large numbers, the left side goes to limn→∞ limp→∞ ( a p∧a ), and
from the results in (a), we have in probability limp→∞ p−1 L = a10 . Hence, in
probability     
a H Ha a ∧a
lim = lim /a10 .
n,p→∞ n n,p→∞ p

Proof of Theorem 2.3. We note that A = (H ∧ H  )−1/2 and from the


proof of Lemma A.2 (d),
 −1/2  
−1/2
 d L GU  ∧2 U G L
(H ∧ H) = ,
p p p

where GG = In , U = (u1 , . . . , un ), and ui are iid Np (0, I). Since the distribution
2
of T + is invariant under orthogonal transformations, we get from (2.15)
2  −1
T+ ALA
pb̂ = b̂z  z
n p
 −1/2  −1/2
L L
= b̂z  A−2
d
z
p p
 −1    −1
 L U  ∧2 U L
G
d
= b̂z G z.
p p p

Hence,

 
 U  ∧2 U 
2
(p/n)b̂T + − n z G pa20 G z − n
lim P0 1/2
< z1−α = lim P0  < z1−α 
n,p→∞ (2n) n,p→∞ (2n)1/2
 

∧ U
z  Upaa
2
z−n
= lim P0  20
< z1−α  .
n,p→∞ (2n)1/2

Let ri be the eigenvalues of (U  ∧2 U/pa20 ). Then

ri = 1 + Op (p−1/2 ),
and  
1 n
tr ∧2 U U 
ri = = 1 + Op (np)−1/2 .
n i=1 npa20

Hence,
 2

(p/n)b̂T + − n
lim P0 < z1−α
n,p→∞ (2n)1/2
84 M. S. SRIVASTAVA


i=1 ri zi −
n 2 n
= lim P0 < z1−α
n,p→∞ (2n)1/2
 n

i=1 ri (zi − i=1 ri −


n 2 1) n
= lim P0 + < z1−α .
n,p→∞ (2n)1/2 (2n)1/2
Now
n  1/2  n 
i=1 ri − n n i=1 ri
= −1
(2n)1/2 2 n
 1/2  1/2
n 1
= Op (1) → 0 as (n, p) → ∞.
2 np
Hence, from Lemma 2.1,
 2

(p/n)b̂T + − n
lim P0 < z1−α = Φ(z1−α ).
n,p→∞ (2n)1/2

Proof of Theorem 2.4. The ALA occuring in (2.15) can be rewritten


as
ALA = (H ∧ H  )−1/2 (HV H  )(H ∧ H  )−1/2
= (H ∧ H  )−1/2 H ∧1/2 W ∧1/2 H  (H ∧ H  )−1/2
= P W P ,
where W ∼ Wp (I, n) and P = (H ∧ H  )−1/2 H∧1/2 , P P  = In . Let G be an n × n
orthogonal matrix such that
P W P  = GM G
where M = diag(m1 , . . . , mn ), m1 > · · · > mn are the ordered eigenvalues of
P W P  . Then, if d1 > · · · > dn are the ordered nonzero eigenvalues of W , we get
from Poincare Separation Theorem, see Rao (1973, p. 64), mi ≤ di , i = 1, . . . , n.
Hence in distribution
2
T+
= z  GM −1 G z = v  M −1 v
n
≥ v  D−1 v,
where v ∼ Nn (0, I) is independently distributed of D. Thus from Corollary A.1,
F + ≥ Fn,p−n+1 .

Acknowledgements
I am grateful to Dr. Ekkehard Glimm of AICOS Technologies, Basel, Switzer-
land for kindly reading this article and offering many suggestions that greatly
improved the presentation. Thanks are also due to Professors Y. Fujikoshi, J.
Láuter, R. Pincus, M. Genton, and two anonymous referees for their helpful
comments. The calculation for the example in Section 8 was carried out by M.
Shakhatreh, that of Tables 1–3 by Meng Du, and that of Tables 4–6 by Yan Liu;
sincerest thanks to each of them. The research was supported by the Natural
Sciences and Engineering Research Council of Canada.
ANALYZING HIGH DIMENSIONAL DATA 85

References

Alon, U., Notterman, D. A., Gish, K., Yhurra, S., Mack, D. and Levine, A. J. (1999). Broad
patterns of Gene expression revealed by clustering analysis of Tumor and Normal colon
tissues probed by Oligonucleotide Assays, Proceedings of the National Academy of Sciences,
96, 6745–6750.
Alter, O., Brown, P. O. and Botstein, D. (2000). Singular value decomposition for genome-wide
expression data processing and modelling, Proceedings of the National Academy of Sciences,
97, 10101–10106.
Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: by an example of a two sample
problem, Statistica Sinica, 6, 311–329.
Benjamini, Y. and Hockberg, Y. (1995). Controlling the False Discovery Rate: A practical and
powerful approach to multiple testing, J. Roy. Statist. Soc. Ser. B , 57, 289–300.
Dempster, A. P. (1958). A high dimensional two sample significance test, Ann. Math. Statist.,
29, 995–1010.
Dempster, A. P. (1960). A significance test for the separation of two highly multivariate small
samples, Biometrics, 16, 41–50.
Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for
the classification of Tumors using gene expression data, J. Amer. Statist. Assoc., 97, 77–87.
Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical Bayes analysis of a
microarray experiment, J. Amer. Statist. Assoc., 96, 1151–1160.
Eisen, M. B., Spellman, P. T., Brown, P. O. and Botstein, D. (1998). Cluster analysis and dis-
play of genome-wide expression-patterns, Proceeding of the National Academy of Sciences,
95, 14863–14868.
Fujikoshi, Y., Himeno, T. and Wakaki, H. (2004). Asymptotic results of a high dimensional
MANOVA test and power comparison when the dimension is large compared to the sample
size, J. Japan. Statist. Soc., 34, 19–26.
Ibrahim, J., Chen, M. and Gray, R. J. (2002). Bayesian models for gene expression with DNA
microarray data, J. Amer. Statist. Assoc., 97, 88–99.
Johnston, I. M. (2001). On the distribution of the largest eigenvalue in principle component
analysis, Ann. Statist., 29, 295–327.
Läuter, J., Glimm, E. and Kropf, S. (1998). Multivariate tests based on left-spherically dis-
tributed linear scores, Ann. Statist., 26, 1972–1988.
Ledoit, O. and Wolf, M. (2002). Some hypothesis tests for the covariance matrix when the
dimension is large compared to the sample size, Ann. Statist., 30, 1081–1102.
Lehmann, E. L. (1959). Testing Statistical Hypothesis, Wiley, New York.
Rao, C. R. (1973). Linear Statistical Inference and Its Applications, Wiley, New York.
Rao, C. R. and Mitra, S. K. (1971). Generalized Inverse of Matrices and Its Applications,
Wiley, New York.
Schott, J. R. (1997). Matrix Analysis for Statistics, Wiley, New York.
Simaika, J. B. (1941). On an optimum property of two important statistical tests, Biometrika,
32, 70–80.
Siotani, M., Hayakawa, T. and Fujikoshi, Y. (1985). Modern Multivariate Statistical Analysis:
A Graduate Course and Handbook , American Sciences Press, Inc., Columbus, Ohio, U.S.A.
Srivastava, M. S. (1970). On a class of non-parametric tests for regression parameters, J. Statist.
Res., 4, 117–131.
Srivastava, M. S. (1972). Asymptotically most powerful rank tests for regression parameters in
MANOVA, Ann. Inst. Statist. Math., 24, 285–297.
Srivastava, M. S. (2002). Methods of Multivariate Statistics, Wiley, New York.
Srivastava, M. S. (2003). Singular Wishart and Multivariate beta distributions, Ann. Statist.,
31, 1537–1560.
Srivastava, M. S. (2005). Some tests concerning the covariance matrix in High-Dimensional
data, J. Japan. Statist. Soc., 35, 251–272.
Srivastava, M. S. and Fujikoshi, Y. (2006). Multivariate analysis of variance with fewer obser-
vations than the dimension, J. Multivariate Anal., 97, 1927–1940.
86 M. S. SRIVASTAVA

Srivastava, M. S. and Khatri, C. G. (1979). An Introduction to Multivariate Statistics, North-


Holland, New York.
Srivastava, M. S. and von Rosen, D. (2002). Regression Models with unknown singular covari-
ance matrix, Linear Algebra and its Applications, 354, 255–273.
Srivastava, M. S., Hirotsu, C., Aoki, S. and Glimm, E. (2001). Multivariate one-sided tests,
Data Analysis from Statistical Foundations, (eds. Mohammad, A. K. and Saleh, E.), 387–
401, Nova Science Publishers Inc., New York.

View publication stats

You might also like