Multivariate Theory For Analyzing High Dimensional
Multivariate Theory For Analyzing High Dimensional
Multivariate Theory For Analyzing High Dimensional
net/publication/228366796
CITATIONS READS
111 749
1 author:
M. S. Srivastava
University of Toronto
220 PUBLICATIONS 6,562 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by M. S. Srivastava on 19 December 2013.
M. S. Srivastava*
Key words and phrases: Distribution of test statistics, DNA microarray data, fewer
observations than dimension, multivariate analysis of variance, singular Wishart.
1. Introduction
In DNA microarray data, gene expressions are available on thousands of
genes of an individual but there are only few individuals in the dataset. For
example, in the data analyzed by Ibrahim et al. (2002), gene expressions were
available on only 14 individuals, of which 4 were normal tissues and 10 were
endometrial cancer tissues. But even after excluding many genes, the dataset
consisted of observations on 3214 genes. Thus, the observation matrix is a 3214×
14 matrix. Although these genes are correlated, it is ignored in the analysis.
The empirical Bayes analysis of Efron et al. (2001) also ignores the correlations
among the genes in the analysis of their data. Similarly, in the comparison of
discrimination methods for the classification of tumors using gene expression
data, Dudoit et al. (2002) considered in their Leukemia example 6817 human
genes on 72 subjects from two groups, 38 from the learning set and 34 from the
test set. Although it gives rise to a 6817×6817 correlation matrix, they examined
a 72 × 72 correlation matrix. Most of the methods used in the above analyses
ignore the correlations among the genes.
The above cited papers use some criteria for reducing the dimension of the
data. For example, Dudoit et al. (2002) reduce the dimension of their Leukemia
data from 3571 to 40, before they applied their methods of classification on
Received July 15, 2005. Revised October 21, 2005. Accepted January 25, 2006.
*Department of Statistics, University of Toronto, 100 St. George Street Toronto, Ontario, Canada
M5S 3G3. Email: [email protected]
54 M. S. SRIVASTAVA
the data. Methods for reducing the dimension in microarray data have also
been given by Alter et al. (2000) in their singular value decomposition methods.
Similarly, Eisen et al. (1998) have given a method of clustering the genes using a
measure similar to the Pearson correlation coefficient. Dimension reduction has
also been the objective of Efron et al. (2001) and Ibrahim et al. (2002). Such
reduction of the dimension is important in microarray data analysis because
even though there are observations on thousands of genes, there are relatively
few genes that reflect the differences between the two groups of data or several
groups of data. Thus, in analyzing any microarray data, the first step should be
to reduce the dimension. The theory developed in this paper provides a method
for reducing the dimension.
The sample covariance matrix for the microarray datasets on N individuals
with p genes in which p is larger than N is a singular matrix of rank n = N − 1.
For example if xik denotes the gene expression of the i-th gene on the k-th
individual, i = 1, . . . , p, k = 1, . . . , N , then the sample covariance matrix is of
the order p × p given by
S = (sij ),
where
N
nsij = (xik − xi )(xjk − xj ),
k=1
and xi and xj are the sample means of the i-th and j-th genes’ expressions,
−1 N x . The corresponding p × p population
xi = N −1 N k=1 xik , xj = N k=1 jk
covariance matrix Σ will be assumed to be positive-definite. To understand the
difficulty in analyzing such a dataset, let us divide the p genes arbitrarily into two
groups, one consisting of n = N −1 genes and the other consisting of q = p−N +1
genes. We write the sample covariance matrix for the two partitioned groups as
S11 S12
S= S
, S22 = (s22ij ),
S12 22
s−1 −1
22ii s12i S11 s12i .
However, since the matrix S is of rank n, it follows from Srivastava and Khatri
(1979, p. 11) that
−1
S22 = S12 S11 S12 ,
−1
giving s22ii = s12i S11 s12i . Thus, the sample multiple correlation between any
gene i in the second set with the first set is always equal to one for all i =
1, . . . , p − n. Similarly, all the n sample canonical correlations between the two
sets are one for n < (p − n). It thus appears that any inference procedure that
takes into account the correlations among the genes may be based on n genes or
ANALYZING HIGH DIMENSIONAL DATA 55
2. One-sample problem
Let p-dimensional random vectors x1 , . . . , xN be independent and identically
distributed (hereafter referred to as iid) as normal with mean vectors µ and
unknown nonsingular covariance matrix Σ. Such a distribution will be denoted
by x ∼ Np (µ, Σ). We shall assume that N ≤ p and consider the problem of
testing the hypothesis H : µ = 0 against the alternative A : µ = 0. The sample
mean vector and the sample covariance matrix are respectively defined by
N
(2.1) x = N −1 xi , and S = n−1 V,
i=1
where
N
(2.2) V = (xi − x)(xi − x) , n = N − 1.
i=1
T + = N x S + x
2
(2.3) = nN x V + x.
Let
2
+ p − n + 1 T+
(2.4) F = .
n n
2
Thus, for n < p, we propose the statistic F + or equivalently the T + statistic
for testing the hypothesis that µ = 0, against the alternative that µ = 0. It
is shown in Subsection 2.1 that F + is invariant under the linear transformation
xi → cΓxi , c = 0, and ΓΓ = Ip . Let
n 1/2 −1
(2.6) lim P (p(p − n + 1) b̂F +
− 1) ≤ z1−α = Φ(z1−α ),
n,p→∞ 2
n
(2.7) lim P cp,n (b̂F +
− 1) < z1−α = Φ(z1−α ),
n,p→∞ 2
where we choose
cp,n = [(p − n + 1)/(p + 1)]1/2
for fast convergence to normality, see Corollary 2.1.
When
1 1/2
µ= δ,
nN
where δ is a vector of constants, then the asymptotic power of the F + test as
given in Subsection 2.3 is given by
nb̂p(p − n + 1)−1 F + − n
+
β(F ) = lim P > z1−α | µ
n,p→∞ (2n)1/2
1/2
n n µ ∧µ
Φ −z1−α + ,
p 2 a2
ANALYZING HIGH DIMENSIONAL DATA 57
The asymptotic power of the TD test as given by Bai and Saranadasa (1996) is
given by
nµ µ
β(TD ) Φ −z1−α + √ .
2pa2
Thus, when Σ = γ 2 I,
n µ µ
β(TD ) = −z1−α + √ ,
2p γ2
and 1/2
n n µµ
β(F + ) = −z1−α + √ .
p 2p γ2
Thus, in this case, Dempster’s test is superior to the F + test, unless (n/p) → 1.
It is as expected since Dempster’s test is uniformly most powerful among all tests
whose power depends on (µ µ/γ 2 ), see Simaika (1941). This test is also invariant
under the linear transformations xi → cΓxi , c = 0, ΓΓ = Ip . In other cases, F +
may be preferred if
1/2
n
(2.8) µ ∧ µ > µ µ;
pa2
For example, if µ Np (0, ∧), then on the average (2.8) implies that
1/2
n
(tr ∧2 ) > tr ∧,
pa2
that is
a21
n> p = bp.
a2
1/2
Since a21 < a2 , such an n exists. Similarly, if µ = λi , the same inequality is
obtained and F + will have better power than the Dempster test.
Next, we compare the power of TD and F + tests by simulation where the
+
F statistic in (2.7) is used. From the asymptotic expressions of the power
given above and in (2.29), it is clear that asymptotically, the tests TD and TBS
(defined in equation (2.26)) have the same power. Thus, in our comparison
of power, we include only the Dempster’s test TD as it is also known that if
Σ = σ 2 Ip , then the test TD is the best invariant test under the transformation
x → cΓx, c = 0, ΓΓ = Ip , among all tests whose power depends on µ µ/σ 2
irrespective of the size of n and p. Therefore it is better than the Hotelling’s
T 2 -test (when n > p), TBS and the F + tests. However, when Σ = σ 2 I, then no
58 M. S. SRIVASTAVA
ordering between the two tests TD and F + exist. Thus, we shall compare the
power of the TD test with the F + test by simulation when the covariance matrix
Σ = diag(d1 , . . . , dp ) = σ 2 Ip . The diagonal elements d1 , . . . , dp are obtained as
an iid sample from several distributions. However, once the values of d1 , . . . , dp
are obtained by a single simulation, they are kept fixed in our simulation study.
Similarly, the values of the elements of the mean vector µ for the alternative
hypothesis are obtained as iid observations from some distributions. Once these
values are obtained, they are also held fixed throughout the simulation. In order
that both tests have the same significance level, we obtain by simulation the cut-
off points for the distribution of the statistic under the hypothesis. For example,
for the statistic F + , we obtain Fα+ such that
(# of F + ≥ Fα+ )
= α,
1000
where F + is calculated from the (n + 1) samples from Np (0, Σ) for each 1000
replications. The power is then calculated from the (n+1) samples from Np (µ, Σ)
replicated again 1000 times. We have chosen α to be 0.05. The power of the two
tests are shown in Tables 1–3. The mean vectors for the alternative are obtained
as
In our power comparison, we have used the statistic given in Corollary 2.1.
Table 1. Power, Σ = Ip .
µ1 µ2
p n TD F+ TD F+
60 30 1.0000 0.9780 0.9987 0.8460
100 40 1.0000 1.0000 1.0000 0.9530
60 1.0000 1.0000 1.0000 1.0000
80 1.0000 1.0000 1.0000 0.9970
150 40 1.0000 1.0000 0.9317 0.9830
60 1.0000 1.0000 1.0000 1.0000
80 1.0000 1.0000 1.0000 1.0000
200 40 1.0000 1.0000 1.0000 0.9870
60 1.0000 1.0000 1.0000 1.0000
80 1.0000 1.0000 1.0000 1.0000
400 40 1.0000 1.0000 1.0000 0.9960
60 1.0000 1.0000 1.0000 1.0000
80 1.0000 1.0000 1.0000 1.0000
ANALYZING HIGH DIMENSIONAL DATA 59
µ1 µ2
+
p n TD F TD F+
60 30 0.6424 1.0000 0.2909 1.0000
100 40 0.9417 1.0000 0.5934 1.0000
60 0.9979 1.0000 0.8608 1.0000
80 1.0000 1.0000 0.9622 1.0000
150 40 0.9606 1.0000 0.6605 1.0000
60 0.9991 1.0000 0.9127 1.0000
80 1.0000 1.0000 0.9829 1.0000
200 40 0.9845 1.0000 0.7892 1.0000
60 0.9998 1.0000 0.9668 1.0000
80 1.0000 1.0000 0.9981 1.0000
400 40 1.0000 1.0000 0.9032 1.0000
60 1.0000 1.0000 0.9951 1.0000
80 1.0000 1.0000 0.9981 1.0000
µ1 µ2
+
p n TD F TD F+
60 30 0.9541 1.0000 0.5420 1.0000
100 40 0.9998 1.0000 0.9394 1.0000
60 1.0000 1.0000 0.9983 1.0000
80 1.0000 1.0000 1.0000 1.0000
150 40 0.9999 1.0000 0.9622 1.0000
60 1.0000 1.0000 0.9999 1.0000
80 1.0000 1.0000 1.0000 1.0000
200 40 1.0000 1.0000 0.9925 1.0000
60 1.0000 1.0000 1.0000 1.0000
80 1.0000 1.0000 1.0000 1.0000
400 40 1.0000 1.0000 0.9996 1.0000
60 1.0000 1.0000 1.0000 1.0000
80 1.0000 1.0000 1.0000 1.0000
P {F + ≥ Fα+ } = α
60 M. S. SRIVASTAVA
Table 4. Attained significant level of F + test under H, sample from N (0, I).
n = 30 n = 40 n = 50 n = 60 n = 70 n = 80 n = 90
p = 100 0.077 0.062 0.057 0.058 0.078 0.098 0.117
p = 150 0.069 0.069 0.053 0.073 0.067 0.052 0.059
p = 200 0.053 0.052 0.053 0.047 0.056 0.048 0.039
p = 300 0.068 0.057 0.064 0.069 0.054 0.037 0.039
p = 400 0.071 0.064 0.053 0.067 0.053 0.048 0.048
Table 5. Attained significant level of F + test under H, (1) sample from N (0, D).
(2) D = diag(d1 , . . . , dp ), where Di ∼ U (2, 3).
n = 30 n = 40 n = 50 n = 60 n = 70 n = 80 n = 90
p = 100 0.072 0.060 0.055 0.062 0.071 0.096 0.108
p = 150 0.053 0.047 0.057 0.048 0.052 0.046 0.060
p = 200 0.062 0.060 0.053 0.058 0.050 0.039 0.047
p = 300 0.068 0.067 0.064 0.052 0.053 0.068 0.058
p = 400 0.071 0.061 0.067 0.052 0.051 0.058 0.061
Table 6. Attained significant level of F + test under H, (1) sample from N (0, D).
(2) D = diag(d1 , . . . , dp ), where Di ∼ χ2 of 2df .
n = 30 n = 40 n = 50 n = 60 n = 70 n = 80 n = 90
p = 100 0.049 0.032 0.014 0.008 0.018 0.015 0.030
p = 150 0.028 0.017 0.017 0.015 0.025 0.006 0.002
p = 200 0.019 0.024 0.030 0.018 0.001 0.002 0.010
p = 300 0.043 0.030 0.013 0.013 0.008 0.028 0.012
p = 400 0.055 0.042 0.033 0.021 0.018 0.018 0.002
ANALYZING HIGH DIMENSIONAL DATA 61
is used when n ≥ p, since S is positive definite with probability one and F has an
F -distribution with p and n−p+1 degrees of freedom. The F -test in (2.9) can be
interpreted in many ways. For example, x S −1 x is the sample (squared) distance
of the sample mean vector from the zero vector. It can also be interpreted as the
test based on the p sample principal components of the mean vector, since
x S −1 x = x H0 L−1
0 H0 x,
where
S = H0 L0 H0 , H0 H0 = Ip .
L0 = diag(l1 , . . . , lp ) and H0 x is the vector of the p sample principal components
of the sample mean vector. It can also be shown to be equivalent to a test based
on (a x)2 where a is chosen such that a Sa = 1 and (a x)2 is maximized.
When n < p, S has a singular Wishart distribution, see Srivastava (2003)
for its distribution. For the singular case, a test corresponding to the F -test in
(2.9) can be proposed as,
F − = cN x S − x,
for some constant c, where S − is a generalized inverse of S and SS − S = S. No
such test has been proposed in the literature so far as it raises two obvious ques-
tions, namely which g-inverse to use and what is its distribution. For example,
the p × p sample covariance matrix S can be written as
S11 S12
S= S
,
S12 22
see Rao (1973, p. 27), Rao and Mitra (1971, p. 208), Schott (1997), or Siotani et
al. (1985, p. 595). In this case,
−1
F − = cN x1 S11 x1 ,
(2.10) nS = V = H LH,
(2.11) S + = nH L−1 H.
Thus, we define the sample (squared) distance of the sample mean vector from
the zero vector by
D + = x S + x
2
a = (S + )x/(x S + x)1/2 .
ANALYZING HIGH DIMENSIONAL DATA 63
Hence, for testing the hypothesis that µ = 0 against the alternative that µ = 0,
we propose the test statistic
max(p, n) − min(p, n) + 1
(2.13) F+ = N x S + x
n min(p, n)
p−n+1
= N x S + x, if n < p,
n2
n−p+1
= N x S −1 x, if n ≥ p,
np
which is the same as F defined in (2.8) when the sample covariance matrix is
2
nonsingular. Thus, when n < p, we propose the statistics T + or F + , as defined
in (2.3) and (2.4) respectively, for testing the hypothesis that µ = 0 vs µ = 0.
2
We note that the statistic T + is invariant under the transformation xi → cΓxi ,
where c ∈ R(0) , c = 0 and Γ ∈ Op ; R(0) denotes the real line without zero and
Op denotes the group of p × p orthogonal matrices. Clearly cΓ ∈ Glp. To show
the invariance, we note that
2
T+
= N x V + x
n
= N x H L−1 Hx
= N x Γ ΓH L−1 HΓ Γx
= N x Γ (ΓV Γ )+ Γx,
since the eigenvalues of ΓV Γ are the same as that of V . The invariance under
scalar transformation obviously holds.
In the above theorem, it is assumed that the covariance matrix Σ is not only
singular but is of rank r ≤ n. At the moment, no statistical test is available
to check this assumption. However, in practical applications, we look at the
eigenvalues of the sample covariance matrix S, and delete the eigenvalues that
are zero or very small, as is done in the selection of principal components.
Tests are available in Srivastava (2005) to check if Σ = γ 2 I, γ 2 unknown,
or if Σ = diag(λ1 , . . . , λp ) under the assumption that n = O(pδ ), o < δ ≤ 1.
A test for the first hypothesis that is of spherecity is also available in Ludoit
64 M. S. SRIVASTAVA
and Wolf (2002) under the assumption that n = O(p). More efficient tests can
be constructed based on the statistics (N x x/ tr S), and N x DS−1 x, depending
upon which of the above two hypotheses is true; here Ds = diag(s11 , . . . , spp ),
S = (sij ). Thus the F + test may not be used when the covariance matrix is
either a constant times an identity matrix or a diagonal matrix. Nevertheless,
we derive the distribution of the F + test when Σ = γ 2 I, and γ 2 is unknown
because in this case we obtain an exact distribution which may serve as a basis
for comparison when only asymptotic or approximate distributions are available.
Theorem 2.2. Let the F + statistic be as defined in (2.4). Then when the
covariance matrix Σ = γ 2 I, and γ 2 is unknown, the F + statistic is distributed
under the hypothesis H, as an F -distribution with n and p − n + 1 degrees of
freedom, n ≤ p.
Σ = ∧ = diag(λ1 , . . . , λp ).
Let
z ∼ Nn (0, I),
and
2
T+
= N x V + x
n
(2.15) = z (ALA)−1 z.
where
Let
n2 tr S 2 p tr S
(2.19) â2 = −
(n − 1)(n + 2) p n p
and
(2.20) â1 = (tr S/p)
respectively, see Srivastava (2005). Thus, clearly (tr S 2 /p) is not a consistent
estimator of a2 unless p is fixed and n → ∞. Thus, under the assumption (2.16),
a consistent estimator of b as (n, p) → ∞ is given by
â21
(2.21) b̂ = .
â2
n 1/2 −1
lim P (p(p − n + 1) b̂F +
− 1) ≤ z1−α = Φ(z1−α ).
n,p→∞ 2
Corollary 2.1. Let n = O(pδ ), 0 < δ < 1, then under the condition (2.16)
and when µ = 0,
1/2
n
lim P cp,n (b̂F +
− 1) < z1−α = Φ(z1−α )
n,p→∞ 2
where
cp,n = [(p − n + 1)/(p + 1)]1/2 .
Theorem 2.4. Let Fn,p−n+1 (α) be the upper 100% point of the F -distri-
bution with n and (p − n + 1) degrees of freedom. Then
native,
1/2
1
(2.22) µ = E(x) = δ.
nN
From (2.15)
2
T+
= z (ALA)−1 z,
n
where given H,
δH A/(δ H A2 Hδ)1/2 .
Then, given H,
θ
w = Γz ∼ Nn ,I ,
0
where
As before,
lim (pbT + /n) = lim z z = lim w w.
2
p→∞ p→∞ p→∞
nµ µ
− Φ −z1−α + √ = 0,
2pa2
68 M. S. SRIVASTAVA
see Bai and Saranadasa (1996). This also gives the asymptotic distribution of
Dempster’s test under the hypothesis.
and is an unbiased and ratio consistent estimator of tr Σ2 . This test is also in-
variant under the transformation x → cΓxi , c = 0, ΓΓ = I as was the Dempster
test TD . To obtain the distribution of TBS , we need the following lemma.
Then for any iid random variables ui with mean zero and variance one,
n
lim P ain ui < z = Φ(z),
n→∞
i=1
Thus, if
max1≤i≤p λi
(2.27) lim = 0,
p→∞ (tr Σ2 )1/2
then
lim P {TBS ≤ z} = Φ(z).
n,p→∞
Since,
(N/n)1/2 δ x δ ∧ δ θ02
lim var √ = lim = lim = 0,
n,p→∞ 2 tr Σ2 n,p→∞ 2n tr Σ2 n,p→∞ 2n
it follows that
√
(N xx ) − tr S pâ2 δδ
lim P √ √ > z1−α − √
n,p→∞ 2pâ2 tr Σ2 n 2pa2
δδ
− Φ z1−α + √ = 0.
n 2pa2
Thus, the asymptotic power of the BS test under condition (2.28) is given by
nµ µ
(2.29) β(BS) Φ −z1−α + √ .
2pa2
n−k+1
TLGK = N x O1 (O1 SO1 )−1 O1 x ∼ Fk,n−k+1 , k ≤ n.
nk
The non-null distribution of this statistic is not available. A comparison
by simulation also appears difficult as no guidance is available as to how many
components to include. For example, if we choose only one component from the
n + 1 components, denoted by a, then one may use the t2 -test given by
3. Two-sample F + test
In this section, we consider the problem of testing that the mean vector
µ1 and µ2 of two populations with common covariance Σ are equal. For this,
we shall define the sample (squared) distance between the two populations with
sample mean vectors x1 and x2 and pooled sample covariance matrix S by
D+ = (x1 − x2 ) S + (x1 − x2 ).
2
Thus, for testing the equality of the two mean vectors, the test statistic F +
becomes
−1
p−n+1 1 1
F +
= + (x1 − x2 ) S + (x1 − x2 ), n ≤ p,
n2 N1 N2
where x11 , . . . , x1N1 are iid Np (µ1 , Σ), x21 , . . . , x2N2 are iid Np (µ2 , Σ), Σ > 0,
x1 and x2 are the sample mean vectors,
N1
N2
nS = (x1i − x1 )(x1i − x1 ) + (x2i − x2 )(x2i − x2 ) ,
i=1 i=1
and
n = N1 + N2 − 2.
All the results obtained in Theorems 2.1 to 2.4 for the one-sample case are also
2
available for the two-sample case, except that T + is now defined as
N1 N2
(x1 − x2 ) S + (x1 − x2 ).
2
T+ =
N1 + N2
and within sum of squares or sum of squares due to error will be denoted by
k
Ni
V = nS = (xij − xi )(xij − xi ) ,
i=1 j=1
where n = N − k.
ANALYZING HIGH DIMENSIONAL DATA 71
B = U U ,
where etr (A) stands for the exponential of the trace of the matrix A, see S & K
(1979, p. 54). Under the hypothesis of the equality of the k mean vectors θ = 0
and U ∼ Np,m (0, Σ, Im ).
Similarly, the sum of squares due to error given in (2.19) can be written as
V = nS = Y Y ,
where Y : p × n, the n columns of Y are iid Np (0, Σ). It is also known that B
and V are independently distributed. Let
HV H = L = diag(l1 , . . . , ln ), n ≤ p,
where
and the m columns of Z are iid Nn (0, I) under the hypothesis that µ = 0.
From Srivastava and von Rosen (2002), the results corresponding to the case
when Σ is singular becomes available. Thus, we get the following theorem.
where η = rm(r2 + m − 5)/48 and χ2f denotes a chi-square random variable with
f degrees of freedom.
−pb̂ log U + − mn
lim P √ < z1−α = Φ(z1−α ).
n,p→∞ 2mn
For some other tests and power , see Srivastava and Fujikoshi (2006), and
when (p/n) → c, c ∈ (0, ∞), see Fujikoshi et al. (2004).
5. Confidence intervals
When a hypothesis is rejected, it is desirable to find out which component or
components may have caused the rejection. In the context of DNA microarrays,
it will be desirable to find out which genes have been affected after the radiation
treatment. Since p is very large, the confidence intervals obtained by the Bonfer-
roni inequality, which corresponds to the maximum of t-tests, are of no practical
value as they give very wide confidence intervals.
As an alternative, many researchers such as Efron et al. (2001) use the ‘False
Discovery Rate’, FDR, proposed by Benjamini and Hockberg (1995), although
ANALYZING HIGH DIMENSIONAL DATA 73
the conditions required for its validity may not hold in many cases. On the other
hand, we may use approximate confidence intervals for selecting the variables or
to get some idea as to the variable or variables that may have caused the rejection
of the hypothesis. These kind of confidence intervals use the fact that
for any real valued function g(·) of the p-vector a, where H : p × n, and Rp is the
p-dimensional Euclidean space. Thus, it gives a confidence coefficient less than
100(1 − α)%. However, we shall assume that they are approximately equal.
where z1−α denotes the upper 100α% point of the standard normal distribution.
It may be noted that A2α χ2n,α . In the notation and assumptions of Section
2, approximate simultaneous confidence intervals for linear combinations a µ at
100(1 − α)% confidence coefficient are given by
when a Sa = 0 and a∈ ρ(H ). Here n = N1 +N2 −2, and S is the pooled estimate
of the covariance matrix Σ. Thus, approximate simultaneous confidence intervals
for the differences in the components of the means µ1i − µ2i are given by
1/2
1 1
(x1i − x2i ) ± + (n/p)1/2 b̂−1/2 (sii )1/2 A1/2
α , i = 1, . . . , p,
N1 N2
74 M. S. SRIVASTAVA
where
cα
where V is defined in the beginning of Section 4, and 1−c α
is the upper 100α%
−1
point of the distribution of the largest eigenvalue of W U . The value of cα can
be obtained from Table B.7 in Srivastava (2002) with p0 = n, m0 = k − 1, and
n corresponds to p. The vector a of constants can be chosen as in the previous
two subsections.
and
+
(6.4) P3 = pb̂ tr(V1 V(1) )
which are approximately distributed as χ2n1 n2 and χ2n1 ×n(1) respectively. The tests
in (6.1) and (6.2) are arbitrary. To make them less arbitrary, we may define
(i)
T2 = [pb̂ tr Vi+ V(3−i) − n1 n2 ]/(2n1 n2 )1/2 , i = 1, 2,
and
(i) +
T3 = [pb̂ tr Vi V(i) − ni n(i) ]/(2ni n(i) )1/2 , i = 1, . . . , k.
for testing the equality of k covariance matrices may be considered. The hypoth-
esis of the equality of the two covariance matrices is rejected if
R2 > z1−α/4
R3 > z1−α/2k
7. Selection of variables
In order to have good power for any test in multivariate analysis, it is impor-
tant to have as few characteristics as possible. Especially those variables that do
not have discriminating ability, should be removed. We begin with b̂n,n based on
the n characteristics chosen for the three cases, as n largest value of N x2r /Sr,r ,
[N1 N2 /(N1 + N2 )] (x1r − x2r )2 /Srr , and (ar Bar )/(ar Sar ), where r = 1, . . . , p
and ar = (0 . . . 0, 1, 0 . . . 0) is a p-vector with all zeros except the r-th place which
is one. Let p∗ be the number of characteristics for which
for the three cases respectively where A2α is defined in (5.1) and cα in Subsection
5.3. All the testing, procedures, and confidence intervals etc, may now be carried
out with N observations and p∗ characteristics. Other values of p∗ around it may
also be tried for seperation of groups etc. If the selected p∗ is less than n, then
the usual multivariate methods apply. However, if p∗ n or p∗ > n, then the
methods proposed in this paper apply. This method of selection of variables is
illustrated in Example 8.2 in Section 8.
8. An example
Alon et al. (1999) used Affymetrix oligonucleotide arrays to monitor absolute
measurements on expressions of over 6500 human gene expressions in 40 tumour
and 22 normal colon tissues. Alon et al. considered only 2000 genes with highest
minimal intensity across the samples. We therefore consider only data on these
2000 genes on 40 patients and 22 normal subjects.
To analyze the data, we first calculate the pooled sample covariance matrix
assuming that the two populations of tumour and normal tissues have the same
covariance matrix Σ. Although, this assumption can be ascertained by the test
statistic given in Section 6, it may not be necessary at this stage as we will be
testing for it after the selection of variables. From the pooled sample covariance
matrix S, we found that b60,60 = 0.1485. Using the formula (7.2) in Section 7,
we select 103 characteristics. For testing the equality of means, we calculated
b60,103 = 0.110, and
N1 N2 p
(8.1) b̂60,103 (x1 − x2 ) S + (x1 − x2 ) = 94.760.
N1 + N2 n
Using the approximation that A2α χ2n,α , we find that the value of χ260,0.05 =
79.08. Hence, the p-value is 0.0028. Thus, the hypothesis of the equality of
the two means is rejected. Similarly, for testing the equality of the covariances
of the two populations, we calculate the sample covariances of the two popu-
lations, S1 and S2 on 39 and 21 degrees of freedom respectively. The value of
P2 = pb̂60,103 tr V1+ V2 = 523.01, where V1 = 39S1 and V2 = 21S2 . This is approx-
imately chi-square with 819 degrees of freedom. The p-value is 0.999 for this case,
and the hypothesis of the equality of covariances of the two groups is accepted.
Thus, the final analysis may be based on 103 characteristics with both groups
having the common covariance. We can also try a few other values of p if the
seperation of the two groups and equality of the two covariances are maintained.
We first tried p = 200. For this case, we find that b60,200 = 0.0096, and the
value of the statistic in (8.1) is 59.376, with a p-value of 0.4984. Thus, p = 200
does not provide the separation of the two groups. Next we tried p = 100. For
this case, b60,100 = 0.113, and the value of the statistic in (8.1) is 86.0128 with a
p-value of 0.015. Thus, p = 100 is a possible alternative, since the value of the
test statistic P2 for p∗ = 100 is 317.3079 with a p-value of one, accepting the hy-
pothesis of equality of the two covariances of the two groups. Next, we consider
ANALYZING HIGH DIMENSIONAL DATA 77
the case when p = n = 60. We now have the case of p n. We found that one
eigenvalue of the sample covariance matrix is very small. Thus, we assume that
the population covariance matrix Σ is of rank 59 and apply the results given in
Theorem 2.1. The value of the test statistic is 21.1035 with a p-value of 0.04646.
The corresponding value of the test statistic P2 for p∗ = 60 is 666.0173 with a
p-value of one. Consequently, p = 60 is also a possibility on which we can base
our future analysis. However, it appears from the above analysis that p = 103 is
the best choice.
9. Concluding remarks
In this paper we define sample (squared) distance using the Moore-Penrose
inverse instead of any generalized inverse of the sample covariance matrix which
is singular due to fewer observations than the dimension. For normally dis-
tributed data, it is based on sufficient statistic while a sample (squared) distance
using any generalized inverse is not. These distances are used to propose tests in
one-sample, two-sample and multivariate analysis of variance. Simultaneous con-
fidence intervals for the mean parameters are given. Using the proposed methods,
a dataset is analyzed. This shows that the proposed test statistic performs well.
In addition, it provides confidence intervals which have been used to select the
relevant characteristics from any dataset.
N
x S + x = u Σ1/2 H (HV H )−1 HΣ1/2 u
n
= u M Dλ M H (HΣ1/2 W Σ1/2 H )−1 HΣ1/2 u
1/2
= u M Dλ A(A Dλ M W M Dλ A)−1 A Dλ M u
1/2 1/2 1/2 1/2
= u M (M W M )−1 M u
= z U −1 z
n
n
˜
−(1/2)
× (d˜i − d˜j )
(1/2)(p−n−1)
d˜i e di
.
i<j i=1
1
n (1/2)(p−n−1) ˜
−(1/2)
gn (P ) (d˜i − d˜j ) d˜ i e di
1
2(1/2)pn Γ n p i<j i=1
2
1
(A.5) = gn (P )
1
2(1/2)pn Γ n p
2
n
1
× (d˜i − d˜j )|P D̃P |(1/2)(p−n−1) etr − P D̃P .
i<j
2
Putting U = P D̃P and using the above Jacobian of the transformation given in
(A.4), we get the pdf of U which is Wn (I, p), n ≤ p.
Corollary A.1. Let w ∼ Nn (0, In ) and D̃ = diag(d˜1 , . . . , d˜n ), d˜1 > · · · >
d˜n , whose pdf is given by (A.1) and w and D̃ are independently distributed. Then
p − n + 1 −1
w D̃ w ∼ Fn,p−n+1 .
n
ANALYZING HIGH DIMENSIONAL DATA 79
where
z ∼ Nn (0, In )
and
U ∼ Wn (I, p)
2
Proof of Theorem 2.2. Since Σ = γ 2 I = ∧ and since T + is invariant
under scalar transformation, we may assume without any loss of generality that
γ 2 = 1. Hence,
A = (H ∧ H )−1/2 = (HH )−1/2 = I.
Thus, from (2.15) with a slight change of notation (w instead of z), it follows
that
2
T+
= w L−1 w,
n
where w ∼ Nn (0, I) is independently distributed of the diagonal matrix L. Also,
the diagonal elements of L are the non-zero eigenvalues of Y Y , where the n
columns of Y are iid Np (0, In ). Thus, L is the diagonal matrix of the eigenvalues
of U = Y Y ∼ Wn (p, In ). Using Corollary A.1, we thus have
2
p − n + 1 T+ p − n + 1 −1
= wL w
n n n
p − n + 1 −1
= z U z ∼ Fn,p−n+1 .
n
To prove Theorems 2.3, we need the results stated in the following lemma.
Lemma A.2. Let V = Y Y ∼ Wp (∧, n), where the columns of Y are iid
Np (0, ∧). Let l1 , . . . , ln be the n non-zero eigenvalues of V = H LH, HH =
In , L = diag(l1 , . . . , ln ) and the eigenvalues of W ∼ Wn (In , p), are given by
80 M. S. SRIVASTAVA
where
a1 = (tr ∧/p), and lim a1 = a10 .
p→∞
ui ∧ uj
lim = 0, i = j,
p→∞ p
and
Y Y
lim = a10 In in probability.
p→∞ p
since d1 , . . . , dn are the eigenvalues of W ∼ Wn (p, I). This proves (c). We note
that
Y Y = H L1/2 GG L1/2 H,
for any n × n orthogonal matrix G, GG = In depending on Y . Choosing G =
L1/2 HY (Y Y )−1 , we find that in distribution,
Thus, in distribution
GY ∧ Y G = GU ∧2 U G = L1/2 H ∧ H L1/2 ,
82 M. S. SRIVASTAVA
tr ∧4
lim = 0.
p→∞ p2
Hence, in probability,
Y ∧Y tr ∧2
lim G G = lim In = a20 In .
p→∞ p p→∞ p
Thus, in probability
L1/2 H ∧ H L1/2
lim = a20 In .
p→∞ p
a Y Y a a H LHa
= .
pn pn
1 n
= (a y i )2
pn i=1
1 n
p
2
n
p
2 2
= aj yij + aj ak yij yik ,
pn i=1 j=1
pn i=1 j<k
Hence, in probability
1 n p
a H LHa
lim a2j yij
2
= lim .
n,p→∞ pn n,p→∞ pn
i=1 j=1
From the law of large numbers, the left side goes to limn→∞ limp→∞ ( a p∧a ), and
from the results in (a), we have in probability limp→∞ p−1 L = a10 . Hence, in
probability
a H Ha a ∧a
lim = lim /a10 .
n,p→∞ n n,p→∞ p
where GG = In , U = (u1 , . . . , un ), and ui are iid Np (0, I). Since the distribution
2
of T + is invariant under orthogonal transformations, we get from (2.15)
2 −1
T+ ALA
pb̂ = b̂z z
n p
−1/2 −1/2
L L
= b̂z A−2
d
z
p p
−1 −1
L U ∧2 U L
G
d
= b̂z G z.
p p p
Hence,
U ∧2 U
2
(p/n)b̂T + − n z G pa20 G z − n
lim P0 1/2
< z1−α = lim P0 < z1−α
n,p→∞ (2n) n,p→∞ (2n)1/2
∧ U
z Upaa
2
z−n
= lim P0 20
< z1−α .
n,p→∞ (2n)1/2
ri = 1 + Op (p−1/2 ),
and
1 n
tr ∧2 U U
ri = = 1 + Op (np)−1/2 .
n i=1 npa20
Hence,
2
(p/n)b̂T + − n
lim P0 < z1−α
n,p→∞ (2n)1/2
84 M. S. SRIVASTAVA
i=1 ri zi −
n 2 n
= lim P0 < z1−α
n,p→∞ (2n)1/2
n
(p/n)b̂T + − n
lim P0 < z1−α = Φ(z1−α ).
n,p→∞ (2n)1/2
Acknowledgements
I am grateful to Dr. Ekkehard Glimm of AICOS Technologies, Basel, Switzer-
land for kindly reading this article and offering many suggestions that greatly
improved the presentation. Thanks are also due to Professors Y. Fujikoshi, J.
Láuter, R. Pincus, M. Genton, and two anonymous referees for their helpful
comments. The calculation for the example in Section 8 was carried out by M.
Shakhatreh, that of Tables 1–3 by Meng Du, and that of Tables 4–6 by Yan Liu;
sincerest thanks to each of them. The research was supported by the Natural
Sciences and Engineering Research Council of Canada.
ANALYZING HIGH DIMENSIONAL DATA 85
References
Alon, U., Notterman, D. A., Gish, K., Yhurra, S., Mack, D. and Levine, A. J. (1999). Broad
patterns of Gene expression revealed by clustering analysis of Tumor and Normal colon
tissues probed by Oligonucleotide Assays, Proceedings of the National Academy of Sciences,
96, 6745–6750.
Alter, O., Brown, P. O. and Botstein, D. (2000). Singular value decomposition for genome-wide
expression data processing and modelling, Proceedings of the National Academy of Sciences,
97, 10101–10106.
Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: by an example of a two sample
problem, Statistica Sinica, 6, 311–329.
Benjamini, Y. and Hockberg, Y. (1995). Controlling the False Discovery Rate: A practical and
powerful approach to multiple testing, J. Roy. Statist. Soc. Ser. B , 57, 289–300.
Dempster, A. P. (1958). A high dimensional two sample significance test, Ann. Math. Statist.,
29, 995–1010.
Dempster, A. P. (1960). A significance test for the separation of two highly multivariate small
samples, Biometrics, 16, 41–50.
Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for
the classification of Tumors using gene expression data, J. Amer. Statist. Assoc., 97, 77–87.
Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical Bayes analysis of a
microarray experiment, J. Amer. Statist. Assoc., 96, 1151–1160.
Eisen, M. B., Spellman, P. T., Brown, P. O. and Botstein, D. (1998). Cluster analysis and dis-
play of genome-wide expression-patterns, Proceeding of the National Academy of Sciences,
95, 14863–14868.
Fujikoshi, Y., Himeno, T. and Wakaki, H. (2004). Asymptotic results of a high dimensional
MANOVA test and power comparison when the dimension is large compared to the sample
size, J. Japan. Statist. Soc., 34, 19–26.
Ibrahim, J., Chen, M. and Gray, R. J. (2002). Bayesian models for gene expression with DNA
microarray data, J. Amer. Statist. Assoc., 97, 88–99.
Johnston, I. M. (2001). On the distribution of the largest eigenvalue in principle component
analysis, Ann. Statist., 29, 295–327.
Läuter, J., Glimm, E. and Kropf, S. (1998). Multivariate tests based on left-spherically dis-
tributed linear scores, Ann. Statist., 26, 1972–1988.
Ledoit, O. and Wolf, M. (2002). Some hypothesis tests for the covariance matrix when the
dimension is large compared to the sample size, Ann. Statist., 30, 1081–1102.
Lehmann, E. L. (1959). Testing Statistical Hypothesis, Wiley, New York.
Rao, C. R. (1973). Linear Statistical Inference and Its Applications, Wiley, New York.
Rao, C. R. and Mitra, S. K. (1971). Generalized Inverse of Matrices and Its Applications,
Wiley, New York.
Schott, J. R. (1997). Matrix Analysis for Statistics, Wiley, New York.
Simaika, J. B. (1941). On an optimum property of two important statistical tests, Biometrika,
32, 70–80.
Siotani, M., Hayakawa, T. and Fujikoshi, Y. (1985). Modern Multivariate Statistical Analysis:
A Graduate Course and Handbook , American Sciences Press, Inc., Columbus, Ohio, U.S.A.
Srivastava, M. S. (1970). On a class of non-parametric tests for regression parameters, J. Statist.
Res., 4, 117–131.
Srivastava, M. S. (1972). Asymptotically most powerful rank tests for regression parameters in
MANOVA, Ann. Inst. Statist. Math., 24, 285–297.
Srivastava, M. S. (2002). Methods of Multivariate Statistics, Wiley, New York.
Srivastava, M. S. (2003). Singular Wishart and Multivariate beta distributions, Ann. Statist.,
31, 1537–1560.
Srivastava, M. S. (2005). Some tests concerning the covariance matrix in High-Dimensional
data, J. Japan. Statist. Soc., 35, 251–272.
Srivastava, M. S. and Fujikoshi, Y. (2006). Multivariate analysis of variance with fewer obser-
vations than the dimension, J. Multivariate Anal., 97, 1927–1940.
86 M. S. SRIVASTAVA