Sparse Matrix-Based Random Projection For Classification: Weizhi Lu, Weiyu Li, Kidiyo Kpalma and Joseph Ronsin
Sparse Matrix-Based Random Projection For Classification: Weizhi Lu, Weiyu Li, Kidiyo Kpalma and Joseph Ronsin
Sparse Matrix-Based Random Projection For Classification: Weizhi Lu, Weiyu Li, Kidiyo Kpalma and Joseph Ronsin
Abstract
Index Terms
Random Projection, Sparse Matrix, Classification, Feature Selection, Distance Preservation, High-
dimensional data
I. I NTRODUCTION
Random projection attempts to project a set of high-dimensional data into a low-dimensional subspace
without distortion on pairwise distance. This brings attractive computational advantages on the collection
and processing of high-dimensional signals. In practice, it has been successfully applied in numerous
fields concerning categorization, as shown in [1] and the references therein. Currently the theoretical
study of this technique mainly falls into one of the following two topics. One topic is concerned with the
construction of random matrix in terms of distance preservation. In fact, this problem has been sufficiently
addressed along with the emergence of Johnson-Lindenstrauss (JL) lemma [2]. The other popular one
2
is to estimate the performance of traditional classifiers combined with random projection, as detailed in
[3] and the references therein. Specifically, it may be worth mentioning that, recently the performance
consistency of SVM on random projection is proved by exploiting the underlying connection between
JL lemma and compressed sensing [4] [5].
Based on the principle of distance preservation, Gaussian random matrices [6] and a few sparse
{0, ±1} random matrices [7], [8], [9] have been sequentially proposed for random projection. In terms
of implementation complexity, it is clear that the sparse random matrix is more attractive. Unfortunately,
as it will be proved in the following section II-B, the sparser matrix tends to yield weaker distance
preservation. This fact largely weakens our interests in the pursuit of sparser random matrix. However, it
is necessary to mention a problem ignored for a long time, that is, random projection is mainly exploited
for various tasks of classification, which prefer to maximize the distances between different classes,
rather than preserve the pairwise distances. In this sense, we are motivated to study random projection
from the viewpoint of feature selection, rather than of traditional distance preservation as required by
JL lemma. During this study, however, the property of satisfying JL lemma should not be ignored,
because it promises the stability of data structure during random projection, which enables the possibility
of conducting classification in the projection space. Thus throughout the paper, all evaluated random
matrices are previously ensured to satisfy JL lemma to a certain degree.
In this paper, we indeed propose the desired {0, ±1} random projection matrix with the best feature
selection performance, by theoretically analyzing the change trend of feature selection performance over
the varying sparsity of random matrices. The proposed matrix presents currently the sparsest structure,
which holds only one random nonzero position per column. In theory, it is expected to provide better
classification performance over other more dense matrices, if the projection dimension is not much
smaller than the number of feature elements. This conjecture is confirmed with extensive classification
experiments on both synthetic and real data.
The rest of the paper is organized as follows. In the next section, the JL lemma is first introduced, and
then the distance preservation property of sparse random matrix over varying sparsity is evaluated. In
section III, a theoretical frame is proposed to predict feature selection performance of random matrices
over varying sparsity. According to the theoretical conjecture, the currently known sparsest matrix with
better performance over other more dense matrices is proposed and analyzed in section IV. In section V,
the performance advantage of the proposed sparse matrix is verified by performing binary classification
on both synthetic data and real data. The real data incudes three representative datasets in dimension
reduction: face image, DNA microarray and text document. Finally, this paper is concluded in section
3
VI.
II. P RELIMINARIES
This section first briefly reviews JL lemma, and then evaluates the distance preservation of sparse
random matrix over varying sparsity.
For easy reading, we begin by introducing some basic notations for this paper. A random matrix is
denoted by R ∈ Rk×d , k < d. rij is used to represent the element of R at the i-th row and the j -th
column, and r ∈ R1×d indicates the row vector of R. Considering the paper is concerned with binary
classification, in the following study we tend to define two samples v ∈ R1×d and w ∈ R1×d , randomly
drawn from two different patterns of high-dimensional datasets V ⊂ Rd and W ⊂ Rd , respectively. The
inner product between two vectors is typically written as hv, wi. To distinguish from variable, the vector
is written in bold. In the proofs of the following lemmas, we typically use Φ(∗) to denote the cumulative
distribution function of N (0, 1). The minimal integer not less than ∗, and the the maximum integer not
larger than ∗ are denoted with d∗e and b∗c .
The distance preservation of random projection is supported by JL lemma. In the past decades, several
variants of JL lemma have been proposed in [10], [11], [12]. For the convenience of the proof of the
following Corollary 2, here we recall the version of [12] in the following Lemma 1. According to Lemma
2 ) = 1.
1, it can be observed that a random matrix satisfying JL lemma should have E(rij ) = 0 and E(rij
Lemma 1. [12] Consider random matrix R ∈ Rk×d , with each entry rij chosen independently from
2 ) = 1. For any fixed vector v ∈ Rd , let
a distribution that is symmetric about the origin with E(rij
v0 = √1 RvT .
k
2 3
− )k
− (2(B+1)
Pr(kv0 k2 ≤ (1 − )kvk2 ) ≤ e (1)
2m ) ≤ (2m)! 2m
• Suppose ∃L > 0 such that for any integer m > 0, E(rij 2m m! L . Then for any > 0,
Up to now, only a few random matrices are theoretically proposed for random projection. They can
be roughly classified into two typical classes. One is the Gaussian random matrix with entries i.i.d dawn
from N (0, 1) , and the other is the sparse random matrix with elements satisfying the distribution below:
1 with probability 1/2q
√
rij = q× 0 with probability 1 − 1/q (3)
−1 with probability 1/2q
√
where q is allowed to be 2, 3 [7] or d [8]. Apparently the larger q indicates the higher sparsity.
Naturally, an interesting question arises: can we continue improving the sparsity of random projection?
Unfortunately, as illustrated in Lemma 2, the concentration of JL lemma will decrease as the sparsity
increases. In other words, the higher sparsity leads to weaker performance on distance preservation.
However, as it will be disclosed in the following part, the classification tasks involving random projection
are more sensitive to feature selection rather than to distance preservation.
Lemma 2. Suppose one class of random matrices R ∈ Rk×d , with each entry rij of the distribution
as in formula (3), where q = k/s and 1 ≤ s ≤ k is an integer. Then these matrices satisfy JL lemma
with different levels: the sparser matrix implies the worse property on distance preservation.
Proof: With formula (3), it is easy to derive that the proposed matrices satisfy the distribution defined
in Lemma 1. In this sense, they also obey JL lemma if the two constraints corresponding to formulas
(1) and (2) could be further proved.
For the first constraint corresponding to formula (1):
4
B = E(rij )
p p
= ( k/s)4 × (s/2k) + (− k/s)4 × (s/2k) (4)
= k/s < ∞
then it is approved.
For the second constraint corresponding to formula (2):
for any integer m > 0, derive E(r2m ) = (k/s)m−1 , and
2m )
E(rij 2m m!k m−1
= .
(2m)!L2m /(2m m!) sm−1 (2m)!L2m
5
2m )
E(rij 2m k m−1
≤ ,
(2m)!L2m /(2m m!) sm−1 mm L2m
√ √
let L = (2k/s)1/2 ≥ 2(k/s)(m−1)/2m / m, further derive
2m )
E(rij
≤ 1.
(2m)!L2m /(2m m!)
for any integer m > 0. Then the second constraint is also proved.
Consequently, it is deduced that, as s decreases, B in formula (4) will increase, and subsequently the
boundary error in formula (4) will get larger. And this implies that the sparser the matrix is, the worse
the JL property.
In this section, a theoretical framework is proposed to evaluate the feature selection performance of
random matrices with varying sparsity. As it will be shown latter, the feature selection performance would
be simply observed, if the product between the difference between two distinct high-dimensional vectors
and the sampling/row vectors of random matrix, could be easily derived. In this case, we have to previously
know the distribution of the difference between two distinct high-dimensional vectors. For the possibility
of analysis, the distribution should be characterized with a unified model. Unfortunately, this goal seems
hard to be perfectly achieved due to the diversity and complexity of natural data. Therefore, without loss
of generality, we typically assume the i.i.d Gaussian distribution for the elements of difference between
two distinct high-dimensional vectors, as detailed in the following section III-A. According to the law of
large numbers, it can be inferred that the Gaussian distribution is reasonable to be applied to characterize
the distribution of high-dimensional vectors in magnitude. Similarly to most theoretical work attempting
to model the real world, our assumption also suffers from an obvious limitation. Empirically, some of the
real data elements, in particular the redundant (indiscriminative) elements, tend to be coherent to some
extent, rather than being absolutely independent as we assume above. This imperfection probably limits
the accuracy and applicability of our theoretical model. However, as will be detailed later, this problem can
be ignored in our analysis where the difference between pairwise redundant elements is assume to be zero.
This also explains why our theoretical proposal can be widely verified in the final experiments involving
6
a great amount of real data. With the aforementioned assumption, in section III-B, the product between
high-dimensional vector difference and row vectors of random matrices is calculated and analyzed with
respect to the varying sparsity of random matrix, as detailed in Lemmas 3-5 and related remarks. Note
that to make the paper more readable, the proofs of Lemmas 3-5 are included in the Appendices.
From the viewpoint of feature selection, the random projection is expected to maximize the difference
between arbitrary two samples v and w from two different datasets V and W , respectively. Usually the
difference is measured with the Euclidean distance denoted by kRzT k2 , z = v − w. Then in terms of
the mutual independence of R, the search for good random projection is equivalent to seeking the row
vector r̂ such that
r̂ = arg max{|hr, zi|}. (5)
r
Thus in the following part we only need to evaluate the row vectors of R. For the convenience of analysis,
the two classes of high-dimensional data are further ideally divided into two parts, v = [vf vr ] and
w = [wf wr ], where vf and wf denote the feature elements containing the discriminative information
between v and w such that E(vif − wif ) 6= 0, while vr and wr represent the redundant elements such
that E(vir − wir ) = 0 with a tiny variance. Subsequently, r = [rf rr ] and z = [zf zr ] are also seperated
into two parts corresponding to the coordinates of feature elements and redundant elements, respectively.
Then the task of random projection can be reduced to maximizing |hrf , zf i|, which implies that the
redundant elements have no impact on the feature selection. Therefore, for simpler expression, in the
following part the high-dimensional data is assumed to have only feature elements except for specific
explanation, and the superscript f is simply dropped. As for the intra-class samples, we can simply
assume that their elements are all redundant elements, and then the expected value of their difference is
equal to 0, as derived before. This means that the problem of minimizing the intra-class distance needs
not to be further studied. So in the following part, we only consider the case of maximizing inter-class
distance, as described in formula (5).
To explore the desired r̂i in formula (5), it is necessary to know the distribution of z. However, in
practice the distribution is hard to be characterized since the locations of feature elements are usually
unknown. As a result, we have to make a relaxed assumption on the distribution of z. For a given real
dataset, the values of vi and wi should be limited. This allows us to assume that their difference zi is
also bounded in amplitude, and acts as some unknown distribution. For the sake of generality, in this
7
paper zi is regarded as approximately satisfying the Gaussian distribution in magnitude and randomly
takes a binary sign. Then the distribution of zi can be formulated as
x with probability 1/2
zi = (6)
−x with probability 1/2
where x ∈ N (µ, σ 2 ), µ is a positive number, and Pr(x > 0) = 1 − , = Φ(− σµ ) is a small positive
number.
B. Product between high-dimensional vector and random sampling vector with varying sparsity
This subsection mainly tests the feature selection performance of random row vector with varying
sparsity. For the sake of comparison, Gaussian random vectors are also evaluated. Recall that under the
2 ) = 1, the Gaussian matrix has elements
basic requirement of JL lemma, that is E(rij ) = 0 and E(rij
i.i.d drawn from N (0, 1), and the sparse random matrix has elements distributed as in formula (3) with
q ∈ {d/s : 1 ≤ s ≤ d, s ∈ N}.
Then from the following Lemmas 3-5, we present two crucial random projection results for the high-
dimensional data with the feature difference element |zi | distributed as in formula (6):
• Random matrices will achieve the best feature selection performance as only one feature element
is sampled by each row vector; in other words, the solution to the formula (5) is obtained when r
randomly has s = 1 nonzero elements;
• The desired sparse random matrix mentioned above can also obtain better feature selection perfor-
mance than Gaussian random matrices.
Note that, for better understanding, we first prove a relatively simple case of zi ∈ {±µ} in Lemma
3, and then in Lemma 4 generalize to a more complicated case of zi distributed as in formula (6). The
performance of Gaussian matrices on zi ∈ {±µ} is obtained in Lemma 5.
p
Lemma 3. Let r = [r1 , ..., rd ] randomly have 1 ≤ s ≤ d nonzero elements taking values ± d/s
with equal probability, and z = [z1 , ..., zd ] with elements being ±µ equiprobably, where µ is a positive
constant. Given f (r, z) = |hr, zi|, there are three results regarding the expected value of f (ri , z):
dse
q
1) E(f ) = 2µ ds 21s d 2s eCs 2 ;
√
2) E(f )|s=1 = µ d > E(f )|s>1 ;
q
3) lim √1d E(f ) → µ π2 .
s→∞
1 1
0.9 0.9
(E(f)|s+E(f)|s+1)/(2µd1/2)
0.8 0.8
E(f)/(µd1/2)
0.7 0.7
0.6 0.6
0.5 0.5
1 10 20 30 40 50 60 70 80 90 100 2 10 20 30 40 50 60 70 80 90 99
s s>1
(a) (b)
1
p
Fig. 1: The process of µ√ d
E(f ) converging to 2/π (≈ 0.7979) with increasing s is described in (a); and in (b)
1
the average value of two µ√ d
E(f ) with adjacent s (> 1), namely 2µ1√d (E(f )|s + E(f )|s+1 ), is approved very
p
close to 2/π. Note that E(f ) is calculated with the formula provided in Lemma 3.
Remark on Lemma 3: This lemma discloses that the best feature selection performance is obtained,
when only one feature element is sampled by each row vector. In contrast, the performance tends to
converge to a lower level as the number of sampled feature elements increases. However, in practice
the desired sampling process is hard to be implemented due to the few knowledge of feature location.
As it will be detailed in the next section, what we can really implement is to sample only one feature
element with high probability. Note that with the proof of this lemma, it can also be proved that if s is
p
odd, E(f ) fast decreases to µ 2d/π with increasing s; in contrast, if s is even, E(f ) quickly increases
p
towards µ 2d/π as s increases. But for arbitrary two adjacent s larger than 1, their average value on
p
E(f ), namely (E(f )|s + E(f )|s+1 )/2, is very close to µ 2d/π . For clarity, the values of E(f ) over
1
varying s are calculated and shown in Figure 1, where instead of E(f ), √
µ d
E(f ) is described since
only the varying s is concerned. The specific character of E(f ) ensures that one can still achieve better
performance over others by sampling s = 1 element with a relative high probability, along with the
occurrence of a sequence of s taking consecutive values slightly larger than 1.
p
Lemma 4. Let r = [r1 , ..., rd ] randomly have 1 ≤ s ≤ d nonzero elements taking values ± d/s with
equal probability, and z = [z1 , ..., zd ] with elements distributed as in formula (6). Given f (r, z) = |hr, zi|,
it is derived that:
3
q √
3 2 µ −1
if ( 98 ) 2 [ π2 + (1 + 4 )π(σ) ] + 2Φ(− σµ ) ≤ 1.
Lemma 5. Let r = [r1 , ..., rd ] have elements i.i.d drawn from N (0, 1), and z = [z1 , ..., zd ] with elements
being ±µ equiprobably, where µ is a positive constant. Given f (r, z) = |hr, zi|, its expected value
q
E(f ) = µ 2d
π .
The lemmas of the former section have proved that the best feature selection performance can be
obtained, if only one feature element is sampled by each row vector of random matrix. It is now interesting
to know if the condition above can be satisfied in the practical setting, where the high-dimensional data
consists of both feature elements and redundant elements, namely v = [vf vr ] and w = [wf wr ].
According to the theoretical condition mentioned above, it is known that the row vector r = [rf rr ] can
obtain the best feature selection, only when ||rf ||0 = 1, where the quasi-norm `0 counts the number of
nonzero elements in rf . Let rf ∈ Rdf , and rr ∈ Rdr , where d = df + dr . Then the desired row vector
should have d/df uniformly distributed nonzero elements such that E(||rf ||0 ) = 1. However, in practice
the desired distribution for row vectors is often hard to be determined, since for a real dataset the number
of feature elements is usually unknown.
In this sense, we are motivated to propose a general distribution for the matrix elements, such that
||rf ||0 = 1 holds with high probability in the setting where the feature distribution is unknown. In other
10
words, the random matrix should hold the distribution maximizing the ratio Pr(||rf ||0 = 1)/Pr(||rf ||0 ∈
{2, 3, ..., df }). In practice, the desired distribution implies that the random matrix has exactly one nonzero
position per column, which can be simply derived as below. Assume a random matrix R ∈ Rk×d randomly
holding 1 ≤ s0 ≤ k nonzero elements per column, equivalently s0 d/k nonzero elements per row, then
one can derive that
From the last equation in formula (7), it can be observed that the increasing s0 d/k will reduce the value
of formula (7). In order to maximize the value, we have to set s0 = 1. This indicates that the desired
random matrix has only one nonzero element per column.
The proposed random matrix with exactly one nonzero element per column presents two obvious
advantages, as detailed below.
• In complexity, the proposed matrix clearly presents much higher sparsity than existing random
√
projection matrices. Note that, theoretically the very sparse random matrix with q = d [8] has
√ √
higher sparsity than the proposed matrix when k < d. However, in practice the case k < d is
usually not of practical interest, due to the weak performance caused by large compression rate d/k
√
(> d).
• In performance, it can be derived that the proposed matrix outperforms other more dense matrices, if
the projection dimension k is not much smaller than the number df of feature elements included in
the high-dimensional vector. To be specific, from Figure 1, it can be observed that the dense matrices
with column weight s0 > 1 share comparable feature selection performance, because as s0 increases
they tend to sample more than one feature element (namely ||rf ||0 > 1) with higher probability.
Then the proposed matrix with s0 = 1 will present better performance than them, if k ensures
||rf ||0 = 1 with high probability, or equivalently the ratio Pr(||rf ||0 = 1)/Pr(||rf ||0 ∈ {2, 3, ..., df })
being relatively large. As shown in formula (7), the condition above can be better satisfied, as k
increases. Inversely, as k decreases, the feature selection advantage of the proposed matrix will
11
degrade. Recall that the proposed matrix is weaker than other more dense matrices on distance
preservation, as demonstrated in section II-B. This means that the proposed matrix will perform
worse than others when its feature selection advantage is not obvious. In other words, there should
exist a lower bound for k to ensure the performance advantage of the proposed matrix, which is
also verified in the following experiments. It can be roughly estimated that the lower bound of k
should be on the order of df , since for the proposed matrix with column weight s0 = 1, the k = df
leads to E(||rf ||0 ) = d/k × df /d = 1. In practice, the performance advantage seemingly can be
maintained for a relatively small k(< df ). For instance, in the following experiments on synthetic
data, the lower bound of k is as small as df /20. This phenomenon can be explained by the fact that
to obtain performance advantage, the probability Pr(||rf ||0 = 1) is only required to be relatively
large rather than to be equal to 1, as demonstrated in the remark on Lemma 3.
V. E XPERIMENTS
A. Setup
This section verifies the feature selection advantage of the proposed currently sparest matrix (StM) over
other popular matrices, by conducting binary classification on both synthetic data and real data. Here
the synthetic data with labeled feature elements is provided to specially observe the relation between
the projection dimension and feature number, as well as the impact of redundant elements. The real
data involves three typical datasets in the area of dimensionality reduction: face image, DNA microarray
and text document. As for the binary classifier, the classical support vector machine (SVM) based on
Euclidean distance is adopted. For comparison, we test three popular random matrices: Gaussian random
matrix (GM), sparse random matrix (SM) as in formula (3) with q = 3 [7] and very sparse random
√
matrix (VSM) with q = d [8].
The simulation parameters are introduced as follows. It is known that the repeated random projection
tends to improve the feature selection, so here each classification decision is voted by performing 5
times the random projection [13]. The classification accuracy at each projection dimension k is derived
by taking the average of 100000 simulation runs. In each simulation, four matrices are tested with the
same samples. The projection dimension k decreases uniformly from the high dimension d. Moreover,
it is necessary to note that, for some datasets containing more than two classes of samples, the SVM
classifier randomly selects two classes to conduct binary classification in each simulation. For each class
of data, one half of samples are randomly selected for training, and the rest for testing.
12
TABLE I: Classification accuracies on the synthetic data which have d = 2000 and redundant elements suffering
from three different varying levels σr . The best performance is highlighted in bold. The lower bound of projection
dimension k that ensures the proposal outperforming others in all datasets is highlighted in bold as well. Recall
that the acronyms GM, SM, VSM√and StM represent Gaussian random matrix, sparse random matrix with q = 3,
very sparse rand matrix with q = d, and the proposed sparsest random matrix, respectively.
1) Data generation: The synthetic data is developed to evaluate the two factors as follows:
• the relation between the lower bound of projection dimension k and the feature dimension df ;
• the negative impact of redundant elements, which are ideally assumed to be zero in the previous
theoretical proofs.
To this end, two classes of synthetic data with df feature elements and d − df redundant elements are
generated in two steps:
• randomly build a vector ṽ ∈ {±1}d , then define a vector w̃ distributed as w̃i = −ṽi , if 1 ≤ i ≤ df ,
and w̃i = ṽi , if df < i ≤ d;
• generate two classes of datasets V and W by i.i.d sampling vif ∈ N (ṽi , σf2 ) and wif ∈ N (w̃i , σf2 ),
if 1 ≤ i ≤ df ; and vir ∈ N (ṽi , σr2 ) and wir ∈ N (w̃i , σr2 ), if df < i ≤ d.
Subsequently, the distributions on pointwise distance can be approximately derived as |vif − wif | ∈
N (2, 2σf2 ) for feature elements and (vir − wir ) ∈ N (0, 2σr2 ) for redundant elements, respectively. To be
close to reality, we introduce some unreliability for feature elements and redundant elements by adopting
relatively large variances. Precisely, in the simulation σf is fixed to 8 and σr varies in the set {8, 12, 16}.
Note that, the probability of (vir −wir ) converging to zero will decrease as σr increases. Thus the increasing
σr will be a challenge for our previous theoretical conjecture derived on the assumption of (vir −wir ) = 0.
As for the size of the dataset, the data dimension d is set to 2000, and the feature dimension df = 1000.
13
Three types of representative high-dimensional datasets are tested for random projection over evenly
varying projection dimension k . The datasets are first briefly introduced, and then the results are illustrated
and analyzed. Note that, the simulation is developed to compare the feature selection performance of
different random projections, rather than to obtain the best performance. So to reduce the simulation load,
the original high-dimensional data is uniformly downsampled to a relatively low dimension. Precisely, the
face image, DNA, and text are reduced to the dimensions 1200, 2000 and 3000, respectively. Note that,
in terms of JL lemma, the original high dimension allows to be reduced to arbitrary values (not limited
to 1200, 2000 or 3000), since theoretically the distance preservation of random projection is independent
of the size of high-dimensional data [7].
1) Datasets:
• Face image
– AR [14] : As in [15], a subset of 2600 frontal faces from 50 males and 50 females are examined.
For some persons, the faces were taken at different times, varying the lighting, facial expressions
(open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses). There are 6 faces
with dark glasses and 6 faces partially disguised by scarfs among 26 faces per person.
– Extended Yale B [16], [17]: This dataset includes about 2414 frontal faces of 38 persons, which
suffer varying illumination changes.
– FERET [18]: This dataset consists of more than 10000 faces from more than 1000 persons
taken in largely varying circumstances. The database is further divided into several sets which
are formed for different evaluations. Here we evaluate the 1984 frontal faces of 992 persons
each with 2 faces separately extracted from sets fa and fb.
14
TABLE II: Classification accuracies on five face datasets with dimension d = 1200. For each projection dimension
k, the best performance is highlighted in bold. The lower bound of projection dimension k that ensures the proposal
outperforming others in all datasets is highlighted in bold as well. Recall that the acronyms GM, SM, VSM √ and
StM represent Gaussian random matrix, sparse random matrix with q = 3, very sparse random matrix with q = d,
and the proposed sparsest random matrix, respectively.
– GTF [19]: In this dataset, 750 images from 50 persons were captured at different scales and
orientations under variations in illumination and expression. So the cropped faces suffer from
serious pose variation.
– ORL [20]: It contains 40 persons each with 10 faces. Besides slightly varying lighting and
expressions, the faces also undergo slight changes on pose.
• DNA microarray
– Colon [21]: This is a dataset consisting of 40 colon tumors and 22 normal colon tissue samples.
2000 genes with highest intensity across the samples are considered.
– ALML [22]: This dataset contains 25 samples taken from patients suffering from acute myeloid
leukemia (AML) and 47 samples from patients suffering from acute lymphoblastic leukemia
(ALL). Each sample is expressed with 7129 genes.
– Lung [23] : This dataset contains 86 lung tumor and 10 normal lung samples. Each sample
holds 7129 genes.
15
TABLE III: Classification accuracies on three DNA datasets with dimension d = 2000. For each projection
dimension k, the best performance is highlighted in bold. The lower bound of projection dimension k that ensures
the proposal outperforming others in all datasets is highlighted in bold as well. Recall that the acronyms GM, SM,
VSM and√StM represent Gaussian random matrix, sparse random matrix with q = 3, very sparse random matrix
with q = d, and the proposed sparsest random matrix, respectively.
1
Publicly available at http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html
16
TABLE IV: Classification accuracies on three Text datasets with dimension d = 3000. For each projection dimension
k, the best performance is highlighted in bold. The lower bound of projection dimension k that ensures the proposal
outperforming others in all datasets is highlighted in bold as well. Recall that the acronyms GM, SM, VSM √ and
StM represent Gaussian random matrix, sparse random matrix with q = 3, very sparse random matrix with q = d,
and the proposed sparsest random matrix, respectively.
data, and k > 600 (k/d > 1/5) for all text data. Note that, for some individual datasets, in fact we
can obtain smaller thresholds than the uniform thresholds described above, which means that for these
datasets, our performance advantage can be ensured in lower projection dimension. It is worth noting
that our performance gain usually varies across the types of data. For most data, the gain is on the level
of around 1%, except for some special cases, for which the gain can achieve as large as around 5%.
Moreover, it should be noted that the proposed matrix can still present comparable performance with
others (usually inferior to the best results not more than 1%), even as k is smaller than the lower threshold
described above. This implies that regardless of the value of k , the proposed matrix is always valuable
due to its lower complexity and competitive performance. In short, the extensive experiments on real
data sufficiently verifies the performance advantage of the theoretically proposed random matrix, as well
as the conjecture that the performance advantage holds only when the projection dimension k is large
enough.
This paper has proved that random projection can achieve its best feature selection performance,
when only one feature element of high-dimensional data is considered at each sampling. In practice,
however, the number of feature elements is usually unknown, and so the aforementioned best sampling
process is hard to be implemented. Based on the principle of achieving the best sampling process with
high probability, we practically propose a class of sparse random matrices with exactly one nonzero
17
element per column, which is expected to outperform other more dense random projection matrices,
if the projection dimension is not much smaller than the number of feature elements. Recall that for
the possibility of theoretical analysis, we have typically assumed that the elements of high-dimensional
data are mutually independent, which obviously cannot be well satisfied by the real data, especially the
redundant elements. Although the impact of redundant elements is reasonably avoided in our analysis,
we cannot ensure that all analyzed feature elements are exactly independent in practice. This defect
might affect the applicability of our theoretical proposal to some extent, whereas empirically the negative
impact seems to be negligible, as proved by the experiments on synthetic data. In order to validate
the feasibility of the theoretical proposal, extensive classification experiments are conducted on various
real data, including face image, DNA microarray and text document. As it is expected, the proposed
random matrix shows better performance than other more dense matrices, as the projection dimension
is sufficiently large; otherwise, it presents comparable performance with others. This result suggests
that for random projection applied to the task of classification, the proposed currently sparsest random
matrix is much more attractive than other more dense random matrices in terms of both complexity and
performance.
A PPENDIX A.
Proof of Lemma 3
Proof: Due to the sparsity of r and the symmetric property of both rj and zj , the function f (r, z)
q P
can be equivalently transformed to a simpler form, that is f (x) = µ ds | i=s
i=1 xi | with xi being ±1
equiprobably. With the simplified form, three results of this lemma are sequentially proved below.
0
sCs−1 if i = 0
s−i−1 i−1 s
− sCs−1 if 1 ≤ i ≤
sCs−1
2
Csi |s − 2i| =
i−1 s−i−1
− sCs−1 if 2s < i < s
sCs−1
sC s−1
if i = s
s−1
i−1
Further, with Cs−1 = si Csi , it can be deduced that
s
X s dse
(Csi |s − 2i|) = 2d eCs 2
2
i=0
r
d 1 s d 2s e
E(f ) = 2µ d eCs
s 2s 2
√
2) Following the proof above, it is clear that E(f (x))|s=1 = f (x)|s=1 = µ d. As for E(f (x))|s>1 , it
is evaluated under two cases:
– if s is odd, s+1
√2 1s s+1 Cs 2
p
E(f (x))|s s2 2 s(s − 2)
= s−1 = <1
E(f (x))|s−2 √2 1 s−1
Cs−2 2 s−1
s−2 2s−2 2
namely, E(f (x)) decreases monotonically with respect to s. Clearly, in this case E(f (x))|s=1 >
E(f (x))|s>1 ;
– if s is even, s
√2 1s s Cs2
r
E(f (x))|s s2 2 s−1
= s = <1
E(f (x))|s−1 √2 1 s 2
C s
s−1 2s−1 2 s−1
which means E(f (x))|s=1 > E(f (x))|s>1 , since s − 1 is odd number for which E(f (x))
monotonically decreases.
– if s is even,
√
r
1 s! 2d λs −2λ s
E(f (x)) = µ ds s s s = µ e 2
2 2!2! π
– if s is odd,
√ s+1 1
r
s! 2d s2 s λs −λ s+1 −λ s−1
E(f (x)) = µ d √ s+1 s−1 = µ ( 2 )2 e 2 2
s 2s 2 ! 2 !
π s −1
q
Clearly lim √1 E(f (x)) →µ 2
π holds, whenever s is even or odd.
s→∞ d
A PPENDIX B.
Proof of Lemma 4
Proof: Due to the sparsity of r and the symmetric property of both rj and zj , it is easy to derive
q P
that f (r, z) = |hr, zi| = ds | sj=1 zj |. This simplified formula will be studied in the following proof.
To present a readable proof, we first review the distribution shown in formula (6)
N (µ, σ) with probability 1/2
zj ∼
N (−µ, σ) with probability 1/2
where for x ∈ N (µ, σ), Pr(x > 0) = 1 − , = Φ(− σµ ) is a tiny positive number. For notational
simplicity, the subscript of random variable zj is dropped in the following proof. To ease the proof of
the lemma, we first need to derive the expected value of |x| with x ∼ N (µ, σ 2 ):
∞
|x| −(x−µ)
Z 2
E(|x|) = √ e 2σ2 dx
−∞ 2πσ
0 ∞
−x −(x−µ)
Z Z
2
x −(x−µ)2
= √ e 2σ2 dx + √ e 2σ2 dx
−∞ 2πσ 0 2πσ
0 ∞
x − µ −(x−µ) x − µ −(x−µ)
Z 2
Z 2
=− √ e 2σ2 dx + √ e 2σ2 dx
−∞ 2πσ 0 2πσ
Z ∞ Z 0
1 −(x−µ)2 1 −(x−µ)2
+µ √ e 2σ2 dx − µ √ e 2σ2 dx
0 2πσ −∞ 2πσ
σ (x−µ)2
σ (x−µ)2
which will be used many a time in the following proof. Then the proof of this lemma is separated into
two parts as follows.
1) This part presents the expected value of f (ri , z) for the cases s = 1 and s > 1.
√
– if s = 1, f (r, z) = d|z|; with the the probability density function of z :
1 1 −(z−µ)2 1 1 −(z+µ)2
p(z) = √ e 2σ2 + √ e 2σ2
2 2πσ 2 2πσ
= √ e 2σ2 dz + √ e 2σ2 dz
2 −∞ 2πσ 2 −∞ 2πσ
with the previous result on E(|x|), it is further deduced that
r
2 − µ22 µ
E(|z|) = σe 2σ + µ(1 − 2Φ(− ))
π σ
√ √ √
r
2d µ2 µ
E(f ) = dE(|z|) = σµ e− 2σ2 + µ d(1 − 2Φ(− )) ≈ µ d
π σ
1 i
t ∼ N ((s − 2i)µ, sσ 2 ) with probability C
2s s
where 0 ≤ i ≤ s denotes the number of z drawn from N (−µ, σ 2 ). Then the PDF of t can be
described as
s
1 X i 1 −(t−(s−2i)µ)2
p(t) = C s √ e 2sσ 2
2s 2πsσ
i=0
then,
21
Z ∞
E(|t|) = |t|p(t)dt
−∞
s Z ∞
1 X 1 −(t−(s−2i)µ)2
= s Csi |t| √ e 2sσ 2 dt
2 −∞ 2πsσ
i=0
s r
1 X 2s −(s−2i)22 µ2 −|s − 2i|µ
= Csi { σe 2sσ + µ|s − 2i|[1 − 2Φ( √ )]}
2s π sσ
i=0
2) This part derives the upper bound of the aforementioned E(f )|s>1 . For simpler expression, the three
factors of above expression for E(f )|s>1 are sequentially represented by f1 , f2 and f3 , and then
are analyzed, respectively.
q
– for f1 = µ ds 21s si=0 (Csi |s − 2i|), it can be rewritten as
P
r
d 1 d 2s e s
f1 = 2µ Cs d e
s 2s 2
−(s−2i)2 µ2
q Ps
2d 1 i
– for f2 = σ π 2s i=0 Cs e
2sσ 2 , first, we can bound
−(s−2i)2 µ2 2
e 2sσ2
< exp(− σµ2 ) if i < α or i > α
2 µ2
e −(s−2i)
2sσ 2 ≤1 if α ≤ i ≤ s − α
√
where α = d s−2 s e. Take it into f2 ,
r α−1 r s r s−α
2d 1 X i −µ22 2d 1 X −µ2 2d 1 X i
f2 < σ s
Cs e σ + σ Csi e 2σ 2 +σ Cs
π 2 π 2s π 2s
i=0 i=s−α+1 i=α
r r s−α
2d −µ22 2d 1 X i
<σ e σ +σ Cs
π π 2s
i=α
22
ds/2e
Since Csi ≤ Cs ,
r r
2d −µ22 2d 1 √
f2 < σ e σ +σ (b sc + 1)Csds/2e
π π 2s
r r r
2d −µ22 2d 1 √ ds/2e 2d 1 ds/2e
≤σ eσ +σ sCs +σ C
π π 2 s π 2s s
r r r
2d −µ22 2d 1 2 ds/2e s 2d 1 ds/2e
≤σ eσ +σ √ Cs d e+σ C
π s
π 2 s 2 π 2s s
with Stirling’s approximation,
q
2d −µ2 √ 2 λ −2λ q
d2 λs −2λs/2 if s is even
π σe 2σ 2 + d π σe s s/2
+ s π σe
f2 <
q
2d −µ2 √ 2 s λs −λ s+1 −λ s−1
π σe 2σ2 + d 2σ π ( s2s−1 ) 2 e 2 2
√ √
s s2 s λs −λ s+1 −λ s−1
+ d 2σ π s+1 ( s2 −1 ) e if s is odd
2 2 2
q
− 2i|Φ( −|s−2i|µ
d 1 Ps i
– for f3 = −2µ s 2s i=0 [Cs |s
√
sσ
)], with the previous defined α,
r s−α
d 1 X i −|s − 2i|µ
f3 ≤ −2µ [Cs |s − 2i|Φ( √ )]
s 2s sσ
i=α
r s−α
d 1 X −µ
≤ −2µ s
[Csi |s − 2i|Φ( )]
s2 σ
i=α
r s−α
d 1 X i
= −2µ [Cs |s − 2i|]
s 2s
i=α
r
d 1 d 2s −1e α−1
= −2µ s
(2sCs−1 − 2sCs−1 )
s2
√ 1 d s −1e α−1
= −4µ ds s (Cs−1 2
− Cs−1 )
2
≤0
23
E(f )|s>1 = f1 + f2 + f3
1
q s
d d2e 2σ
√ λs −2λ s q 2d −µ2 q d 2 λ −2λ
2µ 2s C
s s + π de 2 +
π σe
2σ 2 +
s π σe
s s/2
√ 1
s
d 2 −1e α−1
−4µ ds 2s (Cs−1 − Cs−1 ) if s is even
<
1
q s
d d2e 2σ
√ s2 2s λs −λ s+1 −λ s−1 q 2d −µ2
2µ 2s s Cs + π d s2 −1 e + π σe 2σ2
2 2
√ √
+ d 2σ s ( s2 ) 2s eλs −λ s+1
−λ s−1 √ 1 d 2s −1e α−1
− − Cs−1
π s+1 s −1
2
2 2 4µ ds 2 s (C s−1 ) if s is odd
q
2d 4σ
√ λs −2λ s q 2d −µ2 q d 2 λ −2λ
( π µ + π d)e 2 +
π σe
2σ 2 +
s π σe
s s/2
√ 1
s
d 2 −1e α−1
−4µ ds 2s (Cs−1 − Cs−1 ) if s is even
=
q
2d 4σ
√ s2 s λs −λ s+1 −λ s−1
q
2d −µ2
( µ + d)( ) 2e + σe 2σ2
2 2
π π s 2 −1 π
+√d 2σ s ( s2 ) 2s eλs −λ s+1 √
√
−λ s−1 d 2s −1e α−1
− 4µ ds 21s (Cs−1 − Cs−1
π s+1 s2 −1
2 2 ) if s is odd
√
r
2d − µ22 µ
E(f )|s>1 < E(f )|s=1 = σe 2σ + µ d(1 − 2Φ(− ))
π σ
– if s is even, since f3 ≤ 0,
2σ √ λs −2λ s
r r r
2d d 2 λs −2λs/2 2d −µ22
E(f )|s>1 < ( µ+ d)e 2 + σe + σe 2σ
π π sπ π
2σ √
r r r
2d d2 2d −µ22
≤( µ+ d) + σ+ σe 2σ
π π sπ π
√
r r
2 1 2σ 2d −µ22
= µ d( + (1 + √ ) ) + σe 2σ
π s πµ π
q
2 √1 2σ µ
Clearly E(f )|s>1 < E(f )|s=1 , if π + (1 + 2 ) πµ ≤ 1 − 2Φ(− σ ). This condition is well
satisfied when µ >> σ , since Φ(− σµ ) decreases monotonically with increasing µ/σ .
– if s is odd, with f3 ≤ 0,
2σ √ √ 2σ √s
r r
2d s2 s s2 s 2d −µ22
E(f )|s>1 < ( µ+ d)( 2 )2 + d ( 2 )2 + σe 2σ
π π s −1 π s+1 s −1 π
2 s
It can be proved that ( s2s−1 ) 2 decreases monotonically with respect to s. This yields that
24
√
3 2σ √
r r
2d 32 3 2d −µ22
E(f )|s>1 < ( µ + (1 + ) d)( 2 )2 + σe 2σ
π 4 π 3 −1 π
q √
9 32
in this case E(f )|s>1 < E(f )|s=1 , if ( 8 ) ( π2 + (1 + 43 ) πµ
2σ
) ≤ 1 − 2Φ(− σµ ).
A PPENDIX C.
Proof of Lemma 5
R EFERENCES
[1] N. Goel, G. Bebis, and A. Nefian, “Face recognition experiments with random projection,” in Proceedings of SPIE,
Bellingham, WA, pp. 426–437, 2005.
[2] W. B. Johnson and J. Lindenstrauss, “Extensions of Lipschitz mappings into a Hilbert space,” Contemp. Math., vol. 26,
pp. 189–206, 1984.
[3] R. J. Durrant and A. Kaban, “Random projections as regularizers: Learning a linear discriminant ensemble from fewer
observations than data dimensions,” Proceedings of the 5th Asian Conference on Machine Learning (ACML 2013). JMLR
W&CP, vol. 29, pp. 17–32, 2013.
[4] R. Calderbank, S. Jafarpour, and R. Schapire, “Compressed learning: Universal sparse dimensionality reduction and learning
in the measurement domain,” Technical Report, 2009.
[5] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A simple proof of the restricted isometry property for random
matrices,” Constructive Approximation, vol. 28, no. 3, pp. 253–263, 2008.
25
[6] P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” in Proceedings
of the 30th Annual ACM Symposium on Theory of Computing, pp. 604–613, 1998.
[7] D. Achlioptas, “Database-friendly random projections: Johnson–Lindenstrauss with binary coins,” J. Comput. Syst. Sci.,
vol. 66, no. 4, pp. 671–687, 2003.
[8] P. Li, T. J. Hastie, and K. W. Church, “Very sparse random projections,” in Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and data mining, 2006.
[9] A. Dasgupta, R. Kumar, and T. Sarlos, “A sparse Johnson–Lindenstrauss transform,” in Proceedings of the 42nd ACM
Symposium on Theory of Computing, 2010.
[10] S. Dasgupta and A. Gupta, “An elementary proof of the Johnson–Lindenstrauss lemma,” Technical Report, UC Berkeley,
no. 99–006, 1999.
[11] J. Matoušek, “On variants of the Johnson–Lindenstrauss lemma,” Random Struct. Algorithms, vol. 33, no. 2, pp. 142–156,
2008.
[12] R. Arriaga and S. Vempala, “An algorithmic theory of learning: Robust concepts and random projection,” Journal of
Machine Learning, vol. 63, no. 2, pp. 161–182, 2006.
[13] X. Z. Fern and C. E. Brodley, “Random projection for high dimensional data clustering: A cluster ensemble approach,” in
Proceedings of the 20th International Conference on Machine Learning, 2003.
[14] A. Martinez and R. Benavente, “The AR face database,” Technical Report 24, CVC, 1998.
[15] A. Martinez, “PCA versus LDA,” IEEE Trans. PAMI, vol. 23, no. 2, pp. 228–233, 2001.
[16] A. Georghiades, P. Belhumeur, and D. Kriegman, “From few to many: Illumination cone models for face recognition under
variable lighting and pose,” IEEE Trans. PAMI, vol. 23, no. 6, pp. 643–660, 2001.
[17] K. Lee, J. Ho, and D. Kriegman, “Acquiring linear subspaces for face recognition under variable lighting,” IEEE Trans.
PAMI, vol. 27, no. 5, pp. 684–698, 2005.
[18] P. J. Phillips, H.Wechsler, and P. Rauss, “The FERET database and evaluation procedure for face-recognition algorithms,”
Image and Vision Computing, vol. 16, no. 5, pp. 295–306, 1998.
[19] A. V. Nefian and M. H. Hayes, “Maximum likelihood training of the embedded HMM for face detection and recognition,”
IEEE International Conference on Image Processing, 2000.
[20] F. Samaria and A. Harter, “Parameterisation of a stochastic model for human face identification,” In 2nd IEEE Workshop
on Applications of Computer Vision, Sarasota, FL, 1994.
[21] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine, “Broad patterns of gene expression
revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the
National Academy of Sciences, vol. 96, no. 12, pp. 6745–6750, 1999.
[22] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A.
Caligiuri, C. D. Bloomfield, and E. S. Lander, “Molecular classification of cancer: Class discovery and class prediction by
gene expression monitoring,” Science, vol. 286, no. 5439, pp. 531–537, 1999.
[23] D. G. Beer, S. L. Kardia, C.-C. C. Huang, T. J. Giordano, A. M. Levin, D. E. Misek, L. Lin, G. Chen, T. G. Gharib, D. G.
Thomas, M. L. Lizyness, R. Kuick, S. Hayasaka, J. M. Taylor, M. D. Iannettoni, M. B. Orringer, and S. Hanash, “Gene-
expression profiles predict survival of patients with lung adenocarcinoma,” Nature medicine, vol. 8, no. 8, pp. 816–824,
Aug. 2002.
[24] D. Cai, X. Wang, and X. He, “Probabilistic dyadic data analysis with local and global consistency,” in Proceedings of the
26th Annual International Conference on Machine Learning (ICML’09), 2009, pp. 105–112.
26