Isometric Projection: Report No. UIUCDCS-R-2006-2747 UILU-ENG-2006-1787

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Report No.

UIUCDCS-R-2006-2747 UILU-ENG-2006-1787
Isometric Projection
by
Deng Cai, Xiaofei He, and Jiawei Han
July 2006
Isometric Projection

Deng Cai

Xiaofei He

Jiawei Han

Department of Computer Science, University of Illinois at Urbana-Champaign

Yahoo! Research Labs


Abstract
Recently the problem of dimensionality reduction has received a lot of interests in many elds
of information processing, including data mining, information retrieval, and pattern recognition. We
consider the case where data is sampled from a low dimensional manifold which is embedded in
high dimensional Euclidean space. The most popular manifold learning algorithms include Locally
Linear Embedding, ISOMAP, and Laplacian Eigenmap. However, these algorithms are nonlinear and
only provide the embedding results of training samples. In this paper, we propose a novel linear
dimensionality reduction algorithm, called Isometric Projection. Isometric Projection constructs a
weighted data graph where the weights are discrete approximations of the geodesic distances on the
data manifold. A linear subspace is then obtained by preserving the pairwise distances. Our algorithm
can be performed in either original space or reproducing kernel Hilbert space, which leads to Kernel
Isometric Projection. In this way, Isometric Projection can be dened everywhere. Comparing to
Principal Component Analysis (PCA) which is widely used in data processing, our algorithm is more
capable of discovering the intrinsic geometrical structure. Specially, PCA is optimal only when the
data space is linear, while our algorithm has no such assumption and therefore can handle more
complex data space. We present experimental results of the algorithm applied to synthetic data set
as well as real life data. These examples illustrate the eectiveness of the proposed method.
1 Introduction
Dimensionality reduction has been a key problem in many elds of information processing, such as
data mining, information retrieval, and pattern recognition. When data is represented as points in a
high-dimensional space, one is often confronted with tasks like nearest neighbor search. Many methods
have been proposed to index the data for fast query response, such as K-D tree, R tree, R* tree, etc
[7]. However, these methods can only operate with small dimensionality, typically less than 100. The

The work was supported in part by the U.S. National Science Foundation NSF IIS-03-08215/IIS-05-13678. Any
opinions, ndings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily
reect the views of the funding agencies.
1
eectiveness and eciency of these methods drop exponentially as the dimensionality increases, which is
commonly referred to as the curse of dimensionality.
During the last decade, with the advances in computer technologies and the advent of the World Wide
Web, there has been an explosion in the amount and complexity of digital data being generated, stored,
analyzed, and accessed. Much of this information is multimedia in nature, including text, image, and
video data. The multimedia data is typically of very high dimensionality, ranging from several thousands
to several hundreds of thousand. Learning in such a high dimensional in many cases is almost infeasible.
Thus, learnability necessitates dimensionality reduction. Once the high-dimensional data is mapped into
lower-dimensional space, conventional indexing schemes can then be applied.
One of the most popular dimensionality reduction algorithms might be Principal Component Analysis
(PCA) [12]. PCA performs dimensionality reduction by projecting the original n-dimensional data onto
the d( n)-dimensional linear subspace spanned by the leading eigenvectors of the datas covariance
matrix. Its goal is to nd a set of mutually orthogonal basis functions that capture the directions of
maximum variance in the data so that the pairwise Euclidean distances can be best preserved. If the
data is embedded in a linear subspace, PCA is guaranteed to discover the dimensionality of the subspace
and produces a compact representation. PCA has been widely applied in data mining [16], information
retrieval [5], multimedia [14], etc.
In many real world databases, however, there is no evidence that the data is sampled from a linear
subspace. For example, it is always believed that the face images are sampled from a nonlinear low-
dimensional manifold which is embedded in the high-dimensional ambient space [9]. This motivates us
to consider manifold based techniques for dimensionality reduction. Recently, various manifold learning
techniques, such as ISOMAP [21], Locally Linear Embedding (LLE) [18] and Laplacian Eigenmap [2] have
been proposed which reduce the dimensionality of a xed training set in a way that maximally preserve
certain inter-point relationships. LLE and Laplacian Eigenmap are local methods which attempt to
preserve local geometry of the data; essentially, they seek to map nearby points on the manifold to nearby
points in the low-dimensional representation. ISOMAP is a global method which attempts to preserve
geometry at all scales, mapping nearby points on the manifold to nearby points in low-dimensional space,
and faraway points to faraway points. One of the major limitations of these methods is that they do not
generally provide a functional mapping between the high and low dimensional spaces that are valid both
on and o the training data. Moreover, these methods are computationally expensive and may not be
able to handle large scale databases.
In this paper, we propose a novel dimensionality reduction algorithm called Isometric Projection
(IsoProjection), which explicitly takes into account the manifold structure. To model the manifold
structure, we rst construct a nearest neighbor graph of the observed data. We then compute shortest
paths in the graph for all pairs of data points. The shortest-paths computation gives an estimate of
the global metric structure. Using techniques from Multi-Dimensional Scaling (MDS) and requiring the
mapping function to be linear, we nally obtain Isometric Projection. IsoProjection can operate in either
original data space or reproducing kernel Hilbert space (RKHS) which leads to Kernel IsoProjection.
2
With a nonlinear kernel, kernel IsoProjection is capable of discovering nonlinear structure of the data
manifold. More crucially, kernel IsoProjection is dened everywhere.
The points below highlight several aspects of the paper:
IsoProjection provides an optimal linear approximation to the true isometric embedding of the
underlying data manifold. It tends to give a more faithful representation of the datas global
structure than PCA does.
IsoProjection is linear. It is computationally tractable. It can be obtained by solving an eigenvector
problem.
IsoProjection, as well as its nonlinear extension, is dened everywhere. Therefore, query points can
also be mapped into the low-dimensional representation space in which retrieval, clustering and
classication may be performed.
IsoProjection is fundamentally based on ISOMAP [21], but ISOMAP does not have properties (2)
and (3) above.
The remainder of the paper is organized as follows. In Section 2, we provide some back materials
for manifold based dimensionality reduction. Section 3 introduces our proposed IsoProjection algorithm.
Section 4 describes its nonlinear extension, kernel IsoProjection. The extensive experimental results are
presented in Section 5. Finally, we provide some concluding remarks and suggestions for future work in
Section 6.
2 Background
In this section, we provide mathematical background of manifold based dimensionality reduction, as well
as its eect on some potential applications like retrieval, clustering and classication. For a detailed
treatment of manifolds, please see [10].
2.1 Manifold based Dimensionality Reduction
Data are generally represented as points in high-dimensional vector space. For example, a 32 32 image
can be represented by a 1024-dimensional vector. Every element of the vector corresponds to a pixel.
A text document can be represented by a term vector. In many cases of interests, the data may not
ll the whole ambient space, but reside on or near a submanifold embedded in the ambient space. One
hopes then to estimate geometrical and topological properties of the submanifold from random samples
(scattered data) lying on this unknown submanifold. The formal denition of manifold is as follows.
Denition An p-dimensional manifold (denoted by
p
) is a topological space that is locally Euclidean.
That is, around every point, there is a neighborhood that is topologically the same as the open unit ball
in R
p
.
3

(a)

(b)
Figure 1: Examples of one-dimensional manifold (a) and two-dimensional manifold (b). Both of them
are embedded in the three-dimensional ambient space.
Figure 1 gives examples of manifold with dimensionality 1 and 2. In order to compute distances on
the manifold, one needs to equip a metric to the topological manifold. A manifold possessing a metric is
called Riemannian Manifold, and the metric is commonly referred to as Riemannian Metric.
Denition Suppose for every point x in a manifold , an inner product ', `
x
is dened on a tangent
space T
x
of at x. Then the collection of all these inner products is called the Riemannian metric.
Once the Riemannian metric is dened, one is allowed to measure the lengths of the tangent vectors
v T
x
:
|v|
2
= 'v, v`
For every smooth curve r : [a, b] , we have tangent vectors:
r

(t) =
dr
dt
T
r(t)

and can therefore use the Riemannian metric (inner product of the tangent spaces) to dene their lengths.
We can then dene the length of r from a to b:
length(r) =

b
a
|
dr
dt
|dt =

b
a
|r

(t)|dt
Note that, a Riemannian metric is not a distance metric on . However, for a connected manifold, it is
the case that every Riemannian metric induces a distance matric on , i.e. Geodesic Distance.
Denition The geodesic distance d
M
(a, b) is dened as the length of the shortest curve connecting a
and b.
In the plane, the geodesics are straight lines. On the sphere, the geodesics are great circles (like the
equator). Suppose
p
is embedded in a n-dimensional Euclidean space R
n
(p n). Let us consider
a low dimensional map, f : R
n
R
d
(d n), and the f has a support on a submanifold
p
, i.e.
4
supp(f) =
p
. Note that, p d n, and p is generally unknown. Let d
R
d denote the standard
Euclidean distance measure in R
d
. In order to preserve the intrinsic (invariant) geometrical structure of
the data manifold, we seek a function f such that:
d
M
p(x, y) = d
R
d(f(x), f(y)) (1)
In this paper, we are particularly interested in linear mappings, i.e. projections. The reason is for
its simplicity. And more crucially, the same derivation can be performed in reproducing kernel Hilbert
space (RKHS) which naturally leads to its nonlinear extension.
2.2 Potential Applications
Dimensionality reduction is often considered as a data pre-processing. After that, retrieval, clustering
and classication can be performed in the lower dimensional subspace.
In information retrieval, the most commonly used strategy is query by example. Give a dataset
A = x
1
, , x
m
, the query-by-example process can be formally stated below:
1. The user submits a query q.
2. Compute the distance between x
i
and q according to some pre-dened distance measure d, i =
1, , m. Sort d(q, x
i
) in increasing order. Let r(x
i
) be the rank of x
i
.
3. Return the top k matches, 1(k, q, A) = x
i
[r(x
i
) k.
As can be seen, a key step of the above process is the distance measure. In practice, one can only
consider simple distance measures for fast query response, such as Euclidean distance, although it may
not reect the intrinsic geometrical structure. In this paper, however, by using Isometric Projection, the
Euclidean distances in the low dimensional subspace provide a faithful approximation to the geodesic
distances on the intrinsic data manifold.
Clustering is an unsupervised learning problem. It aims at grouping objects with some common
properties. For example, document clustering aims at grouping documents sharing the same topics. The
K-means algorithm is one of the most popular iterative descent clustering methods. Let C(i) be an
assignment of class label of x
i
, i = 1, , m. K-means tries to minimize the following objective function:
min
K

k=1

C(i)=k
d(x
i
, m
k
)
where m
k
is the center of the k-th cluster. The performance of K-means is essentially determined by
the choice of the distance measure. Recently, there has been considerable interest in spectrally based
techniques to data clustering due to their good performance [15], [20]. Spectral clustering has very close
tie to spectral dimensionality reduction. In fact, spectral clustering can be though of as a combination of
spectral dimensionality reduction and traditional clustering algorithms such as K-means. The rationale
5
behind spectral clustering reside in the fact that, after dimensionality reduction, Euclidean distances in
the subspace can better describe the intrinsic relationships between objects than those in the original
ambient space. Therefore, it is expected that good clustering performance can be achieved in the subspace
obtained by our Isometric Projection algorithm.
3 Isometric Projection
In this section, we introduce a novel dimensionality reduction algorithm, called Isometric Projection. We
begin with a formal denition of the problem of dimensionality reduction.
3.1 The Problem
The generic problem of dimensionality reduction is the following. Given a set of points x
1
, , x
m
in
R
n
, nd a mapping function that maps these m points to a set of points y
1
, , y
m
in R
d
(d << n),
such that y
i
represents x
i
, where y
i
= f(x
i
). Our method is of particular applicability in the special
case where x
1
, x
2
, , x
m
and is a nonlinear manifold embedded in R
n
.
In this section, we consider that f is linear. In the next section, we will describe its nonlinear extension
using kernel techniques.
3.2 The Objective Function of Isometric Projection
We dene X = (x
1
, x
2
, , x
m
) and f(x) = a
T
x. Sometimes the rank of X is less than the number of
dimensions (n). In this case, we can apply Singular Value Decomposition (SVD) to project them into a
lower dimensional subspace without losing any information. We have:
X = UV
T
where = diag(
1
, ,
r
) and
1

2

r
0 are the singular values of X, U = [u
1
, , u
r
]
and u
i
is called left singular vectors, V = [v
1
, , v
r
] and v
i
is called right singular vectors. We project
the data points x
i
(i = 1, , m) into the SVD subspace by throwing away the components corresponding
to zero singular value. We denote by W
SV D
the transformation matrix of SVD, W
SV D
= U. By SVD
projection, The rank of the new data matrix is equal to the number of features (dimensions). Note that,
this step is used to guarantee that matrix XX
T
is non-singular. When the number of data points (m)
is large than the number of features (n), XX
T
is usually non-singular. In such case, this step is not
necessary. For the sake of simplicity, we still use X to denote the data in the SVD subspace in the
following.
Let d
M
be the geodesic distance measure on and d the standard Euclidean distance measure in
R
d
. Isometric Projection aims to nd a Euclidean embedding such that Euclidean distances in R
d
can
6
provide a good approximation to the geodesic distances on . That is,
f
opt
= arg min
f

i,j

d
M
(x
i
, x
j
) d

f(x
i
), f(x
j
)

2
(2)
In real life data set, the underlying manifold is often unknown and hence the geodesic distance
measure is also unknown. In order to discover the intrinsic geometrical structure of , we rst construct
a graph G over all data points to model the local geometry. There are two choices
1
:
1. -graph: we put an edge between i and j if d(x
i
, x
j
) < .
2. kNN-graph: we put an edge between i and j if x
i
is among k nearest neighbors of x
j
or x
j
is
among k nearest neighbors of x
i
.
Once the graph is constructed, the geodesic distances d
M
(i, j) between all pairs of points on the manifold
can be estimated by computing their shortest path distances d
G
(i, j) on the graph G. The procedure
is as follows: initialize d
G
(x
i
, x
j
) = d(x
i
, x
j
) if x
i
and x
j
are linked by an edge; d
G
(x
i
, x
j
) = otherwise.
Then for each value of p = 1, 2, , m in turn, replace all entries d
G
(x
i
, x
j
) by
min

d
G
(x
i
, x
j
), d
G
(x
i
, x
p
) +d
G
(x
p
, x
j
)

.
The matrix of nal values D
G
= d
G
(x
i
, x
j
) will contain the shortest path distances between all pairs of
points in G. This procedure is named Floyd-Warshall algorithm [4]. More ecient algorithms exploiting
the sparse structure of the neighborhood graph can be found in [8].
In the following, we apply techniques from Multi-Dimensional Scaling (MDS) to convert distances to
inner products, which uniquely characterize the geometry of the data in a form that supports ecient
optimization [12]. We have the following theorem:
Theorem 1 Let D be the distance matrix such that D
ij
is the distance between x
i
and x
j
. Dene matrix
S
ij
= D
2
ij
and H = I
1
m
ee
T
where I is the identity matrix and e is the vector of all ones. It can be
shown that (D) = HSH/2 is the inner product matrix. That is, D
2
ij
= (D)
ii
+ (D)
jj
2(D)
ij
,
i, j.
Proof We have:
(D) =
1
2
HSH =
1
2
(I
1
m
ee
T
)S(I
1
m
ee
T
)
=
1
2
(S
1
m
ee
T
S
1
m
See
T
+
1
m
2
ee
T
See
T
)
and
(D)
ij
=
1
2

S
ij

1
m
m

i=1
S
ij

1
m
m

j=1
S
ij
+
1
m
2
m

i=1
m

j=1
S
ij

.
1
Under supervised situation, more restrictions can be imposed that require the edge only be put between data points
which share the same label.
7
Since
S
ij
= D
2
ij
= |x
i
x
j
|
2
= (x
i
x
j
)
T
(x
i
x
j
)
= x
T
i
x
i
2x
T
i
x
j
+x
T
j
x
j
m

i=1
S
ij
=
m

i=1
x
T
i
x
i
2x
T
j
m

i=1
x
i
+mx
T
j
x
j
m

j=1
S
ij
= mx
T
i
x
i
2x
T
i
m

j=1
x
j
+
m

j=1
x
T
j
x
j
m

i=1
m

j=1
S
ij
= m
m

i=1
x
T
i
x
i
2
m

j=1
x
T
j
m

i=1
x
i
+m
m

j=1
x
T
j
x
j
and note that
m

i=1
x
T
i
x
i
=
m

j=1
x
T
j
x
j
and
x =
1
m
m

i=1
x
i
=
1
m
m

j=1
x
j
,
we have:
(D)
ij
= x
T
i
x
j
x
T
i
x x
T
j
x + x
T
x = (x
i
x)
T
(x
j
x)
Thus, we have:
D
2
ij
= (D)
ii
+(D)
jj
2(D)
ij
, i, j
The matrix H is often called centering matrix. Let D
Y
denote the Euclidean distance matrix in the
reduced subspace, and (D
Y
) be the corresponding inner product matrix. Thus, the objective function
(2) becomes minimizing the following:
|(D
G
) (D
Y
)|
L
2 (3)
where |A|
L
2 is the L
2
matrix norm

i,j
A
2
i,j
.
3.3 Learning Isometric Projections
Consider a linear function f(x) = a
T
x. Let y
i
= f(x
i
) and Y = (y
1
, , y
m
) = a
T
X. Thus, we have
(D
Y
) = Y
T
Y = X
T
aa
T
X
The optimal projection is given by solving the following minimization problem:
a

= min
a
|(D
G
) X
T
aa
T
X|
2
(4)
8
Following some algebraic steps and noting tr(A) = tr(A
T
), we see that:
|(D
G
) X
T
aa
T
X|
2
= tr

(D
G
) X
T
aa
T
X

(D
G
) X
T
aa
T
X

= tr

(D
G
)(D
G
)
T
X
T
aa
T
X(D
G
)
T

(D
G
)X
T
aa
T
X +X
T
aa
T
XX
T
aa
T
X

Note that, the magnitude of a is of no real signicance because it merely scales y


i
. Therefore, we can
impose a constraint as follows:
a
T
XX
T
a = 1
Thus, we have
tr

X
T
aa
T
XX
T
aa
T
X

= tr

a
T
XX
T
aa
T
XX
T
a

= 1
And,
|(D
G
) X
T
aa
T
X|
2
= tr

(D
G
)(D
G
)
T

2tr

a
T
X(D
G
)X
T
a

+ 1
Now, the minimization problem (4) can be written as follows:
arg max
a
a
T
XX
T
a= 1
a
T
X(D
G
)X
T
a (5)
We will now switch to a Lagrangian formulation of the problem. The Lagrangian is as follows
L = a
T
X(D
G
)X
T
a a
T
XX
T
a
Requiring that the gradient of L vanish gives the following eigenvector problem:
X[(D
G
)]X
T
a = XX
T
a (6)
It is easy to show that the matrices X[(D
G
)]X
T
and XX
T
are both symmetric and positive semi-
denite. The vectors a
i
(i = 1, 2, , l) that minimize the objective function are given by the eigenvectors
corresponding to the maximum eigenvalues of the generalized eigen-problem. Let A = [a
1
, , a
l
], the
linear embedding is as follows:
x y = W
T
x
W = W
SV D
A
where y is a l-dimensional representation of the high dimensional data point x. W is the transformation
matrix.
9
4 Kernel IsoProjection
In this Section, we describe a method to conduct IsoProjection in the reproducing kernel Hilbert space
into which the data points are mapped. This gives rise to Kernel IsoProjection.
Suppose X = x
1
, x
2
, , x
m
A is the training sample set. We consider the problem in a feature
space T induced by some nonlinear mapping
: A T
For a proper chosen , an inner product ', ` can be dened on T which makes for a so-called reproducing
kernel Hilbert space (RKHS). More specically,
'(x), (y)` = /(x, y)
holds where /(., .) is a positive semi-denite kernel function. Several popular kernel functions are:
Gaussian kernel /(x, y) = exp(|xy|
2
/
2
); polynomial kernel /(x, y) = (1+'x, y`)
d
; Sigmoid kernel
/(x, y) = tanh('x, y` +).
Given a set of vectors v
i
T[i = 1, 2, , d which are orthonormal ('v
i
, v
j
` =
i,j
), the projection
of (x
i
) T to these v
1
, , v
d
leads to a mapping from A to Euclidean space R
d
through
y
i
=

'v
1
, (x
i
)`, 'v
2
, (x
i
)`, , 'v
d
, (x
i
)`

T
We look for such v
i
T[i = 1, 2, , d that helps y
i
[i = 1, , m preserve geodesic distances on
the data manifold. A typical scenario is A = R
n
, T = R

with d << n < .


Let (X) denote the data matrix in RKHS:
(X) = [(x
1
), (x
2
), , (x
m
)]
Now, the eigenvector problem in RKHS can be written as follows:

(X)[(D
G
)]
T
(X)

v =

(X)
T
(X)

v (7)
Because the eigenvector of (7) are linear combinations of (x
1
), (x
2
), , (x
m
), there exist coe-
cients
i
, i = 1, 2, , m such that
v =
m

i=1

i
(x
i
) = (X)
where = (
1
,
2
, ,
m
)
T
R
m
.
Following some algebraic formulations, we get:

(X)[(D
G
)]
T
(X)

v =

(X)
T
(X)

(X)[(D
G
)]
T
(X)

(X) =

(X)
T
(X)

(X)

T
(X)

(X)[(D
G
)]
T
(X)

(X)
=
T
(X)

(X)
T
(X)

(X)
K[(D
G
)]K = KK (8)
10
where K is the kernel matrix, K
ij
= /(x
i
, x
j
). Let the column vectors
1
,
2
, ,
m
be the solutions
of equation (8). For a test point x, we compute projections onto the eigenvectors v
k
according to
(v
k
(x)) =
m

i=1

k
i
((x) (x
i
)) =
m

i=1

k
i
/(x, x
i
)
where
k
i
is the i
th
element of the vector
k
. For the original training points, the map can be obtained
by y = K, where the i
th
element of y is the one-dimensional representation of x
i
.
In some situations, IsoProjection, kernel IsoProjection and Isomap [21] may give the same embedding
results. We have the following proposition.
Proposition 2 If X in equation (6) is a full rank square matrix, then IsoProjection and Isomap have
the same embedding results on training points; and if K in equation (8) is positive denite, then kernel
IsoProjection and Isomap have the same embedding results on training points.
Proof Recall that the eigen-problem of IsoProjection is as follows:
X[(D
G
)]X
T
w = XX
T
w. (9)
For the original training points, the embedding results can be obtained by y = X
T
w, where the i
th
element of y is the one-dimensional embedding of x
i
. Replace X
T
w by y, equation (9) can be rewritten
as follows:
X[(D
G
)]y = Xy (10)
Since X is a full rank square matrix, the inverse of X always exists. Thus, the above equation can be
changed to
X
1
X[(D
G
)]y = X
1
Xy. (11)
Finally, we get
[(D
G
)]y = y (12)
which is just the eigen-problem of Isomap.
In kernel IsoProjection, the map of the training points can be obtained by y = K, where the i
th
element of y is the one-dimensional embedding of x
i
. Replace K by y, equation (8) can be rewritten
as:
K[(D
G
)]y = Ky. (13)
Similarly, if K is positive denite, the above equation can be reduced to
[(D
G
)]y = y (14)
which again is the eigen-problem of Isomap.
This proposition illustrates three interesting points:
11
1. When the number of features (m) is larger than the number of samples (n), X will be a full rank
square matrix after SVD transformation if all the data vectors are linearly independent. In this
case, IsoProjection provides the same embedding result on training points as Isomap. However,
IsoProjection has the projection functions which can be applied to testing data. In many real world
applications such as information retrieval, the dimensionality of the document space is typically
much larger than the number of documents. It falls into this case if these document vectors are
linearly independent.
2. Kernel IsoProjection with a positive denite kernel matrix yields the same results as Isomap on the
training points. Moreover, Kernel IsoProjection is dened everywhere while Isomap is only dened
on the training samples. In reality, when the number of samples is much larger than the number of
features (such as data in Figure 2), kernel IsoProjection might have more power than IsoProjection
to discover the nonlinear manifold structure.
3. Based on (1) and (2), a general guideline for choosing IsoProjection or kernel IsoProjection could
be: when the number of features (m) is larger than the number of samples (n), IsoProjection is
preferred; otherwise, kernel IsoProjection is preferred.
5 Experimental Results
5.1 A Toy Problem
We rst take the synthetic Swiss roll data to examine our algorithm. The 1000 data points are sampled
from a 2-dimensional manifold which is embedded in 3-dimensional ambient space (Figure 2(a)). Since
the number of data points (m = 1000) is much larger than the number of features (n = 3), we use
kernel IsoProjection with Gaussian kernel. The kernel matrix is positive denite thus the embedding
result (Figure 2(c)) of training data (1000 points) is the same as that in Isomap [21]. However, kernel
IsoProjection provides a mapping function that we can use to project new testing data (Figure 2(d)(e)).
Kernel IsoProjection correctly recover the intrinsic dimensionality and geometric structure of the data.
The Euclidean distance in the embedding space (Figure 2(c)) can accurately approximate the geodesic
distance on the manifold. For comparison, we also demonstrate the embedding result of kernel PCA [19]
on the same data, as shown in Figure 2(f). Clearly, kernel PCA failed to illustrate the low-dimensional
manifold structure.
5.2 Experiments on Clustering
In this subsection, we investigate the use of dimensionality reduction algorithms for document clustering.
Latent Semantic Indexing (LSI) [5] is the most popular dimensionality reduction algorithm for document
analysis. LSI is essentially equivalent to PCA provided that the data points have a zero mean. In this
experiment, we compared our IsoProjection with LSI.
12
5.2.1 Data Corpora
Reuters-21578 corpus
2
, which contains 21578 documents in 135 categories, was used in our experiments.
In our experiments, we discarded those documents with multiple category labels, and selected the largest
30 categories. It left us with 8,067 documents as described in Table 1. Each document is represented
as a term-frequency vector and each document vector is normalized to 1. We simply removed the stop
words, and no further preprocessing was done.
5.2.2 2-D Visualization of Document Set
As we described previously, LSI and IsoProjection are dierent dimensionality reduction algorithms. In
this subsection, we use them to project the documents into a 2-dimensional subspace for visualization.
We randomly selected four classes for this test. Figure 3 shows the 2D embedding results. As can be seen,
LSI fails to distinguish the dierent classes, and the four classes are mixed together. The four classes
can be easily separated in IsoProjection embedding. This illustrative example shows that IsoProjection
can have more discriminating power than LSI.
5.2.3 Evaluation Metric of Clustering
We chose K-means as our clustering algorithm and compared three methods. These three methods are
listed below:
K-means on original term-document matrix (Baseline)
K-means after LSI (LSI)
K-means after IsoProjection (IsoP)
In IsoProjection, the parameter k (number of nearest neighbors) was set to 15.
We tested these algorithms on several cases. For each case, K(= 2 10) classes were randomly selected
from the document corpus. The documents and the cluster number K are provided to the clustering
algorithms. The clustering result is evaluated by comparing the obtained label of each document with
that provided by the document corpus. Two metrics, the accuracy (AC) and the normalized mutual
information metric (MI) are used to measure the clustering performance [3]. Given a document x
i
, let
r
i
and s
i
be the obtained cluster label and the label provided by the corpus, respectively. The AC is
dened as follows:
AC =

n
i=1
(s
i
, map(r
i
))
n
where n is the total number of documents and (x, y) is the delta function that equals one if x = y and
equals zero otherwise, and map(r
i
) is the permutation mapping function that maps each cluster label r
i
2
Reuters-21578 corpus is at http://www.daviddlewis.com/resources/testcollections/reuters21578/
13
to the equivalent label from the data corpus. The best mapping can be found by using the Kuhn-Munkres
algorithm [11].
Let C denote the set of clusters obtained from the ground truth and C

obtained from our algorithm.


Their mutual information metric MI(C, C

) is dened as follows:
MI(C, C

) =

c
i
C,c

j
C

p(c
i
, c

j
) log
2
p(c
i
, c

j
)
p(c
i
) p(c

j
)
where p(c
i
) and p(c

j
) are the probabilities that a document arbitrarily selected from the corpus belongs
to the clusters c
i
and c

j
, respectively, and p(c
i
, c

j
) is the joint probability that the arbitrarily selected
document belongs to the clusters c
i
as well as c

j
at the same time. In our experiments, we use the
normalized mutual information MI as follows:
MI(C, C

) =
MI(C, C

)
max(H(C), H(C

))
where H(C) and H(C

) are the entropies of C and C

, respectively. It is easy to check that MI(C, C

)
ranges from 0 to 1. MI = 1 if the two sets of clusters are identical, and MI = 0 if the two sets are
independent.
5.2.4 Results
The evaluations were conducted with dierent numbers of clusters. For each given class number K, K
classes were randomly selected from the database. This process were repeated 50 times, and the average
performance was computed. For each single test (given K classes of documents), we applied the above
three methods. For each method, the K-means step was repeated 10 times with dierent initializations
and the best result in terms of the objective function of K-means was recorded. For IsoProjection and
LSI, they both need to estimate the dimensionality of the subspace. In general, their performance varies
with the dimensionality of the subspace. Figure 4 show the clustering performance of these algorithms
as a function of the dimensionality of the subspace. Table 2 shows the best performance obtained by
each algorithm. The paired T-test on the 50 random tests are reported in Table 3.
As can be seen, our clustering algorithm consistently outperformed LSI and baseline. LSI learned a
compact representation for documents, however, there is no signicant performance improvement over
baseline. This shows that LSI fails to discover the intrinsic class structure of the document corpus.
5.3 Experiments on Classication
In this subsection, we investigate the performance of our proposed IsoProjection algorithm for classi-
cation task, particularly, face recognition. In classication, the label information of training data is
available which can be incorporate into the graph construction of our algorithm. The most well known
supervised dimensionality reduction method is Linear Discriminate Analysis (LDA)[6]. Both PCA [22]
14
and LDA [1] are popular linear methods for subspace learning in face recognition. Thus, our algorithm
is compared with these two algorithms.
5.3.1 Dataset and Experimental Design
In this study, we use the Yale face database
3
. The Yale face database was constructed at the Yale Center
for Computational Vision and Control. It contains 165 gray scale images of 15 individuals. The images
demonstrate variations in lighting condition, facial expression (normal, happy, sad, sleepy, surprised, and
wink). Figure 6 shows the 11 images of one individual in Yale data base.
In the experiments, preprocessing to locate the faces was applied. Original images were manually
aligned (two eyes were aligned at the same position), cropped, and then re-sized to 32 32 pixels, with
256 gray levels per pixel. Each image is represented by a 1, 024-dimensional vector in image space.
Dierent pattern classiers have been applied for face recognition, such as nearest-neighbor [1], Bayesian
[13], Support Vector Machine [17]. In this paper, we apply the nearest-neighbor classier for its simplicity.
The Euclidean metric is used as our distance measure.
In short, the recognition process has three steps. First, we calculate the face subspace from the
training samples; then the new face image to be identied is projected into d-dimensional subspace by
using our algorithm; nally, the new face image is identied by a nearest neighbor classier.
5.3.2 Results
A random subset with l(= 2, 3, 4, 5, 6, 7, 8) images per individual was taken with labels to form the
training set, and the rest of the database was considered to be the testing set. For each given l, we
average the results over 50 random splits. Note that, for LDA, there are at most c1 nonzero generalized
eigenvalues and, so, an upper bound on the dimension of the reduced space is c1, where c is the number
of individuals [1]. The graph in IsoProjection is built based on the label information.
In general, the performance of all these methods varies with the number of dimensions. We show
the best results and the optimal dimensionality obtained by PCA, LDA, IsoProjection and baseline
methods in Table 4. The paired T-test on the 50 random splits are reported in Table 5. For the baseline
method, the recognition is simply performed in the original 1024-dimensional image space without any
dimensionality reduction.
As can be seen, our algorithm performed the best in all the cases. There is no improvement over
baseline for PCA method. The performance of LDA is very sensitive to the training size. When the
training size is small, LDA can be even worse than PCA. As the training sample increases, LDA achieves
similar performance to IsoProjection.
3
http://cvc.yale.edu/projects/yalefaces/yalefaces.html
15
6 Concluding Remarks and Future Work
In this paper, we propose a new linear dimensionality reduction algorithm called Isometric Projection.
It can be performed in either original space or reproducing kernel Hilbert space, which leads to Kernel
Isometric Projection. Both IsoProjection and kernel IsoProjection are based on the same variational
principle that gives rise to the Isomap [21]. As a result they are capable of discovering the nonlinear
degree of freedom that underlie complex natural observations. Our approaches has a major advantage
over recent nonparametric techniques for global nonlinear dimensionality reduction such as [18][21][2]
that the functional mapping between the high and low dimensional spaces are valid both on and o the
training data. Performance improvement of this method over Principal Component Analysis and Linear
Discriminant Analysis is demonstrated through several experiments.
There are several interesting problems that we are going to explore in the future work:
1. In this paper, the geodesic distance of two points is approximated by the length of the shortest
path on the nearest neighbor graph. It is unclear if there are more ecient and better ways to do
it.
2. In most of previous algorithms on manifold learning, either an eigen-problem or a generalized eigen-
problem need to be solved. Thus, all these algorithms will be failed to handle extremely large data
set. The tradeo between eectiveness and eciency might be needed in such case. It is interesting
to develop a exible algorithm which can make the tradeo under dierent situations.
References
[1] P. N. Belhumeur, J. P. Hepanha, and D. J. Kriegman. Eigenfaces vs. sherfaces: recognition using
class specic linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence,
19(7):711720, 1997.
[2] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering.
In Advances in Neural Information Processing Systems 14, pages 585591. MIT Press, Cambridge,
MA, 2001.
[3] D. Cai, X. He, and J. Han. Document clustering using locality preserving indexing. IEEE Transac-
tions on Knowledge and Data Engineering, 17(12):16241637, December 2005.
[4] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. MIT Press,
2nd edition, 2001.
[5] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. harshman. Indexing by
latent semantic analysis. Journal of the American Society of Information Science, 41(6):391407,
1990.
[6] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classication. Wiley-Interscience, Hoboken, NJ,
2nd edition, 2000.
16
[7] V. Gaede and O. G unther. Multidimensional access methods. ACM Comput. Surv., 30(2):170231,
1998.
[8] A. Grama, G. Karypis, V. Kumar, and A. Gupta. An Introduction to Parallel Computing: Design
and Analysis of Algorithms. Addison Wesley, 2nd edition, 2003.
[9] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang. Face recognition using laplacianfaces. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 27(3):328340, 2005.
[10] J. M. Lee. Introduction to Smooth Manifolds. Springer-Verlag New York, 2002.
[11] L. Lovasz and M. Plummer. Matching Theory. Akad emiai Kiad o, North Holland, Budapest, 1986.
[12] K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis. Academic Press, 1980.
[13] B. Moghaddam and A. Pentland. Probabilistic visual learning for object representation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 19(7):696710, 1997.
[14] B. Moghaddam, Q. Tian, N. Lesh, and T. S. H. C. Shen. Visualization and user-modeling for
browsing personal photo libraries. International Journal of Computer Vision, 56, 2004.
[15] A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances
in Neural Information Processing Systems 14, pages 849856. MIT Press, Cambridge, MA, 2001.
[16] S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming pattern discovery in multiple time-series. In
VLDB 05: Proceedings of the 31st international conference on Very large data bases, pages 697708,
Trondheim, Norway, 2005.
[17] P. J. Phillips. Support vector machines applied to face recognition. Advances in Neural Information
Processing Systems, 11:803809, 1998.
[18] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,
290(5500):23232326, 2000.
[19] B. Scholkopf, A. Smola, and K.-R. M uller. Nonlinear component analysis as a kernel eigenvalue
problem. Neural Computation, (10):12991319, 1998.
[20] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 22(8):888905, 2000.
[21] J. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear dimen-
sionality reduction. Science, 290(5500):23192323, 2000.
[22] M. Turk and A. P. Pentland. Face recognition using eigenfaces. In IEEE Conference on Computer
Vision and Pattern Recognition, Maui, Hawaii, 1991.
17
The IsoProjection Algorithm
Input: Data matrix X = [x
1
, , x
m
], x
i

R
n
for -graph or k for kNN-graph
Output: Transformation matrix
W = [w
1
, , w
l
], w
j
R
n
Step 1: SVD projection:
X = UV , = diag(
1
, ,
r
)

1

r
> 0 and r
m
X U
T
X
Step 2: Construct neighborhood graph:
Dene graph G over all data points
by connecting points i ad j if |x
i

x
j
| < , or if i and j are among the
k nearest neighbors of each other. Set
edge lengths equal to |x
i
x
j
|.
Step 3: Compute shortest paths:
Calculate the shortest path distances
d
G
(i, j) between all pairs of points in
G, let D
G
= d
G
(i, j).
Step 4: Isometric projection :
Dene S, S
ij
= D
G
2
ij
; H = I
1
m
ee
T
.
Dene (D
G
) =
1
2
HSH.
Solve the generalized eigen-problem:
X[(D
G
)]X
T
a = XX
T
a
Suppose a
1
, , a
l
are the eigenvec-
tors corresponding to the largest l
eigenvalues. Let A = [a
1
, , a
l
].
W = UA.
18

(a)

(b)

(c)

(d)

(e)

(f)
Figure 2: The Swiss roll data set, illustrating how Kernel IsoProjection exploits geodesic distance for
nonlinear dimensionality reduction as well as provides a projection function which is dened everywhere.
(a) For two arbitrary points (circled) on a nonlinear manifold, their Euclidean distance in the high-
dimensional input space (length of dashed line) may not accurately reect their intrinsic similarity, as
measured by geodesic distance along the low-dimensional manifold (length of solid blue curve). (b) The
neighborhood graph G constructed in IsoProjection (with k=7) allows an approximation (black segments)
to the true geodesic path to be eciently computed as the shortest path in G. (c) The two-dimensional
embedding of Kernel IsoProjection best preserves the shortest path distances in the neighborhood graph.
The straight dashed line (red) in the lower dimensional Euclidean space is a good approximation to the
geodesic on the data manifold (d) Three new points (test points) are injected into the system. Similarly,
their Euclidean distances in the high-dimensional input space (length of dashed lines) can not accurately
reect their intrinsic similarity. (e) Using the mapping function learned by Kernel IsoProjection, we
map these three test points into the two-dimensional space where the Euclidean distances can accurately
reect their intrinsic relationship. (f) The embedding results of Kernel PCA on the same data set.
Clearly, Kernel PCA failed to capture the low-dimensional manifold structure.
19
Table 1: 30 semantic categories from Reuters-21578 used in our experiments
category num of doc category num of doc
earn 3713 grain 45
acq 2055 copper 44
crude 321 jobs 42
trade 298 reserves 38
money-fx 245 rubber 38
interest 197 iron-steel 37
ship 142 ipi 36
sugar 114 nat-gas 33
coee 110 veg-oil 30
gold 90 tin 27
money-supply 87 cotton 24
gnp 63 bop 23
cpi 60 wpi 20
cocoa 53 pet-chem 19
alum 45 livestock 18

(a) LSI

(b) IsoProjection
Figure 3: 2D visualization of a document set
20

0 5 10 15 20
78
80
82
84
86
88
90
92
94
Dimension
A
c
c
u
r
a
c
y

(
%
)
IsoProjection
LSI
Baseline
(a) 2 Classes

0 5 10 15 20 25 30
68
70
72
74
76
78
80
82
84
Dimension
A
c
c
u
r
a
c
y

(
%
)
IsoProjection
LSI
Baseline
(b) 3 Classes

0 10 20 30 40
60
65
70
75
Dimension
A
c
c
u
r
a
c
y

(
%
)
IsoProjection
LSI
Baseline
(c) 4 Classes

0 10 20 30 40
55
60
65
70
Dimension
A
c
c
u
r
a
c
y

(
%
)
IsoProjection
LSI
Baseline
(d) 5 Classes

0 10 20 30 40 50
45
50
55
60
65
70
Dimension
A
c
c
u
r
a
c
y

(
%
)
IsoProjection
LSI
Baseline
(e) 6 Classes

0 10 20 30 40 50
45
50
55
60
65
Dimension
A
c
c
u
r
a
c
y

(
%
)
IsoProjection
LSI
Baseline
(f) 7 Classes

0 10 20 30 40 50 60
45
50
55
60
Dimension
A
c
c
u
r
a
c
y

(
%
)
IsoProjection
LSI
Baseline
(g) 8 Classes

0 10 20 30 40 50 60
45
50
55
60
Dimension
A
c
c
u
r
a
c
y

(
%
)
IsoProjection
LSI
Baseline
(h) 9 Classes
Figure 4: The average accuracy over dierent number of classes. The clustering performance was eval-
uated at dierent dimensionality. As can be seen, the clustering performance of both IsoProjection
and LSI are not sensitive to the reduced dimensionality. Clustering performances after IsoProjection in
all cases are consistently better than baseline, while clustering after LSI does not show any signicant
improvement over baseline.

0 20 40 60
40
45
50
55
Dimension
A
c
c
u
r
a
c
y

(
%
)
IsoProjection
LSI
Baseline
Figure 5: The average accuracy on 10 classes
21
Table 2: Clustering Results on Reuters-21578
Accuracy (%) Mutual Information (%)
k Baseline LSI IsoP Baseline LSI IsoP
2 87.13 87.50 93.91 59.98 60.75 73.64
3 77.53 77.83 81.50 56.68 56.98 61.60
4 73.23 74.01 76.89 59.80 60.35 62.38
5 67.11 67.34 69.55 56.31 56.22 58.18
6 65.48 65.95 68.54 57.90 58.09 59.63
7 62.31 62.53 66.28 57.27 57.40 60.20
8 58.19 58.75 61.55 55.61 55.89 57.41
9 55.26 55.62 59.02 54.88 55.20 56.87
10 54.50 55.15 57.39 55.16 55.45 56.81
Ave. 66.75 67.18 70.51 57.07 57.37 60.75
Table 3: T-test on clustering
LSI vs. Baseline IsoP vs. LSI
k Accuracy Mutual Info. Accuracy Mutual Info.
2
3
4
5
6
7
8
9
10
or means P-value 0.01
> or < means 0.01 < P-value 0.05
means P-value > 0.05
Figure 6: Sample face images from the Yale database. For each subject, there are 11 face images under
dierent lighting conditions with facial expression.
22
Table 4: Recognition accuracy on the Yale database
Train Num Baseline PCA LDA IsoProjection
2 0.46 0.46 (29) 0.44 (9) 0.56 (14)
3 0.52 0.52 (44) 0.61 (14) 0.67 (14)
4 0.55 0.55 (59) 0.69 (14) 0.73 (14)
5 0.58 0.58 (74) 0.74 (14) 0.77 (14)
6 0.61 0.61 (89) 0.77 (14) 0.79 (14)
7 0.62 0.62 (36) 0.80 (14) 0.81 (14)
8 0.65 0.65 (116) 0.81 (14) 0.82 (14)
Table 5: T-test on classication
Train Num LDA vs. Baseline IsoProjection vs. LDA
2
3
4
5
6
7
8
23

You might also like