Graph Regularized Non-Negative Matrix Factorization For Data Representation
Graph Regularized Non-Negative Matrix Factorization For Data Representation
Graph Regularized Non-Negative Matrix Factorization For Data Representation
AbstractMatrix factorization techniques have been frequently applied in information retrieval, computer vision and pattern recognition. Among them, Non-negative Matrix Factorization (NMF) has received considerable attention due to its psychological and
physiological interpretation of naturally occurring data whose representation may be parts-based in the human brain. On the other hand,
from the geometric perspective, the data is usually sampled from a low dimensional manifold embedded in a high dimensional ambient
space. One hopes then to find a compact representation which uncovers the hidden semantics and simultaneously respects the intrinsic
geometric structure. In this paper, we propose a novel algorithm, called Graph Regularized Non-negative Matrix Factorization (GNMF),
for this purpose. In GNMF, an affinity graph is constructed to encode the geometrical information, and we seek a matrix factorization
which respects the graph structure. Our empirical study shows encouraging results of the proposed algorithm in comparison to the
state-of-the-art algorithms on real world problems.
Index TermsNon-negative Matrix Factorization, Graph Laplacian, Manifold Regularization, Clustering.
I NTRODUCTION
Non-negative Matrix Factorization (NMF) [26] is a matrix factorization algorithm that focuses on the analysis
of data matrices whose elements are nonnegative.
Given a data matrix X = [x1 , , xN ] RM N , each
column of X is a sample vector. NMF aims to find
two non-negative matrices U = [uik ] RM K and
V = [vjk ] RN K whose product can well approximate
the original matrix X.
X UVT .
There are two commonly used cost functions that
quantifies the quality of the approximation. The first
one is the square of the Euclidean distance between
two matrices (the square of the Frobenius norm of two
matrices difference) [33]:
!2
K
X
X
T 2
xij
O1 = kX UV k =
uik vjk . (1)
i,j
k=1
K
X
uk vjk
(3)
k=1
TRIX
kxj xl k2
to measure the dissimilarity between the low dimensional representations of two data points with respect to
the new basis.
With the above defined weight matrix W, we can use
the following two terms to measure the smoothness of
the low dimensional representation.
N
1 X
D(zj ||zl ) + D(zl ||zj ) Wjl
2
j,l=1
N
K
vlk
1 XX
vjk
+ vlk log
Wjl .
=
vjk log
2
vlk
vjk
R2 =
j,l=1 k=1
(4)
and
R1 =
=
1
2
N
X
kzj zl k2 Wjl
j,l=1
N
X
zTj zj Djj
j=1
N
X
(5)
zTj zl Wjl
j,l=1
T
(6)
N
M X
X
X
xij
uik vjk
xij +
k=1 uik vjk
i=1 j=1
k=1
K
N
N
X
X
X
vjk
vlk
+
vjk log
Wjl
+ vlk log
2 j=1
vlk
vjk
xij log PK
(7)
l=1 k=1
L
U
L
V
= 2XV + 2UVT V +
(10)
(11)
(13)
XT U + WV
jk
VUT U + DV jk
(15)
Our GNMF also adopts this strategy. After the multiplicative updating procedure converges, we set the
Euclidean length of each column vector in matrix U to 1
and adjust the matrix V so that UVT does not change.
3.3
O1
,
uik
vjk vjk + jk
O1
vjk
(17)
The ik and jk are usually referred as step size parameters. As long as ik and jk are sufficiently small, the
above updates should reduce O1 unless U and V are at
a stationary point.
Generally speaking, it is relatively difficult to set these
step size parameters while still maintaining the nonnegativity of uik and vjk . However, with the special
form of the partial derivatives, we can use some tricks
to set the step size parameters automatically. Let ik =
uik /2 UVT V ik , we have
O1
uik
O1
= uik
T
uik
u
2 UV V ik ik
uik
T
2
XV
+
2
UV
V
= uik
ik
ik
2 UVT V ik
XV ik
= uik
UVT V ik
uik + ik
Similarly, let jk
(18)
O1
vjk
O1
= vjk
T
vjk
2 VU U + DV jk vjk
vjk
2 XT U jk
= vjk
T
2 VU U + DV jk
+ 2 VUT U jk + 2 LV jk
XT U + WV jk
= vjk
VUT U + DV jk
(19)
Now it is clear that the multiplicative updating rules
in Eq. (14) and Eq. (15) are special cases of gradient
descent with an automatic step parameter selection.
The advantage of multiplicative updating rules is the
guarantee of non-negativity of U and V. Theorem 1 also
guarantees that the multiplicative updating rules in Eq.
(14) and (15) converge to a local optimum.
Updating Rules Minimizing Eq. (7)
uik uik
(xij vjk /
P
j
vjk
uik vjk )
!1
uik I + L
P
P
i xi1 uik /
k uik v1k
P
P
v2k i xi2 uik / k uik v2k
..
.
P
P
vN k i xiN uik / k uik vN k
v1k
(21)
= vjk /2 VUT U + DV jk , we have
vjk + jk
3.4
vk
(20)
P
P
v1k i xi1 uik / k uik v1k
!
P
P
X
v2k i xi2 uik / k uik v2k
uik I + L vk =
..
i
.
P
P
vN k i xiN uik / k uik vN k
TABLE 1
Computational operation counts for each iteration in NMF and GNMF
NMF
GNMF
fladd
2M N K + 2(M + N )K 2
2M N K + 2(M + N )K 2
+N (p + 3)K
fladd
4M N K + (M + N )K
4M N K + (M + 2N )K
GNMF
+q(p + 4)N K
fladd: a floating-point addition
N : the number of sample points
p: the number of nearest neighbors
NMF
F-norm formulation
flmlt
fldiv
overall
2M N K + 2(M + N )K 2 + (M + N )K
(M + N )K
O(M N K)
2M N K + 2(M + N )K 2 + (M + N )K
(M + N )K
O(M N K)
+N (p + 1)K
Divergence formulation
flmlt
fldiv
overall
4M N K + (M + N )K
2M N + (M + N )K
O(M N K)
4M N K + (M + N )K
2M N + M K
O M+q(p+4) NK
+N p + q(p + 4)N K
flmlt: a floating-point multiplication
fldiv: a floating-point division
M : the number of features
K: the number of factors
q: the number of iterations in CG
P
Since matrix i uik I + L is symmetric, positive-definite
and sparse, we can use the iterative algorithm Conjugate Gradient (CG) [20] to solve this linear system of
equations very efficiently. In each iteration, CG needs
toPcompute the matrix-vector products in the form of
( i uik I + L)p. The remaining work load of CG in each
iteration is 4N flam. Thus, the time cost of CG in each
iteration is pN + 4N . If CG stops after q iterations, the
total time cost is q(p + 4)N . CG converges very fast, usually within 20 iterations. Since we need to solve K linear
equations systems, the total time cost is q(p + 4)N K.
Besides the multiplicative updates, GNMF also needs
O(N 2 M ) to construct the p-nearest neighbor graph. Suppose the multiplicative updates stops after t iterations,
the overall cost for NMF (both formulations) is
O(tM N K).
(22)
(23)
E XPERIMENTAL R ESULTS
TABLE 2
Statistics of the three data sets
dataset
COIL20
PIE
TDT2
size (N )
1440
2856
9394
dimensionality (M )
1024
1024
36771
# of classes (K)
20
68
30
TABLE 3
Clustering performance on COIL20
K
4
6
8
10
12
14
16
18
20
Avg.
Kmeans
83.015.2
74.510.3
68.65.7
69.68.0
65.06.8
64.04.9
64.04.9
62.74.7
63.7
68.3
PCA
83.115.0
75.512.2
70.49.3
70.87.2
64.34.6
67.36.2
64.14.9
62.34.3
64.3
69.1
Accuracy (%)
NCut
89.411.1
83.611.3
79.17.7
79.47.6
74.95.5
71.55.6
70.74.1
67.24.1
69.6
76.2
NMF
81.014.2
74.310.1
69.38.6
69.47.6
69.06.3
67.65.6
66.06.0
62.83.7
60.5
68.9
GNMF
93.510.1
92.46.1
84.09.6
84.44.9
81.08.3
79.25.2
76.84.1
76.03.0
75.3
82.5
Kmeans
74.618.3
73.211.4
71.86.8
75.06.2
73.15.6
73.34.2
74.63.1
73.72.6
73.4
73.6
Normalized
PCA
74.418.2
73.112.1
72.88.3
75.15.2
72.54.6
74.94.9
74.52.7
73.92.5
74.5
74.0
GNMF
90.912.7
91.15.6
89.06.5
89.23.3
88.04.9
87.33.0
86.52.0
85.81.8
87.5
88.4
TABLE 4
Clustering performance on PIE
K
10
20
30
40
50
60
68
Avg
Kmeans
29.03.7
27.92.2
26.11.3
25.41.4
25.00.8
24.20.8
23.9
25.9
PCA
29.83.3
27.72.4
26.51.7
25.61.6
24.61.0
24.60.7
25.0
26.3
Accuracy (%)
NCut
82.58.6
75.94.4
74.43.6
70.42.9
68.22.2
67.72.1
65.9
73.6
NMF
57.86.3
62.03.5
63.33.7
63.72.4
65.22.9
65.11.4
66.2
63.3
GNMF
80.38.7
79.55.2
78.94.5
77.13.2
75.73.0
74.62.7
75.4
77.4
Kmeans
34.84.1
44.92.4
48.41.8
50.91.7
52.60.8
53.01.0
55.1
48.5
Normalized
PCA
35.83.9
44.72.8
48.81.5
50.91.8
51.91.3
53.40.9
54.7
48.6
GNMF
86.15.5
88.02.8
89.11.6
88.61.2
88.81.1
88.70.9
88.6
88.3
TABLE 5
Clustering performance on TDT2
K
5
10
15
20
25
30
Avg.
Kmeans
80.817.5
68.515.3
64.98.7
63.94.2
61.54.3
61.2
66.8
SVD
82.716.0
68.213.6
65.37.2
63.45.5
60.84.0
65.9
67.7
Accuracy (%)
NCut
96.40.7
88.210.8
82.111.2
79.08.1
74.34.8
71.2
81.9
NMF
95.510.2
83.612.2
79.911.7
76.35.6
75.04.5
71.9
80.4
GNMF
98.52.8
91.47.6
93.42.7
91.22.6
88.62.1
88.6
92.0
Kmeans
78.119.0
73.113.5
74.07.9
75.74.5
74.62.4
74.7
75.0
Compared Algorithms
GNMF
94.28.9
85.69.2
88.05.7
85.94.1
83.92.6
83.7
86.9
8
95
90
85
80
90
70
75
70
GNMF
Kmeans
PCA
NCut
NMF
65
60
0
10
10
10
50
40
30
GNMF
Kmeans
PCA
NCut
NMF
20
10
0
10
60
Accuracy (%)
Accuracy (%)
Accuracy (%)
80
10
10
10
(a) COIL20
10
80
GNMF
Kmeans
SVD
NCut
NMF
75
70
65
10
85
10
10
10
(b) PIE
10
10
10
(c) TDT2
Fig. 1. The performance of GNMF vs. parameter . The GNMF is stable with respect to the parameter . It achieves
consistently good performance when varies from 10 to 1000.
95
90
85
80
90
70
75
70
GNMF
Kmeans
PCA
NCut
NMF
65
60
3
10
60
Accuracy (%)
Accuracy (%)
Accuracy (%)
80
50
40
30
GNMF
Kmeans
PCA
NCut
NMF
20
10
0
(a) COIL20
7
p
(b) PIE
10
85
80
GNMF
Kmeans
SVD
NCut
NMF
75
70
65
11 13 15 17 19 21
p
(c) TDT2
9
90
92
88
Normalized mutual information (%)
94
Accuracy (%)
90
88
86
84
82
80
78
Dotproduct weighting
01 weighting
3
86
84
82
80
78
76
Dotproduct weighting
01 weighting
74
72
11 13 15 17 19 21 23 25
p
(a) AC
11 13 15 17 19 21 23 25
p
(b) NMI
92
Normalized mutual information (%)
84
83
Accuracy (%)
82
81
80
79
78
77
Heat kernel weighting (=0.2)
Heat kernel weighting (=0.5)
01 weighting
76
75
74
10
90
88
86
84
82
Heat kernel weighting (=0.2)
Heat kernel weighting (=0.5)
01 weighting
80
78
(a) AC
10
(b) NMI
10
0.5
60
NMF
GNMF
Objective function value
50
0.4
0.3
0.2
40
30
20
10
0.1
100
200
300
Iteration #
400
500
100
200
300
Iteration #
400
500
(a) COIL20
0.3
40
NMF
GNMF
35
Objective function value
0.25
0.2
0.15
0.1
0.05
(a) Basis vectors (column vectors (b) Basis vectors (column vectors
of U) learned by NMF
of U) learned by GNMF
30
25
20
15
10
5
100
200
300
Iteration #
400
500
100
200
300
Iteration #
400
500
(b) PIE
62
3000
NMF
GNMF
2500
Objective function value
61
60
59
58
57
1500
(a) Basis vectors (column vectors (b) Basis vectors (column vectors
of U) learned by NMF
of U) learned by GNMF
1000
500
56
55
2000
100
200
300
Iteration #
400
500
100
200
300
Iteration #
400
500
(c) TDT2
We have presented a novel method for matrix factorization, called Graph regularized Non-negative Matrix
Factorization (GNMF). GNMF models the data space
as a submanifold embedded in the ambient space and
performs the non-negative matrix factorization on this
manifold. As a result, GNMF can have more discriminating power than the ordinary NMF approach which
only considers the Euclidean structure of the data. Experimental results on document and image clustering show
that GNMF provides a better representation in the sense
of semantic structure.
Several questions remain to be investigated in our
future work:
1) There is a parameter which controls the smoothness of our GNMF model. GNMF boils down to
original NMF when = 0. Thus, a suitable value of
is critical to our algorithm. It remains unclear how
to do model selection theoretically and efficiently.
ACKNOWLEDGMENTS
This work was supported in part by National Natural
Science Foundation of China under Grants 60905001 and
90920303, National Key Basic Research Foundation of
China under Grant 2009CB320801, NSF IIS-09-05215 and
the U.S. Army Research Laboratory under Cooperative
Agreement Number W911NF-09-2-0053 (NS-CTA). Any
opinions, findings, and conclusions expressed here are
those of the authors and do not necessarily reflect the
views of the funding agencies.
11
Lemma 4: Function
G(v, v) = F (v)
(t)
(t)
(t)
F (v
k
X
(t)
(t)
VUT U ab =
val UT U lb vab UT U bb
,v
) G(v
,v
) = F (v
(t)
M X
N
X
i=1 j=1
(xij
K
X
k=1
uik vjk ) +
K X
N X
N
X
and
M
X
(t)
(t)
Daj vjb Daa vab
DV ab =
j=1
Fab =
= 2XT U + 2VUT U + 2LV
(27)
V ab
ab
Fab
= 2 UT U bb + 2Laa
(28)
Since our update is essentially element-wise, it is sufficient to show that each Fab is nonincreasing under the
update step of Eq. (15).
(t)
v
aa ab
(t)
(33)
= Laa vab
(t)
vab
(t)
(t)
(vab )
Fab
T
2 VU U ab + 2 DV ab
XT U + WV ab
VUT U + DV ab
(t)
= vab vab
=
(32)
l=1
.
) G(v
(30)
(t)
Proof:
(t)
(t)
with Eq. (29) to find that G(v, vab ) Fab (v) is equivalent
to
VUT U ab + DV)ab
UT U bb + Laa .
(31)
(t)
vab
DW
(t)
(t)
(25)
(t)
(29)
We have
(t+1)
(t)
are satisfied.
(t+1)
(t)
(t)
vab
(34)
Since Eq. (29) is an auxiliary function, Fab is nonincreasing under this update rule.
12
Lemma 5: Function
(t)
G(V, V )
K
X
X
uik vjk
xij log xij xij +
=
i,j
k=1
(t)
uik vjk
xij P
log uik vjk
(t)
k uik vjk
i,j,k
log(x) 1
X
vjk
vjk log
2
vlk
j,l,k
(t)
uik vjk
log P
(t)
k uik vjk
vlk
+ vlk log
Wjl
vjk
!
k=1
uik vjk
X
k
(t)
uik vjk
P
(t)
k uik vjk
x 1.
uik
i=1
(t)
M
X
uik vjk
1
xij P
(t) v
jk
i=1
k uik vjk
N
X
(vjk vlk ) Wjl = 0,
vjk
(36)
l=1
1 j N,
1kK
P
(t) P
(t)
v1k i xi1 uik / k uik v1k
..
uik Ivk + Lvk =
.
P
i
(t) P
(t)
vN k i xiN uik / k uik vN k
1 k K.
(t)
uik vjk
log uik vjk log P
(t)
k uik vjk
!!
(t+1)
1
,
x
M
X
(t)
uik vjk
1
xij P
uik
(t) v
jk
i=1
i=1
k uik vjk
N
vjk
X
vlk
log
+
+1
Wjl = 0,
2
vlk
vjk
(35)
vk
X
uik I + L
1 k K.
1kK
1
P
(t)
x
u
/
u
v
i1
ik
ik
1k
k
i
..
.
P
(t)
(t) P
vN k i xiN uik / k uik vN k
(t)
v1k
P
l=1
1 j N,
O=
N
X
j=1
xj Uzj
T
xj Uzj ,
13
N
X
j xj Uzj
j=1
xj Uzj
T
X UVT X UVT
T
= Tr X1/2 UVT 1/2 X1/2 UVT 1/2
T
X UVT
= Tr X UVT
= Tr
T
N
X
j xj Uzj
j=1
T
xj Uzj + Tr(VT LV)
T
+ Tr(VT LV)
= Tr X UV X UVT
T
= Tr X1/2 UVT 1/2 X1/2 UVT 1/2
+ Tr(VT LV)
T
= Tr X UVT
X UVT + Tr(VT L V )
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
14