Aparajita Khan Final Thesis-22.11

Download as pdf or txt
Download as pdf or txt
You are on page 1of 285

Integrative Clustering

of Multi-View Data:
Subspace Clustering, Graph
Approximation to Manifold Learning

A thesis submitted to Indian Statistical Institute


in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computer Science

by
Aparajita Khan
Senior Research Fellow

Under the supervision of


Dr. Pradipta Maji, Professor

Machine Intelligence Unit


Indian Statistical Institute, Kolkata
November 2021
To my parents,
who gave me hope. Always.
ii
Acknowledgements

More than a thesis, a PhD is a journey. As I look back to the years that led
to the completion of this thesis, I realize how fortunate I am to find such kind
people around me whose consistent support, supervision, and confidence in me
made this journey possible.
It is a pleasure for me to express my heartfelt gratitude to my thesis supervi-
sor Prof. Pradipta Maji at the very outset for his endless support, guidance,
training, and teaching throughout my PhD tenure. I thank him for giving me
the opportunity to carry out my thesis work in his Biomedical Imaging and
Bioinformatics Laboratory and introducing me to the world of front-line re-
search in pattern recognition and machine learning. The thesis would not have
been completed without his vision and technical support. Customary thanks
always stand inadequate against the priceless effort he has put in to develop my
technical knowledge and inspiring me to become an independent and creative
researcher.
My indebtedness to Prof. C. A. Murthy is also beyond words. I consider
myself lucky to have him as a teacher for the ‘Advanced Pattern Recognition’
course. His expositions of several topics in statistics and computer science was
immensely helpful to me throughout my research career. I am thankful to
Prof. Rajat K. De, Prof. Rana Barua, Prof. Palash Sarkar, Prof. Subhash C.
Nandy, and Prof. Bhargab B. Bhattacharya for their teaching and technical
insights on some of the core subjects of computer science and mathematics. All
the discussions with them and their valuable inputs helped me a lot in several
parts of my thesis.
I wish to extend my earnest gratitude to my senior, Dr. Manjari Pradhan, for
always being a sweet big sister and her help in several aspects of my hostel
life. Without her companionship, this journey would not have been so smooth
and cherishable. I express my sincere gratitude to Dr. Sanchayan Santra and
Mr. Avisek Gupta for their constant support during different phases of my life.
The presence of all three of you has made my life graceful, and I never felt I
am away from home. I am fortunate that I met you all. I am thankful to my
lab members, especially, Ekta Shah, Ankita Mandal, Sankar Mondal, Suman
Mahapatra, Debamita Kumar, Monalisa Pal, Sampa Misra, and Rishika Sen for
creating such a pleasant atmosphere for research and collaboration within the
lab. I am delighted to acknowledge my friends, Indrani, Arundhati, Sukriti, and
Banani with whom I developed strong connections during my undergraduate
years and their presence gives me strength and happiness.
I am indebted to the Dean of Studies and the Director of Indian Statistical
Institute for providing me the fellowship and grants, and an incomparable in-
frastructure for research. I would also like to thank all the faculty members of
Machine Intelligence Unit, Indian Statistical Institute, for their helpful sugges-
tions and continued support throughout my PhD tenure. I express my sincere
thanks to the authorities of the institute for the facilities extended to carry out
my research. I also thank office staffs for their efforts to help me out in the
official matters and conduct lab work smoothly.
Last but certainly not least, I would like to take this opportunity to express my
sincere gratitude to my beloved parents for being the most substantial support
not only during my PhD career but also throughout my life. Both of you are
my source of inspiration. I am equally thankful for all the supports I got from
my other family members. Finally, I owe a debt of gratitude to a very special
person, my husband, Abir, for his unfailing love, support and understanding
that made the completion of the thesis possible. He was always around at times
when I found it hard to continue and helped me to keep things in perspective.
I am moved by his simplicity and deeply appreciate his belief in me.
I thank everyone who directly and indirectly supported me throughout my
thesis work.

Aparajita Khan
Abstract

Multi-view data clustering explores the consistency and complementary properties of


different views to uncover the natural groups present in a data set. While multiple
views are expected to provide more information for an improved learning performance,
they pose their own set of unique challenges. The most important problems of multi-
view clustering are the high-dimensional heterogeneous nature of different views, selec-
tion of relevant and complementary views while discarding noisy and redundant ones,
preventing the propagation of noise from individual views during data integration, and
capturing the lower dimensional non-linear geometry of each view.
In this regard, the thesis addresses the problem of multi-view data clustering, in
the presence of high-dimensional, noisy, and redundant views. In order to select the
appropriate views for data clustering, some new quantitative measures are introduced
to evaluate the quality of each view. While the relevance measures evaluate the com-
pactness and separability of the clusters within each view, the redundancy measures
compute the amount of information shared between two views. These measures are
used to select a set of relevant and non-redundant views during multi-view data inte-
gration.
The “high-dimension low-sample size” nature of different views makes the feature
space geometrically sparse and the clustering computationally expensive. The thesis
addresses these challenges by performing the clustering in the low-rank joint subspaces,
extracted by feature-space, graph, and manifold based approaches. In feature-space
based approach, the problem of incremental update of relevant eigenspaces is addressed
for multi-view data sets. This formulation makes the extraction of joint subspace com-
putationally less expensive compared to the principal component analysis. The graph
based approaches, on the other hand, inherently take care of the data heterogeneity
of different views, by modelling each view using a separate similarity graph. In order
to filter out the background noise embedded in each view, a novel concept of approxi-
mate graph Laplacian is introduced, which captures the de-noised relevant information
using the most informative eigenpairs of the graph Laplacian.
In order to utilize the underlying non-linear geometry of different views, the graph-
based approach is judiciously integrated with the manifold optimization techniques.
The optimization over Stiefel and k-means manifolds is able to capture the non-
linearity and orthogonality of the cluster indicator subspaces. Finally, the problem
of simultaneous optimization of the graph connectivity and clustering subspaces is
addressed by exploiting the geometry and structure preserving properties of Grass-
mannian and symmetric positive definite manifolds.
vi
Contents

1 Introduction 1
1.1 Multi-View Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Challenges in Multi-View Analysis . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Scope and Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Survey on Multi-View Clustering 11


2.1 Multi-View Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Multi-View Clustering Approaches . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Early Integration Approaches . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Two-Stage Late Integration Approaches . . . . . . . . . . . . . . . . 14
2.2.3 Subspace Clustering Approaches . . . . . . . . . . . . . . . . . . . . 16
2.2.3.1 Matrix Factorization Based Approaches . . . . . . . . . . . 17
2.2.3.2 Tensor Based Approaches . . . . . . . . . . . . . . . . . . . 18
2.2.3.3 Self-Representation Based Subspace Learning Approaches . 19
2.2.3.4 Cannonical Correlation Analysis Based Approaches . . . . 19
2.2.4 Co-Training and Co-Regularization Approaches . . . . . . . . . . . . 20
2.2.5 Multiple Kernel Learning Approaches . . . . . . . . . . . . . . . . . 20
2.2.6 Statistical Model Based Approaches . . . . . . . . . . . . . . . . . . 21
2.2.7 Graph Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.8 Manifold Based Approaches . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.9 Deep Clustering Approaches . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Multi-View Classification Approaches . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Subspace Learning Approaches . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Co-Training Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.3 Multi-View Support Vector Machines . . . . . . . . . . . . . . . . . . 25
2.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Multivariate Normality Based Analysis for Low-Rank Joint Subspace


Construction 27
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 NormS: Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Principal Subspace Model . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Rank Estimation of Individual Modality . . . . . . . . . . . . . . . . 30
3.2.3 Relevance and Dependency Measures . . . . . . . . . . . . . . . . . . 33
3.2.3.1 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3.2 Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . 35

vii
3.2.4 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.4.1 Computational Complexity of Proposed Algorithm . . . . . 38
3.3 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Data Sets and Experimental Setup . . . . . . . . . . . . . . . . . . . 40
3.3.2 Illustration of Proposed Algorithm . . . . . . . . . . . . . . . . . . . 41
3.3.3 Effectiveness of Proposed Algorithm . . . . . . . . . . . . . . . . . . 44
3.3.3.1 Importance of Relevance . . . . . . . . . . . . . . . . . . . 44
3.3.3.2 Importance of Rank Estimation . . . . . . . . . . . . . . . 46
3.3.3.3 Significance of Dependency . . . . . . . . . . . . . . . . . . 46
3.3.3.4 Importance of Selecting Non-normal Residuals . . . . . . . 47
3.3.4 Comparative Performance Analysis . . . . . . . . . . . . . . . . . . . 48
3.3.4.1 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.4.2 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.4.3 Execution Efficiency . . . . . . . . . . . . . . . . . . . . . . 51
3.3.5 Robustness and Stability Analysis . . . . . . . . . . . . . . . . . . . 52
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4 Selective Update of Relevant Eigenspaces for Integrative Clustering of


Multi-View Data 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 SVD Eigenspace Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 SURE: Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Eigenspace Updation . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 Evaluation of Individual Modality . . . . . . . . . . . . . . . . . . . 65
4.3.2.1 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.2.2 Concordance . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.4 Compuational Complexity . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Accuracy of Eigenspace Construction . . . . . . . . . . . . . . . . . . . . . . 69
4.4.1 Error Bound on Principal Sines . . . . . . . . . . . . . . . . . . . . . 70
4.4.2 Accuracy of Singular Triplets . . . . . . . . . . . . . . . . . . . . . . 74
4.4.2.1 Mean Relative Difference of Singular Values . . . . . . . . . 74
4.4.2.2 Relative Dimension of Intersection Space . . . . . . . . . . 74
4.5 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 75
4.5.1 Optimum Value of Concordance Threshold . . . . . . . . . . . . . . 75
4.5.2 Accuracy of Subspace Representation . . . . . . . . . . . . . . . . . . 76
4.5.3 Execution Efficiency of SURE . . . . . . . . . . . . . . . . . . . . . . 78
4.5.4 Importance of Data Integration and Modality Selection . . . . . . . 78
4.5.5 Importance of Relevance . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5.6 Significance of Concordance . . . . . . . . . . . . . . . . . . . . . . . 81
4.5.7 Performance Analysis of Different Algorithms . . . . . . . . . . . . . 83
4.5.8 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

viii
5 Approximate Graph Laplacians for Multi-View Data Clustering 91
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Basics of Graph Laplacian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 CoALa: Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1 Convex Combination of Graph Laplacians . . . . . . . . . . . . . . . 95
5.3.2 Construction of Joint Eigenspace . . . . . . . . . . . . . . . . . . . . 97
5.3.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.5 Choice of Convex Combination . . . . . . . . . . . . . . . . . . . . . 102
5.4 Quality of Eigenspace Approximation . . . . . . . . . . . . . . . . . . . . . . 103
5.5 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 110
5.5.1 Description of Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5.2 Optimum Value of Rank . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.5.3 Difference Between Eigenspaces . . . . . . . . . . . . . . . . . . . . . 113
5.5.4 Effectiveness of Proposed CoALa Algorithm . . . . . . . . . . . . . . 114
5.5.4.1 Importance of Data Integration . . . . . . . . . . . . . . . . 114
5.5.4.2 Importance of the choice of Convex Combination . . . . . . 117
5.5.4.3 Importance of Noise-Free Approximation . . . . . . . . . . 119
5.5.4.4 Advantage of Averting Row-normalization . . . . . . . . . . 120
5.5.5 Comparative Performance Analysis on Multi-Omics Data Sets . . . . 121
5.5.6 Comparative Performance Analysis on Benchmark Data Sets . . . . 126
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6 Multi-Manifold Optimization for Multi-View Subspace Clustering 133


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2 Basics of Manifold Based Clustering . . . . . . . . . . . . . . . . . . . . . . 135
6.3 MiMIC: Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.3.1 Multi-View Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.3.2 Manifold Optimization Based Solution . . . . . . . . . . . . . . . . . 139
6.3.2.1 Optimization of UJoint . . . . . . . . . . . . . . . . . . . . . 139
6.3.2.2 Optimization of Uj . . . . . . . . . . . . . . . . . . . . . . . 143
6.3.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3.3.1 Choice of Initial Iterates . . . . . . . . . . . . . . . . . . . . 145
6.3.3.2 Convergence Criterion . . . . . . . . . . . . . . . . . . . . . 146
6.3.3.3 Computational Complexity . . . . . . . . . . . . . . . . . . 146
6.4 Asymptotic Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4.2 Asymptotic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.5 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 156
6.5.1 Description of Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.5.1.1 Synthetic Data Sets . . . . . . . . . . . . . . . . . . . . . . 157
6.5.1.2 Benchmark Data Sets . . . . . . . . . . . . . . . . . . . . . 157
6.5.1.3 Multi-Omics Cancer Data Sets . . . . . . . . . . . . . . . . 159
6.5.2 Performance on Synthetic Data Sets . . . . . . . . . . . . . . . . . . 159
6.5.3 Significance of Asymptotic Convergence Bound . . . . . . . . . . . . 162
6.5.4 Choice of Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

ix
6.5.5 Choice of Damping Factor in Joint Laplacian . . . . . . . . . . . . . 164
6.5.6 Importance of Data Integration . . . . . . . . . . . . . . . . . . . . . 167
6.5.7 Importance of k-Means and Stiefel Manifolds . . . . . . . . . . . . . 170
6.5.8 Comparative Performance Analysis . . . . . . . . . . . . . . . . . . . 171
6.5.8.1 Results on Benchmark Data Sets . . . . . . . . . . . . . . . 171
6.5.8.2 Results on Multi-Omics Cancer Data Sets . . . . . . . . . . 173
6.5.8.3 Results on Social Network and General Image Data Sets . . 173
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

7 Geometry Aware Multi-View Clustering over Riemannian Manifolds 179


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.2 GeARS: Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.2.1 Geometry Aware Multi-View Integration . . . . . . . . . . . . . . . . 182
7.2.2 Updation of Graph Connectivity . . . . . . . . . . . . . . . . . . . . 185
7.3 Optimization Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.3.1 Optimization over Grassmannian Manifold . . . . . . . . . . . . . . . 187
7.3.2 Optimization over SPD Manifold . . . . . . . . . . . . . . . . . . . . 190
7.3.3 Optimization of Graph Weights . . . . . . . . . . . . . . . . . . . . . 192
7.3.4 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.3.4.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.3.4.2 Computational Complexity . . . . . . . . . . . . . . . . . . 194
7.3.4.3 Asymptotic Convergence Bound . . . . . . . . . . . . . . . 196
7.4 Grassmannian Disagreement Bounds . . . . . . . . . . . . . . . . . . . . . . 197
7.5 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 199
7.5.1 Description of Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . 199
7.5.2 Significance of Asymptotic Convergence Bound . . . . . . . . . . . . 200
7.5.3 Empirical Study on Subspace Disagreement Bound . . . . . . . . . . 203
7.5.4 Choice of Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
7.5.5 Effectiveness of Proposed Algorithm . . . . . . . . . . . . . . . . . . 205
7.5.5.1 Importance of Joint Subspace Optimization . . . . . . . . . 205
7.5.5.2 Importance of Individual Subspace Optimization . . . . . . 205
7.5.5.3 Importance of Pairwise Distance Reduction . . . . . . . . . 206
7.5.5.4 Importance of Laplacian Optimization . . . . . . . . . . . . 207
7.5.5.5 Importance of Weight Updation . . . . . . . . . . . . . . . 209
7.5.6 Comparision with Exisitng Approaches . . . . . . . . . . . . . . . . . 209
7.5.6.1 Performance Analysis on Benchmark Data Sets . . . . . . . 209
7.5.6.2 Performance Analysis on Cancer Data Sets . . . . . . . . . 212
7.5.6.3 Performance Analysis on Social Network and General Im-
age Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

8 Conclusion and Future Directions 215


8.1 Major Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

x
A Description of Data Sets 219
A.1 Multi-Omics Cancer Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . 219
A.1.1 Pre-Processing of Multi-Omics Data Sets . . . . . . . . . . . . . . . 221
A.2 Multi-View Benchmark Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 222
A.2.1 Social Network Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 222
A.2.1.1 Twitter Data Sets . . . . . . . . . . . . . . . . . . . . . . . 222
A.2.1.2 Citation Network Data Set . . . . . . . . . . . . . . . . . . 224
A.2.2 Image Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
A.2.3 Multi-Source News Article Data Sets . . . . . . . . . . . . . . . . . . 225

B Cluster Evaluation Indices 227


B.1 External Cluster Evaluation Measures . . . . . . . . . . . . . . . . . . . . . 227
B.2 Internal Cluster Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . 229

C Basics of Matrix Perturbation Theory 231

D Background on Manifold Optimization 235

List of Publications 239

References 241

xi
xii
List of Figures

1.1 Different application areas of multi-view data analysis. . . . . . . . . . . . . 3


1.2 Outline of the thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Different views of multi-omics data analysis. . . . . . . . . . . . . . . . . . . 12


2.2 Different types of multi-view clustering approaches. . . . . . . . . . . . . . . 14
2.3 Early integration based multi-view clustering. . . . . . . . . . . . . . . . . . 15
2.4 Two-stage consensus clustering. . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Multi-view subspace clustering. . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Graph based multi-view clustering. . . . . . . . . . . . . . . . . . . . . . . . 22

3.1 χ2 distributions for H-statistic of three modalities. . . . . . . . . . . . . . . 34


3.2 Dependency of modality Xj on Xi : (a) Orthogonal subspaces (b) Linearly
dependent subspaces (c) Arbitrary subspaces. . . . . . . . . . . . . . . . . 36
3.3 Two different cases of residual component Qj after the projection of U j
on the current joint subspace: (a) Residual follows normal distribution (b)
Residual shows divergence from normal distribution. . . . . . . . . . . . . . 37
3.4 Density and Q-Q plots for first five principal components of RNA and mDNA
modalities of CESC data set. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Density and Q-Q plots for the residual components of mDNA for CESC data. 42
3.6 Density and Q-Q plots for first four principal components of miRNA and
RPPA for CESC data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 Density and Q-Q plots for the residual components the miRNA and RPPA
for CESC data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8 Kaplan-Meier survival plots for proposed subtypes of CESC, LGG, OV, and
BRCA data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.9 Distribution of p-values obtained from robustness analysis on different data. 53

4.1 (a) Projected and residual components of subspace U pXm`1 q with respect
to U pXr m q; (b) Intersection between U pXm`1 q and U pX r m q is empty; (c)
U pXm`1 q is a subspace of U pXm q. . . . . . . . . . . . . . . . . . . . . . . .
r 62
4.2 Variation of PVE and F-measure for different values of threshold τ for CESC,
GBM, and LGG data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 Different quantitative indices for the evaluation of gap between true and
approximate eigenspaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4 Comparison of execution time for PCA computed using EVD (top row) and
SVD (bottom row) and the proposed SURE approach on LGG, LUNG, and
KIDNEY data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

xiii
4.5 Kaplan-Meier survival plots for subtypes identified by SURE on different data. 87

5.1 Variation of Silhouette index and F-measure for different values of rank
parameter r on omics data sets. . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.2 Variation of Silhouette index and F-measure for different values of rank
parameter r on benchmark data sets. . . . . . . . . . . . . . . . . . . . . . . 113
5.3 Variation of difference between full-rank and approximate eigenspaces with
respect to rank r. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4 Scatter plots using first two components of different low-rank based ap-
proaches on LGG data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.5 Scatter plots using first two components of different low-rank based ap-
proaches on STAD data set. . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.6 Scatter plots using first two components of different low-rank based sub-
spaces for Politics-UK data set. . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.7 Scatter plots using first two components of different low-rank based sub-
spaces for Digits data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.1 Optimization of UJoint over k-means manifold. . . . . . . . . . . . . . . . . . 140


6.2 Two-dimensional scatter plots of three synthetic shape data sets: ground
truth clustering (top two rows: (a)-(h)) and MiMIC clustering (bottom two
rows: (i)-(p)). The numbers in (i)-(p) denote the clustering accuracy ob-
tained using the MiMIC algorithm. . . . . . . . . . . . . . . . . . . . . . . . 158
6.3 Asymptotic convergence analysis for Spiral data set: scatter plot of data
with varying Gaussian noise (top row) and variation of convergence ratio
and objective function with increase in iteration number t (bottom row). . 160
6.4 Asymptotic convergence analysis for Jain data set: scatter plot of data with
varying Gaussian noise (top row) and variation of convergence ratio and
objective function with increase in iteration number t (bottom row). . . . . 161
6.5 Asymptotic convergence analysis for R15 data set: scatter plot of data with
varying Gaussian noise (top row) and variation of convergence ratio and
objective function with increase in iteration number t (bottom row). . . . . 161
6.6 Asymptotic convergence analysis for Compound data set: scatter plot of
data with varying Gaussian noise (top row) and variation of convergence
ratio and objective function with increase in iteration number t (bottom row).162
6.7 Variation of Silhouette index and F-measure for different values of rank r
on Digits and LGG data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.8 Two-dimensional scatter plots of individual views and proposed algorithm
for BBC data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.9 Two-dimensional scatter plots of individual views and proposed MiMIC al-
gorithm for 3Sources data set. . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.10 Two-dimensional scatter plots of three individual views and proposed MiMIC
algorithm for multi-omics cancer data sets: LGG (top row) and STAD (bot-
tom row). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

7.1 Effect of basis rotation on the cluster structure of a data set. . . . . . . . . 183
7.2 The Grassmannian manifold. . . . . . . . . . . . . . . . . . . . . . . . . . . 184
7.3 Optimization of UJoint over the Grassmannian manifold. . . . . . . . . . . . 188

xiv
7.4 Asymptotic convergence analysis for Spiral data set: scatter plot of data
with varying Gaussian noise (top row) and variation of convergence ratio
and objective function with increase in iteration number t (bottom row). . 200
7.5 Asymptotic convergence analysis for Jain data set: scatter plot of data with
varying Gaussian noise (top row) and variation of convergence ratio and
objective function with increase in iteration number t (bottom row). . . . . 201
7.6 Asymptotic convergence analysis for Aggregation data set: scatter plot of
data with varying Gaussian noise (top row) and variation of convergence
ratio and objective function with increase in iteration number t (bottom
row). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.7 Asymptotic convergence analysis for Flame data set: scatter plot of data
with varying Gaussian noise (top row) and variation of convergence ratio
and objective function with increase in iteration number t (bottom row). . 202
7.8 Variation of the theoretical upper bound Γm and the observed Grassman-
nian distance between UJoint and Um with increase in iteration number t for
3Sources, BBC, LGG, and STAD data sets. Sub-figures in each row shows
the variation for different views of the corresponding data set. . . . . . . . . 204
7.9 Variation of Silhouette index and F-measure for different values of rank r
on LGG, OV, and Digits data sets. . . . . . . . . . . . . . . . . . . . . . . . 205

D.1 Armijo condition for the choice of step size. . . . . . . . . . . . . . . . . . . 236

xv
xvi
List of Tables

3.1 Relevance and Rank of Each Modality and Modalities Selected by the Pro-
posed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Importance of Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Importance of Rank Estimation, Dependency Measure, and Selection of
Non-normal Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Comparative Performance Analysis of Proposed and Existing Approaches . 48
3.5 Survival p-values and Execution Times of Proposed and Existing Approaches 50
3.6 Survival Analysis of Cancer Subtypes Identified by Proposed Algorithm . . 51
3.7 Stability Analysis of Each Cluster . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1 Comparative Performance Analysis of Individual Modalities, PCA Combi-


nations, and SURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2 Importance of Relevance Based Ordering of Views . . . . . . . . . . . . . . 82
4.3 Importance of Concordance in SURE . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Comparative Performance Analysis of SURE and Existing Approaches . . . 84
4.5 Comparative Performance Analysis of SURE and Existing Approaches . . . 85
4.6 Survival p-values and Execution Times of Proposed and Existing Approaches 86
4.7 Survival Analysis for Subtypes Identified by SURE on Different Data Sets . 88

5.1 Comparative Performance Analysis of Spectral Clustering on Individual


Modalities and Proposed Approach on Omics Data Sets . . . . . . . . . . . 115
5.2 Comparative Performance of Spectral Clustering on Individual Modalities
and Proposed Approach on Twitter Data Sets . . . . . . . . . . . . . . . . . 116
5.3 Comparative Performance of Spectral Clustering on Individual Modalities
and Proposed Approach on Digits Data Set . . . . . . . . . . . . . . . . . . 116
5.4 Comparative Performance Analysis of Equally and Damped Weighted Com-
bination on Omics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5 Comparative Performance Analysis of Equally and Damped Weighted Com-
bination on Benchmark Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 118
5.6 Comparative Performance Analysis of Full-Rank and Approximate Sub-
spaces of Omics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.7 Effect of Row-normalization on Different Subspaces on Omics Data . . . . . 121
5.8 Effect of Row-Normalization on Benchmark Data Sets . . . . . . . . . . . . 121
5.9 Comparative Performance Analysis of CoALa and Existing Approaches Based
on External Indices on Omics Data Sets . . . . . . . . . . . . . . . . . . . . 122
5.10 Comparative Performance Analysis of CoALa and Existing Approaches Based
on External Indices on Omics Data Sets . . . . . . . . . . . . . . . . . . . . 123

xvii
5.11 Comparative Performance Analysis of CoALa and Existing Approaches Based
Internal Indices and Execution Time on Omics Data Sets . . . . . . . . . . 125
5.12 Comparative Performance Analysis on Benchmark Data Sets: Football,
Politics-UK, Rugby, Digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.13 Comparative Performance Analysis on Benchmark Data Sets: ORL, Cal-
tech7, CORA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.1 Performance Analysis of Proposed Algorithms on Synthetic Clustering Data


Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.2 Performance Analysis of Proposed Algorithm at Rank k and Optimal Rank r‹ 164
6.3 Performance of the MiMIC Algorithm for Different Values of Damping Fac-
tor ∆ on Benchmark and Multi-Omics Data Sets . . . . . . . . . . . . . . . 166
6.4 Performance Analysis of Spectral Clustering on Individual Views and Pro-
posed MiMIC Algorithm for BBC and ALOI Data Sets . . . . . . . . . . . . 167
6.5 Performance Analysis of Spectral Clustering on Individual Views and Pro-
posed MiMIC Algorithm for 100Leaves and 3Sources Data Sets . . . . . . . 168
6.6 Performance Analysis of Spectral Clustering on Individual Views and Pro-
posed MiMIC Algorithm for Multi-Omics Data Sets . . . . . . . . . . . . . 169
6.7 Performance Analysis of Individual Manifolds and Proposed Algorithm . . . 170
6.8 Comparative Performance Analysis of Proposed and Existing Integrative
Clustering Algorithms on Benchmark Data Sets . . . . . . . . . . . . . . . . 172
6.9 Comparative Performance Analysis of Proposed and Existing Integrative
Clustering Algorithms on Multi-Omics Data Sets: BRCA, LGG, STAD,
LUNG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.10 Comparative Performance Analysis of Proposed and Existing Integrative
Clustering Algorithms on Multi-Omics Data Sets: CRC, CESC, KIDNEY,
OV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.11 Comparative Performance Analysis of Proposed and Existing Algorithms on
Twitter Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.12 Comparative Performance Analysis of Proposed and Existing Algorithms on
ORL, Caltech7, and CORA Data Sets . . . . . . . . . . . . . . . . . . . . . 177

7.1 Performance Analysis of Proposed Algorithm at Rank k and Optimal Rank r‹ 206
7.2 Importance of Different Components of the Proposed Algorithm on Bench-
mark Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.3 Importance of Different Components of the Proposed Algorithm on Omics
Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.4 Comparative Performance Analysis of Proposed and Existing Multi-View
Clustering Algorithms on Benchmark Data Sets . . . . . . . . . . . . . . . . 210
7.5 Comparative Performance Analysis of Proposed and Existing Subtype Iden-
tification Algorithms on Multi-Omics Cancer Data Sets: OV, LGG, BRCA,
and STAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
7.6 Comparative Performance Analysis of Proposed and Existing Subtype Iden-
tification Algorithms on Multi-Omics Cancer Data Sets: CRC, CESC, KID-
NEY, and LUNG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

xviii
7.7 Comparative Performance Analysis of Proposed and Existing Algorithms on
ORL, Caltech7, and CORA Data Sets . . . . . . . . . . . . . . . . . . . . . 214

A.1 Summary of Data Sets with Feature Space based Representation . . . . . . 226

xix
xx
Chapter 1

Introduction

Data, the seemingly abundant yet elusive entity, has been the driving force behind the
growth of science over the last couple of decades. Nowadays, our daily interactions have
majorarily shifted from the physical domain to the digital domain, and as a consequence,
every little action generates data. Data pours in from every experiment performed, every
file saved, every picture taken, every social media interaction, and every search query
submitted to Google. A rough estimate of the amount of data, collected over the last few
millennia upto the last decade, is about five exabytes. Nowadays, this same amount of
data gets generated and stored every single day. And it is not only the volume of the data
that has grown drastically, but also the variety of it. However, just having an abundance
of data is not enough, it is also essential to analyze the data and make sense of it.
Data analysis refers to the process of cleaning, organizing, interpreting, and visualiz-
ing the data in order to transform it into useful information [88]. Information adds meaning
to the data. It is obtained by looking for interesting and non-trivial patterns within the
several bytes of numbers, letters, and characters collected as raw data. A pattern refers to
a segment of the data that follows an identifiable trend or repeats itself in a discernible way.
The huge volume and variety of data available these days necessitate the need for pattern
recognition which is the automated process of discovering patterns and regularities in
data [220]. The massive real-life data sets, along with informative patterns, may also con-
tain measurement errors, imprecision, redundancy, and so on. Machine learning [15, 67]
plays a significant role in discovering the natural structures within these massive and often
noisy data sets. It is the systematic study and design of algorithms for learning useful and
non-obvious patterns, and making inference from the data. It addresses the computational
aspect of data driven knowledge discovery and decision making.
Depending on the learning strategy, pattern recognition and machine learning algo-
rithms can be broadly classified into following three categories.

• Supervised learning algorithms aim to learn a function that maps a set of at-
tributes describing a data instance, also known as object, sample or observation, to a
set of labels or target attribute, using a collection of annotated training instances. For
example, spam filters used in e-mail servers identify new incoming e-mails as “spam"
or “not-spam" based on previously seen annotated instances. In supervised learning,
each instance in the set of training examples must be labeled with the corresponding

1
value of the target attribute. This requires a great deal of time and effort to create
a data set with labeled instances.
• In unsupervised learning, the learners task is to make inference from a set of data
instances in absence of labeled training samples. An example of unsupervised learn-
ing is to cluster COVID-19 affected patients based on demographics, mortality, and
incidence rates, in order to identify vulnerable zones that would benefit from allocat-
ing additional resources by the governing authorities. Another example is grouping
news articles, hosted in different online news portals, into different categories, such as
sports, politics, business, science, etc. Unsupervised learning saves the time and effort
invested in labeling the data instances; but lack of supervision makes the problem
more challenging to solve.
• Semi-supervised learning forms the third category of machine learning algorithms.
In several application domains, acquiring data is cheap, but acquiring labeled data
turns out to be expensive. For example, in the problem of web page classification, all
the web pages hosted in world-wide web are at our disposal, but creating a training
set of annotated web pages is a tedious job. In semi-supervised learning, an initial
model is developed based on the limited labeled training data and then unlabeled
data is used to refine the model.
The learning algorithms traditionally work with two types of data set representations:
feature vector based data and relational data [148]. Feature vector based representation
consists of numerical, categorical, textual, or binary set of d features for n samples in a d-
dimensional measurement space. For example, image data sets can be represented by global
or local features (like color histogram) extracted from images or their raw pixel intensities.
The relational data, on the other hand, is represented by n2 pairwise relationships between
n samples. For example, in the news article categorization, two articles can be considered
to be similar or on related topic if there is a hyperlink connecting the two articles. A set
of n samples, represented by d-dimensional feature vectors or by pairwise relationships, is
referred to as a “view " or “modality" of a data set.
In several applications, only a single type of information may not be sufficient to char-
acterize the nature and dynamics of the problem completely. A hyperlink connecting two
news articles is not sufficient to claim that both of them belong to the same news category.
Similarity between their content profiles also needs to be evaluated for making such a claim.
Diverse information can be captured via multiple views for the same set of observations or
samples. The thesis addresses the unsupervised learning problem for multi-view data sets.

1.1 Multi-View Data Analysis


Multi-view learning is an emerging machine learning paradigm that focuses on discover-
ing patterns in data represented by multiple distinct views [203]. These data sets are nearly
omnipresent in modern real-life applications due to an upsurge in different data collection,
measurement, and representation techniques. In image and video processing, color, shape,
and texture information generate three different kinds of feature sets, and each of them
can be considered as a single view of the given data set. Similarly, in cross-language text
classification, the same article can be written in multiple languages. This kind of data

2
DNA Copy Number
Methylation Variation

Multi-Omics BBC News The Guardian


Gene
Expression Cancer Subtyping
Multi-Source
News Article
Protein Classification
Expression

The Times
The Daily Telegraph

Twitter MRI
X-Ray

Multi-Platform
Multimodal
Social Network
Medical Imaging
Analysis
Facebook
PET Scan CT Scan

Figure 1.1: Different application areas of multi-view data analysis.

set is known as multi-view data, where each type of feature set or affinity/distance based
representation corresponds to a single view.
One of the important unsupervised learning paradigms is clustering. It aims to find
coherent groups of samples in the data, such that samples within a group are as similar
as possible, while samples in different groups are as dissimilar as possible. Multi-view
clustering groups the samples, based on the information of multi-view representation of the
data set. Multi-view classification, on the other hand, tries to learn decision boundaries,
separating different classes, from the labeled training examples having more than one view.
Figure 1.1 shows some of the application areas of multi-view learning. While clustering
or classification on each view separately may somewhat reveal the patterns present in the
data set, multi-view analysis that utilizes the diverse information of multiple views has the
potential to expose more fine-tuned structures that are not revealed by examining only a
single view.
The idea of fusing information from multiple sources or views gained importance over
traditional single view learning models during the last decade. Although relatively recent,
multi-view learning has become an active area of research due to its remarkable success in a
wide range of real-world applications, such as multi-camera face recognition, multi-source
news article clustering, action recognition, multi-omics cancer stratification, biomedical
imaging, and imaging genetics, to name just a few [130, 174, 288].
There are several reasons behind the prominent success of multi-view learning over its
single view counterpart. Some of them are listed below.

3
1. Comprehensive View of the System: Different views can reveal different aspects,
each giving a glimpse of the underlying dynamics of the whole system. For example,
in face recognition, multiple cameras alleviate the limitations of a single camera since
there are higher chances of a person being in a favorable frontal pose. Moreover, the
facial appearance and features often vary significantly due to variation in lighting
conditions, light angles, facial expressions, and head poses. In such a scenario, a
multi-camera network obtains multiple images of a face in different poses and lighting
conditions, and gives more accurate and robust face recognition results compared to
single-camera/single-view analysis.

2. Complementary Information: Each view may contain complementary informa-


tion that is not present in other views, even when all of them capture the same aspect
of a system. In multi-omics study, both gene expression and copy number variation
data contain genetic information of an individual. The gene expression conveys the
overexpression or underexpression of a gene, while copy number variation gives the
number of times a gene sequence has been repeated within the DNA of the individual.
Utilization of both consistent and complementary information of different views can
significantly improve the learning performance.

3. Resilience to Noise: Multi-view observations can reduce the effect of experimental


or measurement noise in the data. Noisy observations in one view can be compensated
by the corresponding observations of other views.

4. Cross-Platform Analysis: Due to the availability of multiple views, it is possible


to perform a variety of additional analyses, such as drawing associations between
variables observed in different views. In imaging genetic studies, given functional
magnetic resonance imaging and single nucleotide polymorphism (SNP), it is possible
to identify brain region alterations triggered by corresponding SNP changes in genes.
Other analysis like estimation of noise content or significance of one view given other
views is also possible owing to the availability of multiple views.

Despite these advantages, the abundance of data in multi-view learning comes with
several challenges as well [288], which are discussed in the next section.

1.2 Challenges in Multi-View Analysis


The traditional machine learning algorithms, such as artificial neural networks, support
vector machines, kernel machines, discriminant analysis, and spectral clustering, are de-
vised to work on single view data. These algorithms do not trivially adapt to the multi-view
setting, as the multi-view nature of the data set poses its own set of challenges. Some of
them, focused more towards clustering, are listed as follows.

1. Data Heterogeneity: The simplest approach to handle the multi-view data sets
using the conventional machine learning algorithms is to concatenate the feature sets
of all the views to construct a single view. However, this concatenation is not mean-
ingful as each view has its own specific statistical property, and the data in different
views is usually measured in different units which are not necessarily compatible.

4
Different views vary immensely in terms of scale, unit, and variance. For instance, in
multi-omics study, RNA sequence based gene expression data is measured in RPM
(reads per million) consisting of real values in the order of 105 , while DNA methyla-
tion data consists of β-values which lie in [0, 1]. The concatenation of features from
these heterogeneous views is likely to reflect only the properties of views having high
variance. Unbiased integration of multiple views requires extracting a transformed
feature space or a uniform platform, so that intrinsic properties are equally preserved
in all views. Clustering can be performed separately on each heterogeneous view.
But, manual integration of clustering solutions from different views can be tedious
and may fail to capture cross-platform correlations.
2. High-Dimension Low-Sample Size Nature: In real-life data analysis, data sets
usually have large number of observed variables, such as several thousands of words
in documents, nearly 106 pixels in images, 20K genes in DNA microarrays, and so
on. The number of samples in these data sets typically ranges within a few thousand.
Due to the lack of sufficient training samples, the learning models tend to overfit the
data, thus reducing the generalization performance. The multicollinearity issue is
also commonly observed in high dimensional settings, in which two or more features
are highly correlated. This degrades the consistency properties of the eigenvalues and
eigenvectors of the rank deficient sample covariance matrix [109]. In high dimensions,
the feature space becomes geometrically sparse; and most of the clustering algorithms
become computationally expensive and prone to degraded performance.
3. Noisy and Redundant Views: In real-world settings, the observations in different
views are often corrupted by noise due to the measurement errors. The noise in
different views gets propagated or even exaggerated during the data fusion process, if
not explicitly taken care of. Furthermore, most of the multi-view algorithms consider
all the available views for learning, under the assumption that each view is informative
and provides homogeneous and consistent information about the underlying patterns
in the data. However, some views may provide disparate, redundant, or even worse
information. Due to the presence of noisy and redundant views, integration of all the
available views can degrade the quality of cluster structures and decision boundaries
learned from the data.
4. View Disagreement: In multi-view learning paradigm, the views are expected to
uniformly agree upon an underlying global class/cluster structure. This implies that
two samples belonging to a class in one view should belong to the same class in other
views as well. However, in realistic settings, data sets are often corrupted by noise
and each view is likely to be corrupted by an independent noise process. In such a
situation, a set of observations in some views gets corrupted while the corresponding
observations in other views may remain unaffected. For example, in multi-sensory
data sets, a sensor may temporarily get to an errorneous state before returning back to
normal condition. This may lead to disagreement between different views. In case of
severe disagreement or corruption, the clusters identified in different views would not
conform with each other, and hence arriving at a global consensus becomes hard [43].
5. Low-Rank Non-Linear Geometry of Views: In several real-life data sets, most of
the views have large number of features. Although the data in these views may appear

5
to point clouds in a high-dimensional feature space, itâĂŹs meaningful structures
often reside on a lower dimensional subspace or manifold embedded in the high-
dimensional space. Moreover, in high-dimensions, the ratio between the nearest and
farthest points approaches one, that is, the points tend to become uniformly distant
from each other [4]. Consequently, the problem of clustering points based on its
nearest neighborhood becomes ill-posed, since the contrast between the distances to
different data points cease to exist. Hence, clustering in the high-dimensional original
feature space usually gives poor performance compared to a transformed space. Even
in transformed space learning, extracting a single subspace or manifold for a multi-
view data set might not be sufficient. Each view has its own underlying, possibly
non-linear, geometry that needs to be captured separately.

6. Incomplete Views: Most of the multi-view learning algorithms assume that all
samples can be successfully observed on all the views. However, due to measurement
and pre-processing errors, the data sets are prone to having incomplete views, where a
sample is not observed in one view (missing view), or the sample is partially observed
(missing variables). Consideration of only the samples observed in all the views
reduces the sample size and makes the model prone to overfitting. The presence
of incomplete views necessitates utilization of the connection between the views and
restoration of samples in the incomplete views with the help of corresponding samples
in the complete views [257].

Some of these challenges like data heterogeneity and incomplete views are inherent to
multi-view data, while other challenges like high-dimension low-sample size nature and
low-rank geometry exist in single-view data as well. However, presence of multiple het-
erogeneous views escalates the complexity of these problems. Hence, some new advanced
algorithms need to be designed that can efficiently address these challenges and mine mean-
ingful patterns embedded in multi-view data sets.

1.3 Scope and Organization of Thesis


In this regard, the thesis aims at designing a set of algorithms to address some of the
problems of multi-view data integration and clustering. One of the major challenges in
multi-view clustering is the high-dimension low-sample size nature of each view. For high-
dimensional view, the standard approach is to extract a lower dimensional transformed
space that captures the cluster structure better than the high-dimensional input space
and to perform clustering in that space. The transformed space can be a linear subspace
or a general non-linear manifold embedded within the ambient input feature space. Fur-
thermore, depending on the objective function, there can be numerous lower dimensional
subspaces and manifolds of the same high dimensional space. The main contribution of this
thesis is to design some novel algorithms to extract informative subspaces and manifolds
for multi-view data analysis and clustering, and theoretically analyze important properties
of these transformed spaces and new algorithms therein.
The outline of the thesis is presented in Figure 1.2. The thesis consists of eight chapters.
Chapter 1 provides an introduction to multi-view data analysis and outlines some of it’s
application areas. It also discusses the major challenges encountered during integrative

6
Multi-View
Clustering

Chapter 1
Introduction Chapter 8
Feature-Space Graph Based Conclusion and
Based Integration Integration
Future Directions

Chapter 2
Survey on Multi-
Subspace Based Manifold Based
View Clustering

Chapter 5
Approximate Graph
Laplacians for Multi-
View Data Clustering

Chapter 3 Chapter 4 Chapter 6 Chapter 7


Multivariate Normality Selective Update of Multi-Manifold Geometry Aware
Based Analysis for Relevant Eigenspaces for Optimization for Multi-View Clustering
Low-Rank Joint Integrative Clustering of Multi-View Subspace over Riemannian
Subspace Construction Multi-View Data Clustering Manifolds

Figure 1.2: Outline of the thesis.

analysis of multi-view data. Chapter 2 describes the problem of multi-view clustering and
its basic principles. A brief survey on existing multi-view clustering approaches is also
covered in this chapter.
One of the important challenges in multi-view data integration is the appropriate se-
lection of relevant and complementary views over noisy and redundant ones. Another
challenge is the high dimension-low sample size nature of each view. Chapter 3 addresses
these two challenges by proposing a novel algorithm, which constructs a low-rank joint
subspace from the low-rank subspaces of individual high-dimensional views. Statistical
hypothesis testing is introduced to effectively estimate the rank of each view by separating
the signal component from its noise counterpart. Two quantitative indices are proposed to
evaluate the quality of different views. While the first one assesses the degree of relevance
of the cluster structure embedded within each view, the second measure evaluates the
amount of cluster information shared between two views. To construct the joint subspace,
the algorithm selects the most relevant views with maximum shared information. During
data integration, the intersection between two subspaces is also considered to select cluster
information and filter out the noise from different subspaces. The efficacy of clustering on
the joint subspace, extracted by the proposed approach, is compared with that of several
existing integrative clustering approaches on real-life multi-omics cancer data sets. Survival
analysis is performed to reveal the significant differences between survival profiles of the
identified subtypes, while robustness analysis shows that the identified subtypes are not
sensitive towards perturbation of the data sets.
Due to the high-dimensional nature of the multi-view data sets, extracting a low-
dimensional subspace often becomes computationally very expensive. Extraction of the

7
principal subspace by performing principal component analysis (PCA) on the integrated
data set requires eigendecomposition of a considerably higher order covariance matrix. In
this regard, Chapter 4 addresses the problem of incrementally updating the singular value
decomposition of a higher order data matrix in the context of the multi-view data inte-
gration. This analytical formulation enables efficient construction of the joint subspace of
integrated data from low-rank subspaces of the individual views. Construction of joint sub-
space by the proposed method is shown to be computationally more efficient as compared
to PCA on the integrated data matrix. New quantitative indices are introduced to theo-
retically quantify the gap between the joint subspace extracted by the proposed approach
and the principal subspace extracted by performing PCA on the integrated data matrix, in
terms of the principal angles between these subspaces. Finally, clustering is performed on
the extracted joint subspace to identify meaningful clusters. The clustering performance of
the proposed approach is studied and compared with that of existing integrative clustering
approaches on several real-life multi-view cancer data sets.
Different views of a multi-view data set vary immensely in terms of unit and scale. One
of the important approaches of handling data heterogeneity in multi-view data clustering
is modeling each modality or view using a separate similarity graph. Information from
the multiple graphs is then integrated by combining them into a unified graph. A major
challenge here is how to preserve cluster information while removing noise from individual
graphs. In this regard, Chapter 5 presents a novel graph-based algorithm that integrates
noise-free approximations of multiple similarity graphs. The proposed method first ap-
proximates a graph using the most informative eigenpairs of its Laplacian which contain
cluster information. The approximate Laplacians are then integrated for the construction
of a low-rank subspace that best preserves overall cluster information of multiple graphs.
However, this approximate subspace differs from the full-rank subspace which integrates
information from all the eigenpairs of each Laplacian. The matrix perturbation theory is
used to theoretically evaluate how far approximate subspace deviates from the full-rank
one for a given value of approximation rank. Finally, spectral clustering is performed on
the approximate subspace to identify the clusters. Extensive experiments are performed
on several real-life cancer as well as benchmark multi-view data sets to study and compare
the performance of the proposed approach.
The meaningful patterns embedded in high-dimensional multi-view data sets typically
tend to have a much more compact representation that often lies close to a low-dimensional
manifold. Identification of hidden structures in such data mainly depends on the proper
modeling of the geometry of low-dimensional manifolds. In this regard, Chapter 6 presents
a manifold optimization based integrative clustering algorithm for multi-view data. To
identify consensus clusters, the algorithm uses the approximate joint graph Laplacian,
proposed in Chapter 5, to integrate de-noised cluster information from individual views.
It then optimizes a joint clustering objective, while reducing the disagreement between
the cluster structures conveyed by the joint and individual views. The optimization is
performed alternatively over k-means and Stiefel manifolds. The Stiefel manifold helps
to model the non-linearities and differential clusters within the individual views, while k-
means manifold tries to elucidate the best-fit joint cluster structure of the data. A gradient
based movement is performed separately on the manifold of each view, so that individual
non-linearity is preserved while looking for shared cluster information. The convergence
of the proposed algorithm is established over the manifold, and asymptotic convergence

8
bound is obtained to quantify theoretically how fast the sequence of iterates generated
by the algorithm converges to an optimal solution. The performance of the proposed
approach, along with a comparison with state-of-the-art multi-view clustering approaches,
is demonstrated on synthetic, benchmark and multi-omics cancer data sets.
Simultaneous optimization of the individual graph structures, their weights, and the
joint and individual subspaces, is likely to give a more comprehensive idea of the clusters
present in the data set. In this regard, Chapter 7 presents another manifold optimization
algorithm that harnesses the geometry and structure preserving properties of symmet-
ric positive definite (SPD) manifold and Grassmannian manifold for efficient multi-view
clustering. The SPD manifold is used to optimize the graph Laplacians corresponding
to the individual views while preserving their symmetricity, positive definiteness, and re-
lated properties. The Grassmannian manifold, on the other hand, is used to optimize and
reduce the disagreement between the joint and individual clustering subspaces. The geom-
etry preserving property of Grassmannian optimization additionally enforces the clustering
solutions to be basis invariant cluster indicator subspaces, such that all cluster indicator
matrices whose columns span the same subspace map to the same clustering solution. A
gradient based line-search algorithm, that alternates between different manifolds, is pro-
posed to optimize the subspaces and Laplacians. The matrix perturbation theory is used
to theoretically bound the disagreement or Grassmannian distance between the joint and
individual subpaces at any given iteration of the proposed algorithm. The disagreement is
empirically shown to minimize as the algorithm progresses and converges to a local min-
ima. The comparative clustering performance of the proposed and existing approaches is
demonstrated on several benchmark and multi-omics cancer data sets.
Finally, Chapter 8 concludes the thesis, and discusses the future scopes and improve-
ments of the proposed research work.

9
10
Chapter 2

Survey on Multi-View Clustering

This chapter presents the basics of the multi-view clustering problem. A brief literature
survey focused primarily on multi-view clustering, along with it’s classification counterpart
is also covered in this chapter.

2.1 Multi-View Clustering


A multi-view data set of n samples, tx1 , x2 , . . . , xn u, consists of M views, where M ě 2.
The term “view" is used interchangeably with the term “modality" throughout the thesis,
and accordingly, a multi-view data set is also referred to as a “multimodal data set". The
views or modalities can be represented by feature vector based data or by relational data. In
case of feature vector based representation, an M -view data set is given by a set of M data
matrices X1 , X2 , . . . , Xm , . . . , XM , each corresponding to one of the M views. Each Xm is
a pn ˆ dm q matrix consisting of dm features for each of the n samples, observed in a dm -
dimensional measurement space. The most commonly encountered space is the Euclidean
space, in which case, Xm contains numeric values in <nˆdm . The views can contain other
types of data as well, like, textual, categorical, binary, and so on. The measurement space,
as well as the number of observed variables, dm , need not be the same across different
views. The matrices X1 , . . . , XM may vary in terms of their scale, unit, variance, dimension
(column-wise), and data distribution. In case of relational data, the M views are typically
represented by M similarity (distance) matrices W1 , W2 , . . . , Wm , . . . , WM . Each Wm is a
pn ˆ nq matrix given by
Wm “ rwm pi, jqsnˆn ,
where wm pi, jq ě 0 is the similarity (distance) between samples xi and xj in the m-th view.
Figure 2.1 shows an example of multimodal omics data set with feature vector based
representation. The advent of whole genome sequencing technologies have led to the gen-
eration of different types of “omics" data from different levels of the genome. As shown
in Figure 2.1, the DNA methylation, copy number variation, gene expression, and protein
expression data can be observed from the epigenomic, genomic, transcriptomic, and pro-
teomic levels of the genome, respectively. In a multimodal data set, these observations can
be made for a common set of n samples or patients whose genome is being sequenced. The
resulting data set is a collection of M views, denoted by X1 , X2 , . . . , Xm , . . . , XM . Each

11
Epigenomic Genomic Transcriptomic Proteomic

DNA Copy Number Gene Protein


Methylation Variation Expression Expression

...
samples

Modalities / Views
Figure 2.1: Different views of multi-omics data analysis.

Xm , in this case, is a pn ˆ dm q data matrix consisting of the expression levels of dm genes,


or micro-RNAs, or proteins for those n samples.
Clustering is an unsupervised learning approach, which discovers the natural groups
present in a data set. Multi-view clustering aims at partitioning the n samples, txi uni“1 ,
into k subsets A1 , A2 , . . . , Ak based on the feature/ similarity information of multiple views,
such that the following three conditions are met:

• Aj ‰ H, for j “ 1, 2, . . . , k.
k
Ť
• Aj “ tx1 , . . . , xn u.
j“1

• Aj X Al “ H, @j ‰ l, and j, l “ 1, 2, . . . , k.

In addition, the samples contained in a cluster Aj are “more similar" to each other and
“less similar" to those in other clusters.
According to the above definition of clustering, each sample can belong to a single
cluster. Hence, this type of clustering is termed as “crisp", “hard" or “partitional" clustering.
An alternate formulation of clustering, termed as “fuzzy clustering", was introduced by
Zadeh [270]. A fuzzy clustering of the samples tx1 , . . . , xn u into k clusters is characterized
by k membership functions uj where

uj : txi uni“1 ÝÑ r0, 1s, for j “ 1, 2, . . . , k,

12
such that
k
ÿ n
ÿ
uj pxi q “ 1, for i “ 1, 2, . . . , n, and 0ă uj pxi q ă n, for j “ 1, 2, . . . , k.
j“1 i“1

Under fuzzy clustering, each sample may belong to more than one cluster “up to some
degree". The membership function uj pxi q gives the degree of belongingness of sample xi
to the j-th cluster. Fuzzy multi-view clustering is relatively less explored compared to its
hard counterpart. This thesis is focused on the design and analysis of hard multi-view
clustering algorithms and Section 2.2 primarily covers a brief survey of the same.
The area of multi-view learning is relatively new. However, owing to its state-of-the-
art performance in several application areas it quickly came into the limelight of machine
learning research and developed a rich literature over the past decade. The literature on
multi-view learning can majorarily be divided into multi-view clustering and multi-view
classification. Since the thesis focuses on multi-view clustering, the next section describes
different multi-view clustering approaches, and then Section 2.3 briefly touches upon it’s
classification counterpart.

2.2 Multi-View Clustering Approaches


Multi-view clustering algorithms can roughly be classified into seven categories based on
their algorithmic approaches, as shown in Figure 2.2. These categories are outlined in the
following subsections.

2.2.1 Early Integration Approaches


An early integration approach first concatenates the feature based raw data matrices from
all the views and then applies a single-view based clustering algorithm on the concatenated
data matrix. This straightforward integration enables the direct application of traditional
clustering algorithms to multi-view data. Given the feature based representation of views,
X1 . . . XM , the concatenated data matrix is formed by
“ ‰
X “ X1 X2 . . . Xm . . . XM ,
M
ÿ
nˆdm nˆd
where Xm P < and X P < such that d “ dm .
m“1

Then, any single-view clustering algorithm like k-means [139], spectral clustering [230], or
Gaussian mixture models [50] can be applied on the raw concatenated matrix X. Figure 2.3
shows a diagrammatic representation of the early integration approach, where the k-means
clustering algorithm is applied on X to obtain the clusters.
The naive concatenation, however, increases the dimension or number of features in
the data set, which is a major challenge for some of the single views as well. One baseline
solution to the problem of early integration is to perform PCA on concatenated data X
and then perform the single-view clustering, like k-means clustering, on top few principal

13
Muli-View
Muli-View
Clustering
Clustering

Two-Stage Multiple Statistical Graph


Late Kernel Models Integration
Integration Learning

Deep
Early Subspace Clustering
Integration Clustering

Matrix Self-Representation Co-Training &


Factorization Learning Co-Regularization

Higher-Order Cannonical
Tensors Correlation Anaysis

Figure 2.2: Different types of multi-view clustering approaches.

components of X. Another approach of handling the high dimension and it’s subsequent
problem of overfitting is to add regularization to induce data sparsity [221]. In high-
dimensional multi-view data integration, even though a majority of features in one view
may not be discriminative for a group of samples, a small number of features in the same
view can still be highly discriminative. In [235], sparsity inducing `2,1 -norm regularization
is imposed on X in order to obtain discriminative features from different views. The `2 -
norm regularization is imposed within each view to emphasize on view-specific feature
weight learning corresponding to each cluster, while `1 -norm is used to enforce sparsity
between different views and learn features that are discriminative across multiple clusters.
Although PCA and regularization can somewhat address the curse of dimensionality,
there are two more issues with naive concatenation. Firstly, the lack of appropriate nor-
malization is likely to give higher weight to views with larger number of features or higher
variance. But, it may not necessarily detect the best possible cluster structure. Secondly,
naive feature concatenation does not take into account the difference in the distribution,
scale, and unit of measurement of the data in different views. Hence, the concatenation
may not be meaningful.

2.2.2 Two-Stage Late Integration Approaches


In the late integration approach, each view is first clustered separately using a single-view
clustering algorithm. The per-view clustering solutions are integrated at a second stage

14
...
segment mean reads per million fold change in
in log 2 (RPM) ~10 6 log2

k-means
clustering

clusters

Figure 2.3: Early integration based multi-view clustering.

to identify the integrative clusters [23, 93, 140, 214]. Figure 2.4 shows an example of the
two-stage clustering approach, where the final clusters are obtained by taking a global
consensus on the individual view-specific clusterings
In the cluster of cluster assignments (COCA) algorithm [93], the clustering solution of
a sample xi , corresponding to view Xm , is encoded by a binary cluster indicator vector
that contains a 1 at index j indicating the belongingness of sample xi to cluster j, and 0
otherwise. The binary vectors for all samples acrosss all views are combined to obtain a
multi-view cluster indicator matrix. Consensus clustering [160] on the multi-view indicator
matrix reveals the final clustering of the samples. The COCA algorithm has been applied
for pan-cancer analysis of multiple genomic modalities. It investigates how tumors in
different types of tissues cluster together, and whether the obtained tumor clusters resemble
the tissue of the site of cancer [93].
Among the probalilistic approaches, Bruno and Maillet [23] used latent semantic anal-
ysis to obtain the final clusters from the multi-view cluster indicator matrix. In Bayesian
consensus clustering [140], a Bayesian framework driven by Dirichlet mixture model is de-
veloped for simultaneous estimation of the consensus as well as view-specific clusterings.
The Dirichlet distribution based modelling has the advantage of incorporating uncertain-
ity within both view-specific and consensus clustering. More flexible methods allow for
more general consensus strategies and dependence models. Kirk et al. [115] proposed the
Bayesian correlated clustering algorithm, which uses a statistical framework to cluster each
view while simultaneously modeling the pairwise inter-dependence between two clusterings.
In Bergman consensus clustering [126], the disagreement between the consensus clustering
result and the input view-specific clusterings is generalized from the traditional Euclidean

15
consensus
clustering
...

Figure 2.4: Two-stage consensus clustering.

distance to a more general Bergman loss.


The advantage of these late integration approaches is that any clustering algorithm can
be used in single view stage. Certain clustering algorithms that are known to work well on
certain views can be independently used on those views without having to find a unified
model or algorithm that works for all views. However, the major drawback is that the late
integration of the view-specific clustering solutions often becomes cumbersome and may
fail to capture joint structures shared by different views.

2.2.3 Subspace Clustering Approaches


In several real-world applications, although the observed data is high-dimensional, the
essential information of the data can be represented in a much lower dimension. For
instance, there can be a large number of pixels in a given image, yet the appearance,
geometry, objects, and dynamics of a scene can be described using only a few parameters.
Subspace based approaches seek to find a unified latent space from multiple low-dimensional
subspaces and afterwards perform clustering in the latent space [137, 141, 289]. Figure 2.5
illustrates the general approach of multi-view subspace clustering. Low-rank subspaces
from individual views are merged to obtain a joint subspace, clustering on which gives the
final clusters. The mapping of the high-dimensional views to low-dimensional subspaces can
be achieved by a variety of methods such as matrix factorization, low-rank approximation,
and tensor decomposition. Apart from these methods, subspaces have also been extracted
with the idea of preserving different desirable properties in the latent space, like locality
and neighborhood, self-representativeness, non-negativity, sparsity, correlation, and so on.

16
Merge
subspaces

Clustering
in subspace
...

Figure 2.5: Multi-view subspace clustering.

Few mainstream subspace based approaches are described briefly as follows.

2.2.3.1 Matrix Factorization Based Approaches


Matrix factorization approaches use factorization algorithms to obtain low-rank factors
that act as representations of the high-dimensional points in a lower dimensional subspace.
One of the earliest and widely used factorization algorithm is that of non-negative matrix
factorization (NMF) [123]. For a single view Xm , NMF assumes that Xm has an intrin-
sic lower dimensional non-negative representation, and tries to approximate each Xm as
rˆdm
a product of two low-rank non-negative factors Zm P <nˆr ` and Hm P <` , such that
Xm « Zm Hm . Among the multi-view extensions of NMF, the MultiNMF [137] algorithm
integrates the views by imposing their corresponding low-rank representations Zm s to be
nˆr
close to a global consensus representation Zs P <` . The rows of Zs are treated as repre-
sentation of the samples in a r dimensional subspace and the joint clusters are identified
using a standard clustering algorithm like k-means on consensus matrix Z. s In another
algorithm, termed as JointNMF [137, 281], instead of obtaining a common factor Zs that is
close to all the Zm s through a two-stage optimization, each view Xm itself is approximated
by a common Zs and a view-specific factor Hm , that is Xm « ZH s m . Locality preserving
NMF models have also been proposed, which constrain the pairwise simlarities between
the samples in the latent representation Zs to be proportional to those in the original space
Xm [101, 284]. Other variants of NMF based multi-view clustering algorithms have also
been proposed, such as, semiNMF [287], graph regularized NMF [146], manifold regularized
NMF [283,299], local patch alignment based NMF [169], and robust neighboring constraint
NMF [36].

17
The reason behind the popularity of NMF is its ability to extract sparse and easily
interpretable factors. The NMF latent factors can be viewed as part-based representations
because the non-negativity constraint allows only additive, and not subtractive combina-
tions. For instance, in case of an image Xm , the columns of the Zm factor can be interpreted
as basis images, and Hm states how to sum up the basis images in order to reconstruct
an approximation of a given image Xm . In the case of facial images, the basis images are
features such as eyes, noses, moustaches, and lips, while the columns of Hm indicate which
feature is present in which image. The eigenvector based factors in other matrix decompo-
sitions like SVD contain both positive and negative entries, which lack the additive factor
based meaningful interpretation. However, NMF in it’s basic form [123] is applicable only
for non-negative data. Also, the factors Zm and Hm are not unique as they are obtained
by alternating optimization, which is sensitive to initialization.
Multi-view clustering has also been addressed using other factorization algorithms,
such as tri-matrix factorization [298], anchor graph based non-negative orthogonal fac-
torization [261], bilinear factorization [293], and SVD [91, 141]. Among the SVD based
approaches, the joint and individual variance explained (JIVE) [141], and angle based
JIVE [63] (A-JIVE) algorithms use SVD to partition each view into a common joint factor
and a view-specific individual factor. Clustering on the rows of the joint factor gives the
consistent global clustering of the data set, while that on the individual factors give the
view-specific clusterings. The different factorization based approaches differ in terms of
the interpretation of the low-rank factors, properties preserved in them, and their compu-
tational complexity.

2.2.3.2 Tensor Based Approaches


A natural extension of matrix factorization methods for multi-view analysis is the use
of tensors, which are higher order matrices. While the conventional matrices can capture
pairwise correlations within a view, tensors are capable of capturing high-order correlations
among multiple views and extracting information from a multidimensional perspective [39,
245, 253]. Zhang et al. [276] proposed the construction of third-order tensor with low-rank
constraint to model the cross information among different views and reduce redundancy in
the learned subspace. Jia et al. [107] imposed structured sparsity and symmetric low-rank
constraints on the horizontal and frontal slices of higher-order tensors to model both inter
and intra-view relationships. Tensors are also fused with graph and kernel learning to
improve multi-view clustering performance. Wu et al. [245] proposed to learn a low-rank
tensor for spectral clustering [230] directly from multiple similarity graphs. Tensors and
affinity graphs are learned simultaneously for multi-view spectral clustering in [41,42]. Xie
et al. [252] kernelized the high-dimensional input features using a tensor learning framework
to capture non-linear relationships between the samples. Multi-view clustering based on
tensor-SVD and its derived tensor nuclear norm are extensively explored in [244, 248, 253,
285].
The low-rank tensor learning approaches generally work by decomposing the input
tensor across multiple dimensions into lower order factors. The multi-dimensional decom-
position considers excessive combinations of all input features. One major challenge here
is to extract useful information from the decomposition and discard useless feature combi-
nations [105]. Moreover, the higher order relations learned from the inherently noisy views

18
can often give misleading information.

2.2.3.3 Self-Representation Based Subspace Learning Approaches


The self-representation based multi-view subspace clustering approaches emerged from two
popular baseline approaches, namely, sparse subspace clustering (SSC) [59,60] and low-rank
representation (LRR) [135]. Both of these approaches are based on the assumption that
high-dimensional data belonging to multiple classes or categories often lies in a union of
low-dimensional subspaces. This assumption implies that each sample or data point lying
in a union of multiple low-rank subspaces can always be expressed as an affine or linear
combination of a few other points belonging to that subspace, referred to as the ‘self-
representative’ property. These algorithms look for the sparsest combination in order to
learn an appropriate basis to fit each group and automatically determine other points lying
in the same subspace. The sparse combination coefficients are used to build a similarity
matrix from which clusters are identified by spectral clustering.
Multi-view extensions of subspace clustering have been proposed, which directly re-
construct the data points in the original views using the self-representative property and
generate view-specific subspace representation [26, 27, 71, 240, 241, 275, 276]. Among them,
Zhang et al. [276] extended LRR for the multi-view setting using generalized tensor nuclear
norm, while Cao et al. [27] introduced a diversity term based on Hilbert Schmidt indepen-
dence criterion, and Wang et al. [240] added an exclusivity term to seek complementarity
and consistency of the multi-view subspace representations. These approaches reconstruct
the data points in each view separately using self-representation. However, a general as-
sumption is that multiple views are generated from a single underlying latent distribution.
Based on this assumption, latent multi-view subspace clustering (LMSC) [277] generates a
common latent representation for all views rather than that of each individual one. Zhang
et al. [275] extended LMSC by introducing neural networks to explore more general rela-
tionships between the views. Extensions are also proposed to enable subspace learning in
presence of missing samples and features in different views [167, 250, 255].
A majority of this category of subspace clustering approaches learn view-specific self-
representations. However, each view is likely to give only partial information about the
overall structure of the multi-view data set. Hence, subspaces and clusters reconstructed
from view-specific self-representations of samples may not give a complete or even accurate
picture of the multi-view data set [212].

2.2.3.4 Cannonical Correlation Analysis Based Approaches


Cannonical correlation analysis (CCA) gained importance in multi-view learning as it nat-
urally extracts two projection vectors from any two views such that the projected data
along those vectors is maximally correlated [97]. However, real-world data sets usually
have complex structure which is hard to capture via a single pair of covariates. The r-th
pair of covariates is obtained by maximizing the correlation between the new pair while
constraining it to be orthogonal to the previous ones. Chaudhuri et al. [35] theoretically
established that when the data is drawn from a mixture of Gaussians or a mixture of log
concave distributions, the canonical covariates can be used to cluster the data.
The conventional CCA fails in the high-dimensional low-sample size setting due to the

19
non-invertibility of covariance matrix, multi-collinearity of features, and computational
difficulty. To address these issues, sparse CCA [44, 61] is proposed to incorporate feature
selection into the CCA model and maximize correlation between only a small subset of
features. Other variants of CCA are also proposed for multi-view data integration and
clustering, for example, group sparse CCA [133], kernel CCA [17], and cluster CCA [175].
Blaschko and Lampert [17] used kernel CCA to learn non-linear relationships and proposed
a generalized spectral clustering algorithm for two-view data.
A major drawback of CCA is that it can extract correlated features from only two
feature sets or views. A typical way of generalizing CCA to multiple views is to maximize
the sum of pairwise correlations between all pairs of views [229]. However, this approach
is unable to capture higher-order correlations obtained by simultaneous examination of all
views. In this regard, some generalizations of CCA have been proposed to simultaneously
handle an arbitrary number of views. Examples include multi-set CCA [95], generalized
regularized CCA [219], graph multi-view CCA [37], and tensor CCA [144], among others.

2.2.4 Co-Training and Co-Regularization Approaches


The co-training and co-regularization approaches are based on the idea that the true un-
derlying clustering of multiple views would assign corresponding points in each view to the
same cluster. Based on this assumption, Kumar and Daumé [119] proposed a two-view
clustering algorithm that uses clustering result from one view to guide the other view, and
vice versa. Specifically, the spectral embedding from graph Laplacian [230] of one view is
used to refine the similarity graph used for the other view. By alternately iterating this
approach between two views, the clusterings of two views tend to be close to each other.
Multi-view extensions of the two-view co-regularization has also been proposed in [120].
Yu et al. [268] introduced co-regularization into the self-representation based multi-
view subspace clustering framework [60, 71]. Zhao et al. [291] combined the simplicity
of linear discriminant analysis (LDA) and k-means clustering, along with the co-training
approach, to extract discriminative subspaces from one view based on the clustering labels
learned in another view. Xu et al. [258] imposed tensor nuclear norm constraints on the co-
regularization model to capture higher-order relations while looking for consistent clustering
across different views. One drawback of the co-training based algorithms is that they co-
regularize each view equally which does not make sense when one view is informative, while
the other is noisy.

2.2.5 Multiple Kernel Learning Approaches


Kernel functions implicitly map data points into a high (possibly infinite) dimensional
space and compute inner-product between images of points without explicitly computing
their coordinates in the transformed space [190]. Multiple kernel multi-view clustering
approaches pre-define a group of base kernels corresponding to different views and then
combine those kernels using a linear or non-linear combination to improve the clustering
performance [25, 81, 136, 199, 225].
Equally weighting kernels from different views can degrade the common clustering result
due to the presence of low-quality views. Hence, several kernel weighting schemes have been
proposed in the multi-view literature. For example, the algorithms proposed in [129, 225,

20
247] determine the distribution of weights based on some heuristic or model dependent
hyper-parameters. Self-weight optimization schemes are proposed in [136, 199, 266] which
automatically learn kernel weights without involving extra parameters. Cai and Li [25]
used Hilbert Schmidt independence criterion to measure the agreement between a pair of
kernels and obtained consensus kernels by agreement maximization. To filter redundant
information from views, Yao et al. [265] proposed to select a diverse subset of representative
kernels from a pre-specified set of kernels corresponding to different views. Trivedi et
al. [223] used Kernel CCA, while Zhang et al. [279] extended fuzzy c-means algorithm [13]
to address the problem of multi-kernel clustering in the presence of incomplete views.
A challenge in multi-kernel learning is obtaining the appropriate choice of kernel func-
tion (for example, linear kernel, polynomial kernel, or Gaussian kernel), which maps the
input feature space to a high-dimensional Hilbert space.

2.2.6 Statistical Model Based Approaches


Statistical approaches [115, 156, 192, 243] aim to model the probability distribution of the
data. These approaches usually assume that the observed data is generated from a mixture
of distributions and use expectation maximization [161] to estimate the parameters of the
distribution by maximizing the likelihood of the observed data. The statistical approaches
have the advantage that it allows incorporating prior knowledge regarding the views while
modelling the distribution functions. This can be done using Baysian priors or by specifying
the choice of the distribution. Among the statistical approaches, the iCluster algorithm
[192] assumes that the views are generated from a misture of Gausssian, MDI [115] assumes
a Dirichlet mixture model, while LRAcluster [243] and iCluster+ [156] algorithms allow
modeling different views with a different distribution (like, Gaussian distribution for real
data, Possion distribution for integer count data).
The other advantage of the statistical approaches is that they can model the uncertain-
ity in the data and make ‘soft’ probabilistic decisions, like probability of a sample belonging
to a cluster. Zhuang et al. [297] used probabilistic latent semantic analysis to model the
co-occurrence of samples and features in different views and determined cluster assignment
based on conditional probability of the samples belonging to different clusters. In spite
of the advantages, in most of the statistical formulations, the parameter estimation part
turns out to be computationally very expensive on high-dimensional real-life data sets.
Moreover, the heuristics and assumptions made regarding the distribution of the data do
not always conform with the diverse and noisy real-life data sets, resulting in poor model
fitting.

2.2.7 Graph Based Approaches


Graph based models form the most common category of multi-view clustering algorithms
[98, 164, 165, 213, 234, 236, 237, 246, 272, 273]. These methods typically take input graphs of
all views and find a fused graph or a low-dimensional spectral embedding of the graphs,
and then employ an additional clustering algorithm, like k-means or spectral clustering, to
produce the final clusters. Figure 2.6 shows an example of clustering based on multi-view
graph fusion. Consideration of a separate graph for each view inherently takes care of the
heterogeneity within the views in terms unit, variance, and scale. However, the quality of

21
Fused graph
...

Spectral
clustering

Figure 2.6: Graph based multi-view clustering.

the cluster structure reflected in the graphs vary from one view to another. To incorporate
the differences in importance of different views, weighted multi-view graph clustering is
proposed in [98, 164, 165]. These approaches first weight each input graph so that different
views can have different impact on the unified representation. Several other advanced
weight optimization schemes are proposed in [236,272,273]. In order to impose consistency
between the clusterings reflected in different views, Nie et al. [166] proposed construction
of a common nearest neighbor graph shared between all the views. The edge weights in
the common graph have been assigned based on the similarities between the corresponding
samples in all the views.
Most of the graph based approaches perform multi-view clustering on a set of fixed input
graphs, and the results are dependent on the quality of input graphs. In this regard, Tao
et al. [213] proposed an adaptive graph learning strategy where in addition to assigning
the importance of the graphs from view level, sample-pair-specific weights are assigned
within the views depending on the sample connection across different views. Another
set of approaches is based on the hypothesis that each view has a consistent part shared
between different views and an inconsistent part that does not appear in other views due
to view-specific characteristic traits [19,94,132]. These approaches first separate the graph
adacency matrices into consistent and inconsistent parts by orthogonality constraints, and
then construct a unified matrix for clustering by fusing the consistent parts. Several hybrid
approaches have also been proposed that impose graph regularization on NMF [24, 146],
CCA [37, 38], tensor [40, 245], and self-representation subspace [267, 292] based multi-view
clustering approaches.

22
One major drawback of the graph based approaches is that the graphs constructed
from the inherently noisy real-life views may not be ideal. The noise and misleading edge
weights in the individual graphs may propagate during the graph fusion process and distort
the cluster structure of the unified graph [113].

2.2.8 Manifold Based Approaches


In several real-world data sets, the features are observed in a high-dimensional Euclidean
space but the meaningful structures often lie on a low-dimensional manifold embedded
within the input feature space [188]. Intuitively, in case of multi-view data, each view can
be regarded as lying on a separate manifold, and the intrinsic structure of the whole data
set can be treated as a mixture of manifolds. Based on this hypothesis, several multi-view
algorithms have been proposed to identify clusters lying on lower dimensional possibly
non-linear manifolds [24, 82, 108, 178, 269, 283, 295]. Cai et al. [24] and Zhang et al. [283]
performed NMF based multi-view clustering with manifold regularization to preserve the
local geometrical structures in each view. Zhou et al. [295] and Tao et al. [108] proposed ex-
tensions of this framework to incorporate sparsity and missing data prediction, respectively.
Xie et al. [249] learns the local manifold structure of each view using Laplacian embed-
ding [73,180] which preserves the neighborhood relationships between the high-dimensional
points in the lower dimensional space as well.
In a separate line of approach, a prior assumption is made regarding the form or struc-
ture of the manifold and then optimization is performed on that specific manifold to identify
the clusters. Yu et al. [269] assumed that the lower dimensional representation correspond-
ing to different views belong to the Stiefel manifold [57], while the approaches proposed
in [82, 178] assumed that the representations belong to the Grassmannian manifold [3, 57].
The algorithms then optimize and merge the representations on Stiefel or Grassmannian
manifolds, using their manifold specific non-Euclidean distance measures. Most of these
approaches use manifold induced norm regularizations, embeddings, and distance mea-
sures to capture the manifold structure, but they perform optimization over the Euclidean
space [24, 178, 249, 283]. The indirect use of manifold may fail to capture the true struc-
ture of the underlying manifold. A few approaches [269] which optimize directly on the
manifolds can better exploit their inherent structure and properties. However, this opti-
mization is computationally more expensive compared to Euclidean optimization as general
non-linear manifolds do not satisfy the vector space assumptions of the Euclidean space.
The standard optimization algorithms like gradient descent, Newton’s method, conjugate
directions, etc., fail to work on non-linear manifolds unless generalized depending on the
geometry of the manifold.

2.2.9 Deep Clustering Approaches


The classical machine learning approaches mostly use a shallow and linear (or linear approx-
imation of a non-linear) embedding function to capture the intrinsic structures of multiple
views in a lower dimensional subspaces. Inspired by the powerful and non-linear represen-
tation ability of deep neural networks [77, 104], recently several deep multi-view clustering
approaches have been proposed [7,33,131,211]. The deep clustering approaches are primar-
ily based on CCA [7, 12, 35, 80], matrix factorization [33, 103, 260, 287], self-representation

23
learning [1, 211, 238], and generative adverserial networks (GAN) [131, 239]. Andrew et
al. [7] proposed DeepCCA, a deep neural network extension of CCA, which extracts non-
linear feature embeddings corresponding to each view. The correlation between the feature
embeddings are maximized using CCA in the last layer.
In the self-representation category of multi-view clustering, Abavisani and Patel [1]
proposed a network consisting of a multi-view encoder, a self-expressive layer, and a multi-
view decoder. The encoder constructs a latent representation of the multi-view data,
the self-expressive layer enforces the self-representation property to construct a subspace
preserving affinity matrix, and the decoder block minimizes the sample reconstruction
loss. Spectral clustering on the affinity matrix learned from self-expressive layer generates
the clusters. Gao et al. [72] proposed a hybrid network combining DeepCCA and the
self-representation layer. Among the GAN based approaches, Li et al. [131] fused an
autoencoder network with an adverserial network consisting of generator and discriminator
components for multi-view clustering. The factorization based approaches proposed in [103,
260,287] learn hierarchical semantics of multi-view data by performing matrix factorization
like NMF in a layer-wise fashion. Clustering is performed on the representation learned at
the final layer.
The deep multi-view models in general require massive amount of training data to learn
the millions of weights and hyper-parameters. Also, in several approaches [1, 131, 238], the
network architecture is heavily data dependent, which becomes hard to optimize from a
million possible combinations of layers and activation functions.

2.3 Multi-View Classification Approaches


In the problem of multi-view classification, there are n training samples or instances, given
by !´ ¯)n
p1q pmq pM q
xi , . . . , xi , . . . , xi , yi ,
i“1
pmq
where yi P Y (the label set) and xi P X pmq (the domain or measurement space corre-
sponding to the m-th view). Each training instance is a pM ` 1q tuple sampled from an
unknown underlying joint distribution over X p1q ˆ X p2q ˆ X pM q ˆ Y. The aim is to find a
function
f : X p1q ˆ X p2q ˆ X pM q ÝÑ Y
in a hypothesis space F that can predict the label associated with an unknown instance
x P X p1q ˆ X p2q ˆ X pM q by f pxq.
The supervised multi-view learning approaches can be divided into three major cate-
gories: subspace learning, co-training, and multiple kernel learning. These three approaches
are briefly outlined in next three subsections.

2.3.1 Subspace Learning Approaches


Subspace learning approaches aim to extract a common latent space from all the views to
perform classification in that subspace. CCA, as discussed in Section 2.2.3.4, is a typical
subspace learning approach that maximizes the correlation between pairs of projections
from different views. To incorporate the label information in classification problems, Sun

24
et al. [208] proposed discriminant CCA that considers inter-class and intra-class similarities
of different views during subspace extraction. Elmadany et al. [58] optimized discriminant
CCA using deep neural networks in order to obtain a nonlinear supervised dimensionality
reduction model. Yang and Sun [263] proposed MLDA that combines LDA with CCA to
ensure discriminative ability within a single view, while maximizing the correlation between
different views. However, high correlation between the canonical vectors computed by
MLDA implies that they contain redundant information. To address this, Sun et al. [206]
further proposed MULDA that combines CCA with uncorrelated or orthogonal LDA. The
algorithm proposed in [206] also generalized MULDA to the nonlinear case by replacing
CCA with kernel CCA and kernel discriminant CCA. Benton et al. [11] incorporated the
class information by treating the one-hot encoding matrix of the labels as an additional
view. Mandal and Maji [149] integrated supervised information into CCA through the
concept of rough hypercuboid. In this work, rough sets [170] are used to handle vagueness
within the classes and extract features that maximize the relevance and significance with
respect to the given class labels.

2.3.2 Co-Training Approaches


The co-training based approaches typically train separate classifier for each view, and then
attempt to minimize the disagreement between the decision boundaries and prediction
functions learned in different views by alternating training. The approaches proposed
in [121, 196] use co-regularized least squares to perform a joint regularization over the
views and minimize the disagreement between the view-specific prediction functions in a
least squared sense. Guo and Xiao [83] proposed a subspace based co-regularized training
for cross language text classification. Their algorithm jointly minimizes the training error
of each classifier in each language while penalizing the distance between the subspace
representations of parallel documents. All of these approaches learn a separate prediction
function corresponding to each view. Sindhwani and Rosenberg [197] proposed to learn a
single prediction function with a data-dependent co-regularization norm that reduces the
problem to standard single-view classification problem.

2.3.3 Multi-View Support Vector Machines


Support vector machine is a kernelized classification algorithm that utilizes labeled samples
to learn a decision boundary that maximizes the width of the gap between two classes [47].
Due to the popularity and effectiveness of SVMs in classification tasks, several multi-view
extensions of SVM have been proposed [62, 79, 100, 124]. Farquhar et al. [62] proposed
a SVM-2K model that performs two-view classification by combining kernel CCA and
SVM into a single optimization model instead of a two-stage model (kernel CCA followed
by SVM). Li et al. [124] proposed a two-view transductive SVM that utilizes multi-view
features to improve performance of classifiers trained on individual views. Incremental
multi-view SVM is proposed in [296], which integrates the views one after another in an
incremental way instead of processing all views simultaneously. This incremental algorithm
is shown to be scalable and is specifically applicable for scenarios with streaming views.
Tang et al. [210] incorporated the ‘learning using privileged information’ paradigm [226,
227] into multi-view SVMs to target the complementary information of different views

25
while training. Several co-regularization and co-training style multi-view SVMs have also
been proposed. Typical approaches include multiview Laplacian SVMs [254], generalized
eigenvalue proximal SVMs [205], sparse multi-view SVMs [204], multi-training SVMs [125],
and manifold regularized multi-view vector valued SVMs [145].

Apart from these three major categories, multi-view classification has also been ad-
dressed using probabilistic model [207], multiple kernel learning [274], and deep learn-
ing [110, 187, 259] based approaches.

2.3.4 Conclusion
The advancement of information acquisition technologies like various sensors, medical and
imaging devices, multimedia and networking platforms, and new feature extraction tech-
niques has made multi-view data increasingly common in several real-world applications.
The ubiquitous multi-view data has made multi-view learning an active area of research and
several algorithms have thus been proposed to understand the natural structure of these
data sets. However, these algorithms do not come without limitations. Most algorithms
fail to give satisfactory performance in presence of noisy, redundant, and misleading views.
However, such views are fairly common in real-life data sets. Furthermore, the algorithms
are yet to harness the full non-linear geometry of multiple views and identify the best possi-
ble clustering of a data set. This gives a scope for improvement of the multi-view clustering
literature by designing algorithms that can truly understand the non-linear dynamics of a
multi-view system and be resilient to noisy and redundant views.
As mentioned in Section 1.2 of Chapter 1, one of the important challenges in multi-
view data integration is the appropriate selection of relevant and complementary views over
noisy and redundant ones. In this regard, the next chapter presents a novel algorithm for
constructing a low-rank joint subspace of the multi-view data, taking into consideration the
relevance or quality of the cluster structure embedded within each view and the redundancy
or amount of cluster information shared between two views.

26
Chapter 3

Multivariate Normality Based


Analysis for Low-Rank Joint
Subspace Construction

3.1 Introduction
Advanced high-throughput technologies have expanded the breadth of available omics data
from genome sequence data to transcriptomic, methylomic, and proteomic data. Each type
of omics data reflects the biological variation at a specific molecular level. However, dis-
eases like cancer involve complex interactions among biological components like genes,
microRNAs, and proteins across multiple molecular levels. Therefore, integrative analysis
of data from multiple omic modalities like gene expression, DNA methylation, etc, is likely
to capture a more accurate picture of dynamic molecular systems. One major objective
of integrative analysis is to understand the taxonomy of cancer by identifying the latent
disease subtypes. Cancer subtyping provides deeper understanding of disease pathogene-
sis as well as helps in designing personalized treatments. Data driven subtype discovery
is most popularly achieved by clustering data from one or more omic modalities. How-
ever, clustering multimodal or multi-view data sets has two major challenges. The main
challenge is the selection of appropriate modalities or views, which can provide relevant
and shared cluster information over noisy ones. Another challenge is to efficiently handle
‘high-dimension low-sample size’ nature of the data sets, which reduces the signal-to-noise
ratio and makes clustering computationally expensive.
Separate clustering followed by manual integration is a frequently used approach to an-
alyze multiple omics data sets for its simplicity. Cluster-of-cluster assignment (COCA) [93]
and Bayesian consensus clustering (BCC) [140] are two such approaches, which first clus-
ter each modality separately and the individual clustering solutions are then combined
to get the final cluster assignments. However, the integration of separate clustering solu-
tions fails to capture cross-platform correlations and shared joint structure. On the other
hand, some of the direct integrative approaches, like super k-means [282], iCluster [192],
iCluster+ [156], LRAcluster [243], joint and individual variance explained (JIVE) [141], and

27
angle-based JIVE (A-JIVE) [63], proceed by concatenating the individual modalities to get
the integrated data which is then used for clustering. As the naive concatenation of different
modalities may degrade the signal-to-noise ratio of the data, most of the direct integrative
approaches first extract a low-rank subspace representation of the high dimensional inte-
grated data and then clustering is performed in the reduced subspace [141,156,192,243]. A
brief survey of two-stage consensus and subspace based multimodal clustering approaches
is provided in Sections 2.2.2 and 2.2.3, respectively, of Chapter 2.
An important parameter of the low-rank based approaches is the rank or dimension of
the low-rank subspace to be extracted. The relation between the number of clusters k in
a data set and rank r of the low-rank subspace has already been established in literature
[52, 271]. The k centroids of k clusters in a d-dimensional input space lie in an affine
subspace of dimension at most pk ´ 1q [69]; and when the data is well clustered, the affine
subspace determined by the k centroids is parallel to the k principal components of the
data [151]. Zha et al. [271] showed that the top k principal components are the continuous
solutions to the discrete cluster membership indicators in the k-means clustering problem.
However, given the indicators for the pk ´ 1q clusters, the cluster membership indicators for
the k-th cluster can be retrieved. This indicates the presence of redundancy in the top k
principal components. Consequently, Ding and He [52] showed that the continuous solution
to the discrete k-means cluster indicators is given by the pk ´ 1q principal components.
When the number of clusters in a data set is known, low-rank approaches like iCluster
[192], sparse iCluster [193], iCluster2 (iCluster with variance-weighted shrinkage) [191], and
iCluster+ [156] use the relation between the cluster subspace spanned by the k centroids
and the pk ´ 1q principal components to estimate the required rank parameter. Other
low-rank based approaches like LRAcluster [243] and JIVE [141] do not use the relation
between the number of clusters and rank parameter. While LRAcluster uses a likelihood
based index, JIVE uses permutation tests to estimate the rank of the reduced subspace.
In general, the existing integrative clustering algorithms after estimating the rank, use
all the available modalities to construct the final joint subspace. The relevance of the
individual modalities as well as the amount of shared information contained within them
are not considered explicitly for the selection of modalities. However, some of the omic
modalities may provide only noisy information [278], which can degrade the underlying
cluster structure.
In this regard, this chapter presents a novel algorithm, termed as NormS (Normality
based Subspace), to extract a low-rank joint subspace of the integrated data from the
low-rank subspaces of the individual modalities. The proposed algorithm uses Roystons’s
H-statistic [182] for multivariate normality to estimate the ranks of the individual sub-
spaces. A normality based measure of relevance of an individual modality and a orthog-
onality based measure of shared information or dependency between two modalities are
introduced in this work. The relevance measure gives a linear ordering of the modalities,
indicating the quality of cluster information embedded within them, while the dependency
measure is used to asses the overlap between the information provided by two modalities.
The modalities with maximum relevance and shared cluster structure are used to construct
the joint subspace. Furthermore, during integration of low-rank individual subspaces, in-
tersection between the subspaces is considered to select the cluster information only and
filter out the noise from each subspace. The performance of the clustering on the joint
subspace extracted by the proposed method is studied and compared with the existing

28
low-rank and consensus based approaches on several real-life multimodal omics data sets.
The efficiency of rank estimation and appropriate modality selection, based on the pro-
posed relevance and dependency measures, is established over existing approaches of naive
integration of all the modalities. Finally, the identified clusters are shown to be robust and
stable against perturbation of the data set. Some of the results of this chapter are reported
in [111].
The rest of the chapter is organized as follows: Section 3.2 describes the proposed mul-
tivariate normality based approach for multi-view data clustering. It also introduces two
quantitative measures proposed to evaluate the quality of different modalities. Experimen-
tal results on different multimodal cancer data sets and comparative performance analysis
with existing approaches are presented in Section 3.3. Finally, Section 3.4 concludes the
chapter.

3.2 NormS: Proposed Method


This section presents a new algorithm, based on multivariate normality, to construct the
joint subspace of the integrated data from the low-rank subspaces of individual modalities.
A multimodal or multi-view data set consists of M ě 2 different sets of observations
corresponding to the same set of n samples. Let M different modalities or views be given
by X1 , . . . , Xm , . . . , XM , where each Xm P <nˆdm and dm is the number of features in Xm .
The proposed algorithm assumes a signal-plus-noise model of the data [262], where each
Xm can be decomposed as
Xm “ Ξm ` Zm , (3.1)
where Ξm is the signal component and Zm is the noise component consisting of independent
error terms. Zm is assumed to follow N p0, Λq distribution, where Λ “ diagpσ12 , . . . , σd2m q
and σd2i is the variance of noise along the i-th feature. The signal component Ξm represents
the inherent structure of the data. For a data set having embedded cluster structure, the
signal component Ξm is considered to be a mixture of Gaussian. If the rank of the latent Ξm
matrix is rm , then the rm -dimensional principal subspace of Xm represents the structural
information embedded in Ξm . The rm -dimensional principal subspace of a modality Xm is
a linear subspace of <n spanned by the first rm left singular vectors of Xm or the first rm
eigenvectors of Xm Xm T as basis. The principal subspace has the advantage of explaining

the maximal possible variance of the data using rm components.

3.2.1 Principal Subspace Model


The principal subspace of modality Xm is generally extracted using SVD of the mean
centered Xm since n ăă dm . Let µpXm q P <dm be mean of Xm and 1 be a column vector
of length n of all ones. Let the SVD of Xm be given by

Xm ´ 1µpXm qT “ U pXm qΣpXm qV pXm qT , (3.2)

where U pXm q and V pXm q, in their columns, contain the left and right singular vectors,
respectively, and ΣpXm q is a diagonal matrix of corresponding singular values arranged in
decreasing order. The principal components of Xm are obtained by scaling the projections

29
in the columns of U pXm q by the corresponding spread values in ΣpXm q, given by

Y pXm q “ U pXm qΣpXm q. (3.3)

Therefore, the rm -dimensional principal subspace representation of Xm is given by the


two-tuple:
ΨpXm q “ xU pXm q, ΣpXm qy, (3.4)
where U pXm q is truncated to store the top rm left singular vectors and ΣpXm q contains
the corresponding rm largest singular values.

3.2.2 Rank Estimation of Individual Modality


The proposed algorithm assumes that the data in a modality Xm is generated from a
mixture of Gaussian. It uses a statistical hypothesis test to estimate the rank rm of its
principal subspace. The estimation of rm proceeds as follows: for each possible value of
r “ 1, 2, 3, ..., it is tested whether the r-dimensional principal subspace encodes better
cluster structure compared to the pr ´ 1q-dimensional subspace. Under the assumption
of normally distributed noise, that is Zm „ N p0, Λq, a subspace of dimension r would be
normally distributed only if it does not reflect cluster structure. On the other hand, if the r-
dimensional subspace encodes better cluster structure compared to the pr ´ 1q-dimensional
subspace, then the r-dimensional subspace would deviate more from normality as compared
to the pr ´1q-dimensional subspace. However, once all the meaningful variations due to the
clusters are summarized in the principal subspace of dimension r, the remaining variation
can be attributed to the normally distributed Gaussian noise of the Zm component, which
gets reflected in the subspace of dimension pr ` 1q. In this case, the pr ` 1q-dimensional
subspace has higher normality compared to the r-dimensional subspace, and the rank rm
of modality Xm is considered to be r. In this regard, it should be noted that Hamerly and
Elkan [87] proposed to use the normality test for estimating the number of clusters in a
data set.
In the proposed algorithm, normality of a subspace is tested using Royston’s multi-
variate normality test [182]. It is an extension of the Shapiro-Wilk’s test [189] of uni-
variate normality, which has been shown to be the most powerful normality test for all
types of distributions and sample sizes [153, 176]. Moreover, the H-statistic of Royston’s
normality test is found to have good power properties [150] against many alternative dis-
tributions. For a certain value of rank r, the first r left singular vectors in U pXm q span
the r-dimensional principal subspace of Xm . Let the r left singular vectors be given by
U pXm q “ rU 1 , . . . , U i , . . . , U r s, where U pXm q P <nˆr . Let the univariate data corre-
sponding to the i-th left singular vector be U i “ pui1 , . . . , uij , . . . , uin qT . The Shapiro-Wilk
W -statistic for univariate normality computes the correlation between the order statistics
of the given data and the expected standard normal order statistics. It has the following
form: ˜ ¸2 O ˜ ¸
n n
ÿ ÿ 2
Wi “ aj uipjq uij ´ ū
` ˘
; (3.5)
j“1 j“1

where uip1q , . . . , uipnq are the order statistic of ui1 , . . . , uin and ū is the mean of U i . The aj ’s

30
are given by [189]:
˘´1{2
pa1 , . . . , an q “ f T V ´1 f T V ´1 V ´1 f
` ˘`
, (3.6)

where f “ pf1 , . . . , fn qT . The fj ’s are the expected values of the order statistic of inde-
pendent and identically distributed random variables sampled from the standard normal
distribution and V is the covariance matrix of those order statistics. The value of W -
statistic lies in p0, 1s and a value close to 1 suggests a good fit to normality.
Royston [181] showed that W i could be transformed to an approximately standard
normal variate T i , using the following transformation:
”` ˘λ ı
T i “ σ ´1 1 ´ Wi ´µ , (3.7)

where λ, µ, and σ are the functions of n, calculated based on polynomial approximations


given by [183]. Let T 1 , . . . , T i , . . . , T r be obtained using (3.7), where T i is the transforma-
tion of the i-th component U i of the principal subspace of modality Xm . Let Ri be defined
by # „  +2
i ´1 1 i
` ˘
R “ Φ Φ ´T , i “ 1, . . . , r, (3.8)
2

where Φp.q denotes the cumulative distribution function of the standard normal distribu-
tion. Since T i is approximately standard normal, therefore, Ri „ χ21 individually. Here χ2d
denotes the χ2 -distribution with d degrees of freedom. For any r-dimensional subspace,
where the variables Ri ’s are not necessarily uncorrelated, Royston’s H-statistic is given by
r
eÿ i
Hr “ R. (3.9)
r i“1

The Hr -statistic follows approximately χ2e distribution, where e ď r is called the effective
degrees of freedom of the χ2 -distribution. The parameter e is estimated, using the method
of moments as described by [182], as follows:

r 1 ÿÿ
e“ , where c̄ “ 2 cij (3.10)
1 ` pr ´ 1qc̄ r ´ r i‰j

and cij is the correlation between variables Ri and Rj .


In order to estimate the rank of modality Xm , for each possible value of rank r, the
difference between the normality of r- and pr ´ 1q-dimensional subspaces is evaluated. Two
alternative hypotheses are as follows :

H0 : r-dimensional subspace does not deviate more from normality compared to the
pr ´ 1q-dimensional one.
H1 : r-dimensional subspace deviates more from normality compared to the
pr ´ 1q-dimensional one.

The Hr -statistic in (3.9) measures the normality of the r-dimensional principal subspace

31
of a modality Xm . The above hypothesis is tested using the following statistic:

γr “ Hr ´ Hr´1 , (3.11)

where γr measures the difference between the normalities of the r- and pr ´ 1q-dimensional
principal subspaces. However, in a principal subspace, the left singular vectors are orthog-
onal to each other. So, the correlation cij “ 0 in (3.10) for i ‰ j. Therefore, for a principal
subspace, the effective degrees of freedom e of the Hr -statistic is equal to the dimension r
of the subspace, and Hr „ χ2r . Hence, the γr -statistic reduces to

r
ÿ r´1
ÿ
γr “ i
R ´ Rj “ Rr and γr „ χ21 . (3.12)
i“1 j“1

The relation in (3.12) signifies that if the value of Rr corresponding to the r-th left singular
vector itself shows deviation from normality, the r-dimensional subspace deviates more
from normality compared to the pr ´ 1q-dimensional subspace. This is similar to the
use of univariate normality for identification of significant principal components as in [91].
Acceptance of null hypothesis H0 at rank r implies that the r-th left singular vector reflects
the noise from the normally distributed noise component Zm of modality Xm . On the other
hand, failure to accept H0 implies that the r-th singular vector reflects structural variation
from the mixture of Gaussian component Ξm representing the clusters. To estimate the
rank rm of the signal component Ξm , the hypothesis H0 is tested sequentially for each value
of r starting from one. The minimum value of r for which H0 is accepted implies that the
pr ´ 1q-dimensional principal subspace has summarized all the meaningful variations due
to the structural component Ξm , while the r-dimensional subspace additionally includes
noisy variation from the Zm component. Following this argument, for a given significance
level α, the rank rm is determined by the relation:

rm “ mintr : pr ě αu ´ 1, (3.13)

where pr is the p-value of hypothesis test H0 at rank r. This implies that the rank of a
modality Xm is estimated to be the smallest integer rm such that the prm ` 1q-th principal
component follows a normal distribution. However, real-life omics data sets are often cor-
rupted with high proportions of noise. Consequently, a principal component may abruptly
follow a normal distribution depicting the high noise content, while being preceded and
succeeded by components that deviate from normality. To make the rank estimation robust
to such noisy artifacts, the rank is estimated to be the smallest integer rm such that its
two consecutive components prm ` 1q and prm ` 2q are normally distributed, indicating
that rm -th component is the last meaningful one depicting the clusters.
For a certain modality, if the first two components U 1 and U 2 are normally distributed,
then the null hypothesis H0 gets accepted at rank 1. This implies that the rank of the
signal component Ξm is 0 and the modality does not encode any relevant structural infor-
mation other than the random Gaussian noise. This gives the advantage of automatically
filtering out noisy modalities, where the underlying subtype structure is not reflected at
all. For the remaining modalities with rank rm ą 0, the cluster structure, encoded by

32
the modalities, varies from one modality to another. Some modalities may reflect compact
and well separated clusters, while others may reflect poor cluster structure. Moreover,
two modalities may either provide shared cluster information or completely disjoint noisy
information. Thus, appropriate choice of modalities, providing relevant and shared cluster
information, is expected to provide better cluster structure in the final low-rank subspace.
The relevance and dependency measures, proposed in this work, to evaluate the quality of
each modality are described next.

3.2.3 Relevance and Dependency Measures


Let rm be the rank of a modality Xm estimated using the hypothesis testing as described
in Section 3.2.2. The rm -dimensional principal subspace of Xm consists of the top rm left
singular vectors of Xm in U pXm q and their corresponding singular values in ΣpXm q. The
left subspaces U pXm q’s from different modalities have varying ranks and are not directly
comparable. So, a uniform measure of relevance is introduced next to assess the quality
of cluster information provided by different modalities with varying ranks. This measure
is based on the distribution of the principal components and the spread of the data along
them.

3.2.3.1 Relevance
Let Hrm be the value of Royston’s H-statistic for the rm -dimensional principal subspace of
Xm . It approximately follows χ2 -distribution with rm degrees of freedom. However, the H-
statistic values across different modalities are not comparable as the degrees of freedom of
the χ2 -distribution vary for different modalities. Let Fχ´1 2
2 pp, df q be the inverse of the χ cu-
mulative distribution function with df degrees of freedom for the corresponding probability
p. At the significance level of α, Fχ´1
2 pp1 ´ αq, rm q gives the minimum value of H-statistic
for which the multivariate normality assumption of the rm -dimensional principal subspace
of Xm can be rejected. Therefore, the difference between Hrm and Fχ´1 2 pp1 ´ αq, rm q evalu-
ates how far the normality of the principal subspace of Xm is with respect to the minimum
threshold for rejection of normality. Moreover, the singular values in ΣpXm q give the
spread of the data in the principal subspace of Xm . Amongst different modalities, higher
the value of spread, better is the separability of clusters reflected in its corresponding prin-
cipal subspace. The relevance of a modality Xm is defined as the product of two factors:

Rl pXm q “ ΦpHrm , rm , αq ˆ Θm , (3.14)


Nÿ M
where Θm “ trpΣpXm qq trpΣpXj qq,
j“1
« ff
1 Hrm ´ Fχ´12 pp1 ´ αq, rm q
ΦpHrm , rm , αq “ 1` ,
2 maxtHrm , Fχ´1
2 pp1 ´ αq, rm qu

and trpAq denote the trace of a matrix A. The first factor ΦpHrm , rm , αq evaluates the
distribution of the data along the principal subspace of Xm , while the second factor Θm
measures the fraction of variance/spread explained by the modality Xm out of the total

33
variance explained by the principal subspaces of all modalities.
In ΦpHrm , rm , αq, the Hrm values are not comparable for different modalities Xm ’s. So,
the difference between Hrm and its own minimum threshold Fχ´1 2 pp1 ´ αq, rm q is consid-
ered for significance analysis. This makes the H-statistic values Hrm ’s comparable across
different modalities. This is illustrated in Figure 3.1. Let there be three modalities X1 , X2 ,
and X3 with ranks r1 , r2 , and r3 , respectively. Without loss of generality, let r1 ‰ r2 ‰ r3 .
For modality Xm , m “ 1, 2, and 3, Royston’s H-statistic Hrm for the rm -dimensional
principal subspace of Xm follows χ2 distribution with rm degrees of freedom. Figure 3.1
shows the χ2 distributions for three modalities. The shaded areas in three χ2 curves
show the regions where the H-statistic shows statistically significant deviation from mul-
tivariate normality. At a significance level of α, τm gives the minimum value for which
the H-statistic of the respective χ2rm -distribution can be called statistically significant.
Let us assume that δm be the difference between Hrm and τm . A value of δm ě 0 im-
plies that the corresponding Hrm is statistically significant. Figure 3.1 shows that Hr1
and Hr2 are statistically significant, whereas Hr3 is not. Moreover, as δ2 ą δ1 ą δ3 ,
ΦpHr2 , r2 , αq ą ΦpHr1 , r1 , αq ą ΦpHr3 , r3 , αq. This implies that principal components of
modality X2 deviate further away from normality as compared to those of X1 , so X2 has
better cluster structure compared to X1 .

χ2r1 α = 0.05
0.2 τm = Fχ−1
2 (0.95, rm )
Probability density

δm = Hrm − τm
r1 6= r2 6= r3

0.1 χ2r2
χ2r3
Hr 3
Hr 1
Hr 2
0 Hr m
τ1 τ2 τ3
δ1 >0 δ2 >0 δ3 <0
Figure 3.1: χ2 distributions for H-statistic of three modalities.

The following properties can be stated about the relevance measure Rl pXm q:

1. 0 ď Rl pXm q ď 1.

2. Rl pXm q “ 0 if Hrm “ 0 or trpΣpXm qq “ 0.

34
For Hrm “ 0,
« ff
1 0 ´ Fχ´1
2 pp1 ´ αq, rm q
Φp0, rm , αq “ 1` “ 0.
2 maxt0, Fχ´1
2 pp1 ´ αq, rm qu

M
ř
3. Rl pXm q Ñ 1 when Hrm Ñ 8 and trpΣpXm qq Ñ trpΣpXj qq.
j“1
This is because
« ff
1 Hrm ´ Fχ´12 pp1 ´ αq, rm q
lim ΦpHrm , rm , αq “ lim 1`
Hrm Ñ8 Hrm Ñ8 2 maxtHrm , Fχ´12 pp1 ´ αq, rm qu
» ´ ¯ fi
´1
1 Hrm 1 ´ Fχ2 pp1 ´ αq, rm q {Hrm
“ –1 ` lim fl
2 Hrm Ñ8 Hrm
« ff
1 Fχ´1
2 pp1 ´ αq, rm q
“ 2 ´ lim “ 1,
2 Hrm Ñ8 Hrm

and lim Θm “ 1.
M
ř
trpΣpXm qqÑ trpΣpXj qq
j“1

When Hrm equals to Fχ´1 2 pp1 ´ αq, rm q, the modality Xm is at the minimum threshold

for rejecting the null hypothesis of normality and the value of ΦpHrm , rm , αq is 0.5. There-
fore, the value of ΦpHrm , rm , αq ă 0.5 implies that the principal subspace of Xm has a mul-
tivariate normal distribution, which reflects the presence of only random Gaussian noise
from the Zm component. On the other hand, the value of ΦpHrm , rm , αq ě 0.5 indicates
statistically significant deviation from multivariate normality and the presence of signal
component Ξm . Hence, a modality Xm is considered irrelevant if ΦpHrm , rm , αq ă 0.5, and
is not considered for joint subspace construction. On the other hand, ΦpHrm , rm , αq ě 0.5
implies that Xm has relevant cluster information. For a modality Xm , the first factor
ΦpHrm , rm , αq tends to 1 when its H-statistic Hrm tends to 8, while the second factor
Θm tends to 1 when only Xm has non-zero variance and all the other modalities have
variance close to 0. Higher value of ΦpHrm , rm , αq implies further deviation from the nor-
mally distributed noise component, while higher value of Θm implies a larger fraction of
explained variance. Taking both the factor together, a higher value of Rl implies better
cluster information.

3.2.3.2 Dependency
Let Xi and Xj be two modalities with left subspaces U pXi q and U pXj q, and ranks ri and rj ,
respectively. The dependency of Xj on Xi is measured by the proportion of the subspace
U pXj q that can be spanned by the subspace U pXi q. This is obtained by projecting the
singular vectors in U pXj q onto the column space of U pXi q. As U pXi q is orthogonal, the
projection matrix onto the column space of U pXi q is given by U pXi qU pXi qT . Using this

35
projection matrix, the projection of U pXj q onto the subspace U pXi q is given by

P “ U pXi qU pXi qT U pXj q (3.15)

The proportion of U pXj q that can be spanned by U pXi q is given by the ratio of norm of
projection P to the norm of U pXj q itself. So, dependency of modality Xj on Xi is given
by
||P||2F
DpXj |Xi q “ (3.16)
||U pXj q||2F
where ||A||2F is the squared Frobenius norm of the matrix A. Some properties of the
dependency measure can be stated as follows:
1. 0 ď DpXj |Xi q ď 1.

2. If U pXi q and U pXj q are orthogonal to each other, then the projection P “ 0 and
dependency DpXj |Xi q “ 0 (Figure 3.2(a)).

3. If all the left singular vectors in U pXj q are linear combinations of those in U pXi q,
then P “ U pXj q and dependency DpXj |Xi q “ 1 (Figure 3.2(b)).

4. DpXj |Xi q ‰ DpXi |Xj q (asymmetric).


Dependency D, thus, measures the amount of shared information present in modality Xj ,
given the information in modality Xi . The possible cases of dependency of a modality
Xj on a modality Xi are depicted in Figure 3.2. In Figure 3.2(a), the subspace U pXj q is
orthogonal to U pXi q, so its dependency on U pXi q is 0. On the other hand, if U pXj q is
linearly dependent on U pXi q as in Figure 3.2(b), its dependency on U pXi q is 1. For any
other arbitrary orientation of two subspaces (Figure 3.2(c)), dependency DpXj |Xi q lies in
between 0 and 1.

D(Xj |Xi ) = 0 D(Xj |Xi ) = 1 0 < D(Xj |Xi ) < 1


z z z

U (Xi ) U (Xi ) U (Xi )


U (Xj ) U (Xj ) y
y y U (Xj )
x x
x
(a) (b) (c)

Figure 3.2: Dependency of modality Xj on Xi : (a) Orthogonal subspaces (b) Linearly


dependent subspaces (c) Arbitrary subspaces.

3.2.4 Proposed Algorithm


The proposed algorithm is described next for the construction of joint subspace from the
principal subspaces of individual modalities. For each Xm , its rank rm is estimated. A

36
modality Xm with rank rm “ 0 consists of only the noisy component Zm and gets auto-
matically filtered out at the first stage. The relevance Rl pXm q is computed according to
(3.14) for each modality Xm having rank rm ą 0. Let

ΨpXm q “ xU pXm q, ΣpXm qy (3.17)

denote the joint subspace obtained at step m of the proposed algorithm. The process of
joint subspace construction is initiated from the modality Xπ having maximum relevance
value. Thus, at step 1, the initial joint subspace is given by

ΨpX1 q “ ΨpXπ q “ xU pXπ q, ΣpXπ qy. (3.18)

At step pm`1q, each remaining modality Xj may have some shared cluster information with
respect to the current joint subspace ΨpXm q. To assess the amount of shared information,
the dependency DpXm |Xj q of Xj on the current joint subspace U pXm q is computed for each
of the remaining modalities Xj ’s. Higher the value of dependency, stronger is the presence
of shared structure in that modality. So, the modality Xω having maximum dependency
on the current subspace U pXm q is chosen for integration. At step pm ` 1q, the principal
subspace ΨpXω q is integrated with the joint subspace ΨpXm q obtained at step m.

Qj Qj
Uj Uj
Residual Residual

Projection Projection

Pj Pj

(a) (b)

Figure 3.3: Two different cases of residual component Qj after the projection of U j on
the current joint subspace: (a) Residual follows normal distribution (b) Residual shows
divergence from normal distribution.

Both U pXm q and U pXω q are subspaces of <n with their column vectors as their basis.
The intersection between these two subspaces reflects common structures encoded by both
the subspaces. Intersection is computed by projecting the columns of U pXω q onto the basis
spanned by the columns of U pXm q, and is given by

I “ U pXm qT U pXω q. (3.19)

The projection P of U pXω q, lying in the subspace U pXm q, is obtained by the product of

37
the basis U pXm q and the projection magnitudes in I, and is given by

P “ U pXm qI. (3.20)

The residual Q of U pXω q is obtained by subtracting the projection P from U pXω q itself,
which is given by
Q “ U pXω q ´ P. (3.21)
The projection P reflects the shared structure, while residual Q contains the extra infor-
mation of Xω . Let the columns of Q be given by Q “ rQ1 , . . . , Qj , . . . , Qrω s. If a residual
vector Qj contains cluster information, then the data in Qj shows divergence from the nor-
mality. But, if Qj contains only noisy information, then it should be normally distributed.
The main idea is to incorporate only remaining cluster information of modality Xω . So,
the singular vectors of U pXω q, whose residuals show significant divergence from normality,
are considered for the construction of joint subspace U pXm`1 q. This is given by the set

S “ tU j : pQj ă αu, (3.22)

where U j denotes the j-th column of U pXω q and pQj denotes the p-value corresponding
to the Shapiro-Wilk normality test on the residual column Qj of Q. Therefore, for each
component U j P U pXω q, there are two possible cases: either its residual component Qj
is normally distributed (Figure 3.3(a)), or the residual shows significant divergence from
the normality (Figure 3.3(b)). These two cases are illustrated in Figure 3.3. In the figure,
let component Uj consists of two clusters and the noise. In Figure 3.3(a), the projected
component P j of U j reflects both the clusters, and the residual component Qj depicts only
the noise. Thus, Qj is normally distributed. On the other hand, in Figure 3.3(b), the
residual Qj depicts some cluster information along with noise. So, Qj shows divergence
from the normality. The set S is formed using only those U j ’s having cluster information
in their residuals. Finally, the components of the joint subspace ΨpXm`1 q at step pm ` 1q
are obtained as follows:
“ ‰
U pXm`1 q “ U pXm q S and (3.23)
ΣpXm`1 q “ diag pΣpXm q, ΣpSqq , (3.24)

where U pXm`1 q is formed by column-wise concatenation of U pXm q and vectors of S, and


ΣpSq is the diagonal matrix of singular values corresponding to vectors in S.
For M modalities having rank greater than 0, the final joint subspace ΨpXM q is ob-
tained in M steps using the above procedure. The principal components Y pXM q are then
obtained from ΨpXM q using (3.3). Finally, k-means clustering is performed on the rows
of Y pXM q to get the sample clusters or the cancer subtypes. The proposed algorithm to
extract a joint subspace of a multimodal data set is given in Algorithm 3.1.

3.2.4.1 Computational Complexity of Proposed Algorithm


The proposed algorithm begins by performing SVD on each of the modalities to extract
its left subspace and singular values. For a single modality, this complexity is bounded by

38
Algorithm 3.1 Proposed Algorithm: NormS
Input: X1 , . . . , Xm , . . . , XM , Xm P <nˆdm
Output: Joint subspace ΨpXq
1: for m Ð 1 to M do
2: Estimate rank rm of each modality Xm using hypothesis test.
3: Compute the principal subspace ΨpXm q of rank rm using (3.4).
4: Compute relevance Rl pXm q for each modality Xm using (3.14).
5: end for
6: Let Γ be the set of M ď M modalities having rank greater than 0.
7: Find the modality Xπ having maximum relevance.
8: Set ΨpX1 q “ ΨpXπ q
9: Remove modality Xπ from Γ, that is, Γ “ ΓztXπ u.
10: for m Ð 1 to pM ´ 1q do
11: Compute the dependency DpXm |Xj q of each of the remaining modalities
Xj P Γ on the current joint subspace Xm .
12: Select Xω having the maximum dependency or shared structure.
13: Compute intersection I, projection P , and residual Q using (3.19), (3.20),
and (3.21), respectively.
14: Test the normality of each residual vector Qj P Q using Shapiro-Wilk test.
15: Compute the set S of residuals using (3.22) which show significant divergence
from normality.
16: Update the current
“ joint ‰subspace as follows:
U pXm`1 q “ U pXm q S
ΣpXm`1 q “ diagpΣpXm q, ΣpSqq
ΨpXm`1 q “ xU pXm`1 q, ΣpXm`1 qy
17: Remove modality Xω from Γ, that is, Γ “ ΓztXω u.
18: end for
19: Set ΨpXq “ ΨpXM q.
20: Return ΨpXq.

Opn2 dmax q, where dmax is the maximum number of features among the modalities. This
is followed by the computation of rank. The rank is computed by consecutively perform-
ing normality test on the left singular vectors. Each normality test has a complexity of
Opn log nq attributed to the computation of order statistics in (3.5). For a modality, the
complexity of rank estimation is bounded by Oprmax n log nq, where rmax is the maximum
rank among the modalities. The computation of relevance Rl takes OpM q time. There-
fore, for M modalities, the time complexity of individual
` ` subspace construction followed
by rank and relevance estimation is bounded by O M n2 dmax ` rmax n log n ` M
˘˘

2
OpM n dmax q, considering M, rmax ăă n. The subspaces can also be constructed paral-
lelly for different modalities as the problems are independent of each other. Then, the most
relevant modality is selected in OpM q time and the initial joint subspace is constructed in
Op1q time.
At each step of the joint subspace construction (steps 10-18), the dependency of the
remaining left subspaces` on the current joint subspace is computed. This computation
is upper bounded by O M n2 rmax . For the modality with maximum shared structure,
˘

39
the projection P and the residual Q are computed in Opn2 rmax q time. Evaluation of
normality of the residuals is bounded by O prmax n log nq. Depending on the residuals,
the next selected modality is updated into joint subspace in Op1q time by concatenation.
M n2 rmax
`
Hence, the time complexity of a single updation step of the algorithm is O
`n2 rmax ` rmax n log n`1 “ n2 rmax q. Finally,
˘
` OpM ˘ the overall complexity of the proposed
algorithm is bounded by O M n dmax `M n2 rmax “ OpM n2 dmax q. This shows that the
2

computational complexity of the proposed algorithm is dominated by the initial steps of


individual subspace construction and rank estimation.

3.3 Experimental Results and Discussion


This section presents the clustering performance of the joint subspace extracted by the
proposed algorithm and its comparison with the existing integrative clustering algorithms.

3.3.1 Data Sets and Experimental Setup


The multimodal omics data for four types of cancer, namely, cervical carcinoma (CESC),
lower grade glioma (LGG), ovarian carcinoma (OV), and breast invasive carcinoma (BRCA),
are obtained from The Cancer Genome Atlas (TCGA) (https://cancergenome.nih.gov/).
By comprehensive integrated analysis, TCGA research network has identified three sub-
types of CESC [218] and LGG [217], and four subtypes of OV [215] and BRCA [214].
These tumor subtypes have been shown to be clinically relevant and reveal new potential
therapeutic targets for the cancer. The four data sets CESC, LGG, OV, and BRCA consist
of 124, 267, 334, and 398 samples, respectively. All the data sets have four different omic
modalities, namely, gene expression (RNA), DNA methylation (mDNA), microRNA expres-
sion (miRNA), and reverse phase protein array expression (RPPA). These four modalities
are measured on different platforms and represent different biological information. The
details of the data sets, their subtypes, and the pre-processing steps are described in the
Appendix A.
The performance of the proposed method is compared with that of two existing consen-
sus based approaches, namely, cluster of cluster analysis (COCA) [93], and Bayesian consen-
sus clustering (BCC) [140], two statistical model based low-rank approaches, namely, LR-
Acluster [243], and iCluster [192], and three SVD based low-rank approaches, namely, PCA
on concatenated data (PCA-con) [6], joint and individual variance explained (JIVE) [141]
with rank estimation based on permutation tests (JIVE-Perm) and Bayesian information
criteria (JIVE-BIC), and angle based JIVE, termed as A-JIVE [63]. The details of the
experimental setup and parameter tuning used for the existing algorithms are specified in
the supplementary material of [111]. The performance of different algorithms is evaluated
by comparing their identified subtypes with the clinically established TCGA subtypes us-
ing six external cluster validity indices, namely, clustering accuracy, normalized mutual
information (NMI), adjusted Rand index (ARI), F-measure, Rand index, and purity. The
definitions of these external indices are provided in Appendix B. Experimental results cor-
responding to Jaccard and Dice coefficients are provided in [111]. To study the clinical
implications of the identified subtypes, survival analysis is performed using Cox log-rank
test [96] and Peto and Peto’s modification of Gehan-Wilcoxon test [172]. These tests deter-

40
mine the statistical significance of differences in survival profiles of the identified subtypes.
The source code of the proposed NormS algorithm, written in R language, is available at
https://github.com/Aparajita-K/NormS.

Table 3.1: Relevance and Rank of Each Modality and Modalities Selected by the Proposed
Algorithm
Modality Relevance Rank Selected Relevance Rank Selected
mDNA 0.1884817 3 RNA, 0.4320317 10
CESC

LGG
RNA 0.2921399 2 mDNA, 0.0289518 0 mDNA,
miRNA 0.1990886 5 miRNA, 0.0056958 0 RPPA
RPPA 0.2006048 4 RPPA 0.2428867 6
mDNA 0.0230986 0 0.2373227 5 RNA,

BRCA
RNA,
RNA 0.4936741 3 0.2947759 3 mDNA,
OV

miRNA,
miRNA 0.2474369 5 0.1602746 4 miRNA,
RPPA
RPPA 0.0579902 2 0.2464338 6 RPPA

3.3.2 Illustration of Proposed Algorithm


The proposed algorithm uses multivariate normality to estimate the rank and relevance of
the individual modalities. The rank and relevance of different modalities, as well as the
modalities selected by the proposed algorithm on different data sets are reported in Table
3.1. Table 3.1 shows that the relevance and rank of the modalities vary among the data sets,
and hence different subsets of modalities are selected for different data sets. A modality
having zero rank indicates that its first two principal components are normally distributed,
and the modality contains only the noise component. This automatically eliminates noisy
modalities having zero rank and low relevance values (like RNA and miRNA modalities of
LGG, and mDNA modality of OV) from integrating into the joint subspace. For CESC
data set, initially all the modalities have non-zero ranks and all are considered for joint
subspace construction. However, during integration, a majority of the residual components
from different modalities turn out to be normal with respect to the existing joint subspace.
Hence, they are not integrated into the final subspace, thus performing a second level of
noise removal.
The working principle of the proposed algorithm is illustrated using the CESC data set
as an example. Table 3.1 shows that for the CESC data set, the rank r of mDNA, RNA,
miRNA, and RPPA are 3, 2, 5, and 4, respectively. Figures 3.4(a) and 3.4(b) show density
plots, quantile-quantile (Q-Q) plots, and p-values for the first 5 principal components of
RNA and mDNA modalities, respectively of CESC data set. These figures show that third,
fourth, and fifth components of the RNA, and fourth and fifth components of mDNA are
normally distributed, depicting the random Gaussian noise component of these modalities.
On the other hand, the first two components of RNA in Figure 3.4(a) show deviation from
normality, indicating the presence of clusters. For mDNA, Figure 3.4(b) shows that the
second principal component abruptly follows a normal distribution, while both first and
third components show deviation from normality. Additionally, the remaining components
from 4 onwards are normally distributed. So, the rank of mDNA is estimated to be 3. The
density plots in Figures 3.4(a) and 3.4(b) also show that the first component of both RNA

41
Component 1 Component 2 Component 3 Component 4 Component 5
Density p−value= 2.17e−07 p−value= 2.08e−04 p−value= 2.03e−01 p−value= 9.07e−01 p−value= 2.48e−01

Density

Density

Density

Density
−0.2 −0.1 0.0 0.1 0.2 0.3 −0.3 −0.2 −0.1 0.0 0.1 0.2 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3

Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot

0.3
0.20

0.2

0.2
0.15

0.2
0.1

0.1
0.00 0.05 0.10

0.1
Sample Quantiles

Sample Quantiles

Sample Quantiles

Sample Quantiles

Sample Quantiles
0.1
0.0

0.0

0.0
0.0
−0.1

−0.1
−0.1

−0.1
−0.2

−0.2
−0.10

−0.2

−0.2
−0.3

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

(a) Components of RNA


Component 1 Component 2 Component 3 Component 4 Component 5
p−value= 1.79e−05 p−value= 5.74e−01 p−value= 2.03e−02 p−value= 2.50e−01 p−value= 8.88e−01
Density

Density

Density

Density

Density
−0.2 −0.1 0.0 0.1 0.2 0.3 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3

Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot
0.3

0.2
0.2

0.2
0.2
0.2

0.1
0.1

0.1
0.1
Sample Quantiles

Sample Quantiles

Sample Quantiles

Sample Quantiles

Sample Quantiles
0.1

0.0

0.0

0.0
0.0

−0.1
0.0

−0.1
−0.1
−0.1

−0.2
−0.1

−0.2
−0.2
−0.2

−0.3

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

(b) Components of mDNA

Figure 3.4: Density and Q-Q plots for first five principal components of RNA and mDNA
modalities of CESC data set.

and mDNA have a bimodal distribution, indicating multiple clusters. According to the
relevance values in Table 3.1, four modalities of the CESC data set can be ordered as RNA
followed by RPPA, miRNA, and mDNA. Therefore, the joint subspace construction begins
with RNA. Although mDNA is the modality with lowest relevance, it has the maximum
shared information with RNA, according to the dependency measure. So, mDNA is selected
next for integration. Figure 3.5 shows the density and Q-Q plots of the residuals of mDNA
with respect to the current joint subspace of RNA. The figure shows that the residuals
of the first and second component of mDNA are normally distributed with p-values 0.284
and 0.246, respectively, while the third component deviates from normality (p-value is
0.0348). Therefore, only the third principal component of mDNA is integrated into the
joint subspace.

Residual of Component 1 Residual of Component 2 Residual of Component 3


Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot
p−value= 2.84e−01 p−value= 2.46e−01 p−value= 3.48e−02
0.2
0.00 0.05 0.10 0.15
0.1

0.1
Sample Quantiles

Sample Quantiles

Sample Quantiles
Density

Density

Density

0.0
0.0

−0.1
−0.1

−0.10

−0.2
−0.2

−0.2 −0.1 0.0 0.1 0.2 −2 −1 0 1 2 −0.2 −0.1 0.0 0.1 0.2 −2 −1 0 1 2 −0.3 −0.1 0.0 0.1 0.2 0.3 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

Figure 3.5: Density and Q-Q plots for the residual components of mDNA for CESC data.

42
Component 1 Component 2 Component 3 Component 4
p−value= 4.25e−02 p−value= 4.74e−01 p−value= 1.99e−03 p−value= 4.51e−04

Density

Density

Density

Density
−0.2 −0.1 0.0 0.1 0.2 0.3 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3

Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot

0.2
0.2
0.2

0.2
0.1
0.1

0.1
0.1
Sample Quantiles

Sample Quantiles

Sample Quantiles

Sample Quantiles
0.0
0.0

0.0
0.0

−0.1
−0.1

−0.1
−0.1

−0.2

−0.2

−0.2
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

(a) Components of miRNA

Component 1 Component 2 Component 3 Component 4


p−value= 3.54e−09 p−value= 4.77e−03 p−value= 1.67e−02 p−value= 7.66e−05
Density

Density

Density

Density
−0.2 −0.1 0.0 0.1 0.2 0.3 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 −0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.4 −0.2 0.0 0.2

Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot
0.2

0.2
0.3
0.2

0.1

0.1
0.2
Sample Quantiles

Sample Quantiles

Sample Quantiles

Sample Quantiles
0.0
0.0
0.1

0.1

−0.1
−0.1
0.0
0.0

−0.2
−0.2
−0.1

−0.3
−0.1

−0.2

−0.3

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

(b) Components of RPPA

Figure 3.6: Density and Q-Q plots for first four principal components of miRNA and RPPA
for CESC data set.

The modality selected next for integration is miRNA whose estimated rank is 5. The
density and Q-Q plots for principal components of miRNA and their residuals with respect
to the current joint subspace are given in Figures 3.6(a) and 3.7(b), respectively. The plots
of the residuals in Figure 3.7(b) show that the residual of only the fourth principal compo-
nent of miRNA shows significant divergence from normality and is selected for integration
into the joint subspace. Finally, RPPA is selected for integration whose estimated rank is
4. The density and Q-Q plots for principal components of RPPA and their residuals are
given in Figures 3.6(b) and 3.7(a), respectively. Figure 3.7(a) shows that out of the top
four principal components of RPPA, the residuals of only the first and second components
show deviation from normality. Thus only these two components of the RPPA modality are
integrated into the joint subspace and the rest are eliminated as noisy ones, thus forming
a six dimensional joint subspace for the CESC data set.

43
Residual of Component 1 Residual of Component 2 Residual of Component 3 Residual of Component 4
p−value= 8.62e−09 p−value= 1.90e−03 p−value= 1.57e−01 p−value= 5.07e−02

Density

Density

Density

Density
−0.2 −0.1 0.0 0.1 0.2 0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 −0.2 −0.1 0.0 0.1 0.2 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3

Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot

0.3

0.15

0.2
0.2

0.10
0.2

0.1
−0.05 0.00 0.05
Sample Quantiles

Sample Quantiles

Sample Quantiles

Sample Quantiles
0.1

0.1

0.0
0.0

−0.1
0.0

−0.1

−0.2
−0.15
−0.1

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

(a) Residuals of RPPA


Residual of Component 1 Residual of Component 2 Residual of Component 3 Residual of Component 4 Residual of Component 5
p−value= 8.42e−02 p−value= 9.87e−01 p−value= 9.75e−01 p−value= 1.79e−04 p−value= 8.96e−02
Density

Density

Density

Density

Density
−0.2 −0.1 0.0 0.1 0.2 0.3 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.2 −0.1 0.0 0.1 0.2 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.2 −0.1 0.0 0.1 0.2

Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot

0.2
0.2

0.15
0.2

0.2
0.10

0.1
0.1

0.1
0.1
Sample Quantiles

Sample Quantiles

Sample Quantiles

Sample Quantiles

Sample Quantiles
−0.05 0.00 0.05
0.0

0.0
0.0
0.0

−0.1

−0.1
−0.1
−0.1

−0.15
−0.2

−0.2
−0.2

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

(b) Residuals of miRNA

Figure 3.7: Density and Q-Q plots for the residual components the miRNA and RPPA for
CESC data set.

3.3.3 Effectiveness of Proposed Algorithm


This subsection illustrates the importance of rank estimation, relevance and dependency
measures introduced in this work. It also highlights the significance of selecting non-normal
residuals only during data integration.

3.3.3.1 Importance of Relevance


The proposed relevance measure Rl pXm q estimates the relevance of a modality Xm based
on the distribution of its rm principal components and the spread of the data along those
components. The relevance measure provides an ordering of the modalities, and the pro-
cess of integration starts with the most relevant one. To establish the importance of the
proposed relevance measure and the ordering, the performance of clustering is studied for
three different cases where the process of integration is initiated with the second, third, and
fourth most relevant modalities, keeping all other components of the algorithm fixed. The
starting modality for the three different cases and its comparative performance with the
proposed algorithm are reported in Table 3.2 for different data sets. The rank is estimated
to be 0 for both RNA and miRNA modalities of LGG and DNA methylation modality
of OV. Hence, the comparative performance of starting with these three modalities is not

44
Table 3.2: Importance of Relevance

Data Set Different Measures 2nd Most 3rd Most 4th Most Proposed
Relevant Relevant Relevant Algorithm
Starting Modality RPPA miRNA mDNA RNA
Rank of Modality 4 5 3 2
Relevance 0.2006048 0.1990886 0.1884817 0.2921399
Accuracy 0.8870968 0.5403226 0.7016129 0.8870968
CESC
NMI 0.7085741 0.1488846 0.2518516 0.6854921
ARI 0.7118294 0.1282493 0.2987082 0.7004411
F-measure 0.8781377 0.5677140 0.6852036 0.8801172
Rand 0.8644112 0.5774980 0.6556517 0.8587726
Purity 0.8870968 0.6129032 0.7016129 0.8870968
Starting Modality RPPA RNA miRNA mDNA
Rank of Modality 6 0 0 10
Relevance 0.2428867 0.0289518 0.0056958 0.4320317
Accuracy 0.7228464 - - 0.7940075
LGG
NMI 0.4739360 - - 0.5325030
ARI 0.3557794 - - 0.4649223
F-measure 0.7236789 - - 0.7916535
Rand 0.6978401 - - 0.7465292
Purity 0.7228464 - - 0.7940075
Starting Modality miRNA RPPA mDNA RNA
Rank of Modality 5 2 0 3
Relevance 0.2474369 0.0579902 0.0230986 0.4936741
Accuracy 0.5568862 0.5568862 - 0.6976048
OV
NMI 0.2717504 0.2717504 - 0.4504552
ARI 0.2015805 0.2015805 - 0.4142200
F-measure 0.5552212 0.5552212 - 0.6910392
Rand 0.6949524 0.6949524 - 0.7766269
Purity 0.5568862 0.5568862 - 0.6976048
Starting Modality RPPA mDNA miRNA RNA
Rank of Modality 6 5 4 3
Relevance 0.2464338 0.2373227 0.1602746 0.2947759
Accuracy 0.7688442 0.7160804 0.7688442 0.7688442
BRCA
NMI 0.5540359 0.4142157 0.5437267 0.5437267
ARI 0.5179618 0.4007992 0.5090183 0.5090183
F-measure 0.7692316 0.7228568 0.7699789 0.7699789
Rand 0.8033746 0.7593636 0.7999063 0.7999063
Purity 0.7688442 0.7185930 0.7688442 0.7688442

reported in Table 3.2. The results in Table 3.2 show that the proposed algorithm gives
the better performance compared to the cases where integration begins with any modality
other than the most relevant one, for both LGG and OV data sets. For BRCA data set,
when integration is initiated with the most relevant modality, that is RNA, then miRNA
is the least redundant one that is chosen at next step of integration. Vice-versa, when
integration is initiated with miRNA, then RNA is the least redundant one that is selected
next. So, for BRCA data set, the performance of the proposed algorithm is same as the case

45
where integration begins with miRNA. For the CESC and BRCA data sets, the proposed
algorithm gives best performance for F-measure compared to the other three cases. For
Rand index, ARI, and NMI, the proposed algorithm gives the second best performance.
However, the best performance for majority of the indices is obtained in the case where
integration begins with the second most relevant modality, that is RPPA. In brief, the
proposed method of integrating modalities based on their relevance ordering gives better
performance compared to that of some arbitrary ordering in majority of the cases.

3.3.3.2 Importance of Rank Estimation


The proposed method uses normality tests to separately estimate the rank of each indi-
vidual modality, independent of the number of clusters in the data set. The estimated
ranks of the individual modalities are given in Table 3.2 for different data sets. Existing
low-rank based approaches like iCluster [192], iCluster2 [191], iCluster+ [156], use the fixed
relation between the number of clusters k, and the pk ´ 1q principal components [52], to
determine the rank of their respective subspaces. To establish the importance of the pro-
posed method of rank estimation, the performance of clustering in the subspace extracted
by the proposed algorithm is compared with that of the subspace formed by concatenating
pk ´ 1q principal components of each modality. The comparative results reported in Table
3.3 for all the data sets show that the proposed algorithm, with variable number of selected
components from the individual modalities, gives better performance compared to the case
where fixed pk ´ 1q components are selected from each modality, except for NMI and Rand
index in BRCA data set. This shows that for real life data sets where the clusters are not
well-separated, the strict relation between the number of clusters k and rank pk ´ 1q does
not necessarily hold.

3.3.3.3 Significance of Dependency


At each iteration of the data integration, the proposed algorithm selects a modality that
has maximum dependency or shared information with respect to the current joint subspace.
To assess the significance of prioritizing modalities having maximum shared structure, all
the modalities are naively integrated based on their relevance ordering, and the clustering
performance of the resulting subspace is studied. The comparative performance of this
relevance-based subspace (without dependency) and the proposed one is reported in Table
3.3. The results imply that for CESC and BRCA data sets, the proposed approach of
considering dependency or shared structure along with relevance extracts better overall
cluster information compared to relevance alone. For LGG and OV data sets however,
both approaches have the same performance. This because, for LGG data set, apart from
the most relevant modality, mDNA, RPPA is the only other modality with non-zero rank
and is also the one with maximum shared structure. Similarly, for OV data set, mDNA
is automatically eliminated due to its zero-rank. Amongst the remaining modalities, the
ordering obtained considering relevance and dependency together is identical to the one
obtained based on relevance alone. Hence, both approaches have the same performance for
LGG and OV data sets. Thus, when a large number of modalities having non-zero ranks are
available, considering relevance and dependency together during integration gives better
performance compared to only relevance based integration.

46
Table 3.3: Importance of Rank Estimation, Dependency Measure, and Selection of Non-
normal Residuals

Data Set Different Fixed Rank Without Taking All Proposed


Measures pk ´ 1q Dependency Residuals Algorithm
Accuracy 0.8387097 0.8790323 0.8387097 0.8870968
NMI 0.6579328 0.6707970 0.6579328 0.6854921
CESC ARI 0.6040195 0.6875915 0.6040195 0.7004411
F-measure 0.8181978 0.8706213 0.8181978 0.8801172
Rand 0.8080252 0.8526095 0.8080252 0.8587726
Purity 0.8387097 0.8790323 0.8387097 0.8870968
Accuracy 0.6479401 0.7940075 0.7228464 0.7940075
NMI 0.3181501 0.5325030 0.4739360 0.5325030
LGG ARI 0.2801154 0.4649223 0.3557794 0.4649223
F-measure 0.6471171 0.7916535 0.7236789 0.7916535
Rand 0.6512630 0.7465292 0.6978401 0.7465292
Purity 0.6479401 0.7940075 0.7228464 0.7940075
Accuracy 0.6916168 0.6976048 0.6976048 0.6976048
NMI 0.4431124 0.4504552 0.4504552 0.4504552
OV ARI 0.4089271 0.4142200 0.4142200 0.4142200
F-measure 0.6834049 0.6910392 0.6910392 0.6910392
Rand 0.7742893 0.7766269 0.7766269 0.7766269
Purity 0.5568862 0.6976048 0.6976048 0.6976048
Accuracy 0.7638191 0.7638191 0.7613065 0.7688442
NMI 0.5556963 0.5231492 0.5517217 0.5437267
BRCA ARI 0.5082782 0.4958426 0.5052503 0.5090183
F-measure 0.7651760 0.7642758 0.7626970 0.7699789
Rand 0.8002101 0.7928053 0.7989950 0.7999063
Purity 0.7638191 0.7638191 0.7613065 0.7688442

3.3.3.4 Importance of Selecting Non-normal Residuals


Once the modality with maximum shared information gets selected, the proposed method
examines the distribution of the residuals of the selected modality with respect to the cur-
rent subspace. Out of all the components, only the components whose residuals depict the
presence of cluster structure are integrated with the current subspace. The residuals follow-
ing normal distribution depict noise and are eliminated from the integration. To establish
the importance of integrating only non-normal residual components, a joint subspace is
constructed where all the components of the selected modality are integrated irrespective
of the distribution of their residuals. The comparative performance of this subspace and
the proposed one, is studied and reported in Table 3.3. Comparative analysis in Table 3.3
show that for CESC, LGG, and BRCA data sets, selection of only non-normal residual
components yields a lower-dimensional subspace with better cluster structure compared to
the one which integrates all the residuals. For the OV data set, however, all the residuals at
each of the proposed algorithm show deviance from normality, and hence all the residuals
are selected for integration. Hence, for OV data set, the two subspaces are identical, giving

47
the same performance. Thus, elimination of components having noisy residuals, used in
the proposed method preserves better cluster structure in all the data sets.

3.3.4 Comparative Performance Analysis


This section compares the performance of the proposed algorithm with that of eight exist-
ing integrative clustering approaches, namely, COCA [93], BCC [140], LRAcluster [243],
iCluster [192], JIVE-Perm and JIVE-BIC [141], A-JIVE [63], and PCA-con [6].

Table 3.4: Comparative Performance Analysis of Proposed and Existing Approaches

Data Different Rank of External Evaluation Index


Set Algorithms Subspace Accuracy NMI ARI F-measure Rand Purity
COCA - 0.6693548 0.4172592 0.3677157 0.6870510 0.6971282 0.6774194
BCC - 0.6895161 0.2854917 0.3144526 0.6795619 0.6687779 0.6935484
JIVE-Perm 24 0.7177419 0.4425848 0.3860367 0.7097880 0.7164962 0.7177419
JIVE-BIC 4 0.8064516 0.5296325 0.5229385 0.8011385 0.7791765 0.8064516
CESC

A-JIVE 48 0.6500000 0.3700238 0.3355826 0.6511586 0.6857724 0.6814516


iCluster 2 0.5483871 0.1737526 0.1017765 0.5568753 0.5731707 0.5645161
LRAcluster 1 0.8145161 0.5176602 0.5384740 0.8123256 0.7867821 0.8145161
PCA-con 3 0.8548387 0.6750978 0.6333073 0.8390298 0.8237608 0.8548387
NormS 6 0.8870968 0.6854921 0.7004411 0.8801172 0.8587726 0.8870968
COCA - 0.6591760 0.2772248 0.2533847 0.6608123 0.6454901 0.6591760
BCC - 0.6340824 0.2737596 0.248606 0.63111660 0.6382755 0.6355805
JIVE-Perm 8 0.5617978 0.2299551 0.1606599 0.5757978 0.6056715 0.5730337
JIVE-BIC 8 0.6741573 0.3441747 0.3050874 0.6679019 0.6642730 0.6741573
LGG

A-JIVE 48 0.7168539 0.4267241 0.3376560 0.7172792 0.6869055 0.7168539


iCluster 2 0.4382022 0.1379678 0.0996867 0.5187438 0.5821858 0.5355805
LRAcluster 2 0.4719101 0.1240057 0.1030798 0.5137382 0.5831714 0.5280899
PCA-con 3 0.6666667 0.3438738 0.3031312 0.6574834 0.6616823 0.6666667
NormS 14 0.7940075 0.5325030 0.4649223 0.7916535 0.7465292 0.7940075
COCA - 0.5943114 0.3131466 0.2810761 0.6068513 0.7039183 0.5943114
BCC - 0.4610778 0.1567582 0.1254690 0.4755846 0.6268706 0.4622754
JIVE-Perm 32 0.5718563 0.2629523 0.2027605 0.5653910 0.6885005 0.5718563
A-JIVE 64 0.5191617 0.2124862 0.1981556 0.5111353 0.6942997 0.5221557
OV

iCluster 3 0.5089820 0.2249889 0.2005886 0.4808256 0.6916078 0.5119760


LRAcluster 2 0.6287425 0.3745173 0.2999204 0.6384046 0.7322472 0.6287425
PCA-con 4 0.6946108 0.4424701 0.4068449 0.6868295 0.7734621 0.6946108
Proposed 10 0.6976048 0.4504552 0.4142200 0.6910392 0.7766269 0.6976048
COCA - 0.7434673 0.5002408 0.4864778 0.7457304 0.7905295 0.7434673
BCC - 0.6251256 0.3169187 0.3049874 0.6242493 0.7055783 0.6334171
JIVE-Perm 12 0.6859296 0.4287142 0.3772649 0.6889363 0.7464906 0.6859296
BRCA

JIVE-BIC 4 0.6608040 0.4372675 0.3603942 0.6678438 0.7286432 0.6608040


A-JIVE 64 0.6140704 0.4482479 0.3710317 0.6707575 0.7363682 0.6841709
iCluster 3 0.7638191 0.5176193 0.4745746 0.7658865 0.7842867 0.7638191
LRAcluster 2 0.7110553 0.4368520 0.4035040 0.7101385 0.7521740 0.7110553
PCA-con 4 0.7587940 0.5506612 0.5038795 0.7601317 0.7984380 0.7587940
Proposed 11 0.7688442 0.5437267 0.5090183 0.7699789 0.7999063 0.7688442

48
3.3.4.1 Cluster Analysis
Table 3.4 compares the performance of clustering on the joint subspace extracted by the
proposed algorithm with that of existing integrative clustering approaches, in terms of the
external cluster evaluation indices. The results in Table 3.4 show that the proposed method
outperforms all the existing algorithms for CESC, LGG, and OV data sets, in terms of all
six external evaluation indices. For BRCA data set, the proposed method gives the best
results for accuracy, ARI, Rand, F-measure, and purity and second-best performance for
NMI. PCA-con gives the best performance for NMI. PCA-con also has the second best
performance in CESC and OV data sets for all the external indices. For LGG, A-JIVE has
the second best performance for across all the indices. The JIVE algorithm gives better
performance with BIC based rank estimation compared to permutation test based approach
for both CESC and LGG data sets. For LGG, the joint rank estimated by JIVE is the
same using both BIC and permutation tests. However, the overall performance differs due
to difference in rank of the individual modalities estimated by the two criteria. For the OV
data set, permutation test based JIVE algorithm extracts a 8-dimensional joint structure
for each modality, which are concatenated to form the 32-dimensional final joint structure.
However, BIC based JIVE algorithm estimates the rank of joint structure to be 0, which
implies that the four different modalities do not share any correlated information among
them. The iCluster algorithm uses regularized joint Gaussian latent variable model with
standard lasso penalty to estimate the low-rank subspace. The performance of iCluster
is heavily dependent on the choice of the penalty parameter. The lower performance of
iCluster for all data sets, except BRCA, is attributed to poor model fitting and penalty
parameter tuning. LRAcluster has relatively good performance for CESC, OV, and BRCA
data sets, but for LGG its performance is very poor. This is primarily due to error in esti-
mation of optimal rank, as better performance of LRAcluster is observed for ranks higher
than the optimal one selected by the algorithm. The performance of Bayesian consen-
sus based approach, BCC, is relative poor compared to the low-rank based approaches for
CESC, OV, and BRCA data sets. This is mainly due to the poor estimation of distribution
parameters based on Gibbs sampling approach, used by the algorithm. The results also
show that the proposed approach gives better performance compared to all the existing
low-rank approaches like JIVE, A-JIVE, iCluster, LRAcluster, and PCA-con for all data
sets. This implies that the proposed method of rank estimation, and selection of modalities
with high relevance and shared information preserve better cluster structure in the joint
subspace compared to the low-rank subspaces extracted by the existing algorithms, which
consider all the available modalities irrespective of their information content.

3.3.4.2 Survival Analysis


The log-rank and Wilcoxon test p-values from survival analysis of proposed and existing
approaches are reported in Table 3.5. The survival difference of the proposed subtypes
is compared with that of the previously identified TCGA subtypes. The Kaplan-Meier
survival plots for the subtypes identified by the proposed approach are given in Figure
3.8 for different data sets. The median survival time and change in survival rate of the
subtypes are observed over 2, 5, and 7 years of diagnosis of cancer. Median survival time
is a statistic that refers to how long patients are expected to survive with a disease. The
median survival time for a disease subtype is given by the time period where the Kaplan-

49
Table 3.5: Survival p-values and Execution Times of Proposed and Existing Approaches

Different Survival Analysis (p-value) Time Survival Analysis (p-value) Time


Algorithms Log-Rank Wilcoxon (in sec) Log-Rank Wilcoxon (in sec)

COCA 5.563e-02 3.126e-02 6.01 1.166e-04 2.805e-05 18.61


BCC 5.318e-01 4.572e-01 10.33 3.721e-06 3.434e-07 12.78
JIVE-Perm 4.074e-02 2.479e-02 575.95 3.736e-04 1.310e-04 622.28
CESC

JIVE-BIC 8.295e-02 8.341e-02 69.08 3.156e-08 5.134e-10 1636.94


A-JIVE 3.463e-01 2.469e-01 251.77 3.784e-07 1.922e-08 462.65

LGG
iCluster 1.448e-01 1.212e-01 1054.89 4.201e-03 7.864e-03 1241.97
LRAcluster 2.404e-01 2.418e-01 9.29 9.278e-02 1.682e-01 25.09
PCA-con 1.243e-01 9.175e-02 0.23 3.144e-08 9.196e-10 1.61
Proposed 1.352e-01 1.064e-01 1.09 2.473e-07 6.000e-09 1.05
COCA 1.159e-02 7.210e-03 26.90 6.042e-02 2.699e-01 25.40
BCC 5.174e-01 5.433e-01 17.94 2.333e-01 3.957e-01 40.38
BRCA

JIVE-Perm 1.137e-02 1.435e-02 934.13 7.982e-03 8.471e-03 1491.21


JIVE-BIC 5.314e-01 4.693e-01 734.10 - - -
A-JIVE 2.358e-01 2.206e-01 761.76 1.825e-01 2.489e-01 557.67
OV

iCluster 1.409e-02 4.282e-03 511.87 5.831e-01 6.338e-01 2076.36


LRAcluster 1.513e-01 2.320e-01 23.53 1.583e-01 2.305e-01 15.35
PCA-con 2.765e-02 2.047e-02 1.06 7.744e-02 2.583e-01 1.07
Proposed 6.887e-02 5.397e-02 1.47 4.296e-02 1.516e-01 1.72

Meier curve for the subtype crosses the survival probability of 0.5, and it is not available for
subtypes whose survival curves end before the survival probability of 0.5 due to low sample
count or presence of censored samples. The observations are reported in Table 3.6. For
OV and BRCA data sets, the p-values from both log-rank and Wilcoxon tests on proposed
subtypes are lower than that of the previously identified TCGA subtypes. This implies
that subtypes identified by the proposed method have larger difference in survival profiles
compared to the already identified subtypes of these data sets. For the LGG data set,
the already identified subtypes have lower p-values as compared to that of the proposed
subtypes. However, the survival difference is statistically significant for both proposed and
TCGA subtypes.
For the CESC data set, the p-value of pairwise log-rank test, comparing subtypes 1
and 2, is 0.04706108, which implies a significant difference between their survival profiles.
This is also visible from their distinctly separate survival curves reported in Figure 3.8(a).
However, the p-values from log-rank test are not significant for the other two pairs. Table
3.6 shows that proposed subtype 1 for CESC data set has a median survival time of 5.57
years, and its survival probability drops to 0.547 after 5 years of diagnosis of the cancer.
However, for subtypes 2 and 3, the survival probability is as high as 0.804 and 0.771,
respectively, after 5 years of diagnosis. This implies that subtype 1 shows poor prognosis
compared to subtypes 2 and 3, and its survival probability is also less than 0.5 after 7
years of diagnosis. For the LGG data set, Figure 3.8(b) shows that subtypes 1 and 2
have close and intersecting survival curves, which implies lower difference between their
survival profiles. However, subtype 3 is significantly different from subtypes 1 and 2, as
the p-values from pairwise log-rank test on subtypes 1 and 3, and subtypes 2 and 3 are
4.162-06 and 4.359e-05, respectively. Moreover, subtype 3 has a very low median survival

50
Table 3.6: Survival Analysis of Cancer Subtypes Identified by Proposed Algorithm

Different Different No. of Survival Probability After Median Survival


Data Sets Subtypes Samples 2 Years 5 Years 7 Years Time (Years)

Subtype1 32 0.875 0.547 0.410 5.57


Subtype2 68 0.957 0.804 0.721 -
CESC
Subtype3 24 0.771 0.771 - -
p-values: log-rank= 1.352176e-01 Wilcoxon= 1.064875e-01
Subtype1 139 0.950 0.774 0.430 6.26
Subtype2 77 0.889 0.709 0.489 6.67
LGG
Subtype3 51 0.343 0.343 0.343 1.66
p-values: log-rank= 2.473056e-07 Wilcoxon= 6.06793e-09
Subtype1 106 0.748 0.224 0.079 3.37
Subtype2 70 0.814 0.333 0.333 4.25
OV Subtype3 56 0.809 0.254 0.158 3.69
Subtype4 100 0.790 0.402 0.246 3.98
p-values: log-rank= 4.296905e-02 Wilcoxon= 1.516501e-01
Subtype1 84 0.908 0.670 0.670 12.2
Subtype2 77 0.922 0.725 0.725 -
BRCA Subtype3 150 0.988 0.884 0.746 10.8
Subtype4 87 0.971 0.855 0.376 6.8
p-values: log-rank= 6.887602e-02 Wilcoxon= 5.397618e-02

time of only 1.66 years, as compared to 6.26 and 6.67 years, respectively, for subtypes 1
and 2. This implies higher survival risk for patients belonging to subtype 3. For the OV
data set, Figure 3.8(c) shows that the survival curves of the subtypes are close to each
other, and the survival durations are not significantly different, which is also the case for
the established TCGA subtypes [215]. However, the change in survival rate of the subtypes
over the years shows that subtype 1 of the OV data set has the highest risk of survival
after 7 years of diagnosis compared to the other three subtypes. For BRCA, Figure 3.8(d)
shows that proposed subtypes 1 and 3 have fairly high median survival time. However, for
subtype 4 the survival probability drops sharply from 0.855 at 5 years of diagnosis to 0.376
after 7 years of diagnosis, implying its poor prognosis.

3.3.4.3 Execution Efficiency


The execution times reported in Table 3.5 show that the proposed approach is computation-
ally much faster than other statistical and SVD based low-rank approaches like iCluster,
JIVE, A-JIVE, and LRAcluster. The reduced computation time of the proposed approach
is due to the iterative updation of the joint subspace from the individual subspaces, instead
of solving an SVD from scratch at each step of joint subspace construction. The execution
time of the proposed algorithm is slightly higher than the PCA-con approach for CESC,
OV, and BRCA data sets. The lower execution time of PCA-con is achieved due to direct
PCA on the large integrated data. However, the results using external evaluation indices
show that such naive integration methods fail to capture the true cluster structure of the

51
1.00 +++++++++++ + 1.00 +
++++
+ ++
++++++++++++++
++++++++++++++++++
++++ ++++ +++++++++ ++
+ Subtype1 ++ + Subtype1
++ +
++ + Subtype2 +++ + Subtype2
+ ++ +++
+ Subtype3 + + + Subtype3
+
++ + ++++++
+ + ++ +
Survival Probability

Survival Probability
0.75 + +
0.75 +++
+
+++ +
++
++ +
+
0.50 + + + + 0.50 +

+ + + +

+ ++ +

0.25 0.25 + +

0.00 0.00
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000
Time (Days) Time (Days)
(a) CESC (b) LGG

1.00 ++
++
+++++++
++ 1.00 +++++++
++++++++++++++
+
++++++++++++ +++++++++++++
+ ++
+++++++ + + ++ +
+ + +
++
+ Subtype1 +++++++ ++++++++ + Subtype1
+++ ++ + ++++++
+++++ + Subtype2 + +++++ + Subtype2
++++++ + Subtype3 + + + Subtype3
+ +++ ++
+ + Subtype4 ++++++ + + Subtype4
++++++ ++ + +++
Survival Probability

Survival Probability

0.75 + + 0.75 + +++ +


++++++
++ +++ + + ++ ++ + +
++
++
++++++ ++ +
+
+
++ + ++
+
+
++ ++
0.50 +
++ 0.50
++
++
+ ++ +
++ ++ +
++++ +
+ ++
+
0.25 0.25
+
+ + + + +
+ +
+ + +
0.00 0.00
0 1000 2000 3000 4000 5000 0 2000 4000 6000 8000
Time (Days) Time (Days)
(c) OV (d) BRCA

Figure 3.8: Kaplan-Meier survival plots for proposed subtypes of CESC, LGG, OV, and
BRCA data sets.

data. For model fitting, iCluster and LRAcluster use expectation maximization algorithm,
while JIVE uses alternate optimization. These iterative algorithms have slow convergence
on the high-dimensional multimodal data sets. This leads to huge execution time and poor
scalability of these algorithms.

3.3.5 Robustness and Stability Analysis


To assess the robustness of the clusters identified by the proposed method to small per-
turbation in the data set, a bootstrap approach is undertaken. For each data set, 1000
bootstrap samples are generated by sampling with replacement from the original data set.
Two stage pipeline of the proposed joint subspace construction and subsequent clustering
is then performed on each bootstrap sample. The quality of clustering of the bootstrap
samples is assessed using Davies-Bouldin (DB) index [48], which is an internal cluster eval-
uation metric. For each bootstrap sample, 1000 different permutations of the cluster labels

52
are obtained and the DB index is computed for each permuted labelling. Depending on the
mean and standard deviation of the DB index obtained over different permutations, the
Z-score and p-value of the observed DB score are evaluated. The distribution of p-values
obtained from the bootstrap samples is given in Figure 3.9 for the four data sets. The
distributions show that majority of the p-values lie in the range of 1e-14 and 1e-06 for
the BRCA data set, while for the OV and LGG data sets, the range is in between 1e-10
and 1e-05. These p-value ranges imply that the observed DB scores for the BRCA data
set deviate more from the DB scores obtained under random permutation of cluster labels
compared to the CESC and OV data sets. On the other hand, for the CESC data set,
more than 90% of the p-values lie in between 1e-06 and 1e-04, indicating lowest deviation
from random clustering compared to other three data sets. However, for all three data
sets, more than 99% of the p-values are less than 0.001, which implies that the clusters in
the bootstrap samples show significant deviation from randomly assigned clusters. Thus,
clusters identified in all four data sets are robust against random perturbation of the data
sets.

80

70
CESC
% of Bootstrap Samples

60
LGG

50 OV

40 BRCA

30

20

10

0
0 - 1e-14

1e-14 - 1e-12

1e-12 - 1e-10

1e-10 - 1e-08

1e-08 - 1e-06

1e-06 - 1e-05

1e-05 - 5e-05

5e-05 - 1e-04

1e-04 - 5e-04

5e-04 - 0.001

0.001 - 0.1

p-value
Figure 3.9: Distribution of p-values obtained from robustness analysis on different data.

For any clustering method, it is necessary to analyze whether the patterns identified
by cluster analysis are necessarily meaningful or not. Stability means that a meaningful
valid cluster should not disappear if the data set is changed in a “non-essential" way.
Statistically, this means that data sets drawn from the same underlying distribution should
give rise to more or less the same clustering. Hennig [92] proposed a method for cluster-wise

53
stability analysis where clusters found in a data set are treated as “true" clusters and several
bootstrap replications of the original data set are generated. For each true cluster, the most
similar cluster in the bootstrap samples is identified using Jaccard similarity coefficient [84].
For a cluster, a summary statistic like mean Jaccard coefficient over the bootstrap samples
is a measure of its stability. Jaccard coefficient value lies in between 0 and 1, and a higher
value is indicative of better stability. Hennig [92] also suggested other summary statistics
for stability assessment, namely, number of dissolutions and number of good recoveries.
Dissolution refers to those cases for which the Jaccard coefficient value is less than 0.5, and
the cluster is said to be “dissolved" or lost in the bootstrap sample. Lower value of the
number of dissolutions for a cluster indicates that it represents a meaningful pattern which
is not easily lost in the perturbed bootstrap samples. Number of good recoveries measures
the number of times the Jaccard coefficient value is greater than 0.75 in the bootstrap
samples. It indicates how well a cluster can be recovered from the perturbed bootstrap
samples.

Table 3.7: Stability Analysis of Each Cluster

Different Different Mean Jaccard No. of No. of Good


Data Sets Subtypes Coefficient Dissolutions Recoveries
Subtype1 0.7479388 295 669
CESC Subtype2 0.7845754 151 692
Subtype3 0.5590394 356 215
Subtype1 0.5396389 690 130
LGG Subtype2 0.4145083 860 103
Subtype3 0.7359946 12 230
Subtype1 0.3325994 685 0
Subtype2 0.4312147 137 0
OV
Subtype3 0.6631375 11 352
Subtype4 0.5888738 8 144
Subtype1 0.7839652 45 734
Subtype2 0.9477903 0 982
BRCA
Subtype3 0.8666537 7 913
Subtype4 0.8300830 7 911

To assess the stability of the clusters identified by the proposed approach, these sum-
mary statistics over 1000 bootstrap samples are reported in Table 3.7. The results in Table
3.7 show that all the identified subtypes of CESC and BRCA show high stability (mean Jac-
card coefficient ą 0.5). The subtypes of BRCA data set have maximum stability amongst
all the data sets and subtype 2 shows the highest stability value of 0.9477903 among all
the identified subtypes. The dissolution and recovery values in Table 3.7 also indicate that
subtypes 2, 3, and 4 of BRCA data set are stable meaningful patterns which are dissolved
in less than 10 bootstrap samples and could be recovered successfully in more than 900
times out of the 1000 bootstrap samples. Subtypes 1 and 2 of CESC and subtype 3 of LGG
also have very high stability values of 0.7479388, 0.7845754, and 0.7359946, respectively.
Subtypes 1 and 2 of LGG data set have poor stability and also fewer recoveries compared
to the number of dissolutions. This is also evident from their close and intersecting survival
curves in Figure 3.8(b). Subtypes 1 and 2 of OV have poor stability, but the other two
subtypes 3 and 4 have moderate stability values. In 9 out of 14 cases, the identified clusters

54
have higher recoveries compared to dissolutions. This implies that most of the identified
subtypes indicate stable and meaningful patterns that are less likely to be dissolved when
the data set is subjected to small perturbations.

3.4 Conclusion
The chapter presents a new algorithm for the extraction of a low-rank joint subspace from
the high-dimensional multimodal data sets. The algorithm uses hypothesis testing to esti-
mate efficiently the rank of each individual modality by separating its signal or structural
component from the noise component. In order to address the major challenge of appro-
priate modality selection during data integration, two modality evaluation measures are
proposed. One evaluates the relevance of a modality in terms of the quality of cluster
structure embedded within it, while other measures the amount of shared information con-
tained within the modalities. The modalities with highest relevance and maximum shared
information are selected for integration. Moreover, intersection between two subspaces is
considered to extract only the residual cluster information of different modalities, while
removing the noisy components. Extensive experimental results show that the proposed
method of rank estimation, modality selection, and joint subspace construction provides
better clustering performance as compared to several existing integrative clustering ap-
proaches on several real-life multimodal omics data sets. The results also show that the
subtypes identified by clustering on the extracted joint subspace have close resemblance
with the previously established TCGA subtypes and have statistically significant difference
in survival profiles. Finally, robustness analysis demonstrates that the identified subtypes
indicate stable and meaningful patterns that are robust against small perturbation in the
data set.
One of the major problems in multi-view data analysis is the high dimensional nature
of the modalities. It makes sample clustering computationally expensive. In this regard,
a novel algorithm is proposed in Chapter 4 to construct a low-rank joint subspace of
integrated data from the low-rank subspaces of individual modalities. The problem of
incrementally updating the singular value decomposition of a data matrix is formulated for
the multimodal data framework.

55
56
Chapter 4

Selective Update of Relevant


Eigenspaces for Integrative
Clustering of Multi-View Data

4.1 Introduction
Integrative genomic data analysis refers to the design of algorithms to combine, infer, and
analyze data from multiple genomic modalities like gene expression, DNA methylation,
copy number variation, etc. Data from a single modality reflects biological patterns and
variations within a specific molecular level. Integrative analysis allows modeling of intrinsic
patterns of the individual modalities or views, and also captures correlated patterns across
multiple modalities. One major objective of integrative genomic data analysis is cancer
subtype discovery. Cancer subtyping provides insight into disease pathogenesis and design
of personalized therapies. Data driven subtype discovery is most popularly achieved by
clustering data from one or more genomic modalities [90]. Several integrative clustering
algorithms exist for cancer subtyping [93, 140, 141, 156, 192, 214, 234, 243, 278]. A brief
survey of existing integrative clustering algorithms is provided in Chapter 2. Clustering
multimodal genomic data has mainly three major challenges.
1. The main challenge is the appropriate selection of modalities those provide relevant
and shared subtype information, over modalities that provide noisy and inconsistent
information.
2. Another challenge is handling the highly heterogeneous nature, in terms of scale,
unit, and variance, of different genomic modalities.
3. The third challenge is that due to the high dimensional nature of the genomic modali-
ties, the feature space becomes geometrically sparse; and most of the clustering meth-
ods become computationally expensive and prone to degrade their performance [46].
The existing integrative clustering approaches, as mentioned in Chapter 3, do not
address all these challenges together. In general, direct integrative clustering approaches

57
concatenate data matrices obtained from multiple modalities into a single matrix, which
is used to get the joint clusters. However, the curse of dimensionality gets amplified due
to the concatenation of several modalities. Most of the existing integrative clustering
approaches assume that all the available modalities provide homogeneous and consistent
cluster information; and thus consider all of them for integrative clustering. However, some
modalities may provide disparate or even worse information [278]. Due to the presence of
such noisy modalities, naive integration of information from all the available modalities
can degrade the final cluster structure. During data integration, relevant modalities with
shared cluster information should be chosen, instead of considering noisy and inconsistent
ones. Therefore, one of the important problems in multimodal data clustering is how to
select a subset of relevant modalities.
Another major challenge in clustering high-dimensional data is how to extract a lower
dimensional subspace that best preserves the underlying cluster structure. PCA is an
extensively used dimensionality reduction method for large-scale genomic data sets [5, 6].
It extracts the principal subspace that maximizes the variance along the projected axes
and also minimizes the reconstruction error for any given rank. The principal subspace
can be effectively represented using eigenspaces, which are widely used in various pattern
recognition and image processing applications [32, 157, 184, 185]. Eigenspaces can be com-
puted using SVD which has high computational complexity. This motivates the use of
eigenspace update algorithms to prevent re-computation of eigenspace from scratch every
time new observations are added to the data set. Such strategies include incremental up-
date [32, 85] where eigenspace is updated on addition of every new observation, and block
update [21, 29, 86] where update occurs on addition of new sets of observations. However,
these algorithms have been proposed for a framework where data sets are incrementally
updated with new observations. Eigenspace update model for multimodal data sets, where
new modalities are being added for the same set of samples, has not been proposed to the
best of our knowledge.
In this regard, this chapter introduces a novel algorithm, termed as SURE (Selective
Update of Relevant Eigenspaces), to construct a low-rank joint subspace of the integrated
data. The joint subspace is constructed from the low-rank subspaces of the individual
modalities, such that it preserves best the underlying cluster structure. A theoretical
formulation for updating the eigenspace is introduced for multimodal data sets, where
new modalities are added for the same set of samples. The formulation enables efficient
construction of the joint subspace compared to performing PCA on the concatenated data
matrix. Moreover, the algorithm evaluates the quality of each modality before integrating
it into the joint subspace. This allows the proposed algorithm to select most relevant
modalities with maximum shared information, and hence addresses the problem of modality
selection. Some new quantitative indices are proposed to measure theoretically the gap
between the joint subspace extracted by the proposed SURE algorithm and the principal
subspace obtained by PCA on the concatenated data. Finally, clustering is performed on
the extracted joint subspace to identify the tumor subtypes. The efficiency of clustering
by the proposed algorithm is extensively studied and compared with existing integrative
clustering approaches on real-life multimodal cancer data sets. Some of the results of this
chapter are reported in [112].
The rest of the chapter is organized as follows: Section 4.2 describes the basics of the
SVD based eigenspace model of a data set. Section 4.3 presents the proposed multimodal

58
clustering algorithm based on the updation of relevant eigenspaces, while Section 4.4 intro-
duces some quantitative indices proposed in order to theoretically measure the gap between
the full-rank eigenspace and the approximate eigenspace extracted by the proposed algo-
rithm. Section 4.5 presents the experimental results on different multimodal cancer data
sets and comparative performance analysis with existing approaches. Concluding remarks
are provided in Section 4.6.

4.2 SVD Eigenspace Model


The basic assumption of the eigenspace model is that the data follows a multivariate Gaus-
sian distribution. Under this assumption, the eigenspace model of the data set refers to
the statistical description of a set of n observations in d-dimensional space in the form
of a hyper-ellipsoid [85]. The hyper-ellipsoid is centered at the mean of the observations,
and its axes point in directions where spread of the observations is maximized, subject to
orthogonality. The hyper-ellipsoid is flat in the directions where the spread is negligible.
This indicates a lower dimensional embedding of the hyper-ellipsoid considering only the
top few axes along which the spread is significantly high. Eigenspace models can be com-
puted either by eigenvalue decomposition of the covariance matrix of the data or by SVD
of the mean centered data matrix itself. As n ăă d for omics data, computation of large
d ˆ d covariance matrix needs intensive space and time. Also, multicollinearity of omic
features often leads to a singular covariance matrix. Hence, the SVD eigenspace model is
used in this work.
Let X P <nˆd be a data matrix of n observations or samples, each having d features,
and rankpXq “ r. As stated previously in Section 3.2.1, the SVD of the mean-centered
data matrix X is given by

X ´ 1µpXqT “ U pXqΣpXqV pXqT , (4.1)

where µpXq P <d is the mean of the data, AT denotes the transpose of a matrix A, and 1
denotes a column vector of length n of all ones. The matrix U pXq contains the r left singular
vectors of X in its columns, which gives the r-dimensional principal subspace projection
of the n samples of X. ΣpXq is a diagonal matrix with entries diagtσ1 , . . . , σi , . . . , σr u,
where σ1 ě . . . ě σi ě . . . ě σr ą 0. The σi ’s are the singular values of X, which give the
spread of the projections along singular vectors in U pXq. The matrix V pXq contains the
r right singular vectors of X in its columns, which are the loadings of the d variables of X
corresponding to the projections in U pXq. The principal components of X are obtained by
multiplying the projections in U pXq with the corresponding spread values in ΣpXq, given
by Y “ U pXqΣpXq. The SVD eigenspace of X is given by a four-tuple as follows [86]:

ΨpXq “ xµpXq, U pXq, ΣpXq, V pXqy. (4.2)

The SVD eigenspace model defined above in (4.2) differs from the principal subspace
model defined in (3.4) of Chapter 3 in the sense that the SVD eigenspace contains two addi-
tional terms: the data mean µpXq and the right singular subspace V pXq. This is because,
only the left subspace U pXq and singular values ΣpXq are sufficient to extract the principal

59
of a data matrix. The other two components are not required. Moreover, in Chapter 3 the
joint subspace is constructed by simply concatenating the U and Σ components from the
relevant modalities. However, this chapter focuses on entirely reconstructing the SVD of
the integrated data from SVDs of individual modalities which requires contribution from
all four components, U , Σ, V , and µ. Hence, the SVD eigenspace is defined as a four tuple
as in (4.2).
Zha et al. [271] showed that the continuous relaxation of the discrete cluster member-
ship indicators in k-means clustering problem is given by the top k principal components.
So, the rank k truncated eigenspace, containing only top k singular vectors and corre-
sponding singular values, sufficiently represents the cluster information of X. The noisy
information, embedded in remaining pn ´ kq singular triplets, gets eliminated from the
truncated eigenspace. So, the rank r of the eigenspace is considered to be k in the current
work.

4.3 SURE: Proposed Method


This section presents the proposed SURE algorithm to construct a low-rank joint subspace
of the integrated data. Prior to describing the proposed algorithm, theoretical formulation
for the eigenspace update problem is described next.

4.3.1 Eigenspace Updation


Let X1 , . . . , Xm , . . . , XM , where Xm P <nˆdm , be M different modalities or views of a
multimodal data set, all measured on the same set of n samples. The M data matrices can
be concatenated to form an integrated data matrix as follows:
“ ‰
X “ X1 . . . Xm . . . XM . (4.3)

The eigenspace of X gives a low-rank representation of the integrated data matrix X.


However, computation of eigenspace of X from scratch involves solving the SVD of X,
which is computationally expensive due to the large size of X. A theoretical formulation
is developed next for updating the SVD of a multimodal data set. This formulation al-
lows construction of the eigenspace of X from the low-rank eigenspaces of the individual
modalities.
Let the rank k eigenspace for modality Xm be given by

ΨpXm q “ xµpXm q, U pXm q, ΣpXm q, V pXm qy. (4.4)

Let the data matrix, formed by column-wise concatenation of m modalities, be given by


“ ‰
r m “ X1 X2 . . . Xm .
X

The eigenspace of X is constructed sequentially in M steps by constructing the eigenspace


r m at each step for m “ 1, . . . , M . Let the eigenspace of X
of X r m obtained at m-th step be

60
given by
ΨpX r m q, U pX
r m q “ xµpX r m q, ΣpX
r m q, V pX
r m qy. (4.5)
At pm ` 1q-th step, the matrix X r m of m-th
r m`1 is formed by concatenation of the matrix X
step and the pm ` 1q-th modality Xm`1 , given by
” ı
X
r m`1 “ Xr m Xm`1 . (4.6)

Let the eigenspace of X


r m`1 be given by

ΨpX r m`1 q, U pX
r m`1 q “ xµpX r m`1 q, ΣpX
r m`1 q, V pX
r m`1 qy. (4.7)

The idea of eigenspace construction is as follows: At pm ` 1q-th step, the eigenspace


ΨpX r m`1 q is not constructed by solving the SVD of data matrix X
r m`1 from scratch, rather
it is constructed from the eigenspace ΨpX r m q obtained at m-th step and the eigenspace
ΨpXm`1 q of modality Xm`1 . The initial eigenspace is given by

ΨpX
r 1 q “ ΨpX1 q.

Let ‘ be the operator denoting the addition of two eigenspaces. The eigenspace at pm`1q-
th step is given by
ΨpXr m`1 q “ ΨpXr m q ‘ ΨpXm`1 q. (4.8)
For the integrated data matrix X, the final eigenspace is iteratively obtained as follows:

Step 1 : ΨpX
r 1 q “ ΨpX1 q
Step 2 : ΨpX
r 2 q “ ΨpX
r 1 q ‘ ΨpX2 q
...
Step pm ` 1q : ΨpX
r m`1 q “ ΨpX
r m q ‘ ΨpXm`1 q
...
Step M : ΨpX
r M q “ ΨpX
r M ´1 q ‘ ΨpXM q “ ΨpXq.

The operator ‘ in (4.8), for the construction of components of the new eigenspace ΨpXr m`1 q,
is described next.
The relation between a data matrix X and the components of its eigenspace ΨpXq is
obtained by SVD using (4.1). Applying (4.1) to data matrices X r m and Xm`1 , the following
relations are obtained:

X r m qT “ U pX
r m ´ 1µpX r m qΣpX r m qT ;
r m qV pX
(4.9)
Xm`1 ´ 1µpXm`1 qT “ U pXm`1 qΣpXm`1 qV pXm`1 qT .
r m`1 is constructed by column-wise concatenation of Xm`1 to X
The matrix X r m . So, the
mean component µpXm`1 q of ΨpXm`1 q is also obtained directly by column-wise concate-
r r

61
nation of mean vectors µpX
r m q and µpXm`1 q, that is,
” ı
µpX r m q µpXm`1 q .
r m`1 q “ µpX (4.10)

The left singular subspace U pXr m`1 q consists of unit vectors corresponding to the prin-
cipal subspace projection of the data in Xr m`1 . The matrix X r m`1 has X r m and Xm`1 as its
constituent block matrices. Therefore, the left subspace of Xm`1 must be constructed in
r
such a way that it contains the information of projection of data from the m modalities in
Xr m and also the projection information from pm ` 1q-th modality Xm`1 . So, both U pX r mq
and U pXm`1 q must be subspaces of U pXm`1 q. The subspace U pXm`1 q can be obtained by
r r
constructing a basis sufficient to span both the subspaces U pX r m q and U pXm`1 q. U pXr mq
is itself a basis for the left subspace of Xr m . Let Γ be the basis for the subspace lying
orthogonal to left subspace spanned by U pX r m q. Therefore, a sufficient basis for U pX
r m`1 q
can be formed by augmenting the basis U pX r m q with basis Γ of the orthogonal space.

z
z z
U (Xm+1)
Q
e m)
U (X
P y y y
x x
x
(a) (b) (c)

Figure 4.1: (a) Projected and residual components of subspace U pXm`1 q with respect
to U pX
r m q; (b) Intersection between U pXm`1 q and U pX
r m q is empty; (c) U pXm`1 q is a
subspace of U pXm q.
r

In Figure 4.1(a), the gray plane represents the subspace U pXm`1 q and the dotted
plane represents the subspace U pX r m q. The basis Γ has to span the subspace orthogonal
to the dotted plane U pX
r m q. It is constructed by projecting U pXm`1 q to U pX
r m q and then
obtaining orthogonal bases for the residual matrix. The projection is given by

r m qT U pXm`1 q.
I “ U pX (4.11)

The component of U pXm`1 q, lying in the subspace spanned by U pXr m q, is obtained by mul-
tiplying the projection I with the corresponding basis U pXm q. The projected component
r
P is given by
P “ U pX
r m qI. (4.12)
Finally, the residual component Q is obtained by subtracting the projected component P
from U pXm`1 q itself, given by,

Q “ U pXm`1 q ´ P. (4.13)

62
The residual Q lies in the subspace orthogonal to the one spanned by U pX r m q. In Figure
4.1(a), P denotes the projection of the gray plane U pXm`1 q onto the dotted plane U pXr mq
and the stripped plane denotes the residual component Q, which is orthogonal to the dotted
plane U pXr m q. An orthonormal basis Γ for the residual component can be obtained by
Gram-Schmidt orthogonalization of Q. However, if the intersection between the subspaces
U pX
r m q and U pXm`1 q is non-empty, the rank of the residual space reduces. So, the rank
of the intersection space is evaluated in order to choose the right number of basis vectors
required for the residual space. This can be evaluated using the following theorem.

Theorem 4.1. Let A and B be two subspaces of <n . Let columns of matrices A P <nˆr1
and B P <nˆr2 be orthonormal bases for the subspaces A and B, respectively. Let SVD
of AT B be U ΣV T , where Σ “ diagpσ1 , σ2 , ..., σr q and σ1 ě σ2 ě ... ě σr . Then, the
dimension of the intersection subspace A X B is ω iff σ1 “ σ2 “ ... “ σω “ 1 ě σω`1 [16].
The above theorem states that the number of singular values of AT B, which are equal
to 1, gives the dimension of the intersection subspace of A and B. For subspaces U pX r mq
and U pXm`1 q, the matrices themselves form orthonormal bases. Therefore, according to
Theorem 4.1, the number of singular values of the matrix I of (4.11), those are equal to 1,
gives the dimension of the intersection space. Let t be the number of such singular values
of I that are equal to 1. Then, the dimension of the residual space is pr ´ tq, where r is
the dimension of the subspace U pXm`1 q. Let G be the orthonormal basis obtained from
Gram-Schmidt orthogonalization of Q. If the rank of residual space is pr ´ tq, then exactly
t column vectors of G would have norm zero. Finding the rank t of the intersection space
through SVD of I has complexity of Opr3 q. Alternatively, t can be computed by finding
the number of vectors in G having norm zero. For t ą 0, pr ´ tq non-zero vectors of G are
used to form Γ, which spans the residual space. Following two special cases arise while
considering the intersection between the subspaces.

• Case 1 - Intersection between two subspaces U pX r m q and U pXm`1 q is empty:


This case arises when U pXm`1 q lies entirely in the subspace orthogonal to U pXr mq
as shown in Figure 4.1(b). Therefore, when U pXm`1 q is projected onto U pXm q, the
r
projection magnitudes are all zeros, that is, I “ 0, where 0 denotes a zero matrix of
appropriate dimension. Hence, the projected component P in (4.12) is also 0. So,
the residual Q in (4.13) is the whole subspace U pXm`1 q, that is, Q “ U pXm`1 q,
which is itself an orthonormal basis. Therefore, the basis for the residual space is
Γ “ U pXm`1 q.
• Case 2 - Subspace U pXm`1 q is a subspace of U pX r m q: This case arises when
U pXm`1 q is itself a subspace of U pXm q, as shown in Figure 4.1(c) where the subspaces
r
are parallel to each other. This implies that all the column vectors of U pXm`1 q can
be expressed as a linear combination of those in U pX r m q. So, the projected component
P in (4.12) is U pXm`1 q itself, and the residual Q in (4.13) is 0. Since the residual
space is empty, the basis U pXr m q is sufficient to span both the subspaces U pX r m q and
U pXm`1 q. Therefore, Γ “ H.

After constructing” the appropriate


ı basis Γ for residual space, it is appended to the
basis U pXm q. Thus, U pXm q Γ spans both the subspaces U pX
r r r m q and U pXm`1 q. This

63
r m`1 q by a rotation RpX
basis differs from the required basis U pX r m`1 q. Hence, U pX
r m`1 q
is obtained as follows:
” ı
U pX r m q Γ RpX
r m`1 q “ U pX r m`1 q, (4.14)

where RpX r m`1 q is an orthonormal rotation matrix. The ΣpX


r m`1 q and V pX
r m`1 q com-
ponents of the eigenspace of Xm`1 and the rotation matrix RpXm`1 q are computed as
r r
follows. The SVD of X r m`1 gives the following relation:

X r m`1 qT “ U pX
r m`1 ´ 1µpX r m`1 qΣpX r m`1 qT .
r m`1 qV pX (4.15)

Substituting U pX
r m`1 q from (4.14) in (4.15), we get
” ı
X r m`1 qT “ U pX
r m`1 ´ 1µpX r m q Γ RpX
r m`1 qΣpX r m`1 qT ;
r m`1 qV pX (4.16)

” ıT ´ ¯
ñ RpX
r m`1 qΣpX r m`1 qT “ U pX
r m`1 qV pX r mq Γ X r m`1 qT
r m`1 ´ 1µpX (4.17)

” ıT ” ı
as U pXr m q and Γ are orthonormal matrices, and U pX r mq Γ U pX
r m q Γ “ Is , where
Is is the identity matrix of order s.
Substituting the values of Xr m`1 and µpXr m`1 q from (4.6) and (4.10), respectively, in
(4.17), we get

RpX
r m`1 qΣpX r m`1 qT
r m`1 qV pX
” ıT ” ı ” ı
“ U pXr mq Γ Xm Xm`1
r ´ 1 µpXm q
r T µpXm`1 q T

” ıT ” ı
“ U pXr mq Γ X
r m ´ 1µpX r m qT Xm`1 ´ 1µpXm`1 qT . (4.18)

Using (4.9) in (4.18), we get


« ff
r T M12 ΣpXm`1 qV pXm`1 qT
r m`1 q “ M11 ΣpXm qV pXm q
T
r
RpX
r m`1 qΣpX
r m`1 qV pX
r m qT ;
M21 ΣpX
r m qV pX M22 ΣpXm`1 qV pXm`1 qT
(4.19)
r m qT U pX
where M11 “ U pX r m q “ Ik ; M21 T
“ Γ U pX
r m q “ 0;
r m qT U pXm`1 q “ I; M22 “ ΓT U pXm`1 q.
M12 “ U pX

Substituting the values of Mij , @i, j “ 1, 2 in (4.19), we get


„ 
Ik ΣpX r m qT
r m qV pX IΣpXm`1 qV pXm`1 qT
RpX
r m`1 qΣpX r m`1 qT “
r m`1 qV pX . (4.20)
0 M22 ΣpXm`1 qV pXm`1 qT

64
Solving the SVD problem for the matrix of (4.20), the components RpX r m`1 q, ΣpXr m`1 q,
and V pXm`1 q are obtained. The left subspace U pXm`1 q is obtained by substituting the
r r
value of RpXr m`1 q in (4.14). Finally, the matrices U pXr m`1 q and V pX
r m`1 q are truncated to
store only the top k singular vectors and ΣpX r m`1 q is truncated to store the corresponding
k largest singular values in the eigenspace of X r m`1 .
For Case 1, where intersection between two left subspaces is empty, substituting the
values I “ 0 and Γ “ U pXm`1 q in the SVD of (4.20), we get
„ 
T I ΣpX r m qT
r m qV pX 0
RpXm`1 qΣpXm`1 qV pXm`1 q “ k
r r r . (4.21)
0 Ik ΣpXm`1 qV pXm`1 qT

The SVD of (4.21) is a block-diagonal SVD problem whose solution is given by


„ 
Σp X
r mq 0
RpX
r m`1 q “ I2k ; ΣpXr m`1 q “ ;
0 ΣpXm`1 q
„ 
r T
V pXr m`1 qT “ V pXm q 0
.
0 V pXm`1 qT

Substituting RpX
r m`1 q in (4.14),

U pX r m q U pXm`1 qs.
r m`1 q “ rU pX

This signifies that for non-intersecting subspaces U pX r m q and U pXm`1 q, the bases for the
joint left and right singular subspaces are formed by the union of the individual bases and
has rank 2k. In the context of integrative clustering, this implies that the cluster structure
reflected in modality Xm`1 is completely disparate with respect to cluster structure em-
bedded in joint modality X r m . So, incorporation of modality Xm`1 can introduce totally
inconsistent cluster information into the joint cluster structure embedded in the eigenspace
of Xr m . Therefore, careful evaluation of a modality is necessary before updating it into the
joint eigenspace.

4.3.2 Evaluation of Individual Modality


This section introduces two modality evaluation measures, namely, relevance and con-
cordance. While relevance assesses the quality of cluster information provided by each
modality, the concordance measures the amount of cluster information shared between two
modalities. Let Xi P <nˆdi and Xj P <nˆdj be two modalities of a multimodal data set
whose rank k eigenspaces are given by

ΨpXi q “ xµpXi q, U pXi q, ΣpXi q, V pXi qy; (4.22)


ΨpXj q “ xµpXj q, U pXj q, ΣpXj q, V pXj qy. (4.23)

65
4.3.2.1 Relevance
The relevance of a modality is defined in terms of the compactness of the cluster structure
embedded in its eigenspace. The compactness is evaluated in the left subspace, which con-
tains principal subspace projection of the samples. The relevance measure is independent
of the difference in scale, unit, and variance of the modalities, as the left subspace of each
modality contains k unit vectors. The compactness of cluster structure of a modality Xi
is given by the percentage of variance explained (PVE) by a partition of its left subspace
U pXi q. Let C i “ tC1i , . . . , Cji , . . . , Cki u be a partition of the left subspace U pXi q into k
clusters.
The PVE in U pXi q by partition C i is given by the ratio of between-cluster variance in
C i to the total variance of U pXi q. The total variance is the total sum-of-squared distance
of each sample from its mean, given by
n
ÿ
TpU pXi qq “ ||xip ´ x̄i ||2 (4.24)
p“1

where x̄i is the mean of U pXi q. Since U pXi q contains principal subspace projection of data
in Xi , the projection values in U pXi q also have zero mean. Hence, x̄i “ 0. Moreover, the
columns of U pXi q are orthonormal to each other, therefore,

n
ÿ
TpU pXi qq “ ||xip ||2 “ ||U pXi q||2F “ tracepU pXi qT U pXi qq “ tracepIk q “ k, (4.25)
p“1

where ||A||2F denotes the Frobenius norm of matrix A. The within-cluster variance of
partition C i is the sum-of-squared distance of each data point from its cluster centroid,
given by
k
ÿ ÿ
WC i pU pXi qq “ ||xip ´ mj ||2 (4.26)
j“1 xip PC i
j

where mj is the centroid of cluster Cji . The between-cluster variance in C i is obtained by


subtracting the within-cluster variance in C i from the total variance of U pXi q. Thus, the
PVE in U pXi q by the partition C i is given by

TpU pXi qq ´ WC i pU pXi qq


PVEpU pXi qq “ . (4.27)
TpU pXi qq

The relevance of a modality Xi is given by

RelpXi q “ PVEpU pXi qq. (4.28)

The relevance measure gives a value in between 0 and 1 with higher value indicating
better cluster structure. So, the modality Xi has higher relevance than modality Xj if
PVEpU pXi qq ą PVEpU pXj qq. The relevance measure gives an ordering of the modalities,
based on the quality of their cluster structures.

66
4.3.2.2 Concordance
The construction of joint eigenspace begins with the most relevant modality, having the
best inherent cluster structure. Updating this eigenspace with a modality having very dis-
cordant cluster structure may degrade the final cluster solution. Therefore, a concordance
measure, based on normalized mutual information (NMI) [68] between the cluster assign-
ments of two modalities, is used to capture the joint cluster information shared between
two modalities. Let C i and C j be k-partitions of the subspaces U pXi q and U pXj q, respec-
tively. The concordance A between Xi and Xj is given by the NMI between the cluster
solutions C i and C j
A pXi , Xj q “ NMIpC i , C j q. (4.29)
NMI is defined as follows:
` ˘
i j 2 I Ci, Cj
NMIpC , C q “ ; (4.30)
rHpC i q ` HpC j qs
` ˘
where HpC i q is the entropy of C i and I C i , C j is the mutual information between C i and
C j , which are as follows:

k
ÿ
H Ci “ ´ P rpCpi q log P rpCpi q;
` ˘
p“1
« ff
` i j˘ ÿk ÿ
k
P rpCpi X Cqj q
I C ,C “ P rpCpi X Cqj q log ;
p“1 q“1 P rpCpi qP rpCqj q

where P rpSq denotes the probability of the set S. The value of concordance A lies in
the range r0, 1s, with larger value being indicative of more shared information between
two modalities. While selecting a modality, the average concordance between a candidate
modality and all the previously integrated ones is computed. A candidate modality is
selected for update only if its average concordance is beyond some threshold τ .

4.3.3 Proposed Algorithm


The relevance and concordance measures together help to select relevant modalities during
data integration. The main steps of the proposed SURE algorithm are reported next.
Let X1 , . . . , Xm , . . . , XM , where Xm P <nˆdm , be M modalities, S is the set of selected
modalities and initially S “ H. The SURE algorithm constructs the joint eigenspace
ΨpXr M q is given in Algorithm 4.1.
After the construction of ΨpX r M q, the principal components of the integrated data are
obtained by
Y “ U pX r M qΣpXr M q.

Finally, the rows of pn ˆ kq matrix Y are clustered using k-means algorithm to obtain the
cancer subtypes.

67
Algorithm 4.1 Proposed Algorithm: SURE
1: for m Ð 1 to M do in parallel
2: Compute eigenspace: ΨpXm q using SVD of Xm .
3: Perform k-means on the left subspace U pXm q of Xm .
4: Compute relevance: RelpXm q using (4.28) .
5: end for
6: Compute pairwise concordance A pXi , Xj q , @i ‰ j.
7: Xπ Ð modality with maximum relevance.
8: m Ð 1; S “ tXπ u.
r 1 “ Xπ ; Initial eigenspace: ΨpX
9: X r 1 q Ð ΨpXπ q.
10: for m Ð 1 to pM ´ 1q do
11: for each Xj not added to joint eigenspace ΨpX r m q do
12: Compute average ř concordance of Xj with previously integrated modalities:
ĀpXj q “ 1{|S| ωPS ApXω , Xj q.
13: end for
14: Xl Ð Xj with maximum average concordance.
15: if ĀpXl q ě τ , then
16: Update Ψp ” Xm`1 q “
r
ı ΨpXm q ‘ ΨpXl q as follows:
r
17: Xr m`1 “ X r m Xl .
18: m Ð m ` 1; S “ S Y tXl u.
19: Compute µpX r m`1 q using (4.10).
20: Compute I, P, and Q using (4.11), (4.12), and (4.13), respectively.
21: G Ð Gram-Schmidt orthogonalization of Q.
22: t Ð number of columns of G having norm zero.
23: Γ Ð first pk ´ tq basis vectors of G.
24: Compute RpX r m`1 q, ΣpXr m`1 q, and V pXr m`1 q using SVD of (4.20).
25: Compute U pXm`1 q from (4.14).
r
26: Truncate the matrices U pX r m`1 q, ΣpX
r m`1 q, and V pXr m`1 q at rank k.
27: ΨpXm`1 q “ xµpXm`1 q, U pXm`1 q, ΣpXm`1 q, V pXm`1 qy.
r r r r r
28: else
29: break
30: end for
31: ΨpXr M q Ð ΨpX r m q.

4.3.4 Compuational Complexity


In the proposed algorithm, for each modality Xm P <nˆdm , a SVD problem of size pn ˆ dm q
M
ř
is solved in step 2. Let dmax “ maxtdm u and d “ dm . The SVD problems on the
m“1
individual modalities are independent of each other and can be computed parallelly for
all the modalities. This time complexity is bounded by the time required for the largest
modality, that is, Opmintnd2max , n2 dmax uq “ Opn2 dmax q, assuming n ă dmax due to the
high dimension low sample size nature of the data sets. Similarly, performing k-means
on the left subspace U pXm q of Xm and computation of its relevance RelpXm q from the
clustering solution, in steps 3 and 4 can be done for all the modalities in parallel. The

68
k-means clustering on pn ˆ kq matrix U pXm q has time complexity of Optmax nk 2 q, where
tmax is the maximum number of iterations the k-means algorithm runs and k ăă n.
Computation of RelpXm q takes Opnq time, owing to the computation of within-cluster
` ` in` U2pXm q. Thus, for
variance M ˘˘modalities, the time complexity of steps 1-5 is bounded
by O M n dmax ` tmax nk ` n “ OpM n2 dmax q, considering sequential construction
2
˘

of eigenspaces for different modalities.


After computation of individual eigenspaces in steps 1-5, concordance A between every
pair of modalities is computed in step 6. This involves computation of normalized mutual
information which takes Opk 2 q time. Step 7 has time complexity of O pM q to find the
modality with maximum relevance. Steps 8 and 9 are assignments operations which take
Op1q time. For the remaining modalities, the loop in step 10 can execute at most pM ´
1q times. On m-th execution of the loop, there are pM ´ mq candidate modalities for
the eigenspace update. For each candidate modality, its average concordance Ā with the
formerly updated ones is computed in step 12. This has a complexity of O pmq. For pM ´mq
candidate modalities, the total complexity of steps 11-13 is O pmpM ´ mqq. The one with
maximum average concordance is chosen in O pM ´ mq time. If its average concordance Ā
is greater than threshold τ then the eigenspace is updated in steps 16-27.
During eigenspace update, steps 17-19 consist of concatenation and union operations
which take at most Opdmax q time. Step 20 takes Opnk 2 q time to compute the matri-
ces I, P, and Q. The Gram-Schmidt orthogonalization in step 21 has complexity of
Opnk 2 q for pn ˆ kq matrix Q. To find t in step 22, the norm of the columns of Q is
computed, which takes Opnkq time. Step 24 requires solving the SVD problem of (4.20)
of the main article, which is of size at most p2k ˆ dq and has time complexity of Opk 2 dq.
U pXr m`1 q in step 25 computed in Opnk 2 q time. Steps 26 and 27 have constant com-
plexity of Op1q. Hence, the total complexity of steps 16-27 for updating the eigenspace
is Opdmax ` nk 2 ` nk ` k 2 d ` nk 2 q “ Opk 2 dq. Therefore, time
` ˘
` complexity of updat-˘
ing the eigenspace in m-th iteration of the loop in step 10 is OpmpM ´ mq ` k 2 dq “
Opk 2 dq. Step 10 is executed at most pM ´ 1q times which gives a total complexity of
OpM k 2 dq. The overall computational complexity of the proposed SURE algorithm is
OpM n2 dmax ` M k 2 dq “ OpM n2 dmax q, assuming M, k ăă n ă dmax . Thus the time
` ˘

complexity is bounded by that of individual eigenspace construction in steps 1-5.

4.4 Accuracy of Eigenspace Construction


This section introduces some quantitative indices to measure the gap between “full-rank"
eigenspace of the integrated data and its approximate eigenspace constructed by the pro-
posed SURE algorithm. Let X be the integrated data given by (4.3). The full-rank
eigenspace of X contains the full-rank information of all its component modalities and
constructed by the SVD of X using (4.1). Let its rank r eigenspace of X be given by

ΨpXq “ xµpXq, U pXr q, ΣpXr q, V pXr qy. (4.31)

The superscript r denotes that r largest singular values and corresponding singular vectors
are considered in the eigenspace. This full-rank eigenspace representation is also same as
the principal subspace extracted by PCA on the integrated data X. Let ΨpXq r be the

69
approximate rank r eigenspace of X obtained by the proposed SURE algorithm, that is,

M
à
ΨpXq
r “ ΨpXm q,
m“1

where ΨpXm q is the rank r eigenspace for modality Xm . It is further assumed that all the
M modalities are used during the eigenspace update. Let ΨpXq r be given by

r r r
ΨpXq
r “ xµpXq, U pX
r q, ΣpX
r q, V pX
r qy. (4.32)

Here, ΨpXqr is an approximate eigenspace of X as it is constructed from truncated rank


r individual eigenspaces. The truncation errors, inherent in individual eigenspaces, get
propagated onto joint eigenspace during the updating process. This results in a gap between
full-rank eigenspace ΨpXq and approximate eigenspace ΨpXq.r However, as r increases, the
truncation errors in the individual eigenspaces reduce and the gap decreases. So, the gap
between two eigenspaces can be computed for different values of rank r. For any r1 ą r, an
eigenspace of rank r1 has more singular values and vectors in its Σ, U , and V components
than an eigenspace of rank r. So, for different values of r, the gap is always measured
between fixed number of singular values and vectors of two eigenspaces.

4.4.1 Error Bound on Principal Sines


The gap between left and right subspaces can be measured using the principal angles
between subspaces (PABS) [16]. PABS generalizes the concept of angle between two lines
to a set of angles between two subspaces, defined next.
Definition 4.1. Let A and B be two subspaces of <n of dimension r1 and r2 , respectively.
Let t “ minpr1 , r2 q. The principal angles between subspaces A and B are given by a
sequence of t angles, ΘpA, Bq “ rθ1 , . . . , θj , . . . , θt s, where 0 ď θ1 ď . . . ď θt ď π{2. The
angle θj is defined by

θj “ max max arccos |aT b| ;


` ˘
aPA bPB

subject to ||a|| “ ||b|| “ 1, aTi a “ 0, bTi b “ 0, for i “ 1, 2, ..., j ´ 1 [16].


The principal sines sinpθj q1 s of the angles can be computed using singular values as follows.
nˆr1 and B P <nˆr2 be orthonormal bases
” AP<
Theorem 4.2. Let the columns of matrices ı
for subspaces A and B, respectively. Let A AK be a unitary matrix such that the columns
of AK span the subspace orthogonal to A. Also, let the singular values of pAK qT B be given by
the elements of the diagonal matrix Ξ “ diag pν1 , . . . , νt q , where ν1 ě . . . ě νj ě . . . ě νt .
The principal sine sinpθt`1´j q “ νj [106, 116].
Thus, the principal sines between subspaces A and B are given by the singular values of
pAK qT B. The principal sines can be used to define a notion of difference between two
subspaces.

70
Definition 4.2. Let A and B be two subspaces of <n . Let the diagonal matrix Ξ contains
the singular values of pAK qT B as in Theorem 4.2. The measure of difference between two
def
subspaces A and B is defined by sin ΘpA, Bq “ Ξ [202].
The squared Frobenius norm of a matrix, denoted by k . k2F , is the sum of squares of
its singular values. So, using Theorem 4.2 and Definition 4.2, we get

t
ÿ t
ÿ
k sin ΘpA, Bq k2F “k Ξ k2F “ νj2 “ sin2 pθt`1´j q . (4.33)
j“1 j“1

Hence, (4.33) implies that the sum of squares of the principal sines between two subspaces
A and B is given by k sin ΘpA, Bq k2F .
The gaps between two left subspaces U pXr q and U pX r r q and two right subspaces V pXr q
r
and V pX r q are computed using the sum of squares of the principal sines between the two
sets of subspaces. The matrices U pXr q and U pX r r q are themselves orthonormal bases of
rank r for the corresponding left subspaces. Let the principal angles between subspaces
U pXr q and U pX r r q be given by θ1 , . . . , θr and the singular values of U pXrK qT U pX r r q be given
by γ1 , . . . , γr , arranged in decreasing order, where columns of U pXrK q span the subspace
orthogonal to one spanned by U pXr q. Then, following Theorem 4.2 and Definition 4.1, the
sum of squared principal sines between two left subspaces U pXr q and U pX r r q is given by

r r
r r qq k2 “
ÿ ÿ
k sin ΘpU pXr q, U pX F γi2 “ sin2 pθr`1´i q .
i“1 i“1

Similarly, for two right subspaces V pXr q and V pX r r q, let the principal angles between them
be given by φ1 , . . . , φr and the singular values of V pXrK qT V pX r r q be given by ω1 , . . . , ωr ,
arranged in decreasing order, where columns of V pXrK q span the subspace orthogonal to
V pXr q. Then, sum of squared principal sines between two right subspaces is given by

r r
r r qq k2 “
ÿ ÿ
k sin ΘpV pXr q, V pX F ωj2 “ sin2 pφr`1´j q .
j“1 j“1

The cumulative gap between full-rank and approximate pairs of left and right subspaces is
given by the root mean squared principal sines between them, which is given by

´ r
¯ „1 ! r r
) 12
r r 2 r 2
GapΘ X , X “r k sin ΘpU pX q, U pX qq kF ` k sin ΘpV pX q, V pX qq kF
r r .
2r
(4.34)
Since the principal angles θi1 s and φ1j s lie in r0, π{2s, sin2 θi1 s and sin2 φ1j s lie in r0, 1s and
GapΘ also lies in r0, 1s. If the approximate left and right subspaces U pX r r q and V pX r r q are
close approximations of the full-rank ones, then θi1 s and φ1j s are close to 0. This implies
that a value of GapΘ close to 0 indicates a better
´ approximation.
¯
Next, upper bound on the value of GapΘ Xr , X r r is derived as a function of rank r
of the singular subspaces. Without loss of generality, let us assume that the individual

71
modalities Xm ’s are mean centered and have dimension pn ˆ dm q, where n ď dm . The SVD
of a modality Xm can be partitioned as:

Xm “ U pXm qΣpXm qV pXm qT


„ r q
„ r qT

“ r rK
‰ ΣpXm 0 V pXm
“ U pXm q U pXm q rK q rK qT
0 ΣpXm V pXm
r r r T rK rK rK T
“ U pXm qΣpXm qV pXm q ` U pXm qΣpXm qV pXm q
r rK
“ Xm ` Xm , (4.35)

where ΣpXm r q “ diagpλ1 , . . . , λr q consists of r largest singular values of X , and U pX r q


m m m m
r
and V pXm q contain the corresponding r left and right singular vectors in their columns,
respectively. Similarly, ΣpXm rK q contains the remaining pn´rq singular values λr`1 , . . . , λn ,
m m
rK rK
while U pXm q and V pXm q contain the corresponding singular vectors. Thus, Xm r is the

rank r approximation of Xm using the r largest singular triplets, and Xm rK is the approx-

imation using the remaining pn ´ rq singular triplets. Using (4.35), the integrated data
matrix X in (4.3) can be decomposed as
“ ‰
X “ X1 . . . Xm . . . XM
“ X1r ` X1rK . . . Xm
“` ˘ ` r rK
˘ ` r rK
˘‰
` Xm . . . XM ` XM
“ ‰ “ ‰
“ X1r . . . XMr ` X rK . . . X rK
1 M
“ Xr ` XrK . (4.36)

Thus, X is the full-rank integrated data and Xr is its approximation using rank r ap-
proximations of the individual modalities. The SVD of X is used to obtain the full-
rank eigenspace ΨpXq in (4.31). On the other hand, the proposed algorithm constructs
the approximate eigenspace ΨpXqr for data matrix Xr by iteratively updating the rank r
eigenspaces of the individual modalities ΨpXm q’s. Let the SVD of X be partitioned as

X “ U pXqΣpXqV pXqT
‰ ΣpXr q V pXr qT (4.37)
„ „ 
“ r rK 0
“ U pX q U pX q
0 ΣpXrK q V pXrK qT

and the SVD of Xr obtained by eigenspace update be partitioned as

Xr “ U pXqΣp
r XqV r T
r pXq
r rq r r qT
« ff « ff
” ı ΣpX 0 V pX
r rK
“ U pXr q U pX
r q (4.38)
0 r rK q V pX
ΣpX r rK qT

r r q P <nˆr , V pXr q, V pX
where U pXr q, U pX r r q P <dˆr , and

ΣpXr q “ diagpσ1 , . . . , σr q, ΣpXrK q “ diagpσr`1 , . . . , σn q,


r rK
ΣpX
r q “ diagpr
σ1 , . . . , σ
rr q, ΣpX
r q “ diagpr
σr`1 , . . . , σ
rn q.

72
According to (4.36), X “ Xr ` XrK , therefore, using matrix perturbation theory [202], the
integrated data matrix X can be viewed as a perturbation of its rank r approximation Xr
due to the presence of error component XrK . Next, Wedin’s sin Θ theorem [242] can be
used to bound the principal angles between the rank r left and right singular subspaces of
a matrix and its perturbation. Let the residuals of left and right subspaces be

RL “ Xr V pXr q ´ U pXr qΣpXr q;


and RR “ pXr qT U pXr q ´ V pXr qΣpXr q.

Let δ be defined as
" *
def
δ “ min min |σi ´ σ
rr`j |, min σi .
1ďiďr,1ďjďpn´rq 1ďiďr

Wedin’s sin Θ theorem states that if δ ą 0, then


b
b k RL k2F ` k RR k2F
k sin ΘpU pXr q, U pX r r qq k2 ď
r r qq k2 ` k sin ΘpV pXr q, V pX .
F F
δ
b
´ r
¯ k RL k2F ` k RR k2F
So, GapΘ Xr , X
r ď ? . (4.39)
2rδ
The above relation states that the cumulative sum of squares of the principal sines between
the full-rank and approximate left and right subspaces is bounded in terms of the Frobenius
norm of the residual matrices RL and RR , and the minimum difference between full-rank
and approximate sets of singular values, δ.
As the value of rank r approaches the full rank n, the residual component XrK Ñ 0 and
Xr Ñ X. Similarly, the components U pXr q, ΣpXr q, and V pXr q also tend towards U pXq,
ΣpXq, and V pXq, respectively. So,

lim RL “ lim Xr V pXr q ´ U pXr qΣpXr q


rÑn rÑn
“ lim XV pXq ´ U pXqΣpXq
rÑn
“ lim U pXqΣpXqV pXqT V pXq ´ U pXqΣpXq “ 0.
rÑn

Similarly, lim RR “ 0. Substituting the limiting values of RL and RR in (4.39), we get


rÑn

´ ¯
r r “ 0.
lim GapΘ Xr , X
rÑn

This implies that as the approximation rank r approaches the full rank n, the principal
angles between full-rank and approximate pairs of left and right subspaces reduce to 0.

73
4.4.2 Accuracy of Singular Triplets
This subsection introduces two more quantitative indices to evaluate the difference between
U , Σ, and V components of full-rank and approximate eigenspaces.

4.4.2.1 Mean Relative Difference of Singular Values


For any rank r, both ΣpXr q and ΣpX r r q consist of r largest singular values. The relative
difference between the singular values in ΣpXr q and ΣpX r r q, with respect to the singular
values of ΣpXr q, is given by a sequence H “ rλ1 , . . . , λi , . . . , λr s, where

r r qi
ΣpXr qi ´ ΣpX
λi “ ; (4.40)
ΣpXr qi

Σp.qi is the i-th largest singular value of the respective eigenspace. The singular values
capture the spread of the data along the principal axes. The maximum spread, captured
by the singular values in ΣpXr r q, is bounded by spread captured by the top r components of
the individual eigenspaces. This is much less than the actual spread of samples in X, which
is reflected in ΣpXr q. Hence, ΣpXr qi ě ΣpX r r qi , so the value of λi lies in r0, 1s. A value of
λi close to 0 indicates less difference between the i-th component of the two eigenspaces.
A cumulative measure of the gap between ΣpXr q and ΣpX r r q is given by the mean of first
h values of H as follows:
h
rr “ 1
´ ¯ ÿ
DiffSV Xr , X λi . (4.41)
h i“1

The value of DiffSV also lies in r0, 1s, with a value closer to 0 indicating a better approxi-
mation.

4.4.2.2 Relative Dimension of Intersection Space


Let us assume that r1 is the dimension of the space lying in the intersection of two left
subspaces UpXr q and UpX r r q. According to Theorem 4.1, reported in Section 4.3.1, r1 is
r r q having value 1. The relative dimension of
the number of singular values of U pXr qT U pX
intersection space between two left subspaces is defined as the ratio of the dimension of
intersection space and that of the left subspace UpXr q, which is as follows:

1
rr “ r ;
´ ¯
DimIS Xr , X (4.42)
r

where r1 ď r. So, the value of DimIS lies in r0, 1s. If the overlap between two left subspaces
is high, the dimension of the intersection subspace r1 is close to r. Thus, the value of DimIS
close to 1 indicates lower gap between the two left subspaces. Similarly, DimIS between two
right subspaces V pXr q and V pX r r q can be calculated using the number of singular values
r
of V pXr qT V pX
r q having value 1.

74
4.5 Experimental Results and Discussion
The proposed SURE algorithm is used to extract a low-rank joint subspace of the integrated
data. The clustering performance of the extracted subspace is studied and compared with
several existing integrative clustering approaches. The approaches compared are Bayesian
consensus clustering (BCC) [140], cluster of cluster analysis (COCA) [93], PCA on naively
concatenated data (PCA-Con) [6], joint and individual variance explained (JIVE) [141],
A-JIVE [63], iCluster [192], LRAcluster [243], and NormS [111] (proposed in Chapter 3).
The performance of JIVE is reported considering both permutation test (JIVE-Perm) and
Bayesian information criteria (JIVE-BIC) for rank selection. The experimental setup for
the existing approaches is followed same as that of Chapter 3. It is also described in the
supplementary material of [111]. The source code of the proposed SURE algorithm, written
in R language, is available at https://github.com/Aparajita-K/SURE.
To evaluate the performance of different clustering algorithms, six external cluster eval-
uation indices, namely, accuracy, normalized mutual information (NMI), adjusted Rand
index (ARI), F-measure, Rand index, and purity, are used, which compare the identified
subtypes with the established subtypes. The indices are described in Appendix B. For all
six indices, a value close to one indicates that the identified subtypes have close resemblance
with the previously established ones. Two other performance measures, namely, p-value of
Cox log-rank test [96] and p-value of Peto & Peto’s modification of the Gehan-Wilcoxon
test [172], are also considered to evaluate the significance of the differences in survival
profiles of the identified subtypes.
Multimodal omics data for seven types of cancers, namely, cervical carcinoma (CESC),
glioblastoma multiforme (GBM), lower grade glioma (LGG), lung carcinoma (LUNG) and
kidney carcinoma (KIDNEY), ovarian carcinoma (OV), and breast invasive carcinoma
(BRCA), are obtained from TCGA (http://cancergenome.nih.gov/), having 124, 168,
267, 671, 737, 334, and 398 samples, respectively. By comprehensive integrated analysis,
TCGA Research Network has identified three molecular subtypes of both CESC [218] and
LGG [217], and four subtypes of OV [215] and BRCA [214]. Four subtypes of GBM were
identified by Veerhak et al. [228]. The samples of LUNG and KIDNEY data sets are di-
vided into two and three subtypes, respectively, based on the tissue of origin. The CESC,
LGG, KIDNEY, and LUNG data sets have four different modalities, namely, gene expres-
sion (RNA), DNA methylation (mDNA), miRNA expression (miRNA), and reverse phase
protein array expression (RPPA), while the GBM data set has three modalities, namely,
RNA, miRNA, and copy number variation (CNV). A brief description of these data sets is
provided in Appendix A.

4.5.1 Optimum Value of Concordance Threshold


The threshold parameter τ of the proposed SURE algorithm (in step 15 of Algorithm 4.1)
decides whether the a remaining individual eigenspaces will be considered for updating the
current joint eigenspace. At each iteration of joint eigenspace construction, the modality
having maximum average concordance Ā, with respect to pre-selected modalities, is taken
into consideration. The joint eigenspace is updated only if the value of Ā is beyond some
threshold τ . This threshold prevents modalities having low concordance or shared infor-
mation with the previously updated ones from being integrated into the joint eigenspace.

75
Given M modalities, different subsets of modalities get selected for different values of
threshold τ . For each data set, the value of τ is varied in the range r0, 0.95s at an interval
of 0.05. For each value of threshold τ , the PVE by a k partition of the final joint subspace
is evaluated, which is denoted by PVEτ . The optimum value τ ˚ for each data set is chosen
using the following relation:

τ ˚ “ arg max tPVEτ u. (4.43)


τ

It is worth noting that the upper bound for varying τ is 0.95 instead of 1.00. For τ “ 1.00,
a candidate modality has to have full concordance or agreement in cluster structure with
all the previously integrated ones. For real-life omics data sets, this is highly unlikely,
and hence no candidate modality will ever get selected for updating the eigenspace. So,
for τ “ 1.00, a unimodal solution, consisting of only the most relevant modality, will
be considered always. As integration of multiple modalities can capture the biological
variations across multiple genomic levels, the threshold τ is upper bounded at 0.95 in order
to prefer multiple modalities.
CESC GBM 0.8
F-Measure LGG
0.85 F-Measure
0.8 PVE PVE
0.8
F-Measure 0.75
PVE
Evaluation Index

Evaluation Index

Evaluation Index
0.75
0.75 0.7

0.7
0.7 0.65

0.65 0.65
0.6

0.6
0.6 0.55
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
τ τ τ

Figure 4.2: Variation of PVE and F-measure for different values of threshold τ for CESC,
GBM, and LGG data sets.

Figure 4.2 shows the variation of F-measure and PVE for different values of τ for CESC,
GBM, and LGG data sets, as examples. From Figure 4.2, it is seen that the values of F-
measure and PVE vary in a similar fashion with the change in τ . The PVE is calculated
based on the generated clusters, while the F-measure is computed based on the ground
truth subtype information. Since these two indices are found to vary similarly, the optimal
value of τ inferred from PVE also gives the optimal value of F-measure, thus giving good
clustering performance. For each data set, the best value of F-measure, obtained from all
possible values of threshold τ , is compared with that obtained for optimal threshold τ ˚ .
For all data sets, the best F-measure is exactly same with the F-measure corresponding to
τ ˚.

4.5.2 Accuracy of Subspace Representation


The proposed SURE algorithm constructs the joint subspace of the integrated data from
individual principal subspaces, using a eigenspace update approach. The extracted joint
subspace is an approximation of the principal subspace extracted by PCA on the inte-
grated data matrix. Three quantitative indices, namely, GapΘ, DiffSV, and DimIS, are
proposed in Section 4.4 to evaluate the gap between full-rank and approximate eigenspaces.

76
Gap Between Principal Angles Gap Between Singular Values
0.03
0.5 CESC CESC
GBM GBM
0.4 LGG LGG
LUNG 0.02 LUNG

DiffSV
GapΘ

0.3 KIDNEY KIDNEY

0.2 0.01

0.1

0 0
0.1 0.3 0.5 0.7 0.9 1 0.1 0.3 0.5 0.7 0.9 1
Fraction of Full Rank Fraction of Full Rank
(a) (b)

Intersection of Left Subspaces Intersection of Right Subspaces


1 1
CESC CESC
0.8
GBM 0.8
GBM
LGG LGG
LUNG LUNG
DimIS

DimIS
0.6 0.6
KIDNEY KIDNEY

0.4 0.4

0.2 0.2

0 0
0.2 0.4 0.6 0.8 1 0.1 0.3 0.5 0.7 0.9 1
Fraction of Full Rank Fraction of Full Rank
(c) (d)

Figure 4.3: Different quantitative indices for the evaluation of gap between true and ap-
proximate eigenspaces.

To observe the variation in gap between these two eigenspaces with the increase in rank
parameter r, the three proposed indices are evaluated for different values of rank r. Due
to the high dimension and low sample size nature of the data sets, the full rank of the
integrated data matrix is always bounded by the number of samples. So, for each data set,
the indices are evaluated for different fractions of the full rank of the integrated data. The
value of the h parameter for the DiffSV is set to be 10, which implies that the gap between
singular values is measured between the top 10 components of the two eigenspaces. The
variation of these quantitative indices, with increase in rank, is shown in Figure 4.3 for
different data sets.
While Figure 4.3(a) shows the root mean squared principal sines between the left and
right subspaces of full-rank and approximate eigenspaces, Figure 4.3(b) shows the difference
between their singular values. Figure 4.3(b) shows that the difference between singular
values monotonically decreases to 0 with the increase in rank, for all data sets. Figure
4.3(a) shows that the difference between the singular subspaces, in terms of their principal
sines, also converges to 0. However, the change in variation in case of singular subspaces
is not monotonically decreasing as of singular values in Figure 4.3(b). For some of the
smaller values of rank r, the difference also increases between two consecutive values. This
is due to the fact that, for a given value of r, there can be infinitely many rank r subspaces
of an n-dimensional vector space. For smaller values of r, the rank r singular subspaces

77
of individual modalities can be very different from each other due to the large number
of possibilities. Consequently, the approximate singular subspace, constructed from these
individual subspaces, tends to vary a lot from the full-rank subspace. However, as r
approaches the full rank n, the number of possible subspaces reduces and the difference
between them converges to 0.
Figure 4.3(c) shows that the intersection between two left subspaces increases gradually
and uniformly with the increase in rank r. But, for the right subspaces, as seen in Figure
4.3(d), intersection continues to remain almost 0 for all data sets, until the rank considered
for eigenspace is more than 70% of the full rank. This implies that there is more gap
between the pair of right subspaces compared to that of left ones. This difference in
gap arises because the right subspaces consist of loadings from different sets of variables
in different modalities, while the left subspaces consist of the projections of same set of
samples across all the modalities. The disjointness of variables in the right subspaces leads
to larger gap between the pair of right subspaces.

4.5.3 Execution Efficiency of SURE


One major advantage of the proposed algorithm is that it extracts the principal subspace of
the integrated data matrix by iteratively updating the principal subspaces of the individual
modalities, and its time complexity is Opn2 dmax q. On the other hand, the time complexity
of performing PCA on the integrated data matrix using eigenvalue decomposition (EVD)
of the covariance matrix is Opd3 q, while that using SVD of mean-centered data matrix is
Opn2 dq, where n ăă dmax ăă d. This makes the proposed algorithm particularly efficient
for PCA based dimensionality reduction of large multimodal data sets. Figure 4.4 compares
the execution time of the proposed SURE algorithm with that for extracting the principal
components using EVD and SVD for LGG, LUNG, and KIDNEY data sets. The RNA and
mDNA modalities have large number of features such as 20,502 and 25,978, respectively.
The variation in execution time for extracting top k principal components using these
three algorithms is observed by gradually increasing the number of features from RNA and
mDNA modalities. The plots in Figure 4.4(a) - (c) show that the execution time of PCA
computed using EVD increases quadratically with respect the proposed SURE approach.
This is because PCA using EVD takes Opd3 q time which is significantly higher compared
to Opn2 dmax q. Figure 4.4(d) - (f) show that the execution time of PCA using SVD as
well as of the proposed SURE algorithm increases linearly with increase in number of
features. However, SURE takes significantly lesser time to extract the principal components
as compared to PCA using SVD, especially for large data sets like LUNG and KIDNEY
with 671 and 757 samples, respectively.

4.5.4 Importance of Data Integration and Modality Selection


To establish the importance of data integration, the clustering performance on top k princi-
pal components of individual modalities is compared with that of the rank k joint subspace
extracted by the proposed algorithm. There can be a total of p2M ´ M ´ 1q possible combi-
nations of two or more modalities from M modalities. Each multimodal combination gives
a different clustering solution. Therefore, the clustering performance of the top k princi-
pal components of each multimodal combination is evaluated using Silhouette index [179].

78
LGG LUNG KIDNEY
1000
PCA (Using EVD) PCA (Using EVD)
400 1000 PCA (Using EVD)
SURE SURE
SURE
Execution Time (sec)

Execution Time (sec)

Execution Time (sec)


800
800
300
600
600
200
400
400

100 200 200

0 0 0
5000 15000 25000 35000 45000 5000 15000 25000 35000 45000 5000 15000 25000 35000 45000
Number of Features Number of Features Number of Features

(a) (b) (c)

LGG LUNG KIDNEY


16 70 80
PCA (Using SVD) PCA (Using SVD) PCA (Using SVD)
60 70
Execution Time (sec)

Execution Time (sec)

Execution Time (sec)


SURE SURE SURE
12 60
50
50
40
8 40
30
30
20
4 20
10 10

0 0 0
5000 15000 25000 35000 45000 5000 15000 25000 35000 45000 5000 15000 25000 35000 45000
Number of Features Number of Features Number of Features

(d) (e) (f)

Figure 4.4: Comparison of execution time for PCA computed using EVD (top row) and
SVD (bottom row) and the proposed SURE approach on LGG, LUNG, and KIDNEY data
sets.

The best combination is chosen to be the one with maximum value of Silhouette index.
To evaluate the strength of the proposed SURE algorithm in selecting appropriate subset
of modalities, its performance is compared with that of PCA on the best combination of
modalities, henceforth termed as PCA_Combine. The comparative performance of the in-
dividual modalities, the best multimodal combination, and the proposed SURE approach is
reported in Table 4.1 for CESC, GBM, LGG, LUNG, and KIDNEY data sets, as examples.
Table 4.1 shows that the joint subspace extracted by the SURE algorithm gives better
performance compared to all the unimodal solutions for four data sets, namely, CESC,
GBM, LUNG, and KIDNEY, in terms of four external evaluation indices. This establishes
the significance of integrative analysis over unimodal analysis. For LGG data set, the
mDNA gives the best performance among all possible unimodal and multimodal combi-
nations. The SURE algorithm also efficiently chooses only mDNA to construct the final
eigenspace. For GBM and LUNG data sets, the modalities selected by SURE algorithm
are same as the best combination of modalities obtained for PCA. The combination differs
for CESC and KIDNEY data sets, however, the performance of SURE is always better
as compared to PCA_Combine. This is due to the fact that the individual eigenspaces
in the proposed algorithm are truncated at rank k, thus filtering out the noisy informa-
tion present in them. The joint subspace constructed from these informative truncated
eigenspaces preserves better cluster structure compared to PCA_Combine that considers
the complete information of each eigenspace. The results in Table 4.1 also show that the
performance of SURE is atleast as good as that of PCA on best combination of modalities
for all data sets. This establishes that the proposed SURE approach is able to select the

79
Table 4.1: Comparative Performance Analysis of Individual Modalities, PCA Combina-
tions, and SURE

Modality/ Accuracy NMI ARI F-Measure Rand Purity


Algorithm
mDNA 0.5241935 0.2420431 0.1175554 0.5453798 0.5819565 0.5806452
RNA 0.8467742 0.6242327 0.6168352 0.8310850 0.8164175 0.8467742
miRNA 0.5564516 0.1589298 0.1512301 0.5697384 0.6087071 0.5887097
CESC

RPPA 0.5000000 0.0803494 0.0917321 0.5166786 0.5847102 0.5322581


PCA_subset 0.8145161 0.5868124 0.5579264 0.7882956 0.7844217 0.8145161
SURE 0.8629032 0.6461946 0.6507274 0.8512028 0.833989 0.8629032
Best PCA subset: RNA, miRNA, RPPA
Subset selected by SURE: RNA, miRNA
RNA 0.7619048 0.5636125 0.4870354 0.7775749 0.8029655 0.7619048
miRNA 0.6071429 0.3636915 0.3329748 0.6408620 0.7343171 0.6547619
CNV 0.4166667 0.1207564 0.1061846 0.4678243 0.5688623 0.4464286
GBM

PCA_subset 0.7916667 0.5729951 0.5441936 0.8072570 0.8244226 0.7916667


SURE 0.797619 0.5815764 0.5588514 0.8120413 0.8300542 0.7976190
Best PCA subset: RNA, miRNA, CNV
Subset selected by SURE: RNA, miRNA, CNV
mDNA 0.7940075 0.5335888 0.4668931 0.7904750 0.7465292 0.7940075
RNA 0.659176 0.2782794 0.2558892 0.6600498 0.6461660 0.6591760
miRNA 0.4007491 0.0318103 0.0251035 0.4425295 0.5499986 0.5018727
LGG

RPPA 0.5767790 0.1808821 0.1435186 0.5820448 0.5910563 0.5767790


PCA_subset 0.6554307 0.3414426 0.2968495 0.6576214 0.6572893 0.6554307
SURE 0.7940075 0.5335888 0.4668931 0.7904750 0.7465292 0.7940075
Best PCA subset: mDNA, RNA, miRNA
Subset selected by SURE: mDNA
mDNA 0.8077496 0.2949746 0.3778454 0.8065767 0.6889561 0.8077496
RNA 0.9344262 0.6580123 0.7545235 0.9342231 0.8772694 0.9344262
miRNA 0.8226528 0.3547613 0.4155187 0.8222429 0.7077741 0.8226528
LUNG

RPPA 0.5305514 0.0005234 0.0011007 0.6089374 0.5011233 0.5365127


PCA_subset 0.9388972 0.6773549 0.7701654 0.9386955 0.8850902 0.9388972
SURE 0.9418778 0.6878184 0.7806842 0.9417093 0.8903486 0.9418778
Best PCA subset: mDNA, RNA, miRNA
Subset selected by SURE: mDNA, RNA, miRNA
mDNA 0.6716418 0.3889985 0.3406900 0.7217190 0.6741896 0.8317503
RNA 0.9457259 0.7483180 0.8308028 0.9462649 0.9156687 0.9457259
KIDNEY

miRNA 0.8493894 0.4923203 0.6068730 0.8573787 0.8044216 0.8493894


RPPA 0.4261872 0.0020027 0.0061810 0.4639016 0.5078609 0.6241520
PCA_subset 0.9511533 0.7670505 0.8489024 0.9516854 0.9246800 0.9511533
SURE 0.9525102 0.7726162 0.8534490 0.9530685 0.9269512 0.9525102
Best PCA subset: mDNA, RNA, miRNA, RPPA
Subset selected by SURE: mDNA, RNA, miRNA

best subset of modalities among all possible p2M ´ 1q combinations.

80
The results corresponding to survival analysis show that the subtypes identified by
SURE algorithm have statistically significant difference in survival profiles, considering
5% significance level of both log rank and generalized Wilcoxon tests, for the LGG, and
KIDNEY data sets. For different data sets, different combinations of modalities achieve the
lowest p-values in survival analysis. However, their performance with respect to external
indices is considerably poor as compared to SURE. In brief, the relevance and concordance
measures of the proposed algorithm appropriately select the best subset of modalities and
the eigenspace update approach efficiently integrates their cluster information. In effect,
the subtypes identified by SURE have closest resemblance with the previously established
cancer subtypes.

4.5.5 Importance of Relevance


The proposed algorithm first evaluates the relevance of each modality based on compactness
of the cluster structure embedded within its left subspace. The relevance measure provides a
linear ordering of the modalities, and the process of integration starts with the most relevant
one. To establish the importance of relevance based ordering in data integration, the
performance of clustering is studied for three other cases where the process of integration
is initiated with the second, third, and fourth most relevant modalities, keeping all other
components of the algorithm fixed. For different initiating modalities, different subset of
modalities are selected during the construction of joint subspace, giving rise to different
clustering solutions. The starting modality for other three cases, their corresponding subset
of selected modalities and their comparative performance with the proposed approach are
reported in Table 4.2 for different data sets.
The results in Table 4.2 show that for the LGG data set, only the proposed relevance
ordering gives the best performance, while for other orderings the performance is degraded
drastically. For the other data sets, however, one or more orderings have the same perfor-
mance as that of the proposed algorithm. This is due to the presence of the concordance
measure and the value of threshold τ selected for each of those orderings. For example, for
the CESC data set, if the process starts with RNA, miRNA has the highest concordance
and the remaining modalities have concordance below the optimal threshold τ selected for
CESC. Again, starting with miRNA, only RNA has the highest concordance that exceeds
the optimal threshold. Hence, same subsets of modalities are selected for both the cases of
CESC, giving rise to identical clustering performance. Similar cases occur for both GBM
and KIDNEY data sets. For LUNG data set, for each different ordering, all four modal-
ities get selected without degrading final clustering performance. However, the proposed
ordering gives the best performance with smaller subset of modalities. So, the performance
of the proposed relevance based ordering is atleast as best as the other orderings.

4.5.6 Significance of Concordance


At each iteration of eigenspace update, the proposed algorithm considers the modality hav-
ing maximum average concordance Ā or shared information with respect to the previously
updated ones. However, if the value of Ā is below the optimal threshold of τ , then it is not
updated with the current joint eigenspace. To assess the significance of the concordance
measure for modality selection, all the modalities are naively integrated based on their

81
Table 4.2: Importance of Relevance Based Ordering of Views

Integration Starting Relevance Selected views External evaluation index


starts with view of view (in order) NMI ARI F-Measure Purity
nd
2 best mDNA 0.47007 tmDNAu 0.24204 0.11755 0.54537 0.58064
CESC

3 rd
best RPPA 0.45550 tRPPA, mDNA, 0.67509 0.63330 0.83902 0.85483
miRNA, RNAu
4th best miRNA 0.44951 tmiRNA, RNAu 0.64619 0.65072 0.85120 0.86290
SURE RNA 0.47533 tRNA, miRNAu 0.64619 0.65072 0.85120 0.86290
2nd best CNV 0.48196 tCNVu 0.12075 0.10618 0.46782 0.44642
GBM

3rd best miRNA 0.43332 tmiRNA, RNA, 0.58157 0.55885 0.81204 0.79761
CNVu
SURE RNA 0.50859 tRNA, miRNA, 0.58157 0.55885 0.81204 0.79761
CNVu

2nd best RNA 0.44798 tRNA, RPPA, 0.34387 0.30313 0.65748 0.66666
mDNA, miRNAu
LGG

3rd best RPPA 0.43591 tRPPA, RNA, 0.34387 0.30313 0.65748 0.66666
mDNA, miRNAu
th
4 best miRNA 0.42871 tmiRNAu 0.03181 0.02510 0.44252 0.50187
SURE mDNA 0.50396 tmDNAu 0.53358 0.46689 0.79047 0.79400

2nd best mDNA 0.35353 tmDNA, RNA, 0.68781 0.78068 0.94170 0.94187
miRNAu
3rd best tmiRNA, RNA,
LUNG

miRNA 0.35334 0.68781 0.78068 0.94170 0.94187


mDNA u
4th best RPPA 0.31047 tRPPA, RNA, 0.68781 0.78068 0.94170 0.94187
miRNA, mDNAu
SURE RNA 0.43179 tRNA, miRNA, 0.68781 0.78068 0.94170 0.94187
mDNAu

2nd best miRNA 0.53257 tmiRNA, RNA, 0.77261 0.85344 0.95306 0.95251
mDNAu
KIDNEY

3rd best mDNA 0.50915 tmDNA, RNAu 0.76805 0.84888 0.95172 0.95115
4 th
best RPPA 0.39006 tRPPA, mDNA, 0.77261 0.85344 0.95306 0.95251
RNA, miRNAu
SURE RNA 0.58383 tRNA, miRNA, 0.77261 0.85344 0.95306 0.95251
mDNAu

relevance ordering, and the clustering performance of the resulting subspace is studied.
The comparative performance of this relevance-based subspace (without concordance Ā)
and the proposed SURE algorithm is reported in Table 4.3. The results in Table 4.3 show
that for CESC and LGG data sets, selection of a subset of modalities gives better perfor-
mance compared to the naive integration of all modalities. For GBM, there are only three
modalities and the proposed algorithm selects all of them. So, the performance on GBM is
identical with or without concordance. For LUNG and KIDNEY data sets, the proposed
algorithm selects only three modalities out of four using the concordance measure. How-
ever, the results in Table 4.2 show that for these data sets, selection of all four modalities
does not degrade the clustering performance. But, the concordance measure for modality
selection gives better performance with smaller subset of modalities compared to relevance
alone.

82
Table 4.3: Importance of Concordance in SURE

Data Algorithm External Evaluation Index


Set Settings Accuracy NMI ARI F-Measure Rand Purity
Without Ā 0.8548387 0.6750978 0.6333073 0.8390298 0.8237608 0.8548387
CESC
SURE 0.8629032 0.6461946 0.6507274 0.8512028 0.833989 0.8629032
Without Ā 0.797619 0.5815764 0.5588514 0.8120413 0.8300542 0.797619
GBM
SURE 0.797619 0.5815764 0.5588514 0.8120413 0.8300542 0.797619
Without Ā 0.6666667 0.3438738 0.3031312 0.6574834 0.6616823 0.6666667
LGG
SURE 0.7940075 0.5335888 0.4668931 0.7904750 0.7465292 0.7940075
Without Ā 0.9418778 0.6878184 0.7806842 0.9417093 0.8903486 0.9418778
LUNG
SURE 0.9418778 0.6878184 0.7806842 0.9417093 0.8903486 0.9418778
Without Ā 0.9525102 0.7726162 0.8534490 0.9530685 0.9269512 0.9525102
KIDNEY
SURE 0.9525102 0.7726162 0.8534490 0.9530685 0.9269512 0.9525102

4.5.7 Performance Analysis of Different Algorithms


Finally, the performance of the proposed SURE algorithm is compared with that of seven
existing integrative clustering approaches, namely, BCC [140], COCA [93], JIVE [141],
A-JIVE [63], iCluster [192], LRAcluster [243], PCA-Con [6], and NormS [111] (proposed
in Chapter 3). Comparative results with respect to six external indices are reported in
Tables 4.4 and 4.5, while survival analysis and execution times are reported in Table 4.7.
The results in Tables 4.4 and 4.5 show that the SURE approach performs better than
all the existing approaches with respect to most of the external indices, on four data
sets, namely, GBM, LGG, LUNG, and OV. For LGG data set, the performance of SURE
algorithm is significantly better compared to all the existing algorithms, except NormS.
The better performance is attributed to the efficient selection of relevant modalities only
during joint subspace construction, which is also applicable in case of NormS algorithm.
For KIDNEY data set, LRAcluster gives the best performance. However, the performance
of the SURE on KIDNEY data set, considering only three modalities, is almost close to the
best results. The JIVE, A-JIVE, iCluster, LRAcluster, and PCA-Con are low-rank based
approaches. The results in Tables 4.4 and 4.5 show that the joint subspace extracted by
the proposed algorithm preserves better cluster structure compared to the ones extracted
by these existing low-rank based approaches. This is because the proposed algorithm
first truncates the individual eigenspaces at rank k, and then considers only the cluster
information of top k singular triplets for further integration; thus filtering out the inherent
noise present in the pn´kq remaining components. The existing low-rank based approaches,
however, consider cluster as well as noisy information of all the modalities; thus giving poor
cluster structure in the extracted subspace.
For GBM data, BIC based JIVE algorithm estimates the rank of joint structure to be
0, which implies that the four different modalities do not share any correlated information
among them. On the other hand, for LGG and KIDNEY data, the joint rank estimated by
JIVE is the same using both BIC and permutation tests. However, the overall performance
differs due to difference in rank of the individual modalities estimated by these two criteria.
The survival analysis results of Table 4.6 show that the subtypes identified by all the

83
Table 4.4: Comparative Performance Analysis of SURE and Existing Approaches

Different Rank of External Evaluation Index


Algorithms Subspace Accuracy NMI ARI F-Measure RAND Purity
COCA - 0.6693548 0.4172592 0.3677157 0.6870510 0.6971282 0.6774194
BCC - 0.6895161 0.2854917 0.3144526 0.6795619 0.6687779 0.6935484
JIVE-Perm 24 0.7177419 0.4425848 0.3860367 0.7097880 0.7164962 0.7177419
CESC

JIVE-BIC 4 0.8064516 0.5296325 0.5229385 0.8011385 0.7791765 0.8064516


A-JIVE 48 0.6500000 0.3700238 0.3355826 0.6511586 0.6857724 0.6814516
iCluster 2 0.5483871 0.1737526 0.1017765 0.5568753 0.5731707 0.5645161
LRAcluster 1 0.8145161 0.5176602 0.5384740 0.8123256 0.7867821 0.8145161
PCA-con 3 0.8548387 0.6750978 0.6333073 0.8390298 0.8237608 0.8548387
NormS 6 0.8870968 0.6854921 0.7004411 0.8801172 0.8587726 0.8870968
SURE 3 0.8629032 0.6461946 0.6507274 0.8512028 0.8339890 0.8629032
COCA - 0.6863095 0.3682423 0.3367219 0.6771487 0.7251354 0.6863095
BCC - 0.4113095 0.1273042 0.0578081 0.4363617 0.5873253 0.4511905
JIVE-Perm 12 0.6666667 0.3802445 0.3664034 0.6909252 0.7566296 0.6726190
GBM

A-JIVE 36 0.6940476 0.4829471 0.4580430 0.7211722 0.7907898 0.7208333


iCluster 3 0.7678571 0.5441494 0.5298306 0.7850480 0.8182207 0.7678571
LRAcluster 3 0.7678571 0.5421434 0.5201970 0.7894569 0.8152267 0.7678571
PCA-Con 4 0.7916667 0.5729951 0.5441936 0.8072570 0.8244226 0.7916667
NormS 13 0.6964286 0.4610496 0.4593267 0.7190554 0.7931993 0.7023810
SURE 4 0.7976190 0.5815764 0.5588514 0.8120413 0.8300542 0.7976190
COCA - 0.6591760 0.2772248 0.2533847 0.6608123 0.6454901 0.6591760
BCC - 0.6340824 0.2737596 0.248606 0.63111660 0.6382755 0.6355805
JIVE-Perm 8 0.5617978 0.2299551 0.1606599 0.5757978 0.6056715 0.5730337
LGG

JIVE-BIC 8 0.6741573 0.3441747 0.3050874 0.6679019 0.6642730 0.6741573


A-JIVE 48 0.7168539 0.4267241 0.3376560 0.7172792 0.6869055 0.7168539
iCluster 2 0.4382022 0.1379678 0.0996867 0.5187438 0.5821858 0.5355805
LRAcluster 2 0.4719101 0.1240057 0.1030798 0.5137382 0.5831714 0.5280899
PCA-con 3 0.6666667 0.3438738 0.3031312 0.6574834 0.6616823 0.6666667
NormS 14 0.7940075 0.5325030 0.4649223 0.7916535 0.7465292 0.7940075
SURE 3 0.7940075 0.5335888 0.4668931 0.790475 0.7465292 0.7940075
COCA - 0.9284650 0.6287671 0.7339231 0.9283705 0.8669662 0.9284650
BCC - 0.9372578 0.6648076 0.7645295 0.9371445 0.8822697 0.9372578
JIVE-Perm 8 0.9269747 0.6333526 0.7288041 0.9266709 0.8644127 0.9269747
LUNG

JIVE-BIC 8 0.9388972 0.6883994 0.7701592 0.9385860 0.8850902 0.9388972


A-JIVE 32 0.9478390 0.7192028 0.8019299 0.9476450 0.9009720 0.9478390
iCluster 1 0.6333830 0.0627751 0.0696293 0.6299231 0.5348889 0.6333830
LRAcluster 1 0.9344262 0.6535038 0.7545277 0.9342966 0.8772694 0.9344262
PCA-Con 2 0.9388972 0.6773549 0.7701654 0.9386955 0.8850902 0.9388972
NormS 27 0.9359165 0.6650183 0.7597192 0.9357050 0.8798674 0.9359165
SURE 2 0.9418778 0.6878184 0.7806842 0.9417093 0.8903486 0.9418778

algorithms for LGG and KIDNEY data have significantly different survival profiles. On
the other hand, for the CESC and LUNG data sets, most of the algorithms fail to give
statistically significant results at 5% significance level.
Comparing the execution time of different algorithms in Table 4.6, it is seen that SURE

84
Table 4.5: Comparative Performance Analysis of SURE and Existing Approaches

Different Rank of External Evaluation Index


Algorithms Subspace Accuracy NMI ARI F-Measure RAND Purity
COCA - 0.9408280 0.7493140 0.8393954 0.9477422 0.9199568 0.9470828
BCC - 0.9122117 0.6783448 0.7299573 0.9139998 0.8657292 0.9122117
KIDNEY

JIVE-Perm 12 0.9308005 0.6955325 0.7786981 0.9300085 0.8893944 0.9308005


JIVE-BIC 12 0.9253731 0.6777835 0.7724587 0.9250073 0.8863305 0.9253731
A-JIVE 48 0.9582090 0.7902576 0.8695284 0.9585611 0.9349404 0.9582090
iCluster 2 0.6065129 0.2547010 0.1717458 0.6514716 0.5842023 0.6811398
LRAcluster 2 0.9538670 0.7862018 0.8579391 0.9545717 0.9292298 0.9538670
PCA-Con 3 0.9511533 0.7670505 0.8489024 0.9516854 0.9246800 0.9511533
NormS 35 0.9525102 0.7726162 0.8534490 0.9530685 0.9269512 0.9525102
SURE 3 0.9525102 0.7726162 0.8534490 0.9530685 0.9269512 0.9525102
COCA - 0.5943114 0.3131466 0.2810761 0.6068513 0.7039183 0.5943114
BCC - 0.4610778 0.1567582 0.1254690 0.4755846 0.6268706 0.4622754
JIVE-Perm 32 0.5718563 0.2629523 0.2027605 0.5653910 0.6885005 0.5718563
A-JIVE 64 0.5191617 0.2124862 0.1981556 0.5111353 0.6942997 0.5221557
OV

iCluster 3 0.5089820 0.2249889 0.2005886 0.4808256 0.6916078 0.5119760


LRAcluster 2 0.6287425 0.3745173 0.2999204 0.6384046 0.7322472 0.6287425
PCA-con 4 0.6946108 0.4424701 0.4068449 0.6868295 0.7734621 0.6946108
NormS 10 0.6976048 0.4504552 0.4142200 0.6910392 0.7766269 0.6976048
SURE 4 0.7215569 0.4680312 0.4372574 0.7148805 0.7857258 0.7215569
COCA - 0.7434673 0.5002408 0.4864778 0.7457304 0.7905295 0.7434673
BCC - 0.6251256 0.3169187 0.3049874 0.6242493 0.7055783 0.6334171
JIVE-Perm 12 0.6859296 0.4287142 0.3772649 0.6889363 0.7464906 0.6859296
JIVE-BIC 4 0.6608040 0.4372675 0.3603942 0.6678438 0.7286432 0.6608040
BRCA

A-JIVE 64 0.6140704 0.4482479 0.3710317 0.6707575 0.7363682 0.6841709


iCluster 3 0.7638191 0.5176193 0.4745746 0.7658865 0.7842867 0.7638191
LRAcluster 2 0.7110553 0.4368520 0.4035040 0.7101385 0.7521740 0.7110553
PCA-con 4 0.7587940 0.5506612 0.5038795 0.7601317 0.7984380 0.7587940
NormS 11 0.7688442 0.5437267 0.5090183 0.7699789 0.7999063 0.7688442
SURE 4 0.7663317 0.5528011 0.5104814 0.7683344 0.8010455 0.7663317

has the minimum execution time compared to all existing algorithms on three larger data
sets, namely, LGG, LUNG, and KIDNEY, having 267, 671, and 737 samples, respectively.
For two smaller data sets, namely, CESC and GBM having 124 and 168 samples, respec-
tively, PCA-Con achieves the minimum execution time. Comparing the execution time
of SURE with the state-of-the-art low-rank approaches such as iCluster, JIVE, A-JIVE,
and LRAcluster in Table 4.6, it is evident that the SURE extracts the low-rank subspace
in significantly lower time as compared to all these approaches for five data sets. Hence,
the proposed algorithm is computationally more efficient compared to all the existing ap-
proaches considered in this work.

4.5.8 Survival Analysis


Clinical information of the samples, retrieved from the RTCGA.clinical package [117], is
used to analyze the survival profiles of the subtypes identified by the proposed SURE

85
Table 4.6: Survival p-values and Execution Times of Proposed and Existing Approaches

Different Survival Analysis (p-value) Time Survival Analysis (p-value) Time


Algorithms Log-Rank Wilcoxon (in sec) Log-Rank Wilcoxon (in sec)

COCA 5.563e-02 3.126e-02 6.01 1.166e-04 2.805e-05 18.61


BCC 5.318e-01 4.572e-01 10.33 3.721e-06 3.434e-07 12.78
JIVE-Perm 4.074e-02 2.479e-02 575.95 3.736e-04 1.310e-04 622.28
JIVE-BIC 8.295e-02 8.341e-02 69.08 3.156e-08 5.134e-10 1636.94
CESC

LGG
A-JIVE 3.463e-01 2.469e-01 251.77 3.784e-07 1.922e-08 462.65
iCluster 1.448e-01 1.212e-01 1054.89 4.201e-03 7.864e-03 1241.97
LRAcluster 2.404e-01 2.418e-01 9.29 9.278e-02 1.682e-01 25.09
PCA-con 1.243e-01 9.175e-02 0.23 3.144e-08 9.196e-10 1.61
NormS 1.352e-01 1.064e-01 1.09 2.473e-07 6.000e-09 1.05
SURE 1.370e-01 1.079e-01 0.321 2.125e-07 5.901e-09 0.87
COCA 1.159e-02 7.210e-03 26.90 6.042e-02 2.699e-01 25.40
BCC 5.174e-01 5.433e-01 17.94 2.333e-01 3.957e-01 40.38
JIVE-Perm 1.137e-02 1.435e-02 934.13 7.982e-03 8.471e-03 1491.21
JIVE-BIC 5.314e-01 4.693e-01 734.10 - - -
BRCA

A-JIVE 2.358e-01 2.206e-01 761.76 1.825e-01 2.489e-01 557.67


OV

iCluster 1.409e-02 4.282e-03 511.87 5.831e-01 6.338e-01 2076.36


LRAcluster 1.513e-01 2.320e-01 23.53 1.583e-01 2.305e-01 15.35
PCA-con 2.765e-02 2.047e-02 1.06 7.744e-02 2.583e-01 1.07
NormS 6.887e-02 5.397e-02 1.47 4.296e-02 1.516e-01 1.72
SURE 4.845e-02 3.442e-02 6.32 3.872e-02 1.650e-01 4.62
COCA 4.796e-01 7.407e-01 45.49 1.650e-04 5.035e-04 69.50
BCC 3.970e-01 5.827e-01 1156.40 5.607e-03 1.705e-02 1404.64
JIVE-Perm 3.691e-01 5.893e-01 10814.40 3.087e-07 6.640e-07 5510.53
KIDNEY

JIVE-BIC 6.254e-02 1.145e-01 14510.26 3.298e-07 8.795e-07 23494.11


LUNG

A-JIVE 2.185e-01 4.442e-01 1348.23 1.265e-06 3.515e-06 1235.29


iCluster 1.523e-01 3.375e-01 2666.65 4.408e-11 4.711e-10 2152.22
LRAcluster 6.971e-01 9.892e-01 358.09 4.131e-04 1.485e-03 386.15
PCA-con 4.292e-01 6.561e-01 5.61 1.725e-04 6.006e-04 7.85
NormS 4.702e-01 7.274e-01 3.85 1.709e-04 5,946e-04 2.53
SURE 5.102e-01 7.778e-01 5.26 1.709e-04 5.946e-04 6.19

algorithm on different data sets. The survival profiles of the subtypes are compared using
Kaplan-Meier survival plots, median survival times, survival probability of the samples
within a subtype after two, five, and seven years of diagnosis of the disease, and log-rank
test p-value from pairwise comparison of subtypes. Median survival time is a statistic that
refers to how long patients are expected to survive with a disease. It is the time expressed
in months or years, when half of the patients in a group of patients diagnosed with the
disease are still alive. It gives an approximate indication of the survival as well as the
prognosis of a group of patients with the disease. The median survival time for a disease
subtype is given by the time period where the Kaplan-Meier curve for the subtype crosses
the survival probability of 0.5, and it is not available for subtypes whose survival curves
end before the survival probability of 0.5 due to low sample count or presence of censored
samples. The total number of deaths in each subtype, the number of samples at risk and
the number of events of death at two, five, and seven years of diagnosis are also observed

86
LGG CESC
1.00 +
++++
+ ++
++++++++++++++
+++++++++++++++++++++ + Subtype 1
1.00 +++++++++++ +
+ Subtype 1
++++ ++++
++++++ ++ p < 0.0001 + Subtype 2 p = 0.14 + Subtype 2
+ ++ ++
+++ ++ + Subtype 3 ++ + Subtype 3
++ + +
+ ++
+ +++ ++

Survival Probability

Survival Probability
+ + ++
0.75 ++
+++++ 0.75 + +
+ +
+++
++
++
+ +
0.50 0.50 + + + +
+
+ + +
+ ++
+ + +
0.25 0.25

0.00 0.00
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 6000
Time (Days) Time (Days)

(a) (b)

GBM LUNG KIDNEY


1.00 1.00 +
+
+++
+++++ 1.00 +++++++
++++++++ + Subtype 1
+
+++ +++
++++++++
p = 0.013 + Subtype 1 ++
+
+
+++
+ p = 0.51 + Subtype 1
+++++++++++ +++
p = 0.00017 + Subtype 2
+ +++++
+ ++
+ + Subtype 2 +++++++ +++ +
Subtype 2 +
++++ +++
+
++++++++ + Subtype 3
+ Subtype 3 ++++
++++ +++++ +++++ +++++ +
+
+++++ +++ +++++ ++++++++
+ Subtype 4 + + +++++++++++++++ +++ + ++ ++
+++ +++++

Survival Probability
Survival Probability

Survival Probability

+++++ ++
0.75 0.75 +++
+
++++
++
0.75 ++++ ++
+++++ ++++ +++
++ +++++ +++
+++ ++ ++
++ ++++++ ++ +++
++
+++
++++++++ +++++++++
++ +++ ++
+ ++ +++++ ++++++ ++ + +
+++ ++++ ++++
0.50 0.50 ++ 0.50 +++++
+++ +++ ++++++
+
+ +++ +
+ ++
+
+++ ++
+++ + + ++
0.25 + 0.25 0.25
+
+
+
+ ++ + +
+ +
+ +
0.00 0.00 0.00
0 1000 2000 3000 0 2000 4000 6000 0 1000 2000 3000 4000 5000
Time (Days) Time (Days) Time (Days)

(c) (d) (e)

Figure 4.5: Kaplan-Meier survival plots for subtypes identified by SURE on different data.

to study the prognosis of respective cancer with time. The survival results are reported in
Figure 4.5 and Table 4.7.
The Kaplan-Meier plot for the subtypes of LGG data set is given in Figure 4.5(a). The
p-values for the log-rank test and the generalized Wilcoxon test are 2.125e´07 and 5.901e´
09, respectively. These p-values show that there is a statistically significant difference in
survival profiles of the subtypes of LGG, identified by the SURE algorithm. Table 4.7
shows that subtype 2 and subtype 3 have median survival times of 7.96 and 5.62 years,
respectively. Hence, subtype 2 and Subtype 3 have much better prognosis than subtype
1 which has survival time of 1.66 years. The survival risk is also very high for subtype 1,
as the number of death is 15 out of 51 samples and the survival probability is only 0.343
after two years of diagnosis. The p-value from pairwise log-rank test comparing subtypes 1
and 2 is 5.117e ´ 05, comparing subtypes 1 and 3 is 5.915e ´ 06, while the p-value between
subtypes 2 and 3 is 0.32947. Thus, the difference between survival profiles of subtypes 1 and
2 and subtypes 1 and 3 are statistically significant, while the difference is not statistically
significant between subtypes 2 and 3. Both the subtypes 2 and 3 have similar survival
probabilities at two and five years of diagnosis. However, the survival probability for
subtype 3 is 0.370 which is very low compared to subtype 2 having probability 0.551 after
seven years of diagnosis of cancer.
The survival plot for the CESC data is given in Figure 4.5(b). Figure 4.5(b) and Table

87
Table 4.7: Survival Analysis for Subtypes Identified by SURE on Different Data Sets
Different No. of Total No. Median Survival Time No. of No. of Events Survival
Subtypes Samples Of Deaths Time (Years) (Years) Risks Of Death Probability
2 3 14 0.343
Subtype 1 51 15 1.66 5 1 0 0.343
7 1 0 0.343
LGG

2 28 4 0.906
Subtype 2 73 14 7.96 5 15 4 0.741
7 8 3 0.551
2 43 4 0.933
Subtype 3 143 17 5.62 5 10 5 0.740
7 5 5 0.370
2 8 2 0.877
Subtype 1 33 6 5.57 5 4 3 0.548
7 2 1 0.411
CESC

2 21 1 0.957
Subtype 2 70 7 NA 5 11 3 0.794
7 9 1 0.721
2 6 2 0.771
Subtype 3 21 2 NA
5 4 0 0.771
2 16 20 0.455
Subtype 1 37 34 1.726 5 6 8 0.209
7 4 2 0.139
2 6 42 0.125
GBM

Subtype 2 48 45 0.944 5 1 3 0.055


7 1 0 0.055
2 7 42 0.147
Subtype 3 50 46 0.921
5 1 3 0.046
2 4 29 0.1212
Subtype 4 33 33 0.984
5 1 3 0.0303
2 92 50 0.717
Subtype 1 285 86 5.08 5 31 21 0.501
LUNG

7 13 9 0.323
2 98 56 0.703
Subtype 2 363 105 3.45 5 22 39 0.329
7 13 3 0.274
2 73 16 0.864
Subtype 1 214 28 NA 5 27 10 0.677
7 13 1 0.648
KIDNEY

2 55 6 0.909
Subtype 2 74 12 NA 5 35 5 0.818
7 18 1 0.789
2 263 70 0.811
Subtype 3 445 140 6.3 5 91 55 0.591
7 14 12 0.449

88
4.7 show that the median survival time is not reached for subtypes 2 and 3, while for
subtype 1, the median survival time is 5.57 years. Moreover, subtypes 2 and 3 have 7
and 2 deaths out of 70 and 21 samples, respectively. On the other hand, subtype 1 has 3
death cases out of 33 samples. The survival probability after seven years of diagnosis is
only 0.411 for subtype 1, while the probabilities are 0.721 and 0.771 for subtypes 2 and 3,
respectively. These results show that subtypes 2 and 3 have better prognosis compared to
subtype 1. The pairwise log-rank test p-values for subtypes 1 and 2 is 0.04712, that for
subtypes 1 and 3 is 0.29749, and that for subtypes 2 and 3 is 0.78188. The difference in
survival profiles is statistically significant only for subtypes 1 and 2 and is not significant
for other pairs.
Table 4.7 reports the survival analysis results for the GBM data set and the Kaplan-
Meier plot for the GBM subtypes identified by the proposed SURE approach is given in
Figure 4.5(c). For the GBM data set, the overall log-rank p-value is 0.0137, which shows
that the subtypes have significant difference in their survival profiles. The median survival
times for subtypes 1, 2, 3, and 4 are 1.726, 0.944, 0.921, and 0.984, years respectively.
Comparative results from survival analysis of other data sets in Table 4.7 show that the
GBM subtypes have significantly poor prognosis compared to subtypes of other cancers.
Moreover, across all the subtypes, the number of deaths is very close to the total number
of samples. Death rate is most severe for subtype 4, where death occurs for all the 33
samples of the subtype. The p-values for pairwise log-rank test for subtypes 1 and 2 is
0.01413, that for subtypes 1 and 3 is 0.00743, and that for subtypes 1 and 4 is 0.00290.
The pairwise survival difference between subtype 1 and the other subtypes is statistically
significant. On the other hand, the pairwise log-rank test p-values for subtypes 2 and 3 is
0.95869, for subtypes 2 and 4 is 0.71164, and that for subtypes 3 and 4 is 0.86016, which
show no significant difference among survival profiles of subtypes 2, 3, and 4.
The Kaplan-Meier plot and survival analysis results for the LUNG data set are given
in Figure 4.5(d) and Table 4.7, respectively. The median survival time for subtype 1 is
5.08 years, while for subtype 2 the median survival time is worse, that is, 3.45 years. The
log-rank p-value for survival difference is 0.51, which does not show statistical significance.
However, the survival probabilities for subtype 1 and subtype 2 after five years of diagnosis
are 0.501 and 0.329, respectively, and after seven years of diagnosis, the survival proba-
bilities are 0.323 and 0.274, respectively. This shows increased survival risk for subtype 2
compared to subtype 1.
For the KIDNEY data set, the survival curves are plotted in Fig. Figure 4.5(e) and
the results are reported in Table 4.7. In the KIDNEY data set, for both the subtypes 1
and 2, the survival curves end before the median survival probability of 0.5. Moreover,
the survival probabilities for subtypes 1 and 2 after seven years of diagnosis are 0.648 and
0.789, respectively, while for subtype 3, this probability drops to 0.449. This indicates that
subtypes 1 and 2 have better prognosis than the subtype 3 which has a median survival
time of 6.3 years. The p-value from pairwise log-rank test comparing subtypes 1 and
2 is 0.124566, comparing subtypes 1 and 3 is 0.01657646, and for subtypes 2 and 3 is
0.0001816. The p-values are statistically significant when compared between subtypes 3
and 1 and between subtypes 3 and 2. The overall log-rank p-value is 0.00017 when the
profiles of all the three subtypes are compared together, which is statistically significant.

89
4.6 Conclusion
The chapter presents a novel algorithm to extract a low-rank joint subspace of the high di-
mensional multimodal data. The sample clustering is performed on the extracted subspace
to find the subtypes of respective cancer. The problem of updating the SVD of a data
matrix is formulated for multimodal data, where new modalities are added for the same
set of samples. The theoretical formulation introduced here enables the proposed SURE
algorithm to extract the principal components in lesser time compared to performing PCA
on the concatenated data. Some new quantitative indices are proposed to evaluate theoreti-
cally the gap between joint subspace extracted by the proposed algorithm and the principal
subspace extracted by PCA. Theoretical analysis also shows that the extracted subspace
converges to the full-rank subspace extracted by PCA, as the rank approaches full rank
of the integrated data. Unlike the existing integrative clustering approaches, the proposed
approach considers that each modality may not provide relevant and consistent information
about the true subtypes; hence, it evaluates the quality of each modality before integra-
tion. The evaluation measures and eigenspace update based approach allow the proposed
algorithm to efficiently select only relevant modalities, discarding the noisy and inconsis-
tent ones. The effectiveness of the proposed algorithm for cancer subtype identification
has been studied and compared with existing integrative clustering approaches on several
real-life multimodal cancer data sets. The experimental results show that the proposed
algorithm performs better than unimodal and multimodal approaches in identification of
cancer subtypes.
One of the important approaches of handling data heterogeneity in multimodal data
clustering is modeling each modality using a separate similarity graph. Information from
the multiple graphs is integrated by combining them into a unified graph. A major chal-
lenge here is how to preserve cluster information while removing noise from individual
graphs. In this regard, Chapter 5 introduces a novel algorithm that integrates noise-free
approximations of multiple similarity graphs.

90
Chapter 5

Approximate Graph Laplacians for


Multi-View Data Clustering

5.1 Introduction
Advancement in information acquisition technologies has made multimodal data ubiquitous
in numerous real-world application domains like social networking [78], image processing
[54,127], 3D modeling [171], cancer biology [199], to name a few. Whole-genome sequencing
project has given rise to a wide variety of “omics" data, which include genomic, epigenomic,
transcriptomic, and proteomic data. The system-level insight, provided by different omics
data, has led to numerous scientific discoveries and clinical applications over the past decade
[89]. Cancer subtype identification has emerged out to be a major clinical application of
multi-omics study. It can provide deeper understanding of disease pathogenesis and design
of targeted therapies. While each type of omic data reflects the characteristic traits of
a specific molecular level, integrative analysis of multi-omics data, which considers the
biological variations across multiple molecular levels, can reveal novel cancer subtypes.
Multi-view clustering is the primary tool for identification of disease subtypes from
multi-omics data [31, 102]. A brief survey on different multi-view clustering algorithms
is reported in Chapter 2. The main challenge is how to integrate information appropri-
ately, obtained from different modalities. Naive integration of different modalities with
varying scales may give inconsistent results. Another challenge is to handle efficiently the
‘high dimension-low sample size’ nature of the individual data sets, which degrades the
signal-to-noise ratio in the data and makes clustering computationally expensive.
In multi-omics data, different modalities vary immensely in terms of unit and scale. For
instance, RNA sequence based gene expression data consists of RPM (reads per million)
values having six-orders of magnitude, while DNA methylation data consists of β values
which lie in [0, 1]. So, concatenation of features from these heterogeneous modalities
would reflect only the properties of features having high variance. In order to capture the
inherent properties of different modalities, it is essential to model the variations within
each modality separately and then integrate them using a common platform. One widely
used approach is to model each individual modality using a separate similarity graph. The
individual similarity graphs are constructed in such a way that their vertices represent the

91
samples, while their edges are weighted by the pairwise affinities between the samples of
the respective modalities. The challenge is then how to integrate information efficiently
from multiple similarity graphs. This comes under the paradigm of graph based multi-
view learning [120,134,142,247,256,286,294], where the main objective is to learn a unified
graph that is sufficiently “close" to all the graphs in some sense. In most multi-view
learning algorithms, spectral clustering [147, 152, 230] is performed on the similarity graph
corresponding to the unified view to identify the clusters of a given data set. The spectral
clustering uses spectrum of the graph Laplacian [45] to identify the clusters in a data set.
It has been shown in [230] that the relaxed solution to the k cluster indicators of a data
set is given by the eigenvectors corresponding to the k smallest eigenvalues of its graph
Laplacian. Hence, spectral clustering algorithms perform simple k-means on the k smallest
eigenvectors of the graph Laplacian. However, it also implies that only a few eigenvectors of
the Laplacian contain the cluster discriminatory information of the data set. The remaining
eigenvectors may not necessarily encode cluster information and may reflect background
noise. As a consequence, a major drawback of these multi-view algorithms is that both
similarity graphs and their Laplacians, constructed from different views, inherently contain
noisy information. This unwanted noise of the individual views may get propagated into
the unified view during integration. This can degrade the quality of the cluster structure
inferred from the unified view. Therefore, it is essential to prevent the noise in the individual
views from being propagated into the unified view.
In this regard, the chapter presents a novel algorithm, termed as CoALa (Convex-
combination of Approximate Laplacians), which integrates noise-free approximations of
multiple similarity graphs. The proposed method models each modality using a separate
similarity graph, as different modalities are highly heterogeneous in nature and are mea-
sured in different scales. The noise in each individual graph is eliminated by approximating
it using the most informative eigenpairs of its Laplacian which contain cluster information.
The approximate Laplacians are then integrated and a low-rank subspace is constructed
that best preserves the overall cluster information of multiple graphs. The graphs are in-
tegrated using a convex combination, where they are weighted according to the quality
of their inherent cluster structure. Hence, noisy graphs have lower impact on the final
subspace compared to the ones with good cluster structure. However, the approximate
subspace constructed by the proposed method differs from the full-rank subspace that in-
tegrates information from all the eigenpairs of each Laplacian. The matrix perturbation
theory is used to theoretically upper bound the difference between the full-rank and approx-
imate subspaces, as a function of the approximation rank. It is shown, both theoretically
and experimentally, that the approximate subspace converges to the full-rank one as the
rank of approximation approaches to the full-rank of the individual Laplacians. Finally, the
efficacy of clustering in the approximate subspace is extensively studied and compared with
different existing integrative clustering approaches, on several real-life multi-omics cancer
data sets. The results on benchmark data sets from other domains like image processing
and social networks are also provided to establish the generality of the proposed approach.
Some of the results of this chapter are reported in [113].
The rest of this chapter is organized as follows: Section 5.2 introduces the basics of
graph Laplacian and its properties, while Section 5.3 presents the proposed graph based
algorithm for multi-view data clustering. Section 5.4 presents theoretical upper bounds
on the difference between full-rank and approximate subspaces. Experimental results and

92
comparison with existing approaches on multi-omics cancer and benchmark data sets are
presented in Section 5.5. Section 5.6 concludes the chapter.

5.2 Basics of Graph Laplacian


Given a set of samples or objects X “ tx1 , . . . , xi , . . . , xn u, and a similarity matrix W “
rwpi, jqsnˆn , where xi P <d and wpi, jq “ wpj, iq ě 0 is the similarity between objects xi
and xj , the intuitive goal of clustering is to partition the objects into several groups such
that objects in the same group are similar to each other, while those in different groups
are dissimilar. The problem of clustering can also be approached from a graph theoretic
point of view, where the data set X can be represented as an undirected similarity graph
G “ pV, Eq having vertex set V “ tv1 , . . . , vi , . . . , vn u, where each vertex vi represents the
object xi , and the edge between vertices vi and vj is weighted by the similarity wpi, jq.
řn
The degree dri of vertex vi is given by dri “ wpi, jq, and the degree matrix D is given by
j“1
the diagonal matrix
D “ diagpdr1 , . . . , dri , . . . , drn q. (5.1)
Given the number of clusters k, clustering can be viewed as partitioning the graph G into
k subgraphs such that edges between different subgraphs have lower weights, while edges
within a subgraph have higher weights. For a subset of vertices A Ă V , let its complement
ř r
Ā be given by Ā “ V zA. A measure of size of subset A can be given by volpAq “ di .
vi PA
For two not necessarily disjoint subsets A, B Ă V , let
ÿ
CpA, Bq “ wpi, jq. (5.2)
vi PA,vj PB

For a subset A of vertices, CpA, Āq gives the weight of the cut that separates the vertices
in A from the rest of vertices in G. So, given the number of subsets k, the graph parti-
tioning problem finds a partition A1 , . . . , Ak of V such that it minimizes the cut weight
CpAi , Āi q for each Ai . However, minimizing only CpAi , Āi q can lead to singleton subsets
Ai ’s. In clustering, it is desirable to achieve clusters with reasonably large set of points.
i ,Āi q
So, minimizing CpA volpAi q , instead of CpAi , Āi q, would constrain each subset Ai to be fairly
large. The most common optimization problem in this regard is the normalized cut or
N cut [194], defined as

k
1 ÿ CpAi , Āi q
minimize N cutpA1 , . . . , Ak q “
A1 ,...,Ak 2 i“1 volpAi q
(5.3)
k
ď
such that Ai X Aj “ H and Ai “ V.
i“1

However, the above optimization problem is NP-hard [231]. The spectral clustering [230]
provides a computationally tractable solution to this Ncut problem. It analyzes the spec-
trum or eigenspace of graph Laplacian to find the solution [158]. The graph Laplacian and

93
several its variants are described next.
Let G “ pV, Eq be a graph with similarity matrix W and degree matrix D as given by
(5.1). The matrix pD ´ W q is called the Laplacian of graph G [158], and the normalized
Laplacian of G is given by [45]

L “ D´1{2 pD ´ W qD´1{2 “ I ´ D´1{2 W D´1{2 , (5.4)

where I is identity matrix of appropriate order. Two important properties of normalized


Laplacian are as follows [45]:

Property 5.1. L is symmetric and positive semi-definite.

Property 5.2. The eigenvalues of L lie in r0, 2s.

Let the k clusters in a data set X be represented by the indicator matrix

E “ re1 . . . ej . . . ek s P <nˆk , (5.5)

where ej is the indicator vector in <n for the j-th cluster, that is, ej P t0, 1un , such that
ej has a nonzero component only for the points in the j-th cluster. Let the r largest
eigenvectors of a matrix correspond to its r largest eigenvalues. It is shown in [230] that if
the constraint on the cluster indicators ej ’s is relaxed such that ej P r0, 1s, then the real-
valued solution to the indicators e1 , . . . , ek is given by the k smallest eigenvectors of the
normalized Laplacian L. The normalized spectral clustering algorithm by Ng et al. [162] is
described in Algorithm 5.1. The spectral clustering algorithm [162, 194] first computes the
graph Laplacian and then k-means clustering is performed on its k smallest eigenvectors.
The main advantage of spectral clustering is that it transforms the representations of the
objects txi u from their original space to an indicator subspace where the cluster charac-
teristics are more prominent. As the cluster properties are enhanced in this new subspace,
even simple clustering algorithms, such as k-means, have no difficulty in distinguishing the
clusters.

Algorithm 5.1 Normalized Spectral Clustering [162]


Input: Similarity matrix W , number of clusters k.
Output: Clusters A1 , . . . , Ak .
1: Construct degree matrix D and normalized Laplacian L as in (5.1) and (5.4), respec-
tively.
2: Find eigenvectors U “ ru1 . . . uk s corresponding to k smallest eigenvalues of L.
1
3: Normalize the rows of U , i.e. U “ diagpU U T q´ 2 U .
4: Perform clustering on the rows of U using k-means algorithm.
5: Return clusters A1 , . . . , Ak from k-means clustering.

In a Laplacian matrix, the necessary cluster information is embedded in its k smallest


eigenvectors. However, based on Eckart-Young theorem [56], the best low-rank approxi-
mation of a symmetric matrix can be constructed from its few largest eigenpairs. So, the
best low-rank approximation of a Laplacian matrix primarily encodes noise, rather than

94
cluster information. In the proposed work, the final subspace of a multimodal data set is
constructed from low-rank approximations of individual graph Laplacians. So, in order to
reflect the cluster information in the low-rank approximations, the shifted Laplacian [51]
is used, which is defined as

L “ 2I ´ L “ I ` D´1{2 W D´1{2 . (5.6)

The following property of shifted Laplacian makes it feasible to reflect the cluster informa-
tion in its best low-rank approximation.

Property 5.3. If pλ, vq is an eigenvalue-eigenvector pair of normalized Laplacian L, then


p2 ´ λ, vq is an eigenpair of shifted Laplacian L [51].

Property 5.3 implies that the k smallest eigenvalues and eigenvectors of normalized Lapla-
cian L correspond to the k largest eigenvalues and eigenvectors of shifted Laplacian L.
Therefore, the relaxed solution to the cluster indicators e1 , . . . , ek in (5.5) is given by the
k largest eigenvectors of L. So, the best rank k approximation of L also encodes its cluster
information. As the eigenvalues of L lie in r0, 2s, the eigenvalues of L also lie in r0, 2s.
Moreover, L is symmetric and positive semi-definite [51].

5.3 CoALa: Proposed Method


This section presents a novel algorithm to extract a low-rank joint subspace from multiple
graph Laplacians. Some analytical formulations, required for subspace construction, are
reported next, prior to describing the proposed algorithm.

5.3.1 Convex Combination of Graph Laplacians


Let a multimodal data set, consisting of M modalities or views, be given by X1 , . . . , Xm , . . . , XM .
Each modality Xm P <nˆdm represents the observations for same set of n samples from
the m-th data source. Let Xm be encoded by the similarity graph Gm having similarity
matrix Wm and degree matrix Dm . The shifted Laplacian for modality Xm is given by

´1{2 ´1{2
Lm “ I ` Dm Wm Dm . (5.7)

Let the eigen-decomposition of Lm be given by

T
Lm “ Um Σm Um , (5.8)

where Um “ rum m
1 , . . . , un s P <
nˆn contains the eigenvectors of L
m in its columns, B
T
m m m m
denotes the transpose of B, and Σm “ diagpλ1 , . . . , λn q, where 2 ě λ1 ě . . . ě λn ě
0. For a given rank r, the eigen-decomposition of shifted Laplacian Lm in (5.8) can be

95
partitioned as follows:

T
Lm “ Um Σm Um
„ 
“ r rK
‰ Σrm 0 “ r rK T

“ Um Um rK Um Um
0 Σm
r r r T rK rK rK T
“ Um Σm pUm q ` Um Σm pUm q
“ Lrm ` LrK
m , (5.9)

where 0 denotes a matrix of all zeros of appropriate order, Σrm “ diagpλm m


1 , . . . , λr q con-
r
sists of the r largest eigenvalues and Um contains the corresponding r eigenvectors in its
columns. Similarly, ΣrK rK m
m and Um contain the remaining pn ´ rq eigenvalues λr`1 , . . . , λn
m

and eigenvectors, respectively. Thus, Lrm is the rank r approximation of Lm using the r
largest eigenpairs, and LrKm is the approximation using the remaining pn ´ rq eigenpairs.
Given the number of clusters k, the properties of shifted Laplacian imply that the relaxed
solution to the cluster indicators is given by the k largest eigenvectors of Lm . Therefore, for
each modality Xm , a rank r eigenspace representation is constructed, where k ď r ăă n,
which encodes the cluster information of its shifted Laplacian Lm . Choosing the rank r to
be greater than k allows extra information from each Laplacian at the initial stage.
The rank r eigenspace of shifted Laplacian Lm for modality Xm is defined by a two-
tuple:
ΨpLrm q “ xUmr
, Σrm y. (5.10)
The individual graph Laplacians contain the cluster information of their respective modali-
ties. Multiple modalities are integrated using a convex combination α “ rα1 , . . . , αm , . . . , αM s
of individual shifted Laplacians, defined by

M
ÿ M
ÿ
L“ αm Lm , such that αm ě 0 and αm “ 1. (5.11)
m“1 m“1

The matrix L is called the joint shifted Laplacian and it has the following properties.

Property 5.4. L is symmetric and positive semi-definite.

Proof. Each shifted Laplacian Lm is symmetric for m “ 1, 2, ..., M . So,


˜ ¸T
M
ÿ M
ÿ M
ÿ
T
L “ α m Lm “ αm LTm “ αm Lm “ L.
m“1 m“1 m“1

Therefore, L is symmetric. By Property 5.3, each Lm is positive semi-definite, so, for any
vector a P <n , aT Lm a ě 0. Therefore,
˜ ¸
M
ÿ M
ÿ
aT La “ aT αm aT Lm a ě 0,
` ˘
αm Lm a“
m“1 m“1

as αm ě 0. Therefore, L is positive semi-definite.

96
Property 5.5. L has n eigenvalues γ1 ě . . . ě γi ě . . . ě γn , where γi P r0, 2s.

Proof. By Property 5.3, the eigenvalues of each individual shifted Laplacian Lm lie in r0, 2s
for m “ 1, 2, . . . , M . So, the maximum eigenvalue of Lm and αm Lm satisfy λm
1 ď 2 and
αm λm
1 ď 2αm , respectively. Since each Laplacian Lm is a real symmetric matrix, it is also
Hermitian as it is equal to its own conjugate transpose. Now, L is the sum of M Hermitian
matrices. So, using Weyl’s inequality [202], which bounds the eigenvalues of the sum of
two Hermitian matrices, we get

M
ÿ M
ÿ
γ1 ď αm λm
1 ď 2αm “ 2. (5.12)
m“1 m“1

L is positive semi-definite, so all of its eigenvalues γi ě 0. Therefore, γi P r0, 2s.

Hence, the joint shifted Laplacian L has similar properties as individual shifted Lapla-
cians Lm ’s have. In rest of the chapter, the term joint Laplacian is used to refer to the
joint shifted Laplacian.

5.3.2 Construction of Joint Eigenspace


This subsection describes the construction of eigenspace of the joint Laplacian from low-
rank eigenspaces of individual shifted Laplacians. Let eigen-decomposition of L be given
by
L “ ZΓZT , (5.13)
where Z consists of the eigenvectors of L in its columns and Γ “ diagpγ1 , . . . , γn q is the
diagonal matrix of eigenvalues arranged in descending order of magnitude. The “full-rank"
eigenspace of L is given by the two-tuple

Ψ pLr q “ xZr , Γr y, (5.14)

where Γr “ diagpγ1 , . . . , γr q and Zr contains the eigenvectors corresponding to the eigen-


values in Γr . The term “full-rank" is used to imply that in L, the complete information
of all the eigenpairs of each Laplacian is considered during convex combination. The su-
perscript r in Ψ pLr q indicates that the eigenspace has rank r. The “approximate" joint
Laplacian is defined as
ÿM
Lr˚ “ αm Lrm . (5.15)
m“1

Thus, Lr˚ is the convex combination of best rank r approximation of individual shifted
Laplacians. For each shifted Laplacian Lm , instead of storing its complete eigen-decomposition,
only the r largest eigenpairs are stored in its eigenspace ΨpLrm q. Given these eigenspaces
ΨpLrm qs, the proposed method aims at construction of the rank r eigenspace Ψ pLr˚ q, of
the approximate joint Laplacian Lr˚ . The main advantage of this construction is that
it finds the joint eigenspace from the r largest eigenpairs of individual Laplacians. The

97
cluster information of individual modalities is expected to embed in the k largest eigen-
pairs of their respective shifted Laplacians. Hence, storing r ě k eigenpairs allows for
some extra information from each Laplacian as well as gets rid of the noisy information in
the pn ´ rq eigenpairs. Thus, the approximate eigenspace Ψ pLr˚ q, constructed from the
r largest eigenpairs, is expected to preserve better cluster information compared to the
full-rank eigenspace Ψ pLr q.
One straight forward approach for the construction of eigenspace of Lr˚ is to first
solve the eigen-decomposition of the individual Lm ’s, reconstruct the Lrm ’s from the top r
eigenpairs of respective Lm ’s, combine the reconstructed Lrm ’s using the convex combination
and then perform another eigen-decomposition on the combination Lr˚ . This requires
solving a total of pM ` 1q eigen-decompositions of size pn ˆ nq. However, in the proposed
method, the eigenspaces ΨpLrm q’s of the individual Laplacians inorder are used to construct
a smaller eigenvalue problem of size pM r ˆ M rq whose solution is used to get the required
eigenspace Ψ pLr˚ q. So, it requires solving M eigen-decompositions of size pn ˆ nq and one
of size pM r ˆ M rq, where M r ăă n. This makes the proposed approach computationally
more efficient.
The block decomposition of Lm in (5.9) gives us that Lrm “ Um r Σr pU r qT . So,
m m

M
ÿ M
ÿ

L “ αm Lrm “ r r
αm Um r T
Σm pUm q . (5.16)
m“1 m“1

The expansion of Lr˚ in (5.16) implies that the subspace spanned by its columns is same as
the one spanned by the union of the columns of Um r for m “ 1, . . . , M . Let that subspace

be given by ˜ ¸
M
ď
r r
J “ span CpUm q , (5.17)
m“1

where CpBq denotes the column space of matrix B. To compute the eigenspace of Lr˚ ,
the first step is to construct a sufficient basis that spans the subspace J r . Since J r is the
union of M subspaces, its basis is constructed iteratively in M steps. At step 1, the initial
basis U1 is given by
U1 “ U1r , (5.18)
which spans the subspace CpU1r q. At step m, let the union of m subspaces be given by the
subspace ˜ ¸
ďm
r
Jm “ span CpUjr q (5.19)
j“1

and let its orthonormal basis be given by Um P <nˆr . Given the basis Um obtained at
r
step m, and the basis Um`1 for Lrm`1 , the basis Um`1 at step pm ` 1q is constructed as
follows.
The basis Um`1 has to span both the subspaces Jm r and CpU r
m`1 q. The column vectors
r
of Um themselves form a basis for the subspace Jm . Therefore, a sufficient basis for the
r
subspace Jm`1 can be constructed by appending a basis Υm`1 that spans the subspace
orthogonal to Jm r . The construction of basis Υ
m`1 begins by computing the residue of
r
each basis vector in Um`1 with respect to the basis Um . To compute the residues, each

98
r
vector in Um`1 is projected on each of the basis vectors in Um . In matrix notation, this is
given by
Sm`1 “ UTm Um`1r
. (5.20)
The matrix Sm`1 gives the magnitude of projection of the columns of Um`1 r onto the
r
orthonormal basis Um . The projected component Pm`1 of Um`1 , lying in the subspace
Jmr , is obtained by multiplying the projection magnitudes in S
m`1 by the corresponding
basis vectors in Um , given by
Pm`1 “ Um Sm`1 . (5.21)
r
The residual component Qm`1 of Um`1 is obtained by subtracting projected component
Pm`1 from itself, given by
r
Qm`1 “ Um`1 ´ Pm`1 . (5.22)
An orthogonal basis Υm`1 for the residual space, spanned by columns of Qm`1 , can be
obtained by Gram-Schmidt orthogonalization of Qm`1 . The basis Υm`1 spans the subspace
r . Therefore, a sufficient basis for the subspace J r
orthogonal to Jm m`1 is obtained by
appending Υm`1 to Um , given by
“ ‰
Um`1 “ Um Υm`1 . (5.23)

Let Υ1 “ U1 . After M steps, the basis UM , for the subspace J r in (5.17), is given by
“ ‰
UM “ Υ1 Υ2 . . . ΥM . (5.24)

Let the eigen-decomposition of Lr˚ be given by

Lr˚ “ VΠVT , (5.25)

where V P <nˆn contains the eigenvectors of Lr˚ in its columns, and Π “ diagpπ1 , . . . , πn q
contains the eigenvalues arranged in descending order. The eigenvectors in V span the
column space of Lr˚ , which from (5.17) is the subspace J r . UM is also a basis for J r .
These two bases V and UM span the same subspace J r and they differ by a rotation. So,

V “ UM R, (5.26)

where R is an orthogonal rotation matrix. The eigenvalues Π in (5.25) and the rotation
matrix R in (5.26) are obtained as follows.

M
ÿ
Lr˚ “ r r
αm Um r T
Σm pUm q , [from (5.16)]
m“1
M
ÿ
ñ VΠVT “ r r
αm Um r T
Σm pUm q , [from (5.25)]
m“1
M
ÿ
T r r r T
ñ pUM RqΠpUM Rq “ αm Um Σm pUm q , [from (5.26)]
m“1

99
˜ ¸
M
ÿ
T
ñ RΠR “ UTM r r
α m Um r T
Σm pUm q UM ,
m“1
M
ÿ
ñ RΠRT “ αm UTM Um
r r r T
Σm pUm q UM ,
m“1
fi »
M ΥT1
αm – ... fl Um
ÿ
ñ RΠRT “
ffi r r r T
“ ‰
Σm pUm q Υ1 . . . ΥM ,

m“1 ΥTM
M
ÿ
T
ñ RΠR “ αm Hm , (5.27)
m“1

where Hm P <pM rˆM rq is given by

Hm “ rΥ1 . . . ΥM sT Um
r r r T
Σm pUm q rΥ1 . . . ΥM s . (5.28)

While constructing the basis UM , the Υp ’s are appended iteratively such that whenever
p ą m, Υp is orthogonal to Umr and ΥT U r “ 0. Thus, the matrix H can be partitioned
p m m
2
into M blocks, each of size pr ˆ rq, and the pi, jq-th block of Hm is given by
#
ΥTi Um
r Σr pU r qT Υ
m m j if i ď m and j ď m,
Hm pi, jq “
0 if i ą m or j ą m.

M
ÿ
Let H “ α m Hm ; ñ H “ RΠRT . (5.29)
m“1

This implies that solving the eigen-decomposition of the pM r ˆ M rq matrix H, the eigen-
values Π of Lr˚ and the rotation matrix R are obtained. Then, R is substituted in (5.26)
to get the eigenvectors of Lr˚ in columns of V. The rank r eigenspace of Lr˚ is then given
by the two-tuple
Ψ pLr˚ q “ xVr , Πr y, (5.30)
where Πr “ diagpπ1 , . . . , πr q consists of the r largest eigenvalues of Π arranged in descend-
ing order, and Vr contains the corresponding r eigenvectors in its columns.

5.3.3 Proposed Algorithm


Given similarity matrices W1 , . . . , WM corresponding to M modalities X1 , . . . , XM , convex
combination vector α “ rα1 , . . . , αM s and rank r, the proposed algorithm, termed as
CoALa, extracts a rank r eigenspace for the approximate joint Laplacian Lr˚ . For each
modality Xm , the proposed algorithm first computes the eigen-decomposition of its shifted
Laplacian Lm and then stores the r ě k largest eigenpairs in its eigenspace. Next, it
iteratively computes the basis UM and the eigen-decomposition of the new eigenvalue
problem H. The eigenvalues of Lr˚ are given by the eigenvalues of H, while the eigenvectors
of H are used to rotate the basis UM and get the eigenvectors of Lr˚ . Finally, k-means

100
clustering is performed on the k largest eigenvectors of Lr˚ to get the clusters of the
multimodal data set. The proposed algorithm is described in Algorithm 5.2.

Algorithm 5.2 Proposed Algorithm: CoALa


Input: Similarity matrices W1 , . . . , WM , combination vector α “ rα1 , . . . , αM s, number
of clusters k, and rank r ě k.
Output: Clusters A1 , . . . , Ak .
1: for m Ð 1 to M do
2: Construct degree matrix Dm and shifted normalized Laplacian Lm as in (5.1) and
(6.6), respectively.
3: Compute the eigen-decomposition of Lm .
4: Store the r largest eigenvalues in Σrm and corresponding eigenvectors in Um r in the

rank r eigenspace of Xm .
5: end for
6: Compute initial basis U1 Ð U1r .
7: for m Ð 1 to M ´ 1 do
8: Compute Sm`1 , projected component Pm`1 , and residual component Qm`1 accord-
ing to (5.20), (5.21), and (5.22), respectively.
9: Υm`1 Ð Gram-Schmidt“ orthogonalization‰ of Qm`1 .
10: Update basis Um`1 Ð Um Υm`1 .
11: end for
12: For each modality Xm , compute Hm as in (5.28).
13: Compute the new eigenvalue problem H as in (5.29).
14: Solve the eigen-decomposition of H to get R and Π.
15: Compute eigenvectors V Ð UM R.
16: Compute joint eigenspace Ψ pLr˚ q Ð xVr , Πr y as in (5.30).
17: Find k largest eigenvectors Vk “ rv1 . . . vk s.
18: Perform clustering on the rows of Vk using k-means algorithm.
19: Return clusters A1 , . . . , Ak from k-means clustering.

In the normalized spectral clustering by Ng et al. [162], the eigenvectors are row nor-
malized (step 3 of Algorithm 5.1) before clustering. The advantage of this additional
normalization has been shown for the ideal case where the similarity is zero between points
belonging to different clusters and strictly positive between points in the same clusters. In
such a situation, the eigenvalue 0 has multiplicity k, and the eigenvectors are given by the
1
columns of D 2 E, where E is the ideal cluster indicator matrix as in (5.5). By normalizing
each row by its norm, the eigenvector matrix coincides with the indicator matrix E, and
the points become trivial to cluster. Ng et al. [162] have also shown that when the simi-
larity matrix is “close" to the ideal case, properly normalized rows tend to tightly cluster
around an orthonormal basis. However, in real-life data sets, the clusters are generally not
well-separated due to the high dimension and heterogeneous nature of different modalities.
As a result, the similarity matrices deviate far from the ideal block diagonal ones. So, ad-
ditional row normalization may lead to undesirable scaling which is not advantageous for
the subsequent k-means clustering step. Therefore, row normalization is not recommended
in the proposed algorithm.

101
5.3.4 Computational Complexity
In the proposed algorithm, the first step is to compute the eigenspace of each modality
Xm . Given the similarity matrix Wm for modality Xm , its degree matrix Dm and shifted
Laplacian Lm are computed in step 2 in Opn2 q and Opn3 q time, respectively. Then, the
eigen-decomposition of Lm is computed in step 3 which takes Opn3 q time for the pn ˆ nq
matrix. Therefore, for M modalities, the total complexity of initial eigenspace construction
is OpM n3 q. Next, the basis UM is constructed in M steps. At each step of basis con-
struction, the matrices Sm`1 , Pm`1 , and Qm`1 are computed in step 8 of the algorithm.
It takes Opnr2 q time. The Gram-Schmidt orthogonalization in step 9 also has complexity
of Opnr2 q for pn ˆ rq matrix Qm`1 . The total complexity of basis construction in steps
7-11 is Opnr2 q. The new eigenvalue problem H of size pM r ˆ M rq is formulated in steps
12-13, which takes OpM 3 r3 q time, owing to matrix multiplications. The subsequent eigen-
decomposition of H in step 14 also takes OpM 3 r3 q time. The rotation of UM in step 15 has
complexity of Opnr2 q. Finally, after the construction of joint eigenspace Ψ pLr˚ q, k-means
clustering is performed on pn ˆ kq matrix Vk which has time complexity of Optmax nk 2 q,
where tmax is the maximum number of iterations the k-means algorithm runs.
Hence, the overall computational complexity of the proposed CoALa algorithm, to
extract the joint eigenspace and perform spectral clustering on a multimodal data set, is
pOpM n3 `nr2 `M 3 r3 `nr2 `tmax nk 2 q “qOpM n3 q, assuming M, r, k ăă n. It implies that
the overall complexity of the proposed algorithm is dominated by the individual eigenspace
construction of initial stage.

5.3.5 Choice of Convex Combination


The convex combination vector α determines the weight of the influence of each Laplacian
on the final eigenspace. According to Fiedler’s theory of spectral graph partitioning [66],
the algebraic connectivity or the Fiedler value of a graph G is the second minimum eigen-
value of the Laplacian of G. The Fiedler value represents the weight of the minimum cut
that partitions the corresponding graph into two subgraphs. Moreover, by Property 5.3,
the lower the eigenvalue or cut-weight of the normalized Laplacian L, the higher is the cor-
responding eigenvalue of its shifted Laplacian L. The smallest eigenvalue of L is 0 which
corresponds to largest eigenvalue, λ1 , of L which is 2, and the second largest eigenvalue, λ2 ,
reflects how high is the separability of graph G. The corresponding eigenvector u2 , known
as the Fiedler vector, can be used to partition the vertices of G [200]. For example, if the
Fiedler vector is u2 “ pu21 , ..., u2j , ..., u2n q, spectral partitioning finds a splitting value s
such that the objects with u2j ď s belong to a set, while that with u2j ą s belong to other.
Several popular choices for s have been proposed, or the standard 2-means algorithm can
also be applied on u2 to obtain a 2-partition. Once a 2-partition is obtained, Silhouette
index [179] can internally assess the quality of the partition. Silhouette index lies between
[-1, 1] and higher value indicates a better partition. A modality with good inherent cluster
information is expected to have a higher Fiedler value as well as higher Silhouette index on
the Fiedler vector. Let Spum 2 q denote the Silhoutte index computed based on a 2-partition
of the Fiedler vector corresponding to the m-th modality. Thus, a measure of “relevance"

102
of a modality Xm is defined as

1
χm “ λ m rSpum
2 q ` 1s (5.31)
4 2

where λm 2 is the second largest eigenvalue of shifted Laplacian Lm of Xm and u2 is the


m

corresponding eigenvector. The term pSpum m


2 q ` 1q lies in r0, 2s, while the value of λ2 can
be at most 2. The factor 1{4 acts as a normalizing factor which upper bounds the value of
χ to 1. Hence, the value of relevance measure χ lies in r0, 1s. Higher value of χm implies
higher relevance and better cluster structure. Hence, χ can be used to obtain a linear
ordering of the modalities X1 , . . . , XM . Let Xp1q , . . . , Xpmq , . . . , XpM q be the ordering of
X1 , . . . , Xm , . . . , XM based on decreasing value of relevance χ. In the convex combination
vector α, the component αpmq corresponding to the weighting factor of modality Xpmq is
given by
αpmq “ χpmq β ´m , where β ą 1. (5.32)
This implies that based on the index of Xpmq in the ordering Xp1q , . . . , XpM q , the relevance
value of Xpmq is damped by a factor of β m and then used as its contribution in the convex
χ
combination α. Thus, in α, the most relevant modality has contribution of βp1q , while
χ
the second most relevant one contributes βp2q
2 , and so on. This assignment of α upweights
modalities with better cluster structure, while dampens the effect of irrelevant ones those
having poor structure.

5.4 Quality of Eigenspace Approximation


The proposed algorithm constructs the eigenspace Ψ pLr˚ q from a convex combination of
rank r approximations of the individual Laplacians Lm ’s. This eigenspace differs from
the full-rank eigenspace Ψ pLr q, which is the convex combination of complete or full rank
information of the individual Laplacians. In real-life multimodal data sets, the individual
modalities inherently contain noisy information. The approximation approach prevents
propagation of noise from the individual modalities into the final approximate eigenspace
Ψ pLr˚ q. As a consequence, the approximate subspace is expected to preserve better cluster
structure compared to the full-rank one. However, in the ideal case, where the clusters
in the individual modalities are well-separated, the approximation approach may loose
some important information. So, the difference between the two eigenspaces Ψ pLr q and
Ψ pLr˚ q is evaluated as a function of the approximation rank r, and can be quantified
in terms of their eigenvalues and eigenvectors. The difference between the eigenvalues
can be measured directly in terms of their magnitude, while the difference between the
eigenvectors is measured in terms of difference between the subspaces spanned by the
two sets of eigenvectors. Similar to Chapter 4, the principal angles between subspaces
(PABS) [16, 70] is also used here to measure the difference between two subspaces. The
PABS is a generalization of the concept of angle between two vectors to a set of angles
between two subspaces. The principal angles between two subspace A and B and their
corresponding principal sines, denoted by sin ΘpA, Bq, are defined in Definition 4.1 and
Definition 4.2 of Chapter 4.
In order to bound the difference between the eigenvectors of two eigenspaces Ψ pLr q

103
and Ψ pLr˚ q, the theory of perturbation of invariant subspaces [202] and Davis Kahan
theorem [49] are used. The eigenvalues and eigenvectors of the full-rank eigenspace Ψ pLr q
are given by Γr “ diagpγ1 , . . . , γr q and Zr , respectively, as in (5.14), where γr ‰ γr`1 , while
those for the approximate eigenspace Ψ pLr˚ q are given by Πr “ diagpπ1 , . . . , πr q and Vr ,
respectively, as in (5.30). The columns of Zr span the full-rank subspace formed by the
convex combination of full rank Lm ’s, while those of Vr span the approximate subspace
formed by rank r approximation of Lm ’s. The difference between the subspaces spanned
by the column vectors of Zr and Vr is given by the following theorem.

Theorem 5.1. For any unitarily invariant norm k . k, the following bound holds on the
principal angles between the subspaces defined by CpZr q and CpVr q:
ˆ M ˙
ř
αm Lm Vr
rK

m“1
ksin Θ pCpZr q, CpVr qqk ď ˆ M
˙, (5.33)
m
ř
πr ´ πr`1 ´ αm λr`1
m“1

M
αm λ m
ř
assuming πr ą πr`1 ` r`1 .
m“1

Proof. The matrices Z and Γ contain the eigenpairs of L. For the given r, let Z and Γ be
partitioned as « ff

r rK
ı Γr 0
Z“ Z Z and Γ “ . (5.34)
0 ΓrK

Since Zr and ZrK contain eigenvectors of L, so,

LZr “ Zr Γr Ă CpZr q. (5.35)

This implies that the transformation of any vector v P CpZr q lies in CpZr q itself. So, Zr
spans an invariant subspace of the matrix L [202]. Similarly,

LZrK “ ZrK ΓrK Ă CpZrK q. (5.36)

So, ZrK also spans an invariant subspace of L. Moreover, the columns of ZrK span the
subspace orthogonal to the one spanned by the columns of Zr . Now, let

B1 “ pZr qT LZr “ Γr and B2 “ pZrK qT LZrK “ ΓrK . (5.37)

According to the theory of invariant subspaces [202], B1 and B2 are called the representa-
tion of L with respect to the bases Zr and ZrK , respectively. The matrix B1 “ Γr contains
eigenvalues γ1 , . . . , γr , while B2 “ ΓrK contains eigenvalues γr`1 , . . . , γn . Let ΩpBq denote

104
the set of eigenvalues of a matrix B. Under the assumption that γr ‰ γr`1 , we have

ΩpB1 q X ΩpB2 q “ H. (5.38)

It follows from (5.38) that the eigenvalues of B1 and B2 are non-intersecting. So, Zr spans
rK
a simple ı subspace of L with its complementary subspace being spanned by Z .
” invariant
Also, Zr ZrK is unitary and L can be decomposed as

L “ Zr B1 pZr qT ` ZrK B2 pZrK qT . (5.39)

The decomposition in (5.39) is called the spectral resolution of L along Zr and ZrK . Now,
let L be written as
M
ÿ M
ÿ
αm Lrm ` LrK
` ˘
L“ α m Lm “ m ,
m“1 m“1
ÿM M
ÿ
ñ L“ αm Lrm ` αm LrK
m ,
m“1 m“1
M
ÿ
ñ L “ Lr˚ ` LrK˚ , where LrK˚ “ αm LrK
m . (5.40)
m“1

Let the eigenvectors and eigenvalues of Lr˚ be partitioned as


« ff
” ı Πr 0
V “ Vr VrK and Π “ . (5.41)
0 ΠrK

Since Vr contains eigenvectors of Lr˚ , so

Lr˚ Vr “ Vr Πr Ă CpVr q. (5.42)

This implies that Vr spans an invariant subspace of Lr˚ and Πr is a Hermitian matrix of
order r which gives the representation of Lr˚ with respect to the basis Vr . According to
(5.40), L can be written as the sum of Lr˚ and a perturbation LrK˚ . The perturbation
theory [202] analyzes how near is the perturbed subspace CpVr q to an invariant subspace
CpZr q of L, in terms of the perturbation matrix LrK˚ . So, the residual R of the matrix L,
with respect to a perturbed basis Vr and the Hermitian matrix Πr , is given by

R “ LVr ´ Vr Πr
˜ ¸
M
ÿ
“ Lr˚ ` αm LrK
m V r ´ V r Πr [from (5.40)]
m“1

105
˜ ¸
M
ÿ
r˚ r
“L V ` αm LrK
m V r ´ V r Πr
m“1
˜ ¸
M
ÿ
“ V r Πr ` αm LrK
m V r ´ V r Πr
m“1
˜ ¸
M
ÿ
“ αm LrK
m Vr . (5.43)
m“1

The matrices Πr and B2 “ ΓrK consist of the eigenvalues of the perturbed subspace CpVr q
and the complementary invariant subspace CpZrK q, respectively. According to the Davis-
Kahan theorem [49], the bound on the difference between an invariant subspace CpZr q
and its perturbation CpVr q holds only if the eigenvalues of the perturbed subspace and
the complementary invariant subspace are non-intersecting. So, the range in which the
eigenvalues of Πr and B2 lie are derived.
The matrix Π contains the eigenvalues of Lr˚ given by Π “ diagpπ1 , . . . , πr , πr`1 , . . . , πn q
which can be partitioned into Πr and ΠrK as in (5.41). So, the eigenvalues of Πr satisfy

ΩpΠr q P rπr , π1 s. (5.44)

The range of the eigenvalues of B2 is derived next. Since each Lm is a real symmetric
matrix, its low-rank approximations Lrm and LrK
m are also real symmetric matrices. So,
each Lrm and LrK
m have the Hermitian property and L
rK˚ is the sum of M Hermitian

matrices according to (5.40). The eigenvalues of LrK m m rK


m lie in rλr`1 , λn s, and those of αm Lm
lie in rαm λm m
r`1 , αm λn s. Applying Weyl’s inequality [202] for the eigenvalues of sum of
Hermitian matrices to LrK˚ , we get
« ff
M
ÿ M
ÿ
ΩpLrK˚ q P αm λm
r`1 , αm λm
n . (5.45)
m“1 m“1

The eigenvalues of L lie in rγn , γ1 s, while those of Lr˚ lie in rπn , π1 s. The range of eigen-
values of LrK˚ is given by (5.45). Again, L p“ Lr˚ ` LrK˚ q is the sum of two Hermitian
matrices Lr˚ and LrK˚ . So, using Weyl’s inequality, the eigenvalues of L satisfy

M
ÿ M
ÿ
πj ` αm λ m
n ď γj ď π j ` αm λm
r`1 , (5.46)
m“1 m“1

for j “ 1, ..., n. As stated previously, B2 “ ΓrK consists of eigenvalues γr`1 , . . . , γn of L.

106
Thus, the maximum eigenvalue of B2 is γr`1 , which using (5.46) is bounded by

M
ÿ
γr`1 ď πr`1 ` αm λm
r`1 . (5.47)
m“1

According to (5.44), the minimum eigenvalue of Πr is πr . Let δ be the minimum of the


separation between the eigenvalues of Πr and B2 , which is given by

δ “ mintΩpΠr qu ´ maxtΩpB2 qu
M
ÿ
“ πr ´ πr`1 ´ αm λm
r`1 ą 0. (5.48)
m“1

M
ÿ
So, πr ´ δ “ πr`1 ` αm λm
r`1 . (5.49)
m“1

From (5.47) and (5.49), we get γr`1 ď pπr ´ δq. Moreover, as γn ď γr`1 , γn ď pπr ´ δq.
Also, pπ1 ` δq ě pπr ´ δq, as π1 ě πr . This implies that the eigenvalues of B2 , that is,
γr`1 , . . . , γn satisfy
ΩpB2 q P Rzrπr ´ δ, π1 ` δs. (5.50)

The constraints in (5.44) and (5.50) imply that the eigenvalues of Πr are included in the
interval rπr , π1 s, while those of B2 are excluded from the interval rπr ´ δ, π1 ` δs, where
δ ą 0. So, for an invariant subspace CpZr q, the eigenvalues of its complementary subspace
CpZrK q and those of its perturbed subspace CpVr q are non-intersecting. Finally, according
to the Davis-Kahan theorem [49] which bounds the difference between an invariant subspace
and its perturbation, for any unitarily invariant norm k . k,

kRk
ksin Θ pCpZr q, CpVr qqk ď . (5.51)
δ

Substituting the value of R and δ from (5.43) and (5.48), respectively, in (5.51), we get
ˆ M ˙
ř
α LrK V r
m m
m“1
ksin Θ pCpZr q, CpVr qqk ď ˆ M
˙ (5.52)
αm λm
ř
πr ´ πr`1 ´ r`1
m“1

This concludes the proof.

The above theorem holds for any set of M symmetric positive semi-definite matrices and
their convex combination.

107
Corollary 5.1. Let trpBq denote the trace of matrix B. Then,
˜ ˆ ˙2 ¸
M
pVr qT αm LrK Vr
ř
tr m
m“1
ksin Θ pCpZr q, CpVr qqk2F ď ˆ M
˙ . (5.53)
αm λ m
ř
πr ´ πr`1 ´ r`1
m“1

a
Proof. The Frobenius norm of a matrix B, given by kBkF “ trpB T Bq, is an unitarily
invariant norm. The squared Frobenius norm of R in (5.43) is given by
¨ ˜ ¸2 ˛
M
ÿ
kRk2F “ tr ˝pVr qT αm LrK
m Vr ‚. (5.54)
m“1

The Davis-Kahan theorem holds for any unitarily invariant norm. So, substituting the value
of δ and the Frobenius norm of R in (5.51), the required bound in (5.53) is obtained.

For a given value of r, ksin Θ pCpZr q, CpVr qqk2F measures the difference between the
full-rank and approximate subspaces, in terms of the sum of squares of r principal sines
between them. To make the differences comparable across different values of r, the mean
squared principal sine is considered, which is given by
˜ ˆ ˙2 ¸
M
tr pVr qT αm LrK Vr
ř
m
1 m“1
Φr “ ksin Θ pCpZr q, CpVr qqk2F ď ˆ ˙ . (5.55)
r M
ř m
r πr ´ πr`1 ´ αm λr`1
m“1

The matrix LrKm denotes the approximation of Lm using eigenpairs pr ` 1q to n. As r


approaches the full rank n, the approximation of Lm using the remaining pn ´ rq eigenpairs
approaches to 0, that is, LrK
m Ñ 0. Hence,

M
ÿ
lim αm LrK
m “ 0. (5.56)
rÑn
m“1

Taking limits in (5.55) and then substituting the value of (5.56) in the right hand side of
(5.55), we get
lim Φr “ 0. (5.57)
rÑn

This implies that, as the rank r approaches to the full rank of the individual Lm , the
difference between the full-rank and approximate subspace converges to 0, that is, the
approximate subspace converges to the full-rank subspace.
The eigenvalues of Lr and Lr˚ are given by the elements of the diagonal matrices Γ and
Π, respectively. The bound on the difference between the eigenvalues is given as follows.

108
Theorem 5.2. The eigenvalues of L and Lr˚ satisfy the following bound:

n
ÿ n
ÿ M
ÿ
pγj ´ πj q2 ď αm pλm 2
j q . (5.58)
j“1 j“r`1 m“1

Proof. The decomposition of L in (5.40) gives L “ Lr˚ ` LrK˚ . Both Lr˚ and LrK˚ are
low-rank approximations of the real-symmetric matrix L using its eigenpairs. So, Lr˚ and
LrK˚ are also real and symmetric. The eigenvalues of Lr˚ are given by π1 , . . . , πn , while
those of LrK˚ are given by

M
ÿ M
ÿ
αm λm
r`1 , . . . , αm λ m
n, (5.59)
m“1 m“1

according to (5.45). L is the sum of two real-symmetric matrices and has eigenvalues
γ1 , . . . , γn . The squared Frobenius norm of LrK˚ , given by the sum of squares of its eigen-
values, is
rK˚ 2 n
ÿ M
ÿ
L “ αm pλm 2
j q . (5.60)
F
j“r`1 m“1

According to the Weilandt-Hoffman theorem [76], the sum of squares of the difference
between the eigenvalues of L and Lr˚ is bounded by the squared Frobenius norm of the
residual LrK˚ . Therefore,

n
ÿ n
ÿ M
ÿ
pγj ´ πj q2 ď αm pλm 2
j q . (5.61)
j“1 j“r`1 m“1

This proves the bound on the eigenvalues.

Following analysis establishes that the difference between the eigenvalues of L and Lr˚
approaches to 0 as the rank r approaches to the full rank of L. Let
n
1 1 ÿ
∆r “ trtpΓ ´ Πq2 u “ pγj ´ πj q2 . (5.62)
n n j“1

According to (5.58),
n M
1 ÿ ÿ 2
∆r ď αm pλm
j q . (5.63)
n j“r`1 m“1

So, ∆r bounds the squared sum of the difference between the eigenvalues of L and Lr˚ .
For m “ 1, . . . , M , each Lm is a positive semi-definite matrix with n eigenvalues λm 1 ě
. . . ě λm
n ě 0. As the value of r approaches n, the eigenvalue λ m approaches the smallest
r
eigenvalue λm m
n . Moreover, as there are only n eigenvalues, the value of λj is 0 for any j ą r

109
when r tends to n. Therefore,

n M
1 ÿ ÿ 2
lim ∆r “ lim αm pλm
j q “ 0. (5.64)
rÑn rÑn n j“r`1 m“1

The limits in (5.57) and (5.64) imply that as the approximation rank r approaches to
the full rank, the approximate eigenspace Ψ pLr˚ q converges to the full-rank one Ψ pLr q, in
terms of both eigenvectors and eigenvalues.

5.5 Experimental Results and Discussion


The performance of the proposed CoALa algorithm is compared with that of ten existing
integrative clustering approaches, namely, cluster of cluster analysis (COCA) [93], LR-
Acluster [243], joint and individual variance explained (JIVE) [141], angle-based JIVE
(A-JIVE) [63], iCluster [192], principal component analysis (PCA) on the concatenated
data (PCA-con) [6], similarity network fusion (SNF) [234], normality based low-rank sub-
space (NormS) [111] (proposed in Chapter 3), and selective update of relevant eigenspaces
(SURE) [112] (proposed in Chapter 4). The experimental setup for the existing approaches
is followed same as that of Chapter 3. The performance of the JIVE algorithm, from this
chapter onwards, is reported corresponding to permutation test based rank estimation
(JIVE-Perm), as that is the default choice as mentioned in [141]. The clustering perfor-
mance corresponding to Bayesian information criteria based rank estimation (JIVE-BIC)
is reported in Tables 4.4 and 4.5 of Chapter 4. These tables also show that amongst the
two consensus clustering approaches, namely, COCA and Bayesian consensus clustering
(BCC) [140], COCA has better clustering performance in majority of the data sets. Hence,
the performance of COCA is reported from this chapter onwards. The results of the BCC
algorithm are available in Tables 4.4 and 4.5 of the last chapter. The R implementation of
the proposed algorithm is available at https://github.com/Aparajita-K/CoALa.
The performance of different algorithms is evaluated using six external cluster evalua-
tion indices, namely, accuracy, normalized mutual information (NMI), adjusted Rand index
(ARI), F-measure, Rand index, and purity, which compare the identified clusters with the
clinically established cancer subtypes and the ground truth class information for the bench-
mark data sets. Experimental results corresponding to Jaccard and Dice coefficients are
reported in [113]. For the low-rank based approaches, where clustering is performed in a
subspace, four internal cluster validity indices, namely, Silhouette, Dunn, Davies-Bouldin
(DB), and Xie-Beni indices are used to evaluate the compactness and separability of the
clusters in the extracted subspace. The evaluation indices are described in Appendix B.

5.5.1 Description of Data Sets


In this work, the clustering performance is extensively studied on eight real-life cancer
data sets, obtained from The Cancer Genome Atlas (TCGA) (https://cancergenome.
nih.gov/). The data sets considered here are, namely, colorectal carcinoma (CRC), lower
grade glioma (LGG), stomach adenocarcinoma (STAD), breast adenocarcinoma (BRCA),
ovarian carcinoma (OV), cervical carcinoma (CESC), lung carcinoma (LUNG), and kidney

110
carcinoma (KIDNEY). The CRC has two subtypes: colon and rectum carcinoma, depend-
ing on their site of origin. The LUNG and KIDNEY cancers have two and three histological
subtypes, respectively, based on the tissue of origin. For the other cancers, TCGA research
network has identified three subtypes in LGG [217] and CESC [218], and four subtypes
in STAD [216], BRCA [214], and OV [215], by comprehensive integrated analysis. The
CRC, LGG, STAD, BRCA, CESC, OV, LUNG, and KIDNEY data sets have 464, 267,
242, 398, 124, 334, 671, and 737 samples, respectively. For each of these data sets, four
different omic modalities are considered, namely, DNA methylation (mDNA), gene expres-
sion (RNA), microRNA expression (miRNA), and reverse phase protein array expression
(RPPA). The pairwise similarity wm pi, jq between samples xi and xj of the modality Xm
is computed using the Gaussian similarity kernel
" 2 *
ρ pxi , xj q
wm pi, jq “ exp ´ m 2 , (5.65)
2σm

where ρm pxi , xj q denotes the Euclidean distance between samples xi and xj in Xm and σm
is the standard deviation of the Gaussian kernel. The value of σm is empirically set to be
half of the maximum pairwise distance between any two points of the modality. Choice of
this similarity function results in a completely connected graph for each modality.
Seven other data sets from different application domains like social networks and gen-
eral images are also employed in this study to compare the clustering performance of the
proposed and existing algorithms. Among the social network data sets, Football, Politics-
UK, and Rugby are Twitter data sets which consist of social connection information among
Twitter users, while CORA is a citation network data set of machine learning papers. Each
Twitter data set has a heterogeneous collection of nine network and content-based modali-
ties, namely, follows, followed-by, mentions, mentioned-by, retweets, retweeted-by, lists500,
tweets500, and listmerged500. The CORA data set consists of two modalities, one rep-
resents content information and the other represents inbound/outbound citation relation.
The cosine similarity is used to compute the pairwise similarities between the samples in
the social network data set. Among the general image data sets, ORL is a face clustering
data set and Digits is handwritten numeral identification data set, while Caltech7 is an
object recognition data set. The modalities of Digits, ORL, and Caltech7 data sets are
constructed from different types features extracted from the sample images. The Gaussian
similarity kernel described above is used to construct the similarity matrices for the image
data sets. A brief description of the omics and benchmark data sets and pre-processing
steps is provided in Appendix A.

5.5.2 Optimum Value of Rank


For each multi-view data set having M views and k clusters, the proposed algorithm selects
r eigenpairs from each of the M individual Laplacians and constructs a joint eigenspace
of rank rM . Similar to the existing spectral clustering algorithms [162, 194], the proposed
CoALa algorithm also performs k-means clustering on k eigenvectors of the final eigenspace.
Since the clustering is performed
P T in a k-dimensional subspace, the rank r of the individual
Laplacians should be r ě k{M . To find out the optimal value of rank r, the Silhouette
index [179] is used. It lies between r´1, 1s and a higher value implies better clustering. In

111
CRC 0.9
LGG 1.6
F-measure F-measure
0.7 Silhouette 0.6 Silhouette
1.4
0.8
0.6 0.5
1.2

F-measure

F-measure
Silhouette

Silhouette
0.7 0.4
0.5 1
0.3
0.6
0.4 0.8
0.2
0.3 0.5 0.6
0.1
0.4
0.2 0.4 0
10 20 30 40 50 10 20 30 40 50
rank r rank r

0.45
STAD BRCA 1.2
F-measure 0.4 F-measure
0.4 Silhouette 1.2 Silhouette 1.1
0.35
0.35 1
1 0.3
0.3 0.9
F-measure

F-measure
Silhouette

Silhouette
0.25 0.8
0.25
0.8 0.2 0.7
0.2
0.15 0.6
0.15
0.6
0.1 0.5
0.1
0.4
0.05 0.4 0.05
0.3
0 0
10 20 30 40 50 10 20 30 40 50
rank r rank r

Figure 5.1: Variation of Silhouette index and F-measure for different values of rank param-
eter r on omics data sets.

P T
order to choose the rank parameter, the value of r is varied from k{M to 50 and for each
value of r, the Silhouette index Sprq is evaluated for clustering on the k largest eigenvectors
of the final eigenspace. The optimal value of r, that is r‹ , is obtained using the following
relation:
r‹ “ arg maxtSprqu. (5.66)
r

The variation of both Silhouette index and F-measure with respect to the rank r is
shown in Figure 5.1 for different omics data sets and in Figure 5.2 for the benchmark
data sets. The plots in Figures 5.1 and 5.2 show that the values of Silhouette index and
F-measure vary in a similar fashion. The Silhouette index is an internal cluster validity
measure computed based on the generated clusters, while F-measure is an external index
which compares the generated clusters with the ground truth class information. Since these
two indices are found to vary similarly, the optimum value of Silhouette index would also
produce the optimum value of F-measure for the same parameter configuration. Using this
criterion, the optimal values of rank for CRC, LGG, STAD, BRCA, CESC, OV, LUNG, and
KIDNEY data sets are 3, 48, 23, 4, 2, 20, 4, and 46, respectively, while for the benchmark
data sets Football, Politics-UK, Rugby, and Digits are 22, 45, 7, and 6, respectively. It
is also observed that for BRCA, CRC, Football, Politics-UK, and Digits data sets, the
F-measure corresponding to r‹ coincides with the best value of F-measure obtained over
different values of rank r. The similarly varying curves of Silhouette and F-measure in

112
Football 1
Politics-UK 1.5
F-measure F-measure
0.6 Silhouette 1.4 Silhouette 1.4
0.5 0.8
1.2 1.3
0.4

F-measure

F-measure
Silhouette

Silhouette
0.3 1 0.6 1.2

0.2 1.1
0.8
0.1 0.4 1
0 0.6
0.9
-0.1 0.2
0.4
-0.2 0.8

-0.3 0.2 0 0.7


5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
rank r rank r

0.7
Rugby 1.3 0.6
Digits 1.3
F-measure F-measure
Silhouette 1.2 Silhouette
0.6 0.5 1.2
1.1
0.5 0.4 1.1
F-measure

F-measure
Silhouette

Silhouette
1
0.4 0.3 1
0.9
0.3 0.2 0.9
0.8
0.2 0.1 0.8
0.7
0.1 0.6 0 0.7

0 0.5 -0.1 0.6


5 10 15 20 25 30 35 40 45 50 10 20 30 40 50
rank r rank r

Figure 5.2: Variation of Silhouette index and F-measure for different values of rank param-
eter r on benchmark data sets.

Figures 5.1 and 5.2 justify the use of Silhouette index to find out the optimal rank.

5.5.3 Difference Between Eigenspaces


The proposed method constructs an eigenspace from low-rank approximations of individual
graph Laplacians. This eigenspace is an approximation of the full-rank eigenspace which
considers the complete or full rank information of all the Laplacians. As defined in Section
5.4, for a given rank r, the difference between the full-rank and approximate eigenspaces,
in terms of its eigenvalues and eigenvectors, is given by ∆r and Φr , respectively. Here,
the variation in the difference between these two eigenspaces is observed with the increase
in rank r. For each omic data set, ∆r and Φr are computed for different fractions of the
full rank of that data set. The variation in the values of ∆r and Φr , with the increase in
rank r, is shown in Figures 5.3(a) and 5.3(b), respectively, for different data sets. Figure
5.3(a) shows that the difference between eigenvalues of the two eigenspaces monotonically
decreases to 0 with the increase in rank, for all the data sets. Figure 5.3(b), on the other
hand, shows that the difference between the subspaces, spanned by the eigenvectors of the
two eigenspaces, also converges to 0 as the value of rank r approaches the full rank of the
data set. However, the change in variation in case of eigenvectors is not monotonically
decreasing as in the case of eigenvalues. For some of the smaller values of rank r, the
difference also increases between two consecutive values. This is due to the fact that for
a given value of r, there can be infinitely possible rank r subspaces of an n dimensional

113
0.9
BRCA BRCA
CRC 0.6 CRC
0.8
LGG LGG
0.7 STAD 0.5 STAD

0.6
0.4
0.5

Φr
r

0.4 0.3

0.3 0.2
0.2
0.1
0.1

0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fraction of Full Rank Fraction of Full Rank
(a) Difference in Eigenvalues (b) Difference in Eigenvectors

Figure 5.3: Variation of difference between full-rank and approximate eigenspaces with
respect to rank r.

vector space. For small values of r, the rank r subspaces of individual modalities can be
very different from each other due to the large number of possibilities. Consequently, the
approximate subspace constructed from these subspaces tends to vary a lot from the full-
rank subspace. Hence, the variation in the difference between the full-rank and approximate
sets of eigenvectors fluctuates for small values of rank r. However, as r approaches the full-
rank, the number of possible subspaces reduces and the difference between the eigenvectors
monotonically decreases to 0.

5.5.4 Effectiveness of Proposed CoALa Algorithm


This subsection illustrates the significance of different aspects of the proposed algorithm
such as integration of multiple modalities over individual ones, use of approximate Lapla-
cians as opposed to full-rank ones, choice of the convex combination α, and so on, for four
omics data sets: CRC, LGG, STAD, and BRCA, and four benchmark data sets: Football,
Politics-UK, Rugby, and Digits, as examples.

5.5.4.1 Importance of Data Integration


The proposed CoALa algorithm performs clustering on the k largest eigenvectors of the
approximate eigenspace constructed by integrating multiple low-rank Laplacians. To es-
tablish the importance of this integration, the performance of the proposed algorithm is
compared with spectral clustering on the individual modalities in Tables 5.1, 5.2, and 5.3.

5.5.4.1.1 Omics Data Sets The results in Table 5.1 show that the proposed algorithm
performs better than all four individual modalities for CRC, LGG, and STAD data sets,
in terms of all external indices, except for the purity measure on the CRC data set. The
performance is equal for the purity measure on the CRC data set across all modalities.
Since the highest value of the F-measure on CRC data set is obtained for the proposed
algorithm, it identifies the smaller cluster better than all the individual Laplacians. For the
BRCA data set, RNA outperforms the proposed algorithm, albeit by a very small margin.
Among the individual modalities, mDNA gives the best performance for CRC, LGG, and

114
Table 5.1: Comparative Performance Analysis of Spectral Clustering on Individual Modal-
ities and Proposed Approach on Omics Data Sets
Data Set ModalitiesÑ mDNA RNA miRNA RPPA CoALa
Accuracy 0.5043103 0.5107759 0.5409483 0.5301724 0.6400862
NMI 0.0231494 0.0023715 0.0131094 0.0004871 0.0185660
ARI -0.018933 -0.001914 0.0041107 -0.004704 0.0548748
CRC
F-measure 0.5849894 0.5397796 0.5673758 0.5741394 0.6529565
Rand 0.4989573 0.4991528 0.5022809 0.5007448 0.5382531
Purity 0.7370690 0.7370690 0.7370690 0.7370690 0.7370690
Accuracy 0.8352060 0.5917603 0.4307116 0.3970037 0.9737828
NMI 0.5734568 0.2176187 0.0498676 0.0254500 0.8689965
ARI 0.5567870 0.1801875 0.0510240 0.0238319 0.9199392
LGG
F-measure 0.8269248 0.5875701 0.4717221 0.4326018 0.9737835
Rand 0.7861508 0.6149925 0.5593760 0.5476050 0.9622089
Purity 0.8352060 0.5917603 0.5318352 0.5280899 0.9737828
Accuracy 0.5413223 0.4793388 0.3719008 0.4173554 0.768595
NMI 0.2282198 0.1779419 0.0771419 0.0831100 0.510726
ARI 0.1927570 0.1047749 0.0514998 0.0460928 0.4559866
STAD
F-measure 0.5469686 0.4781377 0.3998266 0.4469459 0.7778227
Rand 0.6509722 0.6239155 0.5989164 0.5883543 0.7661946
Purity 0.5867769 0.5495868 0.4917355 0.4917355 0.7685950
Accuracy 0.5804020 0.7688442 0.4623116 0.4798995 0.7613065
NMI 0.3408150 0.5277072 0.1947561 0.3140984 0.5281849
ARI 0.3047769 0.5130244 0.1663564 0.2359641 0.4874579
BRCA
F-measure 0.5982526 0.7690661 0.5105008 0.5630781 0.7660191
Rand 0.7193018 0.7995519 0.6455071 0.6689493 0.7922357
Purity 0.6532663 0.7688442 0.5703518 0.5879397 0.7613065

STAD data sets. For LGG and STAD data sets, the performance of the proposed CoALa
algorithm is significantly higher than that of their best modality, mDNA.
The scatter plots of the first two dimensions for the best modality, mDNA, and the
proposed CoALa algorithm are given in Figures 5.4 and 5.5 for LGG and STAD data sets,
respectively. The objects in Figures 5.4 and 5.5 are colored according to the previously
established TCGA subtypes of LGG [217] and STAD [216]. For the LGG data set, Figure
5.4(a) shows that in the two-dimensional Laplacian subspace of mDNA, one of the subtypes
is compact and well-separated while the other two intermingled amongst each other. On
the other hand, Figure 5.4(h) for LGG shows that in the proposed subspace all the three
clusters are compact and separated from each other. For STAD, Figure 5.5(a) shows that
a major part of the two-dimensional subspace consists of points randomly scattered from
all the four clusters. However, Figure 5.5(h) shows that although the clusters lack well
separability, the proposed subspace can be partitioned into regions where most of the
data points belong to a single cluster. The scatter plots for the remaining data sets are
provided in the supplementary material. The distinct omic modalities together cover a
wide spectrum of biological information and the results in Table 5.1 show that integration
of multiple modalities leads to better identification of the disease subtypes compared to
unimodal analysis.

115
Table 5.2: Comparative Performance of Spectral Clustering on Individual Modalities and Proposed Approach on Twitter Data Sets
ModalitiesÑ Followed-By Follows Mentioned-By Mentions Retweeted-By Retweets Tweets500 ListMerged500 Lists500 CoALa
Accuracy 0.7419354 0.6584677 0.6673387 0.6737903 0.5427419 0.4834677 0.1895161 0.6649193 0.6241935 0.8500000
NMI 0.7910368 0.6899672 0.7399262 0.7432407 0.6169596 0.5489673 0.2404490 0.7199201 0.7123953 0.8625365
ARI 0.5814725 0.3702998 0.4422042 0.4531746 0.2413006 0.1239701 0.0244924 0.4588554 0.3908937 0.7278994
F-measure 0.7747023 0.7042013 0.7241344 0.7109046 0.5537196 0.5202768 0.2022110 0.7232265 0.6606393 0.8683491

Football
Rand 0.9472965 0.9197825 0.9356405 0.9384256 0.8593378 0.7958926 0.7691328 0.9322776 0.9147218 0.9739682
Purity 0.7282258 0.6766129 0.7362903 0.7092741 0.5447580 0.5008064 0.2072580 0.6931451 0.6399193 0.8584677
Accuracy 0.8902148 0.9140811 0.8310262 0.7878281 0.7248209 0.8377088 0.5011933 0.7016706 0.7517900 0.9665871
NMI 0.8382287 0.7343181 0.5463468 0.4936030 0.4158612 0.5270268 0.1308606 0.6834820 0.7278979 0.9434825
ARI 0.8375676 0.7843860 0.6325506 0.4746045 0.4046677 0.5552591 0.1496182 0.6300209 0.6998277 0.9633130
F-measure 0.9175316 0.8836935 0.8660595 0.7619363 0.8346957 0.7991772 0.5804394 0.8635673 0.8464556 0.9736129
Rand 0.9196880 0.8728323 0.8429422 0.7114181 0.7991423 0.7510534 0.6330178 0.8562195 0.8346941 0.9826084

Politics-UK
Purity 0.9713604 0.9021479 0.8778042 0.7823389 0.8477326 0.8138425 0.6658711 0.9021480 0.8782816 0.9785203
Accuracy 0.6679156 0.6797423 0.5955503 0.6004683 0.7121779 0.6860655 0.3621779 0.3223653 0.6499999 0.8305621

116
NMI 0.6395692 0.6301835 0.6135364 0.6059998 0.6151681 0.5863386 0.2623035 0.2283989 0.5733881 0.7093834
ARI 0.4204121 0.4827426 0.3977919 0.3758964 0.5461666 0.5022541 0.1373920 0.0130416 0.4969606 0.6627701
F-measure 0.7113898 0.6643790 0.6873041 0.6705410 0.7078636 0.6856623 0.3737361 0.3460789 0.7426962 0.8349647

Rugby
Rand 0.8609769 0.8580120 0.8562299 0.8482375 0.8560331 0.8406967 0.7177268 0.5223523 0.8672685 0.9067597
Purity 0.8474238 0.8435597 0.8274004 0.8121780 0.7915691 0.7816159 0.4871194 0.4566745 0.7796253 0.8606557

Table 5.3: Comparative Performance of Spectral Clustering on Individual Modalities and Proposed Approach on Digits Data Set
Data Set ModalitiesÑ Fac Fou Kar Mor Pix Zer CoALa
Accuracy 0.5614000 0.7096000 0.6638000 0.5109000 0.6520000 0.5350000 0.8835000
NMI 0.6192075 0.6443707 0.6407076 0.5361673 0.6385617 0.4766979 0.7981981
ARI 0.4731459 0.5416071 0.5383434 0.3723571 0.5216976 0.3286005 0.7645096
Digits
F-measure 0.6451628 0.7209662 0.7022988 0.5651531 0.6829546 0.5545294 0.8839913
Rand 0.8994301 0.9173923 0.9156842 0.8655854 0.9108559 0.8757654 0.9576618
Purity 0.6223000 0.7100000 0.7027000 0.5414000 0.6890000 0.5350500 0.8835000
0.15
0.1 0.6 1

0.8
0.5
0.1
0.05 0.6
0.4
0.4

0.05 0.3
0 0.2

0.2 0

0
-0.05 -0.2
0.1

-0.4
0
-0.05
-0.1 -0.6

-0.1
-0.8

-0.1
-0.068 -0.066 -0.064 -0.062 -0.06 -0.058 -0.056 -0.054 -0.052 -0.05 -0.15 -0.2 -1
-0.066 -0.064 -0.062 -0.06 -0.058 -0.056 -0.054 -0.052 -0.05 -0.048 -0.046 -0.044 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

r r˚
(a) Best Modality (b) L (c) L _Eqw (d) Lr˚ _RNrm
8x10-5 15 1 0.15

0.8 0.1
6x10-5
10
0.05
0.6

4x10-5 0
5 0.4

-0.05
-5
2x10 0.2
0 -0.1
0
0
-0.15

-5 -0.2
-0.2
-2x10-5
-0.4
-0.25
-10
-4x10-5
-0.6 -0.3

-6x10-5 -5 -15 -0.8 -0.35


-8x10 -6x10-5 -4x10-5 -2x10-5 0 2x10-5 4x10-5 6x10-5 8x10-5 0.0001 -15 -10 -5 0 5 10 15 -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.1 -0.05 0 0.05 0.1 0.15 0.2

(e) JIVE (f) iCluster (g) SNF (h) Lr˚ _Damp (CoALa)

Figure 5.4: Scatter plots using first two components of different low-rank based approaches
on LGG data set.

5.5.4.1.2 Benchmark Data Sets Three Twitter data sets, namely, Football, Politics-
UK, and Rugby have nine different modalities, while the image data set, Digits has six.
Tables 5.2 and 5.3 compare the performance of clustering on the k largest eigenvectors
of the individual shifted Laplacians with that of the proposed approximate subspace for
the Twitter and Digits data set, respectively. From the results of Tables 5.2 and 5.3, it is
evident that the proposed CoALa algorithm consistently and significantly outperforms all
the individual modalites across all four benchmark data sets. For the Digits data set, Table
5.3 shows that all six modalities have significantly lower performance than the proposed
approach. In brief, for all the benchmark data sets, integration of multiple modalities
always beats the performance of individual Laplacians by a wide margin.

5.5.4.2 Importance of the choice of Convex Combination


In order to establish the effectiveness of the proposed weighting factor (termed as Lr˚ _Damp)
described in Section 5.3.5, the clustering performance of the resulting subspace obtained
using Lr˚ _Damp is compared with that of the one where all the modalities are equally
weighted (termed as Lr˚ _Eqw). The damping factor β in (5.32) is empirically set to 1.25
for all data sets.

5.5.4.2.1 Omics Data Sets The scatter plots for the first two components of Lr˚ _Eqw
and Lr˚ _Damp (CoALa) subspaces are given in Figures 5.4 and 5.5 for LGG and STAD
data sets, respectively. For LGG, Figure 5.4(c) for Lr˚ _Eqw shows that two of the three
clusters are highly compact, however, they also lack inter-cluster separability. In case of the
proposed Lr˚ _Damp subspace, in Figure 5.4(h), these two clusters have lower compact-
ness but are well-separated from each other. For STAD, scatter plots for Lr˚ _Eqw and
Lr˚ _Damp (CoALa) in Figures 5.5(c) and 5.5(h), respectively, are of similar nature, al-
though Lr˚ _Eqw shows slightly better inter-cluster separability compared to Lr˚ _Damp.

117
Table 5.4: Comparative Performance Analysis of Equally and Damped Weighted Combi-
nation on Omics Data
Index Data Set Lr˚ _Eqw Lr˚ _Damp Data Set Lr˚ _Eqw Lr˚ _Damp
Accuracy 0.6163793 0.6400862 0.9625468 0.9737828
NMI 0.0084103 0.0185660 0.8509861 0.8689965
ARI 0.0317469 0.0548748 0.8806075 0.9199392
CRC LGG
F-measure 0.6309431 0.6529565 0.9625844 0.9737835
Rand 0.5260669 0.5382531 0.9437921 0.9622089
Purity 0.7370690 0.7370690 0.9625468 0.9737828
Accuracy 0.7727273 0.768595 0.6733668 0.7613065
NMI 0.5150229 0.510726 0.4531777 0.5281849
ARI 0.4639222 0.4559866 0.3964856 0.4874579
STAD BRCA
F-measure 0.7788198 0.7778227 0.6834253 0.7660191
Rand 0.7703782 0.7661946 0.7523132 0.7922357
Purity 0.7727273 0.7685950 0.6783920 0.7613065

Table 5.5: Comparative Performance Analysis of Equally and Damped Weighted Combi-
nation on Benchmark Data Sets
Index Data Set Lr˚ _Eqw Lr˚ _Damp Data Set Lr˚ _Eqw Lr˚ _Damp
Accuracy 0.8758064 0.8500000 0.9785203 0.9665871
NMI 0.8908789 0.8625365 0.9345135 0.9434825
ARI 0.7841728 0.7278994 0.9637864 0.9633130
Football Politics-UK
F-measure 0.8848290 0.8683491 0.9735519 0.9736129
Rand 0.9760741 0.9739682 0.9828368 0.9826084
Purity 0.8778225 0.8584677 0.9785203 0.9785203
Accuracy 0.8196721 0.8305621 0.8170000 0.8835000
NMI 0.7038184 0.7093834 0.8288566 0.7981981
ARI 0.6454866 0.6627701 0.7616410 0.7645096
Rugby Digits
F-measure 0.8288040 0.8349647 0.8746977 0.8839913
Rand 0.8972515 0.9067597 0.9564677 0.9576618
Purity 0.8621780 0.8606557 0.8310000 0.8835000

The quantitative results for this comparison are reported in Table 5.4, which show that
for CRC, LGG, and BRCA data sets, the damping strategy Lr˚ _Damp performs better
than Lr˚ _Eqw, in terms of all external indices. Only for the STAD data set, weighting all
the modalities equally gives slightly better performance. This is also evident from the in-
creased inter-cluster separability in Figure 5.5(c) compared to Figure 5.5(h). However, the
results in Table 5.4 show that assigning maximum weightage to the most relevant modality
and gradually damping it by a factor β, based on its relevance, preserves better cluster
information in majority of the cases.

5.5.4.2.2 Benchmark Data Sets The comparative results for the benchmark data
sets are reported in Table 5.5. It can be observed from Table 5.4 that for a majority of
omics data sets, damped weighting of modalities based on relevance outperforms the equally
weighted one. On the contrary, the results in Table 5.5 shows that the equally weighted

118
Table 5.6: Comparative Performance Analysis of Full-Rank and Approximate Subspaces
of Omics Data
Index Data Set Lr CoALa (Lr˚ ) Data Set Lr CoALa (Lr˚ )
Accuracy 0.5301724 0.6400862 0.6441948 0.9737828
NMI 0.0134459 0.0185660 0.3597365 0.8689965
ARI -0.025277 0.0548748 0.2844081 0.9199392
CRC LGG
F-measure 0.6052757 0.6529565 0.6577440 0.9737835
Rand 0.5007448 0.5382531 0.6524739 0.9622089
Purity 0.7370690 0.7370690 0.6441948 0.9737828
Accuracy 0.5619835 0.768595 0.5477387 0.7613065
NMI 0.2605140 0.510726 0.4315712 0.5281849
ARI 0.2248704 0.4559866 0.3507615 0.4874579
STAD BRCA
F-measure 0.6158419 0.7778227 0.6197007 0.7660191
Rand 0.6706560 0.7661946 0.7403390 0.7922357
Purity 0.6157025 0.768595 0.7185930 0.7613065

strategy gives better performance than the damped one on the Football and Politics-UK
data sets. One possible explanation is that most of the component modalities of the Twitter
data sets are similar to each other and have close performances. For instance, ‘follows’ and
‘followed-by’, both are network based modalities where ‘follows’ captures the outgoing links
from the nodes, while ‘followed-by’ captures the incoming links to the nodes. Other pairs
of modalities like ‘mentions’ and ‘mentioned-by’, and ‘retweets’ and ‘retweeted-by’ are also
very similar to each other. In the damped weighting introduced in Section 5.3.5, slight
differences in the relevance values of these similar modalities would dampen the effect of
the one with lower relevance by a factor of β. This leads to degraded cluster structure in
eigenspace of the joint Laplacian for two Twitter data sets when using the damped weighted
strategy. For Rugby and Digits data sets, damped weighted strategy Lr˚ _Damp has better
performance compared to equally weight Lr˚ _Eqw one for majority of the external indices.

5.5.4.3 Importance of Noise-Free Approximation


The proposed eigenspace is an approximate one, as it is constructed from de-noised ap-
proximations of the individual eigenspaces. This approximate eigenspace is expected to
preserve better cluster structure compared to the full-rank eigenspace constructed from
the complete set of eigenpairs of the individual Laplacians. In order to establish this, the
performance of clustering on the k largest eigenvectors of the full-rank eigenspace Lr is
compared with that of the approximate eigenspace Lr˚ (CoALa) in Table 5.6. From the
results of Table 5.6, it can be observed that the proposed CoALa algorithm outperforms
the full-rank subspace Lr for all the data sets. The performance is significantly better for
BRCA, LGG, and STAD data sets. The full-rank information of individual Laplacians in
Lr inherently contains the noisy information of the pn ´ rq smallest eigenvectors of each
Laplacian. However, in the proposed algorithm, each individual Laplacian is truncated at
rank r, to contain mostly the cluster discriminatory information, where r ăă n. So, the
approximate eigenspace automatically eliminates the noise present in the pn ´ rq remain-
ing eigenvectors. The results of Table 5.6 show that this truncated de-noised Laplacians

119
0.15
0.15 0.2 1

0.8
0.1 0.15
0.1
0.6

0.1 0.4
0.05
0.05

0.2
0.05
0
0 0

0
-0.2
-0.05
-0.05
-0.05 -0.4

-0.6
-0.1
-0.1
-0.1
-0.8

-0.15
-0.07 -0.065 -0.06 -0.055 -0.05 -0.045 -0.15 -0.15 -1
-0.07 -0.068 -0.066 -0.064 -0.062 -0.06 -0.058 -0.056 -0.054 -0.052 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

r r˚
(a) Best Modality (b) L (c) L _Eqw (d) Lr˚ _RNrm
6x10-5 4 1 0.2

3 0.8
0.15
4x10-5
2
0.6

1 0.1
2x10-5 0.4

0
0.05
0.2
0 -1
0
0
-2

-2x10-5 -0.2
-3 -0.05

-0.4
-4
-4x10-5
-0.1
-5 -0.6

-6x10-5 -5 -5 -5 -5 -5 -5 -5 -5
-6 -0.8 -0.15
-0.0001 -8x10 -6x10 -4x10 -2x10 0 2x10 4x10 6x10 8x10 0.0001 -6 -4 -2 0 2 4 6 8 -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2

(e) JIVE (f) iCluster (g) SNF (h) Lr˚ _Damp (CoALa)

Figure 5.5: Scatter plots using first two components of different low-rank based approaches
on STAD data set.

preserve better cluster structure in the resulting eigenspace compared to the full-rank one.
The scatter plots for the full-rank subspaces of LGG and STAD data sets are given in Fig-
ures 5.4(b) and 5.5(b), respectively. For LGG, Figures 5.4(b) shows that only one cluster
is well-separated. On the other hand, data points from the other two clusters of LGG and
all the four clusters of STAD in Figure 5.5(b) are cluttered amongst each other exhibiting
poor separability. The optimal rank, r‹ , for LGG and STAD data sets are 48, and 39,
respectively, while their full-ranks are 267 and 242, respectively. The scatter plots for rank
r‹ approximation in Figures 5.4(h) and 5.5(h) show that filtering out the noise in the re-
maining 219 and 203 eigen-pairs of the individual Laplacians preserves significantly better
cluster structure for these data sets.

5.5.4.4 Advantage of Averting Row-normalization


In normalized spectral clustering (Algorithm 5.1), row-normalization tends to shift the
objects in the projected subspace in such a way that they cluster tightly around an orthog-
onal basis. This is primarily justified when the objects lie close to the ideal case where
the clusters are infinitely apart [162]. However, row-normalization may not necessarily
give better performance on real-life data sets. The two-dimensional scatter plots for the
row-normalized subspaces of LGG and STAD data sets are given in Figures 5.4(d) and
5.5(d), respectively. For both data sets, as expected, row-normalization pushes objects
from different clusters further away from the origin in different directions of the subspace,
which increases the inter-cluster separability. However, points lying in the boundaries of
different clusters are not necessarily pushed away and are projected around the origin of
the subspaces, which in turn reduces the compactness of the clusters. When the number of
boundary points is relatively large, row-normalization tends to give degraded performance.
To study this quantitatively, the clustering performance of the row-normalized subspace
(termed as Lr˚ _RNrm) is compared with that of not normalized one in Table 5.7. The re-

120
Table 5.7: Effect of Row-normalization on Different Subspaces on Omics Data
Index Data Set Lr˚ _RNrm CoALa Data Set Lr˚ _RNrm CoALa
Accuracy 0.5991379 0.6400862 0.8951311 0.9737828
NMI 0.0056913 0.0185660 0.6857991 0.8689965
ARI 0.0220924 0.0548748 0.7359314 0.9199392
CRC LGG
F-measure 0.6169586 0.6529565 0.9010565 0.9737835
Rand 0.5186192 0.5382531 0.8771367 0.9622089
Purity 0.7370690 0.7370690 0.8951311 0.9737828
Accuracy 0.7355372 0.768595 0.6859296 0.7613065
NMI 0.4582469 0.5107260 0.4806899 0.5281849
ARI 0.4012421 0.4559866 0.4012943 0.4874579
STAD BRCA
F-measure 0.7389739 0.7778227 0.6946324 0.7660191
Rand 0.7474024 0.7661946 0.7588193 0.7922357
Purity 0.7355372 0.7685950 0.6859296 0.7613065

Table 5.8: Effect of Row-Normalization on Benchmark Data Sets


Index Data Set Lr˚ _RNrm CoALa Data Set Lr˚ _RNrm CoALa
Accuracy 0.8669354 0.8500000 0.9465394 0.9665871
NMI 0.8856647 0.8625365 0.8414139 0.9434825
ARI 0.7775054 0.7278994 0.9075232 0.9633130
Football Politics-UK
F-measure 0.8679092 0.8683491 0.9571452 0.9736129
Rand 0.9785490 0.9739682 0.9746971 0.9826084
Purity 0.8911290 0.8584677 0.9715990 0.9785203
Accuracy 0.6224824 0.8305621 0.8600000 0.8835000
NMI 0.6527087 0.7093834 0.8528484 0.7981981
ARI 0.3970370 0.6627701 0.7970693 0.7645096
Rugby Digits
F-measure 0.6737320 0.8349647 0.8629902 0.8839913
Rand 0.8606810 0.9067597 0.9629410 0.9576618
Purity 0.8545667 0.8606557 0.8565000 0.8835000

sults reported in Table 5.7 show that for all four data sets, the proposed subspace performs
better than its row-normalized counterpart Lr˚ _RNrm.
Table 5.8 compares the performance of the proposed approximate subspace with and
without the row-normalization step for the benchmark data. Table 5.8 shows that for
Politics-UK and Rugby data sets, avoiding row-normalization give better performance with
respect to all the external indices. On the other hand, for Football and Digits data set,
majority of the external indices gives better performance with row-normalization. Scatter
plots for the first two dimensions of Lr˚ _RNrm and the proposed CoALa algorithm are
given in Figures 5.6 and 5.7 for the Politics-UK and the Digits data set, respectively.

5.5.5 Comparative Performance Analysis on Multi-Omics Data Sets


The performance of the proposed algorithm is compared with that of the existing ones,
in Tables 5.9 and 5.10 in terms of the external cluster evaluation indices. The COCA
and BCC algorithms are consensus clustering based approaches, while the other existing

121
Table 5.9: Comparative Performance Analysis of CoALa and Existing Approaches Based
on External Indices on Omics Data Sets

Different Rank of External Evaluation Index


Algorithms Subspace Accuracy NMI ARI F-Measure RAND Purity
COCA - 0.5323276 0.0120929 0.0007663 0.5586055 0.5010706 0.7370690
BCC - 0.5745690 0.0070894 0.0074889 0.5973067 0.5158300 0.7370690
JIVE(Perm) 16 0.6034483 0.0071359 0.0256478 0.6210774 0.5203694 0.7370690
A-JIVE 32 0.6034483 0.0064720 0.0246106 0.6206032 0.5203694 0.7370690
iCluster 1 0.6163793 0.0069992 0.0293081 0.6298050 0.5260669 0.7370690
CRC

LRAcluster 1 0.5129310 0.0030437 -0.001822 0.5410661 0.4992552 0.7370690


PCA-Con 2 0.5366379 0.0057828 0.0036971 0.5641984 0.5016106 0.7370690
SNF 2 0.5991379 0.0069730 0.0240692 0.6178576 0.5186192 0.7370690
NormS 16 0.6206897 0.0093881 0.0347351 0.6345375 0.5281150 0.7370690
SURE 2 0.5107759 0.0027977 -0.002148 0.5416716 0.4991528 0.7370690
CoALa 2 0.6400862 0.0185660 0.0548748 0.6529565 0.5382531 0.7370690
COCA - 0.6591760 0.2772248 0.2533847 0.6608123 0.6454901 0.6591760
BCC - 0.6340824 0.2737596 0.248606 0.63111660 0.6382755 0.6355805
JIVE 8 0.5617978 0.2299551 0.1606599 0.5757978 0.6056715 0.5730337
A-JIVE 48 0.7168539 0.4267241 0.3376560 0.7172792 0.6869055 0.7168539
iCluster 2 0.4382022 0.1379678 0.0996867 0.5187438 0.5821858 0.5355805
LGG

LRAcluster 2 0.4719101 0.1240057 0.1030798 0.5137382 0.5831714 0.5280899


PCA-con 3 0.6666667 0.3438738 0.3031312 0.6574834 0.6616823 0.6666667
SNF 3 0.8689139 0.6253254 0.6331662 0.8720595 0.8268142 0.8689139
NormS 14 0.7940075 0.5325030 0.4649223 0.7916535 0.7465292 0.7940075
SURE 3 0.7940075 0.5335888 0.4668931 0.7904750 0.7465292 0.7940075
CoALa 4 0.9737828 0.8689965 0.9199392 0.9737835 0.9622089 0.9737828
COCA - 0.4450413 0.1309746 0.0740987 0.4558087 0.5981242 0.5173554
BCC - 0.5392562 0.1500351 0.1421471 0.5520075 0.6081204 0.5673554
JIVE(Perm) 8 0.4049587 0.1288122 0.0657955 0.4487487 0.5981619 0.5165289
A-JIVE 64 0.4148760 0.1234864 0.0763413 0.4458621 0.6086142 0.5227273
STAD

iCluster 3 0.3512397 0.0650589 0.0288255 0.3832114 0.5855423 0.4917355


LRAcluster 1 0.4256198 0.1259879 0.0912460 0.4746753 0.6122218 0.5619835
PCA-Con 2 0.6900826 0.3654109 0.3204142 0.6959782 0.7110524 0.6900826
SNF 2 0.5661157 0.3216270 0.2694201 0.6333622 0.6945235 0.6363636
NormS 27 0.5702479 0.1805281 0.1625013 0.5770884 0.6435993 0.5950413
SURE 2 0.6983471 0.3511439 0.3445607 0.7056674 0.7216145 0.6983471
CoALa 4 0.7685950 0.5107260 0.4559866 0.7778227 0.7661946 0.768595
COCA - 0.7434673 0.5002408 0.4864778 0.7457304 0.7905295 0.7434673
BCC - 0.6251256 0.3169187 0.3049874 0.6242493 0.7055783 0.6334171
JIVE 12 0.6859296 0.4287142 0.3772649 0.6889363 0.7464906 0.6859296
A-JIVE 64 0.6140704 0.4482479 0.3710317 0.6707575 0.7363682 0.6841709
BRCA

iCluster 3 0.7638191 0.5176193 0.4745746 0.7658865 0.7842867 0.7638191


LRAcluster 2 0.7110553 0.4368520 0.4035040 0.7101385 0.7521740 0.7110553
PCA-con 4 0.7587940 0.5506612 0.5038795 0.7601317 0.7984380 0.7587940
SNF 4 0.6783920 0.4558955 0.4111794 0.6865447 0.7602370 0.6959799
NormS 11 0.7688442 0.5437267 0.5090183 0.7699789 0.7999063 0.7688442
SURE 4 0.7663317 0.5528011 0.5104814 0.7683344 0.8010455 0.7663317
CoALa 4 0.7613065 0.5281849 0.4874579 0.7660191 0.7922357 0.7613065

122
Table 5.10: Comparative Performance Analysis of CoALa and Existing Approaches Based
on External Indices on Omics Data Sets

Different Rank of External Evaluation Index


Algorithms Subspace Accuracy NMI ARI F-Measure RAND Purity
COCA - 0.9408280 0.7493140 0.8393954 0.9477422 0.9199568 0.9470828
BCC - 0.9122117 0.6783448 0.7299573 0.9139998 0.8657292 0.9122117
JIVE(Perm) 12 0.9308005 0.6955325 0.7786981 0.9300085 0.8893944 0.9308005
A-JIVE 48 0.9582090 0.7902576 0.8695284 0.9585611 0.9349404 0.9582090
KIDNEY

iCluster 2 0.6065129 0.2547010 0.1717458 0.6514716 0.5842023 0.6811398


LRAcluster 2 0.9538670 0.7862018 0.8579391 0.9545717 0.9292298 0.9538670
PCA-Con 3 0.9511533 0.7670505 0.8489024 0.9516854 0.9246800 0.9511533
SNF 3 0.9579376 0.7946083 0.8796762 0.9590236 0.9400330 0.9579376
NormS 35 0.9525102 0.7726162 0.8534490 0.9530685 0.9269512 0.9525102
SURE 3 0.9525102 0.7726162 0.8534490 0.9530685 0.9269512 0.9525102
CoALa 3 0.9294437 0.6987468 0.7786424 0.9285111 0.8893207 0.9294437
COCA - 0.5943114 0.3131466 0.2810761 0.6068513 0.7039183 0.5943114
BCC - 0.4610778 0.1567582 0.1254690 0.4755846 0.6268706 0.4622754
JIVE 32 0.5718563 0.2629523 0.2027605 0.5653910 0.6885005 0.5718563
A-JIVE 64 0.5191617 0.2124862 0.1981556 0.5111353 0.6942997 0.5221557
iCluster 3 0.5089820 0.2249889 0.2005886 0.4808256 0.6916078 0.5119760
OV

LRAcluster 2 0.6287425 0.3745173 0.2999204 0.6384046 0.7322472 0.6287425


PCA-con 4 0.6946108 0.4424701 0.4068449 0.6868295 0.7734621 0.6946108
SNF 4 0.5269461 0.2753886 0.2058407 0.5642052 0.6557695 0.5389222
NormS 10 0.6976048 0.4504552 0.4142200 0.6910392 0.7766269 0.6976048
SURE 4 0.7215569 0.4680312 0.4372574 0.7148805 0.7857258 0.7215569
CoALa 4 0.6736527 0.3381426 0.3199015 0.6700606 0.7379295 0.6736527
COCA - 0.9284650 0.6287671 0.7339231 0.9283705 0.8669662 0.9284650
BCC - 0.9372578 0.6648076 0.7645295 0.9371445 0.8822697 0.9372578
JIVE(Perm) 8 0.9269747 0.6333526 0.7288041 0.9266709 0.8644127 0.9269747
A-JIVE 32 0.9478390 0.7192028 0.8019299 0.9476450 0.9009720 0.9478390
LUNG

iCluster 1 0.6333830 0.0627751 0.0696293 0.6299231 0.5348889 0.6333830


LRAcluster 1 0.9344262 0.6535038 0.7545277 0.9342966 0.8772694 0.9344262
PCA-Con 2 0.9388972 0.6773549 0.7701654 0.9386955 0.8850902 0.9388972
SNF 2 0.9493294 0.7152672 0.8072916 0.9492292 0.9036502 0.9493294
NormS 27 0.9359165 0.6650183 0.7597192 0.9357050 0.8798674 0.9359165
SURE 2 0.9418778 0.6878184 0.7806842 0.9417093 0.8903486 0.9418778
CoALa 2 0.9403875 0.6970004 0.7754083 0.9400693 0.8877149 0.9403875
COCA - 0.6693548 0.4172592 0.3677157 0.6870510 0.6971282 0.6774194
BCC - 0.6895161 0.2854917 0.3144526 0.6795619 0.6687779 0.6935484
JIVE(Perm) 24 0.7177419 0.4425848 0.3860367 0.7097880 0.7164962 0.7177419
A-JIVE 48 0.6500000 0.3700238 0.3355826 0.6511586 0.6857724 0.6814516
iCluster 2 0.5483871 0.1737526 0.1017765 0.5568753 0.5731707 0.5645161
CESC

LRAcluster 1 0.8145161 0.5176602 0.5384740 0.8123256 0.7867821 0.8145161


PCA-con 3 0.8548387 0.6750978 0.6333073 0.8390298 0.8237608 0.8548387
SNF 3 0.6693548 0.4927941 0.4239905 0.7073802 0.7043011 0.6935484
NormS 6 0.8870968 0.6854921 0.7004411 0.8801172 0.8587726 0.8870968
SURE 3 0.8629032 0.6461946 0.6507274 0.8512028 0.8339890 0.8629032
CoALa 3 0.8225806 0.5479227 0.5637070 0.8139970 0.7951744 0.8225806

123
algorithms are subspace based approaches for which the optimal rank of the clustering
subspace is reported in Tables 5.9 and 5.10. The optimal ranks are selected using the
selection criteria suggested by the authors for the respective approaches. The results in
Table 5.9 and 5.10 show that the proposed algorithm performs better than all the existing
approaches for CRC, LGG, and STAD data sets in terms of the external indices, except for
the purity measure on the CRC data set. However, F-measure and other external indices
indicate that the proposed algorithm identifies the smaller sized cluster better than the
existing ones. For the rest of the omics data sets, the algorithms proposed in the two
previous chapters, namely, NormS (Chapter 3) and SURE (Chapter 4) are among the best
performing algorithms in terms of external indices, while the proposed algorithm achieves
third or fourth best performance. The iCluster algorithm has comparable performance for
BRCA and CRC data sets, however, its degraded performance in the remaining data sets is
due to the poor selection of its optimal lasso penalty parameter from the high-dimensional
parameter space.
Due to the heterogeneous nature of the individual modalities, LRAcluster models each
modality using a separate probability distribution having its own set of parameters. The
proposed algorithm handles data heterogeneity by considering separate similarity matri-
ces for separate modalities. Moreover, the modalities are integrated using their shifted
Laplacians whose elements always lie in r0, 2s as opposed to the raw data format. So, the
difference in unit and scale of the individual modalities does not affect the final eigenspace.
Similar to the proposed algorithm, the SNF approach also uses spectral clustering on a
unified similarity graph to identify the clusters. However, in terms of the external indices,
the proposed algorithm outperforms SNF on all data sets, except KIDNEY and LUNG
data sets. In SNF, the unified graph is iteratively made similar to the individual graphs.
This can often lead to propagation of unwanted information from noisy graphs into the final
unified one. On the other hand, the proposed algorithm amplifies the effect of the most rel-
evant graph, as well as dampens the effect of the irrelevant ones in the convex combination.
Moreover, truncation of individual Laplacians at rank r ăă n helps in propagating mostly
cluster discriminatory information into the final subspace and automatically filters out the
noise. These two aspects of the proposed CoALa algorithm are primarily responsible for
its significantly better performance, especially for the LGG and STAD data sets.
Different low-rank based approaches extract subspaces of different ranks. Tables 5.9
and 5.10 show that the ranks vary from 1 to as high as 64. The comparison of cluster
compactness and separability in these subspaces of varying dimensions is not reasonable.
So, the goodness of clustering is evaluated using internal cluster validity indices by per-
forming k-means clustering on the first two dimensions of each subspace. This makes the
internal evaluation results comparable and also easy to visualize. Four internal cluster
evaluation measures, namely, Silhouette and Dunn, which are maximization based indices,
and Davies-Bouldin (DB) and Xie-Beni, which are minimization based, are used. The in-
ternal cluster evaluation results are reported in Table 5.11 for four omics data sets, namely,
CRC, LGG, STAD, and BRCA, as examples. The results show that the proposed algo-
rithm has best performance for Silhouette, DB, and Xie-Beni indices for LGG data set
and the second best for Silhouette and Dunn indices for BRCA data set. The SNF has
best performance for two or more internal indices for CRC, STAD, and, BRCA data sets.
This implies that on these three data sets, the cluster structure reflected in the first two
dimensions of SNF more are compact and well-separated compared to the proposed and

124
Table 5.11: Comparative Performance Analysis of CoALa and Existing Approaches Based
Internal Indices and Execution Time on Omics Data Sets

Different Internal Evaluation Index Time


Algorithms Silhouette Dunn DB Xie-Beni (in sec)

JIVE(Perm) 0.4199826 0.0120740 0.8821177 348.50660 3098.75


A-JIVE 0.5016133 0.0043986 0.6872426 2314.1590 946.18
iCluster 0.6229586 0.3317529 0.5770987 0.3629792 337.51
LRAcluster 0.4337712 0.0160840 0.8751325 202.80470 104.12
CRC

PCA-Con 0.3417350 0.0190144 1.1650270 155.15390 2.62


SNF 0.7834208 0.0549104 0.2980235 17.069770 9.66
NormS 0.3640602 0.0185685 1.0995640 116.69010 1.45
SURE 0.3732788 0.0050670 1.0689000 1622.504 5.41
CoALa 0.3483722 0.0179209 1.1021510 115.98920 32.77
JIVE(Perm) 0.4138221 0.0355064 0.8684623 51.054660 665.82
A-JIVE 0.3375023 0.0241153 0.9444459 87.842080 364.43
iCluster 0.3952103 0.0252834 0.9330074 93.144060 3230.52
LRAcluster 0.3921144 0.0344110 0.8593495 43.233820 37.71
LGG

PCA-Con 0.4624043 0.0322859 0.7439401 58.96720 1.08


SNF 0.4441981 0.0149314 0.7388554 318.54730 1.33
NormS 0.4305583 0.0218683 0.8441603 175.06670 0.96
SURE 0.3709216 0.0378629 1.0097820 59.552690 3.71
CoALa 0.6273401 0.0287595 0.4905286 12.563470 17.02
JIVE(Perm) 0.3618677 0.0257650 0.9526717 84.992880 734.70
A-JIVE 0.3365825 0.0203049 0.9617136 101.30600 302.98
iCluster 0.3790058 0.0357959 0.9584001 54.286930 1138.88
STAD

LRAcluster 0.4015128 0.0304117 0.7928001 40.097030 49.36


PCA-Con 0.3862858 0.0182291 0.8355266 227.84070 1.02
SNF 0.4477905 0.0596324 0.7872797 19.297210 1.14
NormS 0.3395181 0.0181344 0.9157146 181.9440 0.80
SURE 0.2679056 0.0613939 1.2882420 17.74966 3.67
CoALa 0.4102003 0.0325467 0.8490579 58.722830 13.79
JIVE(Perm) 0.4429883 0.0134063 0.7463430 277.15980 866.00
A-JIVE 0.3148863 0.0142913 0.9765342 187.56970 686.85
iCluster 0.4400869 0.0258263 0.7819524 77.708790 511.87
BRCA

LRAcluster 0.4300455 0.0369472 0.8211325 43.223840 88.32


PCA-Con 0.4232505 0.0241363 0.8269517 86.81890 0.93
SNF 0.5005988 0.0189055 0.6814998 112.0742 1.91
NormS 0.4218991 0.0090550 0.8069696 504.69590 1.47
SURE 0.3408657 0.0544491 1.0923470 19.350910 6.32
CoALa 0.4478377 0.0253506 0.7873740 81.048340 14.36

other existing algorithms. The scatter plots for the first two dimensions of some low-rank
based approaches are given in Figures 5.4(e)-(g) and in Figures 5.5(e)-(g), respectively, for
LGG and STAD data sets. The data points are labeled in different colors based on the
previously established TCGA subtypes. Although SNF has the best performance for all
the internal indices for STAD data set, the scatter plot of SNF for LGG, in Figures 5.4(g),
shows that the compact and well-separated clusters do not necessarily conform with the

125
clinically established TCGA labellings. In brief, out of 16 cases of internal evaluation,
reported in Table 5.11, the proposed CoALa algorithm ranks among the top three in 8
cases.
The execution times reported in Table 5.11 show that the proposed CoALa algorithm
is computationally much faster than the consensus based COCA approach and other low-
rank approaches like LRAcluster, JIVE, A-JIVE, and iCluster. However, PCA-con, SNF,
NormS, and SURE have lower execution time compared to the proposed algorithm across
all the data sets. For model fitting, iCluster uses expectation maximization algorithm,
while JIVE uses alternate optimization. These iterative algorithms have slow convergence
on the high-dimensional multimodal data sets. This leads to huge execution time and
poor scalability of these algorithms as seen in Table 5.11. PCA-con achieves the lowest
execution time on CRC and STAD data sets, as it performs SVD on the concatenated data
only once. On the other hand, NormS achieves the same on LGG and STAD data sets.
NormS achieves this computational advantage by simply concatenating relevant principal
components from different modalities, at the cost of constructing a relatively much higher
dimensional subspace. However, the external evaluation indices show that such naive con-
catenation in PCA-con and NormS often fails to capture the true cluster structure of the
multimodal data.

0.25 0.008 0.05

0.2 0.006
0

0.15 0.004

-0.05

0.1 0.002

-0.1

0.05 0

-0.15
0 -0.002

-0.05 -0.004 -0.2


-0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 -0.065 -0.06 -0.055 -0.05 -0.045 -0.04 -0.035 -0.03 -0.025 -0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08

r
(a) Best Modality (b) L (c) Lr˚ _Eqw
0.8 0.8 0.2

0.6 0.6
0.15
0.4
0.4

0.2 0.1
0.2

0
0 0.05
-0.2

-0.2
-0.4 0

-0.4
-0.6
-0.05
-0.8 -0.6

-1 -0.8 -0.1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 -0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08

r˚ r˚
(d) L _RNrm (e) SNF (f) L _Damp (CoALa)

Figure 5.6: Scatter plots using first two components of different low-rank based subspaces
for Politics-UK data set.

5.5.6 Comparative Performance Analysis on Benchmark Data Sets


Finally, the performance of different algorithms is studied on seven benchmark multi-
modal data sets, namely, Football, Politics-UK, Rugby, Digits, ORL, Caltech7, and CORA.
Among them, Football, Politics-UK, Rugby, and CORA are social network data sets, while

126
0.04 0.03 0.06

0.03 0.02 0.05

0.02 0.04
0.01

0.01 0.03
0

0 0.02
-0.01
-0.01 0.01
-0.02
-0.02 0
-0.03
-0.03 -0.01

-0.04
-0.04 -0.02

-0.05
-0.05 -0.03

-0.06 -0.06 -0.04

-0.07 -0.07 -0.05


-0.025 -0.024 -0.023 -0.022 -0.021 -0.02 -0.019 -0.018 -0.017 -0.016 -0.024 -0.0235 -0.023 -0.0225 -0.022 -0.0215 -0.021 -0.0205 -0.02 -0.0195 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04

r r˚
(a) Best Modality (b) L (c) L _Eqw
0.8 0.4 0.06

0.6
0.2
0.04

0.4
0
0.02
0.2
-0.2

0 0

-0.4
-0.2
-0.02
-0.6
-0.4

-0.04
-0.8
-0.6

-0.8 -1 -0.06
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03


(d) L _RNrm (e) SNF (f) Lr˚ _Damp (CoALa)

Figure 5.7: Scatter plots using first two components of different low-rank based subspaces
for Digits data set.

Digits, ORL, and Caltech7 are general image data sets. For the social network data sets,
most of the component modalities have graph based representation. However, apart from
SNF, all other existing algorithms require feature based representations of the component
modalities, so their performance could not be evaluated on social network data sets. The
comparative performance of the best modality (in terms of external indices), the full-rank
subspace Lr , SNF, and the proposed CoALa algorithm are reported in Tables 5.12 and 5.13
for different data sets. The scatter plots for the first two dimensions of the corresponding
subspaces are given in Figures 5.6 and 5.7 respectively, for Politics-UK and Digits data sets.
The convex combination α and the optimal rank r‹ are assigned as described previously
in Sections 5.3.5 and 5.5.2, respectively.
The comparative results of Tables 5.12 and 5.13 show that the proposed algorithm
has the best performance in terms of majority of the external indices for all four social
network data sets, namely, Football, Politics-UK, Rugby, and CORA, and two image data
sets, namely, ORL and Caltech7. The SNF algorithm has the second best performance
on the three Twitter data sets and the best modality always outperforms the full-rank
subspace Lr . For the Digits data set, SNF outperforms the proposed algorithm in four
external indices. The proposed algorithm has the second best performance and is followed
by the full-rank subspace Lr . The Football data set has been recently been used for the
performance evaluation of latent multi-view subspace clustering (LMSC) [275] algorithm.
LMSC has two formulations, namely, linear (lLMSC) and generalized (gLMSC). For the
Football data set, the aggregate F-measure values for lLMSC and gLMSC are 0.7082 and
0.7940, respectively, while aggregate Rand index are 0.9714 and 0.9797, respectively, while
F-measure and Rand index for CoALa are 0.8852 and 0.9780, respectively, which show that
CoALa outperforms both lLMSC and gLMSC in terms of F-measure. In terms of Rand

127
Table 5.12: Comparative Performance Analysis on Benchmark Data Sets: Football,
Politics-UK, Rugby, Digits
Measure Data Set Best View Lr SNF CoALa
Subspace Rank 20 20 20 20
Accuracy 0.7757224 0.6564516 0.8145161 0.8500000
NMI 0.7910368 0.7748572 0.8829152 0.8625365
External

ARI 0.5814725 0.3777853 0.7458860 0.7278994


F-measure 0.7747023 0.6616297 0.8431825 0.8683491
Football
Rand 0.9472965 0.8843737 0.9735862 0.9739682
[n “ 248;
Purity 0.7282258 0.6572580 0.8266129 0.8584677
k “ 20;
Silhouette 0.5565601 0.4392812 0.4750064 0.5170209
Internal

M “ 9]
Dunn 0.0122200 0.0304905 0.0496361 0.0506094
DB 0.4087806 0.5388078 0.6463104 0.5318746
Xie-Beni 181.35320 36.629720 16.878340 15.47080
Time (in sec) 0.68 1.13 1.05 1.34
Subspace Rank 5 5 5 5
Accuracy 0.8902148 0.7591885 0.9737470 0.9665871
NMI 0.8382287 0.6777684 0.9194125 0.9434825
External

ARI 0.8375676 0.7205330 0.9608391 0.9633130


F-measure 0.9175316 0.8192186 0.9701235 0.9736129
Politics-UK
Rand 0.9196880 0.8603076 0.9814665 0.9826084
[n “ 419;
Purity 0.9713604 0.8591885 0.9761337 0.9785203
k “ 5;
Silhouette 0.7877163 0.5531584 0.7599383 0.6165161
Internal

M “ 9]
Dunn 0.0691656 0.0082616 0.0121941 0.0216676
DB 0.5042173 0.4124179 0.4971371 0.6299340
Xie-Beni 4.1253610 544.13860 66.892230 68.551380
Time (in sec) 0.95 1.86 3.83 3.68
Subspace Rank 15 15 15 15
Accuracy 0.7121779 0.7283372 0.7611241 0.8305621
NMI 0.6151681 0.6526318 0.6768068 0.7093834
External

ARI 0.5461666 0.6416748 0.5485665 0.6627701


F-measure 0.7426962 0.6845209 0.7778990 0.8349647
Rugby
Rand 0.8672685 0.8578210 0.8818113 0.9067597
[n “ 854;
Purity 0.7796253 0.6803279 0.8454333 0.8606557
k “ 15;
Silhouette 0.5444214 0.5195532 0.4713082 0.4123312
Internal

M “ 9]
Dunn 0.0012972 0.0085216 0.0051843 0.0086649
DB 0.4727219 0.4603219 0.5856659 0.7474256
Xie-Beni 780.66640 212.29020 827.27610 328.6280
Time (in sec) 4.77 7.21 22.94 27.42
Subspace Rank 10 10 10 10
Accuracy 0.7096000 0.8500000 0.8835000 0.8835000
NMI 0.6443707 0.7951372 0.8904675 0.7981981
External

ARI 0.5416071 0.7267166 0.8435742 0.7645096


F-measure 0.7209662 0.8481826 0.8932872 0.8839913
Digits
Rand 0.9173923 0.9503602 0.9715983 0.9576618
[n “ 2000;
Purity 0.7100000 0.8500000 0.8835000 0.8835000
k “ 10;
Silhouette 0.4860050 0.5265748 0.4452352 0.4269673
Internal

M “ 6]
Dunn 0.0050409 0.0064673 0.0031041 0.0071841
DB 0.5722576 0.5331665 0.8063785 0.7470644
Xie-Beni 1275.5800 830.76950 1166.0330 659.67560
Time (in sec) 80.71 135.65 189.03 154.57

128
Table 5.13: Comparative Performance Analysis on Benchmark Data Sets: ORL, Caltech7,
CORA
Measure Data Set Best View Lr SNF CoALa
Subspace Rank 40 40 40 40
Accuracy 0.7167500 0.7155000 0.6907500 0.7715000
NMI 0.8800011 0.8804695 0.8616789 0.8980924
External

ARI 0.6232466 0.6225230 0.6054544 0.6932679


F-measure 0.7643606 0.7609769 0.7257119 0.7962088
ORL
Rand 0.9807581 0.9803784 0.9804474 0.9850088
[n “ 400;
Purity 0.7740000 0.7722500 0.7450000 0.8090000
k “ 40;
Silhouette 0.3613618 0.3631277 0.5106249 0.2952385
Internal

M “ 3]
Dunn 0.1980555 0.1887661 0.0642730 0.2695878
DB 1.2051272 1.2094668 1.0163360 1.4134120
Xie-Beni 2.4075381 2.7465589 102.50140 1.978858
Time (in sec) 12.93 62.32 16.36 13.54
Subspace Rank 7 7 7 7
Accuracy 0.5468114 0.4789688 0.5440299 0.5685210
NMI 0.3222844 0.4730254 0.5676032 0.5650165
External

ARI 0.3202251 0.3476278 0.4126422 0.4397484


F-measure 0.6273082 0.6023481 0.6363390 0.6689529
Caltech7
Rand 0.7038184 0.7242808 0.7482112 0.7583422
[n “ 1474;
Purity 0.7761194 0.8127544 0.8516282 0.8548168
k “ 7;
Silhouette 0.2790141 0.325057 0.5682432 0.3631100
Internal

M “ 6]
Dunn 0.0207478 0.040731 0.0357273 0.0361922
DB 1.1931470 1.082995 0.8238205 0.9804257
Xie-Beni 77.491480 27.67097 83.675470 44.50850
Time (in sec) 12.18 20.85 26.76 21.64
Subspace Rank 7 7 7 7
Accuracy 0.4090472 0.4237444 0.5450517 0.5896233
NMI 0.2276971 0.2739455 0.3829834 0.4364573
External

ARI 0.0676929 0.1337145 0.2941402 0.3256322


F-measure 0.3924244 0.4335790 0.5957978 0.5844190
CORA
Rand 0.4164793 0.5395755 0.7936103 0.7460456
[n “ 2708;
Purity 0.4333456 0.4362629 0.6012186 0.6206425
k “ 7;
Silhouette 0.6649553 0.3931360 0.3874047 0.3951749
Internal

M “ 2]
Dunn 0.0006289 0.0027002 0.0371434 0.0111024
DB 0.8543087 0.7333716 0.9988864 0.8347002
Xie-Beni 172.76125 77.745998 74.81975 116.997500
Time (in sec) 27.97 7.12 14.36 11.49

129
index, performance of LMSC and CoALa are competitive. Also, the Digits data set has
been used for the evaluation of multiple kernel learning based late fusion incomplete multi-
view clustering (LF-IMVC) [138] algorithm and spectral clustering based Wang et. al ’s
algorithm [234]. The aggregate purity and normalized mutual information (NMI) values
for Digits data set for LF-IMVC are 0.7980 and 0.6899, respectively, while for Wang et.
al ’s algorithm NMI achieved is 0.785. For CoALa, aggregate purity and NMI obtained are
0.8835 and 0.797659, respectively. The results imply that CoALa outperforms both these
algorithms on Digits data set.
In terms of internal cluster evaluation indices, the results in Tables 5.12 and 5.13 show
that out of 28 cases, the proposed algorithm achieves best performance in nine cases,
while the second best in ten cases. For the Twitter data sets, the best modality achieves
superior performance for majority of the internal indices. The execution times reported in
Tables 5.12 and 5.13 indicate that the proposed method is computationally more efficient
compared to SNF for five out of seven data sets. Although for the omics data sets in Table
5.11, SNF needs lower execution time compared to CoALa, CoALa demonstrates higher
computational efficiency compared to SNF for the benchmark data sets with larger number
of samples or component modalities.

5.6 Conclusion
This chapter presents a novel algorithm, for the integration of multiple similarity graphs,
that prevents the noise of the individual graphs from being propagated into the unified one.
The proposed method first approximates each graph using the most informative eigenpairs
of its Laplacian which contains its cluster information. Thus, the noise in the individual
graphs is not reflected in their approximations. These de-noised approximations are then
integrated for the construction of a low-rank subspace that best preserves the overall cluster
structure of multiple graphs. However, this approximate subspace differs from the full-rank
one which integrates information of all the eigenpairs of each Laplacian. Using the concept
of matrix perturbation, theoretical bounds are derived as a function of the approximation
rank, inorder to precisely evaluate how far the approximate subspace deviates from the full-
rank one. The clusters in the data set are identified by performing k-means clustering on the
approximate de-noised subspace. The effectiveness of the proposed approximation based
approach is established by showing that the approximate subspace encodes better cluster
structure compared to the full-rank one. The clustering performance of the approximate
subspace is compared with that of existing integrative clustering approaches on multiple
real-life cancer data sets as well as on several benchmark data sets from varying application
domains. Experimental results show that the clusters identified by the proposed approach
have closest resemblance with the clinically established cancer subtypes and also with
the ground-truth class information, when compared with individual modalities as well as
existing algorithms.
The meaningful patterns embedded in high-dimensional multi-view data sets typically
tend to have a much more compact representation that often lies close to a low-dimensional
manifold. Identification of hidden structures in such data mainly depends on the proper
modeling of the geometry of low-dimensional manifolds. In this regard, Chapter 6 presents
a manifold optimization based integrative clustering algorithm for multi-view data. The

130
optimization is performed alternatively over k-means and Stiefel manifolds. The Stiefel
manifold helps to model the non-linearities and differential clusters within the individual
views, while k-means manifold tries to elucidate the best-fit joint cluster structure of the
data.

131
132
Chapter 6

Multi-Manifold Optimization for


Multi-View Subspace Clustering

6.1 Introduction
Multi-view clustering explores the consistency and complementary properties of different
views to improve clustering performance. It has been extensively used over the last decade
[264,288,289] in various applications like face detection [241], action recognition [168], social
networking [78,275], information retrieval [236], cancer biology [113,192,199], to name just a
few. The observations in different views can convey similar or even differential information.
In multi-view clustering paradigm, these views are expected to agree upon an underlying
global cluster structure [141, 199]. Therefore, during data integration, it is essential to
capture the inherent (dis)similarities in the individual views as well as elucidate the global
cluster structure reflected across different views. Identification of cancer subtypes, from
multiple omic data types like gene expression, DNA methylation, and protein expression,
is one of the important application areas of multi-view clustering. The integrative multi-
omic study can provide a comprehensive view of cancer mechanisms, and complement the
diagnosis and therapeutic choices.
Different components of real-world systems, for instance, genes, micro RNAs, and other
biomolecules in aggressive diseases like cancers, often share non-linear relationships [163].
These non-linearities tend to generate observations that lie on or close to a low-dimensional
manifold. Identification of hidden structures and patterns in data crucially depends on
modeling the geometry of the low-dimensional manifolds. Several popular machine learning
approaches like principal component analysis, independent component analysis, and ma-
trix approximation, can be given a geometric framework and modeled as an optimization
problem whose natural domain is a manifold [3]. For example, the ubiquitous eigenvalue
problem, imposed with norm equality constraints, results in a spherical search space which
is an embedded submanifold of the original vector space. In manifold optimization frame-
work, subspaces simply become single points on the manifold and search algorithms do not
have to rely on the Euclidean vector space assumption of the search space. Two important
submanifolds of the Euclidean space are Stiefel and k-means manifolds. While Stiefel man-
ifold is used to model the geometry of an algorithm with orthonormality constraints [57],

133
k-means manifold generalizes spectral clustering over manifolds [28]. The current work
judiciously integrates the merits of these two manifolds to develop a multi-view clustering
algorithm.
Manifold optimization has been used in contemporary applications such as face recog-
nition [180], computer vision [154], objection detection [30], and social networking [290],
to identify non-linear patterns in data. Previous efforts have also resulted in tools that
combine manifold learning and gene expression analysis to uncover non-linear structures
among gene networks [163]. Ding et al. [53] identified breast cancer subtypes by merging
linear subspaces on a Grassmanian manifold. A brief survey on manifold based multi-view
clustering approaches is reported in Section 2.2.8 of Chapter 2. Nevertheless, discovering
the structure of the manifold from a set of data points, sampled from the manifold possibly
with noise, still remains a challenging problem. This is also unsupervised in nature. The
problem gets aggravated in the presence of multiple views. Although different views of the
same data set are expected to conform to the same underlying manifold structure, even
subtle behavioral differences can give rise to different non-linear manifolds corresponding
to different views. To identify meaningful clusters, it is not only essential to model the
individual non-linearities, but also to identify the common structures conveyed by different
manifolds.
In this regard, this chapter presents a novel manifold optimization algorithm, termed
as MiMIC (Multi-Manifold Integrative Clustering), to perform multi-view data clustering.
The proposed algorithm extracts a manifold representation for each view, which is intended
to capture the individual non-linearities. It also constructs a joint graph Laplacian that
contains the de-noised cluster information of the individual views. A joint optimization
objective is proposed, comprising of a clustering component and a disagreement minimiza-
tion component, to look into the consistent cluster information in the individual views with
respect to the joint one. While the clustering component attempts to identify the joint
cluster structure, the other component minimizes the disagreement between the manifold
representation of the joint and individual views. The proposed joint objective is optimized
over the k-means manifold for the clustering component, and Stiefel manifold for the dis-
agreement component. During optimization, a gradient based movement is performed
separately on the individual manifold corresponding to each view, so that the inherent
individual non-linearity is preserved while looking for common cluster information. This
multi-manifold approach is expected to model the individual differential cluster informa-
tion, as well as infer the best-fit global cluster structure of the data set. The convergence
analysis of the proposed algorithm is theoretically established. Asymptotic convergence
bound theoretically quantify how fast the sequence of iterates generated by the proposed
algorithm converges to an optimal solution, if exists. Moreover, the bound is used to make
inference regarding the separability of clusters present in the data set. The efficacy of the
proposed algorithm is studied and compared with that of existing approaches on several
synthetic and benchmark multi-view data sets. The algorithm is also applied for cancer
patient stratification using multi-omics data sets. Some of the results of this chapter are
reported in [114].
The rest of the chapter is organized as follows: Section 6.2 outlines the basic principles
of manifold based data clustering. Section 6.3 presents the proposed multi-view clustering
algorithm based on alternating optimization over multiple non-linear manifolds. In Section
6.4 the asymptotic convergence bound of the proposed algorithm is derived in order to

134
theoretically quantify how fast the algorithm converges to a local minima. Case studies on
different multi-view benchmark data sets and multi-omics cancer data sets, along with a
comparative performance analysis with existing approaches, are presented in Section 6.5.
Concluding remarks are provided in Section 6.6.

6.2 Basics of Manifold Based Clustering


Clustering aims at partitioning a finite set of n samples txi uni“1 into multiple subsets
such that a dissimilarity based cost is minimized. Assuming that the samples lie in the
d-dimensional Euclidean space and the number of subsets is k, the cost function for the
k-means clustering problem is the sum of squared distance of each sample from the centroid
of the cluster it is assigned to. An equivalant formulation of the k-means objective is given
by [28]
min ´ trpU T W U q
U P<nˆk (6.1)
such that U ě 0; U T U “ Ik ; U U T 1 “ 1,

where tr denotes the trace of a matrix, U T denotes the transpose of matrix U , Ik denotes the
identity matrix of order k, 1 denotes a column vector of all ones, U denotes the real-valued
relaxation of the discrete cluster indicator matrix, and W “ rwpi, jqs is an pnˆnq symmetric
positive semi-definite Gram matrix or “affinity" matrix. The affinity matrix W can be re-
placed by the normalized affinity matrix D´1{2 W D´1{2 [162], where D is the degree matrix
n
given by D “ diagpd¯1 , . . . , d¯i , . . . , d¯n q with d¯i “
ř
wpi, jq. Replacing the affinity matrix
j“1
W in (6.1) by the normalized affinity D´1{2 W D´1{2 and adding ` the constant identity ma-˘
trix In , the objective of (6.1) becomes the minimization of tr U T pIn ´ D´1{2 W D´1{2 qU .
The matrix L “ pIn ´ D´1{2 W D´1{2 q is known as the normalized graph Laplacian corre-
sponding to the affinity matrix W . Let the eigenvectors corresponding to the k smallest
eigenvalues of a matrix be referred to as the k smallest eigenvectors in rest of the chap-
ter. The minimization of trpU T LU q, subject to the orthogonality constraint U T U “ Ik ,
is actually the spectral clustering objective [230], [45] for which the optimal U is given
by the k smallest eigenvectors of L. The k smallest eigenvectors of L thus contain the
cluster information of L. However, the best rank k approximation of L is obtained using
the k largest eigenvectors and their corresponding eigenvalues. In order to merge the best
low-rank approximation of L with the cluster information contained in it, the shifted nor-
malized Laplacian [113], [51] is used instead of normalized Laplacian L, which is defined
as
L “ 2In ´ L “ In ` D´1{2 W D´1{2 . (6.2)
The k smallest eigenvectors of the normalized Laplacian L correspond to the k largest eigen-
vectors of shifted normalized Laplacian L [113], [51]. So, the minimization of trpU T LU q
becomes the maximization of trpU T LU q in terms of the shifted normalized Laplacian.
In this chapter, however, a gradient descent based approach is developed, for which the

135
minimization objective in terms of the shifted normalized Laplacian becomes

min ´ trpU T LU q
U P<nˆk (6.3)
such that U ě 0; U T U “ Ik ; U U T 1 “ 1.

A relaxation of (6.3) is to include the non-negativity constraint U ě 0 as a penalty in the


objective, which is given by

min ´ trpU T LU q ` ξ k U´ k2F


U P<nˆk (6.4)
such that U T U “ Ik ; U U T 1 “ 1,

where U´ denotes the negative entries of U , k . kF denotes the Frobenius norm of a matrix,
and ξ is a non-negative parameter. The constraint set in (6.4) is given by

Kmpn, kq :“ tU P <nˆk : U T U “ Ik , U U T 1 “ 1u, (6.5)

which is a submanifold of the Euclidean space <nˆk , and known as the k-means manifold
[28]. Thus, the NP-hard k-means clustering objective can be relaxed to the constrained
optimization in (6.4), where the constraint set is a manifold. The problem now falls under
the elegant theory of manifold optimization [3], which allows us to model the problem as
the following unconstrained optimization problem

min ´ trpU T LU q ` ξ k U´ k2F


U PKmpn,kq

over the manifold Kmpn, kq. In the current work, a manifold optimization based algorithm
is designed to efficiently integrate cluster information from different views of a multi-view
data set. In rest of the chapter, the term ‘Laplacian’ refers to the shifted normalized
Laplacian L as defined in (6.2), unless stated otherwise.

6.3 MiMIC: Proposed Method


Given a set of n samples or objects txi uni“1 , a multi-view data set, consisting of M views,
is given by M matrices X1 , . . . , Xm , . . . , XM . Each view Xm P <nˆdm represents the
observations for a common set of n samples from the m-th data source. Let Xm be
encoded by the similarity graph Gm having similarity matrix Wm “ rwm pi, jqsnˆn , where
wm pi, jq “ wm pj, iq ě 0 is the similarity between objects xi and xj in the m-th view Xm .
The degree matrix Dm , corresponding to affinity matrix Wm , is given by the diagonal
n
matrix Dm “ diagpd¯m , . . . , d¯m , . . . , d¯m q, where d¯m “
ř
1 i n i wm pi, jq. The shifted normalized
j“1
Laplacian Lm for the corresponding view Xm is given by

´1{2 ´1{2
Lm “ In ` Dm Wm Dm . (6.6)

136
Each Xm is expected to provide a different viewpoint for understanding the true nature of
the data set. A truly integrative approach should be able to leverage the cluster information
in different views to uncover the structure of the data set.

6.3.1 Multi-View Integration


An efficient way of integrating information from multiple views is to consider a convex com-
bination of the corresponding graph Laplacians Lm ’s [199], where the views are weighted
according to the quality of their cluster information. In [113], it has also been shown that
the “approximate" graph Laplacian Lrm , constructed from the rpě kq largest eigenpairs of
Lm , encodes better cluster information compared to that of the “full-rank" Laplacian Lm .
The approximate Laplacian Lrm is inherently free from the noise embedded in the pn ´ rq
smallest eigenpairs of Lm . Let the approximate joint Laplacian be given by

M
ÿ M
ÿ
LrJoint “ αm Lrm , such that αm ě 0 and αm “ 1. (6.7)
m“1 m“1

The joint Laplacian LrJoint encodes the de-noised cluster information of all the views. Thus,
the relaxed k-means objective of (6.4) can be optimized using LrJoint . Note that the opti-
mization of (6.4) using LrJoint would produce an pn ˆ kq cluster indicator matrix, say UJoint .
However, for real-life data set, an indicator subspace of rank r (ě k) is generally consid-
ered inorder to retain more cluster information from the Laplacian. The relaxed clustering
objective corresponding to LrJoint is given by

1 ξ
min T
´ trpUJoint LrJoint UJoint q ` k UJoint´ k2F
UJoint P<nˆr 2 2
T T
such that UJoint UJoint “ Ir ; UJoint UJoint 1 “ 1, (6.8)

where UJoint´ denotes the negative entries of UJoint . The matrix UJoint can alternatively
be thought of as a low-rank orthonormal subspace representation of the joint cluster infor-
mation in LrJoint . Under this new representation of UJoint , the pairwise similarities between
the samples can be computed using their inner product in UJoint , given by

T
SJoint “ UJoint UJoint P <nˆn .

Similarly, for each view Xj , for j P t1, . . . , M u, let Uj P <nˆr denote its rank r orthonormal
subspace representation, such that UjT Uj “ Ir . The pairwise similarity matrix for Xj using
subspace Uj is given by Sj “ Uj UjT .
Different views of a multi-view data set are expected to convey similar information.
Therefore, during integration, it is intended to reduce the disagreement between the joint
and individual views. The disagreement between view Xj and the joint view is given by

D pUJoint , Uj q “ kSJoint ´ Sj k2F .

137
Substituting the values of SJoint and Sj , we get
2
DpUJoint , Uj q “ UJoint UJoint
T
´ Uj UjT F
` T
UJoint ` tr UjT Uj ´ 2tr UJoint UJoint
T
Uj UjT
˘ ` ˘ ` ˘
“ tr UJoint
T
Uj UjT .
` ˘
“ 2r ´ 2 tr UJoint UJoint
´ ¯
Hence, disagreement minimization reduces to the minimization of ´ tr UJoint UJoint T Uj UjT .
For each view Xj , the aim is to find an orthonormal subspace Uj that optimizes the spectral
clustering objective trpUjT Lrj Uj q as well as minimizes its disagreement with the joint view.
Here also, the approximate Laplacian Lrj is used because of its de-noising properties as
mentioned in [113]. Hence, in the proposed approach, the integrative clustering objective
is given by f pUJoint , U1 , . . . , Uj , . . . , UM q “

1 ` T ξ
tr UJoint LrJoint UJoint ` k UJoint´ k2F
˘
´
2r 2r
M
1 ÿ“ ` (6.9)
T
Uj UjT ` trpUjT Lrj Uj q .
˘ ‰
´ tr UJoint UJoint
2rM j“1

The Laplacians LJoint and Lj ’s have maximum eigenvalue of 2 for the corresponding eigen-
vector 1. In the ideal case where all the individual graph Laplacians have identical r
disconnected components, LJoint and Lj ’s have eigenvalue 2 with multiplicity r. Then, the
LJoint and Lj based trace terms of (6.9) reduce to 2r, the squared norm term reduces to 0,
while the disagreement based trace term reduces to r. So, the final optimization problem
is given by

min f pUJoint , U1 , . . . , Uj , . . . , UM q (6.10)


UJoint , Uj P<nˆr
T T
such that UJoint UJoint “ Ir , UJoint UJoint 1 “ 1, UjT Uj “ Ir .

The above-mentioned constrained optimization problem can be solved by formulating an


unconstrained optimization problem over the Euclidean space <nˆr using Lagrange multi-
T
pliers. In such case, the second constraint, that is, UJoint UJoint 1 “ 1, imposes a row sum to
1 criterion on UJoint and introduces a set of n Lagrange multipliers, while the orthonormal-
ity constraint on the subspaces UJoint
´ and Uj ’s introduces¯ rpr ` 1q{2 Lagrange multipliers
each. These add up to a total of n ` pM ` 1q rpr`1q 2 Lagrange multipliers. Instead of
solving a large set of partial derivatives for those multipliers in the Euclidean space, the
problem is mapped to an unconstrained optimization problem over manifolds. Moreover,
manifold optimization has the advantage of capturing low-dimensional non-linear manifold
structure of the high-dimensional views.
The constraints on UJoint indicate that UJoint must belong to Kmpn, rq, the k-means
manifold of rank r, given by (6.5). On the other hand, the orthonormality constraint on
Uj implies that Uj must be an element of the Stiefel manifold [3] of rank r, which is given
by
Stpn, rq :“ tU P <nˆr : U T U “ Ir u. (6.11)

138
Thus, the constrained optimization problem of (6.10) boils down to the optimization of
f over two different types of manifolds. The unconstrained multi-manifold optimization
problem is, therefore, given by

min f pUJoint , U1 , . . . , Uj , . . . , UM q . (6.12)


UJoint PKmpn,rq
Uj PStpn,rq

A line-search based multi-manifold optimization algorithm, for the proposed objective func-
tion of (6.12), is described next.

6.3.2 Manifold Optimization Based Solution


The solution space for UJoint , in the optimization problem of (6.12), is the k-means mani-
fold Kmpn, rq, while for each Uj , it is the Stiefel manifold Stpn, rq. The parameters n and
r are kept fixed for both of these manifolds, and are dropped for notational simplicity.
The k-means and Stiefel manifolds are, henceforth, referred to as Km and St, respectively,
both having parameters pn, rq. Both k-means and Stiefel manifolds are non-linear subman-
ifolds of the Euclidean space <nˆr , which are not necessarily endowed with a vector space
structure. Consequently, the standard gradient descent, where the iterates are obtained
based on vector operations, cannot be applied on these manifolds. Line-search generalizes
the concept of gradient descent on non-linear manifolds [3]. It implements following three
steps iteratively until convergence: (i) project the gradient of the objective function onto
the tangent space of the manifold; (ii) move along the direction of negative gradient in
the tangent space; and (iii) project the point obtained in step (ii) back to the manifold.
The optimization objective f in (6.9) is a continuously differentiable scalar field over both
p0q p0q
the manifolds. Given initializations UJoint and Uj ’s, for j P t1, . . . , M u, the line-search
optimization of f over multiple manifolds proceeds as follows.

6.3.2.1 Optimization of UJoint


ptq
Given UJoint obtained at iteration t, and a set of fixed Uj ’s, for j P t1, . . . , M u, let

M
ÿ
U“ Uj UjT .
j“1

ptq
Substituting the value of U in (6.9), the direction of negative gradient of f at UJoint is
given by
„ 
1 ` T ˘ ξ ptq
´∇U ptq f “ ´∇U ptq ´ tr UJoint pLrJoint ` UqUJoint ` k UJoint´ k2F
Joint Joint 2 2
ptq ptq ptq
“ pLrJoint ` Uq UJoint ´ ξUJoint´ “ QJoint (say). (6.13)

ptq
Let the tangent space of the k-means manifold rooted at the current iterate UJoint P Km
be denoted by TU ptq Km. Unlike the non-linear manifold Km, its tangent space is a vector
Joint

139
space [3]. This makes the movement along the tangent space feasible using vector addition
and scalar multiplication. The first step of line-search is to project the negative gradient
ptq
QJoint of (6.13) onto the tangent space. This is done using the projection operator Π [57].
Let ΠTY Km pW q denote the projection of W P <nˆr onto the tangent space of Km rooted
ptq
at Y . In the present case, the root of the tangent is the current iterate UJoint , while the
ptq
point to be projected is the negative gradient QJoint . At iteration t, this projection is given
by [28]
´ ¯
ptq ptq ptq ptq ptq
ΠT ptq Km
QJoint “ QJoint ´ 2UJoint Ω ´ pz1T ` 1zT qUJoint “ ZJoint (say), (6.14)
U
Joint

1 ptq ´ ptq ¯T
where z “ Q U 1 and
n Joint Joint
1 ´´
ptq
¯T ptq
´
ptq
¯T
ptq
´
ptq
¯T
T T ptq
¯
Ω“ QJoint UJoint ` UJoint QJoint ´ 2 UJoint pz1 ` 1z qUJoint .
4

In Figure 6.1, the curved surface is used to denote the k-means manifold, while the

Figure 6.1: Optimization of UJoint over k-means manifold.

ptq
plane denotes its tangent space. The root of the tangent space is the current iterate UJoint ,
denoted by the point lying on both the tangent plane and the manifold. The vector moving
out of the tangent plane points towards the negative gradient direction, while its projection
ptq
lies on the tangent plane. Given the tangent vector ZJoint of (6.14) and step length ηK ą 0,
ptq
the next step is to move in the direction of ZJoint within the tangent space and then project
the obtained point from the tangent space TU ptq Km back to the manifold Km. This is
Joint
achieved using the retraction operator R [3]. Given a manifold M, a point y P M, and
ξ P Ty M, the retraction Ry pξq has two steps: (i) move along ξ to get the point y ` ξ in

140
the linear embedding space; and (ii) “project" the point y ` ξ to the manifold M. For
ptq ptq
the current problem, retraction is performed on the tangent vector ZJoint at point UJoint .
Retraction on Km is performed as follows. Let

pt`1q ptq ptq


ZJoint “ UJoint ` ηK ZJoint

ptq
be the point obtained by moving along ZJoint in the tangent space. Since the tangent
pt`1q
space is a vector space, ZJoint belongs to the tangent space itself. The next step is to
pt`1q
project ZJoint from the tangent space to the manifold Km. Retraction on Km involves the
matrix exponential operation [159]. A projection Ppxq is called retractive projection of x
if P : x Ñ y is a retraction. Let PKmY pZq denote the retractive projection of Z from the
tangent space TY Km rooted at Y back to Km. Here, the retractive projection is performed
pt`1q
on ZJoint . This is given by [28]
´ ¯
pt`1q ptq
PKmU ptq ZJoint “ exppBq exppQ1 q UJoint , (6.15)
Joint
` ptq ˘T pt`1q
where Q “ UJoint ZJoint P <rˆr ;
ptq ` ptq ˘T
Q1 “ UJoint Q UJoint P <nˆn ; and
pt`1q ` ptq ˘T ptq ` pt`1q ˘T
B “ ZJoint UJoint ´ UJoint ZJoint ´ 2Q1 P <nˆn .

Finally, the retracted point in (6.15) is the UJoint obtained at iteration pt ` 1q, that is,
´ ¯
pt`1q pt`1q
UJoint “ PKmU ptq ZJoint .
Joint

pt`1q
Theorem 6.1. UJoint belongs to the k-means manifold.
pt`1q
Proof. For UJoint to belong to k-means manifold, denoted by Km, it must satisfy its prop-
pt`1q
erties given in (6.5) of the main chapter. So, UJoint must have orthonormal columns:

pt`1q T
´ ¯ ´ ¯T ´ ¯
pt`1q ptq ptq
UJoint UJoint “ exppBq exppQ1 qUJoint exppBq exppQ1 qUJoint (from (6.15))
´ ¯T
ptq ptq
“ UJoint exppQ1 qT exppBqT exppBq exppQ1 qUJoint
´ ¯T
ptq ptq
“ UJoint expp´Q1 q expp´Bq exppBq exppQ1 qUJoint
´ ¯T
ptq ptq
“ UJoint UJoint “ Ir .

´ ¯T
ptq ptq
It can be shown that UJoint UJoint commutes with exppQ1 q [28] (see Lemma 6.1 for
details). Hence,
´ ¯T ´ ¯T
ptq ptq ptq ptq
exppQ1 qUJoint UJoint “ UJoint UJoint exppQ1 q.

141
Also, exppBq1 “ expp´Bq1 “ 1. So,

pt`1q T
´ ¯ ´ ¯´ ¯T
pt`1q ptq ptq
UJoint UJoint 1 “ exppBq exppQ1 qUJoint exppBq exppQ1 qUJoint 1
´ ¯T
ptq ptq
“ exppBq exppQ1 qUJoint UJoint exppQ1 qT exppBqT 1
´ ¯T
ptq ptq
“ exppBq exppQ1 qUJoint UJoint expp´Q1 q expp´Bq1
´ ¯T
ptq ptq
“ exppBqUJoint UJoint exppQ1 q expp´Q1 q expp´Bq1
´ ¯T
ptq ptq
“ exppBqUJoint UJoint 1 “ exppBq1 “ 1.

pt`1q
Thus, the next iterate UJoint satisfies both the properties of Km, and therefore, belongs to
it.
´ ¯T
ptq ptq
In Theorem 6.1, the commutative property of UJoint UJoint and exppQ1 q is used
pt`1q
to prove that UJoint belongs to the k-means manifold. The following lemma proves the
commutative property [28].
´ ¯T
ptq ptq
Lemma 6.1. UJoint UJoint commutes with exppQ1 q.

Proof. The t-th iterate of UJoint belongs to the k-means manifold. So, from the properties
of k-means manifold (defined in (6.5)), it satisfies that
´ ¯T
ptq ptq
UJoint UJoint “ Ir . (6.16)

From (6.15), we have


ptq ` ptq ˘T
Q1 “ UJoint Q UJoint P <nˆn , (6.17)

where Q P <rˆr . The exponential of Q1 is given by [159]

Q1 j
8
Q12 Q13 ÿ
exppQ1 q “ In ` Q1 ` ` ` ... “ .
2! 3! j“0
j!

Now,
´ ¯T ` ptq ˘T ptq ´ ptq ¯T
ptq ptq ptq
Q1 UJoint UJoint “ UJoint Q UJoint UJoint UJoint (from (6.17))
´ ¯T
ptq ` ptq ˘T ptq ptq
“ UJoint UJoint UJoint Q UJoint (from (6.16))
´ ¯T
ptq ptq
“ UJoint UJoint Q1 . (6.18)

142
Therefore,
´ ¯T ˆ Q12 Q13
˙ ´ ¯T
1 ptq ptq 1 ptq ptq
exppQ q UJoint UJoint “ In ` Q ` ` ` . . . UJoint UJoint
2! 3!
(applying (6.18) repetatively)
´ ¯T ˆ Q12 Q13
˙
ptq ptq
“ UJoint UJoint I n ` Q1 ` ` ` ...
2! 3!
´ ¯T
ptq ptq
“ UJoint UJoint exppQ1 q.

´ ¯T
ptq ptq
Hence, UJoint UJoint commutes with exppQ1 q.

The algorithm for a single update of UJoint is given in Algorithm 6.1. Figure 6.1 shows
the diagrammatic representation of the gradient computation, tangent space projection,
and retraction operation on the k-means manifold. As shown in Figure 6.1, the point
pt`1q
ZJoint obtained by moving along the tangent plane lies on the tangent plane itself, while
pt`1q
the retracted point UJoint lies only on the curved surface (manifold). The variable UJoint
in the objective function f is optimized over the k-means manifold. There are M other
variables Uj ’s, each corresponding to one of the views. The solution space for Uj , for
j P t1, . . . , M u, is the Stiefel manifold St.

Algorithm 6.1 Optimize_k-means


Ź Optimization of UJoint over k-means manifold
ptq
Input: Joint Laplacian LrJoint , subspaces Uj for j “ 1, ..., M , UJoint of iteration t, step
length ηK ą 0, ξ ą 0.
pt`1q
Output: UJoint . ” ı
ptq
1: Compute negative gradient QJoint Ð ´∇ ptq f by (6.13).
UJoint
2: Project negative gradient
´ onto ¯tangent space:
ptq ptq
ZJoint Ð ΠT ptq Km QJoint using (6.14).
U
Joint
pt`1q ptq ptq
3: ZJoint Ð UJoint ` ηK ZJoint . ´ ¯
pt`1q
4: Find retractive projection PKmU ptq ZJoint using (6.15).
´ ¯ Joint
pt`1q pt`1q
5: UJoint Ð PKm ptq Z Joint .
U Joint
pt`1q
6: Return UJoint .

6.3.2.2 Optimization of Uj
ptq
Let Uj denote the Uj obtained at iteration t. For a specific j P t1, . . . , M u, considering
UJoint and all other Ui ’s to be fixed for i P t1, . . . , M u and i ‰ j, the direction of negative

143
ptq
gradient of f at Uj is given by
” 1 ´` ˘ ptq ¯ı
ptq ˘T ` T
´∇U ptq f “ ´∇U ptq ´ tr Uj UJoint UJoint ` Lrj Uj ,
j j 2
T
˘ ptq ptq
` Lrj Uj “ Qj (say).
`
ñ ´∇U ptq f “ UJoint UJoint (6.19)
j

ptq
To optimize Uj , first the negative gradient direction Qj of (6.19) is projected onto the
ptq ptq
tangent space TU ptq St of Stiefel manifold at Uj . The operator Π for projecting Qj onto
j
ptq
the tangent space of St rooted at the current iterate Uj is given by [57]
´
ptq
¯
ptq 1 ptq ´` ptq ˘T ptq ptq ptq
¯
ΠT ptq St
Qj “ Qj ´ Uj Uj Qj ` pQj qT Uj
U
j 2
´ ¯
ptq ` ptq ˘T ptq ptq
“ In ´ Uj Uj Qj “ Zj (say). (6.20)

ptq ptq
Given the step length ηS ą 0 and the tangent vector Zj , the current iterate Uj is moved
ptq
in the direction of the tangent Zj to obtain

pt`1q ptq ptq


Zj “ Uj ` ηS Zj .

pt`1q
The point Zj which lies in the tangent space is retracted back to the manifold St to
pt`1q
obtain the next iterate Uj . Retraction on St is performed using the singular value
pt`1q pt`1q
decomposition (SVD) of Zj [2]. Let the SVD of Zj be given by

pt`1q pt`1q pt`1q ` pt`1q ˘T


Zj “ Ej Ξj Vj ,

pt`1q pt`1q pt`1q


where Ej and Vj contain the left and right singular vectors of Zj in their
pt`1q
columns, respectively, and Ξpjq is a diagonal matrix containing the singular values stored
pt`1q
in non-increasing order. Following the notation of (6.15), the retractive projection of Zj
onto St is given by ´ ¯
pt`1q pt`1q ` pt`1q ˘T
PStU ptq Zj “ Ej Vj . (6.21)
j

The retracted point in (6.21) is the next iterate of Uj , that is,


´ ¯
pt`1q pt`1q
Uj “ PStU ptq Zj .
j

pt`1q
Theorem 6.2. Uj belongs to the Stiefel manifold.
pt`1q
Proof. For Uj to belong to the Stiefel manifold, it must satisfy its properties given by
pt`1q pt`1q
(6.11), that is, it must have orthonormal columns. The matrices Ej and Vj , given

144
pt`1q
by (6.21), contain the left and right singular vectors of Zj , respectively, which have
onrthonormal columns. Therefore,

pt`1q T
´ ¯
pt`1q pt`1q ` pt`1q ˘T pt`1q ` pt`1q ˘T
Uj Uj “ Vj Ej Ej Vj “ Ir .

Thus, the next iterate of Uj belongs to the Stiefel manifold.

The algorithm for a single update of Uj is given in Algorithm 6.2. The optimization
in Uj is performed for each of the M views separately, considering UJoint and all Ui ’s for
i P t1, . . . , M u and i ‰ j to be fixed.

Algorithm 6.2 Optimize_Stiefel


Ź Optimization of Uj over Stiefel manifold
ptq
Input: Laplacian Lrj , subspace UJoint , Uj of iteration t, step length ηS .
pt`1q
Output: Uj .
ptq
1: Compute negative gradient Qj Ð ´∇U ptq f using (6.19).
j
2: Project negative gradient
´ ¯ onto tangent space:
ptq ptq
Zj Ð ΠT ptq St Qj using (6.20).
U
j
pt`1q ptq ptq
3: Zj Ð Uj ` ηS Zj .
´ ¯
pt`1q
4: Find retractive projection PStU ptq Zj using (6.21).
´ ¯ j
pt`1q pt`1q
5: Uj Ð PStU ptq Zj .
j
pt`1q
6: Return Uj .

6.3.3 Proposed Algorithm


Given M affinity matrices W1 , . . . , WM , corresponding to M views X1 , . . . , XM and a fixed
rank r, the proposed method extracts a low-rank joint subspace representation UJoint that
best preserves the cluster structure of a multi-view data set. The clusters embedded within
the data set are identified by performing k-means clustering on the first k columns of UJoint .
The convex combination α of (6.7) assigns the importance of the individual graphs during
data integration. In the proposed algorithm, the weights αm ’s are assigned according to
the quality of cluster structure reflected in the individual views, which is determined by
the eigenvalues and eigenvectors of their corresponding Laplacians (see Section 5.3.5 of
Chapter 5).

6.3.3.1 Choice of Initial Iterates


For each view Xm , where m P t1, . . . , M u, spectral clustering, in terms of its shifted
normalized Laplacian Lm , solves the following optimization problem [162], [45]:
` T ˘ T
min ´ tr Um Lm Um such that Um Um “ I k , (6.22)
Um P<nˆk

145
where k is the number of clusters. The solution to (6.22) is given by the k largest eigen-
vectors of Lm . The rank r spectral clustering solution is chosen as the initial iterate for
the subspaces corresponding to joint and individual views. Let the eigen-decomposition of
the graph Laplacians, corresponding to joint and individual views, be given by

T
LJoint “ UJoint ΣJoint UJoint and Lj “ Uj Σj UjT ,

for j P t1, . . . , M u. Here, UJoint and Uj ’s contain eigenvectors while ΣJoint and Σj ’s contain
corresponding eigenvalues in non-increasing order. The initial iterates for the proposed
algorithm are given by
p0q r p0q
UJoint “ UJoint and Uj “ Ujr ,
r
where UJoint and Ujr contain the r largest eigenvectors in UJoint and Uj , respectively.

6.3.3.2 Convergence Criterion


ptq ptq
Let f ptq denote the value of the objective function f evaluated using UJoint and Uj ’s,
obtained at iteration t. For the proposed algorithm, the step lengths for optimization on
both the manifolds are chosen to be identical, that is, ηK “ ηS “ η. The direction of
movement at each iteration is always along the negative gradient (as in (6.13) and (6.19)),
which should lead to a reduction in the objective function. To ensure convergence, the
proposed algorithm moves to the next iterate only when there is a sufficient reduction in
the value of the objective function (according to the Armijo criterion [9], see Section S2 of
supplementary). Otherwise, both the step sizes are reduced by a factor δ P p0, 1q.
The proposed algorithm converges when, even with very small step sizes ηK and ηS , the
difference in the objective function f in two consecutive iterations falls below the threshold
 ą 0, that is,
f ptq ´ f pt`1q ă . (6.23)
The proposed algorithm is described in Algorithm 6.3.

6.3.3.3 Computational Complexity


Let X1 , . . . , Xm , . . . , XM , where Xm P <nˆdm , be M different views of a multi-view data
set, all measured on the same set of n samples. The number of clusters in the data set is
assumed to be known and is denoted by k, and let r be the rank of joint and individual
subspaces, UJoint and Uj s, which is given as input to the proposed Algorithm 6.3. Given
the similarity matrix Wm for view Xm , its graph Laplacian Lm is computed in step 2
in Opn2 q time. Then, the eigen-decomposition of Lm is computed in step 3 which takes
Opn3 q time for the pn ˆ nq matrix. The computation of relevance χm in step 6 involves
computation of Silhouette index which has pair-wise distance computation and takes Opn2 q
time. For M views, the total complexity of steps 1´6 is bounded by OpM n3 q. The
computation of joint Laplacian and its eigen-decomposition in steps 7 and 8, respectively,
takes atmost Opn3 q time. Steps 9´10 are initializations, which take constant time. For
a fixed j, optimization of Uj over Stiefel manifold takes Opn2 rq time. The loop for j
in step 12 runs once for each of the M views, which contributes to a total complexity
of OpM n2 rq for steps 12´14. The optimization of UJoint over k-means manifold in step

146
15 has Opn3 q time complexity due to the matrix exponential based retraction operation.
The computation of the joint objective in step 16 takes OpM n2 rq time. The evaluation
of convergence criteria and variable updation in steps 17´21 takes Op1q time. Assuming
that the algorithm takes t iterations to converge, the overall complexity of steps 11´22
is bounded by Opt maxtn3 , M n2 ruq. The clustering on the final solution UJoint
‹ in step 24
2
takes Optkm nk q time, where tkm is the maximum number of iterations k-means clustering
executes.
Hence, the overall computational complexity of the proposed MiMIC algorithm, to ex-

tract the subspace UJoint and perform clustering, is pOpM n3 `t maxtn3 , M n2 ru`tkm nk 2 q “
qOptn3 q, assuming M, r, k ăă n.

Algorithm 6.3 Proposed Algorithm: MiMIC


Input: Similarity matrices W1 , . . . , WM , number of clusters k, rank r ě k, step lengths
ηK and ηS , Km parameter ξ ą 0, convergence parameter  ą 0 and δ P p0, 1q.
Output: Subspace UJoint ‹ and clusters A1 , . . . , Ak .
1: for m Ð 1 to M do
2: Construct degree matrix Dm and Laplacian Lm as in (6.6).
3: Compute eigen-decomposition of Lm .
4: Store r largest eigenvectors of Lm in columns of Um r.

5: Compute weight αm of Xm (using (34) of the supplementary document).


6: end for
7: Compute joint Laplacian LrJoint using (7.4).
8: Compute eigen-decomposition of LrJoint .
p0q r p0q
9: Initialize: UJoint Ð UJoint , Uj Ð Ujr , j “ 1, .., M .
´ ¯
p0q p0q p0q
10: t Ð 0; f p0q Ð f UJoint , U1 , . . . , UM .
11: do
12: for each j P t1, .., M u do
pt`1q ` ptq ptq ˘
13: Uj Ð Optimize_Stiefel Lrj , UJoint , Uj , ηS
14: end for
pt`1q ` r ptq ptq ptq ˘
15: UJoint Ð Optimize_k-means´ L Joint , U Joint , U1 ¯ , ... ..., UM , ηK , ξ .
pt`1q pt`1q pt`1q
16: Compute f pt`1q Ð f UJoint , U1 , . . . , UM .
` ptq pt`1q
˘
17: if f ´ f ą  then
18: Update to next iteration: t “ t ` 1.
19: else
20: Reduce step length: ηS “ δηS , ηK “ δηK .
21: end if
22: while pηK ą 1e ´ 06 & ηS ą 1e ´ 06q
‹ pt`1q
23: Optimal solution: UJoint Ð UJoint .
24: Perform k-means clustering on first k columns of UJoint ‹ .
25: ‹
Return UJoint and clusters A1 , . . . , Ak from k-means.

147
6.4 Asymptotic Convergence Analysis
This section presents the convergence analysis of the proposed MiMIC algorithm for the
set of given Uj ’s under Armijo constraints [9] on the choice of step length. The asymptotic
behavior of the algorithm is also studied to derive a bound on the difference between the
objective function f evaluated at some iteration t and at the optimal solution, for suffi-
ciently large values of t. The bound can be used to make inference about the compactness
and separability of the clusters in the data set.
The proposed MiMIC algorithm for multi-view data clustering is provided in Algorithm
6.3. Before discussing the convergence result and analyzing its asymptotic behavior, the
notation for the retraction operation on a manifold is re-stated next. Given a manifold M,
a point y P M, let Ty M denote the tangent space of the manifold rooted at point y. Given
a tangent ξ P Ty M, the retraction operation Ry pξq denotes the combination of two steps.
First, movement along ξ to get the point y ` ξ in the tangent space. Second, projection
of the point y ` ξ back to the manifold M. For minimization of a function f pyq over M,
given the current iterate y ptq at iterarion t, the update equation for line-search [3] on M is
given by
y pt`1q “ Ryptq pηd pt q q,

where d pt q is a descent direction and η is the step length. For the proposed MiMIC
algorithm, while optimizing the joint objective f with respect to UJoint over the k-means
manifold Km, the set of update equations is given by (Section 6.3.2.1)

ptq
QJoint “ ´∇U ptq f
Joint
´ ¯
ptq ptq
ZJoint “ ΠT ptq Km
QJoint
U
Joint (6.24)
pt`1q ptq ptq
ZJoint “ UJoint ` ηZJoint
´ ¯
rpt`1q pt`1q
UJoint “ PKmU ptq ZJoint ,
Joint

ptq
where UJoint denotes the value of UJoint at iteration t. The set of equations in (6.24) can
be coupled using the retraction operation R and written as
´ ¯
pt`1q
UJoint “ RKmU ptq ´η∇U ptq f , (6.25)
Joint Joint

where RKm denotes retraction on the k-means manifold. To prove the convergence of the
proposed algorithm, certain restrictions are imposed on the descent direction and choice of
step size during optimization. These are discussed in Appendix D.

6.4.1 Convergence
The following convergence result in Theorem 6.3 for line-search optimization over manifolds
is motivated from their classical counterparts in <n [3].
ptq
Theorem 6.3 (Convergence). Every limit point of the sequence tUJoint ut“0,1,2,... , gener-

148
ated by the proposed algorithm for a set of given Uj ’s for j P t1, .., M u, is a critical point
of the cost function f .
ptq (
Proof. (By contradiction) Let there be a subsequence of iterations UJoint tPτ
that con-
verges to some ‹
UJoint which is not a critical point of f , that is ∇UJoint
‹ f ‰ 0. The direction
of movement at each iteration is the negative gradient along which the reduction of cost
` ptq ˘(
f is maximum. It follows that the whole sequence f UJoint is non-increasing and con-
` ‹ ˘ ` ptq ˘ ` pt`1q ˘
verges to f UJoint . So, the difference f UJoint ´ f UJoint goes gradually to zero. The
Armijo criterion CA , given by (D.2), is evaluated at each iteration of the proposed MiMIC
algorithm. The algorithm proceeds to the next iteration only if CA ě 0. The k-means
manifold, over which UJoint is optimized, is a Riemannian manifold with the inner product
given by xZ1 , Z2 y “ trpZ1T Z2 q [28]. This relation can used to replace the trace term in
(D.2). Furthermore, for a set of given Uj ’s for j P t1, .., M u, f becomes a function of UJoint
ptq
only. In that case, the negative gradient becomes ´∇U ptq f “ QJoint (see (6.13)). Using
Joint
(6.24), the Armijo criterion CA can be written as
` ptq ˘ ` ptq ˘ @ ptq D
CA “ f UJoint ´ f RKmU ptq pηQJoint q ` ση ptq ∇U ptq f , QJoint . (6.26)
Joint Joint

The proposed MiMIC algorithm proceeds to the next iteration only if CA ě 0, else it
reduces the step size and checks again. Now, CA ě 0 implies that at each iteration the
proposed algorithm satisfies
` ptq ˘ ` ptq ˘ @ ptq D
f UJoint ´ f RKmU ptq pηQJoint q ě ´ση ptq ∇U ptq f , QJoint .
Joint Joint

The direction of movement at each iteration is

ptq
QJoint “ ´∇U ptq f
Joint

which implies that


ptq D
∇U ptq f , QJoint “ ´k∇U ptq f k2F ă 0,
@
(6.27)
Joint Joint

where k . kF deontes the Frobenius norm of a matrix. Thus, the sequence movement
ptq ` ptq ˘(
directions tQJoint u is gradient related. Moreover, as f UJoint is a convergent sequence,
this implies that the step lengths tη ptq utPτ Ñ 0. As the step lengths η ptq ’s are determined
from the Armijo rule, it follows that for all t greater than some t̄, η ptq “ β ωt η, where ωt
η ptq
is an integer greater than zero. Therefore, the update β “ β pωt ´1q η does not satisfy the

149
Armijo condition. So,
˜ ˜ ¸¸
` ptq ˘ η ptq ptq η ptq @ ptq D
f UJoint ´f RKmU ptq Q ă ´σ ∇U ptq f , QJoint , @t P τ, t ě t̄.
Joint β Joint β Joint

(6.28)
Let
ptq ptq ptq
p ptq “ QJoint and ηpptq “ η kQJoint k .
Q Joint ptq
kQJoint k β

For the function f over the manifold Km equipped with the retraction RKm, let fp “ f ˝RKm
denote the pullback of f through RKm. For U P Km,

fpU “ f ˝ RKmU

denote the restriction of f to the tangent space TU Km. Denoting the zero element of
tangent space TU Km by 0U , the inequality in (6.28) could be written as
´ ¯ ´ ¯
ptq
fpU ptq 0U ptq ´ fpU ptq ηpptq Q
p
Joint
p ptq ,
Joint Joint Joint
@ D
ă ´σ ∇U ptq f , Q Joint (6.29)
ηpptq Joint

@t P τ , where t ě t̄. The mean-value theorem is used to replace the left-hand side of (6.29)
ptq p ptq (see Chapter
by the directional derivative of fp at point UJoint in the direction of Q Joint
3, [3]). So, for some c P r0, ηpptq s, (6.29) can be written as
´ ¯” ı
ptq p ptq
@
p ptq
D
´DfpU ptq cQ Q Joint ă ´σ ∇U ptq f , QJoint , (6.30)
p
Joint
Joint Joint

ptq
@t P τ , where t ě t̄. Since tη ptq utPτ Ñ 0 and QJoint is gradient-related, hence bounded, it
η ptq utPτ also tends to 0. Moreover, as Q
follows that tp p ptq has unit norm, the set of unit
Joint
ptq
norm vectors tQ
p
Joint u belongs to a compact set. Every sequence in a compact set converges
to an element contained within the set. So, there must exist a index set τp Ă τ such that
p ptq utPpτ Ñ Q
p‹ p‹ p‹
tQ Joint Joint for some QJoint having kQJoint k “ 1. Taking limits in (6.30) over
τp, ηpptq Ñ 0, which implies that c Ñ 0 and Q p ptq Ñ Q p ‹ . Also, f is a continuous and
Joint Joint
differentiable scalar field over the Riemannian manifold Km. Therefore, from the definition
of directional derivative D (see (3.31) in Chapter 3, [3]), it satisfies that
” ı @
p ptq p ptq
D
DfpU ptq p0q Q Joint “ ∇U ptq f , QJoint .
Joint Joint

Taking limits, (6.30) becomes


@ D @ D
´ ∇UJoint p‹ p‹
‹ f, QJoint ă ´σ ∇UJoint
‹ f, Q Joint . (6.31)

150
Since 0 ă σ ă 1, it follows from (6.31) that
@ D
∇UJoint p‹
f, Q

Joint ą 0.

ptq @
p‹
D
However, as tQJoint u is gradient related, therefore ∇UJoint
‹ f, Q Joint ă 0 (from (6.27)),
ptq (
which is a contradiction. Therefore, the subsequence of iterates UJoint tPτ converges to
some critical point of the objective function f .

Theorem 6.3 states that only critical points of the cost function f can be accumula-
ptq
tion points of sequences tUJoint u generated by the MiMIC algorithm. It does not specify
whether the accumulation points are local minimizers, local maximizers, or saddle points.
Nevertheless, at each iteration, since the movement is always in the direction of nega-
p0q
tive gradient, unless the initial point UJoint is carefully crafted, Algorithm 6.3 produces
sequences whose accumulation points are local minima of the cost function.

6.4.2 Asymptotic Analysis


The asymptotic convergence describes how fast the sequence of iterates generated by a
search algorithm could arrive to an optimal solution, if exists. For a sufficiently large
value of iteration number t, the properties of cost function f and the line-search nature of
pt`1q
Algorithm 6.3 are used to upper bound the difference between the cost function at UJoint
‹ ptq
and at the optimal solution UJoint in terms of the difference when evaluated at UJoint and

UJoint . The result invokes the smallest and largest eigenvalues of the Hessian of f at the
critical point.
ptq
Let tUJoint ut“0,1,2,... be an infinite sequence of iterates generated by the proposed Al-
ptq
gorithm 6.3, for a set of given tUj uM j“1 . With the direction of movement being QJoint :“
ptq ‹
´∇fJointptq , let the sequence tUJoint ut“0,1,... converge to a point UJoint , which is a critical
point of f according to Theorem 6.3. Let the Hessian of the cost function at the con-
verged solution be denoted by HUJoint
‹ f , and λH,min and λH,max be the smallest and largest
eigenvalues of the Hessian of HUJoint
‹ f . Assume that λH,min ą 0 (hence UJoint ‹ is a local
minimizer of f ). The asymptotic bound is given by the following theorem.

Theorem 6.4. There exists an integer t1 ě 0 such that


´ ¯ ´ ´ ¯ ¯
pt`1q ‹ ptq ‹
f UJoint ´ f pUJoint q ď c f UJoint ´ f pUJoint q ,

for all t ě t1 , where ˆ ˙


2βp1 ´ σq
c “ 1 ´ 2σλH,min min η, , (6.32)
λH,max
where η is the step length, and σ and β are Armijo criterion parameters.


Proof. Let pU, ϕq be a chart of the manifold M :“ Kmpn, kq, with UJoint P U. Let the
negative gradient of f at any point U P M be given by ζU :“ ´∇f pU q, where ζU belongs

151
to the tangent space TU M. Let coordinate expressions for different elements in the corre-
sponding Euclidean space <nˆk be denoted with a hat. The following notations are used
for Euclidean space representations.

Û :“ ϕpU q Ź indicates that coordinate map Û


in <nˆk is equal to ϕ of U in M,
Û :“ ϕpUq Ź similar to above notation, but for
the whole set U,
fˆpÛ q :“ f pU q Ź indicates that the value of fˆ at
Û P <nˆk is equal to the value of
f at U P M,
ζ̂Û :“ DϕpU qrζU s Ź ζ̂Û is the coordinate expression
corresponding to the directional
derivative in manifold M,
R̂Û pζ̂q :“ ϕpRU pζqq Ź the coordinate expression for the
retracted point in <nˆk is given by
the mapping ϕ of the retracted
point RU pζq in M.

Let yÛ denote the Euclidean gradient of fˆ at Û , given by


» fi
B11 fˆpÛ q ... B1k fˆpÛ q
— ffi
— B21 fˆpÛ q ... B2k fˆpÛ q ffi
— ffi
yÛ “ — (6.33)
— ffi
ffi
...
— ffi
— ffi
– fl
Bn1 fˆpÛ q ... Bnk fˆpÛ q
nˆk

Let GÛ denote the matrix representation of the Riemannian metric g of M, in the coordi-
nate space. Without loss of generality, we assume that the coordinate map of the critical

point is ÛJoint “ 0 (the zero vector) and GÛ ‹ “ In .
Joint
The main aim is to obtain, at a current iterate U , a suitable upper bound on f pRU ptA ζU qq,
where tA is the Armijo step and tA ζU is the Armijo point in tangent space TU M. The

152
Armijo condition implies that

f pU q ´ f pRU ptA ζU qq ě ´σ ∇f pU q, tA ζU ,
@ D

ñ f pRU ptA ζU qq ď f pU q ´ σ ζU , tA ζU
@ D

ď f pU q ´ σtA xζU , ζU y . (6.34)

First a lower bound is obtained on xζU , ζU y in terms of f pU q. Given a smooth scalar


field f on Riemannian manifold M, ζU denotes the negative gradient of f at U , given
by ζU :“ ´∇f pU q. The coordinate expression for ζU in <nˆk is given in terms of the
Euclidean gradient yÛ and the matrix representation of Riemannian metric G as follows
(Section 3.6 in [3]):
ζ̂Û “ G´1 p´yÛ q.

Also, from (3.29) in [3],


´ ¯
xζU , ζU y “ ζ̂Û GÛ ζ̂Û “ yÛ G´1 yÛ “k yÛ k2 1 ` OpÛ q , (6.35)


as GÛ is assumed to be the identity matrix at the critical point ÛJoint . From Taylor
expansion of the Euclidean gradient yÛ , we have

∇fˆpÛJoint

` Û q “ ∇fˆpÛJoint

q ` HÛ ‹ Û ` Opk Û k2 q,
Joint

ñ yÛ “ ∇fˆpÛ q “ H0 Û ` Opk Û k2 q (6.36)



(as ÛJoint “ 0, so ∇fˆpÛJoint
‹ q “ 0, and from (6.33))

On the other hand, from the Taylor expansion of fˆ, we have


´ ¯T 1
fˆpÛJoint

` Û q “ fˆpÛJoint

q ` ∇fˆpÛJoint

q Û ` Û T HÛ ‹ Û ` Opk Û k3 q,
2 Joint
1
ñ fˆpÛ q “ fˆp0q ` Û T H0 Û ` Opk Û k3 q. (6.37)
2

(applying ÛJoint “ 0 and ∇fˆpÛJoint
‹ q “ 0)

It follows from (6.36) and (6.37) that

1
fˆpÛ q ´ fˆp0q “ yÛT H´1 3
0 yÛ ` Opk Û k q
2
1 1
ď k yÛ k, (6.38)
2 λH,min


holds for all Û sufficiently close to ÛJoint . This is because, in (6.38) above, λH,min denotes

153
the minimum eigenvalue of the Hessian of fˆ at ÛJoint
‹ , that is H0 , and from the properties
of eigenvalues, we have that, for any vector v, v T H´1 ´1
0 v ď pλH,min q . Therefore, from
(6.35) and (6.38), it can be concluded that

‹ 1 1
f pU q ´ f pUJoint qď xζU , ζU y ,
2 λH,min

ñ 2λH,min pf pU q ´ f pUJoint qq ď xζU , ζU y . (6.39)

Thus, (6.39) gives the desired lower bound on xζU , ζU y. Using the bound (6.39) in the
Armijo condition (6.34) gives us that

f pRU ptA ζU qq ď f pU q ´ σtA 2λH,min pf pU q ´ f pUJoint



qq ,
ñ f pRx ptA ζU qq ´ f pUJoint q ď d 1 ´ 2λH,min σtA pf pU q ´ f pUJoint

` ˘ ‹
qq . (6.40)

Next a lower bound is obtained on the Armijo step size tA to substitute in (6.40). Using
the retraction operator R and the of negative gradient ζU , we can define a smooth curve
on the manifold, from < to M, given by t Ñ RU ptζU q. This mapping can be further used
to define a smooth function h on M from < to < with a well-defined classical derivative,
given by
hU ptq “ f pRU ptζU qq . (6.41)

The derivative of hU is given by (Sections 3.5.1, 3.5.2, and 3.6 of [3])

d ˇ
h9 U pt “ 0q “ f pRU ptζU qqˇ “ Df pU qr´ζU s
ˇ
dt t“0
“ x∇f pU q, ζU y “ ´ xζU , ζU y . (6.42)

Using (6.41) and (6.42) the Armijo condition (6.34) reads

hU ptA q ď hU p0q ` σtA h9 U p0q. (6.43)

The Taylor expansion of hU gives us that

: U p0q
h
hU ptq “ hU p0q ` th9 U p0q ` t2 .
2

The t at which the left- and right-hand sides of (6.43) are equal is given by

: U p0q
h
hU p0q ` th9 U p0q ` t2 “ hU p0q ` σth9 U p0q,
2
: U p0q (6.44)
h ´2p1 ´ σqh9 U p0q
ñt “ ´h9 U p0q ` σ h9 U p0q, ñt“ .
2 : U p0q
h

154
Using t in (6.44) and the definition of Armijo point (Definition D.3 and Section 4.2 of [3]),
the step size tA that satisfies (6.40) has the following lower bound
˜ ¸
A ´2βp1 ´ σqh9 U p0q
t ě min η, , (6.45)
: U p0q
h

: u is given by
where η̄ and β are Armijo step size parameters. The second derivative h

d2 ˇ
:
hU pt “ 0q “ 2 f pRU ptζU qqˇ “ D2 f pxqr´ζU s
ˇ
dt t“0
“ p´ζU qT H0 p´ζU q “ H0 k ζU k2 . (6.46)

From properties of eigenvalues, we have that for any vector v, v T H0 v ď λH,max . Therefore,
using (6.42) and (6.46) in (6.45) gives that
ˆ ˙
A 2βp1 ´ σq
t ě min η, , (6.47)
λH,max


for all U sufficiently close to UJoint . Using the lower bound (6.47) in (6.40) gives

f pRU ptA ζU qq ´ f pUJoint


‹ ‹
q ď c pf pU q ´ f pUJoint qq (6.48)

where ˆ ˙
2βp1 ´ σq
c “ 1 ´ 2σλH,min min η, . (6.49)
λH,max
In (6.48), tA is the Armio step size corresponding to the Armijo point. When the next
pt`1q
iterate UJoint is the Armio point, then the decrease in the value of the objective function
ptq pt`1q ptq
from UJoint to UJoint is σ times the directional derivative at UJoint . In Algorithm 6.3, the
next iterate is ´ ¯
pt`1q
UJoint “ RU ptq tζU pt`1q , (6.50)
Joint Joint

where t satisfies the Armijo condition, that is, with step length t, the decrease in the value
of the objective function is greater than or equal to σ times the directional derivative at
ptq
UJoint . Hence using (6.50) in (6.48), we get
´ ¯
pt`1q ‹ ptq ‹
f pUJoint q ´ f pUJoint q ď c f pUJoint q ´ f pUJoint q ,

where c is given by (6.49).

Let the rank r of subspaces UJoint and Uj s be set to k, the number of clusters in the
data set. Given a set of tUj uM
j“1 , ignoring the constant factors, the objective function f in

155
terms of UJoint is given by
ˆ ˆ M ˙ ˙
1 ÿ
f pUJoint q “ ´tr T
UJoint k
LJoint ` Uj Uj UJoint ` ξ k UJoint´ k2F
T
M j“1

The cost function is equivalent to the Rayleigh quotient function, given by


` T ˘
f pUJoint q “ tr UJoint ΞUJoint
˜ ¸
M
k 1 ÿ T
where Ξ “ ξIn ´ LJoint ´ Uj Uj . (6.51)
M j“1

Let λ1 ď λ2 ď . . . ď λn be the eigenvalues of Ξ. The extreme eigenvalues of the Hessian


HUJoint
‹ f is given by (Section 4.9 of [3])

λH,min “ λk`1 ´ λk and λH,max “ λn ´ λ1 . (6.52)

Using (7.31) in (6.32), the convergence bound states that for all t greater than some t1
´ ¯ ´ ´ ¯ ¯
pt`1q ‹ ptq ‹
f UJoint ´ f pUJoint q ď c f UJoint ´ f pUJoint q ,

where c is convergence factor given by


" *
2βp1 ´ σq
c “ 1 ´ 2σpλk`1 ´ λk q min η, . (6.53)
pλn ´ λ1 q

The convergence factor c determines how fast the proposed algorithm converges to an
optimal solution of a given data set. A smaller value of c indicates greater decrease in
value of the cost function from iteration t to pt ` 1q while c close to 1 indicates minimal
decrease. Also, matrix Ξ has form equivalent to that of the normalized graph Laplacian.
For a data set with k well-separated clusters, the matrix Ξ tends to have close to block
diagonal structure and each of the k smallest eigenvalues of Ξ is indicative of one of the
k clusters. In this case, it is expected to have a greater gap between the eigenvalues λk
and λk`1 , which leads to a smaller value of c, indicating faster convergence. For poorly
separated clusters, the difference pλk`1 ´ λk q tends to be very small and c is close to 1,
indicating longer time to reach the optimal solution. Hence, c can be used to infer about
the cluster structure of the data set.

6.5 Experimental Results and Discussion


In this work, experiments are conducted to study and compare the performance of the
proposed MiMIC algorithm on several real-world and synthetic data sets. The clustering
performance of the MiMIC algorithm is compared with that of eight multi-view clustering
algorithms on five benchmark data sets, and nine cancer subtype identification algorithms
on eight multi-omics data sets. The performance of different algorithms is evaluated using

156
six external cluster evaluation indices, namely, accuracy, adjusted rand index (ARI), nor-
malized mutual information (NMI), F-measure, Rand index, and purity, which compare
the identified clusters with the clinically established cancer subtypes and the ground truth
class information of the benchmark data sets.
In order to randomize the experiments, each algorithm is executed 10 times, and the
means and standard deviations of each measure are reported. In all the tables, the numbers
within parentheses are the standard deviations,  0 means that the value is close to
zero (approximately 1e ´ 17), while 0.00 denotes exactly zero. For the proposed MiMIC
algorithm, the step lengths ηK and ηS are set to 0.05. The value of convergence parameter
 is empirically set to 0.001 for benchmark data sets and 0.005 for omics data sets. The
step reduction factor δ is set to 0.5, and ξ of (6.9) is set to 0.01. The R implementation of
the proposed algorithm is available at https://github.com/Aparajita-K/MiMIC.

6.5.1 Description of Data Sets


In this work, experiments are performed on three different types of data sets as follows:

6.5.1.1 Synthetic Data Sets


Experiments are conducted on five two-dimensional shape data sets (http://cs.joensuu.
fi/sipu/datasets/) to give a visual illustration of the capability of the proposed MiMIC
algorithm. The data sets are Jain, Sipral, Aggregation, Compound, D31, Pathbased,
Flame, and R15 consisting of 373, 312, 788, 399, 3100, 300, 240, and 600 samples, re-
spectively, and number of clusters varying between 2 and 31. These are single view data
sets for which two views are generated from two graphs constructed using k-nearest neigh-
bors and Gaussian kernel.

6.5.1.2 Benchmark Data Sets


In this work, several publicly available data sets from a variety of application domains
are considered. Among them, 3Sources (http://mlg.ucd.ie/datasets/3sources.html)
and BBC (http://mlg.ucd.ie/datasets/segment.html) are multi-source news article
clustering data sets, consisting of 169 and 685 news articles, annotated with 6 and 5 top-
ics, respectively. Five benchmark image data sets are also considered, namely, Digits,
100Leaves, ALOI, ORL, and Caltech7. The Digits data set (https://archive.ics.uci.
edu/ml/datasets/Multiple+Features) consists of 2000 images of handwritten numer-
als (‘0’–‘9’). The ALOI (http://elki.dbs.ifi.lmu.de/wiki/DataSets/MultiView) and
100Leaves1 data sets are both 100 cluster data sets, where ALOI consists of 11,025 images
of 100 small objects, while 100Leaves consists of 1,600 samples from 100 plant species.
The ORL data set (https://cam-orl.co.uk/facedatabase.html) consists of 400 facial
images of 40 subjects taken under varying lighting conditions and facial expressions. The
Caltech7 data set (https://github.com/yeqinglee/mvdata) is a seven class object recog-
nition data set. Apart from these data sets, six multi-view social network data sets, namely,
Football, Olympics, Politics-UK, Politics-IE, Rugby, and CORA are also considered in this
study.
1
https://archive.ics.uci.edu/ml/datasets/One-hundred+plant+species+leaves+data+set

157
Ground-truth partition:
30 35 30 24

22
30
25 25
20

25 18
20 20

16
20

15 15 14

15
12

10 10
10 10

8
5 5
5
6

0 0 0 4
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 45

(a) Jain (b) Spiral (c) Aggregation (d) Compound


28 35 18 30

16
26 30
25

14
24 25
20
12
22 20

10 15

20 15
8
10
18 10
6

5
16 5
4

14 0 2 0
0 2 4 6 8 10 12 14 16 0 5 10 15 20 25 30 35 2 4 6 8 10 12 14 16 18 0 5 10 15 20 25 30

(e) Flame (f) Pathbased (g) R15 (h) D31

MiMIC algorithm partition:


30 35 30 24

22
30
25 25
20

25 18
20 20

16
20

15 15 14

15
12

10 10
10 10

8
5 5
5
6

0 0 0 4
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 45

(i) 1.00 (j) 1.00 (k) 0.99619 (l) 0.89473


28 35 18 30

16
26 30
25

14
24 25
20
12
22 20

10 15

20 15
8
10
18 10
6

5
16 5
4

14 0 2 0
0 2 4 6 8 10 12 14 16 0 5 10 15 20 25 30 35 2 4 6 8 10 12 14 16 18 0 5 10 15 20 25 30

(m) 1.00 (n) 0.83 (o) 0.995 (p) 0.82032

Figure 6.2: Two-dimensional scatter plots of three synthetic shape data sets: ground truth
clustering (top two rows: (a)-(h)) and MiMIC clustering (bottom two rows: (i)-(p)). The
numbers in (i)-(p) denote the clustering accuracy obtained using the MiMIC algorithm.

158
Table 6.1: Performance Analysis of Proposed Algorithms on Synthetic Clustering Data
Sets
Data Sets Aggregation Compound Pathbased Spiral Jain Flame R15 D31
No. of Samples 788 399 300 312 373 240 600 3100
No. of Clusters 7 6 3 3 2 2 15 31
Accuracy 0.99619 0.89473 0.83000 1.00 1.00 0.98750 0.99500 0.82032
NMI 0.98839 0.89671 0.63926 1.00 1.00 0.89905 0.99135 0.91780
ARI 0.99198 0.92926 0.58486 1.00 1.00 0.95014 0.98921 0.68092
F-measure 0.99622 0.91264 0.81500 1.00 1.00 0.98748 0.99497 0.85341
Rand 0.99728 0.97343 0.81190 1.00 1.00 0.97520 0.99868 0.97419
Purity 0.99619 0.89724 0.83000 1.00 1.00 0.98750 0.99500 0.85096

6.5.1.3 Multi-Omics Cancer Data Sets


The subtype analysis is studied on eight types of cancers, namely, lower grade glioma
(LGG), stomach adenocarcinoma (STAD), breast adenocarcinoma (BRCA), lung carci-
noma (LUNG), ovarian carcinoma (OV), cervical carcinoma (CESC), colorectal carcinoma
(CRC), and kidney carcinoma (KIDNEY), and corresponding data sets are obtained from
The Cancer Genome Atlas (TCGA) (https://cancergenome.nih.gov/). The colorectal,
lung and kidney cancer data sets have two, two, and three histological subtypes, respec-
tively, identified by World Health Organization. For other five cancers, TCGA research
network has identified four clinically relevant subtypes for both BRCA, STAD, and OV
while three subtypes for LGG and CESC. For all the cancer data sets, four different omic
data types, namely, DNA methylation (mDNA), gene expression (RNA), microRNA ex-
pression (miRNA), and reverse phase protein array expression (RPPA), are considered.
All the real-world data sets are summarized in Table A.1 and are briefly described in Ap-
pendix A. For the data sets with feature vector based representation, the pairwise similarity
matrices Wm ’s are computed using the Gaussian kernel.

6.5.2 Performance on Synthetic Data Sets


Figure 6.2 shows the scatter plots for five two dimensional shape data sets. The objects in
Figure 6.2 are colored according to their ground truth partition information (top two rows:
Figures6.2(a)-(h)) the partition obtained by the proposed MiMIC algorithm in (bottom
two rows: Figure 6.2(i)-(p)).For these data sets, two views, generated using the Gaussian
kernel and k-nearest neighbors, have related cluster structure, but they differ in their
graph connectivity. The quantitative results on the synthetic data sets, in terms of the
external indices, are reported in Table 6.1. The qualitative results in Figure 6.2 show that
the MiMIC algorithm obtains almost perfect clustering for the Jain, Spiral, Aggregation,
Flame, and R15 data sets. For the Compound, D31, and Pathbased data sets, the clustering
performance is also very good, having accuracy 0.89473, 0.82032, 0.83, respectively. The
scatter plots in Figure 6.2 show that the Spiral, Compound, Jain, and Pathbased data
sets have non-linearly separable clusters, while the D31 data set has 3,100 samples and
31 clusters. All the results reported in Figure 6.2 show that the proposed algorithm can
efficiently identify both non-linearly separable and large number of clusters.

159
35 35

30 30

25 25

20 20

15 15

10 10

5 5

0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35

(a) Original Data Set (b) Noise with Std_Dev= 0.5

35 35

30 30

25 25

20 20

15 15

10 10

5 5

0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35

(c) Noise with Std_Dev= 1 (d) Noise with Std_Dev= 1.5

1 0 1 0
Ratio γ
Bound c -0.2
0.9 f(UJoint(t)) 0.9
-0.5 -0.4
Ratio γ
0.8 0.8 Bound c -0.6
f(UJoint(t))
Objective f

Objective f
-1
Ratio γ

Ratio γ

-0.8
0.7 0.7
-1
0.6 -1.5
0.6 -1.2

-1.4
0.5 -2 0.5
-1.6
0.4 -1.8
0.4 t’
2 4 6 8 10 t’ 12 14 16 5 10 15 20 25
Iteration t Iteration t
(a) c= 0.5432315 (b) c= 0.8995148

1 0 1 0

-0.2 -0.2
0.9 0.9 Ratio γ
Ratio γ
Bound c -0.4
Bound c -0.4
f(UJoint(t))
0.8 f(UJoint(t)) 0.8
-0.6 -0.6
Objective f

Objective f
Ratio γ

Ratio γ

0.7 -0.8 0.7 -0.8

-1 -1
0.6 0.6
-1.2 -1.2

0.5 -1.4 0.5 -1.4

-1.6 -1.6
0.4 t’5 0.4 t’5
10 15 20 10 15 20
Iteration t Iteration t
(c) c= 0.9298075 (d) c= 0.9569399

Figure 6.3: Asymptotic convergence analysis for Spiral data set: scatter plot of data with
varying Gaussian noise (top row) and variation of convergence ratio and objective function
with increase in iteration number t (bottom row).
160
30 30 30 35

30
25 25 25

25
20 20 20

20

15 15 15

15

10 10 10
10

5 5 5
5

0 0 0 0
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45 -5 0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45 50

(a) Original Data Set (b) Noise Std_Dev= 0.5 (c) Noise Std_Dev= 1 (d) Noise Std_Dev= 1.5

1 0 1 0 1 0 1 0
Ratio γ Ratio γ
Bound c Bound c
0.9 f(UJoint(t)) -0.2 f(UJoint(t)) -0.2
0.9 -0.2 0.9 -0.2
0.9
-0.4 -0.4
0.8 -0.4 0.8 -0.4
Ratio γ f

0.8

Objective f

Ratio γ f

Objective f
0.8
Objective

Objective
Ratio γ

-0.6 Ratio γ
-0.6
-0.6 0.7 -0.6
0.7 0.7
-0.8 0.7
-0.8
-0.8 0.6 -0.8
0.6 -1 0.6
0.6 -1
-1.2 Ratio γ -1 0.5 Ratio γ -1
0.5 0.5 Bound c Bound c
0.5 -1.2 f(UJoint(t)) f(UJoint(t))
-1.4
t’ 35 t’45 -1.2 0.4 t’ 20
5 10 15 20 25 30 5 10 15 20 25 30 35 40 10 20 30 40 50 t’ 60 10 30 40 50
Iteration t Iteration t Iteration t Iteration t

(a) c= 0.5669898 (b) c= 0.7558949 (c) c= 0.8887072 (d) c= 0.9117535

Figure 6.4: Asymptotic convergence analysis for Jain data set: scatter plot of data with
varying Gaussian noise (top row) and variation of convergence ratio and objective function
with increase in iteration number t (bottom row).

18 18 20 20

18 18
16 16

16 16
14 14
14 14

12 12
12 12

10 10 10 10

8 8
8 8

6 6
6 6
4 4

4 4
2 2

2 2 0 0
2 4 6 8 10 12 14 16 18 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 20 0 5 10 15 20 25

(a) Original Data Set (b) Noise Std_Dev= 0.5 (c) Noise Std_Dev= 1 (d) Noise Std_Dev= 1.5

1 0 1 0 1 0 1 0
Ratio γ Ratio γ
Ratio γ
Bound c Bound c
-2 Bound c
0.9 f(UJoint(t)) 0.9 f(UJoint(t)) -2 0.9 -2 0.9 -2
f(UJoint(t))
-4 -4 Ratio γ
0.8 0.8 0.8 -4 -4
Objective f

Objective f

Ratio γ f

0.8
Objective f

Bound c
Objective

f(UJoint(t))
Ratio γ

Ratio γ

Ratio γ

-6 -6
0.7 0.7 -6 -6
0.7 0.7
-8 -8
0.6
0.6 0.6 -8 -8
0.6
-10 -10
0.5
0.5 0.5 -10 -10
0.5
-12
-12
0.4 t’
5 10 15 20 25 30 5 10 15 20 25 30 t’ 35 40 45 5 t’ 10 15 20 25 30 35 40 45 50 t’ 5 10 15 20 25 30 35 40 45
Iteration t Iteration t Iteration t Iteration t

(a) c= 0.6819649 (b) c= 0.8589394 (c) c= 0.9477831 (d) c= 0.9849866

Figure 6.5: Asymptotic convergence analysis for R15 data set: scatter plot of data with
varying Gaussian noise (top row) and variation of convergence ratio and objective function
with increase in iteration number t (bottom row).

161
24 24 25 30

22 22

25
20 20 20

18 18
20

16 16 15

14 14 15

12 12 10

10
10 10

8 8 5
5

6 6

4 4 0 0
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45

(a) Original Data Set (b) Noise Std_Dev= 0.5 (c) Noise Std_Dev= 1 (d) Noise Std_Dev= 1.5

1 0 1 0 1 0 1 0
Ratio γ
Bound c
f(UJoint(t))
-0.2
0.9 -0.2 0.9 0.9 -0.2 0.9 -0.2
-0.4
-0.4 0.8 0.8 -0.4 0.8
0.8
Ratio γ f

Objective f

Ratio γ f

Objective f
-0.4
Objective

Objective
-0.6
Ratio γ

Ratio γ
-0.6 0.7 0.7 -0.6 0.7
0.7 -0.8 -0.6

-0.8 0.6 -1 0.6 -0.8 0.6


0.6 -0.8
Ratio γ -1.2 Ratio γ Ratio γ
-1 0.5 0.5 0.5
Bound c Bound c -1 Bound c
0.5 f(UJoint(t)) -1
f(UJoint(t)) -1.4 f(UJoint(t))
5 10 15 20 25 30 35 40 t’ 45 10 20 30 40 t’ 50 5 10 15t’ 20 25 30 35 40 45 5 10 15 t’20 25 30 35 40 45
Iteration t Iteration t Iteration t Iteration t

(a) c= 0.7194825 (b) c= 0.7725400 (c) c= 0.9311562 (d) c= 0.9583102

Figure 6.6: Asymptotic convergence analysis for Compound data set: scatter plot of data
with varying Gaussian noise (top row) and variation of convergence ratio and objective
function with increase in iteration number t (bottom row).

6.5.3 Significance of Asymptotic Convergence Bound


The asymptotic convergence bound obtained in Theorem 6.4 indicates how fast the se-
quence of iterates generated by the proposed algorithm converges to an optimal solution of
a given data set. For a sufficiently large value of iteration number t, Theorem 6.4 bounds
pt`1q
the difference between the cost function f evaluated at UJoint and at the optimal solution
‹ ptq ‹
UJoint in terms of the difference between that evaluated at UJoint and UJoint . Let γt be
given by the ratio ´ ¯
pt`1q ‹
f UJoint ´ f pUJoint q
γt “ ´ ¯ . (6.54)
ptq ` ‹
˘
f UJoint ´ f UJoint

Theorem 6.4 states that for all t greater or equal to some t1 , γt ď c, where c is given
by (6.32). The convergence factor c can be used to make inference about the underlying
cluster structure of the data set. As discussed in Section 6.4.2, a value of c close to 1
indicates poor separation between the clusters present in the data set, while a value much
lower than 1 indicates well-separated clusters. To experimentally establish this, multiple
noisy data sets are generated from the synthetic shape data sets used in this work, by
adding Gaussian noise of mean 0 and standard deviations 0.5, 1, and 1.5. Experiments are
performed on noise-free and noisy variations of four shape data sets, namely, Spiral, Jain,
R15, and Compound. The scatter plots for the noise-free and noisy variants of Spiral, Jain,
R15, and Compound data sets are provided in the top rows of Figures 6.3, 6.4, 6.5, and 6.6,
respectively. As stated in Section 6.5.1.1, for each variant of each data set, two views are
generated using k-nearest neighbors and Gaussian´kernel.¯ Starting from a random initial
ptq
iterate, the variation of γt and the cost function f UJoint is observed for different values

162
´ ¯
ptq
of t “ 1, 2, 3, . . ., until convergence. The variation of γt and f UJoint along with the
corresponding value of convergence factor c is provided in the bottom rows of Figures 6.3,
6.4, 6.5, and 6.6 for Spiral, Jain, R15, and Compound data sets, respectively. The value of
the bound c is marked by a horizontal dashed green line in these figures.
For all the data sets, the top rows of Figures 6.3, 6.4, 6.5, and 6.6 show that the cluster
structure and their separability degrades with the increase in noise, as expected. The
bottom rows of these figures in turn show that with increase in noise in the data sets, the
value of the convergence factor c increases and goes close to 1. For instance, for the Spiral
data set, the value of c for the noise-free original data set in Figure 6.3(a) is 0.5432315,
while that for the three increasingly noisy variants in Figures 6.4(b), 6.4(c), and 6.4(d) are
0.8995148, 0.9298075, and 0.9569399, respectively. Similar observations can be made for
Jain, R15, and Compound data sets as well from the bottom rows of Figures 6.4, 6.5, and
6.6, respectively. Although the results are sensitive to the added noise and the choice of
the random initial iterate, in general, it can be observed that lower values of c imply faster
convergence. For instance, the bottom rows of Figures 6.3, 6.4, 6.5, and 6.6 show that for
all four data sets, the proposed algorithm converges in lesser number of iterations in the
noise-free case compared to the noisy ones. The value of the iteration threshold t1 , above
which the asymptotic bound is satisfied by all the iterations until convergence, is marked
by a dashed vertical line in the figures. In general, it can be observed from Figures 6.3, 6.4,
6.5, and 6.6 that for all data sets, as noise increases, the value of t1 decreases implying a
longer path until convergence. In brief, the results show that the convergence bound c can
be used to make inference about the quality of the clusters and the speed of convergence
of the proposed algorithm, for a given data set.

6.5.4 Choice of Rank


The proposed algorithm identifies the set of k clusters by performing clustering on the first

k columns of UJoint . Although clustering is performed in a k-dimensional subspace, the
proposed algorithm works with rank r subspaces, where r is generally greater or equal to
k, in order to incorporate better information from the individual views. The optimal value
of rank r is obtained using the same procedure as described in Section 5.5.2 of Chapter 5.
The value of r is varied from k to maxt50, 2ku and for each value of r, the Silhouette index

Sprq is evaluated for clustering using the first k columns of UJoint . The optimal rank, r‹ ,
is the one that maximizes Sprq over different values of r.
In order to validate the choice of rank, based on the Silhouette index, the variation of
both Sprq and F-measure is observed for different values of rank r. Figure 6.7 shows the
variation of these two indices for Digits and LGG data sets, as examples. Similar to Figures
5.1 and 5.2 of Chapter 5, Figure 6.7 shows that Sprq and F-measure values tend to vary in
a similar fashion for the data sets. The optimal values of rank for three image data sets,
namely, Digits, 100Leaves, and ALOI are 12, 180, and 150, respectively. For both news
article data sets, namely, 3Sources and BBC, the optimal rank is 21, while for eight omics
data sets, namely, LGG, STAD, BRCA, LUNG, CRC, CESC, OV, and KIDNEY, the ranks
are 43, 16, 4, 3, 8, 3, 5, and 5, respectively. For BRCA, LGG, STAD, LUNG, 100Leaves,
and ALOI data sets, it is also observed that the F-measure corresponding to r‹ coincides
with the best F-measure obtained over different values of rank. In order to establish the
importance of considering the optimal rank r‹ , Table 6.2 compares the performance of

163
Table 6.2: Performance Analysis of Proposed Algorithm at Rank k and Optimal Rank r‹
Measure Rank k Rank r‹ Rank k Rank r‹
Rank 10 12 6 21
Accuracy 0.7905(0.0) 0.9207(4.21e-4) 0.6153(6.23e-3) 0.7360(5.92e-2)

3Sources
Digits

NMI 0.7556(0) 0.8597(4.88e-4) 0.5721(1.38e-2) 0.6433(3.59e-2)


ARI 0.6754(0.0) 0.8352(8.18e-4) 0.4635(1.88e-2) 0.5957(6.69e-2)
F-measure 0.8070(0.0) 0.9209(4.15e-4) 0.6786(5.28e-3) 0.7581(5.04e-2)
Rand 0.9409(0.0) 0.9703(1.49e-4) 0.8162(5.25e-3) 0.8514(2.61e-2)
Purity 0.7980(0.0) 0.9207(4.21e-4) 0.7455(6.23e-3) 0.7946(2.28e-2)
Rank 5 21 100 180
Accuracy 0.7275(8.35e-2) 0.8715(0.0) 0.6976(1.80e-2) 0.8185(1.55e-2)

100Leaves
NMI 0.6123(7.39e-2) 0.7182(0) 0.8976(6.07e-3) 0.9302(4.12e-3)
BBC

ARI 0.5844(1.43e-1) 0.7273(0.0) 0.6148(2.12e-2) 0.7431(2.53e-2)


F-measure 0.7539(7.55e-2) 0.8613(0.0) 0.7524(1.55e-2) 0.8492(1.13e-2)
Rand 0.8253(7.10e-2) 0.8959(0.0) 0.9907(7.31e-4) 0.9913(1.17e-3)
Purity 0.7284(8.28e-2) 0.8715(0.0) 0.7380(1.62e-2) 0.7772(1.53e-2)
Rank 4 4 3 43
Accuracy 0.7964(0.0) 0.7964(0.0) 0.6292(0.0) 0.9625(0.0)
BRCA

NMI 0.5553(0) 0.5553(0) 0.4106(0) 0.8543(0)


LGG

ARI 0.5474(0.0) 0.5474(0.0) 0.2765(0.0) 0.8790(0.0)


F-measure 0.7997(0.0) 0.7997(0.0) 0.6316(0.0) 0.9623(0.0)
Rand 0.8152(0.0) 0.8152(0.0) 0.6590(0.0) 0.9424(0.0)
Purity 0.7964(0.0) 0.7964(0.0) 0.6741(0.0) 0.9625(0.0)
Rank 4 16 2 3
Accuracy 0.5123(0.0) 0.7727(0.0) 0.9388(0.0) 0.9463(0.0)
LUNG
STAD

NMI 0.2905(0) 0.5220(0) 0.6822(0.0) 0.7173(0.0)


ARI 0.1520(0.0) 0.4650(0.0) 0.7701(0.0) 0.7965(0.0)
F-measure 0.5239(0.0) 0.7830(0.0) 0.9386(0.0) 0.9461(0.0)
Rand 0.6362(0.0) 0.7698(0.0) 0.8850(0.0) 0.8983(0.0)
Purity 0.5909(0.0) 0.7727(0.0) 0.9388(0.0) 0.9463(0.0)

the proposed MiMIC algorithm when the rank is k with that of optimal rank r‹ for four
benchmark and four omics data sets. Table 6.2 shows that for LGG, STAD, LUNG, and
all four benchmark data sets, there is a significant improvement in performance when
considering rank r‹ instead of k. For BRCA data, the performance is exactly same for
both the cases.

6.5.5 Choice of Damping Factor in Joint Laplacian


The joint Laplacian LrJoint , defined in (6.7), is a convex combination of the individual
approximate graph Laplacians. The convex combination is set according to Section 5.3.5
of Chapter 5. In the convex combination, the Laplacians are weighted according to the
relevance of the cluster information provided by the corresponding views. The relevance
measure χ in (5.31) gives a linear ordering of the views based on the quality of their
underlying cluster structure. Based on this ordering, the relevance values are damped by
powers of ∆ and then used in the convex combination. This damping strategy upweights

164
0.65
Digits 0.65
LGG 1.2
F-measure F-measure
Silhouette 0.95 Silhouette
0.6 1.1
0.6
0.9 1
0.55
0.55 0.9
0.85

F-measure

F-measure
Silhouette

Silhouette
0.5
0.5 0.8
0.45 0.8
0.7
0.4 0.75 0.45
0.6
0.35 0.7
0.4 0.5
0.3 0.65 0.4
0.35
0.25 0.6 0.3
10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
rank r rank r

Figure 6.7: Variation of Silhouette index and F-measure for different values of rank r on
Digits and LGG data sets.

the contribution of views with better cluster structure, while damping the effect of those
having poorer structure. With damping factor ∆ “ 1, the individual contributions are
relatively close to each other depending upon their Fiedler values and Fiedler vectors. On
the other hand, with ∆ “ 2, the contributions of the views in decreasing order of relevance
χ χ χ
are 2p1q , 4p2q , 8p3q , and so on. This indicates heavier damping resulting in higher difference
between the individual contributions. The effect of the two damping factors is studied in
Table 6.3 for different data sets.

0.02 0.1

0.015 0.05

0
0.01

-0.05
0.005

-0.1
0

-0.15
-0.06 -0.055 -0.05 -0.045 -0.04 -0.035 -0.03 -0.025 -0.02 -0.015 -0.01 -0.055 -0.05 -0.045 -0.04 -0.035 -0.03 -0.025 -0.02 -0.015 -0.01

(a) Segment1 (b) Segment2

0.1
0.02 0.1

0.05

0.015 0.05

0
0.01

-0.05

-0.05
0.005
-0.1

-0.1
0 -0.15

-0.15 -0.2
-0.055 -0.05 -0.045 -0.04 -0.035 -0.03 -0.025 -0.02 -0.015 -0.01 -0.06 -0.055 -0.05 -0.045 -0.04 -0.035 -0.03 -0.025 -0.02 -0.015 -0.01 -0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1

(c) Segment3 (d) Segment4 (e) MiMIC

Figure 6.8: Two-dimensional scatter plots of individual views and proposed algorithm for
BBC data set.

Table 6.3 shows that for four benchmark data sets, namely, Digits, 3Sources, BBC, and

165
Table 6.3: Performance of the MiMIC Algorithm for Different Values of Damping Factor
∆ on Benchmark and Multi-Omics Data Sets
Measure ∆“1 ∆“2 ∆“1 ∆“2
Rank 12 42 21 26
Accuracy 0.9207(4.21e-4) 0.7860(0.0) 0.7360(5.92e-2) 0.6520(3.74e-3)
Benchmark

3Sources
Digits

NMI 0.8597(4.88e-4) 0.8275(0) 0.6433(3.59e-2) 0.6224(8.33e-3)


ARI 0.8352(8.18e-4) 0.7367(0.0) 0.5957(6.69e-2) 0.5225(1.34e-2)
F-measure 0.9209(4.15e-4) 0.8428(0.0) 0.7581(5.04e-2) 0.6941(3.91e-3)
Rand 0.9703(1.49e-4) 0.9500(0.0) 0.8514(2.61e-2) 0.8191(7.10e-3)
Purity 0.9207(4.21e-4) 0.8225(0.0) 0.7946(2.28e-2) 0.7763(3.74e-3)
Rank 21 5 180 50
Accuracy 0.8715(0.0) 0.7976(3.04e-2) 0.8185(1.55e-2) 0.6765(1.80e-2)

100Leaves
Benchmark

NMI 0.7182(0) 0.6658(4.01e-2) 0.9302(4.12e-3) 0.8499(6.61e-3)


BBC

ARI 0.7273(0.0) 0.7027(6.04e-2) 0.7431(2.53e-2) 0.5715(2.10e-2)


F-measure 0.8613(0.0) 0.8127(3.42e-2) 0.8492(1.13e-2) 0.7067(1.46e-2)
Rand 0.8959(0.0) 0.8874(2.89e-2) 0.9913(1.17e-3) 0.9910(5.93e-4)
Purity 0.8715(0.0) 0.7991(3.04e-2) 0.7772(1.53e-2) 0.7120(1.32e-2)
Rank 40 4 45 43
Multi-Omics

Accuracy 0.6683(0.0) 0.7964(0.0) 0.9700(0.0) 0.9625(0.0)


BRCA

NMI 0.4503(0) 0.5553(0) 0.8646(0) 0.8543(0)


LGG

ARI 0.3894(0.0) 0.5474(0.0) 0.9097(0.0) 0.8790(0.0)


F-measure 0.6800(0.0) 0.7997(0.0) 0.9700(0.0) 0.9623(0.0)
Rand 0.7499(0.0) 0.8152(0.0) 0.9574(0.0) 0.9424(0.0)
Purity 0.6733(0.0) 0.7964(0.0) 0.9700(0.0) 0.9625(0.0)
Rank 25 16 4 3
Multi-Omics

Accuracy 0.7727(0.0) 0.7727(0.0) 0.9388(0.0) 0.9463(0.0)


LUNG
STAD

NMI 0.5183(0) 0.5220(0) 0.6920(0.0) 0.7173(0.0)


ARI 0.4658(0.0) 0.4650(0.0) 0.7701(0.0) 0.7965(0.0)
F-measure 0.7791(0.0) 0.7830(0.0) 0.9385(0.0) 0.9461(0.0)
Rand 0.4591(0.0) 0.7698(0.0) 0.8850(0.0) 0.8983(0.0)
Purity 0.7727(0.0) 0.7727(0.0) 0.9388(0.0) 0.9463(0.0)

100Leaves, lower damping (∆ “ 1) gives better performance compared to higher damping


(∆ “ 2). The individual views of the benchmark data sets are relatively similar to each
other, for instance, different segments of the same news article for the BBC data set,
and RGB and HSV colour histograms of same image for ALOI data set. As a result,
lower damping works better for the benchmark data sets. For the multi-oimcs data sets,
however, Table 6.3 shows that heavier damping with ∆ “ 2 gives better performance.
Table 6.6 shows that there is a significant difference between the clustering performance of
the most and the second most relevant views of LGG, BRCA, and LUNG data sets. Hence,
significantly upweighting the most relevant view with ∆ “ 2 gives better performance for
the multi-oimcs data sets. Therefore, in this work, the damping factor ∆ is chosen to be 2
for the multi-omics data sets, and 1 for the benchmark data sets.

166
Table 6.4: Performance Analysis of Spectral Clustering on Individual Views and Proposed
MiMIC Algorithm for BBC and ALOI Data Sets
Views Segment1 Segment2 Segment3 Segment4 MiMIC
Accuracy 0.6202(2.1e-3) 0.6202(3.6e-2) 0.6102(3.6e-2) 0.5550(3.0e-3) 0.8715(0.0)
NMI 0.4312(1.7e-3) 0.4459(5.7e-2) 0.4097(1.3e-3) 0.4033(8.2e-3) 0.7182(0)
BBC

ARI 0.3405(6.6e-2) 0.3895(8.9e-2) 0.3429(7.0e-3) 0.2518(1.1e-2) 0.7273(0.0)


F-measure 0.6514(1.2e-2) 0.6363(3.9e-2) 0.6435(1.2e-2) 0.6205(3.0e-3) 0.8613(0.0)
Rand 0.7256(1.7e-2) 0.7174(6.3e-2) 0.7425(7.4e-3) 0.6671(9.1e-3) 0.8959(0.0)
Purity 0.6212(3.5e-3) 0.6218(3.7e-2) 0.6120(3.6e-2) 0.5565(3.0e-3) 0.8715(0.0)
Views RGB HSV Haralick ColorSimilarity MiMIC
Accuracy 0.4215(1.1e-2) 0.4433(7.0e-3) 0.1001(2.3e-3) 0.5191(1.1e-2) 0.5742(7.4e-3)
NMI 0.7179(3.9e-3) 0.7093(5.1e-3) 0.3659(4.1e-3) 0.7683(4.9e-3) 0.7805(2.3e-3)
ALOI

ARI 0.2915(1.4e-2) 0.2979(1.9e-2) 0.0550(6.8e-4) 0.3745(2.2e-2) 0.4233(6.6e-3)


F-measure 0.4789(1.0e-2) 0.5136(7.9e-3) 0.1209(1.5e-3) 0.5843(1.1e-2) 0.6221(4.9e-3)
Rand 0.9745(1.8e-3) 0.9759(2.2e-3) 0.8938(7.0e-3) 0.9797(1.9e-3) 0.9840(3.7e-4)
Purity 0.4717(9.9e-3) 0.4876(7.3e-3) 0.1094(2.4e-3) 0.5547(8.6e-3) 0.6119(5.7e-3)

0.15 0.15

0.1 0.1

0.05 0.05

0 0

-0.05 -0.05

-0.1 -0.1

-0.15 -0.15

-0.2 -0.2
-0.1 -0.095 -0.09 -0.085 -0.08 -0.075 -0.07 -0.065 -0.06 -0.055 -0.05 -0.11 -0.1 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04

(a) BBC (b) Reuters

0.15 0.15

0.1
0.1

0.05
0.05

0
0

-0.05

-0.05
-0.1

-0.1
-0.15

-0.15
-0.2

-0.2 -0.25
-0.1 -0.095 -0.09 -0.085 -0.08 -0.075 -0.07 -0.065 -0.06 -0.055 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15

(c) The Guardian (d) MiMIC

Figure 6.9: Two-dimensional scatter plots of individual views and proposed MiMIC algo-
rithm for 3Sources data set.

6.5.6 Importance of Data Integration


The proposed algorithm integrates information by optimizing a joint clustering objective
while reducing the disagreement between the joint and individual subspaces. To study the
importance of integration, the performance of the proposed algorithm is compared with

167
Table 6.5: Performance Analysis of Spectral Clustering on Individual Views and Proposed
MiMIC Algorithm for 100Leaves and 3Sources Data Sets
Views Shape Texture Margin MiMIC
Accuracy 0.3095(9.1e-3) 0.4777(1.4e-2) 0.5786(1.1e-2) 0.8185(1.5e-2)
NMI 0.6479(6.7e-3) 0.7327(5.6e-3) 0.7940(4.4e-3) 0.9302(4.1e-3)
ARI 0.1820(5.8e-3) 0.3265(1.4e-2) 0.4478(9.8e-3) 0.7431(2.5e-2)
100Leaves
F-measure 0.3525(7.4e-3) 0.5139(1.1e-2) 0.6113(9.7e-3) 0.8492(1.1e-2)
Rand 0.9699(1.5e-3) 0.9839(7.2e-4) 0.9880(2.9e-4) 0.9913(1.17e-3)
Purity 0.3696(7.5e-3) 0.5216(1.2e-2) 0.6203(7.6e-3) 0.7772(1.53e-2)
Views BBC Guardian Reuters MiMIC
Accuracy 0.7159(0.0) 0.6508(0.0) 0.5562(0.0) 0.7360(5.9e-2)
NMI 0.6390(0.0) 0.5270(0.0) 0.5347(0.0) 0.6433(3.5e-2)
ARI 0.6082(0.0) 0.4119(0.0) 0.41434(0.0) 0.5957(6.6e-2)
3Sources
F-measure 0.7656(0.0) 0.7036(0.0) 0.6482(0.0) 0.7581(5.0e-2)
Rand 0.8624(0.0) 0.7983(0.0) 0.7982(0.0) 0.8514(2.6e-2)
Purity 0.7869(0.0) 0.6982(0.0) 0.6982(0.0) 0.7946(2.2e-2)

0.15 0.15 0.15


0.15

0.1
0.1 0.1
0.1

0.05
0.05 0.05

0.05
0

0 0

-0.05
0
-0.05 -0.05
-0.1

-0.05
-0.1 -0.1
-0.15

-0.15 -0.15
-0.1 -0.065 -0.06 -0.055 -0.05 -0.045 -0.04 -0.035 -0.03 -0.065 -0.06 -0.055 -0.05 -0.045 -0.2
-0.04
-0.068 -0.066 -0.064 -0.062 -0.06 -0.058 -0.056 -0.054 -0.052 -0.05 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1

0.15 0.15 0.2 0.2

0.1 0.15 0.15


0.1

0.05 0.1 0.1


0.05

0 0.05 0.05

-0.05 0 0

-0.05
-0.1 -0.05 -0.05

-0.1
-0.15 -0.1 -0.1

-0.15 -0.2 -0.15


-0.07 -0.065 -0.06 -0.055 -0.05 -0.045 -0.07 -0.065 -0.06 -0.055 -0.05 -0.045 -0.04 -0.035 -0.07 -0.068 -0.066 -0.064 -0.062 -0.06 -0.058 -0.056 -0.054 -0.052 -0.15
-0.05
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15

(a) mDNA (b) RNA (c) miRNA (d) MiMIC

Figure 6.10: Two-dimensional scatter plots of three individual views and proposed MiMIC
algorithm for multi-omics cancer data sets: LGG (top row) and STAD (bottom row).

that of spectral clustering on the individual views. The comparative results are reported
in Tables 6.4 and 6.5 for the benchmark data sets, and in Table 6.6 for the multi-omics
cancer data sets. The results in Tables 6.4 and 6.5 clearly show that for four benchmark
data sets, namely, Digits, BBC, ALOI, and 100Leaves, there is significant improvement in
performance of the proposed MiMIC algorithm considering multiple views over any single
view clustering. For the 3Sources data set, there is lesser improvement in terms of NMI
and accuracy, and the single view BBC news source gives the best performance in terms
of ARI and F-measure. In case of the multi-omics data sets, Table 6.6 shows that for all
four data sets, the proposed algorithm achieves the best clustering performance across all
four evaluation indices. The performance gain is most evident for LGG and STAD data

168
Table 6.6: Performance Analysis of Spectral Clustering on Individual Views and Proposed
MiMIC Algorithm for Multi-Omics Data Sets
Views mDNA RNA miRNA RPPA MiMIC
Accuracy 0.8352060 0.5917603 0.4307116 0.3970037 0.9625468
NMI 0.5734568 0.2176187 0.0498676 0.0254500 0.8543905
ARI 0.5567870 0.1801875 0.0510240 0.0238319 0.8790253
LGG
F-measure 0.8269248 0.5875701 0.4717221 0.4326018 0.9623406
Rand 0.7861508 0.6149925 0.5593760 0.5476050 0.9424967
Purity 0.8352060 0.5917603 0.5318352 0.5280899 0.9625468
Accuracy 0.5413223 0.4793388 0.3719008 0.4173554 0.7727273
NMI 0.2282198 0.1779419 0.0771419 0.0831100 0.5220123
ARI 0.1927570 0.1047749 0.0514998 0.0460928 0.4650334
STAD
F-measure 0.5469686 0.4781377 0.3998266 0.4469459 0.7830757
Rand 0.6509722 0.6239155 0.5989164 0.5883543 0.7698296
Purity 0.5867769 0.5495868 0.4917355 0.4917355 0.7727273
Accuracy 0.5804020 0.7688442 0.4623116 0.4798995 0.7964824
NMI 0.3408150 0.5277072 0.1947561 0.3140984 0.5553836
ARI 0.3047769 0.5130244 0.1663564 0.2359641 0.5474472
BRCA
F-measure 0.5982526 0.7690661 0.5105008 0.5630781 0.7997020
Rand 0.7193018 0.7995519 0.6455071 0.6689493 0.8152728
Purity 0.6532663 0.7688442 0.5703518 0.5879397 0.7964824
Accuracy 0.8107303 0.9359165 0.8241431 0.5037258 0.9463487
NMI 0.2980508 0.6631276 0.3575188 0.0001449 0.7173075
ARI 0.3852741 0.7597207 0.4193820 -0.001743 0.7965891
LUNG
F-measure 0.8104506 0.9357307 0.8237679 0.5630053 0.9461134
Rand 0.6926485 0.8798674 0.7097048 0.4992815 0.8983028
Purity 0.8107303 0.9359165 0.8241431 0.5365127 0.9463487

sets. Gene or RNA expression is the most relevant view for BRCA and LUNG data sets,
while for LGG and STAD data sets it is DNA-methylation. For BRCA and LUNG data
sets, the clustering performance of RNA expression is very close to that of the proposed
multi-view algorithm. Evidently, most of the initial works of cancer subtype identification
were based on gene expression study [75, 198].
The scatter plots of the first two dimensions of the subspaces extracted by the individual
views and the proposed algorithm are given in Figures 6.9 and 6.8 for two benchmark data
sets: 3Sources and BCC, and in Figure 6.10 for two multi-omics data sets: LGG and
STAD, as examples. The objects in these figures are colored according to the ground truth
or previously established TCGA cancer subtypes. The scatter plots for the individual views
in Figures 6.9, 6.8 and 6.10 demonstrate the diversity of cluster structures exhibited by the
views. The scatter plots for the proposed algorithm in Figures 6.9(d), 6.8(e), and 6.10(d)
(top rpw) demonstrate significantly higher cluster separability compared to any of their
individual views for 3Sources, BBC, and LGG data sets, respectively. The distinct omic
views may exhibit disparate cluster structures, but Tables 6.4-6.6, and Figures 6.9-6.10
indicate that proper integration gives much better idea about the overall cluster structure
of the data set.

169
Table 6.7: Performance Analysis of Individual Manifolds and Proposed Algorithm
Manifold k-Means Stiefel MiMIC k-Means Stiefel MiMIC
Accuracy 0.8205(0.0) 0.6480(2.2e-2) 0.9207(4.2e-4) 0.7345(8.5e-2) 0.7732(1.3e-3) 0.8715(0.0)
NMI 0.8350(0) 0.6535(8.8e-3) 0.8597(4.8e-4) 0.6167(6.1e-2) 0.5983(2.4e-3) 0.7182(0)
Digits

BBC
ARI 0.7687(0.0) 0.5155(1.5e-2) 0.8352(8.1e-4) 0.5910(1.2e-1) 0.6382(3.2e-3) 0.7273(0.0)
F-measure 0.8756(0.0) 0.6922(1.8e-2) 0.9209(4.1e-4) 0.7614(6.1e-1) 0.7806(1.3e-2) 0.8613(0.0)
Rand 0.9564(0.0) 0.9005(5.7e-03) 0.9703(1.4e-4) 0.8315(6.1e-2) 0.8659(1.1e-3) 0.8959(0.0)
Purity 0.8315(0.0) 0.6665(1.8e-02) 0.9207(4.2e-4) 0.7356(8.5e-2) 0.7747(1.3e-3) 0.8715(0.0)
Accuracy 0.6497(2.4e-3) 0.6798(7.2e-2) 0.7360(5.9e-2) 0.9288(0.0) 0.6292(0.0) 0.9625(0.0)
3Sources

NMI 0.6221(4.6e-3) 0.6020(7.2e-2) 0.6433(3.5e-2) 0.7949(0) 0.4305(0) 0.8543(0)

LGG
ARI 0.5173(2.1e-3) 0.5226(1.2e-1) 0.5957(6.6e-2) 0.7790(0.0) 0.2842(0.0) 0.8790(0.0)
F-measure 0.6927(2.7e-4) 0.7330(6.3e-2) 0.7581(5.0e-2) 0.9269(0.0) 0.6313(0.0) 0.9623(0.0)
Rand 0.8170(1.7e-4) 0.8310(4.1e-2) 0.8514(2.61e-2) 0.8940(0.0) 0.6632(0.0) 0.9424(0.0)
Purity 0.7739(2.4e-3) 0.7455(5.9e-2) 0.7946(2.28e-2) 0.9288(0.0) 0.6779(0.0) 0.9625(0.0)
Accuracy 0.7139(2.2e-2) 0.6457(1.8e-2) 0.8185(1.5e-2) 0.6867(3.3e-2) 0.5165(0.0) 0.7727(0.0)
100Leaves

NMI 0.8887(5.4e-3) 0.8278(5.9e-3) 0.9302(4.1e-3) 0.4412(4.2e-2) 0.2985(0) 0.5220(0)

STAD
ARI 0.6374(2.1e-2) 0.5323(1.5e-2) 0.7431(2.5e-2) 0.3615(5.1e-2) 0.1616(0.0) 0.4650(0.0)
F-measure 0.7543(1.7e-2) 0.6741(1.5e-2) 0.8492(1.1e-2) 0.6930(3.1e-2) 0.5241(0.0) 0.7830(0.0)
Rand 0.9920(6.4e-4) 0.9903(3.8e-4) 0.9913(1.1e-3) 0.6938(1.1e-2) 0.6433(0.0) 0.7698(0.0)
Purity 0.7502(1.8e-2) 0.6764(1.5e-2) 0.7772(1.5e-2) 0.6884(2.9e-2) 0.5909(0.0) 0.7727(0.0)
Accuracy 0.5044(1.4e-2) 0.5068(1.8e-2) 0.5742(7.4e-3) 0.7085(0.0) 0.7889(0.0) 0.7964(0.0)
NMI 0.7461(4.5e-3) 0.7462(5.2e-3) 0.7805(2.3e-3) 0.4964(0) 0.5373(0) 0.5553(0)
BRCA
ALOI

ARI 0.3874(1.6e-2) 0.3850(1.7e-2) 0.4233(6.6e-3) 0.4291(0.0) 0.5331(0.0) 0.5474(0.0)


F-measure 0.5739(1.0e-2) 0.5748(1.4e-2) 0.6221(4.9e-3) 0.7072(0.0) 0.7905(0.0) 0.7997(0.0)
Rand 0.9828(1.1e-3) 0.9826(1.2e-3) 0.9840(3.7e-4) 0.7670(0.0) 0.8075(0.0) 0.8152(0.0)
Purity 0.5461(1.1e-2) 0.5483(1.5e-2) 0.6119(5.7e-3) 0.7085(0.0) 0.7889(0.0) 0.7964(0.0)

6.5.7 Importance of k-Means and Stiefel Manifolds


The proposed objective function f in (6.9) is optimized over two different manifolds,
namely, k-Means and Stiefel manifolds. The k-means manifold optimizes the joint cluster-
ing component, while Stiefel manifold minimizes the disagreement component. To establish
the importance of the k-means manifold, only the disagreement minimization component
corresponding to the Uj ’s is optimized over the Stiefel manifold. However, in this opti-
mization problem, the joint subspace UJoint does not get updated. So, to evaluate the
performance of Stiefel manifold, the final clustering is performed on the subspace corre-
sponding to the most relevant view (according to the relevance measure defined in Section
5.3.5 of Chapter 5). The comparative performance of Stiefel manifold optimization and the
proposed MiMIC algorithm is presented in Table 6.7 for different benchmark and omics
data sets. Table 6.7 shows that optimization over only the Stiefel manifold has led to signif-
icantly poor performance compared to the proposed algorithm for all data sets, except for
BRCA and LUNG. For BRCA and LUNG data sets, the proposed algorithm outperforms
the Stiefel manifold, but by a lower margin. This is attributed to the fact that for these
two data sets, the most relevant view (that is, RNA) has performance close to the proposed
algorithm (see Table 6.6), and the final clustering is also performed on the most relevant
subspace. The significant difference in performance establishes the importance of k-means
manifold in the proposed approach.
In order to study the importance of Stiefel manifold, the performance of the proposed

170
algorithm is compared with that of the case where only the joint clustering component
corresponding to UJoint is optimized over k-means manifold. The comparative performance
is reported in Table 6.7. For all the data sets, the proposed MiMIC algorithm optimized
over two manifolds outperforms the joint clustering component optimized over only the
k-means manifold. This establishes the essence of Stiefel manifold. Table 6.7 also indicates
that apart from the 3Sources and BRCA data sets, k-means manifold optimization gives
better performance compared to that of Stiefel manifold. For ALOI and LUNG data sets,
the average performance of the two individual manifolds are competitive. However, best
results are obtained when both manifolds are considered. This establishes the importance
of considering two different manifolds.

6.5.8 Comparative Performance Analysis


Finally, the performance of the proposed MiMIC algorithm is extensively compared with
that of several existing multi-view clustering algorithms on benchmark and multi-omics
cancer data sets. Corresponding results are reported in Tables 6.8, 6.9, and 6.10, where best
performance is highlighted in bold, while italicized values indicate second best performance.

6.5.8.1 Results on Benchmark Data Sets


For five benchmark data sets, the performance of MiMIC is compared with that of eight
state-of-the-art methods, namely, multi-view k-means clustering (MKC) [26], co-regularized
spectral clustering (CoregSC) [120], multi-view spectral clustering (MSC) [246], adaptive
structure-based multi-view clustering (ASMV) [272], multiple graph learning (MGL) [164],
multi-view clustering with graph learning (MCGL) [273], graph-based multi-view clustering
(GMC) [236], and convex combination of approximate graph Laplacians (CoALa) [113].
Among these algorithms, MKC and CoregSC are subspace clustering and co-training based
approaches, respectively, while others are graph-based approaches. The proposed MiMIC
algorithm is compared with these eight existing approaches based on four external indices,
namely, accuracy, NMI, ARI, and F-measure, similar to [236].
The comparative performance is provided in Table 6.8 for benchmark data sets. The
results in Table 6.8 show that the proposed MiMIC algorithm gives the best performance
on all five benchmark data sets across all measures, except for three cases, that is, ARI
and NMI on Digits data set, accuracy on 100Leaves, and ARI on ALOI data set. For these
three cases, MiMIC achieves the second best performance. The graph based algorithms
like ASMV, MCGL, and GMC have standard deviations zero or close to zero as they do
not require an additional k-means clustering step to determine the partition. The MiMIC
algorithm performs robustly on Digits and BBC dats sets, with standard deviations close
to zero. Small standard deviations are observed for 3Sources, 100Leaves, and ALOI data
sets, among which the later two have as high as hundred clusters, while 3Sources exhibits
poor separability in its joint subspace (as seen in Figure 6.9(d)). In general, algorithms like
MKC, CoregSC, and MSC perform poorly compared to recently proposed graph algorithms
like MCGL, GMC, and CoALa, all of which are again outperformed by the proposed MiMIC
algorithm in 16 out of 20 cases. Similar to the proposed MiMIC algorithm, the CoALa
algorithm is also based on Laplacian approximation, but CoALa obtains a closed form
solution over the Euclidean space. The better performance of the proposed algorithm across

171
Table 6.8: Comparative Performance Analysis of Proposed and Existing Integrative Clustering Algorithms on Benchmark Data Sets
Algorithm  MKC CoregSC MSC ASMV MGL MCGL GMC CoALa MiMIC
Accuracy 0.4924(2.77e-1) 0.7556(5.96e-2) 0.7918(8.21e-2) 0.5745(0) 0.7440(8.19e-2) 0.8530(0.0) 0.8820(0) 0.8835 (0.0) 0.9207(4.21e-4)
NMI 0.5325(3.68e-1) 0.7421(3.27e-2) 0.7560(3.24e-2) 0.6709(0) 0.8264(4.73e-2) 0.9055(0.0) 0.9050 (0) 0.7981(0) 0.8597(4.88e-4)
ARI 0.4280(2.99e-1) 0.6885(5.73e-2) 0.6803(6.28e-2) 0.4047(0) 0.6888(1.07e-1) 0.8313(0) 0.8502(0) 0.7645(0.0) 0.8352 (8.18e-4)

Digits
F-measure 0.5130(2.33e-2) 0.6934(5.11e-2) 0.7129(5.58e-2) 0.4852(0) 0.7238(9.37e-2) 0.8493(0) 0.8658(0) 0.8839 (0.0) 0.9209(4.15e-4)
Accuracy 0.4663(1.06e-1) 0.5479(2.99e-2) 0.4751(2.97e-2) 0.3373(0) 0.6751(6.67e-2) 0.3077(0) 0.6923 (0) 0.6508(0.0) 0.7360(5.92e-2)
NMI 0.3665(1.00e-1) 0.5238(1.98e-2) 0.3850(2.27e-2) 0.0896(0) 0.5768(8.61e-2) 0.1034(0) 0.6216 (0.0) 0.6198(0) 0.6433(3.59e-2)
ARI 0.2461(1.40e-1) 0.3339(2.85e-2) 0.2618(3.81e-2) -0.021(0) 0.4431(1.17e-1) -0.033(0) 0.4431(0.0) 0.5183 (0.0) 0.5957(6.69e-2)

3Sources
F-measure 0.4114(1.08e-1) 0.4775(1.91e-2) 0.4087(3.05e-2) 0.3528(0) 0.5966(7.12e-2) 0.3417(0.0) 0.6047(0.0) 0.6929 (0.0) 0.7581(5.92e-2)
Accuracy 0.6034(1.10e-1) 0.4701(0.0) 0.6732(4.94e-2) 0.3372(0.0) 0.5396(1.10e-1) 0.3533(0) 0.6934(0) 0.8108 (4.36e-3) 0.8715(0.0)

172
NMI 0.4786(8.51e-2) 0.2863(0.0) 0.5531(1.44e-2) 0.0348(0.0) 0.3697(1.89e-1) 0.0741(0) 0.5628(0.0) 0.6536 (1.96e-2) 0.7182(0)
ARI 0.3450(1.21e-1) 0.2727(0.0) 0.4658(2.20e-2) 0.0018(0) 0.3153(1.66e-1) 0.0053(0) 0.4789(0) 0.7102 (2.78e-2) 0.7273(0.0)

BBC
F-measure 0.5018(9.03e-2) 0.4879(0.0) 0.5877(1.83e-2) 0.3781(0.0) 0.5402(8.53e-2) 0.3762(0.0) 0.6333(0) 0.8138 (9.93e-4) 0.8613(0.0)
Accuracy 0.0100(0.0) 0.7706(2.58e-2) 0.7379(2.21e-2) 0.7906(0) 0.6904(2.42e-2) 0.8106(0) 0.8238(0) 0.7384(1.34e-2) 0.8185 (1.56e-2)
NMI 0.0000(0.0) 0.9165(5.90e-3) 0.9014(7.60e-3) 0.9009(0) 0.8753(7.60e-3) 0.9130(0.0) 0.9292 (0.0) 0.8893(4.06e-3) 0.9302(4.12e-3)
ARI 0.0000(0.0) 0.7229(1.92e-2) 0.6788 (2.26e-2) 0.6104(0) 0.3858(5.65e-2) 0.5155(0) 0.4974(0) 0.6550(1.41e-2) 0.7431(2.53e-2)
F-measure 0.0186(0.0) 0.7257(1.90e-2) 0.6821(2.23e-2) 0.6148(0) 0.3944(5.53e-2) 0.5217(0.0) 0.5042(0) 0.7672 (1.19e-2) 0.8492(1.13e-2)

100Leaves
Accuracy 0.0101(0) 0.5217(2.13e-2) 0.4738(7.65e-2) 0.4555(0) 0.4807(1.51e-2) 0.4625(0) 0.5705 (0) 0.5594(1.44e-2) 0.5742(7.44e-3)
NMI 0.0000(0.0) 0.6993(1.32e-2) 0.6358(5.44e-2) 0.6767(0) 0.7052(7.00e-3) 0.6657(0) 0.7350(0.0) 0.7654 (3.72e-3) 0.7805(2.39e-3)
ARI 0.0000(0) 0.4097(4.52e-2) 0.3305(4.81e-2) 0.0533(0.0) 0.1987(4.37e-2) 0.0441(0.0) 0.4305(0) 0.4352(1.18e-2) 0.4233 (6.66e-3)

ALOI
F-measure 0.0196(0) 0.4051(2.38e-2) 0.3366(3.68e-2) 0.0712(0) 0.2112(4.22e-2) 0.0621(0) 0.4366(0) 0.6213 (1.15e-2) 0.6221(4.90e-3)
all four benchmark data sets in Table 6.8 establishes the importance of iterative line-search
based manifold optimization in the proposed formulation compared to Euclidean space
optimization in CoALa.

6.5.8.2 Results on Multi-Omics Cancer Data Sets


For the cancer data sets, the performance of MiMIC is compared with nine integrative
cancer subtype identification algorithms, namely, cluster of cluster analysis (COCA) [93],
multivariate normality based joint subspace clustering (NormS) [111], LRAcluster [243],
iCluster [192], principal component analysis on naively concatenated data (PCA-con),
selective update of relevant eigenspaces (SURE) [112], joint and individual variance ex-
plained (JIVE) [141], similarity network fusion (SNF) [234], and CoALa [113]. The COCA
is a two-stage consensus clustering based approach, LRAcluster, NormS, and iCluster are
probabilistic model based approaches, while JIVE and SURE are low-rank subspace based
approaches. The SNF and CoALa algorithms are graph based approaches. The experi-
mental setup followed for the existing multi-omics cancer subtyping algorithms is same as
that followed in Chapter 3.
The comparative performance analysis is reported in Table 6.9. The results in Table
6.9 show that for BRCA and STAD data sets, the proposed MiMIC algortihm has the
closest resemblance with the previously established TCGA and WHO subtypes of these
cancers, in terms of all external indices. For LGG, LUNG, CRC, and KIDNEY data sets,
although the best performance is obtained by either CoALa or SNF, the performance of
MiMIC is very competitive. Among the five existing probabilistic model based approaches,
NormS has superior performance in majority of the cases. The iCluster algorithm has
poor performance on CESC, KIDNEY, STAD, and LUNG data sets, and LRAcluster has
comparatively poor performance on STAD and LGG data sets. This poor performance is
attributed to the poor fitting of their probabilistic model on the real-life data sets. The
PCA-con, JIVE, and SURE algorithms are SVD based low-rank approaches, among them
PCA-con and SURE have comparable performance. The results reported in both Tables
6.8, 6.9, and 6.10 show that the proposed MiMIC algorithm performs significantly better
than the existing ones on majority of benchmark data sets and some omics data sets.

6.5.8.3 Results on Social Network and General Image Data Sets


Apart from the results on various data sets reported in Section 6.5.8.1 and Section 6.5.8.2,
experiments are also carried out on five Twitter data sets: Football, Olympics, Politics-IE,
Politics-UK, and Rugby; two general image data sets: Caltech7 and ORL; and one citation
network data set: CORA. These data sets mostly have graph/network based views [78].
The performance of the proposed MiMIC algorithm on these eight data sets is compared
with that of the two individual manifolds, namely, k-means and Stiefel manifolds, and with
that of two graph based approaches, namely, SNF and CoALa (proposed in Chapter 5).
The comparative results are provided in Tables 6.11 and 6.12.
The results in Tables 6.11 and 6.12 show that for most of the external indices, the
proposed algorithm has better performance compared to the both individual manifolds: k-
means and Stiefel manifolds, for all social network and image data sets, except Politics-UK.
For the Politics-UK data set, better clustering performance is achieved when considering

173
Table 6.9: Comparative Performance Analysis of Proposed and Existing Integrative Clustering Algorithms on Multi-Omics Data
Sets: BRCA, LGG, STAD, LUNG
Consensus Statistical Model Based Subspace Based Graph Based Manifold
Algorithm
COCA NormS LRAcluster iCluster PCA-con SURE JIVE SNF CoALa MiMIC
Accuracy 0.7434(7.94e-4) 0.7688(0.0) 0.7110(0.0) 0.7638(0.0) 0.7587(0.0) 0.7663(0.0) 0.6859(0.0) 0.6783(0.0) 0.7613(0.0) 0.7964(0.0)
NMI 0.5002(3.48e-4) 0.4287(0) 0.5437(0) 0.5176(0) 0.5506(0) 0.4558(0.0) 0.4368(0.0) 0.5528 (0) 0.5281(0) 0.5553(0)
ARI 0.4864(4.50e-4) 0.5090(0.0) 0.4035(0.0) 0.4745(0.0) 0.5038(0.0) 0.5104 (0.0) 0.3772(0.0) 0.4111(0.0) 0.4874(0.0) 0.5474(0.0)
F-measure 0.7457(8.13e-4) 0.7699 (0.0) 0.7101(0.0) 0.7658(0.0) 0.7601(0.0) 0.7683(0.0) 0.6889(0.0) 0.6865(0.0) 0.7660(0.0) 0.7997(0.0)

BRCA
Rand 0.7905(1.92e-4) 0.7999 (0.0) 0.7521(0.0) 0.7842(0.0) 0.7984(0.0) 0.8010(0.0) 0.7464(0.0) 0.7602(0.0) 0.7922(0.0) 0.8152(0.0)
Purity 0.7434(7.95e-4) 0.7688 (0.0) 0.7110(0.0) 0.7638(0.0) 0.7587(0.0) 0.7663(0.0) 0.6859(0.0) 0.6959(0.0) 0.7613(0.0) 0.7964(0.0)
Accuracy 0.6591(0.0) 0.7940(0.0) 0.4719(0.0) 0.4382(0.0) 0.6666(0.0) 0.7940(0.0) 0.5617(0.0) 0.8689(0.0) 0.9737(0.0) 0.9625 (0.0)
NMI 0.2772(0.0) 0.5325(0.0) 0.1240(0) 0.1379(0) 0.3438(0.0) 0.5335(0.0) 0.2299(0) 0.6253(0.0) 0.8689(0) 0.8543 (0)
ARI 0.2533(0.0) 0.4649(0.0) 0.1030(0.0) 0.0996(0.0) 0.3031(0.0) 0.4668(0.0) 0.1606(0.0) 0.6331(0.0) 0.9199(0.0) 0.8790 (0.0)

LGG
F-measure 0.6608(0.0) 0.7916(0.0) 0.5137(0.0) 0.5187(0.0) 0.6574(0.0) 0.7904(0.0) 0.5757(0.0) 0.8720(0.0) 0.9737(0.0) 0.9623 (0.0)

174
Rand 0.6454(0.0) 0.7465(0.0) 0.5831(0.0) 0.5821(0.0) 0.6616(0.0) 0.7465(0.0) 0.6056(0.0) 0.8268(0.0) 0.9622(0.0) 0.9424 (0.0)
Purity 0.6591(0.0) 0.7940(0.0) 0.5280(0.0) 0.5355(0.0) 0.6666(0.0) 0.7940(0.0) 0.5730(0.0) 0.8689(0.0) 0.9737(0.0) 0.9625 (0.0)
Accuracy 0.4450(3.34e-2) 0.5702(0.0) 0.4256(0.0) 0.3512(0.0) 0.6900(0.0) 0.6983(0.0) 0.4049(0.0) 0.5661(0.0) 0.7685 (0.0) 0.7727(0.0)
NMI 0.1309(4.77e-3) 0.1805(0) 0.1259(0) 0.0650(0) 0.3654(0.0) 0.3511(0) 0.1288(0) 0.3216(0.0) 0.5107 (0.0) 0.5220(0)
ARI 0.0740(1.02e-2) 0.1625(0.0) 0.0912(0.0) 0.0288(0.0) 0.3204(0.0) 0.3445(0.0) 0.0657(0.0) 0.2694(0.0) 0.4559 (0.0) 0.4650(0.0)
F-measure 0.4558(2.50e-2) 0.5770(0.0) 0.4746(0.0) 0.3832(0.0) 0.6959(0.0) 0.7056(0.0) 0.4487(0.0) 0.6333(0.0) 0.7778 (0.0) 0.7830(0.0)

STAD
Rand 0.5981(1.32e-2) 0.6435(0.0) 0.6122(0.0) 0.5855(0.0) 0.7110(0.0) 0.7216(0.0) 0.5981(0.0) 0.6945(0.0) 0.7661 (0.0) 0.7698(0.0)
Purity 0.5173(9.50e-3) 0.5950(0.0) 0.5619(0.0) 0.4917(0.0) 0.6900(0.0) 0.6983(0.0) 0.5165(0.0) 0.6363(0.0) 0.7685 (0.0) 0.7727(0.0)
Accuracy 0.9284(0.0) 0.9359(0.0) 0.9344(0.0) 0.6333(0.0) 0.9388(0.0) 0.9418(0.0) 0.9269(0.0) 0.9493(0.0) 0.9403(0.0) 0.9463 (0.0)
NMI 0.6287(0.0) 0.6650(0.0) 0.6535(0.0) 0.0627(0.0) 0.6773(0) 0.6878(0.0) 0.6333(0.0) 0.7152 (0.0) 0.6970(0.0) 0.7173(0.0)
ARI 0.7339(0.0) 0.7597(0.0) 0.7545(0.0) 0.0696(0.0) 0.7701(0.0) 0.7806(0.0) 0.7288(0.0) 0.8072(0.0) 0.7754(0.0) 0.7965 (0.0)
F-measure 0.9283(0.0) 0.9357(0.0) 0.9342(0.0) 0.6299(0.0) 0.9386(0.0) 0.9417(0.0) 0.9266(0.0) 0.9492(0.0) 0.9400(0.0) 0.9461 (0.0)

LUNG
Rand 0.8669(0.0) 0.8798(0.0) 0.8772(0.0) 0.5348(0.0) 0.8850(0.0) 0.8903(0.0) 0.8644(0.0) 0.9036(0.0) 0.8877(0.0) 0.8983 (0.0)
Purity 0.9284(0.0) 0.9359(0.0) 0.9344(0.0) 0.6333(0.0) 0.9388(0.0) 0.9418(0.0) 0.9269(0.0) 0.9493(0.0) 0.9403(0.0) 0.9463 (0.0)
Table 6.10: Comparative Performance Analysis of Proposed and Existing Integrative Clustering Algorithms on Multi-Omics Data
Sets: CRC, CESC, KIDNEY, OV
Consensus Statistical Model Based Subspace Based Graph Based Manifold
Algorithm
COCA NormS LRAcluster iCluster PCA-con SURE JIVE SNF CoALa MiMIC
Accuracy 0.5323(5.56e-3) 0.6206(0.0) 0.5129(0.0) 0.6163(0.0) 0.5366(0.0) 0.5107(0.0) 0.6034(0.0) 0.5991(0.0) 0.6400(0.0) 0.6228 (0.0)
NMI 0.0120 (1.27e-3) 0.0093(0.0) 0.0030(0.0) 0.0070(0.0) 0.0057(0.0) 0.0028(0.0) 0.0071(0.0) 0.0069(0.0) 0.0185(0.0) 0.0069(0.0)
ARI 0.0007(1.86e-3) 0.0347(0.0) -0.001(0.0) 0.0293(0.0) 0.0037(0.0) -0.002(0.0) 0.0256(0.0) 0.0240(0.0) 0.0548(0.0) 0.0310 (0.0)
F-measure 0.5586(5.56e-3) 0.6345(0.0) 0.5410(0.0) 0.6298(0.0) 0.5642(0.0) 0.5416(0.0) 0.6210(0.0) 0.6178(0.0) 0.6529(0.0) 0.6338 (0.0)

CRC
Rand 0.5010(6.97e-4) 0.5281(0.0) 0.4992(0.0) 0.5260(0.0) 0.5016(0.0) 0.4991(0.0) 0.5203(0.0) 0.5186(0.0) 0.5382(0.0) 0.5291 (0.0)
Purity 0.7370(0.0) 0.7370(0.0) 0.7370(0.0) 0.7370(0.0) 0.7370(0.0) 0.7370(0.0) 0.7370(0.0) 0.7370(0.0) 0.7370(0.0) 0.7370(0.0)
Accuracy 0.6693(0.0) 0.8870(0.0) 0.8145(0.0) 0.5483(0.0) 0.8548(0.0) 0.8629 (0.0) 0.7177(0.0) 0.6693(0.0) 0.8225(0.0) 0.8548(0.0)
NMI 0.4172(4.77e-3) 0.6854(0) 0.5176(0) 0.1737(0) 0.6750 (0) 0.6461(0.0) 0.4425(0.0) 0.4927(0.0) 0.5479(0.0) 0.6451(0)
ARI 0.3677(8.95e-4) 0.7004(0.0) 0.5384(0.0) 0.1017(0.0) 0.6333(0.0) 0.6507 (0.0) 0.3860(0.0) 0.4239(0.0) 0.5637(0.0) 0.6236(0.0)
F-measure 0.6865(2.49e-3) 0.8801(0.0) 0.8123(0.0) 0.5568(0.0) 0.8390(0.0) 0.8512 (0.0) 0.7097(0.0) 0.7073(0.0) 0.8139(0.0) 0.8418(0.0)

CESC

175
Rand 0.6971(6.33e-5) 0.8587(0.0) 0.7867(0.0) 0.5731(0.0) 0.8237(0.0) 0.8339 (0.0) 0.7164(0.0) 0.7043(0.0) 0.7951(0.0) 0.8193(0.0)
Purity 0.6774(0.0) 0.8870(0.0) 0.8145(0.0) 0.5645(0.0) 0.8548(0.0) 0.8629 (0.0) 0.7177(0.0) 0.6935(0.0) 0.8225(0.0) 0.8548(0.0)
Accuracy 0.9470(0.0) 0.9525(0.0) 0.9538(0.0) 0.6065(0.0) 0.9511(0.0) 0.9525(0.0) 0.9308(0.0) 0.9579(0.0) 0.9294(0.0) 0.9552 (0.0)
NMI 0.7493(0.0) 0.7726(0) 0.7862 (0.0) 0.2547(0) 0.7670(0) 0.7726(0) 0.6955(0) 0.7946(0.0) 0.6987(0) 0.7767(0.0)
ARI 0.8393(0.0) 0.8534(0.0) 0.8579 (0.0) 0.1717(0.0) 0.8489(0.0) 0.8534(0.0) 0.7786(0.0) 0.8796(0.0) 0.7786(0.0) 0.8534(0.0)
F-measure 0.9477(0.0) 0.9530(0.0) 0.9545(0.0) 0.6514(0.0) 0.9516(0.0) 0.9530(0.0) 0.9300(0.0) 0.9590(0.0) 0.9285(0.0) 0.9551 (0.0)

KIDNEY
Rand 0.9199(0.0) 0.9269(0.0) 0.9292 (0.0) 0.5842(0.0) 0.9246(0.0) 0.9269(0.0) 0.8893(0.0) 0.9400(0.0) 0.8893(0.0) 0.9268(0.0)
Purity 0.9470(0.0) 0.9525(0.0) 0.9538(0.0) 0.6811(0.0) 0.9511(0.0) 0.9525(0.0) 0.9308(0.0) 0.9579(0.0) 0.9294(0.0) 0.9552 (0.0)
Accuracy 0.5943(7.09e-3) 0.6976 (0.0) 0.6287(0.0) 0.5089(0.0) 0.6946(0.0) 0.7215(0.0) 0.5718(7.73e-3) 0.5269(0.0) 0.6736(0.0) 0.6595(2.84e-3)
NMI 0.3131(1.22e-2) 0.4504 (0) 0.3745(0) 0.2249(0) 0.4424(0.0) 0.4680(0) 0.2629(8.42e-4) 0.2753(0.0) 0.3381(0) 0.3271(3.97e-4)
ARI 0.2810(6.83e-3) 0.4142 (0.0) 0.2999(0.0) 0.2005(0.0) 0.4068(0.0) 0.4372(0.0) 0.2027(4.21e-3) 0.2058(0.0) 0.3199(0.0) 0.3112(4.28e-3)

OV
F-measure 0.6068(4.28e-3) 0.6910 (0.0) 0.6384(0.0) 0.4808(0.0) 0.6868(0.0) 0.7148(0.0) 0.5653(7.84e-3) 0.5642(0.0) 0.6700(0.0) 0.6611(2.54e-3)
Rand 0.7039(2.64e-3) 0.7766 (0.0) 0.7322(0.0) 0.6916(0.0) 0.7734(0.0) 0.7857(0.0) 0.6885(2.80e-3) 0.6557(0.0) 0.7379(0.0) 0.7383(1.92e-3)
Purity 0.5943(7.09e-3) 0.6976 (0.0) 0.6287(0.0) 0.5119(0.0) 0.6946(0.0) 0.7215(0.0) 0.5718(7.73e-3) 0.5389(0.0) 0.6736(0.0) 0.6595(2.84e-3)
only the k-means manifold. Compared to the graph based approaches SNF and CoALa,
the proposed algorithm outperforms both of them for Football, Politics-IE, Caltech7, and
CORA data sets on majority of external indices. For the Olympics, Politics-UK, and ORL
data sets, the performance of CoALa and the proposed MiMIC algorithm is comparable.
For the Rugby data set, the CoALa algorithm of Chapter 5 has the best performance.

Table 6.11: Comparative Performance Analysis of Proposed and Existing Algorithms on


Twitter Data Sets

k-Means Stiefel Graph Based


Algorithm MiMIC
Manifold Manifold SNF CoALa
Accuracy 0.8673 (1.14e-2) 0.7366(1.09e-2) 0.8145(0.0) 0.8500(2.58e-2) 0.8846(2.27e-2)
Football

NMI 0.8804(9.42e-3) 0.7742(6.64e-3) 0.8829 (0.0) 0.8625(1.90e-2) 0.8958(1.24e-2)


ARI 0.7566 (2.51e-2) 0.5584(1.63e-2) 0.7458(0.0) 0.7278(4.15e-2) 0.7841(4.61e-2)
F-measure 0.8792 (9.90e-2) 0.7610(9.99e-3) 0.8431(0.0) 0.8683(1.81e-2) 0.8941(1.74e-2)
Rand 0.9756 (3.02e-3) 0.9538(2.93e-3) 0.9735(0.0) 0.9739(4.74e-3) 0.9781(5.73e-3)
Purity 0.8713 (1.15e-2) 0.7576(1.03e-2) 0.8266(0.0) 0.8548(2.42e-2) 0.8879(2.13e-2)
Accuracy 0.8228(2.88e-2) 0.7390(2.04e-2) 0.9051(0.0) 0.8443(1.49e-2) 0.8844 (2.60e-2)
Olympics

NMI 0.9141(1.08e-2) 0.8075(1.05e-2) 0.9381 (0.0) 0.9197(5.78e-3) 0.9394(9.10e-3)


ARI 0.7890(6.58e-2) 0.5474(2.64e-2) 0.9090(0.0) 0.80712.40e-2) 0.8699 (3.52e-2)
F-measure 0.8520(2.87e-2) 0.7699(1.82e-2) 0.9121 (0.0) 0.8682(1.46e-2) 0.9006(2.35e-2)
Rand 0.9787(8.20e-3) 0.9391(5.51e-3) 0.9911(0.0) 0.9812(2.83e-3) 0.9871 (3.69e-3)
Purity 0.8782(1.66e-2) 0.7605(1.97e-2) 0.9137(0.0) 0.8991(1.03e-2) 0.9112 (1.70e-2)
Accuracy 0.8764(0.0) 0.8048(3.30e-2) 0.9252 (0.0) 0.8735(0.0) 0.9436(1.45e-2)
Politics-IE

NMI 0.8246(0) 0.6884(2.63e-2) 0.8938 (0.0) 0.8170(0.0) 0.8573(1.88e-2)


ARI 0.7408(0.0) 0.7096(3.17e-2) 0.9409(0.0) 0.8284(0.0) 0.8693 (2.81e-2)
F-measure 0.8662(0.0) 0.7988(3.18e-2) 0.9258 (0.0) 0.8583(0.0) 0.9447(1.21e-2)
Rand 0.8910(0.0) 0.8780(1.55e-2) 0.9772(0.0) 0.9305(0.0) 0.9499 (1.01e-2)
Purity 0.8850(0.0) 0.8275(2.02e-2) 0.9310 (0.0) 0.8793(0.0) 0.9436(1.45e-2)
Accuracy 0.9785(0.0) 0.9245(7.65e-3) 0.9737 (0.0) 0.9665(0.0) 0.9727(2.01e-3)
Politics-UK

NMI 0.93331 (0) 0.7365(1.69e-2) 0.9194(0.0) 0.9434(0.0) 0.9225(5.69e-3)


ARI 0.9640(0.0) 0.8175(1.58e-2) 0.9608(0.0) 0.9633 (0.0) 0.9522(4.83e-3)
F-measure 0.9735 (0.0) 0.9182(8.98e-3) 0.9701(0.0) 0.9736(0.0) 0.9692(3.97e-3)
Rand 0.9829(0.0) 0.9133(7.73e-3) 0.9814(0.0) 0.9826 (0.0) 0.9774(2.32e-3)
Purity 0.9785(0.0) 0.9274(5.30e-3) 0.9761 (0.0) 0.9785(0.0) 0.9727(2.01e-3)
Accuracy 0.6822(2.11e-2) 0.6001(3.63e-2) 0.7611 (0.0) 0.8305(2.41e-3) 0.6841(2.40e-2)
NMI 0.6513(1.11e-2) 0.6283(1.56e-2) 0.6768 (0.0) 0.7093(3.13e-3) 0.6552(9.61e-3)
Rugby

ARI 0.4345(3.03e-2) 0.4057(1.91e-2) 0.5485 (0.0) 0.6627(1.88e-3) 0.4344(2.72e-2)


F-measure 0.7320(2.21e-2) 0.6629(3.27e-2) 0.7778 (0.0) 0.8349(1.03e-3) 0.7331(2.50e-2)
Rand 0.8622(7.46e-3) 0.8591(7.48e-3) 0.8818 (0.0) 0.9067(4.61e-4) 0.8631(4.87e-3)
Purity 0.8512(1.39e-2) 0.8418(1.91e-2) 0.8454(0.0) 0.8606(2.26e-3) 0.8600 (8.94e-3)

6.6 Conclusion
This chapter presents a novel manifold optimization based algorithm for integrative clus-
tering of high dimensional multi-view data sets. A joint objective is proposed, consisting

176
Table 6.12: Comparative Performance Analysis of Proposed and Existing Algorithms on
ORL, Caltech7, and CORA Data Sets

k-Means Stiefel Graph Based


Algorithm MiMIC
Manifold Manifold SNF CoALa
Accuracy 0.6602(4.30e-2) 0.7127(2.97e-2) 0.6907(2.57e-2) 0.7715(2.18e-2) 0.7307 (2.36e-2)
NMI 0.8396(1.87e-2) 0.8756(1.07e-2) 0.8616(1.00e-2) 0.8980(1.15e-2) 0.8814(1.35e-2)
ORL

ARI 0.5141(5.14e-2) 0.6027(2.97e-2) 0.6054(3.04e-2) 0.6932(2.82e-2) 0.6208 (3.83e-2)


F-measure 0.7047(3.40e-2) 0.7620(2.22e-2) 0.7257(2.44e-2) 0.7962(1.78e-2) 0.7677 (2.29e-2)
Rand 0.9728(4.29e-3) 0.9792(2.19e-3) 0.9804 (2.04e-3) 0.9850(1.63e-3) 0.9802(2.65e-3)
Purity 0.7197(3.40e-2) 0.7635(1.91e-2) 0.7450(2.26e-2) 0.8090(1.75e-2) 0.7737 (1.78e-2)
Accuracy 0.5655(5.72e-4) 0.4181(3.27e-4) 0.5440(3.42e-2) 0.5685 (0.0) 0.5773(0.0)
Caltech7

NMI 0.5730 (1.17e-3) 0.3141(7.64e-5) 0.5676(2.41e-2) 0.5650(0) 0.5880(0)


ARI 0.4463 (1.12e-3) 0.2831(2.27e-4) 0.4126(2.86e-2) 0.4397(0.0) 0.4608(0.0)
F-measure 0.6471(4.11e-4) 0.5290(3.63e-4) 0.6363(4.12e-2) 0.6689(0.0) 0.6600 (0.0)
Rand 0.7613 (4.63e-4) 0.6958(8.45e-5) 0.7482(1.13e-2) 0.7583(0.0) 0.7674(0.0)
Purity 0.8654 (5.72e-4) 0.7648(6.55e-4) 0.8516(1.09e-2) 0.8548(0.0) 0.8751(0.0)
Accuracy 0.4823(4.46e-3) 0.3980(3.45e-2) 0.5450(2.79e-2) 0.5896 (3.41e-3) 0.6120(2.46e-3)
NMI 0.3284(4.92e-3) 0.2573(2.67e-2) 0.3829(1.14e-2) 0.4364 (2.81e-3) 0.4686(6.17e-3)
CORA

ARI 0.1275(3.02e-3) 0.0885(3.29e-3) 0.2941(1.86e-2) 0.3256 (2.87e-3) 0.3479(3.73e-3)


F-measure 0.4726(8.67e-3) 0.3932(2.36e-2) 0.5957 (1.96e-2) 0.5844(4.98e-3) 0.6373(3.50e-3)
Rand 0.5253(4.25e-2) 0.4862(4.36e-2) 0.7936(9.31e-3) 0.7460(2.14e-3) 0.7709 (9.35e-4)
Purity 0.5031(6.78e-3) 0.4468(1.60e-2) 0.6012(1.62e-2) 0.6206 (3.41e-3) 0.6423(2.46e-3)

of two components, namely, a joint clustering component to identify compact and well-
separated clusters, and a disagreement minimization component to look for consistent
clusters across different views. The joint objective is optimized over two different mani-
folds, namely, k-means and Stiefel manifolds. The Stiefel manifold models the differential
clusters in the individual views, while the k-means manifold tries to infer the best-fit
global cluster structure in the data. The optimization is performed separately along the
manifolds of each view, so that individual non-linearity within each view is not lost while
looking for the shared cluster information. The convergence of the proposed algorithm is
theoretically established over the manifold, while the analysis of its asymptotic behavior
quantifies how fast it converges to an optimal solution. The derived asymptotic bound is
used to make inference regarding the separability of the clusters present in the data set.
The clustering performance of the proposed algorithm is studied and compared with several
state-of-the-art integrative clustering approaches on several multi-omics cancer data sets
and benchmark data sets. Comparative studies demonstrate that the proposed algorithm
can efficiently leverage information from multiple views, and for majority of the data sets,
it reveals clusters that have closest resemblance with the previously established cancer
subtypes and the ground-truth class information.
The MiMIC algorithm proposed in this chapter optimizes only the joint and individ-
ual clustering subspaces to capture the underlying structure of the data set. However,
simultaneous optimization of the individual graphs, their corresponding weightage in the
joint view, as well as the joint and individual subspaces, is likely to give a more compre-
hensive idea of the clusters present in the data set. In this regard, Chapter 7 presents

177
another manifold optimization algorithm that harnesses the geometry and structure pre-
serving properties of symmetric positive definite manifold and Grassmannian manifold for
efficient multi-view clustering.

178
Chapter 7

Geometry Aware Multi-View


Clustering over Riemannian
Manifolds

7.1 Introduction
Multi-view clustering, now a major hot spot in unsupervised machine learning, aims to
gather similar subjects in the same group and dissimilar ones in different groups, utilizing
the information of multiple views, instead of just one. The extensive literature on multi-
view clustering can be classified into several categories [34,174], which are briefly described
in Chapter 2. Among them, graph based models form the most common category, which fuse
graphs from different views and extract a lower dimensional subspace or spectral embedding
of the fused graph to perform clustering [128,164,234,236,246,272,273]. The weighted graph
fusion has been proposed in several approaches [128,164,273]. Among them, the algorithms
proposed in Chapters 5 and 6, and in [199] fix graph weights a priori, while those proposed
in [128, 164, 236, 272, 273] use adaptive weight optimization techniques. A major issue
with graph-driven approaches is that the real-world views inherently contain measurement
errors, redundancy, and noise, which propagate during the graph fusion process distorting
the learned cluster structure. In this regard, Chapter 5 introduces the fusion of de-noised
approximations of view-specific graph Laplacians to obtain better cluster separation in
the subsequent approximate subspace. However, the approach focuses on extracting only
the consistent information of different views via a joint subspace of the fused graph. The
complementary information of individual graphs is ignored during the fusion process.
In several real-world applications, data appears to be point-cloud, but, it’s meaning-
ful structure resides on a lower dimensional manifold embedded in the higher dimensional
space [180, 186, 232]. The conventional manifold learning algorithms exploit the property
that a manifold, although non-linear, has a locally linear geometry that resembles the
Euclidean space. The locally linear property is used to identify neighboring points in a
cluster [180,186]. In these algorithms, however, the form of the manifold is unknown. As a
result, the metric and properties of the space are generally not defined. In a separate line

179
of approach, the data is assumed to originate from a clearly known manifold, preferably a
Riemannian one, as they are endowed with a smoothly varying inner product [74,232,233].
Some widely used Riemannian manifolds are Stiefel [57], Grassmannian [3], and symmetric
positive definite (SPD) [74] manifolds. The Stiefel manifold is used in the optimization of
cost functions with orthogonality constraints, where, in addition to the subspace structure,
the specific choice of basis vectors is also important [143]. The Grassmannian manifold’s
geometric properties have been utilized in vision problems involving subspace constraints.
Examples include affine invariant shape clustering [10], subspace tracking [195, 201], and
face recognition from image sets and video clustering [224, 232, 233]. The covariance ma-
trices of features, used as region descriptors, have been looked as points on the SPD man-
ifold [209]. Nevertheless, the use of these manifolds is primarily restricted to image/video
based applications. Their strength in general multi-view graph and spectral clustering
applications is yet to be fully explored.
The MiMIC algorithm proposed in Chapter 6 uses Stiefel manifold to extract spectral
embeddings of joint and individual views for clustering. However, the graphs constructed
from inherently noisy real-life views may not be ideal to extract the best fit cluster structure
of the data set. Although the algorithm proposed in Chapter 6 extracts both consistent
information of fused graph and complementary information of individual graphs, it does
not address the issue of graph refinement. Furthermore, the spectral embeddings extracted
by the algorithm are sensitive to the choice of basis. So, it does not take into account
the intrinsic geometry of the solution space. Simultaneous optimization of the individual
graph structures, their fusion weights, and the joint and individual subspaces, is likely to
give a more comprehensive idea of the clusters present in the data set.
In this regard, the current chapter presents a manifold based multi-view clustering
algorithm, termed as GeARS (Geometry Aware Riemannian Spectral clustering). The
proposed algorithm harnesses the geometry and structure preserving properties of Grass-
mannian and SPD manifolds for efficient multi-view clustering. It optimizes the spectral
clustering objective separately on de-noised approximations of joint and individual views
to extract the shared as well as view-specific complementary cluster structures. To impose
consistency between the clustering in different views, it also minimizes the distance be-
tween the cluster solutions of joint and individual views, as well as that between pairwise
individual views. The optimization is performed using a gradient based line-search that
alternates between the SPD and Grassmannian manifolds. The SPD manifold is used to
optimize the graph Laplacians corresponding to the individual views while preserving their
symmetricity, positive definiteness, and related properties. The Grassmannian manifold,
on the other hand, is used to optimize and reduce the disagreement between different clus-
tering subspaces. Grassmannian modeling additionally enforces the clustering solutions to
be basis invariant cluster indicator subspaces. The basis invariance property takes into
account geometry of the space and maps multiple orthonormal cluster indicators spanning
the same subspace into a single solution, as they are merely rotations of each other which
does not essentially change the cluster structure conveyed by the subspace. The graph
weights are also optimized at each iteration of the algorithm to obtain the optimal com-
bination of the views. The asymptotic convergence behavior of the proposed algorithm
is studied to obtain an upper bound that quantifies how fast the algorithm converges to
a local optimal solution. The matrix perturbation theory is used to theoretically bound
the disagreement or Grassmannian distance between the joint and individual subpaces at

180
any given iteration of the proposed algorithm. The disagreement is empirically shown to
minimize as the algorithm progresses and converges to a local minima. The multi-view
clustering performance of the GeARS algorithm is extensively studied and compared with
that of existing ones on diverse benchmark data sets. Its application in cancer subtype
identification from multiple omics data types is also established.
The rest of the chapter is organized as follows: Section 7.2 presents the proposed model
of multi-view data integration and clustering. Section 7.3 introduces the proposed line-
search optimization technique over the Grassmannian and SPD manifolds and the proposed
GeARS algorithm, and analyzes its convergence behavior. In Section 7.4, an upper bound
on the Grassmannian distance between the joint and the individual subspaces is derived
using matrix perturbation theory. Case studies on different multi-view benchmark data
sets and multi-omics cancer data sets, along with a comparative performance analysis with
existing approaches, are presented in Section 7.5. Concluding remarks are provided in
Section 7.6.

7.2 GeARS: Proposed Method


A multi-view data set is a collection of M pě 2q views for a common set of n samples,
txi uni“1 . Each view is usually represented by a matrix Xm P <nˆdm , for m “ 1, . . . , M ,
consisting of dm -dimensional observations from the m-th data source for the common n
samples. The view Xm can be encoded as a n node similarity graph Gm whose vertices
represent the samples and edges represent the pairwise similarities between the samples.
Let its affinity matrix be given by Wm “ rwm pi, jqsnˆn . Its pi, jq-th element wm pi, jq ě 0
represents the affinity between samples xi and xj in view Xm . Given affinity Wm , the
degree matrix Dm represents the total affinity at each vertex of the graph. It is given
n
by Dm “ diagpd¯m , . . . , d¯m , . . . , d¯m q, where d¯m “
ř
1 i n i wm pi, jq. The shifted normalized
j“1
Laplacian of graph Gm , as defined in Chapter 5, is given by

´1{2 ´1{2
Lm “ In ` Dm Wm Dm , (7.1)

where In denotes the pn ˆ nq identity matrix. The advantage of shifted Laplacian over
´1{2 ´1{2
the conventional definition [45] of In ´ Dm Wm Dm is that it merges the best rank
k approximation of Lm as well as its cluster information into the same eigenspace. The
spectral clustering problem in terms of Lm is a negative trace minimization problem given
by
T T
minimize ´ trpUm Lm Um q such that Um Um “ Ik , (7.2)
Um P<nˆk

where trp.q denotes the matrix trace function. The Laplacian and its spectrum provides
insight into the edge-connectivity of the graph. This connectivity differs in each network,
thus conveying varying cluster information. A truly integrative approach should (i) capture
the joint or consistent clustering across different views while preserving the complemen-
tary cluster pattern of each view, (ii) refine the connectivity of the graphs based on joint
and individual cluster structures, (iii) automatically estimate the weight or contribution of
individual views during construction of the joint view (iv) be resilient to noise and hetero-

181
geneity of the high-dimensional views. A manifold based multi-view clustering approach
that captures all these properties is described next. The term ‘Laplacian’ in the following
sections would refer to its shifted normalized variant L as defined in (7.1), unless explicitly
specified.

7.2.1 Geometry Aware Multi-View Integration


The solution Um to the spectral clustering problem in (7.2) gives the best fit cluster in-
dicator for view Xm given Laplacian Lm . Solving this for each of the M views gives M
different cluster indicators that preserve the complementary cluster structure of different
views. The shared or joint cluster structure, on the other hand, can be obtained by con-
structing a joint view and then solving its corresponding spectral clustering problem. The
joint view can be constructed by integrating the individual Laplacians using a convex com-
bination with weights proportional to the separability of the clusters in the views. However,
the Laplacians constructed from the high-dimensional real-world views invariably contain
noise, which gets reflected in the joint view during the combination process. To prevent
this noise propagation, Chapter 5 proposes to combine de-noised approximations of the
Laplacians. Specifically, let the best rank r approximation of Laplacian Lm be given by its
eigenvalue decomposition as follows:

Lrm “ Vmr Σrm pVmr qT , (7.3)

where Vmr P <nˆr contains the r largest eigenvectors of Lm in its columns and Σrm is a
diagonal matrix consisting of the corresponding r largest non-zero eigenvalues. Conven-
tionally, spectral clustering uses the k largest eigenvectors of Lm to obtain a clustering of
view Xm . However, during Laplacian approximation, the rank r in (7.3) is considered to
be greater or equal to k, the number of clusters, in order to capture more information from
each view. To make the integration step resilient to noise, the “approximate" Laplacians
Lrm are integrated in the weighted combination, as opposed to “full-rank” Laplacians Lm ,
as presented in Chapter 5. The approximation tends to preserve the stronger pairwise sim-
ilarities as opposed to the weaker ones. The approximate joint Laplacian corresponding to
the fused network is given by

M
ÿ M
ÿ
LrJoint “ αm Lrm , such that αm ě 0 and αm “ 1. (7.4)
m“1 m“1

The above approximation automatically filters the noise in the pn ´ rq least significant
eigenpairs of the individual Lm ’s from propagating into the joint network. Solving the
spectral clustering problem on LrJoint gives a joint cluster indicator, say UJoint , that captures
the consistent clustering accross different views.
The spectral clustering solutions UJoint and Um are all pn ˆ kq orthonormal matrices,
which are treated as projection of the n points in some k-dimensional subspace. The k
columns act as a set of k orthonormal basis vectors for the corresponding subspace. How-
ever, any k-dimensional subspace can be represented by an infinite number of orthogonal
bases. A change in the orthonormal basis for the same subspace amounts to a linear trans-
formation of the projected points which does not essentially change the cluster structure

182
reflected in that subspace. Figure 7.1 shows an illustrative example in two dimensions.

15 15

10
10

5
5

0
0

-5

-5

-10
-5 0 5 10 15 -5 0 5 10 15

(a) Axes aligned basis (b) Rotated basis

Figure 7.1: Effect of basis rotation on the cluster structure of a data set.

The data points in Figure 7.1(b) are rotations of those in Figure 7.1(a). In Figure 7.1(a)
the basis is aligned with the trivial x ´ y axes, while that in Figure 7.1(b) is rotated by
some angle θ. However, the geometry of the data points in Figures 7.1(a) and 7.1(b) show
that this rotation does not change the cluster structure of the data set. This motivates
the search for basis invariant solutions. The basis invariance property implies that the
solution is a cluster indicator subspace instead of a representative cluster indicator matrix.
Indicator subspaces are obtained by optimizing the spectral clustering objective in (7.2)
over spanpUm q, as opposed to a particular Um , where, spanpAq denotes the linear subspace
spanned by the columns of matrix A. Optimization over the column space, spanpUm q,
restricts the search space to be a Riemannian quotient space, known as the Grassmannian
manifold [3], defined by

Grpn, kq :“ tspanpU q P <nˆk | U T U “ Ik u. (7.5)

The Grassmannian manifold, Grpn, kq, with the integers n ě k ą 0 is the space formed
by all k-dimensional linear subspaces embedded in the n-dimensional Euclidean space.
A point on Grpn, kq is represented by any orthonormal basis for a subspace. Clearly, the
choice of representative basis is not unique. Hence, a Grassmannian point is an equivalence
class [U ] of the set of all orthogonal matrices whose columns span the same subspace as
those of U . For any matrix U P <nˆk , its column span is rotation invariant, that is,
spanpU q “ spanpU Rq for any R P Opkq, where Opkq is the set of k ˆ k orthogonal rotation
matrices. Hence, the equivalence classes of the Grassmannian manifold are obtained by the
action of pk ˆ kq orthogonal rotation matrices over the set of pn ˆ kq orthonormal matrices,
denoted by

Grpn, kq :“ tU P <nˆk | U T U “ Ik u{Opkq. (7.6)

The curved surface in Figure 7.2 shows the Grassmannian manifold, while the points on
the manifold, marked in black, are linear subspaces represented by rectangular planes. For
instance, spanpU1 q in Figure 7.2 is a Grassmann point and the equivalance class [U1 ] consists

183
Figure 7.2: The Grassmannian manifold.

of all orthonormal matrices whose columns span the same subspace as those of U1 (denoted
by points U1a , U1b , U1c , and U1d in Figure 7.2. The quotient geometry of Grassmannian
manifold in (7.6) enables the search for subspaces as opposed to representative matrices.
In the proposed formulation, the spectral clustering problem of (7.2) is solved for the
approximate joint view as well as approximate individual ones, in order to obtain the global
clustering while preserving the complementary cluster patterns of individual views. Fur-
thermore, to incorporate geometry awareness, the solutions are made basis invariant by
considering the search space to be the Grassmannian manifold Grpn, kq of k-dimensional
subspaces, as opposed to representative matrices in the Euclidean space, <nˆk . This re-
duces the problem to an unconstrained optimization over the Grassmannian manifold as
the orthonormality constraints on indicator subspaces are inherently incorporated into the
manifold structure. The problem is given by

M
` T 1 ÿ
LrJoint UJoint ´ T r
˘
minimize ´ tr UJoint trpUm Lm Um q. (7.7)
)
spanpUJoint q M m“1
PGrpn,kq
spanpUm q

In this optimization framework, k-dimensional linear subspaces, spanpUJoint q and spanpUm q,


simply reduce to Grassmannian points. To impose consistency between the global cluster-
ing and clustering reflected in different views, the Grassmannian distance between the joint
subspace and each of the individual subspaces, as well as that between pairwise individual
subspaces are minimized. The distance between two Grassmann points can be computed
in terms of the k principal angles tθi uki“1 between the subspaces [280]. The projection
distance between points spanpUJoint q and spanpUm q is given by

k
ÿ
dθ pUJoint , Um q “ sin2 θim “k UJoint UJoint
T T 2
´ Um Um kF , (7.8)
i“1

184
where θim denotes the i-th largest principal angle between corresponding subspaces. This
distance can be reduced to

T T
` ˘
dθ pUJoint , Um q “ k ´ tr UJoint UJoint Um Um . (7.9)

Incorporating the individual and pairwise distance minimization terms, the optimization
problem becomes

1 ` T
LrJoint UJoint
˘
minimize f pUJoint , U1 , ..., UM q “ ´ tr UJoint (7.10)
spanpU q
Joint
(
PGrpn,kq
2
spanpUm q
M M „ 
1 ÿ 1 ÿ T r
` dθ pUi , Uj q ` ´ trpUm Lm Um q ` dθ pUJoint , Um q .
2M pM ´ 1q i,j“1 2M m“1
i‰j

In the above problem, only the cluster indicator subspaces are optimized. However, the
connectivity of the individual graphs also plays a crucial role in determining the global
and view-specific clustering of the data set. Therefore, for better understanding of the true
nature of the data set, it is essential to use the cluster information of indicator subspaces
to modify the graph connectivity, as well as use the modified graph connections to update
the cluster assignments.

7.2.2 Updation of Graph Connectivity


The graph Laplacian Lm in (7.1) contains the pairwaise connectivity information of the
similarity matrix Wm , while its spectrum contains the graph partition information. These
information are pivotal in identifying the clusters in the data set. The Laplacian Lm is
symmetric and positive semi-definite with n eigenvalues in r0, 2s. Its approximation, Lrm ,
is constructed using the r largest non-zero eigenvalues and corresponding eigenvectors.
Hence, Lrm is symmetric and positive definite (SPD). Due to this property, the approximate
Laplacians become elements of the Riemannian manifold of symmetric and positive definite
matrices, known as the SPD manifold [74], which is defined as follows

n
S`` “ tA P <nˆn | A “ AT and v T Av ą 0 for v P <n , v ‰ 0u.

In order to update the similarity graphs based on information present in the cluster indi-
cator subspaces, the approximate Laplacians Lrm s are optimized over the SPD manifold,
along with the subspaces spanpUJoint q and spanpUm q’s, which are optimized over the Grass-
mannian manifold. Modification of the Laplacians changes their edge connectivity, which
in turn changes their inherent cluster structure. Accordingly, the contribution of the indi-
vidual graphs in the fused network should also change. This motivates the optimization of
graph weights αm ’s of (7.4), as well. Incorporating these variables, the final optimization

185
problem over two different manifolds is given by
˜ ¸
M
1 T
` ÿ κ r
˘
minimize ´ tr UJoint αm Lm UJoint
spanpUJoint q, spanpUm qP Grpn,kq 2 m“1
Lrm PS``
n , α P<
m

M „  M
1 ÿ T r T T 1 ÿ
trpUi UiT Uj UjT q,
` ˘
´ trpUm Lm Um q ` tr UJoint UJoint Um Um ´
2M m“1 2M pM ´ 1q i,j“1
i‰j
M
ÿ
such that αm ě 0, αm “ 1. (7.11)
m“1

In the above problem, while combining the Laplacians Lrm ’s, their weights αm ’s are
raised to the power κ with κ ą 1 as κ “ 1 favors the trivial solution, where the v-th
view with minimum loss has αv “ 1, and 0 otherwise. The problem in (7.11) is solved
iteratively by alternating optimization over Grassmannian manifold, SPD manifold, and
the real-valued space. Note that, UJoint and Um are optimized over the Grassmannian
manifold which is a quotient space formed by the action of pk ˆ kq orthogonal rotation
matrices. Hence, for a set of given Lr1 , ..., LrM and α1 ..., αM , the objective in (7.10) should
be rotation invariant in terms of UJoint and Um s.

Theorem 7.1. f pUJoint , U1 , ..., UM q is rotation invariant.

Proof. Let RJoint , R1 , . . . , RM P Opkq, the set of pk ˆ kq orthogonal rotation matrices. So,
it satisfies that

T T
RJoint RJoint “ Ir and Rm Rm “ Ir , for m “ 1, . . . , M.

tr pUm Rm qT Lrm pUm Rm q “ tr Um


` ˘ ` T r T
˘ ` T r ˘
Lm Um Rm Rm “ tr Um Lm Um . (7.12)

Similarly, tr pUJoint RJoint qT LrJoint pUJoint RJoint q “ tr UJoint


` T
LrJoint UJoint .
` ˘ ˘
(7.13)

Also, Um Rm pUm Rm qT “ Um Um
T
and UJoint RJoint pUJoint RJoint qT “ UJoint UJoint
T
. (7.14)

Substituting (7.12), (7.13), and (7.14) in the function f of (7.10) gives

f pUJoint RJoint , U1 R1 , ..., UM RM q “ f pUJoint , U1 , ..., UM q.

186
7.3 Optimization Strategy
The manifold based formulation in (7.11) has two major advantages. First, manifolds are
expected to better capture the underlying non-linear geometry of complex real-world data
sets. The other advantage is that optimization over non-linear manifolds like Grassmannian
and SPD manifolds does not require vector space assumptions on the search space. The
gradient descent algorithm, which adds a multiple of descent direction to the previous
iterate, obviously requires the structure of a vector space and is not possible on general
manifolds. Optimization over manifolds is performed using line search [3] which substitutes
the standard linear step in gradient descent by more general paths based on retractions [3].
It proceeds by projecting the negative gradient onto the tangent space of the manifold. The
tangent space is essentially a vector space, which allows linear movement. Hence, a linear
step is taken in the tangent space from the current iterate towards the projected gradient.
Finally, retraction maps the updated point from tangent space back to the manifold. Let f
p0q p0q rp0q p0q
denote the proposed joint objective function in (7.11), and UJoint , Um , Lm , and αm , for
m P t1, . . . , M u, denote the initial iterates for the respective variables. The optimization
of (7.11) based on line-search is performed as follows.

7.3.1 Optimization over Grassmannian Manifold


Given a set of fixed Um , Lrm , αm , for m “ 1, . . . , M , and the t-th iterate of UJoint , denoted
M
ptq ř T . The negative gradient of f in (7.11) at U ptq is given by
by UJoint , let U “ Um Um Joint
m“1

„ M M 
1 ´
T
` ÿ κ r
˘ ¯ ` T ` ÿ T
˘ ˘
´∇U ptq f “ ´ ∇U ptq ´ ´ tr UJoint αm Lm UJoint ´ tr UJoint Um Um UJoint
Joint Joint 2 m“1 m“1
M
` ÿ κ r
˘ ptq ptq
“ αm Lm ` U UJoint “ QJoint (say).
m“1
(7.15)
In the notation of Grassmannian manifold, Grpn, kq, the parameters n and k are always
fixed and are dropped for notational simplicity. Let TU ptq Gr denote the tangent space
Joint
ptq
of the Grassmannian manifold. The subscript UJoint implies that the tangent space has
ptq
its origin at the Grassmannian point spanpUJoint q. Let ΠX pY q denote the projection of Y
onto the tangent space of a manifold rooted at point X. The negative gradient Qptq Joint is
orthogonally projected on the tangent space TU ptq Gr as follows [3]:
Joint

´ ¯ ´ ¯
ptq ptq ` ptq ˘T ptq ptq
ΠU ptq QJoint “ In ´ UJoint UJoint QJoint “ ZJoint . (7.16)
Joint (say)

ptq
Given tangent ZJoint P TU ptq Gr and step size ηG ą 0, a linear step is taken within the
Joint
ptq ptq
tangent space from UJoint in the direction of ZJoint as follows:

pt`1q ptq ptq


ZJoint “ UJoint ` ηG ZJoint . (7.17)

187
Then the obtained point (7.17) is mapped from the tangent space back to the manifold.
pt`1q
This is done by projective retraction, denoted by PGr, of the point ZJoint P TU ptq Gr back
Joint
to manifold Gr. For the Grassmannian manifold, retraction is performed using singular
pt`1q pt`1q
value decomposition (SVD) of the representative matrix ZJoint [3]. Let the SVD of ZJoint
be given by
pt`1q pt`1q pt`1q ` pt`1q ˘T
ZJoint “ EJoint ΞJoint VJoint ,
pt`1q pt`1q
where EJoint and VJoint are orthonormal matrices of left and right singular vectors of
pt`1q pt`1q
ZJoint , respectively, while ΞJoint contains the singular values in its diagonal. The retractive
projection is given by
´ ¯ ´ ¯
pt`1q pt`1q ` pt`1q ˘T
PGrU ptq ZJoint “ span EJoint VJoint . (7.18)
Joint

The point obtained after retraction in (7.18) becomes the next iterate of UJoint , that is
´ ¯ ´ ¯
pt`1q pt`1q
span UJoint “ PGrU ptq ZJoint . (7.19)
Joint

Figure 7.3 shows the diagrammatic representation of a single step of line-search optimiza-

Figure 7.3: Optimization of UJoint over the Grassmannian manifold.

tion over the Grassmanian manifold. In Figure 7.3, the curved surface denotes the manifold.
The transparent stripped
´ plane
¯ denotes the tangent space of the manifold, rooted at the
ptq
current iterate span UJoint , denoted by the point lying at the intersection of the plane

188
and the manifold. That point is actually a k-dimensional linear subspace denoted by the
shaded plane connected to the point using dotted lines. The vector pointing outwards from
ptq
the tangent plane is the negative gradient direction, and its perpendicular projection ZJoint
lies on the tangent plane. A small step is taken in the tangent plane in the direction of
the projected gradient, marked by the horizontal dotted line in Figure 7.3.´ The ¯obtained
pt`1q
point is then retracted back to the manifold. The retracted point span UJoint lies on
the curved surface. It is also a linear subspace denoted by the second shaded plane. The
following theorem proves that the next iterate obtained by retraction in (7.19) belongs to
the Grassmannian manifold.
´ ¯
pt`1q
Theorem 7.2. span UJoint belongs to the Grassmannian manifold.
´ ¯
pt`1q pt`1q
Proof. According to (7.5), for span UJoint to belong to the Grassmanian manifold, UJoint
should be orthonormal. From (7.18) and (7.19) we get
` pt`1q ˘T pt`1q pt`1q ` pt`1q ˘T pt`1q ` pt`1q ˘T
UJoint UJoint “ VJoint EJoint EJoint VJoint “ Ik ,

pt`1q pt`1q pt`1q


as EJoint and VJoint contain left and right singular vectors of ZJoint , respectively, and are
pt`1q
therefore orthonormal. Hence, UJoint is also orthonormal.

The algorithm for computing a single iteration of UJoint over the Grassmannian manifold
is given in Algorithm 7.1.

Algorithm 7.1 Optimize_UJoint


Ź Optimization of UJoint over Grassmannian manifold Grpn, kq
Input: Cluster indicator subspaces Um and weights αm , for m “ 1, ..., M , joint Laplacian
ptq
LrJoint , joint subspace UJoint of iteration t, step size ηG ą 0.
pt`1q
Output: UJoint . ” ı
ptq
1: Compute negative gradient QJoint Ð ´∇ ptq f using (7.15).
UJoint
2: Project negative gradient
´ ¯ onto tangent space:
ptq ptq
ZJoint Ð ΠU ptq QJoint using (7.16).
Joint
pt`1q ptq ptq
3: ZJoint Ð UJoint ` ηG ZJoint . ´ ¯
pt`1q
4: Find retractive projection PGr ptq ZJoint using (7.18).
´ ¯ UJoint ´ ¯
pt`1q pt`1q
5: Next iterate: span UJoint Ð PGrU ptq ZJoint .
Joint
pt`1q
6: Return UJoint .

Similar to the joint subspace, span pUJoint q, the individual subspaces span pUm qs are also
elements of the Grassmannian manifold. For a specific m P t1, . . . , M u, let UJoint , weights
αj , Laplacians Lrj , @j P t1, . . . , M u, and all other Ui ’s to be fixed for i P t1, . . . , M u, such
ptq
that i ‰ m. For that view Xm , let Um denote the representative matrix corresponding to
ptq
cluster indicator subspace obtained at iteration t. The negative gradient of f at point Um

189
is given by
« ˙ ˙ff
ˆ ˆ M
1 T r T
ÿ
T
´∇U ptq f “ ´ ∇U ptq ´ tr Um Lm ` UJoint UJoint ` Uj Uj Um
m m 2
j“1
j‰m
ˆ M
ÿ ˙
r T T ptq ptq
“ Lm ` UJoint UJoint ` Uj Uj Um “ Qm (say). (7.20)
j“1
j‰m

´ ¯
ptq
Similar to (7.16), projection of negative gradient onto tangent space of span Um is given
by
´ ¯ ´ ` ptq ˘T ¯ ptq
ΠU ptq Qptq
m “ I n ´ U ptq
m Um
ptq
Qm “ Zm (say). (7.21)
m

Then, a linear step in the tangent space in the direction of projected negative gradient is
taken as follows:

pt`1q ptq ptq


Zm “ Um ` ηG Zm .

pt`1q
After the linear step, retractive projection maps the obtained point Zm back to the
manifold, as follows:
´ ¯ ´ ` pt`1q ˘T ¯
pt`1q pt`1q
PGrU ptq Zm “ span Em Vm , (7.22)
m

pt`1q pt`1q
where Em and Vm are orthonormal matrices containing left and right singular vectors
pt`1q
of Zm , respectively. As before, the retracted point in (7.22) becomes the next iterate of
Um , that is, ´ ¯ ´ ¯
pt`1q pt`1q
span Um “ PGrU ptq Zm .
m
´ ¯
pt`1q
It follows from Theorem 7.2 that span Um belongs to the Grassmannian manifold.
The pseudocode for a single iteration of Um over the Grassmannian manifold is given in
Algorithm 7.2. The same follows for each of the M views.

7.3.2 Optimization over SPD Manifold


Apart from the indicator subspaces, there are M other variables Lrm s, each corresponding
to the similarity graph of one of the views. These Lrm s are optimized over the SPD manifold
n . For a specific m, assume that the indicator subspaces spanpU
S`` Joint q, spanpUj q, weights
αj , for j “ 1, ..., M , and the shifted Laplacians Lri are fixed for i “ 1, ..., M , and i ‰ m.
rptq rptq
Let Lm denote the t-th iterate of Lrm . The negative gradient of f with respect to Lm is

190
Algorithm 7.2 Optimize_Um
Ź Optimization of Um over Grassmannian manifold Grpn, kq
Input: Joint subspace UJoint , other individual subspaces Uj for j “ 1, ..., M, j ‰ m, Lapla-
ptq
cian Lrm , subspace Um of iteration t, step size ηG .
pt`1q
Output: Um . ” ı
ptq
1: Compute negative gradient Qm Ð ´∇ ptq f by (7.20).
Um
2: Project negative´ gradient
¯ onto tangent space
ptq ptq
Zm Ð ΠU ptq Qm using (7.21).
m
pt`1q ptq ptq
3: ZJoint Ð UJoint ` ηG ZJoint . ´ ¯
pt`1q
4: Find retractive projection PGr ptq Zm using (7.22).
´ ¯ Um ´ ¯
pt`1q pt`1q
5: Next iterate: span Um Ð PGrU ptq Zm .
m
pt`1q
6: Return Um .

given by

1” κ ` T ˘ı
tr UJoint Lrm UJoint ´ tr Um
˘ ` T r
´∇Lrptq f “ ´ ∇Lrptq ´αm Lm Um
m m 2
` κ T T
˘ ptq
“ αm UJoint UJoint ` Um Um “ QLm (say). (7.23)

The tangent space of the manifold of SPD matrices is the set of symmetric matrices.
ptq
Therefore, the projection of QLm onto the tangent space of SPD manifold is given by [74]
´ ¯ ´ ¯
ptq ptq ptq
ΠLrptq QLm “ Lrptq
m symm Q Lm Lrptq
m “ ZLm , (say) (7.24)
m

T ptq
where symmpAq “ pA`A 2
q
. The symmetrization of QLm is mathematically unnecessary as
its structure in (7.23) implies that it would always be symmetric. Next, a linear step in
rptq ptq
taken in the tangent space of SPD manifold from Lm towards the projected gradient ZLm
with step size ηS ą 0 as follows:

pt`1q ptq
ZLm “ Lrptq
m ` ηS ZLm . (7.25)

pt`1q
It can be shown from the following theorem that the obtained point ZLm in the tangent
space itself belongs to the SPD manifold. This property is attributed to the form of the
negative gradient in (7.23).
pt`1q
Theorem 7.3. ZLm belongs to the SPD manifold.
pt`1q
Proof. To prove the belongingness of ZLm to the SPD manifold, we first show that the
ptq
tangent space point ZLm is symmetric and positive semi-definite. The UJoint and Um are
points on the Grassmannian manifold and are both orthonormal matrices of order pn ˆ kq.

191
So, we can write

T T T T
Um Um “ Um Ik Um and UJoint UJoint “ UJoint Ik UJoint .

T and U
It follows from above that both Um Um T
Joint UJoint have k eigenvalues, each equals to

1. Hence, they are symmetric positive definite matrices. For any z ‰ 0 in <n , from (7.23),
we have

ptq
z T QLm z “ z T Um Um
T
z ` αm z T UJoint UJoint
T
z,
where z T Um Um
T
z ą 0, z T UJoint UJoint
T
z ą 0, and αm ě 0.
ptq
Hence, z T QLm z ą 0. (7.26)

ptq
So, the negative gradient QLm is symmetric positive definite. Again, for any z ‰ 0 in <n ,
from (7.24), we have that

ptq ptq ptq


z T ZLm z “ z T Lm
rptq
QLm Lrptq T
m z “ y QLm y ą 0,

where y “ Lrptq
m z ‰ 0 ô z ‰ 0.

So, the projected gradient is also positive definite. While taking the linear step in the
pt`1q
tangent space, the obtained point ZLm in (7.25) is the sum of two symmetric positive
pt`1q
definite matrices, which similar to (7.26) is symmetric positive definite. Therefore, ZLm
belongs to the SPD manifold.
pt`1q n , therefore, the retraction of Z pt`1q n
As ZLm P S`` Lm from the tangent space of S``
to the manifold S`` n is not required. This prevents the computationally expensive matrix
pt`1q
exponential based retraction [74] step in case of SPD manifold optimization. Hence, ZLm
itself is the next iterate of Lrm , that is

rpt`1q pt`1q
Lm “ ZLm .

The algorithm for a single update of Lrm over the SPD manifold is given in Algorithm 7.3.

7.3.3 Optimization of Graph Weights


Considering all other variables fixed, the optimization problem of (7.11) in terms of the
network weights αm ’s is given by

M
ÿ M
ÿ
κ
minimize ´αm gm such that αm ě 0, αm “ 1,
αm P<
m“1 m“1
ˆ ˙
1 T
where gm “ tr UJoint Lrm UJoint . (7.27)
2

192
Algorithm 7.3 Optimize_Lm
Ź Optimization of Lrm over SPD manifold
rptq
Input: Indicators UJoint and Um , Laplacian Lm of iteration t, step size ηS ą 0, weight
factor αm .
rpt`1q
Output: Lm .
ptq
1: Compute negative gradient QLm д∇ rptq f using (7.23).
Lm
2: Project negative gradient
´ onto tangent space:
¯
ptq rptq ptq rptq
ZLm Ð Lm symm QLm Lm using (7.24).
pt`1q rptq ptq
3: Take linear step ZLm Ð Lm ` ηS ZLm .
rpt`1q pt`1q
4: Next iterate: Lm Ð ZLm .
rpt`1q
5: Return Lm .

Ignoring the non-negativity constraints, the Lagrangian of the above problem in given by

M
ÿ M
´ ÿ ¯
κ
´αm gm ` ξ αm ´ 1 ,
m“1 m“1

where ξ is the Lagrange multiplier. Taking the derivative of the Lagrangian on αm and
setting it to 0 gives

κ´1
´ καm gm ` ξ “ 0,
ˆ ˙ 1
ξ κ´1 1 ` ˘ 1
ñαm “ “ ξ κ´1 κgm 1´κ . (7.28)
κgm

M
ř
Since αm “ 1, therefore, we get
m“1

M
ÿ 1 ` ˘ 1 1 1
ξ κ´1 κgm 1´κ “ 1, ñ ξ κ´1 “ M
.
ř 1
m“1 pκgm q 1´κ
m“1

Substituting the value of ξ in (7.28) gives


` ˘ 1
gm 1´κ
αm “ M
. (7.29)
ř 1
pgm q 1´κ
m“1

In the above deduction, the non-negativity constraints on αm are neglected. Nevertheless,


the positive semi-definite property of Lrm implies that gm ě 0. So, the derived expression
of αm in (7.29) automatically satisfies the non-negativity constraint.

193
7.3.4 Proposed Algorithm
Given M graphs with affinity matrices W1 , . . . , WM , corresponding to M views X1 , . . . , XM
and rank parameter r ě k, the proposed GeARS algorithm extracts a low-rank subspace
spanpUJoint q that reflects the multi-view consensus clusters of the data set. Then, k-means
on the rows of UJoint identifies the final clusters. Similar to Chapter 6, the Lrm ’s are
initialized to the rank r approximations of the individual graph Laplacians, while the
individual cluster indicator subspaces Um ’s are initialized to the orthonormal matrices
containing k largest eigenvectors of corresponding Laplacians. The initial view weights,
p0q
αpmq ’s, are determined based on eigenvalues of the corresponding Laplacians, as done in
´ ¯
p0q tr Σkpmq
previously in Chapters 5 and 6. The initial weights are given by αpmq “ ∆m , with ∆ě
1, where Σkpmq denotes the m-th largest order statistic of the sequence k k
Σ1 , ..., ΣM , and
αpmq
denotes the weight corresponding to the view having the m-th largest order statistic. Given
rp0q p0q
initializations of Laplacians, Lm , their weights, αm , and κ ą 1, the initial iterate for joint
p0q κ rp0q
cluster indicator, UJoint , is set to the k largest eigenvectors of the matrix M
p0q ř
m“1 pαm q Lm .
The proposed GeARS algorithm is described in Algorithm 7.4.

7.3.4.1 Convergence
Let the proposed objective function f in (7.11), evaluated at the t-th iterate, be denoted by
f ptq . Since the movement on each manifold is directed towards the negative gradient, which
is a descent direction, the line-search ensures that there would be a reduction in objective
function at each iteration. However, similar to Chapter 6, the algorithm proceeds to the
next set of iterates only when the reduction is sufficient determined based on the Armijo
convergence criterion [9]. If not, then the step lengths ηG and ηS are iteratively decreased
by a factor of δ P p0, 1q until an iterate is obtained with sufficient reduction. The proposed
algorithm converges to a local optima when the difference between the value of f in two
consecutive iterations t and pt ` 1q falls below a given threshold , that is, f ptq ´ f pt`1q ă .
The proposed GeARS algorithm performs line search optimization over Riemannian
manifolds. The convergence of line search over Riemannian manifolds is theoretically es-
tablished in Chapter 6 and in [3]. The convergence result reported in Theorem 6.3 of
Chapter 6 analogically holds for the proposed GeARS algorithm as well. It states that
only critical points of the cost function where the gradient of f vanishes can be accumu-
lation points of the sequence of iterates generated by the proposed algorithm. However, it
does not necessarily guarantee that the obtained optimal solution is a local minimizer, and
not a saddle point. Nevertheless, as the line-search at each iteration is directed towards
p0q
the negative gradient, unless the initial iterate UJoint is specifically designed, Algorithm 7.4
is unlikely to produce sequences whose accumulation points are not local minima of the
cost function.

7.3.4.2 Computational Complexity


Similar to the MiMIC algorithm of Chapter 6, the GeARS algorithm also starts by ex-
tracting a low-rank cluster indicator subspace corresponding to each view. This involves
computing the pn ˆ nq graph Laplacian of each view and it’s Opn3 q eigendecomposition.

194
Algorithm 7.4 Proposed Algorithm: GeARS
Input: Similarity matrices W1 , . . . , WM , clusters k, rank parameter r ě k, κ ą 1, step
sizes ηG , ηS ą 0, convergence parameters  ą 0 and δ P p0, 1q.
Output: Multi-view clusters C1 , . . . , Ck .
1: for each view m Ð 1 to M do
2: Construct graph Laplacian Lm as in (7.1).
3: Compute eigen-decomposition of Lm .
4: Lrm Ð Rank r approximation of Lm .
5: Vmk Ð k largest eigenvectors of Lm .
6: Compute graph weight αm based on k largest eigenvalues of Lm .
7: end for
řM
8: Compute Lrκ Joint Ð
κ r
m“1 αm Lm .
9: Compute eigen-decomposition of Lrκ k
Joint and store the k largest eigenvectors in UJoint .
p0q
10: Initialize variables: UJoint Ð UJoint k , U p0q Ð V k , Lrp0q Ð Lr , αp0q Ð α ,
m m m m m m
´for each m “ 1, .., M . ¯
p0q p0q rp0q p0q
11: t Ð 0; f p0q Ð f UJoint , Um , Lm , αm , @m “ 1, ..., M .
12: do
pt`1q ` ptq rptq ptq ptq ˘
13: UJoint Ð Optimize_UJoint UJoint , Lj , Uj , αj , @j “ 1, ..., M, ηG .
pt`1q ` rptq pt`1q ptq ptq ptq ˘
14: Um Ð Optimize_Um Lm , UJoint , αm , Um , Uj , j ‰ m, ηG
for each m P t1, .., M u.
rpt`1q ` rptq ptq pt`1q pt`1q ˘
15: Lm Ð Optimize_Lm Lm , αm , UJoint , Um , ηS
for each m P t1, .., M u.
pt`1q
16: Update αm according to (7.29), @m P t1, .., M u.
pt`1q
` pt`1q pt`1q rpt`1q pt`1q ˘
17: Compute f Ð f UJoint , Um , Lm , αm , @m “ 1, ..., M .
` ˘
18: if f ptq ´ f pt`1q ą  then
19: Move to next set of iterates: t “ t ` 1.
20: else
21: Decrease step size by δ: ηG “ δηG , ηS “ δηS .
22: end if
23: while pηG ą 1e ´ 03 & ηS ą 1e ´ 03q
‹ pt`1q
24: Optimal joint subspace: UJoint Ð UJoint .
25: Perform k-means clustering on the rows of UJoint ‹ .

26: Return clustering C1 , . . . , Ck from k-means.

For M views, the individual subspace construction in steps 1´7 takes OpM n3 q time with
sequential operation. The computation of joint Laplacian and its eigendecomposition in
steps 8 and 9, respectively, takes atmost Opn3 q time. The initialization steps in 10 and 11,
take constant Op1q time.
In each iteration of the gradient based line-search in steps 12-23, the indicator subspaces
UJoint and Uj are optimized over the Grassmannian manifold. The Grassmannian manifold
is a quotient manifold of the Stiefel manifold. Hence, computing a single iterate over
the Grassmannian manifold has the same complexity as that for the Stiefel manifold, as
in Chapter 6. The SVD based retraction over Grassmannian and Stiefel manifolds takes
Opn2 kq time for steps 13 and 14. The Lrm optimization in step 15 does not involve the

195
matrix exponential based retraction operation, as established in Theorem 7.3. So, it has a
time complexity of Opn2 q instead of Opn3 q. The graph weight updation in step 16 takes
Op1q time. The computation of the joint objective in step 17 takes OpM n2 rq time. The
evaluation of convergence criteria and variable updation in steps 18´22 takes Op1q time.
` ` takes t iterations to converge,
Assuming that the algorithm ` the overall
˘ complexity of steps
12´23 is bounded by O t n2 k ` M n2 k ` M n2 ` 1 = O tM n2 k . The clustering on the
˘˘

final solution UJoint in step 25 takes Optkm nk 2 q time, where tkm is the maximum number
of iterations k-means clustering executes.
Hence, the overall computational complexity of the proposed GeARS algorithm, to

the subspace UJoint 3 2 2
extract
` 3 2
˘ and perform clustering, is pOpM n ` tM n k ` tkm nk q “
qO maxtM n ` tM n ku , assuming M, r, k ăă n.

7.3.4.3 Asymptotic Convergence Bound


The asymptotic behavior of the proposed algorithm is analyzed to obtain a convergence
bound that indicates how fast the algorithm arrives at a local optima starting from a
random initial iterate. For a sufficiently large value of iteration number t, the difference
pt`1q ‹
between the cost function f evaluated at UJoint and at the optimal solution UJoint can
ptq ‹
be upper bounded in terms of the difference in f evaluated at UJoint and UJoint . The
bound involves eigenvalues of the Hessian of f at the optimal solution. Given a set of
fixed subspaces Um ’s, Laplacians Lrm ’s, and weights αm ’s, for m “ 1, ..., M , the proposed
objective function f becomes a function of only UJoint , given by
ˆ M
ÿ ˙
T
` κ r T
˘
f pUJoint q “ ´tr UJoint αm Lm ` Um Um UJoint .
j“1

The above function ` has its form˘equivalent to that of the Rayleigh quotient function, given
T
by f pUJoint q “ tr UJoint Ξ UJoint , where

M ˆ
ÿ ˙
κ r T
Ξ“´ αm Lm ` Um Um and UJoint P <nˆk . (7.30)
j“1

Let λ1 ď . . . ď λk ď λk`1 ď . . . ď λn be the eigenvalues of Ξ. Also, let the Hessian of f


at the optimal solution be denoted by HUJoint
‹ f , and λH,max and λH,min , respectively, be
the maximum and minimum eigenvalues of the Hessian matrix HUJoint ‹ f . For the Rayleigh
quotient form, these two eigenvalues are given by (Section 4.9 of [3])

λH,max “ λn ´ λ1 and λH,min “ λk`1 ´ λk . (7.31)

Similar to the asymptotic analysis reported in Section 6.4.2 of Chapter 6, the asymptotic
bound for the proposed model is given as follows. There exists an iteration number t1 ě 0
such that ´ ¯ ´ ´ ¯ ¯
pt`1q ‹ ptq ‹
f UJoint ´ f pUJoint q ď c f UJoint ´ f pUJoint q ,

196
for all t ě t1 , where
" *
2βp1 ´ σq
c “ 1 ´ 2σpλk`1 ´ λk q min ηG , . (7.32)
pλn ´ λ1 q

Here, σ and β are Armijo parameters, and ηG is the Grassmannian step size.
The bound c in (7.32) determines the relative decrease in the value of the cost function
f from iteration number t to pt ` 1q, for large values of t. In case of negligible reduction
in f the value of c is close to 1, while smaller values indicate higher reduction in cost
function. The Ξ matrix in (7.30), whose eigenvalues determine the convergence factor c,
has its form similar to Laplacian matrix. Hence, similar to the graph Laplacian, in case
of k well-separated clusters, the Ξ matrix is expected to have a greater gap between the
eigenvalues λk and λk`1 . This results in a smaller value of c and faster convergence to the
local minima. In case of poor inter-cluster separation, the gap pλk`1 ´ λk q also reduces
resulting in c being close to 1 and slower convergence. Hence, the convergence factor c can
predict separation between the clusters in a data set.

7.4 Grassmannian Disagreement Bounds


The Grassmannian distance dθ pUJoint , Um q in (7.8) quantifies the disagreement between the
joint cluster indicator subspace spanpUJoint q and the indicator subspace spanpUm q corre-
sponding to the m-th view. The disagreement is given by the sum of the squared principal
sines of k angles between the two subspaces. In order to impose consistency between the
clusterings reflected in different views, the disagreement between the joint and each of the
individual indicator subspaces is minimized in the proposed formulation. This subsection
uses matrix perturbation theory [202] to derive an upper bound on the Grassmannian dis-
tance dθ pUJoint , Um q between the joint and an individual subspace, at any given iteration t
of the proposed GeARS algorithm.
pt´1q pt´1q rpt´1q
Let UJoint , Um , and Lm denote the values of the corresponding variables at
iteration pt ´ 1q of the proposed algorithm (Algorithm 7.4). Without loss of generality, it
pt´1q 1
is assumed that the views are equally weighted, that is, αm “ M and κ “ 1, @m “
1, 2, . . . , M . The proposed objective function, in (7.11), is given by
˜ ¸
M
T
ÿ 1
Lrm UJoint
` ˘
minimize ´ tr UJoint (7.33)
UJoint , Um Lrm
m“1
M
M „  M
1 ÿ T r T T 1 ÿ
trpUi UiT Uj UjT q,
` ˘
´ trpUm Lm Um q ` tr UJoint UJoint Um Um ´
M m“1 M pM ´ 1q i,j“1
i‰j

with orthonormality constraints on UJoint and Um , and symmetric positive definite con-
straints on Lrm . The constant factor 2 in the denominator of (7.11) is not taken into
consideration as it does not affect the angles between the subspaces. Given the pt ´ 1q-th
iterates of Um and Lrm , the subproblem of (7.33) with respect to UJoint , at iteration t, is

197
given by ˜ ¸
„ÿM 
1 T rpt´1q pt´1q pt´1qT
minimize ´ tr UJoint Lj ` Uj Uj UJoint . (7.34)
UJoint M j“1

The solution to the negative trace minimization problem in (7.34) is the t-th iterate of
UJoint and is given by the k largest eigenvectors of

M ´
„ÿ ¯
ptq 1 rpt´1q pt´1q pt´1qT
VJoint “ Lj ` Uj Uj . (7.35)
M j“1

Similarly, for a fixed m, the subproblem with respect to the individual subspace Um , keeping
all other variables fixed to their pt ´ 1q-th iterates, is given by
¨ ˛

˚ T 1 rpt´1q M 
1 pt´1q pt´1qT 1 ÿ pt´1q pt´1qT ‹
minimize ´ tr ˚U m Lm ` UJoint U Joint ` Uj Uj Um ‹
Um ˝ M M M pM ´ 1q j“1 ‚
j‰m
(7.36)
The solution to (7.36) is the t-th iterate of Um and is given by the k largest eigenvectors
of
M
ptq 1 rpt´1q 1 pt´1q pt´1qT 1 ÿ pt´1q pt´1qT
Wm “ Lm ` UJoint UJoint ` Uj Uj . (7.37)
M M M pM ´ 1q j“1
j‰m

ptq
The t-th iterates of UJoint and Um are given by the k largest eigenvectors of VJoint and
ptq ptq
Wm , respectively. The bound on the Grassmannian distance between subspaces UJoint
ptq
and Um is obtained by computing the distance between the k dimensional eigenspaces of
ptq ptq
VJoint and Wm . This is done using matrix perturbation theory [202], which analyzes the
difference between the eigenspaces of a matrix and its perturbation. From the expressions
ptq ptq ptq
of VJoint and Wm in (7.35) and (7.37), respectively, VJoint can be written as a perturbation
ptq ptq
of Wm by Em , given by
ptq ptq ptq
VJoint “ Wm ` Em , (7.38)
where
M ˆ ˙ ˆ ˙
ÿ 1 rpt´1q M ´2 pt´1q pt´1qT 1 pt´1q pt´1qT pt´1q pt´1qT
Eptq
m “ L ` U Uj ` Um Um ´UJoint UJoint .
j“1
M j M pM ´ 1q j M
j‰m
(7.39)
Applying the Davis Kahan theorem [49] (see Appendix C) on the perturbation relation in
ptq ptq
(7.38), the squared principal sines between the k largest eigenvectors of VJoint and Wm is
bounded as follows.

Result 7.1. The Grassmannian distance between t-th iterates of the joint subspace, UJoint ,

198
and the individual subspace corresponding to the m-th view, Um , is given by

ptq ptq 2
Em Um
ptq ptq F
dθ pUJoint , Um qď , (7.40)
ΨVJointptq pkq ´ ΦWptq pk ` 1q
m

where ΨVJointptq pkq and ΦWptq pk ` 1q denote the k-th and pk ` 1q-th largest eigenvalues of
m
ptq ptq
VJoint and Wm , respectively.
In multi-view clustering, the views are expected to agree upon a uniform global clus-
ptq ptq
tering of the data set. Hence, the distance dθ pUJoint , Um q between the joint and individual
clustering subspace, for each m “ 1, 2, ..., M , is expected to minimize as the proposed
algorithm progresses starting from a random initial iterate to a local minima.

7.5 Experimental Results and Discussion


The clustering performance of the proposed GeARS algorithm is studied and compared
with that of the existing approaches on several real-world data sets. Among them, four are
benchmark data sets on which GeARS is compared with nine multi-view clustering algo-
rithms, while eight others are multi-omics cancer data sets on which GeARS is evaluated
against ten cancer subtyping algorithms. The clustering results are evaluated by measuring
the closeness of the identified clusters with ground truth class information for the bench-
mark data sets, and with the clinically established cancer subtypes for the multi-omics
data sets. Six indices, namely, clustering accuracy, adjusted Rand index (ARI), normal-
ized mutual information (NMI), F-measure, Rand index, and purity are used to evaluate
the clustering performance.
To add randomization in the clustering results, each algorithm is executed 10 times,
and the evaluation indices are reported in the mean ˘ standard deviation form. Similar to
Chapter 6, in Tables 7.1-7.6, the numbers within brackets denote the standard deviations,
 0 denotes a value close to zero ( „ 1e ´ 16), while ‘0.0’ denotes the exact zero. For the
proposed GeARS algorithm, the step sizes ηG for Grassmannian manifold is set to 0.01,
while ηS for SPD manifold is set to 0.001 for all data sets. The values of  and δ are
empirically set to 0.01 and 0.5, respectively. The weight initialization parameter ∆ is set
to 1 for benchmark data sets and 2 for multi-omic cancer data sets, as in Chapter 6. The
value of κ in the weighted combination α is set to 2 for all data sets. The source code of
the proposed algorithm in R is available at https://github.com/Aparajita-K/GeARS.

7.5.1 Description of Data Sets


In this work, multi-view clustering is performed on the following benchmark and multi-
omics cancer data.
Benchmark Data Sets: Seven publicly available benchmark data sets, namely, 3Sources,
BBC, Digits, 100Leaves, ORL, Caltech7, and CORA from diverse application domains are
considered in this study. The BBC and 3Sources are multi-source document clustering
data sets, Digits, 100Leaves, ORL, and Caltech7 are image data sets, while CORA is a
social network data set.

199
Multi-Omics Cancer Data Sets: Disease subtype identification is performed on eight
different cancers, namely, ovarian carcinoma (OV), breast adenocarcinoma (BRCA), lower
grade glioma (LGG), stomach adenocarcinoma (STAD), colorectal carcinoma (CRC), cer-
vical carcinoma (CESC), lung carcinoma (LUNG),fig:rankGr and kidney carcinoma (KID-
NEY). All the cancer data sets are obtained from The Cancer Genome Atlas 1 (TCGA).
The number of cancer subtypes for CRC and LUNG is two, for LGG, CESC, and KIDNEY
is three, while for OV, BRCA, and STAD is four. For each of these cancer data sets, four
genomic views are considered, namely, microRNA expression (miRNA), gene expression
(RNA), DNA methylation (mDNA), and reverse phase protein array expression (RPPA).
The benchmark and multi-omics data sets, are described in Appendix A, while the cluster
evaluation indices are described Appendix B.

35 35 35 35

30 30 30 30

25 25 25 25

20 20 20 20

15 15 15 15

10 10 10 10

5 5 5 5

0 0 0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35

(a) Original Data Set (b) Noise with STD= 0.5 (c) Noise with STD= 1 (d) Noise with STD= 1.5

1 0 1 0 1 0 1 0
Ratio γ c= 0.94897 c= 0.96944
Bound c -1
0.9 -1
f(UJoint(t)) -2 0.9 c= 0.87032 -2 0.9 0.9
Ratio γ -2
0.8 -2
0.8 Bound c
-4 -3
f(UJoint(t)) Ratio γ
Objective f

-4 0.8
Objective f

Ratio γ f

0.8

Objective f
Ratio γ
Objective

-3
Ratio γ

Ratio γ

Ratio γ

0.7 -4 Bound c
Bound c
-6 0.7 f(UJoint(t))
f(UJoint(t))
0.6 -6 0.7 -5 0.7 -4
c= 0.53687 0.6
-8 -6
0.5 -5
0.6 0.6
-8 -7
-10 0.5 -6
0.4
0.5 -8 0.5
0.4 -10 -7
0.3 t’ -12 t’10 t’10 -9
5 10 15 20 25 5 15 20 25 30 20 30 40 50 60 10 t’ 20 30 40 50 60 70
Iteration t Iteration t Iteration t Iteration t

(a) c= 0.53687 (b) c= 0.87032 (c) c= 0.94897 (d) c= 0.96944

Figure 7.4: Asymptotic convergence analysis for Spiral data set: scatter plot of data with
varying Gaussian noise (top row) and variation of convergence ratio and objective function
with increase in iteration number t (bottom row).

7.5.2 Significance of Asymptotic Convergence Bound


The convergence factor c in (7.32) bounds the difference between the cost function f
pt`1q ‹
evaluated at point UJoint and at the optimal solution UJoint in terms of the difference
ptq ‹
between that evaluated at UJoint and UJoint . Let this ratio be given by
´ ¯
pt`1q ‹
f UJoint ´ f pUJoint q
γt “ ´ ¯ . (7.41)
ptq ` ‹
˘
f UJoint ´ f UJoint
1
https://cancergenome.nih.gov/

200
30 30 30 35

30
25 25 25

25

20 20 20
20

15 15 15 15

10
10 10 10

5 5 5
0

0 0 0 -5
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45 -5 0 5 10 15 20 25 30 35 40 45 -5 0 5 10 15 20 25 30 35 40 45

(a) Original Data Set (b) Noise with STD= 0.5 (c) Noise with STD= 1 (d) Noise with STD= 1.5

1 0 1 0 1 0 1 c= 0.96658 0
Ratio γ Ratio γ Ratio γ
Bound c -1 Bound c -1 Bound c Ratio γ
0.9 0.9 -1
f(UJoint(t)) f(UJoint(t)) 0.9 f(UJoint(t)) 0.9
Bound c
-1

0.8
-2 -2 c= 0.81057 f(UJoint(t))
0.8 -2
0.8 0.8 -2
c= 0.74822
Objective f

Objective f

Ratio γ f

Objective f
-3

Objective
-3
Ratio γ

Ratio γ

Ratio γ
0.7 -3
0.7 -3
-4 0.7 0.7
0.6 -4 -4
c= 0.56594
-5 0.6
0.6 -4
-5 0.6
0.5 -5
-6 0.5
-6 0.5 -5
0.4 -6 0.5
-7 0.4
-7 -6
0.3 t’20 0.4
5 10 15 25 5 10 15t’ 20 25 5 10 15t’ 20 25 30 5 10 t’15 20 25 30 35 40 45
Iteration t Iteration t Iteration t Iteration t

(a) c= 0.56594 (b) c= 0.74822 (c) c= 0.81057 (d) c= 0.96658

Figure 7.5: Asymptotic convergence analysis for Jain data set: scatter plot of data with
varying Gaussian noise (top row) and variation of convergence ratio and objective function
with increase in iteration number t (bottom row).

30 30 35 35

30
30
25 25

25
25
20 20
20
20

15 15 15

15
10
10 10
10
5

5 5
5
0

0 0 0 -5
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40

(a) Original Data Set (b) Noise with STD= 0.5 (c) Noise with STD= 1 (d) Noise with STD= 1.5

1 0 1 0 1 0 0
Ratio γ Ratio γ c= 0.99300 1
Ratio γ c= 0.99940
Bound c Bound c
0.9 Bound c
f(UJoint(t)) -5 0.9 f(UJoint(t)) -5 0.9
f(UJoint(t)) -5 0.9 -5
0.8
0.8 0.8
-10
Objective f

-10
Objective f

Ratio γ f

Objective f

Ratio γ
Objective

-10 0.8
Ratio γ

Ratio γ

Ratio γ

0.7 -10
0.7 Bound c
0.7 f(UJoint(t))
-15 -15
0.6 0.7
c= 0.59972 -15
c= 0.52430 0.6 0.6 -15
0.5 -20 -20
0.6
0.5 0.5 -20
0.4
-25 -25 -20
0.4 0.5
0.3 t’ 0.4 t’
5 10 15 20 25 5 10 15 t’ 20 25 5 10 15 20 25 30 35 40 0 10t’ 20 30 40 50 60 70 80
Iteration t Iteration t Iteration t Iteration t

(a) c= 0.52430 (b) c= 0.59971 (c) c= 0.99300 (d) c= 0.99940

Figure 7.6: Asymptotic convergence analysis for Aggregation data set: scatter plot of data
with varying Gaussian noise (top row) and variation of convergence ratio and objective
function with increase in iteration number t (bottom row).

201
28 30 30 30

28 28
28
26
26
26
26
24 24
24
24
22
22
22
22 20
20
20
18
20
18
18 16
18
16
14
16
16 14 12

14 14 12 10
0 2 4 6 8 10 12 14 16 -2 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 -4 -2 0 2 4 6 8 10 12 14 16

(a) Original Data Set (b) Noise with STD= 0.5 (c) Noise with STD= 1 (d) Noise with STD= 1.5

1 0 1 0 1 0 1 0
Ratio γ Ratio γ c= 0.94944
Bound c -1 Bound c -1 c= 0.90555
0.9 0.9 -1
f(UJoint(t)) f(UJoint(t)) 0.9 0.9 -1
-2 -2
0.8 0.8 -2 -2
0.8
Objective f

Objective f
0.8

Ratio γ f

Objective f
-3 Ratio γ

Objective
-3
Ratio γ

Ratio γ

Ratio γ
0.7 0.7 c= 0.68514 Bound c -3 -3
-4 f(UJoint(t)) 0.7
c= 0.60062 -4 0.7
0.6 0.6
-5 -4 -4
-5 0.6
0.5 0.6
-6 0.5 -5 Ratio γ
-6 -5
0.5 Bound c
0.4 -7 0.4 f(UJoint(t))
-7 0.5 -6
-6
0.4
5 10 15 t’ 20 25 5 10 t’15 20 25 10 20 t’ 30 40 50 60 10 20 30 t’ 40 50
Iteration t Iteration t Iteration t Iteration t

(a) c= 0.60062 (b) c= 0.68514 (c) c= 0.90555 (d) c= 0.94944

Figure 7.7: Asymptotic convergence analysis for Flame data set: scatter plot of data with
varying Gaussian noise (top row) and variation of convergence ratio and objective function
with increase in iteration number t (bottom row).

Similar to the results and analysis presented in Section 6.5.3 of Chapter 6, the scatter plots
for the noise-free and noisy variants of four synthetic shape data sets, namely, Spiral, Jain,
Aggregation, and Flame, are provided in the top rows of´Figures ¯ 7.4, 7.5, 7.6, and 7.7,
ptq
respectively. The variation of γt and the cost function f UJoint , for different values of
iteration number t “ 1, 2, 3, . . ., along with the corresponding value of convergence factor c
for the data sets, is provided in the bottom rows of Figures 7.4, 7.5, 7.6, and 7.7. In these
figures, the value of the bound c is marked by a horizontal dashed green line, while the
vertical dashed line denotes the iteration threshold t1 above which the asymptotic bound
is satisfied by all the iterations until convergence.
For all four data sets, the top rows of Figures 7.4, 7.5, 7.6, and 7.7 show that the
cluster structure and their separability degrades with the increase in noise, as expected.
The bottom rows of these figures in turn show that with increase in noise in the data sets,
the value of the convergence factor c increases and goes close to 1. For instance, in case
of the Spiral data set, the value of c for the noise-free original data set in Figure 7.4(a) is
0.53687, while that for the three increasingly noisy variants in Figure 7.4(b), 7.4(c), and
7.4(d) are 0.87032, 0.94897, and 0.96944, respectively. Similar pattern in the values of c can
be observed for Jain, Aggregation, and Flame data sets from the bottom rows of Figures
7.5, 7.6, and 7.7, respectively. Although the results are sensitive to the added noise and
the choice of the random initial iterate, in general, it can be observed that lower values of
c implies faster convergence. For instance, the bottom row of Figure 7.4 shows that the
proposed algorithm converges in much lesser number of iterations (ď 30) in the noise-free
case (Figure 7.4(a)) compared to the noisy ones (iterations ě 65 in Figures 7.4(c) and
7.4(d)). It can also be observed that the value of the iteration threshold t1 , above which
the asymptotic bound is satisfied by all the iterations until convergence, decreases as the
amount of noise increases (from Figure 7.4(a) to Figure 7.4(d)), implying a longer path

202
until convergence due to noise. The value of the minimization based objective function f at
the optimal solution also increases from -11.6 in the noise-free case (Figure 7.4(a)) to -7.02
in heavily noised case (Figure 7.4(d)), implying degradation of cluster structure. Similar
observations can be made from Figures 7.5, 7.6, and 7.7 for Jain, Aggregation, and Flame
data sets, respectively. These results show that, similar to Chapter 6, the convergence
factor c in (7.32) can be used to make inference about the quality of the clusters and the
speed of convergence of the proposed algorithm, for a given data set.

7.5.3 Empirical Study on Subspace Disagreement Bound


The relation established in (7.40) gives an upper bound on the Grassmannian distance
` ptq ˘
dθ UJoint , Um qptq between the joint subspace spanpUJoint q and the individual one spanpUm q
corresponding to the m-th view, at iteration t of the proposed algorithm. Let the bound
for view m at iteration t be denoted by

ptq ptq 2
E U
m m
F
Γptq
m “ . (7.42)
ΨVJointptq pkq ´ ΦWptq pk ` 1q
m

ptq
The variation of the upper bound Γm and the actual distance between those subspaces is
empirically studied for two omics data sets, namely, LGG and STAD, and two benchmark
data sets, namely, 3Sources and BBC, with different values of iteration number t. The
variations are reported in Figure 7.8, for each view of LGG, STAD, 3Sources, and BBC
data sets. Figure 7.8 shows that for all views of each data set the theoretical bound
(marked in red) is satisfied by the actual distance (marked in blue) between the joint and
the individual subspaces. At each iteration of the proposed GeARS algorithm, the next
set of iterates is computed by taking a small step in the direction of the negative gradient
of objective function f . Since the Grassmannian bound in Section 7.4 is computed from
ptq
the closed form solutions of UJoint and Um , the upper bound Γm , as seen in Figure 7.8,
is satisfied by the actual distance observed between those subspaces at each iteration t of
the proposed algorithm. The bottom two rows of Figure 7.8 for LGG and STAD data sets
show that the theoretical bound is closer to the observed distance for the omics data sets
compared to that for the 3Sources and BBC benchmark data sets, shown in top two rows
of the same figure.

7.5.4 Choice of Rank


The optimal approximation rank, r‹ , of the individual Laplacians, is determined based
on the Silhouette index, as described in Section 6.5.4 of Chapter 6. The variation of the
Silhouette index and F-measure is shown in Figure 7.9 for LGG, OV, and Digits data sets,
as examples. Figure 7.9 shows that both F-measure and Silhouette index tend to vary
similarly for the data sets. Based on this criteria, the optimal rank for the benchmark data
sets, namely, 3Sources, BBC, Digits, and 100Leaves are 45, 46, 13, and 160, respectively.
For the eight cancer data sets, namely, OV, LGG, BRCA, STAD, CRC, CESC, LUNG,
and KIDNEY, the ranks are 5, 31, 4, 19, 10, 3, 33, and 5, respectively. For OV, BRCA,
LGG, Digits, and 3Sources data sets, the F-measure corresponding to selected rank r‹

203
6
3Sources: BBC 6
3Sources: Guardian 5
3Sources: Reuters
Bound Bound Bound
Actual Actual 4.5 Actual
5 5
4
3.5
dθ(UJoint, Um)

dθ(UJoint, Um)

dθ(UJoint, Um)
4 4
3
3 3 2.5
2
2 2
1.5
1
1 1
0.5
0 0 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
iteration t iteration t iteration t

(a) Grassmannian bounds on 3Sources data set

16
BBC: Segment1 14
BBC: Segment2 13
BBC: Segment3 20
BBC: Segment4
Bound Bound Bound Bound
Actual Actual 12 Actual 18 Actual
14 12
11 16
12 10
dθ(UJoint, Um)

dθ(UJoint, Um)

dθ(UJoint, Um)

dθ(UJoint, Um)
10 14
10 9
12
8 8
8 10
7
6 8
6 6
5 6
4 4
4 4
2 2 3 2
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
iteration t iteration t iteration t iteration t

(b) Grassmannian bounds on BBC data set

11
LGG: mDNA 14
LGG: RNA 10
LGG: miRNA 10
LGG: RPPA
Bound Bound Bound Bound
10 Actual Actual 9 Actual 9 Actual
12
9 8 8
10 7
dθ(UJoint, Um)

dθ(UJoint, Um)

dθ(UJoint, Um)

dθ(UJoint, Um)
8 7
8 6
7 6
5
6 6 5
4
5 4
4 3
4 3 2
2
3 2 1
2 0 1 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
iteration t iteration t iteration t iteration t

(c) Grassmannian bounds on LGG data set

12
STAD: mDNA 18
STAD: RNA 13
STAD: miRNA 13
STAD: RPPA
Bound Bound Bound Bound
11 Actual 16 Actual 12 Actual 12 Actual
10 14 11 11
9 10 10
dθ(UJoint, Um)

dθ(UJoint, Um)

dθ(UJoint, Um)

dθ(UJoint, Um)

12
8 9 9
10
7 8 8
8
6 7 7
6
5 6 6
4 4 5 5
3 2 4 4
2 0 3 3
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
iteration t iteration t iteration t iteration t

(d) Grassmannian bounds on STAD data set

Figure 7.8: Variation of the theoretical upper bound Γm and the observed Grassmannian
distance between UJoint and Um with increase in iteration number t for 3Sources, BBC,
LGG, and STAD data sets. Sub-figures in each row shows the variation for different views
of the corresponding data set.

coincides with the best F-measure obtained over different values of r. The importance of
considering the approximation rank r‹ to be greater than k, the number of clusters, is
established by comparing the clustering performance of the proposed algorithm at rank r‹
with that at k in Table 7.1. Table 7.1 shows that for all benchmark data sets, and three
cancer data sets, namely, OV, LGG, and STAD, the clustering performance significantly
improves when considering the optimal rank r‹ instead of k. For BRCA data, r‹ “ k yields
same performance in both the cases.

204
0.65
LGG 1.2
OV 0.75 0.65
Digits
F-measure 0.4 F-measure F-measure
Silhouette Silhouette Silhouette 0.95
1.1 0.7 0.6
0.6
1 0.65 0.9
0.35 0.55
0.55
0.9 0.6 0.85

F-measure

F-measure

F-measure
Silhouette

Silhouette

Silhouette
0.5
0.5 0.8 0.3 0.55
0.45 0.8
0.45 0.7 0.5
0.25 0.4 0.75
0.6 0.45
0.4 0.7
0.5 0.4 0.35
0.2
0.35 0.3 0.65
0.4 0.35
0.3 0.3 0.15 0.3 0.25 0.6
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 10 15 20 25 30 35 40 45 50
rank r rank r rank r

Figure 7.9: Variation of Silhouette index and F-measure for different values of rank r on
LGG, OV, and Digits data sets.

7.5.5 Effectiveness of Proposed Algorithm


This subsection illustrates the significance of different components of the proposed for-
mulation, such as, optimization of joint and individual cluster indicator subspaces over
Grassmannian manifold, optimization of the graph Laplacians over SPD manifold, Lapla-
cian eigenvalue based initialization of graph weights, and so on. The results are studied
on four benchmark data sets: 3Sources, BBC, Digits, and 100Leaves, and four omics data
sets: BRCA, STAD, LGG, and OV, as examples, in Tables 7.2 and 7.3.

7.5.5.1 Importance of Joint Subspace Optimization


The proposed algorithm extracts a global cluster indicator subspace, UJoint , that optimizes
the spectral clustering objective over the joint view, while reducing the discrepancy between
the joint and the individual clustering subspaces. To establish the importance of optimizing
UJoint in the proposed formulation, the clustering performance of the proposed algorithm is
compared with that of the case where the joint clustering subspace is not optimized, instead,
all other components like the individual cluster indicator subspaces, individual Laplacians,
and their corresponding weights are optimized. Since UJoint is not optimized, the clusters
are identified by performing k-means clustering on the indicator subspace corresponding
to the highest weighted view according to the eigenvalue based measure. The clustering
performance for this case is reported under the ‘UJoint optimize’ component in Tables 7.2
and 7.3. These two tables show that for all data sets, except BRCA, clustering on the joint
subspace having multi-view information gives a significant improvement in performance as
opposed to even the most relevant single view. For the BRCA data set, RNA expression
is the most relevant view according to α, and its spectral clustering result is considerably
high (as shown in the result corresponding to the ‘Best view’ in Table 7.3), hence, the
improvement in performance considering UJoint is comparatively lower. Nevertheless, the
increased performance of the proposed GeARS algorithm across all data sets establishes
the importance of UJoint optimization.

7.5.5.2 Importance of Individual Subspace Optimization


In order to establish the importance of optimizing the individual cluster indicator sub-
spaces, the performance of the proposed algorithm is compared with that of the case where

205
Table 7.1: Performance Analysis of Proposed Algorithm at Rank k and Optimal Rank r‹
Measure Rank k Rank r‹ Rank k Rank r‹
Rank 10 13 6 45
Accuracy 0.7925(0.0) 0.9321(3.16e-4) 0.6153(7.88e-3) 0.7786(7.48e-3)
NMI 0.7610(Ñ 0) 0.8683(5.59e-4) 0.5723(1.62e-2) 0.6823(1.06e-2)
ARI Digits 0.6792(0.0) 0.8557(6.27e-4) 3Sources 0.4537(1.28e-2) 0.6591(2.09e-2)
F-measure 0.8089(0.0) 0.9325(3.21e-4) 0.6786(5.43e-3) 0.8063(1.07e-2)
Rand 0.9416(0.0) 0.9740(1.12e-4) 0.8127(4.48e-3) 0.8770(6.61e-3)
Purity 0.8000(0.0) 0.9321(3.16e-4) 0.7455(7.89e-3) 0.8142(7.48e-3)

Rank 5 46 100 160


Accuracy 0.7132(6.52e-2) 0.8804(1.74e-3) 0.7893(1.69e-2) 0.8372(1.79e-2)
NMI 0.5856(5.70e-2) 0.7335(5.84e-3) 0.9145(6.09e-3) 0.9346(3.71e-3)
ARI BBC 0.5371(1.00e-1) 0.7566(8.23e-3) 100Leaves 0.7042(1.94e-2) 0.7643(1.97e-2)
F-measure 0.7293(5.31e-2) 0.8710(1.98e-3) 0.8206(1.48e-2) 0.8618(1.11e-2)
Rand 0.8032(5.52e-2) 0.9078(3.35e-3) 0.9936(5.17e-4) 0.9950(5.35e-4)
Purity 0.7143(6.47e-2) 0.8804(1.75e-3) 0.8137(1.41e-2) 0.8545(1.41e-2)

Rank 4 5 3 31
Accuracy 0.6497(0.0) 0.7023(2.89e-3) 0.6329(0.0) 0.9887(0.0)
NMI 0.3287(0.0) 0.3687(1.71e-3) 0.4312(Ñ 0) 0.9397(Ñ 0)
ARI OV 0.3070(0.0) 0.3735(4.88e-3) LGG 0.2861(0.0) 0.9655(0.0)
F-measure 0.6503(0.0) 0.7035(3.02e-3) 0.6347(0.0) 0.9887(0.0)
Rand 0.7392(0.0) 0.7621(2.29e-3) 0.6639(0.0) 0.9837(0.0)
Purity 0.6497(0.0) 0.7023(2.89e-3) 0.6779(0.0) 0.9887(0.0)

Rank 4 4 4 19
Accuracy 0.7914(0.0) 0.7914(0.0) 0.4214(0.0) 0.7933(0.0)
NMI 0.5444(Ñ 0) 0.5444(Ñ 0) 0.0982(Ñ 0) 0.4970(Ñ 0)
ARI BRCA 0.5376(0.0) 0.5376(0.0) STAD 0.0560(0.0) 0.5059(0.0)
F-measure 0.7936(0.0) 0.7936(0.0) 0.4530(0.0) 0.7942(0.0)
Rand 0.8104(0.0) 0.8104(0.0) 0.5953(0.0) 0.7847(0.0)
Purity 0.7914(0.0) 0.7914(0.0) 0.5000(0.0) 0.7933(0.0)

the individual subspaces Um ’s, for m “ 1, ..., M , are set to their initial iterates and not
optimized, but all other variables are optimized. This result is provided corresponding to
‘Um optimize’ component in Tables 7.2 and 7.3. The proposed algorithm outperforms this
restricted case across all data sets as shown in Tables 7.2 and 7.3. The difference is more
significant in case of 3Sources, LGG, BRCA, and STAD data sets. When the individual
subspaces Um ’s are not optimized, the information in the joint subspace and the individual
Laplacians do not update the individual subspaces. In absence of this information prop-
agation, the joint subspace UJoint is unable to reach a better consensus about the global
cluster structure, which results in poorer performance as shown in Tables 7.2 and 7.3.
This establishes the importance of optimizing the variables Um ’s, corresponding to each
individual view.

7.5.5.3 Importance of Pairwise Distance Reduction


Apart from reducing the Grassmannian distance between the joint and individual sub-
spaces, the proposed model also reduces that between every pair of individual subspaces.
Tables 7.2 and 7.3 report the clustering performance of the model where the pairwise

206
Table 7.2: Importance of Different Components of the Proposed Algorithm on Benchmark
Data Sets
Module Accuracy NMI ARI F-measure Rand Purity
Best view 0.7096(7.7e-4) 0.6443(3.9e-4) 0.5416(9.2e-4) 0.7206(6.9e-4) 0.9173(2.1e-4) 0.7096(7.7e-4)
UJoint 0.7735(0.0) 0.7256(Ñ 0) 0.6401(0.0) 0.7843(0.0) 0.9347(0.0) 0.7775(0.0)
Um
Digits

0.9055(0.0) 0.8392(Ñ 0) 0.8064(0.0) 0.9067(0.0) 0.9650(0.0) 0.9055(0.0)


dθ pUi , Uj q 0.9315(0.0) 0.8668(Ñ 0) 0.8543(0.0) 0.9318(0.0) 0.9738(0.0) 0.9315(0.0)
Lm 0.9060(0.0) 0.8395(Ñ 0) 0.8072(0.0) 0.9072(0.0) 0.9652(0.0) 0.9060 (0.0)
α_Equal 0.8965(0.0) 0.8363(Ñ 0) 0.7973(0.0) 0.8971(0.0) 0.9634(0.0) 0.8965(0.0)
αp0q _Eigen 0.9055(0.0) 0.8385(Ñ 0) 0.8060(0.0) 0.9067(0.0) 0.9650(0.0) 0.9055(0.0)
GeARS 0.9321(3.1e-4) 0.8683(5.5e-4) 0.8555(6.2e-4) 0.9325(3.2e-4) 0.9740(1.1e-4) 0.9321(3.1e-4)

Best view 0.7159(0.0) 0.6390(0.0) 0.6082(0.0) 0.7656(0.0) 0.8624(0.0) 0.7869(0.0)


UJoint 0.6343(1.3e-2) 0.5535(7.3e-3) 0.4432(1.1e-2) 0.6844(1.0e-2) 0.8136(2.8e-3) 0.7236(9.2e-3)
3Sources

Um 0.7165(4.3e-2) 0.6292(2.2e-2) 0.5635(4.9e-2) 0.7560(3.4e-2) 0.8455(1.9e-2) 0.7621(1.9e-2)


dθ pUi , Uj q 0.7526(3.3e-2) 0.6615(3.7e-2) 0.6176(6.7e-2) 0.7855(3.8e-2) 0.8627(2.5e-2) 0.7881(3.3e-2)
Lm 0.7526(3.3e-2) 0.6590(3.4e-2) 0.6149(6.4e-2) 0.7838(3.7e-2) 0.5464(2.4e-2) 0.7881(3.3e-2)
α_Equal 0.7289(4.93e-2) 0.6434(3.44e-2) 0.5832(5.57e-2) 0.7629(3.78e-2) 0.8509(2.11e-2) 0.7745(2.53e-2)
αp0q _Eigen 0.7526(3.3e-2) 0.6590(3.4e-2) 0.6149(6.4e-2) 0.7838(3.7e-2) 0.8617(2.4e-2) 0.7881(3.3e-2)
GeARS 0.7786(7.4e-3) 0.6823(1.0e-2) 0.6591(2.0e-2) 0.8063(1.0e-2) 0.8770(6.6e-3) 0.8142(7.4e-3)

Best view 0.6202(2.1e-3) 0.4312(1.7e-3) 0.3405(6.6e-2) 0.6514(1.2e-2) 0.7256(1.7e-2) 0.6212(3.5e-3)


UJoint 0.7868(0.0) 0.6402(Ñ 0) 0.6846(0.0) 0.8067(0.0) 0.8824(0.0) 0.7883(0.0)
Um 0.8759(0.0) 0.7282(Ñ 0) 0.7499(0.0) 0.8663(0.0) 0.9053(0.0) 0.8759(0.0)
BBC

dθ pUi , Uj q 0.8797(1.4e-3) 0.7308(5.2e-3) 0.7529(7.8e-3) 0.8701(1.7e-3) 0.9062(3.2e-3) 0.8797(1.4e-3)


Lm 0.8786(1.7e-3) 0.7293(4.6e-3) 0.7521(6.2e-3) 0.8692(1.7e-3) 0.9061(2.5e-3) 0.8786(1.7e-3)
α_Equal 0.8658(3.4e-2) 0.7120(3.2e-2) 0.7162(9.0e-2) 0.8575(3.0e-2) 0.8914(3.7e-2) 0.8658(3.4e-2)
αp0q _Eigen 0.8786(1.2e-3) 0.7291(3.5e-3) 0.7520(5.3e-3) 0.8692(1.2e-3) 0.9060(2.2e-3) 0.8786(1.2e-3)
GeARS 0.8804(1.7e-3) 0.7335(5.8e-3) 0.7566(8.2e-3) 0.8710(1.9e-3) 0.9078(3.3e-3) 0.8804(1.7e-3)

Best view 0.5786(1.1e-2) 0.7940(4.4e-3) 0.4478(9.8e-3) 0.6113(9.7e-3) 0.9880(2.9e-4) 0.6203(7.6e-3)


UJoint 0.6338(1.3e-2) 0.8146(5.4e-3) 0.5059(1.3e-2) 0.6565(1.1e-2) 0.9898(3.1e-4) 0.6614(1.2e-2)
100Leaves

Um 0.8228(1.7e-2) 0.9340(3.4e-3) 0.7553(1.4e-2) 0.8546(1.1e-2) 0.9948(4.0e-4) 0.8456(1.4e-2)


dθ pUi , Uj q 0.8313(1.6e-2) 0.9357(3.8e-3) 0.7530(1.7e-2) 0.8585(1.0e-2) 0.9947(6.4e-4) 0.8512(1.0e-2)
Lm 0.8232(1.2e-2) 0.9336(4.8e-3) 0.7542(2.4e-2) 0.8546(1.0e-2) 0.9948(3.8e-4) 0.8457(1.4e-2)
α_Equal 0.8186(1.5e-2) 0.9307(3.9e-3) 0.7414(1.8e-2) 0.8502(9.8e-3) 0.9944(5.1e-4) 0.8421(1.1e-2)
αp0q _Eigen 0.8223(1.9e-2) 0.9328(4.4e-3) 0.7510(1.9e-2) 0.8530(1.3e-2) 0.9947(5.1e-4) 0.8449(1.5e-2)
GeARS 0.8372(1.7e-2) 0.9346(3.7e-3) 0.7643(1.9e-2) 0.8618(1.1e-2) 0.9950(5.3e-4) 0.8545(1.4e-2)

ř
distance term dθ pUi , Uj q, @i ‰ j in (7.11), is not considered into the optimization frame-
work. The results are provided under ‘dθ pUi , Uj q’ component in Tables 7.2 and 7.3. They
demonstrate that for OV and 3Sources data sets, there is a substantial decrease in cluster-
ing performance when not considering the pairwise distance minimization term. For the
other data sets, the performance is reduced by a smaller margin. The small margin is due
to the fact that the final k-means is performed on UJoint and the pairwise distance term
has an indirect effect on the cluster structure reflected in UJoint .

7.5.5.4 Importance of Laplacian Optimization


Optimization of Laplacians Lm ’s plays an important role in updating the connectivity of
the graphs based on the information reflected in the joint and individual cluster indicator

207
Table 7.3: Importance of Different Components of the Proposed Algorithm on Omics Data
Sets
Module Accuracy NMI ARI F-measure Rand Purity
Best view 0.6497(0.0) 0.3748(Ñ 0) 0.3548(0.0) 0.6444(0.0) 0.7536(0.0) 0.6497(0.0)
UJoint 0.6556(0.0) 0.3297(Ñ 0) 0.3109(0.0) 0.6566(0.0) 0.7409(0.0) 0.6556(0.0)
Um 0.6766(0.0) 0.3360(0.0) 0.3369(0.0) 0.6775(0.0) 0.7484(0.0) 0.6766(0.0)
OV

dθ pUi , Uj q 0.5889(8.0e-2) 0.3075(4.5e-2) 0.2684(7.6e-2) 0.5963(7.6e-2) 0.7208(3.0e-2) 0.6035(7.0e-2)


Lm 0.6485(6.5e-2) 0.3257(3.2e-2) 0.3152(5.7e-2) 0.6510(6.1e-2) 0.7397(2.3e-2) 0.6526(5.6e-2)
α_Equal 0.5598(0.0) 0.2588(Ñ 0) 0.2155(0.0) 0.5691(0.0) 0.7030(0.0) 0.5628(0.0)
αp0q _Eigen 0.5658(0.0) 0.2673(Ñ 0) 0.2443(0.0) 0.5699(0.0) 0.7162(0.0) 0.5658(0.0)
GeARS 0.7023(2.8e-3) 0.3687(1.7e-3) 0.3735(4.8e-3) 0.7035(3.0e-3) 0.7621(2.2e-3) 0.7023(2.8e-3)

Best view 0.8352(0.0) 0.5734(Ñ 0) 0.5567(0.0) 0.8269(0.0) 0.7861(0.0) 0.8352(0.0)


UJoint 0.6292(0.0) 0.4106(Ñ 0) 0.2765(0.0) 0.6316(0.0) 0.6590(0.0) 0.6741(0.0)
Um 0.8764(0.0) 0.6502(Ñ 0) 0.6328(0.0) 0.8742(0.0) 0.8194(0.0) 0.8764(0.0)
LGG

dθ pUi , Uj q 0.9812(0.0) 0.9001(Ñ 0) 0.9449(0.0) 0.9812(0.0) 0.9740(0.0) 0.9812(0.0)


Lm 0.9565(3.9e-2) 0.8406(9.5e-2) 0.8689(1.2e-1) 0.9561(4.0e-2) 0.9367(6.0e-2) 0.9565(3.9e-2)
α_Equal 0.9101(0.0) 0.7167(Ñ 0) 0.7541(0.0) 0.9107(0.0) 0.8834(0.0) 0.9101(0.0)
αp0q _Eigen 0.9850(0.0) 0.9189(Ñ 0) 0.9527(0.0) 0.9850(0.0) 0.9776(0.0) 0.9850(0.0)
GeARS 0.9887(0.0) 0.9397(Ñ 0) 0.9655(0.0) 0.9887(0.0) 0.9837(0.0) 0.9887(0.0)

Best view 0.7688(0.0) 0.5277(Ñ 0) 0.5130(0.0) 0.7690(0.0) 0.7995(0.0) 0.7688(0.0)


UJoint 0.7788(0.0) 0.5331(Ñ 0) 0.5169(0.0) 0.7812(0.0) 0.8023(0.0) 0.7788(0.0)
BRCA

Um 0.7060(0.0) 0.4698(Ñ 0) 0.4094(0.0) 0.7134(0.0) 0.7611(0.0) 0.7060(0.0)


dθ pUi , Uj q 0.7814(0.0) 0.5361(Ñ 0) 0.5210(0.0) 0.7840(0.0) 0.8041(0.0) 0.7814(0.0)
Lm 0.7085(0.0) 0.4795(Ñ 0) 0.4144(0.0) 0.7158(0.0) 0.7628(0.0) 0.7085(0.0)
α_Equal 0.6783(0.0) 0.4535(Ñ 0) 0.3793(0.0) 0.6858(0.0) 0.7495(0.0) 0.6783(0.0)
αp0q _Eigen 0.7788(0.0) 0.5341(Ñ 0) 0.5162(0.0) 0.7817(0.0) 0.8023(0.0) 0.7788(0.0)
GeARS 0.7914(0.0) 0.5444(Ñ 0) 0.5376(0.0) 0.7936(0.0) 0.8104(0.0) 0.7914(0.0)

Best view 0.5413(0.0) 0.2282(0.0) 0.1927(0.0) 0.5469(0.0) 0.6509(0.0) 0.5867(0.0)


UJoint 0.4297(0.0) 0.1178(0.0) 0.0636(0.0) 0.4597(0.0) 0.5971(0.0) 0.5000(0.0)
STAD

Um 0.7148(0.0) 0.4548(Ñ 0) 0.3591(0.0) 0.7164(0.0) 0.7264(0.0) 0.7148(0.0)


dθ pUi , Uj q 0.7768(0.0) 0.4648(Ñ 0) 0.4710(0.0) 0.7766(0.0) 0.7682(0.0) 0.7768(0.0)
Lm 0.7685(0.0) 0.4537(Ñ 0) 0.4508(0.0) 0.7687(0.0) 0.7615(0.0) 0.7685(0.0)
α_Equal 0.7933(0.0) 0.5038(Ñ 0) 0.5041(0.0) 0.7956(0.0) 0.7864(0.0) 0.7933(0.0)
αp0q _Eigen 0.7636(1.7e-3) 0.4452(6.4e-4) 0.4566(3.6e-3) 0.7650(4.8e-3) 0.7635(7.5e-4) 0.7636(1.7e-3)
GeARS 0.7933(0.0) 0.4970(Ñ 0) 0.5059(0.0) 0.7942(0.0) 0.7847(0.0) 0.7933(0.0)

subspaces. To study the significance of this, the performance of the proposed algorithm
is compared with that of the case where the Laplacians are fixed to the original ones
obtained from the input graphs, and only the indicator subspaces are optimized. The ‘Lm
optimize’ component reports the results of this case in Tables 7.2 and 7.3. Similar to the
other cases, the proposed GeARS algorithm also outperforms this case in all data sets.
The difference in performance is significant for BRCA and OV data sets, and marginal
for BBC and 100Leaves data sets. In this case also the effect of Lm optimization on
global clustering performance is indirect as change in Lm s induces change in the individual
indicator subspaces Um s which in turn can influence the change in global cluster structure
reflected in UJoint . Even so, there is a decrease in performance when not considering the
Lm s into the optimization model.

208
7.5.5.5 Importance of Weight Updation
The proposed algorithm initilizes the weight of each view based on its Laplacian eigenval-
ues which capture a notion of cluster separability. To study the importance of this weight
initialization and its subsequent optimization, the performance of the proposed algorithm
is compared with two cases, one where all the views are equally weighted, and other where
the weights are fixed to their eigenvalue based initial weights. In both of these cases, the
weights are kept fixed throughout the optimization procedure, in order to study the impact
of weight updation. The former case is denoted by ‘Equal Weight’ in Tables 7.2 and 7.3,
while the later is denoted by ‘Eigen Weight’. Tables 7.2 and 7.3 shows that for all data sets,
except STAD, the eigen weighted combination of views gives better clustering performance
compared to the equally weighted combination. However, the proposed GeARS algorithm
that iteratively optimizes the eigenvalue based weight initialization outperforms both these
cases on each data set, except STAD. For the STAD data set, the equally weighted com-
bination marginally outperforms GeARS on two indices, ARI and F-measure.

7.5.6 Comparision with Exisitng Approaches


The multi-view clustering performance of the proposed GeARS algorithm is compared with
that of several existing approaches on benchmark and cancer data sets. The comparative
results are provided in Tables 7.4, 7.5, and 7.6.

7.5.6.1 Performance Analysis on Benchmark Data Sets


For the benchmark data sets, the performance of GeARS is compared with that of nine
state-of-the-art algorithms, namely, co-regularized spectral clustering (CoregSC) [120],
multi-view k-means clustering (MKC) [26], adaptive structure-based multi-view cluster-
ing (ASMV) [272], multiple graph learning (MGL) [164], multi-view clustering with graph
learning (MCGL) [273], multi-view spectral clustering (MSC) [246], convex combination of
approximate graph Laplacians (CoALa) (proposed in Chapter 5) [113], graph-based multi-
view clustering (GMC) [236], and multi-manifold integrative clustering (MiMIC) (proposed
in Chapter 6) [114]. Among these approaches, CoregSC is a co-training based approach,
MKC is multi-subspace based, MiMIC is based on manifold clustering, while all others are
graph based approaches.
The comparative results are provided in Table 7.4 for the benchmark data sets. It
shows that the proposed GeARS algorithm gives the best performance on all benchmark
data sets across all measures, except for the NMI measure on Digits data set. The proposed
algorithm performs third best in NMI after the graph based MCGL and GMC approaches.
Among the existing approaches, two recently proposed graph based approaches, namely,
GMC and CoALa outperform the co-training, multi-subspace clustering, and other graph
based approaches. Nevertheless, the MiMIC algorithm majorarily outperforms both GMC
and CoALa on the data sets. The GMC is a graph fusion based approach that automatically
computes graph weights and produces clusters without any additional clustering step. The
CoALa algorithm on the other hand fuses together approximate Laplacians, but all the
graph weights are fixed a priori during fusion. Furthermore, both these algorithms perform
Euclidean space optimization focusing only on constructing a fused graph, not aiming to
preserve the complementary cluster structure of individual graphs. Although the manifold

209
Table 7.4: Comparative Performance Analysis of Proposed and Existing Multi-View Clus-
tering Algorithms on Benchmark Data Sets
Algorithm Accuracy NMI ARI F-measure
MKC 0.4924(2.77e-1) 0.5325(3.68e-1) 0.4280(2.99e-1) 0.5130(2.33e-2)
CoregSC 0.7556(5.96e-2) 0.7421(3.27e-2) 0.6885(5.73e-2) 0.6934(5.11e-2)
MSC 0.7918(8.21e-2) 0.7560(3.24e-2) 0.6803(6.28e-2) 0.7129(5.58e-2)
ASMV 0.5745(Ñ 0) 0.6709(Ñ 0) 0.4047(Ñ 0) 0.4852(Ñ 0)
MGL 0.7440(8.19e-2) 0.8264(4.73e-2) 0.6888(1.07e-1) 0.7238(9.37e-2)
Digits
MCGL 0.8530(0.0) 0.9055(0.0) 0.8313(Ñ 0) 0.8493(Ñ 0)
CoALa 0.8835(0.0) 0.7981(Ñ 0) 0.7645(0.0) 0.8839(0.0)
GMC 0.8820(Ñ 0) 0.9050(Ñ 0) 0.8502(Ñ 0) 0.8658(Ñ 0)
MiMIC 0.9207(4.21e-4) 0.8597(4.88e-4) 0.8352(8.18e-4) 0.9209(4.15e-4)
GeARS 0.9321(3.16e-4) 0.8683(5.59e-4) 0.8557(6.27e-4) 0.9325(3.21e-4)
MKC 0.4663(1.06e-1) 0.3665(1.00e-1) 0.2461(1.40e-1) 0.4114(1.08e-1)
CoregSC 0.5479(2.99e-2) 0.5238(1.98e-2) 0.3339(2.85e-2) 0.4775(1.91e-2)
MSC 0.4751(2.97e-2) 0.3850(2.27e-2) 0.2618(3.81e-2) 0.4087(3.05e-2)
ASMV 0.3373(Ñ 0) 0.0896(Ñ 0) -0.021(Ñ 0) 0.3528(Ñ 0)
MGL 0.6751(6.67e-2) 0.5768(8.61e-2) 0.4431(1.17e-1) 0.5966(7.12e-2)
3Sources
MCGL 0.3077(Ñ 0) 0.1034(Ñ 0) -0.033(Ñ 0) 0.3417(0.0)
CoALa 0.6508(0.0) 0.6198(Ñ 0) 0.5183(0.0) 0.6929(0.0)
GMC 0.6923(Ñ 0) 0.6216(0.0) 0.4431(0.0) 0.6047(0.0)
MiMIC 0.7360(5.92e-2) 0.6433(3.59e-2) 0.5957(6.69e-2) 0.7581(5.92e-2)
GeARS 0.7786(7.48e-3) 0.6823(1.06e-2) 0.6591(2.09e-2) 0.8063(1.07e-2)
MKC 0.6034(1.10e-1) 0.4786(8.51e-2) 0.3450(1.21e-1) 0.5018(9.03e-2)
CoregSC 0.4701(0.0) 0.2863(0.0) 0.2727(0.0) 0.4879(0.0)
MSC 0.6732(4.94e-2) 0.5531(1.44e-2) 0.4658(2.20e-2) 0.5877(1.83e-2)
ASMV 0.3372(0.0) 0.0348(0.0) 0.0018(Ñ 0) 0.3781(0.0)
MGL 0.5396(1.10e-1) 0.3697(1.89e-1) 0.3153(1.66e-1) 0.5402(8.53e-2)
BBC
MCGL 0.3533(Ñ 0) 0.0741(Ñ 0) 0.0053(Ñ 0) 0.3762(0.0)
CoALa 0.8108(4.36e-3) 0.6536(1.96e-2) 0.7102(2.78e-2) 0.8138(9.93e-4)
GMC 0.6934(Ñ 0) 0.5628(0.0) 0.4789(Ñ 0) 0.6333(Ñ 0)
MiMIC 0.8715(0.0) 0.7182(Ñ 0) 0.7273(0.0) 0.8613(0.0)
GeARS 0.8804(1.74e-3) 0.7335(5.84e-3) 0.7566(8.23e-3) 0.8710(1.98e-3)
MKC 0.0100(0.0) 0.0000(0.0) 0.0000(0.0) 0.0186(0.0)
CoregSC 0.7706(2.58e-2) 0.9165(5.90e-3) 0.7229(1.92e-2) 0.7257(1.90e-2)
MSC 0.7379(2.21e-2) 0.9014(7.60e-3) 0.6788(2.26e-2) 0.6821(2.23e-2)
ASMV 0.7906(Ñ 0) 0.9009(Ñ 0) 0.6104(Ñ 0) 0.6148(Ñ 0)
MGL 0.6904(2.42e-2) 0.8753(7.60e-3) 0.3858(5.65e-2) 0.3944(5.53e-2)
100Leaves
MCGL 0.8106(Ñ 0) 0.9130(0.0) 0.5155(Ñ 0) 0.5217(0.0)
CoALa 0.7384(1.34e-2) 0.8893(4.06e-3) 0.6550(1.41e-2) 0.7672(1.19e-2)
GMC 0.8238(Ñ 0) 0.9292(0.0) 0.4974(Ñ 0) 0.5042(Ñ 0)
MiMIC 0.8185(1.56e-2) 0.9302(4.12e-3) 0.7431(2.53e-2) 0.8492(1.13e-2)
GeARS 0.8372(1.79e-2) 0.9346(3.71e-3) 0.7643(1.97e-2) 0.8618(1.11e-2)

based algorithm, MiMIC, optimizes over the non-linear Steifel manifold to better capture
the lower-dimensional non-linear geometry of complex data sets, it does not optimize the
Laplacians based on the indicator subspaces. The graph weights, like CoALa, are also fixed

210
a priori. However, the increased performance of the proposed GeARS algorithm for all data
sets in Table 7.4 establishes the importance of joint and individual subspace optimization,
as well as graph and its corresponding weight updation in context of multi-view clustering.
Table 7.5: Comparative Performance Analysis of Proposed and Existing Subtype Identifi-
cation Algorithms on Multi-Omics Cancer Data Sets: OV, LGG, BRCA, and STAD
Algorithm Accuracy NMI ARI F-measure Rand Purity
COCA 0.5943(7.0e-3) 0.3131(1.2e-2) 0.2810(6.8e-3) 0.6068(4.2e-3) 0.7039(2.6e-3) 0.5943(7.0e-3)
NormS 0.6976(0.0) 0.4504(0.0) 0.4142(0.0) 0.6910(0.0) 0.7766 (0.0) 0.6976 (0.0)
LRAcluster 0.6287(0.0) 0.3745(Ñ 0) 0.2999(0.0) 0.6384(0.0) 0.7322(0.0) 0.6287(0.0)
iCluster 0.5089(0.0) 0.2249(Ñ 0) 0.2005(0.0) 0.4808(0.0) 0.6916(0.0) 0.5119(0.0)
PCA-con 0.6946(0.0) 0.4424(Ñ 0) 0.4068(0.0) 0.6868(0.0) 0.7734(0.0) 0.6946(0.0)
OV

SURE 0.7215(0.0) 0.4680(Ñ 0) 0.4372(0.0) 0.7148(0.0) 0.7857(0.0) 0.7215(0.0)


JIVE 0.5718(7.7e-3) 0.2629(8.4e-3) 0.2027(4.2e-3) 0.5653(7.8e-3) 0.6885(2.8e-3) 0.5718(7.7e-3)
SNF 0.5269(0.0) 0.2753(0.0) 0.2058(0.0) 0.5642(0.0) 0.6557(0.0) 0.5389(0.0)
CoALa 0.6736(0.0) 0.3381(Ñ 0) 0.3199(0.0) 0.6700(0.0) 0.7379(0.0) 0.6736(0.0)
MiMIC 0.6595(2.8e-3) 0.3271(3.9e-4) 0.3112(4.2e-3) 0.6611(2.5e-3) 0.7383(1.9e-3) 0.6595(2.8e-3)
GeARS 0.7023(2.8e-3) 0.3687(1.7e-3) 0.3735(4.8e-3) 0.7035(3.0e-3) 0.7621(2.2e-3) 0.7023(2.8e-3)

COCA 0.6591(0.0) 0.2772(0.0) 0.2533(0.0) 0.6608(0.0) 0.6454(0.0) 0.6591(0.0)


NormS 0.7940(0.0) 0.5325(0.0) 0.4649(0.0) 0.7916(0.0) 0.7465(0.0) 0.7940(0.0)
LRAcluster 0.4719(0.0) 0.1240(Ñ 0) 0.1030(0.0) 0.5137(0.0) 0.5831(0.0) 0.5280(0.0)
iCluster 0.4382(0.0) 0.1379(Ñ 0) 0.0996(0.0) 0.5187(0.0) 0.5821(0.0) 0.5355(0.0)
PCA-con 0.6666(0.0) 0.3438(0.0) 0.3031(0.0) 0.6574(0.0) 0.6616(0.0) 0.6666(0.0)
LGG

SURE 0.7940(0.0) 0.5335(0.0) 0.4668(0.0) 0.7904(0.0) 0.7465(0.0) 0.7940(0.0)


JIVE 0.5617(0.0) 0.2299(Ñ 0) 0.1606(0.0) 0.5757(0.0) 0.6056(0.0) 0.5730(0.0)
SNF 0.8689(0.0) 0.6253(0.0) 0.6331(0.0) 0.8720(0.0) 0.8268(0.0) 0.8689(0.0)
CoALa 0.9737(0.0) 0.8689(Ñ 0) 0.9199(0.0) 0.9737(0.0) 0.9622(0.0) 0.9737(0.0)
MiMIC 0.9625(0.0) 0.8543(Ñ 0) 0.8790(0.0) 0.9623(0.0) 0.9424 (0.0) 0.9625 (0.0)
GeARS 0.9887(0.0) 0.9397(Ñ 0) 0.9655(0.0) 0.9887(0.0) 0.9837(0.0) 0.9887(0.0)

COCA 0.7434(7.9e-4) 0.5002(3.4e-4) 0.4864(4.5e-4) 0.7457(8.1e-4) 0.7905(1.9e-4) 0.7434(7.9e-4)


NormS 0.7688(0.0) 0.4287(Ñ 0) 0.5090(0.0) 0.7699(0.0) 0.7999 (0.0) 0.7688 (0.0)
LRAcluster 0.7110(0.0) 0.5437(Ñ 0) 0.4035(0.0) 0.7101(0.0) 0.7521(0.0) 0.7110(0.0)
iCluster 0.7638(0.0) 0.5176(Ñ 0) 0.4745(0.0) 0.7658(0.0) 0.7842(0.0) 0.7638(0.0)
BRCA

PCA-con 0.7587(0.0) 0.5506(Ñ 0) 0.5038(0.0) 0.7601(0.0) 0.7984(0.0) 0.7587(0.0)


SURE 0.7663(0.0) 0.4558(0.0) 0.5104(0.0) 0.7683(0.0) 0.8010(0.0) 0.7663(0.0)
JIVE 0.6859(0.0) 0.4368(0.0) 0.3772(0.0) 0.6889(0.0) 0.7464(0.0) 0.6859(0.0)
SNF 0.6783(0.0) 0.5528(Ñ 0) 0.4111(0.0) 0.6865(0.0) 0.7602(0.0) 0.6959(0.0)
CoALa 0.7613(0.0) 0.5281(Ñ 0) 0.4874(0.0) 0.7660(0.0) 0.7922(0.0) 0.7613(0.0)
MiMIC 0.7964(0.0) 0.5553(Ñ 0) 0.5474(0.0) 0.7997(0.0) 0.8152(0.0) 0.7964(0.0)
GeARS 0.7914(0.0) 0.5444(Ñ 0) 0.5376(0.0) 0.7936(0.0) 0.8104(0.0) 0.7914(0.0)

COCA 0.4450(3.3e-2) 0.1309(4.7e-3) 0.0740(1.0e-2) 0.4558(2.5e-2) 0.5981(1.3e-2) 0.5173(9.5e-3)


NormS 0.5702(0.0) 0.1805(Ñ 0) 0.1625(0.0) 0.5770(0.0) 0.6435(0.0) 0.5950(0.0)
LRAcluster 0.4256(0.0) 0.1259(Ñ 0) 0.0912(0.0) 0.4746(0.0) 0.6122(0.0) 0.5619(0.0)
iCluster 0.3512(0.0) 0.0650(Ñ 0) 0.0288(0.0) 0.3832(0.0) 0.5855(0.0) 0.4917(0.0)
STAD

PCA-con 0.6900(0.0) 0.3654(0.0) 0.3204(0.0) 0.6959(0.0) 0.7110(0.0) 0.6900(0.0)


SURE 0.6983(0.0) 0.3511(Ñ 0) 0.3445(0.0) 0.7056(0.0) 0.7216(0.0) 0.6983(0.0)
JIVE 0.4049(0.0) 0.1288(Ñ 0) 0.0657(0.0) 0.4487(0.0) 0.5981(0.0) 0.5165(0.0)
SNF 0.5661(0.0) 0.4558(0.0) 0.1522(0.0) 0.5521(0.0) 0.6945(0.0) 0.6363(0.0)
CoALa 0.7685(0.0) 0.5107(0.0) 0.4559(0.0) 0.7778(0.0) 0.7661 (0.0) 0.7685 (0.0)
MiMIC 0.7727(0.0) 0.5220(Ñ 0) 0.4650(0.0) 0.7830(0.0) 0.7698(0.0) 0.7727(0.0)
GeARS 0.7933(0.0) 0.4970(Ñ 0) 0.5059(0.0) 0.7942(0.0) 0.7847(0.0) 0.7933(0.0)

211
7.5.6.2 Performance Analysis on Cancer Data Sets
In case of the cancer data sets, the performance of proposed GeARS algorithm is com-
pared with ten multi-omics cancer subtyping algorithms, namely, LRAcluster [243], iClus-
ter [192], multivariate normality based joint subspace clustering (NormS) (proposed in
Chapter 3) [111], cluster of cluster analysis (COCA) [93], joint and individual variance ex-
plained (JIVE) [141], selective update of relevant eigenspaces (SURE) (proposed in Chapter
4) [112], principal component analysis on naively concatenated data (PCA-con), similarity
network fusion (SNF) [234], CoALa (proposed in Chapter 5) [113], and MiMIC (proposed
in Chapter 6) [114]. The experimental setup followed for the existing multi-omics cancer
subtyping algorithms is same as that followed in Chapter 3.
The comparative results are reported in Tables 7.5 and 7.6. All the results show that for
LGG and STAD data sets, the proposed algorithm outperforms all the existing ones with
respect to all the clustering indices, except for the NMI measure on STAD data set. For
the OV data set, SVD based SURE algorithm of Chapter 4 has the highest performance,
while the MiMIC algorithm, proposed in Chapter 6, has that for the BRCA data set,
outperforming the proposed one by a small margin. Apart from COCA which is two-stage
consensus clustering approach, all the approaches studied in Tables 7.5 and 7.6 perform
clustering on a low-rank subspace. For NormS, LRAcluster, and iCluster algorithms, the
subspace is based on a probabilistic model, for JIVE, SURE, and PCA-con, the subspace is
SVD based variance maximization subspace, while for CoALa, SNF, MiMIC, and GeARS,
it is the graph-cut minimization based spectral clustering subspace. The results in Tables
7.5 and 7.6 show that except for the CESC and OV data sets, in general, the spectral
clustering subspace outperforms the probabilistic and the variance maximization based
subspaces. A possible explanation for this is that the probabilistic model often fits poorly
in real-life data sets, while the variance maximization property tends to reflect the variance
due to the cluster pattern as well as noise in its principal subspace. The combined results of
Tables 7.5, and 7.6 show that the proposed GeARS algorithm has the best performance for
LGG and STAD data sets, and has the second or third best performance in the remaining
six cancer data sets, thus achieving competitive results with respect to the state-of-the-art
in all data sets.

7.5.6.3 Performance Analysis on Social Network and General Image Data Sets
The performance of the proposed GeARS algorithm is also studied on the CORA social
network data set and two general image data sets, namely, Caltech7 and ORL. These data
sets consist of mostly network or graph based views, hence, the proposed algorithm is
compared with SNF, CoALa, and MiMIC algorithms, that can work graphical represen-
tation of views. The comparative results are reported in Table 7.7. The results in Table
7.7 show that the proposed algorithm has the best clustering performance for the ORL
data set and the second best performance for Caltech7 and CORA data sets. For Caltech7
and CORA data sets, the MiMIC algorithm proposed in Chapter 6 has the best perfor-
mance for majority of the external indices. The competitive performance of the proposed
GeARS algorithm on these data sets indicates that the algorithm can correctly identify
the community structure in large-scale social networks and recognize faces or objects from
multi-feature image data sets.

212
Table 7.6: Comparative Performance Analysis of Proposed and Existing Subtype Identifi-
cation Algorithms on Multi-Omics Cancer Data Sets: CRC, CESC, KIDNEY, and LUNG
Algorithm Accuracy NMI ARI F-measure Rand Purity
COCA 0.5323(5.56e-3) 0.0120 (1.27e-3) 0.0007(1.86e-3) 0.5586(5.56e-3) 0.5010(6.97e-4) 0.7370(0.0)
NormS 0.6206(0.0) 0.0093(0.0) 0.0347(0.0) 0.6345(0.0) 0.5281(0.0) 0.7370(0.0)
LRAcluster 0.5129(0.0) 0.0030(0.0) -0.001(0.0) 0.5410(0.0) 0.4992(0.0) 0.7370(0.0)
iCluster 0.6163(0.0) 0.0070(0.0) 0.0293(0.0) 0.6298(0.0) 0.5260(0.0) 0.7370(0.0)
PCA-con 0.5366(0.0) 0.0057(0.0) 0.0037(0.0) 0.5642(0.0) 0.5016(0.0) 0.7370(0.0)
CRC

SURE 0.5107(0.0) 0.0028(0.0) -0.002(0.0) 0.5416(0.0) 0.4991(0.0) 0.7370(0.0)


JIVE 0.6034(0.0) 0.0071(0.0) 0.0256(0.0) 0.6210(0.0) 0.5203(0.0) 0.7370(0.0)
SNF 0.5991(0.0) 0.0069(0.0) 0.0240(0.0) 0.6178(0.0) 0.5186(0.0) 0.7370(0.0)
CoALa 0.6400(0.0) 0.0185(0.0) 0.0548(0.0) 0.6529(0.0) 0.5382(0.0) 0.7370(0.0)
MiMIC 0.6228 (0.0) 0.0069(0.0) 0.0310 (0.0) 0.6338 (0.0) 0.5291 (0.0) 0.7370(0.0)
GeARS 0.6206(0.0) 0.0065(0.0) 0.0295(0.0) 0.6321(0.0) 0.5281(0.0) 0.7370(0.0)

COCA 0.6693(0.0) 0.4172(4.77e-3) 0.3677(8.95e-4) 0.6865(2.49e-3) 0.6971(6.33e-5) 0.6774(0.0)


NormS 0.8870(0.0) 0.6854(0) 0.7004(0.0) 0.8801(0.0) 0.8587(0.0) 0.8870(0.0)
LRAcluster 0.8145(0.0) 0.5176(0) 0.5384(0.0) 0.8123(0.0) 0.7867(0.0) 0.8145(0.0)
iCluster 0.5483(0.0) 0.1737(0) 0.1017(0.0) 0.5568(0.0) 0.5731(0.0) 0.5645(0.0)
CESC

PCA-con 0.8548(0.0) 0.6750 (0) 0.6333(0.0) 0.8390(0.0) 0.8237(0.0) 0.8548(0.0)


SURE 0.8629 (0.0) 0.6461(0.0) 0.6507 (0.0) 0.8512 (0.0) 0.8339 (0.0) 0.8629 (0.0)
JIVE 0.7177(0.0) 0.4425(0.0) 0.3860(0.0) 0.7097(0.0) 0.7164(0.0) 0.7177(0.0)
SNF 0.6693(0.0) 0.4927(0.0) 0.4239(0.0) 0.7073(0.0) 0.7043(0.0) 0.6935(0.0)
CoALa 0.8225(0.0) 0.5479(0.0) 0.5637(0.0) 0.8139(0.0) 0.7951(0.0) 0.8225(0.0)
MiMIC 0.8548(0.0) 0.6451(0) 0.6236(0.0) 0.8418(0.0) 0.8193(0.0) 0.8548(0.0)
GeARS 0.8548(0.0) 0.6451(0) 0.6236(0.0) 0.8418(0.0) 0.8193(0.0) 0.8548(0.0)

COCA 0.9470(0.0) 0.7493(0.0) 0.8393(0.0) 0.9477(0.0) 0.9199(0.0) 0.9470(0.0)


NormS 0.9525(0.0) 0.7726(0) 0.8534(0.0) 0.9530(0.0) 0.9269(0.0) 0.9525(0.0)
LRAcluster 0.9538(0.0) 0.7862 (0.0) 0.8579 (0.0) 0.9545(0.0) 0.9292 (0.0) 0.9538(0.0)
iCluster 0.6065(0.0) 0.2547(0) 0.1717(0.0) 0.6514(0.0) 0.5842(0.0) 0.6811(0.0)
KIDNEY

PCA-con 0.9511(0.0) 0.7670(0) 0.8489(0.0) 0.9516(0.0) 0.9246(0.0) 0.9511(0.0)


SURE 0.9525(0.0) 0.7726(0) 0.8534(0.0) 0.9530(0.0) 0.9269(0.0) 0.9525(0.0)
JIVE 0.9308(0.0) 0.6955(0) 0.7786(0.0) 0.9300(0.0) 0.8893(0.0) 0.9308(0.0)
SNF 0.9579(0.0) 0.7946(0.0) 0.8796(0.0) 0.9590(0.0) 0.9400(0.0) 0.9579(0.0)
CoALa 0.9294(0.0) 0.6987(0) 0.7786(0.0) 0.9285(0.0) 0.8893(0.0) 0.9294(0.0)
MiMIC 0.9552 (0.0) 0.7767(0.0) 0.8534(0.0) 0.9551 (0.0) 0.9268(0.0) 0.9552 (0.0)
GeARS 0.9565(0.0) 0.7797(0) 0.8580(0.0) 0.9566(0.0) 0.9291(0.0) 0.9565(0.0)

COCA 0.9284(0.0) 0.6287(0.0) 0.7339(0.0) 0.9283(0.0) 0.8669(0.0) 0.9284(0.0)


NormS 0.9359(0.0) 0.6650(0.0) 0.7597(0.0) 0.9357(0.0) 0.8798(0.0) 0.9359(0.0)
LRAcluster 0.9344(0.0) 0.6535(0.0) 0.7545(0.0) 0.9342(0.0) 0.8772(0.0) 0.9344(0.0)
iCluster 0.6333(0.0) 0.0627(0.0) 0.0696(0.0) 0.6299(0.0) 0.5348(0.0) 0.6333(0.0)
LUNG

PCA-con 0.9388(0.0) 0.6773(0) 0.7701(0.0) 0.9386(0.0) 0.8850(0.0) 0.9388(0.0)


SURE 0.9418(0.0) 0.6878(0.0) 0.7806(0.0) 0.9417(0.0) 0.8903(0.0) 0.9418(0.0)
JIVE 0.9269(0.0) 0.6333(0.0) 0.7288(0.0) 0.9266(0.0) 0.8644(0.0) 0.9269(0.0)
SNF 0.9493(0.0) 0.7152 (0.0) 0.8072(0.0) 0.9492(0.0) 0.9036(0.0) 0.9493(0.0)
CoALa 0.9403(0.0) 0.6970(0.0) 0.7754(0.0) 0.9400(0.0) 0.8877(0.0) 0.9403(0.0)
MiMIC 0.9463 (0.0) 0.7173(0.0) 0.7965 (0.0) 0.9461 (0.0) 0.8983 (0.0) 0.9463 (0.0)
GeARS 0.9433(0.0) 0.7035(0.0) 0.7859(0.0) 0.9431(0.0) 0.8929(0.0) 0.9433(0.0)

213
Table 7.7: Comparative Performance Analysis of Proposed and Existing Algorithms on
ORL, Caltech7, and CORA Data Sets
Graph Based Manifold Based
Algorithms
SNF CoALa MiMIC GeARS
Accuracy 0.6907(2.57e-2) 0.7715 (2.18e-2) 0.7307(2.36e-2) 0.8052(3.06e-2)
NMI 0.8616(1.00e-2) 0.8980 (1.15e-2) 0.8814(1.35e-2) 0.9118(1.73e-2)
ARI 0.6054(3.04e-2) 0.6932 (2.82e-2) 0.6208(3.83e-2) 0.7201(4.22e-2)
ORL
F-measure 0.7257(2.44e-2) 0.7962 (1.78e-2) 0.7677(2.29e-2) 0.8297(2.61e-2)
Rand 0.9804(2.04e-3) 0.9850 (1.63e-3) 0.9802(2.65e-3) 0.9864(2.28e-3)
Purity 0.7450(2.26e-2) 0.8090 (1.75e-2) 0.7737(1.78e-2) 0.8315(2.66e-2)
Accuracy 0.5440(3.42e-2) 0.5685(0.0) 0.5773 (0.0) 0.5852(2.31e-2)
NMI 0.5676(2.41e-2) 0.5650(0) 0.5880(0) 0.5734 (2.69e-2)
ARI 0.4126(2.86e-2) 0.4397(0.0) 0.4608(0.0) 0.4582 (3.65e-2)
Caltech7
F-measure 0.6363(4.12e-2) 0.6689 (0.0) 0.6600(0.0) 0.6761(2.99e-2)
Rand 0.7482(1.13e-2) 0.7583(0.0) 0.7674(0.0) 0.7666 (1.49e-2)
Purity 0.8516(1.09e-2) 0.8548(0.0) 0.8751(0.0) 0.8603 (7.65e-3)
Accuracy 0.5450(2.79e-2) 0.5896 (3.41e-3) 0.6120(2.46e-3) 0.5801(3.35e-2)
NMI 0.3829(1.14e-2) 0.4364(2.81e-3) 0.4686(6.17e-3) 0.4416 (1.64e-2)
ARI 0.2941(1.86e-2) 0.3256 (2.87e-3) 0.3479(3.73e-3) 0.3079(4.17e-2)
CORA
F-measure 0.5957(1.96e-2) 0.5844(4.98e-3) 0.6373(3.50e-3) 0.6140 (1.96e-2)
Rand 0.7936(9.31e-3) 0.7460(2.14e-3) 0.7709(9.35e-4) 0.7736 (9.66e-3)
Purity 0.6012(1.62e-2) 0.6206(3.41e-3) 0.6423(2.46e-3) 0.6217 (2.38e-2)

7.6 Conclusion
This chapter presents a multi-view clustering algorithm based on line-search optimization
of two different Riemannian manifolds, namely, Grassmannian and SPD manifolds. While
the Grassmannian manifold is used to optimize the lower dimensional cluster indicator sub-
spaces corresponding to different views, the SPD manifold optimizes the graph structure
represented by the corresponding Laplacians. The SPD manifold automatically preserves
the symmetricity and positive definiteness of the Laplacians during optimization. Addi-
tionally, the basis invariance property of the Grassmannian manifold finds cluster indicator
subspaces as opposed to representative indicator matrices. The convergence and asymptotic
properties of the proposed line-search algorithm are analyzed in order to predict noise and
separability of the clusters in the data set. The matrix perturbation theory is used to derive
a theoretical upper bound on the Grassmannian distance between the joint and individual
clustering subspaces. The distance is also empirically shown to minimize as the algorithm
converges to a local minima of the objective function. The clustering performance of the
proposed algorithm is studied and compared with that of several state-of-the-art multi-view
clustering approaches on four benchmark and eight multi-omics cancer data sets. Exper-
imental results show that simultaneous optimization of the clustering subspaces, graph
Laplacians, and their corresponding weights, in the proposed manifold algorithm, has su-
perior performance in several data sets, compared to existing algorithms that optimize over
the Euclidean space or only a subset of the variables.

214
Chapter 8

Conclusion and Future Directions

This chapter summarizes the major contributions of the research reported in different chap-
ters of the thesis. It also provides future research directions, including possible extensions
and applications of the proposed research work, in multi-view clustering.

8.1 Major Contributions


The thesis presents different approaches for multi-view data clustering. Primarily, there are
four major challenges in multi-view clustering: (i) the high-dimensional low-sample size na-
ture of views, (ii) selection of relevant and informative views over noisy and redundant ones
during data integration, (iii) prevent the propagation of noise from the real-life individual
views to the joint one during information fusion, and (iv) modelling the lower dimensional
non-linear geometry of views. The algorithms proposed in this thesis address these issues
using three different baseline strategies, namely, subspace based approach (Chapters 3 and
4), graph based approach (Chapter 5), and manifold based approach (Chapter 6 and 7).
A brief summary, highlighting the key attributes of the proposed approaches, is discussed
as follows.
Chapter 3 presents a new algorithm for the extraction of a low-rank joint subspace from
high-dimensional multi-view data sets. The algorithm uses hypothesis testing to estimate
efficiently the rank of each individual view by separating its signal or structural component
from the noise component. In order to address the major challenge of appropriate view
selection during data integration, two evaluation measures are proposed. One evaluates the
relevance of a view in terms of the quality of cluster structure embedded within it, while
the other measures the amount of shared information contained within the views. The
views with highest relevance and maximum shared information are selected for integration.
Next, in Chapter 4, in order to reduce the computational complexity of joint subspace
construction, the problem of updating the SVD of a data matrix is formulated and solved
for multi-view data sets. The theoretical formulation introduced in this chapter enables
the proposed algorithm to extract the principal components in lesser time compared to
performing PCA on the concatenated data. Some new quantitative indices are proposed to
theoretically evaluate the gap between joint subspace extracted by the proposed algorithm
and the principal subspace extracted by PCA. Similar to the previous chapter, the algo-

215
rithm proposed in Chapter 4 also evaluates and then integrates only the relevant views for
joint eigenspace construction. The effectiveness of the algorithms proposed in Chapters 3
and 4 is studied and compared with several existing integrative clustering approaches on
real-life multi-omics cancer data sets.
Chapter 5 presents a novel algorithm, for the integration of multiple similarity graphs,
that prevents the noise of the individual graphs from being propagated into the unified
one. The algorithm first approximates each graph using the most informative eigenpairs
of its Laplacian which contains its cluster information. Thus, the noise in the individual
graphs is not reflected in their approximations. These denoised approximations are then
integrated for the construction of a low-rank subspace that best preserves the overall cluster
structure of multiple graphs. Using the matrix perturbation theory, theoretical bounds are
derived as a function of the approximation rank, in order to precisely evaluate how far the
approximate subspace deviates from the full-rank subspace. The clustering performance of
the approximate subspace is compared with that of different existing integrative clustering
approaches on several real-life cancer data sets as well as benchmark data sets from varying
application domains.
Chapter 6 presents a novel manifold optimization-based algorithm for integrative clus-
tering of high-dimensional multi-view data sets. A joint clustering objective is optimized
over two different manifolds, namely, k-means and Stiefel manifolds. The Stiefel manifold
models the differential clusters of the individual views, whereas the k-means manifold tries
to infer the best-fit global cluster structure in the data. The optimization is performed
separately along the manifolds of each view so that individual non-linearity within each
view is not lost while looking for the shared cluster information. The convergence of the
proposed algorithm is theoretically established over the manifold, while the analysis of its
asymptotic behavior quantifies how fast it converges to an optimal solution. Chapter 7,
on the other hand, demonstrates that simultaneous optimization of the individual graph
structures, their weights, and the joint and individual subspaces, is likely to give a more
comprehensive idea of the clusters present in the data set. It presents another manifold
optimization algorithm that harnesses the geometry and structure preserving properties
of symmetric positive definite manifold (SPD) and Grassmannian manifold for efficient
multi-view clustering. The SPD manifold is used to optimize the graph Laplacians corre-
sponding to the individual views while preserving their symmetricity, positive definiteness,
and related properties. The Grassmannian manifold is used to optimize and reduce the
disagreement between the joint and individual clustering subspaces. The clustering per-
formance of the manifold optimization algorithms proposed in Chapters 6 and 7 is studied
and compared with several state-of-the-art integrative clustering approaches on various
multi-omics cancer and benchmark data sets.
The concept of approximate graph Laplacians proposed in this thesis is unique.

8.2 Future Directions


There are many important aspects of the research reported in this thesis that can be
extended for the advancement of multi-view data analysis. Future directions are enlisted
as follows:

1. Eigenspace model when data does not follow a mixture of Guassians: The

216
basic assumption of the signal-plus-noise model proposed in Section 3.2 of Chapter 3 is
that the data in each view is drawn from a mixture of Gaussian distributions. Under
this assumption, the signal component of each view and its corresponding rank is
estimated by those principal components which show deviance from the multivariate
normal distribution. However, real-life data may sometimes fail to satisfy the mixture
of Gaussian assumption, for which generalized models of signal and noise component
estimation may be developed.

2. Parallel computation of joint eigenspace: The joint eigenspace of the integrated


data, proposed in Section 4.3.1 of Chapter 4, is constructed sequentially in M steps
for M views, X1 , . . . , XM . A possible extension of this model is to reduce the com-
putational complexity by constructing the joint eigenspace parallelly in a single step
from the individual eigenspaces. Parallel computation would involve solving a sin-
gle SVD problem of larger size compared to those solved in each of the M steps of
sequential eigenspace construction.

3. Non-linear combination of graph Laplacians: In Chapter 5, the joint graph


Laplacian is constructed by taking a convex combination of individual Laplacians.
The convex combination weights are determined by a heuristic, based on the eigen-
values and eigenvectors of the individual Laplacians. This model can be extended by
considering a non-linear combination of individual Laplacians. The non-linear combi-
nation coefficients may also be determined automatically by solving an optimization
problem.

4. Tensor spectral clustering: The graph based multi-view clustering approaches


proposed in Chapters 5, 6, and 7 work with similarity graphs represented by pairwise
similarity between the samples. This model may be extended by considering higher-
order relationships between the samples represented by p-th order tensors, where
p ą 2. The higher-order relationships may better capture the non-linear distribution
or neighborhood of the samples, resulting in better clustering. Once the higher-order
relationships are modelled using tensors, a joint spectral clustering objective can
be optimized over the Euclidean space (as in Chapter 5) or over manifolds (as in
Chapters 6 and 7).

5. Second-order geometry in manifold optimization: The line-search optimiza-


tion proposed in Chapters 6 and 7 obtains a local optimal solution by always moving
in the negative gradient direction starting from an initial iterate. The gradient only
considers the first order geometry of the manifold while optimizing at a given iterate.
Other optimization techniques, like Newton’s method, trust-region method [3], can be
developed that consider the second order geometry of the space during optimization
and obtain solutions with global convergence properties.

6. Data drawn from an union of overlapping manifolds: The algorithms proposed


in Chapters 6 and 7 extract a single lower dimensional manifold corresponding to each
view of a multi-view data set. Generalizations of this model can be proposed where
the data in each view is considered to be lying on a union of multiple, possibly
overlapping, non-linear manifolds. This generic model is expected to better capture
non-linear cluster patterns embedded in non-Euclidean spaces.

217
7. Incomplete views: All the multi-view clustering algorithms proposed in the thesis
assume that all the samples are completely observed in all the views. However, due
to measurement and pre-processing errors, the data sets often have incomplete views,
where some of samples are not observed in one view (missing view), or only some
of the variables are observed corresponding to a sample in some view (missing vari-
ables). Multi-view clustering algorithms can be developed that can work in presence
of incomplete views. It would require utilizing the connection between the views to
restore the samples in the incomplete views with the help of corresponding samples
in the complete views.

8. Views observed in heterogeneous measurement spaces: The approaches pro-


posed in Chapters 3 and 4 assume that in case of feature space based representation,
the views X1 , . . . , Xm , . . . , XM are all observed in a real-valued Euclidean space, that
is, Xm P <nˆdm . However, some of the views may not be observed in the real-valued
space. For example, the single nucleotide polymorphism (SNP) data is binary, with a
one if a nucleotide has undergone mutation in a sample, and zero otherwise. Similarly,
the views can also consist of categorical, integer count, or textual data. The proposed
clustering algorithms can be extended to work with heterogeneous multi-view data
where different views are observed in different measurement spaces.

9. Deep network based optimization: All the algorithms, proposed in different


chapters of this thesis, perform shallow optimization and obtain either eigenvalue-
eigenvector or gradient based solutions. However, the multi-view clustering objective
proposed especially in Chapters 6 and 7 can also be optimized using a network based
on deep leaning framework. The eigenvector based solutions can be used to initialize
or guide the deep optimization model.

10. Weak supervision model: The algorithms proposed in Chapters 3, 4, 5, 6, and 7


are designed for an unsupervised setting, which does not consider the label informa-
tion during the learning process. In large-scale real-life data sets, although it may
not be possible to annotate all the samples in a data set, it may be possible to obtain
labels of only a subset of the samples. New approaches can be designed that can
improve the learning performance by allowing the multi-view clustering algorithms
to be supervised by a small number of labelled samples.

218
Appendix A

Description of Data Sets

The appendix presents a brief description of the multi-omics cancer and multi-view bench-
mark data sets used in the thesis for comparative analysis of the proposed and the existing
multi-view clustering algorithms.

A.1 Multi-Omics Cancer Data Sets


Throughout Chapters 3-7 nine real-life multi-omics cancer data sets from TCGA are ex-
tensively studied in the thesis. The cancer data sets and their genomic views are described
as follows.
1. Cervical Carcinoma (CESC): The cervical cancer data set consists of 124 sam-
ples. The recent integrative study by TCGA Research Network [218] has identified
three molecular subtypes cervical cancer, namely, Keratin-low Squamous subgroup,
Keratin-high Squamous subgroup, and Adenocarcinoma-rich subgroup. The data set
consists of 37 samples of Keratin-low Squamous subgroup, 58 samples of Keratin-high
Squamous subgroup, and 29 samples of Adenocarcinoma-rich subgroup.
2. Colorectal Carcinoma (CRC): It is the third most commonly diagnosed cancer in
both men and women and account for nine percent of all cancer deaths [65]. The colon
and rectum are parts of the digestive system and cancer forms in the colon and/or
the rectum. There are 307 samples in the OV data set. Depending on the site of
origin, the samples of OV are divided into two subtypes, namely, colon carcinoma
and rectum carcinoma, having 236 and 71 samples, respectively.
3. Lower Grade Glioma (LGG): Diffuse low-grade and intermediate-grade gliomas
which together make up the lower-grade gliomas have highly variable clinical be-
haviour that is not adequately predicted on the basis of histological class. Integrative
analysis of data from RNA, DNA-copy-number, and DNA-methylation platforms has
uncovered three prognostically significant subtypes of lower-grade glioma [217]. The
LGG data set consists of 267 samples. The first subtype has 134 samples which
exhibit IDH mutation and no 1p/19q codeletion. The second subtype exhibits both
IDH mutation and 1p/19q codeletion and has 84 samples. The third one is called
the wild-type IDH subtype and has 49 samples.

219
4. Breast Invasive Carcinoma (BRCA): Breast cancer is one of the most common
cancers with greater than 1,300,000 cases and 450,000 deaths each year worldwide
[214]. During the last 15 years, four intrinsic molecular subtypes of breast cancer,
namely, Luminal A, Luminal B, HER2-enriched, and Basal-like, have been identified
and intensively studied [99], [198], [214]. The BRCA data set consists of 398 samples
comprising of 171, 98, 49, and 80 samples of LuminalA, LuminalB, HER2-enriched,
and Basal-like subtype, respectively.

5. Ovarian Carcinoma (OV): Ovarian cancer is the eighth most commonly occurring
cancer in women and there were nearly 300,000 new cases in 2018 [22]. Ovarian cancer
encompasses a heterogeneous group of malignancies that vary in etiology, molecular
biology, and numerous other characteristics. TCGA researchers have identified four
robust expression subtypes of high-grade serous ovarian cancer [215]. The OV data
set consists of 334 samples. The four subtypes are termed as immunoreactive, dif-
ferentiated, proliferative, and mesenchymal, consisting of 74, 91, 90, and 79 samples,
respectively.

6. Stomach Adenocarcinome (STAD): Stomach/Gastric cancer was the worldâĂŹs


third leading cause of cancer mortality in 2012, responsible for 723,000 deaths [64].
TCGA research network has proposed a molecular classification dividing gastric can-
cer into four subtypes [216]. The STAD data set has 242 samples which consists
of 54 samples from microsatellite unstable tumours, which show elevated mutation
rates, 21 samples of tumours showing positivity for EpsteinBarr virus, 119 samples
of tumours having chromosomal instability, and 48 samples of genomically stable
tumors.

7. Glioblastoma Multiforme (GBM): GBM is the most common and malignant


form of brain cancer and has four subtypes identified in the study by Veerhak et
al. [228]. The subtypes are Proneural, Neural, Classical, and Mesenchymal. The data
set consists of 168 samples from three genomic modalities, namely, Gene, miRNA,
and CNV, as the DNA and the Protein modalities are available for a small number
of samples. The data set contains 51, 24, 37, and 56 samples of Proneural, Neural,
Classical, and Mesenchymal subtypes, respectively.

8. Lung Carcinoma(LUNG): Based on the primary site of origin, lung cancer set
can be categorized in two subtypes, namely, adenocarcinoma and squamous cell car-
cinoma. These were also the two major subtypes of lung cancer in 2015 WHO classi-
fication [222]. The LUNG data set consists of 671 samples with 360 samples of lung
adenocarcinoma and 311 samples of lung squamous cell carcinoma.

9. Kidney Carcinoma(KIDNEY): There are three subtypes of kidney cancer in


TCGA based on their tissue type of the site of origin. These are, namely, kidney
renal clear cell carcinoma, kidney renal papillary cell carcinoma and kidney chromo-
phobe. The data set consists of 737 samples of kidney cancer with 460 samples of
kidney renal clear cell carcinoma, 214 samples of kidney renal papillary cell carci-
noma, and 63 samples of kidney chromophobe.

220
A.1.1 Pre-Processing of Multi-Omics Data Sets
For all the data sets except GBM, four different omic modalities are considered, namely,
DNA methylation (mDNA), gene expression (RNA), microRNA expression (miRNA), and
reverse phase protein array expression (RPPA). For the GBM data set three modalities
namely RNA, miRNA, and copy number variation (CNV) are considered as mDNA and
RPPA modalities are not available for a majority of the samples in the data set. In order
to avoid considering features with too many missing values, for all the omic modalities,
those features for which the corresponding omic expression value is not present for more
than 5% of the total number of samples are excluded. For the remaining features, missing
values are replaced using 0.

• RNA and miRNA pre-processing: For all data sets except GBM and OV, se-
quence based RNA and miRNA expression data from Il- lumina HiSeq and Illumina
GA platforms are used. The RNA and miRNA modalities contain expression signals
for 20, 502 annotated genes and 1046 miRNAs, respectively. However, fil- tering out
miRNAs with more than 5% missing values reduced the number miRNAs for the
these data sets to around 300. The under- lying assumption of the proposed work
is that the data follows multivariate Gaussian distribution. However, the sequence
based RNA and miRNA expression modalities of the data sets contain normalized
RPKM (reads per kilobase of exon per million) counts for the genes. Count data
are known to follow a skewed distribution and have the property that the variance
depends on the mean value [300]. It is observed that genes having larger mean ex-
pression values also tend to have larger variances and are not normally distributed.
Log transformation is generally performed on the sequence based expression data
to make the data more or less normally distributed [300]. The degree of normality
attained depends on the skewness of the data before transformation. Therefore, for
modalities with sequence based count data, the 0 entries are replaced by 1, and then
the data is log-transformed using base 10. On the other hand, for OV and GBM
data sets, array based RNA and miRNA expression data from AgilentG4502A_07_3
and H-miRNA_8x15Kv2 platforms are used. As the RNA and miRNA expression
data for the OV data set is observed on microarray based platforms which contain
log-ratio based expression data, so the data is not log-transformed as in case of the
other four data sets. The RNA modality of OV data set consists of expression for
17,814 genes amongst which 2,000 most variable genes are considered. The miRNA
expression data is available for 799 microRNAs.

• mDNA pre-processing: For the DNA methylation modality, methylation β-values


from Illumina HumanMethylation450 and HumanMethylation450 beadarray plat-
forms are used. The HumanMethylation450 beadarray gives methylation β-values of
485,577 CpG sites, while HumanMethylation27 beadarray covers 27,578 CpG sites.
These two platforms share a common set of 25,978 CpG locations. Over 94% of loci
present on HumanMethylation27 array are included in the HumanMethylation450 ar-
ray content. Moreover, the correlation between the β-value measurements across the
two platforms were compared in [14] which showed strong correlation of R2 ą 0.97.
Therefore, for all the data set, methylation data across those common 25,978 CpG
locations are considered from both the platforms. Additionally, CpG locations with

221
missing gene information were filtered out from the study. The top 2,000 most vari-
able CpG sites are used for clustering.

• RPPA pre-processing: For protein modality, reverse phase protein array data from
the MDA_RPPA_Core platform is used. The protein expression data is available
in log-ratio form with values ranging between r´10, 10s. Taking intersections of the
protein IDs available for different samples, expression levels of around 200 proteins
are obtained for different data sets.

• CNV pre-processing: For the GBM data set, CNV data from affymetrix SNP
array 6.0 platform is used. The raw copy number segmented data is processed using
the CNregions function of iCluster+ [155] R-package to reduce the redundant copy
number regions. The CNregions function has a epsilon parameter which denotes
the maximum Euclidean distance between adjacent probes tolerated for defining a
non-redundant region. The number of non-redundant copy number regions extracted
for a data set depends on the value of the epsilon parameter and is proportional to
the number of samples in the data set. It is recomended in [155] to choose a value
of epsilon such that the reduced dimension is less than 10, 000. The default value of
0.005 is considered for the epsilon parameter of the CNregions function for all the
data sets.

These five modalities, measured on different platforms represent a wide variety of bio-
logical information. The summary of the data sets in terms of their sample size, dimension
of their individual modalities, and their number of clusters is provided in Table A.1.

A.2 Multi-View Benchmark Data Sets


Benchmark data sets from different application domains like social networking, information
retrieval, handwritten digits identification, and object detection are considered in this work.
The data sets are briefly described as follows.

A.2.1 Social Network Data Sets


Two types of social network data sets are studied in the thesis: Twitter network data sets
and citation network data sets.

A.2.1.1 Twitter Data Sets


A brief description of five Twitter data sets used in this work are as follows.

1. Football: This data set is a collection of 248 English Premier League football players
and clubs active on Twitter. The disjoint ground truth communities correspond to
the 20 individual clubs in the league.

2. Politics-UK: This data set consists of 419 Members of Parliament (MPs) in the
United Kingdom. The ground truth consists of five groups, corresponding to political
parties.

222
3. Rugby: The Rugby data set is a collection of 854 international Rugby Union players,
clubs, and organizations currently active on Twitter. The ground truth consists of
over- lapping communities corresponding to 15 countries. In the case of players, these
user accounts can potentially be assigned to both their home nation and the nation
in which they play club rugby. As the full names or screen names of the Twitter
users are not available, so the overlapping Rugby players are assigned either to their
country or their club.

4. Olympics: A dataset of 464 users, covering athletes and organizations that were
involved in the London 2012 Summer Olympics. The disjoint ground truth commu-
nities correspond to 28 different sports.

5. Politics-IE: A collection of Irish politicians and political organisations, assigned to


seven disjoint ground truth groups, according to their affiliation.

Views of Twitter Data Sets


For each Twitter data set, a heterogeneous collection of nine network and content-based
modalities are available. In all cases, cosine similarity is used to compute the pairwise
similarities between the Twitter users. All the Twitter data sets are publicly available at
http://mlg.ucd.ie/aggregation/. Description of the nine different modalities of each
Twitter data set is given below:

1. Tweets500: User content profiles, constructed from the concatenation of the 500
most recently-posted tweets for each user.

2. Lists500: List content profiles, constructed from the concatenation of both the
names and the descriptions of the 500 Twitter lists to which each user has most
recently been assigned.

3. Follows: From the unweighted directed follower graph, construct binary user profile
vectors based on the users whom they follow ( i.e. out-going links).

4. Followed-by: From the unweighted directed follower graph, construct binary user
profile vectors based on the users that follow them (that is, incoming links). A pair
of users are deemed to be similar if they are frequently âĂIJco-followedâĂİ by the
same users.

5. Mentions: From the weighted directed mention graph, construct user profile vectors
based on the users whom they mention.

6. Mentioned-by: From the weighted directed mention graph, construct binary user
profile vectors based on the users that mention them. A pair of users are deemed to
be similar if they are frequently âĂIJco-mentionedâĂİ by the same users.

7. Retweets: From the weighted directed retweet graph, construct user profile vectors
based on the users whom they retweet.

223
8. Retweeted-by: From the weighted directed retweet graph, construct user profile vec-
tors based on the users that retweet them. Users are deemed to be similar if they are
frequently âĂIJco-retweetedâĂİ by the same users.

9. ListMerged500: Based on Twitter user list memberships, construct an unweighted


bipartite graph, such that an edge between a list and a user indicates that the list
contains the specified user. A pair of users are deemed to be similar if they are
frequently linked to the same lists. Again, we only consider the 500 lists to which
each user has been assigned most recently.

A.2.1.2 Citation Network Data Set


The CORA citation network data set consists of 2708 machine learning papers. The data
set has two views. The citation relation view consists of 5429 links indicating inbound
and outbound citations among the papers. The other view is a content based view where
each publication in the data set is described by a 0/1-valued word vector indicating the
absence/presence of the corresponding word from a dictionary of 1433 unique machine
learning keywords. The machine learning articles are classified into seven topics, namely,
neural networks, rule learning, reinforcement learning, probabilistic methods, theory, ge-
netic algorithms, and case based study. The pre-processed citation and content views are
publicly available at https://github.com/KunyuLin/Multi-view-Datasets.

A.2.2 Image Data Sets


1. Digits: This data set consists of features of handwritten numerals (‘0’-‘9’) extracted
from a collection of Dutch utility maps with 200 patterns per class (for a total of 2,000
patterns) have been digitized in binary images. The data set is publicly available
at https://archive.ics.uci.edu/ml/datasets/Multiple+Features. The samples
are represented in terms of the following six feature sets:

(a) mfeat-fou: 76 Fourier coefficients of the character shapes.


(b) mfeat-fac: 216 profile correlations.
(c) mfeat-kar: 64 Karhunen-Love coefficients.
(d) mfeat-pix: 240 pixel averages in 2 x 3 windows.
(e) mfeat-zer: 47 Zernike moments.
(f) mfeat-mor: 6 morphological features.

2. 100Leaves: It is a one-hundred plant species leaves data set https://archive.ics.


uci.edu/ml/datasets/One-hundred+ plant+species+leaves+data+set. The data
set consists of 1,600 samples, with sixteen samples of each type of leaf for each of
the one-hundred plant species. Each sample is represented by three sets of image
features: shape descriptors, fine scale margin, and texture histogram.

3. ALOI: This is the Amsterdam Library of Object Image data set http://elki.dbs.
ifi.lmu.de/wiki/DataSets/MultiView. The data set is from the work of [18]. The
data set consists of 11,025 images of 100 small objects. Each image is represented

224
with four types of features, that is, RGB color histogram, HSV color histogram, color
similiarity and Haralick features.

4. ORL:The ORL database of faces contains 400 face images. There are ten differ-
ent images of 40 distinct subjects. For some subjects, the images were taken at
different times, varying the lighting conditions, facial expressions (open/closed eyes,
smiling/not smiling) and facial details (glasses/no glasses). All the images were taken
against a dark homogeneous background with the subjects in an upright, frontal po-
sition (with tolerance for some side movement). The size of each image is p92 ˆ 112q
pixels, with 256 grey levels per pixel. Following [276], the images in the data set are
resized to p48 ˆ 48q and three types of image features are extracted: View1 intensity
(4096 dimensions), View2 local binary pattern (LBP) (3304 dimensions), and View3
Gabor (6750 dimensions). The standard LBP feature is extracted from p72 ˆ 80q
loosely cropped images with a histogram size of 59 over 910 pixel patches. The Gabor
feature is extracted with one scale λ “ 4 at four orientations θ “ 0˝ , 45˝ , 90˝ , 135˝
with a loose face crop at the resolution of p25 ˆ 30q pixels. The ORL data set is
available at https://cam-orl.co.uk/facedatabase.html.

5. Caltech7: The Caltech7 is a subset of the Caltech 101 data set http://www.
vision.caltech.edu/Image_Datasets/Caltech101/ for image based object recog-
nition problem. The data set consists of 1474 images from seven widely used classes,
namely, Face, Motorbikes, Dolla-Bill, Garfield, Snoopy, Stop-Sign, and Windsor-
Chair. Six types of image features are extracted from all the images: 48 dimensional
Gabor features, 40 dimensional wavelet moments features, 254 dimensional centrist
features, 1984 dimensional histogram of oriented gradients (HOG) features, 512 di-
mensional GIST descriptors, and 928 dimensional LBP features. The processed image
features for the Caltech7 data set are available at https://github.com/yeqinglee/
mvdata.

A.2.3 Multi-Source News Article Data Sets


1. 3Sources: This is a multi-view text data set available at http://mlg.ucd.ie/
datasets/3sources.html. It consists of 169 news articles collected from three well-
known online news sources: BBC, Reuters, and The Guardian, from the period
February to April 2009. Each news article story was manually annotated with one
or more of the six topical labels: business, entertainment, health, politics, sport,
and technology. The labels roughly correspond to the primary section headings used
across the three news sources. The data set has three views, one corresponding of
each of the three news sources.

2. BBC: This is also a multi-view news article clustering data set constructed from
the single-view BBC news corpora http://mlg.ucd.ie/datasets/segment.html. It
consists of 685 news documents. Each raw document was split into four segments
by separating the documents into paragraphs, and merging sequences of consecutive
paragraphs. The segments for each document were then randomly assigned to views.

225
Table A.1: Summary of Data Sets with Feature Space based Representation
Data Set Sample Cluster View d1 d2 d3 d4 d5 d6
Digits 2000 10 6 216 76 64 6 240 47
3Sources 169 6 3 3560 3631 3068 - - -
Benchmark

BBC 685 5 4 4659 4633 4665 4684 - -


100Leaves 1600 100 3 64 64 64 - - -
ALOI 11025 100 4 64 64 13 77 - -
ORL 400 40 3 4096 3304 6750 - - -
Caltech7 1474 7 6 48 40 254 1984 512 928
BRCA 398 4 4 2000 2000 278 212 - -
LGG 267 3 4 2000 2000 333 209 - -
Multi-Omics

STAD 242 4 4 2000 2000 291 218 - -


LUNG 671 2 4 2000 2000 296 180 - -
KIDNEY 737 3 4 2000 2000 261 174 - -
CESC 124 3 4 2000 2000 311 219 - -
CRC 464 2 4 2000 2000 291 178 - -
OV 737 4 4 2000 2000 334 192 - -
GBM 169 4 3 2000 2000 534 - - -

Each document is annotated with one of the five topical labels: business, entertain-
ment, politics, sport, and technology. The data set has four views corresponding to
the four segments.

226
Appendix B

Cluster Evaluation Indices

B.1 External Cluster Evaluation Measures


Four external cluster evaluation measures are used to compare the performance different
approaches, namely, accuracy, adjusted rand index (ARI), normalized mutual information
(NMI), and F-measure. Since there are different definitions of some of the measures, like
accuracy and NMI, in clustering, the definitions used in this work is are described next. A
higher value indicates a better performance for each metric. Let T “ tt1 , . . . , tj , . . . , tk u be
the true partition of n samples of a data set into k clusters. Let C “ tc1 , . . . , ci , . . . , ck u be
the k clusters returned by a clustering algorithm. Let the number of samples in the data
set be denoted by n. The external evaluation indices measure how close is the clustering
C with respect to true partition T . Also, let Ò denote that a higher value of that index
means a “better" clustering, while Ó means the exact opposite. The four external evaluation
indices are as follows.

1. AccuracyÒ [275]: Given a sample xp , let its cluster and class labels be denoted by
cp and tp , respectively. The clustering accuracy is given by

n
1 ÿ
Accuracy “ δptp , mappcp qq,
n p“1

where δpa, bq “ 1 when a “ b, otherwise δpa, bq “ 0. The function mappcp q is the


permutation map function, which maps the cluster labels into class labels. The best
map can be obtained by the Kuhn-Munkres algorithm [118].

2. NMIÒ [68] measures the concordance of cluster assignments in T and C. NMI is


defined as follows:

2 I pT , Cq
NMI “ ; (B.1)
rHpT q ` HpCqs

where HpCq is the entropy of C and I pT , Cq is the mutual information between T and

227
C, which are as follows:

k
ÿ
H pCq “ ´ P rpci q log P rpci q;
i“1
k ÿ
k „ 
ÿ P rpci X tj q
I pT , Cq “ P rpci X tj q log ;
i“1 j“1
P rpci qP rptj q

where P rpSq denotes the probability of the set S.

3. ARIÒ [8] is an adjustment of the rand index, given by,

k ř
k ` ˘
ř |ci Xtj |
2 ´ n3
i“1 j“1
ARI “ 1 .
2 pn1 ` n2 q ´ n3

k ` ˘ k ` ˘
ř |ci | ř |tj | 2n1 n2
where n1 “ 2 , n2 “ 2 , n3 “ npn´1q .
i“1 j“1

4. F-measureÒ [122] of a cluster ci with respect to a class tj evaluates how well cluster
cluster ci describes class tj and is given by the harmonic mean of precision and recall.

|ci X tj |
Precision Pij “ .
|ci |

|ci X tj |
Recall Rij “ .
|tj |

2Pij Rij
F-measure Fptj , ci q “
Pij ` Rij
2|ci X tj |
“ .
|ci | ` |tj |

The overall F-measure is given by the weighted average of the maximum F-measure
over the clusters in C.
k
1 ÿ
F-measure “ nj maxtFptj , ci qu,
n j“1 i

where nj denotes the number of points in class tj .

5. PurityÒ [177]: It measures the extent to which each cluster contains samples pri-
marily from one class. Each cluster is first assigned with the true class which is most
frequent in the cluster and then the purity of the clustering solution is assessed by

228
the proportion of correctly assigned samples. Formally it is given by,

k
1ÿ
Purity “ max t|ci X tj |u. (B.2)
n i“1 j

In general, higher the value of purity, better is the cluster solution. However, purity
does not penalize large number of clusters.

6. RandÒ [173]: Rand index is a pair-counting based cluster evaluation index which
measures the pairs of `points on which the two clusterings agree or disagree. In a n
sample data set, the n2 pairs of points can be divided into four categories. Let a
˘

represent the number of pairs that are in the same cluster both in C and T , b represent
the number of pairs that are in the same cluster in C but in different clusters in T ,
c represents the number of pairs that are in different clusters in C but in the same
cluster in T , and d represent the number of pairs that are in different clusters both in
C and T .The values a and d count the agreements while b and c the disagreements.
The Rand index is defined as the ratio of the total number of agreements to the total
number of pairs, given by

a`d
Rand “ . (B.3)
a`b`c`d

All the external cluster validation indices lie in [0,1] and a higher value indicates better
clustering.

B.2 Internal Cluster Evaluation Measures


Internal cluster validity indices evaluate the quality of clustering based on the information
intrinsic to data like compactness and separation of the identified clusters. The information
of the correct partition of the data is not used during internal cluster evaluation. In the
proposed CoALa algorithm, k-means clustering is performed in an approximate subspace of
rank k, where k is the number of clusters in the data set. To establish the effectiveness of the
proposed algorithm, the quality of clustering in the k-dimensional approximate subspace
is compared with that of the rank k subspaces of the individual modalities and the rank k
true subspace using seven internal cluster evaluation indices. These indices are described
as follows.
Let X “ tx1 , ..., xi , ..., xn u be the set of n samples, where xi P Rk represents the i-th
sample in a k-dimensional subspace. Let the Euclidean distance between samples xi and xj
be denoted as de pxi , xj q. The k clusters are represented as C “ C1 , ..., Ck , and the centroids
n
of each of k clusters are v1 , ..., vk . Let the centroid of the dataset be given by x̄ “ n1
ř
xi .
i“1

1. Silhouette Ò [179]: It is a normalized summation-type index. The compactness


within the clusters is measured based on the distance between all the samples in the
same cluster and the separation between the clusters is based on the nearest neighbor

229
distance. It is defined as

1 ÿ ÿ bpxi , Cj q ´ apxi , Cj q
Silhouette “ , (B.4)
n C PC x PC max tbpxi , Cj q, apxi , Cj qu
j i j

where
1 ÿ
apxi , Cj q “ de pxi , xm q; (B.5)
|Cj | ´ 1 x PC ,
m j
xi ‰xm
# +
1 ÿ
bpxi , Cj q “ min de pxi , xm q . (B.6)
Cl PCzCj |Cl | x PC
m l

2. Dunn Index Ò [55]: It is a ratio-type index where compactness is estimated by the


nearest neighbor distance and the separation by the maximum cluster diameter. It
is defines as
min t min tδpCj , Cl quu
Cj PC Cl PCzCj
Dunn “ , (B.7)
maxt∆pCj qu
Cj PC

where δpCj , Cl q “ min min tde pxi , xm qu, (B.8)


xi PCj xm PCl

and ∆pCj q “ max tde pxi , xm qu. (B.9)


xi ,xm PCj

3. Davies-Bouldin (DB) Index Ó [48]: This index estimates the compactness based
on the distance from the samples in a cluster to its centroid and separation based on
the distance between centroids. It is defined as
" *
1 ÿ SpCj q ` SpCl q
DB “ max , (B.10)
k C PC Cl PCzCj de pvj , vl q
j

1 ÿ
where SpCj q “ de pxi , vj q. (B.11)
|Cj | x PC
i j

4. Xie-Beni Index Ó [251]: It is an index of fuzzy clustering, but it is also applicable


to crisp clustering. For crisp clustering it is defined as

d2e pxi , vj q
ř ř
1 Cj PC xi PCj
Xie ´ Beni “ . (B.12)
n min td2e pvj , vl qu
Cj ,Cl PC

230
Appendix C

Basics of Matrix Perturbation


Theory

This section provides a brief description of the theory of invariant subspaces and the related
theorems that are used in the main paper. Let A be a matrix of order n. An invariant
subspace of the matrix A and some of its properties are as follows [202].

Definition C.1. A subspace Z is an invariant subspace of A if AZ Ă Z, that is, @x P


Z, Ax P Z.

Property C.1. Let Z is an invariant subspace of A, and let the columns of matrix Z form
a basis for subspace Z. Then there exists a unique matrix B such that

AZ “ ZB. (C.1)

The matrix B is considered to be the representation of A on subspace Z with respect to the


basis Z.

Property C.2. Let Z1 be an ı subspace of A and the columns of Z1 form an or-


” invariant
thonormal basis for Z1 . Let Z1 W2 be unitary, where columns of W2 spans the subspace
orthogonal to Z1 . Then we can write
« ff
” ıT ” ı B1 W
Z1 W2 A Z1 W2 “ , (C.2)
0 B2

where B1 “ Z1T AZ1 , B2 “ W2T AW2 , and W “ Z1T AW2 .

The equation in (C.2) is called the reduced form of A. The eigenvalues of B1 are the
eigenvalues of A associated with the basis Z1 . The complementary set of eigenvalues are
those of the matrix B2 . This leads to the notion of a simple invariant subspace, defined as
follows.

231
Definition C.2. Let Z1 be an invariant
” ısubspace of A, and let the reduced form of A
with respect to the unitary matrix Z1 W2 be given by (C.2). Then Z1 is called a simple
invariant subspace of A if
ΩpB1q X ΩEpB2q “ H, (C.3)

where ΩpBq denotes the set of all eigenvalues of matrix B.

A simple invariant subspace has a complementary subspace defined using the spectral
resolution of A as follows.

”Theorem ı C.1. Let the simple invariant subspace Z1 of A with respect to the unitary matrix
Z1 W2 have the reduced form as given by (C.2). Then there exist matrices Z2 and W1
” ı´1 ” ıT
such that Z1 Z2 “ W1 W2 and

A “ Z1 B1 W1T ` Z2 B2 W2T , (C.4)

where Bj “ WjT AZj , for j “ 1, 2. Also, AZ1 “ Z1 B1 and AZ2 “ Z2 B2 [202].

Since, AZ2 “ Z2 B2 , this implies that Z2 “ CpZ2 q is an invariant subspace of A. Thus


if Z1 is an invariant subspace of A, then A also has a complementary invariant subspace
Z2 [202]. The form of A in (C.4) is called the spectral resolution of A along Z1 and Z2 .
In the matrix perturbation problem, let Z1 be a simple invariant subspace of a matrix
A, and let Ar “ A ` E be a perturbation of A. If E is sufficiently small, then there is an
invariant subspace Zr1 of A,
r such that Zr1 approaches Z1 as E approaches zero. The Davis
Kahan theorem [202] is used to bound the difference between Zr1 and Z1 in terms of the
residual E.

Theorem C.2. Davis Kahan sinΘ theorem [202] Let A be a matrix of order n. Let A
have a spectral resolution given by

A “ Z1 B1 Z1T ` Z2 B2 Z2T , (C.5)


” ı
where Z1 Z2 is unitary with Z1 P Cnˆr . Let Zr P Cnˆr have orthonormal columns, and
for any Hermitian matrix B of order r, let residual

R “ AZr ´ ZB.
r (C.6)

If ΩpBq Ă ra, bs and for some δ ą 0, ΩpB2q Ă Rzra ´ δ, b ` δs, then for any unitarily
invariant norm k . k,
¯ k R k
r
´
sin Θ CpZ1 q, CpZq ď . (C.7)
δ
Weyl’s theorem and Weilandt-Hoffman theorem which bound the eigenvalues of the
sum of two Hermitian matrices are as follows.

232
Theorem C.3. Weyl [202] Let A and E be n ˆ n Hermitian matrices with eigenvalues
a1 ě . . . ě an and b1 ě . . . ě bn , respectively. The Hermitian matrix A
r “ A ` B having
a1 ě . . . ě r
eigenvalues r an satisfy

ai ` bn ď r
ai ď ai ` b1 . (C.8)

Theorem C.4. Weilandt-Hoffman theorem [76] Let A and B be nˆn be real symmetric
matrices with eigenvalues a1 ě . . . ě an and b1 ě . . . ě bn , respectively. Let the Hermitian
matrix A
r “ A ` B have eigenvalues r a1 ě . . . ě r
an . Then the following bound holds
n
ÿ n
ÿ
ai ´ ai q2 ď kBk2F “
pr b2i . (C.9)
i“1 i“1

233
234
Appendix D

Background on Manifold
Optimization

Definition D.1 (Gradient-related sequence). Given a cost function f on a Rieman-


(
nian manifold M, a sequence ξ ptq , where ξ ptq P Typtq M, is gradient-related if, for any
( (
subsequence y ptq tPτ of y ptq that converges to a non-critical point of f , the corresponding
(
subsequence ξ ptq tPτ is bounded and satisfies
@ D
lim sup ∇f py ptq q, ξ ptq ă 0.
tÑ8, tPτ

Here, x., .y denotes the inner product. For a function f , descent direction at a point y refers
to a vector moving along which leads to a reduction of the function. A direction ξ is a
descent direction if the directional derivative along ξ is negative, that is,

x∇f pyq, ξy ă 0.
(
Definition D.1 implies that a sequence of directions ξ ptq on the tangent space of M is
gradient related if it contains a subsequence of descent directions of f . Thus, moving along
a gradient-related sequence at each iteration would lead to a reduction of the function f .
To ensure the convergence of the proposed algorithm, the Armijo condition [9] is im-
posed on the choice of step size during the optimization. The condition is defined as
follows:

Definition D.2 (Armijo criterion). Given a cost function f on a Riemannian manifold


M with retraction R, a point y P M, a tangent vector ξ P Ty M, and scalars η̄ ą 0 and
σ P p0, 1q, the step length η̄ is said to satisfy the Armijo condition restricted to the direction
ξ if the following inequality holds:
` ˘ @ D
f pyq ´ f Ry pη̄ξq ě ´σ η̄ ∇f pyq, ξ . (D.1)

235
Figure D.1: Armijo condition for the choice of step size.

The Armijo condition ` is a popular


˘ line-search condition that states that the reduction
in f , given by f pyq ´ f@ Ry pη̄ξq D, should be proportional to both the step length η̄ and the
directional derivative ∇f pyq, ξ , where σ P p0, 1q is the constant of proportionality.
ptq ptq
Let f ptq denote the value of the objective function f evaluated using UJoint and Uj ’s,
obtained at iteration t of the proposed algorithm. For the proposed algorithm, the step
lengths for optimization on both the manifolds are chosen to be identical, that is, ηK “
ηS “ η. Also, the direction of movement on the tangent space is always the negative
gradient ´∇f (as in (6.13) and (6.19)), and the retracted point from the tangent space
gives the next iterate. Between two consecutive iterations, the reduction in the objective
function f is given by f ptq ´ f pt`1q . Inorder to satisfy the Armijo criterion, this reduction
must be proportional to the directional derivative. This is evaluated using
@ D
CA “ f ptq ´ f pt`1q ` ση ∇f , ´∇f . (D.2)

Here, CA ě 0 implies that the Armijo condition is satisfied and there has been a sufficient
reduction in the value of the objective function. The proposed algorithm moves to the next
iterate only when the Armijo criterion is satisfied. The value of Armijo parameter σ is set
to 1e ´ 05 following [20].

Definition D.3 (Armijo point). Given a cost function f on a Riemannian manifold


M with retraction R, a point y P M, a tangent vector ξ P Ty M, and scalars η̄ ą 0,
β, σ P p0, 1q, the Armijo point is ξ A “ η A ξ “ β ω η̄ξ, where ω is the smallest non-negative
integer such that
f pxq ´ f Ry pβ ω η̄ξq ě ´σ ∇f pyq, β ω η̄ξ .
` ˘ @ D

The real number η A is called the Armijo step size [3].

The smallest step size that satisfies the Armijo condition is called the Armijo step size
η A . It is given by η A “ β ω η̄, such that ω is the smallest non-negative integer to achieve
this for a given η̄ ą 0 and β P p0, 1q. Figure D.1 shows an example of the Armijo condition
for choosing the step size. To choose a step size that satisfies the Armijo condition, we

236
start with a step length η̄ ą 0 and then check for the choices β η̄, β 2 η̄, . . . , until β ω η̄ falls
under the set of acceptable step sizes that satisfy (D.1). This choice of step size would give
a sufficient decrease in the value of the function f .

237
238
List of Publications

Published/Accepted:

J1. Aparajita Khan and Pradipta Maji. Multi-Manifold Optimization for Multi-View
Subspace Clustering. IEEE Transactions on Neural Networks and Learning
Systems, pages 1-13, 2021. DOI: 10.1109/TNNLS.2021.3054789.

J2. Aparajita Khan and Pradipta Maji. Approximate Graph Laplacians for Multimodal
Data Clustering. IEEE Transactions on Pattern Analysis and Machine In-
telligence, 43(3):798-813, 2021. DOI: 10.1109/TPAMI.2019.2945574.

J3. Aparajita Khan and Pradipta Maji. Selective Update of Relevant Eigenspaces for
Integrative Clustering of Multimodal Data. IEEE Transactions on Cybernetics,
pages 1-13, 2020. DOI: 10.1109/TCYB.2020.2990112.

J4. Aparajita Khan and Pradipta Maji. Low-Rank Joint Subspace Construction for Can-
cer Subtype Discovery. IEEE/ACM Transactions on Computational Biology
and Bioinformatics, 17(4):1290-1302, 2020. DOI: 10.1109/TCBB.2019.2894635.

Submitted:

J5. Aparajita Khan and Pradipta Maji. Geometry Aware Multi-View Clustering over
Riemannian Manifolds. IEEE Transactions on Pattern Analysis and Machine
Intelligence, pages 1-13, 2021 (Manuscript ID: TPAMI-2021-08-1458).

239
240
References

[1] M. Abavisani and V. M. Patel. Deep Multimodal Subspace Clustering Networks.


IEEE Journal of Selected Topics in Signal Processing, 12(6):1601–1614, 2018.
[2] P. Absil and J. Malick. Projection-like Retractions on Matrix Manifolds. SIAM
Journal on Optimization, 22(1):135–158, 2012.
[3] P. A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Man-
ifolds. Princeton University Press, Princeton, New Jersey, 2008. ISBN:978-0-691-
13298-3.
[4] C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the Surprising Behavior of
Distance Metrics in High Dimensional Spaces. In Proceedings of the 8th International
Conference on Database Theory, ICDT ’01, pages 420–434, Berlin, Heidelberg, 2001.
[5] G. Alexe, G. S. Dalgin, S. Ganesan, C. DeLisi, and G. Bhanot. Analysis of Breast
Cancer Progression Using Principal Component Analysis and Clustering. Journal of
Biosciences, 32:1027–1039, 2007.
[6] O. Alter, P. O. Brown, and D. Botstein. Singular Value Decomposition for Genome-
Wide Expression Data Processing and Modeling. Proceedings of the National
Academy of Sciences USA, 97(18):10101–10106, 2000.
[7] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep Canonical Correlation Anal-
ysis. In Proceedings of the 30th International Conference on Machine Learning, vol-
ume 28, pages 1247–1255, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
[8] P. Arabie and L. Hubert. Comparing Partitions. Journal of Classification, 2:193–218,
1985.
[9] L. Armijo. Minimization of Functions Having Lipschitz Continuous First Partial
Derivatives. Pacific Journal of Mathematics, 16(1):1–3, 1966.
[10] E. Begelfor and M. Werman. Affine Invariance Revisited. In Proceedings of the
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’06), volume 2, pages 2087–2094, 2006.
[11] A. Benton, R. Arora, and M. Dredze. Learning Multiview Embeddings of Twitter
Users. In Proceedings of the 54th Annual Meeting of the Association for Compu-
tational Linguistics, pages 14–19, Berlin, Germany, August 2016. Association for
Computational Linguistics.

241
[12] A. Benton, H. Khayrallah, B. Gujral, D. A. Reisinger, S. Zhang, and R. Arora. Deep
Generalized Canonical Correlation Analysis. In Proceedings of the 4th Workshop
on Representation Learning for NLP (RepL4NLP-2019), pages 1–6, Florence, Italy,
August 2019. Association for Computational Linguistics.

[13] J. C. Bezdek, R. Ehrlich, and W. Full. FCM: The Fuzzy c-Means Clustering Algo-
rithm. Computers & Geosciences, 10(2):191–203, 1984.

[14] M. Bibikova, B. Barnes, C. Tsan, V. Ho, B. Klotzle, J. M. Le, D. Delano, L. Zhang,


G. P. Schroth, K. L. Gunderson, J.-B. Fan, and R. Shen. High Density DNA Methy-
lation Array with Single CpG Site Resolution. Genomics, 98(4):288–295, 2011. New
Genomic Technologies and Applications.

[15] C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, New


York, 2006. ISBN: 978-0-387-31073-2.

[16] A. Bjorck and G. H. Golub. Numerical Methods for Computing the Angles Between
Linear Subspaces. Mathematics of Computation, 27:579–594, 1973.

[17] M. B. Blaschko and C. H. Lampert. Correlational Spectral Clustering. In Proceed-


ings of the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR 2008), pages 1–8, Los Alamitos, CA, USA, June 2008. Max-
Planck-Gesellschaft, IEEE Computer Society.

[18] A. Blum and T. Mitchell. Combining Labeled and Unlabeled Data with Co-Training.
In Proceedings of the 11th Annual Conference on Computational Learning Theory,
COLT’ 98, pages 92–100, New York, NY, USA, 1998. Association for Computing
Machinery.

[19] A. Bojchevski, Y. Matkovic, and S. Günnemann. Robust Spectral Clustering for


Noisy Data: Modeling Sparse Corruptions Improves Latent Embeddings. In Proceed-
ings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD ’17, page 737âĂŞ746, New York, NY, USA, 2017. Association
for Computing Machinery.

[20] Nicolas Boumal. Optimization and Estimation on Manifolds. PhD thesis, Université
catholique de Louvain, 2014.

[21] M. Brand. Incremental Singular Value Decomposition of Uncertain Data with Missing
Values. In Proceedings of the European Conference on Computer Vision, pages 707–
720. Springer, 2002.

[22] F. Bray, J. Ferlay, I. Soerjomataram, R. L. Siegel, L. A. Torre, and A. Jemal. Global


Cancer Statistics 2018: GLOBOCAN Estimates of Incidence and Mortality World-
wide for 36 Cancers in 185 Countries. CA: A Cancer Journal for Clinicians, Sep
2018.

[23] E. Bruno and S. Marchand-Maillet. Multiview Clustering: A Late Fusion Approach


Using Latent Models. In Proceedings of the 32nd International ACM SIGIR Confer-
ence on Research and Development in Information Retrieval, pages 736–737, 2009.

242
[24] D. Cai, X. He, J. Han, and T. S. Huang. Graph Regularized Nonnegative Matrix
Factorization for Data Representation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 33(8):1548–1560, 2011.

[25] M. Cai and L. Li. Subtype Identification from Heterogeneous TCGA Datasets on a
Genomic Scale by Multi-View Clustering with Enhanced Consensus. BMC Medical
Genomics, 10(4):75, December 2017.

[26] X. Cai, F. Nie, and H. Huang. Multi-View K-Means Clustering on Big Data. In
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelli-
gence, pages 2598–2604, Beijing, China, 2013.

[27] X. Cao, C. Zhang, H. Fu, Si Liu, and Hua Zhang. Diversity-Induced Multi-view
Subspace Clustering. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 586–594, Boston, Massachusetts, 2015.

[28] T. Carson, D. G. Mixon, and S. Villar. Manifold Optimization for k-means Clustering.
In 2017 International Conference on Sampling Theory and Applications (SampTA),
pages 73–77, July 2017.

[29] Y. Chahlaoui, K. Gallivan, and P. Van Dooren. Recursive Calculation of Dominant


Singular Subspaces. SIAM Journal on Matrix Analysis and Applications, 25(2):445–
463, 2003.

[30] M. A. Z. Chahooki and N. M. Charkari. Learning the Shape Manifold to Improve


Object Recognition. Machine Vision and Applications, 1(24):33–46, 2013.

[31] P. Chalise, D. C. Koestler, M. Bimali, Q. Yu, and B. L. Fridley. Integrative Cluster-


ing Methods for High-Dimensional Molecular Data. Translational Cancer Research,
3(3):202, 2014.

[32] S. Chandrasekaran, B. S. Manjunath, Y. F. Wang, J. Winkeler, and H. Zhang. An


Eigenspace Update Algorithm for Image Analysis. Graphical Models and Image Pro-
cessing, 59(5):321–332, 1997.

[33] S. Chang, J. Hu, T. Li, H. Wang, and B. Peng. Multi-View Clustering via Deep
Concept Factorization. Knowledge-Based Systems, 217:106807, 2021.

[34] G. Chao, S. Sun, and J. Bi. A Survey on Multi-View Clustering. arXiv e-prints, page
arXiv:1712.06246, December 2017.

[35] K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan. Multi-View Clustering


via Canonical Correlation Analysis. In Proceedings of the 26th Annual International
Conference on Machine Learning, ICML ’09, pages 129–136, New York, NY, USA,
2009. Association for Computing Machinery.

[36] F. Chen, G. Li, S. Wang, and Z. Pan. Multiview Clustering via Robust Neighboring
Constraint Nonnegative Matrix Factorization. Mathematical Problems in Engineer-
ing, 2019:1–10, November 2019.

243
[37] J. Chen, G. Wang, and G. B. Giannakis. Graph Multiview Canonical Correlation
Analysis. IEEE Transactions on Signal Processing, 67(11):2826–2838, 2019.

[38] J. Chen, G. Wang, and G. B. Giannakis. Multiview Canonical Correlation Analysis


over Graphs. In Proceedings of the 2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 2947–2951, 2019.

[39] Y. Chen, S. Wang, C. Peng, Z. Hua, and Y. Zhou. Generalized Nonconvex Low-Rank
Tensor Approximation for Multi-View Subspace Clustering. IEEE Transactions on
Image Processing, 30:4022–4035, 2021.

[40] Y. Chen, X. Xiao, and Y. Zhou. Multi-view Clustering via Simultaneously Learning
Graph Regularized Low-Rank Tensor Representation and Affinity Matrix. In Proceed-
ings of the 2019 IEEE International Conference on Multimedia and Expo (ICME),
pages 1348–1353, 2019.

[41] Y. Chen, X. Xiao, and Y. Zhou. Jointly Learning Kernel Representation Tensor
and Affinity Matrix for Multi-View Clustering. IEEE Transactions on Multimedia,
22(8):1985–1997, 2020.

[42] Y. Chen, X. Xiao, and Y. Zhou. Multi-view Subspace Clustering via Simultane-
ously Learning the Representation Tensor and Affinity Matrix. Pattern Recognition,
106:107441, 2020.

[43] C. M. Christoudias, R. Urtasun, and T. Darrell. Multi-View Learning in the Presence


of View Disagreement. In Proceedings of the Twenty-Fourth Conference on Uncer-
tainty in Artificial Intelligence, UAI’08, pages 88–96, Arlington, Virginia, USA, 2008.

[44] D. Chu, L. Liao, M. K. Ng, and X. Zhang. Sparse Canonical Correlation Analy-
sis: New Formulation and Algorithm. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 35(12):3050–3065, 2013.

[45] F. R. K. Chung. Spectral Graph Theory. Number 92. American Mathematical Society,
Providence, Rhode Island, 1997. ISBN: 0-8218-0315-8.

[46] P. Coretto, A. Serra, and R. Tagliaferri. Robust Clustering of Noisy High-Dimensional


Gene Expression Data for Patients Subtyping. Bioinformatics, 34(23):4064–4072,
2018.

[47] Corinna Cortes and Vladimir Vapnik. Support-Vector Networks. Machine Learning,
20(3):273âĂŞ297, September 1995.

[48] D. L. Davies and D. W. Bouldin. A Cluster Separation Measure. IEEE Transactions


on Pattern Analysis and Machine Intelligence, 1(2):224–227, 1979.

[49] C. Davis and W. Kahan. The Rotation of Eigenvectors by a Perturbation. III. SIAM
Journal on Numerical Analysis, 7(1):1–46, 1970.

[50] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incom-


plete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B
(Methodological), 39(1):1–38, 1977.

244
[51] C. Dhanjal, R. Gaudel, and S. Clémençon. Efficient Eigen-Updating for Spectral
Graph Clustering. Neurocomputing, 131:440–452, 2014.

[52] C. Ding and X. He. K-means Clustering via Principal Component Analysis. In
Proceedings of the 21st International Conference on Machine learning, page 29. ACM,
2004.

[53] H. Ding, M. Sharpnack, C. Wang, K. Huang, and R. Machiraju. Integrative Cancer


Patient Stratification via Subspace Merging. Bioinformatics, 35(10):1653–1659, May
2019.

[54] A. Djelouah, J. Franco, E. Boyer, F. Le Clerc, and P. PÃľrez. Sparse Multi-View


Consistency for Object Segmentation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 37(9):1890–1903, 2015.

[55] J. C. Dunn. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting
Compact Well-Separated Clusters. Journal of Cybernetics, 3(3):32–57, 1973.

[56] C. Eckart and G. Young. The Approximation of One Matrix by Another of Lower
Rank. Psychometrika, 1(3):211–218, Sep 1936.

[57] A. Edelman, T. A. Arias, and S. T. Smith. The Geometry of Algorithms with Orthog-
onality Constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–
353, April 1999.

[58] Nour El Din Elmadany, Yifeng He, and Ling Guan. Multiview Learning via Deep
Discriminative Canonical Correlation Analysis. In Proceedings of the 2016 IEEE In-
ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
2409–2413, 2016.

[59] E. Elhamifar and R. Vidal. Sparse Subspace Clustering. In Proceedings of the 2009
IEEE Conference on Computer Vision and Pattern Recognition, pages 2790–2797,
2009.

[60] E. Elhamifar and R. Vidal. Sparse Subspace Clustering: Algorithm, Theory, and
Applications. IEEE Transactions on Pattern Analysis and Machine Intelligence,
35(11):2765–2781, 2013.

[61] J. Fang, D. Lin, S. C. Schulz, Z. Xu, V. D. Calhoun, and Y. P. Wang. Joint Sparse
Canonical Correlation Analysis for Detecting Differential Imaging Genetics Modules.
Bioinformatics, 32(22):3480–3488, November 2016.

[62] J. Farquhar, D. Hardoon, H. Meng, J. Shawe-taylor, and S. Szedmák. Two View


Learning: SVM-2K, Theory and Practice. In Proceedings of the Advances in Neural
Information Processing Systems, volume 18. MIT Press, 2006.

[63] Q. Feng, M. Jiang, J. Hannig, and J.S. Marron. Angle-Based Joint and Individual
Variation Explained. Journal of Multivariate Analysis, 166:241–265, 2018.

245
[64] J. Ferlay, I. Soerjomataram, R. Dikshit, S. Eser, C. Mathers, M. Rebelo, D. M.
Parkin, D. Forman, and F. Bray. Cancer Incidence and Mortality Worldwide:
Sources, Methods and Major Patterns in GLOBOCAN 2012. International Jour-
nal of Cancer, 136(5):E359–386, Mar 2015.

[65] J. Ferlay, I. Soerjomataram, M. Ervik, R. Dikshit, S. Eser, C. Mathers, M. Rebelo,


D. M. Parkin, D. Forman, and F. Bray. GLOBOCAN 2012 v1.0, Cancer Incidence
and Mortality Worldwide: IARC CancerBase No. 11, 2013.

[66] M. Fiedler. A Property of Eigenvectors of Nonnegative Symmetric Matrices and its


Application to Graph Theory. Czechoslovak Mathematical Journal, 25(4):619–633,
1975.

[67] P. Flach. Machine Learning: The Art and Science of Algorithms that Make Sense of
Data. Cambridge University Press, New York, 2012. ISBN: 978-1-107-09639-4.

[68] A. L. N. Fred and A. K. Jain. Robust Data Clustering. In Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, vol-
ume 3, pages 128–136, 2003.

[69] J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning,


volume 1. Springer series in statistics New York, 2001. ISBN:978-0-387-84857-0.

[70] K. Fukui and A. Maki. Difference Subspace and Its Generalization for Subspace-
Based Methods. IEEE Transactions on Pattern Analysis and Machine Intelligence,
37(11):2164–2177, Nov 2015.

[71] H. Gao, F. Nie, X. Li, and H. Huang. Multi-view subspace clustering. In 2015 IEEE
International Conference on Computer Vision (ICCV), pages 4238–4246, 2015.

[72] Q. Gao, H. Lian, Q. Wang, and G. Sun. Cross-Modal Subspace Clustering via Deep
Canonical Correlation Analysis. Proceedings of the AAAI Conference on Artificial
Intelligence, 34(04):3938–3945, Apr 2020.

[73] Q. Gao, J. Ma, H. Zhang, X. Gao, and Y. Liu. Stable orthogonal local discrimi-
nant embedding for linear dimensionality reduction. IEEE Transactions on Image
Processing, 22(7):2521–2531, 2013.

[74] Z. Gao, Y. Wu, Y. Jia, and M. Harandi. Learning to Optimize on SPD Manifolds. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), pages 7697–7706, 2020.

[75] M. E. Garber, O. G. Troyanskaya, K. Schluens, S. Petersen, Z. Thaesler, M. Pacyna-


Gengelbach, M. van de Rijn, G. D. Rosen, C. M. Perou, R. I. Whyte, R. B. Alt-
man, P. O. Brown, D. Botstein, and I. Petersen. Diversity of Gene Expression in
Adenocarcinoma of the Lung. Proceedings of the National Academy of Sciences,
98(24):13784–13789, 2001.

[76] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University
Press, Baltimore, MD, USA, 1996. ISBN:0-8018-5414-8.

246
[77] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
ISBN:978-0262035613.

[78] D. Greene and P. Cunningham. Producing a Unified Graph Representation from


Multiple Social Network Views. In Proceedings of the 5th Annual ACM Web Science
Conference, WebSci ’13, pages 118–121, New York, NY, USA, 2013. ACM.

[79] Z. Gu, Z. Zhang, J. Sun, and B. Li. Robust Image Recognition by L1-norm Twin-
Projection Support Vector Machine. Neurocomputing, 223:1–11, 2017.

[80] C. Guo and D. Wu. Canonical Correlation Analysis (CCA) Based Multi-View Learn-
ing: An Overview. CoRR, abs/1907.01693, 2019.

[81] D. Guo, J. Zhang, X. Liu, Y. Cui, and C. Zhao. Multiple Kernel Learning Based
Multi-view Spectral Clustering. In Proceedings of the 22nd International Conference
on Pattern Recognition, pages 3774–3779, 2014.

[82] J. Guo, Y. Sun, J. Gao, Y. Hu, and B. Yin. Low rank representation on product
grassmann manifolds for multi-view subspace clustering. In Proceedings of the 25th
International Conference on Pattern Recognition, 2020. ICPR 2020., 08 2020.

[83] Y. Guo and M. Xiao. Cross Language Text Classification via Subspace Co-
Regularized Multi-View Learning. In Proceedings of the 29th International Cofer-
ence on International Conference on Machine Learning, ICML’12, page 915âĂŞ922,
Madison, WI, USA, 2012. Omnipress.

[84] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On Clustering Validation Techniques.


Journal of intelligent information systems, 17(2-3):107–145, 2001.

[85] P. Hall, D. Marshall, and R. Martin. Merging and Splitting Eigenspace Models. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22(9):1042–1049, 2000.

[86] P. Hall, D. Marshall, and R. Martin. Adding and Subtracting Eigenspaces with
Eigenvalue Decomposition and Singular Value Decomposition. Image and Vision
Computing, (20):1009–1016, 2002.

[87] G. Hamerly and C. Elkan. Learning the k in k-means. In Proceedings of the Advances
in Neural Information Processing Systems, pages 281–288, 2004.

[88] Han, J. and Kamber, M. and Pei, J. Data Mining: Concepts and Techniques. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition, 2011. ISBN: 978-
0123814791.

[89] Y. Hasin, M. Seldin, and A. Lusis. Multi-Omics Approaches to Disease. Genome


Biology, 18(1):83, May 2017.

[90] K. A. Heller and Z. Ghahramani. Bayesian Hierarchical Clustering. In Proceedings


of the 22nd International Conference on Machine Learning, pages 297–304, 2005.

[91] K. H. Hellton and M. Thoresen. Integrative Clustering of High-Dimensional Data


with Joint and Individual Clusters. Biostatistics, 17(3):537–548, 02 2016.

247
[92] C. Hennig. Cluster-Wise Assessment of Cluster Stability. Computational Statistics
& Data Analysis, 52(1):258–271, 2007.

[93] K. A Hoadley, C. Yau, et al. Multiplatform Analysis of 12 Cancer Types Reveals


Molecular Classification Within and Across Tissues of Origin. Cell, 158(4):929–944,
2014.

[94] M. Horie and H. Kasai. Consistency-Aware and Inconsistency-Aware Graph-Based


Multi-View Clustering. In Proceedings of the 28th European Signal Processing Con-
ference (EUSIPCO), pages 1472–1476, 2021.

[95] Paul Horst. Relations Among m Sets of Measures. Psychometrika, 26:129–149, 1961.

[96] D. W. Hosmer, S. Lemeshow, and S. May. Applied Survival Analysis: Regression


Modeling of Time to Event Data. Wiley-Interscience, New York, NY, USA, 2nd
edition, 2008. ISBN:9780471754992.

[97] Harold Hotelling. Relations Between Two Sets of Variates. Biometrika, 28(3/4):321–
377, 1936.

[98] C. Hou, F. Nie, H. Tao, and D. Yi. Multi-View Unsupervised Feature Selection with
Adaptive Similarity and View Weight. IEEE Transactions on Knowledge and Data
Engineering, 29(9):1998–2011, 2017.

[99] Z. Hu et al. The Molecular Portraits of Breast Tumors are Conserved Across Mi-
croarray Platforms. BMC Genomics, 7:96, April 2006.

[100] C. Huang, F. Chung, and S. Wang. Multi-View L2-SVM and Its Multi-View Core
Vector Machine. Neural Networks, 75(C):110–125, March 2016.

[101] J. Huang, F. Nie, H. Huang, and C. Ding. Robust Manifold Nonnegative Matrix
Factorization. ACM Transactions on Knowledge Discovery from Data, 8(3), June
2014.

[102] S. Huang, K. Chaudhary, and L. X Garmire. More is Better: Recent Progress in


Multi-omics Data Integration Methods. Frontiers in Genetics, 8:84, 2017.

[103] S. Huang, Z. Kang, and Z. Xu. Auto-weighted Multi-View Clustering via Deep
Matrix Decomposition. Pattern Recognition, 97:107015, 2020.

[104] Y. Huang, W. Wang, L. Wang, and T. Tan. A General Nonlinear Embedding Frame-
work Based on Deep Neural Network. In Proceedings of the 22nd International Con-
ference on Pattern Recognition, pages 732–737, 2014.

[105] Y. Ji, Q. Wang, X. Li, and J. Liu. A Survey on Tensor Techniques and Applications
in Machine Learning. IEEE Access, 7:162950–162990, 2019.

[106] S. Ji-guang. Perturbation of Angles Between Linear Subspaces. Journal of Compu-


tational Mathematics, 5(1):58–61, 1987.

248
[107] Y. Jia, H. Liu, J. Hou, S. Kwong, and Q. Zhang. Multi-View Spectral Clustering Tai-
lored Tensor Low-Rank Representation. IEEE Transactions on Circuits and Systems
for Video Technology, pages 1–1, 2021.

[108] S. Jing-Tao and Z. Qiu-Yu. Completion of Multiview Missing Data based on Multi-
manifold Regularised Non-negative Matrix Factorisation. Artificial Intelligence Re-
view, 53(7):5411–5428, 2020.

[109] S. Jung and J. S. Marron. PCA Consistency in High Dimension, Low Sample Size
Context. The Annals of Statistics, 37(6B):4104 – 4130, 2009.

[110] M. Kan, S. Shan, and X. Chen. Multi-View Deep Network for Cross-View Classi-
fication. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 4847–4855, 2016.

[111] A. Khan and P. Maji. Low-rank joint subspace construction for cancer subtype
discovery. IEEE/ACM Transactions on Computational Biology and Bioinformatics,
17(4):1290–1302, 2020. DOI: 10.1109/TCBB.2019.2894635.

[112] A. Khan and P. Maji. Selective Update of Relevant Eigenspaces for Integrative
Clustering of Multimodal Data. IEEE Transactions on Cybernetics, pages 1–13,
2020. DOI: 10.1109/TCYB.2020.2990112.

[113] A. Khan and P. Maji. Approximate Graph Laplacians for Multimodal Data Cluster-
ing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(3):798–
813, 2021. DOI: 10.1109/TPAMI.2019.2945574.

[114] A. Khan and P. Maji. Multi-Manifold Optimization for Multi-View Subspace Clus-
tering. IEEE Transactions on Neural Networks and Learning Systems, pages 1–13,
2021. DOI: 10.1109/TNNLS.2021.3054789.

[115] P. Kirk, J. E. Griffin, R. S. Savage, Z. Ghahramani, and D. L. Wild. Bayesian


Correlated Clustering to Integrate Multiple Datasets. Bioinformatics, 28(24):3290–
3297, Dec 2012.

[116] A. V. Knyazev and P. Zhu. Principal Angles Between Subspaces and their Tangents.
Technical Report TR2012-058, Mitsubishi Electric Research Laboratories, September
2012.

[117] M. Kosinski. RTCGA.clinical: Clinical Datasets from The Cancer Genome Atlas
Project, 2016. R package version 20151101.6.0.

[118] H. W. Kuhn. The Hungarian Method for the Assignment Problem. Naval Research
Logistics Quarterly, 2(1âĂŘ2):83–97, 1955.

[119] A. Kumar and H. Daume III. A Co-Training Approach for Multi-View Spectral
Clustering. In Proceedings of the 28th International Conference on Machine Learning,
ICML’11, pages 393–400, Madison, WI, USA, 2011. Omnipress.

249
[120] A. Kumar, P. Rai, and H. Daumé. Co-Regularized Multi-View Spectral Clustering. In
Proceedings of the 24th International Conference on Neural Information Processing
Systems, NIPS’11, pages 1413–1421, Red Hook, NY, USA, 2011. Curran Associates
Inc.

[121] C. Lan, Y. Deng, X. Li, and J. Huan. Co-regularized Least Square Regression for
Multi-view Multi-class Classification. In Proceedings of the International Joint Con-
ference on Neural Networks (IJCNN), pages 342–347, 2016.

[122] B. Larsen and C. Aone. Fast and effective text mining using linear time document
clustering. In In Proc. Knowledge Discovery and Data mining, pages 16–22, San
Diego, USA, 1999.

[123] D. D. Lee and H. S. Seung. Algorithms for Non-Negative Matrix Factorization. In


Proceedings of the 13th International Conference on Neural Information Processing
Systems, NIPS’00, page 535âĂŞ541, Cambridge, MA, USA, 2000. MIT Press.

[124] G. Li, S. C. H. Hoi, and K. Chang. Two-View Transductive Support Vector Machines.
In Proceedings of the SIAM International Conference on Data Mining, SDM 2010,
April 29 - May 1, 2010, Columbus, Ohio, USA, pages 235–244. SIAM, 2010.

[125] J. Li, N. Allinson, D. Tao, and X. Li. Multitraining Support Vector Machine for
Image Retrieval. IEEE Transactions on Image Processing, 15(11):3597–3601, 2006.

[126] J. Li, L. Xie, Y. Xie, and F. Wang. Bregmannian Consensus Clustering for Cancer
Subtypes Analysis. Computer Methods and Programs in Biomedicine, 189:105337,
June 2020.

[127] J. Li, C. Xu, W. Yang, C. Sun, and D. Tao. Discriminative Multi-View Interactive
Image Re-Ranking. IEEE Transactions on Image Processing, 26(7):3113–3127, July
2017.

[128] X. Li, H. Zhang, R. Wang, and F. Nie. Multi-view clustering: A scalable and
parameter-free bipartite graph fusion method. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, pages 1–1, 2020.

[129] Y. Li, F. Nie, H. Huang, and J. Huang. Large-Scale Multi-View Spectral Cluster-
ing via Bipartite Graph. In Proceedings of the 29th AAAI Conference on Artificial
Intelligence, AAAI’15, page 2750âĂŞ2756. AAAI Press, 2015.

[130] Y. Li, M. Yang, and Z. Zhang. A Survey of Multi-View Representation Learning.


IEEE Transactions on Knowledge and Data Engineering, 31(10):1863–1883, 2019.

[131] Z. Li, Q. Wang, Z. Tao, Q. Gao, and Z. Yang. Deep Adversarial Multi-View Clustering
Network. In Proceedings of the 28th International Joint Conference on Artificial
Intelligence, IJCAI-19, pages 2952–2958, July 2019.

[132] Y. Liang, D. Huang, and C. Wang. Consistency Meets Inconsistency: A Unified


Graph Learning Framework for Multi-View Clustering. In Proceedings of the IEEE
International Conference on Data Mining (ICDM), pages 1204–1209, 2019.

250
[133] D. Lin, J. Zhang, J. Li, V. D. Calhoun, H. W. Deng, and Y. P. Wang. Group Sparse
Canonical Correlation Analysis for Genomic Data Integration. BMC Bioinformatics,
14:245, Aug 2013.

[134] Y. Lin, T. Liu, and C. Fuh. Multiple Kernel Learning for Dimensionality Reduction.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(6):1147–1160,
June 2011.

[135] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust Recovery of Subspace
Structures by Low-Rank Representation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 35(1):171–184, 2013.

[136] J. Liu, F. Cao, X.-Z. Gao, L. Yu, and J. Liang. A Cluster-Weighted Kernel K-Means
Method for Multi-View Clustering. In Proceedings of the 34th AAAI Conference on
Artificial Intelligence, AAAI 2020, pages 4860–4867. AAAI Press, 2020.

[137] J. Liu, C. Wang, J. Gao, and J. Han. Multi-View Clustering via Joint Nonnegative
Matrix Factorization. In Proceedings of the 2013 SIAM International Conference on
Data Mining, pages 252–260, 2013.

[138] X. Liu, X. Zhu, M. Li, L. Wang, C. Tang, J. Yin, D. Shen, H. Wang, and W. Gao. Late
Fusion Incomplete Multi-View Clustering. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 41(10):2410–2423, 2019.

[139] S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information


Theory, 28(2):129–137, 1982.

[140] E. F. Lock and D. B. Dunson. Bayesian consensus clustering. Bioinformatics,


29(20):2610–2616, 2013.

[141] E. F. Lock, K. A. Hoadley, J. S. Marron, and A. B Nobel. Joint and Individual


Variation Explained (JIVE) for Integrated Analysis of Multiple Data Types. The
Annals of Applied Statistics, 7(1):523–542, 2013.

[142] B. Long, P. S. Yu, and Z. Zhang. A General Model for Multiple View Unsupervised
Learning. In Proceedings of the 2008 SIAM International Conference on Data Mining,
pages 822–833. SIAM, 2008.

[143] Y. M. Lui, J. R. Beveridge, and M. Kirby. Canonical Stiefel Quotient and its Ap-
plication to Generic Face Recognition in Illumination Spaces. In Proceedings of the
IEEE 3rd International Conference on Biometrics: Theory, Applications, and Sys-
tems, pages 1–8, 2009.

[144] Y. Luo, D. Tao, K. Ramamohanarao, C. Xu, and Y. Wen. Tensor Canonical Correla-
tion Analysis for Multi-View Dimension Reduction. IEEE Transactions on Knowledge
and Data Engineering, 27(11):3111–3124, 2015.

[145] Y. Luo, D. Tao, C. Xu, C. Xu, H. Liu, and Y. Wen. Multiview Vector-Valued
Manifold Regularization for Multilabel Image Classification. IEEE Transactions on
Neural Networks and Learning Systems, 24(5):709–722, 2013.

251
[146] Y. Ma, X. Hu, T. He, and X. Jiang. Clustering and Integrating of Heterogeneous Mi-
crobiome Data by Joint Symmetric Nonnegative Matrix Factorization with Laplacian
Regularization. IEEE/ACM Transactions on Computational Biology and Bioinfor-
matics, 17(3):788–795, 2020.

[147] M. Maila and J. Shi. A Random Walks View of Spectral Segmentation. In Proceed-
ings of the Eighth International Workshop on Artificial Intelligence and Statistics,
volume R3 of Proceedings of Machine Learning Research, pages 203–208. PMLR,
04–07 Jan 2001. Reissued by PMLR on 31 March 2021.

[148] P. Maji and S. Paul. Scalable Pattern Recognition Algorithms: Applications in Com-
putational Biology and Bioinformatics. Springer-Verlag, London, April 2014. ISBN:
978-3-319-05629-6.

[149] A. Mandal and P. Maji. FaRoC: Fast and Robust Supervised Canonical Correlation
Analysis for Multimodal Omics Data. IEEE Transactions on Cybernetics, 48(4):1229–
1241, 2018.

[150] C. J. Mecklin. A Comparison of the Power of Classical and Newer Tests of Multi-
variate Normality. PhD thesis, University of Northern Colorado, 2000.

[151] M. Meilă. The Uniqueness of a Good Optimum for k-means. In Proceedings of the
23rd International Conference on Machine learning, pages 625–632. ACM, 2006.

[152] M. Meila and J. Shi. Learning Segmentation by Random Walks. In Proceedings of the
Advances in Neural Information Processing Systems 13, pages 873–879. MIT Press,
2001.

[153] M. Mendes and A. Pala. Type I Error Rate and Power of Three Normality Tests.
Pakistan Journal of Information and Technology, 2(2):135–139, 2003.

[154] G. F. Miranda, C. E. Thomaz, and G. A. Giraldi. Geometric Data Analysis Based


on Manifold Learning with Applications for Image Understanding. In Proceedings
of the 30th SIBGRAPI Conference on Graphics, Patterns and Images Tutorials
(SIBGRAPI-T), pages 42–62, Oct 2017.

[155] Q. Mo and R. Shen. iClusterPlus: Integrative clustering of multi-type genomic data,


2016. R package version 1.12.1.

[156] Q. Mo, S. Wang, V. E. Seshan, A. B. Olshen, N. Schultz, C. Sander, R S. Powers,


M. Ladanyi, and R. Shen. Pattern Discovery and Cancer Gene Identification in
Integrated Cancer Genomic Data. Proceedings of the National Academy of Sciences,
110(11):4245–4250, 2013.

[157] B. Moghaddam and A. Pentland. Probabilistic Visual Learning for Object Represen-
tation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):696–
710, 1997.

[158] B. Mohar, Y. Alavi, G. Chartrand, and O. R. Oellermann. The Laplacian Spectrum


of Graphs. Graph Theory, Combinatorics, and Applications, 2(871-898):12, 1991.

252
[159] C. Moler and C. Loan. Nineteen Dubious Ways to Compute the Exponential of a
Matrix. SIAM Review, 20:801–836, 10 1978.

[160] Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. Consensus Clustering:
a Resampling-based Method for Class Discovery and Visualization of Gene Expres-
sion Microarray Data. Machine Learning, 52:91–118, 2003.

[161] T. K. Moon. The Expectation-Maximization Algorithm. IEEE Signal Processing


Magazine, 13(6):47–60, 1996.

[162] A. Y. Ng, M. I. Jordan, and Y. Weiss. On Spectral Clustering: Analysis and an Al-
gorithm. In Proceedings of the 14th International Conference on Neural Information
Processing Systems: Natural and Synthetic, NIPS’01, pages 849–856, 2001.

[163] N. D. Nguyen, I. K. Blaby, and D. Wang. ManiNetCluster: a Novel Manifold Learning


Approach to Reveal the Functional Links Between Gene Networks. BMC Genomics,
20(Suppl 12):1003, December 2019.

[164] F. Nie, J. Li, and X. Li. Parameter-Free Auto-Weighted Multiple Graph Learning: A
Framework for Multiview Clustering and Semi-Supervised Classification. In Proceed-
ings of the 25th International Joint Conference on Artificial Intelligence, IJCAI’16,
pages 1881–1887. AAAI Press, 2016.

[165] F. Nie, J. Li, and X. Li. Self-weighted Multiview Clustering with Multiple Graphs.
In Proceedings of the26th International Joint Conference on Artificial Intelligence,
IJCAI-17, pages 2564–2570, 2017.

[166] Feiping Nie, Guohao Cai, and Xuelong Li. Multi-view clustering and semi-supervised
classification with adaptive neighbours. In Proceedings of the Thirty-First AAAI Con-
ference on Artificial Intelligence, AAAI’17, page 2408âĂŞ2414. AAAI Press, 2017.

[167] G. Niu, Y. Yang, and L. Sun. One-Step Multi-View Subspace Clustering with In-
complete Views. Neurocomputing, 438:290–301, 2021.

[168] L. Niu, W. Li, D. Xu, and J. Cai. An Exemplar-Based Multi-View Domain General-
ization Framework for Visual Recognition. IEEE Transactions on Neural Networks
and Learning Systems, 29(2):259–272, 2018.

[169] W. Ou, S. Yu, G. Li, J. Lu, K. Zhang, and G. Xie. Multi-View Non-negative Matrix
Factorization by Patch Alignment Framework with View Consistency. Neurocomput-
ing, 204:116–124, 2016. Big Learning in Social Media Analytics.

[170] Z. Pawlak, J. Grzymala-Busse, R. Slowinski, and W. Ziarko. Rough Sets.


38(11):88âĂŞ95, November 1995.

[171] B. Pepik, M. Stark, P. Gehler, and B. Schiele. Multi-View and 3D Deformable


Part Models. IEEE Transactions on Pattern Analysis and Machine Intelligence,
37(11):2232–2245, November 2015.

[172] R. Peto and J. Peto. Asymptotically Efficient Rank Invariant Test Procedures. Jour-
nal of the Royal Statistical Society. Series A (General), 135(2):185–207, 1972.

253
[173] W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of
the American Statistical association, 66(336):846–850, 1971.
[174] N. Rappoport and R. Shamir. Multi-Omic and Multi-View Mlustering Algo-
rithms: Review and Cancer Benchmark. Nucleic Acids Research, 46(20):10546–10562,
November 2018.
[175] N. Rasiwasia, D. Mahajan, V. Mahadevan, and G. Aggarwal. Cluster Canonical
Correlation Analysis. In Proceedings of the 17th International Conference on Artificial
Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research,
pages 823–831, Reykjavik, Iceland, 22–25 Apr 2014. PMLR.
[176] N. M. Razali and Y. B. Wah. Power Comparisons of Shapiro-Wilk, Kolmogorov-
Smirnov, Lilliefors and Anderson-Darling Tests. Journal of statistical modeling and
analytics, 2(1):21–33, 2011.
[177] E. Rendón, I. M. Abundez, C. Gutierrez, S. Zagal, A. Arizmendi, E. M. Quiroz, and
H. E. Arzate. A Comparison of Internal and External Cluster Validation Indexes. In
Proceedings of the 2011 American Conference on Applied Mathematics and the 5th
WSEAS International Conference on Computer Engineering and Applications, pages
158–163, 2011.
[178] W. Rong, E. Zhuo, H. Peng, J. Chen, H. Wang, C. Han, and H. Cai. Learning
a Consensus Affinity Matrix for Multi-View Clustering via Subspaces Merging on
Grassmann Manifold. Information Sciences, 547:68–87, 2021.
[179] P. J. Rousseeuw. Silhouettes: a Graphical Aid to the Interpretation and Validation
of Cluster Analysis. Journal of Computational and Applied Mathematics, 20:53–65,
1987.
[180] S. T. Roweis and L. K. Saul. Nonlinear Dimensionality Reduction by Locally Linear
Embedding. Science, 290(5500):2323–2326, December 2000.
[181] J. P. Royston. An Extension of Shapiro and Wilk’s W Test for Normality to Large
Samples. Applied Statistics, pages 115–124, 1982.
[182] J. P. Royston. Some Techniques for Assessing Multivariate Normality Based on the
Shapiro-Wilk W. Applied Statistics, pages 121–133, 1983.
[183] J. P. Royston. Approximating the Shapiro-Wilk W-test for Non-normality, jour-
nal=Statistics and Computing. 2(3):117–119, 1992.
[184] J. Ruiz-del-Solar and P. Navarrete. Recursive Estimation of Motion Parameters.
Computer Vision and Image Understanding, 64(3):434–442, 1996.
[185] J. Ruiz-del-Solar and P. Navarrete. Eigenspace-Based Face Recognition: A Com-
parative Study of Different Approaches. IEEE Transactions on Systems, Man, and
Cybernetics, 35(2):315–325, 2006.
[186] L. K. Saul and S. T. Roweis. Think Globally, Fit Locally: Unsupervised Learning
of Low Dimensional Manifolds. Journal of Machine Learning Research, 4:119–155,
December 2003.

254
[187] M. Seeland and P. MÃďder. Multi-View Classification with Convolutional Neural
Networks. PLOS ONE, 16(1):1–17, 01 2021.

[188] H. S. Seung and D. D. Lee. Cognition. The Manifold Ways of Perception. Science,
290(5500):2268–2269, December 2000.

[189] S. S. Shapiro and M. B. Wilk. An Analysis of Variance Test for Normality (Complete
Samples). Biometrika, 52(3/4):591–611, 1965.

[190] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge
University Press, Cambridge, 2004. ISBN: 10.1017/CBO9780511809682.

[191] R. Shen, Q. Mo, N. Schultz, V. E. Seshan, A. B. Olshen, J. Huse, M. Ladanyi, and


C. Sander. Integrative Subtype Discovery in Glioblastoma using iCluster. PloS One,
7(4):e35236, 2012.

[192] R. Shen, A. B. Olshen, and M. Ladanyi. Integrative Clustering of Multiple Genomic


Data Types using a Joint Latent Variable Model with Application to Breast and
Lung Cancer Subtype Analysis. Bioinformatics, 25(22):2906–2912, 2009.

[193] R. Shen, S. Wang, and Q. Mo. Sparse Integrative Clustering of Multiple Omics Data
Sets. The Annals of Applied Statistics, 7(1):269–294, 2013.

[194] J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 22(8):888–905, Aug 2000.

[195] S. Shirazi, M. T. Harandi, B. C. Lovell, and C. Sanderson. Object Tracking via Non-
Euclidean Geometry: A Grassmann Approach. In Proceedings of the IEEE Winter
Conference on Applications of Computer Vision, pages 901–908, 2014.

[196] V. Sindhwani and P. Niyogi. A Co-regularized Approach to Semi-supervised Learning


with Multiple Views. In Proceedings of the ICML Workshop on Learning with Multiple
Views, 2005.

[197] V. Sindhwani and D. S. Rosenberg. An RKHS for Multi-View Learning and Manifold
Co-Regularization. In Proceedings of the 25th International Conference on Machine
Learning, ICML ’08, pages 976–983, New York, NY, USA, 2008. Association for
Computing Machinery.

[198] T. Sorlie, C. M. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie,


M. B. Eisen, M. van de Rijn, S. S. Jeffrey, T. Thorsen, H. Quist, J. C. Matese, P. O.
Brown, D. Botstein, P. E. LÃÿnning, and A. L. Borresen-Dale. Gene Expression
Patterns of Breast Carcinomas Distinguish Tumor Subclasses with Clinical Implica-
tions. Proceedings of the National Academy of Sciences U.S.A., 98(19):10869–10874,
September 2001.

[199] N. K. Speicher and N. Pfeifer. Integrating Different Data Types by Regularized Un-
supervised Multiple Kernel Learning with Application to Cancer Subtype Discovery.
Bioinformatics, 31(12):i268–i275, 06 2015.

255
[200] D. A. Spielman and S.-H. Teng. Spectral Partitioning Works: Planar Graphs and
Finite Element Meshes. Linear Algebra and its Applications, 421(2):284 – 305, 2007.
Special Issue in honor of Miroslav Fiedler.

[201] A. Srivastava and E. Klassen. Bayesian and Geometric Subspace Tracking. Advances
in Applied Probability, 36(1):43–56, 2004.

[202] G. W. Stewart and J. Sun. Matrix Perturbation Theory. Academic press, New York,
1990. ISBN: 9780126702309.

[203] S. Sun, L. Mao, Z. Dong, and L. Wu. Multiview Machine Learning. Springer Singa-
pore, Singapore, 2019. ISBN: 978-981-13-3029-2.

[204] S. Sun and J. Shawe-Taylor. Sparse Semi-supervised Learning Using Conjugate Func-
tions. Journal of Machine Learning Research, 11(84):2423–2455, 2010.

[205] S. Sun, X. Xie, and C. Dong. Multiview Learning With Generalized Eigenvalue
Proximal Support Vector Machines. IEEE Transactions on Cybernetics, 49(2):688–
697, 2019.

[206] S. Sun, X. Xie, and M. Yang. Multiview Uncorrelated Discriminant Analysis. IEEE
Transactions on Cybernetics, 46(12):3272–3284, 2016.

[207] S. Sun and D. Zong. LCBM: A Multi-View Probabilistic Model for Multi-label
Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence,
pages 1–1, 2020.

[208] T. Sun, S. Chen, J. Yang, and P. Shi. A Novel Method of Combined Feature Ex-
traction for Recognition. In Proceedings of the 8th IEEE International Conference
on Data Mining, pages 1043–1048, 2008.

[209] H. Tabia and H. Laga. Covariance-Based Descriptors for Efficient 3D Shape Match-
ing, Retrieval, and Classification. IEEE Transactions on Multimedia, 17(9):1591–
1603, 2015.

[210] J. Tang, Y. Tian, P. Zhang, and X. Liu. Multiview Privileged Support Vector Ma-
chines. IEEE Transactions on Neural Networks and Learning Systems, 29(8):3463–
3477, 2018.

[211] X. Tang, X. Tang, W. Wang, L. Fang, and X. Wei. Deep Multi-View Sparse Sub-
space Clustering. In Proceedings of the 7th International Conference on Network,
Communication and Computing, ICNCC 2018, pages 115–119, New York, NY, USA,
2018. Association for Computing Machinery.

[212] H. Tao, C. Hou, Y. Qian, J. Zhu, and D. Yi. Latent Complete Row Space Recovery for
Multi-View Subspace Clustering. IEEE Transactions on Image Processing, 29:8083–
8096, 2020.

[213] H. Tao, C. Hou, J. Zhu, and D. Yi. Multi-View Clustering with Adaptively Learned
Graph. In Proceedings of the 9th Asian Conference on Machine Learning, volume 77
of Proceedings of Machine Learning Research, pages 113–128. PMLR, November 2017.

256
[214] TCGA Network. Comprehensive Molecular Portraits of Human Breast Tumours.
Nature, 490(7418):61–70, October 2012.

[215] TCGA Research Network. Integrated Genomic Analyses of Ovarian Carcinoma. Na-
ture, 474(7353):609–615, Jun 2011.

[216] TCGA Research Network. Comprehensive Molecular Characterization of Gastric


Adenocarcinoma. Nature, 513(7517):202–209, 2014.

[217] TCGA Research Network. Comprehensive, Integrative Genomic Analysis of Diffuse


Lower-Grade Gliomas. The New England Journal of Medicine, 372(26):2481–2498,
2015.

[218] TCGA Research Network. Integrated Genomic and Molecular Characterization of


Cervical Cancer. Nature, 543(7645):378–384, 2017.

[219] A. Tenenhaus and M. Tenenhaus. Regularized Generalized Canonical Correlation


Analysis. Psychometrika, 76(2):257–284, April 2011.

[220] S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, Inc.,


USA, 4th edition, 2008. ISBN: 9781597492720.

[221] R. Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal
Statistical Society. Series B (Methodological), 58(1):267–288, 1996.

[222] W. D. Travis, E. Brambilla, A. P. Burke, A. Marx, and A. G. Nicholson. Introduction


to The 2015 World Health Organization Classification of Tumors of the Lung, Pleura,
Thymus, and Heart. Journal of Thoracic Oncology, 10(9):1240–1242, September
2015.

[223] A. Trivedi, P. Rai, H. Daume III, and S. L. DuVall. Multiview Clustering with
Incomplete Views. In Proceedings of the Neural Information Processing Systems
Workshop, volume 224, pages 1–8, 2010.

[224] P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa. Statistical Computa-


tions on Grassmann and Stiefel Manifolds for Image and Video-Based Recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11):2273–2286,
2011.

[225] G. Tzortzis and A. Likas. Kernel-Based Weighted Multi-View Clustering. In Pro-


ceedings of the IEEE 12th International Conference on Data Mining, pages 675–684,
2012.

[226] V. Vapnik and R. Izmailov. Learning Using Privileged Information: Similarity Con-
trol and Knowledge Transfer. Journal of Machine Learning Research, 16(61):2023–
2049, 2015.

[227] V. Vapnik and A. Vashist. A New Learning Paradigm: Learning using Privileged
Information. Neural Networks, 22(5):544–557, 2009. Advances in Neural Networks
Research: IJCNN2009.

257
[228] R. G. Verhaak, K. A. Hoadley, E. Purdom, V. Wang, Y. Qi, M. D. Wilkerson,
C. R. Miller, L. Ding, T. Golub, J. P. Mesirov, G. Alexe, M. Lawrence, M. O’Kelly,
P. Tamayo, B. A. Weir, S. Gabriel, W. Winckler, S. Gupta, L. Jakkula, H. S. Feiler,
J. G. Hodgson, C. D. James, J. N. Sarkaria, C. Brennan, A. Kahn, P. T. Spellman,
R. K. Wilson, T. P. Speed, J. W. Gray, M. Meyerson, G. Getz, C. M. Perou, and
D. N. Hayes. Integrated Genomic Analysis Identifies Clinically Relevant Subtypes of
Glioblastoma Characterized by Abnormalities in PDGFRA, IDH1, EGFR, and NF1.
Cancer Cell, (17):98–110, 2010.

[229] J. Via, I. Santamaria, and J. Perez. A Learning Algorithm for Adaptive Canonical
Correlation Analysis of Several Data Sets. Neural Networks, 20(1):139–152, 2007.

[230] U. Von Luxburg. A Tutorial on Spectral Clustering. Statistics and computing,


17(4):395–416, 2007.

[231] D. Wagner and F. Wagner. Between Min Cut and Graph Bisection. In Proceedings
of the International Symposium on Mathematical Foundations of Computer Science,
pages 744–750. Springer, 1993.

[232] B. Wang, Y. Hu, J. Gao, Y. Sun, F. Ju, and B. Yin. Adaptive Fusion of Heterogeneous
Manifolds for Subspace Clustering. IEEE Transactions on Neural Networks and
Learning Systems, pages 1–14, 2020.

[233] B. Wang, Y. Hu, J. Gao, Y. Sun, F. Ju, and B. Yin. Learning Adaptive Neighborhood
Graph on Grassmann Manifolds for Video/Image-Set Subspace Clustering. IEEE
Transactions on Multimedia, 23:216–227, 2021.

[234] B. Wang, A. M. Mezlini, F. Demir, M. Fiume, Z. Tu, M. Brudno, B. Haibe-Kains,


and A. Goldenberg. Similarity Network Fusion for Aggregating Data Types on a
Genomic Scale. Nature Methods, 11:333–337, 2014.

[235] H. Wang, F. Nie, and H. Huang. Multi-View Clustering and Feature Learning via
Structured Sparsity. In Proceedings of the 30th International Conference on Machine
Learning, volume 28 of Proceedings of Machine Learning Research, pages 352–360,
Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.

[236] H. Wang, Y. Yang, and B. Liu. GMC: Graph-Based Multi-View Clustering. IEEE
Transactions on Knowledge and Data Engineering, 32(6):1116–1129, 2020.

[237] H. Wang, Y. Yang, B. Liu, and H. Fujita. A Study of Graph-Based System for
Multi-View Clustering. Knowledge-Based Systems, 163:1009–1019, 2019.

[238] Q. Wang, J. Cheng, Q. Gao, G. Zhao, and L. Jiao. Deep Multi-View Subspace Clus-
tering with Unified and Discriminative Learning. IEEE Transactions on Multimedia,
pages 1–1, 2020.

[239] Q. Wang, Z. Ding, Z. Tao, Q. Gao, and Y. Fu. Partial multi-view clustering via
consistent gan. In Proceedings of the IEEE International Conference on Data Mining
(ICDM), pages 1290–1295, 2018.

258
[240] X. Wang, X. Guo, Z. Lei, C. Zhang, and S. Z. Li. Exclusivity-Consistency Regu-
larized Multi-View Subspace Clustering. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 1–9, Los Alamitos, CA,
USA, July 2017. IEEE Computer Society.

[241] Y. Wang, X. Lin, L. Wu, W. Zhang, Q. Zhang, and X. Huang. Robust Subspace Clus-
tering for Multi-View Data by Exploiting Correlation Consensus. IEEE Transactions
on Image Processing, 24(11):3939–3949, 2015.

[242] P. Wedin. Perturbation Bounds in Connection with Singular Value Decomposition.


BIT Numerical Mathematics, 12(1):99–111, 1972.

[243] D. Wu, D. Wang, M. Q. Zhang, and J. Gu. Fast Dimension Reduction and Integra-
tive Clustering of Multi-omics Data using Low-rank Approximation: Application to
Cancer Molecular Classification. BMC genomics, 16(1):1022, 2015.

[244] J. Wu, Z. Lin, and H. Zha. Essential Tensor Learning for Multi-View Spectral Clus-
tering. IEEE Transactions on Image Processing, 28(12):5910–5922, 2019.

[245] J. Wu, X. Xie, L. Nie, Z. Lin, and H. Zha. Unified Graph and Low-Rank Tensor
Learning for Multi-View Clustering. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 34, 2020.

[246] R. Xia, Y. Pan, L. Du, and J. Yin. Robust Multi-View Spectral Clustering via Low-
Rank and Sparse Decomposition. In Proceedings of the 28th AAAI Conference on
Artificial Intelligence, pages 2149–2155, 2014.

[247] T. Xia, D. Tao, T. Mei, and Y. Zhang. Multiview Spectral Embedding. IEEE
Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 40(6):1438–
1446, Dec 2010.

[248] D. Xie, Q. Gao, S. Deng, X. Yang, and X. Gao. Multiple Graphs Learning with a
New Weighted Tensor Nuclear Norm. Neural Networks, 133:57–68, 2021.

[249] D. Xie, W. Xia, Q. Wang, Q. Gao, and S. Xiao. Multi-View Clustering by Joint
Manifold Learning and Tensor Nuclear Norm. Neurocomputing, 380:105–114, 2020.

[250] M. Xie, Z. Ye, G. Pan, and X. Liu. Incomplete Multi-View Subspace Clustering with
Adaptive Instance-sample Mapping and Deep Feature Fusion. Applied Intelligence,
01 2021.

[251] X. L. Xie and G. Beni. A Validity Measure for Fuzzy Clustering. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 13(8):841–847, 1991.

[252] Y. Xie, J. Liu, Y. Qu, D. Tao, W. Zhang, L. Dai, and L. Ma. Robust Kernelized
Multiview Self-Representation for Subspace Clustering. IEEE Transactions on Neural
Networks and Learning Systems, 32(2):868–881, 2021.

[253] Y. Xie, D. Tao, W. Zhang, Y. Liu, L. Zhang, and Y. Qu. On Unifying Multi-View Self-
Representations for Clustering by Tensor Multi-Rank Minimization. International
Journal of Computer Vision, 126:1157–1179, 11 2018.

259
[254] Xie Xijiong and Shiliang Sun. Multi-view laplacian twin support vector machines.
Intelligent Data Analysis, 19:701–712, 07 2015.

[255] C. Xu, H. Liu, Z. Guan, X. Wu, J. Tan, and B. Ling. Adversarial Incomplete
Multiview Subspace Clustering Networks. IEEE Transactions on Cybernetics, pages
1–14, 2021.

[256] C. Xu, D. Tao, and C. Xu. Large-margin multi-viewinformation bottleneck. IEEE


Transactions on Pattern Analysis and Machine Intelligence, 36(8):1559–1572, August
2014.

[257] C. Xu, D. Tao, and C. Xu. Multi-view learning with incomplete views. IEEE Trans-
actions on Image Processing, 24(12):5812–5825, 2015.

[258] H. Xu, X. Zhang, W. Xia, Q. Gao, and X. Gao. Low-rank Tensor Constrained
Co-regularized Multi-view Spectral Clustering. Neural Networks, 132:245–252, 2020.

[259] J. Xu, X. Zhang, W. Li, X. Liu, and J. Han. Joint Multi-view 2D Convolutional
Neural Networks for 3D Object Classification. In Proceedings of the 29th International
Joint Conference on Artificial Intelligence, IJCAI-20, pages 3202–3208, July 2020.

[260] Z. Xue, J. Du, D. Du, and S. Lyu. Deep Low-rank Subspace Ensemble for Multi-view
Clustering. Information Sciences, 482:210–227, 2019.

[261] B. Yang, X. Zhang, F. Nie, F. Wang, W. Yu, and R. Wang. Fast Multi-View Clus-
tering via Nonnegative and Orthogonal Factorization. IEEE Transactions on Image
Processing, 30:2575–2586, 2021.

[262] D. Yang, Z. Ma, and A. Buja. A Sparse Singular Value Decomposition Method
for High-Dimensional Data. Journal of Computational and Graphical Statistics,
23(4):923–942, 2014.

[263] Mo Yang and Shiliang Sun. Multi-view Uncorrelated Linear Discriminant Analysis
with Applications to Handwritten Digit Recognition. In Proceedings of the Interna-
tional Joint Conference on Neural Networks (IJCNN), pages 4175–4181, 2014.

[264] Y. Yang and H. Wang. Multi-view Clustering: A Survey. Big Data Mining and
Analytics, 01(02):83, 2018.

[265] Y. Yao, Y. Li, B. Jiang, and H. Chen. Multiple Kernel k-Means Clustering by Select-
ing Representative Kernels. IEEE Transactions on Neural Networks and Learning
Systems, pages 1–14, 2020.

[266] Y. Ye, X. Liu, J. Yin, and E. Zhu. Co-regularized Kernel k-means for Multi-view
Clustering. In Proceedings of the 23rd International Conference on Pattern Recogni-
tion (ICPR), pages 1583–1588, 2016.

[267] C. Z. You, H. H. Fan, and Z. Q. Shu. Non-negative Sparse Laplacian regularized


Latent Multi-view Subspace Clustering. In Proceedings of the 19th International
Symposium on Distributed Computing and Applications for Business Engineering and
Science (DCABES), pages 210–213, 2020.

260
[268] H. Yu, T. Zhang, and Y. Lian, Y.and Cai. Co-regularized Multi-view Subspace Clus-
tering. In Proceedings of the 10th Asian Conference on Machine Learning, volume 95
of Proceedings of Machine Learning Research, pages 17–32. PMLR, 14–16 Nov 2018.

[269] Y. Yu, L. Zhang, and S. Zhang. Simultaneous Clustering of Multiview Biomedical


Data using Manifold Optimization. Bioinformatics, 35(20):4029–4037, 03 2019.

[270] L. A. Zadeh. Fuzzy Sets. Information and Control, 8(3):338–353, 1965.

[271] H. Zha, X. He, C. Ding, H. Simon, and M. Gu. Spectral Relaxation for k-means
Clustering. In Proceedings of the Neural Information Processing Systems, volume 14,
pages 1057–1064, Vancouver, Canada, 2001.

[272] K. Zhan, X. Chang, J. Guan, L. Chen, Z. Ma, and Y. Yang. Adaptive Structure
Discovery for Multimedia Analysis Using Multiple Features. IEEE Transactions on
Cybernetics, 49(5):1826–1834, 2019.

[273] K. Zhan, C. Zhang, J. Guan, and J. Wang. Graph Learning for Multiview Clustering.
IEEE Transactions on Cybernetics, 48(10):2887–2895, 2018.

[274] C. Zhang, E. Adeli, T. Zhou, X. Chen, and D. Shen. Multi-Layer Multi-View Clas-
sification for Alzheimer’s Disease Diagnosis. 2018:4406–4413, February 2018.

[275] C. Zhang, H. Fu, Q. Hu, X. Cao, Y. Xie, D. Tao, and D. Xu. Generalized La-
tent Multi-View Subspace Clustering. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 42(1):86–99, 2020.

[276] C. Zhang, H. Fu, S. Liu, G. Liu, and X. Cao. Low-Rank Tensor Constrained Multi-
view Subspace Clustering. In Proceedings of the IEEE International Conference on
Computer Vision (ICCV), pages 1582–1590, 2015.

[277] C. Zhang, Q. Hu, H. Fu, P. Zhu, and X. Cao. Latent Multi-view Subspace Clustering.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 4333–4341, 2017.

[278] C. Zhang, J. Liu, Q. Shi, X. Yu, T. Zeng, and L. Chen. Integration of Multiple
Heterogeneous Omics Data. In IEEE International Conference on Bioinformatics
and Biomedicine (BIBM), pages 564–569, December 2016.

[279] D. Zhang and S. Chen. Clustering Incomplete Data Using Kernel-Based Fuzzy C-
means Algorithm. Neural Processing Letters, 18(3):155–162, 2003.

[280] J. Zhang, G. Zhu, R. W. Heath Jr., and K. Huang. Grassmannian Learning: Em-
bedding Geometry Awareness in Shallow and Deep Learning. Computing Research
Repository (CoRR), abs/1808.02229, 2018.

[281] S. Zhang, C. C. Liu, W. Li, H. Shen, P. W. Laird, and X. J. Zhou. Discovery of Multi-
dimensional Modules by Integrative Analysis of Cancer Genomic Data. Nucleic Acids
Research, 40(19):9379–9391, October 2012.

261
[282] W. Zhang, Y. Liu, N. Sun, D. Wang, J. Boyd-Kirkup, X. Dou, and J. D. Han.
Integrating Genomic, Epigenomic, and Transcriptomic Features Reveals Modular
Signatures Underlying Poor Prognosis in Ovarian Cancer. Cell Reports, 4(3):542–
553, 2013.

[283] X. Zhang, L. Zhao, L. Zong, X. Liu, and H. Yu. Multi-view Clustering via Multi-
manifold Regularized Nonnegative Matrix Factorization. In Proceedings of the IEEE
International Conference on Data Mining, pages 1103–1108, 2014.

[284] X. Zhang, L. Zong, X. Liu, and H. Yu. Constrained NMF-Based Multi-View Clus-
tering on Unmapped Data. In Proceedings of the 29th AAAI Conference on Artificial
Intelligence, AAAI’15, page 3174âĂŞ3180. AAAI Press, 2015.

[285] Y. Zhang, W. Yang, B. Liu, G. Ke, Y. Pan, and J. Yin. Multi-view Spectral Cluster-
ing via Tensor-SVD Decomposition. In Proceedings of the IEEE 29th International
Conference on Tools with Artificial Intelligence (ICTAI), pages 493–497, 2017.

[286] Z. Zhang, Z. Zhai, and L. Li. Uniform Projection for Multi-View Learning. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 39(8):1675–1689, August
2017.

[287] H. Zhao, Z. Ding, and Y. Fu. Multi-View Clustering via Deep Matrix Factorization.
In Proceedings of the 31st AAAI Conference on Artificial Intelligence, AAAI’17, page
2921âĂŞ2927. AAAI Press, 2017.

[288] J. Zhao, X. Xie, X. Xu, and S. Sun. Multi-View Learning Overview: Recent Progress
and New Challenges. Information Fusion, 38(C):43–54, November 2017.

[289] L. Zhao, T. Yang, J. Zhang, Z. Chen, Y. Yang, and Z. J. Wang. Co-Learning


Non-Negative Correlated and Uncorrelated Features for Multi-View Data. IEEE
Transactions on Neural Networks and Learning Systems, pages 1–11, 2020.

[290] W. Zhao, S. Tan, Z. Guan, B. Zhang, M. Gong, Z. Cao, and Q. Wang. Learning
to Map Social Network Users by Unified Manifold Alignment on Hypergraph. IEEE
Transactions on Neural Networks and Learning Systems, 29(12):5834–5846, Decem-
ber 2018.

[291] X. Zhao, N. Evans, and J. Dugelay. A Subspace Co-training Framework for Multi-
view Clustering. Pattern Recognition Letters, 41:73–82, 2014. Supervised and Unsu-
pervised Classification Techniques and their Applications.

[292] Q. Zheng, J. Zhu, Z. Li, S. Pang, Jun Wang, and Lei Chen. Consistent and Com-
plementary Graph Regularized Multi-view Subspace Clustering. arXiv, 2004.03106,
2020.

[293] Q. Zheng, J. Zhu, Z. Tian, Z. Li, S. Pang, and X. Jia. Constrained Bilinear Fac-
torization Multi-view Subspace Clustering. Knowledge-Based Systems, 194:105514,
2020.

262
[294] D. Zhou and C. JC Burges. Spectral Clustering and Transductive Learning with
Multiple Views. In Proceedings of the 24th International Conference on Machine
Learning, pages 1159–1166. ACM, 2007.

[295] L. Zhou, G. Du, K. LÃij, and L. Wang. A Network-based Sparse and Multi-manifold
Regularized Multiple Non-negative Matrix Factorization for Multi-view Clustering.
Expert Systems with Applications, 174:114783, 2021.

[296] P. Zhou, Y. Shen, L. Du, and F. Ye. Incremental Multi-view Support Vector Machine.
In Proceedings of the 2019 SIAM International Conference on Data Mining (SDM),
pages 1–9, 2019.

[297] F. Zhuang, G. Karypis, X. Ning, Q. He, and Z. Shi. Multi-view Learning via Proba-
bilistic Latent Semantic Analysis. Information Sciences, 199:20–30, 2012.

[298] M. Zitnik and B. Zupan. Data Fusion by Matrix Factorization. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 37(1):41–53, 2015.

[299] L. Zong, X. Zhang, L. Zhao, H. Yu, and Q. Zhao. Multi-view Clustering via Multi-
manifold Regularized Non-negative Matrix Factorization. Neural Networks, 88:74–89,
2017.

[300] I. Zwiener, B. Frisch, and H. Binder. Transforming RNA-Seq Data to Improve the
Performance of Prognostic Gene Signatures. PloS One, 9(1):e85150, 2014.

263

You might also like