Machine Learning and Statistical Methods For Clustering Single-Cell RNA-sequencing Data
Machine Learning and Statistical Methods For Clustering Single-Cell RNA-sequencing Data
doi: 10.1093/bib/bbz063
Advance Access Publication Date: 27 June 2019
Review article
Abstract
Single-cell RNAsequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each
individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single
cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine
learning and statistical methods for clustering scRNA-seq transcriptomes developed in the past few years. The review
focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models,
k-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique
challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of
transcripts, highly variable total mRNAs from single cells and ambiguous cell markers in the presence of technical biases
and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and
dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of
single cells. We will also introduce those more advanced approaches to cluster scRNA-seq transcriptomes in time series
data and multiple cell populations and to detect rare cell types. Several software packages developed to support the cluster
analysis of scRNA-seq data are also reviewed and experimentally compared to evaluate their performance and efficiency.
Finally, we conclude with useful observations and possible future directions in scRNA-seq data analytics.
Availability: All the source code and data are available at https://github.com/kuanglab/single-cell-review.
Introduction
an average of the transcription levels in a bulk population of cells
Transcriptome profiling of cells can capture gene transcriptional collected from a biological sample and the bulk gene expressions
activities to reveal cell identity and function. In conventional are clustered to detect gene coexpression modules and sample
bulk gene expression analysis, a transcriptome is measured as clusters [1, 2]. Because bulk analyses ignore individual cell
Raphael Petegrosso is currently a PhD candidate in Computer Science at the University of Minnesota Twin Cities. He received his BS in Computer
Engineering from University of Sao Paulo, Brazil. His research interests include network-based learning, semisupervised learning and phenome-genome
association analysis.
Zhuliu Li is currently a PhD candidate in Computer Science at University of Minnesota Twin Cities. He received his BE in Electric Engineering from
Xidian University, China. His research interests include statistical learning, semisupervised learning, network-based learning and applications in biological
networks.
Rui Kuang is an associate professor with Computer Science and Engineering Department at University of Minnesota Twin Cities with joint appointment in
Bioinformatics and Computational Biology. His research interests are broadly in biological network analysis, cancer genomics, phenome predictions and
machine learning.
Submitted: 25 January 2019; Received (in revised form): 4 April 2019
© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
1
2 Petegrosso et al.
variable genes before clustering in the presence of high level for inference by Gibbs sampling. Thus, BISCUIT imputes the
of technical noise and dropouts. Similarly, the analysis in [21] dropouts along with clustering by a Dirichlet process mixture
shows that global normalization by median library size or model (DPMM).
through spike-ins would not resolve the dropouts and might
remove biological stochasticity specific to each cell type, both of
which result in improper clustering and characterization of the Dimension reduction
latent cell types.
Dimension reduction is commonly used to project high-
5. LDA [33] was originally proposed in natural language process- Clustering techniques
ing. LDA assumes that a document is generated by first sam-
In this section we review the application of eight categories of
pling topics from a multinomial distribution over the topics
clustering methods to scRNA-seq data. The methods are sum-
with a Dirichlet prior, followed by sampling of the words in
marized with their strenghts, limitations and time complexity
the documents from the multinomial distribution over the
in Table 1. Some scRNA-seq clustering algorithms use multiple
words conditioned on each topic with a Dirichlet prior. Each
clustering techniques and are thus listed in multiple categories.
document can then be represented in a lower-dimensional
Partition k-Means - Low time complexity - Sensitive to outliers pcaReduce [25] 2016
-Scalable to large datasets -User must know the number of O(KND)
clusters
SAIC [30] 2017
SC3 [26] 2017
SCUBA [41] 2014
scVDMC [9] 2018
k-Medoids -Centers are original datapoints - Sensitive to outliers O(K(N − K)2 ) RaceID2 [42] 2016
(medoids) -User must know the number of
- Suitable for discrete data clusters
Graph-based - No assumption about data -Computationally intensive for large TCC [13] 2016
Spectral clustering O(N3 )
distribution datasets SIMLR [39] 2017
Clique detection - Intuitive and clear definition - NP-hard - Reliant on heuristic O(2N ) SNN-Cliq [18] 2015
of clusters as cliques solutions - No cluster detection in
sparse graph
Densitybased DBSCAN - High efficiency - Sensitive to parameters O(N log N) GiniClust [14] 2016
- Flexible definition of clusters
in arbitrary shape
Density peak - Does not require threshold - High time complexity O(N2 ) Monocle 2 [51] 2017
Clustering methods for scRNA-seq data
clustering parameter
Continued
5
2016
2015
2017
2017
2018
2017
2016
2017
Year
the clusters are merged together at the root of the hierarchical
structure. Agglomerative clustering using the CURE algorithm
[43], for example, has the time complexity of O(N2 log N). Divisive
conCluster [31]
Divisive clustering with exhaustive search has complexity O(2N ).
SCRAT [54]
SIMLR [39]
Algorithm
TCC [13]
Moreover, the hierarchical relationship does not provide the
SC3 [26]
optimal partition of the data points into clusters. An additional
step is needed to derive a target number K of clusters from the
hierarchical tree.
BackSPIN [29] is a two-way biclustering algorithm that applies
matrix with SPIN [44] until the split criteria are no longer met
Time complexity
cell sample onto the bulk and the scRNA-seq profiles. SC3 [26]
- Sensitive to outliers
Mixture models
- Scalable stochastic gradient decentfor
clusters
Affinity propagation
spatial clustering of the single cells. The scRNA-seq data are datapoints inside the sphere is larger than a threshold. The
integrated with binarized in situ RNA data in a bimodal mixture process is repeated for each datapoint to expand the cluster. It
model for a set of selected landmark genes, and then each has the advantages of high efficiency and suitability for data
single cell can be assigned to the spatial cluster regions by the with any shape. However, clustering by density is very sensitive
posterior probability of the scRNA-seq expression profile in the to the parameters and can exhibit poor results if the densities
bimodal mixture models. DTWScore [38] selects highly divergent of the clusters are unbalanced. The time complexity of the
genes from time-series scRNA-seq gene expression data with DBSCAN clustering is O(N log N). Density-based clustering is
by PCA and Laplacian transformations. Then, six different kinds where X(d) represents the gene expression profile of population
of projections are clustered by k-means to allow the construction d; D is the number of populations; B ∈ {0, 1}m is an indicator
of a consensus matrix with CSPA consensus function [26]. vector of gene selection; DB is a diagonal matrix with B on the
Finally, the consensus matrix is used for hierarchical clustering. diagonal and Yi,j = [U(1)
i,j
, . . . , U(D)
i,j
]T .
conCluster [31] is another consensus clustering method that Seurat 2.0 [12] identifies cell subpopulations by integrating
combines several partitions by t-SNE and k-means with different multiple data sets with a common source of variation with
parameters. The partitions are then concatenated as the multiple CCA (multi-CCA) [59] to learn shared gene correlation
fate decisions by integrating signals from other cells and for iterative clustering. SAIC [30] iterates two steps, applying
executing complex gene-regulatory changes. scRNA-seq data k-means to cluster the cells and ANOVA to select signature
have been analyzed to reconstruct the lineage trees of the cell genes, for simultaneous clustering and marker gene detection.
differentiation processes and to sort cells according to their scVDMC [9] embeds the marker gene selection and the multitask
biological stage, also called pseudotime. Although the details of clustering in the optimization framework.
the methods specifically designed for this problem are beyond
the scope of this article, we noticed that several of the methods
from only two different cell types to evaluate whether the genes and 4 nested splits identified the number of clusters
clustering method can indeed reliably distinguish the cell types. closest to 10.
• cellTree [34] first applies LDA to embed the single cells as
Runtime and scalability mixture of topics and then builds a hierarchical clustering
by constructing a minimal spanning tree on the topic dis-
To measure the efficiency and scalability, the clustering methods
tributions. To run cellTree, we first fit the LDA model with
can be evaluated by the runtime and the computational
the default method (joint MAP estimation) to choose the
resources required for running the implementation. High
15. Jiang L, Schlesinger F, Davis CA, et al. Synthetic spike- 37. Murtagh F, Hernández-Pajares M. The kohonen self-
in standards for RNA-seq experiments. Genome Res 2011; organizing map method: an assessment. J Classification
21(9):1543–51. 1995;12(2):165–90.
16. Lin P, Troup M, Ho JWK. CIDR: ultrafast and accurate cluster- 38. Wang Z, Jin S, Liu G, et al. DTWscore: differential expres-
ing through imputation for single-cell RNA-seq data. Genome sion and cell clustering analysis for time-series single-cell
Biol 2017;18(1):59. RNA-seq data. BMC Bioinformatics 2017;18(1):270.
17. Zhang JM, Fan J, Christina Fan H, et al. An interpretable 39. Wang, B, Zhu, J., Pierson, E, et al. Visualization and analysis of
58. Hicks SC, Teng M, Irizarry RA. On the widespread and critical 66. Zheng GXY, Terry JM, Belgrader P, et al. Massively parallel
impact of systematic bias and batch effects in single-cell rna-seq digital transcriptional profiling of single cells. Nat Commun
data. bioRxiv, 2015. 2017;8:14049.
59. Kettenring JR. Canonical analysis of several sets of variables. 67. Chung W, Eum HH, Lee H-O, et al. Single-cell RNA-
Biometrika 1971;58(3):433–51. seq enables comprehensive tumour and immune cell
60. Waltman L, Van Eck NJ. A smart local moving algorithm profiling in primary breast cancer. Nat Commun 2017;
for large-scale modularity-based community detection. Eur 8:15081.