Machine Learning and Statistical Methods For Clustering Single-Cell RNA-sequencing Data

Briefings in Bioinformatics, 00(0), 2019, 1–15
doi: 10.1093/bib/bbz063
Advance Access Publication Date: 27 June 2019
Review article
Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

Machine learning and statistical methods for
clustering single-cell RNA-sequencing data
Raphael Petegrosso, Zhuliu Li, Rui Kuang
Corresponding author: Rui Kuang, Department of Computer Science and Engineering, University of Minnesota Twin Cities, Minneapolis, MN, USA.
Tel.: (612) 624-7820; Fax: (612) 625-0572; E-mail: [email protected]
Abstract
Single-cell RNAsequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each
individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single
cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine
learning and statistical methods for clustering scRNA-seq transcriptomes developed in the past few years. The review
focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models,
k-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique
challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of
transcripts, highly variable total mRNAs from single cells and ambiguous cell markers in the presence of technical biases
and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and
dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of
single cells. We will also introduce those more advanced approaches to cluster scRNA-seq transcriptomes in time series
data and multiple cell populations and to detect rare cell types. Several software packages developed to support the cluster
analysis of scRNA-seq data are also reviewed and experimentally compared to evaluate their performance and efficiency.
Finally, we conclude with useful observations and possible future directions in scRNA-seq data analytics.
Availability: All the source code and data are available at https://github.com/kuanglab/single-cell-review.
Key words: scRNA sequencing; machine learning; clustering; single-cell technology.
Introduction
an average of the transcription levels in a bulk population of cells
Transcriptome profiling of cells can capture gene transcriptional collected from a biological sample and the bulk gene expressions
activities to reveal cell identity and function. In conventional are clustered to detect gene coexpression modules and sample
bulk gene expression analysis, a transcriptome is measured as clusters [1, 2]. Because bulk analyses ignore individual cell
Raphael Petegrosso is currently a PhD candidate in Computer Science at the University of Minnesota Twin Cities. He received his BS in Computer
Engineering from University of Sao Paulo, Brazil. His research interests include network-based learning, semisupervised learning and phenome-genome
association analysis.
Zhuliu Li is currently a PhD candidate in Computer Science at University of Minnesota Twin Cities. He received his BE in Electric Engineering from
Xidian University, China. His research interests include statistical learning, semisupervised learning, network-based learning and applications in biological
networks.
Rui Kuang is an associate professor with Computer Science and Engineering Department at University of Minnesota Twin Cities with joint appointment in
Bioinformatics and Computational Biology. His research interests are broadly in biological network analysis, cancer genomics, phenome predictions and
machine learning.
Submitted: 25 January 2019; Received (in revised form): 4 April 2019
© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
1
2 Petegrosso et al.
identities, they cannot investigate important biological problems Normalization

at the single-cell resolution such as distinct functional roles of
cells during early development, distinct cell types in complex The raw scRNA-seq read libraries are usually normalized in two
tissues, cell lineage relationships and stochastic gene expression ways; cell normalization and gene normalization. Cell normal-
among cells. Single-cell RNA sequencing (scRNA-seq) has ization is done to remove the amplification biases and other
emerged and now widely used to quantify mRNA expression cell-specific effects inherent in the experimental protocols and
in individual cells [3, 4]. In scRNA-seq protocols, single cells can be achieved with commonly used read count normalization

are isolated with a capture method such as flow-activated cell methods such as fragments per kilobase million, reads per kilo-
sorting (FACS), Fluidigm C1 or microdroplet microfluidics and base million and transcripts per million (TPM), which normalizes
then the RNAs are captured, reverse transcribed and amplified each cell by the total number of short reads and a scaling factor.
for sequencing [4]. The applications of scRNA-seq have already Unique molecular identifier (UMI)-based protocols, in principle,
led to important biological insights and discoveries, for example, already avoid biases related to amplification and sequence depth
understanding of tumor heterogeneity in cancer [5]. since multiple reads associated with the same UMI are collapsed
Clustering is also a necessary step to identify the cell sub- into a unique count [10]. However, since libraries are usually
population structure in scRNA-seq data, there are several unique not sequenced in saturation (i.e. each uniquely tagged molecule
challenges in the clustering analysis. First, technical noise and is observed at least once), normalization has also been shown
biases are introduced by cell-specific characteristics such as useful for this type of data [11–14]. Another alternative for cell
cell-cycle stages or cell size, as well as by technical/systematic normalization is to use spike-in sequences such as the external
sources such as capture inefficiency, amplification biases and RNA control consortium molecules [15] based on the assumption
sequencing depth. For example, the heavy polymerase chain that technical effects affect the intrinsic and extrinsic genes
reaction (PCR) amplification required by the tiny amount of equally [10]. Note that it is also common to use log-transformed
RNA material in a single cell [6] also exponentially amplifies read counts after adding a pseudocount of 1 [9, 12, 16–19].
the biases. These biases and noise cause uneven coverage of Gene normalization is also performed across samples to
the entire transcriptome and result in an abundance of zero- prevent the highly expressed genes from dominating the
coverage regions and many ‘dropout’ genes [7, 8]. In addition, analysis. For example, z-score normalization (standardization)
when multiple single-cell populations from a cohort of sam- [9, 12, 17] can be used, as in principal component analysis
ples are analyzed together, the technical biases and biological (PCA). Empirically, standardization of the features may improve
variance across the populations dominate the clustering of the the convergence and clustering. It is important to note that
single cells, resulting in clustering by the sample of origin rather the standardized data will lose the relative scale of the genes
than by cells of similar types [9]. and become less sparse due to the expression shift, which
In this article, we review the recently developed statistical might influence clustering performance on large-scale sparse
and machine learning methods for improving the clustering of scRNA-seq data.
scRNA-seq data. These new methods include (1) new data pro- The SINCERA pipeline [20] provides a normalization compo-
cessing statistical methods for cell-specific normalization, the nent for preprocess scRNA-seq data. The package performs gene
imputation of ‘dropouts’, projection and dimension reduction normalization by z-score and cell normalization by trimmed
and cell marker identifications; (2) the conventional clustering mean. To determine if trimmed mean should be performed,
methods modified or customized for scRNA-seq data, including SINCERA also provides a quality-control tool for visualizing MA
partitioning-based clustering, hierarchical clustering, mixture plot, Q-Q plot, intersample correlation and distance measures.
models, graph-based clustering, density-based clustering, neu- Other methods perform more specialized normalization on
ral networks, ensemble clustering and affinity propagation; (3) scRNA-seq data. For example, BISCUIT [21] uses iterative normal-
new approaches to cluster scRNA-seq transcriptomes in time ization during clustering procedure by learning parameters that
series data and multiple cell populations and to detect rare cell represent the technical variations. Rare cell type identification
types. We also discuss several additional important computa- (RaceID) [22] normalizes the total transcript count within each
tional aspects in scRNA-seq data clustering including similarity cell to the median transcript number across cells. Transcript-
measures, feature representations and evaluations of the single- compatibility counts (TCC)-based clustering [13] uses equiva-
cell clustering results. In addition, we performed experiments lence classes instead of genes as features and normalizes each
comparing more than ten software packages to evaluate their feature by dividing the total count across all the cells.
clustering performance and efficiency on a large-scale scRNA- Moreover, it is typical to remove genes and cells from the
seq dataset. Finally, we conclude the review with discussions library if they exhibit extremely low expression because of the
of several remaining computational challenges in single-cell assumption that they represent spurious signals in the data.
clustering analysis. Previous studies established different thresholds for the removal
of low-expressed genes and cells, which might vary according to
the total number of cells and genes in the analysis. For instance,
in the analysis of droplet-based peripheral blood mononuclear
Data preprocessing for clustering
cell (PBMC) data in single-cell variance-driven multitask cluster-
In the clustering analysis of scRNA-seq data, data preprocessing ing (scVDMC) [9], genes that are expressed in less than three
is essential to reduce technical variations and noise such as cells, and cells with a total UMI count of less than 200 are
capture inefficiency, amplification biases, GC content, difference removed from the analysis.
in the total RNA content and sequence depth, in addition Although the global normalization of genes and cells is
to dropouts in reverse transcription [8]. High-dimensional common in most of the current clustering workflows, there is
scRNA-seq data are typically normalized and projected to still some debate regarding the effect on the clustering results.
a lower-dimensional space by dimension reduction. Several The analysis in [10] shows that the application of bulk-based cell
computational methods have also been developed to address normalization methods can have serious adverse consequences
dropout events with imputation or better similarity measures. for the analysis of scRNA-seq data such as detection of highly
Clustering methods for scRNA-seq data 3
variable genes before clustering in the presence of high level for inference by Gibbs sampling. Thus, BISCUIT imputes the
of technical noise and dropouts. Similarly, the analysis in [21] dropouts along with clustering by a Dirichlet process mixture
shows that global normalization by median library size or model (DPMM).
through spike-ins would not resolve the dropouts and might
remove biological stochasticity specific to each cell type, both of
which result in improper clustering and characterization of the Dimension reduction
latent cell types.
Dimension reduction is commonly used to project high-

dimensional gene expression data to a lower-dimensional
Dropout imputation space to allow the analysis to focus on relevant signals
in the low-dimensional space for better data visualization,
A significant technical artifact in scRNA-seq data is known as
clustering and other interpretations. Dimension reduction also
‘dropout’. Dropout events refer to the false quantification of
helps partially resolve the statistical issues of insufficient
a gene as unexpressed due to missing or low-expressed tran-
samples when the number of dimensions is larger than the
scripts during the reverse-transcription step [3]. Previous stud-
number of samples. Many dimension reduction methods have
ies also suggested that simple normalization will not address
been applied with scRNA-seq clustering algorithms including
the dropout effects in scRNA-seq data analysis [10, 21]. Thus,
PCA, multidimensional scaling (MDS), t-distributed stochastic
several clustering algorithms include specific mechanisms for
neighbor embedding (t-SNE), canonical correlation analysis
the correction of dropouts, e.g. Seurat [11] use coexpression
(CCA), latent Dirichlet allocation (LDA) and dimension reduction
patterns across cells in the scRNA-seq profiles to impute the
embedded in other models.
expression of the landmark genes from the coexpressed genes
before clustering. 1. PCA projects the datapoints with the eigenvectors (princi-
Dropouts can also be imputed while computing the pair- pal components) associated with the largest eigenvalues of
wise similarity or distance for clustering. Clustering through the covariance matrix to preserve most of the variance in
imputation and dimensionality reduction (CIDR) [16] imputes the original data. For example, pcaReduce [25] projects an
the expression of the dropout genes before clustering. First, expression matrix with the top K-1 principal components
the occurrence of possible dropouts among the single cells is before clustering. SC3 [26] applies PCA and Laplacian trans-
analyzed to identify the dropout candidates in each cell and formations to the distance matrices to obtain inputs for its
calculate the dropout rate of each gene. The dropout rates of the consensus clustering. PCA has also been widely used for
candidates are then used to estimate the imputed expression data visualization in 2 or 3 dimensions after scRNA-seq data
levels of the dropout candidates between each pair of samples, clustering [14, 16, 17, 19, 25]. PCA is a linear projection method
i.e. when a dropout event is identified with high probability, based on assuming the data are Gaussian. To capture non-
the algorithm performs a weighted imputation of the expres- linear structure in the data, kernel PCA can be applied with
sion from the expression profile of the other sample. Finally, nonlinear kernel mapping.
cell–cell dissimilarity is calculated using the imputed values 2. MDS [27], also known as principal coordinate analysis (PCoA),
for hierarchical clustering. The new version of Seurat [12] and is a dimension reduction algorithm based on distance-
shared nearest neighbors (SNN)-Cliq [18] are based on SNN as preserving techniques. MDS projects the data points to a
an alternative similarity measure. It has been demonstrated lower-dimensional space to preserve the distance among
that in sparse high-dimensional data, SNN is more suitable for the data points in the original higher-dimensional space
clustering analysis in the presence of dropouts because of taking by minimizing the difference between the distance in the
into account the surrounding neighbor datapoints. Therefore, original space and the distance in the projected space
these methods are also expected to perform better even without in all pairs of datapoints. For example, CIDR [16] applies
explicit imputation of the dropouts. MDS on a dissimilarity matrix and then takes the top
Zero-inflated factor analysis (ZIFA) [23] implements a principal coordinates for hierarchical clustering. MDS has the
modified probabilistic PCA to incorporate a zero-inflated model advantage of preserving the original pairwise distance in the
to account for the dropout events. ZIFA projects single cells to low-dimensional projection and easily allowing nonlinear
a low-dimensional space in which dropouts can happen with a feature embedding. However, MDS is not scalable to large-
probability specified by an exponential decay associated with scale data since pairwise distances must be computed to
the expression levels. The zero-inflated negative binomial- minimize the objective function.
based wanted variation extraction [24] uses zero-inflated 3. t-SNE [28] is a probabilistic distance-preserving approach.
binomial model to extract low-dimensional signals from the t-SNE constructs a probability distribution associated with
data to account for zero inflation (dropouts), overdispersion the similarities among the datapoints in the original space
and the nature of the count data. ZIFA-WaVE models the cells’ and the lower-dimensional space after projection and then
expression density function as an affine combination of a minimizes the Kullback–Leibler divergence between the two
Dirac function, which accounts for the existence of dropouts, distributions with respect to the locations of the data points
and a negative binomial distribution over the observed counts. in the map. t-SNE is widely used for data visualization in
ZIFA-WaVE fits the UMI counts better than ZIFA does without single-cell data analysis [12–14, 17, 19, 21, 22, 29–31].
assuming an exponential decay of the expression values. 4. CCA [32] is a dimension reduction method based on the
In a more sophisticated probabilistic graphical model, cross-covariance of datasets. Given two or more datasets,
BISCUIT [21] explicitly estimates the imputed gene expressions the method finds projections of each dataset to maximize
in each single cell as well as the parameters of the assumed the correlation among the projected datasets. In scRNA-seq
data distributions and prior distributions to represent technical data analysis, CCA is suitable for the integration of data
and biological variations. In particular, random variables from multiple sources. For example, Seurat 2.0 [12] applies
representing the unobserved true expression levels without the CCA on multiple single-cell datasets to identify the shared
cell-specific rescaling are introduced in the graphical model components.
4 Petegrosso et al.
5. LDA [33] was originally proposed in natural language process- Clustering techniques
ing. LDA assumes that a document is generated by first sam-
In this section we review the application of eight categories of
pling topics from a multinomial distribution over the topics
clustering methods to scRNA-seq data. The methods are sum-
with a Dirichlet prior, followed by sampling of the words in
marized with their strenghts, limitations and time complexity
the documents from the multinomial distribution over the
in Table 1. Some scRNA-seq clustering algorithms use multiple
words conditioned on each topic with a Dirichlet prior. Each
clustering techniques and are thus listed in multiple categories.
document can then be represented in a lower-dimensional

space of k topics. cellTree [34] uses LDA to learn ‘topics’
as latent features to represent cells, where words are gene
expression levels conditioned to the selected latent features. Partitioning-based clustering
The generative process of LDA produces an interpretable set Partitioning-based clustering methods identify the best K
of latent features. centers to partition the datapoints into K clusters where the
6. Self-organizing map (SOM) [35] or Kohonen neural network centers are either centroids (means), called k-means or medoids,
is an unsupervised competitive learning algorithm that called k-medoids.
can be used for both clustering and dimension reduction The k-means approach finds the centroids to minimize the
by the number and arrangement of the output units of sum of the squares of the Euclidean distance between each
the neural network [36, 37]. When used for visualization, datapoint and its closest centroid. It has the advantage of
SOM organizes the output units of the neural network in low time complexity. However, it is sensitive to outliers, and
a 2D grid to allow direct visualization of the clusters of the user must specify the number of clusters K a priori. The
datapoints. time complexity of k-means using Lloyd’s algorithm is O(KND)
7. Model-embedded dimension reduction combines dimension per iteration for clustering N datapoints of dimension D into
reduction within the models for data processing. ZIFA K classes.
[23] and ZINB-WaVE [24] are two such examples that Several methods for analyzing scRNA-seq data employ
model dropout events by zero-inflated data for dimension k-means. SAIC [30] combines k-means and ANOVA in iterations
reduction, as discussed in Dropout imputation section. of clustering single cells followed by signature gene identifica-
tion. SCUBA [41] uses k-means to divide cells at each time point
into two groups and uses gap statistics to identify bifurcation
events. One of the steps of SC3 [26] is to use k-means on the
Similarity and kernel functions
projections of cells pairwise distance matrices and combine the
Instead of using dimension reduction, many clustering methods individual k-means clustering results with a consensus function.
use a kernel function or a similarity function to compute pcaReduce [25] and scVDMC [9] use k-means to initialize their
pairwise similarity between individual cells for clustering. algorithm.
The kernel strategy will compute a N × N similarity matrix The k-medoids approach identifies K data points among the
from an N × M expression profile matrix expecting that smart original N examples as medoids to minimize the sum of distance
design of the kernel mapping or the similarity function will of data points to their medoid. It is most suitable for discrete
reduce the variability in the original feature space in an data with meaningful medoids as clustering centers. However,
implicit feature mapping with the function (if a valid kernel similar to k-means, it is sensitive to outliers, and the user must
function is used). SNN-Cliq [18] and Seurat [11] use the SNN specify the number of clusters K a priori. The time complexity
as the similarity graph. cellTree [34] finds a pairwise distance of k-medoid using the partitioning around medoids algorithm is
between cells by chi-square on the topic histograms found O(K(N − K)2 ) for solving the combinatorial problem of choosing
with LDA. DTWscore [38] finds the dynamic time warping the optimal K points from the N data points.
(DTW) distance between pairs of cells for each gene using RaceID2 [42], proposed for the identification of rare cell types
time series scRNA-seq data to select highly variable genes with scRNA-seq data, showed that replacing k-means clustering
where the DTW distance is calculated based on the alignment with k-medoids leads to improved clustering results.
of two time series in the optimal warping path. TCC-based
clustering [13] uses Jensen–Shannon distance between cells
as input for spectral clustering or affinity propagation. SIMLR
Hierarchical clustering
[39] combines multiple kernels to learn a cell similarity matrix
and address dropout issues with a rank constraint and graph Hierarchical clustering is the most widely used clustering
diffusion. method in gene expression data analysis. Hierarchical clustering
Most other methods use more standard correlation or dis- builds a hierarchical structure among the data points, which
tance functions. BackSPIN [29], DendroSplit [17], ICGS [40] and naturally defines clusters by the branches in the hierarchical
SINCERA [20] use a Pearson correlation matrix to find the best tree. Many scRNA-seq data clustering algorithms are based on
splitting point in their hierarchical clustering strategy. GiniClust hierarchical clustering or adopt hierarchical clustering as one of
[14] and RaceID [22] also use a correlation matrix for DBSCAN and the steps in the analysis.
k-means clustering, respectively. Reference component analysis Hierarchical clustering makes few assumptions regarding the
(RCA) [19] calculates the correlation between the expression pro- overall distribution of the data points. Thus, it is suitable for
files between single cells and reference bulk cells as new features datasets of many different shapes. Another important advantage
for clustering to minimize technical variation and batch effect. is the representation with hierarchical relationships among all
CIDR [16] uses pairwise Euclidean dissimilarity on the expression the datapoints for interpretation of the results. There are two
profiles with the imputation of dropouts. SC3 [26] calculate cell– main implementations of hierarchical clustering: agglomerative
cell pairwise similarity or distance using Spearman, Pearson clustering and divisive clustering.
and Euclidean distances as multiple scenarios for consensus Agglomerative clustering starts with all the N datapoints as
clustering. N initial clusters, and at each step, the clusters are merged
Table 1. Clustering techniques. The table shows the main categories of clustering algorithms applied to clustering scRNA-seq data. For each category, we include a list of scRNA-seq data clustering
algorithms with their strengths, limitations and time complexity.
Category / Subcategory Strengths Limitations Time complexity Algorithm Year
Partition k-Means - Low time complexity - Sensitive to outliers pcaReduce [25] 2016
-Scalable to large datasets -User must know the number of O(KND)
clusters
SAIC [30] 2017
SC3 [26] 2017
SCUBA [41] 2014
scVDMC [9] 2018
k-Medoids -Centers are original datapoints - Sensitive to outliers O(K(N − K)2 ) RaceID2 [42] 2016
(medoids) -User must know the number of
- Suitable for discrete data clusters
- Agglomerative: O(N2 log(N))

Hierarchical - Allow fitting to flexible cluster - High time complexity BackSPIN [29] 2015
- Divisive: O(2N )
shapes - No explicit clusters given
- Hierarchical relationship cellTree [34] 2016
among datapoints CIDR [16] 2017
DendroSplit [17] 2018
ICGS [40] 2016
RCA [19] 2017
SC3 [26] 2017
Graph-based - No assumption about data -Computationally intensive for large TCC [13] 2016
Spectral clustering O(N3 )
distribution datasets SIMLR [39] 2017
Clique detection - Intuitive and clear definition - NP-hard - Reliant on heuristic O(2N ) SNN-Cliq [18] 2015
of clusters as cliques solutions - No cluster detection in
sparse graph
Louvain - Heuristic can lead to bad results SCANPY [49] 2018

- Relatively low time complexity O(N log(N))
- Iterative process can hide small Seurat 1.0 [11] 2015
communities
Mixture models BISCUIT [21] 2016
-Incorporating prior knowledge - Computational difficulties in DTWScore [38] 2017
O(N2 K) (GMM)
as assumptions of distributions inference of graphical models Seurat 1.0 [11] 2015
TSCAN [46] 2016
Densitybased DBSCAN - High efficiency - Sensitive to parameters O(N log N) GiniClust [14] 2016
- Flexible definition of clusters
in arbitrary shape
Density peak - Does not require threshold - High time complexity O(N2 ) Monocle 2 [51] 2017
Clustering methods for scRNA-seq data
clustering parameter
Continued
5

6 Petegrosso et al.
according to distance measures, called linkage distance, until all
2016
2015
2017
2017
2018
2017
2016
2017
Year
the clusters are merged together at the root of the hierarchical
structure. Agglomerative clustering using the CURE algorithm
[43], for example, has the time complexity of O(N2 log N). Divisive
Kim, Daniel, et al. [53]

Lv, Dekang, et al. [55]
clustering, in contrast, starts with all the datapoints as a single
cluster, and at each step the clusters are recursively divided.
conCluster [31]
Divisive clustering with exhaustive search has complexity O(2N ).

SOMSC [56]
Thus, the time complexity of hierarchical clustering is high.
SCRAT [54]
SIMLR [39]
Algorithm
TCC [13]
Moreover, the hierarchical relationship does not provide the
SC3 [26]
optimal partition of the data points into clusters. An additional
step is needed to derive a target number K of clusters from the
hierarchical tree.
BackSPIN [29] is a two-way biclustering algorithm that applies
algorithm in the ensemble

hierarchical clustering on both single-cell and gene dimensions.
BackSPIN iteratively splits the gene expression correlation
- Complexity of each
matrix with SPIN [44] until the split criteria are no longer met
Time complexity
at a branch. cellTree [34] builds a hierarchical structure among

the single cells by constructing a minimal spanning tree on the
topic distributions obtained by modeling single-cell data as a
O(KND)
mixture of topics with LDA. CIDR [16] uses hierarchical clustering

O(N2 )
on the top coordinates obtained with PCoA on a dissimilarity

matrix obtained with imputation of dropouts. ICGS [40] applies
hierarchical clustering to cluster the expression data of a set
clustering algorithms for ensemble
of guide genes selected by filtering genes by expression level

and dynamic range and performs pairwise correlation analysis.
- Reliant on combining other
RCA introduced in [19] applies hierarchical clustering on the

correlation matrix obtained from the projections of each single-
- Sensitive to parameters
cell sample onto the bulk and the scRNA-seq profiles. SC3 [26]
- Sensitive to outliers
also applies hierarchical clustering on the consensus matrix

obtained by combining results of each k-means clustering in
the ensemble. To derive the actual clusters in the hierarchy,
Limitations
DendroSplit [17] detects clusters in the constructed tree with

dynamic splits and merges of the tree branches by measuring a
separation score from the original expression data.
-Incorporation of relation among clusters
Mixture models
- Scalable stochastic gradient decentfor
- Automatic detection of the number of
Clustering by mixture models assumes that the datapoints are

- Robust clustering by integration of
sampled from a mixture of several probability distributions,

each of which represents a cluster. The clustering of a sample
is inferred by learning the probability of its generation from
each distribution. The common choices of mixture models for
clustering are the Gaussian mixture model (GMM) for continuous
multiple methods
data and the categorical mixture model for count data.

The advantage of mixture models include rigorous prob-
abilistic modeling and the flexibility of introducing prior
Strengths
knowledge in the model. However, solving mixture models

training
clusters
requires advanced optimization or sampling techniques with

high computational complexity and relies on the accuracy
of the assumption about the data distributions. Mixture
models are usually learned with expectation maximization,
which alternatively infers the mixture parameters and class
assignment likelihoods or sampling and variational methods
for learning graphical probabilistic models. The time complexity
of mixture models depends on the distribution of the mixture.
In GMM clustering, the time complexity is O(N2 K) [45].
Category / Subcategory
BISCUIT [21] is based on a hierarchical Dirichlet process

Table 1. (continued)
Affinity propagation
mixture model (HDMM) with additional cell-specific scaling and

dropout imputation. The HDMM models cells as a Gaussian mix-
Neural network
ture with Dirichlet prior on mixture coefficients, normal prior

on the means and Wishart prior on the covariance matrices,
Ensemble
and the cell-specific scaling accounts for cell-specific technical

variances. BISCUIT is inferred with Gibbs sampling. Seurat 1.0
[11] combines scRNA-seq data with in situ RNA patterns for
spatial clustering of the single cells. The scRNA-seq data are datapoints inside the sphere is larger than a threshold. The
integrated with binarized in situ RNA data in a bimodal mixture process is repeated for each datapoint to expand the cluster. It
model for a set of selected landmark genes, and then each has the advantages of high efficiency and suitability for data
single cell can be assigned to the spatial cluster regions by the with any shape. However, clustering by density is very sensitive
posterior probability of the scRNA-seq expression profile in the to the parameters and can exhibit poor results if the densities
bimodal mixture models. DTWScore [38] selects highly divergent of the clusters are unbalanced. The time complexity of the
genes from time-series scRNA-seq gene expression data with DBSCAN clustering is O(N log N). Density-based clustering is

a DTWscore and then applies GMM to cluster cells with the typically used for identification of outlier cells in scRNA-seq
selected genes. TSCAN [46] clusters cells using GMM and builds a data analysis, such as GiniClust [14] and Monocle 2 [51].
minimum spanning tree (MST) to discover pseudotime ordering. GiniClust [14] is based on DBSCAN to discover rare subpopu-
lations of cells. GiniClust uses the Gini index as a measure of the
variability of expression values to select the genes that are then
Graph-based clustering used by DBSCAN to cluster cells.
Density peak clustering [52] takes into account the distance
In graph-based clustering, datapoints are represented as nodes
between datapoints instead of a density threshold as in DBSCAN
in a graph, and the edges are represented by the pairwise simi-
and assumes that cluster centers are local maxima in the density
larities between the datapoints. Graph-based clustering is based
of datapoints in the cluster. The time complexity of density
on the simple assumption of dense communities in the graph
peak clustering is O(N2 ). Monocle 2 [51] performs density peak
represented as either a dense subgraph or spectral components,
clustering [52] on cells in the space obtained by t-SNE.
and thus relies less on other assumptions about the data dis-
tributions. However, the computational requirement is a major
limitation. The two most common algorithms for graph-based
Neural networks
clustering are spectral clustering and clique detection.
In spectral clustering [47], an affinity matrix and its graph The Kohonen neural networks, also known as SOMs [35]
Laplacian are built by a similarity function, such as RBF kernel. performs competitive learning for clustering; each training
The top eigenvectors of the graph Laplacian are computed for datapoint is used iteratively to update the cluster centers
subsequent clustering by k-means. The time complexity of find- weighted by the similarity (distance) between the training
ing all the eigenvectors is O(N3 ), although more efficient meth- datapoint and each center with stochastic gradient-descent.
ods can be used to find a certain number of top eigenvectors. The cluster centers are initialized with predefined structures,
Thus, spectral clustering is often not directly applicable to large such as a grid. SOM is quite scalable since stochastic gradient-
datasets. TCC-based clustering [13] builds an affinity matrix descent does not require keeping all the datapoints in memory.
with transcript compatibility counts using the Jensen–Shannon In addition, the predefined structures among the centers
distance between cells for spectral clustering when the number can introduce prior knowledge and provide interpretable
of cell types is known a priori; otherwise affinity propagation is relationships among the clusters. SOM is, however, sensitive
applied. SIMLR [39] is a framework for learning a cell similarity to parameter tuning, such as the learning rate used to update
measure using rank constraint and graph diffusion, by which the the weights. It can be solved with a similar algorithm to that of
learned latent components can be used for spectral clustering. k-means in O(NKD).
In graph theory, a clique is defined as a subgraph in which SOM has been used for visualizing and clustering scRNA-Seq
every pair of nodes are adjacent. The cliques, therefore, represent data. Several studies [53–55] applied SOM to intuitively visualize
clusters of datapoints in the graph. Since finding cliques in a similarity relationships in a 2D heat map in which the spatial
graph is an NP-hard problem, heuristic approaches are often proximity reflects the expression pattern similarity. The soft-
used. SNN-Cliq [18] utilizes clique detection to cluster cells with ware package single-cell R-analysis tools (SCRAT) [54] provides
scRNA-seq data. Since cliques are often rare in sparse graphs, users with options to visualize a 2D heat map representing
SNN-Cliq detects dense but not fully connected quasi-cliques in correlations among genes across single-cell profiles. SOMSC [56]
an SNN graph. utilizes SOM to collapse high-dimensional gene expression data
Another graph-based clustering algorithm for single-cell into two dimensions for cellular state transition identification
analysis is the Louvain algorithm [48]. Louvain is a community and pseudotemporal ordering of cells.
detection algorithm that is more scalable than other graph-
based algorithms using a greedy approach to assign nodes
to communities and updating the network to obtain more Ensemble clustering
coarse-grained representation. The time complexity of Louvain
Ensemble clustering, also called consensus clustering, is a widely
is O(N log N). SCANPY [49] is a pipeline that integrates the
used strategy in which clustering is performed using several
Louvain algorithm to provide a tool capable of analyzing a large
different scenarios (e.g. different clustering algorithms, similar-
scale scRNA-seq datasets. Seurat [11] also utilizes the Louvain
ity measures and feature selections/projections) with the same
algorithm on the SNN graph of cells to discover cell types.
dataset, and the individual results are later merged based on
the agreement among them by a consensus function. Ensemble
learning can capture the diversity in different data representa-
Density-based clustering
tions and clustering models and has been shown to be more
Density-based clustering defines clusters as regions with a robust and lead to better results than single models. The limita-
high density of datapoints in the input space. Two examples tion of ensemble clustering is the reliance on other techniques
of density-based clustering are DBSCAN and density peak for data transformation and the base clustering methods.
clustering. SC3 [26] is a consensus clustering method applied to
DBSCAN [50] reports a cluster if, for a given datapoint scRNA-seq data clustering. SC3 first finds cell pairwise distance
taken as the center of a sphere of radius , the number of matrices by Euclidean, Pearson and Spearman distance followed
8 Petegrosso et al.
by PCA and Laplacian transformations. Then, six different kinds where X(d) represents the gene expression profile of population
of projections are clustered by k-means to allow the construction d; D is the number of populations; B ∈ {0, 1}m is an indicator
of a consensus matrix with CSPA consensus function [26]. vector of gene selection; DB is a diagonal matrix with B on the
Finally, the consensus matrix is used for hierarchical clustering. diagonal and Yi,j = [U(1)
i,j
, . . . , U(D)
i,j
]T .
conCluster [31] is another consensus clustering method that Seurat 2.0 [12] identifies cell subpopulations by integrating
combines several partitions by t-SNE and k-means with different multiple data sets with a common source of variation with
parameters. The partitions are then concatenated as the multiple CCA (multi-CCA) [59] to learn shared gene correlation

consensus for final k-means clustering. structures conserved across the multiple datasets. Similar to
the CCA discussed in Dimension reduction section, multi-CCA
combines pairwise CCA to find the optimal coprojection of each
Affinity propagation dataset to maximize the total correlation between each pair
of projections. The cells projected into the lower-dimensional
Affinity propagation [57] is a clustering algorithm based on
space are then used to find cell–cell distance by SNN, and cell
message passing between two kinds of log-probabilities to find
types are discovered by a graph-based clustering method, smart
exemplar datapoints (cluster centers): responsibility, which indi-
local moving [60].
cates how suitable a datapoint xk is to represent a datapoint xi
These advanced data integration models explored important
relative to other candidates xk = xk , and availability, which mea-
general frameworks for cross-dataset studies in single-cell data
sures how appropriate datapoint xi is for representation by dat-
analysis that enable future studies from consortia to integration
apoint xk , considering other datapoints xi = xi also represented
of datasets from multiple laboratories and technologies aiming
by xk . The main advantage of affinity propagation is that there
to define, for example, all human cell types.
is no requirement that the number of clusters be known. The
disadvantages are the relatively high time complexity and the Rare cell types and singleton clusters
sensitivity to outliers. The time complexity of affinity propaga-
tion is O(N2 ). TCC-based clustering [13] clusters single cells with In single-cell clustering analysis, the detection of rare cell types
affinity propagation when the number of cell types is unknown. is an important problem since cell types that play an important
SIMLR [39] also has the option to apply affinity propagation role in development or disease progression often have low abun-
directly on the similarity matrix learned from multiple kernels dance [14]. Due to their small population size, rare cell types are
instead of spectral clustering on the latent space. often difficult to detect in standard clustering analysis.
RaceID [22] is a clustering algorithm specifically designed to
identify rare cell types in scRNA-seq data. The algorithm first
computes Person’s correlation distance between pairs of cells
Clustering multiple cell populations used for k-means clustering. In each cluster, outlier cells are
When multiple single-cell populations collected from multiple screened according to the variability of genes compared to a
biological samples are sequenced, more complex batch effects background noise model. Finally, the outlier cells are merged into
and specific biological variations in each individual cell popu- outlier clusters if their correlation exceeds a threshold of the
lation are introduced into the clustering analysis. Batch effects cell–cell correlation in their original cluster.
occur when cells from one biological group or condition are GiniClust [14] is another clustering strategy focused on the
cultured, captured and sequenced separately from cells in a sec- discovery of rare subpopulations. The algorithm uses the Gini
ond condition [58]. If each cell population is collected from one index as a measure of the variability in expression values, for
individual in a group of samples such as a patient cohort, each gene feature selection. This approach is shown to be more sen-
individual single-cell population will carry distinct population- sitive to the proportion of cells with high versus low expression
specific characteristics. The technical biases and irrelevant bio- values than the commonly used Fano factor. The genes with
logical variance among the samples will be significant con- the highest Gini index are then used as features for density
founding factors causing the individual cell populations to clus- clustering by DBSCAN to detect the rare cell types.
ter together. For example, when the scRNA-seq profiles from Cells belonging to rare cell types can also be viewed as
multiple patients are pooled together for clustering, the clusters outliers in the clustering process. Most published single-cell
will simply assign the single cells to the sample origin [9]. clustering algorithms can result in small clusters, or even
scVDMC [9] is designed to cluster multiple populations of singletons. Although this may occur due to poor initialization
scRNA-seq data from biological replicates or different samples or convergence of the clustering algorithm, it can also be
simultaneously with a multitask clustering approach. scVDMC interpreted as outlier cells from rare cell types. Several
assumes that the individual cell populations consist of similar algorithms have specific techniques, and parameter tuning in
cell types with similar markers but possibly varying expression most cases to carefully select these singletons for rare cell-type
patterns across the datasets due to some population-specific detection. SINCERA [20], for example, instead of requiring the
biological variation. The mathematical optimization framework user to specify the minimum distance between the clusters
uses embedded feature selection to look for a small set of shared in hierarchical clustering, uses a threshold on the number
cell markers while allowing varying expression of the markers in of allowed singletons. Similarly, DendroSplit [17] has three
different populations with a controlled variance as follows: parameters that control the number of detected singletons:
minimum cluster size; disband size, which evaluates the size
of subtrees resulting from a cluster split; and a threshold to
1
D
min ||DB (X(d) − U(d) V(d) ||2F determine singleton merging to its nearest neighbor.
U(d) ,V(d) ,B 2
d=1

D
−w B Var(U(d) ) + α
T
Bi Var(Y(i,j) ) (1) Cell differentiation and pseudotime ordering
d=1 i,j
Cell differentiation is governed by complex gene-regulatory
subject to B = λ, ∀i = 1, . . . , n(d) , ∀d = 1, . . . , D, processes. During differentiation, each cell makes independent
fate decisions by integrating signals from other cells and for iterative clustering. SAIC [30] iterates two steps, applying
executing complex gene-regulatory changes. scRNA-seq data k-means to cluster the cells and ANOVA to select signature
have been analyzed to reconstruct the lineage trees of the cell genes, for simultaneous clustering and marker gene detection.
differentiation processes and to sort cells according to their scVDMC [9] embeds the marker gene selection and the multitask
biological stage, also called pseudotime. Although the details of clustering in the optimization framework.
the methods specifically designed for this problem are beyond
the scope of this article, we noticed that several of the methods

are also either applicable to single-cell clustering or based on Evaluations of clustering
some clustering strategies. These methods usually find some
Since the clustering of scRNA-seq data is an unsupervised learn-
specific projections by dimension reduction for tree construction
ing task in most studies, reliable evaluations are critical for the
or clustering in low-dimensional space.
validation of the clustering method and the clustering results.
Monocle [61] utilizes independent component analysis to
While some studies prepare ‘gold-standard datasets’ annotated
find a low-dimensional projection of the cells, which are then
with high confidence labels such as cell stages, conditions or
used to construct a MST. The more recently proposed Monocle
lines for the evaluation, some other studies rely on experimental
2 [51] reconstructs single-cell trajectories with reverse graph
validation and examination of the biological implications of the
embedding, utilizing only genes differentially expressed in cell
clustering. Below are the common strategies used for evaluation.
clusters identified by t-SNE and density peak clustering. TSCAN
finds clusters using the GMM and builds an MST based on
the clusters for pseudotime ordering. cellTree [34] applies the Adjusted Rand index
LDA to project the individual cells into the topics dimension
When the true clusters are available, the Rand index (RI) can be
to represent individual cells as a mixture of topics. The cell
used to measure the level of agreement between the clustering
hierarchical structure can then be found by finding the MST on
partition and the true clusters. It is most commonly used in
a chi-square distance matrix computed with topics histograms.
its adjusted form with a correction by the index that would be
SLICER [62] uses locally linear embedding to project the cells
expected by chance. Given two partitions X = {X1 , . . . , Xr } and
in a lower-dimensional space to build new neighbor graph for
Y = {Y1 , . . . , Ys }, the adjusted RI (ARI) is defined as follows:
sorting the cells based on their shortest path distances from
a user-specified starting cell. Then, a geodesic entropy is com-
puted using the shortest path distances to detect branches in nij bj n
ij 2
− [ i a2i j 2 ]/ 2
the cellular trajectory. SCUBA [41] uses k-means to cluster cells ARI = ai bj ai bj n , (2)
1
2
[ i 2
+ j 2 ]−[ i 2 j 2 ]/ 2
along a binary tree detailing bifurcation events for time-series
data. SOMSC [56] utilizes SOM to reduce the dimension of gene
expression data to identify cellular states, and the pseudotime where nij = |Xi ∩ Yj | is the number of objects in common between

ordering of the cells is obtained from the state transitions. Xi and Yj , ai = j nij , and bj = i nij . ARI = 1 indicates a
perfect agreement between the compared clusters, and ARI = 0
indicates random clustering. The adjusted form can also result
in negative values on indexes less than the expected index.
Discovery of cell marker genes ARI is widely used in the evaluation of clustering on scRNA-
One of the most important goals in the clustering analysis of the seq data [9, 16–19, 25, 26] for its convenient interpretability and
scRNA-seq data is the discovery of new marker genes to char- implementation.
acterize the gene expression patterns and functions of each cell
type found by clustering for future biological interpretation and
Validation of marker genes
experimental validation. Most methods identify marker genes
after clustering by differential gene expression analysis between After clustering of the single cells, it is believed that each cluster
the clusters with statistical tests. Seurat [11], for example, uses should exhibit coherent expression on a subset of signature
the Wilcoxon rank-sum test, a nonparametric test based on the genes that distinguish the cluster from the other clusters. These
order statistics in the sorted expression values. SINCERA [20] selected signature genes can be compared to known markers
also uses the rank-sum test when the sample size is small and from the literature for association with the tissues or cell types
Welch’s t-test otherwise. Welch’s t-test does not assume the being analyzed, providing an indication of consistent clustering
same variance in the two groups as opposed to Student’s t- [9, 11, 13, 14, 17, 19, 22, 25, 26, 29, 30, 38]. In some studies, FACS
test. SC3 [26] uses the Kruskal–Wallis test, an extension of the sorting or flow cytometry staining by the detected marker genes
Wilcoxon rank-sum test to test more than two groups. There was applied to sort single cells from new samples to further
are also existing software for the differential expression analysis validate that the markers indeed separate a subpopulation from
such as MAST [63], SAMseq [64] and scde [65]. the whole cell population [9].
Rather than performing differential expression analysis as a
postprocessing step of clustering, some other methods identify
the marker genes simultaneously with the clustering process.
Downsampling evaluation
BackSPIN [29] calculates the average gene expression in each Downsampling is a statistical approach to evaluate the robust-
cluster after each split and assign each gene to the cluster with ness of clustering results as the number of samples for clustering
the highest expression. DendroSplit [17] identifies the marker is reduced. In the evaluation of clustering by SC3 [26], the cells
genes with the most significant P-values by Welch’s t-test as a are downsampled with a binomial distribution with P = 0.1 and
clustering separation score to decide whether a branch needs n = round(Mij ), for each gene i and cell j. In the evaluation
to be split further in hierarchical clustering. ICGS [40] performs of BISCUIT [21], the counts for each cell j are downsampled
pairwise correlation analysis to identify gene modules and select with a different rate rj ∼ Unif (0.1, 1). In the evaluation of
the most intracorrelated genes in the modules as the guide genes TCC-based clustering [13], cells are also randomly subsampled
10 Petegrosso et al.
from only two different cell types to evaluate whether the genes and 4 nested splits identified the number of clusters
clustering method can indeed reliably distinguish the cell types. closest to 10.
• cellTree [34] first applies LDA to embed the single cells as
Runtime and scalability mixture of topics and then builds a hierarchical clustering
by constructing a minimal spanning tree on the topic dis-
To measure the efficiency and scalability, the clustering methods
tributions. To run cellTree, we first fit the LDA model with
can be evaluated by the runtime and the computational
the default method (joint MAP estimation) to choose the
resources required for running the implementation. High

number of topics, followed by learning a pairwise Euclidean
efficiency is a highly desirable feature since the sizes of the new
distance for all cells. Then we ran hierarchical clustering
scRNA-seq datasets, especially those generated from droplet-
using linkage distance by ward, complete, single and average
based platforms, are typically on the scale of hundreds of
measure, obtaining the best results for ward. Ward distance
thousands or larger [66]. Runtime and scalability have become
was also successfully used as linkage distance in the single-
important issues. Several previous studies evaluated runtime
cell context in [9, 16, 20].
of the clustering algorithms on scRNA-seq datasets of different • CIDR [16] first performs cell dropout imputation by the
sizes [16, 17]. It has been demonstrated that the runtime required
expected expression value calculated using a dropout
by several widely used tools to analyze datasets of less than
probability distribution. After imputation, PCoA is applied
10 000 cells can range from tens of seconds to several days
on the dissimilarity matrix for dimension reduction followed
[16]. The large variations suggest that efficiency is a concern
by hierarchical clustering. CIDR has only one parameter, the
even on datasets of moderate sizes. The implementation of
desired number of clusters, which is set to 10.
SC3 [26], to reduce the computational requirement, adopts a • DendroSplit [17] reports clusters with dynamic splits and
two-stage approach for clustering large scRNA-seq datasets: in
merges of the hierarchical tree branches by measuring a sep-
the first stage, only up to 5000 cells can be clustered with the
aration score from the original expression data. In the exper-
software; and in the second stage, classifiers are trained with
iment, DendroSplit was run with split and merge thresholds
the samples clustered in the first stage to classify the remaining
between 1 and 20 to identify the best results. The authors of
samples in the dataset and thus obtain the cluster assignment.
DendroSplit recommend the merge threshold to be half of the
A previous study in [13] also demonstrated the efficiency of
split threshold for good results.
TCC-based clustering due to enabling short-read alignment • ICGS [40] applies hierarchical clustering to cluster the expres-
with pseudoalignment tools. In particular, the short reads are
sion data of a set of selected genes. In the experiment, ICGS
grouped together as an equivalent class if they are mapped
was run with gene correlation threshold ρ between 0.05 and
to the same set of transcripts in the reference transcriptome.
0.35, with a step size of 0.05 for selecting the best ρ.
Then, the read counts of the equivalent classes can be used as • Monocle 2 [51] applies density peak clustering in the lower
features for clustering. In this scenario, the algorithm needs to
dimensional space obtained by applying t-SNE to the single
know only the potential transcripts of origin for computing the
cells for reconstructing a single-cell trajectory. We run Mon-
read counts in the equivalent classes, which can be derived by
ocle with one, two or three t-SNE components.
pseudoalignment without the full alignment of the reads to the • pcaReduce [25] is based on PCA and k-means clustering. The
transcripts.
algorithm also has an additional step to construct a cell-
type tree by merging pairs of clusters based on analyzing
Experimental evaluation the probability density function associated with the pair of
We conducted two experimental evaluations of the scRNA-seq clusters. pcaReduce was run using the number of dimensions
clustering methods. In the first experiment, we compared sev- q = 10, which is the number of cell types.
• SC3 [26] uses PCA and Laplacian transformation on multiple
eral widely used scRNA-seq clustering methods to identify the
strengths and limitations in clustering performance and their distance matrices using different metrics. k-Means clustering
scalability to a dataset of more than 100 000 PBMCs. In the second is then applied to cluster each different representation of the
experiment, we performed clustering on 212 breast cancer cells data. Finally, a consensus matrix is constructed and clustered
from 5 individuals to evaluate the clustering performance of with hierarchical clustering. To run SC3, we used the rec-
multiple cell populations. ommended setting by which clustering is performed using
5000 cells to obtain the clusters for training a support vector
machine (SVM), which is then used to assign the remaining
Clustering performance and scalability on PBMC data cells to the clusters.
• SCRAT [54] uses a SOM to cluster and visualize single cells
We downloaded PBMC data from the 10x Genomics website [66].
in a 2D map in which the units represent single cells that
In the original data, there are 10 bead-enriched subpopulations
have correlated gene expression. The algorithm was run with
of PBMC from a fresh donor (Donor A) with 103 887 cells in total.
20, 30 and 40 units in the first layer of the neural network to
In addition to evaluating the compared methods using the entire
obtain the best results.
dataset, we also performed downsampling with sizes of 100, 1000 • Seurat [11, 12] was initially proposed to infer cellular local-
and 10 000 to measure the scalability. The dataset originally
ization by integrating scRNA-seq data with in situ hybridiza-
contains mRNA expressions of 32 739 genes, from which we
tion patterns. To cluster cells, an updated version of the
selected 19 630 genes that are expressed in at least 3 cells. We
package constructs the SNN graph of cells and utilizes Lou-
compared the following methods:
vain clustering for clustering. In Seurat 2.0, multiple single-
• BackSPIN [29] is a divisive hierarchical biclustering method cell datasets can be integrated using CCA to identify shared
that simultaneously clusters genes and cells based on sorting components for pooled clustering. Seurat was run using the
points into neighborhoods. In the experiment, BackSPIN was LogNormalize parameter, with a scale factor of 100, 1000 and
run using feature selection for {1000, 5000, 10 000} genes 10 000 and a resolution between 1 and 1.2 with a step size of
and nested splits parameters in {3, 4, 5}. The choice of 5000 0.01.
• SNN-Cliq [18] constructed a SNN graph among the cells

and applied clique detection on the graph to discover cell
types. SNN-Cliq was run using the k parameter of k-nearest
neighbors between 3 and 25 to select the best k.
• TSCAN [46] utilizes a GMM to cluster single cells in clusters,
which are then used to build a MST for pseudotime ordering.
We tested TSCAN with and without PCA, and obtained better

results with PCA setting the number of clusters to 10.
In addition, we included k-means clustering using standard

Euclidean distance. No method using affinity propagation was
compared since TCC-based clustering [13] uses transcript-
compatibility counts and is not applicable to the UMI counts
in the PBMC dataset, whereas the available SIMLR package [39]
includes only spectral clustering but not affinity propagation.
Since the PBMC dataset contains UMI counts for each gene
by cell, we did not perform any further normalization unless
required by a compared method. Each method was run 10 times
to obtain the mean and the standard deviation of the ARI. When
multiple parameters were tested, we report the best results, as
in previous studies [9, 25, 26]. We also report the mean and the
standard deviation of the runtime of all the compared methods,
measured by 10 runs on a server with Intel Xeon E52687W v3
3.10 GHz, 25 M Cache and 256 GB of RAM.
Figure 1 shows the ARI and runtime comparison among
the methods by the mean and the standard deviation of 10
runs. The results show that Monocle, cellTree, Seurat and SC3
exhibit the best ARI performance among the methods. However,
Monocle, cellTree and Seurat do not scale to all the samples due
to the memory issue. The SC3 software package clusters only
up to 5000 cells and classifies the remaining cells. Without the
supervised step, SC3 has similar scalability to that of cellTree
and Seurat. pcaReduce was able to cluster all the cells; however,
the running time was more than 2 days, as shown in Figure 1B
and the clustering result was not improved by clustering more
cells together, as shown in Figure 1A. ICGS did not perform
well on this dataset, being the slowest method scalable up to
only 1000 cells. Nevertheless, the pipeline reports additional
important information along with clustering, such as marker
identification prior to clustering, plots using t-SNE and gene
ontology annotations. The SCRAT package performed well on Figure 1. Comparison of clustering performance and scalability. (A) The y-axis
clustering 100 cells but became unstable when 40 units were is the ARI of the clustering results on the PBMC dataset. The x-axis is labeled
by the size of the (downsampled) datasets. (B) The y-axis shows the runtime
used for clustering 1000 cells. SCRAT requires at least 3 days to
of clustering the PBMC dataset. The x-axis is also labeled by the size of the
process 5000 cells and thus is not scalable to the larger datasets. (downsampled) datasets. The curves are truncated if a method is not scalable
Note that SCRAT also reports important additional information to a certain size of the dataset.
about lineage relationship and gene enrichment analysis. The
standard k-means shows very stable results up to 10 000 cells,
with an ARI of approximately 0.15, but when tested on all cells, based on hierarchical clustering, applies LDA projection of the
the performance drops to only 0.033. data, which appears more suitable to the read count data. In
Figure 1A also shows that k-means, SC3 and pcaReduce, all terms of partition-based methods, we can see that even though
of which use k-means as one of the steps in the clustering, pcaReduce utilizes k-means as part of its framework, it is able to
have the largest variance among the multiple runs while the improve the clustering results with proper use of PCA and the
hierarchical clustering methods cellTree, CIDR and DendroSplit, clustering merge strategy. SC3 consensus clustering appears to
the graph-based clustering method, SNN-Cliq and the density- be a very promising method that combines the advantages of
based clustering method Monocle always returns the same clus- several distance measures and projections. However, the results
tering output in the multiple runs. The mixture models, TSCAN seem to be unstable when SC3 depends on the SVM to classify
and Seurat, and the neural network method, SCRAT also always more cells, e.g. the result of clustering 10 000 is worse than that
return the same clustering results indicating that some strategy of clustering 1000 cells. TSCAN using GMM shows better results
for obtaining a fixed initialization is used in the implementation. than k-means when using all cells (P = 0.001 by t-test), which
A further analysis of the results obtained by the clustering suggests that the Gaussian modeling may play a positive role
techniques shows that hierarchical clustering-based methods in clustering. The implementation of SOM in SCRAT appears
exhibit very close mean ARI results. When clustering 1000 cells, to have poor scalability probably due to the large number of
we can see that BackSPIN, CIDR, DendroSplit and ICGS have gene expression features in the network even though SOM can
ARIs between approximately 0.25 and 0.3. cellTree, though also be trained with stochastic gradient decent. For density-based
clustering, Monocle outperforms the other methods by a large

margin for clustering 10 000 cells. Moreover, Monocle is rela-
tively scalable with an efficient implementation of density peak
clustering by [52]. Finally, even though both Seurat and SNN-
Cliq build SNN as the foundation for clustering, Seurat performs
better by using the Louvain algorithm instead of clique detection
as SNN-Cliq.

This experiment shows that, even though there is a large
number of clustering methods specifically designed for scRNA-
seq analysis, they show considerably varying results for clus-
tering thousands of cells, and there is still a need for methods
that can scale to a large number, such as hundred thousands, or
possibly more, rather than using a supervised step.
Clustering multiple cell populations in breast cancer

We downloaded the original dataset from [67] containing 515
cells of 11 patients with breast cancer. The dataset reports TPM
values of 25 636 genes, from which we extracted the top 5000
genes with the largest variance in expression. The dataset labels
each cell in one of the three groups: immune, stromal or tumor.
Because some of the patients do not contain cells of all three
types, we utilized 212 cells from 5 patients.
This dataset was used to mainly compare the two methods
that are designed to cluster multiple populations, scVDMC [9]
and Seurat 2.0, a new version of Seurat [11]. Seurat 2.0 applies
pairwise CCA to integrate multiple datasets in a space that max-
imizes the correlation between their projections. Seurat 2.0 was
run with the number of selected genes in {3000, 3200, . . . , 5000},
the number of canonical correlation components in {2, . . . , 10},
and resolution in {0.2, 0.3, 0.4, 0.5}. The best result is obtained
with the three parameters, 1600, 2 and 0.2, respectively. scVDMC
assumes the single-cell populations consist of similar cell types
with similar markers but possibly varying expression patterns
across the datasets due to some population-specific biologi- Figure 2. Clustering multiple BRCA cell populations. Comparison of the methods
cal variation. The mathematical optimization framework uses by clustering multiple BRCA cell populations with (A) ARI and (B) running time.
embedded feature selection to look for a small set of shared cell
markers while allowing varying expressions of the markers in
different populations with a controlled variance. scVDMC was an ARI of 0.4742. Even though the mean of scVDMC is higher,
run using initialization by separated k-means with the parame- we can notice that its variance of results is also higher, so that
ters λ in {100, 200, . . . , 1000}, α in {1, 2, . . . , 6} and w in {1, 2, . . . , 6} the difference to Seurat 2.0 is not statistically significant (P =
(see Equation 1 for the definition of the parameters). The best 0.3511 by t-test) due to the k-means initialization in scVDMC.
result was obtained by λ = 1000, α = 3, w = 3. scVDMC has also shown better mean runtime performance than
The two methods were also compared to k-means and the Seurat 2.0 (P = 2E − 14 by t-test), with a mean of 56s against
best performing single-dataset clustering algorithms Monocle, 151s. Overall, the results in this experiment clearly demonstrate
SC3 [26] and cellTree [34] in two scenarios: separated clustering, the advantage of applying advanced learning methods such as
in which data from each patient is clustered separately; and multitask clustering or multi-CCA to integrative clustering of
pooled clustering, in which all the data are combined in a single multiple cell populations.
dataset. Monocle was run with the perplexy = 3 option to avoid
error with the t-SNE. k-Means was used with Person’s correla-
tion distance, which gives better results than using Euclidean
distance on this dataset.
Discussions and conclusions
Figure 2 shows the clustering results measured by ARI and In the past 6 years, there has been substantial development
running time where each method ran 10 times to obtain the of clustering algorithms specifically for the analysis of scRNA-
mean and the standard deviation. To measure the ARI, we com- seq data. These algorithms aim to tackle challenges inherent
bined the data of all populations together to consider the agree- in scRNA-seq data, such as cell-specific biases, dropouts and
ment between the overall clustering and the true clusters in technical noise. Some algorithms have been developed to solve
each population. It is interesting to observe in Figure 2A that the tasks involving multiple populations of cells [9, 11], detection
the pooled version of SC3, k-means and cellTree perform much of rare cell types [14, 22] and pseudotime ordering of cells [34,
worse than the separated version, strongly indicating that sim- 61]. Moreover, there is substantial attention given to the devel-
ple pooling is not applicable to the integration of multiple scRNA- opment of data preprocessing techniques, such as normaliza-
seq datasets. We also noticed that scVDMC and Seurat 2.0 both tion, dropout imputation, dimension reduction and similarity
achieved better ARIs, with a mean of 0.681 for scVDMC, against measures, which contribute to reducing the technical varia-
a mean of 0.675 for Seurat 2.0, than separated k-means with tions before clustering is performed. Together, these advances in
computational methods have provided a wide variety of useful

tools for clustering analysis of scRNA-seq data. dimension reduction methods and more advanced
We also observed that increasing number of studies are learning methods for time series, multiple dataset
in need of more scalable clustering algorithms to transfer the integration and small cluster detection.
• Current clustering methods scale only to scRNA-seq
success of single-cell clustering algorithms to larger datasets.
The more scalable new scRNA-seq platforms have tremendously datasets with tens of thousands of cells. More scalable
reduced the cost and time for cell capture and sequencing algorithms are necessary to allow applications to target

and have enabled new studies to utilize a much larger num- larger scRNA-seq datasets such as possibly as many as
ber of single cells, e.g. droplet-based platforms from the 10x 1 000 000 cells generated from droplet-based platforms.
• Clustering algorithms, which integrate multiple cell
Genomics can capture and sequence one million cells for each
study [66]. This advance brings new challenges. We observed populations and are applicable to the clustering of other
that most existing tools do not scale well to tens of thou- types of single-cell (epi)genomic data, are also in great
sands of cells, which limits the applicability of the algorithms in demand to support future analysis of scRNA-seq data
future studies. from patient cohorts and new types of single-cell data.
Another limitation of the current methods is related to the
opportunities for data integration. The fast growing number of
single-cell datasets becoming available in the past few years
shows that, soon enough, the vast amount of single-cell data will Funding
allow the curation of specific knowledge bases of cell types, cell RP is partially supported by the CAPES Foundation, Ministry
markers, their expression patterns or even epigenomic features. of Education of Brazil (BEX 13250/13-2).
In addition, there will be single-cell resolution profiling of large
patient cohorts such as those in The Cancer Genome Atlas. We
have shown that there only exists a limited number of methods
for performing single-cell clustering when multiple datasets
References
are combined in a meta-analysis. As the number and size of 1. Ben-Dor A, Shamir R, Yakhini Z. Clustering gene expression
single-cell datasets continue to grow, advanced data integration patterns. J Comput Biol 1999;6(3–4):281–97.
methods will be in great need. 2. Jiang D, Tang C, Zhang A. Cluster analysis for gene expres-
In addition to the unsupervised learning methods described sion data: a survey. IEEE Trans Knowl Data E 2004;16(11):1370–
in this review, we also noticed an alternative problem formu- 86.
lation that utilizes supervised or semisupervised ideas to per- 3. Stegle O, Teichmann SA, Marioni JC. Computational and
form cell type identification. For example, SC3 [26] package analytical challenges in single-cell transcriptomics. Nat Rev
uses supervised learning to assign the remaining cells to the Genet 2015;16(3):133.
clusters found by consensus clustering, improving the scalability 4. Kolodziejczyk AA, Kim JK, Svensson V, et al. The technol-
to a larger number of cells. Scmap [68] applies to a scenario ogy and biology of single-cell RNA sequencing. Mol Cell
where the cell types of a subset of cells are known a priori in 2015;58(4):610–20.
a reference dataset and then the cells of unknown cell type from 5. Tsoucas D, Yuan G-C. Recent progress in single-cell cancer
some other dataset are mapped to the most similar group of genomics. Curr Opin Genet Dev 2017;42:22–32.
cell types with the nearest neighbor classification after gene 6. Shintaku H, Nishikii H, Marshall LA, et al. On-chip separation
feature selection. In a more general formulation of the problem, and analysis of RNA and DNA from single cells. Anal Chem
a model needs to classify the cells into the known cell types 2014;86(4):1953–7.
and identify cells of new types to detect new clusters. Clearly, 7. Hebenstreit D. Methods, challenges and potentials of single
a multistage approach or more advanced modeling technique is cell RNA-seq. Biology 2012;1(3):658–67.
necessary. 8. Bacher R, Kendziorski C. Design and computational analy-
Finally, more different types of single-cell data have now sis of single-cell RNA-sequencing experiments. Genome Biol
been collected in addition to RNA expression, such as single- 2016;17(1):63.
cell epigenomic data [69], single-cell Hi-C genome structures 9. Zhang H, Lee C-AA, Li Z, et al. A multitask clustering
[70, 71] and single-cell DNA sequencing [72]. While some of the approach for single-cell RNA-seq analysis in recessive dys-
clustering methods developed for scRNA-Seq data could also be trophic epidermolysis bullosa. PLoS Comput Biol 2018;14(4):
applicable to some of the new single-cell data types, we expect e1006053.
there will also be substantially new computational development 10. Vallejos CA, Risso D, Scialdone A, et al. Normalizing single-
for clustering analysis of the new data types. cell RNA sequencing data: challenges and opportunities. Nat
Methods 2017;14(6):565.
11. Rahul S, Farrell JA, Gennert D, et al. Spatial reconstruc-
Key Points
tion of single-cell gene expression data. Nat Biotechnol
• The new computational challenges for clustering 2015;33(5):495.
scRNA-seq data include dropout imputation, rare cell- 12. Butler A, Hoffman P, Smibert P, et al. Integrating single-cell
type detection, integration of multiple single-cell pop- transcriptomic data across different conditions, technolo-
ulations, inference of cell developmental trajectory and gies, and species. Nat Biotechnol 2018;36(5):411.
the evaluation and intepretation of single-cell clusters. 13. Ntranos V, Kamath GM, Zhang JM, et al. Fast and accu-
• There has been substantial new computational rate single-cell RNA-seq analysis by clustering of transcript-
development dedicated to the clustering analysis compatibility counts. Genome Biol 2016;17(1):112.
of scRNA-seq data analysis including clustering 14. Jiang L, Chen H, Pinello L, et al. GiniClust: detecting rare cell
techniques, normalization and imputation methods, types from single-cell gene expression data with Gini index.
Genome Biol 2016;17(1):144.
15. Jiang L, Schlesinger F, Davis CA, et al. Synthetic spike- 37. Murtagh F, Hernández-Pajares M. The kohonen self-
in standards for RNA-seq experiments. Genome Res 2011; organizing map method: an assessment. J Classification
21(9):1543–51. 1995;12(2):165–90.
16. Lin P, Troup M, Ho JWK. CIDR: ultrafast and accurate cluster- 38. Wang Z, Jin S, Liu G, et al. DTWscore: differential expres-
ing through imputation for single-cell RNA-seq data. Genome sion and cell clustering analysis for time-series single-cell
Biol 2017;18(1):59. RNA-seq data. BMC Bioinformatics 2017;18(1):270.
17. Zhang JM, Fan J, Christina Fan H, et al. An interpretable 39. Wang, B, Zhu, J., Pierson, E, et al. Visualization and analysis of

framework for clustering single-cell RNA-Seq datasets. BMC single-cell RNA-seq data by kernel-based similarity learning.
bioinformatics 2018;19(1):93. Nat Methods 2017;14(4):414.
18. Xu C, Su Z. Identification of cell types from single-cell tran- 40. Olsson A, Venkatasubramanian M, Chaudhri VK, et al. Single-
scriptomes using a novel clustering method. Bioinformatics cell analysis of mixed-lineage states leading to a binary cell
2015;31(12):1974–80. fate choice. Nature 2016;537(7622):698.
19. Huipeng L, Courtois ET, Sengupta D, et al. Reference com- 41. Marco, E, Karp, RL, Guo, G, et al. Bifurcation
ponent analysis of single-cell transcriptomes elucidates cel- analysis of single-cell gene expression data reveals
lular heterogeneity in human colorectal tumors. Nat Genet epigenetic landscape. Proc Natl Acad Sci 2014;111(52):
2017;49(5):708. E5643–50.
20. Guo M, Wang H, Potter SS, et al. SINCERA: a pipeline for 42. Grün, D, Muraro, MJ, Boisset, J-C, et al. De novo prediction of
single-cell RNA-Seq profiling analysis. PLoS Comput Biol stem cell identity using single-cell transcriptome data. Cell
2015;11(11):e1004575. Stem Cell 2016;19(2):266–77.
21. Prabhakaran S, Azizi E, Carr A, et al. Dirichlet process mix- 43. Guha S, Rastogi R, Shim K. CURE: an efficient clustering
ture model for correcting technical variation in single-cell algorithm for large databases. In: ACM Sigmod Record, Vol. 27.
gene expression data. In: International Conference on Machine New York, NY, USA: ACM, 1998, 73–84.
Learning. New York, NY, USA: JMLR.org. 2016, pp. 1070–9. 44. Tsafrir D, Tsafrir I, Ein-Dor L, et al. Sorting points into neigh-
22. Grün D, Lyubimova A, Kester L, et al. Single-cell messenger borhoods (SPIN): data analysis and visualization by ordering
RNA sequencing reveals rare intestinal cell types. Nature distance matrices. Bioinformatics 2005;21(10):2301–8.
2015;525(7568):251. 45. Xu D, Tian Y. A comprehensive survey of clustering algo-
23. Pierson E, Yau C. ZIFA: dimensionality reduction for zero- rithms. Ann Data Sci 2015;2(2):165–93.
inflated single-cell gene expression analysis. Genome Biol 46. Ji Z, Ji H. TSCAN: pseudo-time reconstruction and eval-
2015;16(1):241. uation in single-cell RNA-seq analysis. Nucleic Acids Res
24. Risso D, Perraudeau F, Gribkova S, et al. A general and flexible 2016;44(13):e117–7.
method for signal extraction from single-cell RNA-seq data. 47. Ng AY, Jordan MI, Weiss Y. On spectral clustering: analysis
Nat Commun 2018;9(1):284. and an algorithm. In: Advances in Neural Information Processing
25. Yau Cet al. pcaReduce: hierarchical clustering of single cell Systems. Vancouver, British Columbia, Canada: MIT Press.
transcriptional profiles. BMC Bioinformatics 2016;17(1): 2002, 849–56.
140. 48. Blondel VD, Guillaume J-L, Lambiotte R, et al. Fast unfolding
26. Kiselev V Yu, Kirschner K, Schaub MT, et al. SC3: con- of communities in large networks. J Statist Mech Theory Exper-
sensus clustering of single-cell RNA-seq data. Nat Methods iment 2008;2008(10):P10008.
2017;14(5):483. 49. Alexander Wolf F, Angerer P, Fabian J, et al. Large-scale
27. Torgerson WS. Multidimensional scaling: I. theory and single-cell gene expression data analysis. Genome Biol
method. Psychometrika 1952;17(4):401–19. 2018;19(1):15.
28. van der Maaten L, Hinton G. Visualizing data using t-SNE. 50. Ester M, Kriegel H-P, Sander J, et al. A density-based algo-
J Mach Learn Res 2008;9(Nov):2579–605. rithm for discovering clusters in large spatial databases with
29. Zeisel A, Muñoz-Manchado AB, Codeluppi S, et al. Cell types noise. In: Kdd, Vol. 96. Portland, Oregon: AAAI Press, 1996,
in the mouse cortex and hippocampus revealed by single- 226–31.
cell RNA-seq. Science 2015;347(6226):1138–42. 51. Qiu, X, Mao, Q, Tang, Y, et al. Reversed graph embed-
30. Yang L, Liu J, Lu Q, et al. SAIC: an iterative clustering approach ding resolves complex single-cell trajectories. Nat Methods
for analysis of single cell RNA-seq data. BMC Genomics 2017;14(10):979.
2017;18(6):689. 52. Rodriguez A, Laio A. Clustering by fast search and find of
31. Gan Y, Li N, Zou G, et al. Identification of cancer subtypes density peaks. Science 2014;344(6191):1492–6.
from single-cell RNA-seq data using a consensus clustering 53. Kim DH, Marinov GK, Pepke S, et al. Single-cell transcriptome
method. BMC Med Genomics 2018;11(6):117. analysis reveals dynamic changes in lncRNA expression
32. Hotelling H. Relations between two sets of variates. during reprogramming. Cell Stem Cell 2015;16(1):88–101.
Biometrika 1936;28(3/4):321–77. 54. Camp, JG, Sekine, K, Gerber, T, et al. Multilin-
33. Blei DM. Andrew Y Ng, Michael I Jordan. Latent eage communication regulates human liver bud
dirichlet allocation J Mach Learn Res 2003;3(Jan): development from pluripotency. Nature 2017;546
993–1022. (7659):533.
34. Yotsukura S, Nomura S, Aburatani H, et al. CellTree: an 55. Lv D, Wang X, Dong J, et al. Systematic characterization of
R/bioconductor package to infer the hierarchical structure of lncRNAs’ cell-to-cell expression heterogeneity in glioblas-
cell populations from single-cell RNA-seq data. BMC Bioinfor- toma cells. Oncotarget 2016;7(14):18403.
matics 2016;17(1):363. 56. Peng T, Nie Q. SOMSC: self-organization-map for high-
35. Kohonen T. The self-organizing map. Proc IEEE 1990;78(9): dimensional single-cell data of cellular states and their transitions.
1464–80. bioRxiv, 2017, 124693.
36. Flexer A. On the use of self-organizing maps for clustering 57. Frey BJ, Dueck D. Clustering by passing messages between
and visualization. Intell Data Anal 2001;5(5):373–84. data points. Science 2007;315(5814):972–6.
58. Hicks SC, Teng M, Irizarry RA. On the widespread and critical 66. Zheng GXY, Terry JM, Belgrader P, et al. Massively parallel
impact of systematic bias and batch effects in single-cell rna-seq digital transcriptional profiling of single cells. Nat Commun
data. bioRxiv, 2015. 2017;8:14049.
59. Kettenring JR. Canonical analysis of several sets of variables. 67. Chung W, Eum HH, Lee H-O, et al. Single-cell RNA-
Biometrika 1971;58(3):433–51. seq enables comprehensive tumour and immune cell
60. Waltman L, Van Eck NJ. A smart local moving algorithm profiling in primary breast cancer. Nat Commun 2017;
for large-scale modularity-based community detection. Eur 8:15081.

Phys J B 2013;86(11):471. 68. Kiselev VY, Yiu A, Hemberg M. Scmap: projection of
61. Trapnell C, Cacchiarelli D, Grimsby J, et al. The dynamics and single-cell RNA-seq data across data sets. Nat Methods
regulators of cell fate decisions are revealed by pseudotem- 2018;15(5):359.
poral ordering of single cells. Nat Biotechnol 2014;32(4):381. 69. Kelsey G, Stegle O, Reik W. Single-cell epigenomics: record-
62. Welch JD, Hartemink AJ, Prins JF. SLICER: inferring branched, ing the past and predicting the future. Science 2017;358
nonlinear cellular trajectories from single cell RNA-seq data. (6359):69–75.
Genome Biol 2016;17(1):106. 70. Liu J, Lin D, Yardimci G, et al. Unsupervised embed-
63. Finak G, McDavid A, Yajima M, et al. MAST: a flexible statis- ding of single-cell Hi-C data. Bioinformatics 2018;34(13):
tical framework for assessing transcriptional changes and i96–104.
characterizing heterogeneity in single-cell RNA sequencing 71. Cusanovich DA, Daza R, Adey A, et al. Multiplex
data. Genome Biol 2015;16(1):278. single-cell profiling of chromatin accessibility by
64. Li J, Tibshirani R. Finding consistent patterns: a nonparamet- combinatorial cellular indexing. Science 2015;348(6237):
ric approach for identifying differential expression in RNA- 910–4.
Seq data. Stat Methods Med Res 2013;22(5):519–36. 72. Pellegrino M, Sciambi A, Treusch S, et al. High-throughput
65. Kharchenko PV, Silberstein L, Scadden DT. Bayesian single-cell DNA sequencing of acute myeloid leukemia
approach to single-cell differential expression analysis. Nat tumors with droplet microfluidics. Genome Res, 28(9):
Methods 2014;11(7):740. 1345–52, 2018.

Machine Learning and Statistical Methods For Clustering Single-Cell RNA-sequencing Data

Uploaded by

Machine Learning and Statistical Methods For Clustering Single-Cell RNA-sequencing Data

Uploaded by

Briefings in Bioinformatics, 00(0), 2019, 1–15

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

Key words: scRNA sequencing; machine learning; clustering; single-cell technology.

identities, they cannot investigate important biological problems Normalization

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

Category / Subcategory Strengths Limitations Time complexity Algorithm Year

- Agglomerative: O(N2 log(N))

Louvain - Heuristic can lead to bad results SCANPY [49] 2018

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

according to distance measures, called linkage distance, until all

Kim, Daniel, et al. [53]

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

algorithm in the ensemble

at a branch. cellTree [34] builds a hierarchical structure among

mixture of topics with LDA. CIDR [16] uses hierarchical clustering

on the top coordinates obtained with PCoA on a dissimilarity

of guide genes selected by filtering genes by expression level

RCA introduced in [19] applies hierarchical clustering on the

also applies hierarchical clustering on the consensus matrix

DendroSplit [17] detects clusters in the constructed tree with

- Automatic detection of the number of

Clustering by mixture models assumes that the datapoints are

sampled from a mixture of several probability distributions,

data and the categorical mixture model for count data.

knowledge in the model. However, solving mixture models

requires advanced optimization or sampling techniques with

BISCUIT [21] is based on a hierarchical Dirichlet process

mixture model (HDMM) with additional cell-specific scaling and

ture with Dirichlet prior on mixture coefficients, normal prior

and the cell-specific scaling accounts for cell-specific technical

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

• SNN-Cliq [18] constructed a SNN graph among the cells

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

In addition, we included k-means clustering using standard

clustering, Monocle outperforms the other methods by a large

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

Clustering multiple cell populations in breast cancer

computational methods have provided a wide variety of useful

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz063/5519426 by Guilford College user on 17 July 2019

You might also like