GCLUTO - An Interactive Clustering, Visualization, and Analysis System
Analysis System
There are a number of different systems that are specifically de- These schemes differ on how the similarity between the individual
signed for the interactive exploration/visualization of document col- objects in the various clusters are combined to determine the sim-
lections and query results that use document clustering and clus- ilarity between the clusters themselves. The single-link criterion
ter visualization as one of the mechanisms to facilitate this type function measures the similarity of two clusters by the maximum
of analysis [3, 18, 8, 33, 49, 42, 34]. The primary goal of these similarity between any pair of objects from each cluster, whereas
systems is to aid the user in navigating through the document col- the complete-link uses the minimum similarity. In general, both
the single- and the complete-link approaches do not work very well Although the clustering problem is a well-defined optimization prob-
because they either base their decisions to a limited amount of in- lem, there can still be uncertainty associated with its result due to
formation (single-link), or assume that all the objects in the cluster uncertainties in the input data. For example, in the case of clus-
are very similar to each other (complete-link). On the other hand, tering data generated from measurements, each measurement will
the group average approach measures the similarity of two clusters have a margin of error. Without additional analysis, a clustering
by the average of the pairwise similarity of the objects from each algorithm will cluster the data under the assumption that the data
cluster and does not suffer from the problems arising with single- is completely accurate. However, it would be more appropriate if
and complete-link. The remaining seven schemes take an entirely the algorithm could incorporate the uncertainties associated with
different approach and treat the clustering process as an optimiza- the data to produce a clustering result that could portray its level of
tion problem by selecting the cluster-pairs that optimize various as- statistical significance given the uncertainties.
pects of intra-cluster similarity, inter-cluster dissimilarity, and their
combinations. The advantage of these schemes is that they lead Bootstrap clustering is a technique introduced by [27] that adds
to more natural clusters and agglomerative trees that are more bal- the statistical technique of bootstrapping to clustering algorithms.
anced than the more traditional schemes. A precise description of Bootstrapping simulates the multiple sampling of a distribution by
these schemes is beyond the scope of this paper, and the reader randomly selecting values from a known sample with replacement.
should refer to [54, 56] for a detailed description and comparative By sampling with replacement, new hypothetical datasets can be
evaluation. produced from the original dataset that exhibit the same distribution
of values and uncertainties. This allows clustering algorithms to
explore what results would occur if the same measurements were
3.1.2 Partitional Clustering
taken again.
Partitional clustering algorithms find the clusters by partitioning the
entire dataset into a predetermined number of disjoint sets, each
gC LUTO implements two methods of bootstrapping data: resam-
corresponding to a single cluster. This partitioning is achieved by
pling features and resampling residuals. To resample the features
treating the clustering process as an optimization procedure that
of a dataset, gC LUTO randomly selects with replacement columns
tries to create high quality clusters according to a particular func-
from the dataset to produce a new set of features. This resampling
tion that reflects the underlying definition of the “goodness” of the
tests to what extent the clustering algorithm may be relying on any
clusters. This function is referred to as the clustering criterion func-
particular feature. The resampling of residuals is performed by first
tion and gC LUTO implements seven such criterion functions (that
supplying a residual matrix for the dataset. A residual matrix con-
are similar to the I1 , I2 , E1 , H1 , H2 , G1 , and G2 schemes used by
tains the errors associated with a dataset, which can be found by
agglomerative clustering) and have been shown to produce high-
fitting the data to a linear model. In [27], residuals for microarray
quality clusters in low- and high-dimensional datasets [56].
data are found by fitting the data to an ANOVA model. gC LUTO
can accept residual matrices stored in character delimited file for-
gC LUTO uses two different methods for computing the partitioning
mats. With the residual matrix, gC LUTO performs bootstrapping to
clustering solution. The first method computes a k-way clustering
generate a new residual matrix, which is then added to the original
solution via a sequence of repeated bisections, whereas the sec-
dataset to produce a new hypothetical dataset.
ond method computes the solution directly (in a fashion similar to
traditional K-means-based algorithms). These methods are often
Bootstrap clustering uses these hypothetical datasets to estimate the
referred to as repeated bisecting and direct k-way clustering, re-
significance of a clustering solution by clustering each hypotheti-
spectively. In both cases, the clustering solution is computed using
cal dataset and comparing all of their clustering solutions. gC LUTO
a randomized incremental optimization algorithm that is greedy in
provides three statistics for reporting a clustering solution’s signifi-
nature, has low computational requirements, and produces high-
cance: solution stability, cluster stability, and object stability [36].
quality solutions [56].
The term stability refers to the level of consistency observed be-
tween the various clustering results. These stability measurements
3.1.3 Graph Partitional range from zero (no consistency between solutions) to one (com-
gC LUTO’s graph-partitioning-based clustering algorithms use a sparse plete consistency between solutions). Solution stability represents
graph to model the affinity relations between the different objects, the significance of the solution as a whole, where as cluster and
and then discover the desired clusters by partitioning this graph object stability portray a significance level on a per cluster and per
[23]. To some extent, this approach is similar in spirit with that used object basis.
by the partitional clustering algorithms described earlier; however,
unlike those algorithms, the graph-partitioning-based approaches gC LUTO compares the various solutions generated by bootstrap
consider only the affinity relations between an object and a small clustering by computing a consensus solution to which a mapping
number of its most similar other objects. As will be discussed later, is found to all other solutions. This star-like mapping arrangement
this enables them to find clusters that have inherently different char- allows comparisons to be made between any pair of solutions while
acteristics than those discovered by partitional methods. also requiring the fewest mappings to be found. The consensus so-
lution is found by generating a solution that is most similar to most
gC LUTO provides different methods for constructing this affinity of the solutions. This is done by clustering the objects using a sim-
graph and various post-processing schemes that are designed to ilarity graph built from information about the multiple bootstrap
help in identifying the natural clusters in the dataset. The actual solutions. gC LUTO builds a similarity graph by defining the sim-
graph partitioning is computed using an efficient multilevel graph- ilarity of two objects as the percentage of bootstrap solutions that
partitioning algorithm [24] that leads to high-quality partitionings assign the two objects to the same cluster. The consensus solution
and clustering solutions. is also the final solution that gC LUTO presents to the user in the
solution report.
3.1.4 Bootstrap Clustering
3.1.5 Characteristics of the Various Clustering Al-
The various clustering algorithms provided by gC LUTO have been
designed, and are well-suited, for finding different types of clusters—
allowing for different types of analysis. There are two general types
of clusters that often arise in different application domains and dif-
ferent analysis requirements. What differentiates them is the simi-
larity relations among the objects assigned to the various clusters.
The first type contains clusters in which the similarity between all
pairs of objects assigned to the same cluster will be high. On the
other hand, the second type contains clusters in which the direct
pairwise similarity between the various objects of the same clus-
ter may be quite low, but within each cluster there exist a suf-
ficiently large number of other objects that eventually “connect”
these low similarity objects. That is, if we consider the object-
to-object similarity graph, then these objects will be connected by
many paths that stay within the cluster that traverse high similarity Figure 1: Overview of gC LUTO’s work-flow with example
edges. The names of these two cluster types have been inspired screen-shots for each stage.
by this similarity-based view, and they are referred to as globular
and transitive clusters, respectively. gC LUTO’s partitional and ag-
glomerative algorithms are able to find clusters that are primarily
methods and maintain the low memory requirements of the parti-
globular, whereas its graph-partitioning and some of its agglomer-
tional schemes.
ative algorithms (e.g., single-link) are capable of finding transitive
3.2 Clustering Work-flow and Organization
The main strength of gC LUTO is its ability to organize the user’s
3.1.6 Similarity Measures data and work-flow in a way that eases the process of data anal-
gC LUTO’s feature-based clustering algorithms treat the objects to ysis. This work-flow often consists of a sequence of stages, such
be clustered as vectors in a multi-dimensional space and measure as importing and preparing data, selecting clustering options, inter-
the degree of similarity between these objects using either the co- preting solution reports, and concluding with visualization. Each
sine function, the Pearson’s correlation coefficient, extended Jac- stage of the process demands decisions to be made by the user that
card coefficient [48], or a similarity measure derived from the Eu- can alter the course of data analysis. Consequently, the user may
clidean distance of these vectors. The first two similarity measures want to backtrack to previous stages and create a new branch of
can be used by all clustering algorithms, whereas the last two can analysis. An overview of this work-flow with examples of branch-
be used only by the graph-partitioning-based algorithms. ing is depicted in Figure 1.
By using the cosine and correlation coefficient measures, two ob- gC LUTO assists these types of work-flows by introducing the con-
jects are similar if their corresponding vectors1 point in the same cept of a project. A project manages the various data files, solu-
direction (i.e., they have roughly the same set of features and in the tions reports, and visualizations that the user generates by storing
same proportion), regardless of their actual length. On the other and presenting them in a single container. Figure 2 illustrates how
hand, the Euclidean distance does take into account both direction gC LUTO uses a tree to represent a project as it progresses through
and magnitude. Finally, similarity based on extended Jaccard co- the stages of data analysis.
efficient accounts both for angle as well as magnitude. These are
some of the most widely used measures, and have been used exten- The work-flow of a user begins by creating a new project. gC LUTO
sively to cluster a variety of different datasets. will create a new directory to hold all project related files as well
as a new empty project tree. Next the user imports one or more re-
lated datasets. These datasets are represented by icons that appear
3.1.7 Computational Requirements directly beneath the project tree’s root. After importation, the user
gC LUTO’s algorithms have been optimized for operating on very
can cluster a dataset to produce a clustering solution. For each so-
large datasets both in terms of the number of objects as well as the
lution, a solution report is generated which contains statistics about
number of dimensions. Nevertheless, the various clustering algo-
the clustering. Clustering solutions are presented by an ‘S’ icon and
rithms have different memory and computational scalability char-
are placed beneath the clustered dataset’s icon in the project tree.
acteristics. The agglomerative based schemes can cluster datasets
As more clustering solutions are generated, the project tree will
containing 2,000–5,000 objects in under a minute but due to their
continue to organize them by their corresponding datasets. Lastly,
memory requirements they should not be used to cluster datasets
the work-flow concludes with interpreting a solution using one or
with over 10,000 objects. The partitional algorithms are very scal-
more visualizations. Again, the project tree will reflect which solu-
able both in terms of memory and computational complexity, and
tions have generated visualizations by placing beneath them visu-
can cluster datasets containing several tens of thousands of ob-
alization icons.
jects in a few minutes. Finally, the complexity of the graph-based
schemes is usually between that of agglomerative and partitional
3.2.1 Creating a New Project
1 The user begins their analysis by first creating a new project. A
In the case of Pearson’s correlation coefficient, the vectors are
obtained by first subtracting their average value. project is intended to hold one or more related datasets. The project
Figure 3: Several screen-shots of the clustering dialog.
3.2.3 Clustering
tree provides an easy interface for switching between datasets and Once a dataset is imported into gC LUTO, clustering (using the var-
comparing their results. ious algorithms described in Section 3.1) can be initiated by select-
ing the desired options from the clustering options dialog pictured
When a project is saved, all of the project information is saved in Figure 3.
under a single project directory specified by the user. Within the
project directory, directories and text files are used to capture the A full listing of the available options is shown in Table 1. These op-
same tree structure seen in the gC LUTO project tree. This straight tions have been organized into four sections: General, Preprocess,
forward format is used so that third party applications can access Bootstrap, and Miscellaneous. The most general options include
gC LUTO’s project data. In addition, gC LUTO allows exporting of specifying the number of desired clusters, the clustering method,
solutions and printing of visualizations to standard formats for ex- and similarity and criterion functions. The preprocess options al-
ternal use. low the user to prepare their data before clustering. This is accom-
plished by using model and pruning functions. The models scale
3.2.2 Importing Data various portions of the dataset, whereas the pruning options gen-
Datasets can be imported into gC LUTO in a variety of formats. erate a more descriptive subset of the dataset. These options are
Currently the supported formats include the (i) sparse and dense necessary for datasets that have value distributions that may skew
vectors, (ii) object-to-object similarity matrices, and (iii) character clustering algorithms. In addition, these pre-processing options can
delimited files. be used to implement a number of object and feature weighting
schemes that are used within the context of document clustering
The vector format contains a matrix written in either dense or sparse including tf-(no)idf, maxtf-(no)idf, and logtf-(no)idf [39].
form. Each row of the matrix represents an object, whereas each
column represents a feature. The value of the of ith row and j th 3.3 Solution Reports
column represents the strength of feature j in object i. With this Solution reports are generated for each dataset that is clustered.
matrix, gC LUTO will compare objects by comparing their vec- Solution reports contain information about the clustering options
tors. This format can be used to directly represent a wide-range used and statistics about the discovered clusters. These statistics
of datasets, including the document-term matrices commonly used include the number of clusters, cluster sizes, the average internal
in information retrieval, customer purchasing transaction, gene ex- and external similarities (ISim and ESim), the average internal and
pression measurements, or any other datasets that is represented as external standard deviations of these similarities (ISdev and ESdev),
a rectangular matrix whose rows are the objects and columns are and a list of the most discriminating and descriptive features for
the various dimensions. each cluster. For each of these features, gC LUTO also displays
the percentage of the within cluster similarity and across cluster
If such vectors are not available, but information about object pair- difference that these features account for, respectively. If known
wise similarities is available, then the gC LUTO’s similarity format classes are specified for the objects, then the entropy, purity, and
can be used. This format consists of a square matrix with same class conservation statistics are also displayed.
number of rows and columns as the number of objects. The value
in the ith row and j th column represents the similarity of the ith In Figure 4 an example solution report is given for a dataset con-
and j th object. Note that the user can specify either a dense or a taining documents about sports. Each object is a document that
sparse similarity matrix. The similarity entries that are not supplied contains the words in the documents. A class file has been speci-
are assumed to be zero. fied for this dataset that allows gC LUTO to compare its clustering
to the known classes. With this information gC LUTO can calculate
Character delimited files contain the same information as the gC LUTO’s the purity and entropy of a cluster by noting how many different
vector format except in a more common and flexible form. Most classes are associated to the objects of the cluster. The class dis-
spreadsheet applications can export data in character delimited for- tribution matrix shows how many objects of each cluster belong to
Table 1: Clustering options available in gC LUTO. Some options
are only available for certain clustering methods.
# of Clusters Number of clusters that the algorithm should find
Method Clustering algorithm to use
Similarity Function to measure the similarity between two objects
Criterion Function to guide algorithm by evaluating intermediate
Row Scales the values of each row in data matrix
Column Globally scales the values of each row across rows
Graph Determines when an edge will exist between two vertices
Column Remove columns that do not contribute to similarity
Vertex Remove vertices that tend to be outliers
Edge Remove edges that tend to connect clusters
Perform Whether to perform bootstrap clustering
# of Iterations Number of solutions to create
Features Whether to resample the features in each iteration
Residuals Whether to resample data by adding residuals
Graph Options
Components Remove small connected components before clustering
Neighbors # of nearest neighbors used in graph-partitioning
# of Trials # of clusterings to create to search for best clustering
# of Iterations # of refinement iterations in partitioning
Selection Determines how to select next cluster for bisection
K-way refine Whether to k-way refine a repeated bisectioning solution
each class. From the class distribution, we can see that clusters 0
through 6 associate strongly to a single class. Cluster 7, however,
appears to contain objects from many classes.
3.4 Visualization
gC LUTO can generate two different visualizations that can be used
to gain insight on the relationships between the clusters and the
relationships between the objects within each cluster. Both of these
visualizations are entirely interactive and can be easily customized
and modified by the user.
This display allows the user to visually inspect their data for pat-
terns. In an ideal clustering solution, rows belonging to the same Figure 4: Example solution report of a clustering of sports re-
cluster should have relatively similar patterns of red and green. The lated documents. The sections of this report in order are clus-
visualization emphasizes these patterns for the user by displaying tering options, cluster statistics, class distribution, and descrip-
them in contiguous blocks. If the features represent a sequence, for tive and discriminating features.
example measurements in a time-course experiment, then the user
can identify trends that occur across the features. The user may
Figure 5: A screen-shot of the Matrix visualization.
also be able to identify more questionable clusters by observing or lower-right of their parents). gC LUTO removes this ambiguity
stark dissimilarities between rows within a cluster. by explicitly ordering the visual position of each subtree by choos-
ing the set of orientations that maximizes the similarity between
In addition to the color matrix, the visualization also includes la- objects placed in consecutive rows in the Matrix Visualization.
bels and hierarchical trees located at the edges of the matrix. If the
user supplies labels with their data, then the rows of the matrix will
be labeled with object names and the columns with feature names. Manipulating the Matrix Visualization. Once the Matrix vi-
If the user clusters their data with an agglomerative algorithm, then sualization is generated, users can further explore their results by
the agglomerative tree will be displayed on the left-hand side of manipulating the visualization in several ways. First, the user may
the visualization. The user may also generate a hierarchical tree collapse any set of rows or columns in the matrix by collapsing the
even if a partitional clustering algorithm was used. In such cases, corresponding nodes in the hierarchical trees located above and to
gC LUTO performs additional agglomerative clustering within each the left of the matrix. By collapsing a node of the tree, the user can
partitional cluster and a single agglomerative clustering of the clus- hide all of the node’s descendants. In the matrix, the corresponding
ters themselves. Using the trees generated from these additional rows that belong to the leaves of the collapsed sub-tree are replaced
clusterings, gC LUTO constructs a single hierarchical tree that con- by a single representative row. The representative row contains the
forms to the same cluster structure found with the partitional algo- average vector of all of the hidden rows and, thus, summaries the
rithm. Lastly, the Matrix Visualization can also display a hierarchi- data in a condensed form. This feature is especially useful for large
cal tree called the feature tree, which is generated by performing datasets that are difficult to fully display on a computer monitor.
agglomerative clustering on the transpose of the data matrix. Columns can also be collapsed in a similar manner. When a rep-
resentative row crosses a representative column, the intersection is
Similar to visualizations in other clustering applications, the hier- a representative cell, which contains the average value of the cells
archical tree depicts relationships between objects by displaying contained within the collapsed rectangular region.
the order in which objects were merged in the agglomerative pro-
cess. Since merging is performed by descending pair-wise simi- A frequent use of row averaging is to view the cluster mid-point
larity, objects that are near each other in the tree are more similar vectors. This can be done either by collapsing the appropriate
than objects placed in distant locations. However, if users want nodes in the object hierarchical tree, or by selecting the “Show
to draw conclusions about object similarities using the hierarchical Only Clusters” option from the “Matrix” menu. The user may also
tree, they must keep in mind that a two-dimensional drawing of a quickly expand all collapsed nodes by choosing the “Show All Ob-
hierarchical tree is not unique. That is, for every parent node in jects” option from the “Matrix” menu.
the tree, the two children nodes and their sub-trees can be drawn in
one of two possible orientations: top or bottom (note that gC LUTO The last manipulation that is available to the user is scaling. A
draws the hierarchical tree with children placed to the upper-right common problem with viewing similar visualizations in other ap-
plications, is that it is difficult to represent a large dataset on a rela-
tively small display. One solution is to only display a portion of the
visualization at any one time and allow the user to scroll to view
other portions. The downside to this solution is that the user has a
narrow view of their data, which makes it difficult to compare local
details to the global trends. Another solution is to shrink the graph-
ics until they fit within the viewable area. In cases where the matrix
has more rows and columns than the number of pixels available, it
becomes difficult to appropriately represent the matrix without ex-
cessive distortion. gC LUTO implements a unique compromise by
allowing the user to zoom in on portions of the matrix that are of
interest, while zooming away from portions that are less important
but are still needed for context.
∇yk Error =
2 d k,j− δk,j yk − yj
