K-means clustering
1. Par&&onal clustering
• In contrast to hierarchical clustering, in the par22onal clustering, you decide the number of groups that you want
to fix in your data. This are called disjoint groups since not all the clusters are going to be part of a mayor final
cluster (which is present in hierarchical clustering).
• From hierarchical clustering:
§ Cut a line from hierarchical clustering to delimitate the maximum of clustering processes.
§ By varying the cut height, we could produce arbitrary number of clusters (K parameter).
3. Calculate the average distance between the points of a same cluster and define a new centroid for this
4. Again, calculate the distance from every gene to each new calculated centroids, and associated each event to
the closet centroid (group).
5. Repeat un2l assignment stabilize. It means you calculate the new centroids by average distance ebtween the
points of a same cluster and calculates the distances between each gene to each new calculated centroids
and aggoupate each event to the closest centroid and the cluster do not change from the last itera2on.
• When it is an outlier point in a cluster the centroid of this cluster is not going to be representa2ve of the cluster
and maybe it could aUract by distance another event that is correctly associated to a cluster and now it would be
associated to the wrongly cluster.
• To avoid problems with outliers or noise, you can do K-means clustering by par22on around medoids:
§ Pick one real data point (the closest to all the events of a cluster), instead of the average distance of the
points clustered together, as the centroid of the cluster.
§ More robust in the presence of noise and outliers.
6. K-means clustering (clustering by random assignment of all the observa&ons)
1. Once you have prefixed the number of cluster (K parameter) each event (point) is randomly associated to one
of the clusters (k).
2. Calculate the centroid point of each cluster by the average distance of the points of these cluster.
3. Regrouping by proximity, calculate the distance of each point to the centers and assign the closest cluster in
distance to each point.
• You start with K=2 since it is the minimum, and you can gradually increase K.
• Then you analyze the improvement of increasing K:
§ Reduce within cluster distance.
§ Increase between cluster distance.
• Then you analyze the cost with each increase in K.
• Compare cost with improvement, stop when it’s not worth it.
• There are different mathema2cal approaches:
§ W (k) à total sum of squares within clusters. Within cluster distance (it should be small).
§ B (k) à total sum of squares between clusters means. Between cluster distance (it should be big).
§ n à total number of data points.
§ Typically, k should be less than 10 and stop at this point.
• In Calinski method wherever you have the maximum increase that’s where you stop.
• In Har2gan method wherever you have the peak that’s where you stop.
• Other thing that can use if there is a lot of noise is consensus clustering. In RNA-seq, as much as you only
selected the most differen2ally expressed genes, there are a lot of noisy genes or even samples if you are
clustering by samples
• Consensus clustering is based on cluster ensembles (combina2on or aggrega2on of mul2ple clustering results
(par22ons) into a single consolidated solu2on).
• Reconcile clustering informa2on about the same data set coming from different resources or from different runs
of the same algorithm.
• One type of consensus clustering is thigh clustering:
1. Random sample X’ from original data X à select only a subset of genes/samples from the original dataset.
2. Run K-mean clustering with an ini2ally reasonable K (try 7).
3. Based on the generated clusters, use the clustered centers to classify original data by K-means clustering.
4. Fill co-membership matrix: you ask if this is the first 2me things converge, how many of the genes will belong
to the same cluster. A big matrix of pairwise genes (remember which genes are together and remember if
they have been clustered together).
5. Start again:
a. Random sample X’ from original data X à select only a subset of genes/samples from the original
dataset, this must be different selec2on than the first one (but some genes can be selected
repeatedly, only not all of them).
b. Run K-mean clustering with an ini2ally reasonable K (try 7).
c. Based on the generated clusters, use the clustered centers to classify original data by K-means
d. Fill co-membership matrix: you ask if this is the how many 2mes events have been converged, how
many 2mes genes has belonged to a same cluster. A big matrix of pairwise genes (remember which
genes are together and remember if they have been clustered together).
6. Repeat the process un2l an algorithm find sets with average co-membership based on all the co-
memberships matrixes. Do not force clustering of all the data points, some of them will occasionally or never
be part of a cluster and this is why this method is noise tolerant, and you can find biologically robust clusters.
• Another way to do consensus clustering is iCluster:
§ Reconcile the clustering informa2on about the same samples using different profiling techniques.
§ If you do clustering of the same samples/genes using each different layers (methyla2on, gene
expression, copy number, muta2on) you will obtain different types of clusters. Nevertheless, this
algorithm takes into account all of the matrix layers doing the co-membership matrixes of each layer
and comparing between them.
§ Typically use TCGA database.