0% found this document useful (0 votes)
1 views11 pages

K-mean

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 11

CLONGLOMERATE ANALYSIS

K-means clustering

1. Par&&onal clustering

• In contrast to hierarchical clustering, in the par22onal clustering, you decide the number of groups that you want
to fix in your data. This are called disjoint groups since not all the clusters are going to be part of a mayor final
cluster (which is present in hierarchical clustering).
• From hierarchical clustering:
§ Cut a line from hierarchical clustering to delimitate the maximum of clustering processes.
§ By varying the cut height, we could produce arbitrary number of clusters (K parameter).

2. K-means clustering (clustering by random centroids)


1. Choose K (number of groups that you want) centroids at random places.
2. You want to cluster genes/samples (dots) based on the distance. Calculate the distance from every gene to
each centroid and associated each event to the closet centroid (group).

3. Calculate the average distance between the points of a same cluster and define a new centroid for this
group.
4. Again, calculate the distance from every gene to each new calculated centroids, and associated each event to
the closet centroid (group).
5. Repeat un2l assignment stabilize. It means you calculate the new centroids by average distance ebtween the
points of a same cluster and calculates the distances between each gene to each new calculated centroids
and aggoupate each event to the closest centroid and the cluster do not change from the last itera2on.

3. K-means algorithm generali&es (clustering by random centroids)

• K-means is an iterated process.


• Determinis2c method (randomly choosing the first centroids?)
§ The ini2al cluster centers choose important: if you restart the algorithm with the same original three
centroids the result would be the same. On the other hand, if you randomly choose different ini2al
centroids the clustering result would be different.
§ Can be trapped in local op2mal problem: a situa2on in op2miza2on where an algorithm finds a solu2on
that is beUer than nearby solu2ons (in its local neighborhood) but is not the best possible solu2on overall
(the global op2mum).
• How to pick ini2al cluster centers:
§ Run hierarchical cluster and find a cut line.
§ Random star many 2mes with different ini2al centroids.
• Might not be tolerant to outliers or noise.
4. K-means clustering has problems with outliers or noise

• When it is an outlier point in a cluster the centroid of this cluster is not going to be representa2ve of the cluster
and maybe it could aUract by distance another event that is correctly associated to a cluster and now it would be
associated to the wrongly cluster.

5. K-means clustering (clustering around medoids)

• To avoid problems with outliers or noise, you can do K-means clustering by par22on around medoids:
§ Pick one real data point (the closest to all the events of a cluster), instead of the average distance of the
points clustered together, as the centroid of the cluster.
§ More robust in the presence of noise and outliers.
6. K-means clustering (clustering by random assignment of all the observa&ons)

1. Once you have prefixed the number of cluster (K parameter) each event (point) is randomly associated to one
of the clusters (k).

2. Calculate the centroid point of each cluster by the average distance of the points of these cluster.
3. Regrouping by proximity, calculate the distance of each point to the centers and assign the closest cluster in
distance to each point.

4. Recalculate the centers and regrouping the events by proximity.


5. Repeat un2l stabiliza2on.

7. How to pick K parameter (number of clusters)

• You start with K=2 since it is the minimum, and you can gradually increase K.
• Then you analyze the improvement of increasing K:
§ Reduce within cluster distance.
§ Increase between cluster distance.
• Then you analyze the cost with each increase in K.
• Compare cost with improvement, stop when it’s not worth it.
• There are different mathema2cal approaches:
§ W (k) à total sum of squares within clusters. Within cluster distance (it should be small).
§ B (k) à total sum of squares between clusters means. Between cluster distance (it should be big).
§ n à total number of data points.
§ Typically, k should be less than 10 and stop at this point.

• In Calinski method wherever you have the maximum increase that’s where you stop.
• In Har2gan method wherever you have the peak that’s where you stop.

• Other op2on is visualizing the k parameter by the knee method:


§ Axes:
o X-axis: k parameter (number of clusters).
o Y-axis: total intergroup variability.
§ Choose the k parameter that explain the maximum total intragroup variability with the minimum clusters
(k) possibles. Since the following k are not going to being worth since they cannot explain so much more
variability between groups.
§ Choose the knee point of the graph (protuberance).

• In RNA-seq if you have so many genes you came up with a huge K:


§ To avoid it, you only cluster genes that are differen2ally expressed between samples.
§ The magic number would be 7 at maximum (try 7 if you are clustering by genes).
§ In addi2on of describing if a cluster gene is upregulated or downregulated other features could be
described, such as gene ontology.
8. Consensus clustering

• Other thing that can use if there is a lot of noise is consensus clustering. In RNA-seq, as much as you only
selected the most differen2ally expressed genes, there are a lot of noisy genes or even samples if you are
clustering by samples
• Consensus clustering is based on cluster ensembles (combina2on or aggrega2on of mul2ple clustering results
(par22ons) into a single consolidated solu2on).
• Reconcile clustering informa2on about the same data set coming from different resources or from different runs
of the same algorithm.
• One type of consensus clustering is thigh clustering:
1. Random sample X’ from original data X à select only a subset of genes/samples from the original dataset.
2. Run K-mean clustering with an ini2ally reasonable K (try 7).
3. Based on the generated clusters, use the clustered centers to classify original data by K-means clustering.
4. Fill co-membership matrix: you ask if this is the first 2me things converge, how many of the genes will belong
to the same cluster. A big matrix of pairwise genes (remember which genes are together and remember if
they have been clustered together).
5. Start again:
a. Random sample X’ from original data X à select only a subset of genes/samples from the original
dataset, this must be different selec2on than the first one (but some genes can be selected
repeatedly, only not all of them).
b. Run K-mean clustering with an ini2ally reasonable K (try 7).
c. Based on the generated clusters, use the clustered centers to classify original data by K-means
clustering.
d. Fill co-membership matrix: you ask if this is the how many 2mes events have been converged, how
many 2mes genes has belonged to a same cluster. A big matrix of pairwise genes (remember which
genes are together and remember if they have been clustered together).
6. Repeat the process un2l an algorithm find sets with average co-membership based on all the co-
memberships matrixes. Do not force clustering of all the data points, some of them will occasionally or never
be part of a cluster and this is why this method is noise tolerant, and you can find biologically robust clusters.
• Another way to do consensus clustering is iCluster:
§ Reconcile the clustering informa2on about the same samples using different profiling techniques.
§ If you do clustering of the same samples/genes using each different layers (methyla2on, gene
expression, copy number, muta2on) you will obtain different types of clusters. Nevertheless, this
algorithm takes into account all of the matrix layers doing the co-membership matrixes of each layer
and comparing between them.
§ Typically use TCGA database.

You might also like