Hierarchical clustering: New bounds and objective

M Rahgoshay, MR Salavatipour - arXiv preprint arXiv:2111.06863, 2021 - arxiv.org
M Rahgoshay, MR Salavatipour
arXiv preprint arXiv:2111.06863, 2021arxiv.org
Hierarchical Clustering has been studied and used extensively as a method for analysis of
data. More recently, Dasgupta [2016] defined a precise objective function. Given a set of $ n
$ data points with a weight function $ w_ {i, j} $ for each two items $ i $ and $ j $ denoting
their similarity/dis-similarity, the goal is to build a recursive (tree like) partitioning of the data
points (items) into successively smaller clusters. He defined a cost function for a tree $ T $ to
be $ Cost (T)=\sum_ {i, j\in [n]}\big (w_ {i, j}\times| T_ {i, j}|\big) $ where $ T_ {i, j} $ is the …
Hierarchical Clustering has been studied and used extensively as a method for analysis of data. More recently, Dasgupta [2016] defined a precise objective function. Given a set of data points with a weight function for each two items and denoting their similarity/dis-similarity, the goal is to build a recursive (tree like) partitioning of the data points (items) into successively smaller clusters. He defined a cost function for a tree to be where is the subtree rooted at the least common ancestor of and and presented the first approximation algorithm for such clustering. Then Moseley and Wang [2017] considered the dual of Dasgupta's objective function for similarity-based weights and showed that both random partitioning and average linkage have approximation ratio which has been improved in a series of works to [Alon et al. 2020]. Later Cohen-Addad et al. [2019] considered the same objective function as Dasgupta's but for dissimilarity-based metrics, called . It is shown that both random partitioning and average linkage have ratio which has been only slightly improved to [Charikar et al. SODA2020]. Our first main result is to consider and present a more delicate algorithm and careful analysis that achieves approximation . We also introduce a new objective function for dissimilarity-based clustering. For any tree , let be the number of and 's common ancestors. Intuitively, items that are similar are expected to remain within the same cluster as deep as possible. So, for dissimilarity-based metrics, we suggest the cost of each tree , which we want to minimize, to be . We present a -approximation for this objective.
arxiv.org
Showing the best result for this search. See all results